Many methods have been developed to estimate species richness but few are useful for estimating regional richness. We compared the performance of commonly used non-parametric and area-based estimators with a particular focus on testing a newly developed but little tested maximum entropy method (MaxEnt).

Location

Tropical forest of Jianfengling Reserve, Hainan Island, China.

Methods

We extrapolated species richness on 12 estimators up to a larger regional scale – the reserve (472 km^{2}) – where 164 25 m × 25 m quadrats were distributed on a grid of 160 km^{2} within the tropical forest. We also analysed the effects of base (or ‘anchor‘) scale A_{0} on the species richness estimated (S_{est}) with MaxEnt.

Results

Six non-parametric methods underestimated the species richness, while six area-based methods overestimated the species richness. The accuracy of the MaxEnt estimate (S_{est}) was improved with the increase of base scale A_{0}.

Conclusions

Our findings suggest non-parametric methods should not be used to estimate richness across heterogeneous landscapes but can be used in well-defined sampling areas. Jack2 is the best of the six non-parametric methods, while the logistic model and the MaxEnt method seem to be the best of the six area-based methods. Improvements to the MaxEnt method are possible but that will require reformulation of the method by considering species–abundance distributions other than log-series and more general spatial allocation rules.

Information on the number of species in a region is essential for addressing urgent issues of biodiversity conservation and extinction risk assessment in the face of climate change and anthropogenic disturbance. Darwin showed how species originate but we are still unable to count how many there are and how many are disappearing due to irresponsible human activities. Without this crucial information, efforts to protect them in the rapidly changing world are hindered. Numerous methods have been proposed to estimate species richness (Palmer 1990; Bunge & Fitzpatrick 1993; Colwell & Coddington 1994; Chazdon et al. 1998; Chao 2005; Shen & He 2008; Harte et al. 2009; Colwell et al. 2012); however, these have so far only met with limited success. Most currently available methods only work well for estimating richness in areas of small extent (e.g. a few hectares). Of far more importance is the ability to predict richness at scales of hundreds or thousands of square kilometers, where direct enumeration of species is extremely resource demanding and often impossible. For large scales, the current methods either substantially underestimate (e.g. many of the non-parametric estimators) or substantially overestimate (e.g. widely used species–area models) richness (Chao & Lee 1992; Chazdon et al. 1998; Hortal et al. 2006; Jobe 2008). How to extrapolate diversity across a large magnitude of scales remains one of the major challenges in ecology.

Recently, of the few methods developed to estimate regional richness, the method of maximum entropy (MaxEnt) proposed by Harte et al. (2009) appears to have great potential. This method is the latest addition to the MaxEnt formalism in ecology that has developed along two major lines: modelling trait–abundance relationships (Haegeman & Loreau 2008, 2009; Shipley 2009a,b; He 2010) and describing macroecological patterns (Pueyo et al. 2007; Dewar & Porté 2008; Harte et al. 2008, 2009; Volkov et al. 2009). For estimating species richness, Harte et al. (2009) demonstrated that a species–area model derived from the MaxEnt formalism could successfully extrapolate the number of tree species from census data of small plots (0.25 ha) to a region of 60 000 km^{2} in the Western Ghats, India. To our knowledge, this is currently the most promising method that can extrapolate richness across areas of seven orders of magnitude, although another method (Krishnamani et al. 2004) also successfully estimated the richness of the same region. The MaxEnt method is especially attractive if considering that it requires just two simple state variables, the total number of individuals (N_{0}) and the number of species (S_{0}), at a base scale (A_{0}) (also called an anchor scale; Harte et al. 2009). However, the performance and robustness of the method has not been tested beyond the work of Harte et al. (2009). For application of the new MaxEnt method, it is essential to evaluate the performance of the method against other commonly used methods and to investigate the effects of variation in A_{0} on the accuracy of the method.

In this study, we tested the performance of 12 non-parametric and area-based methods with particular focus on testing the MaxEnt method using an exceptional data set consisting of 164 25 m × 25 m quadrats regularly distributed on a grid of 160 km^{2} nested within a nature reserve in a tropical rain forest in Hainan Island, China. Based on these 164 quadrats, we extrapolated the number of free-standing shrub/tree species up to a larger regional scale: the nature reserve (472 km^{2}). We first compared the performance of the non-parametric and area-based estimators and then investigated the effects of base scale A_{0} on the MaxEnt method.

Methods

Twelve species richness estimators, with a focus on the MaxEnt method

The 12 estimators evaluated in this study are a selection of both widely used and new methods (Table 1). Six of them are non-parametric methods, including four incidence-based methods (Bootstrap, Chao_{2}, Jack_{1} and Jack_{2}) and two abundance-based methods [Chao_{1} and ACE (abundance-based coverage estimator); Palmer 1990; Chazdon et al. 1998; Colwell et al. 2012]. Six are area-based methods (power-law species–area model, exponential species–area model, logistic species–area model, Ugland's total species method, Shen & He's method and MaxEnt model; He et al. 1996; Chazdon et al. 1998; Ugland et al. 2003; Jobe 2008; Shen & He 2008).

Table 1. Estimation of species richness with 12 non-parametric methods and area-based methods. The number of species was extrapolated from the base scale of 625-m^{2} quadrats to that of the Jianfengling reserve (472 km^{2}). The values in parenthesis are the ratios between the estimated and true richness. The observed richness was obtained from the 164 25 m × 25 m quadrats

Estimator

Number of species

True richness

992

Observed richness

596 (0.60)

Non-parametric estimator

Chao1

671 (0.68)

ACE

670 (0.68)

Chao2

784 (0.79)

Jack1

755 (0.76)

Jack2

846 (0.85)

Bootstrap

664 (0.67)

Area-based estimator

Ugland's method

1714 (1.73)

Shen & He's method

1563 (1.58)

Power-law model

7620 (7.68)

Exponential model

1546 (1.56)

Logistic model

1398 (1.41)

MaxEnt method

1394 (1.41)

In the following, we introduce the MaxEnt method. A very general form of species–area models can be written as:

S(A)=S0∑n=1N0[1−P(0|n,A,A0)]ϕ(n|S0,N0)(1)

where S_{0} is the total number of species in the study area A_{0}, N_{0} is the total number of individuals summed across S_{0}, and S(A) is the number of species in sub-area A nested within A_{0}. 1−P(0|n,A,A_{0} ) is the probability that a species with abundance n in A_{0} is present in A, and φ(n|S_{0}, N_{0}) is the probability that a species randomly drawn from the species pool in A_{0} has abundance n. φ(n|S_{0},N_{0} ) is a virtual species–abundance distribution.

There are many ways to define the presence probability and species–abundance distribution in Eq. (1). The MaxEnt formalism developed by Harte et al. (2008, 2009) shows that the presence probability for halved A_{0} has the form P(0|n,A_{0}/2,A_{0}) = 1/(n + 1) and the species–abundance distribution φ(n|S_{0},N_{0} ) follows a log-series distribution (also see Pueyo et al. 2007). Substituting these results into Eq. (1) leads to a recurrent species–area model linking scales A and 2A:

where S(A) and N(A) are, respectively, the number of species and number of individuals in area A, while S(2A) and N(2A) are the numbers in double area 2A. In the case where N(2A) is reasonably large, e.g. >50, which is always the case in real applications, Eq. (3) is simplified to

S(2A)=−N(2A)(exp(λϕ,2A)−1)log(1−exp(−λϕ,2A))(4)

Because of the recurrent nature of the formula, we can start from a base (or ‘anchor’) scale A_{0} with S_{0} and N_{0} to either scale up or scale down richness. To scale up, we start by substituting S(A) = S(A_{0}) = S_{0} and N(2A) = N(2A_{0}) = 2N_{0} into Eqs (2) and (4) (Note N(2A_{0}) = 2N_{0} holds because N almost always scales linearly with A; He et al. 1996). As a result, there are only two unknown parameters, S(2A) and λ_{φ,2A}, which can be easily solved for the simultaneous Eqs (2) and (4). λ_{φ,2A} is the Lagrange multiplier of the MaxEnt formalism (Harte et al. 2009). The newly solved S(2A) will be used as a new base scale for scaling S(4A), etc. This procedure is reiterated until the richness at a desired scale is obtained. Down-scaling can be similarly carried out, but using S(2A) = S(A_{0}) = S_{0} and N(2A) = N(A_{0}) = N_{0} as a base scale.

Data

The data for testing the MaxEnt method were collected from the Jianfengling forest reserve located in southwest Hainan Island, China (18°23′–18°50′ N, 108°36′–109°05′ E, 0–1412 m a.s.l., annual rainfall 1000–3600 mm). The reserve comprises 472 km^{2} of tropical rain forest. The region has a tropical monsoon climate, with the wet season from May to October and the dry season from November to the following April. The annual average temperature is 24.5 °C, with the coldest and warmest monthly average temperatures being 19.4 °C and 27.3 °C, respectively. The total number of plant species in Jianfengling reserve is 2813, including 497 introduced species (Zeng et al. 1995). Among them, there are about 992 free-standing tree/shrub species (excluding vines and introduced species) with a diameter at breast height (DBH) ≥ 1.0 cm. These 992 species are the focus of this study.

A total of 164 quadrats (each 25 m × 25 m) were established from August 2007 through June 2009 in the middle of the reserve, at elevations ranging from 259 m to 1265 m. These quadrats were systematically established on a 1 km × 1 km grid, covering ca. 160 km^{2} and all vegetation types in the reserve.

The following data were recorded for each of the 164 quadrats: (1) species identity for each free-standing stem with DBH ≥ 1 cm, (2) relative position of each stem in each quadrat, and (3) latitude and longitude. Species nomenclature followed the Flora Republicae Popularis Sinicae and Flora of China (http://www.efloras.org).

Testing performance of the 12 non-parametric and area-based richness estimators

Comparison of the performance of the 12 methods

First, the six non-parametric methods were used to extrapolate richness to the Jianfengling reserve (472 km^{2}). Because non-parametric methods do not have a defined target area for the methods to refer to and the 164 quadrats cover all vegetation types and 60% of shrub/tree species of Jianfengling reserve, we assumed the reserve is the area that the 164 quadrats represent. Second, the six area-based methods, including the MaxEnt method (Eqs (2) and (4)), were used to extrapolate richness from the base scale of the quadrats of 625 m^{2} to the Jianfengling reserve (472 km^{2}). For the MaxEnt method, the number of species S_{0} and the number of individuals N_{0} at the base scale are the averaged species richness and abundance of all the 164 25 m × 25 m quadrats. Then, these methods were compared with each other and to the true species richness in Jianfengling reserve (992 species) to assess the accuracy of the estimated richness.

Evaluate the effects of base scale A_{0}on performance of the MaxEnt method

To conduct the analysis, four groups of N_{0} and S_{0} at the different base areas A_{0} (100, 225, 400 and 625 m^{2}) were used to extrapolate richness (S_{est}) in the reserve (472 km^{2}). These four groups of N_{0} and S_{0} were produced using the following procedures: (1) for base area of the 625 m^{2}, N_{0} and S_{0} were obtained by averaging the abundance and richness for each of the 164 quadrats; (2) for base areas smaller than 625 m^{2} (100, 225 and 400 m^{2}), because we have measured the relative position of each stem in each 625-m^{2} quadrat, we can randomly select any number of small quadrats from each quadrat. First, we randomly selected 100 10 m × 10 m sub-quadrats from each of the 164 25 m × 25 m quadrats. The averaged abundance and richness of these sub-quadrats for each of the 164 quadrats were used as N_{0} and S_{0} for base scale 100 m^{2}. The same procedure was repeated to obtain N_{0} and S_{0} for base area 225 m^{2} (15 m × 15 m) and 400 m^{2} (20 m × 20 m) sub-quadrats. Finally, N_{0} and S_{0} at the above four different base (sub-quadrat) areas (100, 225, 400 and 625 m^{2}) were used to extrapolate richness with the increase in target area. The estimated richnesses at the four different base areas were also plotted as a log function of the target areas for comparison.

All of the analyses were conducted using R package (R Development Core Team 2010; R Foundation for Statistical Computing, Vienna, AT).

Results

A comparison of the results reveals that all six non-parametric methods underestimated the number of species in the reserve. Jack2 was the best estimator among the six non-parametric methods examined although it underestimated Jianfengling richness by 15%. In contrast, the area-based methods substantially overestimated the species richness at the reserve scale (Table 1). The MaxEnt method performed as well as the logistic model for estimating richness of the reserve (Table 1). Even though the MaxEnt method and the logistic model appear to be the ‘best’ area-based methods, they still overestimated the true Jianfengling richness by 41% (Table 1), and the MaxEnt method does not show an asymptote when scaling up (Fig. 1). Furthermore, as is clear from Eqs (2) and (4), the MaxEnt richness estimate S_{est} is a sole function of the base scale A_{0}. Results show that S_{est} extraploated to any larger target area is decreased with the increase in base scale A_{0} from 100, 225, 400 to 625 m^{2} (Fig. 2).

Discussion

Estimating species diversity on the landscape or regional scale has increasingly become a pressing issue for assessing the impact of human disturbance and climate change on biodiversity. Although many richness estimation methods have been developed, few are of satisfactory robustness for extrapolating richness across a large spatial scale (Palmer 1990; Bunge & Fitzpatrick 1993; Colwell & Coddington 1994; Chazdon et al. 1998; Scheiner 2003; Tjørve 2003; Hortal et al. 2006; Magnussen et al. 2006; Jobe 2008). In this study, we found that underestimation was widespread for the non-parametric methods, while overestimation prevailed for the area-based methods (Table 1). Furthermore, the problem with the non-parametric methods is not only their widely observed underestimation (Chazdon et al. 1998; Brose et al. 2003; Magnussen et al. 2006; Jobe 2008; Shen & He 2008) but also the lack of a physically defined target area for the methods to refer to. Because of this problem we do not know whether the richness estimated from the non-parametric methods based on the 164 quadrats is for Jianfengling reserve or for a certain community that these 164 quadrats genuinely represent. In this study, we can only assume subjectively the defined target area to be the 472 km^{2} Jianfengling reserve because sampling quadrats were systematically and sufficiently set up in this area. Therefore, all the non-parametric estimators require the assumption that the extrapolated area is inherently represented by the sampled data.

In contrast to the non-parametric methods, area-based methods have their own problems. Among all the area-based methods examined, only the logistic model has an asymptote when scaling up (Fig. 1). Substantial overestimation is inevitable for other area-based methods if estimation is extrapolated to regional scales. Overestimation is also found for the total species curve method of Ugland et al. (2003) for our data, which is not in agreement with Jobe (2008), who showed the method worked well for estimating tree richness of the Great Smoky Mountains National Park, USA (~2000 km^{2}). The overestimation here is not a surprise given that in the original application Ugland et al. (2003) estimated 5403 species in the Norwegian continental shelf (~81 300 km^{2}) from a sample size of 50.5 m^{2}, more than five times that estimated with the Chao2 method (1035 species) and a species–accumulation curve model (1192 species). The innovation of the total species method is that it incorporates the rate of species turnover into a species–accumulation curve. The overestimation lies in the fact that the total species model has the same mathematical form as the exponential model (S = c + zlogA) in Fig. 1 that does not have an asymptote.

The newly developed MaxEnt method (Harte et al. 2009) does not provide a significant improvement on the richness estimation. Our results show that the MaxEnt method systematically overestimates species richness by about 40% for all the plots, similar to the overestimation from the logistic model (Table 1). These results are not in agreement with Harte et al. (2009) who found that the method estimated 1070 tree species in a 60 000-km^{2} reserve in the Western Ghats, India, which is close to the recorded 900 tree species of the reserve. Certainly, we need to acknowledge that we cannot thoroughly survey the 472 km^{2} Jianfengling reserve and the true richness (992 species) is counted from the published plant species checklist, as in Harte et al. (2009), which will likely miss some less common species in the reserve. This means that MaxEnt method, the logistic model and the other area-based methods are likely to be more useful, while all non-parametric estimators clearly underestimate true species richness and will be less useful.

Like any other richness estimators, the MaxEnt method is also not immune to the effect of sampling intensity. It is found that unlike other area-based methods, the MaxEnt estimate decreases with the increase in base area (Fig. 2). In real applications, where base area is actually used, it can lead to substantially different results. This suggests the MaxEnt method could be as biased as other area-based methods to extrapolate regional richness.

On the other hand, the MaxEnt prediction of a universal slope of species–area relationship vs. N/S curve is well matched with the data of the 164 quadrats (Fig. S1). We also tried to downscale the species richness to smaller scales and found that the MaxEnt method worked well in this mode. However, this does not necessarily guarantee that the MaxEnt method would perform better when scaling up. As shown in Fig. 1, the MaxEnt curve is not much different from other models (except the power-law and logistic models) and deviates at the larger scales. This is because the MaxEnt model is more suitable for homogeneous landscapes. Samples from heterogeneous landscapes can influence the accuracy of estimated species richness when scaling up, as found by Harte et al. (2009).

Certainly, the MaxEnt method may be improved if two assumptions on which the derivation of the method (Eqs (2) and (3)) is based are relaxed. One is that the species–abundance distribution follows log-series distribution; the other is that spatial distribution of species follows a specific allocation rule: P(0|n,A_{0}/2,A_{0}) = 1/(n + 1). For the first assumption, it is known both from empirical and theoretical evidence that log-series distribution is a model of meta-communities (or communities of a large spatial extent) (Hubbell 2001) or undisturbed habitats. Log-series distribution is not a dominant pattern for terrestrial communities of limited spatial extent (Ulrich et al. 2010) or disturbed forests (Kempton & Taylor 1976), which is the case for the Jianfengling reserve, whose species–abundance distribution better follows the log-normal than the log-series distribution (Fig. S2). The Jianfengling reserve underwent various degrees of logging disturbance in the past. Therefore, a possible improvement to MaxEnt is to replace the log-series distribution by more widely used log-normal distribution that can also be derived from the methods of MaxEnt (Pueyo et al. 2007). As for the second assumption, it is widely documented that species in nature have a broad spectrum of spatial distribution, ranging from highly aggregated to regular distribution (He et al. 1997; Condit et al. 2002). It is unlikely that any specific allocation rule would adequately capture this variation (Zillio & He 2010). Another possible improvement to the MaxEnt thus is to use a more general allocation rule such as that given in Conlisk et al. (2007) to define spatial distribution of species, although this would inevitably introduce new parameters.

Acknowledgements

This work was conducted during the visit of Han Xu to the University of Alberta in 2009. The study was supported by the State Forestry Administration (201104057, 200804001), National Nonprofit Institute Research Grant of CAF (CAFYBB2011004, RITFYWZX200902), National Natural Science Foundation of China (30430570, 30590383), the China Institute of the University of Alberta and the GEOIDE of Canada, a CFERN and GENE award for ecological papers. The authors are grateful to Mingxian Lin, Jianhui Wu, Zhang Zhou, Tushou Luo, and Dexiang Chen from the Research Institute of Tropical Forestry, Jinhua Mo from Jianfengling Bureau of Forestry and Guangjian Li from the Jianfengling National Reserve for their support.