### Introduction

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References

One important component of ecological diversity is the number of species in a community. Although obtaining an estimate of this number would seem to be a vital part of the process of quantifying a community, such estimates are difficult to obtain. Censusing a community is usually unfeasible, so instead samples are taken from the community and some form of extrapolation is carried out to estimate the number of unobserved species (e.g. Colwell & Coddington 1994). This can be achieved either by taking a single sample and using the distribution of abundances to estimate the species richness, or by taking several samples and using the distribution of incidences in the different samples to get the estimate. This paper will be concerned with the former method, although much of the discussion will carry straight over to the latter.

A great many approaches to the problem of estimating species richness have been suggested in the statistical literature (reviewed in Bunge & FitzPatrick 1993). Only a few of these methods have seen use in ecology, and the choice of the methods seems to be due largely to the availability of software. The ideal approach to estimating species richness would be to use the data (in the form of the number of species observed once, twice, etc.) to estimate the distribution of species abundances, and from this estimate the number of species that were present but observed zero times. The problem here is that the distribution of abundances is unknown, so we have to either find a distribution that is flexible enough to take in all reasonable distributions, but for which the parameters can be estimated with a reasonable degree of precision, or we have to find the ‘true’ abundance distribution (or a good approximation to it). Neither of these seems likely. The problem that has to be faced is whether different distributions will give similar estimates of species richness, i.e. whether the estimation is robust to model misspecification.

One model of abundances that is often used is the log-normal distribution (Preston 1948). This is despite well-documented problems in fitting it to data (e.g. Hughes 1986). A better alternative is the Poisson log-normal distribution, which preserves the underlying concept (that species’ abundances are log-normally distributed), but places a model of the sampling from these distributions on top of this (Cassie 1962). An alternative distribution can be obtained by assuming that the abundances follow a gamma distribution. When random sampling is then assumed, the distribution of the numbers of species captured can be shown to follow a negative binomial distribution (Fisher, Corbet & Williams 1943; White & Bennetts 1996). Despite their statistical appeal, and their use in modelling diversity (e.g. Kempton & Taylor 1974), the use of these distributions in estimating species richness has been rare (for an exception see Peterson & Meier 2003).

Non-parametric estimates of species richness have also been developed and used. These do not make any explicit distributional assumptions about the species abundances, but instead use approximations based on statistics calculated from the data. Of the two methods commonly used, the first (historically speaking) is only a lower bound of the number of species (Chao 1984, 1987). The second (Chao & Lee 1992; Chao, Ma & Yang 1993) is a genuine estimate, which uses the estimate for the case where all species are equally abundant, and corrects this with a term based on the coefficient of variation, i.e. on the first two moments of the abundance distribution (Bunge & FitzPatrick 1993).

An alternative parametric approach to estimating species richness is to construct species sampling curves. These are plots of the number of species in samples of different sizes (Gotelli & Colwell 2001), and can be constructed either by sequentially adding a new sample, and re-counting the number of species (an accumulation curve), or by taking independent samples of varying sizes (a rarefaction curve) constructed by taking subsamples randomly of different sizes from the sample. The number of species in the subsamples is then plotted against the sample size to produce a taxon sampling curve (Gotelli & Colwell 2001). The species richness is then estimated by fitting an equation to the curve, and estimating its asymptote, i.e. how many species there would be if an infinite number of individuals were collected. Although it may not be apparent, this method is parametric. This can be seen if we consider that the distribution of abundances of species can be used to predict the number of species in a sample of size *N*, and then can clearly be used to predict the shape of the curve of the number of species as *N* is changed–, i.e. the species accumulation curve (Christen & Nakamura 2000).

All these methods assume (as they have to) that there is a clearly definable community. In practice this community is defined informally through empirical and taxonomic means, i.e. how the organisms are trapped, and which groups the researchers are able to identify. Neither aspect of this definition defines a community operationally, although both may be a reflection of an operational definition. This would weaken any connection between theoretical studies of community structure and the sort of data that are collected. For example, because samples tend to be of whatever species is trapped, the community that is sampled is usually an open one, including any immigrants or tourist species that are found (for an exception see Longino, Colwell & Coddington 2002).

There is a related question of whether the estimate is intended to estimate the number of species at the time of sampling, or whether it is meant to be a wider statement about the number of species that could be in the community. The estimation methods imply the latter definition, as they assume an infinite number of individuals in the community − in other words, they count any species that could be found in the community. The species richness that is being measured is then the total number of species in the region, where (depending on the dispersal abilities of the organisms) the region may have to be defined at the continental, or even global, scale. For field data, this is probably not the quantity required, although it is clearly the relevant one for museum collection data (e.g. Peterson & Meier 2003).

A more relevant quantity might be the number of species in a community at any one time, i.e. an answer to the question ‘How many species are there here now?’. In principle this can be estimated using parametric methods. If the population size of the community (i.e. the number of individuals in the community) is known or can be estimated, then the predicted number of species at this population size can be calculated. Sampling curves are simply extrapolated out to the population size (rather than to infinity). Distribution-based estimates can be calculated by taking the difference between the number of species at an infinite population size and the predicted number of zeroes in a sample of size equal to the population size (i.e. the number of species not present). Obtaining an estimate of the number of individuals in a community is clearly a problem, although any estimates are probably fairly robust to bias.

Even when a community has been defined suitably, different estimates of species richness may arise from different abundance distributions (Colwell & Coddington 1994). It is certainly important that a fitted curve should fit the data, but little attention has been paid to this in the literature. For example, Colwell & Coddington (1994; their Fig. 1) show a species accumulation plot and the fitted hyperbolic curve. However, the curve clearly overestimates the number of species over the middle of the range of pooled samples. One effect of this is to make the estimate of species richness dependent on sample size. The authors note that their estimate of species richness depends on sample size, but do not link this to a lack of fit (the effect of the larger sample is to pull the fitted curve up, without these the curve is lower, so the predicted species richness is lower). If the curve fitted the data then the change in the estimate would not be directional, but rather would be pulled up and down by randomness in the data.

Several authors have attempted to address the question of how accurate species richness estimators are. A common comparison is of the estimated species richness with a ‘true’, known species richness at different re-sample sizes (e.g. Colwell & Coddington 1994; Walther & Morand 1998; Longino, Colwell & Coddington 2002; Brose *et al*. 2003; Foggo *et al*. 2003; Peterson & Meier 2003). With the exception of Brose *et al*. (2003) and Peterson & Meier (2003), the ‘true’ species richness is defined as the number of species in the sample, which seems to be prejudging the issue. Even when data sets are screened to try to ensure a full coverage (e.g. Walther & Morand 1998; Brose *et al*. 2003; Foggo *et al*. 2003) the comparisons cannot be regarded as foolproof, as the extremely rare species will still be difficult to capture. It is then perhaps not surprising that the studies conclude that the early methods of Chao (1984, 1987) work best as these, like the observed number of species, are a lower bound. Several studies have looked at the behaviour of the estimators by taking re-samples of different sizes, and examining how the point estimates change with re-sample size (e.g. Walther & Morand 1998; Foggo *et al*. 2003). The criterion for a method being good has generally been that it approaches the observed richness quickly, but this is a property of the particular sample and not of the population from which it has been taken. A correct criterion should look at whether independent samples give similar estimates, and in particular that the 95% confidence interval for the estimates contains the true value 95% of the time.

This paper compares the different estimation methods, and investigates some of their properties. The principle question is whether the estimation methods give the same answers, and if not, can we use the data to discard any of the estimates? Investigating this latter point means examining the properties of the estimators. Both field data and data simulated from the fit of models to the field data are used. The former has the advantage that it is realistic, and so likely to be similar to other field data. However, it is not possible to assess the bias of the estimates as the true number of species is unknown. Hence, simulated data are also used, with the simulated data being drawn from the distributions that are fitted to the data. The true values are then known, and are reasonable if one assumes that the distributional assumptions are reasonable.

### Results

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References

The abundance distributions for the two data sets are shown in Fig. 1. Both samples show the typical pattern of a J-shaped curve, with the singletons being the modal class. The Fort Augustus sample has about three times as many individuals than Barnfield, but only a few more species (Table 1). The most common species in the Barnfield sample is about three times as common as the next most common species (365 individuals as against 118), and seems to be an extreme species. The difference between the two most common species in the Fort Augustus sample is much less (255 and 317 individuals). It is this difference that motivates the examination of the effect of the most common Barnfield species.

The different estimates of species richness are shown in Table 1. Most of the estimates are slightly higher than the number of species in the samples − for Fort Augustus this means that the methods suggest that between about five and 20 species are not in the sample; for Barnfield, between about 10 and 40 species are predicted as being present but not sampled. The exceptions to this come from ACE2, the negative binomial estimator and the two estimators based on the overdispersion model (all of which give much higher estimates of species richness), and estimates based on an exponential model fitted to the species accumulation curves, which with an intercept predict only slightly more species than present, and without an intercept manage to predict less species than sampled.

The predictions for the simulated data are shown in Fig. 2. In all cases, the only estimation method that gives a good estimate of the species number is the correct parametric estimate. The accumulation curve methods and the non-parametric methods all underestimate the true number of species. The negative binomial model gives the highest estimates, which considerably overestimates the number of species when the Poisson log-normal distribution is used and conversely, the Poisson log-normal distribution underestimates the species richness when the data are taken from a negative binomial distribution. It is worth noting that the order of the estimates is the same for all four simulated situations, and indeed the real data. This is despite the differences in the distributions and data sets that are used, and suggests that the patterns are due to the properties of the estimators themselves rather than the data.

It is curious that for Fort Augustus the ACEs estimates are smaller than the Chao1 estimates (Table 1). Chao1 is not strictly an estimate of the number of species, but an estimate of the minimum number of species (Chao 1984), so for these data sets the ACE appears to underestimate the species richness. However, using all the data in the ACE (i.e. ACE2) considerably increases the estimate. The effect of increasing the cut-off point (*t*) between rare and common species is initially to decrease, and then increase the estimate (Fig. 3), with little effect on the standard error. There seems no obvious point at which to place a cut-off, unless a lower bound rather than an estimate is required, in which case the smallest estimate should be preferred. The estimates follow the (re-)sample size (Fig. 4), which suggests strongly that the estimates are biased, and might be better used as lower bounds on species richness.

The accumulation curves and the residuals from the fitted curves are shown in Fig. 5. For all the fitted curves without an intercept term, the residuals are positive at low and high re-sample sizes, and negative at intermediate sizes. This is a clear indication that the models are not good fits to the data − if the models were correct, then there would be no structure in the residual plots. The effect of this is that the accumulation curves are extrapolated so that the estimate of species richness is too low. The exponential curve fitted to the Fort Augustus data set is an extreme case, where the estimate is actually below the observed number of species.

When the intercept term is added the fits are much improved. The residual standard deviation (which is also the standard error of the species richness estimate, Table 1) is lower, and the residual plots show very little structure. The residuals do show a decrease in the variance as the re-sample size increases, although this does not seem to be large. This is opposite to the pattern assumed in the method developed by Raaijmakers (1987) that has commonly been used to fit the Michaelis–Menton model to species accumulation curves. The suggestion of Keating & Quinn (1998) that the variance will be largest at intermediate values is supported, as the sampling here does not include the lowest 10% of sample sizes, and the variance has to be zero at a sample size of 1. The residual variances for the two models with intercepts are very similar − indicating that both models fit to the data almost as well as each other. However, they still give different estimates of the species richness, with the exponential curve giving a smaller estimate than the non-parametric estimates, and the Michaelis–Menton curve giving estimates that are up to 20 species larger than the non-parametric estimates.

The profile likelihoods of the ML estimates are shown in Fig. 6. All of the profiles are positively skewed, even after a log transformation. The negative binomial distribution shows a maximum at a higher species richness (as already noted), but for Barnfield there is a large range of values with very similar likelihoods. The profiles for the Poisson log-normal distribution are more peaked, and have modes at smaller values. For Fort Augustus the maximum likelihoods are very similar, but for Barnfield the maximum is larger for the Poisson log-normal distribution; the likelihood ratio is almost 7000. The lack of fit is caused largely by the most common species − when this is removed, the likelihood ratio drops to 24·5. The effect of adding an overdispersal term (i.e. to change the sampling distribution from a Poisson to a negative binomial) is to massively flatten the profile likelihood. This is because the unexplained variation in the data is so large that the model allows for this excess variation. This also means that the standard errors are large (Table 1).

The fitted models are plotted in Fig. 1 and their residuals in Fig. 7. The residuals for Fort Augustus do not seem to show any pattern. In contrast, the residuals from the fit of the negative binomial distribution to the Barnfield data are sigmoid, being positive for lower abundances, and negative for higher abundances. This suggests that the lowest and highest values are pulling the fitted model away from the rest of the data. The effect of the most common species on the fit has already been examined. Overall, there is little difference between the fit of the two distributions to the Fort Augustus data, but the Poisson log-normal gives a better fit to the data from Barnfield.

Removing the most common species from the Barnfield data set has little effect on the non-parametric and accumulation curve estimates (Table 1). The effect on the ML estimates and the ACE2 estimates is then reduced considerably, in particular the estimate from the negative binomial distribution. This can be explained by the flatness of the profile likelihood − the estimated value is still well within the reasonable range for the full data set (Fig. 6). The effect of this species is reduced in the models with overdispersion, because the influence of outliers is reduced in these models (as a common species can be a common species with an additional effect of the extra variation allowed for in the overdispersion); however, there is still a reduction.

### Discussion

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References

The main conclusion to be drawn from the results is that different estimators give different estimates of species richness. When the data are simulated, only the distribution from which the data are simulated gives a correct estimate. Therefore, unless either the actual number of species is known (in which case the estimation is unnecessary), or if the form of the underlying sampling distribution is known the estimates cannot be relied upon. If a recommendation for which estimator to use has to be made, then either the negative binomial distribution or either of the distributions with overdispersion may be the most appropriate, simply because their standard errors seem to reflect most accurately the range of likely values.

Brose *et al*. (2003) recently examined the behaviour of incidence-based species richness estimators and showed that unless the abundance distribution was even, or the coverage (the proportion of species sampled) were high, the non-parametric estimators underestimated the true species richness. They suggested a scheme for choosing an estimator based on taking an average estimated coverage from several methods. However, as shown here the estimates may all have an extremely large bias, so it will be difficult to trust the average estimate, and hence the chosen estimator. The problem, again, is that the only way to decide which estimator is good is to either know the correct answer in advance, or to know the underlying abundance distribution.

The non-parametric estimates and those based on extrapolating sampling curves all give similar estimates (Table 1, Fig. 2) with a similar ordering, regardless of the actual number of species. Their similarity is probably not a reflection of the true number of species, but is due to intrinsic differences between the properties of the estimators. This is apparent when the estimates for the simulated distributions are examined: regardless of the distribution, the same order is retained, with all estimates less than the true number.

It should perhaps not be a surprise that the nonparametric methods do not perform well − the derivation of Chao1 was as a lower bound (Chao 1984), and the ACE has also been shown to be an underestimate in simulations of negative binomial distributions (Chao & Lee 1992). In those simulations, the minimum value of *k* that was used was 1, whereas the highest estimates value here was 0·2 (for Fort Augustus; for Barnfield it was 0·024 and 0·076 when the most common species was removed). The two non-parametric methods used here were derived considering only the mean and variance of the empirical distribution. However, the distributions of the data used here are clearly positively skewed (Fig. 1), and this is probably a common feature of abundance distributions. Ignoring this may contribute to the bias.

The species curves methods assume an underlying abundance distribution (Christen & Nakamura 2000). However, the form of the abundance distribution may not be clear from the fitted curve, so any connections to ecological theory are obscure (although the exponential curve, eqn 7, can be derived from an even distribution of species). If the distribution is known then the estimator could be derived through the hierarchical approach used here for the gamma and log-normal abundance distributions, which should lead to a more efficient estimator (Bunge & FitzPatrick 1993). The exponential accumulation curve used here provides a clear example of this inefficiency, by providing impossible estimates.

The fit of parametric models can be checked using relatively simple techniques such as plotting residuals. If a model that has been fitted to data is unable to predict the data, then any predictions of new data should be viewed with extreme caution. Although the importance of these sorts of checks is emphasized in the statistical literature (e.g. Miller 1997), they are often missing from ecological analyses. Many estimates of species richness based on sampling curves may well be biased for this simple, and checkable, reason.

Other properties of the estimates are also important. Intuitively, the estimate of the number of unsampled species should depend largely on the number of rare species sampled. The Barnfield sample, however, tells a different story. The estimate from the negative binomial distribution shifts considerably when the most frequent species is removed. The ACE using all the data is also sensitive to this species. A potential solution to this problem (at least for the parametric case) is to fit distributions which will allow flexibility in the upper tail of the distribution, while allowing the species richness to be influenced more by the rarer species. An alternative would be to follow the approach of Chao, Ma & Yang (1993), and use only the uncommon species in the estimation. However, it is unclear how uncommonness should be defined. Any cut-off will be arbitrary and placing it at the same abundance for all data sets does not seem reasonable, as it does not take into account variation in sampling effort. A better method might be to define common species as being the most frequent species that include a set proportion of the total number of individuals in the sample.

If a parametric model is to be used to estimate species richness, then the problem is choosing the right model. Finding such a model from ecological theory is difficult because the rate of capture of each species is a product of two quantities: the abundance of the species and the catchability, i.e. the rate of capture of each individual. If the catchabilities are identical for each species, then the fitted distribution will be an estimate of the abundance distribution. If the catchabilities are unequal, then the fitted distribution will be a biased estimate of the abundance distribution. If the catchabilities of each species can be estimated (for example, through a large-scale mark–recapture experiment), then the abundance distribution can be estimated. Even if the catchability can be modelled there is little good theory to guide the choice of which abundance distribution to use, so developments in this area (e.g. Hubbell 2001) seem necessary.

The large effects of the over-dispersion seen here are indicative of the two parameter models used not fitting well. The models with over-dispersion place all the extra variation into a random term, and so are the most flexible models. The cost of doing this is their large standard errors. An alternative would be to fit a three-parameter distribution, such as the generalized gamma distribution (Diserud & Engen 2000). Of course, this only moves the problem on one stage further: different three-parameter models will probably also give different estimates.

It should be clear from this discussion that it is difficult to estimate the number of species in a community. It appears that the only way to be sure that estimates are genuinely estimating species richness is the distribution of abundances of species in a community, and their catchabilties, is known. It seems unlikely that this will be possible, and hence unlikely that the bias of parametric estimates can be measured. Non-parametric estimators do not provide a solution as their bias is unbounded (Engen 1978). Estimating species richness therefore seems futile, as it is impossible to know how bad the estimates are. What, then, can be estimated? Chao1 is a lower bound to species richness and ACE seems to be an underestimate, so both of these can be regarded as providing lower limits to species richness. It is possible that an upper bound can be derived from ecological theory, by estimating the maximum number of species that a region can sustain, or if the empirical abundance distribution has an internal mode, in which case it might be expected that the number of unobserved species is less than the mode (but see Magurran & Henderson 2003 for an example of a bimodal data set). There seems little prospect for doing anything else than trying to provide tighter estimates of these bounds through refinements in ecological and statistical theory, with the hope that in the course of doing this a reliable estimator is discovered.