## Introduction

One important component of ecological diversity is the number of species in a community. Although obtaining an estimate of this number would seem to be a vital part of the process of quantifying a community, such estimates are difficult to obtain. Censusing a community is usually unfeasible, so instead samples are taken from the community and some form of extrapolation is carried out to estimate the number of unobserved species (e.g. Colwell & Coddington 1994). This can be achieved either by taking a single sample and using the distribution of abundances to estimate the species richness, or by taking several samples and using the distribution of incidences in the different samples to get the estimate. This paper will be concerned with the former method, although much of the discussion will carry straight over to the latter.

A great many approaches to the problem of estimating species richness have been suggested in the statistical literature (reviewed in Bunge & FitzPatrick 1993). Only a few of these methods have seen use in ecology, and the choice of the methods seems to be due largely to the availability of software. The ideal approach to estimating species richness would be to use the data (in the form of the number of species observed once, twice, etc.) to estimate the distribution of species abundances, and from this estimate the number of species that were present but observed zero times. The problem here is that the distribution of abundances is unknown, so we have to either find a distribution that is flexible enough to take in all reasonable distributions, but for which the parameters can be estimated with a reasonable degree of precision, or we have to find the ‘true’ abundance distribution (or a good approximation to it). Neither of these seems likely. The problem that has to be faced is whether different distributions will give similar estimates of species richness, i.e. whether the estimation is robust to model misspecification.

One model of abundances that is often used is the log-normal distribution (Preston 1948). This is despite well-documented problems in fitting it to data (e.g. Hughes 1986). A better alternative is the Poisson log-normal distribution, which preserves the underlying concept (that species’ abundances are log-normally distributed), but places a model of the sampling from these distributions on top of this (Cassie 1962). An alternative distribution can be obtained by assuming that the abundances follow a gamma distribution. When random sampling is then assumed, the distribution of the numbers of species captured can be shown to follow a negative binomial distribution (Fisher, Corbet & Williams 1943; White & Bennetts 1996). Despite their statistical appeal, and their use in modelling diversity (e.g. Kempton & Taylor 1974), the use of these distributions in estimating species richness has been rare (for an exception see Peterson & Meier 2003).

Non-parametric estimates of species richness have also been developed and used. These do not make any explicit distributional assumptions about the species abundances, but instead use approximations based on statistics calculated from the data. Of the two methods commonly used, the first (historically speaking) is only a lower bound of the number of species (Chao 1984, 1987). The second (Chao & Lee 1992; Chao, Ma & Yang 1993) is a genuine estimate, which uses the estimate for the case where all species are equally abundant, and corrects this with a term based on the coefficient of variation, i.e. on the first two moments of the abundance distribution (Bunge & FitzPatrick 1993).

An alternative parametric approach to estimating species richness is to construct species sampling curves. These are plots of the number of species in samples of different sizes (Gotelli & Colwell 2001), and can be constructed either by sequentially adding a new sample, and re-counting the number of species (an accumulation curve), or by taking independent samples of varying sizes (a rarefaction curve) constructed by taking subsamples randomly of different sizes from the sample. The number of species in the subsamples is then plotted against the sample size to produce a taxon sampling curve (Gotelli & Colwell 2001). The species richness is then estimated by fitting an equation to the curve, and estimating its asymptote, i.e. how many species there would be if an infinite number of individuals were collected. Although it may not be apparent, this method is parametric. This can be seen if we consider that the distribution of abundances of species can be used to predict the number of species in a sample of size *N*, and then can clearly be used to predict the shape of the curve of the number of species as *N* is changed–, i.e. the species accumulation curve (Christen & Nakamura 2000).

All these methods assume (as they have to) that there is a clearly definable community. In practice this community is defined informally through empirical and taxonomic means, i.e. how the organisms are trapped, and which groups the researchers are able to identify. Neither aspect of this definition defines a community operationally, although both may be a reflection of an operational definition. This would weaken any connection between theoretical studies of community structure and the sort of data that are collected. For example, because samples tend to be of whatever species is trapped, the community that is sampled is usually an open one, including any immigrants or tourist species that are found (for an exception see Longino, Colwell & Coddington 2002).

There is a related question of whether the estimate is intended to estimate the number of species at the time of sampling, or whether it is meant to be a wider statement about the number of species that could be in the community. The estimation methods imply the latter definition, as they assume an infinite number of individuals in the community − in other words, they count any species that could be found in the community. The species richness that is being measured is then the total number of species in the region, where (depending on the dispersal abilities of the organisms) the region may have to be defined at the continental, or even global, scale. For field data, this is probably not the quantity required, although it is clearly the relevant one for museum collection data (e.g. Peterson & Meier 2003).

A more relevant quantity might be the number of species in a community at any one time, i.e. an answer to the question ‘How many species are there here now?’. In principle this can be estimated using parametric methods. If the population size of the community (i.e. the number of individuals in the community) is known or can be estimated, then the predicted number of species at this population size can be calculated. Sampling curves are simply extrapolated out to the population size (rather than to infinity). Distribution-based estimates can be calculated by taking the difference between the number of species at an infinite population size and the predicted number of zeroes in a sample of size equal to the population size (i.e. the number of species not present). Obtaining an estimate of the number of individuals in a community is clearly a problem, although any estimates are probably fairly robust to bias.

Even when a community has been defined suitably, different estimates of species richness may arise from different abundance distributions (Colwell & Coddington 1994). It is certainly important that a fitted curve should fit the data, but little attention has been paid to this in the literature. For example, Colwell & Coddington (1994; their Fig. 1) show a species accumulation plot and the fitted hyperbolic curve. However, the curve clearly overestimates the number of species over the middle of the range of pooled samples. One effect of this is to make the estimate of species richness dependent on sample size. The authors note that their estimate of species richness depends on sample size, but do not link this to a lack of fit (the effect of the larger sample is to pull the fitted curve up, without these the curve is lower, so the predicted species richness is lower). If the curve fitted the data then the change in the estimate would not be directional, but rather would be pulled up and down by randomness in the data.

Several authors have attempted to address the question of how accurate species richness estimators are. A common comparison is of the estimated species richness with a ‘true’, known species richness at different re-sample sizes (e.g. Colwell & Coddington 1994; Walther & Morand 1998; Longino, Colwell & Coddington 2002; Brose *et al*. 2003; Foggo *et al*. 2003; Peterson & Meier 2003). With the exception of Brose *et al*. (2003) and Peterson & Meier (2003), the ‘true’ species richness is defined as the number of species in the sample, which seems to be prejudging the issue. Even when data sets are screened to try to ensure a full coverage (e.g. Walther & Morand 1998; Brose *et al*. 2003; Foggo *et al*. 2003) the comparisons cannot be regarded as foolproof, as the extremely rare species will still be difficult to capture. It is then perhaps not surprising that the studies conclude that the early methods of Chao (1984, 1987) work best as these, like the observed number of species, are a lower bound. Several studies have looked at the behaviour of the estimators by taking re-samples of different sizes, and examining how the point estimates change with re-sample size (e.g. Walther & Morand 1998; Foggo *et al*. 2003). The criterion for a method being good has generally been that it approaches the observed richness quickly, but this is a property of the particular sample and not of the population from which it has been taken. A correct criterion should look at whether independent samples give similar estimates, and in particular that the 95% confidence interval for the estimates contains the true value 95% of the time.

This paper compares the different estimation methods, and investigates some of their properties. The principle question is whether the estimation methods give the same answers, and if not, can we use the data to discard any of the estimates? Investigating this latter point means examining the properties of the estimators. Both field data and data simulated from the fit of models to the field data are used. The former has the advantage that it is realistic, and so likely to be similar to other field data. However, it is not possible to assess the bias of the estimates as the true number of species is unknown. Hence, simulated data are also used, with the simulated data being drawn from the distributions that are fitted to the data. The true values are then known, and are reasonable if one assumes that the distributional assumptions are reasonable.