EstimateS turns 20: statistical estimation of species richness and shared species from samples, with non-parametric extrapolation

Authors


R. K. Colwell, Dept of Ecology and Evolutionary Biology, Univ. of Connecticut, 75 N. Eagleville Rd., Storrs, CT 06268, USA, and Univ. of Colorado Museum of Natural History, Boulder, CO 80309, USA. E-mail: colwell@uconn.edu

Abstract

EstimateS offers statistical tools for analyzing and comparing the diversity and composition of species assemblages, based on sampling data. The latest version computes a wide range of biodiversity statistics for both sample-based and individual-based data, including analytical rarefaction and non-parametric extrapolation, estimators of asymptotic species richness, diversity indices, Hill numbers, and (for sample-based data) measures of compositional similarity among assemblages. In the first 20 yr of its existence, EstimateS has been downloaded more than 70 000 times by users in 140 countries, who have cited it in 5000 publications in studies of taxa from microbes to mammals in every biome.

EstimateS is a free software application for Windows and Macintosh operating systems, designed to help assess and compare the diversity and composition of species assemblages, based on sampling data. With a fully graphical user interface, the application computes a wide range of biodiversity statistics, including rarefaction and extrapolation, estimators of species richness, diversity indices, Hill numbers, and measures of compositional similarity among assemblages.

Twenty years ago, Colwell and Coddington (1994) developed a conceptual framework for describing species assemblages at the landscape level, in terms of richness and compositional similarity. As tropical entomologists involved in biotic inventory work (Longino et al. 2002), they were acutely aware that biodiversity sampling data, even for intensive and carefully designed studies, are routinely biased by undersampling. Observed species counts and other measures of diversity that take account of rarer species are inevitably underestimates (Gotelli and Colwell 2001, 2011), and measures of similarity based on observed counts are routinely overestimates (Chao et al. 2005). Colwell and Coddington (1994) reviewed most of the statistical tools then available for reducing undersampling bias, including parametric distribution-fitting (e.g. lognormal), parametric function-fitting (e.g. Michaelis–Menten curves), and non-parametric estimators of asymptotic species richness (e.g. Chao's estimators and jackknife estimators).

To visualize the effect of undersampling on observed richness and on the performance of richness estimators, Colwell and Coddington (1994) introduced graphs that came to be known as sample-based rarefaction plots (Gotelli and Colwell 2001), showing both expected (rarefied) richness and estimated asymptotic richness as a function of increasingly large numbers of pooled sampling units, up to the total number in the full empirical sample set (the reference sample). The Pascal program that Colwell developed to produce the figures in the Colwell and Coddington (1994) study formed the core of the first version of EstimateS. That program, like every subsequent version of EstimateS, was based on the idea of combining rarefaction with asymptotic richness estimation. Later, measures of compositional similarity that take undersampling into account (Chao et al. 2000, 2005) were incorporated into EstimateS.

Between 1993 and 1996, early Pascal (for MacOS) versions of EstimateS were circulated among colleagues in the biodiversity inventory community. The critiques and comments of these early adopters helped guide further development, enhanced by increasingly frequent collaboration with Anne Chao. In 1997, the EstimateS website (<http://purl.oclc.org/EstimateS>) went live, supporting the launch of the first downloadable version: a fast, compiled application with a graphical user interface for both Windows and Mac OS, built in the application development environment, 4th Dimension® (still the development environment used for EstimateS). A download registry recorded 500 downloads in 1998, 3000 total downloads by the year 2000, and 7200 by 2003.

Ten years later, as of December, 2013, more than 70 000 downloads had been registered to users in 140 countries (193 countries are currently members of the UN). According to Google Scholar, the number of scholarly publications citing EstimateS (in its several versions) has steadily risen over the years, to more 5000 citations as of March, 2014 (nearly two citations per day during 2012) (Fig. 1A), Remarkably, these citations have appeared in more than 700 different journals (and 60 books), ranging from 120 in Biodiversity and Conservation and about 60 each in Biota Neotropica, Forest Ecology and Management, Biological Conservation, and Biotropica to more than 400 journals with one citation each. It is surely no accident that journals that feature tropical research on hyperdiverse biotas figure prominently in the list.

Figure 1.

Citations of EstimateS and its uses since 1998. (A) Number of citations per year. These citations appeared in more than 700 different journals, of which the top 10 were Biodiversity and Conservation, Biota Neotropica, Forest Ecology and Management, Biological Conservation, Biotropica, Journal of Biogeography, Diversity and Distributions, Journal of Insect Conservation, Conservation Biology, and PLoS One. (B) Focal taxa of studies citing EstimateS. (C) Conceptual focus of studies citing EstimateS.

We attribute the continued success of EstimateS not only to a fundamental and widespread interest in estimating diversity, but also to the multiplicative propagation of its popularity through citations, word-of-mouth recommendations, and its use in classrooms and teaching laboratories. We would like to hope that the widespread us of EstimateS arises, as well, from its continually updated functionality, incorporation of up-to-date statistical developments and refinements of biodiversity estimation, comprehensive output, ease of use, and easy-to-understand Estimates User's Guide.

Ecologists, conservation biologists, microbiologists, and paleontologists and other scientists have used EstimateS to study a great range of terrestrial and freshwater taxa, from mammals to microbes, in every biome and on every continent (including Antarctica) and every major island. In the oceans, EstimateS has been applied to data for marine taxa living in habitats ranging from estuaries and surface waters to hydrothermal vents. Figure 1B shows the results of an analysis on the titles of 3695 citations (the total number of citations as of 8 June 2012, when we began this bibliographic analysis).

Although researchers in a surprising variety of fields have put EstimateS to use in many ways (Fig. 1C) an analysis of ˜ 10% of citations, randomly selected from those listed by Google Scholar in June, 2012, revealed that the majority of studies used EstimateS to quantify the species richness (and other measures of diversity) of a plot or geographical area, or to quantify changes in diversity or assemblage structure along a gradient. Studies of species interactions (Perez et al. 2009) and evaluation of competing sampling methods (Chiarucci et al. 2001, Allford et al. 2008) have also been frequent themes.

EstimateS has been used in some unexpected and innovative ways. Ethnobiologists have used it to estimate and track the diversity of medicinal plants in marketplaces (Mati and de Boer 2011) and also to estimate the richness of vegetable cultivars in studies of the conservation of agricultural diversity (Baco et al. 2007). Archaeologists have used it to estimate the richness of artifact types in assemblages at dig sites (Eren et al. 2012). EstimateS has been useful in estimating the richness of hyperdiverse bacterial assemblages, from those found within the human body (Sepehri et al. 2007, Ji et al. 2012) to the microbial communities of fermenting drinks (Escalante et al. 2008). The program has also been widely used to estimate genetic diversity (Vos and Velicer 2006, Viprey et al. 2008).

The current version of EstimateS (ver. 9), departs from previous versions in three fundamental ways: 1) it offers direct individual-based rarefaction for abundance data, with unconditional (‘open’) variance and confidence intervals, while continuing to provide classic rarefaction for sample-based incidence or abundance data as in all previous versions; 2) it introduces non-parametric extrapolation of species richness (for both sample-based and individual- based data), smoothly extending the rarefaction curve beyond the reference sample to augmented sample sizes, with unconditional variance and confidence intervals; and 3) it allows the automatic input and analysis of multiple datasets (batch input) (Fig. 2A).

Figure 2.

Option screen examples from the EstimateS 9 graphical user interface. (A) The four input filetypes: sample- based incidence or abundance data (one set or multiple sets of replicated sampling units) or individual-based abundance data (one sample or multiple samples). (B) The randomization and rarefaction panel of the diversity settings screen for sample-based data. Here, the user sets the number of sample-order randomizations, specifies the extent of extrapolation, and sets the number of sampling points (knots) on the rarefaction and extrapolation curve. Settings on the other panels of this screen specify the richness estimators and diversity indices to be computed (estimators and indices panel) and some specialized options (other options panel). The diversity settings screen for individual-based data is similar. Options for sample similarity and shared species estimators are specified in a shared species settings screen.

Rarefaction is a resampling framework that selects, at random, 1, 2, …, n individuals or 1, 2, …, t sampling units until all individuals or sampling units in the reference sample have been accumulated. For each level of rarefaction, EstimateS computes a large number of biodiversity statistics. For species richness, exact analytical methods are used to compute the expected number of species (with unconditional variance and confidence intervals) for each level of rarefaction (or equivalently, accumulation) of individuals or samples. For other diversity measures, EstimateS resamples individuals or sampling units stochastically (based on random numbers from a strong-hash-driven cryptographic algorithm). The resampling process is repeated many times, and the means of the resamples for each level of accumulation are reported. The biasing effects of differences in sample size on diversity statistics for two or more data sets can usually be substantially reduced by comparing them at the same level of species accumulation.

Traditional variances calculated by classic rarefaction formulas and estimated by boostrapping methods are conditional on the sample. Therefore, these variances approach zero as the size of the sample approaches the size of the references sample. The variance in rarefied and extrapolated richness that is computed by EstimateS is called an unconditional variance because it estimates the true variance of the estimated richness of the assemblage from which the samples were taken, rather than the variance in richness conditional on the reference sample. The unconditional variance in richness for the reference sample must be greater than zero to account for the heterogeneity that would be expected among additional random samples of the same size taken from the entire assemblage. Unconditional variance (and the confidence limits derived from it) for sample-based rarefaction was introduced by Colwell et al. (2004), while unconditional variance for individual-based rarefaction was missing from the toolbox of biodiversity statistics until 2012 (Colwell et al. 2012).

Rarefaction, in effect, represents an interpolation between the value of a diversity measure assessed for the reference sample and zero (for individual-based abundance data) or between the value of a diversity measure assessed for the reference sample and the diversity of a typical single sampling unit (for sample-based incidence data). For species richness, EstimateS ver. 9 introduces extrapolation from a reference sample to the expected richness (with unconditional confidence intervals) for a user-specified, augmented number of individuals or sampling units. The recently- developed methods that EstimateS uses for richness extrapolation (Colwell et al. 2012) rely on statistical sampling models, not on the fitting of mathematical functions. They require an estimator for asymptotic richness as a ‘target’ for the extrapolation. EstimateS uses Chao1 for individual-based abundance data and Chao2 for sample-based incidence data. Figure 2B shows the options screen for sample-based data, and Fig. 3 illustrates rarefaction and extrapolation for the comparison of multiple datasets.

Figure 3.

Sample-based rarefaction (interpolation) and non- parametric extrapolation for reference samples (filled black circles) for ground-dwelling ants from five elevations on the Barva Transect in northeastern Costa Rica (Longino and Colwell 2011), with 95% unconditional confidence intervals, as calculated by EstimateS ver. 9. Maximum species density is found at the 500-m elevation site, consistently exceeding the species density at both higher and lower elevations. Species density drops significantly with each increase in elevation above 500 m, based conservatively on non-overlapping confidence intervals (graph from Colwell et al. 2012).

Hill numbers are a family of diversity measures that quantify diversity in units of equivalent numbers of equally abundant species (Jost 2006, Gotelli and Chao 2013). EstimateS ver. 9 (and earlier versions) computes the most widely used Hill numbers (richness, exponential Shannon diversity, and reciprocal Simpson diversity) by averaging Hill number values among random resamples for the reference sample and each level of rarefaction. Chao et al. (2013) recently extended the analytical rarefaction and extrapolation tools of Colwell et al. (2012) to the full set of Hill numbers and to coverage-based rarefaction (Chao and Jost 2012). The addition of these tools is on the drawing board for future development of EstimateS.

In the Shared Species options screen, EstimateS offers an important set of tools for measuring the similarity in species composition between pairs of samples and (more important) estimating similarity between pairs of assemblages. In addition to key, traditional similarity indices (Jaccard, S rensen, Morisita Horn, and Bray Curtis), which measure sample similarity, EstimateS computes Chao's widely-used Jaccard and S rensen similarity estimators, which take into account species shared but not detected in one or both samples (Chao et al. 2005, 750 citations). Chao's estimators require either sample-based abundance data or replicated incidence data.

When EstimateS moved from a command-line interface to a fully graphical user interface (GUI) about 15 yr ago, it seemed inconceivable that anyone would ever want to return to the command-line world of hieratic syntax that characterized computing from 1960 to the early 1990s. But it seems that the R revolution in data analysis and presentation graphics has brought things full circle, as R users work from the console or from script files. For those who prefer to work in the R environment, we can suggest Jari Oksanen's ‘vegan’ package (<http://cran.r-project.org/web/packages/vegan/index.html>) and Noah Charney's ‘vegetarian’ package (<http://cran.r-project.org/web/packages/vegetarian/index.html>), which include some of the statistical tools offered by EstimateS. Meanwhile, the next version of EstimateS aims to offer a modest hybrid solution, by providing GUI-based options to output R data frames, together with a small library of R code to access these exported data frames to produce frequently-used graphical output types from EstimateS analyses.

You can download the EstimateS application and access the online EstimateS User's Guide at <http://purl.oclc.org/estimates>. If you publish a paper with results from EstimateS, be sure to specify the version and release date in the Methods section, and cite this Software note (Colwell and Elsensohn 2014). To reference the User's Guide itself, or its mathematical appendices, cite Colwell (2013).

Acknowledgements

The authors would like to thank the multitude of EstimateS users who have invented new ways to use it and those who have suggested extensions and improvements over the years.

Ancillary