Program SimAssem: software for simulating species assemblages and estimating species richness


  • Earlier versions of this paper and supporting information constituted chapter two in my (Gordon C. Reese) dissertation which is accessible at

Correspondence author. E-mail:


  1. Species richness, the number of species in a defined area, is the most frequently used biodiversity measure. Despite its intuitive appeal and conceptual simplicity, species richness is often difficult to quantify, even in well-surveyed areas, because of sampling limitations such as survey effort and species detection probability. Nonparametric estimators have generally performed better than other options, but no particular estimator has consistently performed best across variation in assemblage and survey parameters.
  2. In order to evaluate estimator performances, we developed the program SimAssem. SimAssem can: (i) simulate assemblages and surveys with user-specified parameters, (ii) process existing species encounter history files, (iii) generate species richness estimates not available in other programs and (iv) format encounter history data for several other programs.
  3. SimAssem can help elucidate relationships between assemblage and survey parameters and the performance of species richness estimators, thereby increasing our understanding of estimator sensitivity, improving estimator development and defining the bounds for appropriate application.


Species richness, the number of unique species in a defined area, is the most commonly used measure of biological diversity (Gaston 1996; Moreno et al. 2006). Species richness (SR) can be used to delineate protected areas, monitor biological systems and investigate environmental relationships. Surveys rarely encounter all of the species in an area; therefore, numerous estimators have been proposed to improve upon the negative bias of raw counts.

Three categories are regularly used to classify SR estimators (Colwell & Coddington 1994). The first category includes extrapolative methods applied to species accumulation curves or species–area curves. The Michaelis–Menten equation (Michaelis & Menten 1913), negative exponential model (Holdridge et al. 1971) and power model (Arrhenius 1921; Tjørve 2009) are commonly used to extrapolate to an estimate of SR at some large sample or large area.

A second category includes parametric estimators that make assumptions about the underlying species-abundance distribution or species detection probabilities (p). One type of parametric estimator uses a fitted distribution, often either a log-normal or log-series. For this category, required steps such as estimating total abundance and selecting the discrete abundance classes to which a continuous distribution is fit are often prohibitive (see Colwell & Coddington 1994; Magurran 2004). There are also parametric estimators based on the assumption that p is constant across species.

A third category includes nonparametric estimators, which are those that are neutral on the probability distribution from which parameters are drawn. Many of the nonparametric SR estimators were originally derived from methods to estimate the number of individuals in a closed population (e.g. Burnham & Overton 1978; Chao 1984; Pledger 2000).

The search for a single best estimator has not yet been resolved. However, general comparisons of the three estimator categories favour the nonparametric methods (see table 1 in Cao, Larsen & White 2004; table 3 in Walther & Moore 2005). Nonparametric estimators are therefore the focus of this project.

The performance of nonparametric SR estimators can be affected by species- and assemblage-level attributes as well as by survey design parameters, hereafter collectively referred to as factors (Keating & Quinn 1998; Brose, Martinez & Williams 2003). Several studies have indicated that bias decreases as species-abundance distributions become more even (Wagner & Wildi 2002; O'Dea, Whittaker & Ugland 2006). One assumption of the closed population estimators, translated for species data, holds that species are equally detectable across space. Spatial aggregation regularly challenges this assumption (Schmit, Murphy & Mueller 1999). Other factors found to affect SR estimator performance include the number of species (Keating & Quinn 1998; Poulin 1998), total abundance or density of individuals (Baltanás 1992; Walther & Morand 1998) and species detection probability, p (Boulinier et al. 1998).

Raw sample data and consequently, SR estimates, are also affected by survey design parameters such as effort (Burnham & Overton 1979; Brose, Martinez & Williams 2003). Additionally, survey configuration has been important to other estimation issues (Reese et al. 2005). Selecting survey locations randomly is unbiased and therefore preferable; however, survey locations are often selected based on accessibility and previous results (Beck & Kitching 2007). The above factors can all affect sample coverage (sc), which is the proportion of a species pool represented in a sample and the single most important factor with respect to estimator performance (Baltanás 1992; Brose, Martinez & Williams 2003). Unfortunately, one needs to know the true number of species to calculate sc and, if this information were available, estimation would be unnecessary. It is therefore important to understand how the aforementioned factors affect performance.

Evaluating SR estimators across a wide range of factors in the field is difficult because of temporal, financial and logistical constraints as well as uncertainty about species- and assemblage-level parameters. Despite the simplifications, simulations are advantageous because they can be systematically varied and randomly surveyed, and most important, the true number of species is known. Our objective therefore was to develop a program in which specified parameters are used to simulate and survey species assemblages, thereby revealing the behaviour of SR estimators in a controlled setting. Most, if not all, of the programs currently available for estimating SR, for example, EstimateS (Colwell 2006), SPADE (Chao & Shen 2010) and ws2m (Turner, Leitner & Rosenzweig 2003), process existing encounter history data (information indicating whether a species was encountered during a particular survey occasion), but include little or no simulation capability. In addition, SimAssem includes a more comprehensive suite of SR and variance estimators than other programs.

Program SimAssem

SimAssem is application software developed in Visual Basic 6.0 for 32-bit versions of Microsoft Windows and includes a graphical user interface (Fig. 1) and internal dialogue with R software (R Development Core Team 2009). SimAssem can process both existing encounter history data and data from assemblages simulated with user-specified parameters. Other than for some R functions, the Mersenne twister pseudorandom number generator is used for randomizations (Matsumoto & Nishimura 1998). The SimAssem program and source code are available at We hope that the source code provides a valuable foundation for the quick evaluation and development of new estimators.

Figure 1.

The graphical user interface for SimAssem which includes sections for: inputting a file (a); setting simulation parameters for: species and abundance (b), spatial configuration (c), species detection probability (d), and survey design (e); outputting files (f); displaying a simulated assemblage (g); and displaying estimates (h).

Simulating an assemblage

In SimAssem, assemblages are simulated by specifying the number of species (S), total abundance across species (N) and a distribution to which abundances conform (Fig. 2, Table 1). SimAssem will run only when ≥ S.

Table 1. Species-abundance distributions in SimAssem. See the referenced publications and Appendix S1 for details
Abundance distributionsDescriptive literature
  1. a

    The allocation of abundances to species is stochastic. Final abundances are the integer portion of averages across the specified number of iterations (Iterations menu item).

Broken-stick (BS)aTokeshi (1990)
Dominance-decay (DD)aTokeshi (1990)
Dominance-preemption (DP)aTokeshi (1990)
Geometric-series (GS)Tokeshi (1990)
Log-normal (LN)aAppendix S1
Log-series (LS)Magurran (2004)
Power-fraction (PF)aTokeshi (1996)
Particulate-niche (PN)aTokeshi (1993)
Random-assortment (RA)aTokeshi (1990)
Random-fraction (RF)aTokeshi (1990)
Sugihara's sequential model (S75)aSugihara (1980), Tokeshi (1993)
Zero-sum (ZS)aHubbell (2001)
Figure 2.

Species-abundance distributions available in SimAssem (see Table 1 for abbreviations). In this example, we specified 25 species and 10 000 total individuals and, other than for the ZS distribution, each graph point was the average of 1000 iterations. Due to stochasticity, the line shown for ZS represents the first iteration with 25 species (θ = 4). For the LS distribution, = 0.99969074 (see Appendix S1 for details).

The fundamental attributes of an assemblage include its species richness and species-abundance distribution. Two established theories about species abundance are: (i) abundances are generally unequal amongst species and (ii) most species are relatively rare (Fisher, Corbet & Williams 1943). The geometric-series (Motomura 1932), log-normal (Preston 1948) and log-series distributions (Fisher, Corbet & Williams 1943) have been successfully fit to biological datasets; however, the representation of abundance distributions with purely mathematical models has been criticized for not explaining the patterns. Some of the earliest alternatives focusing instead on process include the broken-stick and particulate-niche models (MacArthur 1957). More recent work with abundance distributions continued to emphasize the methodological steps required to create a distribution and, by way of analogy, the ecological processes that result in real abundance distributions (Tokeshi 1990, 1993, 1996). These models are assumed to approximate the interactions and subsequent patterns of small groups of taxonomically related species, that is, species vying for the same resources, and have therefore been termed niche-based models. A basic premise holds that niche apportionment can be modelled by a stick being broken, where the units of the stick represent individuals. Methods for generating the available species-abundance distributions and worked examples are given in Appendix S1.

Several different algorithms are included for distributing individuals across a square landscape (Table 2). Possible spatial patterns range from aggregated (species-specific or assemblage-wide), to random, to hyper-dispersed (more evenly spaced than random) (Fig. 3; see Appendix S1 for details).

Table 2. Spatial configuration algorithms in SimAssem. User-specified parameters are described in Appendix S1 and include distance (D) [0–1], fidelity (F ) [0–1], maximum number of seeds (Sds), and the length of the shoulder (τ) [0–1] and rate of decline (ω) [0–1] for a distance-decay formula, 1 – (1 – ωDistanceToSeed)τ
Configuration algorithmsParameters
Aggregated (centres)D, F
Aggregated (centres equal abun)D, F
Aggregated (individuals)D, F
Aggregated (individuals max dist)D, F
Clustered (assemblage-wide)Sds, τ, ω
Clustered (species-specific)Sds, τ, ω
Figure 3.

Example configuration patterns with 1 species and 1000 individuals, including aggregated (left panel), random (middle panel) and hyper-dispersed (right panel).

Creating encounter data: detection and design

Before ps are assigned, species are grouped into thirds based on abundance, for example, one group is comprised of the least abundant species. A randomly selected group is increased by one for each species that remains when the true number of species (Strue) is not a factor of three. Within each group, species-specific ps can be randomly drawn from a beta distribution with specified α and β parameters (R function rbeta). Beta distributions are characterized by an expected value (mean), E(X) = α/(α β) and variance, var(X) = (αβ)/[(α β)2 (α β + 1)], where X is a random beta variate. Additionally, ps can be fixed for each abundance group.

Each simulated landscape is partitioned by a 100 × 100 grid for the purpose of conducting surveys. User-settings include the Survey design, that is, spatial configuration of surveyed grid cells, and Number of cells to survey (t), from 1 to 10 000. Cells are surveyed without replacement.

SimAssem includes two survey designs. Surveyed grid cells can be randomly selected (Random) or added to randomly oriented, horizontal or vertical linear transects that are each one grid cell wide (Linear transect). The Linear transect option requires a Minimum number of transects (m) across which t is divided. Due to landscape dimensions, the maximum transect length is the smaller of 100 or t/m. When transect length is truncated to 100 or when m is not a factor of t, additional transects are added until the number of surveyed cells equals t. One random uniform variate is drawn for every individual in a surveyed cell, and an individual is encountered when the random uniform variate ≤ p.

Estimating species richness

SimAssem includes numerous SR estimators. Two, in particular, are computer intensive estimators, that is, CY-1 and CY-2 (see Table 3 for estimators and abbreviations), that performed relatively well in comparative studies (Reese 2012), but are unavailable elsewhere. Log-transformed variance estimates are used to restrict the lower bound of 95% confidence intervals to the number of species observed, Sobs (Burnham et al. 1987, part 3). Several estimators involve iterations that can be set under the Iterations menu item. For more details and estimator equations, see Appendix S1.

Table 3. Species richness estimators and abbreviations used in SimAssem
  1. a

    A description of the estimator where (a) indicates that the estimator uses sample abundance data, that is, number of individuals, and (i) indicates that the estimator uses sample incidence data, that is, presence/absence in surveys.

  2. b

    Variance is estimated by an analytically derived estimator.

  3. c

    Variance is estimated by the variance across iterations.

Abundance-based coverage (a)ACEChao & Lee (1992)
Bootstrapb (i)BootSmith & van Belle (1984)
Bootstrap; iteratedc (i)Boot-BSmith & van Belle (1984)
Chao1b (a)Chao1Chao (1984)
Chao1 (bias-corrected)b (a)Chao1BCChao (2005)
Chao2b (i)Chao2Chao (1987)
Chao2 (bias-corrected)b (i)Chao2BCChao (2005)
Coverage-adjusted (i)C1,C2,C3Ashbridge & Goudie (2000)
CY-1c (i)CY-1Cao, Larsen & Hughes (2001)
CY-2 (i)CY-2Cao, Larsen & White (2004)
Darroch–Ratcliff (i)DRDarroch & Ratcliff (1980)
Incidence-based coverage (i)ICELee & Chao (1994)
1st-order jackknifeb (i)Jack1Burnham & Overton (1978)
2nd-order jackknifeb (i)Jack2Burnham & Overton (1978)
3rd-order jackknifeb (i)Jack3Burnham & Overton (1978)
4th-order jackknifeb (i)Jack4Burnham & Overton (1978)
5th-order jackknifeb (i)Jack5Burnham & Overton (1978)
Mixture-model (i)MixturePledger (2000)
Observed species count S obs  

Additional output

Other reported values, some requiring simulated data, include: (i) the number of simulated species, (ii) the total number of simulated individuals, (iii) the number of species observed, (iv) the number of surveys with encounters, (v) the total number of individuals encountered, (vi) the true and estimated sample coverage (via CY-1), (vii) Shannon's evenness index (Shannon & Weaver 1949) and (viii) Clark and Evans aggregation index (Clark & Evans 1954). Two diversity indices are also reported, Margalef's diversity index (Clifford & Stephenson 1975) and Menhininck's index (Whittaker 1977).

Biological surveys are generally expensive and often provide diminishing returns on investment, that is, effort. SimAssem includes two estimators of the additional effort needed to encounter a user-specified proportion (Parameters menu item) of certain SR estimators (Chao et al. 2009). One estimates the additional number of individuals needed to encounter the specified proportion of Chao1, thus requiring abundance data. An incidence-based version estimates the number of additional surveys, for example, quadrats, needed to encounter a user-specified proportion of Chao2.

Import and export options

SimAssem can import comma-, space- and tab-delimited encounter history data saved as a plain text file. The first line is disregarded and therefore useful for documentation, line two must contain two numbers, Sobs and the number of surveys conducted, and line three must begin the encounter history data, where each row represents a different species and each column a different survey result, either by abundance, that is, the actual number of individuals encountered, or by incidence, that is, a one indicates that one or more individuals were encountered and a zero that there were zero encounters.

There are several export options available in SimAssem. Estimates can be exported to a comma-delimited file, where the first line lists estimator names and the following lines list the estimates. Encounter history data can be formatted for programs EstimateS (Colwell 2006), MARK (White & Burnham 2009) and SPADE (Chao & Shen 2010). Also, individual-level data can be exported to a comma-delimited text file including (in the following order): a numerical species identifier, x-coordinate, y-coordinate, the grid cell in which it fell (beginning with 1 in the lower left corner and proceeding first across and then up), p, and whether the individual was encountered (1) or not (0). Accumulation curve data are also exportable where, beginning with the specified survey size (1-t) and increasing sequentially by that amount, surveys are randomly drawn without replacement, and estimates at each survey size are averaged over a user-specified number of replications.


SimAssem allows users to quickly and easily evaluate numerous estimators across a wide range of assemblages. Such investigations are difficult, if not impossible, in the real world because of both sampling limitations and uncertainty regarding the true assemblage parameters. Simulating assemblages involves a considerable amount of stochasticity; therefore, SimAssem provides the option to set the number of runs with a specific set of parameters in order to facilitate comparisons.

We envision SimAssem being used to compare estimator performance under surveyed or expected field conditions, thereby improving estimator selection for a particular application. For example, suppose that data were collected from an assemblage with an apparent log-normal species-abundance distribution, species-specific spatial aggregation and where species detection probabilities varied around an average of 0.3. SimAssem could reveal that coverage-based estimators (see Chao & Lee 1992; Lee & Chao 1994) are less biased than other estimators in assemblages simulated with similar characteristics and robust to variation in the degree of spatial aggregation. Furthermore, SimAssem could provide an estimate of the amount of bias (e.g. −15%) given the level of effort expended.


We thank Amy Angert and Thomas Stohlgren for their reviews of an earlier version of this manuscript. We also thank an anonymous reviewer for comments that considerably improved this manuscript.