Undersampling and the measurement of beta diversity

Authors


correspondence author. E-mail: jan.beck@unibas.ch

Summary

  1. Beta diversity is a conceptual link between diversity at local and regional scales. Various additional methodologies of quantifying this and related phenomena have been applied. Among them, measures of pairwise (dis)similarity of sites are particularly popular. Undersampling, i.e. not recording all taxa present at a site, is a common situation in ecological data. Bias in many metrics related to beta diversity must be expected, but only few studies have explicitly investigated the properties of various measures under undersampling conditions.
  2. On the basis of an empirical data set, representing near-complete local inventories of the Lepidoptera from an isolated Pacific island, as well as simulated communities with varying properties, we mimicked different levels of undersampling. We used 14 different approaches to quantify beta diversity, among them dataset-wide multiplicative partitioning (i.e. ‘true beta diversity’) and pairwise site x site dissimilarities. We compared their values from incomplete samples to true results from the full data. We used these comparisons to quantify undersampling bias and we calculated correlations of the dissimilarity measures of undersampled data with complete data of sites.
  3. Almost all tested metrics showed bias and low correlations under moderate to severe undersampling conditions (as well as deteriorating precision, i.e. large chance effects on results). Measures that used only species incidence were very sensitive to undersampling, while abundance-based metrics with high dependency on the distribution of the most common taxa were particularly robust. Simulated data showed sensitivity of results to the abundance distribution, confirming that data sets of high evenness and/or the application of metrics that are strongly affected by rare species are particularly sensitive to undersampling.
  4. The class of beta measure to be used should depend on the research question being asked as different metrics can lead to quite different conclusions even without undersampling effects. For each class of metric, there is a trade-off between robustness to undersampling and sensitivity to rare species. In consequence, using incidence-based metrics carries a particular risk of false conclusions when undersampled data are involved. Developing bias corrections for such metrics would be desirable.

Introduction

Not all species occur at equal abundances at all sites. However, this (in principle) simple phenomenon has proved difficult to measure and interpret. True beta diversity has been described as link between species diversity at local and regional scales (i.e. alpha and gamma diversity), but many studies employed, explicitly or implicitly, quite different ideas on how to quantify the phenomenon, e.g. various (dis)similarity metrics between pairs of communities (Tuomisto 2010a).

Recently, several authors have attempted to straighten out the different aspects that have been summarized under the umbrella term ‘beta diversity’ (Whittaker, Willis & Field 2001; Jost 2007; Jurasinski, Retzer & Beierkuhnlein 2009; Jost et al. 2010; Tuomisto 2010a,b; Anderson et al. 2011; Jost, Chao & Chazdon 2011) . Apart from ‘true beta diversity’ of a region, i.e. the ratio between gamma diversity and average alpha diversity (Whittaker 1960; Jost 2007; Jost et al. 2010), researchers have often applied pairwise (dis)similarities between samples (arranged and analysed as symmetric distance matrices). Some of the most widely used similarity metrics were Jaccard-, Sørensen- or Bray-Curtis indices (Magurran 2004), while an overwhelming number of further metrics had been devised (Koleff, Gaston & Lennon 2003) . Further concepts included additive partitioning of alpha, beta and gamma components of diversity, (dis)similarity metrics for more than two samples (Chao et al. 2008), direct ordination of species' abundances (Legendre, Borcard & Peres-Neto 2005) or inferences from species accumulation curves (Tuomisto 2010b). Furthermore, effects of shared species, nestedness and species richness were acknowledged as different components of beta diversity (Baselga 2010; Schmera & Podani 2011).

As with metrics of alpha and gamma diversity, one practical consideration (often constrained by data availability) is whether the abundances of species are considered or whether incidence data (i.e. presence/absence) are used. Different concepts of quantifying the degree to which sites differ in their species communities lead to measuring quite different ecological phenomena, so the first decision when choosing methods is the appropriateness to the ecological question at hand (Tuomisto 2010a,b). This important decision is not the topic of this article, but will be discussed in context of our results.

Data quality issues (e.g. sampling methods and timing, spatial position of sites, taxonomy) can affect the measurement and interpretation of biodiversity irrespectively of definitions and concepts. Undersampling, i.e. not recording all species present at a site, has been recognized as the rule rather than the exception in field studies from species-rich systems (Coddington et al. 2009), and it also plays a role in many assessments of gamma diversity at macroecological scales (Beck et al. 2012a). With the advent of estimators to correct for undersampling bias, researchers became increasingly aware of this problem with respect to alpha and gamma diversity (Hurlbert 1971; Wolda 1981; Colwell & Coddington 1994; Chao & Shen 2003; Beck & Kitching 2007; Beck & Schwanghart 2010). However, undersampling can also bias the measurement of beta diversity, and, because of its simultaneous effect on alpha and gamma diversity estimates, it is not trivial to even predict the direction of expected bias (Tuomisto 2010b).

Recently, Cardoso, Borges & Veech (2009) used simulation experiments to assess the robustness of several incidence-based measures of faunal similarity to undersampling. Furthermore, Chao et al. (2005, 2006) have devised bias corrections to variants of some popular community similarity metrics (i.e. Jaccard, Sørensen) and tested their behaviour under undersampling conditions. Morisita's index (Morisita 1959; Wolda 1981) and its generalized form (normalized expected shared species, NESS; Grassle & Smith 1976; Trueblood, Gallagher & Gould 1994) were reported to be relatively unbiased by small samples. Comparisons of various metrics with regard to their robustness to undersampling, as provided e.g. by Cardoso, Borges & Veech (2009), are important guidelines for authors of field studies to choose metrics appropriate to their data.

Here, we advance this approach by including various concepts (i.e. average ‘true beta diversity’, pairwise dissimilarity based on incidence and abundance data) and compare their behaviour under undersampling conditions. Rare species in a community will be found less representatively in a small sample both with regard to their occurrence (i.e. they may often not be found) and their relative abundance (random variation in individual numbers have larger effects on their relative abundance in the sample), while common species will often be adequately represented. We hypothesize, therefore, that the abundance distribution in samples and their influence on metrics of beta diversity or dissimilarity affects the robustness to undersampling of these metrics. Metrics that are strongly affected by common species should be more robust to undersampling than those that are sensitive to the occurrence and relative abundance of rare taxa (at the extreme, incidence-based measures).

We utilize a suitable empirical data set as well as simulated species communities to investigate undersampling effects. This enables us to (1) quantify the magnitude of the problem in various estimates of beta diversity, and (2) advise users on which metrics are particularly suitable if high degrees of undersampling are known or suspected in a data set. Because of the large number of proposed measures, we had to restrict our analysis to a handful of metrics. We tried to include measures that are widely used, or that seemed particularly suitable to avoid undersampling effects, or that had properties allowing us to investigate the hypothesized causes of undersampling bias. For example, several pairwise indices of dissimilarity were shown to be related to each other as function of a scale parameter q, where increasing q leads to increased focus on the abundant species (the Hill-series for local diversity is a popular example of these relationships; Jost 2006; i.e., = 0 leads to Sørensen, = 1 to Horn, = 2 to Morisita-Horn). Similarly, ‘true beta diversity’ can be based on species numbers (= 0) or Shannon diversity (= 1). More details on conceptual relationships between different metrics can be found in Tuomisto (2010a,b). Table 1 lists the metrics included in our comparison and provides descriptions of their properties.

Table 1. Beta diversity metrics investigated in this study
  Metric Details Abundance Comment
  1. (1)–(4) represent regional measures of multiplicative ‘true beta diversity’. (1) is based on species richness (i.e., incidence data), whereas (2) uses effective number of species as SpE = eH, where H = Shannon's entropy (Jost 2006; note that average local diversity (inline image) is calculated as exponent of average H). (3) and (4) use undersampling corrections for those metrics of α- and γ-diversity respectively. Chao1 estimators of species richness (Chao 1984) were used in (3). The bias-corrected version of Shannon's entropy (Hbc; Chao & Shen 2003) was used for (4). (5)–(14) represent pairwise dissimilarity measures that lead to a symmetric site x site dissimilarity matrix. (5) is the complement of Jaccard's index (Magurran 2004). Note that Sørensen's index (9) has very similar mathematical properties. Both indices are related to true beta diversityM) and proportional effective species turnoverPt) through relatively simple transformations for the special case of two sample sites (Tuomisto 2010a). (6) and (7) are dissimilarity metrics that were particularly robust to undersampling in Cardoso, Borges & Veech (2009); we used Cardoso et al.'s modification of William's β−3). (8) is an abundance-based variant of Jaccard's index that is corrected for undersampling by estimating the unseen shared species in two samples (Chao et al. 2005, 2006). Note that it is different to many other abundance-based dissimilarity indices (e.g. 10–13) in not being based on matching relative abundances species-by-species; (9–13) are conceptually linked dissimilarity metrics (Jost 2006; Chao et al. 2008). Whereas (9) relates to species counts (i.e. = 0), Horn overlap (10; Horn 1966) is based on Shannon entropy (i.e., = 1) and Morisita-type indices (11, 12) on Simpson diversity (= 2). (12) is a probabilistic metric that quantifies the complement of the chances for finding the same species when drawing two individuals at random from two samples (Morisita 1959). In slightly different formulation (11) it becomes the Morisita-Horn index. Both variants are considered highly dependent on the most abundant species in communities. NESS (Grassle & Smith 1976) is an extension to (12) where a parameter m allows putting more weight on rare species (we used = 10). Trueblood, Gallagher & Gould (1994) suggested that better results are recovered after chord-transformation (CNESS; (13)). (14) is a widely used quantitative extension of Sørensen's index that takes species' abundance into account. For more information on properties and implications of the metrics we refer to Southwood & Henderson (2000), Koleff, Gaston & Lennon (2003), Magurran (2004), Tuomisto (2010a,b) .

  2. Notation for pairwise indices:= species common to both sites; b,c = species exclusive to one of the two compared sites; h1N = total individuals sampled in habitat h1; h2N = total individuals sampled in habitat h2; jN = sum of lesser values for species common to both habitats.

(1)βM1 (Observed species richness [Sp]) inline image No 
(2)βM2 (Effective number of species, SpE) inline image Yes 
(3)βM3 (Chao1-estimated species, [estSp]) inline image NoValue compared with the full-sample value of βM1
(4)βM4 (Bias-corrected effective number of species, SpEbc) inline image YesValue compared with the full-sample value of βM2
(5)Complement of Jaccard index (CJ) inline image No

βM = 2/(1 + CJ)

βPt = (1- CJ)/2

(6)β−2 (Harrison et al. 1992) inline image No 
(7)β−3 (Williams 1996) inline image No 
(8)Complement of bias-corrected ‘Jaccard index’ (CJchao)“Jaccard index”for abundance data, corrected for unseen shared species; see Chao et al. 2005, 2006 for formula.Yes 
(9)Complement of Sørensen index (CS) inline image No 
(10)Complement of Horn overlap (Ch)

inline image

where xi is abundance of species i in sample of X individuals from one site, yi is abundance of species i in sample of Y individuals from the other site. All sums (Σ) are understood as = 1 to =  S; S is the number of species.

Yes 
(11)Complement of Morisita-Horn index (Cmh)

inline image

See explanations of metric (10) for details.

Yes 
(12)Complement of Morisita index (Cm)

inline image

See explanations of metric (10) for details.

Yes 
(13)Complement of CNESS (= 10)Chord-transformed Normalized Expected Shared Species; see Grassle & Smith 1976 for formula.YesDistance between 0 and √2
(14)Complement of Bray-Curtis (CN) inline image Yes 

Materials and methods

We used empirical data to ensure that data structures resembled those found in real species communities, and simulated communities to provide generality in our conclusions, i.e. to make sure they are not affected by idiosyncratic properties of a particular data set. Simulated communities also allowed us to investigate how certain variations in community structure affect undersampling sensitivity. To both types of data, we applied a resampling protocol to mimic the process of undersampling, and we compared analyses of such incomplete data to ‘true’ results from the fully inventoried communities. It is important to note that our primary aim was to see which method successfully accounted for undersampling (i.e. sample results should be similar to full data results). Different methods may, independently of the undersampling problem, lead to different conclusions because they imply different interpretations of quantifying differences between sites.

Empirical data

We utilized a data set of the nocturnal Macrolepidoptera of Norfolk Island (see Holloway 1977, 1996). The moth fauna of this isolated Pacific island can be clearly separated into resident species and only slightly fewer vagrants from Australia, and we used only data on residents for this study. Criteria for identifying vagrants (Holloway 1982) included, among others, inconsistency of appearance and numbers from year to year, time of appearances coinciding across frequent to rare species and showing correlation with frontal systems passing over the island, absence of known host plants and a track record for vagrancy (e.g. records on oil rigs, etc.). This data set is exceptionally suitably for our analysis because failure to separate vagrant and resident communities (Holloway 1977, 1982, 1996; see also Magurran & Henderson 2003) can lead to erroneous assessments of diversity (Wolda & Marek 1994). Despite its small size, Norfolk Island features different landscapes, such as forests, human-affected habitats and coastal cliffs (differing in associated Lepidoptera), and selected sites (see below) spanned this habitat variability. Very large numbers of individuals are partly due to pooling data for 12 years. Temporal variation in species' relative abundance and community evenness occurred, but analyses based on a smaller subset of these data (Holloway 1977) did not differ in ecological conclusions from those based on the whole data (Holloway 1996).

We excluded 34 sites where either fewer than 4000 individuals were collected or where the proportion of singletons was higher than 8% of all observed species. For 6 remaining sites, we had 48210 sampled individuals (on average per site: 8035 individuals, 40 species, 2·5 singletons = 6% of total species), covering 52 of 56 resident species known to occur on the island (i.e. gamma diversity measured as sum of species in the 6 samples is similar to total regional species richness of residents on the well-surveyed island). We calculated the numerical species richness estimators Chao1 and ACE (Colwell & Coddington 1994), which suggested that 95–100% (mean 97%) of the estimated local species richness was observed. We consider these 6 sites completely inventoried for the purpose of our study (note that even in much larger ecological sampling programs singleton records persist; Nee, Harvey & May 1991) . Species-abundance relationships per site were typical in resembling a (truncated) log-normal distribution (KS-test for normality on log10-transformed data, p > 0·20; Holloway 1996). Many species occurred at all 6 sites, whereas some were restricted to one or few sites. There was a positive abundance-occupancy relationship (N = 52, Spearman's R = 0·87, p < 0·0001), i.e. locally common species were often also found at many sites.

Simulated data

To simulate sampling from a range of theoretically realistic communities, we used the software SimSSD4 (Legendre, Borcard & Peres-Neto 2005. This software allows simulating the abundance of species in samples from a grid-based landscape, constrained by environmental gradient(s) to which species may react as well as autocorrelation of species ranges. Pending these constraints, species' abundances are first drawn from a normal distribution and then transformed (e.g. exponentiated, rounded) to yield realistically distributed abundances.

We simulated a landscape of 100 × 100 pixels with environmental gradients along both axes, in which species with autocorrelated abundances occur. These landscapes were sampled systematically at 25 sites, data were transformed to provide lognormal-type species frequencies (as found in real communities; i.e. option int(exp(y)) in SimSSD4; abbreviated; ‘exp’.). Furthermore, we replicated all simulations with square-rooted abundances (int(sqrt(exp(y))) in SimSSD4; ‘sqrt.’), hence producing a more even abundance distribution of taxa in the samples. While all simulations had a landscape-wide species richness of 50, we varied two further properties that can affect resulting patterns of beta diversity. We used three different variogram ranges of the autocorrelation of species abundance (5, 15, 30 pixels, leading to increased clumping in the distribution of each species, hence higher absolute differences between samples), and we varied the proportion of species that were distributed by chance only, as opposed to those that were additionally related to the simulated environmental gradients (10:40, 25:25, 40:10). Thus, communities should show increased spatial structure in beta diversity patterns. Of each of the 12 parameter combinations (2 types of abundance × 3 variogram ranges × 3 proportions of random vs. environmentally bound species), we simulated three runs (i.e., nsim = 3 in SimSSD4) to be able to assess expected variation in communities due to random variation, but we present only results from run 1 in detail. Details on parameter settings for each simulation are provided in the supplementary information.

Undersampling

We resampled original data (empirical and simulated) by randomly drawing individuals (without replacement) until a fixed proportion of total species richness of each site was recorded. This proportion (henceforth ‘sample completeness’) is a direct and intuitive measure of undersampling. Probabilities of encountering particular species were given by their relative abundance in the full data for the respective site. We simulated sample completeness from 0·3 (i.e. only 30% of the species at each site have been recorded) to 1·0 in intervals of 0·1. For each level of sample completeness, we performed a bootstrapping analysis with 1000 undersampled distributions and calculated all metrics (Table 1) for each bootstrap sample.

We measured bias by comparing results from incomplete samples to those from all data for all chosen metrics respectively. As a measure of comparison, we used the mean error (of 1000 bootstrap replications) divided by the interval length (interval-scaled mean error, ISME; i.e. scaled to 1 for dissimilarities ranging from 0 to 1, scaled to number of sites-1 for true beta diversites, etc.). For pairwise dissimilarities, ISME was calculated for all unique site-by-site pairs. We considered separately the 15 pairwise calculations for empirical data, but we used locally weighted regression spline smoothing (LOESS) to indicate average trends for the 300 site pairs from each simulation.

However, for pairwise site dissimilarities, we also compared observed dissimilarity matrices from undersampled data to true matrices by linear correlation (i.e. Pearson r). The rationale behind this is that many ecologists would be satisfied if dissimilarities of sites in the sample matrix in relation to each other correspond with their relative position derived from complete data, even if numerical values in site x site comparisons were biased, i.e. different from complete data-values. If dissimilarity data are used for further correlation analysis (e.g. with environmental conditions), preserving the right order is more relevant for correct conclusions than deviation in absolute values. Bias (i.e. deviation in numerical values from those based on complete data), on the other hand, is more important if data are compared with other studies. Thus, it depends on the specific use of data whether ISME or correlation is more useful to quantify undersampling effects, and we present and discuss both assessments. We assessed the estimates using averages from bootstrapping and indicated their precision (Walther & Moore 2005) by drawing standard deviations in figures.

Although the primary purpose of this study was analysing undersampling effects, we also compared results obtained with the different metrics for the full empirical data set (matrix correlations of pairwise dissimilarities; Supporting Information). Furthermore, we extracted three measures that are sometimes used as approximations of the completeness of a sample (i.e. number of individuals N; N/S ratio; coverage, Good 1953) to evaluate their relationship with sample completeness as defined here (Supporting Information).

Undersampling simulations, calculation of metrics and analyses of undersampling effects were carried out in MATLAB (http://www.mathworks.com/products/matlab/, scripts and functions are available in the Supplementary Information). While many procedures were written by us, we integrated available MATLAB and R functions (http://www.r-project.org/; packages vegan and ecodist accessed from within MATLAB using statconnDCOM) to calculate some of the metrics. CNESS (chord-transformed normalized expected shared species) was calculated with the MATLAB software Legal6 (Legendre & Gallagher 2001).

Results

Figure 1 shows results for dataset-wide assessments of multiplicative beta diversity. Metrics based on species richness (βM1, βM3) often overestimated beta diversity if undersampling occurred, although less so if explicitly accounted for (βM3; however, precision was low with highly incomplete data). Results between empirical and simulated data were very similar, and there were no differences between simulation types. Metrics based on effective number of species (βM2, βM4) were less biased in empirical data, but this was not consistently so with simulation data. The undersampling-correction in βM4, in particular, worked well with empirical data while bias was not much lower than for the uncorrected metric for simulations with a log-normal abundance distribution, and it was much worse for the more even, square root transformed data (see Discussion).

Figure 1.

Undersampling errors (scaled to the data interval, ISME) on dataset-wide metrics of multiplicative beta diversity. Metrics βM1 and βM3 are based on species counts, βM2 and βM4 on Shannon entropy. βM3 and βM4 explicitly attempt to account for undersampling effects. Graphs on the left are based on empirical data (Norfolk), those on the right on simulations of communities that vary in autocorrelation [variogram range (ra) 5–30 pixel], the proportion of species (sp) affected by an environmental gradient (10:40, 25:25, 40:10) and species-abundance distributions (log-normal (exp.), square root (sqrt) transformation thereof). All data are means of 1000 random draws (standard deviations shown only for empirical data). Unbiased estimates of beta diversity have an ISME of zero. Note the difference in y-axis scaling in βM4 simulation (lower left); the usual y-axis scaling (max. 0·25) is indicated by a horizontal line.

Correlations of pairwise dissimilarity metrics between samples and full data (Fig. 2) as well as bias of these metrics (Fig. 3 and Supporting Information) indicate strong effects of undersampling (and low precision) for all incidence-based metrics (Jaccard, Sørensen, β-2, β-3). Correlations were also surprisingly weak and bias was high for the abundance-based Jaccard-Chao metric (CJchao) despite its explicit correction for unseen shared species, especially for simulated data. Data in Fig. 3 suggest that this metric is particularly sensitive to idiosyncratic data conditions (i.e., bias varied strongly between individual site pairs). Simulations indicated lower correlations and higher bias in data with more even species-abundance distributions (i.e., square-rooted), as was the case for all abundance-based metrics.

Figure 2.

Undersampling effects on pairwise dissimilarity metrics as measured by correlation (Pearson's r; black line) of site x site pairwise distance matrices with the respective true matrix. Perfect match in the (relative) dissimilarities of sites would yield = 1. Note that this criterion accounts for the preservation of order, but not necessarily for correct values (i.e. bias; Fig. 3 and Supporting Information). Data on the left refer to empirical moth samples from Norfolk Island, data on the right to simulated communities (see Fig. 1 for acronyms of different simulations). All data are means of 1000 random draws for each level of sample completeness, only for empirical data standard deviations (SD) are shown as a measure of precision.

Figure 3.

Bias (ISME) for pairwise site dissimilarities from undersampled, simulated communities (unbiased metrics have ISME = 0). Points show values for 300 pairwise site comparison per simulation (means of 1000 replicates), lines are smoothing curves (LOESS). We show simulated data for different species-abundance distributions (log-normal (exp.) and square-rooted log-normal (sqrt.), but otherwise identical conditions (variogram range 15, half of the species affected by the environmental gradient; variation in these features did not affect conclusions). Data for all simulation types as well as empirical samples are presented in the Supplementary Information.

When comparing metrics with regard to different scale parameters q (i.e. βM1 vs. βM2 for multiplicative beta; Sørensen vs. Horn vs. Morisita-Horn/Morisita for pairwise dissimilarity), we observed a decrease in undersampling sensitivity (i.e. higher correlations, lower bias) with higher weight on common taxa (i.e. increasing q), as predicted. For some data conditions (notably our empirical samples), Morisita- and Morisita-Horn dissimilarities were extremely robust to undersampling and attained correlations of > 0·8 and very small bias (ISME <<0·1) even if only 30% of species had been recorded. Horn, Bray-Curtis and CNESS= 10 were intermediate in their robustness (again, depending on species-abundance structure).We also made use of the properties of CNESS to investigate further this effect of rare species. By adjusting the sampling parameter m, we could use the same metric yet change the weight of rare species in dissimilarity estimates (high = more weight on rare taxa; Grassle & Smith 1976). Using our empirical data, these analyses confirmed that undersampling robustness (correlation as well as bias) increased with m (Supporting Information).

Mean biases were usually positive, the exception being Morisita's metric with simulated log-normal abundance. We did not find differences in undersampling robustness (in all tested metrics) between other variants of simulations (autocorrelation of species' abundances, importance of gradient) that seemed relevant against the random variation found in different runs of the same simulations (data not shown).

Analyses of complete empirical data identified two sets of metrics that lead to relatively homogeneous conclusions regarding community patterns (Supplementary Information): on the one hand, incidence-based measures (Jaccard, Sørensen, β−2, β−3) and also (abundance-based) Jaccard-Chao (CJchao); on the other hand, abundance-based Morisita-type metrics (i.e. Horn, Morisita-Horn, Morisita, to a certain degree CNESS), but also Bray-Curtis. Thus, despite conceptual similarities e.g. between Sørensen and the ‘Morisita family’ (by varying q, see above) or Bray-Curtis (‘quantitative Sørensen’), the main distinction in empirical results seemed to be the utilization of abundance data. CNESS was not strongly correlated with any other distance metrics, but was generally more similar to the Morisita-type, abundance-based metrics. Negative correlations of moderate strength (i.e. −0·5 < < −0·2) were not uncommon if comparing these two main groups of metrics.

Discussion

Almost all metrics compared in this study (Table 1) showed substantial sensitivity to undersampling in two relevant ways. Undersampling led to biased estimates of beta diversity or dissimilarity (i.e. mean results from samples were substantially different from true values; Figs. 1 & 3), and results from undersampled data sets even deviated strongly from full data if only the relative position of sites to each other were considered (correlation analyses of site x site matrices; Fig. 2). Overestimates (positive bias) were more often observed than underestimates in our data. This is probably common, but the direction of bias can be different under particular circumstances of community structures (Chao et al. 2005). Furthermore, undersampling greatly increased random variation in estimates; i.e. it led to low precision of estimates. This implies that data from undersampled communities may sometimes reflect true patterns closely, but it may also, due to chance alone, even show opposite patterns. Only multiple sampling replicates, which are unlikely in local field studies, would allow any assessment of such random variation when analysing empirical data.

Our results support the expectation that the importance of abundant species in a metric leads to higher robustness to undersampling. We found this effect where metrics can be ranked according to their sensitivity to common species (e.g. increasing q for true diversity and the Sørensen-Horn-Morisita series; increasing m for CNESS), and in its extreme by the high sensitivity of incidence-based measures. The particular robustness of Morisita and Morisita-Horn metrics was already described by Wolda (1981), who, however, did not clearly separate between effects of sample size and incompleteness of species inventories. Furthermore, within all abundance-based measures, we noticed stronger undersampling effects if simulated data featured more even species-abundance distributions (i.e. less extreme abundance in the most common species), which also shifts weight to rarer taxa. Note that this directly reflects on the common practice of square root transforming abundance data prior to similarity analysis. We did not, however, observe this effect for βM3 although the Chao1-estimator of species richness also relies on abundance data (but note low precision in empirical data; Fig. 1).

Many abundance-based metrics had lower bias with empirical data than with simulations (most obviously βM4 and CJchao), and several effects could be responsible for that. Local communities in simulations were, due to the simulation methodology, less individual-rich (average 30–50 individuals, depending on applied species-abundance distribution) than our (very large) empirical samples. While this is not necessarily biologically unrealistic (depending on taxon and the scale of sampling), it led to a diminished dominance of the common species and to more singletons, which may increase undersampling bias in many metrics. Furthermore, (biologically realistic) sampling without replacement then has more effect on the structure of a small local community than in large local communities. Most estimators assume sampling with replacement (i.e. sampling from an infinite population), and this assumption could have been be violated. We investigated this by repeating our analysis with (unrealistic) sampling with replacement. This led to substantial reduction in bias of βM4 for low sample completeness, pointing to a sensitivity of this method towards local community size. Notably, it did not lead to bias reduction in CJchao.

Our analyses of empirical data could not confirm any particular stability of β−2 and β−3 (which had been suggested by Cardoso, Borges & Veech 2009). The relatively poor performance of the Jaccard-Chao dissimilarity (CJchao) in empirical and simulated data was also unexpected, as it is a measure that specifically attempts to correct for unseen shared species, i.e. undersampling (Chao et al. 2005, 2006). We communicated with A. Chao to preclude software errors in the vegan script as a possible reason, and we compared results for full data to results by A. Chao's own software (SPADE). We conclude that substantial deviation to true results remains even if the probabilistic correction for unseen shared species reduces bias compared with uncorrected dissimilarities (Chao et al. 2005, 2006). Data in Fig. 3 suggested a high variability in bias depending on the individual site pair investigated (but this was not clearly associated with different simulations, apart from the species-abundance distribution). Identifying data properties that may be responsible for this could be important to evaluate when the metric should be used, and when not.

CNESS has become popular in studies on mega-diverse taxa and ecosystems (e.g. Brehm & Fiedler 2003; Beck & Chey 2007; Beck et al. 2010) as it may offer, with intermediate m (10 is often used as default), a compromise between sensitivity to rare and common taxa. However, we found that it was not as robust to undersampling as is often assumed (e.g. it was not substantially better than Bray-Curtis at sample completeness 0·5). Other metrics that may achieve some balance between weighting for rare taxa yet relative robustness to undersampling may be those of the (= 1)-type, i.e. Horn overlap for pairwise similarities and true beta diversity based on Shannon entropy.

Among multiplicative dataset-wide assessments of ‘true’ beta diversity, the benefits of using methods with an explicit correction for unseen species (βM3 and particularly βM4) were quite sensitive to the abundance structure of data (O'Hara 2005), which calls for particular caution when applying such metrics to data that are not based on individual counts (which will often be log-normal), but to data based on biomass or other units that may feature very different abundance distributions. Apart from Shannon entropy, other concepts of ‘effective numbers of species’ (e.g. based on Simpson indices) could be used that put even less weight on rare taxa (Jost et al. 2010; Tuomisto 2010b; Jost, Chao & Chazdon 2011).

In practice, researchers cannot know a priori the degree of undersampling, although numerical methods exist to estimate, with varying reliability, the number of unseen species (Colwell & Coddington 1994) and even the sampling effort needed to find them (Chao et al. 2009). We evaluated how useful three commonly applied ‘simple’ approximations would have been to estimate the sample completeness of our artificially undersampled data (Supporting Information). These comparisons showed that coverage (Good 1953) is strongly dependent on the data structure and often did not provide good approximations of sample completeness in our definition. Individual numbers and the N/S ratio were (nonlinearly) related to completeness within data sets, but they cannot be used to assess the sample completeness in a new data set (as their intercept was not constant). Estimates of undersampling from a range of numerical species richness estimators are probably useful (though not infallible; Beck & Schwanghart 2010) for assessing the magnitude of the problem. In better-known taxa, the consideration of the regionally known species pool may also provide guidance. In species-rich systems undersampling should probably always be assumed as long as there is no evidence otherwise (Coddington et al. 2009).

Among alternative approaches not covered in our analysis, hierarchical additive diversity partitioning has recently been developed as a promising method to assess beta components of diversity (Crist et al. 2003; Crist & Veech 2006; Beck et al. 2012b; see Holloway 1996 for an R-mode approach assessing species co-occurrence within our empirical data). Crucially to our topic, results are compared with null model simulations (related to rarefaction), which ensure that inference is robust to undersampling even if absolute measures of diversity components may be biased (Crist & Veech 2006; Chase et al. 2011). A similar null model approach is taken for testing species co-occurrence patterns (Ulrich & Gotelli 2007), which will also be affected by undersampling. Legendre, Borcard & Peres-Neto (2005) had suggested that direct ordination (e.g. detrended correspondence analysis, redundancy analysis) can be appropriate for many research questions on community composition. However, many uncertainties remain on how to best analyse undersampling robustness for these methods.

The lack of congruency of results for different metrics found for the full data (i.e. no undersampling, Supporting Information; cf. Anderson et al. 2011) underlines the need to choose a metric that is appropriate for the question to be analysed. Different metrics measure different aspects of community variation, and they can lead to different ecological conclusions. These differences also prevent any simple solution to the undersampling problem, such as using only Morisita-type measures (i.e. Cm or Cmh) whenever undersampling is suspected. The choice of beta diversity metric has to be made primarily in the context of the research question (constrained by data availability). For example, studies with a focus on the conservation of rare species will not benefit from applying metrics that are primarily affected by the most common taxa, whereas this may be an option if functional properties of a system, e.g. species interactions, are at the core of a study (because most interactions will be among the common taxa). An example of the former is provided by our empirical data where the 18 species of the association restricted to remnant areas of native forest are mostly ranked within the third and fourth abundance quartiles in the total sample of resident species (Holloway 1996).

Apart from incompletely inventorying local sites (the effects of which we investigated in this study), there are other forms of undersampling that require further study. For example, not all habitats of a landscape may be covered in a study, and resulting metrics of beta diversity or similarity may therefore be biased with regard to the landscape even if correct for the sample data set, particularly when mobile organism are being sampled – as seen in the considerable input of vagrant moth species to samples even in a highly isolated situation such as on Norfolk Island (Holloway 1977, 1982, 1996).

Conclusions

Most metrics that quantify differences between local communities are liable to undersampling effects. Our results suggest a trade-off between metrics with high sensitivity to rare taxa (in extreme: incidence data) and those attaining results robust to undersampling. This implies that some research questions, at the current state of available metrics, cannot be analysed without taking a risk of bias due to incomplete sampling. Where abundance data are used (i.e. they are available and they are suitable to approach the research question at hand; Tuomisto 2010a,b; Anderson et al. 2011), effects of undersampling are probably less a concern than if an analysis is based on incidence data. If, e.g., measuring faunal difference among the common species seems sufficient to the research question, Morisita-type metrics are the most reliable pairwise site x site metric if undersampling is suspected. Methods that explicitly account for undersampling are useful, but their performance can be sensitive to the evenness of the species community as well as to other, yet unidentified conditions of data. Further research is needed to provide robust incidence-based metrics of beta diversity, because (1) many research questions (e.g. in conservation) explicitly require a high weight on rare taxa, and (2) incidence data are most commonly available from non-quantitative species inventories.

Acknowledgements

Our study benefited from discussions with M. Curran, J. Fahr and K. Fiedler. Five anonymous reviewers helped considerably to improve the study and clarify our presentation. A. Chao provided important feedback on software. We are indebted to those who helped compiling the data set on Norfolk Island moths, particularly Maurge and Freddie Jowett, as acknowledged in Holloway (1977, 1996).

Ancillary