1. Numbers of individuals or species are often recorded to test for variations in abundance or richness between treatments, habitat types, ecosystem management types, experimental treatments, time periods, etc. However, a difference in mean detectability among treatments is likely to lead to the erroneous conclusion that mean abundance differs among treatments. No guidelines exist to determine the maximum acceptable difference in detectability.
2. In this study, we simulated count data with imperfect detectability for two treatments with identical mean abundance (N) and number of plots (nplots) but different mean detectability (p). We then estimated the risk of erroneously concluding that N differed between treatments because the difference in p was ignored. The magnitude of the risk depended on p, N and nplots.
3. Our simulations showed that even small differences in p can dramatically increase this risk. A detectability difference as small as 4–8% can lead to a 50–90% risk of erroneously concluding that a significant difference in N exists among treatments with identical N = 50 and nplots = 50. Yet, differences in p of this magnitude among treatments or along gradients are commonplace in ecological studies.
4. Fortunately, simple methods of accounting for imperfect detectability prove effective at removing detectability difference between treatments.
5. Considering the high sensitivity of statistical tests to detectability differences among treatments, we conclude that accounting for detectability by setting up a replicated design, applied to at least part of the design scheme and analysing data with appropriate statistical tools, is always worthwhile when comparing count data (abundance, richness).
Data on abundance and species richness of animals or plants are often collected in comparative studies to determine the variations in abundance and richness between treatments or habitats, areas that are managed differently, experimentally manipulated situations, monitoring over time, etc. Yet, the observed number of individuals or species is both a function of the true state of the system (i.e. the true number of individuals or species) and of the observation process. A significant proportion of the individuals or species are invariably missed during inventory observations. This has been demonstrated for a wide range of taxonomic groups, e.g. mammals (Baker 2004), birds (Boulinier et al. 1998), butterflies (Casula & Nichols 2003), spiders (Coddington, Young & Coyle 1996) and plants (Kéry et al. 2006). In this article, we will develop the case of abundance estimators, but our conclusions are equally relevant to species richness estimators as, from a methodological point of view, there is a parallel between species in a community and individuals in a population.
Some statistical approaches do account for differences in p; however, they usually require replicating counts in space or in time to be able to distinguish variations in the detection process from variations in abundance (but see Discussion for non-replicated methods). Replication greatly increases study costs in terms of manpower (time in the field), ecological disturbance (multiple sampling) and training requirements (statistical methodology). In practice, if manpower is limited, increasing the number of subsamples per plot is likely to necessitate reducing the number of plots proportionately. This would in turn reduce the statistical power of the study to detect differences among treatments. It is therefore tempting to ignore the detection difference issue with a view to maximising statistical power. This is probably one of the main reasons why imperfect detectability is still frequently ignored in species-monitoring schemes. For instance, 66% of 396 species-monitoring schemes in Europe do not control for detectability (EuMon 2006; consulted 18 October 2010).
A common belief among opponents of the systematic consideration of imperfect detectability (i.e. a detection probability of <1) is that, when the degree of detectability is ‘roughly’ the same among the treatments, they should cancel out and tests for differences among treatments should not be affected. Therefore, any effort to account for imperfect detectability is a waste of manpower. However, Tyre et al. (2003) have already demonstrated that this simplification is unwarranted in the specific case of occurrence data analysis, where imperfect detectability can bias the estimator of the slope for the effect of covariates. The problem of detectability is generally acknowledged when the probability of detection varies with the factor of interest (e.g. experimental treatment, ecological or temporal gradient): differences in detectability may cause, accentuate or hide differences in the observed mean abundance between treatments and result in false interpretations. But, to our knowledge, no threshold values have been defined for the maximum difference in p that can be considered negligible among treatments and therefore can be neglected in the analyses (low risk of Type I error). In other words, when does the risk of drawing false conclusions become important enough to counterbalance the cost of replicating counts?
The detectability issue in comparative studies is not only a question of bias and precision in population size estimates (e.g. Burnham & Overton 1979; Chao 1987; Hellmann & Fowler 1999; Walther & Moore 2005; Xi, Watson & Yip 2008). Indeed, information about actual differences in N may be more reliable when detectability is low but varies little among treatments than it is when detectability is globally higher but varies more among treatments. Some specific case studies have emphasised the importance of accounting for detectability – i.e. accounting for p changed the conclusions of the study (Kéry, Gardner & Monnerat 2010a), while others have reached the opposite conclusion – i.e. accounting for p did not change the conclusions (Kéry & Schmid 2006; Bas et al. 2008). MacKenzie & Kendall (2002) discussed ways of incorporating detectability into abundance ratios (relative abundances) and recommended estimating detectability in all cases. Although they argued that there is generally no good reason to assume detectability to be constant (among treatments, along gradients), they did not provide a formal assessment of this statement.
The main goal of our study was therefore to estimate the minimum acceptable difference in detection probability that justifies estimating detectability when designing a comparative study and analysing data. We used simulations to explore the sensitivity of comparative tests of abundance data to among-treatment differences in mean individual detectability. We considered the additional influence of the mean number of individuals per plot and the number of plots. We then explored to what extent basic methods of accounting for detectability (using a nonparametric estimator) are effective at removing the detectability difference among treatments and limiting the type I error risk.
Materials and methods
Simulations mimicking real count data surveys were performed to compare two different treatments (e.g. two habitats, areas or experimental treatments) with different mean detectabilities (p1 ≠ p2), but with identical mean population size (N), number of sampling plots (nplots) and number of subsamples per plot (S). The simulations artificially created detection/non-detection histories, i.e. matrices indicating whether a particular individual was detected or not at a given subsampling occasion. The simulations were designed to estimate the risk of committing what is known in statistical terminology as a type I error, i.e. the null hypothesis H0, ‘there is no difference in population size between the treatments’, is rejected even though H0 is true (N1 = N2 = N was fixed). The usual accepted risk is 5%, and a between-treatment difference is likely to inflate this risk.
Each run had four steps as described below (with i for individual, j for plot, t for treatment and s for subsample):
Step 1. Assign a population size Njt to each of the nplots of each treatment assuming a Poisson distribution: Njt∼ Poisson(N).
Step 2. Assign a detection probability pijt to each of the Njt individuals present on each plot (jt) assuming a Beta distribution: pijt∼ Beta(beta1,beta2t) (beta1 was identical for both treatments but not beta2; pt = beta1/(beta1 + beta2t)), and determine whether the individual is detected on each of the S subsamples by performing S Bernoulli trials, Bernoulli(pijt).
Step 3. For each plot, count the number of individuals that were recorded in at least one subsample (N.rawjt ≤ Njt) and calculate the corresponding Jackknife 2 estimate (Burnham & Overton 1979):
where n1 is the number of individuals detected in only one subsample (singletons), and n2 is the number of individuals detected in only two of the S subsamples (doubletons). We also calculated the corresponding Chao 2 estimate (see results in Appendix 2).
Step 4. Test whether the mean N.rawjt (and ŜJack2jt) values statistically differ between the two treatments, assuming a Poisson (and respectively Gaussian) distribution; see below for the justification of using these distributions. Increment a counting variable if the ‘treatment’P-value is significant (i.e. <5%).
For each combination of p1, p2, N and nplots, steps one to four were repeated 5000 times. The proportion of runs with a P-value <0·05 (αsim) was then used as an estimate of the type I error risk. As we randomly generated two sets of populations for the two treatments in step 1, setting the same p, N and nplots values for the two treatments, by chance, significant between-treatment difference in mean N occurred in c. 5% of the simulations. That is, estimated rejection rates are perfectly acceptable unless they are >5%. To determine the maximum acceptable difference in p between the two treatments (Δpmax = p2−p1) for which the type I error risk was still below the desired 5%, the value of p2 was increased (from p1) until the type I error risk was above the desired 5% threshold value (actually 5·6%, i.e. 0·05 + 1·96*√(0·05*0·95/5000)). This involved performing sets of 5000 runs without knowing a priori the number of sets needed to reach the p2max value with the desired precision. To speed up the process, we incremented p2 in steps of 0·1 among sets of runs, then in steps of 0·01 and finally in steps of 0·001, rather than incrementing p2 directly in steps of 0·001.
For simplicity, we assumed that there was no variation in pijt among the S subsamples. Initially, we also explored the effect of the level of heterogeneity (var(p)) on the type I error risk by using combinations of beta1t and beta2t values and keeping the same mean detectability pt. However, as the results varied only slightly with var(p) (at least in comparison with p), only the results for beta1 = 2 are presented herein.
We used a generalised linear model with a Poisson distribution for N.raw but a linear model with a Gaussian distribution for ŜJack2 in step 4 because simulations with N1 = N2 and p1 = p2 (to check that the nominal rejection rate (αnom) of the tests was effectively 5%) showed that αnom was effectively 5% when assuming a Poisson distribution for N.raw but was generally above 5% for ŜJack2. With a Gaussian distribution for ŜJack2, the nominal rejection rate was 5%.
Obviously, we could not explore all possible values for all parameter combinations. Hence, we used values that are commonly found in biostatistics and ecology: N = 10, 50 or 100; nplots = 10, 25, 50, 75 or 100; S = 1 (for raw counts), 3–10; p1 = 0·1, 0·25, 0·5 and 0·8 (with beta1 = 2). We developed specific functions for the R software (R Development Core Team 2009), provided in Appendix 1, which will allow the reader to estimate the type I error risk in other situations (note that the functions can be easily modified as to draw a single set of populations for the two treatments, mimicking the case of a single set of plots visited by two different observers or sampled using two competing methods).
Raw counts of individuals
As expected, the type I error risk αsim gradually increased to above the 5% limit as the difference in p among treatments (Δp) increased. The increase was stronger for lower values of p. Logically, αsim also increased with both N and nplots, as the two variables play symmetrical roles (Fig. 1).
More importantly, αsim was far above the 5% threshold value, even for low Δp, in the majority of the situations considered. At the extreme, when N = 100, nplots = 100 and Δp≥4%, α was >0·9, for all values of p. The situations with a reasonably low (<0·1) risk of committing a type I error had very low statistical power (and therefore a high risk of committing a type II error), e.g. with N = 10, nplots = 10, p1 = 0·1 and Δp ≤ 3%.
Performance of the Jackknife 2 estimator
As expected, the factor that impacted the performance of the Jackknife 2 estimator the most was p: the higher the p value, the larger the Δpmax (shown by colder colours in Fig. 2). However, no simple monotonic relationship related Δpmax to the number of subsamples s. Indeed, when p1 = 0·25, we found an optimum number of subsamples below and above which Δpmax decreased (s = 5 or 6 for all N and nplots; Fig. 2). For other combinations, the higher (at least 10 replicates per plot) or the lower the s value, the higher the Δpmax. In the remaining cases, s seemed to have little influence on Δpmax.
More quantitatively, when p1 = 0·1, the Jackknife failed to keep αsim < 5% when the difference in mean p between treatments was ≥2%, for all combinations of N, S and nplots tested. When p1 = 0·25, the Jackknife performed better as it maintained αmax<5% for Δpmax values ranging from 5 to 40% (the % decreasing with increasing N and nplots). Interestingly, Δpmax peaked when s equalled six replicates per plot. A more surprising pattern was observed when p1 = 0·5. Indeed, higher values for Δpmax were found when s was either low (3–4) or high (8–9) than when s was intermediate (especially at low nplots, such as 20). When p1 = 0·8, the Jackknife estimator was able to preserve αsim<5% for a Δp value of around 10% (i.e. p2 around 0·9) when s was three for N = 10, 3–4 for N = 50 and 4–6 for N = 100 (depending on nplots).
Difference in p among treatments and type I error risk
This study provides the first formal evidence supporting the often-repeated recommendation to systematically use statistical tools to account for detection probability when comparing count data among treatments. Our main result is that the risk of committing type I errors because of differences in mean detectability among treatments is high even when the difference is barely perceptible in the field. For instance, for two treatments with the same number of plots (nplots = 50) and the same mean number of individuals/species per plot (N = 50), there is a 50–90% risk of erroneously declaring that the two treatments differ in their mean number of individuals/species per plot, when the mean probability of detection differs by only 4–8% among treatments (depending on the mean probability of detection, p). Table 1 provides a representative overview of the variability in p reported in ecological studies, including a wide variety of taxa and factors impacting p (e.g. species identity, observer, etc.). The range in detection probability (among species, observers, etc.) is almost invariably above 10%. We suspect that in many cases, if not most, assuming that the variation in p among treatments or over time is negligible could lead scientists or managers to draw misleading conclusions. We therefore agree with MacKenzie & Kendall (2002) who stated that ‘a priori, it is more likely that detection probabilities are actually different; hence, the burden of proof should be shifted, requiring evidence that detection probabilities are practically equivalent’. Small differences in detectability among treatments are likely to be commonplace in ecological studies. For example, the set of species may differ among treatments, different observers may participate in the surveys, individuals may be counted at different distances among treatments (e.g. because of variations in habitat proximity) or weather or seasonal conditions may not be the same.
Table 1. Examples of factors impacting the probability of detection p in a variety of taxonomic groups (with mean value p and range of values)
Because the type I error risk decreases as the mean detection probability p increases, one way to reduce this risk is to increase p. This can be achieved simply by visiting the same plot a number of times or by installing several subplots (i.e. replicating counts) and calculating the total number of different species/individuals (or if individuals cannot be identified from one visit to the next, the highest number of individuals counted in one visit). Nonetheless, even though this strategy may help to limit type I errors, the risk still remains above the 5% threshold when raw counts are used. Therefore, as soon as counts are replicated, one should always use estimators of N accounting for p, rather than raw counts, because estimators always keep the type I error risk closer to the desired 5%.
Estimators of N accounting for p
The reliability of abundance/richness estimates (high precision, low bias) increases with the fraction of individuals or species that is recorded, i.e. with the mean detectability and the level of replication (Xi, Watson & Yip 2008). Our simulations point out that for very low detection probabilities (i.e. around 0·1), none of the strategies we explored (up to 10 subsamples) kept the rejection rate below 5% when the difference in mean detectability among treatments was >2%. In those cases, our simulations showed that the Chao2 estimator (Chao 1987) performed only marginally better than the Jack2 estimator (results shown in Appendix 2). Nonetheless, for greater detection probabilities (>0·1), the risk of committing a type I error as a result of detectability difference was significantly reduced, sometimes to the 5% threshold, by replicating counts and using nonparametric estimators (Fig. 2). The Jackknife estimator was able to preserve a nominal rejection rate of 5% despite an among-treatment difference in p of 10%, with only three replicates per plots when p ≥ 0·5 and N ≤ 50. A higher level of replication would be necessary for higher N values, especially for large sample sizes (large nplots). Schmeller et al. (2009; see their Table 2) provided data on monitoring practices (mainly volunteer based) in five European countries: the median level of subsampling among 262 monitoring schemes was between 1 and 3·5, with the number of replicates being either the number of visits each year or the number of samples per visit. Our results show that such a limited level of subsampling may not be sufficient to adequately account for potential variations in detectability.
We focused on nonparametric estimators because they give sound results in many circumstances (e.g. Otis et al. 1978; Walther & Moore 2005) and are widely used; though, alternative, more flexible methods are now available to assess the size of a closed population or community. These methods are mostly based on generalised linear models and can incorporate a rich variety of factors possibly affecting N and p (see Royle & Dorazio 2008; King et al. 2009). It would be interesting to assess the minimum replication needed with these methods to ensure acceptable type 1 error risks as we have done for nonparametric estimators. We therefore suggest extending simulations to these recent approaches, including patch occupancy (MacKenzie et al. 2002; MacKenzie & Royle 2005), finite-mixture models (Pledger 2000, 2005) and N-mixture models (Royle 2004b). Finally, methods such as distance sampling (Buckland et al. 1993; Nichols et al. 2000) and spatial capture–recapture methods (Efford & Dawson 2009) that account for detectability without requiring replicating counts should also be considered.
Unless unreplicated methods can be implemented, a minimum requirement of doubling or tripling the number of subsamples per plot may be incompatible with the resources (manpower, funds) that can be devoted to many survey programmes. Furthermore, fieldworkers who have to repeat visits might become demotivated, thus potentially lowering the amount and quality of data (particularly when volunteers are concerned). If replication is traded off against the number of plots, it would also dangerously lower the ability to detect differences among treatments (Type II error). It is nevertheless crucial that biodiversity data be collected in ways that allow reliable inferences to be made (Yoccoz, Nichols & Boulinier 2001). MacKenzie & Kendall (2002) discussed various ways of incorporating detectability into abundance estimates (equivalence testing, model averaging). A double-sampling approach may be an interesting balance between limiting the cost of a study and improving its robustness. This approach relies on estimating the mean detectability and its standard deviation by replicating counts over a representative part of the study plots only (Bart & Earnst 2002; Pollock et al. 2002) and has been successfully used with patch occupancy models (Kéry et al. 2010b).
Statistical tests comparing mean plot population size or mean species richness between treatments are very sensitive to even small differences in mean probability of detection among treatments. As numerous factors are likely to significantly affect detectability in most biodiversity surveys, it is more reasonable to assume a priori that differences in detectability among treatments could bias the statistical comparison tests. Consequently, in line with MacKenzie & Kendall (2002), we recommend that scientists and managers always choose a robust sampling design that estimates and incorporates detection probability in the statistical analyses, before the routine sampling starts.
This work originated from a workshop held in 2005 at Obergurgl (Austria) jointly organised by the 6th European Funding Program Network of Excellence ALTER-Net (http://www.alter-net.info/) and the EuMon STREP EU-project (http://eumon.ckff.si; EU-Commission contract number 6463). We thank Frédéric Gosselin, Marc Kéry and one anonymous referee for constructive comments on earlier versions of the manuscript and Victoria Moore for English corrections.