Incidence‐data‐based species richness estimation via a Beta‐Binomial model

Individual‐based abundance data and sample‐based incidence data are the two most widely used survey data formats to assess the species diversity in a target area, where the sample‐based incidence data are more available and efficient for estimating species richness. For species individual with spatial aggregation, individual‐unit‐based random sampling scheme is difficult to implement, and quadrat‐unit‐based sampling scheme is more available to implement and more likely to fit the model assumption of random sampling. In addition, sample‐based incidence data, without recording the number of individuals of a species and only recording the binary presence or absence of a species in the sampled unit, could considerably reduce the survey loading in the field. In this study, according to sample‐based incidence data and based on a beta‐binomial model assumption, instead of using the maximum likelihood method, I used the moment method to derive the richness estimator. The proposed richness estimation method provides a lower bound estimator of species richness for beta‐binomial models, in which the new method only uses the number of singletons, doubletons and tripletons in the sample to estimate undetected richness. I evaluated the proposed estimator using simulated datasets generated from various species abundance models. For highly heterogeneous communities, the simulation results indicate that the proposed estimator could provide a more stable, less biased estimate and a more accurate 95% confidence interval of true richness compared to other traditional parametric‐based estimators. I also applied the proposed approach to real datasets for assessment and comparison with traditional estimators. The newly proposed richness estimator provides different information and conclusions from other estimators.


| INTRODUC TI ON
Species richness, or the number of species in a particular area, is the simplest and most frequently used measurement of biodiversity.
Understanding spatial and temporal patterns of species richness is essential to community ecology and conservation biology. However, because of sampling limitations, a complete species inventory is difficult to compile; many previous approaches in scientific literature were developed to address the underestimation of observed species richness associated with incomplete sampling data (Bunge et al., 2014;Bunge & Fitzpatrick, 1993;Chiu et al., 2014;Colwell & Coddington, 1994;Dorazio & Andrew Royle, 2003;Griffiths, 1973;Rocchetti et al., 2014;Wilson & Collins, 1992). There are two types of data typically collected in the field to assess richness diversity, namely individual-based abundance data, in which the sampling unit is the individual and the number of individuals (abundance) is recorded for each observed species, and sample-based incidence data, in which the sampling unit is a quadrat (or, e.g., a plot, a transect or a time interval) and only the binary presence or absence of species is recorded in each sampled unit (Colwell et al., 2012).
Furthermore, there are generally two typical sampling schemes implemented to collect data, which are sampling with replacement and sampling without replacement. Depending on their data collection formats and sampling schemes, most previously derived estimators can be classified into parametric or nonparametric methods. Nonparametric methods yield robust statistical performance because no assumption is made related to the species detection probability; these methods are widely used by ecologists, and the lower bound estimator (Chao, 1984(Chao, , 1987 and jackknife estimators (Burnham & Overton, 1978, 1979 are the most popular estimators.
Parametric approaches assume the species detection probability is a random variable to reduce the number of parameters and applies the traditional statistical inference to estimate species richness (Chao & Bunge, 2002;Dorazio & Andrew Royle, 2003;Griffiths, 1973;Sanathanan, 1972Sanathanan, , 1977. In general, parametric approaches usually work well when sample size is sufficiently large or the distribution of the species detection probability fits the model assumption. However, parametric approaches are often overlooked in ecological applications due to their unreliability, especially when data are sparse (Chao & Bunge, 2002).
For estimating species richness within a target area, samplebased incidence data that only the presence or absence of species is recorded in each sampling unit, without precisely recording the number of individuals for each species in the sampled plot, require less field effort than individual-based abundance data. In addition, for some individual organisms, especially micro-organisms, invertebrates or plants, it is hard to sample individuals randomly and independently due to their tiny size or highly spatial aggregated pattern. Therefore, sample-based incidence data are more practical compared to individual-based abundance data for estimating species richness. In this study, according to sample-based incidence data with a replacement sampling scheme, the proposed approach involved uses the parametric method to assume that the species detection probability is a random variable following a beta distribution. Instead of using the maximum likelihood (ML) method (Dorazio & Andrew Royle, 2003;Griffiths, 1973;Sanathanan, 1972Sanathanan, , 1977, the traditional statistical moment method was applied to obtain the estimators of the parameters in the assumed beta-binomial model based on the rare species in the sample. The new richness estimator was shown as a nearly unbiased estimator under the homogeneous model and provided a parametric lower bound estimator in other discussed models. Compared to the ML method using an entire sample, the new estimation method only requires information about the rare species (i.e. the number of singletons, doubletons and tripletons) in the sample to estimate the number of undetected species. The simulation results indicated that the proposed approach reduced bias and root mean squared error (RMSE) more so than traditional nonparametric estimators and the ML estimator (MLE), improving the accuracy of the coverage rate when the heterogeneity of species abundances was relatively high. Additionally, this method has the advantage of employing nonparametric approaches because only the number of rare species in the sample is required to estimate the number of undetected species, and information about the number of highly frequent species is not needed. Through this method, the field work required for field investigations can be greatly reduced.
The theoretical framework of the estimation method in our study is detailed in the following model and estimator section. In the simulation analysis section, we analyse the statistical performance of the proposed estimator approach in ecological species composition models by examining the bias, estimated standard error (SE), and coverage rate with a 95% confidence interval (CI). Afterwards, real data are used to illustrate the application of the newly proposed estimator and compare the new approach with other commonly used estimators. The final section provides a discussion of our estimation method.

| Model and estimator
We assume that there are S species presented in the target area, which is an unknown parameter. In this study, the sampling unit is a plot, quadrat, or transect randomly sampled from a target region and only the absence or presence of a species is recorded in each sampled unit. If a sample contains T sampling units or plots, and Y i is the number of plots that ith species observed or recorded in these T plots, then Y i follows a binomial distribution with parameter T and detection probability i . Here, the detection probability i is determined not only by species population size, but also by many other biotic factors, such as individual colour, size, vocalizations and movement patterns.
To assess the heterogeneity of species detection probabilities, a mixed binomial model widely used in ecological studies was applied. I assumed that the count Y i follows a binomial distribution Binomial T, i , and 1 , 2 , … , S are independent and identically distributed (i.i.d.) random variables with probability density function g( ). Herein, I presume that g( ) follows a beta distribution with two parameters (α, β) ensuring species relative abundance distribution flexibility. Thus, I obtain the marginal distribution of the species count in the sample as follows: Let Q k denote the number of species exactly observed in the k is an indicator function. I(A) equals to 1 if event A occurs and 0 if otherwise. Thus, S obs = ∑ T k=1 Q k is the observed richness in the sample and Q 0 is the undetected richness. Based on the mixture model, the observed frequency counts Q k : k ≥ 1 follow a multinomial distribution with the total sum S obs and cell probabilities p k 1 − p 0 : k ≥ 1 . The conditional likelihood function is expressed as: Then, the conditional MLE of and can be obtained using an iterative numerical procedure (Sanathanan, 1972(Sanathanan, , 1977. The mean detection probability of unobserved species can be obtained as is a function of ̂ and̂ . Additionally, S obs approximately follows a binomial distribution with parameters S and E S obs ∕ S. Therefore, Var S obs can be estimated by S obs 1 − S obs ∕Ŝ MLE , and Var g ̂ ,̂ | S obs can be approximated by the following formula. where I( , ) is the Fisher information matrix of ( , ) with respect to the likelihood function L( , ) shown in Equation 2. Here, the estimator of Var g ̂ ,̂ | S obs can be obtained by replacing ( , ) as the ML estimate (̂ ,̂ ). Therefore, the variance estimator of Ŝ MLE can be obtained as In this manuscript, we took another approach to estimate parameters of the beta-binomial model. On the basis of the ML method, the estimator of species richness uses all observed species in the sample, including frequently and rarely detected species, to estimate undetected richness. However, the MLE of richness is only robust when the model assumption is met. When the model assumption is violated, the MLE will become very unreliable (see Section 2.2 section). To modify the unreliability of MLE, we adopt a nonparametric concept. According to the Good-Turing frequency formula theory (Good, 1953(Good, , 2000 or the Cauchy-Schwarz inequality, the following inequality is held: This means that the rarer species could supply a more constrained range for undetected richness (i.e. rare species in a sample provide the most information about undetected species). Although some proposed nonparametric estimators use the entire sample to estimate undetected richness (Chao & Lee, 1992;Jiménez-Gamero & Puig, 2020;Puig & Kokonendji, 2018). However, for highly heterogeneous communities, Chao and Yang (1993) modified the estimator derived by Chao and Lee (1992) using the subset of the sample that consisted of only rare species to obtain a more robust estimator based on the same nonparametric approaches. That is why many widely used nonparametric estimating approaches use only rare species in the sample to estimate undetected richness (Burnham & Overton, 1978, 1979Chao, 1984Chao, , 1987Chiu et al., 2014).
Therefore, on the basis that rare species in the sample contain the most information on undetected richness, instead of using the ML method, I employed the statistical method of moments to derive a new richness estimator using only information on the rare species in a sample. Based on the marginal distribution of the species count in the sample, Equation 1, the expectation of the frequency counts can be expressed as: (2) We can now obtain the following expectations of undetected rich- Therefore, the expectation of undetected richness (Q 0 ) can be approximately presented as: Here, is the lower bound of E Q 0 when T is large enough (Chao, 1987) and applies to all mixed binomial models. When expectations are replaced by the observed numbers, is the widely used richness estimator (named as Chao2) in the literature (Chao, 1987).
Therefore, according to Equation is approximately equal to the bias of under the assumption that the species detection rate follows a beta-binomial distribution. Since the Cauchy-Schwarz inequality can also show the inequality, is approximately equal to: To obtain a more stable estimate and ensure � > 0, which needs to meet the requirement of 1 2 ≤ 2Q 2 2 3Q 1 Q 3 ≤ 1, the newly proposed estimator is expressed as: where ; and the expression (F) − is equal to max 1 2 , F if F ≤ 1 and 1 if otherwise. Here, Q 1 or Q 3 equal to zero will be replaced by 1 to make Equation 7 well defined. Although the newly proposed estimator (Equation 7) was derived based on the beta-binomial model assumption, it can also be interpreted as a bias-corrected estimator of Chao2 and be directly derived using the Good-Turing frequency formula (see Appendix S1 for details).
Since the newly proposed estimator is expressed in a simple closed form, without using the resampling method, the statistical delta method could be used to derive the estimator of variance.
According to the asymptotic approximated distribution, in which follows a multinomial distribution with a total size S and cell probabilities ( (Chao & Lee, 1992;Chao & Yang, 1993), the variance estimators of the aforementioned richness estimators (Ŝ) can all be derived using the delta method. Thus, To derive the 95% CI of species richness and to ensure the lower bound of 95% CI of species richness is larger than observed richness, assume Ŝ − S obs follows a log normal distribution (Chiu et al., 2014). Then the 95% CI of species richness is obtained as  in other models (proved in Appendix S3). (iv) The estimator provides a lower bound estimator of species richness under the betabinomial model, which is a relatively flexible ecological model that includes the broken stick and homogeneous models.

| Simulation study
A simulation study was conducted to investigate the statistical behaviour of the new estimator and compare it with current estimators, namely the nonparametric-based Chao2 lower bound estimator, improved Chao2 lower bound estimator (Chiu et al., 2014) and first-order jackknife estimator, and the parametric-based MLE. Where the ML estimates of parameters in the beta-binomial model can be efficiently calculated using the R function in the package 'extraDistr'.
, denoted as Ŝ Chao2 ): the numbers of singletons and doubletons are used to estimate the number of undetected species.
(ii) The improved Chao2 estimator , 0 , denoted as Ŝ iChao2 ): compare to Chao2 estimator, this estimator uses additional information (namely, the numbers of tripletons and quadrupletons) to estimate the number of undetected specie and correct the negative bias of Chao2 (Chiu et al., 2014).
(iii) The first-order jackknife estimator (S obs + T − 1 T Q 1 , denoted as Ŝ Jack1 ): the number of singletons is used to estimate the number of undetected species.    = 1, 2, … , S, and a 1 , a 2 , … , a S is a random sample from an exponential distribution with a parameter of 1, and the maximum of i is set as 0.5. This model is commonly used in previous literature and equivalent to the Dirichlet distribution.
Model 4: the log-normal model (CV = 1.34): where i = ca i , i = 1, 2, … , S, and a 1 , a 2 , … , a S is a random sample from a log-normal distribution, and the maximum of i is set as 1.
Model 6: the Beta model (CV = 0.55): where i = ca i , i = 1, 2, … , S , and a 1 , a 2 , … , a S is a random sample from a beta distribution with the parameters (2, 5) that meet our model assumption, and the maximum of i is set as 0.25.
The CV in these six models ranged from 0 to 1.84, covering most practical cases in real applications. Here, I considered four different quadrat numbers, 10, 20, 40 and 80, as different sampling sizes or efforts, and a total of 24 model-size combinational scenarios were generated. For each discussed estimator in each simulation scenario, the estimate and estimated standard error (SE) were averaged over 1000 simulated datasets to obtain the mean estimate and the mean estimated SE. The sample SE and RMSE over these 1000 estimates were also obtained. The percentage of the datasets in which the 95% CIs covered the true value is detailed in the tables (Tables 1-6). The average of observed richness over 1000 samples is also presented.

| Simulation results
In general, a good species richness estimator should have as small bias and variance as possible (i.e. low RMSE). Moreover, the coverage rate of its associated 95% CI should be close to 0.95. As sample size increases, the estimator should also exhibit the following essential patterns: its bias, accuracy (as measured by RMSE), and coverage TA B L E 3 Broken stick model with mean detection probability of 0.071 and CV = 1.03 rate of the CI should generally improve as sample size increases, and its estimates should finally converge to the true species richness when sample size is large enough. Using these criteria and according to the simulation results, we concluded the following findings: 1. In some cases (see Table 2), the jackknife estimator performed well in terms of bias and RMSE. However, as an estimator of species richness, the jackknife estimator violated the essential criteria that bias and accuracy (quantified by RMSE) should decrease with sample size (Tables 1 and 6). These results are consistent with those of Chiu et al. (2014). Of the discussed estimators, the jackknife estimator usually had the smallest SE, indicating favourable robustness; however, its coverage rate of CI is significantly lower than the claimed level, especially in cases with high heterogeneity (Tables 3-6).
Meanwhile, no theoretical foundation supports the use of the jackknife estimator to reduce bias of observed species richness (Cormack, 1989).
2. Chao2 is almost unbiased in the homogeneous model (Table 1).
However, its negative bias and CI coverage rate become worse as the heterogeneity of the species detection rates increased (Tables 2-6). Overall, Chao2 is advantageous as a richness estimator for its bias, RMSE and 95% CI coverage rate that perform better with increasing sample size (Tables 1-6) and tends to derive true species richness when sample size is sufficiently large. Compared with the newly proposed estimator, Chao2 had a smaller SE, larger bias and larger RMSE in the discussed hypothetical models (Tables 2-5).
3. The improved Chao2 (denoted as iChao2), a bias-corrected Chao2 estimator, fulfils the essential properties of a richness estimator (Tables 1-6). Compare to Chao2, iChao2 had a smaller bias, lower RMSE and more accurate 95% CI than Chao2 for most discussed hypothetical models (Tables 2-6). However, for the model with high CV (i.e. high heterogeneity), iChao2 still presented seriously negative bias and a low coverage rate of 95% CI when sample size is not large enough (Tables 3-5).
4. The MLE performs well in relation to bias, RMSE and 95% CI coverage rate when the hypothetical model is of low CV and sample size is large enough (Tables 1 and 2). However, when the model has a high CV, the MLE will tend to be unstable due to high variance (Tables 3-5). If beta distribution is assumed, the MLE performs well in terms of bias and RMSE ( Table 6).  (Tables 2-6). Although the SE of the newly proposed estimator is always larger than that of Chao2 and iChao2, its 95% CI coverage rate tends to be more highly accurate in most simulation scenarios (Tables 2-6).

Generally
6. The asymptotic approach used to estimate the variances of the discussed estimators performs well in most simulation cases except in some cases for the MLE (Tables 1-6).

| Work example 1: Barro Colorado Island Forest dataset
Here, I used one real dataset of forest as a true community to evaluate and compare our estimator with other discussed estimators. The Finally, the estimate and their estimated SE were averaged over 500 simulated datasets to obtain the mean estimate and mean estimated SE. The sample SE, RMSE and coverage rate of 95% CI over these 500 estimates are also shown in Table 7. Abbreviations: CI, confidence interval; CV, coefficient of variation; RMSE, root mean squared error; SE, standard error.
As with the simulation results, the results in Table 7 show that the number of observed species exhibits large negative bias, especially with lower sampling effort. The MLE is severely overestimating due to its unreliability (high variance) in the model with high CV.
Although Chao2, iChao2 and first-order Jackknife estimator could reduce the bias of observed richness, the three estimators are still severely underestimating under low sampling effort especially in the small sample size. Compared with the newly proposed estimator, the SEs of Jackknife1, Chao2 and iChao2 are much lower in each sampling effort setting, and the coverage rates of CI of the two estimators are quite low, especially with lower sampling effort. The newly proposed estimator displays more robust performance in each sampling effort setting, which has lower bias and RMSE. Although the new estimator has higher SE in each sampling effort setting, the coverage rate of 95% CI is closer to 95% that is consistent with the results in simulation studies.
3.2.2 | Work example 2: Tropical ant data I used another real dataset to illustrate the application of the proposed estimator. Here, tropical ant data collected at five elevations (50, 500, 1070, 1500 and 2000 m) in Costa Rica by Longino and Colwell (2011) were employed. Table 8 shows that the numbers of   trapping samples at the five elevations were 599, 230, 150, 200 and 200, respectively; the reference sample size (T), observed species richness (S obs ), CV of empirical species detection rate and first 10 frequency counts (Q 1 ∼ Q 10 ) are also noted. As Table 8 shows, the observed richness roughly decreases as elevation increases, and the observed richness in the elevation of 500 m is the highest. Table 8 also shows that the number of singletons is more than twice the number of doubletons at each elevation, except at the elevation of 2000 m, in which there are still many undetected species in each site.
In Table 9, the estimates and estimated SE of the various richness

| DISCUSS ION
According to the concept of the Good-Turing frequency formula, we According to statistical theory, random sampling is the fundamental assumption for deriving the estimator of species richness.
In ecological studies, individual-based abundance data and samplebased incidence data are two common data types collected to assess the species richness of different taxa. In individual-based data, the sampling unit is the individual, which is randomly and independently sampled and identified to species. In sample-based incidence data, the sampling unit is not an individual, but a plot (or quadrat, net, trap, transect, timed survey, etc.), which is randomly and independently sampled from the target region, and only the incidence (presence or absence) of each species in the sampled unit is recorded. In the field, individuals of each species always present a strong or slight spatial aggregation pattern due to the biotic or abiotic effects that make individuals difficult to individually sample and count. In this case, sample-based incidence data, with their ease of collection and proper random sampling assumption, can be effectively adopted to estimate species richness. However, richness estimation based on sample-based incidence data is still a statistical challenge. Even if communities have the same richness and species abundance, different communities display specific spatial aggregation patterns of individuals that could lead to differing species detection rates for communities of different heterogeneities. Thus, a single statistical model is always insufficient in describing all the diverse ecological communities, so there is no unbiased richness estimator for all ecological communities. Nevertheless, in this study, although the newly proposed richness estimator was derived based on the constrained beta-binomial model through a parametric approach, the newly proposed estimator can be also derived as a bias corrected estimator of Chao2 using the Good-Turing frequency formula without model assumption (see Appendix S1). That can explain why the proposed estimator presents robust behaviour in all simulation scenarios, including homogeneous, random uniform, broken stick, log-normal, and Zipf-Mandelbrot models and a real forest community. As a result, I am confident that the proposed estimator can supply another candidate to estimate richness based on sample-based incidence data especially when sample size is not large enough. Species richness in the study is defined as the number of detectable species of the assemblage that is the target of estimation. Therefore, if a low species detection rate resulting from high spatial aggregation of individuals causes zero-inflation in the survey data, it may be challenging to accurately estimate species richness, due to the many almost undetectable species. In this case, the newly proposed estimator can supply a superior lower bound for species richness (see Tables 1-6).
The sample-based replacement sampling scheme is identical to a capture-recapture sampling scheme employed to estimate population size of a specific species. Therefore, the proposed estimator can also be used to estimate population size when only the heterogeneous detection rate of individuals is considered and without varying by time and without depending on the previous detection records.

ACK N OWLED G EM ENTS
I would like to thank the Editor (Dr. Aaron Ellison), the Associate Editor (Dr. Claudia Coleine) and two reviewers for providing insightful and thoughtful comments and suggestions. I am especially grateful for Reviewer 2's comments regarding the detailed TA B L E 9 Richness estimates of the Chao2 (Ŝ Chao2 ), the improved Chao2 (Ŝ iChao2 ), the first-order jackknife estimator (Ŝ Jack1 ), the MLE (Ŝ MLE ) and newly proposed estimator (Ŝ BB ) and their SEs at the five elevations correction and rigorousness of the article. This work is supported by the Taiwan Ministry of Science and Technology under Contract 109-2118-M-002-003-MY2.

CO N FLI C T O F I NTE R E S T
There is no conflict of interest.

PEER R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/2041-210X.13979.

DATA AVA I L A B I L I T Y S TAT E M E N T
R code for calculating the estimate and SE of discussed estimator (including Chao2, iChao2, MLE-based estimator and newly proposed estimator) is archived in Zenodo repository. https://doi.org/10.5281/ zenodo.7015481 (Chiu, 2022). The BCI Plot data used in manuscript are available from the website: http://ctfs.si.edu/webat las/datas ets/bci/.