Estimating infection prevalence: Best practices and their theoretical underpinnings

Abstract Accurately estimating infection prevalence is fundamental to the study of population health, disease dynamics, and infection risk factors. Prevalence is estimated as the proportion of infected individuals (“individual‐based estimation”), but is also estimated as the proportion of samples in which evidence of infection is detected (“anonymous estimation”). The latter method is often used when researchers lack information on individual host identity, which can occur during noninvasive sampling of wild populations or when the individual that produced a fecal sample is unknown. The goal of this study was to investigate biases in individual‐based versus anonymous prevalence estimation theoretically and to test whether mathematically derived predictions are evident in a comparative dataset of gastrointestinal helminth infections in nonhuman primates. Using a mathematical model, we predict that anonymous estimates of prevalence will be lower than individual‐based estimates when (a) samples from infected individuals do not always contain evidence of infection and/or (b) when false negatives occur. The mathematical model further predicts that no difference in bias should exist between anonymous estimation and individual‐based estimation when one sample is collected from each individual. Using data on helminth parasites of primates, we find that anonymous estimates of prevalence are significantly and substantially (12.17%) lower than individual‐based estimates of prevalence. We also observed that individual‐based estimates of prevalence from studies employing single sampling are on average 6.4% higher than anonymous estimates, suggesting a bias toward sampling infected individuals. We recommend that researchers use individual‐based study designs with repeated sampling of individuals to obtain the most accurate estimate of infection prevalence. Moreover, to ensure accurate interpretation of their results and to allow for prevalence estimates to be compared among studies, it is essential that authors explicitly describe their sampling designs and prevalence calculations in publications.


| INTRODUC TI ON
Prevalence, a key measure in studies of disease ecology, is defined as the percentage of individuals in a population infected with a given pathogen (Jovani & Tella, 2006). This measure describes the occurrence of a pathogen in a population and is an essential component of mathematical models in epidemiology (Kermack & McKendrick, 1927).
Because determining the "true" prevalence of a pathogen in a population would require exhaustive sampling from every individual in the target population, studies generally estimate pathogen prevalence by determining the infection status of a proportion of the population via necropsy or sampling of feces, urine, blood, or saliva (Jovani & Tella, 2006). Because invasive procedures may be impractical or prohibited, particularly in studies of threatened populations, the analysis of noninvasive samples of material that potentially contains evidence of infection (e.g., feces or urine) is often preferred (Leendertz et al. 2006).
Methods for estimating prevalence from such samples can be Several past studies have discussed the accuracy of prevalence estimation methods. Muehlenbein (2005) found that the prevalence of multiple helminth species increased as Pan troglodytes schweinfurthii individuals were sampled repeatedly, and recommended that all researchers should standardize their prevalence estimation methods by sampling individuals repeatedly and only using individual-based prevalence estimation methods. Huffman, Gotoh, Turner, Hamai, and Yoshida (1997) asserted that anonymous estimation methods are biased relative to individual-based methods, but provided only empirical evidence from a single population of P. troglodytes schweinfurthii to back this claim. Several other authors (including Murray, Stem, Boudreau, & Goodall, 2000;Gillespie, 2006;Muehlenbein, Schwartz, & Richard, 2003) have cautioned against anonymous estimation methods or claimed to have benefited from individual-based estimation methods, but the comparative performance of the two methods has yet to be rigorously examined mathematically or empirically.
Here, we formally compare the performance of individual-based and anonymous prevalence estimation methods. We begin by presenting a simple mathematical model that demonstrates the differences in bias between the two. Our model guides us toward two specific predictions, described below, which we investigate with empirical data on gastrointestinal helminth infections of primates taken from the Global Mammal Parasite Database (GMPD) (Nunn & Altizer, 2005;Stephens et al., 2017). We focus on these hosts and parasites because helminths are the main parasite for which fecal sampling occurs, and sampling challenges are common in primates due to their complex ecology and some species' threatened status.

| Individual-based prevalence estimation
Using the definition above, "true" prevalence (P) is defined mathematically as: F I G U R E 1 Prevalence estimation methods. In anonymous prevalence estimation, the origin of samples is unknown, and any information about the number of hosts that generated the samples cannot be used in estimating prevalence. In individual-based prevalence estimation with single sampling, each sample is paired to a different host. In individual-based prevalence estimation with repeated sampling, multiple samples are paired to each host, enabling more accurate estimates of prevalence when infected hosts do not always produce samples containing evidence of infection In this equation, I is the number of infected individuals in a population, and N is the total number of individuals. True prevalence is a theoretical representation of the actual occurrence of a pathogen in a discrete population.
In practice, the true prevalence of a pathogen is often impossible or impractical to measure, and sampling designs are restricted to providing an estimate of prevalence (P). An estimate of prevalence is biased if its expected value (E[P]) is not equal to the true prevalence (P). If E[P] < P, prevalence will be underestimated, while if E[P] > P, prevalence will be overestimated.
In individual-based methods, P is calculated by dividing the number of individuals observed to be infected (i) by the total number of individuals that were sampled (n): When n < N, this calculation assumes that sampling is random.
Using Equation (3), we can calculate the expected value of P while incorporating information about repeated sampling of individuals and the efficacy of the method used to detect evidence of infection in a sample.
In this equation, D is the probability that a sample containing evidence of infection is detected as such (i.e., detection rate), and X is the number of samples collected from each individual (see Appendix for derivations of all equations). If D = 1, the expected value of P is equal to P, and thus, P is an unbiased estimator of prevalence. If D < 1, P is a negatively biased estimator of P. Thus, P underestimates P whenever the probability of a false negative is greater than zero.
Many factors can cause the detection rate to fall below 1. For example, certain chemicals used in the past are not conducive to longterm preservation of delicate specimens, such as some protozoa.
Many protozoa and even some of the more common helminths can also be difficult to distinguish from fecal debris. However, bias due to false negatives decreases as per-individual sampling effort (X) increases, because 1 -(1 -D) X approaches 1 as X increases.
Further bias is introduced if there is variation in the presence of evidence of infection in samples from an infected individual. For example, egg production by helminths can vary with age of the parasite population and the presence of co-infections by other parasites (Muehlenbein & Lewis, 2013). In Equation (4), we define F as the proportion of an infected individual's samples that contain evidence of infection, or the occurrence rate. In this scenario, the expected value of P is as follows: If D < 1 or F < 1 in this equation, P is a negatively biased estimator of P, regardless of the sampling effort. However, the bias still decreases as X increases because repeatedly sampling individuals increase the likelihood that infected individuals will be correctly identified as such. It is not necessary to distinguish between the effects of false negatives and variation in the presence of evidence of infection in order to infer the presence of bias, because the F and D terms are multiplicatively combined. Muehlenbein (2005) provides an empirical example of how repeated sampling can mitigate the bias introduced when not all samples from infected individuals test positive. He found that within a population of wild chimpanzees, cumulative parasite richness (number of unique intestinal parasites infecting a given host) significantly increased for every sequential sample (up to four samples) taken per animal. In the same study, the most commonly occurring parasites were found in all of the serial samples of only a fraction of the chimpanzees, and not one of the twelve parasitic species recovered from the group was found in all samples from any one animal.
Sampling protocols that only collect one or a few samples per individual are particularly prone to large biases in prevalence estimation, especially when D and or F are much less than 1. To observe these biases, many estimates of prevalence from multiple studies of the same disease system would have to be compared. In a dataset of many disease systems that vary significantly in terms of P, F, and D, the complex interaction between these variables would obscure the pattern of how increased sampling effort corresponds to increased estimated prevalence.

| Anonymous prevalence estimation
In anonymous estimation methods, prevalence is estimated by dividing the number of samples that test positive (S I ) for the pathogen by the total number of samples collected (S N ) (Equation 5). This approach is based on the assumption that the proportion of infected samples reflects the proportion of infected individuals in the population: Note that measures of population size are not present in the equation. A major assumption underlying this calculation is that sampling is random. The expected value of P for anonymous sampling is: The expected value of prevalence is the same for anonymous estimations of prevalence and individual-based estimations of prevalence from studies in which individuals are only sampled once (i.e., Equation 4 reduces to Equation 6 when X = 1). In all other cases, assuming that the detection rate (D) and occurrence rate (F) are less than 1, the bias for anonymous prevalence estimation is more negative than the bias for individual-based prevalence estimation (see Appendix). This effect arises because anonymous estimation is unable to account for infected individuals producing samples that do not contain any evidence of infection. Individual-based estimation methods can partially overcome this problem by accounting for the repeated sampling of individuals. (1) A sensitivity analysis of the effect of the values of P, F, D, and X on the difference in bias between individual-based and anonymous prevalence estimation methods is given in the Appendix. The key finding that emerges from this analysis is that the difference between the prevalence estimates generated using the two methods increases proportionally to P and is greater for higher values of X.
Individual-based estimates of prevalence are greater than or equal to anonymous estimates of prevalence for all values of all parameters.

| Predictions
Our theoretical treatment of prevalence estimation gives rise to two predictions with regard to the performance of individual-based and anonymous estimations of prevalence. First, individual-based estimates of prevalence from studies in which individuals are repeatedly sampled should be on average higher than anonymous estimates of prevalence, assuming that random sampling of individuals or samples occurred in all studies. This prediction arises because we expect that less than 100% of samples from infected individuals will show evidence of infection (i.e., F and/or D < 1), based on technical and biological failures to detect infections as described above. Second, we predict equivalence between individual-based estimates of prevalence from studies with single sampling of individuals and anonymous estimates of prevalence. If differences in sampling bias toward infected individuals exist between these two categories of prevalence estimates, then these estimates of prevalence will differ, based on the equations and assumptions given above (Equations 4 and 6). To test both predictions, the estimates of prevalence being compared must represent a random sample of parasites, hosts, and laboratory techniques, as this helps account for variation in F and D among studies.

| Methods
We evaluate the above predictions using empirical data on gastrointestinal helminth parasite infections in primate hosts, detected through fecal sampling, from the GMPD (Nunn & Altizer, 2005; Stephens et al., 2017), a database compiled through systematic literature searches for infectious diseases of primates. The data we extracted span 31 host genera and 64 parasite genera, and are drawn from 123 published papers representing multiple different laboratories and authors. Thus, we view our dataset as a random sample of prevalence estimates. We extracted the prevalence estimates, sample sizes, host species, and parasite species from each relevant entry in the GMPD and then coded the prevalence estimates as either "individual-based" or "anonymous." All ambiguously described prevalence estimates were coded as anonymous. We did not extract anonymous estimates of prevalence from studies where the number of individuals sampled was stated and equal to the number of samples, but instead considered these data to represent individualbased estimates of prevalence with single sampling. When a study reported separate prevalence estimates for age and sex classes within a population, or when a study reported prevalence estimates for study subpopulations (i.e., different social groups within a park), we pooled the data and calculated a combined prevalence estimate.
This was carried out to make these data consistent with data from other studies that pooled data across demographic groups and subpopulations. Prevalence estimates reported for the same study population in different years were treated as separate data points, because such studies often investigated changes in prevalence over time due to factors such as environmental change. In several cases, multiple prevalence estimates corresponding to different laboratory techniques were given for a host-parasite pair within a study and were treated as separate data points. Finally, we removed entries for which all forms of estimated prevalence were equal to 0, which indicates that the authors searched for the parasite but failed to find it.
To test our first prediction, we first compared anonymous and individual-based prevalence estimates from studies that provide both of these types of estimates using a paired t-test. We then conducted a statistical analysis to assess differences between individual-based and anonymous estimates of prevalence in the entire dataset. In this larger analysis of all available data, we performed model selection with Akaike information criterion (AICc), which selects an optimum model based on maximum likelihood (Akaike, 1998), in the R statistical platform (R Core Team 2014) using the "MuMin" (Barton, 2009) and "lme4" (Bates, Maechler, Bolker, & Walker, 2014) packages. We averaged models that were within 10 AIC units of the best model. All candidate models were linear. We considered all combinations of prevalence estimation type ("individual" or "anonymous") and the interaction between host genus and parasite genus as predictor variables of prevalence estimate.
We included random effects of host genus and parasite genus in all candidate models. We did not include interactions between prevalence estimation type and host or parasite genus, as no effects were predicted and doing so would not help in evaluating our specific predictions about estimation type and prevalence. We chose not to use models that specifically incorporate phylogenetic effects (such as phylogenetic generalized least squares, i.e., PGLS), because such models cannot incorporate more than one data point per species, which would prevent the comparison of estimates of prevalence of different parasites from the same host. Additionally, such methods would control for the phylogeny of either the host or the parasite, but not both. However, we were able to control for some phylogenetic effects by including host and parasite genus as random effects in all candidate models.
To test our second prediction, we investigated differences between individual-based prevalence estimates from studies that sampled individuals only once and anonymous prevalence estimates using the model selection approach described above.

| Results
In total, we extracted 737 total entries on helminth infection prevalence estimated through fecal sampling from the GMPD. Of these, 425 give an individual-based estimate of prevalence, 349 give an anonymous estimate of prevalence, and 37 give both individualbased and anonymous estimates of prevalence. Our data span 31 host genera and 64 parasite genera. Further details are provided in Table S1.
Among the 37 entries that provided both individual and anonymous estimates of prevalence, we find that individual sampling led to higher estimates of prevalence ( Figure 2). The mean of the individualbased estimates of prevalence is 44.5% (SD = 30.9%), and the mean of the anonymous estimates of prevalence is 27.2% (SD = 24.7%), leading to a mean difference of 17.3% (95% CI: 11.8%-22.7%).
Only one entry reports a higher anonymous than individual-based prevalence estimate, and in that case, the difference is very small (3%). In support of Prediction 1, a paired t-test reveals that anonymous estimates of prevalence are significantly lower than individualbased estimates of prevalence (t 36 = 6.46, p < 0.0001).
We observe this same pattern in the broader analysis of the full dataset of prevalence estimates from studies that reported one or both types of estimates. The results of the model selection process reveal that the top model received 100% of the weight (Table 1).
Prevalence estimation type has a relative importance score of 1. In this model, anonymous prevalence estimation is again associated with substantially decreased prevalence (coefficient = −0.1217, t = −5.87, p < 0.0001  individual-based designs are on average 12.17% higher than those from anonymous designs after accounting for the genus of the host and the parasite (Figure 3), also supporting our first prediction.
In testing Prediction 2, we find that individual-based estimates of prevalence from studies with single sampling of individuals (N = 120) differ from anonymous measures of prevalence ( Figure 4).
Prevalence estimation method has a relative importance score of 0.29 (Table 2). In the averaged model, the coefficient of anonymous prevalence estimation is −0.064 (Z = 2.40, p < 0.02). This indicates that after controlling for other factors, anonymous estimations of prevalence are on average 6.4% lower than individual-based estimates of prevalence.  (Huffman et al., 1997;Muehlenbein, 2005).

| D ISCUSS I ON
Individual-based estimates of prevalence are expected to be closer to true prevalence than anonymous estimates because they are able to use information about the repeated sampling of indi- Note. AICc: Akaike information criterion. Table 2 shows the top models selected for the analysis of prevalence. "+" symbols indicate included variables. All other models had ΔAICc > 10, were not included in the averaged model, and are not shown. Furthermore, we make several important simplifications during our theoretical treatment of prevalence estimation. We do not incorporate false positives. False positives are less likely to occur in fecal sample analyses that focus on helminth egg detection (relative to analyses that seek to identify larva and protozoa), because fecal debris and or other materials are unlikely to be confused for helminth eggs. However, false negatives do remain an issue, especially in cases where it is difficult to discern helminth eggs from fecal debris and other material in the sample, and in genetic procedures such as PCR (Borst, Box, & Fluit, 2004). We also do not consider parasite misidentification. While this may occur, it would not affect any estimate of prevalence as long as the misidentification is consistent, and parasites of separate species are not identified as members of the same species.
Our empirical analyses may have been affected by discrepancies between our data coding and the actual methods employed in the original studies, as many papers from which we collected empirical data were unclear in their descriptions of sample collection and prevalence estimation. However, we classified all ambiguously described methods as anonymous estimation, so any incorrect classifications were almost certainly individual-based estimated being classified as anonymous estimates. This would obscure differences between estimation types, making it a conservative practice.
In conclusion, we demonstrate theoretically that estimating prev- parasitology. Therefore, researchers should take care to sample randomly, use methods designed to reduce unconscious sampling bias, and fully and unambiguously report their sampling procedures.

ACK N OWLED G M ENTS
We would like to thank Roderick Acquaye and Marie Rogers for assisting with data coding.

CO N FLI C T O F I NTE R E S T
The authors have no conflict of interests to declare.

AUTH O R CO NTR I B UTI O N S
IFM compiled data, constructed the mathematical model, designed and conducted statistical analyses, and drafted the manuscript. ISC helped to compile data and design statistical analyses, and drafted the manuscript. CLN and MPM helped conceive of the study and helped draft the manuscript. All authors gave final approval for publication.

DATA ACCE SS I B I LIT Y
The dataset used in this study is available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.51t5s6b.

S U PP O RTI N G I N FO R M ATI O N
Additional supporting information may be found online in the Supporting Information section at the end of the article.

D E R I VATI O N O F EQ UATI O N 6
Equation 6 gives the expected value of the estimate of prevalence for anonymous estimation. We assume that all individuals generate the same number of samples regardless of their infection status and that samples are randomly selected from the "pool" of all samples.
The number of samples generated from infected individuals containing evidence of infection will be S N PF. The number of these that will be detected as containing infectious material will be S N PFD Therefore, the expected proportion of samples detected as containing infectious material in a population will be Multiplying the values displayed in the plots by true prevalence, P, gives the difference in bias between individual-based and anonymous estimates of prevalence. When X = 1, the difference in bias is equal to P, as (1 -FD -(1 -FD) X ) = 1 for all values of F and D Therefore, no difference in bias exists between anonymous estimation and individual-based estimation with single sampling.
To prove that the bias is always more negative for anonymous sampling than for individual-based sampling when X > 1, we wish to show that P(FD − 1) < P((1 − (1 − FD) X )−1).

S ENS ITIVIT Y ANALYS IS O F PRE VALEN CE B IA S E S
Here, we consider the difference in bias for individual-based and anonymous prevalence estimation methods applied to the same system (i.e. same values of F, D, and P). The difference in bias is: Thus, the difference in bias is expected to be directly proportional to true prevalence, P. Figure A1 above shows that in addition to increasing with P, the difference in bias between the two methods increases with greater values of X and is maximized at different combinations of F and D, depending on the value of X.