Estimating viral prevalence with data fusion for adaptive two‐phase pooled sampling

Abstract The COVID‐19 pandemic has highlighted the importance of efficient sampling strategies and statistical methods for monitoring infection prevalence, both in humans and in reservoir hosts. Pooled testing can be an efficient tool for learning pathogen prevalence in a population. Typically, pooled testing requires a second‐phase retesting procedure to identify infected individuals, but when the goal is solely to learn prevalence in a population, such as a reservoir host, there are more efficient methods for allocating the second‐phase samples. To estimate pathogen prevalence in a population, this manuscript presents an approach for data fusion with two‐phased testing of pooled samples that allows more efficient estimation of prevalence with less samples than traditional methods. The first phase uses pooled samples to estimate the population prevalence and inform efficient strategies for the second phase. To combine information from both phases, we introduce a Bayesian data fusion procedure that combines pooled samples with individual samples for joint inferences about the population prevalence. Data fusion procedures result in more efficient estimation of prevalence than traditional procedures that only use individual samples or a single phase of pooled sampling. The manuscript presents guidance on implementing the first‐phase and second‐phase sampling plans using data fusion. Such methods can be used to assess the risk of pathogen spillover from reservoir hosts to humans, or to track pathogens such as SARS‐CoV‐2 in populations.


| INTRODUC TI ON
The rapid pandemic spread of COVID-19 has overwhelmed health systems globally, from funding and supply chains to testing and hospital capacity. The capacity to detect infections circulating in a population is constrained by technical limitations, costs, and logistics, which rapidly scale with the temporal and spatial scales of epidemics. The COVID-19 pandemic has highlighted the importance of efficient sampling strategies and statistical methods for monitoring infection prevalence, both in humans and in reservoir hosts. In studies of reservoir hosts, the research question is not necessarily whether an individual is infected, but rather the goal can be to estimate the prevalence in the reservoir host population, and how this changes over space and time. Furthermore, funding for screening potential reservoirs is generally limited compared with human screening. Nevertheless, estimating prevalence in reservoir hosts is critical for understanding drivers of pathogen spillover and precise estimates of population prevalence require testing a large number of samples (Plowright et al., 2017(Plowright et al., , 2019. Unfortunately, the total number of samples that can be screened is limited by the high costs of field sampling, laboratory testing, and other fiscal constraints; thus, strategies to optimize testing samples are critical for successful disease surveillance. One approach for testing, particularly when the prevalence rates are expected to be low, is to pool individual samples to assess whether one, or more, of the pooled samples results in a positive test. This pooling procedure is commonly referred to as group testing (Du et al., 2000). Group testing was first developed in 1940s to detect cases of syphilis in soldiers in the US military during the Second World War. The technique increases efficiency of utilization of limited resources during outbreaks or surveillance programs with direct impact on response capacity (Dorfman, 1943). However, the effectiveness of pool testing can be compromised as disease incidence and prevalence increases, as this results in more tests conducted during the second rounds of diagnostic assays to identify individual positive samples. Thus, there is a need for clear guidance on the optimal number of samples to pool or number of total pools to be tested. Depending on the population prevalence, pooling too many or too few samples can decrease precision in the estimated parameter and make inferences on the population prevalence unreliable, or require multiple stages of individual testing.
In light of testing limitations, pooled techniques have been adapted for COVID-19 screening in humans (Mallapaty, 2020;Mutesa et al., 2020). In this scenario, the primary intent is to determine which individuals are infected to implement isolation of cases and contact tracing protocols to mitigate spread of the virus. The idea is to first combine individual samples into a single pooled sample. If the pooled sample is negative, then all of the individual samples are assumed to be negative. If a pooled sample tests positive, although it is not immediately clear how many and which of the individual tests are positive, there are many strategies to subsequently identify this.
The simplest strategy involves retesting each of the individual samples that comprise the pool. This procedure will enable a researcher to have a complete dataset that identifies all individuals that test positive. Rather than automatically retesting all individuals in a positive pool, Sobel and Elashoff (1975) present a hierarchical approach for testing subsets of the pools in an iterative fashion. Phatarfod and Sudbury (1994) proposed an array approach where an individual specimen was divided across multiple pools. In the context of estimating population prevalence, Bilder et al. (2010) present "Informative Retesting," where the retesting approach uses individual covariate information in the retesting protocol and Hepworth and Watson (2017); Hepworth and Walter (2020) present a restricted randomization approach for retesting similar in spirit to Phatarfod and Sudbury (1994)'s arrays. With appropriate pooling strategies and retesting positive pools, the overall number of samples tested can be less than the total number of individuals.
In addition to testing humans, there are also broad efforts to identify the reservoir hosts for SARS-CoV-2-related viruses, and coronaviruses more broadly. In contrast studies where the goal is case identification (Zhang et al., 2013), however, individual results may not be required and population-level estimates of prevalence are often sufficient to identify hosts and understand transmission dynamics within their populations. In some cases, this will involve collecting new samples with purpose-fit sampling designs, or in other cases, sample banks may already exist that can be screened for coronaviruses. The number of samples available does not always match the funding available for screening, and optimal testing approaches to achieve desired inferences are required. Thus, to understand population prevalence in reservoir populations and in other situations where the focus is on population prevalence, we propose a data fusion procedure, with pooled testing, that enables estimates of population prevalence in multiphase studies without retesting positive pools.
With the remainder of this article, Section 2 details sampling approaches for screening individuals and presents estimation approaches, including data fusion techniques, for estimating population prevalence, Section 3 contains results from a set of simulation studies, and Section 4 concludes with a discussion.

| Sampling approaches
In some scientific studies, particularly for studying viral prevalence in reservoir hosts, a large number of samples may already exist or can be collected for minimal costs relative to the cost of testing samples. Ideally, all of these samples would be tested; however, with this work, we assume that the number of samples that can be tested is constrained. Most often the constraint is the research funding, but limits can also be a result of instrument capacity or availability of materials. Given these constraints, a sampling strategy needs to be devised to determine which individual samples to test and whether these samples should be combined into pools.
The optimal number of samples per pool requires knowledge of the population prevalence-which is generally unknown for investigations into novel host-pathogen combinations. Thus, an approach for pooled sampling is to implement a two-phase sampling design where an initial pooling in the first phase can be used to inform second-phase sampling strategies. Altering the pool size or testing additional samples individually in the second phase requires a data fusion procedure to combine inferences across pooled and individual samples. With this work, we provide guidance on adaptive two-phase sampling designs while combining pooled samples with individual samples using a novel Bayesian data fusion (Allen, 2017) procedure. We will show that this procedure results in a more precise estimator of the population prevalence than retesting positive pools.

| Pooling and group testing
If the overall population prevalence is close to zero, then most of the individual samples will be negative, and therefore, testing costs per positive sample detected are high. With a fixed cost for a single test, pooling strategies allow two or more samples to be jointly tested for the same cost as an individual sample. While a negative pool implies that all of the individual samples are negative, a positive pooled sample only implies that one or more samples in the pool are positive. The positive samples can be retested or be directly used to inform population prevalence and future sampling. Optimal pooling approaches require knowledge of population prevalence. This article focuses on data fusion methods for combining pooled results from multiple phases without requiring retesting positive pools to estimate population prevalence and establish optimal pool sizes, which can include either retesting samples from positive pools; testing additional pools of the same or different sizes; or testing additional individual samples, which requires a data fusion procedure.
Pooling procedures involve combining two or more individual samples into a pool. There is a long history of pooled sampling approaches (Bhattacharyya et al., 1979;Burrows, 1987;Dorfman, 1943;Swallow, 1985) and associated statistical methodology for parameter estimation (Biggerstaff, 2008;Chen et al., 2009;Colón et al., 2001;Hepworth, 2005). The difference between these approaches and what we propose is that existing methods are generally focused on designing a single-phase pooled sampling plan or estimating population prevalence conditional on pooled samples (Colón et al., 2001).
In contrast, we propose a procedure that combines adaptive pooling with data fusion techniques to integrate pooled and individual samples combining.

| Adaptive sampling + two-phase sampling
Adaptive sampling is a procedure where the sampling strategy is informed by previously collected data. Adaptive sampling has a history in quality control fields (Prabhu et al., 1994;Runger & Montgomery, 1993;Runger & Pignatiello Jr, 1991). More recently, adaptive sampling has become popular in sensor networks (Gedik et al., 2007;Jain & Chang, 2004) and Bayesian model selection procedures (Clyde et al., 2011;Nott & Kohn, 2005). In the context of pooled testing, Hepworth (1996) proposed a sequential approach for pooled sizes. In theory, the number of samples in a pool could be adaptively changed after each test (e.g., as an outbreak is developing and cases rise and then subsequently falls); however, for practical implementation we restrict the sampling procedure to two phases. In addition to the sampling approach, there is a corresponding estimation problem, often referred to as sequential estimation (Lai, 2001).
Estimation methods for the two-phase sampling approach are described in the next section.
In ecological settings, due to data collection, processing, and analysis procedures, adaptive sampling and estimation are generally conducted with a small number of discrete phases. Often, two-phase approaches include an initial screening of rare species that informs more efficient resource allocation in the second phase (Pacifici et al., 2012). In other settings, certain types of sampling are more expensive and a cheaper method is used for an initial screening before employing a more expensive sampling procedure (Rivest et al., 1990;Villella & Smith, 2005;Villella & Smith, 2005). Conceptually, pooled sampling represents a cheaper form of sampling, on a per sample basis, and hence is quite similar to these scenarios.
Adaptive sampling approaches have been developed for monitoring prevalence, both with and without pooling. Reilly (1996) developed optimal sampling approaches for two-phase sampling where the cost of sampling may differ between phase 1 and phase 2. Breslow and Chatterjee (1999) presented an approach for two-phase sampling where the second-phase estimation is a case-control sample. McIsaac and Cook (2015) developed a framework of two-phase sampling which they define as "response-dependent," meaning that sampling approach in the second phase depends on the responses in the first phase. However, we are not aware of any approaches that combine adaptive sampling with data fusion for estimation of viral prevalence by enabling researchers to combine information from pooled and individual samples. Our two-phase approach would be beneficial in calibrating the optimal pool size in any biological setting with pooled samples, such as pooled samples for Salmonella detection (Kinde et al., 1996) or estimating prevalence of infected insect vectors (Ebert et al., 2010). Moreover, the ability to fuse pooled and individual samples can lead to more efficient sampling in any scenario that implemented a phase one pilot study to inform sampling parameters for prevalence or took an either or approach to pooled sampling versus individual samples (Arnold et al., 2011).

| Choosing pool size and identifiability of population prevalence
When choosing pool sizes, one issue is identifiability of population prevalence when the rate of prevalence is high and a large number of samples are pooled. The maximum-likelihood estimator of the prevalence has been shown to be biased (Colón et al., 2001). The reason is that the probability that a pooled sample tests positive, is π, can be effectively 1 for a range of population prevalence, which is denoted with p. For illustrative purposes, assume that p > .5 and the pool size, n, is 10. In this example, if p = .5, then π = 0.999. Thus, nearly all pooled samples would test positive for prevalence between 0.5 and 1, and hence, it would be impossible to differentiate values in the interval between 0.5 and 1, or in other words, this is not identifiable. This

| Model framework
Using two-phase adaptive sampling, we describe three methods for estimating population prevalence using a combination of pooled and individual samples. Using information from data in the first phase to inform the second phase, efficient adaptive sampling requires sequential estimation. These estimation methods present a set of approaches for two-phase sampling. Bayesian data fusion methods are implemented to estimate overall prevalence, where a beta(α,β) distribution is used as the prior for p. For the experiments in Section 3, a uniform prior, beta(1,1) is used, but for specific applications, subject matter expertise can, and should, be used to select parameters in the beta distribution for an informed prior of population prevalence.
Practitioners with knowledge of prevalence can use established prior elicitation techniques, such as Wu et al. (2008), to create informative prior distributions.
Data integration is formally defined by using different streams that measure the response of interest to make combined inferences and uncertainty calculations. It is important to differentiate between collecting streams of data that can be used as covariates with that of multiple streams of data about the outcome of interest. In the former, information like demographic information could be collected that might be indicative of the parameter of interest. In contrast, the latter scenario contains multiple direct measurements of the outcome of interest, but they may be on different spatial and/or temporal scales and require a formal method for combining the data. In this work, we use a more general term data fusion, rather than data integration, to signify the combining pools of different sizes, includ- where Y = (Y 1 ,…, Y n ) and n is the total number of individuals that are tested.

| Pooled samples
With pooled samples, a common second-phase approach is to retest individuals from pools that test positive. Brookmeyer (1999) presents a conditional probability function that accounts for the dependence between the initial and subsequent pools. When the retesting pools are of size one, the end result is that the total number of positives and negatives are known and the exact posterior distribution can be computed using Equation 3. If the prevalence is low, this approach can be more efficient than testing individual samples.
In particular, this approach is more efficient, on a per-test basis, if the expected number of samples tested is less than the total number of individual samples. (1) Rather than a set of individual tests, the sampling model, or the statistical likelihood, is defined as: where Z j is a binary variable denoting whether the jth pooled sample tests positive, n j is the number of samples in pool j, and π j is a function of p that corresponds to the probability that a pooled sample j tests Hastings algorithm, which is detailed in Appendix 1, is used to estimate the prevalence, p, using the calculated probability π j .

| Integrated analysis
When

| RE SULTS
A set of four synthetic studies are constructed to explore adaptive sampling and data fusion techniques across Phase 1 (initial testing of samples to obtain initial estimates of prevalence) and Phase 2 (follow-up or additional testing to obtain more detailed prevalence estimates). Simulation 1 provides justification for the data fusion  Figure 2 shows the credible interval width for a set of sample sizes. Unsurprisingly, the credible intervals are narrower for larger sample sizes, but this F I G U R E 1 Simulation 1: Data fusion efficacy for phase 1 testing. This simulation shows that data fusion technique, which requires combining pooled and individual samples, is superior to either approach alone. For the lower ranges of prevalence, the posterior mean of all of the methods tracks the overall prevalence, on average. However, as the prevalence becomes closer to one, the pooled only approach is unable to identify the true prevalence. The data fusion approach has the narrowest credible interval width across a range of prevalence values

| Simulation 2: Retesting versus new samples for phase 2 of testing
Following the initial phase of testing to obtain an estimate of popula- However, in the bottom panel of Figure 3, the data fusion procedure has smaller credible interval width for all prevalence levels. This result is fairly intuitive as taking additional samples would contain more information that retesting old samples, where it is known that at least one sample is positive Figure 4 shows the credible interval width as a function of sample size. Prior to this work, the challenge had been the lack of a coherent method to combine the pooled and individual samples, which our proposed data fusion procedure allows.  Credible Interval Width by Prevalence gets larger, the posterior means of the pooled estimators start to deviate from the true prevalence. In fact, at a certain threshold when all of the pooled samples end up positive, the posterior means flatten off and the posterior distribution ends up being uniform between a threshold of p values that map to θ ≈ 1. In addition to inaccurate posterior means, this identifiability problem also results in credible intervals that are very wide.

| Simulation 3: Initial pool sizes for phase
The takeaway from this simulation is that a larger pool size can be more efficient with lower prevalence. The potential drawback to large pool sizes is the case when prevalence is high and many, or all, of the pools test positive, resulting in an imprecise answer. Prior knowledge about the prevalence should be used in selecting phase 1 pooling strategies. As described in Hepworth and Watson (2009)

| Simulation 4: Adaptive sampling for phase 2 of testing
Simulation 3 provided guidance on choosing the phase 1 pool size; this simulation follows that to compare the performance of phase 2 sampling and estimation techniques. Simulation 2 already established that the second-phase data fusion procedure (adding data from new individual samples without pooling) is superior to following up with retesting of individual samples in positive pools. For this next scenario, we compare outcomes when the second-phase data fusion also uses pooled samples. Specifically, in simulation 4, phase 1 consists of 100 pooled samples of size 5; then, we compare secondphase testing with pools of sizes 1, 3, and 5 that run the same number of total tests. Figure 6 shows the estimated posterior means and credible interval widths from the phase 1 samples as well as the three phase 2 approaches. The only approach that accurately estimates the entire range of prevalence values is the phase 2 method that tests addi-

| D ISCUSS I ON
When the goal is to estimate population prevalence, rather than identify infected individuals, pooled testing without retesting positive pools as a second-phase strategy can result in fewer overall tests or more efficient estimates than testing individual samples. Sampling strategies for pooled testing require selecting the pool size as well as the total number of pools to test. Selecting optimal, or efficient, pool sizes is a well-known problem (Thompson, 1962)  another approach is to implement a two-phase sampling procedure that can be used to gain information about the population prevalence, which can then be used to inform pool sizes for the second phase and subsequent testing.
The two-phase approach can enable researchers to learn about the population prevalence before conducting all of the tests on a given pool size; however, using data from the first phase and second phase to jointly inform population prevalence requires formal data fusion procedures. This article presents data fusion methods that can combine information from pooled samples of different sizes, including individual samples.
Adaptive sampling could be implemented on a sample-to-sample basis where the pool size changes with each test. However, in many settings this is impractical as tests are run in large batches or concurrently. In general, a smaller share of the total tests should be allocated to the phase 1 setting than the phase 2 setting. This allows the information gained from phase 1 to be used to create the most efficient pool sizes.
This work has assumed perfect sensitivity and specificity of tests.
In practice, this is rarely the case. However, known sensitivity and specificity could easily be incorporated into the model framework to adjust for the potential of false positive and false negatives. If the sensitivity and specificity of the pooled tests depend on the proportion of pooled samples that are positive or negative, this could also be handled in the model framework but would likely require laboratory testing to understand the relationship between sensitivity and specificity and the underlying samples.
One potential advantage of the data integrated approach and an active area of ongoing research is the ability to incorporate individual-level covariate information, see McMahan et al. (2017) or Joyner et al. (2020). Consider the case that demographic information is available and is related to the prevalence of an individual. Then, pooled tests could be used to inform the overall prevalence in the population, along with the individual samples.
The individual samples also can be used to infer the relationship between individual demographic information and the likelihood of being infected.

ACK N OWLED G M ENTS
This research was developed with funding from The Defense Advanced Research Projects Agency DARPA PREEMPT D18AC00031. The content of the information does not necessarily reflect the position or the policy of the U.S. government, and no official endorsement should be inferred.

CO N FLI C T O F I NTE R E S T
We declare no conflicts of interest. writing-review and editing (equal).

DATA AVA I L A B I L I T Y S TAT E M E N T
No new data are created for this project; however, code for the simulations highlighted in the manuscript is publicly available on github (https://github.com/andyh oegh/DataI ntegr ation).