Estimating the prevalence of two or more diseases using outcomes from multiplex group testing

When screening a population for infectious diseases, pooling individual specimens (e.g., blood, swabs, urine, etc.) can provide enormous cost savings when compared to testing specimens individually. In the biostatistics literature, testing pools of specimens is commonly known as group testing or pooled testing. Although estimating a population‐level prevalence with group testing data has received a large amount of attention, most of this work has focused on applications involving a single disease, such as human immunodeficiency virus. Modern methods of screening now involve testing pools and individuals for multiple diseases simultaneously through the use of multiplex assays. Hou et al. (2017, Biometrics, 73, 656–665) and Hou et al. (2020, Biostatistics, 21, 417–431) recently proposed group testing protocols for multiplex assays and derived relevant case identification characteristics, including the expected number of tests and those which quantify classification accuracy. In this article, we describe Bayesian methods to estimate population‐level disease probabilities from implementing these protocols or any other multiplex group testing protocol which might be carried out in practice. Our estimation methods can be used with multiplex assays for two or more diseases while incorporating the possibility of test misclassification for each disease. We use chlamydia and gonorrhea testing data collected at the State Hygienic Laboratory at the University of Iowa to illustrate our work. We also provide an online R resource practitioners can use to implement the methods in this article.


INTRODUCTION
In group testing applications, individual specimens are combined into pools and tests are performed on the pools for a binary outcome (e.g., positive/negative, etc.).Individuals from negative pools are diagnosed to be negative, while individuals from positive pools are tested further to determine which ones are positive.Dorfman (1943) is credited with introducing this method to test American soldiers for syphilis during World War II.In the seminal article, nonoverlapping pools of individual specimens are formed in the first stage of testing, and positive pools are resolved by testing each individual one by one in the second stage.When the probability of disease is small, group testing protocols that implement a larger number of stages (Pilcher et al., 2005;Quinn et al., 2000) and/or overlapping pools (Martin et al., 2013) can further reduce the number of tests needed to identify positive individuals.The infectious disease literature documents numerous applications of group testing, including for human immunodeficiency virus (HIV), hepatitis B, and hepatitis C (Hourfar et al., 2008;Stramer, Notari, et al., 2013;Westreich et al., 2008), chlamydia and gonorrhea (Lindan et al., 2005), and West Nile virus (Busch et al., 2005).More recently, group testing has played a critical role in reducing laboratory testing loads when diagnosing individuals for SARS-CoV-2 (Abdalhamid et al., 2020;Bilder et al., 2021;Pilcher et al., 2020).
Statistical research in group testing generally falls into one of two categories: case identification and estimation.In the case identification problem, the goal is to characterize the efficiency and accuracy of group testing protocols with the usual objectives of minimizing the expected number of tests and/or maximizing classification accuracy (Kim et al., 2007).On the other hand, the estimation problem involves estimating a population-level probability of disease (Huang et al., 2017;Liu et al., 2012) or covariate-adjusted probabilities by using regression (Delaigle & Meister, 2011;McMahan et al., 2017;Wang et al., 2014).In both problems, the performance of group testing and its ability to offer cost-effective screening and surveillance has been documented extensively.However, most of the existing research in group testing, including those articles referenced above, has focused on a single disease.
In this article, we consider the estimation problem in group testing when multiplex assays are used to test specimens for multiple diseases simultaneously.Our work is largely motivated by the screening practices for Chlamydia trachomatis (CT) and Neisseria gonorrhoeae (NG), the bacteria that lead to chlamydia and gonorrhea infections, respectively.For example, the Aptima Combo 2 Assay (Hologic) and the cobas CT/NG Assay (Roche) are two multiplex assays commonly used by laboratories to test individuals for these bacteria.Recent advances in technology have seen the development of multiplex assays for more than two infections.For example, the BD MAX CT/GC/Trichomonas vaginalis (TV) Assay (Becton, Dickinson and Company) tests for CT, NG, and TV simultaneously, and the Allplex STI Essential Assay (Seegene) detects CT, NG, TV, and Mycoplasma genitalium (de Salazar et al., 2019).Commonly used triplex assays for HIV, hepatitis B, and hepatitis C have been compared in Stramer, Krysztof, et al. (2013), and, more recently, multiplex assays have been authorized by the US Food and Drug Administration for simultaneous detection of influenza and SARS-CoV-2 (Roche, 2020).
When compared to research for single diseases, estimating multiple population-level disease probabilities from group testing data has received far less attention.The original work on this problem is attributed to Hughes-Oliver and Rosenberger (2000), who developed D-optimal designs for estimation when assays are 100% accurate.Ding and Xiong (2015) and Li et al. (2017) proposed optimal designs to estimate probabilities for multiple independent diseases and two correlated diseases, respectively, while allowing for testing error.A practical limitation in these articles is that the methods are based only on outcomes from initially formed master pools; that is, subsequent testing results from resolving positive pools are not incorporated.Another limitation is that the assay accuracy rates are assumed to be 100% for each disease or they are assumed to be known.In some applications, reasonable estimates may be available for disease-specific sensitivities and specificities.A more flexible approach is to regard these population-level parameters as unknown and then estimate them simultaneously with the disease probabilities.This is the approach we espouse in this article.
Estimation for group testing with multiplex assays is challenging.When incorporating test misclassification, (a) the true disease statuses of each specimen tested are latent and are likely correlated and (b) the available data from group testing protocols may include multiple (possibly misclassified) testing outcomes on the same individual.Tebbs et al. (2013) and Warasi et al. (2016) proposed frequentist and Bayesian approaches for estimation with multiplex assays, respectively, but limited their investigation to Dorfman testing, that is, hierarchical group testing protocols which implement two stages only.In this article, we propose a Bayesian framework to estimate population-level disease probabilities and assay accuracy probabilities from any group testing protocol which uses multiplex assays.This includes higher-stage hierarchical and array-based protocols recently proposed in Hou et al. (2017) and Hou et al. (2020), respectively, and any other protocol that might be used in practice.In other words, the estimation framework we present herein is invariant to how the multiplex outcomes are recorded.Therefore, we can compare the accuracy and precision of population-level estimates for different group testing protocols which use multiplex assays.Until now, such a comparison has been missing in the literature.
Subsequent sections of this article are organized as follows.In Section 2, we describe notation, state assumptions, and derive the observed data likelihood which is applicable for any group testing protocol using multiplex assays.In Section 3, we present the specifics of our Bayesian estimation approach when assay accuracy probabilities (sensitivities and specificities) are known, including prior model specification and data augmentation steps to construct an efficient posterior sampling algorithm.In Section 4, we then generalize our approach to allow for assay accuracy probabilities to be unknown.In Section 5, we provide simulation evidence to assess the performance of our estimation methods and provide a comparison of the estimates for different group testing protocols.In Section 6, we illustrate our work by using CT/NG data collected at the State Hygienic Laboratory (SHL) at the University of Iowa.In Section 7, we conclude with a summary discussion and describe future work.Additional details and simulation evidence are provided in the Supporting Information.

NOTATION AND PRELIMINARIES
Suppose  individuals are to be tested for  ≥ 2 diseases using a group testing protocol.We assume all diagnostic test results are obtained from a multiplex assay which provides a positive/negative diagnosis for each disease each time it is used (on specimen pools or on individual specimens).For example, the SHL at the University of Iowa uses the Aptima Combo 2 Assay (AC2A) to detect CT and NG simultaneously (see Section 6).The current multiplex protocol at the SHL tests specimen pools (usually of size 4) with the AC2A.Pools testing positively for either disease are then resolved by testing each individual specimen with the same assay.Disease diagnoses are then determined from the individual tests.
The observed data in group testing consist of diagnostic test results collected as part of a testing protocol.These protocols are typically completed over  ≥ 2 stages, where, within each stage, pooled or individual specimens are tested in response to the results from the previous stage.For example, as noted earlier, the SHL uses an  = 2 stage protocol where pools of specimens are tested in the first stage and individual specimens from positive pools are tested in the second.Hou et al. (2017) evaluated the utility of hierarchical group testing protocols using a larger number of stages, showing that  = 3 stage protocols conferred the smallest number of tests when screening for CT/NG in four western states in the United States (Alaska, Idaho, Oregon, and Washington).A three-stage hierarchical protocol uses an intermediate second stage with smaller-sized subpools; for example, first-stage pools of size 9, three second-stage pools of size 3, individual testing in the third stage.Hou et al. (2020) later proposed  = 2 and  = 3 stage multiplex protocols which use array testing (AT).In these (nonhierarchical) protocols, testing results arise from pooling rows and columns of overlapping specimens arranged in an array-like configuration.
In this article, we propose an estimation framework which is applicable for any group testing protocol using multiplex assays.To maintain this level of generality, we need notation that helps us track pool membership.Define the index set   ⊆ {1, 2, … , },  = 1, 2, … , , which identifies which individuals contribute to the th pool; that is,  ∈   when the th individual is in the th pool.Let Z = ( Z1 , Z2 ) ′ denote a vector of binary random variables encoding the true status of the th pool, where Z = ( ∑ ∈  Ỹ > 0), for  = 1, 2, … ,  and  = 1, 2. In other words, the th pool is truly positive (truly negative) for the th disease if the pool contains at least one (no) positive individual(s) for disease .Again, due to the effects of imperfect testing, the Z 's are not observed.Instead, we observe   = ( 1 ,  2 ) ′ , a vector of binary random variables encoding the test results for the th pool, where   = 1(0) if the th pool tests positively (negatively) for the th disease.
To allow for imperfect testing, we need to relate the observed testing results in   to the true disease statuses in Z .We assume pr(  = 1| Z = 1) =  ∶ and pr(  = 0| Z = 0) =  ∶ , for  = 1, 2, … ,  and  = 1, 2. That is,  ∶ ( ∶ ) is the sensitivity (specificity) of the multiplex assay when testing the th pool for the th disease.We assume these accuracy probabilities do not depend on the number of true positive individuals in the th pool, that is, there is no effect due to the dilution of positive individuals.However, our notation emphasizes  ∶ and  ∶ can be "pool-specific," affording us the flexibility to allow for different multiplex assays to be used and/or to have testing accuracy of a multiplex assay be a function of the th pool size (see Section 4).The conditional distribution of the observed data  = ( ′ 1 ,  ′ 2 , … ,  ′  ) ′ given the individuals' true disease statuses Ỹ is where  is a vector that contains all assay accuracy probabilities; that is, the  ∶ 's and  ∶ 's for  = 1, 2, … ,  and  = 1, 2.
Note that the right-hand side of Equation ( 2) is (| Z1 , Z2 , … , Z , ), but this equals (| Ỹ, ) because Z = ( Z1 , Z2 ) ′ and Z = ( ∑ ∈  Ỹ > 0) are uniquely determined when the true disease statuses Ỹ are given.When writing Equation ( 2), we assume that testing results in  are conditionally independent given the true statuses in Ỹ and that the values of  ∶ and  ∶ for one disease do not depend on the true status of the other disease (see Hou et al., 2020).Combining Equations ( 1) and ( 2), we can express the distribution of the observed data from any group testing protocol as where {0, 1} 2 denotes the collection of all possible realizations of Ỹ.This distribution is obtained by marginalizing the joint distribution of the observed testing responses and the individuals' latent statuses, that is, by summing (, Ỹ|, ) = ( Ỹ|)(| Ỹ, ) over Ỹ.This marginalization process requires computing the sum over the 2 2 possible realizations of Ỹ, which can be computationally prohibitive in practical settings.For example, the Iowa CT/NG data considered in Section 6 involves  = 14, 450 individuals.

ESTIMATION WITH KNOWN ASSAY ACCURACY PROBABILITIES
To incorporate prior knowledge about the disease probabilities in  and the assay accuracy probabilities in , we take a Bayesian approach as in Warasi et al. (2016) who considered two-stage hierarchical protocols only.Our methods herein are more general and apply to any group testing protocol with multiplex assays.In this section, we consider the simpler setting where the assay accuracy probabilities in  are known.This assumption is then relaxed in Section 4.

Posterior sampling
We assume a priori that  ∼ Dirichlet(); that is, the prior distribution for  is given by , where () is a normalizing constant and  = ( 00 ,  10 ,  01 ,  11 ) ′ is a vector of known hyperparameters.Based on the observed data , we then update our knowledge about  through its posterior distribution, given by (|, ) ∝ (|, )().Unfortunately, this distribution involves (|, ) whose calculation in Equation ( 3) is generally infeasible.Therefore, to facilitate posterior estimation, we develop a Markov chain Monte Carlo sampling algorithm that can draw realizations from (|, ).
At the crux of this development is a data augmentation step which involves introducing individuals' true disease statuses as "missing data."Define the vector Ṽ = ( Ṽ(00) , Ṽ(10) , Ṽ(01) , Ṽ(11) ) ′ so that Ṽ(00) = 1 when Ỹ′  = (0, 0), Ṽ( 10  where Set  =  + 1 and repeat steps 2-4 while  < , the number of Gibbs iterates. Two remarks are in order.First, it is worth emphasizing the multinomial cell probabilities in Step 2 are functions of the observed data in  (see Appendix A).This is why the posterior sampling algorithm above can be implemented with any group testing protocol using multiplex assays.In other words, different protocols will give rise to different types of observed data  but the sampling procedure remains unchanged.Second, in practice, we recommend selecting the number of Gibbs iterates  to be large; for example,  = 10, 000, after discarding the first thousand or so iterates for burn-in purposes (see Sections 5 and 6 for specific illustrations).For inference, the sample mean of the  iterates can be used as an estimate of the posterior mean of ; that is, (|, ), and credible intervals can be constructed by using the appropriate sample quantiles.

Maximum a posteriori (MAP) estimation
It is well known that group testing is most beneficial when the probability of disease is low.Otherwise, most initially formed master pools could test positively and the motivation for pooling specimens would quickly diminish.In our multiplex setting, this means  00 , the probability an individual is disease-free, may be close to unity, and the population-level parameters  10 ,  01 ,  11 , and marginal probabilities  1+ =  10 +  11 and  +1 =  01 +  11 may all be close to zero depending on the diseases under investigation.Because of these constraints on the parameter space, the marginal posterior distributions from (|, ) may be heavily skewed and summarizing the posterior distribution with a mean (or median) estimate may be unwise.
In such instances, reporting a posterior mode may be more sensible.We therefore describe an approach to find the MAP estimate; that is, the mode of (|, ).Using the same missing data conceptualization as in Section 3.1, we use the expectation-maximization (EM) algorithm to maximize (|, ).This algorithm involves evaluating (,  () ), the conditional expectation of the logarithm of the augmented posterior (, Ỹ|)() = ( Ỹ|)(| Ỹ, )() given the observed data and current parameter value  () , and then maximizing it as a function of .One then iterates between these two steps until convergence.This can be accomplished by using the steps described below.3. (M-Step): Calculate  (+1) using the solution in Appendix B in the Supporting Information; that is, this maximizer depends on [( Ṽ(00) , Ṽ(10) , Ṽ(01) , Ṽ(11) ) ′ |, ;  () ] and exists in closed form.4. Set  =  + 1, and repeat steps 2-4 until the maximum absolute difference in  (+1) −  () is less than , where  is small.We again make brief remarks.First, because Step 2 uses a Gibbs sampler to estimate the conditional expectation [( Ṽ(00) , Ṽ(10) , Ṽ(01) , Ṽ(11) ) ′ |, ;  () ], calculating the MAP estimate of  takes longer than simply summarizing (|, ) with the posterior mean from Section 3.1.However, because the M-Step solution exists in closed form, this additional time required is generally not prohibitive.We again recommend using a large number of Gibbs iterates in Step 2 (e.g.,  = 10, 000, etc.) after a sufficient burn in (see Sections 5 and 6).Second, when a uniform prior distribution () is used, that is, setting  00 =  10 =  01 =  11 = 1, the MAP estimate of  coincides with the maximum likelihood estimate (MLE) of , a potential preference for users wanting to report frequentist-based point estimates.Finally, when the assay accuracy probabilities in  are unknown, our simulation results in Section 5 and Appendix D in the Supporting Information demonstrate that MAP estimates of  and  can be more accurate than other posterior estimates.We now generalize our methodology to allow for this situation.

ESTIMATION WITH UNKNOWN ASSAY ACCURACY PROBABILITIES
Our goal now is to estimate the population-level infection probabilities in  and the assay accuracy probabilities in  simultaneously.As we demonstrate, this can be accomplished by taking our algorithms in Section 3 and adding appropriate steps for the conditional distribution and MAP solution of .Such an extension is practically useful in incorporating the uncertainty in .For example, although manufacturers will typically report values of sensitivity and specificity for multiplex assays (for each disease) in their product literature, these values are usually obtained from small pilot studies involving specimens whose true disease statuses are known in advance.The practice of ostensibly regarding these values as "correct" can lead to two potential problems.First, doing so ignores the sampling error incurred from having to estimate these values in small feasibility experiments.Second, the population under investigation (e.g., high-risk females in Iowa, etc.) may differ substantially from the one which was used to validate the multiplex assay initially.Extending the approach in McMahan et al. (2017) for single diseases, let  ∶() and  ∶() denote the sensitivity and specificity of the th assay for the th disease, for  = 1, 2 and  = 1, 2, … , , and let () = { ∶ the th assay tests pool } denote the index set of the specimens tested by the th assay, for  = 1, 2, … , .Our use of the set () simply allows us to reparameterize the exposition in Section 2. For example, at the SHL in Iowa, the AC2A assay is used for all specimens tested in pools and individually.If this assay performs the same when testing pools and individuals, then  = 1 and the parameter vector  = ( ∶(1)1 ,  ∶(1)2 ,  ∶(1)1 ,  ∶(1)2 ) ′ .On the other hand, if the performance of the AC2A depends on whether pools or individuals are tested, one could envision one set of assay accuracy probabilities for pools ( = 1) and a separate set for individuals ( = 2).This situation would correspond to  = 2 and the parameter vector would become  = ( ∶(1)1 ,  ∶(1)2 ,  ∶(1)1 ,  ∶(1)2 ,  ∶(2)1 ,  ∶(2)2 ,  ∶(2)1 ,  ∶(2)2 ) ′ .
Under our reparameterization, the distribution of the observed data  from any group testing protocol in Equation (3) can be written as ] , where now both  and the assay accuracy probabilities in  are regarded as unknown.To incorporate the uncertainty in , we use beta prior distributions for each sensitivity and specificity parameter, that is,  ∶() ∼ beta(  ,   ) and Therefore, because all other conditionals remain unchanged, one can take the posterior sampling algorithm described in Section 3.1 and insert one additional step.Similarly, to calculate the MAP estimate of  and , EM algorithm in Section 3.2 can be easily amended.The conditional expectation of the logarithm of the augmented posterior given the observed data and current parameter value, now written (, ,  () ,  () ), also has a closed-form solution in the M-step.The complete algorithms are given in Appendix C in the Supporting Information.

SIMULATION EVIDENCE
We performed a comprehensive simulation study to evaluate the performance of our estimation methods.This study included examining three hierarchical group testing protocols (H2, H3, and H4) from Hou et al. (2017) and one AT protocol from Hou et al. (2020).We now briefly describe these protocols.

Multiplex protocols and simulation description
A hierarchical group testing protocol is carried out by first testing a nonoverlapping master pool of individual specimens.If this pool tests negatively, then each individual in the pool is declared to be negative.If this pool tests positively, the master pool is divided into nonoverlapping subpools of specimens.Two-stage Dorfman protocols (H2) revert to individual testing in the second stage, while higher-stage protocols use smaller sized subpools during intermediate stages of testing before individual testing is used in the final stage.In AT, individual specimens are arranged in a square array configuration forming row and column master pools which are tested in the first stage.Individuals in positive row/column intersections are tested in the second stage along with other individuals whose statuses are potentially unknown because of testing errors (Hou et al., 2020).The overarching message from Hou et al. (2017Hou et al. ( , 2020) ) is that higher stage hierarchical protocols (H3, H4) and AT can substantially reduce the number of tests needed when compared to H2, especially when the probability of at least one disease 1 −  00 is small.This prompts an obvious question.When compared to H2, how do H3, H4, and AT perform in terms of estimation?One might hypothesize that because H3, H4, and AT generally require fewer tests, fewer observations would be available and thus the estimation performance for these protocols would be degraded.On the other hand, it could be that H3 and H4 implement more tests "where it counts," that is, on individuals who are more likely to be positive, and AT uses master pools (rows and columns) that consist of overlapping individuals.In the presence of testing errors, more replicate tests on potentially positive individuals may actually improve estimation-despite H3, H4, and AT requiring fewer tests overall.
We simulated the execution of each protocol (H2, H3, H4, and AT) using two configurations of the disease probabilities,  = (0.95, 0.02, 0.02, 0.01) ′ and (0.990, 0.004, 0.004, 0.002) ′ ; we henceforth call these Configurations I and II, respectively.The first configuration was chosen to represent the overall prevalence of CT/NG in higher risk populations, while the second configuration allows for two rarer diseases, each with a marginal probability of 0.004 + 0.002 = 0.006.For each of H2, H3, H4, and AT, Table 1 lists the specific protocol which minimizes the expected number of tests when  ∶(1) = 0.95 and  ∶(1) = 0.99, for  = 1, 2. For example, the entry "5 ∶ 1" for H2 under Configuration I means that master pools of size 5 reduce the number of tests as much as possible on average among all two-stage hierarchical protocols.Similarly, the entry "9 ∶ 3 ∶ 1" for H3 means that master pools of size 9 are used in the first stage, three subpools of size 3 are used in the second stage, and individual testing is used in the third.We determine these pool sizes using the optimization methods described in Hou et al. (2017Hou et al. ( , 2020)).
For each protocol and disease probability configuration, we simulated the true disease statuses of  = 5000 individuals and randomly assigned these individuals to appropriately sized master pools.We then simulated the testing outcomes on TA B L E 1 Testing protocols in Section 5. Hierarchical protocols H2, H3, and H4 use two, three, and four stages, respectively (Hou et al., 2017).Array testing (AT) uses square arrays (Hou et al., 2020).The protocols listed below minimize the expected number of tests per individual specimen."Configuration I" uses  = (0.95, 0.02, 0.02, 0.01) ′ and "Configuration II" uses  = (0.990, 0.004, 0.004, 0.002) ′ .

Configuration I Configuration II Protocol
Pool sizes Protocol Pool sizes pools and individuals (allowing for potential testing error) to produce data that would be available for estimation purposes.This entire process was repeated  = 500 times, providing us with 500 independent data sets for each protocol under Configurations I and II.Note that in some cases smaller sized master pools were formed when there were remainder individuals.For example, 555 master pools of size 9 were formed for the "9 ∶ 3 ∶ 1" H3 protocol listed in Table 1; the remaining five individuals were tested in a master pool of size 5 and resolved using H2.This practice of using H2 for remainder pools was applied uniformly in all cases to ensure a fair comparison among the protocols.In all simulations, we used  = 10, 000 Gibbs iterates after discarding the first 2000 for burn-in purposes, and we used  (0) = (0.92, 0.05, 0.02, 0.01) ′ and  (0) = (0.96, 0.96, 0.98, 0.98) ′ as initial values for the disease probabilities and assay accuracy probabilities, respectively.All posterior measures of variability have been calculated by thinning with every fifth Gibbs iterate selected.For the simulations reported in this section, independent investigations on our part revealed that perturbing these selections (i.e., using different starting values, using different numbers of Gibbs iterates, thinning differently) did not have a large impact on the results.

Simulation results
Tables 2 and 3 show the estimation results for both disease probability configurations when the assay accuracy probabilities in  = ( ∶(1)1 ,  ∶(1)2 ,  ∶(1)1 ,  ∶(1)2 ) ′ are assumed to be unknown.Table 2 shows the results for estimating  while Table 3 presents those for estimating .When  is treated to be known (Section 3), we provide a table of estimation results for  in Appendix D. Unless otherwise stated, we used flat priors for both  and , that is,  ∼ Dirichlet( 4 ),  ∶(1) ∼ beta(1, 1), and  ∶(1) ∼ beta(1, 1), for  = 1, 2. We selected these noninformative priors for two reasons.First, these distributions give us the most challenging case for estimation because no useful prior information is injected into the model.Second, our use of a flat prior for  produces MAP estimates which should largely coincide with the MLE of .When the Dirichlet( 4 ) distribution is specified a priori, MAP and MLE of  will be identical when  is known and should be approximately equal otherwise.
In both Tables 2 and 3, we present the sample mean ("Est") of the posterior mean (Mean) and MAP estimates calculated from  = 500 independent data sets.We also report two measures of posterior variability: "SD," which is the sample standard deviation of the 500 posterior estimates, and "SE," which is the posterior standard deviation of the Gibbs iterates retained for one data set and then averaged over the 500 data sets.To compare the four protocols (H2, H3, H4, and AT) in terms of classification, we also recorded the average and standard deviation of the number of tests needed to classify each of the 5000 individuals as positive/negative for each disease.These quantities are denoted in Table 2 by  and   , respectively.
Table 2 reveals the averaged mean and MAP estimates of  are on target for both configurations of the disease probabilities across all four group testing protocols.An intriguing finding is that if one moves from Dorfman testing (H2) to one of the more complex protocols (H3, H4, or AT), the posterior variability gets no larger and may actually decrease slightly.This is interesting because H3, H4, and AT all require fewer tests to complete on average.For example, for Configuration I with  = (0.95, 0.02, 0.02, 0.01) ′ , moving from H2 to H3 decreases the average number of tests by approximately 14.6% (2166.6 tests for H2; 1850.8 tests for H3), yet the posterior mean and MAP estimates for H3 are as good as or better than those for H2.Similar observations can be made for H4 and AT in terms of estimation performance, despite these protocols also offering a large reduction in the number of tests needed.
Moving to the assay accuracy probabilities in Table 3, one does observe a slight degradation in performance of Dorfman testing (H2) when attempting to estimate the sensitivity parameters  ∶(1)1 and  ∶(1)2 using a posterior mean, especially for Configuration II with  = (0.990, 0.004, 0.004, 0.002) ′ .This should not be surprising because with so few positive individ-TA B L E 2 Simulation results for the posterior mean (Mean) and the maximum a posteriori (MAP) estimates of  = ( 00 ,  10 ,  01 ,  11 ) ′ when assay accuracy probabilities are unknown.Estimates ("Est") are averages over  = 500 Monte Carlo data sets, "SD" is the sample standard deviation of the 500 estimates, and "SE" is the estimated posterior standard deviation as described in Section 5.2.Flat priors have been used for all parameters; that is,  ∼ Dirichlet( 4 ),  ∶(1) ∼ beta(1, 1), and  ∶(1) ∼ beta(1, 1), for  = 1, 2. The mean  and standard deviation   of the number of tests are also shown; the percentage reduction in  is compared to H2 from Tebbs et al. (2013) and Warasi et al. (2016) uals for either disease, posterior distributions are highly skewed and thus the mean may not be an ideal choice.However, Table 3 also shows that MAP estimation in this setting is much improved for H2, and that MAP estimation with H3, H4, or AT appears to recover  ∶(1)1 and  ∶(1)2 nearly perfectly on average.In addition, when compared to H2, the posterior distributions of  ∶(1)1 and  ∶(1)2 are less variable (smaller SD/SE) when using H3, H4, and AT under both disease probability configurations.Following the recommendations of an anonymous reviewer, we have performed additional simulation studies which use a smaller sample size ( = 1000), a larger number of assays ( = 2), and we have investigated the performance of our methods when assay accuracy probabilities in  are substantially lower.In these more challenging settings for estimation, the use of informative prior distributions for  and/or  can be useful.Another reviewer astutely noted that when assays are perfect, that is, when  ∶(1)1 =  ∶(1)2 =  ∶(1)1 =  ∶(1)2 = 1, all group testing protocols will produce the same estimates of  because all true individual disease statuses are recoverable.Because retesting positive pools may confer limited utility in this setting (Chen & Swallow, 1990), this has motivated us to evaluate the use of master pool testing (MPT) as a means to estimate  in the presence of imperfect testing; see Tu et al. (1995) and McMahan et al. (2017).All additional simulation studies are summarized in Appendix D.

6
CHLAMYDIA AND GONORRHEA APPLICATION  we used values of  ∶(1)1 and  ∶(1)1 reported in the AC2A package insert for CT and similarly  ∶(1)2 and  ∶(1)2 for NG (see www.hologic.com).These values were used only to determine the protocols in Table 4 and to simulate all test responses in our feasibility study.To average over the effect of Monte Carlo simulation error, we created  = 500 sets of master pools for each protocol with the configurations in Table 4. Within each specimen type (urine/swab), random assignment of individuals to pools was used throughout.
For each set of master pools, we used simulation to create test outcomes one would observe had H2, H3, and AT been implemented at the Iowa SHL, and we calculated the posterior mean and MAP estimates of  and  = ( ∶(1)1 ,  ∶(1)2 ,  ∶(1)1 ,  ∶(1)2 ) ′ under the assumption  is unknown.The disease probabilities in  in this application are As in Section 5, we used flat priors for all parameters, that is,  ∼ Dirichlet( 4 ),  ∶(1) ∼ beta(1, 1), and  ∶(1) ∼ beta(1, 1), for  = 1, 2. We also continued to use the same starting values  (0) and  (0) as in Section 5 and the same selections for the number of Gibbs iterates and thinning.Trace plots were used to monitor convergence and to check posterior mixing.For one data set (out of 500) in the urine stratum, which includes the results for H2, H3, and AT, the simulation took 70 (271) s to determine the posterior mean estimates (MAP estimates).These same times for the larger swab stratum were 167 and We conclude with three remarks, all of which present avenues for future work.First, it is always of interest to think about optimal designs in group testing-not only for case identification but also for estimation.We have used protocols in Sections 5 and 6 on the basis of the former because these designs represent those which laboratories can implement to provide diagnoses to all individuals in the most efficient way possible.At the same time, one might also select those designs which provide the best estimation performance.As noted in Section 1, this has been investigated on frequentist grounds when resolving positive pools is not performed.More recently, Warasi et al. (2022) have extended this work to Dorfman testing (H2) when detecting multiple diseases in animal populations.Second, instead of specifying multiple sets of prior distributions for differently sized pools, perhaps to fend off fears of dilution (Warasi et al., 2017), it should be possible to elicit a secondary model which describes how pools might experience the dilution of positive specimens.Modeling assay sensitivities directly, for each disease separately or jointly, could provide a way to relax assumptions in Section 2, and it may provide a more parsimonious approach to estimation.Modifications of our EM algorithm to determine posterior maximizers, such as variational EM approaches, could be useful in the event of increased computational complexity.Finally, an anonymous reviewer has remarked that including covariates which are predictive of disease (e.g., number of sexual partners, race/ethnicity, etc.) may sharpen probability estimates on an individual level.A number of authors have looked at the regression analysis of group testing data for a single disease (see Section 1).Generalizing this work to accommodate outcomes from multiplex group testing is an excellent topic for future research.

A C K N O W L E D G M E N T S
We are grateful to the Editor, Associate Editor, and two reviewers for their valuable comments and suggestions on earlier versions of this article.This research was funded by Grant R01 AI121351 from the National Institutes of Health.

C O N F L I C T O F I N T E R E S T S TAT E M E N T
The authors have declared no conflict of interest.

D ATA AVA I L A B I L I T Y S TAT E M E N T
Simulated data in Section 5 can be generated with the R code provided in the Supporting Information.The Iowa CT/NG data, in the form of disease counts, are summarized in Appendix E in the Supporting Information.

O P E N R E S E A R C H B A D G E S
This article has earned an Open Data badge for making publicly available the digitally-shareable data necessary to reproduce the reported results.The data is available in the Supporting Information section.This article has earned an open data badge "Reproducible Research" for making publicly available the code necessary to reproduce the reported results.The results reported in this article were reproduced partially due to data confidentiality issues.

𝑝 00 =
proportion of individuals negative for both CT and NG  10 = proportion of individuals positive for CT but negative for NG  01 = proportion of individuals negative for CT but positive for NG  11 = proportion of individuals positive for both CT and NG.