On optimal pooling designs to identify rare variants through massive resequencing

The advent of next-generation sequencing technologies has facilitated the detection of rare variants. Despite the significant cost reduction, sequencing cost is still high for large-scale studies. In this article, we examine DNA pooling as a cost-effective strategy for rare variant detection. We consider the optimal number of individuals in a DNA pool to detect an allele with a specific minor allele frequency (MAF) under a given coverage depth and detection threshold. We found that the optimal number of individuals in a pool is indifferent to the MAF at the same coverage depth and detection threshold. In addition, when the individual contributions to each pool are equal, the total number of individuals across different pools required in an optimal design to detect a variant with a desired power is similar at different coverage depths. When the contributions are more variable, more individuals tend to be needed for higher coverage depths. Our study provides general guidelines on using DNA pooling for more cost-effective identifications of rare variants. Genet. Epidemiol. 35:139-147, 2011. © 2011 Wiley-Liss, Inc.


INTRODUCTION
Genome-wide association studies (GWAS) have enjoyed a great success in the past several years to localize diseasesusceptibility loci for many common traits and diseases. The current GWAS paradigm was partially motivated by the common disease common variant (CDCV) assumption, which postulates that a large proportion of heritability of common diseases is due to common variants. GWAS was made possible by both technological advances that can type hundreds of thousands of single nucleotide polymorphisms (SNPs), at affordable cost and the strong dependency, called linkage disequilibrium (LD), among SNPs at the population level. The presence of LD allows researchers to capture the genetic variations in a person's genome by a set of tagSNPs which can be selected based on the LD patterns to factor associations between diseases and disease-causing loci indirectly. One key to the success in GWAS lies in how strong the correlations between tagSNPs and disease-causing loci are. From this CDCV perspective, GWAS have been successful in uncovering many common SNPs associated with common diseases including type I/II diabetes, rheumatoid arthritis, Crohn's disease, and coronary heart disease. However, as noted in Hardy and Singleton [2009], the combination of many identified common variants only explains a small proportion of the genetic component of the common diseases. One possible explanation of this limitation is that GWAS have focused on variants that are common (minor allele frequencies 45%), whereas many disease-causing variants are rare and therefore difficult to be tagged by common variants.
Recently, researchers have explored the possibility of an alternative hypothesis, the common disease rare variant assumption, which states that the diseases are caused by combinations of multiple rare genetic variants. Gorlov et al. [2008] found that the minor allele frequency (MAF) distribution of possibly and probably damaging SNPs is shifted toward rare SNPs compared with the MAF distribution of benign and synonymous SNPs based on the prediction results obtained from PolyPhen. Li and Leal [2008] pointed out that multiple rare variants have been implicitly identified to be associated with diseases such as obesity and schizophrenia. Low frequencies of rare variants lead to weak correlations with tagSNPs. As a result, GWAS are low-powered to detect rare variants. Consequently, different approaches are required for the detection of rare variants. At present, sequencing of candidate genes or entire genomes seems to be a good strategy to identify rare variants as claimed in Li and Leal [2009].
Next-generation sequencing (NGS) or massively parallel sequencing technologies (454FLX, Illumina/Solexa Genome Analyzer, ABI SOLiD. See Mardis [2008] for a review) have brought immense evolution in biological research and increased our biological knowledge underlying diseases. New sequencing technologies have enabled the process of millions of sequence reads of short lengths (35-250 bp, depending on the platform) at a time. Only one or two instrument runs may be required to complete a sequencing experiment. This technological breakthrough has given rise to an international research consortium, 1,000 Genomes Project (1,000GP), where the scientists will sequence the genomes of at least 1,000 people from different ethnic groups.
NGS technologies have opened up great opportunities for discovering more variants in the human genome. Whole exome sequencing technology is emerging as an effective way of capturing a patient's functional rare variants. However, whole genome or exome sequencing cost is still high although researchers are endeavoring to bring down the cost of sequencing a whole genome as low as $1,000 [Service, 2006]. In addition, thousands of genomes need to be sequenced in order to find rare SNPs with MAFs $1%. Consequently, a cost-effective procedure is needed to most efficiently employ the NGS methods to identify rare variants. The issues on a practical limit of cost and labor could be resolved by the use of pooling the genomic DNAs from a relatively low number of individuals. DNA pooling has been used to reduce the cost of large-scale association studies based on high-throughput genotyping technologies. [For reviews see Norton et al., 2004;Sham et al., 2002.] For GWAS, the use of DNA pooling has been considered as a cost-efficient initial screening tool to detect candidate regions in a two-stage design. In the first stage, a case-control association test for each marker is performed based on the estimated allele frequencies from the case and control pools. In the second stage, the candidate markers selected from the first stage are re-evaluated by individual genotyping [Zhao and Wang, 2009;Zou and Zhao, 2004;Zuo et al., 2006]. As suggested by Out et al. [2009], the use of a pooled DNA sample for targeted NGS also can be an attractive costeffective method to identify rare variants in candidate genes. In their paper, a Poisson model was employed to calculate the mis-detection probability and similarly the power to detect a variant. In the calculation of the misdetection probability, they did not take into account the dependency among incorrect bases. Moreover, the proposed statistical power represents the probability of identifying a variant present in a given pooled sample so that the probability of including the variant in the sample is not included in the power calculation. However, it is very important to reflect the sampling variation in the power calculation for pooling designs. In this paper, considering both issues, we investigate the detection probability of a variant in DNA pooling for NGS, and the optimal pooling designs. This paper is organized as follows. In the next section, we will describe how to estimate the detection probability of a variant with a MAF p at a coverage depth C in a DNA pool of k individuals. Due to technical variations in DNA pooling and exon capturing, the contribution of each individual may not be equal. Therefore, we will discuss how to evaluate the average detection probability allowing individual variations in the pooled DNA sample. We illustrate these points with a real sequencing data set in the subsequent section. We conclude this paper with some technical details discussed in Appendix A.

METHODS
Suppose that a pooled DNA sample j is constituted by combining DNA from k individuals. Let w j ¼ ðw j 1 ; . . . ; w j k Þ denote the proportional contributions of the k individuals in the jth pooled DNA sample to be analyzed by a NGS platform. Therefore, P k i¼1 w j i ¼ 1 and each w j i ! 0. We assume that w j is invariant to genome positions. As detailed later, from the comparison between genotyping results and sequencing results from an empirical study, the contribution of each individual to resulting base reads can be estimated as shown in Appendix A.2 and empirical data suggest that the variations can be substantial across individuals. Moreover, in a practical pooling study, those contributions are often unknown. The objective of this paper is to assess how likely a variant with a MAF p can be detected from a pool of k individuals when the sequencing coverage depth at the position is C. As shown later in this section, w j is a key component in the calculation of the detection probability of a variant. In the following discussion, we call a variant detected if at least T sequence reads carry this variant.
First, we begin with the assessment of the detection probability of a rare variant in the simplest setup. In this case, the contribution is assumed to be equal across the individuals in a pooled sample. Suppose that there is a total of 2k chromosomes among the k individuals. Let N denote the number of chromosomes among them carrying the rare variant. Then the detection probability of a variant with a MAF p can be calculated as follows: where n is the number of chromosomes carrying the variant in a sample, T is the threshold to call the presence of the minor allele in the sample, C is the coverage depth, and q n ¼ n=ð2kÞ for n ¼ 0; . . . ; 2k: Procedure 1 Sample-Specific Detection Probability [initialization] Specify/Define the following parameters minor allele frequency p (0opo1) sequencing coverage depth C (C40) threshold for the detection T (T40) estimates for individual contributionsŵ j ¼ ðŵ j 1 ; . . . ;ŵ j k Þ from genotype and sequencing data [main] Let X i denote the number of chromosomes with a variant carried by individual i ði ¼ 1; . . . ; kÞ. Then there are 3 k distinct configurations of X j ¼ ðX Generally, for each sample j, the individual contributions may not be equal, that is, the w j may differ. From this perspective, it is desirable to evaluate the detection probability under a specific distribution for w when it varies. The randomness of the individual contributions in a pooled sample can be represented by the specification of a prior distribution for w. A natural choice for this distribution of w is the Dirichlet distribution with hyperparameters a ¼ ða 1 ; . . . ; a k Þ, where a i 40 for i ¼ 1; . . . ; k. Due to exchangeability among the sampled individuals, we may assume a 1 ¼ . . . ¼ a k ¼ a. For the hyperparameters a, we may either specify a hyperprior distribution or estimate a empirically. For a specific a, we can use Procedure 1 to estimate the detection probability.
To get a sense of the a value in practice, we have gathered empirical data to estimate a. In Appendix A, we describe the empirical data and the estimation procedure. We will call this estimatorâ as the pseudo maximum-likelihood estimator (PMLE) or the pseudo method of moments estimator (PMME). See Appendix A.3 for more details. We illustrate how to compute the average detection probability in Procedure 2.

Procedure 2 Average Detection Probability [initialization]
Specify/Define the following parameters minor allele frequency p (0opo1) sequencing coverage depth C (C40) threshold for the detection t (t40) estimates for hyperparameters,â [main] Let p ave 5 0. for i 5 1 to M Up to this point, the estimation of the detection probability is based on the use of a single lane. However, if L lanes are used to analyze independent samples, then the detection probability can be computed as follows: where P(detect|p) is calculated by Equation (1) or Procedure 2.

EQUAL CONTRIBUTIONS
We calculate the detection probability for a given number of individuals in a pooled sample for a given coverage depth, threshold, and MAF. Therefore, the optimal number of individuals in a pooled sample can be determined in terms of the detection probability. In addition, we study the number of lanes required to reach a certain level of statistical power to identify a rare variant. Since our interest lies in the identification of a rare variant, we choose 0.005, 0.01, and 0.025 for MAFs in our analysis. We use several coverage depths C 5 20, 30, 40, and 50 and a fixed threshold, T 5 3. The choice of a threshold T is discussed in more details in the Discussion and Appendix sections. As shown in Figure 1, the detection probability initially increases with more individuals in a pool but then decreases from a certain point. This phenomenon can be explained by using Equation (1). We focus on rare variants in this manuscript, and only a small number of chromosomes among 2k chromosomes tend to carry the variant in a given pooled sample for such variants even when the pool size k increases. For example, consider a variant of a MAF equal to 0.01 and the pool size k ¼ 1; . . . ; 30. The probability that the number of chromosomes carrying the variant is at most 2 is above 97% for k ¼ 1; . . . ; 30. Therefore, if pools have the rare variant, most of the pools will have the variant on 1 or 2 chromosomes among the 2k chromosomes. In addition, it is more likely that only one chromosome holds the variant in those pools. As a result, Equation (1) may be approximated by PðdetectjN ¼ 1Þ PðN ¼ 1jpÞ. As the pool size k increases, the sampling probability P(N 5 1|p) increases (due to the presence of more chromosomes), whereas the conditional detection probability P(detect|N 5 1) decreases (due to the threshold set to declare the presence of a rare variant). These two factors counter balance each other and lead to an optimal number of samples in a pool. For example, if the pool size k increases from 3 to 30, the probability that only one chromosome carries the variant among 2k chromosomes increases about eightfold from 0.03 and 0.22. However, when the coverage is C 5 20 and the threshold is T 5 3, the detection probability conditional on having only one chromosome harboring the variant decreases much more significantly from 0.67 to 0.004 since the proportion of the chromosomes in the pool carrying the variant drops from 1/6 to 1/60, making it more difficult to detect the rare variant. It is also interesting to note that the optimal numbers of individuals in a pooled sample is somewhat invariant to MAFs. However, significantly more lanes and subsequently more individuals are required to detect a variant of a lower MAF to achieve a certain level of statistical power. Table I also shows that the total number of individuals is about the same across different options for the same MAF.

UNEQUAL CONTRIBUTIONS
For the unequal contribution case, we first need to consider the distribution of the unknown contribution w j for a given sample j. The estimation of w from sequencing and genotyping information is described in Appendix A.2. The proposed estimation approach for w was applied to a sample data (For the data description, see Appendix A.1.), and our estimateŵ is (0.1380, 0.0836, 0.1142, 0.0188, 0.1805, 0.1364, 0.1617, and 0.1667). It is apparent that there were less contributions of individuals 2 and 4 to the pool.
Assuming that the w are drawn from a Dirichlet distribution, we first explore the effect of a on the optimal number of individuals in a pooled sample. We select a set of different values for the hyperparameter, a 5 0.25, 0.5, 0.75, 1.2, and 5. We know that the variability for each individual contribution decreases with an increase in the hyperparameter a in the Dirichlet distribution. In this sense, w should be generated more closely around the mean 1/k for larger a. Consequently, it can be seen in Figure 2 that as a increases, the matching optimal number of individuals gets smaller and closer to the one based on the equal contributions. It can be also found that the average detection probability increases with the magnitude of a. Now, we are interested in the estimation of the hyperparameter a based onŵ from our empirical data. As described in Appendix A.3, we can estimate the hyperparameter a by PMLE or PMME. The estimates are 2.89 and 4.76, respectively, based on PMLE and PMME. By utilizing these two estimates of a, average detection probabilities are calculated with various sequence read coverage depths C 5 20, 30, 40, and 50 and MAFs P 5 0.005, 0.01, and 0.025. Like the equal contribution case, we investigate how many individuals and/or lanes should be used to best identify a rare variant with a fixed coverage depth based on the given estimates. From Figure 3 and Table II, we can find patterns similar to the one for the equal contribution case. However, the results show that more individuals per lane and more lanes are required in order to obtain a given level of statistical power compared to the equal contribution case. In addition, the resulting detection probabilities are shown to decrease 7-10% in comparison with the ones for the equal contribution case (Tables I and II).

DISCUSSION AND CONCLUSION
In this paper, we have considered the detection probability of a variant when a NGS platform is utilized to identify a rare variant through DNA pooling. Through the use of an empirical data set, we inspected the number of lanes and individuals per lane needed to be able to locate a rare variant with a given chance. In this examination, a number of interesting properties are uncovered. First, increasing the number of individuals makes the detection probability higher initially up to a certain point and afterward the detection probability decreases with the number of individuals. Therefore, we can determine the optimal number of individuals in a single lane for a given MAF, coverage depth, and threshold. Second, the optimal number of individuals per lane is very close across MAFs but many more lanes are needed  Indv, the optimal number of individuals; Prob, detection probability; Lane, the minimum number of lanes required for 80% power; Total, the total number of individuals required for 80% power. for the identification of a rarer variant at a given level of detection probability. For a higher coverage depth C and MAF p, the optimal number of individuals increases.
As introduced at the beginning of this article, Out et al. [2009] also carried out the analysis of detecting rare variants. We found that there are a number of differences in their analysis compared to our approach. First, they defined the mis-detection probability by where l is the mis-sequencing rate, that is, l 5 C Á p when C and p denote the local coverage depth and sequencing error rate, respectively. Due to sequencing errors, it is possible to have up to three incorrect minor alleles among which there are the dependency. Unlike the calculation of the mis-detection probability shown in Appendix A.4, Equation (3) cannot take the dependency into account. Second, they focused on the identification of a variant present in a given pooled sample in their power analysis. However, when we collect samples for DNA pooling, we cannot guarantee that those samples include a specific variant. In this sense, it is crucial to consider the sampling variation in the power calculation as can be seen in Equation (1). Last but not the least, the power calculation in our work can take into account the variations of individual samples in a pooled sample by making the use of results from microarray-based genotyping and NGS DNA sequencing, whereas their power calculation is based on the assumption that individual contributions are equal. In our analysis, a very simple model is employed without taking into account the variations such as sequencing errors. Sequencing errors were estimated to be between 1 and 3% [Illumina, 2009;Richter et al., 2008]. The sequencing error rates are currently expected to be between 0.5 and 1% due to the advance in sequencing technologies. This level of sequencing errors will add very little effects to the results on detection probabilities discussed above.
In this article, we have considered a threshold of 3 for the detection of a rare variant. This is a somewhat conservative threshold as the probability that a nonexistent variant is detected three times or more at the discussed coverage depth is very small at an overall sequencing error rate of 0.5-1% if the errors were to occur independent of each other. We choose to err on the conservative side due to potential non-independence of the sequencing errors and the large number of bases investigated. At an overall sequencing error rate of 1% and a coverage depth of C 5 20, 30, 40, and 50, assuming a base has an equal chance to be mis-sequenced to one of the three other bases, the chance that an incorrect base is observed twice (three times or four times) or more is shown in Table IV and Figure 4. The results show that the use of threshold of 3 controls those mis-detection probabilities at the level of 0.2% across given coverage depths of C 5 20, 30, 40, and 50. In order to control the misdetection probability more stringently, a larger threshold Fig. 3. The optimal numbers of individuals on average detection probabilities of variants of P 5 0.005, 0.01, and 0.025 with coverage depths C 5 20, 30, and 40 and threshold T 5 3. The hyperparameters 2.89 and 4.76 are estimated by the PMLE and PMME, respectively. The number on each curve is the optimal number of individuals. PMLE, pseudo maximum-likelihood estimator; PMME, pseudo method of moments estimator. than T 5 3 may be preferred. As shown in Figure 4, the optimal numbers of individuals in a pooled sample is still similar across different MAFs for given coverage depth and larger threshold. These results also suggest that controlling the mis-detection probability more stringently requires the use of a smaller pooled size. Additionally, in Appendix A.5, we also briefly describe how to construct a random threshold for yielding an exact mis-detection probability for a given significance level and perform our analysis. See Appendix A for more details. To summarize, our study has shown that DNA pooling can be a very cost-effective approach for detecting rare variants, and the optimal number of individuals in a pool is robust to the MAFs of rare variants at a specific coverage depth. This is a very desired property as the rare variants to be discovered have unknown frequencies. Moreover, DNA pooling can also be a very effective approach for genetic association studies, and this will be explored in our future work.  Indv, the optimal number of individuals; Prob, detection probability; Lane, the minimum number of lanes required for 80% power; Total, the total number of individuals required for 80% power. MME, method of moments estimator; MLE, maximum-likelihood estimator. For our analysis, we use an empirical data set from a pooled sample consisting of the genomic DNAs from eight individuals. Exome DNA sequencing was performed by first capturing the exome using NimbleGen 2.1M human exome array, followed by Illumina genome analyzer sequencing. The resulting reads are 75 bp paired-end reads. For the alignment and mapping of sequencing reads, we used Bowtie. The genomic DNAs from the eight individuals were also genotyped on Illumina 610-Quad (San Diego, CA) SNP array. As a result, both base counts and genotypes are available for 11,312 positions in the exome.

A.2. ESTIMATION OF THE CONTRIBUTIONS OF INDIVIDUALS IN A POOLED SAMPLE
A common approach to evaluating the accuracy of NGS SNP detection is to compare the sequencing results with those from microarray-based genotyping platforms. Microarray-based genotyping information can be also utilized in the assessment of the rate of discovery for a variant from a NGS technology. If there is genotyping information available for all the individuals in the pooled DNA sample j, the individual contributions in the sample can be estimated in the following way. Let x j i;l denote the proportion of the major allele at a DNA position l ðl ¼ l 1 ; . . . ; l N Þ for the ith individual so that x j i;l ¼ 0; 0:5; or1. Then where jj Á jj is the l 2 norm, x j l i ¼ ðx j 1;l i ; . . . ; x j k;l i Þ is the i-th row of X, and y ¼ ðy l1 ; . . . ; y lN Þ T is the vector of sample major allele frequencies from the NGS method. Note that we impose the constraint to ensure the non-negativity of the individual contributions to the overall pool.

A.3. ESTIMATION OF THE HYPERPARAMETER FOR A DIRICHLET DISTRIBUTION
The (pseudo) maximum-likelihood estimate (afterward MLE)â cannot be expressed in closed-form but can be obtained by making the use of an iterative scheme such as Newton-Raphson method [Minka, 2009]. Here, we briefly describe Newton-Raphson procedure for the MLEâ; The probability density for the Dirichlet distribution with parameters a ¼ ða; . . . ; aÞ at w ¼ ðw 1 ; . . . ; w k Þ is fðwÞ ¼ GðkaÞ where w i 40 for each i ¼ 1; . . . ; k and P k i¼1 w i ¼ 1. If w 1 ; . . . ; w m were available, then the log-likelihood could be written as follows: logðw 1 ; . . . ; w m jaÞ ¼ m log GðkaÞ À mk log GðaÞ Note that for unknown w 1 ; . . . ; w m , w j i will be replaced bŷ w j i in this step. The first and second derivatives of the loglikelihood are given by and g 0 ðaÞ ¼ mk 2 C 0 ðkaÞ À mkC 0 ðaÞ; respectively, where CðxÞ ¼ d logGðxÞ=dx. The Newton-Raphson method would be performed iteratively by a new ¼ a old À gðaÞ g 0 ðaÞ ; until ja old À a new joe for a precision e.
Our second estimator is based on the method of moments. Under the assumption a 1 ¼ Á Á Á ¼ a k ¼ a, for each i ¼ 1; . . . ; k, The MME (PMME) can be determined by replacing s 2 by the sample variance of all w's (ŵ's).
For the comparison of the above two estimators, we performed a simulation with a 5 2, k 5 8, and 20,000 iterations. As shown in Table III, every pair in those summary statistics are very comparable even though the median and mean for the MME are closer to the true a than those for the MLE. We also conducted simulations with different a and k and found similar patterns.

A.4. MIS-DETECTION PROBABILITIES DUE TO SEQUENCING ERRORS
In this section, we describe the calculation of misdetection probabilities due to sequencing errors. Suppose that the overall sequencing error is 1%. Let X 1 and X i ði ¼ 2; 3; 4Þ denote the number of a true base and the number of one of three incorrect bases at a coverage depth C, respectively. Under the assumption that a base has an equal chance to be mis-sequenced to one of the three incorrect bases, ðX 1 ; X 2 ; X 3 ; X 4 Þ follows a multinomial distribution with p 1 5 0.99 and p 2 5 p 3 5 p 4 5 1/300. Note that where n ¼ P 4 i¼1 x i . Then the mis-detection probability for a threshold of 2 is calculated as follows: Pðmis-detectjCÞ ¼ 1 À ½PðC; 0; 0; 0Þ1PðC À 1; 1; 0; 0Þ 1PðC À 1; 0; 1; 0Þ1PðC À 1; 0; 0; 1Þ 1PðC À 2; 1; 1; 0Þ1PðC À 2; 1; 0; 1Þ 1PðC À 2; 0; 1; 1Þ1PðC À 3; 1; 1; 1Þ; where C is a coverage depth. We can also compute the mis-detection probability for TZ3 in a similar way. The misdetection probabilities for T 5 2, 3, and 4 can be found in Table IV. A.5. RANDOM THRESHOLDS FOR THE MIS-DETECTION PROBABILITY OF 0.01% Based on Table IV, we construct a random threshold for controlling type I errors of 0.01% as follows: For U $ Unif½0; 1, TðUÞ ¼ t 1 Á I ðU wÞ 1t 2 Á I ðU4wÞ ; ðA:5Þ where t 1 5 3, t 2 5 4, and w are so chosen as to satisfy the corresponding type I error is equal to 0.01% at a coverage depth C. See Tables V and VI for more details.     Indv, the optimal number of individuals; Prob, probability; Lane, the minimum number of lanes required for 80% power; Total, the total number of individuals required for 80% power.