Get access

Analysis and Optimal Design for Association Studies Using Next-Generation Sequencing With Case-Control Pools


  • Wei E. Liang,

  • Duncan C. Thomas,

  • David V. Conti

    Corresponding author
    • Department of Preventive Medicine, University of Southern California, Los Angeles, California
    Search for more papers by this author

Correspondence to: David V. Conti, Department of Preventive Medicine, Zilkha Neurogenetic Institute, University of Southern California, 2001 North Soto Street, Room 202S, Los Angeles, CA 90089. E-mail:


With its potential to discover a much greater amount of genetic variation, next-generation sequencing is fast becoming an emergent tool for genetic association studies. However, the cost of sequencing all individuals in a large-scale population study is still high in comparison to most alternative genotyping options. While the ability to identify individual-level data is lost (without bar-coding), sequencing pooled samples can substantially lower costs without compromising the power to detect significant associations. We propose a hierarchical Bayesian model that estimates the association of each variant using pools of cases and controls, accounting for the variation in read depth across pools and sequencing error. To investigate the performance of our method across a range of number of pools, number of individuals within each pool, and average coverage, we undertook extensive simulations varying effect sizes, minor allele frequencies, and sequencing error rates. In general, the number of pools and pool size have dramatic effects on power while the total depth of coverage per pool has only a moderate impact. This information can guide the selection of a study design that maximizes power subject to cost, sample size, or other laboratory constraints. We provide an R package (hiPOD: hierarchical Pooled Optimal Design) to find the optimal design, allowing the user to specify a cost function, cost, and sample size limitations, and distributions of effect size, minor allele frequency, and sequencing error rate.