Using Family Data as a Verification Standard to Evaluate Copy Number Variation Calling Strategies for Genetic Association Studies
Article first published online: 24 APR 2012
© 2012 Wiley Periodicals, Inc.
Volume 36, Issue 3, pages 253–262, April 2012
How to Cite
Zheng, X., Shaffer, J. R., McHugh, C. P., Laurie, C. C., Feenstra, B., Melbye, M., Murray, J. C., Marazita, M. L. and Feingold, E. (2012), Using Family Data as a Verification Standard to Evaluate Copy Number Variation Calling Strategies for Genetic Association Studies. Genet. Epidemiol., 36: 253–262. doi: 10.1002/gepi.21618
- Issue published online: 19 APR 2012
- Article first published online: 24 APR 2012
- Manuscript Accepted: 14 NOV 2011
- Manuscript Revised: 5 NOV 2011
- Manuscript Received: 17 AUG 2011
- NIDCR. Grant Number: R01-DE 014899
- NIDCR. Grant Numbers: R01-DE09551, R01-DE12101
- CNV-calling strategies;
- family-based GWAS
A major concern for all copy number variation (CNV) detection algorithms is their reliability and repeatability. However, it is difficult to evaluate the reliability of CNV-calling strategies due to the lack of gold-standard data that would tell us which CNVs are real. We propose that if CNVs are called in duplicate samples, or inherited from parent to child, then these can be considered validated CNVs. We used two large family-based genome-wide association study (GWAS) datasets from the GENEVA consortium to look at concordance rates of CNV calls between duplicate samples, parent-child pairs, and unrelated pairs. Our goal was to make recommendations for ways to filter and use CNV calls in GWAS datasets that do not include family data. We used PennCNV as our primary CNV-calling algorithm, and tested CNV calls using different datasets and marker sets, and with various filters on CNVs and samples. Using the Illumina core HumanHap550 single nucleotide polymorphism (SNP) set, we saw duplicate concordance rates of approximately 55% and parent-child transmission rates of approximately 28% in our datasets. GC model adjustment and sample quality filtering had little effect on these reliability measures. Stratification on CNV size and DNA sample type did have some effect. Overall, our results show that it is probably not possible to find a CNV-calling strategy (including filtering and algorithm) that will give us a set of “reliable” CNV calls using current chip technologies. But if we understand the error process, we can still use CNV calls appropriately in genetic association studies.