Simulating Realistic Genomic Data With Rare Variants
Article first published online: 17 NOV 2012
© 2012 WILEY PERIODICALS, INC.
Volume 37, Issue 2, pages 163–172, February 2013
How to Cite
Xu, Y., Wu, Y., Song, C. and Zhang, H. (2013), Simulating Realistic Genomic Data With Rare Variants. Genet. Epidemiol., 37: 163–172. doi: 10.1002/gepi.21696
- Issue published online: 10 JAN 2013
- Article first published online: 17 NOV 2012
- Manuscript Accepted: 15 OCT 2012
- Manuscript Revised: 9 OCT 2012
- Manuscript Received: 5 SEP 2012
- National Institute on Drug Abuse. Grant Number: R01DA016750
- National Institutes of Health. Grant Number: R01GM088566
- logistic regression;
- rare SNPs
Increasing evidence suggests that rare and generally deleterious genetic variants might have a strong impact on disease risks of not only Mendelian disease, but also many common diseases. However, identifying such rare variants remains challenging, and novel statistical methods and bioinformatic software must be developed. Hence, we have to extensively evaluate various methods under reasonable genetic models. Although there are abundant genomic data, they are not most helpful for the evaluation of the methods because the disease mechanism is unknown. Thus, it is imperative that we simulate genomic data that mimic the real data containing rare variants and that enable us to impose a known disease penetrance model. Although resampling simulation methods have shown their advantages in computational efficiency and in preserving important properties such as linkage disequilibrium (LD) and allele frequency, they still have limitations as we demonstrated. We propose an algorithm that combines a regression-based imputation with resampling to simulate genetic data with both rare and common variants. Logistic regression model was employed to fit the relationship between a rare variant and its nearby common variants in the 1000 Genomes Project data and then applied to the real data to fill in one rare variant at a time using the fitted logistic model based on common variants. Individuals then were simulated using the real data with imputed rare variants. We compared our method with existing simulators and demonstrated that our method performed well in retaining the real sample properties, such as LD and minor allele frequency, qualitatively.