R functions are available at http://bioinformatics.med.yale.edu/group/josh/index.html.
Selecting SNPs to Identify Ancestry
Article first published online: 14 JUN 2011
© 2011 The Authors Annals of Human Genetics © 2011 Blackwell Publishing Ltd/University College London
Annals of Human Genetics
Volume 75, Issue 4, pages 539–553, July 2011
How to Cite
Sampson, J. N., Kidd, K. K., Kidd, J. R. and Zhao, H. (2011), Selecting SNPs to Identify Ancestry. Annals of Human Genetics, 75: 539–553. doi: 10.1111/j.1469-1809.2011.00656.x
- Issue published online: 14 JUN 2011
- Article first published online: 14 JUN 2011
- Received: 12 November 2010, Accepted: 17 March 2011
- error rate;
- allele frequency;
An individual's genotypes at a group of single-nucleotide polymorphisms (SNPs) can be used to predict that individual's ethnicity or ancestry. In medical studies, knowledge of a subject's ancestry can minimize possible confounding, and in forensic applications, such knowledge can help direct investigations. Our goal is to select a small subset of SNPs, from the millions already identified in the human genome, that can predict ancestry with a minimal error rate. The general form for this variable selection procedure is to estimate the expected error rates for sets of SNPs using a training dataset and consider those sets with the lowest error rates given their size. The quality of the estimate for the error rate determines the quality of the resulting SNPs. As the apparent error rate performs poorly when either the number of SNPs or the number of populations is large; we propose a new estimate, the Improved Bayesian Estimate. We demonstrate that selection procedures based on this estimate produce small sets of SNPs that can accurately predict ancestry. We also provide a list of the 100 optimal SNPs for identifying ancestry.