Selecting SNPs to Identify Ancestry


Corresponding author: Joshua N. Sampson, 6120 Executive Blvd, 8038, Bethesda, MD 20892. Tel: (301) 443-8207; Fax: (301) 402-0081; E-mail:


An individual's genotypes at a group of single-nucleotide polymorphisms (SNPs) can be used to predict that individual's ethnicity or ancestry. In medical studies, knowledge of a subject's ancestry can minimize possible confounding, and in forensic applications, such knowledge can help direct investigations. Our goal is to select a small subset of SNPs, from the millions already identified in the human genome, that can predict ancestry with a minimal error rate. The general form for this variable selection procedure is to estimate the expected error rates for sets of SNPs using a training dataset and consider those sets with the lowest error rates given their size. The quality of the estimate for the error rate determines the quality of the resulting SNPs. As the apparent error rate performs poorly when either the number of SNPs or the number of populations is large; we propose a new estimate, the Improved Bayesian Estimate. We demonstrate that selection procedures based on this estimate produce small sets of SNPs that can accurately predict ancestry. We also provide a list of the 100 optimal SNPs for identifying ancestry.