R functions are available at http://bioinformatics.med.yale.edu/group/josh/index.html.

# Selecting SNPs to Identify Ancestry

Version of Record online: 14 JUN 2011

DOI: 10.1111/j.1469-1809.2011.00656.x

© 2011 The Authors Annals of Human Genetics © 2011 Blackwell Publishing Ltd/University College London

Additional Information

#### How to Cite

Sampson, J. N., Kidd, K. K., Kidd, J. R. and Zhao, H. (2011), Selecting SNPs to Identify Ancestry. Annals of Human Genetics, 75: 539–553. doi: 10.1111/j.1469-1809.2011.00656.x

#### Publication History

- Issue online: 14 JUN 2011
- Version of Record online: 14 JUN 2011
*Received*: 12 November 2010,*Accepted*: 17 March 2011

### Keywords:

- Ancestry;
- ethnicity;
- SNPs;
- error rate;
- allele frequency;
- genotype;
- AIM;
- bootstrap;
- FOSSIL

### Summary

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Appendix
- Supporting Information

An individual's genotypes at a group of single-nucleotide polymorphisms (SNPs) can be used to predict that individual's ethnicity or ancestry. In medical studies, knowledge of a subject's ancestry can minimize possible confounding, and in forensic applications, such knowledge can help direct investigations. Our goal is to select a small subset of SNPs, from the millions already identified in the human genome, that can predict ancestry with a minimal error rate. The general form for this variable selection procedure is to estimate the expected error rates for sets of SNPs using a training dataset and consider those sets with the lowest error rates given their size. The quality of the estimate for the error rate determines the quality of the resulting SNPs. As the *apparent error rate* performs poorly when either the number of SNPs or the number of populations is large; we propose a new estimate, the *Improved Bayesian Estimate*. We demonstrate that selection procedures based on this estimate produce small sets of SNPs that can accurately predict ancestry. We also provide a list of the 100 optimal SNPs for identifying ancestry.

### Introduction

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Appendix
- Supporting Information

An individual's genotypes at a group of single nucleotide polymorphisms (SNPs) can be used to predict that individual's ethnicity or ancestry (Shriver et al., 1997; Rosenberg et al., 2002; Jorde & Wooding. 2004; Weir et al., 2005; Paschou et al., 2007; Yamaguchi-Kabata et al., 2008). Identifying ancestry through this approach is often useful (Bamshad et al., 2003; Li et al., 2008; Seldin & Price, 2008). For example, subjects in a medical study may be genotyped because adjusting for precise ancestry can minimize one source of confounding (Freedman et al., 2004; Marchini et al., 2004; Barnholtz-Sloan et al., 2008). Similarly, a sample from a crime scene may be genotyped, so ancestry can be included in the description of a suspect (Lowe et al., 2001; Daniel et al., 2006; Budowle & van Daal, 2008). Although millions of SNPs have been identified, only a small subset needs to be genotyped in order to accurately predict ancestry. Reducing the needed number to the 10s or 100s is still useful even in the era of SNP microarrays. First, genotyping only a few hundred SNPs, compared to the 100,000's of SNPs on a microarray, should be less expensive (Seldin & Price, 2008; Nassir et al., 2009). Second, by basing predictions on only those informative SNPs, we can remove variability caused by considering SNPs with little information. In this article, we aim to describe a method for selecting an “optimal” group of SNPs or a group that has maximal predictive accuracy given its size.

The first set of methods for marker selection ranked SNPs individually by some measure of their ability to distinguish ancestries, such as the estimated values for *F _{ST}*, the allele frequencies,

**p**, or the informativeness for assignment,

*I*, calculated from a training dataset, and then the top ranked SNPs were selected (Rosenberg et al., 2003). See the Appendix section for formal definitions of

_{n}*I*and

_{n}*F*. Obviously, this led to redundant markers, and the majority of markers separated African from European ancestries. Therefore, the next set of methods were more advanced (Xu et al., 2005; Hemminger et al., 2006; Phillips et al., 2007) and included (1) selecting those SNPs that are the strongest contributors to the principal components (Paschou et al., 2007), (2) selecting the 1000 SNPs with the highest

_{ST}*F*

_{ST,}and then, among those, using a genetic algorithm to jointly select the set with the largest

*I*(Lao et al., 2006), and (3) selecting a set of SNPs with a greedy algorithm aimed to minimize the apparent error rate (Rosenberg, 2005). Method a) can still select redundant SNPs and may rank those SNPs that distinguish closely related populations relatively low. Method b) uses unnecessary surrogates,

_{n}*F*and

_{ST}*I*, for predictive accuracy. Method c) is the most promising, as it directly tries to minimize the error rate.

_{n}Our goal is to improve the latter method by selecting SNPs that minimize a better estimate of the error rate. A selection procedure based on our new estimate, which introduces an improved form for the error rate and an improved estimate for the allele frequencies, will result in a better group of SNPs. Although our focus is on SNP selection, our discussion should have broader appeal. We will introduce a parametric estimator for the error rate that can be applied to any prediction rule based on genotypes or, even more generally, any prediction rule using logistic regression to discriminate categories. Unlike the apparent error, this method acknowledges that the true allele frequencies are unknown (Efron, 1986; Claeskens et al., 2006). Moreover, our new estimates of allele frequencies offer a means to reduce the variance of the maximum likelihood estimates (MLE) whenever knowledge of the evolutionary tree is available (Farris, 1972; Saitou & Nei, 1987). Instead of estimating the allele frequencies for an ancestry using only subjects from that ancestry, we will now average over all available subjects in the training dataset. The need for these improvements has only arrived as we now attempt to predict ancestry more accurately than continental origin. For a training dataset, we now have access to the Human Genome Diversity Project (HGDP), where 500,000+ SNPs have been genotyped on hundreds of subjects from 54 populations (Jakobsson et al., 2008).

The order of the article is as follows. In the methods section, we introduce our new selection procedure. Then, in the results section, we apply our selection procedure to both simulated and HGDP data. Finally, we conclude with a short discussion.

### Methods

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Appendix
- Supporting Information

#### Introduction to a New Selection Procedure

##### Overview

As discussed in the introduction, our goal is to define a procedure for choosing a small set of SNPs that, when genotyped, can be used to predict an individual's ancestry. We focus our search to one specific group of candidate procedures. Each selection procedure considered here will first estimate the expected error rate for every set of SNPs and will then choose a set with the lowest estimated error rate, given its size. Obviously, computational limits prevent truly searching over every set of SNPs, but we will deal with that technical issue later. The key point is that we need only define an estimate of the error rate to define a selection procedure.

##### Notation

Assume we select *n* individuals from a heterogeneous population containing *n _{e}* distinct ancestries or

*e*thnicities and that we denote the ancestry of an individual,

*i*, by

*Y*∈{1, … ,

_{i}*n*}. We let

_{e}*n*be the total number of subjects from ancestry , where 1(

_{k}*Y*=

_{i}*k*) = 1, if

*Y*=

_{i}*k*and 0 otherwise. In the heterogeneous population, from which our sample was obtained, we denote the proportion of subjects from ethnicity

*k*by π*

_{k}and let . We will presume that can be accurately estimated and treated as a known, fixed, quantity. Note that the asterisk, *, denotes the true value of a parameter.

Assume, a genome for an individual contains *N* SNPs and denote the three genotypes at each SNP by *AA*, *AB*, and *BB*. Denote the genotype for subject *i* at SNP *j* by *G _{ij}*∈{

*AA*,

*AB*,

*BB*} and the genotype at all SNPs by . When referring to a subset, Ω⊂{1, … ,

*N*}, of those

*N*SNPs, we denote the genotypes for those specific SNPs by . Let be the genotype and ancestry information for subject

*i*, and let

**X**={

*X*

_{1}, … ,

*X*} be the training dataset.

_{n}The genotype frequencies at these *N* SNPs vary by population. For ancestry *k*, we denote the proportion of individuals with genotype *g* at SNP *j* by , and .

##### Maximum likelihood estimates

The MLE, , for **p*** will assume Hardy-Weinberg Equilibrium. Let

- (1)

Then,

- (2)

Again, note that the MLE, , are estimators of the true parameter **p***.

##### Prediction rule

For an individual, *i*, not in the training dataset, we would like to predict his ancestry using his genotype, , and the training data, **X**. Our prediction rule, , will assign the most likely ancestry to the individual, *i*, assuming that were the true allele frequencies. This estimate is asymptotically optimal, in the sense that as *n*∞, this prediction rule will have the lowest possible error rate (Rosenberg et al., 2003). Because of its optimality, we chose this estimate over other options (Michie & Spiegelhalter, 1994; Hastie et al., 2001).

To define the prediction rule formally, we need to provide equations for calculating the probability an individual, *i*, has a specific genotype given his ethnicity and **p***. The likelihood of the event can be written as

- (3)

Next, we use Bayes theorem to define the probability that the individual, *i*, is from a specific ancestry given his genotype, **p***, and ,

- (4)

If we knew **p***, we would just classify individual *i* to the ancestry that maximized equation (4). However, without knowing the true value of **p***, we replace **p*** with its estimate . Then, we can define our prediction rule, , by

- (5)

When clear, we may use one of the two abbreviation, or . Similarly, when notation gets cumbersome, we omit from .

##### Error rate

There are two types of error rates for prediction. First, there is the expected error rate when **p*** is known. Second, there is the expected error rate when **p*** is unknown and an estimate, , must be used in its place. These two error rates are distinct, and the error rate of interest is the second one. In the genetics literature, we are the first to propose a parametric estimate for this second, realistic, error rate.

Before describing our estimate, let us consider the example where we use known values of **p*** and π* to predict the ancestry of individual, *i*, and another group aims to estimate our error rate. If this other group also knew the true parameters, they could accurately estimate our error rate by

- (6)

Note that equation (6) is calculated as one minus the probability of correctly predicting the ancestry. Now, if this different group only knew , their best prediction would be to plug , and *G _{i}*(Ω) into that same function. The result is the apparent error rate

- (7)

However, the true scenario, where our predictions are based only on estimates of **p*** presents a far more difficult challenge. There is no closed form equivalent to equation (6). Even if the other group knew the true parameters, they could not precisely calculate our error rate. In fact, we can only define a function, equation (13), that inputs the true parameters and outputs a consistent approximation of the error rate. The remainder of this section discusses the derivation of this equation.

We start by writing down a formula describing the error rate. Given our prediction rule and **p***, the expected error rate for a given genotype, , averaged over all possible training datasets can be described by equation (8). The probability of a training dataset is the probability of the observed genotypes given the ancestries (i.e., the product of equation (3) across all subjects). Note that in the prediction rule, we consider to be a function of *X* and that is a random variable, as opposed to a fixed value. Here, the terms random and fixed refer to the training data.

- (8)

Also, unless stated otherwise, we assume that ancestry is being predicted for an individual not in the training dataset, so and *Y _{i}* are independent. The probability of correctly predicting the ancestry is the sum, over all possible

*k*, of the probability that the true ancestry is

*k*multiplied by the probability that the predicted ancestry is

*k*.

Unfortunately, equation (8) does not quite offer a way to calculate *err*(*G _{i}*(Ω),

**p***). For a subject with a specific genotype, we know how to calculate using equation (4). However, we have yet to state a means to calculate . Below, we suggest that can be approximated by the probability that a value drawn from one normal distribution is greater than

*k*− 1 values drawn from other normal distributions. Unfortunately, there is no closed form solution. The details of the derivation are left to the appendix, and here, we offer only a sketch of how to go from equation (8) to equation (12).

Consider the variable . We can plug that variable into the function described by equation (4), to create a new variable . We can estimate the distribution of this continuous variable by

- (9)

where

- (10)

where = 1, 2, and 3, when *G _{ij}*= AA, AB, and BB, respectively, and C is a constant independent of ancestry.

The prediction rule will output ancestry , when is the largest among the probabilities for all ancestries. The probability of this event is the same as the probability that is greatest among , where

- (11)

Again, we exclude to simplify notation. The *Z*_{v} are independent as the MLE for population *k* are derived only from the subjects within population *k*. Therefore, we take the following to be a satisfactory approximation of the error rate. Let

- (12)

and as an approximation for the overall error rate, let

- (13)

The foundation of this estimate is the “Prediction Focused Information Criteria” described by Claeskens (Claeskens et al., 2006)and Efron (Efron, 1986) for other scenarios.

##### Estimated error rate and the selection procedure

We could estimate the error rate for a set of SNPs, Ω, by replacing **p*** with in equation (13)

- (14)

However, the MLE, , use only a small subset of the training data to estimate any given allele frequency. Estimates of are based only on subjects from ancestry *k*. We believe this to be wasteful because ancestries located near each other on an evolutionary tree should have similar allele frequencies. Therefore, we suggest estimating by an appropriate average of all . Obviously, population *k* and its evolutionary neighbors will have the greatest weights in this average. In other words, we permit our estimates to be slightly biased when *n* is finite, and in return, our estimates will have much smaller variances. The details will be postponed to later, but we will create a Bayesian averaged estimate, , of **p*** and will ultimately suggest estimating the error rate by

- (15)

Therefore, our ideal selection procedure would be to estimate the error for every given set of SNPs, and then choose a set with the lowest error rate for its size. In practice, because it is computationally infeasible to search through all sets, we suggest using the greedy algorithm described in the appendix. Because our method does not naturally incorporate linkage disequilibrium, we amend the standard greedy algorithm, so that new SNPs cannot be added to the selected set if they are within approximately 75kb of any previously selected SNP.

#### Alternatives to

Instead of using the improved error rates, we could have used the apparent error rate, , defined in equation (7). As the apparent error rate has been used for SNP selection previously (Rosenberg, 2005), this will be one of the selection procedures discussed in our simulations and examples. As we will see, this estimate always underestimates the true error rate.

Because the apparent error performs poorly when dealing with 100,000's of SNPs, we offer another selection procedure. This selection procedure uses the nonparametric 0.632+ bootstrap estimate for the error rate. We tried other nonparametric methods, including various methods of cross-validation, but in our simulations, we always found the 0.632+ bootstrap estimate to be the most accurate and precise. In general, the 0.632+ estimate outperforms other nonparametric estimates (Efron, 1983; Efron & Tibshirani, 1997). Here, we briefly explain how to calculate this estimate.

A bootstrap sample, *X**_{b}, *b*∈{1, … , *B*}, is a randomly selected sample of *n* pairs of observations from , with replacement. By chance, each bootstrap sample excludes some sets of observations. If we create our prediction rule based on *X**_{b}, we can calculate the error rate for those excluded observations, leading to the leave-one-out estimate, .

- (16)

The bootstrap 0.632 estimator (Efron, 1983) combines and the observed error, ,

- (17)

Note that uses the training data as test data as well. The bootstrap 0.632 estimate, , is defined as

- (18)

The is a slight variation of this estimate, but for brevity, we omit the details here and refer the reader elsewhere (Efron & Tibshirani, 1997).

#### Calculating

As promised earlier, we now describe . We found it best to propose a mathematical model that describes the development of the allele frequencies over time. We started with a single set of allele frequencies in one historical population, and then allowed these allele frequencies to change, in steps, as individuals spread around the globe and formed distinct populations. Given this model, we then calculate Bayesian estimates, , of the allele frequencies. We essentially obtain prior distributions for **p*** based on the evolutionary tree and then update these priors given the MLE obtained in the training dataset. Therefore, our proposed estimate, given in equation (21), is . The remainder of this section shows the derivation of this estimate.

The new estimate, , takes advantage of a known evolutionary tree. Assume the tree has *n _{n}* nodes (see Figs 1, 2, and 3 for examples). There are

*n*terminal nodes, each representing an observed population, and

_{e}*n*−

_{n}*n*interior nodes, each representing a historical, combined, population. Label the nodes 1, 2, . . ,

_{e}*n*. Label the edges by the attached node, 2, … ,

_{n}*n*, where the edge acquires the label of the larger of the two attached nodes.

_{n}In the actual model, we will assume that groups of populations share a common allele frequency at each SNP. We will introduce a vector, , which identifies those ancestries sharing the same allele frequency. For SNP *j*, 1 ≤*j*≤*N*, we create an *n _{n}*− 1 length vector of binary variables, . If , then

*p*is the same for all populations. If but

_{kj}*S*= 0 for all other

_{vj}*v*≠

*v*

_{1}, then the populations prior to edge

*v*

_{1}will share a common allele frequency, and the populations following edge

*v*

_{1}will share a different common allele frequency. In Figure 3, we label edge 17 for an example. If

*S*

_{17j}= 1 but

*S*= 0 for all other

_{vj}*v*, then populations 15 and 16 would share a common allele frequency, and all other populations would share a different frequency. Since

*S*= 1 allows allele frequencies to vary, we refer to it as a “bottleneck event”. In general, if two nodes, or populations,

_{vj}*k*

_{1}and

*k*

_{2}, can be connected by a set of edges,

*V*, and

*S*= 0 ∀

_{vj}*v*∈

*V*, then . We place a prior distribution on

*S*,

_{vj}*P*(

*S*= 1) =α

_{vj}_{v}, where α

_{v}can be an increasing function of the distance between the nodes adjacent to edge

*v*.

The number of unique allele frequencies, *N*_{Sj}, will be much smaller than the total number of ancestries, *n _{e}*.

- (19)

We denote the *N*_{Sj} unique allele frequencies by , and we denote the number of subjects in populations with each of those *N*_{Sj} unique allele frequencies by .

- (20)

The Bayesian hierarchical model can be described as follows. We place a uniform prior on **S**. Given **S**, the distribution of **p** is *f*(**p** | **S**) = 1 for any set of allele frequencies consistent with **S**. Given **p**, we know the distributions of the training dataset. After, we perform the appropriate integrations, we can calculate the posterior means for **p**.

- (21)

where *n*_{kj1} is the number of *A* alleles in population *k*, *n*_{(κ)j1} is the number of *A* alleles in populations with the κ^{th} unique allele frequency, *n*_{kj0} and *n*_{(κ)j0} are the respective quantities for the *a* allele. To minimize confusion, we note that although *n _{kj}* and

*n*

_{(κ)j}are numbers of subjects,

*n*

_{kj0},

*n*

_{(κ)j0},

*n*

_{kj1}, and

*n*

_{(κ)j1}are numbers of alleles. Furthermore,

- (22)

Note that requires specification of the hyperparameter . Derivation of this equation is in the appendix.

The model is an obvious simplification for the development of allele frequencies. In history, bottlenecks are actually rare events and allele frequencies change gradually over evolutionary development. Therefore, allele frequencies for neighboring populations should be highly correlated. For any given *S*_{·j}, the abrupt bottlenecks may distort the estimated allele frequencies. However, by averaging over multiple *S*_{·j}, we observe allele frequencies that vary smoothly across the evolutionary tree. Therefore, although we could try to incorporate a correlation structure into *f*(**p** |**S**) and allow for more events, these additions did not improve our estimates. We found our proposed model to perform as well as any of the more complex models examined.

#### Data and Simulation

##### Simulations: aims

Our main objective is to understand and compare the selection procedures. We start with three sets of simulations, designed to answer three questions:

- 1Which is a better estimate of
**p**: or ? - 2Which is a better estimate of the error rate: , or ?
- 3Which error rate is best for our selection procedure: , or ?

##### Simulations: common framework

The three sets of simulations examining these questions share a common framework. There are *n _{e}* ancestries, and these ancestries are related by an evolutionary tree with all edges of equal length. In these simulations,

*n*∈{13, 20, 24} and the possible evolutionary trees are illustrated in Figures 1, 2, and 3. The evolutionary trees with 13 and 24 populations were trimmed versions of the tree that described the relationships among the HGDP populations (Jakobsson et al., 2008). Details of the tree follow in two sections. We assume that all populations are equally common, and let the training dataset contain an equal number of subjects, 5, 10, or 20, from each population. Simulation results were always based on 10,000 datasets.

_{e}For simulation sets 2 and 3, we introduce a new type of error. Recall that the expected error rate, *err*(*G _{i}*(Ω),

**p***) (equation (8)), is averaged over all possible training datasets. Now, we let

*err*(

_{X}*G*(Ω),

_{i}**p***) be the error that would be observed given a specific value of or a specific training dataset,

**X**.

##### Simulations: description

Simulation 1

*General:* For each dataset containing a group of subjects with one genotyped SNP, calculate and and compare them to **p***.

*Specifics:* Let *n _{e}*= 20 and

*N*= 1. To fairly compare and , we examine three possible sets of allele frequencies, (1) No variation:

*p**

_{k}= 0.5 for all

*k*, (2) Intercontinental variation:

*p**

_{k}= 0.5 − 1.5

*d*

_{1}for

*k*∈{1, … , 5},

*p**

_{k}= 0.5 − 0.5

*d*

_{1}for

*k*∈{6, … , 10},

*p**

_{k}= 0.5 + 0.5

*d*

_{1}for

*k*∈{11, … , 15}, and

*p**

_{k}= 0.5 + 1.5

*d*

_{1}for

*k*∈{16, … , 20}, and (3) Intracontinental variation:

*p**

_{1}= 0.5 − 1.5

*d*

_{1}− 2

*d*

_{2},

*p**

_{2}= 0.5 − 1.5

*d*

_{1}− 1

*d*

_{2},

*p**

_{3}= 0.5 − 1.5

*d*

_{1},

*p**

_{4}= 0.5 − 1.5

*d*

_{1}+ 1

*d*

_{2},

*p**

_{5}= 0.5 − 1.5

*d*

_{1}+ 2

*d*

_{2}, … , where

*d*

_{1}= 0.2 and

*d*

_{2}= 0.067.

Simulation 2

*General:* For each dataset, calculate , and . With the additional use of a test set containing 100,000 individuals, calculate *err _{X}*. Compare the three estimated error rates with

*err*.

_{X}*Specifics:* Let *n _{e}*∈{13, 24} and

*N*∈{10, 40, 80}. We generate allele frequencies from the more complex evolutionary trees according to the Bayesian model described, when we defined . For each SNP, we first generate . For for

*t*∈{1, … , 3}. For for

*t*∈{1, … , 4}.

*S*are iid ∀

_{vj}*v*. Allele frequencies for each connected set of populations are generated from a uniform[0.05,0.95] distribution.

Simulation 3

*General:* For each dataset, select the top 40 SNPs according to , and . Then using those SNPs and a training dataset, calculate *err _{X}*(Ω*

_{AE},

**p***),

*err*(Ω*

_{X}_{632+},

**p***), and

*err*(Ω*

_{X}_{IBE},

**p***). We compare these three error rates to see which is the lowest.

*Specifics:* Let *n _{e}*∈{13, 20, 24} and

*N*∈{1000, 10000}. Here, and the remaining probability is split evenly over events when

*n*= 13 and events when

_{e}*n*∈{20, 24}, where

_{e}*S*are iid ∀

_{vj}*v*.

##### HGDP Data

Data 1

The HGDP dataset is more than an example. As it is the dataset that will likely be used for selecting SNPs, the performance of the three possible selection procedures on this specific data set is of primary importance. As the HGDP grows and changes, the rankings of the three methods will need to be reevaluated. For this comparison, we use only a subset of the data, containing 400 subjects in 24 populations, from the HGDP (population names given in appendix). We limit our focus to those subjects with easily available data (Jakobsson et al., 2008). The evolutionary tree for these groups was based on pairwise allele-sharing distance among populations and had been previously estimated by Jakobsson (Jakobsson et al., 2008). We split the data into 50 sets of 10,000 SNPs. For each set of SNPs, we select the top 40 using the greedy algorithm with either , or on 80% of the data. Then, we estimate the true error rate using using the remaining 20%. These error rates are then averaged over all 50 sets of data. Splitting the data into smaller sets was a necessity to decide whether the improvement in the set of SNPs selected by is statistically significant. In the supplementary material, we show the results from selecting SNPs according to a different set of methods. In these methods, the top ranked SNPs, where rankings are by *F _{ST}*,

*I*, or the optimal rate of correct assignment (

_{n}*ORCA*), are selected.

Data 2

We use the entire HGDP dataset to select an optimal group of 100 SNPs for distinguishing ancestry. We start by selecting a candidate group of 5,000 SNPs. This group includes the 2000 SNPs (40 SNPs × 50 test sets) chosen from our initial 10,000-SNP searches. We then repeat the analysis described for dataset 1 focusing on populations within each continent separately. Here, we select the top 20, as opposed to the top 40 SNPs. These chosen SNPs comprise the remaining 3000 SNPs (3 continental regions × 20 SNPs × 50 data sets). The top 100 SNPs are selected from this set of 5000 SNPs and listed in the supplementary material.

### Results

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Appendix
- Supporting Information

#### Simulations

*Simulation 1*

The MLE, , are the most commonly used approximations for . The Bayesian estimates, , shrink the MLE toward the average value from neighboring populations. Therefore, if the truth is that neighboring populations share a common “A” allele frequency, *p**_{kj}, at SNP *j*, then the mean square error, *MSE ^{ML}*, for the MLE, should be larger than the

*MSE*for the Bayesian estimates, where and . The first two columns in Table 1 show that the improvement can be quite high when all populations in the study share a single

^{B}*p**

_{kj}. When populations can have different allele frequencies, the extent of the advantage or disadvantage depends on the evolutionary tree. The tradeoff between maximum likelihood and Bayesian estimates is a tradeoff between variance and bias. can be biased, but will have lower variance. In general, as the the number of subjects per population decreases, the

*MSE*:

^{B}*MSE*decreases, favoring estimation by (Table 1).

^{ML}No var | Intercontinental var | Intracontinental var | ||||
---|---|---|---|---|---|---|

MLE | Bayes | MLE | Bayes | MLE | Bayes | |

5 | 0.025 | 0.002 | 0.02 | 0.016 | 0.019 | 0.013 |

10 | 0.013 | 0.001 | 0.01 | 0.011 | 0.01 | 0.008 |

15 | 0.008 | 0.001 | 0.007 | 0.009 | 0.007 | 0.007 |

*Simulation 2*

We compared three options, , and for the 13 and 24 population examples (Table 2). Clearly, greatly underestimates the true error and the , where *n _{sim}* is the number of simulations, is an order of magnitude larger than the mean square error (MSE) for either of the other estimates. The ratios, and increase as the number of informative SNPs or the number of populations increases. In these simulations, tends to be lower than , but the order reverses as

*N*grows large. The , with its default settings for , slightly overestimates the true value, but when calculating the MSE, this bias is offset by lower variance and a higher correlation between and

*err*.

_{X}13 Populations | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

n_{k} | N | err_{X} | MSE | SD() | cor() | |||||||||

AE | 0.632+ | IBE | AE | 0.632+ | IBE | AE | 0.632+ | IBE | AE | 0.632+ | IBE | |||

5 | 10 | 0.776 | 0.1311 | 0.0269 | 0.0204 | 0.6481 | 0.7662 | 0.786 | 0.0488 | 0.0477 | 0.0449 | 0.805 | 0.851 | 0.9178 |

5 | 40 | 0.4489 | 0.2669 | 0.0311 | 0.0278 | 0.1844 | 0.4461 | 0.4548 | 0.0403 | 0.0653 | 0.0684 | 0.8212 | 0.8815 | 0.9185 |

5 | 80 | 0.2357 | 0.2038 | 0.0287 | 0.023 | 0.0352 | 0.2486 | 0.2361 | 0.0144 | 0.0477 | 0.048 | 0.591 | 0.8456 | 0.8776 |

10 | 10 | 0.7797 | 0.133 | 0.0274 | 0.0197 | 0.6499 | 0.7684 | 0.7876 | 0.0505 | 0.0502 | 0.0467 | 0.8146 | 0.8674 | 0.9219 |

10 | 40 | 0.4426 | 0.263 | 0.03 | 0.0282 | 0.1819 | 0.4386 | 0.4476 | 0.0383 | 0.0625 | 0.0643 | 0.7828 | 0.8796 | 0.9027 |

10 | 80 | 0.2284 | 0.1987 | 0.0274 | 0.0221 | 0.0332 | 0.2415 | 0.2286 | 0.0134 | 0.0482 | 0.0494 | 0.5882 | 0.8676 | 0.8945 |

24 Populations | ||||||||||||||

n_{k} | N | err_{X} | MSE | SD() | cor() | |||||||||

AE | 0.632+ | IBE | AE | 0.632+ | IBE | AE | 0.632+ | IBE | AE | 0.632+ | IBE | |||

5 | 10 | 0.9037 | 0.1033 | 0.016 | 0.0099 | 0.8021 | 0.8962 | 0.9084 | 0.0306 | 0.0265 | 0.0219 | 0.802 | 0.8474 | 0.9178 |

5 | 40 | 0.7792 | 0.3432 | 0.0226 | 0.0207 | 0.4368 | 0.7704 | 0.7928 | 0.046 | 0.0455 | 0.0384 | 0.8553 | 0.8907 | 0.9174 |

5 | 80 | 0.6601 | 0.4851 | 0.0227 | 0.0257 | 0.1758 | 0.6573 | 0.6775 | 0.0307 | 0.0507 | 0.0484 | 0.8149 | 0.8952 | 0.9214 |

10 | 10 | 0.9044 | 0.1017 | 0.0154 | 0.0096 | 0.8044 | 0.8985 | 0.9081 | 0.0302 | 0.0267 | 0.0216 | 0.8032 | 0.852 | 0.9109 |

10 | 40 | 0.7814 | 0.3436 | 0.0215 | 0.0207 | 0.4386 | 0.7722 | 0.7949 | 0.0459 | 0.046 | 0.0413 | 0.863 | 0.907 | 0.9267 |

10 | 80 | 0.6542 | 0.4816 | 0.0227 | 0.0259 | 0.1734 | 0.6508 | 0.6699 | 0.0305 | 0.0488 | 0.0483 | 0.8016 | 0.8879 | 0.9043 |

*Simulation 3*

SNPs were selected by the greedy algorithm aimed to minimize either , or . For each group, the selected SNPs were ordered by the step in which they were added. Therefore, SNP 1 is essentially the most informative and SNP 40 is the least informative. For each group, the error rate was calculated (via simulation) when the top *T* SNPs were used, *T*∈{1, … , 40} and is illustrated in Figure 4. The main point is that when more than three SNPs were used, the SNPs in Ω_{AE} (the set chosen using the apparent error) proved to be poor predictors of the true ancestries. Selection based on resulted in lower error rates, and selection based on resulted in the lowest error rates. Therefore, these simulations clearly suggest that the use of is extremely inefficient and the use of can be the most efficient. However, as the simulation model unfairly favors , we hold off general statements about the -based selection procedure until we see the results for the HGDP data.

#### Data

*Data 1*

Selecting from groups of 10,000 SNPs, we denoted the resulting sets of 40 SNPs by Ω*_{AE}, Ω*_{632+}, and Ω*_{IBE}. These SNPs and their corresponding were then used to predict the ancestry for the 80 subjects in the separate test dataset, resulting in three sets of error rates *err _{AE}*,

*err*

_{0.632+}, and

*err*. These error rates were then averaged over all 50 sets of 10,000 SNPs to produce Figure 5. The results are similar to those from the simulations, showing that SNP selection by outperformed both of the other selection procedures so long as there were more than eight SNPs. Using only eight SNPs, 77% of the subjects were assigned to a population in the correct continental region. Additional SNPs were selected to distinguish intracontinental populations. At this stage in the selection procedure, differences in allele frequencies due to random chance could rival informative differences, and because is designed to remove those that occur by chance, it starts to perform better.

_{IBE}We then compared the selection methods based on the 0.632+ and Imputed Bayesian Error (IBE) estimates of the error rates to see whether selection by produced a statistically significantly better set of SNPs than selection by . Figure 6 shows the difference in error rate, and a point-wise 95% CI, using the sample variance of the 50 values and assuming normality. The improvement was statistically significant. Although training sets contained only 80% of the data, we presume this benefit persists when selecting SNPs using all individuals. For future studies, we recommend splitting the data into training and test sets or using a cross-validation approach to choose the optimal method for selecting SNPs and, when desirable, to tune the hyperparameter . Here, using simulations as our guide, we let (*n _{n}*− 1)α

_{v}= 7 ∀

*v*.

The error rate is still near 50% when Ω* includes 40 markers. However, a more detailed analysis of shows that the majority of errors involve classifying a subject from population *k*_{1} to population *k*_{2}, where *k*_{1} and *k*_{2} are close to each other on the evolutionary tree. Figure 7, created by *superStruct* (available at http://bioinformatics.med.yale.edu/group/josh/FOSSIL.html), is similar to the output from STRUCTURE and shows that using 2000 markers reduces the error rate to near 0%. Each point on the axis corresponds to one of the subjects from one of the test datasets, and above that point, is a series of 24 stacked bars. Each bar has a unique color and represents a single population. The height of the colored bar corresponding to population *k* is proportional to the posterior probability, . Populations within the same continental region are different shades of the same color. The total number of subjects described by Figure 7A is 3650 (= 73 subjects × 50 datasets). As for the overall potential for SNPs, we examined the predictive accuracy of all 2000 SNPs (40 SNPs × 50 datasets) and found near perfect identification (i.e., ) for the majority of the 73 subjects (Fig 7B). The six predictions that disagreed with the self-identification, (i.e., ) were neighboring populations. This figure shows that we can do better than predicting continental origin.

*Data 2*

We used to select an optimal set of 100 SNPs. Those SNPs are listed in the supplementary material.

### Discussion

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Appendix
- Supporting Information

This article has introduced two ideas with an influence that should extend beyond SNP selection procedures. First, we offer an improved method for estimating the population-specific allele frequencies. Second, we offer an improved method for estimating the error rate for prediction rules using genotypes. In fact, this latter method can be applied to any classification problem based on logistic regression. Focusing on the SNP selection procedures, we have demonstrated that selecting SNPs to minimize , instead of can lead to a group of SNPs that can predict ancestry with high accuracy.

The apparent error rate and MLE have been successfully used in the past for selecting SNPs (Rosenberg, 2005). However, here the apparent error performed poorly. The main difference is that the number of populations has increased. With more populations, SNPs that truly separate a small group of populations no longer stand out. The population differences in caused by sample selection are relatively small, but, in terms of the overall importance of a SNP, these differences are additive. Also, as the number of populations increases, we need more SNPs. As the number of needed SNPs increases, the improvement due to each additional SNP decreases, and it is more likely that a noninformative SNP can appear to be the best candidate. More general limitations of the apparent error rate are that it estimates the wrong quantity and cannot account for the fact that estimates of allele frequencies for some populations (i.e., those with more subjects in the training dataset) should be more accurate than others.

In this article, we never actually considered using , and here, we discuss one of its limitations and our reason for avoiding it. Although it has gone unstated in the literature, can be heavily biased because using will exaggerate the true accuracy of the prediction rule. The following, simple, example illustrates that . Let there be one gene and two populations, where the allele frequencies in the populations are *p*_{1} and *p*_{2}. As an extremely rough approximation, suitable only for illustration, consider the error to be a function of the difference *log*(*p*_{1}) −*log*(*p*_{2}) ≡*log*(*p*_{1}/*p*_{2}). The true error generally increases as the distance between *p*_{1} and *p*_{2} decreases, with *err* attaining its maximum of 0.5 when *log*(*p*_{1}/*p*_{2}) = 0, or when the allele frequencies are the same in both populations. Now, assume we are unlucky, and the truth happens to be *log*(*p*_{1}/*p*_{2}) = 0. The estimate, , is distributed around its true value, resulting in .

This bias discussed for is absent from , and therefore, without the additional information from the evolutionary tree, would be the preferred method for estimating the error rate. However, we did find that performed favorably when compared to in the Results section. Because of the nature of nonparametric estimates, it would be difficult to introduce the information from the evolutionary tree into .

Our study focused on individuals with only a single ancestry. However, our general conclusions about the SNP-selection procedure and the selected SNPs will be valid when our goal is to identify the multiple ancestries of admixed individuals. Obviously, the selected group of SNPs will need to be expanded to attain similar error rates. We suspect that the total number of SNPs needed to identify one of the admixed ancestries will be inversely proportional to the percentage of an individual's genome originating with that ancestry. Instead of looking for ancestries of an individual, we would now be looking for ancestries of sections of the chromosomes. Admixture, therefore, requires a selection procedure that assumes only a random subset of the chosen SNPs will actually be available to identify a given ancestry. Therefore, the selected set should include some redundancy. This will also safeguard against genotyping error. We are currently exploring solutions for our two objectives in admixtures.

The next goal, already under examination, is how to incorporate the HGDP data and the knowledge of the optimal set of SNPs in identifying population substructure in genome-wide association studies (GWAS). First, most GWAS are large enough to contribute their own information about allele frequencies in populations. Second, GWAS are often more influenced by large population substructure, and may not need to identify populations that are not greatly present in the study. However, this focus is likely to change as we start searching for rare disease causing mutations.

### Acknowledgements

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Appendix
- Supporting Information

This work was supported, in part, by NIJ grants 2007-DN-BX-K197 and 2010-DN-BX-K225 to KKK awarded by the National Institute of Justice, Office of Justice Programs, US Department of Justice. Points of view in this document are those of the authors and do not necessarily represent the official position or policies of the US Department of Justice.

### References

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Appendix
- Supporting Information

- 2003) Human population genetic structure and inference of group membership. Nat Genet 72, 578–589. , , , , & (
- 2008) Ancestry estimation and correction for population stratification in molecular epidemiologic association studies. Cancer Epidemiol Biomarkers Prev 17, 471–477. , , & (
- 2008) Forensically relevant SNP classes. BioTechniques 44, 603–610. & (
- 2006) Variable selection for logistic regression using a prediction-focused information criterion. Biometrics 62, 972–979. , & (
- 2006) Investigation of single-nucleotide polymorphisms associated with ethnicity. Progress in Forensic Genetics 11—Proceedings of the 21st International ISFG Congress. International Congress Series 1288, 79–81. , & (
- 1983) Estimating the error rate of a prediction rule: Improvement on cross-validation. J Am Stat Assoc 78, 316–331. (
- 1986) How biased is the apparent error rate of a prediction rule? J Am Stat Assoc 81, 461–470. (
- 1997) Improvements on cross-validation: The .632+ bootstrap method. J Am Stat Assoc 92, 548–560. & (
- 1972) Estimating phylogenetic trees from distance matrices. Amer Nat 106, 645–668. (
- 2004) Assessing the impact of population stratification on genetic association studies. Nat Genet 36, 388–393. , , , , , , , , , , , , , , , , & (
- 2001)
*The Elements of Statistical Learning*. Springer Series in Statistics. New York , NY , USA : Springer New York Inc. , & ( - 2006) TAMAL: An integrated approach to choosing SNPs for genetic studies of human complex traits. Bioinformatics 22, 626–627. , & (
- 2008) Genotype, haplotype, and copy number variation in worldwide human populations. Nature 451, 998–10003. , , , , , , , , , , , , , , , , , , , , , , & (
- 2004) Genetic variation, classification and ‘race’. Nat Genet 36, s28–s33. & (
- 2006) Proportioning whole genome single nucleotid polymorphism diversity for the identification of geographic population structure and genetic ancestry. Am J Hum Genet 78, 680–690. , , , & (
- 2008) Worldwide human relationships inferred from genome-wide patterns of variation. Science 319, 1100–1104. , , , , , , , , , & (
- 2001) Inferring ethnic origin by means of an str profile. Forensic Sci Int 119, 17–22. , , & (
- 2004) The effects of human population structure on large genetic association studies. Nat Genet 36, 512–517. , , & (
- 1994)
*Machine learning, neural and statistical classification*. Englewood Cliffs , NJ : Prentice Hall. , & ( - 2009) An ancestry informative marker set for determining continental origin: Validation and extension using human genome diversity panels. BMC Genetics 10, 39. , , , , , , , , , , & (
- 2007) Pca-correlated snps for structure identification in worldwide human populations. PLoS Genet 3, e160. , , , , , & (
- 2007) Inferring ancestral origin using a single multiplex assay of ancestry-informative marker snps. Forensic Sci Int Genet 1, 273–280. , , , , , , , , , & (
- 2005) Algorithms for selecting informative marker panels for population assignment. J Comput Biol 12, 1183–1201. (
- 2003) Information on genetic markers for inference of ancestry. Am J Hum Genet 73, 1402–1422. , , & (
- 2002) Genetic structure of human populations. Science 298, 2381–2385. , , , , , & (
- 1987) The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol Biol Evol 4, 406–425. & (
- 2008) Application of ancestry informative markers to association studies in european americans. PLoS Genet 4, e5. & (
- 1997) Ethnic-affiliation estimation by use of population-specific DNA markers. Am J Hum Genet 60, 957–964. , , , , , & (
- 2005) Measures of human population structure show heterogeneity among genomic regions. Genome Res 15, 1468–1476. , , , & (
- 2005) SNPselector: A web tool for selecting SNPs for genetic association studies. Bioinformatics 21, 4181–4186. , , , , , , & (
- 2008) Japanese population structure, based on SNP genotypes from 7003 individuals compared to other ethnic groups: Effects on population-based association studies. Am J Hum Genet 83, 445–456. , , , , , , & (

### Appendix

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Appendix
- Supporting Information

##### Definition of *F*_{ST} and *I*_{n}

_{ST}

_{n}

The fixation index (*F _{ST}*) is a measure of population differentiation and is usually the correlation of randomly chosen alleles within the same subpopulation relative to that found in the entire population. To define

*F*, let

_{ST}*p**

_{k}be the frequency of allele “A" in population

*k*,

*k*∈{1, … ,

*n*}, where

_{e}*n*is the number of ancestries and

_{e}*q**

_{k}= 1 −

*p**

_{k}. Then

- (23)

- (24)

- (25)

where *w _{k}* is the relative size of the

*k*subpopulation and . In our definitions of these quantities,

^{th}*w*is the relative size of each subpopulation in some larger, natural, population, not among the individuals sampled.

_{k}Information, *I*, is based on the idea of statistical entropy as described in (Rosenberg et al., 2003). The information, or informativeness of assignment is

- (26)

where and *q**= 1 −*p**.

##### Calculating p^{B}

Recall, for each SNP *j*, we defined a set of indicator variables, , representing bottleneck events. Formally, a bottleneck is an evolutionary event, in which only a small subset of individuals survives to the next generation, leading to a dramatic change in population allele frequencies. We use the term more liberally to imply any event that allows two neighboring populations to have different allele frequencies. Recall that *N _{Sj}* is the number of unique allele frequencies,

*n*

_{(κ)j}is the number of subjects with the κ

^{th}unique allele frequency, and

*n*is the total number of ancestries. Central to this calculation is the fact that the density of the estimated parameters, , conditional on the true parameters,

_{e}*p*, can be defined by

- (27)

where

- (28)

Note that *f*(·|·) represents a generic conditional density function, with the exact form depending on the variables in that function. Clearly, as *f*(*p*|*S*) is a constant when nonzero,

- (29)

Because we know that is a density function, we know that must have the form for all nonzero values, where *f*_{β}(·|α, β) is the β density.

We know . We start by calculating ,

- (30)

where we calculate *C*_{2} by noting that is a multiple of *f*_{β}(*p*_{(κ)} |*n*_{(κ)1}+ 1, *n*_{(κ)0}+ 1),

- (31)

where B is beta function. Because we know that is a density function, we know that we can define and conclude that

- (32)

To calculate , note that for each value of *S*, we now know

- (33)

where 1(*p _{j}*=

*p*

_{(κ)}) indicates whether population

*j*shares the κ

^{th}unique allele frequency. Therefore, we have arrived at

- (34)

##### Estimating Error

Here, we show that is asymptotically normal with mean μ_{ki} and variance σ^{2}_{ki} defined in equation (10). Start by focusing on SNP *j* in population *k*. Recall, *n _{k}* is the number of subjects in population

*k*and

*n*is the total number of ancestries. Then, our estimate of

_{e}*p*is

_{kj}- (35)

where by the central limit theorem we know,

- (36)

Next, we want to define our estimate for

- (37)

and

- (38)

We approximate the distribution of by a linear function of , specifically,

- (39)

where **m**′_{kj}≡**m**′ (*p _{kj}*) and

- (40)

Then, we have the following approximation

- (41)

where

- (42)

##### Greedy Algorithm

If the total of number of SNPs available is around 1,000,000, the number of possible groups grows at a rate of . It is computationally infeasible to search such a large space. Therefore, we propose using a greedy algorithm with *N _{S}* steps.

Step 1: Select the single SNP *j* that minimizes the expected error rate: .

Step 2: …*N _{s}*: Given a set of

*n*− 1 SNPs, {

*j*

_{1},

*j*

_{2}, … ,

*j*

_{n−1}}, select the SNP that when added to that current set minimizes the error rate: .

Although , the set chosen by the greedy algorithm, is not guaranteed to be the optimal set, the set should perform satisfactorily, in that the resulting error rate should be similar to the true minimum.

##### Population Names

(Continental Region 1) 1: Yoruba (25), 2: Mandeka (20), 3: Bantu (8), 4: San (7), 5: Biaka Pygmy (32), 6: Mbuti Pygmy (15)

(Continental Region 2) 7: Papuan (16), 8: Melanesian (17), 9: Pima (11), 10: Maya (13), 11: Columbian (7), 12: Yakut (15), 13: Mongola (9), 14: Daur (10), 15: Cambodian (10), 16: Yi (10)

(Continental Region 3) 17: Burusho (7), 18: Kalash (18), 19: Balochi (15), 20: Russian (13), 21: Druze (43), 22: Beduin (47), 23: Palestinian (26), 24: Mozabite 96)

Population Number: Population Name (Number of subjects in population)

### Supporting Information

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Appendix
- Supporting Information

**Table S1:** Top 100 SNPs.

**Figure S1:** Comparison of error rates: *F _{ST}*,

*I*, and ORCA.

_{n}Filename | Format | Size | Description |
---|---|---|---|

AHG_656_sm_suppmat.pdf | 20K | Supporting info item |

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.