SEARCH

SEARCH BY CITATION

Keywords:

  • Ancestry;
  • ethnicity;
  • SNPs;
  • error rate;
  • allele frequency;
  • genotype;
  • AIM;
  • bootstrap;
  • FOSSIL

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Appendix
  10. Supporting Information

An individual's genotypes at a group of single-nucleotide polymorphisms (SNPs) can be used to predict that individual's ethnicity or ancestry. In medical studies, knowledge of a subject's ancestry can minimize possible confounding, and in forensic applications, such knowledge can help direct investigations. Our goal is to select a small subset of SNPs, from the millions already identified in the human genome, that can predict ancestry with a minimal error rate. The general form for this variable selection procedure is to estimate the expected error rates for sets of SNPs using a training dataset and consider those sets with the lowest error rates given their size. The quality of the estimate for the error rate determines the quality of the resulting SNPs. As the apparent error rate performs poorly when either the number of SNPs or the number of populations is large; we propose a new estimate, the Improved Bayesian Estimate. We demonstrate that selection procedures based on this estimate produce small sets of SNPs that can accurately predict ancestry. We also provide a list of the 100 optimal SNPs for identifying ancestry.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Appendix
  10. Supporting Information

An individual's genotypes at a group of single nucleotide polymorphisms (SNPs) can be used to predict that individual's ethnicity or ancestry (Shriver et al., 1997; Rosenberg et al., 2002; Jorde & Wooding. 2004; Weir et al., 2005; Paschou et al., 2007; Yamaguchi-Kabata et al., 2008). Identifying ancestry through this approach is often useful (Bamshad et al., 2003; Li et al., 2008; Seldin & Price, 2008). For example, subjects in a medical study may be genotyped because adjusting for precise ancestry can minimize one source of confounding (Freedman et al., 2004; Marchini et al., 2004; Barnholtz-Sloan et al., 2008). Similarly, a sample from a crime scene may be genotyped, so ancestry can be included in the description of a suspect (Lowe et al., 2001; Daniel et al., 2006; Budowle & van Daal, 2008). Although millions of SNPs have been identified, only a small subset needs to be genotyped in order to accurately predict ancestry. Reducing the needed number to the 10s or 100s is still useful even in the era of SNP microarrays. First, genotyping only a few hundred SNPs, compared to the 100,000's of SNPs on a microarray, should be less expensive (Seldin & Price, 2008; Nassir et al., 2009). Second, by basing predictions on only those informative SNPs, we can remove variability caused by considering SNPs with little information. In this article, we aim to describe a method for selecting an “optimal” group of SNPs or a group that has maximal predictive accuracy given its size.

The first set of methods for marker selection ranked SNPs individually by some measure of their ability to distinguish ancestries, such as the estimated values for FST, the allele frequencies, p, or the informativeness for assignment, In, calculated from a training dataset, and then the top ranked SNPs were selected (Rosenberg et al., 2003). See the Appendix section for formal definitions of In and FST. Obviously, this led to redundant markers, and the majority of markers separated African from European ancestries. Therefore, the next set of methods were more advanced (Xu et al., 2005; Hemminger et al., 2006; Phillips et al., 2007) and included (1) selecting those SNPs that are the strongest contributors to the principal components (Paschou et al., 2007), (2) selecting the 1000 SNPs with the highest FST, and then, among those, using a genetic algorithm to jointly select the set with the largest In (Lao et al., 2006), and (3) selecting a set of SNPs with a greedy algorithm aimed to minimize the apparent error rate (Rosenberg, 2005). Method a) can still select redundant SNPs and may rank those SNPs that distinguish closely related populations relatively low. Method b) uses unnecessary surrogates, FST and In, for predictive accuracy. Method c) is the most promising, as it directly tries to minimize the error rate.

Our goal is to improve the latter method by selecting SNPs that minimize a better estimate of the error rate. A selection procedure based on our new estimate, which introduces an improved form for the error rate and an improved estimate for the allele frequencies, will result in a better group of SNPs. Although our focus is on SNP selection, our discussion should have broader appeal. We will introduce a parametric estimator for the error rate that can be applied to any prediction rule based on genotypes or, even more generally, any prediction rule using logistic regression to discriminate categories. Unlike the apparent error, this method acknowledges that the true allele frequencies are unknown (Efron, 1986; Claeskens et al., 2006). Moreover, our new estimates of allele frequencies offer a means to reduce the variance of the maximum likelihood estimates (MLE) whenever knowledge of the evolutionary tree is available (Farris, 1972; Saitou & Nei, 1987). Instead of estimating the allele frequencies for an ancestry using only subjects from that ancestry, we will now average over all available subjects in the training dataset. The need for these improvements has only arrived as we now attempt to predict ancestry more accurately than continental origin. For a training dataset, we now have access to the Human Genome Diversity Project (HGDP), where 500,000+ SNPs have been genotyped on hundreds of subjects from 54 populations (Jakobsson et al., 2008).

The order of the article is as follows. In the methods section, we introduce our new selection procedure. Then, in the results section, we apply our selection procedure to both simulated and HGDP data. Finally, we conclude with a short discussion.

Methods

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Appendix
  10. Supporting Information

Introduction to a New Selection Procedure

Overview

As discussed in the introduction, our goal is to define a procedure for choosing a small set of SNPs that, when genotyped, can be used to predict an individual's ancestry. We focus our search to one specific group of candidate procedures. Each selection procedure considered here will first estimate the expected error rate for every set of SNPs and will then choose a set with the lowest estimated error rate, given its size. Obviously, computational limits prevent truly searching over every set of SNPs, but we will deal with that technical issue later. The key point is that we need only define an estimate of the error rate to define a selection procedure.

Notation

Assume we select n individuals from a heterogeneous population containing ne distinct ancestries or ethnicities and that we denote the ancestry of an individual, i, by Yi∈{1, … , ne}. We let nk be the total number of subjects from ancestry inline image, where 1(Yi=k) = 1, if Yi=k and 0 otherwise. In the heterogeneous population, from which our sample was obtained, we denote the proportion of subjects from ethnicity k by π*k and let inline image. We will presume that inline image can be accurately estimated and treated as a known, fixed, quantity. Note that the asterisk, *, denotes the true value of a parameter.

Assume, a genome for an individual contains N SNPs and denote the three genotypes at each SNP by AA, AB, and BB. Denote the genotype for subject i at SNP j by Gij∈{AA, AB, BB} and the genotype at all SNPs by inline image. When referring to a subset, Ω⊂{1, … , N}, of those N SNPs, we denote the genotypes for those specific SNPs by inline image. Let inline image be the genotype and ancestry information for subject i, and let X={X1, … , Xn} be the training dataset.

The genotype frequencies at these N SNPs vary by population. For ancestry k, we denote the proportion of individuals with genotype g at SNP j by inline image, and inline image.

Maximum likelihood estimates

The MLE, inline image, for p* will assume Hardy-Weinberg Equilibrium. Let

  • image(1)

Then,

  • image(2)

Again, note that the MLE, inline image, are estimators of the true parameter p*.

Prediction rule

For an individual, i, not in the training dataset, we would like to predict his ancestry using his genotype, inline image, and the training data, X. Our prediction rule, inline image, will assign the most likely ancestry to the individual, i, assuming that inline image were the true allele frequencies. This estimate is asymptotically optimal, in the sense that as n[RIGHTWARDS ARROW], this prediction rule will have the lowest possible error rate (Rosenberg et al., 2003). Because of its optimality, we chose this estimate over other options (Michie & Spiegelhalter, 1994; Hastie et al., 2001).

To define the prediction rule formally, we need to provide equations for calculating the probability an individual, i, has a specific genotype given his ethnicity and p*. The likelihood of the event can be written as

  • image(3)

Next, we use Bayes theorem to define the probability that the individual, i, is from a specific ancestry given his genotype, p*, and inline image,

  • image(4)

If we knew p*, we would just classify individual i to the ancestry that maximized equation (4). However, without knowing the true value of p*, we replace p* with its estimate inline image. Then, we can define our prediction rule, inline image, by

  • image(5)

When clear, we may use one of the two abbreviation, inline image or inline image. Similarly, when notation gets cumbersome, we omit inline image from inline image.

Error rate

There are two types of error rates for prediction. First, there is the expected error rate when p* is known. Second, there is the expected error rate when p* is unknown and an estimate, inline image, must be used in its place. These two error rates are distinct, and the error rate of interest is the second one. In the genetics literature, we are the first to propose a parametric estimate for this second, realistic, error rate.

Before describing our estimate, let us consider the example where we use known values of p* and π* to predict the ancestry of individual, i, and another group aims to estimate our error rate. If this other group also knew the true parameters, they could accurately estimate our error rate by

  • image(6)

Note that equation (6) is calculated as one minus the probability of correctly predicting the ancestry. Now, if this different group only knew inline image, their best prediction would be to plug inline image, and Gi(Ω) into that same function. The result is the apparent error rate

  • image(7)

However, the true scenario, where our predictions are based only on estimates of p* presents a far more difficult challenge. There is no closed form equivalent to equation (6). Even if the other group knew the true parameters, they could not precisely calculate our error rate. In fact, we can only define a function, equation (13), that inputs the true parameters and outputs a consistent approximation of the error rate. The remainder of this section discusses the derivation of this equation.

We start by writing down a formula describing the error rate. Given our prediction rule and p*, the expected error rate for a given genotype, inline image, averaged over all possible training datasets can be described by equation (8). The probability of a training dataset is the probability of the observed genotypes given the ancestries (i.e., the product of equation (3) across all subjects). Note that in the prediction rule, we consider inline image to be a function of X and that inline image is a random variable, as opposed to a fixed value. Here, the terms random and fixed refer to the training data.

  • image(8)

Also, unless stated otherwise, we assume that ancestry is being predicted for an individual not in the training dataset, so inline image and Yi are independent. The probability of correctly predicting the ancestry is the sum, over all possible k, of the probability that the true ancestry is k multiplied by the probability that the predicted ancestry is k.

Unfortunately, equation (8) does not quite offer a way to calculate err(Gi(Ω), p*). For a subject with a specific genotype, we know how to calculate inline image using equation (4). However, we have yet to state a means to calculate inline image. Below, we suggest that inline image can be approximated by the probability that a value drawn from one normal distribution is greater than k− 1 values drawn from other normal distributions. Unfortunately, there is no closed form solution. The details of the derivation are left to the appendix, and here, we offer only a sketch of how to go from equation (8) to equation (12).

Consider the variable inline image. We can plug that variable into the function described by equation (4), to create a new variable inline image. We can estimate the distribution of this continuous variable by

  • image(9)

where

  • image(10)

where inline image= 1, 2, and 3, when Gij= AA, AB, and BB, respectively, and C is a constant independent of ancestry.

The prediction rule will output ancestry inline image, when inline image is the largest among the probabilities for all ancestries. The probability of this event is the same as the probability that inline image is greatest among inline image, where

  • image(11)

Again, we exclude inline image to simplify notation. The Zv are independent as the MLE for population k are derived only from the subjects within population k. Therefore, we take the following to be a satisfactory approximation of the error rate. Let

  • image(12)

and as an approximation for the overall error rate, let

  • image(13)

The foundation of this estimate is the “Prediction Focused Information Criteria” described by Claeskens (Claeskens et al., 2006)and Efron (Efron, 1986) for other scenarios.

Estimated error rate and the selection procedure

We could estimate the error rate for a set of SNPs, Ω, by replacing p* with inline image in equation (13)

  • image(14)

However, the MLE, inline image, use only a small subset of the training data to estimate any given allele frequency. Estimates of inline image are based only on subjects from ancestry k. We believe this to be wasteful because ancestries located near each other on an evolutionary tree should have similar allele frequencies. Therefore, we suggest estimating inline image by an appropriate average of all inline image. Obviously, population k and its evolutionary neighbors will have the greatest weights in this average. In other words, we permit our estimates to be slightly biased when n is finite, and in return, our estimates will have much smaller variances. The details will be postponed to later, but we will create a Bayesian averaged estimate, inline image, of p* and will ultimately suggest estimating the error rate by

  • image(15)

Therefore, our ideal selection procedure would be to estimate the error for every given set of SNPs, and then choose a set with the lowest error rate for its size. In practice, because it is computationally infeasible to search through all sets, we suggest using the greedy algorithm described in the appendix. Because our method does not naturally incorporate linkage disequilibrium, we amend the standard greedy algorithm, so that new SNPs cannot be added to the selected set if they are within approximately 75kb of any previously selected SNP.

Alternatives to inline image

Instead of using the improved error rates, we could have used the apparent error rate, inline image, defined in equation (7). As the apparent error rate has been used for SNP selection previously (Rosenberg, 2005), this will be one of the selection procedures discussed in our simulations and examples. As we will see, this estimate always underestimates the true error rate.

Because the apparent error performs poorly when dealing with 100,000's of SNPs, we offer another selection procedure. This selection procedure uses the nonparametric 0.632+ bootstrap estimate for the error rate. We tried other nonparametric methods, including various methods of cross-validation, but in our simulations, we always found the 0.632+ bootstrap estimate to be the most accurate and precise. In general, the 0.632+ estimate outperforms other nonparametric estimates (Efron, 1983; Efron & Tibshirani, 1997). Here, we briefly explain how to calculate this estimate.

A bootstrap sample, X*b, b∈{1, … , B}, is a randomly selected sample of n pairs of observations from inline image, with replacement. By chance, each bootstrap sample excludes some sets of observations. If we create our prediction rule based on X*b, we can calculate the error rate for those excluded observations, leading to the leave-one-out estimate, inline image.

  • image(16)

The bootstrap 0.632 estimator (Efron, 1983) combines inline image and the observed error, inline image,

  • image(17)

Note that inline image uses the training data as test data as well. The bootstrap 0.632 estimate, inline image, is defined as

  • image(18)

The inline image is a slight variation of this estimate, but for brevity, we omit the details here and refer the reader elsewhere (Efron & Tibshirani, 1997).

Calculating inline image

As promised earlier, we now describe inline image. We found it best to propose a mathematical model that describes the development of the allele frequencies over time. We started with a single set of allele frequencies in one historical population, and then allowed these allele frequencies to change, in steps, as individuals spread around the globe and formed distinct populations. Given this model, we then calculate Bayesian estimates, inline image, of the allele frequencies. We essentially obtain prior distributions for p* based on the evolutionary tree and then update these priors given the MLE obtained in the training dataset. Therefore, our proposed estimate, given in equation (21), is inline image. The remainder of this section shows the derivation of this estimate.

The new estimate, inline image, takes advantage of a known evolutionary tree. Assume the tree has nn nodes (see Figs 1, 2, and 3 for examples). There are ne terminal nodes, each representing an observed population, and nnne interior nodes, each representing a historical, combined, population. Label the nodes 1, 2, . . , nn. Label the edges by the attached node, 2, … , nn, where the edge acquires the label of the larger of the two attached nodes.

image

Figure 1. Evolutionary tree—25 nodes. An example of an evolutionary tree with 25 nodes (nn= 25) and 20 populations (ne= 20). Next to each population node is the population identifier (k).

Download figure to PowerPoint

image

Figure 2. Evolutionary tree—24 nodes. An example of an evolutionary tree with 46 nodes (nn= 46) and 24 populations (ne= 24). Next to each population node is the population identifier (k). See appendix for population names.

Download figure to PowerPoint

image

Figure 3. Evolutionary tree—13 nodes. An example of an evolutionary tree with 24 nodes (nn= 24) and 13 populations (ne= 13). Next to each node is both the node identifier (node) and the population identifier (k), listed as k/node. Internal nodes are listed –/node. Edge 17 is labeled for discussion in the text.

Download figure to PowerPoint

In the actual model, we will assume that groups of populations share a common allele frequency at each SNP. We will introduce a vector, inline image, which identifies those ancestries sharing the same allele frequency. For SNP j, 1 ≤jN, we create an nn− 1 length vector of binary variables, inline image. If inline image, then pkj is the same for all populations. If inline image but Svj= 0 for all other vv1, then the populations prior to edge v1 will share a common allele frequency, and the populations following edge v1 will share a different common allele frequency. In Figure 3, we label edge 17 for an example. If S17j= 1 but Svj= 0 for all other v, then populations 15 and 16 would share a common allele frequency, and all other populations would share a different frequency. Since Svj= 1 allows allele frequencies to vary, we refer to it as a “bottleneck event”. In general, if two nodes, or populations, k1 and k2, can be connected by a set of edges, V, and Svj= 0 ∀vV, then inline image. We place a prior distribution on Svj, P(Svj= 1) =αv, where αv can be an increasing function of the distance between the nodes adjacent to edge v.

The number of unique allele frequencies, NSj, will be much smaller than the total number of ancestries, ne.

  • image(19)

We denote the NSj unique allele frequencies by inline image, and we denote the number of subjects in populations with each of those NSj unique allele frequencies by inline image.

  • image(20)

The Bayesian hierarchical model can be described as follows. We place a uniform prior on S. Given S, the distribution of p is f(p | S) = 1 for any set of allele frequencies consistent with S. Given p, we know the distributions of the training dataset. After, we perform the appropriate integrations, we can calculate the posterior means for p.

  • image(21)

where nkj1 is the number of A alleles in population k, n(κ)j1 is the number of A alleles in populations with the κth unique allele frequency, nkj0 and n(κ)j0 are the respective quantities for the a allele. To minimize confusion, we note that although nkj and n(κ)j are numbers of subjects, nkj0, n(κ)j0, nkj1, and n(κ)j1 are numbers of alleles. Furthermore,

  • image(22)

Note that inline image requires specification of the hyperparameter inline image. Derivation of this equation is in the appendix.

The model is an obvious simplification for the development of allele frequencies. In history, bottlenecks are actually rare events and allele frequencies change gradually over evolutionary development. Therefore, allele frequencies for neighboring populations should be highly correlated. For any given S·j, the abrupt bottlenecks may distort the estimated allele frequencies. However, by averaging over multiple S·j, we observe allele frequencies that vary smoothly across the evolutionary tree. Therefore, although we could try to incorporate a correlation structure into f(p |S) and allow for more events, these additions did not improve our estimates. We found our proposed model to perform as well as any of the more complex models examined.

Data and Simulation

Simulations: aims

Our main objective is to understand and compare the selection procedures. We start with three sets of simulations, designed to answer three questions:

  • 1
    Which is a better estimate of p: inline image or inline image?
  • 2
    Which is a better estimate of the error rate: inline image, or inline image?
  • 3
    Which error rate is best for our selection procedure: inline image, or inline image?
Simulations: common framework

The three sets of simulations examining these questions share a common framework. There are ne ancestries, and these ancestries are related by an evolutionary tree with all edges of equal length. In these simulations, ne∈{13, 20, 24} and the possible evolutionary trees are illustrated in Figures 1, 2, and 3. The evolutionary trees with 13 and 24 populations were trimmed versions of the tree that described the relationships among the HGDP populations (Jakobsson et al., 2008). Details of the tree follow in two sections. We assume that all populations are equally common, and let the training dataset contain an equal number of subjects, 5, 10, or 20, from each population. Simulation results were always based on 10,000 datasets.

For simulation sets 2 and 3, we introduce a new type of error. Recall that the expected error rate, err(Gi(Ω), p*) (equation (8)), is averaged over all possible training datasets. Now, we let errX(Gi(Ω), p*) be the error that would be observed given a specific value of inline image or a specific training dataset, X.

Simulations: description

Simulation 1

General: For each dataset containing a group of subjects with one genotyped SNP, calculate inline image and inline image and compare them to p*.

Specifics: Let ne= 20 and N= 1. To fairly compare inline image and inline image, we examine three possible sets of allele frequencies, (1) No variation: p*k= 0.5 for all k, (2) Intercontinental variation: p*k= 0.5 − 1.5d1 for k∈{1, … , 5}, p*k= 0.5 − 0.5d1 for k∈{6, … , 10}, p*k= 0.5 + 0.5d1 for k∈{11, … , 15}, and p*k= 0.5 + 1.5d1 for k∈{16, … , 20}, and (3) Intracontinental variation: p*1= 0.5 − 1.5d1− 2d2, p*2= 0.5 − 1.5d1− 1d2, p*3= 0.5 − 1.5d1, p*4= 0.5 − 1.5d1+ 1 d2, p*5= 0.5 − 1.5d1+ 2d2, … , where d1= 0.2 and d2= 0.067.

Simulation 2

General: For each dataset, calculate inline image, and inline image. With the additional use of a test set containing 100,000 individuals, calculate errX. Compare the three estimated error rates with errX.

Specifics: Let ne∈{13, 24} and N∈{10, 40, 80}. We generate allele frequencies from the more complex evolutionary trees according to the Bayesian model described, when we defined inline image. For each SNP, we first generate inline image. For inline image for t∈{1, … , 3}. For inline image for t∈{1, … , 4}. Svj are iid v. Allele frequencies for each connected set of populations are generated from a uniform[0.05,0.95] distribution.

Simulation 3

General: For each dataset, select the top 40 SNPs according to inline image, and inline image. Then using those SNPs and a training dataset, calculate errX(Ω*AE, p*), errX(Ω*632+, p*), and errX(Ω*IBE, p*). We compare these three error rates to see which is the lowest.

Specifics: Let ne∈{13, 20, 24} and N∈{1000, 10000}. Here, inline image and the remaining probability is split evenly over inline image events when ne= 13 and inline image events when ne∈{20, 24}, where Svj are iid v.

HGDP Data

Data 1

The HGDP dataset is more than an example. As it is the dataset that will likely be used for selecting SNPs, the performance of the three possible selection procedures on this specific data set is of primary importance. As the HGDP grows and changes, the rankings of the three methods will need to be reevaluated. For this comparison, we use only a subset of the data, containing 400 subjects in 24 populations, from the HGDP (population names given in appendix). We limit our focus to those subjects with easily available data (Jakobsson et al., 2008). The evolutionary tree for these groups was based on pairwise allele-sharing distance among populations and had been previously estimated by Jakobsson (Jakobsson et al., 2008). We split the data into 50 sets of 10,000 SNPs. For each set of SNPs, we select the top 40 using the greedy algorithm with either inline image, or inline image on 80% of the data. Then, we estimate the true error rate using using the remaining 20%. These error rates are then averaged over all 50 sets of data. Splitting the data into smaller sets was a necessity to decide whether the improvement in the set of SNPs selected by inline image is statistically significant. In the supplementary material, we show the results from selecting SNPs according to a different set of methods. In these methods, the top ranked SNPs, where rankings are by FST, In, or the optimal rate of correct assignment (ORCA), are selected.

Data 2

We use the entire HGDP dataset to select an optimal group of 100 SNPs for distinguishing ancestry. We start by selecting a candidate group of 5,000 SNPs. This group includes the 2000 SNPs (40 SNPs × 50 test sets) chosen from our initial 10,000-SNP searches. We then repeat the analysis described for dataset 1 focusing on populations within each continent separately. Here, we select the top 20, as opposed to the top 40 SNPs. These chosen SNPs comprise the remaining 3000 SNPs (3 continental regions × 20 SNPs × 50 data sets). The top 100 SNPs are selected from this set of 5000 SNPs and listed in the supplementary material.

Results

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Appendix
  10. Supporting Information

Simulations

Simulation 1

The MLE, inline image, are the most commonly used approximations for inline image. The Bayesian estimates, inline image, shrink the MLE toward the average value from neighboring populations. Therefore, if the truth is that neighboring populations share a common “A” allele frequency, p*kj, at SNP j, then the mean square error, MSEML, for the MLE, should be larger than the MSEB for the Bayesian estimates, where inline image and inline image. The first two columns in Table 1 show that the improvement can be quite high when all populations in the study share a single p*kj. When populations can have different allele frequencies, the extent of the advantage or disadvantage depends on the evolutionary tree. The tradeoff between maximum likelihood and Bayesian estimates is a tradeoff between variance and bias. inline image can be biased, but will have lower variance. In general, as the the number of subjects per population decreases, the MSEB:MSEML decreases, favoring estimation by inline image (Table 1).

Table 1.  The MSE, inline image, between the MLE and the truth, and the MSE, inline image, between the Bayesian estimate and the truth. These MSE, averaged over all simulations, are listed for different sets of p, where (1) all populations share a common p (No var) (2) all populations within a continent share a common p (Intercontinental var) (3) all populations have a unique p (Intracontinental var)
 No varIntercontinental varIntracontinental var
MLEBayesMLEBayesMLEBayes
50.0250.0020.020.0160.0190.013
100.0130.0010.010.0110.010.008
150.0080.0010.0070.0090.0070.007

Simulation 2

We compared three options, inline image, and inline image for the 13 and 24 population examples (Table 2). Clearly, inline image greatly underestimates the true error and the inline image, where nsim is the number of simulations, is an order of magnitude larger than the mean square error (MSE) for either of the other estimates. The ratios, inline image and inline image increase as the number of informative SNPs or the number of populations increases. In these simulations, inline image tends to be lower than inline image, but the order reverses as N grows large. The inline image, with its default settings for inline image, slightly overestimates the true value, but when calculating the MSE, this bias is offset by lower variance and a higher correlation between inline image and errX.

Table 2.  A comparison of the three methods for estimating error rates inline image, and inline image. The first results column, errX, is the true unconditional error. The second set of results' columns is the MSE from comparing the estimated errors with inline image. The third set of columns is the standard deviation of each estimate. The fourth set of columns is the correlation between each estimate and inline image
13 Populations
nkNerrXMSEinline imageSD(inline image)cor(inline image)
AE0.632+IBEAE0.632+IBEAE0.632+IBEAE0.632+IBE
5100.7760.13110.02690.02040.64810.76620.7860.04880.04770.04490.8050.8510.9178
5400.44890.26690.03110.02780.18440.44610.45480.04030.06530.06840.82120.88150.9185
5800.23570.20380.02870.0230.03520.24860.23610.01440.04770.0480.5910.84560.8776
10100.77970.1330.02740.01970.64990.76840.78760.05050.05020.04670.81460.86740.9219
10400.44260.2630.030.02820.18190.43860.44760.03830.06250.06430.78280.87960.9027
10800.22840.19870.02740.02210.03320.24150.22860.01340.04820.04940.58820.86760.8945
24 Populations
nkNerrXMSEinline imageSD(inline image)cor(inline image)
AE0.632+IBEAE0.632+IBEAE0.632+IBEAE0.632+IBE
5100.90370.10330.0160.00990.80210.89620.90840.03060.02650.02190.8020.84740.9178
5400.77920.34320.02260.02070.43680.77040.79280.0460.04550.03840.85530.89070.9174
5800.66010.48510.02270.02570.17580.65730.67750.03070.05070.04840.81490.89520.9214
10100.90440.10170.01540.00960.80440.89850.90810.03020.02670.02160.80320.8520.9109
10400.78140.34360.02150.02070.43860.77220.79490.04590.0460.04130.8630.9070.9267
10800.65420.48160.02270.02590.17340.65080.66990.03050.04880.04830.80160.88790.9043

Simulation 3

SNPs were selected by the greedy algorithm aimed to minimize either inline image, or inline image. For each group, the selected SNPs were ordered by the step in which they were added. Therefore, SNP 1 is essentially the most informative and SNP 40 is the least informative. For each group, the error rate was calculated (via simulation) when the top T SNPs were used, T∈{1, … , 40} and is illustrated in Figure 4. The main point is that when more than three SNPs were used, the SNPs in ΩAE (the set chosen using the apparent error) proved to be poor predictors of the true ancestries. Selection based on inline image resulted in lower error rates, and selection based on inline image resulted in the lowest error rates. Therefore, these simulations clearly suggest that the use of inline image is extremely inefficient and the use of inline image can be the most efficient. However, as the simulation model unfairly favors inline image, we hold off general statements about the inline image-based selection procedure until we see the results for the HGDP data.

image

Figure 4. Comparison of the error rates when the markers are chosen by inline image, and inline image using simulated data.

Download figure to PowerPoint

Data

Data 1

Selecting from groups of 10,000 SNPs, we denoted the resulting sets of 40 SNPs by Ω*AE, Ω*632+, and Ω*IBE. These SNPs and their corresponding inline image were then used to predict the ancestry for the 80 subjects in the separate test dataset, resulting in three sets of error rates errAE, err0.632+, and errIBE. These error rates were then averaged over all 50 sets of 10,000 SNPs to produce Figure 5. The results are similar to those from the simulations, showing that SNP selection by inline image outperformed both of the other selection procedures so long as there were more than eight SNPs. Using only eight SNPs, 77% of the subjects were assigned to a population in the correct continental region. Additional SNPs were selected to distinguish intracontinental populations. At this stage in the selection procedure, differences in allele frequencies due to random chance could rival informative differences, and because inline image is designed to remove those that occur by chance, it starts to perform better.

image

Figure 5. Comparison of the error rates when the markers are chosen by inline image, and inline image using HGDP data.

Download figure to PowerPoint

We then compared the selection methods based on the 0.632+ and Imputed Bayesian Error (IBE) estimates of the error rates to see whether selection by inline image produced a statistically significantly better set of SNPs than selection by inline image. Figure 6 shows the difference in error rate, inline image and a point-wise 95% CI, using the sample variance of the 50 values and assuming normality. The improvement was statistically significant. Although training sets contained only 80% of the data, we presume this benefit persists when selecting SNPs using all individuals. For future studies, we recommend splitting the data into training and test sets or using a cross-validation approach to choose the optimal method for selecting SNPs and, when desirable, to tune the hyperparameter inline image. Here, using simulations as our guide, we let (nn− 1)αv= 7 ∀v.

image

Figure 6. The difference between the error rates when the markers are chosen by inline image and inline image. The solid line is the difference, error rate using inline image - error rate using inline image, and the dotted lines are the point-wise 95% confidence intervals.

Download figure to PowerPoint

The error rate is still near 50% when Ω* includes 40 markers. However, a more detailed analysis of inline image shows that the majority of errors involve classifying a subject from population k1 to population k2, where k1 and k2 are close to each other on the evolutionary tree. Figure 7, created by superStruct (available at http://bioinformatics.med.yale.edu/group/josh/FOSSIL.html), is similar to the output from STRUCTURE and shows that using 2000 markers reduces the error rate to near 0%. Each point on the axis corresponds to one of the subjects from one of the test datasets, and above that point, is a series of 24 stacked bars. Each bar has a unique color and represents a single population. The height of the colored bar corresponding to population k is proportional to the posterior probability, inline image. Populations within the same continental region are different shades of the same color. The total number of subjects described by Figure 7A is 3650 (= 73 subjects × 50 datasets). As for the overall potential for SNPs, we examined the predictive accuracy of all 2000 SNPs (40 SNPs × 50 datasets) and found near perfect identification (i.e., inline image) for the majority of the 73 subjects (Fig 7B). The six predictions that disagreed with the self-identification, (i.e., inline image) were neighboring populations. This figure shows that we can do better than predicting continental origin.

image

Figure 7. A graph to summarize the ancestry information for each individual. The x-axis indicates subject. For each subject, 24 bars, corresponding to the 24 populations, are stacked. The height of a bar, k, is the estimated probability that the individual is from that population. Each bar is a different color (see online version for colored version). Populations from one continent are varying shades of a single color: Red = Africa, Orange = America, Green = S.E. Asia, Blue = EuroAsia. Black lines separate populations. Top (A) and Bottom (B) images show the expected results using 40 and 2000 markers, respectively. The total number of subjects described in the top half of the figure is 3650 (=73 subjects × 50 datasets), whereas the bottom half of the figure describes only 73 subjects.

Download figure to PowerPoint

Data 2

We used inline image to select an optimal set of 100 SNPs. Those SNPs are listed in the supplementary material.

Discussion

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Appendix
  10. Supporting Information

This article has introduced two ideas with an influence that should extend beyond SNP selection procedures. First, we offer an improved method for estimating the population-specific allele frequencies. Second, we offer an improved method for estimating the error rate for prediction rules using genotypes. In fact, this latter method can be applied to any classification problem based on logistic regression. Focusing on the SNP selection procedures, we have demonstrated that selecting SNPs to minimize inline image, instead of inline image can lead to a group of SNPs that can predict ancestry with high accuracy.

The apparent error rate and MLE have been successfully used in the past for selecting SNPs (Rosenberg, 2005). However, here the apparent error performed poorly. The main difference is that the number of populations has increased. With more populations, SNPs that truly separate a small group of populations no longer stand out. The population differences in inline image caused by sample selection are relatively small, but, in terms of the overall importance of a SNP, these differences are additive. Also, as the number of populations increases, we need more SNPs. As the number of needed SNPs increases, the improvement due to each additional SNP decreases, and it is more likely that a noninformative SNP can appear to be the best candidate. More general limitations of the apparent error rate are that it estimates the wrong quantity and cannot account for the fact that estimates of allele frequencies for some populations (i.e., those with more subjects in the training dataset) should be more accurate than others.

In this article, we never actually considered using inline image, and here, we discuss one of its limitations and our reason for avoiding it. Although it has gone unstated in the literature, inline image can be heavily biased because using inline image will exaggerate the true accuracy of the prediction rule. The following, simple, example illustrates that inline image. Let there be one gene and two populations, where the allele frequencies in the populations are p1 and p2. As an extremely rough approximation, suitable only for illustration, consider the error to be a function of the difference log(p1) −log(p2) ≡log(p1/p2). The true error generally increases as the distance between p1 and p2 decreases, with err attaining its maximum of 0.5 when log(p1/p2) = 0, or when the allele frequencies are the same in both populations. Now, assume we are unlucky, and the truth happens to be log(p1/p2) = 0. The estimate, inline image, is distributed around its true value, resulting in inline image.

This bias discussed for inline image is absent from inline image, and therefore, without the additional information from the evolutionary tree, inline image would be the preferred method for estimating the error rate. However, we did find that inline image performed favorably when compared to inline image in the Results section. Because of the nature of nonparametric estimates, it would be difficult to introduce the information from the evolutionary tree into inline image.

Our study focused on individuals with only a single ancestry. However, our general conclusions about the SNP-selection procedure and the selected SNPs will be valid when our goal is to identify the multiple ancestries of admixed individuals. Obviously, the selected group of SNPs will need to be expanded to attain similar error rates. We suspect that the total number of SNPs needed to identify one of the admixed ancestries will be inversely proportional to the percentage of an individual's genome originating with that ancestry. Instead of looking for ancestries of an individual, we would now be looking for ancestries of sections of the chromosomes. Admixture, therefore, requires a selection procedure that assumes only a random subset of the chosen SNPs will actually be available to identify a given ancestry. Therefore, the selected set should include some redundancy. This will also safeguard against genotyping error. We are currently exploring solutions for our two objectives in admixtures.

The next goal, already under examination, is how to incorporate the HGDP data and the knowledge of the optimal set of SNPs in identifying population substructure in genome-wide association studies (GWAS). First, most GWAS are large enough to contribute their own information about allele frequencies in populations. Second, GWAS are often more influenced by large population substructure, and may not need to identify populations that are not greatly present in the study. However, this focus is likely to change as we start searching for rare disease causing mutations.

Acknowledgements

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Appendix
  10. Supporting Information

This work was supported, in part, by NIJ grants 2007-DN-BX-K197 and 2010-DN-BX-K225 to KKK awarded by the National Institute of Justice, Office of Justice Programs, US Department of Justice. Points of view in this document are those of the authors and do not necessarily represent the official position or policies of the US Department of Justice.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Appendix
  10. Supporting Information
  • Bamshad, M. J., Wooding, S., Watkins, W. S., Ostler, C. T., Batzer, M. A. & Jorde, L. B. (2003) Human population genetic structure and inference of group membership. Nat Genet 72, 578589.
  • Barnholtz-Sloan, J. S., McEvoy, B., Shriver, M. D. & Rebbeck, T. R. (2008) Ancestry estimation and correction for population stratification in molecular epidemiologic association studies. Cancer Epidemiol Biomarkers Prev 17, 471477.
  • Budowle, B. & van Daal, A. (2008) Forensically relevant SNP classes. BioTechniques 44, 603610.
  • Claeskens, G., Croux, C. & Kerckhoven, J. V. (2006) Variable selection for logistic regression using a prediction-focused information criterion. Biometrics 62, 972979.
  • Daniel, R., Walsh, S. J. & Piper, A. (2006) Investigation of single-nucleotide polymorphisms associated with ethnicity. Progress in Forensic Genetics 11—Proceedings of the 21st International ISFG Congress. International Congress Series 1288, 7981.
  • Efron, B. (1983) Estimating the error rate of a prediction rule: Improvement on cross-validation. J Am Stat Assoc 78, 316331.
  • Efron, B. (1986) How biased is the apparent error rate of a prediction rule? J Am Stat Assoc 81, 461470.
  • Efron, B. & Tibshirani, R. (1997) Improvements on cross-validation: The .632+ bootstrap method. J Am Stat Assoc 92, 548560.
  • Farris, J. S. (1972) Estimating phylogenetic trees from distance matrices. Amer Nat 106, 645668.
  • Freedman, M. L., Reich, D., Penney, K. L., McDonald, G. J., Mignault, A. A., Patterson, N., Gabriel, S. B., Topol, E. J., Smoller, J. W., Pato, C. N., Pato, M. T., Petryshen, T. L., Kolonel, L. N., Lander, E. S., Sklar, P., Henderson, B., Hirschhorn, J. N. & Altshuler, D. (2004) Assessing the impact of population stratification on genetic association studies. Nat Genet 36, 388393.
  • Hastie, T., Tibshirani, R. & Friedman, J. (2001) The Elements of Statistical Learning. Springer Series in Statistics. New York , NY , USA : Springer New York Inc.
  • Hemminger, B. M., Saelim, B. & Sullivan, P. F. (2006) TAMAL: An integrated approach to choosing SNPs for genetic studies of human complex traits. Bioinformatics 22, 626627.
  • Jakobsson, M., Scholz, S. W., Scheet, P., Gibbs, J. R., VanLiere, J. M., Fung, H.-C., Szpiech, Z. A., Degnan, J. H., Wang, K., Guerreiro, R., Bras, J. M., Schymick, J. C., Hernandez, D. G., Traynor, B. J., Simon-Sanchez, J., Matarin, M., Britton, A., van de Leemput, J., Rafferty, I., Bucan, M., Cann, H. M., Hardy, J. A., Rosenberg, N. & Singleton, A. B. (2008) Genotype, haplotype, and copy number variation in worldwide human populations. Nature 451, 99810003.
  • Jorde, L. B. & Wooding, S. P. (2004) Genetic variation, classification and ‘race’. Nat Genet 36, s28–s33.
  • Lao, O., Duijn, K. V., Kersbergen, P., Knijff, P. D. & Kayser, M. (2006) Proportioning whole genome single nucleotid polymorphism diversity for the identification of geographic population structure and genetic ancestry. Am J Hum Genet 78, 680690.
  • Li, J. Z., Absher, D. M., Tang, H., Southwick, A. M., Casto, A. M., Ramachandran, S., Cann, H. M., Barsh, G. S., Feldman, M., Cavalli-Sforza, L. L. & Myers, R. M. (2008) Worldwide human relationships inferred from genome-wide patterns of variation. Science 319, 11001104.
  • Lowe, A. L., Urquhart, A., Foreman, L. A. & Evett, I. W. (2001) Inferring ethnic origin by means of an str profile. Forensic Sci Int 119, 1722.
  • Marchini, J., Cardon, L. R., Phillips, M. S. & Donnelly, P. (2004) The effects of human population structure on large genetic association studies. Nat Genet 36, 512517.
  • Michie, D., Spiegelhalter, D. J. & Taylor, C. C. (1994) Machine learning, neural and statistical classification. Englewood Cliffs , NJ : Prentice Hall.
  • Nassir, R., Kosoy, R., Tian, C., White, P., Butler, L., Silva, G., Kittles, R., Alarcon-Riquelme, M., Gregersen, P., Belmont, J., De La Vega, F. & Seldin, M. (2009) An ancestry informative marker set for determining continental origin: Validation and extension using human genome diversity panels. BMC Genetics 10, 39.
  • Paschou, P., Ziv, E., Burchard, E. G., Choudhry, S., Rodriguez-Cintron, W., Mahoney, M. W. & Drineas, P. (2007) Pca-correlated snps for structure identification in worldwide human populations. PLoS Genet 3, e160.
  • Phillips, C., Salas, A., Sanchez, J., Fondevila, M., Gomez-Tato, A., Alvarez-Dios, J., Calaza, M., de Cal, M. C., Ballard, D., Lareu, M. & Carracedo, A. (2007) Inferring ancestral origin using a single multiplex assay of ancestry-informative marker snps. Forensic Sci Int Genet 1, 273280.
  • Rosenberg, N. A. (2005) Algorithms for selecting informative marker panels for population assignment. J Comput Biol 12, 11831201.
  • Rosenberg, N. A., Li, L. M., Wark, R. & Pritchard, J. K. (2003) Information on genetic markers for inference of ancestry. Am J Hum Genet 73, 14021422.
  • Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A. & Feldman, M. W. (2002) Genetic structure of human populations. Science 298, 23812385.
  • Saitou, N. & Nei, M. (1987) The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol Biol Evol 4, 406425.
  • Seldin, M. F. & Price, A. L. (2008) Application of ancestry informative markers to association studies in european americans. PLoS Genet 4, e5.
  • Shriver, M. D., Smith, M. W., Jin, L., Marcini, A., Akey, J. M., Deka, R. & Ferrell, R. E. (1997) Ethnic-affiliation estimation by use of population-specific DNA markers. Am J Hum Genet 60, 957964.
  • Weir, B. S., Cardon, L. R., Anderson, A. D., Nielsen, D. M. & Hill, W. G. (2005) Measures of human population structure show heterogeneity among genomic regions. Genome Res 15, 14681476.
  • Xu, H., Gregory, S. G., Hauser, E. R., Stenger, J. E., Pericak-Vance, M. A., Vance, J. M., Zuchner, S. & Hauser, M. A. (2005) SNPselector: A web tool for selecting SNPs for genetic association studies. Bioinformatics 21, 41814186.
  • Yamaguchi-Kabata, Y., Nakazono1, K., Takahashi, A., Saito, S., Hosono, N., Kubo, M., Nakamura, Y. & Kamatani, N. (2008) Japanese population structure, based on SNP genotypes from 7003 individuals compared to other ethnic groups: Effects on population-based association studies. Am J Hum Genet 83, 445456.

Appendix

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Appendix
  10. Supporting Information
Definition of FST and In

The fixation index (FST) is a measure of population differentiation and is usually the correlation of randomly chosen alleles within the same subpopulation relative to that found in the entire population. To define FST, let p*k be the frequency of allele “A" in population k, k∈{1, … , ne}, where ne is the number of ancestries and q*k= 1 −p*k. Then

  • image(23)
  • image(24)
  • image(25)

where wk is the relative size of the kth subpopulation and inline image. In our definitions of these quantities, wk is the relative size of each subpopulation in some larger, natural, population, not among the individuals sampled.

Information, I, is based on the idea of statistical entropy as described in (Rosenberg et al., 2003). The information, or informativeness of assignment is

  • image(26)

where inline image and q*= 1 −p*.

Calculating pB

Recall, for each SNP j, we defined a set of indicator variables, inline image, representing bottleneck events. Formally, a bottleneck is an evolutionary event, in which only a small subset of individuals survives to the next generation, leading to a dramatic change in population allele frequencies. We use the term more liberally to imply any event that allows two neighboring populations to have different allele frequencies. Recall that NSj is the number of unique allele frequencies, n(κ)j is the number of subjects with the κth unique allele frequency, and ne is the total number of ancestries. Central to this calculation is the fact that the density of the estimated parameters, inline image, conditional on the true parameters, p, can be defined by

  • image(27)

where

  • image(28)

Note that f(·|·) represents a generic conditional density function, with the exact form depending on the variables in that function. Clearly, as f(p|S) is a constant when nonzero,

  • image(29)

Because we know that inline image is a density function, we know that inline image must have the form inline image for all nonzero values, where fβ(·|α, β) is the β density.

We know inline image. We start by calculating inline image,

  • image(30)

where we calculate C2 by noting that inline image is a multiple of fβ(p(κ) |n(κ)1+ 1, n(κ)0+ 1),

  • image(31)

where B is beta function. Because we know that inline image is a density function, we know that we can define inline image and conclude that

  • image(32)

To calculate inline image, note that for each value of S, we now know

  • image(33)

where 1(pj=p(κ)) indicates whether population j shares the κth unique allele frequency. Therefore, we have arrived at

  • image(34)
Estimating Error

Here, we show that inline image is asymptotically normal with mean μki and variance σ2ki defined in equation (10). Start by focusing on SNP j in population k. Recall, nk is the number of subjects in population k and ne is the total number of ancestries. Then, our estimate of pkj is

  • image(35)

where by the central limit theorem we know,

  • image(36)

Next, we want to define our estimate for

  • image(37)

and

  • image(38)

We approximate the distribution of inline image by a linear function of inline image, specifically,

  • image(39)

where mkjm′ (pkj) and

  • image(40)

Then, we have the following approximation

  • image(41)

where

  • image(42)
Greedy Algorithm

If the total of number of SNPs available is around 1,000,000, the number of possible groups grows at a rate of inline image. It is computationally infeasible to search such a large space. Therefore, we propose using a greedy algorithm with NS steps.

Step 1: Select the single SNP j that minimizes the expected error rate: inline image.

Step 2: …Ns: Given a set of n− 1 SNPs, {j1, j2, … , jn−1}, select the SNP that when added to that current set minimizes the error rate: inline image.

Although inline image, the set chosen by the greedy algorithm, is not guaranteed to be the optimal set, the set should perform satisfactorily, in that the resulting error rate should be similar to the true minimum.

Population Names

(Continental Region 1) 1: Yoruba (25), 2: Mandeka (20), 3: Bantu (8), 4: San (7), 5: Biaka Pygmy (32), 6: Mbuti Pygmy (15)

(Continental Region 2) 7: Papuan (16), 8: Melanesian (17), 9: Pima (11), 10: Maya (13), 11: Columbian (7), 12: Yakut (15), 13: Mongola (9), 14: Daur (10), 15: Cambodian (10), 16: Yi (10)

(Continental Region 3) 17: Burusho (7), 18: Kalash (18), 19: Balochi (15), 20: Russian (13), 21: Druze (43), 22: Beduin (47), 23: Palestinian (26), 24: Mozabite 96)

Population Number: Population Name (Number of subjects in population)

Supporting Information

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Appendix
  10. Supporting Information

Table S1: Top 100 SNPs.

Figure S1: Comparison of error rates: FST, In, and ORCA.

FilenameFormatSizeDescription
AHG_656_sm_suppmat.pdf20KSupporting info item

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.