Corresponding author: Scott M. Williams, Department of Genetics, Geisel School of Medicine, Dartmouth College, 78 College Street, HB-6044, Hanover, New Hampshire 03755. Tel: 603 646 8171; E-mail: email@example.com
Using genetic data from an obesity candidate gene study of self-reported African Americans and European Americans, we investigated the number of Ancestry Informative Markers (AIMs) and candidate gene SNPs necessary to infer continental ancestry. Proportions of African and European ancestry were assessed with STRUCTURE (K = 2), using 276 AIMs. These reference values were compared to estimates derived using 120, 60, 30, and 15 SNP subsets randomly chosen from the 276 AIMs and from 1144 SNPs in 44 candidate genes. All subsets generated estimates of ancestry consistent with the reference estimates, with mean correlations greater than 0.99 for all subsets of AIMs, and mean correlations of 0.99 ± 0.003; 0.98 ± 0.01; 0.93 ± 0.03; and 0.81 ± 0.11 for subsets of 120, 60, 30, and 15 candidate gene SNPs, respectively. Among African Americans, the median absolute difference from reference African ancestry values ranged from 0.01 to 0.03 for the four AIMs subsets and from 0.03 to 0.09 for the four candidate gene SNP subsets. Furthermore, YRI/CEU Fst values provided a metric to predict the performance of candidate gene SNPs. Our results demonstrate that a small number of SNPs randomly selected from candidate genes can be used to estimate admixture proportions in African Americans reliably.
Genetic epidemiology studies seek to identify loci with statistically significant allele or genotype frequency differences between cases and controls. If case status covaries with differences in ancestry, many genetic loci can be expected to differ in their allele frequencies irrespective of etiology, potentially giving rise to spurious associations (Clayton et al., 2005; Rosenberg and Nordborg, 2006; Tian et al., 2006; Price et al., 2008;Tian et al., 2008b). To minimize the confounding effects of population substructure, quantitative estimates of individual ancestry need to be considered. Bayesian clustering or maximum likelihood methods are typically used to calculate these estimates (Pritchard et al., 2000; Alexander et al., 2009), or clusters are identified using Principal Components Analysis (PCA) or multidimensional scaling with genotyped genetic markers (Price et al., 2006; Li and Yu, 2008; Novembre and Stephens, 2008).
The number of genotyped markers sufficient to infer and to correct for ancestry depends on the genetic heterogeneity of the populations under study and the informativeness of the markers being used, with informativeness a function of allele frequency differences between the ancestral populations from which the study samples derive. The most commonly used markers for this purpose are termed Ancestry Informative Markers (AIMs), which are selected for their large allele frequency differences between ancestral populations (Shriver et al., 1997; Akey et al., 2002; Rosenberg et al., 2003; Smith et al., 2004; Halder et al., 2008; Nassir et al., 2009; ). However, genetic variants with relatively small allele frequency differences between populations can also be used to infer ancestry if they are genotyped in sufficient numbers. In large-scale candidate gene studies or in genome-wide association studies (GWAS), the sheer quantity of markers under study makes the use of independent AIMs unnecessary, even for assessing ancestry in highly homogenous populations. For example, using 960 cancer candidate gene SNPs, Sloan et al. (2009) were able to infer the European country of origin in immigrants to the United States; and using over half a million SNPs, Novembre et al. (2008) were able to characterize substructure in a European population accurate to within a few hundred kilometres of geographic origin.
Although panels of AIMs have been assumed to be necessary in smaller candidate gene studies, the point at which they can be confidently bypassed, owing to a sufficient number of candidate SNPs, remains largely unexplored. Allocco et al. found that as few as 50 SNPs chosen randomly from the HapMap database can assign individuals to their ancestral continent of origin with an average accuracy of 95%, suggesting that AIMs may not be necessary even in studies with relatively few markers (Allocco et al., 2007). However, SNPs chosen randomly from throughout the genome via HapMap are likely to provide more independent information than SNPs chosen from a set of nonrandomly distributed candidate genes, many of which may be in linkage disequilibrium (LD). Moreover, many studies of admixed populations, such as African Americans, require an assessment of the proportion of admixture. These analyses demand more information from genotypic markers than when individuals need only be assigned to their continent of origin.
Herein, we present an analysis of 1300 self-reported African American and 1247 self-reported European American subjects using 276 AIMs and 1144 obesity-related candidate gene SNPs to evaluate the number of AIMs and/or candidate gene SNPs necessary to characterize global ancestry adequately in similarly designed studies.
Materials and Methods
The Southern Community Cohort Study (SCCS) participants provided written informed consent, and protocols were approved by the Vanderbilt University Human Research Program and Institutional Review Board and by the Meharry Medical College Institutional Review Board.
The SCCS is a cohort study of cancer risk disparities related to ancestry and socioeconomic status among populations. Men and women aged 40–79 were recruited in person at community health centers and also by mail across 12 southeastern U.S. states between 2002 and 2009 (Signorello et al., 2005; Signorello et al., 2010). Approximately 86,000 participants were enrolled, with African Americans comprising two-thirds of the study population (http://www.southerncommunitystudy.org).
For an obesity-related candidate gene study within the SCCS, 2157 female and 390 male participants (1300 of self-reported African ancestry and 1247 of self-reported European ancestry) were selected from among those who enrolled from March 2002 to October 2004. Genomic DNA was extracted from blood samples using Qiagen's DNA Purification kits (Qiagen, Valencia, CA, USA) according to manufacturer's instructions.
Marker Selection and Genotyping
We selected AIMs from a list of 1509 AIM SNPs from an Illumina-designed panel for ancestry estimation and an additional 360 SNPs with comparably large frequency differences between European (CEU) and African (YRI) samples in HapMap I. We selected AIMs using the following criteria: (1) AIMs were at least 5 MB from any of the 44 candidate gene boundaries to ensure independence from the candidate genes and (2) AIMS displayed the largest allele frequency differences between the CEU and YRI HapMap populations. From this list we chose 300 AIMs of which 292 passed the Illumina Scoring algorithm and were genotyped with the Illumina GoldenGate platform (Illumina Inc., San Diego, CA, USA). All SNPs were assessed for deviations from Hardy–Weinberg equilibrium. Of the 292 AIMs, 276 were successfully genotyped with call rates greater than 95% and subsequently used to estimate African and European ancestry in the SCCS African Americans and European Americans.
An additional panel of genetic markers was selected from the obesity-related candidate genes, comprising 1244 SNPs. The candidate SNPs were selected for the obesity study using a tagSNP approach that combined tagSNPs from both the European (CEU) and Yoruba (YRI) HapMap 1 data (Thorisson et al., 2005). HapMap SNPs within each gene and an additional 10 kb upstream and downstream of each gene were identified and evaluated by the Illumina scoring algorithm. SNPs that scored poorly or had minor allele frequencies below 0.05 in both CEU and YRI were excluded. LDSelect was then run separately for the CEU and YRI data using an r2 cut-off of 0.8 to partition SNPs into LD bins for each population (Carlson et al., 2004). When multiple tagSNPs were in an LD bin, SNPs that tagged both populations were preferentially selected. Among equivalent tagSNPs of a given LD bin, one categorized as a candidate functional SNP or one previously employed on Illumina chips was preferentially selected for assay. The same quality control criteria were applied as for the AIMs, and resulted in the removal of 100 SNPs, yielding 1144 for ancestry estimation.
African and European ancestry for each individual was estimated using STRUCTURE (version 2.2.3, http://pritch.bsd.uchicago.edu/structure.html), a software platform that uses a Bayesian clustering algorithm to identify groups of individuals with similar allele frequency profiles (Pritchard et al., 2000). The algorithm estimates the shared population ancestry of individuals based solely on their genotypes, under the assumptions of Hardy–Weinberg equilibrium and linkage equilibrium in ancestral populations. Individuals are assigned admixture estimates, proportions of ancestry summing to 1 across K clusters (K = 2 ancestral populations for all analyses in this study). All runs of STRUCTURE were performed on the ACCRE supercomputing cluster at Vanderbilt University.
To create reference values of African and European ancestry proportions for all 2547 individuals, STRUCTURE was run 10 times (50,000 iterations after a burn-in of 50,000 iterations) using the 276 AIMs. CLUMPP was used to align multiple replicate analyses and to calculate the means of the 10 quantitative ancestry estimates per individual (Jakobsson and Rosenberg, 2007). This procedure was repeated using the 1144 candidate gene SNPs, and the Pearson correlation coefficient between the two sets of individual ancestry estimates was determined.
From each of the sets of 276 AIMs and 1144 candidate gene SNPs, 100 random subsets of 120, 60, 30, and 15 markers were selected. Markers were selected without replacement for each randomisation. The number of genes represented in each random sample of SNPs was tabulated. STRUCTURE was run 10 times (50,000 iterations after a burn-in of 50,000 iterations) for each of the 800 randomised datasets (100 randomisations for each category of 120, 60, 30, and 15 AIMs and 120, 60, 30, and 15 candidate gene SNPs). The means of the 10 quantitative ancestry estimates per individual per randomisation were calculated, generating 100 sets of 2547 mean ancestry estimates for each of the eight categories (Fig. S1). The Pearson correlation coefficients between each of these sets of individual ancestry estimates and the reference vector were then calculated.
To assess how accurately a given individual's ancestry could be estimated using only 120, 60, 30, or 15 AIMs or candidate gene SNPs, each individual's reference estimate of African ancestry was subtracted from each of the individual's 100 estimates of African ancestry per category of 120, 60, 30, or 15 AIMs and candidate gene SNPs. The frequency distributions of the 254,700 absolute values of differences per category were plotted, in addition to the frequency distributions for only the 1300 self-reported African Americans.
Mean Fst was calculated for each randomisation of candidate gene SNPs, using allele frequency data from the YRI and CEU HapMap samples. For the 1300 self-reported African Americans, the association between Fst and the mean absolute difference of African ancestry estimates from reference values was assessed using linear regression for each category of candidate gene SNPs. The Weir and Cockerham algorithm was used to calculate Fst (Weir and Cockerham, 1984).
To determine whether the population differentiation of our SNPs was consistent with background levels of genetic distance between African Americans and European Americans, Fst values were first calculated for the 1144 candidate gene SNPs and 276 AIMs using genotype data from our study. These values were then compared to Fst values for 40 random draws of 1144 SNPs from unrelated HapMap phase III CEU and ASW (African Americans in the Southwest USA) samples, using the Kruskal–Wallis one-way analysis of variance test, a nonparametric method for determining whether samples originate from the same distribution. A two-sided significance probability of 0.05 was used to infer nonrandom influences.
Finally, the extent to which correlations with the reference estimates were impacted by the number of genes per randomisation was determined. For each category of candidate gene SNPs, the associations between the number of genes represented in each randomised sample and the correlation of that sample's ancestry estimates with the reference estimates was assessed using linear regression. Statistical analyses were done using R (http://cran.r-project.org/) and STATA 10.0.
Ancestry Estimated from 276 Aims and 1144 Candidate Gene Snps
Reference values of ancestry for the 2547 individuals were calculated using 276 AIMs. The 1300 self-reported African Americans had a mean proportion of 0.92 African ancestry and the 1247 self-reported European Americans a mean African ancestry of 0.01. Nine of the self-reported African Americans were found to have greater than 0.88 proportion European ancestry, suggesting incorrect classification. The 1144 candidate gene SNPs yielded ancestry estimates that were highly correlated with the reference estimates derived from the 276 AIMs (r = 0.989).
Ancestry Estimated from Random Subsets of 276 Aims
For the full study population, the 15 AIM subsets generated ancestry estimates that were highly correlated with the reference values (r = 0.991; Table 1). Of the total 254,700 ancestry estimates for all subjects generated using 15 AIMs, 93.8% fell within ±0.15 of the corresponding reference estimates. The mean absolute difference from the reference values was 0.06 ± 0.04, and the median was less than 0.01 (Fig. 1). For the self-reported African Americans, the mean absolute difference from the reference values was 0.06 ± 0.07, and the median was 0.03, with the great majority (89.1%) of estimates within ±0.15 of the reference estimates (Fig. 2). Using subsets of 30, 60, and 120 AIMs, the mean absolute difference from the reference values for the self-reported African Americans improved to 0.05 ± 0.06, 0.04 ± 0.04, and 0.02 ± 0.03, respectively, with medians of 0.03, 0.02, and 0.01 (Fig. 2).
Table 1. Correlations between sets of 2547 ancestry estimates derived using the full set of AIMs and those derived using subsets of SNPs
Ancestry Estimated from Random Subsets of 1144 Candidate Gene Snps
Highly correlated ancestry estimates were also obtained when smaller subsets of the 1144 candidate genes were used for estimation (Table 1). With 120-candidate gene SNPs, the mean Pearson correlation coefficient with the reference estimates was 0.986 ± 0.003, indicating that little information was lost when roughly 10% of the full set of candidate gene SNPs was used. The smallest correlation obtained from the 100 randomisations of 120-candidate gene SNPs was 0.977 (Table 1 and Fig. 3). Of the total 254,700 ancestry estimates generated using 120 randomly chosen candidate gene SNPs, the mean absolute difference from the reference values was 0.04 ± 0.07, the median was 0.01, and 92.0% of estimates fell within ±0.15 of their corresponding reference estimates (Fig. 4). For the self-reported African Americans, the mean absolute difference from the reference estimates was 0.06 ± 0.08 and the median was 0.03, 86.5% of estimates fell within ±0.15 of the reference estimates (Fig. 5).
Results remained consistent when 60-candidate gene SNPs were used. The mean correlation with the reference estimates was 0.977 ± 0.009, with only two of 100 randomisations yielding correlations less than 0.95 (0.922 and 0.947; Table 1 and Fig. 3). The mean absolute difference from the reference values was 0.05 ± 0.09, and the median was 0.01; only 9.6% of all ancestry estimates differed from corresponding reference values by more than ±0.15 (Fig. 4). For the self-reported African Americans, the mean difference from the reference estimates was 0.07 ± 0.10 and the median was 0.04, and 85.0% of estimates fell within ±0.15 of the reference values (Fig. 5).
Random subsets of 30-candidate gene SNPs generated ancestry estimates less consistent with the reference values, with correlations ranging from 0.78 to 0.95. However, 92% of the 30 SNP randomisations yielded correlations greater than 0.90 (Table 1 and Fig. 3). For the self-reported African Americans, the mean absolute difference from the reference values was 0.10 ± 0.15, the median was 0.05, and 80.0% of estimates fell within ±0.15 of the reference values (Fig. 5). Random subsets of 15 candidate gene SNPs performed markedly worse than the other subsets (Table 1), especially with respect to inferring the ancestry of the self-reported African Americans: the mean absolute difference from the reference values was 0.18 ± 0.21 and the median was 0.09, with 34.9% of estimates missing the reference by more than 0.15.
Performance of Candidate Gene SNP Subsets Based on YRI/CEU Fst
An inverse relationship existed between the YRI/CEU Fst values of candidate gene SNP randomisations and their accuracy in estimating African ancestry in the 1300 self-reported African Americans (as measured by mean absolute value of differences from reference values). This relationship was stronger in smaller SNP subsets, and significant in all subsets except the 120 SNP data (Fig. 6; Table S1). For the 120 SNP subsets, the r2 = 0.02 (P = 0.16).
Genetic Differentiation of Aims, 1144 Candidate Gene Snps, and Hapmap Data
The median Fst value between self-reported African Americans and self-reported European Americans for the 1144 candidate gene SNPs was 0.059, which was slightly higher than the median Fst value calculated for 40 draws of 1144 random SNPs taken from the HapMap CEU and ASW samples (0.050). The mean Fst for the 1144 candidate gene SNPs (0.087) was slightly lower than that for the 40 random draws from HapMap (0.091). As expected, these Fst estimates were much smaller than the mean and median Fst for the 276 AIMs (both were 0.51; Fig. 7). Forty percent (16/40) of the Fst distribution comparisons between the ASW/CEU simulations and the candidate gene SNPs were not statistically significant at P = 0.05.
Effect of Number of Genes Used to Estimate Ancestry
The candidate gene SNPs were sampled from a total of 44 candidate genes. The mean number of candidate genes represented in the subsets of 120, 60, 30, and 15 SNPs were 34.3 ± 2.0, 27.1 ± 2.1, 18.9 ± 1.9, and 11.8 ± 1.4, respectively. For a given number of SNPs (120, 60, 30, or 15), correlations did not vary significantly with the number of candidate genes represented in the randomised samples (Fig.S2; P > 0.28 for all subsets).
The need to adjust genetic association studies for differences in ancestry between cases and controls is now well recognized. Differentially distributed continental ancestry, in particular, increases the risk of type 1 error, because the fraction of SNPs with >40% allele frequency differences between continental populations is an order of magnitude greater than the fraction within continental subpopulations (Tian et al., 2008a). Because smaller candidate gene studies and follow-up studies to GWAS assess only a limited number of markers, panels of AIMs are often genotyped to address these issues (Seldin and Price, 2008). Recent studies suggest that ancestry, and especially continental ancestry, can be characterized with fewer AIMs than originally thought (Risch et al., 2002; Tsai et al., 2005; Ruiz-Narvaez et al., 2011; Sampson et al., 2011). For example, a subset of 24 AIMs from a set of 128 adequately distinguished European and West African ancestry (Kosoy et al., 2009).
Using a large number of AIMs to estimate ancestry for our reference measure and a rigorous resampling approach, our study confirms that as few as 15 AIMs provide excellent correlation with reference estimates, indicating that a small number of AIMs is sufficient to differentiate continental ancestry. The accuracy of the estimates tended to be lower among African Americans than European Americans, but with only 15 AIMs, 89.1% of the ancestry estimates for the 1300 self-reported African Americans fell within ±0.15 of reference estimates. When 30, 60, and 120 AIMs were used, the percent of estimates falling within ±0.15 of the reference values in African Americans improved to 92.6%, 97.9%, and 99.9%, respectively. Thus, the practical utility of genotyping more than 15 AIMs for some types of studies, including those distinguishing African Americans from European Americans, would appear to be marginal. However, studies requiring increased accuracy, including those distinguishing moderate from high African ancestry among self-reported African Americans in the context of small effect sizes (Reich et al., 2004), could more prudently use approximately 60 AIMs, assuming that the level of accuracy we observed in this study is sufficient to minimize any threats to internal validity.
The history of candidate gene studies indicates that most interrogated markers do not associate with the phenotypes under study. Because we expect a far greater proportion of selected markers to associate significantly with continental ancestry than with any particular phenotype, the necessity of using AIMs to infer ancestry in medium- to large-scale candidate gene studies with a large number of unassociated markers is not clear. The numerical threshold at which ordinary SNPs perform as well as AIMs in this respect has been largely unexplored. Allocco et al. provided evidence that as few as 50 randomly selected HapMap SNPs can assign individuals to their continent of origin, but to our knowledge, our study is the first to systematically investigate the minimum number of nonindependently drawn SNPs (e.g., candidate gene SNPs) sufficient to estimate proportions of admixture in a mixed study population. Our approach allowed us to evaluate the point at which genotyping AIMs may become superfluous, not only as a theoretical matter, but also as a practical guideline for minimizing expense in future candidate gene studies, deep sequencing analyses, and GWAS replications.
We found that as few as 60 SNPs drawn from 22–31 genes generated ancestry estimates that correlated well with reference ancestry estimates in our total sample (mean r = 0.977). Within self-reported African Americans, 15.0% of the ancestry estimates deviated from the reference estimates by more than 0.15 when 60 candidate gene SNPs were used, not appreciably different than when 120 candidate gene SNPs were used (13.6%). The number of genes from which a given number of random SNPs were drawn did not significantly influence the correlations. Although not directly tested in our study, it is probable that admixture proportions can be well estimated with even fewer genes, as long as the number of independent SNPs from those genic regions (i.e., tagSNPs) is similar to the number that we tested.
To determine whether our particular candidate gene SNPs influenced these results, we calculated Fst values for the 1144 SNPs and compared them to Fst values for 1144 SNPs drawn randomly 40 times from CEU and ASW HapMap samples. A moderate inflation in the differences between populations was expected, because we selected our candidate gene SNPs to tag both African American and European American samples. The median Fst of our candidate gene SNPs (0.059) was slightly higher than that of the random draws (0.050) and the mean was slightly lower (0.087 vs. 0.091), but both were much lower than the AIMs’ mean (0.51) and median (0.51), indicating that the variation in our candidate gene SNPs was not atypical of background levels of genetic variation between the two populations.
One way to assess the utility of candidate gene SNPs as ancestry estimators in studies with African Americans is to determine if the mean Fst value of a set of SNPs can predict their performance using Fst values calculated from pre-existing data, such as the YRI/CEU allele frequencies from HapMap. We found a significant linear relationship between mean YRI/CEU Fst and accuracy (measured as the mean absolute difference of estimates from reference values) for subsets of 15, 30, and 60 SNPs. The magnitude of correlation between mean Fst and accuracy decreased as the number of SNPs increased and the range of the mean absolute differences from references narrowed. For the 120 SNP category, where the range of differences in estimated ancestry was 0.058–0.072, the linear relationship between mean Fst and accuracy was not statistically significant, even though mean Fst varied by as much as 50% across different randomisations. This probably reflects the fact that the greater ancestry information provided by the increased number of SNPs significantly outweighed the contribution of average differences in mean allele frequency somewhere between 60 and 120 SNPs. This analysis provides a metric with which to judge the likelihood that candidate gene SNPs will estimate ancestry well, and allows investigators to define their own tolerance for error for smaller sets of candidate gene SNPs.
Our data indicate that small numbers of AIMs and a moderately larger number of candidate gene SNPs can be effective in estimating continental ancestry. While larger studies with greater sample sizes should require less precision of individual assignments, if more precision is sought, mixing a few AIMs (e.g., 15) with the candidate gene SNPs will likely be adequate to correct for population stratification while still providing substantial cost savings. Selection of markers in this way can be a practical and cost-effective approach to estimating global genetic ancestry in admixed population studies.
This project was supported in part by the Komen for the Cure grant OP05–0927-DR1 and by NIH grants R01CA092447 and 3T32GM080178–03S1. DNA sample preparation was conducted at the Survey and Biospecimen Shared Resource that is supported in part by the Vanderbilt-Ingram Cancer Center (P30 CA68485). Analysis was conducted in part using the resources of the Advanced Computing Center for Research and Education at Vanderbilt University, Nashville TN. We wish to thank Emily Phillips for help with graphics and Regina Courtney for DNA sample preparation.