SEARCH

SEARCH BY CITATION

Keywords:

  • Ancestry Informative Markers;
  • AIMs;
  • African Americans;
  • candidate genes;
  • genetic ancestry;
  • STRUCTURE;
  • admixture;
  • population stratification

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Using genetic data from an obesity candidate gene study of self-reported African Americans and European Americans, we investigated the number of Ancestry Informative Markers (AIMs) and candidate gene SNPs necessary to infer continental ancestry. Proportions of African and European ancestry were assessed with STRUCTURE (K = 2), using 276 AIMs. These reference values were compared to estimates derived using 120, 60, 30, and 15 SNP subsets randomly chosen from the 276 AIMs and from 1144 SNPs in 44 candidate genes. All subsets generated estimates of ancestry consistent with the reference estimates, with mean correlations greater than 0.99 for all subsets of AIMs, and mean correlations of 0.99 ± 0.003; 0.98 ± 0.01; 0.93 ± 0.03; and 0.81 ± 0.11 for subsets of 120, 60, 30, and 15 candidate gene SNPs, respectively. Among African Americans, the median absolute difference from reference African ancestry values ranged from 0.01 to 0.03 for the four AIMs subsets and from 0.03 to 0.09 for the four candidate gene SNP subsets. Furthermore, YRI/CEU Fst values provided a metric to predict the performance of candidate gene SNPs. Our results demonstrate that a small number of SNPs randomly selected from candidate genes can be used to estimate admixture proportions in African Americans reliably.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Genetic epidemiology studies seek to identify loci with statistically significant allele or genotype frequency differences between cases and controls. If case status covaries with differences in ancestry, many genetic loci can be expected to differ in their allele frequencies irrespective of etiology, potentially giving rise to spurious associations (Clayton et al., 2005; Rosenberg and Nordborg, 2006; Tian et al., 2006; Price et al., 2008;Tian et al., 2008b). To minimize the confounding effects of population substructure, quantitative estimates of individual ancestry need to be considered. Bayesian clustering or maximum likelihood methods are typically used to calculate these estimates (Pritchard et al., 2000; Alexander et al., 2009), or clusters are identified using Principal Components Analysis (PCA) or multidimensional scaling with genotyped genetic markers (Price et al., 2006; Li and Yu, 2008; Novembre and Stephens, 2008).

The number of genotyped markers sufficient to infer and to correct for ancestry depends on the genetic heterogeneity of the populations under study and the informativeness of the markers being used, with informativeness a function of allele frequency differences between the ancestral populations from which the study samples derive. The most commonly used markers for this purpose are termed Ancestry Informative Markers (AIMs), which are selected for their large allele frequency differences between ancestral populations (Shriver et al., 1997; Akey et al., 2002; Rosenberg et al., 2003; Smith et al., 2004; Halder et al., 2008; Nassir et al., 2009; ). However, genetic variants with relatively small allele frequency differences between populations can also be used to infer ancestry if they are genotyped in sufficient numbers. In large-scale candidate gene studies or in genome-wide association studies (GWAS), the sheer quantity of markers under study makes the use of independent AIMs unnecessary, even for assessing ancestry in highly homogenous populations. For example, using 960 cancer candidate gene SNPs, Sloan et al. (2009) were able to infer the European country of origin in immigrants to the United States; and using over half a million SNPs, Novembre et al. (2008) were able to characterize substructure in a European population accurate to within a few hundred kilometres of geographic origin.

Although panels of AIMs have been assumed to be necessary in smaller candidate gene studies, the point at which they can be confidently bypassed, owing to a sufficient number of candidate SNPs, remains largely unexplored. Allocco et al. found that as few as 50 SNPs chosen randomly from the HapMap database can assign individuals to their ancestral continent of origin with an average accuracy of 95%, suggesting that AIMs may not be necessary even in studies with relatively few markers (Allocco et al., 2007). However, SNPs chosen randomly from throughout the genome via HapMap are likely to provide more independent information than SNPs chosen from a set of nonrandomly distributed candidate genes, many of which may be in linkage disequilibrium (LD). Moreover, many studies of admixed populations, such as African Americans, require an assessment of the proportion of admixture. These analyses demand more information from genotypic markers than when individuals need only be assigned to their continent of origin.

Herein, we present an analysis of 1300 self-reported African American and 1247 self-reported European American subjects using 276 AIMs and 1144 obesity-related candidate gene SNPs to evaluate the number of AIMs and/or candidate gene SNPs necessary to characterize global ancestry adequately in similarly designed studies.

Materials and Methods

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Ethics Statement

The Southern Community Cohort Study (SCCS) participants provided written informed consent, and protocols were approved by the Vanderbilt University Human Research Program and Institutional Review Board and by the Meharry Medical College Institutional Review Board.

Study Population

The SCCS is a cohort study of cancer risk disparities related to ancestry and socioeconomic status among populations. Men and women aged 40–79 were recruited in person at community health centers and also by mail across 12 southeastern U.S. states between 2002 and 2009 (Signorello et al., 2005; Signorello et al., 2010). Approximately 86,000 participants were enrolled, with African Americans comprising two-thirds of the study population (http://www.southerncommunitystudy.org).

For an obesity-related candidate gene study within the SCCS, 2157 female and 390 male participants (1300 of self-reported African ancestry and 1247 of self-reported European ancestry) were selected from among those who enrolled from March 2002 to October 2004. Genomic DNA was extracted from blood samples using Qiagen's DNA Purification kits (Qiagen, Valencia, CA, USA) according to manufacturer's instructions.

Marker Selection and Genotyping

We selected AIMs from a list of 1509 AIM SNPs from an Illumina-designed panel for ancestry estimation and an additional 360 SNPs with comparably large frequency differences between European (CEU) and African (YRI) samples in HapMap I. We selected AIMs using the following criteria: (1) AIMs were at least 5 MB from any of the 44 candidate gene boundaries to ensure independence from the candidate genes and (2) AIMS displayed the largest allele frequency differences between the CEU and YRI HapMap populations. From this list we chose 300 AIMs of which 292 passed the Illumina Scoring algorithm and were genotyped with the Illumina GoldenGate platform (Illumina Inc., San Diego, CA, USA). All SNPs were assessed for deviations from Hardy–Weinberg equilibrium. Of the 292 AIMs, 276 were successfully genotyped with call rates greater than 95% and subsequently used to estimate African and European ancestry in the SCCS African Americans and European Americans.

An additional panel of genetic markers was selected from the obesity-related candidate genes, comprising 1244 SNPs. The candidate SNPs were selected for the obesity study using a tagSNP approach that combined tagSNPs from both the European (CEU) and Yoruba (YRI) HapMap 1 data (Thorisson et al., 2005). HapMap SNPs within each gene and an additional 10 kb upstream and downstream of each gene were identified and evaluated by the Illumina scoring algorithm. SNPs that scored poorly or had minor allele frequencies below 0.05 in both CEU and YRI were excluded. LDSelect was then run separately for the CEU and YRI data using an r2 cut-off of 0.8 to partition SNPs into LD bins for each population (Carlson et al., 2004). When multiple tagSNPs were in an LD bin, SNPs that tagged both populations were preferentially selected. Among equivalent tagSNPs of a given LD bin, one categorized as a candidate functional SNP or one previously employed on Illumina chips was preferentially selected for assay. The same quality control criteria were applied as for the AIMs, and resulted in the removal of 100 SNPs, yielding 1144 for ancestry estimation.

Analyses

African and European ancestry for each individual was estimated using STRUCTURE (version 2.2.3, http://pritch.bsd.uchicago.edu/structure.html), a software platform that uses a Bayesian clustering algorithm to identify groups of individuals with similar allele frequency profiles (Pritchard et al., 2000). The algorithm estimates the shared population ancestry of individuals based solely on their genotypes, under the assumptions of Hardy–Weinberg equilibrium and linkage equilibrium in ancestral populations. Individuals are assigned admixture estimates, proportions of ancestry summing to 1 across K clusters (K = 2 ancestral populations for all analyses in this study). All runs of STRUCTURE were performed on the ACCRE supercomputing cluster at Vanderbilt University.

To create reference values of African and European ancestry proportions for all 2547 individuals, STRUCTURE was run 10 times (50,000 iterations after a burn-in of 50,000 iterations) using the 276 AIMs. CLUMPP was used to align multiple replicate analyses and to calculate the means of the 10 quantitative ancestry estimates per individual (Jakobsson and Rosenberg, 2007). This procedure was repeated using the 1144 candidate gene SNPs, and the Pearson correlation coefficient between the two sets of individual ancestry estimates was determined.

From each of the sets of 276 AIMs and 1144 candidate gene SNPs, 100 random subsets of 120, 60, 30, and 15 markers were selected. Markers were selected without replacement for each randomisation. The number of genes represented in each random sample of SNPs was tabulated. STRUCTURE was run 10 times (50,000 iterations after a burn-in of 50,000 iterations) for each of the 800 randomised datasets (100 randomisations for each category of 120, 60, 30, and 15 AIMs and 120, 60, 30, and 15 candidate gene SNPs). The means of the 10 quantitative ancestry estimates per individual per randomisation were calculated, generating 100 sets of 2547 mean ancestry estimates for each of the eight categories (Fig. S1). The Pearson correlation coefficients between each of these sets of individual ancestry estimates and the reference vector were then calculated.

To assess how accurately a given individual's ancestry could be estimated using only 120, 60, 30, or 15 AIMs or candidate gene SNPs, each individual's reference estimate of African ancestry was subtracted from each of the individual's 100 estimates of African ancestry per category of 120, 60, 30, or 15 AIMs and candidate gene SNPs. The frequency distributions of the 254,700 absolute values of differences per category were plotted, in addition to the frequency distributions for only the 1300 self-reported African Americans.

Mean Fst was calculated for each randomisation of candidate gene SNPs, using allele frequency data from the YRI and CEU HapMap samples. For the 1300 self-reported African Americans, the association between Fst and the mean absolute difference of African ancestry estimates from reference values was assessed using linear regression for each category of candidate gene SNPs. The Weir and Cockerham algorithm was used to calculate Fst (Weir and Cockerham, 1984).

To determine whether the population differentiation of our SNPs was consistent with background levels of genetic distance between African Americans and European Americans, Fst values were first calculated for the 1144 candidate gene SNPs and 276 AIMs using genotype data from our study. These values were then compared to Fst values for 40 random draws of 1144 SNPs from unrelated HapMap phase III CEU and ASW (African Americans in the Southwest USA) samples, using the Kruskal–Wallis one-way analysis of variance test, a nonparametric method for determining whether samples originate from the same distribution. A two-sided significance probability of 0.05 was used to infer nonrandom influences.

Finally, the extent to which correlations with the reference estimates were impacted by the number of genes per randomisation was determined. For each category of candidate gene SNPs, the associations between the number of genes represented in each randomised sample and the correlation of that sample's ancestry estimates with the reference estimates was assessed using linear regression. Statistical analyses were done using R (http://cran.r-project.org/) and STATA 10.0.

Results

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Ancestry Estimated from 276 Aims and 1144 Candidate Gene Snps

Reference values of ancestry for the 2547 individuals were calculated using 276 AIMs. The 1300 self-reported African Americans had a mean proportion of 0.92 African ancestry and the 1247 self-reported European Americans a mean African ancestry of 0.01. Nine of the self-reported African Americans were found to have greater than 0.88 proportion European ancestry, suggesting incorrect classification. The 1144 candidate gene SNPs yielded ancestry estimates that were highly correlated with the reference estimates derived from the 276 AIMs (r = 0.989).

Ancestry Estimated from Random Subsets of 276 Aims

For the full study population, the 15 AIM subsets generated ancestry estimates that were highly correlated with the reference values (r = 0.991; Table 1). Of the total 254,700 ancestry estimates for all subjects generated using 15 AIMs, 93.8% fell within ±0.15 of the corresponding reference estimates. The mean absolute difference from the reference values was 0.06 ± 0.04, and the median was less than 0.01 (Fig. 1). For the self-reported African Americans, the mean absolute difference from the reference values was 0.06 ± 0.07, and the median was 0.03, with the great majority (89.1%) of estimates within ±0.15 of the reference estimates (Fig. 2). Using subsets of 30, 60, and 120 AIMs, the mean absolute difference from the reference values for the self-reported African Americans improved to 0.05 ± 0.06, 0.04 ± 0.04, and 0.02 ± 0.03, respectively, with medians of 0.03, 0.02, and 0.01 (Fig. 2).

Table 1. Correlations between sets of 2547 ancestry estimates derived using the full set of AIMs and those derived using subsets of SNPs
 Mean rS.E.Min–Max
  1. a

    Candidate gene SNP panel: N = 1144 SNPs.

  2. Mean r refers to mean of 100 correlations.

  3. S.E., standard error.

Candidate gene SNP panela0.989NANA
Random AIM subsets, n   
150.9910.0010.989–0.993
300.995<0.0010.994–0.995
600.997<0.0010.996–0.997
1200.999<0.0010.998–0.999
Random candidate gene SNP subsets, n
150.8110.1060.384–0.943
300.9340.0320.785–0.978
600.9770.0090.922–0.987
1200.9860.0030.977–0.990
image

Figure 1. Variation in estimates of ancestry with AIM subsets. Distribution of absolute values of differences between reference estimates of African ancestry and corresponding estimates derived using random subsets of 120 (green), 60 (red), 30 (blue), and 15 (purple) AIMs, for all 2547 European American and African American study participants.

Download figure to PowerPoint

image

Figure 2. Variation in estimates of ancestry for self-reported African Americans only, using AIM subsets. Distribution of absolute values of differences between the self-reported African Americans’ reference estimates of African ancestry and corresponding estimates derived using random subsets of 120 (green), 60 (red), 30 (blue), and 15 (purple) AIMs.

Download figure to PowerPoint

Ancestry Estimated from Random Subsets of 1144 Candidate Gene Snps

Highly correlated ancestry estimates were also obtained when smaller subsets of the 1144 candidate genes were used for estimation (Table 1). With 120-candidate gene SNPs, the mean Pearson correlation coefficient with the reference estimates was 0.986 ± 0.003, indicating that little information was lost when roughly 10% of the full set of candidate gene SNPs was used. The smallest correlation obtained from the 100 randomisations of 120-candidate gene SNPs was 0.977 (Table 1 and Fig. 3). Of the total 254,700 ancestry estimates generated using 120 randomly chosen candidate gene SNPs, the mean absolute difference from the reference values was 0.04 ± 0.07, the median was 0.01, and 92.0% of estimates fell within ±0.15 of their corresponding reference estimates (Fig. 4). For the self-reported African Americans, the mean absolute difference from the reference estimates was 0.06 ± 0.08 and the median was 0.03, 86.5% of estimates fell within ±0.15 of the reference estimates (Fig. 5).

image

Figure 3. Correlations of ancestry estimates using candidate gene SNPs. Distribution of correlations between the reference set of ancestry estimates for all 2547 study participants and 100 sets of corresponding estimates derived using 30 (blue), 60 (red), and 120 (green) random candidate gene SNPs. Data for 15 SNPs not shown; see Table 1.

Download figure to PowerPoint

image

Figure 4. Variation in estimates of ancestry with candidate gene SNP subsets. Distribution of absolute values of differences between reference estimates of African ancestry and corresponding estimates derived using random subsets of 120 (green), 60 (red), and 30 (blue) candidate gene SNPs, for all 2547 study participants.

Download figure to PowerPoint

image

Figure 5. Variation in estimates of ancestry for self-reported African Americans only, using candidate gene SNP subsets. Distribution of absolute values of differences between the self-reported African Americans’ reference estimates of African ancestry and corresponding estimates derived using random subsets of 120 (green), 60 (red), and 30 (blue) candidate gene SNPs.

Download figure to PowerPoint

Results remained consistent when 60-candidate gene SNPs were used. The mean correlation with the reference estimates was 0.977 ± 0.009, with only two of 100 randomisations yielding correlations less than 0.95 (0.922 and 0.947; Table 1 and Fig. 3). The mean absolute difference from the reference values was 0.05 ± 0.09, and the median was 0.01; only 9.6% of all ancestry estimates differed from corresponding reference values by more than ±0.15 (Fig. 4). For the self-reported African Americans, the mean difference from the reference estimates was 0.07 ± 0.10 and the median was 0.04, and 85.0% of estimates fell within ±0.15 of the reference values (Fig. 5).

Random subsets of 30-candidate gene SNPs generated ancestry estimates less consistent with the reference values, with correlations ranging from 0.78 to 0.95. However, 92% of the 30 SNP randomisations yielded correlations greater than 0.90 (Table 1 and Fig. 3). For the self-reported African Americans, the mean absolute difference from the reference values was 0.10 ± 0.15, the median was 0.05, and 80.0% of estimates fell within ±0.15 of the reference values (Fig. 5). Random subsets of 15 candidate gene SNPs performed markedly worse than the other subsets (Table 1), especially with respect to inferring the ancestry of the self-reported African Americans: the mean absolute difference from the reference values was 0.18 ± 0.21 and the median was 0.09, with 34.9% of estimates missing the reference by more than 0.15.

Performance of Candidate Gene SNP Subsets Based on YRI/CEU Fst

An inverse relationship existed between the YRI/CEU Fst values of candidate gene SNP randomisations and their accuracy in estimating African ancestry in the 1300 self-reported African Americans (as measured by mean absolute value of differences from reference values). This relationship was stronger in smaller SNP subsets, and significant in all subsets except the 120 SNP data (Fig. 6; Table S1). For the 120 SNP subsets, the r2 = 0.02 (P = 0.16).

image

Figure 6. Relationship between Fst and accuracy of African ancestry estimation. For each category of 15 (panel a), 30 (panel b), 60 (panel c), and 120 (panel d) candidate gene SNPs, randomised subsets are mapped by mean YRI/CEU Fst (horizontal axis) and mean absolute difference of African ancestry estimates from corresponding reference estimates for the 1300 self-reported African Americans (vertical axis). The slope of the linear fit was significantly different from zero for 15, 30, and 60 SNPs (P < 0.0001, P < 0.0001, P = 0.017, respectively). Horizontal and vertical axes vary among the panels.

Download figure to PowerPoint

Genetic Differentiation of Aims, 1144 Candidate Gene Snps, and Hapmap Data

The median Fst value between self-reported African Americans and self-reported European Americans for the 1144 candidate gene SNPs was 0.059, which was slightly higher than the median Fst value calculated for 40 draws of 1144 random SNPs taken from the HapMap CEU and ASW samples (0.050). The mean Fst for the 1144 candidate gene SNPs (0.087) was slightly lower than that for the 40 random draws from HapMap (0.091). As expected, these Fst estimates were much smaller than the mean and median Fst for the 276 AIMs (both were 0.51; Fig. 7). Forty percent (16/40) of the Fst distribution comparisons between the ASW/CEU simulations and the candidate gene SNPs were not statistically significant at P = 0.05.

image

Figure 7. Distribution of Fst values. The distribution of Fst values between the CEU and ASW HapMap populations using 40 iterations of 1144 randomly selected SNPs (left), and the Fst distributions of the 1144 candidate gene SNPs (centre) and 276 AIMs (right) for the self-reported African American and European American study participants.

Download figure to PowerPoint

Effect of Number of Genes Used to Estimate Ancestry

The candidate gene SNPs were sampled from a total of 44 candidate genes. The mean number of candidate genes represented in the subsets of 120, 60, 30, and 15 SNPs were 34.3 ± 2.0, 27.1 ± 2.1, 18.9 ± 1.9, and 11.8 ± 1.4, respectively. For a given number of SNPs (120, 60, 30, or 15), correlations did not vary significantly with the number of candidate genes represented in the randomised samples (Fig.S2; P > 0.28 for all subsets).

Discussion

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

The need to adjust genetic association studies for differences in ancestry between cases and controls is now well recognized. Differentially distributed continental ancestry, in particular, increases the risk of type 1 error, because the fraction of SNPs with >40% allele frequency differences between continental populations is an order of magnitude greater than the fraction within continental subpopulations (Tian et al., 2008a). Because smaller candidate gene studies and follow-up studies to GWAS assess only a limited number of markers, panels of AIMs are often genotyped to address these issues (Seldin and Price, 2008). Recent studies suggest that ancestry, and especially continental ancestry, can be characterized with fewer AIMs than originally thought (Risch et al., 2002; Tsai et al., 2005; Ruiz-Narvaez et al., 2011; Sampson et al., 2011). For example, a subset of 24 AIMs from a set of 128 adequately distinguished European and West African ancestry (Kosoy et al., 2009).

Using a large number of AIMs to estimate ancestry for our reference measure and a rigorous resampling approach, our study confirms that as few as 15 AIMs provide excellent correlation with reference estimates, indicating that a small number of AIMs is sufficient to differentiate continental ancestry. The accuracy of the estimates tended to be lower among African Americans than European Americans, but with only 15 AIMs, 89.1% of the ancestry estimates for the 1300 self-reported African Americans fell within ±0.15 of reference estimates. When 30, 60, and 120 AIMs were used, the percent of estimates falling within ±0.15 of the reference values in African Americans improved to 92.6%, 97.9%, and 99.9%, respectively. Thus, the practical utility of genotyping more than 15 AIMs for some types of studies, including those distinguishing African Americans from European Americans, would appear to be marginal. However, studies requiring increased accuracy, including those distinguishing moderate from high African ancestry among self-reported African Americans in the context of small effect sizes (Reich et al., 2004), could more prudently use approximately 60 AIMs, assuming that the level of accuracy we observed in this study is sufficient to minimize any threats to internal validity.

The history of candidate gene studies indicates that most interrogated markers do not associate with the phenotypes under study. Because we expect a far greater proportion of selected markers to associate significantly with continental ancestry than with any particular phenotype, the necessity of using AIMs to infer ancestry in medium- to large-scale candidate gene studies with a large number of unassociated markers is not clear. The numerical threshold at which ordinary SNPs perform as well as AIMs in this respect has been largely unexplored. Allocco et al. provided evidence that as few as 50 randomly selected HapMap SNPs can assign individuals to their continent of origin, but to our knowledge, our study is the first to systematically investigate the minimum number of nonindependently drawn SNPs (e.g., candidate gene SNPs) sufficient to estimate proportions of admixture in a mixed study population. Our approach allowed us to evaluate the point at which genotyping AIMs may become superfluous, not only as a theoretical matter, but also as a practical guideline for minimizing expense in future candidate gene studies, deep sequencing analyses, and GWAS replications.

We found that as few as 60 SNPs drawn from 22–31 genes generated ancestry estimates that correlated well with reference ancestry estimates in our total sample (mean r = 0.977). Within self-reported African Americans, 15.0% of the ancestry estimates deviated from the reference estimates by more than 0.15 when 60 candidate gene SNPs were used, not appreciably different than when 120 candidate gene SNPs were used (13.6%). The number of genes from which a given number of random SNPs were drawn did not significantly influence the correlations. Although not directly tested in our study, it is probable that admixture proportions can be well estimated with even fewer genes, as long as the number of independent SNPs from those genic regions (i.e., tagSNPs) is similar to the number that we tested.

To determine whether our particular candidate gene SNPs influenced these results, we calculated Fst values for the 1144 SNPs and compared them to Fst values for 1144 SNPs drawn randomly 40 times from CEU and ASW HapMap samples. A moderate inflation in the differences between populations was expected, because we selected our candidate gene SNPs to tag both African American and European American samples. The median Fst of our candidate gene SNPs (0.059) was slightly higher than that of the random draws (0.050) and the mean was slightly lower (0.087 vs. 0.091), but both were much lower than the AIMs’ mean (0.51) and median (0.51), indicating that the variation in our candidate gene SNPs was not atypical of background levels of genetic variation between the two populations.

One way to assess the utility of candidate gene SNPs as ancestry estimators in studies with African Americans is to determine if the mean Fst value of a set of SNPs can predict their performance using Fst values calculated from pre-existing data, such as the YRI/CEU allele frequencies from HapMap. We found a significant linear relationship between mean YRI/CEU Fst and accuracy (measured as the mean absolute difference of estimates from reference values) for subsets of 15, 30, and 60 SNPs. The magnitude of correlation between mean Fst and accuracy decreased as the number of SNPs increased and the range of the mean absolute differences from references narrowed. For the 120 SNP category, where the range of differences in estimated ancestry was 0.058–0.072, the linear relationship between mean Fst and accuracy was not statistically significant, even though mean Fst varied by as much as 50% across different randomisations. This probably reflects the fact that the greater ancestry information provided by the increased number of SNPs significantly outweighed the contribution of average differences in mean allele frequency somewhere between 60 and 120 SNPs. This analysis provides a metric with which to judge the likelihood that candidate gene SNPs will estimate ancestry well, and allows investigators to define their own tolerance for error for smaller sets of candidate gene SNPs.

Our data indicate that small numbers of AIMs and a moderately larger number of candidate gene SNPs can be effective in estimating continental ancestry. While larger studies with greater sample sizes should require less precision of individual assignments, if more precision is sought, mixing a few AIMs (e.g., 15) with the candidate gene SNPs will likely be adequate to correct for population stratification while still providing substantial cost savings. Selection of markers in this way can be a practical and cost-effective approach to estimating global genetic ancestry in admixed population studies.

Acknowledgements

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

This project was supported in part by the Komen for the Cure grant OP05–0927-DR1 and by NIH grants R01CA092447 and 3T32GM080178–03S1. DNA sample preparation was conducted at the Survey and Biospecimen Shared Resource that is supported in part by the Vanderbilt-Ingram Cancer Center (P30 CA68485). Analysis was conducted in part using the resources of the Advanced Computing Center for Research and Education at Vanderbilt University, Nashville TN. We wish to thank Emily Phillips for help with graphics and Regina Courtney for DNA sample preparation.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information
  • Akey, J.M., Zhang, G., Zhang, K., Jin, L., & Shriver, M.D. (2002) Interrogating a high-density SNP map for signatures of natural selection. Genome Res 12, 18051814.
  • Alexander, D.H., Novembre, J., & Lange, K. (2009) Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19, 16551664.
  • Allocco, D.J., Song, Q., Gibbons, G.H., Ramoni, M.F., & Kohane, I.S. (2007) Geography and genography: prediction of continental origin using randomly selected single nucleotide polymorphisms. BMC Genomics 8, 68.
  • Carlson, C.S., Eberle, M.A., Rieder, M.J., Yi, Q., Kruglyak, L., & Nickerson, D.A. (2004) Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet 74, 106120.
  • Clayton, D.G., Walker, N.M., Smyth, D.J., Pask, R., Cooper, J.D., Maier, L.M., Smink, L.J., Lam, A.C., Ovington, N.R., Stevens, H.E., Nutland, S., Howson, J.M., Faham, M., Moorhead, M., Jones, H.B., Falkowski, M., Hardenbol, P., Willis, T.D., & Todd, J.A. (2005) Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat Genet 37, 12431246.
  • Halder, I., Shriver, M., Thomas, M., Fernandez, J.R., & Frudakis, T. (2008) A panel of ancestry informative markers for estimating individual biogeographical ancestry and admixture from four continents: utility and applications. Hum Mutat 29, 648658.
  • Jakobsson, M. & Rosenberg, N.A. (2007) CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics 23, 18011806.
  • Kosoy, R., Nassir, R., Tian, C., White, P.A., Butler, L.M., Silva, G., Kittles, R., Alarcon-Riquelme, M.E., Gregersen, P.K., Belmont, J.W., De La Vega, F.M., & Seldin, MF (2009) Ancestry informative marker sets for determining continental origin and admixture proportions in common populations in America. Hum Mutat 30, 6978.
  • Li, Q.Z. & Yu, K. (2008) Improved correction for population stratification in genome-wide association studies by identifying hidden population structures. Genet Epidemiol 32, 215226.
  • Nassir, R., Kosoy, R., Tian, C., White, P.A., Butler, L.M., Silva, G., Kittles, R., Alarcon-Riquelme, M.E., Gregersen, P.K., Belmont, J.W., De La Vega, F.M., & Seldin, M.F. (2009) An ancestry informative marker set for determining continental origin: validation and extension using human genome diversity panels. BMC Genet 10, 39–52.
  • Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A.R., Auton, A., Indap, A., King, K.S., Bergmann, S., Nelson, M.R., Stephens, M., & Bustamante, C.D. (2008) Genes mirror geography within Europe. Nature 456, 98101.
  • Novembre, J. & Stephens, M. (2008) Interpreting principal component analyses of spatial population genetic variation. Nature Genet 40, 646649.
  • Price, A.L., Butler, J., Patterson, N., Capelli, C., Pascali, V.L., Scarnicci, F., Ruiz-Linares, A., Groop, L., Saetta, A.A., Korkolopoulou, P., Seligsohn, U., Waliszewska, A., Schirmer, C., Ardlie, K., Ramos, A., Nemesh, J., Arbeitman, L., Goldstein, D.B., Reich, D., & Hirschhorn, J.N. (2008) Discerning the ancestry of European Americans in genetic association studies. PloS Genet 4, e236.
  • Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., & Reich, D. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904909.
  • Pritchard, J.K., Stephens, M., & Donnelly, P. (2000) Inference of population structure using multilocus genotype data. Genetics 155, 945959.
  • Reich, D., Freedman, M.L., Penney, K.L., McDonald, G.J., Mignault, A.A., Patterson, N., Gabriel, S.B., Topol, E.J., Smoller, J.W., Pato, C.N., Pato, M.T., Petryshen, T.L., Kolonel, L.N., Lander, E.S., Sklar, P., Henderson, B., Hirschhorn, J.N., & Altshuler, D. (2004). Assessing the impact of population stratification on genetic association studies. Nat Genet 36, 388393.
  • Risch, N., Burchard, E., Ziv, E., & Tang, H. (2002). Categorization of humans in biomedical research: genes, race and disease. Genome Biol 3, 719.
  • Rosenberg, N.A., Li, L.M., Ward, R., & Pritchard, J.K. (2003). Informativeness of genetic markers for inference of ancestry. Am J Hum Genet 73, 14021422.
  • Rosenberg, N.A. & Nordborg, M. (2006). A general population-genetic model for the production by population structure of spurious genotype-phenotype associations in discrete, admixed or spatially distributed populations. Genetics 173, 16651678.
  • Ruiz-Narvaez, E.A., Rosenberg, L., Wise, L.A., Reich, D., & Palmer, J.R. (2011). Validation of a small set of ancestral informative markers for control of population admixture in African Americans. Am J Epidemiol 173, 587592.
  • Sampson, J.N., Kidd, K.K., Kidd, J.R., & Zhao, H. (2011). Selecting SNPs to identify ancestry. Ann Hum Genet 75, 539553.
  • Seldin, M.F. & Price, A.L. (2008). Application of ancestry informative markers to association studies in European Americans. PloS Genet 4, e5.
  • Shriver, M.D., Smith, M.W., Jin, L., Marcini, A., Akey, J.M., Deka, R., & Ferrell, R.E. (1997). Ethnic-affiliation estimation by use of population-specific DNA markers. Am J Hum Genet 60, 957964.
  • Signorello, L.B., Hargreaves, M.K., & Blot, W.J. (2010). The Southern Community Cohort Study: investigating health disparities. J Health Care Poor Underserved 21, 2637.
  • Signorello, L.B., Hargreaves, M.K., Steinwandel, M.D., Zheng, W., Cai, Q., Schlundt, D.G., Buchowski, M.S., Arnold, C.W., McLaughlin, J.K., & Blot, W.J. (2005). Southern community cohort study: establishing a cohort to investigate health disparities. J Natl Med Assoc 97, 972979.
  • Sloan, C.D., Andrew, A.D., Duell, E.J., Williams, S.M., Karagas, M.R., & Moore, J.H. (2009) Genetic population structure analysis in New Hampshire reveals Eastern European ancestry. PloS one 4, e6928.
  • Smith, M.W., Patterson, N., Lautenberger, J.A., Truelove, A.L., McDonald, G.J., Waliszewska, A., Kessing, B.D., Malasky, M.J., Scafe, C., Le, E., De Jager, P.L., Mignault, A.A., Yi, Z., De The, G., Essex, M., Sankale, J.L., Moore, J.H., Poku, K., Phair, J.P., Goedert, J.J., Vlahov, D., Williams, S.M., Tishkoff, S.A., Winkler, C.A., De La Vega, F.M., Woodage, T., Sninsky, J.J., Hafler, D.A., Altshuler, D., Gilbert, D.A., O'Brien, S.J., & Reich, D. (2004) A high-density admixture map for disease gene discovery in african americans. Am J Hum Genet 74, 10011013.
  • Thorisson, G.A., Smith, A.V., Krishnan, L., & Stein, L.D. (2005) The International HapMap Project Web site. Genome Res 15, 15921593.
  • Tian, C., Gregersen, P.K., & Seldin, M.F. (2008a) Accounting for ancestry: population substructure and genome-wide association studies. Hum Mol Genet 17, R143R150.
  • Tian, C., Hinds, D.A., Shigeta, R., Kittles, R., Ballinger, D.G., & Seldin, M.F. (2006) A genomewide single-nucleotide-polymorphism panel with high ancestry information for African American admixture mapping. Am J Hum Genet 79, 640649.
  • Tian, C., Plenge, R.M., Ransom, M., Lee, A., Villoslada, P., Selmi, C., Klareskog, L., Pulver, A.E., Qi, L., Gregersen, P.K., & Seldin, M.F. (2008b) Analysis and application of European genetic substructure using 300 K SNP information. PLoS Genet 4, e4.
  • Tsai, H.J., Choudhry, S., Naqvi, M., Rodriguez-Cintron, W., Burchard, E.G., & Ziv, E. (2005) Comparison of three methods to estimate genetic ancestry and control for stratification in genetic association studies among admixed populations. Hum Genet 118, 424433.
  • Weir, B.S., & Cockerham, C.C. (1984) Estimating F-statistics for the analysis of population-structure. Evolution 38, 13581370.

Supporting Information

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Additional Supporting Information may be found in the online version of this article.

FilenameFormatSizeDescription
ahg738-sup-0001-F1.tiff1520KFigure S1 Flowchart of analytical approach.
ahg738-sup-0002-F2.tiff1520KFigure S2 Correlations between reference values and candidate gene SNP analyses as a function of gene number.
ahg738-sup-0003-T1.docx18KTable S1 The YRI/CEU Fst of candidate gene SNPs versus accuracy of African ancestry estimates for the 1300 self-reported African Americans.

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.