SEARCH

SEARCH BY CITATION

Keywords:

  • genetic structure;
  • insertion-deletion polymorphisms;
  • indels;
  • human diversity;
  • DNA

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results and Discussion
  6. Conclusions
  7. Acknowledgements
  8. References
  9. Web Site References

In a landmark study Rosenberg et al. (2002) analyzed human genome diversity with 377 microsatellites in the HGDP-CEPH Genome Diversity Panel and reported that the populations were structured into five geographical regions: America, Sub-Saharan Africa, East Asia, Oceania and a cluster composed of Europe, the Middle East and Central Asia. They also observed that the within-population component accounted for 93–95%, and that the among-regions portion was only 3.6%, of the total genetic variance. We have also studied the HGDP-CEPH Diversity Panel (1064 individuals from 52 populations) with a set of 40 biallelic slow-evolving short insertion-deletion polymorphisms (indels). We confirmed the partition of worldwide diversity into five genetic clusters that correspond to major geographic regions. Using the indels we have also disclosed an among-regions component of genetic variance considerably larger (12.1%) than had been estimated using microsatellites. Our study demonstrates that a set of 40 well-chosen biallelic markers is sufficient for the characterization of human population structure at the global level.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results and Discussion
  6. Conclusions
  7. Acknowledgements
  8. References
  9. Web Site References

Now that we have the complete DNA sequence of the euchromatic human genome (International Human Genome Sequencing Consortium, 2004) there is growing interest in characterizing human genomic variation. The conventional approach for this goal has been first to divide humanity into populations which can then be studied. However, populations often have an ambiguous meaning, being irregularly defined on the basis of “race”, geography, culture, religion, physical appearance or other criteria. Rosenberg et al. (2002) tried to avoid this problem by studying the structure of human genome diversity without prior population assignment. In a landmark study they used the Structure computer program (Pritchard et al. 2000) that uses a cluster algorithm for inferring population structure on the basis of genotype data. A set of 377 autosomal microsatellites and the 52 populations of the HGDP-CEPH Diversity Panel (Cann et al. 2002) were used to study worldwide human genome variation. Without a priori information about the origin of individuals Structure was able to identify five main clusters that corresponded to major geographic regions of the globe. Rosenberg et al. (2002) also observed that the within-population differences among the individuals accounted for 93–95%, and that the among-regions variation was only 3.6%, of the total genetic variance.

The first estimation of the levels of human genetic variation at individual, population and regional levels was published in 1972 by Lewontin, who used blood groups, protein variants and isoenzymes to calculate values of 85.4% for the within-population, 8.3% for the among-populations-within-continent, and 6.3% for among-continents components of genetic variance. Using DNA markers other authors also obtained similar results (Barbujani et al. 1997; reviewed in Barbujani & Di Benedetto, 2001; Excoffier & Hamilton, 2003), leading to the important corollary that the human species has a low level of geographical structuring that is not compatible with the existence of human races (Templeton, 1999). Observing that no previous study had estimated a within-population component of human genetic variance as high as 93–95%, Excoffier & Hamilton (2003) suggested that the microsatellite mutation model that had been used by Rosenberg et al. (2002) had caused an artifactual underestimation of the among-regions component. In response, Rosenberg et al. (2003) made a spirited defense of their mutation model and blamed the sampling scheme. We decided to approach this problem by studying exactly the same HGDP-CEPH Diversity Panel used by Rosenberg et al. (2002), but using a different type of genetic markers, slow-evolving diallelic short insertion-deletion polymorphisms (indels).

Weber et al. (2002) characterized 2,000 human diallelic short indels in the human genome. We accessed their data base (http://research.marshfieldclinic.org/) and identified 40 polymorphisms that fulfilled the following criteria: widespread chromosomal location, increasing amplicon sizes that allow multiplex analysis, and allele frequency close to 0.5 in the European population (Supplementary Table 1). Here we report our results from the application of these 40 indel markers to the study of all the samples in the HGDP-CEPG Diversity Panel. Using the Structure program we could reproduce the identification of five main clusters that corresponded to major geographic regions of the globe. However, our analysis of genetic variance showed considerably more structuring, with 85.7% within-population, 2.3% among-populations within-regions, and 12.1% among-regions components of genetic variance.

Materials and Methods

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results and Discussion
  6. Conclusions
  7. Acknowledgements
  8. References
  9. Web Site References

Populations Studied

DNA samples from 1,064 individuals were obtained from the HGPD-CEPH Human Genome Diversity Cell Line Panel (http://www.cephb.fr/HGDP-CEPH-Panel/; Cann et al. 2002). The individuals were sampled across all five continents and assigned to 52 different populations from seven regional groups (Africa, Europe, Middle East, Central/South Asia, East Asia, Oceania and America).

DNA Analysis

DNA from each individual was independently typed for 40 biallelic short insertion/deletion polymorphisms (indels) selected from those described by Weber et al. (2002) and available at http://research.marshfieldclinic.org/genetics/indels/default.asp (Supplementary Table 1). The PCR amplifications used four multiplex reaction systems, each consisting of a mix of 10–12 primer pairs (Supplementary Table 2). To each forward PCR primer a tail of the M13-40 17-mer oligonucleotide was added. The multiplex PCR assay was performed in a 10-μl final volume of the following: 1 X PCR buffer (10 mM Tris-HCl pH 8.3 or pH 9.2, 75 mM KCl, 3.5 mM MgCl2), 200 μM dNTPs, 1.0 U of Platinum Taq DNA polymerase (Invitrogen), 20 ng of genomic DNA, 1.5 μM of M13-40 forward primer labelled with the FAM dye, 1.5 μM of each unlabelled reverse primer, and 0.1 μM of each unlabelled forward primer.

Two microlitres of each labelled PCR products was denatured in formamide solution plus MegaBACE™ ET550-R Size Standard at 95°C for 5 min, and subjected to fragment analysis using a MegaBACE 1000 DNA sequencer (GE Healthcare) according to the manufacturer's instructions. Analyses of allele sizes were scored using the Genetic Profiler (version 2.2) and Fragment Profiler (version 1.2) programs (GE Healthcare).

Population Structure Analysis

We utilised the Structure program version 2.1 (Pritchard et al. 2000), available at http://pritch.bsd.uchicago.edu/software.html. This software uses multilocal genotypes to infer the structure of each population and allocate individuals to different populations. The software defines “K” clusters (where K has to be provided by the user), each of them being characterized by a set of allelic frequencies for each locus. The individuals are grouped (probabilistically) on the basis of their genotypes. We ran ten independent replicates for each value of K, which varied from 1 to 10. Every run consisted of 100,000 burn-in steps, followed by 2 × 106 Markov Chain Monte Carlo iterations, without any prior information on the population origin of each sampled individual. We used the “no admixture” model where each individual is assumed to have originated in a single population, and as an additional parameter assumed the allele frequencies of different populations to be correlated. The graphical output of Structure was modified by the use of the Distruct software (Rosenberg, 2003) available at http://www.cmb.usc.edu/~noahr/district.html.

Statistical Analyses

The genetic structure of the populations and basic parameters of molecular diversity, including analyses of molecular variance (AMOVA) (Excoffier et al. 1992), matrix of co-ancestry coefficients (Reynolds et al. 1983), and Hardy-Weinberg proportions by the exact test (Guo & Thomson, 1992), were calculated using the package Arlequin 2.0 (Schneider et al. 2000) with 100,000 steps in the Markov chain. The statistical significance of Fst values was estimated by permutation analysis using 100,000 permutations. Multidimensional Scaling (Kruskal & Wish, 1978) was performed with the program Statistica for Windows, release 4.0. The Mantel test for matrix correlation was performed with a program written by Dr. Jeffrey Long of the University of Michigan Medical School and made available to us by Dr. Keith Hunley from the University of New Mexico.

Results and Discussion

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results and Discussion
  6. Conclusions
  7. Acknowledgements
  8. References
  9. Web Site References

Biallelic Short Insertion-Deletion Polymorphisms (Indels)

The biallelic short indels chosen for this study, together with some of their properties, are shown in Supplementary Table 1. We tested the Hardy-Weinberg equilibrium in all 52 populations for all 40 loci by the exact method of Guo & Thompson (1992). Among the 2080 values thus obtained 94 (0.045) were significant at the 0.05 level. Therefore, the number of significant departures was less than expected on the basis of chance alone.

Analysis of Molecular Variance

We typed the 40 indels in the full HGDP-CEPH Diversity Panel, composed of 1,064 individuals from 52 different populations distributed in seven geographical regions: Europe, the Middle East, Central Asia, East Asia, Oceania, the Americas and Sub-Saharan Africa. The genotypes were then submitted to an analysis of molecular variance (AMOVA) using the Arlequin program (Schneider et al. 2000). The results of the analysis are shown in Table 1, in comparison with those of Rosenberg et al. (2002). If we focus on each region the results in the two studies are almost identical, the within-population component being responsible for more than 93% of the genetic variance, except for Amerindians who exhibited 11.6% of among-population-within-region variance and a corresponding lower within-population constituent. This result is not unexpected, since it is well known that the demography of Amerindians, especially in South America, has occasioned very high degrees of genetic drift that produce elevated levels of between-population gene frequency variation (Cavalli-Sforza et al. 1994). On the other hand, when we examined the data for the seven geographical regions our indel analysis showed a much larger among-regions component of variation (12.1%) compared with the 3.6% observed in the microsatellite study of Rosenberg et al. (2002), who had attributed their low value to the sampling scheme of the HGDP-CEPH Diversity Panel. Excoffier & Hamilton (2003) have already observed that the level of among-regions variance observed by Rosenberg et al. (2002) was smaller than other worldwide studies, and attributed this to the fact that the authors had not used a stepwise mutation model, the most appropriate model for microsatellite studies. Indeed, not taking homoplasy into account can depress the among-regions variance component (Flint et al. 1999; Romualdi et al. 2002). If one associates the relatively high mutation rate of microsatellites (Leopoldino & Pena, 2003) with the possibility of size constraints for their growth, different populations would tend to approach a common allelic distribution for these markers (Romualdi et al. 2002). The short biallelic indel markers that we employed are expected to have a much lower evolution rate, and thus their distribution is expected to reflect deeper events in the demographic history of populations than would of microsatellites (Romualdi et al. 2002).

Table 1.  Analysis of molecular variance (AMOVA) of the typing results with 40 short insertion-deletion polymorphisms in comparison with the results of Rosenberg et al. (2002). Data were analyzed with the Arlequin ver. 2.000 software (Schneider et al. 2000)
SampleNumber of regionsNumber of populations40 Indels (this study)377 Microsatellites (Rosenberg et al. 2002)
Variance components (%)Variance components (%)
Within populationsAmong populations within regionsAmong regionsWithin populationsAmong populations within regionsAmong regions
World15287.212.8 94.65.4 
World75285.72.312.194.12.43.6
Africa1795.34.7 96.93.1 
Eurasia12197.72.3 98.51.5 
Eurasia32197.41.41.298.31.20.5
 Europe1899.01.0 99.30.7 
 Middle East1498.51.5 98.71.3 
 Central/South Asia1998.31.7 98.61.4 
East Asia11898.71.4 98.71.3 
Oceania1293.96.1 93.66.4 
America1588.511.5 88.411.6 

Our results are compatible with those obtained by two worldwide studies more directly comparable with ours: one by Romualdi et al. (2002) with 21 Alu insertion-deletion polymorphisms in 32 populations and on other by Watkins et al. (2003) with 100 Alu polymorphisms in 31 populations. The former obtained the following components of genetic variance: 82.9% within-populations, 8.2% among-populations-within-region and 8.9% among-regions. The latter observed that 88.6% of the genetic variance occurred within-populations, 1.9% among-populations-within-regions, and 9.6% among-regions (Watkins et al. 2003). Our results are also very similar to those of Bowcock et al. (1991) for 100 diallelic DNA polymorphisms (SNPs) tested in five populations from four continents.

As pointed out by Bowcock et al. (1991) the disparity of the among-populations component (FST) from one polymorphism to another may help to establish whether natural selection is playing a role or whether variation is selectively neutral. In the latter case the only force at play is drift, which we expect to be equal for all genes, since it depends only on demographic properties of the populations and not on the particular genes being studied. To investigate we plotted the observed FST values for the 100 polymorphisms against the corresponding mean gene frequency (data not shown). We then compared the number of FST values expected in the various percentile classes, calculated according to Bowcock et al. (1991), with the number observed. Thirty-eight out of 40 plotted points (95%) were located between the 5th percentile and the 95th percentile of the simulated FST distributions for different initial gene frequencies. In other words, at this level of resolution there was no evidence of deviation from neutrality.

Multidimensional Scaling

We tested the discrimination power of our 40-indel set by obtaining a distance matrix of the 52 populations using the Reynolds genetic measure, which is based on the FST linearized for short divergence times (Reynolds et al. 1983). From the matrix we undertook a Multidimensional Scaling analysis (MDS; Kruskal & Wish, 1978) using the program Statistica. With only two dimensions we obtained a very adequate graphical representation of the distance matrix (stress = 0.108), as shown in Figure 1. It is immediately apparent that the points corresponding to the 52 populations aggregate into five widely separated clusters that correspond to Africa, Oceania, East Asia, America and a central Europe-Middle East-Central Asia group (E-ME-CA). It is interesting to observe that the two most distant clusters are Africa and America, the exact two anchor clusters produced by the Structure program when K = 2 (see below). The E-ME-CA cluster can be separated into three population groups using prior geographical information, and then producing the separation into seven major geographical regions. Among the 52 populations, the only one misclassified according to geographical region was the Kalash population of Central Asia (open arrow). Zhivotovsky et al. (2003) used the data from Rosenberg et al. (2002) to produce an MDS plot with a topography similar to ours, including the “anomalous” position of the Kalash. However, in accordance with the AMOVA results the major geographical regions appeared to be more separated in our MDS plot.

image

Figure 1. Multidimensional scaling plot obtained with the program Statistica, ver. 4.0. Each point represents one population, numbered as follows: (1) Biaka_Pygmies, (2) Mbuti_Pygmies, (3) Mandenka, (4) Yoruba, (5) Bantu_NE, (6) San, (7) Bantu_SE/SW, (8) Mozabite, (9) Bedouin, (10) Druze, (11) Palestian, (12) Brahui, (13) Balochi, (14) Hazara, (15) Makrani, (16) Sindhi, (17) Pathan, (18) Kalash, (19) Burusho, (20) Han, (21) Tujia, (22) Yizu, (23) Miaozu, (24) Oroqen, (25) Daur, (26) Mongola, (27) Hezhen, (28) Xibo, (29) Uygur, (30) Dai, (31) Lahu, (32) She, (33) Naxi, (34) Tu, (35) Yakut, (36) Japanese, (37) Cambodian, (38) Papuan, (39) NAN_Malesian, (40) French, (41) French_Basque, (42) Sardinian, (43) North_Italian, (44) Tuscan, (45) Orcadian, (46) Adygei, (47) Russian, (48) Pima, (49) Maya, (50) Colombian, (51) Karitiana, (52) Surui. The double arrow indicates the Kalash population which belongs to Central Asia and is apparently misclassified (see text). Details about the populations can be obtained at http://www.cephb.fr/HGDP-CEPH-Panel/.

Download figure to PowerPoint

Cluster Analysis with the Structure Programs

Finally, we analyzed the indel data with the Structure program (Pritchard et al. 2000). We made ten runs each, with K varying from 1 to 10, with a burn in of 100,000 and run length of 2,000,000. All runs produced the same clusters except for those at K ≥ 7. The results with K varying from 2 to 6 are shown in Fig. 2. At K = 2 the data were, as seen by Rosenberg et al. (2002), anchored by Africa and America, separated by a relatively large genetic distance, with East Asia very close to America and the Europe-Middle East-Central Asia block close to Africa. Excoffier (2003) pointed out that this division observed by Rosenberg et al. (2002) was at odds with previous results, in which a first split had often been often observed between sub-Saharan Africans and Non-Africans. Now our data with indels confirm the same result. At K = 3 we observe clusters of Africa, a Europe-Middle East-Central Asia-Oceania block and East Asia-America. At K = 4 East Asia and America separate, and at K = 5 we get groups that correspond to five major geographical regions (with Europe, Middle East and Central Asia clustered). With K = 6 the situation continues more or less unchanged with no more significant splits. The value of K = 5 has the highest posterior probability (0.9999).

image

Figure 2. Estimated population structure of 52 human populations studied with 40 diallelic short insertion-deletion polymorphisms. Each of the five horizontal bars is composed of thin vertical lines representing all 1064 individuals. The lines are coloured dependent on the individual's estimated membership fractions and divided into K clusters; the value of K is stated on the left. Vertical black lines separate the individuals into 52 different populations, identified by the labels on the bottom. Ten Structure runs were performed for each K with a burn-in of 100,000 runs and run length of 2,000,000. The graph was prepared with the Distruct software. Details about the populations can be obtained at http://www.cephb.fr/HGDP-CEPH-Panel/.

Download figure to PowerPoint

Turakulov & Easteal (2003) computed that 65 random biallelic polymorphisms (SNPs) would be necessary for identifying distinct geographically separated populations, while Bamshad et al. (2003) calculated that 60 Alu indels would be sufficient to obtain assignment to the continent of origin with an accuracy of 90%. Our study demonstrates that a set of 40 well-chosen biallelic markers is sufficient to characterize human population structure at the global level.

Verification of Possible Biases

As explained above, one of the criteria for the choice of these specific indels was allele frequency close to 0.5 in the European population. For the chosen loci the average frequency of the long allele in Europeans, as determined by Weber et al. (2002), was 0.51 (0.42 in Amerindians, 0.50 in Japanese and 0.61 in Africans). The fact that the indels had been chosen for their high variability in Europeans will lead to biases in the comparison of allele frequencies and gene diversity among the various regions. However, this should not inevitably affect the partition of genetic variance. Indeed, our within-population and among-population-within-region components of the total genetic variance are practically identical to those of Rosenberg et al. (2002). We checked this further using three approaches. First, we studied the distribution of worldwide Fst values. Kidd et al. (2004), studying 369 biallelic markers, observed that the distribution of worldwide Fst values had a mean of 0.138 and was skewed to the right (skewness = 1.082). They attributed the skewness to the fact that the markers had been ascertained by being shown to be polymorphic with moderate to high heterozygosities in a non-African population. The Fst distribution for the 40 indels had a mean of 0.141, which is very similar to the value obtained by Kidd et al. (2004). Moreover, it was not significantly different from normality and had a skewness of only 0.669, not significantly different from zero. Thus we could not reveal evidence for a bias. We then compared our distance matrix of the 52 populations using the Reynolds genetic measure (Reynolds et al. 1983) with a distance matrix calculated from the data of Rosenberg et al. (2002), which can be considered essentially free of ascertainment bias. The test revealed a highly significant correlation of 0.48 (p < 0.0001). The fact that the correlation was not perfect may be related to the fact that, as already discussed, the indels are expected to have a smaller mutation rate than microsatellites and that they may more sensitive to deeper events in the human genealogical tree. As a final test, following Urbanek et al. (1996), we chose the 11 loci among our indels with highest gene diversity in Europe (set E) and the 11 loci with highest diversity in Africa (set A). We then performed separate worldwide AMOVA analyses with the two sets, obtaining virtually identical results, as follows: for set E and set A, respectively, the within-population components of variance were 86.35% and 88.38% respectively, the among-groups components were 10.84% and 9.05%, and the among-populations-within-groups components were 2.82% and 2.56%. We conclude that in our study the ascertainment bias apparently did not significantly affect the partition of variance.

Conclusions

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results and Discussion
  6. Conclusions
  7. Acknowledgements
  8. References
  9. Web Site References

In summary, we have studied the same worldwide population sample as Rosenberg et al. (2002) with a set of 40 biallelic slow-evolving short insertion-deletion polymorphisms. We found that the genetic structure of the populations included in the HGDP-CEPH Diversity Panel is best portrayed by a picture of the world divided into genetic clusters that tightly correspond to five geographic regions: America, Sub-Saharan Africa, East Asia, Oceania and a group composed of Europe, the Middle East and Central Asia. We have also shown that with our set of indels we disclose an among-regions component of genetic variance considerably larger than was estimated by Rosenberg et al. (2002) using microsatellites.

Population studies have suggested that genetic variation is essentially continuous throughout space among humans (Romualdi et al. 2002). This knowledge is apparently at odds with the regional discontinuity observed by Rosenberg et al. (2002) and by us. Serre & Paabo (2004) have proposed that such discontinuity might be an artifact imposed by the sample structure of the HGDP-CEPH Diversity Panel. To address this issue more directly, it will probably be necessary in the future to switch the emphasis of worldwide panels from populations to individuals (Cavalli-Sforza, 2005).

Acknowledgements

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results and Discussion
  6. Conclusions
  7. Acknowledgements
  8. References
  9. Web Site References

This work was partly supported by a grant from the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq). We are grateful to Dr. Howard Cann of the Fondation Jean Dausset who provided the HGDP-CEPH Diversity Panel, and to Dr. Jeffrey Long of the University of Michigan Medical School and Dr. Keith Hunley from the University of New Mexico for making available to us software for the analysis of matrix correlation. We thank Rodrigo Richard Gomes for software development and Tales Silva for his help in running Structure. Neuza A. Rodrigues and Kátia Barroso provided expert technical assistance.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results and Discussion
  6. Conclusions
  7. Acknowledgements
  8. References
  9. Web Site References