The full text of this article hosted at iucr.org is unavailable due to technical difficulties.

Free Access

Large Allele Frequency Differences between Human Continental Groups are more Likely to have Occurred by Drift During range Expansions than by Selection

T. Hofer

Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, 3012 Bern, Switzerland and Swiss Institute of Bioinformatics

Search for more papers by this author
N. Ray

Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, 3012 Bern, Switzerland and Swiss Institute of Bioinformatics

Search for more papers by this author
D. Wegmann

Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, 3012 Bern, Switzerland and Swiss Institute of Bioinformatics

Search for more papers by this author
L. Excoffier

Corresponding Author

Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, 3012 Bern, Switzerland and Swiss Institute of Bioinformatics

*Corresponding author: Laurent Excoffier, Computational and Molecular Population Genetics lab, Institute of Ecology and Evolution, Baltzerstrasse 6 3012 Berne, Switzerland. Tel.: +41 31 631 30 31. Fax.: +41 31 631 48 88. E‐mail:

Laurent.Excoffier@zoo.unibe.ch

Search for more papers by this author
First published: 25 November 2008
Cited by: 84

Summary

Several studies have found strikingly different allele frequencies between continents. This has been mainly interpreted as being due to local adaptation. However, demographic factors can generate similar patterns. Namely, allelic surfing during a population range expansion may increase the frequency of alleles in newly colonised areas. In this study, we examined 772 STRs, 210 diallelic indels, and 2834 SNPs typed in 53 human populations worldwide under the HGDP‐CEPH Diversity Panel to determine to which extent allele frequency differs among four regions (Africa, Eurasia, East Asia, and America). We find that large allele frequency differences between continents are surprisingly common, and that Africa and America show the largest number of loci with extreme frequency differences. Moreover, more STR alleles have increased rather than decreased in frequency outside Africa, as expected under allelic surfing. Finally, there is no relationship between the extent of allele frequency differences and proximity to genes, as would be expected under selection. We therefore conclude that most of the observed large allele frequency differences between continents result from demography rather than from positive selection.

Introduction

On a worldwide scale, human populations show a large phenotypic variability, particularly for skin colour, face and body shapes, susceptibility to pathogens, as well as for the prevalence of genetic diseases (Lewontin, 1995). However, most of the genetic variation in humans is found within populations rather than among populations or geographic regions (Lewontin, 1972, Barbujani et al. 1997, Rosenberg et al. 2002b). Still, many studies have focused on traits or loci showing geographically restricted distribution, or on loci showing drastic allele frequency differences between two regions. These particular cases can indeed reveal important information about local selective pressures or about the demographic histories of different populations (Balaresque et al. 2007). It is however difficult to disentangle the effects of positive selection from those of demography, since past demographic events such as population bottlenecks or range expansions can mimic the genetic signatures of a selective sweep like long range linkage disequilibrium and reduced allelic diversity.

The colonisation of the world by modern humans was probably accompanied by a series of founder effects with subsequent local population expansions (Handley et al. 2007). Strong bottlenecks have also certainly occurred during the exit out of Africa and at the onset of the colonisation of the Americas by people from Asia (Fagundes et al. 2007, Goebel et al. 2008). These bottlenecks, followed by a spatial expansion, can lead to the geographic spread of an allele that rides on the wave of advance of the spatial expansion, a phenomenon called allelic surfing (Edmonds et al. 2004, Klopfstein et al. 2006, Travis et al. 2007). New mutations arising on the wave front and extant alleles may surf successfully (Excoffier & Ray, 2008), spreading geographically and increasing in frequency in the newly colonised areas (Klopfstein et al. 2006). A combination of simulation, analytical and experimental studies have shown that the probability for an allele to successfully surf is increased in the presence of spatial bottlenecks, when local deme size is small, and when populations at the wave front grow rapidly and exchange few genes with their neighbours (Klopfstein et al. 2006, Hallatschek et al. 2007, Travis et al. 2007, Excoffier & Ray, 2008, Hallatschek & Nelson, 2008). This neutral process has received much attention recently because of its consequences on allele frequencies that mimic selective processes (Nielsen et al. 2007).

However, it is clear that human populations colonising novel habitats have been confronted by new selective pressures due to their exposure to different climate, food sources, and pathogens (Balaresque et al. 2007). Some of these selective pressures certainly triggered local adaptation that impacted on allele frequencies at several loci. However, neutral allele surfing, like selection, will also occur at only a few loci, and will therefore not affect all loci uniformly, like other demographic factors such as demographic expansions, inbreeding or bottlenecks.

Until recently, most human genes showing strong geographic structures were considered to be under positive selection (see Table 1, where 44 such genes are listed). Most of these genes show a marked difference in allele frequencies (typically larger than 20%) between African and non‐African populations. In many of these studies, local selection outside Africa was thought to have promoted these large allele frequency differences. Prominent examples are two genes that are involved in the control of brain size, MCPH1 and ASPM (Evans et al. 2005, Mekel‐Bobrov et al. 2005). Both genes showed an increased frequency of a derived allele outside Africa and high levels of linkage disequilibrium. The authors therefore hypothesised that the derived haplotypes were under local positive selection in non‐African populations. However, Currat et al. (2006) showed by spatially‐explicit simulations that similar geographic distributions of allele frequencies could be generated by neutral allelic surfing during the range expansion outside Africa.

Table 1. Genes reported as showing a high degree of population differentiation in the literature. We use here the official gene symbols as defined by the HGNC, and we provide in brackets the symbols used in the references if these are not the official symbols. For only 11 genes out of 44 (25%), past demography was proposed to be more likely to have shaped geographic structure than selection.
Genes Demography proposed as an explanation References
ABCB1 (MDR1) Tang et al. (2004), Wang et al. (2007a)
ABCG2 × de Jong et al. (2004)
ADH1B Han et al. (2007), Osier et al. (2002)
AGT Nakajima et al. (2004)
ALDH2 Oota et al. (2004)
APOE × Singh et al. (2006)
ASIP × Norton et al. (2007)
ASPM Mekel‐Bobrov et al. (2005)
ATXN2 (SCA2) Yu et al. (2005)
CAPN10 Fullerton et al. (2002)
CASP12 Xue et al. (2006)
CCR5 × Sabeti et al.(2005)
CD28 × Butty et al. (2007)
CTLA4 × Butty et al. (2007)
CYP3A × Schirmer et al. (2006), Thompson et al. (2004)
DARC (FY) Hamblin et al. (2002)
DMD Nachman & Crowell (2000)
EDAR Bryk et al. (2008)
F7 Hahn et al. (2004)
G6PD Saunders et al. (2002)
GNB3 Young et al. (2005)
GRK4 Lohmueller et al. (2006)
ICOS × Butty et al. (2007)
IL13 Zhou et al. (2004)
IL4 Rockman et al. (2003)
LCT Bersaglieri et al. (2004)
MAOA Gilad et al. (2002)
MAPT Stefansson et al. (2005)
MC1R Gerstenblith et al. (2007)
MCPH1 Evans et al. (2005)
MMP3 Rockman et al. (2004)
MSTN (GDF8) Saunders et al. (2006)
MTHFR × Hughes et al. (2006), Rosenberg et al. (2002a)
NAT2 Sabbagh et al. (2008)
OCA2 × Norton et al. (2007)
PDYN Rockman et al. (2005)
PTPRC Stanton et al. (2003)
SLC22A4 Mori et al. (2005)
SLC24A5 Lamason et al. (2005), Norton et al. (2007)
SLC45A2 (MATP/AIM1) Norton et al. (2007), Soejima et al. (2006)
SLCO1B1 × Pasanen et al. (2008)
TAS2R16 Soranzo et al. (2005)
TRPV6 Akey et al. (2006)
TYR Norton et al. (2007)

In this study, we explore data from the HGDP‐CEPH Diversity Panel consisting of 772 STRs, 210 insertion‐deletion polymorphisms and 2834 SNPs typed in 53 populations worldwide to determine the prevalence of large allele frequency differences between regions. We find that large allele frequency differences between continental regions are extremely common, as they occur at almost one third of all loci. We discuss the respective role of selection and demographic factors for shaping these patterns in the light of geographic and genomic information.

Material and Methods

Data

We analysed three multilocus data sets containing short tandem repeats (STR), insertion‐deletion polymorphisms (indel) and single nucleotide polymorphisms (SNP), respectively, typed in 53 worldwide populations belonging to the CEPH Human Genome Diversity Panel (Cann et al. 2002, Rosenberg et al. 2002b, Ramachandran et al. 2005, Conrad et al. 2006). The individuals analysed correspond to the H1048 subset defined by Rosenberg (2006), which excludes atypical and duplicated samples. The datasets were downloaded from the web site http://rosenberglab.bioinformatics.med.umich.edu/diversity.html.

Initially, the STR data set contained 783 loci typed in 1048 individuals, but we have removed eleven loci showing overall more than 10% missing data (GATA43C11, GGAA22E01, GATA193D02, GATA135F02P, AAC023, ATT015, ATT077P, GATA63C02, ATA109H09, GATA7F09, and TTTA033), and we thus analysed a total of 9210 alleles at 772 STR loci. We also examined 210 diallelic indels that were typed in the same 1048 individuals, as well as 2834 SNP loci that were typed in a subset of 927 individuals.

The populations were grouped in five main geographic regions, following Rosenberg et al. (2002b): Africa, Eurasia, East Asia, America, and Oceania (Excoffier, 2003, see also Bastos‐Rodrigues et al. 2006, Li et al. 2008). A complete list of the populations is found in Table S1.

Analyses

STRs, indels and SNPs data sets were analysed separately. We used ARLEQUIN ver 3.11 (Excoffier et al. 2005) to calculate the average frequency of each allele in the populations. The R statistical package (R Development Core Team, 2008) was used to develop scripts for the analyses listed below.

For each allele i, we computed the average allele frequency inline image within each geographic region j, as well as the difference with the average frequency computed over all other populations as inline image, where inline image is the average frequency of allele i in all populations not belonging to the geographic region j. This was done for all regions except Oceania, because there are only two populations in this region, and therefore the average frequency is subject to large fluctuations and the power to detect significant differences is low. For STR data, we also computed for each locus the index ΔFmax as the largest absolute value of ΔF found among all alleles present at that locus. This index ΔFmax allows us to characterize allele frequency differences at each locus with a single statistic, like in the case of diallelic loci. For diallelic loci, ΔFmax=|ΔF|.

We randomly permuted populations between regions and recomputed each time ΔF, to obtain its null distribution and test for the significance of ΔF for each allele. The same permutation procedure was used to test if the number of alleles with a given frequency difference (kΔF) between a region and the rest of the world was significantly larger than expected by chance.

We also introduced a procedure to test if a random set of populations that are geographically close to each other also present sharp allele frequency differences with the rest of the world. Taking geography into account is actually a more stringent test of allele frequency differences than a procedure based on free permutation of random populations, because populations closer to each other tend to be more similar than populations at greater distance, due to isolation by distance and shared history. However, when regions consist of only a small number of populations, such as America, the number of possible random groups is reduced. kΔF was tested by taking geographical constraints into account as follows: a random population is assigned to the group representing the tested region, and the other populations allocated to this group are drawn at random from the 2Pj 1 geographically closest populations, where Pj is the number of populations in the tested region. The geographic distance between populations was computed as the shortest distance on land (i.e. least‐cost path avoiding seas) using the software PATHMATRIX (Ray, 2005).

If allelic surfing was a major driving force behind allele frequency differences, we would expect to find more STR alleles with a higher frequency in newly colonised areas, because surfing promotes the increase in frequency of low frequency alleles. However we would not necessarily expect to find any asymmetry in the direction of frequency change of derived SNP and indel alleles, since surfing should affect equally ancestral and derived alleles. We tested these predictions by performing a sign test on the number of alleles having increased or decreased in frequency outside a region of interest. The ancestral allele for each human SNP was inferred by comparisons with orthologous alleles in the chimpanzee and rhesus macaque genome assemblies, available in the Table Browser at the UCSC Genome Bioinformatics Site (http://genome.ucsc.edu/, table snp128OrthoPanTro2RheMac2, (Karolchik et al. 2008)). The ancestral allele was assumed to be identified if both the chimpanzee and macaque alleles were described and identical, or if an allele was only known in one of these two species. If orthologous alleles were known in both species but were different from each other, the ancestral allele was assumed to be the chimpanzee allele if the human variants contained the chimpanzee allele but not the macaque allele. In all other cases the ancestral allele was assumed to be unknown. Likewise the ancestral state of the indels was inferred by comparing human allelic diversity to orthologous alleles in the chimpanzee and in the gorilla (Weber et al. 2002). In this way, we were able to determine the ancestral allelic states of 176 indel and that of 1530 SNP loci. We then used the R function ‘sign.test’ (Package BSDA; (Arnholt, 2007)) to perform a sign test allowing us to determine if there is any asymmetry in the frequency change of derived alleles. The genomic positions of a subset of 476 STRs, 162 indels and 2784 SNPs could be determined in the NCBI Build 35‐reference system. The distance to the nearest gene was computed for each of the mapped loci, and varied from 0 (when the marker is found within the transcript of a gene) to 73.9 Mb. We computed Pearson correlation coefficient between ΔFmax and marker distance to the closest gene to assess whether there was any relationship between these two variables.

Results

We tested whether populations belonging to the same region have more similar allele frequencies than expected by chance due to shared demographic history or shared selective events. Indeed, they show more similar allele frequencies than random populations, as the number of alleles showing ΔF > 0.2 for a given comparison is always significantly larger than expected by chance when tested with the random population permutation procedure (Tables 2–4 and Tables S2‐S4). However, this is not always the case when tested with the geographically explicit permutation test, when randomized regions are made up of spatially neighbouring populations. In the STR dataset all positive frequency differences between America and the rest of the world that are larger than 0.2 are non‐significant (Table 2 and Table S2). Additionally some of the larger frequency differences between America and the rest of the world in the indel dataset are also non‐significant (Table 3 and Table S3). The geographically explicit permutation test is expected to be more stringent, as geographically close populations are genetically often more similar than random populations. However, if there are only few populations in a region, as is the case for the Americas, the geographically explicit permutation test is too stringent because the number of different random groups is reduced. Allele‐specific ΔF was therefore tested with the random permutation procedure only and it is found significant in all cases as soon as ΔF > 0.25. We therefore chose an arbitrary threshold for ΔF of 0.3 to define a set of alleles with significant ΔF to summarise the results.

Table 2. STR allele frequency differences (ΔF) for the comparisons of Africa vs. the rest of the world and America vs. the rest of the world. Positive inline image (in the upper part of the table) indicate that the alleles have a lower frequency within African (or American) populations than in the non‐African (or non‐American) populations (because inline image).
ΔF Africa vs. non‐Africa America vs. non‐America
Allelesa significantb p‐value 1c p‐value 2d Locie significantf Allelesa significantb p‐value 1c p‐value 2d Locie significantf
0.65–0.7 0 0
0.6–0.65 1 1 ** ** 1 1 0
0.55–0.6 0 0
0.5–0.55 5 5 ** ** 5 5 0
0.45–0.5 9 9 ** ** 9 9 1 1 * 0 0
0.4–0.45 9 9 ** ** 9 9 1 1 * 0 0
0.35–0.4 19 19 ** ** 17 17 6 6 ** 6 6
0.3–0.35 24 24 ** ** 22 22 13 13 * 6 6
(−0.3) −0.3 9122 4604 693 609 9049 3916 625 568
(−0.3)–(−0.35) 9 9 ** * 6 6 53 53 ** ** 49 49
(−0.35)–(−0.4) 8 8 ** * 7 7 34 34 ** ** 33 33
(−0.4)–(−0.45) 2 2 ** * 1 1 24 24 ** ** 24 24
(−0.45)–(−0.5) 1 1 ** * 1 1 15 15 ** ** 15 15
(−0.5)–(−0.55) 1 1 ** * 1 1 5 5 ** ** 5 5
(−0.55)–(−0.6) 0 7 7 ** ** 7 7
(−0.6)–(−0.65) 0 2 2 ** ** 2 2
  • aTotal number of alleles with a given ΔF. Note that we have used semi‐open ΔF intervals (]x‐y]) to assign alleles to particular intervals, such that for instance a ΔF value of 0.4 was put in the interval 0.35–0.4.
  • bNumber of alleles with a significant ΔF (inline image).
  • cp‐value for the number of alleles with a given ΔF using random population permutations (* <= 0.05, ** <= 0.001);
  • dSame as c, but constraining permutations by geography (see Methods; * <= 0.05, ** <= 0.001).
  • eNumber of loci with a given ΔFmax value.
  • fNumber of loci with a significant allele frequency difference.
Table 3. Indel absolute allele frequency differences for the comparisons of Africa vs. the rest of the world and America vs. the rest of the world. Since indels can be considered as diallelic loci, we directly report the number of loci with a given ΔFmax value.
ΔFmax Africa vs. non‐Africa America vs. non‐America
Locia significantb p‐value 1c p‐value 2d Loci a significantb p‐value 1c p‐value 2d
0.75–0.8 0 0
0.7–0.75 1 1 ** ** 1 1 ** *
0.65–0.7 2 2 ** ** 0
0.6–0.65 1 1 ** ** 0
0.55–0.6 4 4 ** ** 0
0.5–0.55 3 3 ** * 2 2 ** *
0.45–0.5 10 10 ** ** 1 1 *
0.4–0.45 14 14 ** ** 6 6 ** *
0.35–0.4 14 14 ** * 8 8 **
0.3–0.35 12 12 ** * 11 11 **
0–0.3 149 104 181 93
  • Table header is defined in Table 2.
Table 4. SNP allele frequency differences for the comparisons of Africa vs. the rest of the world and America vs. the rest of the world. Since SNPs can be considered as diallelic loci, we directly report the number of loci with a given ΔFmax value.
ΔFmax Africa vs. non‐Africa America vs. non‐America
Locia significantb p‐value 1c p‐value 2d Locia significantb p‐value 1c p‐value 2d
0.75–0.8 3 3 ** ** 0
0.7–0.75 10 10 ** ** 0
0.65–0.7 1 1 ** * 1 1 ** *
0.6–0.65 14 14 ** ** 5 5 ** *
0.55–0.6 31 31 ** ** 13 13 ** *
0.5–0.55 38 38 ** ** 19 19 ** *
0.45–0.5 62 62 ** ** 22 22 ** *
0.4–0.45 89 89 ** ** 60 60 ** *
0.35–0.4 136 136 ** ** 72 72 ** *
0.3–0.35 129 129 ** * 143 143 ** *
0–0.3 2321 1484 2499 1303
  • Table header is defined in Table 2.

Overall we find that large allele frequency differences between geographic regions are extremely frequent (Tables 2–4 and Tables S2–S4). Indeed, 215 of the 772 STR loci (27.9%), 90 out of 210 indel loci (42.9%) and 913 of the 2834 SNP loci (32.2%) have ΔFmax > 0.3 for at least one comparison. Among these, 18.1% of the STR loci with ΔFmax > 0.3 show such a large ΔFmax for more than one comparison, while for the indels and SNPs this fraction is 28.9% and 18.1%, respectively. Note that the total number of loci with ΔFmax > 0.3 is smaller than the sum of the number of loci with ΔFmax > 0.3 involved in the different comparisons that can be computed from Tables S2–S4, because a given locus can show large allele frequencies in more than one continental comparison. The largest observed ΔF (0.79) was found between African and non‐African populations for the SNP locus ‘rs5972561’ (see below in Figure 4I).

Examples of spatial distribution of alleles with large ΔFs. Black pies represent the frequency of a given allele, and its average frequency within (WR) and out of (OR) the region of interest is shown on the bar plot. Whiskers in the bar plots represent standard deviations. A: allele 298 at the ATA1F08 STR locus (ΔF = 0.45), 18.7 Kb away from closest gene UTRN. B: allele 176 at the GATA84B12 STR locus (ΔF = 0.56), 106.3 Kb away from closest gene CCDC54. C: allele 111 at the GGAA20G10 STR locus (ΔF=0.51), 628 bp away to closest gene E2F6. D: allele 190 at the GATA11C08 STR locus (ΔF = 0.41) 149.0 Kb away from closest gene STARD13. E: indel locus rs2307832 (ΔF = 0.74), 14.9 Kb away from closest gene USP24. F: indel locus rs133052 (ΔF = 0.72), 9.7 Kb away from closest gene MKL1. G: SNP locus rs6431253 (ΔF = 0.54), 169.2 Kb away to closest gene ARL4C. H: SNP locus rs2252199 (ΔF = 0.53), 30 Kb away from closest gene HSPA13. I: SNP locus rs5972561 (ΔF = 0.79), located in the gene DMD. J: SNP locus rs5959428 (ΔF = 0.52), 323.6 Kb away from closest gene ITM2A.

In the comparisons of Africa and America to the rest of the World, the allele frequency differences are strikingly large (Tables S2‐S4), as expected under the surfing out‐of‐Africa hypothesis. When Africa is contrasted to the rest of the world the fraction of loci with ΔFmax > 0.3 is 10.2%, 29.0%, and 18.1%, for STRs, indels, and SNPs, respectively, and these fractions are 19.0%, 13.8%, and 11.8%, respectively, for the Americas. For the Eurasian and East Asian regions, these numbers are much lower, and vary between 1.2% and 8.6%. In keeping with these results, ΔF's are actually never as large in the comparisons of Eurasia and East Asia as in other comparisons. For instance, STRs do not show any allele with ΔF > 0.45 in Eurasia or in East Asia, whereas ΔF reaches 0.6 in Africa and 0.65 in America.

Given their large mutation rate, it may seem surprising that STR alleles show ΔF as large as those observed for SNPs and for indels if these differences had been created during the expansion out‐of‐Africa some 50 to 60 thousand years ago. Over time, mutations are indeed expected to erode large initial frequency differences at neutral loci, and thus large ΔF (50% or more) could be better explained by their maintenance due to selection. In order to check how quickly mutations would lower the frequency of an allele initially fixed in a population, we have carried out simple simulations at STR loci of an unsubdivided population under a pure stepwise mutation model. We have reported this decrease over 2000 generations in Figure S1 for different mutation rates and different effective population sizes. As expected the rate of decrease is positively correlated with mutation rate, and its variance is negatively correlated with population size. However, for a mutation rate of 5×10−4, the allele frequency is still about 65% after 1,000 generations and 46% after 2,000 generations. For a lower mutation rate of 10−4, the mean expected frequencies are 91–92% and 83–85% after 1,000 and 2,000 generations, respectively, depending on the effective population size. Given the relatively large variance of mutation rates for human STR loci (Xu et al. 2005), it appears therefore likely that STR allele frequencies of more than 80% could still be observed after 2,000 generations if they were initially fixed by surfing or a strong bottleneck, without the need to invoke selection for their maintenance. Still, one would expect that loci with high mutation rates would show lower allele frequency differences today. Since heterozygosity is positively correlated with mutation rate for STRs (Kimmel & Chakraborty, 1996), we would expect loci with a low heterozygosity to have larger allele frequency differences than loci with a high heterozygosity, and this is exactly what we observe in Figure 1.

Relationship between average heterozygosity over all populations (He) and largest absolute allele frequency difference (ΔFmax) for STR loci.

Surfing promotes the increase of allele frequencies in the direction of a spatial expansion. Therefore we expect to find more STR alleles with increased frequency in newly colonised areas than alleles with decreased frequency, since the decrease compensating the increase of a single allele will affect several other alleles at a given locus. This excess should be especially pronounced for Africa and America, because they are separated by spatial bottlenecks from the Eurasian continent. As shown in Figures 2 and 3, there is indeed a clear asymmetry in the distribution of STR allele frequency differences between regions. For instance, by considering only alleles with ΔF > 0.3, there are clearly more alleles that increased in frequency outside Africa than there are alleles that decreased in frequency. On the contrary, for East Asia and the Americas, there are more alleles at a higher frequency within these regions (Table S5). Since it is not possible to describe this pattern for diallelic loci like SNPs and indels, we tested for these markers whether the derived alleles show an asymmetry in frequency differences. We actually did not expect to find any asymmetry, as surfing does not discriminate between ancestral and derived alleles. For the indels the derived allele is about equally likely to increase in frequency as it is to decrease in frequency (Table S6). For SNPs however, we find that derived alleles have more often increased than decreased outside Africa for 0.15 < ΔF < 0.5, while we see the reverse situation in America for 0.3 < ΔF < 0.4 (Table S7). No clear pattern occurs for the other two regions (Table S7). This pattern is compatible with surfing, since most derived SNP alleles have low frequencies in Africa and could thus have had more room to increase in frequency by surfing than already frequent alleles.

Comparison of the distribution of allele frequencies between regions. A: Africa vs. rest of the World; B: Eurasia vs. rest of the World; C: East Asia vs. rest of the World; D: America vs. rest of the World. The grey scale in each square is proportional to the fraction of alleles (on a log‐scale) with a given average frequency. The size of the circles within squares is proportional to the number of loci with a given average frequency. Note that each locus is represented here by the allele with the largest frequency difference. Frequencies below 0 indicate that the alleles are not present in the respective group of populations. Note that alleles on the diagonal have equal frequencies in the two groups of populations.

Lod ratio of the number of alleles with a positive frequency difference (#ΔF+) and the number of alleles with a negative frequency difference (#ΔF‐), where positive means a lower frequency in the region of interest and a higher frequency in the rest of the world, as a function of ΔF. A positive lod ratio indicates that more alleles increased than decreased by a given ΔF out of the region of interest. Filled symbols indicate significant lod ratios (p‐value < 0.05, as assessed by a sign test). We only report ΔF categories with more than 10 alleles.

Eberle et al. (2006) found that genic regions are enriched for signals of positive selection compared to non‐genic regions (see also Hinds et al. 2005, Voight et al. 2006, Barreiro et al. 2008). If large ΔF were mainly created by the action of positive selection, it should be especially common close to genes. However, we find the correlation of ΔFmax and distance to the closest gene is only significant (at the 5% level) in three instances: for STR alleles in Eurasia, as well as for SNP alleles in Eurasia and America (Figures S2 and S4). In all three cases the explained variance (R2) is small and the p‐values are above the 1% level. For indels there is no significant correlation between ΔFmax and distance to genes (Figure S3). However, the power to detect selection close to genic regions may be limited here by the lower density of markers than that available in previous genomic studies, which were however based on a much smaller number of populations.

Discussion

We have found an unexpectedly large fraction of loci showing strong differences in allele frequencies between continents in all three datasets. 43% of the indels, 32% of the SNPs and 28% of the STR loci show large frequency differences (ΔFmax > 0.3) between a given geographic region and the rest of the world. A visual inspection of the spatial distribution of some of these allele frequencies indeed reveals striking features (Figure 4), with strong differences between continents, either with very narrow or broader clines, which at first sight is difficult to attribute to pure neutral processes. However, the sheer number of loci showing such striking patterns makes it difficult to believe that these patterns have all been shaped by positive selection, as previously advocated (Evans et al. 2005, Mekel‐Bobrov et al. 2005, Akey et al. 2006, Myles et al. 2008).

There is a clear excess of large ΔF between sub‐Saharan Africa or the Americas and other regions as compared to ΔF between Eurasia or east Asia and other regions (Tables S2‐S4). This is in line with previous genome scan studies, which detected more evidence of recent positive selection in Eurasian and East Asian populations as compared to African populations (Kayser et al. 2003, Akey et al. 2004, Storz et al. 2004, Carlson et al. 2005, Williamson et al. 2007). African populations seem therefore to have a deficit of recent positive selection (but see Hawks et al. 2007), which may be interpreted as evidence that selective pressures in recent times were more prevalent outside of Africa (Akey et al. 2004, Storz et al. 2004). In agreement with this hypothesis, Tang et al. (2007) found more genomic regions potentially influenced by selection when Africa was compared to Eurasian or to Asian populations than in the comparison of Eurasia to Asia. Under a selectionist view, this could be explained by the fact that the Eurasian continent has been colonized only recently and traces of selection would be easier to recognize. However, the populations remaining in Africa have also experienced drastic changes in their environment during the past 50,000 years (deMenocal, 2004), and prominent examples of recent genetic adaptations have been found in this continent as well (e.g. beta‐globin (Hanchard et al. 2007), G6PD (Saunders et al. 2002), or lactose tolerance (Tishkoff et al. 2007)). Like Africa, the Americas are also strongly differentiated from the rest of the World, and here selection would have had little time to operate, especially given the overall small sizes of the populations, leading to large levels of differentiation among Amerindian populations (Wang et al. 2007b).

We believe that demographic factors can better explain the particular differentiation of both Africa and the Americas. These two continents are indeed geographically very isolated from the others, such that some spatial and demographic bottlenecks have certainly occurred during the exit out‐of‐Africa to colonize Eurasia and during the colonization of the Americas from North‐East Asia (see e.g. Fagundes et al. 2007). Moreover, these spatial bottlenecks could have also enhanced the possibility of allelic surfing during subsequent spatial expansions (Travis et al. 2007). Allele surfing could also explain the asymmetry of the STR allele frequency distributions (Figures 2 and 3), since this phenomenon originally described the increase in frequency of rare alleles over large and recently colonized areas (Edmonds et al. 2004, Klopfstein et al. 2006). Therefore, the asymmetries shown in Figures 2 and 3 are expected after a range expansion out‐of‐Africa, as well as into Eurasia, East‐Asia and the Americas.

If large allele frequency differences were mainly driven by positive selection acting on coding regions, one would expect to see a negative relationship between ΔF and the distance between gene and markers. Voight et al. (2006) indeed discovered more signals of selection in genic regions than in non‐genic regions of the genome and Hinds et al. (2005) and Eberle et al. (2006) found that regions of extended linkage disequilibrium are enriched for genic SNPs. When testing for a correlation of allele frequency differences and distance to genes, however, we find only marginally significant results in three cases. We note however, that the relative lower number of loci examined here in a large number of populations is in contrast with previous genome scan studies, where hundreds of thousands of loci were studied in a very few populations. This low marker density may indeed prevent us from obtaining significant results, and it would be interesting to extend our analysis to new databases containing hundreds of thousands of markers (see e.g. Jakobsson et al. 2008, Li et al. 2008). In any case, the fact that markers showing high levels of differentiation between continents appear randomly scattered over the whole genome is more in line with surfing than with positive selection as a cause. It is, however, very likely that we observe the effects of diverse selective and neutral forces and their interaction. Positive selection, genetic drift and allelic surfing mainly lead to increased genetic differences between populations, while balancing selection and migration decrease differentiation. Our results suggest that local adaptation is certainly not the main acting force in promoting these large allele frequency changes between continental regions, but selection could certainly be involved at various loci.

Among the genes that are close to markers with high allele frequency differences between African and non‐African populations, we could identify some that were already signalled as candidates for positive selection in previous studies using different criterion than mere allele frequency differences between continents. These are TCF15 (Storz et al. 2004), KRTAP23–1 (Williamson et al. 2007), PHACTR1 (Williamson et al. 2007), C20orf26 (Williamson et al. 2007), ANTXR2 (Kimura et al. 2007), UTRN (Tang et al. 2007), TYRP1 (Izagirre et al. 2006, Lao et al. 2007), LYST (Izagirre et al. 2006), DMD (Nachman & Crowell, 2000), SEMA4F (Nielsen et al. 2005), and E2F6 (Kayser et al. 2003). It suggests either that markers with geographic differentiation may indeed point to linked selected genes or that previous studies using allele frequency difference as a criterion to identify outlier loci have erroneously mistaken surfing for selection.

Since allele surfing looks very much like a selective sweep (Nielsen et al. 2007, Excoffier & Ray, 2008) it would affect other aspects of genetic diversity than the allele frequency spectrum, like linkage disequilibrium and extended homozygosity (Biswas & Akey, 2006). Previous studies aiming at detecting positively selected loci have attempted to control for past demography, either by 1) explicitly modelling some complex demography (Sabeti et al. 2007, Stajich & Hahn, 2005, Tang et al. 2007, Williamson et al. 2007), 2) by comparing diversity linked to derived or ancestral alleles (Voight et al. 2006), or 3) by contrasting coding to non‐coding regions (Akey et al. 2002, Barreiro et al. 2008). To our knowledge, range expansions have never been used as a null model against which observed patterns were examined, and it is thus unclear (and would be worth examining) how the sensitivity of the first types of approaches would change under such a new null model. As mentioned above, derived and ancestral alleles show different frequencies in Africa (The International HapMap Consortium, 2007, Li et al. 2008) and the result of positive selection differs between new and standing variation (Przeworski et al. 2005, Teshima et al. 2006, Barrett & Schluter, 2008), so that tests based on the comparison of diversity associated to derived and ancestral alleles may indeed be sensitive to allele surfing, simply because these two allele categories have different initial frequencies. The comparison of genic to non‐genic regions may indeed be the approach most robust against past demography. For instance, Barreiro et al. (2008) compared the proportion of loci with a high FST between genic and non‐genic SNPs. They found that the proportion of genic SNPs with an FST>0.65 was about 2.8 fold larger than the proportion of non‐genic SNPs with equally large FST, and they could identify several candidate genes based on this high level of differentiation between populations. However, since this class of high FST SNPs represents only about 0.35% of all genic SNPs, it suggests that most genic regions have not been influenced much by selection. While we find that positive selection is unlikely to have shaped the allele frequency spectrum at most loci, it may certainly have acted on fewer genes than previously believed, and our current results do not allow us to discriminate between the effects of demography and selection for an individual locus. Loci which are candidates for being under positive selection should therefore be more carefully scrutinized to find links between potentially selected alleles and a phenotypic effect (see e.g. Sabeti et al. 2007).

Conclusions

The survey of the HGDP database on human polymorphisms reveals that large allele frequency differences between continental regions are extremely common. Indeed as much as 30% of loci show very large allele frequency differences between continents. These differences are unlikely to have been created by positive selection, but are more likely the result of neutral demographic processes such as the surfing phenomenon. Because the erosion of large allele frequency differences by mutation is slow, even for large mutation rates, the surprisingly large number of strongly differentiated STR alleles also do not need to be explained by the action of positive selection. Africa and the Americas show a much larger extent of differentiation than Eurasia or East Asia, which is certainly due to changes in allele frequencies during the colonisation of the Eurasian and the American continents. Disentangling the effects of selection and neutral demographic processes on genome diversity remains an important challenge of future human evolution studies.

Acknowledgements

Thanks to Montgomery Slatkin for his comments on a previous version of the manuscript, and to Gerald Heckel and Matthieu Foll for stimulating discussions on the subject. We are grateful to Mourad Sahbatou and Sijia Wang for providing information about the genomic location of some of the markers, and to Isabelle Dupanloup for providing help on database issues. This work was supported by a Swiss NSF grant No 3100A0‐112072 to L.E.

    Web resources

    Noah Rosenberg Laboratory: http://rosenberglab.bioinformatics.med.umich.edu/diversity.html

    UCSC Genome Browser: http://genome.ucsc.edu/

    Number of times cited: 84

    • , How disturbance and dispersal influence intraspecific structure, Journal of Ecology, 106, 3, (1298-1306), (2017).
    • , Long-term genetic consequences of mammal reintroductions into an Australian conservation reserve, Biological Conservation, 219, (1), (2018).
    • , Inferring selection in instances of long‐range colonization: The Aleppo pine (Pinus halepensis) in the Mediterranean Basin, Molecular Ecology, 27, 16, (3331-3345), (2018).
    • , Modularity of genes involved in local adaptation to climate despite physical linkage, Genome Biology, 10.1186/s13059-018-1545-7, 19, 1, (2018).
    • , Evolution of Complex Traits in Human Populations, Evolutionary Biology: Self/Nonself Evolution, Species and Complex Traits Evolution, Methods and Concepts, 10.1007/978-3-319-61569-1_9, (165-186), (2017).
    • , The colonization and divergence patterns of Brandt’s vole (Lasiopodomys brandtii) populations reveal evidence of genetic surfing, BMC Evolutionary Biology, 17, 1, (2017).
    • , Variability of innate immune system genes in Native American populations—relationship with history and epidemiology, American Journal of Physical Anthropology, 159, 4, (722-728), (2015).
    • , Human adaptation and population differentiation in the light of ancient genomes, Nature Communications, 7, (10775), (2016).
    • , Genetic surfing in human populations: from genes to genomes, Current Opinion in Genetics & Development, 41, (53), (2016).
    • , Studying the genetic basis of speciation in high gene flow marine invertebrates, Current Zoology, 62, 6, (643), (2016).
    • , GSTM1, GSTP1, and GSTT1 genetic variability in Turkish and worldwide populations, American Journal of Human Biology, 27, 3, (310-316), (2014).
    • , Developing the Scientific Infrastructure to Produce Ethnogenetically-Specific Personalized Medicine, Genetic Testing and Molecular Biomarkers, 19, 9, (465), (2015).
    • , Dissecting ancestry genomic background in substance dependence genome-wide association studies, Pharmacogenomics, 16, 13, (1487), (2015).
    • , From the laboratory to the wild: salinity-based genetic differentiation of the European sea bass (Dicentrarchus labrax) using gene-associated and gene-independent microsatellite markers, Marine Biology, 10.1007/s00227-014-2602-8, 162, 3, (515-538), (2015).
    • , Expansion Load and the Evolutionary Dynamics of a Species Range, The American Naturalist, 185, 4, (E81), (2015).
    • , Phylogeographic Analyses of American Black Bears (Ursus americanus) Suggest Four Glacial Refugia and Complex Patterns of Postglacial Admixture, Molecular Biology and Evolution, 32, 9, (2338), (2015).
    • , Human gephyrin is encompassed within giant functional noncoding yin–yang sequences, Nature Communications, 6, 1, (2015).
    • , Haplotype differences for copy number variants in the 22q11.23 region among human populations: a pigmentation-based model for selective pressure, European Journal of Human Genetics, 23, 1, (116), (2015).
    • , Correspondence: Evolution and Territorial Conflict, International Security, 39, 3, (190), (2015).
    • , Genetic diversity of disease-associated loci in Turkish population, Journal of Human Genetics, 60, 4, (193), (2015).
    • , Genomics of Divergence along a Continuum of Parapatric Population Differentiation, PLOS Genetics, 11, 2, (e1004966), (2015).
    • , Hierarchical boosting: a machine-learning framework to detect and classify hard selective sweeps in human populations, Bioinformatics, (btv493), (2015).
    • , Conditional entropy in variation-adjusted windows detects selection signatures associated with expression quantitative trait loci (eQTLs), BMC Genomics, 16, Suppl 8, (S8), (2015).
    • , Natural selection in a postglacial range expansion: the case of the colour cline in the European barn owl, Molecular Ecology, 23, 22, (5508-5523), (2014).
    • , Impact of range expansions on current human genomic diversity, Current Opinion in Genetics & Development, 29, (22), (2014).
    • , Human pharmacogenomic variation of antihypertensive drugs: from population genetics to personalized medicine, Pharmacogenomics, 15, 2, (157), (2014).
    • , Genetic Divergence and Signatures of Natural Selection in Marginal Populations of a Keystone, Long-Lived Conifer, Eastern White Pine (Pinus strobus) from Northern Ontario, PLoS ONE, 9, 5, (e97291), (2014).
    • , Opportunity for Selection in Human Health, , 10.1016/B978-0-12-800149-3.00001-9, (1-70), (2014).
    • , Extreme Population Differences in the Human Zinc Transporter ZIP4 (SLC39A4) Are Explained by Positive Selection in Sub-Saharan Africa, PLoS Genetics, 10, 2, (e1004128), (2014).
    • , Development of genetic structure in a heterogeneous landscape over a short time frame: the reintroduced Asiatic wild ass, Conservation Genetics, 15, 5, (1231), (2014).
    • , 1000 Genomes Selection Browser 1.0: a genome browser dedicated to signatures of natural selection in modern humans, Nucleic Acids Research, 42, D1, (D903), (2014).
    • , Phenotype versus Genotype Methods for Copy Number Variant Analysis of Glutathione S‐Transferases M1, Annals of Human Genetics, 77, 5, (409-415), (2013).
    • , Brief communication: Effect of nomadic subsistence practices on lactase persistence associated genetic variation in Kuwait, American Journal of Physical Anthropology, 152, 1, (140-144), (2013).
    • , Adaptive evolution during an ongoing range expansion: the invasive bank vole (yodes glareolus) in Ireland, Molecular Ecology, 22, 11, (2971-2985), (2013).
    • , Pervasive selection or is it…? why are FST outliers sometimes so frequent?, Molecular Ecology, 22, 8, (2061-2064), (2013).
    • , DETECTING RANGE EXPANSIONS FROM GENETIC DATA, Evolution, 67, 11, (3274-3289), (2013).
    • , Population genomics shed light on the demographic and adaptive histories of European invasion in the Pacific oyster, Crassostrea gigas, Evolutionary Applications, 6, 7, (1064-1078), (2013).
    • , Population Ancestry and Genetic Risk for Diabetes and Kidney, Cardiovascular, and Bone Disease: Modifiable Environmental Factors May Produce the Cures, American Journal of Kidney Diseases, 10.1053/j.ajkd.2013.05.024, 62, 6, (1165-1175), (2013).
    • , Detection of molecular signatures of selection at microsatellite loci in the South African abalone (Haliotis midae) using a population genomic approach, Marine Genomics, 10, (27), (2013).
    • , Population Genomics of Human Adaptation, Annual Review of Ecology, Evolution, and Systematics, 44, 1, (123), (2013).
    • , Worldwide population distribution of the common LCE3C-LCE3B deletion associated with psoriasis and other autoimmune disorders, BMC Genomics, 10.1186/1471-2164-14-261, 14, 1, (261), (2013).
    • , Broadening the Scope of Cultural Neuroscience, Psychological Inquiry, 24, 1, (47), (2013).
    • , Functional diversity of the glutathione peroxidase gene family among human populations: implications for genetic predisposition to disease and drug response, Pharmacogenomics, 14, 9, (1037), (2013).
    • , Founder takes all: density-dependent processes structure biodiversity, Trends in Ecology & Evolution, 28, 2, (78), (2013).
    • , Natural Selection and Neutral Evolution Jointly Drive Population Divergence between Alpine and Lowland Ecotypes of the Allopolyploid Plant Anemone multifida (Ranunculaceae), PLoS ONE, 8, 7, (e68889), (2013).
    • , An outlier locus relevant in habitat-mediated selection in an alpine plant across independent regional replicates, Evolutionary Ecology, 27, 2, (285), (2013).
    • , Functional variation of thetransthyretingene among human populations and its correlation with amyloidosis phenotypes, Amyloid, 20, 4, (256), (2013).
    • , The ecoimmunology of invasive species, Functional Ecology, 26, 6, (1313-1323), (2012).
    • , AFLP genome scans suggest divergent selection on colour patterning in allopatric colour morphs of a cichlid fish, Molecular Ecology, 21, 14, (3531-3544), (2012).
    • , Evolutionary forces shaping genomic islands of population differentiation in humans, BMC Genomics, 13, 1, (107), (2012).
    • , Positive Selection in the Chromosome 16 VKORC1 Genomic Region Has Contributed to the Variability of Anticoagulant Response in Humans, PLoS ONE, 7, 12, (e53049), (2012).
    • , Human genetic variation of CYP450 superfamily: analysis of functional diversity in worldwide populations, Pharmacogenomics, 13, 16, (1951), (2012).
    • , The coupling hypothesis: why genome scans may fail to map local adaptation genes, Molecular Ecology, 20, 10, (2044-2072), (2011).
    • , Genealogical lineage sorting leads to significant, but incorrect Bayesian multilocus inference of population structure, Molecular Ecology, 20, 6, (1108-1121), (2011).
    • , Geographic differences in allele frequencies of susceptibility SNPs for cardiovascular disease, BMC Medical Genetics, 12, 1, (2011).
    • , Africa: the next frontier for human disease gene discovery?, Human Molecular Genetics, 20, R2, (R214), (2011).
    • , Evolution of lactase persistence: an example of human niche construction, Philosophical Transactions of the Royal Society B: Biological Sciences, 366, 1566, (863), (2011).
    • , A High Incidence of Selection on Physiologically Important Genes in the Three-Spined Stickleback, Gasterosteus aculeatus, Molecular Biology and Evolution, 28, 1, (181), (2011).
    • , Density-regulated population dynamics and conditional dispersal alter the fate of mutations occurring at the front of an expanding population, Heredity, 106, 4, (678), (2011).
    • , Single nucleotide polymorphisms unravel hierarchical divergence and signatures of selection among Alaskan sockeye salmon (Oncorhynchus nerka) populations, BMC Evolutionary Biology, 11, 1, (2011).
    • , Frequency of the AGT Pro11Leu Polymorphism in Humans: does Diet Matter?, Annals of Human Genetics, 74, 1, (57-64), (2009).
    • , Nonadaptive processes in primate and human evolution, American Journal of Physical Anthropology, 143, S51, (13-45), (2010).
    • , Surfing the wave on a borrowed board: range expansion and spread of introgressed organellar genomes in the seaweed Fucus ceranoides L., Molecular Ecology, 19, 21, (4812-4822), (2010).
    • , Exploring the population genetic consequences of the colonization process with spatio‐temporally explicit models: insights from coupled ecological, demographic and genetic models in montane grasshoppers, Molecular Ecology, 19, 17, (3727-3745), (2010).
    • , Genomic diversity, population structure, and migration following rapid range expansion in the Balsam Poplar, Populus balsamifera, Molecular Ecology, 19, 6, (1212-1226), (2010).
    • , ORIGINAL ARTICLE: Survival of mutations arising during invasions, Evolutionary Applications, 3, 2, (109-121), (2010).
    • , THE DISTINCTIVE FOOTPRINTS OF LOCAL HITCHHIKING IN A VARIED ENVIRONMENT AND GLOBAL HITCHHIKING IN A SUBDIVIDED POPULATION, Evolution, 64, 11, (3254-3272), (2010).
    • , HLA complex-linked heat shock protein genes and childhood acute lymphoblastic leukemia susceptibility, Cell Stress and Chaperones, 10.1007/s12192-009-0161-6, 15, 5, (475-485), (2009).
    • , Identification of Local- and Habitat-Dependent Selection: Scanning Functionally Important Genes in Nine-Spined Sticklebacks (Pungitius pungitius), Molecular Biology and Evolution, 27, 12, (2775), (2010).
    • , Genomic landscape of positive natural selection in Northern European populations, European Journal of Human Genetics, 18, 4, (471), (2010).
    • , Genetic Consequences of Range Expansions, Annual Review of Ecology, Evolution, and Systematics, 40, 1, (481), (2009).
    • , Genetic Variation and Recent Positive Selection in Worldwide Human Populations: Evidence from Nearly 1 Million SNPs, PLoS ONE, 4, 11, (e7888), (2009).
    • , Recent positive selection of a human androgen receptor/ectodysplasin A2 receptor haplotype and its relationship to male pattern baldness, Human Genetics, 126, 2, (255), (2009).
    • , Spatial patterns of variation due to natural selection in humans, Nature Reviews Genetics, 10, 11, (745), (2009).
    • , Populations as Individuals, Biological Theory, 4, 3, (267), (2009).
    • , Detecting loci under selection in a hierarchically structured population, Heredity, 103, 4, (285), (2009).
    • , Nonadaptive Genetic Change in Human and Primate Evolution, eLS, (2013).
    • , Human Relationships Inferred from Genetic Variation, eLS, (2009).
    • , Reconstructing Human History Using Autosomal, Y‐Chromosomal and Mitochondrial Markers, eLS, (1-9), (2015).
    • , Gene Evolution and Human Adaptation, eLS, (2013).
    • , Measures of effective population size in sea otters reveal special considerations for wide‐ranging species, Evolutionary Applications, , (2018).
    • , Demographic history influences spatial patterns of genetic diversityin recently expanded coyote (Canis latrans) populations, Heredity, 10.1038/s41437-017-0014-5, (2017).
    • , Positive selection on human gamete-recognition genes, PeerJ, 10.7717/peerj.4259, 6, (e4259), (2018).
    • , Differing Evolutionary Histories of the ACTN3*R577X Polymorphism among the Major Human Geographic Groups, PLOS ONE, 10.1371/journal.pone.0115449, 10, 2, (e0115449), (2015).