• Open Access

Identification of local selective sweeps in human populations since the exodus from Africa

Authors


Ulf Gyllensten, Dept of Genetics and Pathology, Rudbeck Laboratory, Uppsala University, SE-571 85, Uppsala, Sweden. E-mail: ulf.gyllensten@genpat.uu.se

Abstract

Selection on the human genome has been studied using comparative genomics and SNP architecture in the lineage leading to modern humans. In connection with the African exodus and colonization of other continents, human populations have adapted to a range of different environmental conditions. Using a new method that jointly analyses haplotype block length and allele frequency variation (FST) within and between populations, we have identified chromosomal regions that are candidates for having been affected by local selection. Based on 1.6 million SNPs typed in 71 individuals of African American, European American and Han Chinese descent, we have identified a number of genes and non-coding regions that are candidates for having been subjected to local positive selection during the last 100 000 years. Among these genes are those involved in skin pigmentation (SLC24A5) and diet adaptation (LCT). The list of genes implicated in these local selective sweeps overlap partly with those implicated in other studies of human populations using other methods, but show little overlap with those postulated to have been under selection in the 5–7 myr since the divergence of the ancestors of human and chimpanzee. Our analysis provides focal points in the genome for detailed studies of evolutionary events that have shaped human populations as they explored different regions of the world.

Comparisons of the human and chimpanzee genomes have been used to identify genes subjected to selection on the lineage leading to modern humans (Clark et al. 2003; Bustamante et al. 2005; Nielsen et al. 2005). Such comparative genomic approaches address genetic changes that have occurred during the 5–7 myr since humans and chimpanzees shared a common ancestor. However, modern humans emerged in Africa less than 200 kyr ago (Stringer and Andrews 1988) and only began to colonize other continents 50–80 kyr ago. To specifically target genes that have been subjected to selection in association with more recent evolutionary events, such as the African exodus and the separation of Caucasian and Asian populations, genomic analyses based on population comparisons are needed. Some genetic adaptations to environmental conditions have been described in humans. For instance, populations depending on agriculture often have a high tolerance to lactose, associated with a mutation in the lactase gene (Hollox et al. 2001; Enattah et al. 2002). Also, variation at the Duffy blood group locus has been associated with malaria resistance (Hamblin and Di Rienzo 2000; Hamblin et al. 2002). Of the three Duffy alleles (FY*A, FY*B and FY*0), homozygotes for FY*0 have been associated with resistance to malaria. This allele has almost reached fixation in sub-Saharan populations but remains rare in Asian and European populations. However, the number of loci identified that are candidates for local selection is limited and new methods are needed to scan the human genome for evidence of selection.

Positive (directional) selection on an allele is expected to increase its frequency in the affected population. Differences in allele frequency between populations, estimated by Wrights FST (Wright 1950), have therefore been used to indicate genes under selection (Beaumont and Balding 2004; Storz 2005). A high FST indicates positive selection whereas a low FST indicates that the loci are subject to purifying or balancing selection. Genome-wide searches based on allele frequency differences have resulted in a number of genes and gene categories postulated to be under selection (Akey et al. 2002). However, FST estimates for individual sites have a high variance, complicating its use as a sole indicator of regions under selection (Weir et al. 2005). Positive selection is expected to increase both the frequency of the affected site and of linked sites and selective sweeps may therefore result in the presence of long haplotype blocks. This is utilized in hitchhike mapping for the identification of genes under selection (Harr et al. 2002; Schlotterer 2003). However, since both recombination frequency and population structure contribute to haplotype block variability, reduced haplotype variability per se is not a strong indication of selective events. Local selective sweeps have also been studied using a number of methods based on the haplotype or LD architecture (Sabeti et al. 2002; Hanchard et al. 2006; Voight et al. 2006; Wang et al. 2006), and such methods have been proposed to be more powerful for detecting selective events than methods based on the nucleotide diversity (Hanchard et al. 2006), such as Tajima's D-test, Fu and Li's D-test and Fay and Wu's H-test (Tajima 1989; Fu and Li 1993; Fay and Wu 2000). While most methods focus on haplotype patterns within a population, a haplotype-based method was recently presented for cross-population comparisons to detect alleles that have reached near-fixation (Sabeti et al. 2007).

Here, we apply a new cross-population method to search for genomic regions that have been subjected to local positive selection. Our method considers both the haplotype block length and allele frequency variation between different populations as well as between genomic regions. Selection may have occurred at a number of time points in the human history, such as in Africa prior to the separation of the major population groups (Fig. 1, branch 2), after the exodus of non-African populations (branch 3), prior to the separation of Asian and European populations (branch 4) or after the separation of European (branch 5) and Asian (branch 6) populations. Comparison between populations makes it possible to study the genetic changes associated with some of these events. We have used available SNP and haplotype data from three major human populations, African (American), European (American) and Asian (Han Chinese) (Hinds et al. 2005), to address selective sweeps that occurred during the last 50–80 kyr of human evolution. Our results are compared to those of recent studies that have assessed selection in the human genome using other methods.

Figure 1.

Different stages when selection could have acted during human evolution. The number on the branches indicate time points when selection could have occurred, such as on the lineage leading to Chimpanzee (branch 1), in African populations prior to the separation of the three populations (branch 2), after the exodus of non-African populations (branch 3), in non-African populations prior to the separation of Asian and European populations (branch 4), and after the separation of European (branch 5) and Asian (branch 6) populations.

Material and Methods

We used a publicly available dataset consisting of 1.6 million SNPs evenly distributed across the genome and genotyped in 71 individuals by Perlegen Sciences, representing 23 African Americans (AA), 24 European Americans (EA) and 24 Han Chinese (HC) (Hinds et al. 2005). Haplotypes had been inferred separately for each of the three sample sets using the HAP program (Halperin and Eskin 2004) and partitioned into blocks with limited diversity (Hinds et al. 2005). These blocks were defined as sets of SNPs for which at least 80% of the inferred haplotypes could be grouped into common patterns with a population frequency of at least 5%. Using this definition, 235 663 blocks have previously been identified in African Americans, 109 913 blocks in European Americans and 89 994 blocks in Han Chinese (Hinds et al. 2005). For identifying chromosomal regions subjected to positive selection, we only considered the autosomal chromosomes.

Block length and FST values

For each haplotype block, consisting of at least two SNPs, we estimated the block length using Build 35 (ver. 35 of the human genome sequence annotations). Blocks for which the SNP order did not agree between Build 35 and Build 34 (ver. 34 of the human genome sequence annotations which was used by Perlegen) was removed from further analysis. In addition, blocks shorter than 500 bases were removed. A block FST in a population was defined as the average pairwise FST (Wright 1950) for all individual SNPs in that block between that population and another population, resulting into two block FST values for each haplotype block. For instance, for a block defined in African Americans the first block FST is the average FST between African Americans and European Americans for all SNPs in the African American block. The second FST is the average FST between African Americans and Han Chinese for the same SNPs. Genes included in a haplotype block were identified through their reference sequence position according to Build 35.

FR for haplotype blocks

Our aim was to search for regions that differ in block allele frequency (high FST) between African and non-African populations, between Asian and both other populations and between European and both other populations. To study whether each population differs from both the other two populations, we calculated the following FST -ratios (FRs) for each haplotype block (i),

image

where AA refers to African Americans, EA to European Americans, and HC to Han Chinese, respectively. The individual block FST values were divided by the median to be compared to the genome in general and log transformed. An FST ratio around zero indicates an average block FST and a low (negative) FR indicates very low differentiation between the study population and both of the other populations. A high FR reflects large differentiation but does not indicate the specific population(s) in which a potential selective sweep has occurred.

Population-specific extended haplotype blocks

To identify in which population a selective sweep has occurred, we compared the length of haplotype blocks between populations for every site. For each block defined in a population, the corresponding blocks in the two other populations were identified as the blocks with the largest overlap. If no overlapping block was found in the other populations, the block was removed from further analysis. We calculated a ratio between the lengths of the haplotype block (i) in the different populations as:

image
image
image((true))

To make the ratios comparable across populations, each ration was divided by the median and log transformed. The log transformed LR is expected to be zero for regions exhibiting an average proportion of block length within populations. A low (negative) LR indicates that the haplotype block, in the population studied, is relatively short compared to the blocks in the other populations. A high LR in one population indicates larger haplotype block relative the other populations and that the region has potentially undergone a selective sweep in that population.

Simulation of LR distribution

We used the software SelSim (Spencer and Coop 2004) to simulate a dataset using different strength of selection on an SNP, varying the effective population size and using different fractions of derived to ancestral alleles. The number of chromosomes used in each simulation was similar to the number of chromosomes in each of our three human populations from which the empirical data was obtained (n=48). To examine the effect of variation in selection coefficient (s=0, s=0.01, s=0.02, s=0.04), we used an effective population size of Ne=10 000 and the proportion of derived to ancestral alleles of 44/4. To study the effect on the effective population size, we used Ne=1000, Ne=5000 and Ne=10 000, a proportion of derived to ancestral alleles of 44/4 and selection coefficients of s=0 and s=0.04. In all simulations the recombination rate was set to 1 cM/Mb, the number of SNPs to 200 and the density to 1 SNP per kb. For each set of simulations, we made 10 000 permutations. The length of haplotype blocks surrounding the SNP under selection was determined using the same criteria as used for the population data; 80% of the sequences should be grouped into haplotypes with an allele frequency of at least 5% in the dataset (Hinds et al. 2005). For each of the simulated replicates, the block length was determined using a script in MATLAB in the following way:

  • 1Starting at the SNP under selection.
  • 2Step 1 SNP in each direction from starting point.
  • 3Determine if the block including one or both of the new SNPs follows the definition of a haplotype block.
  • 4If the SNPs are included in the same block as the SNP under selection: repeat from step 2. If one SNP does not belong to the same block as the SNP under selection, no further SNPs in that direction from the start point will be evaluated.
  • 5When none of the new SNPs are included in the block, the block length is defined by the distance between the last SNPs, in each direction from the starting point.

LRs were calculated for the simulated haplotype blocks, similar to the empirical population data. Each simulated block was compared to two randomly chosen blocks from the dataset simulated with s=0:

image

Blocks used for calculating the LRs were resampled 100 000 times with replacement.

Combining methods for studying genes under positive selection

We used the iHS method by Voight and colleagues (Voight et al. 2006) to scan our top 1% candidate genes for being subjected to positive selection. For all SNPs in each block, the ancestral state was identified using the UCSC Human Genome database (Mar. 2006 assembly) where possible. For all SNPs with an identified ancestral state, an unstandardized iHS was calculated as the ratio between the integrated EHH (extended haplotype homozygosity) score of the ancestral and the derived allele (Voight et al. 2006). Since our genomic regions are selected as candidates for positive selection, we cannot use our empirical distribution to standardize the iHS. Instead, we used the already described, almost linear, relationship between the cutoff for the top 1% tail of the unstandardized iHS distribution and the allele frequency (Voight et al. 2006, Fig. 4) to estimate what fraction of SNPs for each of our genes are within the top 1% genome-wide empirical distribution of unstandardized iHSs. Large blocks containing more than one gene were divided to assign an independent iHS to each gene. The significance for each gene was calculated as the significance of observing a certain fraction of SNPs within the 1% tail of the genome-wide distribution, compared to the 1% expected by random using the χ2 statistic.

Figure 4.

Distribution of LR for neutrally evolving sites compared to sites with a selection coefficient of s=0.01. Distribution of LR for neutrally evolving sites compared to sites with a selection coefficient of s=0.01. For the neutrally evolving genes, only 1% will exhibit an LR above 3 compared to 22% of the genes with a selection coefficient of 0.01.

Results

Distribution of LR

We have used the length ratio (LR) to measure the length of a haplotype block relative to the length of the corresponding block in other populations and relative to the genome average. The behavior of the LR under different selection models was studied by simulation using the SelSim software (Spencer and Coop 2004). In these simulations, the length of each block was calculated relative to a randomly chosen block from the dataset evolving under neutrality (s=0), and relative to the average block length in the dataset. When keeping all other parameters constant in the simulations and varying the selection coefficient between s=0, s=0.01, s=0.02 and s=0.04, selection will result in a shift of the block length in the expected direction (towards longer blocks) (Fig. 2A). For example, using s=0.01 will result in a mean LR=1.7. The actual LR distributions for the SNP data from African American (AA), European American (EA) and Han Chinese (HC) are quite similar to the simulated data under neutrality (s=0). This is consistent with the prediction that most of the variation in haplotype block length observed in the human genome is the result of random genetic drift rather than selection. The influence of selection on a genomic region does not only depend on the selection coefficient but also on the effective population size. When considering sites that are evolving under neutrality (s=0), varying the effective population size between Ne=1000, Ne=5000 and Ne=10 000 does not affect the LR distribution (Fig. 2B), while the strongest effect from adding a selective advantage for one allele (s=0.04) is, not surprisingly, seen in populations with the largest effective size. As shown by the simulations, we would expect longer haplotype blocks with reduced variability when an allele is favored by positive selection (Fig. 2A). Under certain conditions, positive selection may therefore result in extended haplotype blocks, supporting the use of block length as a variable for the identification of genomic regions that have been subjected to local positive selection.

Figure 2.

A–C. LR and FR distributions. (A) LR distribution. LRs for haplotype blocks as a function of selection coefficient are calculated for simulated datasets with different selection coefficients (s=0, s=0.01, s=0.02 and s=0.04) for an effective population size of Ne=10 000. The solid grey lines represent the LR distributions for the empirical data from the African American, European American and Han Chinese population. (B) LR distribution for haplotype blocks as a function of effective population size. LRs are calculated for simulated datasets with Ne=1000, Ne=5000 or Ne=10 000 and s=0 or s=0.04. The three solid lines (almost overlapping) represent the simulated dataset without selection for three different effective populations sizes. The other lines represent the different effective population sizes and s=0.04. (C) FR distribution. Empirical FST ratios (FRs) for the blocks in African Americans, European American and Han Chinese.

Distribution of FST ratios

As a second parameter, we calculated the FST ratio (FR). The populations for which SNP data are available share a relatively recent common ancestry (less than 50 000 years) and most of the SNP variation is shared between populations. Similar to the LR, the FR for each block in a population is the pairwise FST between that population and each other population, and relative to the genome average. If a gene has been under recent positive selection in only one population, we would expect a large FR, since the FST between that population and both others would be larger than average. The distribution of FR values is very similar between populations, with a somewhat lower fraction of the blocks with FR=0 in the Han Chinese (Fig. 2C).

Block LR and FR distributions

In order to identify genomic regions with deviating patterns and are candidates for having been affected by selection, we combined the LR and the FR values for each population and plotted these against each other. Blocks that are only affected by positive selection in Africans are expected to have a high LRAA and a high FRAA (Fig. 3a), while blocks that are only under selection in European Americans are expected to have a high LREA and a high FREA (Fig. 3b) and, finally, blocks under selection in Asians are expected to have a high LRHC and high FRHC (Fig. 3c). To determine the statistical significance of deviating LR and FR values, the distributions of these parameters have to be addressed. Both the LR and FR values are approximately normally distributed, with LR possibly having a slight negative skew (Fig. 2A, 2C). Since we are only interested in outliers with a high value, only positive values are considered in determining the significance levels. From the empirical distributions, each block was assigned a p-value describing the likelihood of the observed value considering the distribution of values. For each block and each population, p-values for both LRs and FRs were assessed separately. Even though FR and LR are not completely uncorrelated (r2=0.013), we computed a combined p-value by multiplying the p-value for the FR and LR for each block respectively.

Figure 3.

a–c. Length ratio (LR) vs FST ratio (FR) for the three human populations studied. Haplotype blocks including a gene or part of a gene and exhibiting a positive LR and FR in (a) African American, (b) European American and (c) Han Chinese. The red dots represent blocks with genes with a combination of LR and FR values that are significant at a genome-wide level of p<0.01.

FR and LR in genic and non-genic regions

We first examined the distribution of FR and LR values relative to some general features in the genome. Across all populations, 55% (155 558/284 953) of the blocks included at least part of a gene, whereas 45% (129 395/284 953) consisted of only intergenic regions. The median FR is lower for blocks containing genes, both when each population is considered separately and when all populations are analyzed together (Table 1). The fraction of significant (p<0.01) blocks containing genes is higher in all but the African American population (Table 1). Since a low FR indicates a small differentiation between populations relative the genome average, the higher average FR may indicate more drift in intergenic regions. If positive selection is acting specifically on genic rather than intergenic regions, this could result in a higher number of significant FR values for the blocks containing genes, as is seen in our data. In contrast to FR, a low LR reflects blocks that are relatively shorter in a population, whereas blocks with a LR around zero indicate similarity in block distribution among populations. Therefore, the median value of LR for blocks in genic and intergenic regions is not of interest. The fraction of significant combined p-values (p<0.05) is higher for blocks with genes as compared to blocks without genes for Han Chinese and European Americans (Table 1).

Table 1.  Analysis of the distribution of median FR, fraction of significant FR blocks and combined probabilities of LR and FR in genic and non-genic regions1.
ComparisonPopulation
 African American (AA)European American (EA)Han Chinese (HC)All
  1. 1) Abbreviations: ns – not significant; FR−FST ratio

  2. 2) I. The median FR was calculated for blocks containing at least part of a gene and for blocks not containing any part of a gene. p-values are calculated by comparing the median between the two groups of blocks using the Mann-Whitney rank sum test.

  3. 3) II. The fraction of FRs with p<0.01 were calculated for blocks containing at least part of a gene and for blocks not containing any part of a gene. p-values were calculated by comparing the number of significant genes within each group using χ2 statistics.

  4. 4) III. Combined LR and FR probabilities. The fraction of combined values of p<0.05 were calculated for blocks containing at least part of a gene and for blocks not containing any part of a gene. p-values were calculated by comparing the number of significant genes within each group using χ2 statistics.

I2. FR median
 Genic−0.054−0.063−0.072−0.062
 Non-genic−0.083−0.073−0.096−0.083
 P-value4.5××10−43.8×10−21.5×10−23.85×10−6
II3. Fraction of significant FR
 Genic0.01000.01100.01140.0108
 Non-genic0.01000.00870.00830.0091
 P-valuens1.43×10−42.72×10−61.86×10−6
III4. Combined P-value
 Genic0.04410.05750.06170.0540
 Non-genic0.04850.05480.05510.0527
 P-value0.00050.0402.43×10−50.072

Candidate genes for positive selection

All genes in a haplotype block were assessed for their FR and LR and the combined (FR×LR) p-value calculated for that block. Among the total number of 284 953 blocks (100 481, 100 450 and 84 022 for AA, EA and HC respectively), 55% included at least part of a gene or predicted gene. Since both block length and gene density varies across the genome, the number of genes per block will vary and some genes will be part of more than one block. In order to determine a gene-specific LR and FR, we identified all blocks to which each gene found in a population belonged and chose the largest LR, FR, and combined p-value to represent the gene. These analyses include 22 737, 22 807 and 22 665 genes for AA, EA and HC, respectively. After removing duplicates, some genes were still part of the same block, and the final analysis resulted in 15 855, 15 715 and 15 577 blocks containing the genes for each population, respectively. Using a criteria of p=0.01 for genome-wide significance (0.01/15 855=6.31×10−7, 0.01/15 715=6.36×10−7 and 0.01/15 577=6.42×10−7 for AA, EA and HC, respectively) on the combined p-value for LR and FR resulted in a list of 31 genes (Table 2) located in 23 blocks (9, 1, 13 for each population). The number of significant blocks differs widely between populations. This could be due to a bias in the dataset, but considering the similarity between the distributions of LR and FR between populations (Figs. 2 and 3), it is more likely to be due to stochastic variation. The individual genes on these two lists will be addressed further in the discussion.

Table 2.  Genes identified by our method, proposed to have been under positive selection in different human populations1.
PopChromosomeFRLRp-value2Genes    
  1. 1) Abbreviations: Pop – population; FR – FST ratio; LR- length ratio; AA – African American; EA – European American; HC – Han Chinese.

  2. 2) All p-values are genome wide significant p<0.01.

  3. 3) Overlaps with genes suggested to have been under selection in earlier genome-scans based on other methods (Carlson et al. 2005; Voight et al. 2006).

AA65.852.332.15×10−6TCBA1    
AA75.892.681.90×10−6AIP1    
AA86.442.886.85×10−8CSMD1    
AA84.613.537.24×10−7PSD3    
AA115.782.172.92×10−6NELL1    
AA145.103.941.84×10−6NPAS3    
AA159.152.167.02×10−14CYP19A1    
AA165.212.038.92×10−7LOC441745    
AA204.242.712.92×10−6NFATC2    
EA92.783.972.70×10−6C9orf121    
HC24.443.021.09×10−6CPS1    
HC23.873.282.47×10−6EDAR3    
HC24.612.791.32×10−6LOC375295FUCA1P   
HC53.553.492.65×10−6GALNT10    
HC54.173.251.17×10−6GRIA1    
HC84.372.882.13×10−6LOC439940    
HC92.654.228.97×10−7LOC401539    
HC103.643.641.19×10−6LOC389997LOC387703   
HC134.293.641.97×10−6ATP8A2    
HC144.303.241.42×10−6ACTN1    
HC144.303.071.42×10−6RPS29P1WDR22DDX18P1  
HC154.633.692.83×10−7HERC13    
HC194.163.171.60×10−6FUT2FLJ36070RASIP1MGC34799FUT1

Correlation with other studies of genes under positive selection

A number of studies have searched for genes under positive selection, either those that have been affected by selection on the lineage leading to modern humans or those affected more recently during human evolution (Clark et al. 2003; Bustamante et al. 2005; Nielsen et al. 2005; Voight et al. 2006; Wang et al. 2006). The loci implicated in studies of selection on the lineage leading to modern humans shows little overlap with the genes on our lists. However, when comparing with studies of more recent events of selection, the overlap is larger. There is a total of 34 genes (GNA14, PQLC1, MYH9, MYEF2, LOC400369, SLC12A1, AQP1, LOC90193, PRKG2, HERC1, LOC439940, EDAR, SULT1C2, APOL4, PTF1A, C10orf115, C10orf67, MKRN2, RAF1, FLJ11036, C9orf82, C8orf7, EPHB1, SULT1C1, DAPK2, UBXD2, LCT, CBARA1, CCDC2, LRRC19, GCC2, MGC10701, LIMS1, R3HDM) located in 25 different regions that overlap between our top 1% candidates and the top 250 regions suggested to be under selection by Voight et al. (2006). This overlap is not surprising given that the methods used have some similarities. There is also an overlap of 12 genes (CLSPN, EIF2C4, EIF2C1, EIF2C3, KCNH7, USP3, HERC1, EDAR, SULT1C2, DAPK2, GCC2, LIMS1) between our top 1% candidates and the 181 genes proposed by Carlson and colleagues (Carlson et al. 2005). This is a higher overlap than expected by chance (5 overlapping regions out of 48 proposed by Carlson et al. as compared to 1% expected by random, p=0.0139, χ2 test), even though Carlson and colleagues used an approach based on Tajima's D that is conceptually different from ours.

Combining methods for studying genes under positive selection

The somewhat low overlap of genes under positive selection for different identification methods is not surprising. Even though many of the methods are somewhat similar, different populations and SNP data have been used, and both the size of the window used to scan the genome for particular features and the allele frequency spectra vary. In an attempt to study the overlap between the genes identified by our method and those determined by Voight and colleagues (Voight et al. 2006), we analyzed all our top 1% blocks using the iHS method (Voight et al. 2006). Out of the top 1% regions for AA, EA and HC populations (159, 157 and 156, respectively), 19, 46 and 42 regions are identified as candidates for selection (p<0.05) using the iHS method, clearly higher than the number expected by chance. However, if we apply a genome-wide significance level of p=0.01, only 4, 29 and 17 regions, for AA, EA and HC respectively, remain as candidates for positive selection (Table 3).

Table 3.  The genes among our top 1% list of candidates that are proposed to have been under positive selection when evaluated using the iHS method.
Pop1P-valueGenes
  1. 1) Abbreviations: Pop – population; AA – African American; EA – European American; HC – Han Chinese.

AA2.80E-03AIP1
AA1.91E-02CNTN5
AA2.37E-04CSMD1
AA1.97E-03RORA
EA2.25E-06A2BP1
EA1.04E-10ACTBP4
EA1.08E-06AIP1
EA3.54E-02APBA2
EA2.32E-04ARHGAP26
EA1.20E-02BNC2
EA3.76E-05DCC
EA2.84E-04DKFZP566N034
EA4.73E-02DOCK4
EA2.62E-03FLJ10159
EA2.01E-03KIAA0861
EA2.83E-02KIAA1889
EA7.15E-02KIAA2026
EA2.09E-08LCT
EA6.56E-03LOC284788
EA1.65E-05LOC440867
EA7.41E-04LOC90193
EA9.72E-02MYEF2, SLC24A5, LOC400369
EA1.56E-06PEPP2
EA1.27E-03PPP2R2B
EA1.30E-03PRDM10
EA5.60E-06PRKG2
EA1.94E-04PSD3
EA3.25E-14R3HDM
EA7.52E-04SGCZ
EA8.93E-04SLC12A1
EA1.89E-02SMYD3
EA5.05E-03TCBA1
EA6.42E-06UBXD2
HC3.45E-03ATP1B3P1, PAPOLG, LOC130865
HC4.01E-03C2orf23
HC1.10E-02C6orf176
HC1.80E-02C8orf21
HC4.86E-06DAB1
HC1.07E-02EPHB1
HC9.60E-03FHIT
HC1.89E-03FLJ11036, MKRN2, RAF1
HC4.13E-03LOC131368
HC4.49E-05LOC442008
HC2.47E-06LRPPRC
HC2.97E-02MGC42105
HC1.06E-02NAV2
HC2.42E-04SEMA3E
HC5.48E-12SMC6L1, FLJ40869, LOC343930
HC5.34E-02TCBA1
HC2.45E-06TRPC6

Discussion

Many studies of selection in the human genome have been based on comparisons of the human and chimpanzee genomes and have therefore addressed selection over 5–7 myr. We have focused on events that occurred associated with the evolution of major human population groups over the last 100 000 years. The genetic differentiation of human populations is the result of both selective and stochastic forces. Many adaptations, such as those resulting from spatial and temporal variation in climate, exposure to pathogens and diet, may have been restricted to particular populations and are therefore likely to remain undetected by comparative genomic studies.

Positive selection acting on a locus is expected to result in a more rapid fixation of alleles and consequently less variation around the site under selection. A means to identify chromosomal regions subjected to positive selection is therefore to examine the pattern of DNA polymorphism combined with the haplotype structure. The FST for individual nucleotide sites often has a high variance (Akey et al. 2002; Weir et al. 2005) but since variation at neighboring sites is often correlated, we instead estimated the FST for haplotype blocks. Positive selection for an allele in a large population is expected to result in regions with extended haplotypes due to lower levels of genetic variation than expected by random genetic drift. Therefore, one characteristic of genomic regions under recent positive or negative selection is large haplotype blocks of reduced diversity and a high correlation between the FST of proximate SNPs. The genome is known to contain recombination hotspots located between regions with higher LD (Altshuler et al. 2005). By focusing on haplotype blocks with reduced diversity (regardless of LD) rather than studying extended haplotypes, the results should be less sensitive to variation in the recombination rate between populations.

False and true positive rates for genes under selection

Approaches to identify genes under positive selection have to consider the relative contributions of genetic drift and natural selection on the genetic variability pattern. Most studies focus on the top 1% of candidates, but are unable to distinguish between the alternative explanation of positive selection or neutral evolution. In our simulations of the length of blocks with reduced haplotype diversity (LR), we observe that a selective sweep with a selection coefficient s=0.01 will result in an average LR=1.7 (Fig. 4). However, when no selection is acting (s=0), about 10% of the simulated data exhibited an LR>1.7. Therefore, it is not possible to distinguish with certainty between blocks that are under selection and those evolving under neutrality. The number of genes affected by positive selection in the human genome is unknown, but it has been suggested that as many as 3% of genes have been subjected to recent positive selection (Eberle et al. 2006). If we assume that 3% is the true fraction, it follows that roughly 660 of the about 22 000 genes in the human genome have been subjected to positive selection. Assuming that those 660 genes have a selection coefficient of s=0.01 and using LR=3 as the threshold for identification of genes under positive selection then1% of the neutrally evolving genes (1%×97%×22 000=213) and 22% of the genes truly under selection (22%×660=145) will be suggested to be candidates for positive selection (Fig. 4). Using more stringent criteria for the LR cutoff will result in a larger fraction of true to false positives, and also a larger number of false negatives. For example, using LR=4.5 results in 0.1% (21 genes) of the neutral evolving genes and 5.9% (45 genes) of the positively selected genes. Assuming a selection coefficient of s=0.01 on 3% of the genes in the human genome may also be an overestimation, resulting in an even higher ratio of false positive to true positive selected genes. Similar to these estimations, the frequency of false positives among top candidates for positive selection is probably high for most methods developed for scanning the genome for positive selection. At the same time, most of the genes that have undergone positive selection have not been detected. This is the most likely explanation for the low extent of overlap of candidate genes between different studies and methods (Biswas and Akey 2006; Sabeti et al. 2006, 2007), even though the power to detect almost complete selective sweeps has been shown to be high (Sabeti et al. 2007). Many selective events are likely to be weak and their signatures can easily be eradicated by genetic drift. Using a combined approach based on several genomic characteristics in searching for genes subjected to recent positive selection is one approach for reducing the number of false positives as well as false negative genes.

Genes within our top 1% candidates likely to have been under selection

Our analysis of the haplotype LR and FR in three major populations resulted in the identification of a number of genes that are candidates for having been affected by selection, even though their p-values does not reach genome-wide significance. The lactase gene (LCT) appears on the top 1% list of candidates for positive selection in European Americans. The LCT gene is likely to have been under positive selection due to the increased nutrition when consuming dairy products, which were introduced to humans during cattle domestication in the Near East, about 9 000 years ago. Another interesting candidate gene is MCPH1, involved in regulation of brain size. Our analysis indicates that this gene has been under selection in European Americans, and it has earlier been suggested to be both under negative selection on the human lineage (Bustamante et al. 2005) and under positive selection in Caucasians (Evans et al. 2005). The polymorphism in MCPH1 proposed to be under selection in Caucasians was estimated to have arisen approximately 37 000 years ago and simulations indicate that it has been increasing in frequency too rapidly to be compatible with neutral drift (Evans et al. 2005). One further interesting gene for which positive selection is indicated in European Americans is AIM1 (MATP). Polymorphism at AIM1 is associated with normal variation in human pigmentation (dark hair, skin and eye color in Caucasians) (Graf et al. 2005). A recent study also suggested that the AIM1 gene has been subjected to positive selection (Soejima et al. 2006). Skin pigmentation is a well-known example of genetic adaptation in humans (Cavalli-Sforza et al. 1996). Both AIM1 and a gene called OCA2, which is also among our top candidates for being under selection in Europeans, are implicated in different types of oculocutaneous albinism. A mutation in OCA2 causes the most prevalent type of oculocutaneous albinism throughout the world and occurs at much lower frequency in Europeans than in Africans (Lee et al. 1994). SLC24A5, which is among our top candidates in European Americans, has also been shown to be involved in skin pigmentation and the allele associated with light pigmentation is almost fixed in European populations (Lamason et al. 2005). A number of coat color genes have been described in mice and we identified 51 genes within our of 1% top candidates as orthologues to genes associated with variation in coat color in mice (http://albinismdb.med.umn.edu/genes.htm). Among these 51 genes, 5 (RAB27A, DCT, EGFR, ATRN, MATP) are among our top candidates. This is a significantly higher number than expected by chance for 51 randomly chosen genes (p=0.029, χ2 test), indicating that many of the genes involved in human skin pigmentation have been under selection in different populations. Recently a number of studies have focused on genes involved in skin pigmentation in humans (McEvoy et al. 2006, Lao et al. 2007, Myles et al. 2007), also indicating that many of those genes have been subjected to positive selection.

In addition to the genes investigated in our study, a number of regions lacking recognized genes were also as likely to have been under selection as the genes discussed (data not shown). The results for some of these regions may reflect stochastic variation, but in general, this supports the notion that some non-coding DNA and intergenic sequences are under selection for functional reasons (Andolfatto 2005). One example of such a case is the LCT gene. The most likely mutation to have been under selection in association with the LCT gene is situated 14 kb upstream the LCT gene (Enattah et al. 2002). In the case of LCT, the haplotype block is large enough to include both the gene and the upstream region, which might not be the case for other similar situations.

Genome-wide significance of positive selection

As seen in our and other studies, most methods face the problem that using stringent statistical thresholds to avoid false positives results in high numbers of false negatives. When applying thresholds for genome-wide significance we identify 31 candidates for having been under selection (Table 2). As an alternative means of enriching for candidates, we applied the method based on iHS (Voight et al. 2006) to our list of top 1% candidates, which resulted in 50 regions with a significant p-value (Table 3). The overlap between the regions identified by our method using a genome-wide significance threshold (Table 2) and the application of the iHS method to our 1% top candidates, is restricted to AIP1 and CSMD1 in the African American population. In addition, TCBA1, which is on the top 1% list for all three populations, is found among the genome-wide significant genes in African Americans using our method, but among the significant genes for both European Americans and Han Chinese when applying the iHS method. Surprisingly, HERC1 and EDAR (Table 2), which show genome-wide significance using our method, were not significant when we applied the iHS method, even though they were listed as significant by other groups using this test on other datasets (Voight et al. 2006; Sabeti et al. 2007). This discrepancy probably reflects the sensitivity of the results to the dataset used for identifying the candidate regions. The limited sample sizes available for these kinds of studies not only decrease the power to identify genes under selection, but also make the findings hard to replicate in another sample.

Overlap with other studies

The rather small overlap between the loci indicated to be under selection in our study and those pinpointed previously in between-species comparisons is not surprising. First, there is a large difference in the time perspective of the selective events. Local adaptive selection that affects one or a few human populations may be quite distinct from the selective pressure that shaped modern humans from archaic forms of Homo. Second, since environmental factors vary between locations it is not expected that selection will affect all populations equally. Therefore, some of the apparent differences between the results of comparative genomic and population genetic approaches may be due to differences in the samples included. For example, LPP has been suggested to be under negative selection on the human lineage (Bustamante et al. 2005) but is among our top candidates for having been under positive selection in Han Chinese. However, Bustamante and colleagues used only European and African samples to represent humans in their analysis and if local selection has only occurred on LPP in the Han Chinese this would not been detected in their study. We observe a larger overlap with studies focusing on individual human populations. Carlson et al. (2005) based their study on the Tajima's D statistic and used a sliding window technique to identify candidate regions. Tajima's D is conceptually unrelated to our method and it is therefore interesting that we have a larger overlap than expected by chance between our list of genes and that of Carlson et al. (2005). Since we are considering both the allele frequency differences between human populations and the deviation in length of haplotype blocks, we will preferably identify genes that have been under selection after the separation of the three major human groups.

In summary, we have developed and applied a new method for identifying candidate loci for being under selection at different time points during the evolution of human populations. To identify the specific loci and polymorphisms under selection requires further studies of the sequence variability and natural history of different populations. None of the methods available for identifying genes under selection is resistant to errors, but the use of both haplotype architecture and population differentiation increases our ability to identify genomic regions that have been affected by non-random forces. These methodologies provide focal points in the genome for future studies of the evolutionary events that have shaped modern human populations as they explored different parts of the world. One continuation of these studies is to evaluate the top candidates by resequencing the genes in a number of populations. Resequencing is still both time-consuming and expensive but with the rapid progress in high-resolution genotyping of individuals from different populations (Altshuler et al. 2005; Hinds et al. 2005) and the availability of new techniques for high throughput genomic resequencing (Bennett et al. 2005; Margulies et al. 2005), the amount of genome data is growing exponentially and thereby also the potential for further evaluation of genes that have been under selection in humans.

Acknowledgements

This study was supported by grants from the Swedish Natural Sciences Research Council. ÅJ is affiliated to The Linnaeus Centre for Bioinformatics, Uppsala University, Sweden.

Ancillary