Chromosome-Wide Haplotype Sharing: A Measure Integrating Recombination Information to Reconstruct the Phylogeny of Human Populations

Authors

  • Shuhua Xu,

    Corresponding author
    1. Chinese Academy of Sciences and Max Planck Society (CAS-MPG) Partner Institute for Computational Biology, Key Laboratory of Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China
    Search for more papers by this author
  • Li Jin

    1. Ministry of Education (MOE) Key Laboratory of Contemporary Anthropology, School of Life Sciences, Fudan University, Shanghai 200433, China
    Search for more papers by this author

Prof. Dr. Shuhua Xu, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China. Tel: +86 21 54920479; Fax: +86 21 54920451; E-mail: xushua@picb.ac.cn

Summary

The vast amount of recombination information in the human genome has long been ignored or deliberately avoided in studies on human population genetic relationships. One reason is that estimation of the recombination parameter from genotyping data is computationally challenging and practically difficult. Here we propose chromosome-wide haplotype sharing (CHS) as a measure of genetic similarity between human populations, which is an indirect approach to integrate recombination information. We showed in both empirical and simulated data that recombination differences and genetic differences between human populations are strongly correlated, indicating that recombination events in different human populations are evolutionarily related. We further demonstrated that CHS can be used to reconstruct reliable phylogenies of human populations and the majority of the variation in CHS matrix can be attributed to recombination. However, for distantly related populations, the utility of CHS to reconstruct correct phylogeny is limited, suggesting that the linear correlation of CHS and population divergence could have been disturbed by recurrent recombination events over a large time scale. The CHS we proposed in this study is a practical approach without involving computationally challenging and time-consuming estimation of recombination parameter. The advantage of CHS is rooted in its integration of both drift and recombination information, therefore providing additional resolution especially for populations separated recently.

Introduction

The reconstruction of human phylogeny from contemporary genetic information was first attempted by the use of allele frequencies from five major blood-group systems (Cavalli-Sforza & Edwards, 1967; Cavalli-Sforza et al., 1988). Over the last two decades, data from mitochondrial DNA (mtDNA) (Vigilant et al., 1991; Wallace et al., 1999) and the Y chromosome (Underhill et al., 2000; Jobling & Tyler-Smith, 2003) have been used almost exclusively to infer relationships among human populations. However, Y-DNA and mtDNA are both inherited effectively as single “linkage block” owing to the absence of recombination, while other genomic regions could be expected to have different lineages or histories (Hey & Machado, 2003). Recently available data on genome-wide high-density single nucleotide polymorphisms (SNPs), and the advent of whole-genome sequencing data for human populations, have demarcated a transition from single-locus based studies to genomics analysis of human population structure and relationship (Rosenberg et al., 2002, 2005; The International HapMap Consortium, 2005, 2007; Friedlaender et al., 2008; Jakobsson et al., 2008; Kayser et al., 2008; Li et al., 2008; The HUGO Pan-Asian SNP Consortium, 2009). Apart from the significant increase in the number of loci or markers, the accumulated recombination events in the genome are expected to provide additional information for human genetic relationship studies. In practice, however, estimation of population recombination parameter 4Ner from genotyping data is computationally challenging, and the theory of optimal estimation has not yet been fully worked out. Furthermore, estimators rely on assumptions about demography and selective neutrality (Ardlie et al., 2002). As a matter of fact, the vast recombination information in the human genome has long been ignored or deliberately avoided in studies on human population genetic relationships. Most recent studies considering haplotype information focused on either the genomic structure of recombination rates (McVean et al., 2004), or haplotype diversity (Conrad et al., 2006; Jakobsson et al., 2008; Auton et al., 2009), or demographic parameters of single populations (Lohmueller et al., 2009). A recent study introduced a copying model to infer population relationships (Hellenthal et al., 2008), but it was based on SNP data of sparse density (about 4.1 kb per SNP) and small sample size.

In this study, we analyzed 20,177 SNPs on chromosome 21 with a density of 1.6 kb per SNP in 11 human populations representing Africa, Europe, and East Asia. The high-density markers of this data set allow us to investigate the fine-scale recombination pattern across the chromosome and to explore multiple chromosomal regions of various sizes. We first investigated whether recombination information in the genome could be used to study human population genetic relationships, that is, whether the correlation of recombination events and their frequencies among populations could reflect population divergence history. Since recombination events cannot be precisely estimated, we propose chromosome-wide haplotype sharing (CHS) in sliding windows to capture chromosome-wide recombination information in human populations. This approach integrates information of both recombination and drift without relying on precise estimation of the recombination parameter per se. We demonstrated that CHS can be used to reconstruct reliable relationships among human populations, and showed that it could be partitioned into the contributions of recombination and genetic drift. We further conducted simulation studies to investigate properties of CHS and compared it with statistics based on single SNPs; the results showed that HS provides much higher resolution than statistics based on single SNPs, especially when the relationship of closely related human populations is concerned. We also explored the appropriate size of genomic regions for CHS analysis to correctly reconstruct phylogenies and efficiently reveal the population genetic relationship.

Methods

Populations and Samples

Overall 11 human population samples representing Africa, Europe, and East Asia were studied. DNA samples from 48 African-Americans (AfA) (Xu et al., 2007) and 40 Europeans (CAU) were obtained from Coriell Cell Repositories. The five Chinese population samples, 46 Han Chinese (HAN), 42 Chuangs (CHU), 42 Hmongs (HMO), 43 Was (AVA), and 40 Uyghurs (UIG), represent five major linguistic families in East Asia and have been described elsewhere (Huang et al., 2006; Xu & Jin, 2008; Xu et al., 2008). Four HapMap population samples (60 YRI, Yoruba from Ibadan, Nigeria; 60 CEU, Utah residents with ancestry from northern and western Europe; 45 CHB, Han Chinese in Beijing; and 44 JPT, Japanese in Tokyo) (The International HapMap Consortium, 2003, 2005, 2007) were also included in this study. Only unrelated individuals were analyzed in this study. Please refer to Table 1 for more information about population samples.

Table 1.  Information on population samples.
Sample IDEthnicityGeographical locationSample size
AVAWaYunnan, China43
HMOHmongGuizhou, China42
CHUChuangGuangxi, China42
HANHan ChineseShanghai, China46
CHBHan ChineseBeijing, China45
JPTJapaneseTokyo, Japan44
UIGUyghurXinJiang, China40
CEUEuropean AmericanUSA60
CAUCaucasianEuropean40
AfAAfrican AmericanUSA48
YRIYorubaNigeria60

Markers and Their Positions

A set of 29,177 SNPs on chromosome 21 was genotyped in 48 AfA, 40 CAU, 46 HAN, 42 CHU, 42 HMO, 43 AVA, and 40 UIG. Illumina Beadlab™ technology (Illumina, Inc., San Diego, CA, USA) was used in genotyping and the method of genotyping was described elsewhere (Huang et al., 2006). Genotyped SNPs on chromosome 21 of 60 CEU, 60 YRI, 45 CHB, and 44 JPT were downloaded from the web site of The International HapMap Project (HapMap public release #23a, 2008–04-01). After necessary data filtration, for example, deleting markers with missing data >5% samples, excluding those SNPs showing deviation from Hardy-Weinberg equilibrium within population (Fisher's exact test, P < 0.05; where P was estimated using Arlequin 3.0 with 100,000 permutations (Schneider et al., 2000), we obtained 20,177 SNPs that were genotyped successfully in all 11 population samples. The physical positions of SNPs were based on the Homo sapiens Genome Build 36. The total chromosome region studied is 33.4 Mb. The average spacing between adjacent markers was 1.6 kb, with a minimum of 69 bp and a maximum of 189 kb.

Statistical Analysis

Haplotype estimation

Haplotypes were estimated for each individual from its genotypes with fastPHASE (Scheet & Stephens, 2006) version 1.2. “Population labels” were applied during the model fitting procedure to enhance accuracy. The number of random starts of the EM algorithm (-T) was set to 20, and the number of iterations of the EM algorithm (-C) was set to 50. This analysis was used to generate a “best guess” estimate of the true underlying patterns of haplotype structure (Scheet & Stephens, 2006).

Genetic Distance for Populations

Three genetic distance measurements, FST (Weir & Hill, 2002), Nei's standard distance (Nei, 1972), and Nei's DA (Nei et al., 1983) were used to estimate genetic divergence among populations.

Estimates of Haplotype Sharing Between Populations

Basically, HS between populations was estimated as the proportion of sharing haplotypes in populations compared (The HUGO Pan-Asian SNP Consortium, 2009; Xu et al., 2009). Suppose we have two populations, A and B, n(HA) and n(HB) are the total number of haplotypes observed in population A and B, respectively. The n(HA) and n(HB) equal twice the number of persons studied in populations A and B, respectively. We denote the ith distinct haplotype in population A by HAi, whose frequency is denoted by fAi. Similarly, the ith distinct haplotype in population B and its frequency are denoted by HBj and fBj, respectively.

HS between population A and B (HSAB) is defined as:

image

In the HS calculation, HAi and HBj are both replaced by a {0, 1} indicator matrix, where 0 indicates that the ith distinct haplotype in population A is private to population A, and 1 indicates the ith distinct haplotype in population A is also common in population B. The same rules are applied to HBj, that is, 0 indicates that the jth distinct haplotype is private to population B, while 1 indicates that the jth distinct haplotype is also common in population A.

Haplotype sharing distance between population A and B (HSDAB) is estimated as:

image

In some special cases, private haplotypes in a population also provide important information for population genetic history (Xu et al., 2009). Using the earlier notation, the proportion of private haplotypes, for example, in population A (HSAp) can be defined as:

image

In practice, the proportion of private haplotypes has special values in distinguishing recent admixture from shared ancestry. It has been applied in two recent studies (The HUGO Pan-Asian SNP Consortium, 2009; Xu et al., 2009) but will not be further discussed here.

As mentioned earlier, considering the substantial variation of recombination across the human genome (McVean et al., 2004; Myers et al., 2005), we adopted a sliding window strategy and HS was calculated in each window (5 kb ∼ 500 kb bin) for population pairs. The adjacent sliding windows were overlapped by half of the window, that is, the sliding window moves forward half of the given distance bin each time. The HS calculation has been implemented in a computer program package (PEAS v1.0) (Xu et al., 2010).

Since the results could be affected by various sample size among populations, we sampled 80 chromosomes (equal to the chromosome size of 40 individuals) with replacement in each population and counted the number of haplotypes in each genomic interval. The sampling procedure was repeated 100 times and the results were averaged for each genomic interval.

Tree Reconstruction

Distance based population trees were reconstructed using the Neighbor-Joining (NJ) algorithm (Saitou & Nei, 1987) with the Molecular Evolutionary Genetics Analysis software package (MEGA version 4.0) (Tamura et al., 2007). A maximum-likelihood tree of populations was reconstructed using the maximum-likelihood method (Felsenstein, 1973) with the CONTML program in the PHYLIP package (Felsenstein, 1989).

Recombination Parameter Estimation

The PHASE software implements a Bayesian statistical method for reconstructing haplotypes and estimating recombination parameters from population genotype data (Stephens et al., 2001; Li & Stephens, 2003; Stephens & Donnelly, 2003). A “background” recombination rate was given as a prior; any increase over the background rate (“hotspots”) was assumed to occur as a Poisson process, and the width of the hotspot was assumed to have a truncated normal distribution. We divided the 20,177 SNPs into sections (windows) of 40 SNPs with 20 overlapping SNPs between each two consecutive sections. PHASE was run for each section of chromosomal data with recombination model (-MR), 10,000 iterations, 100 thinning interval, and 10,000 burn-ins. X1000 option was invoked to obtain more accurate estimates, which increased the number of iterations of the final runs to be 1000 times longer than other runs. The other parameters were set as the default.

Linkage Disequilibrium Measures and Calculation

In this study, linkage disequilibrium (LD) was also used to measure the recombination magnitude between adjacent SNPs. Several statistics have been used to measure the LD between a pair of loci (Jorde, 1995). The two most common measures are the absolute value of D′ (denoted by |D′| hereafter), and r2, both derived from Lewontin's D (Lewontin, 1964). In this study, both |D′| and r2 were used to measure LD between adjacent SNPs, and were calculated from haplotype data inferred by fastPHASE. Estimates of |D′| were calculated following Devlin and Risch (1995), while estimates of r2 were calculated following Hill and Weir (1994).

Forward-Time Simulation Studies

To investigate the decay of HS between populations as the divergence time increases, and evaluate the ability to reconstruct correct phylogeny of HS, we conducted forward-time simulations. We explored the appropriate window size in which HS can reconstruct correct phylogeny of populations with various divergence times. A most recent common ancestor (MRCA) population with effective population size (Ne) 10,000 was created from 120 YRI chromosomes based on 20,177 SNPs on chromosome 21. A recombination rate of 1cM/1Mb/generation was used to break the chromosomes. Populations split and diverged hierarchically every five generations. All populations were simulated with constant Ne of 10,000 and without bottleneck so that the effect of drift was reduced to a minimum level. HS analysis was performed using the same procedure as that used in empirical data. Phylogenetic trees based on the haplotypes sharing measure were reconstructed using haplotypes in various window sizes (e.g., 5 kb–500 kb). For each given divergence time, the minimum window size was determined when the topology of the phylogenetic tree was consistent with the presimulated one and was stable with 100% bootstrapping value in all clades.

Partial and Multiple Mantel Tests for the Effect of Recombination and Drift on Haplotype Sharing

We used Mantel tests under a multiple correlation and regression design to simultaneously evaluate the contribution of genetic drift and recombination to HS among populations. There are three different matrices to be analyzed which are obtained from pairwise population comparisons: (1) HS matrix; (2) pairwise FST matrix, and (3) pairwise recombination correlation coefficient matrix. In this case, it would be possible to establish which part of the total explained variance of HS matrix could be attributed to drift or recombination. These relative values could be obtained simply by performing Mantel tests, using each effect separately and combined into a single model.

Results

Correlation between Recombination and Genetic Divergence

We first estimated population recombination rates (ρ) in 11 populations (Table 1) using the program PHASE 2.1 (see Methods section for details). Figure 1 displays ρ values along the 33.4 Mb region of chromosome 21 across 11 populations. The average ρ varies substantially among populations (Fig. 2). For example, when ρ was estimated from markers with minor allele frequency (MAF) > 0, AVA showed the smallest inline image (0.19 per kb), while AfA showed the largest inline image (1.76 per kb).

Figure 1.

Estimated ρ values along a 33.4-Mb region of chromosome 21. Each series of points shows the estimates of log10(ρ) per bp. To separate the curves on the Y-axis, a multiplication of an arbitrary constant (c= 4) was added to individual estimates for samples other than YRI, that is, 4 were added to AfA, etc. Note both the large variation across the genome and the similar positions of the troughs and spikes shared between 11 population samples.

Figure 2.

Estimated population recombination rate (ρ) in 11 populations. Y-axis shows the ρ values per kb which were averaged from 40-SNPs windows for each population. Population IDs are displayed on the X-axis.

Although the average ρ varies substantially among populations (Fig. 2), the chromosome-wide recombination pattern (cREC) is highly correlated among populations as shown by high Spearman rank correlation coefficients (Table 2, Table S1 and S2), indicating that the genomic positions of low and high values of ρ are consistent across populations. Similar results were obtained using the LDhat program (McVean et al., 2004). High correlation among populations could also be observed for regional LD (e.g., r2) since the decay of LD is largely dictated by recombination (Table 3, Table S3 and S4, Fig. S1).

Table 2.  Spearman rank correlations of recombination rate (ρ, above diagonal) and genetic divergence (FST, below diagonal) between populations. FST was averaged over all windows of the same distance bin for a given pair of populations.
 JPTCHBHANCHUHMOAVAUIGCAUCEUAfAYRI
  1. Note: All Spearman's ρ are significantly different from zero by one-tailed test (P < 10−6).

JPT0.8930.8980.8850.8860.8740.8760.8330.8490.7920.772
CHB0.0070.9090.8920.8920.8810.8900.8470.8620.8070.791
HAN0.0080.0000.9010.8950.8870.8880.8490.8600.8060.788
CHU0.0190.0100.0100.8950.8870.8800.8470.8550.8020.787
HMO0.0220.0130.0120.0110.8780.8670.8370.8460.7830.768
AVA0.0290.0180.0210.0220.0260.8710.8270.8440.7820.776
UIG0.0360.0370.0370.0390.0430.0480.8850.8880.8260.801
CAU0.1010.1020.1020.0980.1040.1080.0240.9040.8120.780
CEU0.1050.1070.1070.1030.1090.1130.0280.0010.8130.780
AfA0.1370.1390.1380.1340.1410.1390.0920.1040.1090.891
YRI0.1820.1850.1830.1800.1880.1830.1460.1620.1660.008
Table 3.  Spearman rank correlations of LD (|D′|, above diagonal; r2, below diagonal) between populations.
 JPTCHBHANCHUHMOAVAUIGCAUCEUAfAYRI
  1. Note: All Spearman's ρ are significantly different from zero by one-tailed test (P < 10−6).

JPT0.5660.5340.5160.5300.5320.4510.3690.3730.3130.279
CHB0.9320.5600.5490.5570.5300.4830.3980.4050.3230.289
HAN0.9310.9450.5650.5230.5080.4770.3890.3780.3240.286
CHU0.9190.9270.9350.5710.5450.4620.3980.3870.3140.291
HMO0.9100.9270.9260.9290.5360.4480.3700.3800.2870.270
AVA0.8970.9150.9100.9160.9090.4540.3920.3990.3280.284
UIG0.8490.8660.8590.8580.8470.8610.5150.5380.3990.342
CAU0.7730.7840.7790.7850.7710.7860.9080.6010.3890.319
CEU0.7690.7870.7780.7860.7730.7880.9160.9500.3880.331
AfA0.6340.6430.6380.6480.6350.6530.7250.7210.7210.645
YRI0.5560.5600.5530.5680.5540.5700.6200.6130.6130.889

Interestingly, the between-population correlation of cREC is strongly correlated with the differentiation between populations as measured by FST (Fig. 3). It is obvious that the cREC and LD patterns between populations are both strongly correlated with the genetic differences (FST) between populations as indicated by large R2 (0.92 for ρ, 0.96 for r2, and 0.94 for |D′|). Furthermore, population trees reconstructed from Spearman rank correlation coefficients of both recombination and LD between populations (Fig. S2 and S3) were very similar to the maximum-likelihood tree (Fig. S4A) and the NJ tree using Nei's DA distance (Fig. S4C) reconstructed from single SNPs, suggesting that recombination information reflects the genetic relationship among human populations or population divergence history.

Figure 3.

The relationship of FST and Spearman rank correlation of recombination rate (ρ), Spearman rank correlation of LD (|D′| and r2) between Populations. Spearman correlation coefficients are −0.955, −0.961, −0.956 for FST versus ρ, FST versus r2, FST versus |D′|, respectively; P < 10−6 (Mantel test). The regression formulas of three lines are: y=−0.68x+ 0.90 (R2= 0.92), y=−2.07x+ 0.96 (R2= 0.96), y=−1.61x+ 0.56 (R2= 0.94), for ρ, r2, and |D′|, respectively.

Correlation between Haplotype Sharing and Genetic Divergence

HS statistics were calculated among 11 human populations in sliding windows of given size (see Methods section for details). The results from the same window size were averaged and led to the CHS statistic. The HS calculated in each distance bin (5 kb–500 kb) for 55 population pairs were shown in Tables S5–S13, respectively. As expected, populations share more haplotypes within short distance bins and closely related populations share more haplotypes. Overall, as shown in Figure 4A, CHS and FST values in all distance bins are all strongly correlated (Mantel test, r < −0.925, P < 10−4). The magnitude of correlation increases with bin size from 5 kb (r=−0.950) to 50 kb (r=−0.954), but starts to decrease beyond 100 kb (r=−0.949).

Figure 4.

Relationship of haplotype sharing proportions (HSP), correlation of recombination rates and FST. HSP, recombination rates and pairwise FST were calculated from 20,177 SNPs for 55 population pairs. Both recombination rates and FST were averaged over all windows of the same distance bin. (A): Relationship of HSP and FST. (B) Relationship of HSP and recombination. Correlation coefficients are shown in Table 4.

Correlation between Haplotype Sharing and Recombination

Considering the strong correlations between recombination and FST, and the correlation between CHS and FST as we observed earlier, it is expected that cREC and HS are also highly correlated. We further investigated the correlation between cREC and CHS in different distance bins (5–500 kb). Overall, as shown in Figure 4B, cREC and CHS in all distance bins are strongly correlated (Mantel test, r > 0.905, P < 10−4). Correlation magnitudes increase from the 5 kb bin (r= 0.965) to the 50 kb bin (r= 0.967), and also decrease from the 100 kb bin (r= 0.956).

Partition Variation of Haplotype Sharing into Drift and Recombination

Since the CHS values in this study were calculated from haplotypes in sliding windows, they are expected to contain information of both drift and recombination. It is helpful to know the respective contribution of recombination and drift to HS measurement. We thus simultaneously evaluated the contribution of genetic drift and recombination to HS among populations using Mantel tests under a multiple correlation and regression design (see Methods section for details), with the results being shown in Table 4. In distance bins less than 100kb, partial correlation values of HS and cREC are all significant (P < 0.01), and about 40% of CHS variation can be attributed to cREC; while partial correlation values of CHS and FST are all not significant (P > 0.05), and only about 10% of CHS variation can be attributed to FST. In contrast, in distance bins greater than 100 kb, partial correlation of CHS and cREC are not significant (P > 0.07), and only about 10% of CHS variation can be attributed to cREC, while partial correlation values of CHS and FST are all significant (P < 0.05), and about 20% of CHS variation can be attributed to FST. Therefore, both drift and recombination contribute information to HS, but for genomic regions within 100 kb, the contribution of recombination is predominant and that of genetic drift is relatively small.

Table 4.  Correlation and partial correlation of haplotype sharing, genetic distance (FST) and recombination between populations.
 CorrelationPartial Correlation
FSTRecombinationFSTRecombination
  1. Note: Numbers in parentheses are P-values.

5 kb−0.9500.965−0.318 0.599
(<0.0001)(<0.0001)(0.066)(0.010)
10 kb−0.9520.968−0.310 0.629
(<0.0001)(<0.0001)(0.071)(0.008)
20 kb−0.9520.969−0.312 0.638
(<0.0001)(<0.0001)(0.075)(0.008)
30 kb−0.9540.970−0.322 0.642
(<0.0001)(<0.0001)(0.072)(0.006)
40 kb−0.9540.969−0.332 0.633
(<0.0001)(<0.0001)(0.069)(0.008)
50 kb−0.9540.967−0.355 0.608
(<0.0001)(<0.0001)(0.055)(0.008)
100 kb−0.9490.956−0.375 0.508
(<0.0001)(<0.0001)(0.047)(0.024)
200 kb−0.9440.940−0.432 0.363
(<0.0001)(<0.0001)(0.020)(0.074)
500 kb−0.9280.906−0.492 0.137
(<0.0001)(<0.0001)(0.004)(0.271)

Decay of HS with Increasing Divergence Time

Since CHS between populations is contributed by both recombination and drift, it is expected to decay as divergence time increases between populations. We investigated the decay of CHS by forward-time simulation (see Methods section). The results showed that CHS in larger windows decayed more than that in smaller windows (Fig. 5). For example, 71.1% of CHS remained in the 5 kb windows, while only 17.3% of CHS remained in the 50 kb windows and less than 1% of CHS remained in the 500 kb windows for populations diverged 5000 generations ago. To compare the results with those of single SNPs, we calculated average FST from 500 kb windows and used (1 −FST) as the measure for similarity between populations. CHS is significantly decayed compared with that of (1 −FST) (Fig. 5). This result suggested the higher resolution of CHS than FST calculated from the same number of SNPs in distinguishing populations, especially for those populations with very short divergence time.

Figure 5.

Decay of haplotype sharing as a function of the increase of population divergence time. Results obtained from computer simulation assuming constant population size of 10,000 for all populations so that the effect of genetic drift is minimal. Haplotype sharing was calculated in sliding windows (5 kb–500 kb bin) for population pairs. Simulations were repeated 1000 times and results were averaged. The decay of haplotype sharing percentage is compared with the decay of (1 −FST) which can be taken as allele sharing of single SNPs. FST values were calculated from single SNPs within 500 kb windows and averaged for each time scale.

Window Size and Power of HS to Distinguish Populations

In haplotype-based analysis, it may be difficult to determine the ideal window size for analysis. We thus investigated different bin sizes and determined the power of CHS in each bin size to reconstruct reliable population phylogenies by simulation studies (see Methods section). The results showed that the ability to reconstruct a correct phylogeny of populations is very low using CHS with small bin sizes (Fig. 6). For example, less than 70% of windows which were less than 40 kb showed correct phylogeny when they were individually used to reconstruct population trees, that is, population trees were reconstructed by a single window of a given size. This percentage remained less than 70% no matter how long the population diverged. Generally speaking, the larger windows showed higher percentages of reconstructing correct population phylogeny. For example, more than 75% of windows of 100 kb showed correct population phylogeny for populations diverged between 20 and 500 generations ago. However, even for those windows of large size (>100 kb), the ability to reconstruct correct phylogeny remained high only if the populations did not diverge for a very long time. For example, more than 77% of 200 kb windows lead to correct population phylogeny for populations diverged 10 to 300 generations ago; more than 79% of 500 kb windows lead to correct population phylogeny for populations diverged 5 to 100 generations ago. However, this does not mean that the small size of windows cannot be used to reconstruct 100% of correct phylogenies. As shown in Figure 7, the tree topologies within 100 kb bins were supported by 100% bootstrapping values when CHS was used to reconstruct population relationships. Considering the tradeoff of achieving high resolution and robustness of the topology, we hereby recommend using CHS within 100 kb windows for reconstructing genetic relationships of human populations.

Figure 6.

The ability to recover correct phylogeny of haplotype sharing in different window sizes. Phylogenies were reconstructed from simulated haplotypes within each window of given size; the percentage of windows showing correct phylogeny were calculated for each window size (5 kb–500 kb bin). Simulations were repeated 1000 times and results were averaged.

Figure 7.

Population trees reconstructed based on haplotype sharing distance (HSD). (A) HSD calculated from 5 kb bins; (B) HSD calculated from 10 kb bins; (C) HSD calculated from 20 kb bins; (D) HSD calculated from 30 kb bins; (E) HSD calculated from 40 kb bins; (F) HSD calculated from 50 kb bins; (G) HSD calculated from 100 kb bins; (H) HSD calculated from 200 kb bins; (I) HSD calculated from 500 kb bins. Italic numbers on trees are bootstrap values obtained by sampling data 1000 times with replacement.

Discussion

In this study, we have proposed HS as a measure of genetic relationship among human populations, and have shown in a set of high-density SNP data that HS could be used to reconstruct a reliable genetic relationship among human populations. The primary advantage of HS measurement is the increased resolution when closely related populations are compared, especially when data are from common variants such as SNPs with MAF larger than 5% or more. For the majority of common SNPs in the human genome, the differences between populations are not the absence or presence of a certain allele, but the allele frequencies. Furthermore, those closely related populations often share very similar allele frequencies at most of the common loci. Therefore, single SNPs are expected to provide very limited resolution when closely related populations are compared. The HS measurement provides additional resolution as it contains recombination information which could cause more genetic differences between populations.

Both drift and recombination contribute to the genetic divergence between populations; while the former removes the polymorphism within a population, the latter increases the diversity within and between populations. Previous studies reported a positive correlation between recombination rates and levels of genetic variation within human populations (Nachman et al., 1998; Nachman, 2001; Hellmann et al., 2003; Hellmann et al., 2005). Others reported positive correlations between recombination rates and levels of genetic divergence between human and mouse (Sachidanandam et al., 2001; Lercher & Hurst, 2002; Hardison et al., 2003), and other species (Migone et al., 1983; Roselius et al., 2005). However, there are also studies showing a weak correlation of recombination rate between rat, mouse and human (Jensen-Seaman et al., 2004), and showing a recombination rate varying greatly between bird species with highly conserved genome structures (Dawson et al., 2007). In this study, we showed that chromosome-wide recombination patterns among populations are strongly correlated and this correlation is in accordance with the genetic difference between human populations, indicating that recombination events in different human populations are evolutionarily related. We further demonstrated by multiple Mantel testing that the increased resolution of HS is due to the additional recombination information contained in haplotypes, that is, the majority (about 40%, for genomic regions <100 kb) of variation in HS can be attributed to recombination. Therefore, it is biologically reasonable to use HS that integrates both drift and recombination information to study human population relationships. Our current study is based on data from a single chromosome, but our results and method can be extended to the other chromosomes and even the entire genome.

The main defect of the HS measurement lies in the fact that its magnitude depends on the size of the genomic region, that is, HS magnitude in a 5 kb region will be different from that in a 500 kb region. We suggest that authors report the genomic region size as well as the number of markers when reporting HS analysis results; for example, the HS between population A and population B is 80% per 100 kb with an average number of markers. To obtain reliable tree topology and effectively distinguish closely related populations, people can choose an appropriate window size for populations with an estimated divergence time (refer to Fig. 6). However, as we have shown in this study, the relative relationship between populations will not be substantially affected by the size of the genomic region when population trees are used to display the relationship among populations (Fig. 7). We estimated the expected magnitudes of HS for populations with different divergence times, but that required many limited situations with strong assumptions, such as a large and constant effective population size. Therefore the divergence time cannot be deduced from the observed HS from empirical data where the demographic history could be more complicated for the studied populations than that which was simulated in this study.

Theoretically, recombination could be avoided when HS analysis is applied to studies on human population relationships; for instance, one solution is to choose “haplotype block” regions from the human genome (Jeffreys et al., 2001; Ardlie et al., 2002; Gabriel et al., 2002) for “perfect” HS analysis where very rare or even no recombination occurred. Nevertheless, we found in practice that this approach is actually infeasible. One reason is that most haplotype blocks shared by multiple populations could be very short, with few variants providing very limited information as well as resolution. The other reason is that even if we occasionally find a long block shared by some populations, it could only be taken as a single locus like Y-DNA which would not reveal the genome-wide scenario. Besides, such a long block could only be shared by a small number of populations, and is almost useless in other studies. Also, it is unpractical to compare the results of different studies.

Conclusions

In summary, to reconstruct human phylogenies or study human genetic relationships, single-site based approaches ignored the vast recombination information in human genomes, and “haplotype block” based approaches could not be generalized in either the entire genome or in most human populations. The CHS we proposed in this study is a practical way of integrating recombination information but without involving the computationally challenging and time-consuming estimation of recombination parameter. We demonstrated in both empirical and simulated data that recombination differences and genetic differences among human populations are strongly correlated, and our approach can be used to reconstruct reliable genetic relationships of human populations. The advantage of CHS is rooted in its integration of both drift and recombination information, therefore providing additional resolution especially for populations separated recently.

Authors’ Contributions

SX conceived and designed the study. SX collected data and performed the analysis. SX wrote the paper, with contribution from LJ. All authors read and approved the final manuscript. All authors declare that no competing financial interests exist.

Acknowledgements

SX was supported by the National Science Foundation of China (30971577), Shanghai Rising-Star Program (11QA1407600), the Science and Technology Commission of Shanghai Municipality (09ZR1436400), and the Science Foundation of The Chinese Academy of Sciences (KSCX2-EW-Q-1-11; KSCX2-EW-R-01-05; KSCX2-EW-J-15-05). LJ was supported by the National Science Foundation of China (30890034). SX also gratefully acknowledges the support of the K. C. Wong Education Foundation, Hong Kong. This work was also supported by the MoST International Cooperation Base of China.

Ancillary