Intercoalescence Time Distribution of Incomplete Gene Genealogies in Temporally Varying Populations, and Applications in Population Genetic Inference

Authors

Errata

This article is corrected by:

  1. Errata: Corrigendum Volume 77, Issue 5, 464, Article first published online: 6 August 2013

Corresponding author: HUA CHEN, Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts 02115. Tel: 617-935-2254; E-mail: hchen007@gmail.com

Summary

Tracing back to a specific time T in the past, the genealogy of a sample of haplotypes may not have reached their common ancestor and may leave m lineages extant. For such an incomplete genealogy truncated at a specific time T in the past, the distribution and expectation of the intercoalescence times conditional on T are derived in an exact form in this paper for populations of deterministically time-varying sizes, specifically, for populations growing exponentially. The derived intercoalescence time distribution can be integrated to the coalescent-based joint allele frequency spectrum (JAFS) theory, and is useful for population genetic inference from large-scale genomic data, without relying on computationally intensive approaches, such as importance sampling and Markov Chain Monte Carlo (MCMC) methods. The inference of several important parameters relying on this derived conditional distribution is demonstrated: quantifying population growth rate and onset time, and estimating the number of ancestral lineages at a specific ancient time. Simulation studies confirm validity of the derivation and statistical efficiency of the methods using the derived intercoalescence time distribution. Two examples of real data are given to show the inference of the population growth rate of a European sample from the NIEHS Environmental Genome Project, and the number of ancient lineages of 31 mitochondrial genomes from Tibetan populations.

Introduction

The coalescence process can be decomposed into two independent processes: the jump chain of the coalescent process, that is, the topology of the gene genealogy; and the sequential process of intercoalescence times (Kingman, 1982). The latter is well studied for the standard coalescent process in a constant-sized population with infinite-site mutations. It is known that the waiting times between coalescent events (math formula) are independent and exponentially distributed, and that mutations occurring along the genealogy follow a Poisson process. For gene genealogies in a population with temporally varying size, the distributions of math formula’s are no longer mutually independent. Griffiths & Tavare (1994a) provided a general equation for time-varying populations and proposed an importance sampling algorithm for calculating the likelihood function for haplotypes sampled therein. Wooding & Rogers (2002) and Polanski et al. (2003) derived analytical equations for the marginal distributions of coalescence times in populations of time-varying sizes. Statistical methods based on their results were developed and applied to infer population bottlenecks and growth rate.

The gene genealogy that is discussed in this paper is different from previous studies. In the above studies, when tracing back in time, gene genealogies of current haplotypes eventually reached their most recent common ancestor (MRCA). However, if we consider the scenario that at some time T in the past, a population or demographic event occurred, and given the event time T, the lineages may or may not have reached their common ancestor. When the lineages have not reached their common ancestor by time T, the gene genealogy is “incomplete,” and is referred to as a “truncated genealogy” (TG) in this paper. For a TG, the distribution of its intercoalescence times conditional on the truncation time T, is different from the unconditional distribution of complete genealogies in classic coalescent theory, but is seldom investigated in the literature. Blum & Rosenberg (2007) derived the conditional distribution of math formula, and used a rejection sampling algorithm developed from the distribution to infer ancient lineage number. Chen (2012) derived the marginal distribution of intercoalescence times conditional on a TG for populations of constant size and populations whose size only changes instantaneously, the so called n-epoch model. The derived marginal distribution was subsequently used for analytically deriving the allele frequency spectrum (AFS, also known as the site frequency spectrum in the literature) and the joint allele frequency spectrum (JAFS). However, the marginal distribution for intercoalescence times was not addressed in a population with deterministically and continuously time-varying size, which is more realistic than populations with size jumping only at some discrete time points. For example, the simple exponential growth model is a commonly used approximation for modern human demography (Slatkin & Hudson, 1991; Marjoram & Donnelly, 1994; Di Rienzo et al., 1998; Wall & Przeworski, 2000; Adams & Hudson, 2004; Voight & Pritchard, 2005). To infer the demographic history without the error of model misspecification, it is necessary to incorporate populations of temporally varying sizes into the model.

The work presented in this paper follows the theoretical framework of Chen (2012), but is generalized for populations with deterministically time-varying sizes, specifically, for populations under exponential growth. It is an essential step to derive the AFS summarizing the genetic polymorphism pattern, which afterwards can be used to construct various methods for statistical inference in nonequilibrium populations. After we obtain the distribution and expectation of time lengths of truncated genealogies in temporally varying populations, we emphasize the potential applications of the theoretical results to population genetic inference in nonequilibrium populations. We present two applications that use the intercoalescence time distribution: inferring the rate and onset time of population exponential growth; and inferring the number of ancient lineages or founding lineages of samples from the current population at a specific ancient time.

Human populations experienced frequent bottlenecks and growth. In recent years, with the advent of genomic sequencing data from various populations, it has been a research hotspot to construct demographic history from genetic data (Novembre et al., 2008; Gutenkunst et al., 2009; HUGO Pan-Asian SNP Consortium, 2009; Reich et al., 2009; Tishkoff et al., 2009; Li & Durbin, 2011; Gronau et al., 2011; Reich et al., 2012). The result presented in this paper can be a useful component for constructing a coalescent likelihood in order to infer demographic history with a temporally varying population size. As an illustration, we elaborate on a three-parameter model for a single population undergoing two stages of growth: a period with constant population size followed by an exponential growth. The three parameters of interest are the onset time, the rate of exponential growth, and the ancient population size. More complicated demographic models for multiple populations can also be built to derive the joint allele frequency spectrum as similarly done in Chen (2012), and the resulting JAFS can then be used to infer the demographic history of multiple populations on a fine scale (Gutenkunst et al., 2009; Lukic et al., 2011; Chen, 2012).

Inferring the number of ancient lineages of a contemporary sample or population is another topic of great interest in population genetics and ecological studies (Hey, 2005; Anderson & Slatkin, 2007; Blum & Rosenberg, 2007; Leblois & Slatkin, 2007). Estimation of the founding lineage number at a specific ancient time, especially during a bottleneck or founding history, helps to elucidate ancient admixture, founder effect, their effects on genetic diversity of modern population, and the presence of Mendelian diseases in isolated populations (Risch et al., 2003), as well as helps to understand species invasion in ecological studies (Dlugosch & Parker, 2007). The existing methods are developed under the coalescent framework and require computationally intensive approaches, such as importance sampling (Anderson & Slatkin, 2007), coalescent simulation (Leblois & Slatkin, 2007) and rejection sampling (Blum & Rosenberg, 2007). Compared to the existing methods, the new method for inference of lineage number presented in this paper benefits from the analytical form of the intercoalescence time distribution, and thus gains computational efficiency.

We further applied the methods to two real data sets: (1) the resequencing data of 20 European haplotypes from the NIEHS Environmental Genome Project (EGP) to infer the rate and onset time of population growth in West Europeans; (2) 31 Tibetan mitochondrial genomes to infer the number of ancient lineages in the late Paleolithic age. We expect the methods developed in this paper to be useful tools for population genetic inference with the coming flux of large-scale genomic data of human populations and other species.

Methods

Settings

Assume that we are investigating a sample of n haplotypes collected from a single contemporary population. Let the current generation be generation 0, and T be the time in the past from the current generation (if not specified otherwise, all time in this paper is in units of generations). The population size math formula does not have to be constant over time. At time math formula in the past, a demographic event, for example, the end of a population bottleneck or the onset of the population growth, occurred. The gene genealogy of the haplotypes eventually coalesce into the most recent common ancestor (MRCA) if τ is sufficiently large. However, when τ is not large enough, the genealogy may still contain m lineages at time τ. Such a TG is denoted with math formula. An example of a TG is shown in Figure 1, in which math formula lineages are present in the current generation and coalesced into math formula ancient lineages back to time τ. For a TG with math formula fixed, the probability of observing m lineages at time τ is given by (Tavare, 1984; Takahata & Nei, 1985):

display math(1)
Figure 1.

An example of an incomplete genealogy and the intercoalescence times of a sample of n haplotypes (lineages). The population remained a constant size math formula before time τ, and then started the phase of exponential growth with rate of r until it reached the current population size N0. There are math formula lineages in the current generation and math formula ancient lineages at time τ. The coalescence time when math formula lineages reduce to i lineages is represented by math formula, and the intercoalescence time, that is, the time length when only i lineages are presented in the genealogy, is denoted with math formula (math formula).

where

display math(2)

math formula and math formula are the rising and falling factorial functions (i.e., Pochhammer functions).

As shown in Figure 1, the coalescence time for the event when math formula lineages coalesce into i lineages, is denoted by math formula. The intercoalescence time of a truncated genealogy, math formula, is denoted by math formula, and by definition math formula. For a complete genealogy in a population of constant size, the math formula intercoalescence time of a coalescence process, math formula, follows an exponential distribution with the rate of math formula, and math formula’s are mutually independent, which are classical conclusions in the literature of coalescent theory (Hudson, 1990; Fu, 1995; Nordborg, 2001). For a population with varying size, the distribution of intercoalescence times of a complete genealogy has also been well studied (Griffiths & Tavare, 1998; Polanski et al., 2003), and it has been shown that intercoalescence times are no longer mutually independent. Even more complicated is the distribution of the intercoalescence times of a TG. In the following sections, the general equation for intercoalescence time distribution conditional on a TG is first provided in a varying population size model, and then the specific probability density function (pdf) and expectation of intercoalescence times are shown for a population under exponential growth.

Distribution and Expectation of Intercoalescence Times in Deterministically Varying Population Size Models

In a temporally size-varying population, the probability that n lineages at time t0 coalesced into m ancient lineages back to time τ is given by Griffiths & Tavare (1998):

display math(3)

where math formula is defined in Equation (2), and math formula is the ratio of the present population size to the population size at t units of time before present, and the time τ is when the exponential growth started.

Denoted by math formula, the pdf of intercoalescence time math formula conditional on m lineages left at the genealogy truncation time τ. Here math formula is a death process describing the number of ancestral lineages at time τ. Then the density function can be derived as follows

display math(4)

The detailed derivation can be found in Appendix A1. Given that we have the pdf of intercoalescence times for populations with time-varying sizes, the expectation of times can simply be obtained by integrating over the pdf:

display math(5)

The above equations are generalized for the class of models in which population size evolves deterministically. A specific model of exponential population growth is elaborated in the next section.

Exponential Growth

A special case of the models describing deterministically time-varying population size is the exponential growth model. The historical size of the population under exponential growth is described by function:

display math(6)

where N is the present population size, math formula is the population size at time t in the past, and r is the population growth rate (per generation). Exponential growth is a close approximation to the real population history of many human populations, and is extensively used to model population size change in the literature (Slatkin & Hudson, 1991; Marjoram & Donnelly, 1994; Polanski et al., 2003; Adams & Hudson, 2004; Voight & Pritchard, 2005; Chen et al., 2007; Gutenkunst et al., 2009) For an exponentially growing population, Equation (3) becomes:

display math(7)

where math formula are defined as in the section entitled “Settings.” Substituting Equations (6) and (7) into Equation (4), and after some algebra (see Appendix A2 for the details), the conditional pdf of intercoalescence time math formula is then given by:

display math(8)

Here math formula is defined as math formula. The pdf for the coalescence time (math formula) of incomplete genealogies can also be derived in a similar fashion (see Appendix A2). The pdf of math formula is obtained as follows:

display math(9)

A more useful result for ancestral inference is the expected intercoalescence time math formula (also called the time length of stage j as in Fu, 1995). The expectation of math formula is given by

display math(10)

where math formula denotes the exponential integral (Gradshteyn et al., 2007), which is defined as:

display math(11)

The exponential integral can be evaluated by the power series or an asymptotic expansion through publicly available subroutines (Press et al., 1992). The detailed derivation can be found in Appendix A3.

The pdf's of intercoalescence and coalescence times for two specific cases are shown in Figures 2(A–D). In the first case, the population has a contemporary size of 10,000, and the growth phase starts at 1000 generations ago with the rate of 0.001. Subplots (A) and (C) show the pdf's of intercoalescence and coalescence times for the truncated genealogy math formula in the given setting. In the second case, the modern population size is set to be 1 × 106, and the growth phase is also dated 1000 generations ago with the rate 0.01. Subplots (B) and (D) show the pdf's of intercoalescence and coalescence times for the truncated genealogy math formula in this setting. The pdf of math formula shifts toward large values with increase of growth rate. In other words, the time to the first coalescence event is elongated disproportionately to the other intercoalescence times and the shape of the genealogy tends to be star-like. This is consistent with former studies (Slatkin & Hudson, 1991). To further illustrate the validity and reliability of the equations, the coalescent-based simulation program ms is used to generate the intercoalescence times (Hudson, 2002). To simulate the TGs, only samples with a fixed number of lineages at a prechosen time were kept and the intercoalescence times for such genealogies were recorded. As an example, the histograms of intercoalescence times from 2500 simulations for the above two exponential population growth models are presented together with the pdf curves for math formula and math formula from Equations (8) and (9) in Figures 2(E and F).

Figure 2.

(A–B) Probability density functions for the intercoalescence times, math formula, of an exponential growth population. In the figure, the genealogy of a sample of 20 haplotypes is investigated. The parameters are chosen to be math formula for (A), and math formula for (B). Colors and symbols for different intercoalescence times are shown on the key. (C–D) Probability density function for the coalescence times, math formula, of an exponential growth population. In the figure, the genealogy of a sample of 20 haplotypes is investigated. The parameters are chosen to be math formula for (C), and math formula for (D). Colors and symbols for different coalescence times are shown on the key. (E–F) Probability density function for the intercoalescence times, compared to the empirical distribution, represented by the histogram, obtained from 2500 simulations. (E) The pdf of the intercoalescence time math formula with identical settings to (C). (F) The pdf of math formula, with identical settings to (D).

The theoretical result matches simulation well as shown in the figure.

Applications to Ancestral Inference

The distribution and expectation of intercoalescence times of incomplete genealogies are useful in population genetic inference. Traditional coalescent-based ancestral inference methods view the gene genealogy as a nuisance parameter (Felsenstein, 1988), and require computationally intensive approaches, such as importance sampling (Griffiths and Tavare, 1994b) and Markov Chain Monte Carlo (Kuhner et al., 1995; Nielsen & Wakeley, 2001) to integrate over the space of gene genealogies. As an exception, the AFS and AFS-based methods only depend on intercoalescence times, instead of the whole gene genealogy (Griffiths & Tavare, 1998; Chen, 2012). Consequently, if the expectation of intercoalescence times can be derived, we can analytically get the coalescent likelihood function without relying on intensive computation or simulation. In this section, we first derive the AFS of populations under exponential growth, using the expectation of time length conditional on the genealogy truncated at the onset of population growth, and then present two AFS-based methods for population genetic inference: we thus infer population growth rate and onset time, and the number of ancient lineages at a specific ancient time for a sample from the contemporary population.

The Allele Frequency Spectrum of Populations Under Exponential Growth

The AFS is defined as the sampling distribution of allele frequencies of SNPs (Sawyer & Hartl, 1992; Ewens, 2004). For a random sample of n haplotypes (lineages) collected from the contemporary population, the AFS is described using the set of summary statistics: math formula, with math formula corresponding to the expected counts of segregating sites with i copies of derived alleles in the sample of n haplotypes. To ease the notation, the sample size n is sometimes suppressed by writing math formula instead of math formula when the context is clear. In this section, we consider the nonequilibrium AFS of a population that experienced exponential growth. More specifically, the population had a constant size of math formula in the past, and at time math formula, it started growing exponentially with the rate of r, and reached the current population size of N0.

Similarly as in Chen (2012) (see also Wakeley & Hey, 1997; Williamson et al., 2005; Evans et al., 2007), the segregating sites can be divided into two classes: those more ancient than τ, and those appearing after the onset of exponential growth, then:

display math(12)

where math formula denotes expectation, math formula and math formula denote segregating sites that arose before the onset of population growth at time τ (ancient segregating sites), and those arising during the population growth phase (new segregating sites) respectively. Assume that there are m ancient lineages at time τ. Since the population has a constant population size before the exponential growth, the AFS at time τ follows the stationary distribution, and

display math(13)

where μ is the mutation rate per generation per locus, and math formula is the ancient population size at time τ. The AFS of the current generation contributed by those ancient segregating sites is then given by (Fu, 1995; Griffiths & Tavare, 1998):

display math(14)

where math formula is the Polya–Eggenberger distribution (a result of the Polya urn model, see Johnson & Kotz, 1977; Fu, 1995; Slatkin, 1996; Wakeley & Hey, 1997):

display math(15)

where math formula denotes the rising factorial function as in the previous section. The Polya–Eggenberger distribution provides the probability that the j copies of the derived allele among m ancient lineages at time τ grow to i copies among n lineages at present due to the bifurcation during the coalescent process.

Assuming an infinitely-many-sites model, we obtain the AFS generated by new mutations from the expected intercoalescence times of incomplete genealogies (see the details of derivation in appendix A2 of Chen, 2012):

display math(16)

Summing up the expectations of the two classes of segregating sites, the AFS for a population with exponential growth is obtained. One can also use Equations (14) and (16) to estimate the relative proportion of mutations belonging to the two classes for a given demographic history. Figure 3 shows the AFS's estimated using Equations (12), (14), and (15) in populations with different growth rates and onset times of the growth phase. Figures 3(A–C) show the overall AFS and the AFS contributed by ancient segregating sites and new segregating sites for three different demographic scenarios: (1) A population with a current size of 1 × 107, which starts growth at 1000 generations ago with the rate of 0.01; (2) A population with a current size of 10,000, and starts growth at 1000 generations ago with the rate of 0.001; (3) A population with a current size of 1 × 106, and starts growth at 1000 generations ago with the rate of 0.01. The enrichment of SNPs with rare alleles is more significant in populations with faster growth rate, and is mostly contributed by new mutations during the growth phase. For the three demographic scenarios, the proportions of new segregating sites are: 78.8%, 30%, and 99%, respectively. If the demographic model proposed by Gutenkunst et al. (2009) is used to mimic the history of a European population, that is, the ancient population size is 2100, and it starts exponential growth at 848 generations ago with the initial size of 1000 and the growth rate of 0.0040 (see Figure 3 D), 40.4% of the segregating sites in the European population arose since the onset of population growth.

Figure 3.

The allele frequency spectrum from a population undergoing exponential population growth. The AFS's of different population sizes and growth rates have been displayed in the subplots. (A) Onset time math formula generations; the current population size math formula and population growth rate: math formula. (B) Onset time math formula generations; the current population size math formula and population growth rate: math formula. (C) Onset time math formula generations; the current population size math formula and population growth rate: math formula. (D) The European demographic model proposed by Gutenkunst et al. (2009). From lighter to darker, the colors correspond to the AFS of math formula, respectively.

Inferring the Onset Time and Rate of Population Growth

A two-stage population growth model, which includes an ancient population with constant size which was then followed by an exponential growth phase, was commonly used to approximate the demographic histories of many modern human populations (Adams & Hudson, 2004; Voight & Pritchard, 2005; Chen et al., 2007). The method for inferring population growth rate and the onset of growth phase can be developed based on the AFS for a population under exponential growth, and former methods for this purpose used coalescent simulations to generate a large number of samples, optimized over parameter spaces by matching some summary statistics or the allele frequency spectrum of simulated data and real data (Wall & Przeworski, 2000; Adams & Hudson, 2004; Voight & Pritchard, 2005; Wall et al., 2009). Here we used the analytical forms of the AFS derived in the previous section and adopt the Poisson random field models to set up the likelihood function (Sawyer & Hartl, 1992; Bustamante et al., 2001). According to the Poisson random field models, the number of segregating sites in each entry of the AFS, math formula, follows the Poisson distribution independently:

display math(17)

where math formula, is a function of the parameter set Θ. The full likelihood of the data can be constructed by taking the product over all entries:

display math(18)

Coalescent simulations were used to test the performance of the approach. For each combination of different values of population growth rate and onset time, 1000 samples were generated. Each sample contains 20 haplotypes from a population with the contemporary size of 10,000, and each haplotype covers a 10-Mb region by merging SNPs from 1000 10-Kb small regions simulated with the recombination rate of 1e-8 per generation per nucleotide. The two parameters were jointly maximized using the quasi-Newton Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS, Press et al., 1992). Boxplots of the inferred onset time and growth rate against the corresponding true values are presented in Figure 4. The method is then applied to a 4.04-Mb resequencing dataset including 20 haplotypes of European origin from the NIEHS Environmental Genome Project (Livingston et al., 2004). Of note is that due to the small region in the real data compared to the simulated data, the two parameters, T and r, are confounding each other. In the likelihood function, T and r show up together as a product term more often than as an individual term. As a result, not much information about the individual terms is contained in the likelihood. Only when the number of SNPs or the sample size is sufficiently large can the two parameters be disentangled and estimated correctly. By simulation, we found that for the AFS from a single population, when the total region size is larger than 10 Mb, it would be sufficient enough to identify the two parameters well. Thus we fixed the parameters of the current population size and the onset time of the growth to the values inferred in Gutenkunst et al. (2009) (math formula). The estimated growth rate is 0.0042 (95% confidence interval: 0.0012–0.0072) by the proposed approach, which is consistent with Gutenkunst et al. (2009) (0.0040).

Figure 4.

The box plot of estimated population growth rate and the onset time of the growth phase. (A) Population growth rate. The X-axis is the true value in simulation and the Y-axis is the inferred values from 200 simulations. The numbers in the upper panel of the X-axis labels correspond to different onset times used in the simulation. (B) The onset time of population growth. The X-axis is the true value in simulation and the Y-axis is box plot of inferred values from 200 simulations. The numbers in the upper panel of the X-axis labels correspond to different growth rates used in the simulation.

Inferring the Number of Founding Lineages

Inferring the number of founding lineages for a population is of great interest in the study of population genetics and ecology (Hey, 2005; Anderson & Slatkin, 2007; Leblois & Slatkin, 2007). Blum & Rosenberg (2007) developed a rejection sampling algorithm for inferring the number of ancient or founding lineages at some time τ in the past for n lineages at the current generation. They first derived the conditional distribution of intercoalescence times for incomplete genealogies math formula, and this equation was then used to simulate a set of intercoalescence times in a sequential order. Provided this set of intercoalescence times, coalescent-based simulation approach can be modified to generate samples that have a given number of ancient lineages, and the AFS's are obtained from the simulated data. The probability for the observed AFS of the sample was obtained by a rejection sampling that rejects the simulated AFS with a poor match to the observed AFS. The method presented here also evaluates the likelihood of the observed AFS with a fixed number of ancient lineage numbers; a grid search scheme is used to infer the number of ancient lineage numbers. It is different from the method in Blum & Rosenberg (2007) in that the likelihood for the observed AFS is explicitly calculated by exploiting the expected time lengths of the truncated genealogy, and is constructed using the Poisson random field model as in the previous section. The estimation procedure in the new method is computationally much more efficient than a rejection sampling algorithm.

We also use coalescent simulations to test the performance of the method. Two demographic scenarios are simulated: one with the growth rate of 0.001, and the other 0.003, both with the current population size of 10,000. The truncated time is 1000 generations ago. Two hundred samples are generated for each scenario. In each simulation, 40 lineages are generated, each haplotype covers SNPs merged from 1000 1-Kb regions, and each 1-Kb region was simulated allowing both recombination and mutation. The results are shown as error-bar plots in Figure 5. The X-axes present the true values of the ancient lineage numbers, and Y-axes show the mean and standard deviation of the inferred lineage numbers. The results show that the method can infer the true ancient lineage number efficiently and accurately.

Figure 5.

The accuracy and precision of inferred ancient lineages. The sample from each simulation contains 40 lineages. The current population size is 10,000. The truncated time is 1000 generations ago. In each simulation, 200 samples covering a 1-Mb region were generated by assuming a mutation rate math formula per site per generation. (A) Population growth rate math formula. (B) Population growth rate math formula. In each error-bar plot, square indicates mean and error bar indicates 1 unit of standard deviation calculated from 200 samples.

We applied the method to 31 complete mitochondrial DNA genomes from Tibetan populations (Qin et al., 2010). The complete mitochondrial DNAs were aligned to the Reconstructed Sapiens Reference mitochondrial genome sequence (Behar et al., 2012), and polymorphic loci and their ancestral alleles were identified. SNPs within the highly variable region, that is, the control region were excluded from the analysis since we assume an infinitely-many-sites model in deriving the AFS, and the high mutation rate in the control region may violate this assumption. We used a mutation rate of math formula per nucleotide per year for these coding regions (Mishmar et al., 2003) in analysis. The ancient demographic history of the Tibetan population is still unclear, although two possible histories were explored by the proposed method: one originated from Yi et al. (2010), in which the Tibetan population has an initial size of 7360 at 42,955 years ago, expanded to 22,642 at 1973 years ago, and from then on exponentially declined to a current size of 1270; and the other is just a simple population history with the constant effective population size of 6000. For both histories, the population size equals math formula due to maternal inheritance of the mitochondrial genome. The results from the two different assumptions are comparable, and the likelihood curve for the number of ancient lineages at 1000 generations ago under the first history assumption is presented in Figure 6. As one can see, the number of maternal founding lineages of the Tibetan populations was high 25,000 years ago. This conclusion is consistent with other studies which suggest that the settlement of the Tibetan population is ancient and possibly has multiple sources of origin (Shi et al., 2008; Zhao et al., 2009; Qin et al., 2010; Peng et al., 2011; Xu et al., 2011). Provided that the inference of number of ancient lineages for autosomal regions is region-specific, it is of great interest to apply this method to genomic data and to explore the variation of the number of ancient lineages along the genome.

Figure 6.

Likelihood curve for the number of ancient lineages of the 31 Tibetan mitochondrial genomes.

Discussion

In this paper, the distribution and expectation of intercoalescence times of the incomplete genealogy was investigated for a population with time-varying size. Specifically, explicit expression of the pdf was given for populations undergoing exponential growth. The intercoalescence times of the incomplete genealogy can be used widely in population genetic inference. We gave two examples of its application after the allele frequency spectrum of a population under exponential growth was derived: a method for inferring the onset time and rate of population growth was developed based on the AFS; and a method for inferring the number of ancient lineages was further presented. The new approach for inferring ancient lineages provides analytical equations for the AFS through the application of the distribution of intercoalescence time. It is different from a former method based on the rejection sampling algorithm (Blum & Rosenberg, 2007), and thus is computationally more efficient. We applied the new AFS-based methods to infer the rate of population growth from the 20 haplotypes of European in the NIEHS Environmental Genome Project (Livingston et al., 2004), and to infer the number of ancient maternal lineages of Tibetan populations in late Paleolithic age from 31 mitochondrial DNA samples in Qin et al. (2010). The conclusions drawn from the proposed methods are consistent with former studies.

A recent study that infers population growth rate using ancient lineage numbers can be seen in Maruvka et al. (2011), who suggested that for large-sample genealogies the ancient lineage number [called “lineage number as a function of time” (LNFT)] in Maruvka et al. (2011) is close to deterministic with small randomness. Intercoalescence times from such a deterministic process can be greatly simplified compared to Equations (8) and (10). But even for large samples, the randomness of LNFT still exists. For example, for a sample math formula haplotypes collected from a population with an initial size of 1,000,000 and growth rate of 0.003, the mean of LNFT is around 280 at 1000 generations ago, while the variance is around 80. Ignoring the randomness can cause bias in the derived AFS. In the derivation of the expected time length in this paper, the randomness of the intercoalescence time is taken into account by summing over its distribution, which will increase the accuracy of the estimate.

The AFS of a population with time-varying size has been derived in previous studies using different models, including the n-epoch model (Marth et al., 2004) and the exponential growth model (Wooding & Rogers, 2002; Polanski & Kimmel, 2003). The exponential growth model used here is different from Polanski & Kimmel (2003) in that the model in this paper includes a constant-population-size phase before the exponential growth phase. In addition, this model is flexible and can be extended to include multiple exponential growth phases and/or constant-size phases. Other events, such as selective sweeps and migration can also be incorporated into the AFS for populations under exponential growth.

The proposed coalescent-based method can be integrated as part of the JAFS as done by Chen (2012), and provides an alternative for population genetic inference from JAFS in addition to the existing methods based on diffusion approximation (Gutenkunst et al., 2009; Lukic et al., 2011; Zivkovic & Stephan, 2011; Song & Steinrücken, 2012). The coalescent-based JAFS has several advantages compared to the diffusion approximation; for example, it is feasible for parameter inference from more than three populations due to the computational efficiency gained by the analytical form of coalescent likelihood. However, there are some limitations to the coalescent-based method, including the numerical instability of Tavare's transition function for large sample size due to alternating sums of hypergeometric series (Polanski et al., 2003), and the difficulty to model continuous migration. To handle the numerical issue, we can use a high-precision arithmetical library to avoid the overflow problem caused by calculating the alternating sums (Marth et al., 2004), which on the other hand slows down the computation significantly. Another approach is to propose a solution that does not rely on the alternating sum, which is beyond the scope of this paper and will be addressed in subsequent work.

Acknowledgements

I am grateful to Drs. Kun Chen, Rasmus Nielsen, Josh Schraiber and the anonymous reviewers for their constructive comments on an early version of the manuscript, to Dr. Hong Shi for his insightful discussion on mitochondrial DNA data and Tibetan history, to Dr. Hui Li for sharing the Tibetan mitochondrial DNA data, to Dr. Ryan Gutenkunst for sharing his pre-processed version of the EGP data and to Dr. Deborah Nickerson for providing public access to the EGP data set.

Appendices

Proof of Expressions

A1. The Probability Density Functions of Intercoalescence Times

The proof of Equations (8)-(10) follows the same rationale and generalizes the results in Appendix A of Chen (2012). Two results are required in the proof:

(1) The transition probability of observing m1 lineages at time t1 jumping to m2 lineages at time t2 is:

display math(1.1)

where math formula is the current population size, and math formula is the ratio of the current population size over the size at time w.

(2) Let math formula be the time at which a coalescent event occurs, that is, math formula lineages coalesce into i lineages. The probability density function of math formula is (Chen, 2012):

display math(1.2)

Starting from these two results, we can derive the probability density function (pdf) of intercoalescence time math formula for a truncated genealogy in the time-varying population as below. Denoted by m the number of lineages right before the onset of the population growth. Three cases were discussed when math formula (Equation 1.3), math formula (Equation 1.4), and math formula (Equation 1.5).

(1.1) math formula:

display math(1.3)

(1.2) math formula:

display math(1.4)

(1.3) math formula:

display math(1.5)

A2. The Pdf of Intercoalescence and Coalescence Times for Exponentially Growing Populations

The pdf of intercoalescence times of an incomplete genealogy for the exponentially growing population is derived as follows:

(2.1) math formula:

display math(2.1)

To avoid extra long equation, we use the following notation: math formula, and similar definition holds for math formula, math formula, math formula, and math formula.

(2.2) math formula:

display math(2.2)

Let math formula then math formula. Substituting and changing the variable yields

display math(2.3)

(2.3) math formula:

display math(2.4)

Similarly, we can derive the pdf for coalescence times of an incomplete genealogy:

(2.4)math formula:

display math(2.5)

where math formula is defined as above.

A3. The Expectation of Intercoalescence Times for Incomplete Genealogies

The expectation of intercoalescence times can be calculated using the pdf of intercoalescence times derived in A2. A simpler alternative is presented in this section. Conditional on m lineages existing at time τ, the probability that at time t, there are j lineages is given as:

display math(3.1)

The expectation is then obtained by taking integral over t:

display math(3.2)

(3.1) math formula:

display math(3.3)

Let math formula. Substituting and changing the variable yields:

display math(3.4)

where math formula stands for Exponential integral:

display math(3.5)

(3.2) math formula:

The expectation of math formula is given by:

display math(3.6)

And similarly, let math formula. Substituting and changing the variable yields:

display math(3.7)

(3.3) math formula:

display math(3.8)

Ancillary