Errata: Corrigendum Volume 77, Issue 5, 464, Article first published online: 6 August 2013
Corresponding author: HUA CHEN, Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts 02115. Tel: 617-935-2254; E-mail: email@example.com
Tracing back to a specific time T in the past, the genealogy of a sample of haplotypes may not have reached their common ancestor and may leave m lineages extant. For such an incomplete genealogy truncated at a specific time T in the past, the distribution and expectation of the intercoalescence times conditional on T are derived in an exact form in this paper for populations of deterministically time-varying sizes, specifically, for populations growing exponentially. The derived intercoalescence time distribution can be integrated to the coalescent-based joint allele frequency spectrum (JAFS) theory, and is useful for population genetic inference from large-scale genomic data, without relying on computationally intensive approaches, such as importance sampling and Markov Chain Monte Carlo (MCMC) methods. The inference of several important parameters relying on this derived conditional distribution is demonstrated: quantifying population growth rate and onset time, and estimating the number of ancestral lineages at a specific ancient time. Simulation studies confirm validity of the derivation and statistical efficiency of the methods using the derived intercoalescence time distribution. Two examples of real data are given to show the inference of the population growth rate of a European sample from the NIEHS Environmental Genome Project, and the number of ancient lineages of 31 mitochondrial genomes from Tibetan populations.
The coalescence process can be decomposed into two independent processes: the jump chain of the coalescent process, that is, the topology of the gene genealogy; and the sequential process of intercoalescence times (Kingman, 1982). The latter is well studied for the standard coalescent process in a constant-sized population with infinite-site mutations. It is known that the waiting times between coalescent events () are independent and exponentially distributed, and that mutations occurring along the genealogy follow a Poisson process. For gene genealogies in a population with temporally varying size, the distributions of ’s are no longer mutually independent. Griffiths & Tavare (1994a) provided a general equation for time-varying populations and proposed an importance sampling algorithm for calculating the likelihood function for haplotypes sampled therein. Wooding & Rogers (2002) and Polanski et al. (2003) derived analytical equations for the marginal distributions of coalescence times in populations of time-varying sizes. Statistical methods based on their results were developed and applied to infer population bottlenecks and growth rate.
The gene genealogy that is discussed in this paper is different from previous studies. In the above studies, when tracing back in time, gene genealogies of current haplotypes eventually reached their most recent common ancestor (MRCA). However, if we consider the scenario that at some time T in the past, a population or demographic event occurred, and given the event time T, the lineages may or may not have reached their common ancestor. When the lineages have not reached their common ancestor by time T, the gene genealogy is “incomplete,” and is referred to as a “truncated genealogy” (TG) in this paper. For a TG, the distribution of its intercoalescence times conditional on the truncation time T, is different from the unconditional distribution of complete genealogies in classic coalescent theory, but is seldom investigated in the literature. Blum & Rosenberg (2007) derived the conditional distribution of , and used a rejection sampling algorithm developed from the distribution to infer ancient lineage number. Chen (2012) derived the marginal distribution of intercoalescence times conditional on a TG for populations of constant size and populations whose size only changes instantaneously, the so called n-epoch model. The derived marginal distribution was subsequently used for analytically deriving the allele frequency spectrum (AFS, also known as the site frequency spectrum in the literature) and the joint allele frequency spectrum (JAFS). However, the marginal distribution for intercoalescence times was not addressed in a population with deterministically and continuously time-varying size, which is more realistic than populations with size jumping only at some discrete time points. For example, the simple exponential growth model is a commonly used approximation for modern human demography (Slatkin & Hudson, 1991; Marjoram & Donnelly, 1994; Di Rienzo et al., 1998; Wall & Przeworski, 2000; Adams & Hudson, 2004; Voight & Pritchard, 2005). To infer the demographic history without the error of model misspecification, it is necessary to incorporate populations of temporally varying sizes into the model.
The work presented in this paper follows the theoretical framework of Chen (2012), but is generalized for populations with deterministically time-varying sizes, specifically, for populations under exponential growth. It is an essential step to derive the AFS summarizing the genetic polymorphism pattern, which afterwards can be used to construct various methods for statistical inference in nonequilibrium populations. After we obtain the distribution and expectation of time lengths of truncated genealogies in temporally varying populations, we emphasize the potential applications of the theoretical results to population genetic inference in nonequilibrium populations. We present two applications that use the intercoalescence time distribution: inferring the rate and onset time of population exponential growth; and inferring the number of ancient lineages or founding lineages of samples from the current population at a specific ancient time.
Human populations experienced frequent bottlenecks and growth. In recent years, with the advent of genomic sequencing data from various populations, it has been a research hotspot to construct demographic history from genetic data (Novembre et al., 2008; Gutenkunst et al., 2009; HUGO Pan-Asian SNP Consortium, 2009; Reich et al., 2009; Tishkoff et al., 2009; Li & Durbin, 2011; Gronau et al., 2011; Reich et al., 2012). The result presented in this paper can be a useful component for constructing a coalescent likelihood in order to infer demographic history with a temporally varying population size. As an illustration, we elaborate on a three-parameter model for a single population undergoing two stages of growth: a period with constant population size followed by an exponential growth. The three parameters of interest are the onset time, the rate of exponential growth, and the ancient population size. More complicated demographic models for multiple populations can also be built to derive the joint allele frequency spectrum as similarly done in Chen (2012), and the resulting JAFS can then be used to infer the demographic history of multiple populations on a fine scale (Gutenkunst et al., 2009; Lukic et al., 2011; Chen, 2012).
Inferring the number of ancient lineages of a contemporary sample or population is another topic of great interest in population genetics and ecological studies (Hey, 2005; Anderson & Slatkin, 2007; Blum & Rosenberg, 2007; Leblois & Slatkin, 2007). Estimation of the founding lineage number at a specific ancient time, especially during a bottleneck or founding history, helps to elucidate ancient admixture, founder effect, their effects on genetic diversity of modern population, and the presence of Mendelian diseases in isolated populations (Risch et al., 2003), as well as helps to understand species invasion in ecological studies (Dlugosch & Parker, 2007). The existing methods are developed under the coalescent framework and require computationally intensive approaches, such as importance sampling (Anderson & Slatkin, 2007), coalescent simulation (Leblois & Slatkin, 2007) and rejection sampling (Blum & Rosenberg, 2007). Compared to the existing methods, the new method for inference of lineage number presented in this paper benefits from the analytical form of the intercoalescence time distribution, and thus gains computational efficiency.
We further applied the methods to two real data sets: (1) the resequencing data of 20 European haplotypes from the NIEHS Environmental Genome Project (EGP) to infer the rate and onset time of population growth in West Europeans; (2) 31 Tibetan mitochondrial genomes to infer the number of ancient lineages in the late Paleolithic age. We expect the methods developed in this paper to be useful tools for population genetic inference with the coming flux of large-scale genomic data of human populations and other species.
Assume that we are investigating a sample of n haplotypes collected from a single contemporary population. Let the current generation be generation 0, and T be the time in the past from the current generation (if not specified otherwise, all time in this paper is in units of generations). The population size does not have to be constant over time. At time in the past, a demographic event, for example, the end of a population bottleneck or the onset of the population growth, occurred. The gene genealogy of the haplotypes eventually coalesce into the most recent common ancestor (MRCA) if τ is sufficiently large. However, when τ is not large enough, the genealogy may still contain m lineages at time τ. Such a TG is denoted with . An example of a TG is shown in Figure 1, in which lineages are present in the current generation and coalesced into ancient lineages back to time τ. For a TG with fixed, the probability of observing m lineages at time τ is given by (Tavare, 1984; Takahata & Nei, 1985):
and are the rising and falling factorial functions (i.e., Pochhammer functions).
As shown in Figure 1, the coalescence time for the event when lineages coalesce into i lineages, is denoted by . The intercoalescence time of a truncated genealogy, , is denoted by , and by definition . For a complete genealogy in a population of constant size, the intercoalescence time of a coalescence process, , follows an exponential distribution with the rate of , and ’s are mutually independent, which are classical conclusions in the literature of coalescent theory (Hudson, 1990; Fu, 1995; Nordborg, 2001). For a population with varying size, the distribution of intercoalescence times of a complete genealogy has also been well studied (Griffiths & Tavare, 1998; Polanski et al., 2003), and it has been shown that intercoalescence times are no longer mutually independent. Even more complicated is the distribution of the intercoalescence times of a TG. In the following sections, the general equation for intercoalescence time distribution conditional on a TG is first provided in a varying population size model, and then the specific probability density function (pdf) and expectation of intercoalescence times are shown for a population under exponential growth.
Distribution and Expectation of Intercoalescence Times in Deterministically Varying Population Size Models
In a temporally size-varying population, the probability that n lineages at time t0 coalesced into m ancient lineages back to time τ is given by Griffiths & Tavare (1998):
where is defined in Equation (2), and is the ratio of the present population size to the population size at t units of time before present, and the time τ is when the exponential growth started.
Denoted by , the pdf of intercoalescence time conditional on m lineages left at the genealogy truncation time τ. Here is a death process describing the number of ancestral lineages at time τ. Then the density function can be derived as follows
The detailed derivation can be found in Appendix A1. Given that we have the pdf of intercoalescence times for populations with time-varying sizes, the expectation of times can simply be obtained by integrating over the pdf:
The above equations are generalized for the class of models in which population size evolves deterministically. A specific model of exponential population growth is elaborated in the next section.
A special case of the models describing deterministically time-varying population size is the exponential growth model. The historical size of the population under exponential growth is described by function:
where N is the present population size, is the population size at time t in the past, and r is the population growth rate (per generation). Exponential growth is a close approximation to the real population history of many human populations, and is extensively used to model population size change in the literature (Slatkin & Hudson, 1991; Marjoram & Donnelly, 1994; Polanski et al., 2003; Adams & Hudson, 2004; Voight & Pritchard, 2005; Chen et al., 2007; Gutenkunst et al., 2009) For an exponentially growing population, Equation (3) becomes:
where are defined as in the section entitled “Settings.” Substituting Equations (6) and (7) into Equation (4), and after some algebra (see Appendix A2 for the details), the conditional pdf of intercoalescence time is then given by:
Here is defined as . The pdf for the coalescence time () of incomplete genealogies can also be derived in a similar fashion (see Appendix A2). The pdf of is obtained as follows:
A more useful result for ancestral inference is the expected intercoalescence time (also called the time length of stage j as in Fu, 1995). The expectation of is given by
where denotes the exponential integral (Gradshteyn et al., 2007), which is defined as:
The exponential integral can be evaluated by the power series or an asymptotic expansion through publicly available subroutines (Press et al., 1992). The detailed derivation can be found in Appendix A3.
The pdf's of intercoalescence and coalescence times for two specific cases are shown in Figures 2(A–D). In the first case, the population has a contemporary size of 10,000, and the growth phase starts at 1000 generations ago with the rate of 0.001. Subplots (A) and (C) show the pdf's of intercoalescence and coalescence times for the truncated genealogy in the given setting. In the second case, the modern population size is set to be 1 × 106, and the growth phase is also dated 1000 generations ago with the rate 0.01. Subplots (B) and (D) show the pdf's of intercoalescence and coalescence times for the truncated genealogy in this setting. The pdf of shifts toward large values with increase of growth rate. In other words, the time to the first coalescence event is elongated disproportionately to the other intercoalescence times and the shape of the genealogy tends to be star-like. This is consistent with former studies (Slatkin & Hudson, 1991). To further illustrate the validity and reliability of the equations, the coalescent-based simulation program ms is used to generate the intercoalescence times (Hudson, 2002). To simulate the TGs, only samples with a fixed number of lineages at a prechosen time were kept and the intercoalescence times for such genealogies were recorded. As an example, the histograms of intercoalescence times from 2500 simulations for the above two exponential population growth models are presented together with the pdf curves for and from Equations (8) and (9) in Figures 2(E and F).
The theoretical result matches simulation well as shown in the figure.
Applications to Ancestral Inference
The distribution and expectation of intercoalescence times of incomplete genealogies are useful in population genetic inference. Traditional coalescent-based ancestral inference methods view the gene genealogy as a nuisance parameter (Felsenstein, 1988), and require computationally intensive approaches, such as importance sampling (Griffiths and Tavare, 1994b) and Markov Chain Monte Carlo (Kuhner et al., 1995; Nielsen & Wakeley, 2001) to integrate over the space of gene genealogies. As an exception, the AFS and AFS-based methods only depend on intercoalescence times, instead of the whole gene genealogy (Griffiths & Tavare, 1998; Chen, 2012). Consequently, if the expectation of intercoalescence times can be derived, we can analytically get the coalescent likelihood function without relying on intensive computation or simulation. In this section, we first derive the AFS of populations under exponential growth, using the expectation of time length conditional on the genealogy truncated at the onset of population growth, and then present two AFS-based methods for population genetic inference: we thus infer population growth rate and onset time, and the number of ancient lineages at a specific ancient time for a sample from the contemporary population.
The Allele Frequency Spectrum of Populations Under Exponential Growth
The AFS is defined as the sampling distribution of allele frequencies of SNPs (Sawyer & Hartl, 1992; Ewens, 2004). For a random sample of n haplotypes (lineages) collected from the contemporary population, the AFS is described using the set of summary statistics: , with corresponding to the expected counts of segregating sites with i copies of derived alleles in the sample of n haplotypes. To ease the notation, the sample size n is sometimes suppressed by writing instead of when the context is clear. In this section, we consider the nonequilibrium AFS of a population that experienced exponential growth. More specifically, the population had a constant size of in the past, and at time , it started growing exponentially with the rate of r, and reached the current population size of N0.
Similarly as in Chen (2012) (see also Wakeley & Hey, 1997; Williamson et al., 2005; Evans et al., 2007), the segregating sites can be divided into two classes: those more ancient than τ, and those appearing after the onset of exponential growth, then:
where denotes expectation, and denote segregating sites that arose before the onset of population growth at time τ (ancient segregating sites), and those arising during the population growth phase (new segregating sites) respectively. Assume that there are m ancient lineages at time τ. Since the population has a constant population size before the exponential growth, the AFS at time τ follows the stationary distribution, and
where μ is the mutation rate per generation per locus, and is the ancient population size at time τ. The AFS of the current generation contributed by those ancient segregating sites is then given by (Fu, 1995; Griffiths & Tavare, 1998):
where is the Polya–Eggenberger distribution (a result of the Polya urn model, see Johnson & Kotz, 1977; Fu, 1995; Slatkin, 1996; Wakeley & Hey, 1997):
where denotes the rising factorial function as in the previous section. The Polya–Eggenberger distribution provides the probability that the j copies of the derived allele among m ancient lineages at time τ grow to i copies among n lineages at present due to the bifurcation during the coalescent process.
Assuming an infinitely-many-sites model, we obtain the AFS generated by new mutations from the expected intercoalescence times of incomplete genealogies (see the details of derivation in appendix A2 of Chen, 2012):
Summing up the expectations of the two classes of segregating sites, the AFS for a population with exponential growth is obtained. One can also use Equations (14) and (16) to estimate the relative proportion of mutations belonging to the two classes for a given demographic history. Figure 3 shows the AFS's estimated using Equations (12), (14), and (15) in populations with different growth rates and onset times of the growth phase. Figures 3(A–C) show the overall AFS and the AFS contributed by ancient segregating sites and new segregating sites for three different demographic scenarios: (1) A population with a current size of 1 × 107, which starts growth at 1000 generations ago with the rate of 0.01; (2) A population with a current size of 10,000, and starts growth at 1000 generations ago with the rate of 0.001; (3) A population with a current size of 1 × 106, and starts growth at 1000 generations ago with the rate of 0.01. The enrichment of SNPs with rare alleles is more significant in populations with faster growth rate, and is mostly contributed by new mutations during the growth phase. For the three demographic scenarios, the proportions of new segregating sites are: 78.8%, 30%, and 99%, respectively. If the demographic model proposed by Gutenkunst et al. (2009) is used to mimic the history of a European population, that is, the ancient population size is 2100, and it starts exponential growth at 848 generations ago with the initial size of 1000 and the growth rate of 0.0040 (see Figure 3 D), 40.4% of the segregating sites in the European population arose since the onset of population growth.
Inferring the Onset Time and Rate of Population Growth
A two-stage population growth model, which includes an ancient population with constant size which was then followed by an exponential growth phase, was commonly used to approximate the demographic histories of many modern human populations (Adams & Hudson, 2004; Voight & Pritchard, 2005; Chen et al., 2007). The method for inferring population growth rate and the onset of growth phase can be developed based on the AFS for a population under exponential growth, and former methods for this purpose used coalescent simulations to generate a large number of samples, optimized over parameter spaces by matching some summary statistics or the allele frequency spectrum of simulated data and real data (Wall & Przeworski, 2000; Adams & Hudson, 2004; Voight & Pritchard, 2005; Wall et al., 2009). Here we used the analytical forms of the AFS derived in the previous section and adopt the Poisson random field models to set up the likelihood function (Sawyer & Hartl, 1992; Bustamante et al., 2001). According to the Poisson random field models, the number of segregating sites in each entry of the AFS, , follows the Poisson distribution independently:
where , is a function of the parameter set Θ. The full likelihood of the data can be constructed by taking the product over all entries:
Coalescent simulations were used to test the performance of the approach. For each combination of different values of population growth rate and onset time, 1000 samples were generated. Each sample contains 20 haplotypes from a population with the contemporary size of 10,000, and each haplotype covers a 10-Mb region by merging SNPs from 1000 10-Kb small regions simulated with the recombination rate of 1e-8 per generation per nucleotide. The two parameters were jointly maximized using the quasi-Newton Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS, Press et al., 1992). Boxplots of the inferred onset time and growth rate against the corresponding true values are presented in Figure 4. The method is then applied to a 4.04-Mb resequencing dataset including 20 haplotypes of European origin from the NIEHS Environmental Genome Project (Livingston et al., 2004). Of note is that due to the small region in the real data compared to the simulated data, the two parameters, T and r, are confounding each other. In the likelihood function, T and r show up together as a product term more often than as an individual term. As a result, not much information about the individual terms is contained in the likelihood. Only when the number of SNPs or the sample size is sufficiently large can the two parameters be disentangled and estimated correctly. By simulation, we found that for the AFS from a single population, when the total region size is larger than 10 Mb, it would be sufficient enough to identify the two parameters well. Thus we fixed the parameters of the current population size and the onset time of the growth to the values inferred in Gutenkunst et al. (2009) (). The estimated growth rate is 0.0042 (95% confidence interval: 0.0012–0.0072) by the proposed approach, which is consistent with Gutenkunst et al. (2009) (0.0040).
Inferring the Number of Founding Lineages
Inferring the number of founding lineages for a population is of great interest in the study of population genetics and ecology (Hey, 2005; Anderson & Slatkin, 2007; Leblois & Slatkin, 2007). Blum & Rosenberg (2007) developed a rejection sampling algorithm for inferring the number of ancient or founding lineages at some time τ in the past for n lineages at the current generation. They first derived the conditional distribution of intercoalescence times for incomplete genealogies , and this equation was then used to simulate a set of intercoalescence times in a sequential order. Provided this set of intercoalescence times, coalescent-based simulation approach can be modified to generate samples that have a given number of ancient lineages, and the AFS's are obtained from the simulated data. The probability for the observed AFS of the sample was obtained by a rejection sampling that rejects the simulated AFS with a poor match to the observed AFS. The method presented here also evaluates the likelihood of the observed AFS with a fixed number of ancient lineage numbers; a grid search scheme is used to infer the number of ancient lineage numbers. It is different from the method in Blum & Rosenberg (2007) in that the likelihood for the observed AFS is explicitly calculated by exploiting the expected time lengths of the truncated genealogy, and is constructed using the Poisson random field model as in the previous section. The estimation procedure in the new method is computationally much more efficient than a rejection sampling algorithm.
We also use coalescent simulations to test the performance of the method. Two demographic scenarios are simulated: one with the growth rate of 0.001, and the other 0.003, both with the current population size of 10,000. The truncated time is 1000 generations ago. Two hundred samples are generated for each scenario. In each simulation, 40 lineages are generated, each haplotype covers SNPs merged from 1000 1-Kb regions, and each 1-Kb region was simulated allowing both recombination and mutation. The results are shown as error-bar plots in Figure 5. The X-axes present the true values of the ancient lineage numbers, and Y-axes show the mean and standard deviation of the inferred lineage numbers. The results show that the method can infer the true ancient lineage number efficiently and accurately.
We applied the method to 31 complete mitochondrial DNA genomes from Tibetan populations (Qin et al., 2010). The complete mitochondrial DNAs were aligned to the Reconstructed Sapiens Reference mitochondrial genome sequence (Behar et al., 2012), and polymorphic loci and their ancestral alleles were identified. SNPs within the highly variable region, that is, the control region were excluded from the analysis since we assume an infinitely-many-sites model in deriving the AFS, and the high mutation rate in the control region may violate this assumption. We used a mutation rate of per nucleotide per year for these coding regions (Mishmar et al., 2003) in analysis. The ancient demographic history of the Tibetan population is still unclear, although two possible histories were explored by the proposed method: one originated from Yi et al. (2010), in which the Tibetan population has an initial size of 7360 at 42,955 years ago, expanded to 22,642 at 1973 years ago, and from then on exponentially declined to a current size of 1270; and the other is just a simple population history with the constant effective population size of 6000. For both histories, the population size equals due to maternal inheritance of the mitochondrial genome. The results from the two different assumptions are comparable, and the likelihood curve for the number of ancient lineages at 1000 generations ago under the first history assumption is presented in Figure 6. As one can see, the number of maternal founding lineages of the Tibetan populations was high 25,000 years ago. This conclusion is consistent with other studies which suggest that the settlement of the Tibetan population is ancient and possibly has multiple sources of origin (Shi et al., 2008; Zhao et al., 2009; Qin et al., 2010; Peng et al., 2011; Xu et al., 2011). Provided that the inference of number of ancient lineages for autosomal regions is region-specific, it is of great interest to apply this method to genomic data and to explore the variation of the number of ancient lineages along the genome.
In this paper, the distribution and expectation of intercoalescence times of the incomplete genealogy was investigated for a population with time-varying size. Specifically, explicit expression of the pdf was given for populations undergoing exponential growth. The intercoalescence times of the incomplete genealogy can be used widely in population genetic inference. We gave two examples of its application after the allele frequency spectrum of a population under exponential growth was derived: a method for inferring the onset time and rate of population growth was developed based on the AFS; and a method for inferring the number of ancient lineages was further presented. The new approach for inferring ancient lineages provides analytical equations for the AFS through the application of the distribution of intercoalescence time. It is different from a former method based on the rejection sampling algorithm (Blum & Rosenberg, 2007), and thus is computationally more efficient. We applied the new AFS-based methods to infer the rate of population growth from the 20 haplotypes of European in the NIEHS Environmental Genome Project (Livingston et al., 2004), and to infer the number of ancient maternal lineages of Tibetan populations in late Paleolithic age from 31 mitochondrial DNA samples in Qin et al. (2010). The conclusions drawn from the proposed methods are consistent with former studies.
A recent study that infers population growth rate using ancient lineage numbers can be seen in Maruvka et al. (2011), who suggested that for large-sample genealogies the ancient lineage number [called “lineage number as a function of time” (LNFT)] in Maruvka et al. (2011) is close to deterministic with small randomness. Intercoalescence times from such a deterministic process can be greatly simplified compared to Equations (8) and (10). But even for large samples, the randomness of LNFT still exists. For example, for a sample haplotypes collected from a population with an initial size of 1,000,000 and growth rate of 0.003, the mean of LNFT is around 280 at 1000 generations ago, while the variance is around 80. Ignoring the randomness can cause bias in the derived AFS. In the derivation of the expected time length in this paper, the randomness of the intercoalescence time is taken into account by summing over its distribution, which will increase the accuracy of the estimate.
The AFS of a population with time-varying size has been derived in previous studies using different models, including the n-epoch model (Marth et al., 2004) and the exponential growth model (Wooding & Rogers, 2002; Polanski & Kimmel, 2003). The exponential growth model used here is different from Polanski & Kimmel (2003) in that the model in this paper includes a constant-population-size phase before the exponential growth phase. In addition, this model is flexible and can be extended to include multiple exponential growth phases and/or constant-size phases. Other events, such as selective sweeps and migration can also be incorporated into the AFS for populations under exponential growth.
The proposed coalescent-based method can be integrated as part of the JAFS as done by Chen (2012), and provides an alternative for population genetic inference from JAFS in addition to the existing methods based on diffusion approximation (Gutenkunst et al., 2009; Lukic et al., 2011; Zivkovic & Stephan, 2011; Song & Steinrücken, 2012). The coalescent-based JAFS has several advantages compared to the diffusion approximation; for example, it is feasible for parameter inference from more than three populations due to the computational efficiency gained by the analytical form of coalescent likelihood. However, there are some limitations to the coalescent-based method, including the numerical instability of Tavare's transition function for large sample size due to alternating sums of hypergeometric series (Polanski et al., 2003), and the difficulty to model continuous migration. To handle the numerical issue, we can use a high-precision arithmetical library to avoid the overflow problem caused by calculating the alternating sums (Marth et al., 2004), which on the other hand slows down the computation significantly. Another approach is to propose a solution that does not rely on the alternating sum, which is beyond the scope of this paper and will be addressed in subsequent work.
I am grateful to Drs. Kun Chen, Rasmus Nielsen, Josh Schraiber and the anonymous reviewers for their constructive comments on an early version of the manuscript, to Dr. Hong Shi for his insightful discussion on mitochondrial DNA data and Tibetan history, to Dr. Hui Li for sharing the Tibetan mitochondrial DNA data, to Dr. Ryan Gutenkunst for sharing his pre-processed version of the EGP data and to Dr. Deborah Nickerson for providing public access to the EGP data set.
Proof of Expressions
A1. The Probability Density Functions of Intercoalescence Times
The proof of Equations (8)-(10) follows the same rationale and generalizes the results in Appendix A of Chen (2012). Two results are required in the proof:
(1) The transition probability of observing m1 lineages at time t1 jumping to m2 lineages at time t2 is:
where is the current population size, and is the ratio of the current population size over the size at time w.
(2) Let be the time at which a coalescent event occurs, that is, lineages coalesce into i lineages. The probability density function of is (Chen, 2012):
Starting from these two results, we can derive the probability density function (pdf) of intercoalescence time for a truncated genealogy in the time-varying population as below. Denoted by m the number of lineages right before the onset of the population growth. Three cases were discussed when (Equation 1.3), (Equation 1.4), and (Equation 1.5).
A2. The Pdf of Intercoalescence and Coalescence Times for Exponentially Growing Populations
The pdf of intercoalescence times of an incomplete genealogy for the exponentially growing population is derived as follows:
To avoid extra long equation, we use the following notation: , and similar definition holds for , , , and .
Let then . Substituting and changing the variable yields
Similarly, we can derive the pdf for coalescence times of an incomplete genealogy:
where is defined as above.
A3. The Expectation of Intercoalescence Times for Incomplete Genealogies
The expectation of intercoalescence times can be calculated using the pdf of intercoalescence times derived in A2. A simpler alternative is presented in this section. Conditional on m lineages existing at time τ, the probability that at time t, there are j lineages is given as:
The expectation is then obtained by taking integral over t:
Let . Substituting and changing the variable yields:
where stands for Exponential integral:
The expectation of is given by:
And similarly, let . Substituting and changing the variable yields: