In recent years, many studies have found evidence of gene flow between diverging populations by analyzing genetic data under an Isolation with Migration (IM) model (Pinho and Hey 2010). Given evidence of gene exchange, investigators often then wish to inquire of the time when gene flow occurred (e.g. Won & Hey 2005; Becquet & Przeworski 2009). For example, a model of divergence with gene flow would be suggested whether gene flow occurred early or throughout the divergence process, whereas secondary contact would be the likely interpretation if gene flow was found to only have occurred after divergence had been ongoing for some time. Recently, Strasburg and Rieseberg (2011) assessed the quality of estimates for the time of migration events using the method currently implemented in the IMa2 program (Hey 2010). They found that the credible intervals of estimated times were so wide as to make the method unsuitable for the question. These results suggest that some conclusions of previous studies that draw upon the posterior distribution for times of migration should be discounted (e.g. Won & Hey 2005; Niemiller et al. 2008; Strasburg et al. 2008; Nadachowska & Babik 2009).
The Strasburg & Rieseberg (2011) study reports results from simulations. Here, we examine, using the theory underlying the method implemented in the IMa2 program, the possible bases for their observations. We demonstrate that gene migration times are not fully identifiable using the general coalescent for genealogies in an IM model, as implemented in IMa2 and similar programs. In many respects, the findings are general to methods that rely upon calculating the probabilities of genealogies under the coalescent and so are of broader interest than any particular program. We note that the method implemented in IMa2 is the same as that in the IMa program (Hey & Nielsen 2007), and hereafter, we refer simply to IMa.
Principles of IMa
The function of IMa is to obtain the posterior density, h(Θ|X), for the parameters Θ of an IM model given data X from one or more loci from two populations (or more than two populations in the case of IMa2) (for details see Hey & Nielsen 2007; Hey 2010). The parameters Θ include the effective population sizes, migration rates and times of population separation. Hey & Nielsen (2007) showed that the posterior of the parameters h(Θ|X) can be approximated given a sample of genealogies from the posterior density h(G|X). In effect, the method collects the information that the data contains about Θ in the form of a sample of genealogies and then uses these genealogies to estimate the posterior density for Θ, i.e. p(Θ|G,X) = p(Θ|G) if G∼h(G|X) (Hey & Nielsen 2007; Hey 2010). But because there is additional information in the genealogies, which does not bear directly on Θ, it is also possible to estimate a posterior density for other quantities, such as the time of most recent common ancestor in the genealogy (TMRCA), the number and time of coalescent events in each population, as well as the number and time of migration events between pairs of populations for each locus. Thus, even though the IM model assumes a constant rate of gene flow since population splitting, it seemed that by examining the genealogies sampled from the posterior density, it would also be possible to estimate the posterior density of migration times (Won & Hey 2005). As Strasburg & Rieseberg (2011) discovered by simulation and as we show here using an approach based on the calculation of the probability of a genealogy, this is not the case.
In IMa and related programs, a value of G is an ultrametric binary tree that depicts the topology, branch lengths, migration times and migration directions for a sample of genes at a locus (Beerli & Felsenstein 1999; Nielsen & Wakeley 2001). To address the identifiability of migration times, we partition G into several components, including a topology λ, a vector with the coalescent times tc = (tc1,…,tcT), a vector with the migration times tm = (tm1,…,tmT), where cT and mT are the total number of coalescent and migration events, respectively, and a matrix n, where nji is the number of lineages in population j at the ith interval between any two events. For simplicity, we refer to the topology and coalescent times as Λ = (λ,tc).
The probability of a genealogy, π(G|Θ) = π(tm,n,Λ|Θ), is obtained based on coalescent theory assuming a demographic model with parameters Θ. It is noteworthy that π(G|Θ) does not depend directly on much of the information in a genealogy but rather on a few summaries. In models that include migration, these summaries are counts and sums of rates for coalescent and migration events, including the following: (i) the number of coalescent events in each population cc = (cc1,…,ccp); (ii) the number of migration events between each pair of populations cm = (cm12,…,cmp(p−1)); (iii) the sum of coalescent rates for each population fc = (fc1,…,fcp); and (iv) the sum of rates for migration events for each pair of populations fm = (fm12,…,fmp(p−1)), where p refers to the number of populations. In more detail, the sums of rates of coalescent for population j and rates of migration between population j and l are defined as functions of the time intervals and number of lineages in each population during each interval:
where Δti = ti+1−ti is the time interval between any two events, either a coalescent or migration, and t = (tm,tc) is a vector with the sorted coalescent and migration times. For simplicity, these summaries will be referred to as s = (cc,cm,fc,fm). For instance, for an IM model, during a time period with p populations, given the scaled effective sizes θ and migration rates m, this probability is
(Kuhner et al. 1998; Beerli & Felsenstein 1999; Hey & Nielsen 2007), where θj = 4Nejμ, mj→l = Mj→l/μ, 4Nej is the effective size of population j, μ the mutation rate, and Mj→l is the migration rate between population j and l. Note that the terms following the first and second products are associated with coalescent and migration events, respectively. From eqn 2, we can see that the probability of the genealogy (represented by its components tm, n, and Λ) depends on the values of the summaries s = (cc,cm,fc,fm). All genealogies whose tm, n and Λ correspond to the same set of summaries s have the same prior probability. This is a general result, as eqn 2 is the basis of most inference methods based on genealogies (e.g. Beerli & Felsenstein 1999), including methods where the prior probability of the genealogy is calculated by integrating over the prior distribution of the parameters Θ (Hey & Nielsen 2007; Hey 2010).
As a consequence, for the final step of the estimation of the posterior probability h(Θ|X), we can use a sample of values of s from the posterior of genealogies. The result is a function that is itself a mean of functions, one for each sampled value of s,
for a sample of k values of s∼h(s,Λ|X), where π(Θ) is the prior of the parameters (Hey & Nielsen 2007; Hey 2010). As f(si|Θ) = f(Gi|Θ)/p(Gi|si) and π(si) = π(Gi)/p(Gi|si) (similar to eqn A.2), the above expression is an alternative representation for the posterior h(Θ|X), which is typically expressed as a function of genealogies (see eqns 11 and 19 in Hey & Nielsen (2007)). In the case of an IM model with two sampled populations and one ancestral population, s includes just 10 quantities regardless of the sample sizes, and yet, it is sufficient for calculating the probability of a genealogy under the IM model. For multiple independent loci each with a genealogy, s still includes just 10 quantities, each the sum of the corresponding quantities calculated for the individual loci (Hey & Nielsen 2007; Hey 2010).
Posterior probability of migration times
The posterior probability for the genealogy includes that for the migration times, tm,
where f(X|tm,n,Λ) is the likelihood, π(tm,n,Λ) is the prior of the genealogy, and f(X) is the marginal likelihood. It is noteworthy that the likelihood depends only on the topology and coalescent times of the genealogy and does not depend on the number and times of migration events (Felsenstein 1988), i.e.
This raises the question of whether data can in fact contain any information about the migration times, when considered under an IM model. This can be answered by looking further at the posterior distribution. Combining eqn 5 in eqn 4 and noting that h(Λ|X) = f(X|Λ)π(Λ)/f(X), the posterior becomes
This shows that the posterior distribution for the times of migration depends on the posterior for the topology and coalescent times h(Λ|X) and on the conditional prior π(tm,n|Λ) (eqn 6). It can be seen that the most likely migration times are supported by the data indirectly through the posterior of the topology and coalescent times, i.e., the most likely Λ induce a change in the prior of migration timing π(tm,n|Λ). This demonstrates that data provide at least some information about the migration timing (eqn 6). However, as we describe later, the data inform us about the most likely values for summaries of the time intervals s, rather than about the elements of the migration time vector tm.
Nonidentifiability of genealogies
Consider two genealogies G = (tm,n,Λ) and G* = (tm*,n,Λ) that share the same coalescent times and topologies, Λ, and the same number of lineages n (implying the same number of migrations), but have different migration times, tm and tm*, respectively. Because the likelihood depends only on Λ (eqn 5) and does not depend on tm, the posterior probabilities are equal if the two genealogies have the same prior probabilities, ,
As seen in eqn 2, this holds true for genealogies with the same set of summaries s. Therefore, it is possible to show that s is sufficient for (tm,n), in the sense that the posterior of the genealogy depends on s, irrespective of the particular values of (tm,n) (see Appendix I). In other words, the posterior of the migration timing (eqn 6) is fully characterized by the posterior h(s,Λ|X). This means that information provided by the data about the most likely times of migration is captured through the posterior of the summaries s. This makes sense because two of the set of summaries (fc and fm) are functions of the time intervals (eqn 1). However, the fact that these summaries are sums of counts and rates of events across loci introduces an identifiability problem. The reason is that we can estimate the most likely values for the sums given the data, h(s,Λ|X), but we cannot expect to estimate each term of the sum. In particular, there are multiple combinations of (tm,n) for a given value of s. Therefore, we can have two or more genealogies with the same posterior probability but with different migration timing distributions. In these cases, genealogies are said to be nonidentifiable as it is impossible to distinguish them based on their posterior.
Figure 1 shows an example of this nonidentifiability using two genealogies with different migration timings. In the left panel, both migrations happen recently, whereas in the right panel, both migrations happen just after the population split. Despite having different migration times, both genealogies have the same values for the summaries s = (cc,cm,fc,fm) and for the coalescent time tc, and hence have the same posterior probabilities. As seen in the Fig. 1, all genealogies with the same time interval Δt and tc have the same posterior, despite having different migration timing tm.
When there are multiple loci, the nonidentifiability issue is compounded because the posterior probability of all the genealogies depends on summaries that are the sums of s for each of the individual loci. Figure 2 shows an example for two loci. As can be seen, genealogies have migrations in different periods of time, which are consistent in both loci. In Fig. 2a, the two loci suggest older migration, whereas in Fig. 2b, the two loci have recent migration events. These two different cases could be interpreted as favoring alternative models of divergence, if it were possible to distinguish them. But because s is a sum over loci, given that in this example (Δt1+Δt2) = (Δt1*+Δt2*) and the coalescent times are the same, the two groups of genealogies will have the same value of s. Hence, these two groups of genealogies have the same posterior, despite the very different times of migration.
Relation between genealogy summaries and migration times
Given that some information about migration time is contained in the data (eqn 6), we wondered if some general feature of the migration times are contained in s, particularly the summary fm that is the sum of migration rates over time intervals (eqn 1). Data sets were simulated and the joint distribution of fm and overall measures of migration, including the mean, minimum and maximum migration time, were recorded. Simulations were carried out under an IM model, which assumes a constant migration rate, with two sampled populations that diverged from one ancestral population, using the coalescent-based simulator implemented in SimDiv (Wang and Hey 2010). Data sets were generated with a fixed set of parameter values (θ1 = θ2 = θA = 5.0, m1→2 = m2→1 = 0.5 and tsplit = 2.0), varying the sample sizes in each population n = (2,10,100). If genealogies contain information about these overall measures of migration time, then we would expect to see a correlation with fm. However, as shown in Fig. 3, this was not observed. Regardless of sample size, fm shows only a quite modest association with the mean, minimum or maximum of tm. The Spearman's rank correlation coefficients were low, ranging from 0.09 to 0.12 for the mean, from 0.07 to 0.10 for the maximum, and from 0 to 0.05 for the minimum. Similar results were obtained for fc (not shown). These results suggest that we cannot expect to estimate these features of tm.
Strasburg & Rieseberg (2011) demonstrated with simulations an identifiability problem for migration timing. Here, we explain the underlying basis of their findings in terms of the calculation for the probability of genealogies. When using the coalescent to calculate the probability of genealogies under a model with migration, such as the IM model, the probability of a genealogy depends only on a modest set of summaries s = (cc,cm,fc,fm) (Hey & Nielsen 2007), which means that genealogies that differ in their times of migration can have the same values for s. This implies that genealogies with different migration timings can have the same posterior probability and that the migration timings are statistically nonidentifiable. Investigators cannot expect to be able to estimate migration times for the purpose of discerning models of population or species divergence where gene flow varies through time.
This is a general result applicable to genealogies under neutral demographic models that include migration and that depend on the coalescent theory. We thus expect that migration timing estimates obtained with programs such as mdiv (Nielsen & Wakeley 2001), IMa (Hey & Nielsen 2004, 2007), Lamarc (Kuhner et al. 1998; Kuhner 2006) and Migrate (Beerli & Felsenstein 1999) will suffer from this limitation. It is noteworthy that the nonidentifiability of migration timing does not introduce any bias in the estimates of the demographic parameters, such as the effective sizes and migration rates, because the summaries capture all the genealogical information needed to estimate the posterior of the parameters (eqn 3) (Hey & Nielsen 2007; Hey 2010).
Previous studies have reported a wide range of shapes for the posterior distribution of migration timings, including cases suggesting recent migrations, old migrations and/or complex multimodal distributions (e.g. Niemiller et al. 2008; Strasburg et al. 2008; Nadachowska & Babik 2009; Carneiro et al. 2010). The presence of a peak and of variation in the number and location of peaks in the posterior distribution lends the appearance that these distributions are informative. However, this is misleading as the estimated posterior densities for migration times are mostly a function of (i) the prior distribution of migration times and (ii) the nonidentifiability problem. Unlike the prior distributions for the migration rates that are usually uniform and specified by the investigator, the prior distributions for the migration times are induced by the model assumptions. In a model with constant gene flow, the prior distribution for the migration times is not expected to be uniform, but rather a decreasing function with a peak close to zero. The reason is that the number of migration events is proportional to the number of lineages in each population at any instant, and given that the number of lineages decreases going backwards in time owing to coalescent events, most migrations are expected to occur recently. This may explain some of the results found suggesting recent migration. In addition, the effects of the nonidentifiability on the posteriors arise because of the fact that the summaries s are sufficient (eqn A.1) and sums of functions of the migration and coalescent times (eqn 1). Given a particular data set, the most likely values for the summaries s impose strong correlations on the migration times tm. The shape of the posteriors is thus a function of the correlations between the migration times, which depend on the information contained in the data about the values of the summaries. This is influenced by the properties of each particular dataset, such as the sample sizes, sequence lengths, number of loci, as well as the priors specified for the demographic parameters. As a consequence, the posteriors can have complex shapes, including distributions with multiple peaks. In any case, the fact that the times of migration are nonidentifiable implies that the posterior distributions do not have the desirable property of identifying the correct times of migration. Thus, irrespective of its shape, these are not useful to estimate the times of migration.
The initial motivation for looking at the posterior of migration timing was to infer variation in gene flow through time (e.g. Won & Hey 2005). As noted by Strasburg & Rieseberg (2011), cases in which the migration rates vary through time violate the assumptions of the basic IM model. We can envision at least two possible approaches to modelling variable migration rates explicitly. One is to assume that migration rates vary through time following some deterministic function, e.g., exponential change, the parameters of which are estimated from the data along with other parameters. Another possibility is to include in the model more migration parameters, each associated with a distinct time period (e.g. as used in simulations by Becquet & Przeworski 2009). In the simplest case of an IM model with two sampled populations, there would be two migration periods, each with its own migration rates, as well as an additional parameter for the time at which migration rate changed. However, this approach increases significantly the number of parameters of the model, and it is possible that a large amount of additional data would be required for estimation.
We thank three anonymous reviewers for their comments. This work was supported by the National Science Foundation (NSF) grant DEB-0949561 and by National Institutes of Health (NIH) grant GM078204 to J.H.
J.H. conducts empirical and theoretical genetic research on diverse problems in speciation and evolutionary genetics. A.G. and V.C.S. are postdoctoral fellows at the Hey lab working on the population genetics of diverging populations and development of statistical methods.
Here, we demonstrate that the summaries of the genealogy s = (cc,cm,fc,fm) are sufficient for the migration timing tm and number of lineages n. This is analogous to demonstrating that a given statistic is sufficient for the parameters of a model. Note that by definition, a statistic is a function of the data, whereas we are dealing with functions of genealogies. This can be shown applying the factorisation theorem (Lehmann & Casella 1998) to the posterior
where p(tm,n|Λ,s) is the probability of (tm,n) given the values of s, and h(s,Λ|X) is the posterior of s. Noting that h(tm,n,Λ|X) = h(Λ|X)π(tm,n|Λ) (eqn 6) and that h(s,Λ|X) = h(Λ|X)π(s|Λ), the above-mentioned equation becomes
Thus, showing that the prior π(tm,n,Λ) can be factorized into the two functions p(tm,n|s,Λ) and π(s,Λ), implies that s is sufficient for the posterior h(tm,n|X). The function p(tm,n|s,Λ) reflects the probability of obtaining a given configuration for (tm,n) conditional on the values of the summaries s. Note that it does not depend on the data X as required for s to be considered sufficient. Given that all genealogies that have the same corresponding values for the summaries are equally likely (eqn 2), the probability p(tm,n|s,Λ) will be proportional to the number of genealogies sharing the same values for s.
The prior π(s,Λ) is obtained by integrating over the prior probability of genealogies whose (tm,n,Λ) correspond to a given set of summaries ,
where is an indicator variable that takes the value 1 if the condition c holds true and zero otherwise. The same reasoning applies to the posterior h(s,Λ|X). Again, note that h(s,Λ|X) does not depend on (tm,n), as required for s to be considered sufficient. Given that s is sufficient and a sum of counts and rates across period of the genealogy and across loci, the elements of the sum (tm,n) are nonidentifiable.