Classical models of gene flow fail in three ways: they cannot explain large-scale patterns; they predict much more genetic diversity than is observed; and they assume that loosely linked genetic loci evolve independently. We propose a new model that deals with these problems. Extinction events kill some fraction of individuals in a region. These are replaced by offspring from a small number of parents, drawn from the preexisting population. This model of evolution forwards in time corresponds to a backwards model, in which ancestral lineages jump to a new location if they are hit by an event, and may coalesce with other lineages that are hit by the same event. We derive an expression for the identity in allelic state, and show that, over scales much larger than the largest event, this converges to the classical value derived by Wright and Malécot. However, rare events that cover large areas cause low genetic diversity, large-scale patterns, and correlations in ancestry between unlinked loci.
Much effort has gone into understanding the spatial structure of populations, from the collaboration between Wright and Dobzhansky during the Evolutionary Synthesis (Lewontin et al. 1981; Provine 1986), through to the many thousands of present-day studies that infer population structure from genetic variation. Knowing the history of populations is of interest in itself, and indirect estimates of dispersal and density may aid conservation. However, much work has been motivated by interest in the interaction between population structure and selection. In Wright's (1932) Shifting Balance Theory, selection acts more effectively in a structured population, whereas current efforts to detect selection and to map disease alleles depend on understanding the null model, of neutral variation in a structured population (e.g., Nielsen et al. 2007; Coop et al. 2009).
Yet, despite this long-standing interest, models of population structure are almost entirely limited to arrays of demes of constant size, and connected either with no spatial structure (the island model; Wright 1931), or in a one- or two-dimensional lattice (the stepping-stone model; Malécot 1948; Kimura and Weiss 1964). The evolution of both allele frequencies and of genealogies is well understood for these kinds of model (e.g., Charlesworth et al. 2003; Wakeley 2008). Yet, as we explain below, they fail to account for key features of genetic variation in nature. Moreover, there are fundamental difficulties in applying even the classical models to populations that live in continuous two-dimensional habitats, as we explain below (Felsenstein 1975; Nagylaki 1978; Barton et al. 2002). The lack of adequate theoretical models has forced phylogeographic analysis of long-term population history to be almost entirely qualitative (Knowles and Maddison 2002; Hey and Machado 2003; Avise 2004).
Of course, we could model an arbitrary population structure by specifying the population size and rates of migration over time. However, we hardly ever have this detailed information, and cannot hope to estimate it all from genetic data. Thus, we must try to condense a plethora of parameters that describe detailed and realistic models, into a few effective values describing the key features of the population. For example, in a panmictic population, the short-term rate of random drift is summarized by the effective population size, Ne, which depends on sex ratio, variance in fitness, and so on. In a spatially continuous habitat, the deterministic effects of gene flow are accurately approximated as a diffusion process with rate σ2, the variance in distance between parent and offspring along some axis—a single parameter that describes the distribution of distances moved by genes. However, it is difficult to extend this spatial diffusion approximation to include random drift. In one dimension, the approximation is well-behaved, and predicts smooth changes in neutral allele frequency, correlated over distances of , where μ is the mutation rate (Malécot 1948; Nagylaki 1978). In two dimensions, however, drift produces significant local fluctuations that cannot be described by diffusion. In a stepping-stone model, with fixed deme sizes, a diffusion approximation is accurate over all but very small scales (Malécot 1948; Kimura and Weiss 1964; Barton and Wilson 1995; Barton et al. 2002). However, fluctuations in population size in space and time, which are inevitable in a continuous population, may substantially alter the effective diffusion rate and the effective density. Indeed, it is difficult to find any analytically tractable model of a two-dimensional population with continuous space, that could serve as a null model from which we could define “effective” parameters. As Felsenstein (1975) showed, one cannot assume both a uniform density and independent movements: fitness has to decrease with local density if the population is to be stable, which necessarily leads to correlations between the movements and reproduction of nearby individuals. We incorporate these in a way that is tractable both forwards and backwards in time, giving a model that can be used to define effective parameters that describe a wider range of more realistic schemes.
A further difficulty with using the diffusion approximation to describe gene flow in a spatial continuum is that it is not clear how to follow the ancestry of lineages, traced backwards in time. The coalescent process describes this ancestral process in a panmictic population, and yields simple formulae for the distribution of coalescence times, and hence of allele frequencies. Moreover, it allows rapid simulation of just the sampled lineages, rather than of the entire population. The coalescent extends naturally to a population that is clustered into demes of given size (Hudson 1990; Cox and Durrett 2002; Zähle et al. 2005), but it is not known how to extend it to a two-dimensional spatial continuum, or to model fluctuations in population size. It would be natural to assume that lineages move in a random walk, with some probability of coalescence when they are sufficiently close; the common ancestor would be at a random location, centered on the midpoint between the coalescing lineages. Indeed, Wright (1943) proposed essentially this model for isolation by distance, long before the coalescent process was defined mathematically. Unfortunately, this simple model is inadequate. First, as noted above, regulation of population density implies some correlation in the movements of nearby lineages. Second, this naive model is inconsistent. If a large genealogy is followed in two dimensions, then the distribution of two lineages chosen from within it is not the same as the distribution of two lineages, modeled alone (Fig. 1B). This is because coalescence with an unsampled lineage causes a jump to the common ancestor, which would not be seen if we only followed one of the lineages involved. Thus, including more lineages in the sample changes the pattern along any one lineage by inducing more jumps in it. Yet, an algorithm that describes the ancestry of a sample is essential if we are to simulate evolution in populations of a realistic size. For example, a population that lives on a range ∼103σ across will contain ∼106 neighborhoods, which can only be simulated with a reasonable speed if there are very few individuals per neighborhood. Individual-based simulations of the whole population are therefore constrained to unreasonably small neighborhood sizes or range sizes.
Classical models of diffusive gene flow fail in (at least) three ways. First, they cannot explain patterns of allele frequency or genealogical relationship over large spatial scales, simply because diffusion is too slow to move genes far enough. A gene will typically diffuse a distance in t generations. Because many species may occupy ranges at least 103σ across, diffusion across the range will take ∼106 generations—and yet we know that at least at high latitudes, ranges have only been stable since the last Ice Age, for ∼104 years at most. To put this another way, neutral variation, in a balance between gene flow, mutation and drift, is predicted to fluctuate over scales of , where μ is the mutation rate. Thus, if mutation is ∼10−6 per gene per generation, patterns cannot spread over scales of more than ∼103σ, even if the population remained stable for that long. (Note that long-range movement of individuals will not greatly alter this argument. Provided that the standard deviation of the distribution of individual movements is small relative to the species' range, the diffusion approximation will still hold over large enough scales. Even in the extreme case in which a small fraction m move over distances comparable with the species' range, this will simply act as an additional damping force, which pulls fluctuations toward the overall mean, and reduces the spatial scale to ; Wright 1943; Malécot 1948).
Second, we often observe patterns over very large spatial scales, and indeed, these are the necessary material for phylogeographic analysis. For example, Barbujani and Sokal (1990) identify zones in Europe at which allele frequencies in human populations change at multiple loci; these correspond to language boundaries, and reflect large-scale movements of whole populations. Yet, in the classical models, genes diffuse independently of each other, and so patterns at different genetic loci will also be independent. Thus, if genes really did diffuse independently of each other, genealogies at different loci would also be independent, and could not be used to infer a common population history in any simple way. Conversely, correlations across loci, in allele frequencies or in genealogy, indicate demographic events that affect the population as a whole (Barton and Wilson 1995; Eldon and Wakeley 2008). If all the individuals in an area descend from a few high-fitness founders, then any alleles carried by those founders will tend to spread over the same area. Such correlations across loci are not captured by classical models of individual reproduction.
Finally, genetic diversity in abundant species is far lower than expected from census numbers (Lewontin 1974; Gillespie 1991): genetic diversity does increase with population size, but far more slowly than predicted by the simple neutral theory (Nevo et al. 1984; Lynch and Conery 2003). For example, the effective population size in humans is estimated to be ∼104 or less (Takahata 1993; Tenesa et al. 2007), and ∼106 in Drosophila melanogaster (Li et al. 1999), both of which are far lower than census numbers. This discrepancy may be explained by multiple selective sweeps, such that diversity is limited primarily by the extent of hitchhiking (Maynard Smith and Haigh 1974; Gillespie 2000). Alternatively, diversity can be limited by recurrent population bottlenecks, involving part or all of the species' range; the long-term effective size then depends on the number of founders involved in each event, rather than the average census size (Nei 1987; Whitlock and Barton 1997; Eldon and Wakeley 2006, 2009). Regardless of whether diversity is limited by selective sweeps that affect only part of the genome, or genome-wide demographic effects, the classical model of local diffusion and steady density cannot apply.
We begin with a model in which individual reproduction depends on extinction/recolonization events over a range of scales. We then describe a backward process that gives the ancestry of a sample drawn from such a population; unlike the standard coalescent, this allows simultaneous mergers among more than two lineages. This backwards process is an approximation to the forwards model and corresponds exactly in the limit of high population density (the exact backwards process for an arbitrary population density can be written down, but is rather cumbersome). Even though density is high, this limiting process still allows coalescence, because reproduction is correlated across appreciable areas. We introduce recombination into the model, and show that when reproduction in drastic extinction/recolonization events is correlated over large areas, it leads to correlations between patterns at different genetic loci. This distinguishes our model from those that allow only independent reproduction and local movement. We give an integral equation for the probability of identity, which leads to the distribution of coalescence times, and the long-term rate of decay of genetic variability across the whole population.
Although our basic model is idealized, it is open to a variety of extensions, and it captures key features of two-dimensional populations: multiple mergers of ancestral lineages, large-scale structure, diversity much lower than expected from census numbers, and correlations across loci.
FORWARDS MODEL FOR INDIVIDUAL REPRODUCTION
Imagine a population of haploid individuals, scattered over a continuous two-dimensional range in a uniform Poisson distribution with density ρ (Fig. 2A). Reproduction and dispersal occur solely as a result of extinction/recolonization events, which occur at random times and places, at a uniform rate Λ. Each event has an effect that declines with distance from its centre: if an event is centered on z, then individuals at x have probability u(z, x) of dying (Fig. 2C). Throughout, we take u(z, x) to be a function only of the distance |z−x|, where x, y, and z are two-dimensional vectors. We will focus on a Gaussian model, u(z, x) =u0 exp(−|z−x|2/(2θ2)); this has a maximum u0 between 0 and 1, and spatial variance θ2. Integrating over the location of the event, z, we see that the rate at which an individual is killed is . (Note that Λ has dimensions distance−2time−1, and θ has dimensions distance. So, the net rate of mortality has dimensions time−1.)
The population is restored by scattering new individuals randomly, with density ρu(z, x) (Fig. 2D). Crucially, the number of new individuals does not depend on the number that were originally in the patch. Thus, the population is immediately restored to a uniform Poisson density (i.e., the numbers in any area following a Poisson distribution), ρ. One might think of a population of plants, where there is always an individual at each suitable location, and such locations are randomly scattered, so that the population always follows a uniform Poisson density. In a sense, we are strictly enforcing the required population density, just as in the classical stepping stone model. However, the population now follows a random distribution that changes with every generation: the realized population is not regulated deterministically.
The final step is to assign parents for the new individuals (we can think of offspring “choosing” their parents just as in the Wright–Fisher model). We begin by considering a single genetic locus, so that we can ignore recombination, and can assign a single parent to each individual. In the simplest and most extreme case, all of the new offspring would come from a single parent, who is chosen randomly from the individuals near the event, and living immediately before it (Fig. 2B). (Thus, a parent at y will be dead with probability u(z, y) and would survive to live alongside its offspring with probability 1 −u(z, y).) We choose parents by weighting an individual at y by v(z, y), where v is a decreasing function of |z−y|. In the Gaussian model, we choose v as a Gaussian density with variance α2θ2; if α > 1, then parents are drawn from a wider region than that which suffers mortality, and conversely if α < 1. On average, in a population with high and uniform density, the distribution of locations of a parent will follow v; however, with finite density, the distribution of parental locations will depend on where the individuals happen to be. In particular, if the region around the event happens to be empty, then parents must be drawn from further afield.
Under this model, if we follow a single lineage forwards in time, we see a series of jumps, at exponentially distributed time intervals, that occur whenever the lineage is hit by an event. Of course, only the parental lineage in any given event will survive; all other lineages involved in the event are terminated. Crucially, the movements of nearby lineages are correlated, because they may be hit by the same event.
Notice that, in this model, all reproduction and all movement are associated with large-scale extinction/recolonization events. Of course, individuals must in reality reproduce between events, and there are (at least) two ways in which we can incorporate such “local” reproduction. The first method, and the approach we take here, is to assume that individuals reproduce outside of events but that this results in negligible coalescence and dispersal. This background of reproduction does, however, have a significant effect through recombination, as we explain below. The second approach to incorporating local reproduction, and the approach taken by Barton et al. (2010), is to model reproduction outside of large-scale events through a high rate of very small events. We cannot, however, introduce independent death and reproduction of individuals without reintroducing the difficulties which our model seeks to avoid.
This process can represent concrete biological systems. For example, we can imagine trees in a rainforest, whose seeds can only germinate in open patches that occasionally become available through fires, trees falling, and so on. Then, the actual distribution of individuals is determined entirely by the extrinsically determined distribution of available habitat, and effective reproduction only occurs at “events.” However, we can also imagine a situation in which there is a lot of illicit “individual reproduction,” out with the model, that causes no significant coalescence, and only has a significant effect on recombination and (possibly) dispersal.
Other forms for the functions u(z, x) and v(z, x) could be chosen. Etheridge (2008) chooses both to be uniform within discs of radius r, and zero outside. This requires a slight difference from our model, because if a disc is empty, no parents can be found, and so no extinction can be allowed. In the Gaussian model used here, v(z, x) > 0 for all x and z, so that a parent can always be found. We have described the model in terms of events that are described by the functions u and v, occurring at a uniform rate Λ. However, it is straightforward to allow a mixture of kinds of event, of different sizes and intensity, and occurring at a rate that could be written most generally as Λ(u, v). In the disc model, we would use a distribution Λ(u0, r), and in the Gaussian model, Λ(u0, θ, α).
The corresponding backwards process is straightforward. We trace the ancestry of a set of sampled genes backwards in time, until one or more of them encounter an event. Each lineage has a chance u(z, x) of having been born in the event. If one sampled lineage at x was born, then its ancestry jumps to a random location y with density proportional to . For the Gaussian model, the rate at which a gene is hit is 2πΛθ2u0; if it is hit, the distance to the parent along any axis has variance θ2(1 +α2), and so the variance in location increases at a rate σ2= 2πΛθ4u0(1 +α2). If two or more lineages are born then all lineages jump to the same parent, which again is at a random location with distribution v(z, y). The rate at which a gene at 0, and a gene at x, become identical by descent is ; hence, we get h(0, x) =πΛθ2u20 exp(−|x|2/(4θ2)) for the Gaussian model.
This backwards process gives a density of genealogies that is very close to that for genealogies sampled from the forwards model, described above. However, it is not identical. The difficulty arises from the slight variation in density, caused when the centre of an event falls on a sparsely populated region. Thus, there is a tendency for lineages to pull toward each other, viewed forwards in time. The discrepancy between the backwards and forwards models can be measured by calculating the average distance between parents and the centre of events under both simulations. For example, when θ= 0.4 and α= 1.3 the average distance from a parent to the centre of an event is indistinguishable in forwards and backwards simulations on a torus of side 50 when the density ρ≥ 5 (not shown). Forwards and backwards simulations are conducted by maintaining a list of individuals located on a continuous plane, iteratively applying events and storing genealogical information. Spatial indexing techniques (Bentley 1990) allow us to quickly find the set of individuals involved in an event and greatly improve the efficiency of simulation. C code for simulations is available on request from JK.
FORWARDS MODEL IN THE LIMIT OF HIGH DENSITY
In the limit of high population density, the forwards process corresponds exactly to the backwards model, as defined above. If we consider a two-allele model, we can now think of an allele frequency p(x, t), which is well defined at every point. The forwards model is simple. As before, events centered on z are thrown down at random; at each point x some fraction u(z, x) of the population dies and is immediately replaced by the offspring of an individual drawn randomly from the vicinity of z. Thus, the allele frequency immediately after an event is a linear mixture of the frequency immediately before and a random value X,
The value of X is derived from the mean allele frequency around the event, ; X is 0 or 1 with probability . Necessarily, the population density ρ remains constant. (Note that random fluctuations are due entirely to extinction/recolonization events, and so the actual density is irrelevant—the usual process of sampling drift is assumed negligible. Indeed, density could vary arbitrarily).
In this model, successive events produce a mosaic of patches with different allele frequencies. If these events have sharp edges (Fig. 3A), then a correspondingly sharp mosaic will be produced. However, by allowing a range of event sizes (Fig. 3B), or events with smooth edges (i.e., with intensity u(z, x) that decreases toward the edges; Fig. 3C), smooth patterns will be produced.
IDENTITY IN STATE, F(x, μ)
To calculate the distribution of the time to coalescence of a pair of lineages, separated by x, f(x, t), we find its generating function, . This is equal to the probability of identity in allelic state of two genes, separated by x, and with mutation rate μ to new alleles, assuming an infinitely many alleles model. (If two lineages have their most recent common ancestor at time t in the past, then the probability that neither lineage has acquired a mutation in that time is exp(−2μt).) We will take the simplest case, where each event involves recolonization by a single parent, and assume events are instantaneous so that mutation within events can be disregarded.
Taking the genes to be at 0, x, and denoting F(x, μ) as F(x) for convenience, the recursion for the identity is
The first term represents the decay of identity due to mutation, μ. The second term represents the increase in identity, from F(x) to 1, that occurs if both lineages are hit by the same event, centered on z, and both are born in that event (probability u(z, 0)u(z, x)). The last term represents a jump by one of two ancestral lineages that do not coalesce. (Note that because we are considering the extreme case of a single parent, if both jumped, they would necessarily both coalesce into a single parent). In the first term, the gene at x survives, whereas that at 0 derives from a parent at y+x; thus, identity changes from F(x) to F(y). The last term corresponds to the opposite case, where the gene at 0 survives, whereas that at x is derived from a parent at y.
Because x and y are two-dimensional vectors, this equation involves integrals over two-dimensional space. We can simplify, however, by noting that all the functions depend only on distance, and not on orientation. We can therefore integrate over the direction of the vectors x, y, keeping their length |x|, |y| the same. For the Gaussian model, this leads (after some algebra) to:
and I0 is a modified Bessel function of the first kind (Abramowitz and Stegun 1972, §9.6). This integral equation can be solved numerically by approximating the continuous range by discrete points: the kernel K(|x|, |y|) then becomes a matrix that can be inverted to give the solution.
Figure 4 shows that numerical solutions of equation (3) agree well with simulations. Moreover, F(x, μ) contains complete information about the distribution of pairwise coalescence times, f(x, t). To find this distribution, one could write down a recursion analogous to (3), and solve directly, or one could take the inverse Laplace Transform of F(x, μ) numerically. Similarly, recursions for the moments of the distribution of coalescence times can be derived by differentiating equation (3) for μ small. However, the mean coalescence time diverges on an infinite range, because if nearby lineages do not coalesce although they are still close, they will almost certainly wander very far apart. In a finite population, the mean and higher moments of coalescence time are also finite, and can be found by solving the derivatives of equation (3) on a finite range. Charlesworth et al. (2003) discuss the relation between identity in state and the distribution of coalescence times in two-dimensional populations.
THE DIFFUSION APPROXIMATION OVER LARGE SCALES
Our model does not suffer from the difficulties of the classical Wright–Malécot model, because local movements are correlated—if nearby lineages coalesce, then they both jump, whereas if they do not, they are likely to come from parents that are more widely separated. However, Barton et al. (2002) showed that, even with such local correlations, the diffusion approximation can still be applied over sufficiently large scales—in the present case, over scales larger than the sizes of the largest events. The effective population density, ρe, which together with the rate of diffusion of a single lineage, σ2, predicts the identity by descent for well-separated genes, is derived from the probability of identity by descent (IBD) over a time slice that is large enough for movements of a lineage over successive time slices to be uncorrelated; the integral of this IBD over the separation of the genes gives the inverse of the effective density (eq. 15 of Barton et al. 2002). In our model, the time slice can be chosen arbitrarily; movements of a lineage over disjoint time slices are automatically uncorrelated. The rate at which genes separated by x become identical is ; integrating over x gives 1/(2ρe). For the Gaussian model, this integral gives 4π2θ4Λu20= 1/(2ρe). We saw above that single lineages, traced backwards in time, disperse at rate σ2= 2πΛθ4u0(1 +α2).
These two parameters, ρe and σ2, give the rates of dispersal and of random fluctuation, when viewed over large spatial scales. The diffusion approximation then predicts that F can be approximated by:
and K0 is a modified Bessel function of the second kind. F0 is a measure of the identity between nearby genes, which depends on the local structure (Barton et al. 2002); this is an average over F(x) for small x, and is somewhat smaller than F(0). The number of events that strike a point in time 1/2μ is πΛθ2u0/μ; the spatial scale, ℓ, is the square root of this value, multiplied by σ. It can be thought of as the distance traveled between mutational events, as a result of extinction events that hit the lineage.
Figure 4 shows that this is an excellent approximation, provided that F0 is estimated from the exact solution of equation (3). The accuracy of this approximation provides a good justification for the use of small, frequent events to model reproduction outside large-scale events (Barton et al. 2010).
THE MALÉCOT APPROXIMATION
Malécot (1948) gave a solution for F(x) over all scales, including F(0); Wright (1943) derived the same result by a different route. As discussed above, this solution depends on the assumption that local movements are independent, which is not consistent with a stable demography. However, equation (2) converges to the same form as Malécot's (1969, Chap. IX, eq. 7) when u(z, x) is small, so that the terms 1 −u(z, x) can be dropped. The Fourier transform of F(x) can then be written explicitly (Appendix A), and gives equation (4) over large scales.
For low mutation rates and large , this has the form given by Barton et al. (2002, eq. 13). From equation (A3)
γ is the Euler-Mascheroni constant and ψ is the digamma function. This approximation is in fact accurate even when mortality is at its maximum, u0= 1 (Fig. 4): although the Malécot formula is based on inconsistent assumptions, it gives solutions close to those in our model. Our model is practically indistinguishable from Malécot's if we only look at pairwise measures, and if events are of a single size that is much smaller than the species' range.
In contrast, if there is a mixture of events of different sizes, then the distribution of pairwise identities and coalescence times can differ qualitatively from the Malécot solution. Figure 5 shows an example in which there is a mixture of frequent small events and rare large events. The pairwise identity changes over the two corresponding scales, and could not be approximated well by any Malécot-type solution. However, over small scales the identity for mixed events is close to the identity for just small events; and similarly, over large scales it is close to the pattern seen with events of just one size (although now one expects the parameters in the Malécot approximation to depend on both large and small events, cf. Theorem 3.3 of Barton et al. 2010).
Our model has three spatial scales: the size of events, θ; the characteristic scale of the diffusion approximation, , and the size of the range, R. In two dimensions, the overall variability, and the rate of long-term coalescence, is determined by the effective size of the whole population, and hence by the range size, R (Maruyama 1972). However, provided that the characteristic scale and the event size are much smaller than the range, the latter has little effect on the spatial variation in identity (Wilkins  shows how boundaries affect identity in a continuous two-dimensional model). If the event size is comparable with the characteristic scale or with the size of the range, then the Malécot approximation fails.
For a single locus, our idealized model assumes no reproduction or dispersal between events. Of course, organisms must actually reproduce; we are really assuming that reproduction leads to negligible coalescence and dispersal. However, it is not reasonable to ignore recombination. Even if local dispersal is negligible, and density is high enough that coalescence outside large-scale events can be ignored, individuals will still be reproducing sexually, and so linkage disequilibria will continually be broken down. There may be many hundreds of generations between extinction events, and so for all but very tightly linked loci we may be able to assume the limiting case in which there are no linkage disequilibria within local regions. Therefore, in the limit of high density, we can follow genotype frequencies forwards in time, which will simply relax toward linkage equilibrium as a result of recombination within local regions. (In the limit of high recombination, genotype frequencies immediately after an event are derived in the same manner as eq. (1), except we adjust the newly derived genotype frequencies such that the coefficient of linkage disequilibrium, D, is approximately zero just before the next event, looking forwards in time.)
In the backwards process, we trace lineages back in time, as before. Now, the genes at different loci may come from different ancestral lineages, producing a branching as we trace back in time (Fig. 6), just as in the standard ancestral recombination graph (Hudson 1990). At the previous event, a single sampled genome will derive from a set of different ancestors (A, C in Fig. 6). Now, when an event hits this set of ancestors, some may be unaffected, while others will be born from a smaller number of ancestors; thus, lineages may jump and coalesce at each event that hits them. Thus, separate blocks of the sampled genome may descend from the same founder individual: it is this which generates correlations across loci. In the limit of free recombination, every sampled gene comes from a different ancestral lineage, but these may coalesce with each other at an event.
CORRELATIONS ACROSS LOCI
A simple consequence of any bottleneck model is that individuals will descend from some small number of founders, and will therefore carry combinations of alleles seen in those founder individuals. That is, even if loci are loosely linked, there will be a statistical relationship between the combinations of alleles carried within individuals. It is this relationship we refer to when discussing correlations across loci. Our model captures these correlations and allows us to quantify them, giving us a potentially powerful tool for phylogeography: by measuring the correlations across loci within populations we can estimate the importance of large-scale extinction/recolonization events.
Correlations, in coalescence time and in allele frequencies, across unlinked loci arise naturally in our model when we consider the effects of large events. Figure 7 gives an example of how allele frequencies are affected by a large-scale event. Imagine that the lines in Figure 7A represent the allele frequencies at two loci (a and b) in a transect across some fictional range; these frequencies vary smoothly from one end of the range to the other, reflecting spatial variation in the population. At each locus, we have two alleles, 1 and 2, and therefore four possible genotypes 11, 12, 21, 22. Imagine then that a large-scale extinction/recolonization event occurs centered near R/2, and consider Figure 7B. Here, we see that the result of this event has been to, by chance, favor the genotype 12, because the frequency of allele a=1 increases and b=1 decreases. The point here is that, after the event, there is a clear relationship between the allele frequencies.
This correlation will not show up in pairwise measures, because in a neutral model, alleles are exchangeable: an increase of any particular allele is as likely as an increase of another. We believe that the simplest measure of correlation across loci is a fourth-order association, the covariance between identities at each of two pairs of loci. If genes a1, a2 are sampled at one locus, and b1, b2 at another, then we write the chance that they are pairwise identical in state as ; the commas indicate that the four genes are in four different individuals. We define the covariance between identities at the two loci as , where are the pairwise identities between each pair of loci. This measure leads directly to the covariance between squared deviations of allele frequency at the two loci, or equivalently, to the covariance between heterozygosities. See Appendix B for further discussion on calculating analytically.
Figure 8B shows the covariance of identities between loci when there is a mixture of large- and small-scale extinction/recolonization events, and Figure 8A shows the probability of identity in state, for the same parameters.
We have defined a model in which both coalescence and dispersal are due to events that wipe out a fraction of the local population, and replace them with offspring from a small number of parents. These events might correspond to major extinctions that cause severe population bottlenecks, or to smaller events, which in the limit approximate individual reproduction and dispersal. When density is high, this forwards model describes the evolution of the spatial pattern of allele frequencies. Tracing back in time, lineages that are hit by events jump to a new location, where they may coalesce with other lineages that are hit by the same event. The distribution of pairwise coalescence times converges to the classical Wright–Malécot formula over large scales. However, if events cover large areas, relationships may extend over large distances in space, and across the whole genetic map. Thus, our model provides a mathematically tractable framework that includes many features of real populations.
The model is, of course, idealized. Modeling the evolution of a population in a spatial continuum via extinction events falling randomly on nearby individuals is a drastic simplification of real-world population dynamics. Yet, we must simplify to make progress. Few would argue that the Wright–Fisher model is realistic; it has, however, been shown to be a useful approximation of many key features of biological populations.
RELAXING THE ASSUMPTIONS
We have described the simplest version of the model, but many of its assumptions can be relaxed without difficulty. Because we assume that the population density is so high that coalescence due to individual reproduction is negligible, the actual density is irrelevant, and can vary arbitrarily. The genetic structure is determined by the rate of events of different kinds, expressed most generally as Λ(u, v). In principle, this rate could vary in space and time. We have taken the simplest case in which each event is recolonized by one parent, but as indicated above, this could be relaxed to allow multiple parents. In this way, we would generalize from the spatial Λ-coalescent (Pitman 1999; Sagitov 1999) described here, to a spatial Ξ-coalescent (Schweinsberg 2000; Möhle and Sagitov 2001).
We have not specified the bounds on the two-dimensional space, and in numerical examples, used the traditional torus to avoid edge effects. The size of the range determines the mean coalescence time and the net genetic diversity (Maruyama 1972); we expect that if the range is much larger than the size of the largest event, then the classical results will still hold. However, these will break down if events have size comparable with the species' range.
THE SIGNIFICANCE OF THE PARAMETERS
In our Gaussian model, with events of spatial variance θ2, relative parental distance α, and maximum mortality u0, the magnitude of the pairwise identity, F, and hence the distribution of pairwise coalescence times is determined by the effective neighborhood size, . This is the ratio between the rates of spatial diffusion (σ2= 2πΛθ4u0(1 +α2)) and coalescence (1/2ρe= 4π2θ4Λu20), which is, remarkably, independent of both the rate of events, Λ, and their spatial extent, θ. That is because events are responsible for both dispersal and coalescence, and so increasing their rate or their extent increases both processes in the same way (although they cancel here, Λ and θ are real parameters of the model and cannot be discarded). This cancellation would not occur if there were a range of events of different types. Rare events that cause high mortality (Λ small, θ large) would contribute mainly to coalescence, whereas common events that cause low mortality (Λ high, θ small) would contribute more to dispersal (Barton et al. 2010): such events would be unlikely to hit more than one lineage (e.g. Fig. 5).
Even without spatial structure, it is difficult to incorporate selection: the ancestral selection graph applies only to simple forms of selection, and even then is hard to apply (Krone and Neuhauser 1997). It is straightforward, however, to include selection in the forwards model at the limit of high density. As with recombination, selection might act between events, changing genotype frequencies in the usual way. It might also act at events, via either mortality or fecundity: the probability of dying in an event, u(z, x), and the probability of being chosen as a parent, v(z, x), might both depend on genotype. Moreover, the shape of each function could differ between genotypes. Selection can also act on the whole population, through a dependence on genotype frequencies, rather than on individual genotype. The rate of events, Λ, could depend on the local genotype frequency, as could the mortality and fecundity (u, v). For example, the chance of an extinction might depend on population density or resource use, which in turn would in general depend on genotype frequencies. It would be interesting to compare the efficiency of selection, relative to drift, at these three levels: on reproduction between events, on individuals at events, and on the whole population at events.
DETECTING EXTINCTION AND RECOLONIZATION
The idea that extinction and recolonization might be detected through correlations in patterns across loci was suggested by Sokal and Wartenberg (1983), but has not been taken up explicitly. Correlations across loci are implicit in phylogeographic analysis and in admixture models: multiple loci are assumed to be affected in similar ways by their shared history. However, such correlations are rarely tested statistically (Knowles and Maddison 2002): often, population history is inferred from the genealogy of mitochondrial DNA, with the untested assumption that other loci should show similar patterns. One difficulty is that in most sexual organisms, there is too much recombination relative to mutation to allow genealogies to be reconstructed from nuclear sequence. Thus, explicitly genealogical analysis is restricted to, at best, mitochondrial DNA and NRY chromosomes, which are inherited through different sexes and so cannot provide even minimal replication.
Our model predicts that large events will cause multiple simultaneous coalescence, both within and between loci. If one knew the actual genealogies, with accurate times of coalescence, then the standard coalescent could be rejected through detection of such simultaneous coalescence. However, mutation rates are too low for this to give accurate resolution, especially in nuclear genomes where recombination and mutation occur at comparable rates. If we have genome-wide data, recombination itself may provide a more powerful way to detect large-scale events. For example, Hellenthal et al. (2008) distinguish specific historical events, using a hierarchical model in which a diverse ancestral population goes through successive bottlenecks; here, dates are estimated relative to recombination, not mutation. Our model provides an alternative framework for making inferences based on patterns of recombination across the genome.
Use of our model in data analysis would require a better understanding of what effective parameters are to be studied. It seems reasonable to begin by assuming high rates of recombination between loci, and negligible coalescence and movement outside large-scale events—at least, as a null model. Even then, however, we need to estimate the joint distribution of event intensities and sizes, assuming a fixed relation between the area of extinction and of recolonization, α. It would be helpful to know what features of this distribution are important, and what could in principle be estimated from data. It seems to us implausible that the full joint distribution, or even the marginal distributions of θ and u0, could both be estimated. Thus, we need to find simpler models that can describe a range of actual situations, and we need to know what features of the distributions are important. In principle, we could make inferences by simulating the random distribution of events, or even a random distribution of parameters, conditional on the observed data.
LOCAL MOVEMENT AND REPRODUCTION
Our model assumes that all movement and all reproduction are associated with extinction and recolonization. That is a reasonable approximation if genes diffuse a negligible distance between extinction events, and if the population density is so high that most coalescence is due to recolonization. Of course, individuals can reproduce: we simply assume that reproduction between events has negligible consequence for allele frequencies and for genealogies at single loci.
Nevertheless, these assumptions might not hold, and we might wish to introduce a process that describes local diffusion and local coalescence. We could readily add a Brownian motion of lineages, which would correspond to a diffusion of allele frequencies. However, we cannot allow lineages to coalesce when they come near to each other, without introducing the difficulties of the classical models. Instead, we allow a high frequency of small events, which leads to a process that is equivalent to diffusion, but which properly describes the correlated movements involved in local coalescence (Barton et al. 2010).
DISTINGUISHING GENE FLOW FROM HISTORY
A fundamental difficulty in interpreting spatial divergence is to distinguish history from current processes: at one extreme, patterns may have been laid down in the distant past, whereas at the other, they may reflect a statistical equilibrium between present-day processes. Often, one or other scenario is assumed, and the parameters estimated (time since divergence, or average numbers of migrants, say), but no attempt is made to test the relative plausibility of the qualitatively different models. Nested clade analysis (Templeton 1998, 2004, 2008, 2009) provides a formal procedure for inference from genealogies, and purports to distinguish between isolation by distance, population expansions, and so on. However, it performs poorly when tested against simulated data (Panchal and Beaumont 2007). Quantitative methods are being developed that can in principle distinguish current gene flow from allopatric divergence (Hey and Nielsen 2004; Becquet and Przeworski 2009). However, these require that divergence has been occurring for long enough that mutations have arisen which can date the timing of introgression: have novel mutations, found mainly in one population, moved across to the other? Without that mutational clock, it is not clear that the processes can be distinguished—though possibly, recombination might provide a clock instead, via the size of introgressing blocks (e.g., Hellenthal et al. 2008). Our model is relevant to this problem, in that it includes both elements: drastic extinction events may produce regions with distinct combinations of alleles (as assumed in phylogeographic models), while more frequent but smaller events may lead to a statistical equilibrium over local scales, as in the classical Wright–Malécot models.
There is an interesting general issue here: when can we make inferences about specific historical events, rather than estimating parameters that determine the rate of those events? On the one hand, there may have been only a few relevant events: although we may be able to detect these specific events, it will be impossible in principle to estimate their rate. On the other, if there have been very many local events, these would not be comprehensible even if we could detect them: we are then forced to make a more compact description in terms of the parameters of some model. The most compact description, and what can be inferred, may either be a list of specific events, or a set of model parameters: indeed, in our model we may have a mixture of kinds of event, some of which must be treated individually, and some in aggregate.
Associate Editor: M. Doebeli
This work has made use of the resources provided by the Edinburgh Compute and Data Facility (ECDF). The ECDF is partially supported by the eDIKT initiative. NHB is supported in part by EPSRC Grant EP/E066070/1; JK is supported by EPSRC Grant EP/E066070/1; and AME is supported in part by EPSRC Grant EP/E065945/1.
THE MALÉCOT APPROXIMATION
In the main text, the recursion for pairwise identity was integrated to give functions of radial distance (eq. 3), which simplifies numerical solution. However, it is easiest to obtain an analytical approximation by leaving the identity as a function of the vectors x, y: integrating over the angle between them introduces a troublesome Bessel function. In the limit of widely separated lineages, the last term in equation (2) simplifies to a convolution. We show that our formula for pairwise identity (2) is close to Malécot's when mortality is low.
The full equation can be simplified by dropping the factors (1 −F) in the second term, and the factor (1 −u) in the last term. Denoting this simpler solution by F*:
where . Taking Fourier transforms , the convolution becomes a product (i.e., ):
where the FT of u(x) at ω= 0 is denoted , and equals ; because . Hence
Substituting for the Gaussian model, and . Then, the identity between nearby genes is approximately :
where . For low mutation rates (ϕ≪ 1), and splitting the integral in (A1) into components for small and large z, this is approximately:
where γ is the Euler-Mascheroni constant and ψ is the digamma function.
We see that identity depends primarily on u0/(1 +α2), which is independent of θ2, Λ. That is because both dispersal and drift depend on the same process, and so the rate and spatial extent of this process cancel. Note also that this approximation assumes F0≪ 1, because we ignored the factor 1 −F0. For large spatial scales (i.e., for |ω| → 0), the FT is approximately
as |ω| → 0. This transforms back to
where the spatial scale is set by , just as in the classical Malécot solution.
In the main text, we defined the covariance in identities across two loci as , where are the pairwise identities between each pair of loci. The identities in state are also generating functions for the joint distribution of coalescence times, : , where μa, μb are the mutation rates. Thus, the covariance between coalescence times is , evaluated at μa=μb= 0 (McVean 2002). It is easy to see why an event that affects all four genes will generate a covariance in coalescence time, even when they are unlinked, and in different individuals.
In general, the identities depend on how the four genes are grouped into individuals (Cockerham 1954). We must distinguish cases in which genes (a1, b1) and (a2, b2) are grouped into two haploid individuals, (a1), (b1), (a2, b2) into three individuals, and so on (Fig. B1). It is straightforward to write recursions for the various identities. In the limit, where recombination is much faster than other processes, ancestral lineages rapidly scatter into different individuals, and so we need only track identity among two pairs of genes in four separate individuals, . Even so, it is not clear that this simpler recursion can be solved even numerically, because there are so many degrees of freedom. Instead of having a function of distance, we have in a function of four two-dimensional vectors, which has five degrees of freedom, even after allowing for the symmetries of translation and rotation. However, it is possible to find an explicit solution when events kill only a small fraction of the population (i.e., when u0≪ 1 in the Gaussian model). Because it is generated by events that affect all four lineages, the covariance is proportional to u40, and so is small relative to the pairwise identities, which are proportional to u20. The covariance changes over the same spatial scale as does the pairwise identity, , but falls away more steeply with distance. This is because it depends on the random movement of four lineages rather than just two, and so disperses more quickly.