When new advantageous alleles arise and spread within a population, deleterious alleles at neighboring loci can hitchhike alongside them and spread to fixation in areas of low recombination, introducing a fixed mutation load. We use branching processes and diffusion equations to calculate the probability that a deleterious allele hitchhikes and fixes alongside an advantageous mutant. As expected, the probability of fixation of a deleterious hitchhiker rises with the selective advantage of the sweeping allele and declines with the selective disadvantage of the deleterious hitchhiker. We then use computer simulations of a genome with an infinite number of loci to investigate the increase in load after an advantageous mutant is introduced. We show that the appearance of advantageous alleles on genetic backgrounds loaded with deleterious alleles has two potential effects: it can fix deleterious alleles, and it can facilitate the persistence of recombinant lineages that happen to occur. The latter is expected to reduce the signals of selection in the surrounding region. We consider these results in light of human genetic data to infer how likely it is that such deleterious hitchhikers have occurred in our recent evolutionary past.
The first generation of evolutionary models of advantageous alleles focused on the dynamics of single selected loci in isolation from surrounding sites (Haldane, 1927; Fisher, 1930). Hill and Robertson (1966) demonstrated, however, that selection acting at one site in finite populations interferes with the efficacy of selection at surrounding sites, hampering the spread of neighboring beneficial alleles, even in the absence of fitness interactions among the sites. As pointed out by Hill and Robertson (1966), Charlesworth et al. (1993), and generalized by Rice (1999), selection on linked sites reduces the effective number of lineages contributing to future generations to those lineages with the highest fitness. Such genetic bottlenecks increase the power of drift relative to selection, such that advantageous alleles are less likely to spread and they spread more slowly than predicted by their direct effects on fitness. In a general analysis by Barton (1995), Hill–Robertson interference was shown to reduce the fixation probability of beneficial alleles linked to other selected sites. Breaking down interference among selected loci has also been shown to favor increased rates of sex and recombination (Otto and Barton, 1997; Barton and Otto, 2005; Roze and Barton, 2006).
In addition to affecting neighboring loci under selection, Maynard Smith and Haigh (1974) showed that the dynamics of a single selected locus impacts surrounding neutral loci. In particular, an advantageous allele sweeping through a population reduces, on average, the genetic variance around the site of a sweep (see also Thomson 1977). This phenomenon provides a mechanism for detecting regions experiencing selection, forming the basis for the Hudson-Kreitman-Aguade (HKA) test (Hudson et al., 1987), for example.
Relatively little attention has been paid, however, to the effect that selection on neighboring sites might have on the net fitness change associated with the fixation of a focal beneficial allele and on the patterns of variation at surrounding selected sites (see, e.g., Yu and Etheridge (2010) regarding beneficial alleles segregating in the background, and Hadany and Feldman (2005) regarding deleterious alleles in the background). In this article, we consider a focal site carrying a new beneficial allele in the presence of neighboring sites subject to deleterious mutations. We calculate the chance that a linked deleterious allele hitchhikes to fixation along with the beneficial allele, as a function of the rate of recombination between them, and describe the implications for patterns of variation expected within the region of a selective sweep. This work builds upon a recent simulation study by Hadany and Feldman (2005), as well as complementary analytical work for asexual organisms (Johnson and Barton, 2002; Bachtrog and Gordo, 2004; Yu and Etheridge, 2008; Yu et al., 2010). Specifically, Hadany and Feldman (2005) demonstrated that beneficial alleles sweeping to fixation in a purely asexual population often carry along linked deleterious alleles. The fixation of deleterious alleles by hitchhiking generates a fixed mutation load that must await a future adaptive sweep by a back or compensatory mutation in order for it to be erased. Our work provides an analytical prediction of the probability of such undesirable hitchhikers, allowing for arbitrary rates of recombination between the sites under selection.
Empirical Background—Recent studies of amino-acid substitution data suggest that advantageous mutants are present at higher rates then previously assumed. Although precise values remain a matter of debate (Eyre-Walker, 2006), Bierne and Eyre-Walker (2004) estimated that approximately 45% of amino acid substitutions are adaptive in Drosophila melanogaster, equating to one substitution, on average, every 450 generations. Later studies have found that between 30 and 60% of substitutions in D. melanogaster coding and noncoding regions are adaptive (Andolfatto, 2005; Obbard et al., 2009; Andolfatto, 2007; Shapiro et al., 2007), highlighting the prevalence of beneficial mutation. Similar values have been observed in the wild mouse Mus musculus castaneus (Halligan et al., 2010). In hominids this rate tends to be lower; Boyko et al. (2008) and Eyre-Walker and Keightley (2009) found that on average 5% of amino-acid substitutions were adaptive if recent population bottlenecks were taken into account.
Another method to detect the presence of advantageous mutations is through investigating the underlying distribution of fitness effects among mutations. Using such a method Shaw et al. (2002) suggested that half of all mutations in Arabidopsis thaliana increased fitness (although see Keightley and Lynch (2003)). Even in fairly laboratory-adapted strains of Saccharomyces cerevisiae, Joseph and Hall (2004) estimated that around 6% of spontaneous mutations were beneficial (see also Hall and Joseph, 2010).
The strength of selection acting on beneficial alleles is also subject to much debate and is expected to depend on the nature of past environmental changes, both biotic and abiotic (Elena and Lenski, 2003). On the lower end, Jensen et al. (2008) estimated that advantageous mutants have had a mean selection coefficient of sa≈ 10−4 in Drosophila. On the upper end, very large selection coefficients have been detected in experimental evolution studies with bacteria and viruses, with an average sa≈ 2 found in Pseudomonas fluorescens exposed to a novel carbon source (Barrett et al., 2006) and sa ranging between 6 and 14 in the bacteriophage φX174 subjected to heat stress (Bull et al., 2000).
Although there is increasing evidence for the frequent spread of advantageous alleles, it is an inescapable fact that most spontaneous mutations that affect fitness are deleterious (Crow, 1970) and are maintained in populations at a low frequency by recurrent mutation (Wright, 1931). These mutation rates can be substantial; for example, the per-generation genomic deleterious mutation rate Ud in Drosophila has been estimated at 1.2 (Haag-Liautard et al., 2007; Keightley et al., 2009), with estimated rates of Ud of around 4.2 in hominids (Eöry et al., 2010). Deleterious mutation rates are lower in microbes, however. In nonmutator strains of yeast, Hill and Otto (2007) estimated Ud= 0.013 for mutations acting on sporulation ability and Ud= 0.0003 for those affecting growth rate.
If selection acts against deleterious mutations with a coefficient of sd, then we would expect a total of ∼Ud/sd mutations to segregate within a population at mutation–selection balance (ignoring genetic associations among them). Even when Ud is less than one, the expected number of deleterious mutations carried by an individual may be much greater than one. Consequently, newly arisen advantageous alleles may occur within chromosomes also bearing deleterious alleles nearby. In the next section, we develop a model that describes the fate of a deleterious mutation that occurs in the genetic background of a novel beneficial allele. We later return to estimates of mutation rates and selection coefficients to assess how likely it is that deleterious alleles hitchhike to fixation, and how this depends on the mode of reproduction and the effective rate of recombination within a species.
We first present a semi-deterministic calculation of the fixation probability of a haplotype carrying both an advantageous and a deleterious allele using classic population genetics. In the next section, we build a stochastic diffusion model of the appearance and spread of this haplotype, but the calculations presented in this section help to develop an understanding of the key forces at work and so are a natural first step in investigating this problem.
We consider a finite population of N haploid chromosomes with discrete generations, using a standard Wright–Fisher model (Fisher, 1930; Wright, 1931). We are interested in the dynamics of a newly arisen beneficial allele at a locus A. The genome in which A first arises may carry one or more deleterious alleles. Deleterious alleles that are only loosely linked to locus A are unlikely to rise substantially in frequency and are ignored. We focus only on the single most closely linked of these deleterious mutations and call this second locus B, with recombination between A and B occurring at rate r. At locus A, the advantageous allele A1 has a selective advantage sa over the wild-type allele A0. At locus B, the deleterious allele B1 is selected against with selection coefficient sd, relative to the wild-type allele B0. We assume sa > sd, so that the advantageous-deleterious haplotype has a net beneficial effect, snet=sa−sd. For clarity of presentation, we assume additive selection, but all of our analytical results continue to apply if sd is replaced by sa−snet, wherever it occurs.
For each haplotype, we write a 0 subscript if the wild-type allele is present at the locus and a 1 subscript otherwise, in the order AB. All possible haplotypes, along with their fitness, are given in Table 1. In particular, the advantageous-deleterious haplotype is denoted A1B1, and when this haplotype first appears, the remainder of the population is either A0B0 (wild type) or A0B1 (bearing the deleterious allele). The latter haplotype (A0B1) is assumed to be rare and is ignored in the following analysis to simplify the calculations; simulations described in a later section indicate that this assumption introduces little bias. We also assume that no further mutation occurs at either of the loci during the course of the sweep, although the model can be modified to take this into account.
Table 1. Table of haplotypes.
Let p(t) denote the frequency of the A1B1 haplotype, where t is the number of generations since the beneficial allele arose and p0 is its initial frequency (generally 1/N). When the A1B1 haplotype first arises, it becomes established within the population with a probability u that is approximately twice the net selection coefficient, 2snet (Haldane, 1927). It is further assumed that snet≪ 1 and that the population size is large (see next section for results that apply in smaller populations).
In the following derivation, we only consider those A1 alleles that survive stochastic loss while rare. Once established, the frequency of A1B1 can be modeled by the standard deterministic equation for haploid selection (Haldane, 1924):
Among those alleles that succeed in fixing, the trajectory of the A1B1 haplotype is slightly faster, on average, than given by equation (1) (Maynard Smith and Haigh, 1974; Barton, 1994). This initial acceleration is taken into account in the diffusion model developed below; it turns out to have little effect, however, because rare recombination events that break apart the A1B1 haplotype are most likely to occur when the A1B1 haplotype is intermediate in frequency and not when it initially occurs.
Our goal is to calculate the probability, P, that the A1B1 haplotype is not broken apart by recombination before the advantageous A1 allele fixes within the population. If such a recombination event has not yet occurred, there are approximately p(t) of the A1B1 haplotypes and 1 −p(t) of the A0B0 haplotypes (ignoring the rare A0B1 individuals), so that matings between these two haplotypes occur at frequency 2p(t)(1 −p(t)). Among the offspring of these matings, r will be recombinant, half of which will carry the most fit A1B0 haplotype and half of which will carry the least-fit A0B1 haplotype. Even once produced, the most fit recombinant may fail to establish itself within the population due to chance loss while rare. In Appendix A, we use branching processes to show that the probability that a single new A1B0 haplotype establishes within the population if it appears at time t equals
The derivation of equation (2) accounts for the fact that the A1B0 haplotype has fitness 1 +sa relative to the population mean fitness 1 +p(t)(sa−sd), which is changing over time according to equation (1). As expected, if the A1B0 recombinant haplotype arises while p(t) ≈ 0, the recombinant lineage will establish with probability nearly equal to 2sa, the fixation probability of an advantageous A1 allele in an otherwise wild-type population. Also as expected, if the A1B0 recombinant haplotype arises while p(t) ≈ 1, the recombinant lineage will establish with probability nearly equal to 2sd, the fixation probability of a haplotype that has shed the deleterious allele B1 in a population that otherwise carries both A1 and B1. We call A1B0 haplotypes that succeed in establishing while rare “successful recombinants.”
Altogether, κ(t) =rp(t)(1 −p(t))Π(t) is the probability that an A1B0 recombinant haplotype appears at time t and goes on to establish within the population. Note however that this calculation does not specify whether the A1 or B0 allele will fix first; in many cases, if a recombinant appears and fixes with probability Π(t), the actual fixation of the A1B0 haplotype would occur after A1 has reached fixation.
To calculate the overall probability, P, that the A1B1 haplotype is never broken apart by recombination, we must calculate the probability that in every generation, t, none of the N offspring are successful recombinants. Assuming weak selection such that both Π(t) and κ(t) are small, the probability that a deleterious hitchhiker will be carried to fixation by the spread of a linked beneficial allele is given by
Overall, P gives the probability that a fitter recombinant never establishes, assuming that the A1B1 haplotype is not lost stochastically when it first appears. The probability that the A1B1 haplotype succeeds in establishing initially and fixing within the population is thus u (= 2snet) times P. This equation is analogous to equation (16) in Yu and Etheridge (2010), who used a Moran model to estimate the fixation probability of two competing beneficial mutations, with recombination between the two loci.
Equation (3) can be solved by integrating over the allele frequency dynamics rather than over time and replacing the integral with
In this haploid model with weak selection, dp/dt= (sa−sd) p(1 −p). Carrying out the integration, the probability that a fitter recombinant never establishes is given by
where p0 was assumed negligible relative to terms on the order of one. At this point, we can eliminate the population size from the result by measuring the net selection and recombination rates within the population, defined as Sd=Nsd, Sa=Nsa, Snet=N(sa−sd), and ρ=Nr, yielding
where ω is the compound parameter defined by
The hitchhiking process thus depends primarily on these scaled parameters and not separately on the population size and selection or recombination parameters. The above equations show that the probability of hitchhiking to fixation declines exponentially with the recombination rate between the loci and with the number of individuals within the population. The probability of hitchhiking is especially small when the strength of selection for the beneficial allele and against the deleterious allele is similar (Snet small), as this will cause the sweep of the A1B1 haplotype to take longer and allow for more recombination events.
To determine how small the recombination rate must be in order for hitchhiking to occur with a particular probability of interest, c, we set P=c and solve for ρ:
This gives us the recombination rate below which hitchhiking to fixation will occur with frequency greater than c, as a function only of the scaled selection coefficients Sd and Snet. At this point, we hold off discussing these results further until the next section, where we derive a stochastic solution.
The above analysis assumes that the population is very large, allowing us to combine stochastic results for the establishment of particular haplotypes while rare, with deterministic equations for the spread of these haplotypes. The above does not, however, take into account chance fluctuations in haplotype frequencies or the initial acceleration caused by considering only those trajectories where the beneficial allele becomes established (Maynard Smith and Haigh, 1974; Barton, 1994; Otto and Barton, 1997; Desai and Fisher, 2007). To account for these effects, we now derive a stochastic solution for this problem.
Again ignoring the rare deleterious-only lineage, we model the change in frequency, p(t), of the A1B1 haplotype using a diffusion approximation. If a successful recombinant appears, however, the diffusion process is killed. As described by Karlin and Taylor (1981), the probability that the process is not ultimately killed, P(p), given that A1B1 is currently at frequency p, satisfies
where M(p) is the mean change in p over a time step measured in N generations; V(p) is the variance in change of p; and K(p) is the killing function, which denotes the probability of the process being “killed” while the A1B1 haplotype is at frequency p. In this model, killing occurs if recombination forms a fitter haplotype (i.e., A1B0) that succeeds in establishing within the population. To solve equation (8), we use the boundary conditions P(0) =P(1) = 1; that is, the system cannot be killed if the A1B1 or A0B0 haplotype is fixed. Further descriptions of similar diffusion models with killing are available in Karlin et al. (1967) and section 15.10 of Karlin and Taylor (1981); in particular, a related model is described where the diffusion process is killed whenever any recombinant is formed (A1B0 or A0B1), regardless of whether the recombinant succeeds in establishing.
As with standard diffusion models investigating an allele under weak directional selection in a haploid population (Kimura, 1970; Ewens, 2004), we obtain the values M(p) =Snetp(1 −p) and V(p) =p(1 −p), where Snet=N(sa−sd) (see section 2 of Appendix S3). The killing term is obtained by taking the probability that the process is killed in a particular generation, 1 − (1 −κ)N≈Nκ=Nrp(1 −p) Π, and scaling in such a way that the killing term remains finite over the time step of N generations, as N→∞ (Karlin and Taylor, 1981). By doing so, we obtain the killing function K(p) =ρ p(1 −p) π(p), where ρ=Nr and π(p) is the scaled version of the establishment probability of the A1B0 recombinant, Π (eq. 2)
The diffusion approximation assumes that Snet, Sd, and ρ remain finite as N→∞.
Plugging these diffusion coefficients into equation (8) and dividing by p(1 −p), the probability that the process is not killed, P(p), given the current frequency p satisfies
If the process is not killed, there are two potential outcomes: fixation of A0B0 or fixation of A1B1. If we wish to know the probability that a particular advantageous allele that succeeds in fixing carries along with it a deleterious allele, we must rederive the diffusion model conditional on A1 establishing within the population. In Appendix B, we show that the conditional probability P*(p) that the process is not killed (i.e., the deleterious allele B1 fixes) among those cases where A1 sweeps to fixation satisfies:
The differential equations (10) and (11) were solved in Mathematica 6.0 (Supporting information), yielding the somewhat cumbersome equations (B5) and equation (B6), respectively. These can be solved numerically for the probability that the process is not ultimately killed (i.e., the probability that a successful recombinant never appears).
P*(p0) as given by (B6) is the main quantity of interest in this article. It describes the probability that an A1 allele that fixes within a population carries along with it a linked deleterious allele B1, given that the initial frequency of the A1B1 haplotype is p0. Although equations (B5) and (B6) should be used in any numerical analysis, further insight is provided by approximating P*(p0) as an exponentially decreasing function of the recombination rate (as inferred in the semi-deterministic analysis). Assuming that selection is strong relative to drift (Sd, Snet≫ 1), that the frequency of the A1B1 haplotype when the A1 allele first appears is negligibly small (p0≪ 1), and that recombination is not too frequent (ρ≪Sd, Snet), we obtain:
(see details in section 3 of Appendix S3). Again, this can be used to calculate a critical value of recombination above which hitchhiking is unlikely to occur. Specifically, we solve equation (12) for the rate of recombination necessary for the deleterious B1 allele to fix with probability c, given that the beneficial allele A1 initially appears with B1 and ultimately fixes
For example, when c= 1/2, the term in square brackets is approximately 1/4 as long as neither Sd nor Snet is too small (see the figure in section 3 of the Appendix S3). Thus, as a rough rule of thumb (using unscaled parameters), the recombination rate r must be less that 1/4 of snet/(Nsd) for there to be at least a 50% chance that the deleterious allele hitchhikes to fixation.
Hitchhiking events are thus likely to occur over larger regions of the genome if the net selection coefficient acting on the A1B1 haplotype, snet, is stronger because sweeps occur faster. Conversely, the stronger the disadvantage of the deleterious allele, sd, the less likely a hitchhiker will fix because recombinant A1B0 haplotypes are so much more fit. Finally, the larger the population size, the less likely that a hitchhiker will fix, simply because there are more individual chances for recombination to occur while the population remains polymorphic.
These patterns are illustrated in Figure 1, which gives the probability that the deleterious B1 allele hitchhikes to fixation given that the beneficial A1 allele fixes, with darker shading corresponding to higher probabilities. These contour plots are based on the exact solution (B6) to the diffusion equation for P*(p). The thick dashed curves show the approximate equation (13) for the critical value of the recombination rate, ρ, below which we expect deleterious alleles to hitchhike to fixation more than c of the time (c= 10%, 50%, or 90%) when they occur on the haplotype bearing a new beneficial allele; these curves accurately follow the appropriate contour lines as long as selection is not too weak (roughly, Snet, Sd≥ 2).
COMPARISON TO THE CASE OF A LINKED NEUTRAL ALLELE
The dynamics of neutral loci are likely to be affected by the spread nearby of a beneficial allele whenever r is approximately less than sa (Maynard Smith and Haigh, 1974). This rule cannot be used to compare to equation (13) directly, however, because our criteria for being “affected” is now quite strict: the linked B1 allele must fix due to the sweep. We thus briefly describe a corresponding model for the case when B is neutral (full details are provided in the section 4 of Appendix S3).
The diffusion equations remain essentially the same, except that the killing term must be revised now that the recombinant A1B0 haplotype is no more fit than the A1B1 haplotype that is spreading through the population. We assume that, whenever a recombinant A1B0 haplotype appears, the probability that this haplotype becomes the ancestor of the population at some distant future point in time is very nearly 1/(Np). This assumes that any individual carrying the A1 allele alive at that time is equally likely to be the lucky one to ultimately fix and give rise to the entire descendant population. Using 1/(Np) instead of Π for the fixation probability of the recombinant A1B0 haplotype, we obtain the revised killing function, K(p) =ρ p(1 −p) 1/p, for use in the diffusion equation (8), assuming that allele A1 fixes. The conditional probability of the process not being killed was then obtained using Mathematica 6.0.
Focusing on the conditional probability that the process reaches fixation on A1 before being killed by the appearance of a successful recombinant, we again obtained an approximation assuming that selection is strong relative to drift
where γ= 0.577 is Euler’s constant. We have persisted in referring to the net selection on the A1B1 haplotype as Snet despite the fact that now Snet=Sa for ease of comparison with the previous case.
Again, solving this equation for the critical value of ρ below which hitchhiking to fixation occurs more than a proportion c of the time, we get
For c= 1/2, the term in square brackets is approximately 1/4 when Snet= 5, and it continues to decline (but slowly) as Snet increases. Thus, as a rough rule of thumb, r must be less than ≈1/4 of snet for there to be a 50% chance that a neutral allele hitchhikes to fixation. Again, such hitchhiking events are likely to occur over larger regions of the genome when the sweeps are faster (snet large). The key difference, however, from the case with a deleterious hitchhiker is the absence of Nsd in the denominator of this rule, which makes it easier to satisfy than the case of a deleterious hitchhiker (assuming selection is strong relative to drift). Figure 2 shows just how much more likely it is for alleles at locus B to hitchhike to fixation along with allele A when the B locus is neutral (thick top curve) than when it is subject to selection against deleterious mutations (dashed curves).
The fact that neutral alleles are much more likely to hitchhike to fixation than linked deleterious alleles has another important implication. Namely, the presence of a linked deleterious allele increases the chance that surrounding genetic variation will be rescued by recombination. Had there been no linked sites under selection, we would expect a region surrounding a sweep to be entirely fixed when in the majority of cases (eq. 15). If a beneficial allele first occurs on a chromosome containing a deleterious allele, however, this region is greatly reduced to ρ < ρcrit (eq. 13), as illustrated in Figure 2. Consequently, linkage to sites carrying deleterious alleles reduces the impact of selective sweeps, making it less likely that surrounding genetic variation will be lost.
Turning this argument around, a recently fixed beneficial allele might have been strongly selected but appear to have been weakly selected based on the amount of genetic variation remaining in the region. This is because recombinants were favored that untied the beneficial allele from the deleterious genetic baggage with which it arose. Furthermore, we would expect that genetic variation should more often be rescued by the appearance of more fit recombinants on the side of a selective sweep that bears a higher density of other sites under selection. In Supporting information, we simulate a three-locus model with one locus subject to advantageous mutation, one locus being a neutral marker, and one locus subject to recurrent deleterious mutation, with the beneficial mutant placed on a randomly selected genetic background. As confirmed in Figure S1 the sweep of neutral diversity is less severe in cases where selection acts on the locus subject to deleterious mutations.
TWO COMPETING BENEFICIAL MUTATIONS
The above analyses can also be used to solve a related problem of beneficial mutations competing for fixation in the presence of recombination, as considered by Yu and Etheridge (2010). If a beneficial allele is rising in frequency when a second beneficial allele appears at a linked site, then it is possible for the first beneficial allele to be lost if the second allele is more strongly favored if it appears with the wildtype allele at the first locus, and if a recombinant that brings together both alleles onto the same haplotype fails to establish in time.
Although technically there are three chromosome types to be considered before the recombinant appears (00, 10, and the new 01, where the “1” now indicates a beneficial mutation at the first and second sites), we can approximate this scenario as did Yu and Etheridge (2010) by assuming that the 00 wild type is rapidly eliminated, so that the frequencies of 01 and 10 sum roughly to one. This approximation performs surprisingly well for this problem because rare recombination events do not occur until the 10 and 01 haplotypes are both common.
Equation (1) then describes the spread of the more fit 01 haplotype, whose frequency is ≈p(t) (frequency of 10 ≈1 −p(t)), with snet equal to the difference in fitness between 10 and 01 individuals. Equation (2) describes the fixation probability of a recombinant double mutant, with sa and sd giving the selective advantage of the double mutant when it appears in a population predominantly composed of 10 and 01 individuals, respectively. All of the subsequent results described above then follow. Figure S4 shows that equations (5) and (12) provide an excellent estimate of the probability that recombination successfully rescues both beneficial mutations. Although similar in spirit to the work of Yu and Etheridge (2010), our analyses have the advantage of providing closed-form solutions that appear to accurately capture the stochastic nature of recombination rescuing combinations of beneficial alleles at two selected loci.
To investigate the accuracy of the above results, we compare both the semi-determinstic and stochastic models to Monte Carlo simulations. Simulations start with a population of N haploid chromosomes, each consisting of two linked loci. Fitness is assumed to be additive.
An initial proportion p0= 1/N of the population is assigned the advantageous-deleterious A1B1 haplotype. The rest of the population bears the A0B0 haplotype. It is assumed that the A0B1 haplotype is present at a negligibly small frequency, and while it is not considered in the initial population it is tracked if it appears by recombination.
A new generation is formed by selecting two parents with probability proportional to their fitness. Recombination between the two parental loci then occurs with Poisson probability r. This is repeated until N new offspring are created. A new generation is created in this way until the A1B1 genotype is either fixed or is lost from the population. This entire process is repeated 20,000 times to build up an overall probability of fixation along with 95% confidence intervals. We focus attention on the processes where the advantageous allele fixes.
Results are plotted in Figure 3. Simulation data match up very well to all three solutions for the probability of hitchhiking P*(p): semi-deterministic equation (5), diffusion equation (B6), and the approximation to the diffusion equation (12). All three solutions offer similar results when we changed the population size, as long as ρ, Sd, Sa are held constant. Differences between the solutions only become apparent when selection becomes weak. Stochastic effects then play more of a role, especially where the A1B1 haplotype is oversampled and rises to fixation faster than expected, so that the diffusion with killing (B6) provides a slightly more accurate solution. Additional figures presented in section 3 of the Appendix S3 show that the analytical solutions perform less well as selection strengthens in very small populations (e.g., sd= 0.1 with N = 100 or 1000); in these cases, the diffusion approximation assuming weak selection breaks down and the fixation probability of the deleterious allele is underestimated.
Although the above two-locus models offer tractable results, novel advantageous alleles may arise in genomes with multiple mutant alleles. Therefore, we switch to using multilocus computer simulations to investigate the mutation load generated by the rise to fixation of an advantageous allele, given that such mutations arise at rate U in a genome with total map length R, where each new deleterious mutations is assigned a random position between 0 and R. The methods used for these simulations are based on Hartfield et al. (2010) and detailed in Supporting information.
We then determined the mean number of deleterious alleles that fix along with each beneficial mutation, assuming multiplicative selection. Simulations with different Sa values are compared to the control case, Sa= 0, in Figures 4 and S2. These results corroborate the two-locus model; the mean number of deleterious mutants that fix declines with the rate of recombination and rises with the strength of selection on the advantageous mutant, Sa. The mean number of fixed deleterious alleles also stays approximately the same as N increases, if the compound parameters Sa, Sd, NR, and NU are held constant.
Increasing the recombination rate also raises the fixation probability of the advantageous mutant (Fig. S3), which is a well-known result (Peck, 1994; Barton, 1995). Thus recombination is doubly advantageous, as it reduces the number of deleterious alleles that fix in a population following a selective sweep and it increases the likelihood that such an advantageous mutant can establish when rare. This is the likeliest cause of strong selection acting on a modifier for increased recombination in the presence of advantageous and deleterious mutations (Hartfield et al., 2010).
APPLYING RESULTS TO HUMAN GENETIC DATA
How likely is deleterious hitchhiking to occur in nature? To answer this, we use human data as an example. Deleterious mutants are maintained at a mutation–selection balance frequency of q=μ/sd (Wright, 1931), where sd measures selection against the deleterious allele in heterozygotes. Thus an estimate for the number of deleterious mutants segregating throughout a genome is U/sd, for U the diploid per-genome deleterious mutation rate, which has been recently estimated as U= 4.2 (Eöry et al., 2010).
U measures deleterious mutations arising across the entire genome, with the majority appearing in noncoding regions (Eöry et al., 2010). Thus we assume all deleterious mutations have a fixed, weak value of sd. This will slightly overestimate the number of deleterious mutants segregating, as we do not consider stronger deleterious mutations that can arise in coding regions (Eyre-Walker et al., 2006; Boyko et al., 2008).
A deleterious allele must have Nesd≥ 1 in order for selection to overcome the effects of genetic drift (Kimura, 1983). Therefore, assuming deleterious alleles are very weakly selected (Nesd= 1, with human Ne = 10,000; Jorde et al. 1998), we expect U/sd= 4.2/0.0001 = 42,000 such deleterious alleles segregating at any time, roughly half of which lie in each haploid set of 3 Gb in the human genome. Including the site of the beneficial mutation, the average distance between two selected sites is thus 142.9 kb. Assuming that selected sites are randomly distributed across the genome (i.e., ignoring clustering), this distance would be approximately exponentially distributed. In this case, the closest of the deleterious alleles lying to either side of the beneficial allele would also be exponentially distributed with mean 71.4 kb. As a rough guide, the average recombination rate is 1 cM/Mb in a human genome (Broman et al., 1998), thus the closest deleterious allele lies, on average, at a distance of Ner= 7.14. The fixation probability of the deleterious allele with the advantageous mutant would then be 18.8% for Nesa= 5, 37.1% for Nesa= 25, and 62.1% for Nesa= 100, obtained by integrating the hitchhiking probability (B6) over an exponentially distributed distance with mean Ner= 7.14. These calculations are explained in more detail in section 5 of the Appendix S3. If we assumed Nesd= 10, then by following a similar logic we calculate that the mean distance to the nearest deleterious allele is Ner= 71.4, and the estimated fixation probability of a deleterious allele is 0.8% for Nesa= 25 and 2.5% for Nesa= 100.
Overall, these calculations suggest that in humans, deleterious mutants will hitchhike at appreciable frequencies only if they are very weakly selected (Nesd < 10). However, this is only an initial calculation that deserves to be revised to take into account fine-scale recombination rates (McVean et al., 2004) and clustering of mutations around coding regions. For now we note that if clustering causes the average recombination distance to a deleterious allele to drop tenfold, then the hitchhiking probabilities calculated above increase substantially, rising for Nesd= 1 to 68%, 85%, 94% with Nesa= 5, 25, 100, respectively, and for Nesd= 10 to 7%, 20% with Nesa= 25 and 100.
As long as genetic variance in fitness is present within a population, new beneficial alleles can arise in genomes that, by chance, carry deleterious alleles at linked sites. Consequently, if they remain associated, deleterious alleles can hitchhike to fixation as an advantageous allele sweeps through the population. Even if recombination occurs between the two loci, there can still be a good chance of both alleles fixing, if either the recombinant fails to appear in time or is lost by chance when it does appear. Williamson et al. (2007) found possible evidence of such hitchhiking causing the high prevalence of the hereditary hemochromatosis mutation C282Y, due to a selective sweep occurring 150 kb away from the HFE gene where the deleterious C282Y allele is located.
To our knowledge, this article represents the first theoretical study on how recombination affects the hitchhiking to fixation of deleterious alleles. Using both a semi-deterministic and a diffusion approach, we show that in regions of low recombination there is a high probability that a deleterious mutant would be swept to fixation if linked to an advantageous mutant (Fig. 1). This probability approaches one as the deleterious effect sd tends towards zero and the overall advantage of the A1B1 haplotype snet is larger. Outside this parameter range, we find that hitchhiking is likely (greater then 50% chance) if r≲snet/(4Nsd) (more precisely, equation 13). A promising empirical approach would be to investigate areas around the genome that show high dN/dS values. Such regions are assumed to be subject to recurrent sweeps (Nielsen, 2005). If deleterious alleles do hitchhike, then around these sites there should be signs of increased load, such as increased indel frequency, or lower frequency of optimal codon usage. Such a negative relationship between dN and optimal codon usage was found in Drosophila by Betancourt and Presgraves (2002).
Furthermore, we determined that the hitchhiking of tightly linked deleterious alleles reduces the region in which the sweep is likely to fix surrounding sites (compare eq. 15 to eq. 13). This is important as it implies that deleterious hitchhiking can alter experimental estimates of the strength of such sweeps. A potential example of these effects was reported by Clegg et al. (1980), who found that linkage disequilibrium in D. melanogaster broke down more quickly than expected (geometric decay at a ratio 1 −r), based on the surrounding markers being neutral and on measured recombination rates between the selected and neutral markers. This observation could be explained by recombination untangling advantageous alleles from deleterious backgrounds (see also Fig. S1). Further work is warranted to explore the impact of neighboring selected sites on patterns of neutral sequence variability in a fully multilocus framework. In particular, a full treatment requires an exploration not only of the primary effects of a selective sweep at a focal site, but also of how hitchhiking of deleterious alleles can cause secondary sweeps as wild-type alleles reestablish themselves at surrounding sites.
Our work also sheds light on the results found by Hartfield et al. (2010), who showed that a modifier gene for increased recombination is more likely to fix in a population that is subject to both deleterious and advantageous mutation, compared to the deleterious-only mutation case (Keightley and Otto, 2006). The increased selection acting on a recombination modifier when both deleterious and advantageous mutants are present together, compared to when just deleterious or just advantageous mutations are present, suggests that uncoupling advantageous mutants from deleterious backgrounds provides a substantial amount of selection on a recombination modifier (Peck, 1994; Hartfield et al., 2010).
Our preliminary calculations suggest that in obligately sexual species with long genetic map lengths (such as the human genome), recombination is frequent enough to prevent all but weakly deleterious mutants from hitchhiking with advantageous mutants. Our calculations assumed, however, that mutations affecting fitness arise at equal rates throughout the genome, which ignores the clustering of fitness-impacting sites near genic regions. If recombination rates between selected sites are low, either because of this clustering or because of cold spots in recombination, the probability that deleterious alleles hitchhike to fixation rises substantially. Similarly, in species that frequently inbreed (e.g., selfing) or reproduce asexually, the effective amount of recombination may be much lower, substantially increasing the probability of deleterious alleles hitchhiking to fixation. In asexuals with no recombination, the subsequent mutation accumulation can be extremely detrimental (Hadany and Feldman, 2005).
In conclusion, sex and recombination both enhance the probability of beneficial alleles establishing and hinder the fixation of deleterious alleles within a lineage. If this can be shown empirically to be a potent selective force on recombination rates, then this would provide key insight into why sex and recombination are prevalent, which remains an open question in evolutionary genetics (Otto, 2009).
Associate Editor: J. Hermisson
We would like to thank P. Keightley for his support and advice on using human genetic data. We also thank P. Keightley, N. Barton, J. Hermisson, and two anonymous referees for comments on the manuscript. MH is funded by a Biotechnology and Biological Sciences Research Council studentship; SO is funded by the Natural Sciences and Engineering Research Council of Canada.
DERIVATION OF Π(t), THE PROBABILITY OF ESTABLISHMENT OF A RECOMBINANT HAPLOTYPE
When the recombinant A1B0 haplotype is produced, it appears within a population that is already changing due to the spread of the A1B1 haplotype. Thus, we cannot calculate the probability of fixation of the recombinant A1B0 haplotype based solely on its fitness 1 +sa relative to the current population mean 1 +p(t) (sa−sd). Rather, we must also account for future changes in the population mean fitness as the A1B1 haplotype rises in frequency. To do so, we develop a time-inhomogeneous branching process that explicitly follows the dynamics of p(t) (given by eq. 1) that occur after the appearance of the recombinant A1B0 haplotype. A previous diffusion analysis by Kimura and Ohta (1970) also calculated the fixation probability for a favorable allele whose benefit declined over time, but the focus of their analysis was on a case where selection declines linearly over time, whereas here the selection coefficient favoring A1B0 declines according to a logistic function of time, given by s(t) =sa−p(t)(sa−sd).
Let Π(t) be the fixation probability of the recombinant A1B0 haplotype at generation t, given that the current frequency of the A1B1 haplotype is p(t). In a population of constant size, the average parent has one surviving offspring, but we assume that the A1B0 haplotype is more fit and so has an average of 1 +s(t) offspring. Using branching process logic (Haldane, 1927), the recombinant A1B0 haplotype will ultimately be lost (with probability 1 −Π(t)) if and only if all j offspring inheriting the haplotype also fail to leave any descendants over the long run (with probability (1 −Π(t+ 1))j). Assuming a Poisson distribution for the number of offspring j and summing over this distribution, we obtain a recursion for Π(t):
Solving for Π(t+ 1) and subtracting Π(t), we obtain the change in fixation probability over time, which we assume is slow enough that it can be well approximated by the differential equation
With weak selection (s(t) ≪ 1), Π(t) is of the same order as s(t) and the above simplifies to
(Barton, 1995). This differential equation can be solved when selection on the recombinant haplotype varies according to s(t) =sa−p(t)(sa−sd) by first replacing the variable t with the variable p using the chain rule and dp/dt=snetp(1 −p) (section 1 of Appendix S3). To leading order in the selection coefficients, the resulting solution for the fixation probability of the recombinant A1B0 haplotype is given by equation (2).
DERIVING THE DIFFUSION PROCESS WITH KILLING CONDITIONAL ON FIXATION OF THE A1 ALLELE
Conditioning on the fixation of A1 implies that either the A1B1 haplotype fixes (if the process is not killed) or the recombinant successfully establishes and leads to the fixation of the A1B0 haplotype (if the process is killed). Either way, the A1B1 haplotype cannot be lost while it is rare. We must thus adjust the drift term in the diffusion, M(p), to account for the fact that the A1B1 haplotype will, on average, rise more rapidly when rare among those processes where the A1B1 haplotype is not lost. The variance term V(p) and the killing term K(p) are unchanged in the conditioned model, as these terms depend only on the current frequency of the A1B1 haplotype and not on its ultimate fate. From equation (9.5) in chapter 15 of Karlin and Taylor (1981), the conditional drift term M*(p) is given by.
Here, the values of M(p) and V(p) are for the unconditional diffusion process as outlined in the main part of the article. Plugging these terms into equations (B2) and (B3) and evaluating the integrals, we obtain the conditional drift term:
This revised drift term is then placed in equation (8), along with the variance and killing terms, which remain unchanged. Dividing the result by p(1 −p) yields equation (11) in the main text.
The conditional diffusion process requires some care, however, with the boundary conditions. The probability that the process is not killed given that the A1B1 haplotype is fixed remains one, P*(1) = 1, as before. Conditioning assumes, however, that the p= 0 boundary is never reached. Rather than assigning P*(0), we instead assume that P*(p) varies little over very small values of p, given that the process will ultimately reach p= 1 if it is not killed. Thus, we use dP*(0)/dp= 0 as a second boundary condition.
Solving equation (10), we find that the probability that the process is never killed, regardless of whether A0 or A1 ultimately fixes is
whereas the solution to equation (11), conditioned on the fixation of the beneficial A1 allele, given by B6 (below). Here, is the Tricomi confluent hypergeometric function, the generalized Laguerre polynomial (Abramowitz and Stegun, 1970), and ω is the compound parameter given by equation (6) in the main text. Additional details regarding the derivation and solutions for these equations are provided in a Mathematica 6.0 file (Supporting information, section 2).