The model
The model presented here assumes that newly duplicated genes encode a full-length protein with the signals necessary for its proper expression. It is further assumed that all duplicate genes are selectively neutral. (This postulate is examined in the Discussion.) Any given organism in the population may be thought to have anywhere from zero to multiple extra copies of the gene; that is, duplicate copy number is considered to have no selective effect. However, the model presupposes that there are a total of N duplicate copies of the gene, equal to the number of organisms in the population. The model assumes that either copy of a newly duplicated gene can be the one to undergo mutation and that either copy can retain the original function. That is, the original gene is not necessarily the one to retain the original function. Because the model does not include recombination, all copies of the gene accumulate point mutations independently of each other. The basic “task” that the model asks a duplicate gene to perform is to accumulate λ mutations at the correct nucleotide positions to code for a new selectable feature before suffering a null mutation. Because the model presented here does not include recombination, the results can be considered to be most applicable to a haploid, asexual population. However, as will be discussed, implications can also be made for the evolution of diploid, sexual species.
The process we envision for the production of a multiresidue (MR) feature is illustrated in Figure 1, where a duplicate gene coding for a protein is represented as an array of squares that stand for nucleotide positions. A gene coding for a duplicate, redundant protein would contain many nucleotides. The majority of nonneutral point mutations to the gene will yield a null allele (again, by which we mean a gene coding for a nonfunctional protein) because most mutations that alter the amino acid sequence of a protein effectively eliminate function (Reidhaar-Olson and Sauer 1988, 1990; Bowie and Sauer 1989; Lim and Sauer 1989; Bowie et al. 1990; Rennell et al. 1991; Axe et al. 1996; Huang et al. 1996; Sauer et al. 1996; Suckow et al. 1996). However, if several point mutations (indicated by a “+” in the figure) accumulate at specific nucleotide positions (indicated by the three squares outlined in blue in the figure) in the gene coding for the protein before a null mutation occurs elsewhere in the gene (indicated by a red “X”), then several amino acid residues will have been altered and the new selectable MR feature will have been successfully built in the protein (indicated by the green-shaded area). By hypothesis, the gene is not selectable for the new feature when an intermediate number of mutations has occurred, but only when all sites are in the correct state.
In our computer model of the process described above, the nucleotide positions that must be changed from the sequence of the parent gene to be compatible with the developing MR feature (we call states of nucleotide positions “compatible” if they are consistent with what is necessary to code for the MR feature, and “incompatible” if they are not) are explicitly represented as elements of an array (see Materials and Methods for details). These correspond to the squares outlined in blue in Figure 1. (Although the positions are next to each other in the figure, they are not necessarily contiguous in the gene.) These may be considered to be nucleotide positions in the same codon, separate codons, or a combination. The pertinent feature of the model is that multiple changes are required in the gene before the new, selectable feature appears. Changes in these nucleotide positions are assumed to be individually disruptive of the original function of the protein but are assumed either to enhance the original function or to confer a new function once all are in the compatible state. Thus, the mutations would be strongly selected against in an unduplicated gene, because its function would be disrupted and no duplicate would be available to back up the function.
The other nucleotide positions in the gene, corresponding to the black squares in Figure 1, which if they were changed would yield a null allele, are represented only implicitly in our computer model by the constant ρ, which is the ratio of the number of mutations of the original duplicated gene that would produce a null allele to the number of mutations of the original duplicated gene that would yield a compatible residue. (Definitions of terms are given in Table 1.) As an example, consider a gene of a thousand nucleotides. If a total of 2400 point mutations of those positions would yield a null allele, whereas three positions must be changed to build a new MR feature such as a disulfide bond, then ρ would be 2400/3, or 800. (Any possible mutations which are neutral are ignored.) In each generation of the simulation, each of the three positions that must be changed to yield the MR feature is sequentially given a chance to mutate with a probability governed by the mutation rate. However, although a mutation may occur in a position needed for an MR feature, it would nonetheless be unproductive if a null mutation had first occurred at a separate position. To simulate this possibility in our model, when an explicitly represented position does mutate, then we take a further probabilistic step to decide if a null mutation has in the meantime occurred elsewhere in the gene, in positions not explicitly represented. In the earlier example, if one of the three positions mutates, then a further step decides with probability ρ / (1 +ρ) (which in the example would be 800/801) that one or more null mutations have already occurred somewhere in the gene, and the gene is considered to be irrecoverably lost. (The likelihood of a null mutation reverting and the gene then successfully developing an MR feature before other null mutations occur is much lower than if the first λ mutations to the duplicate gene yield compatible residues; thus, we ignore that possibility.) With probability 1 / (1 +ρ) (in the example this would be 1/801), the gene is considered to be free of null mutations and continues in the simulation.
The starting point of the simulation (see Materials and Methods for a more complete description) is a population of organisms that already contains N exact duplicates of the parent gene, which then begin to undergo mutation. For simplicity, each position in an array, representing sites which must be changed to yield an MR feature, can be in either of just two states—the original incompatible state or the mutated, compatible state. Mutations can change a site either forward from incompatible to compatible or backward from compatible to incompatible. (Unlike for null mutations, reversions of compatible mutations back to incompatible ones must be explicitly considered because the probability of reversion in this case is significant.) These transitions occur with equal intrinsic probabilities.
Starting from a uniform population in which all sites that must be changed are in a state incompatible with the MR feature, then there are three processes in our model which affect the rate of approach of the population to steady state, which in turn affect the time required to generate the new MR feature:
Sites in the incompatible state can mutate to the compatible state before any null mutation has occurred. This takes place at a rate equal to the mutation rate per site times the fraction of sites that are in the incompatible state (since only that fraction can mutate directly to the compatible state) times the probability that no null mutation has already occurred. That is, at a rate equal to
where v is the mutation rate per site per generation, ϕ is the fraction of nucleotide sites in the population that are in the incompatible state, ρ(as mentioned above) is the ratio of possible null to compatible mutations over the entire protein, and 1 / (1 +ρ) is the probability that a compatible mutation occurs before a null mutation. (Definitions of terms are given in Table 1.)
A site in the compatible state can mutate back to the incompatible state before a null mutation occurs. This takes place at a rate equal to
A mutation can occur in any one of the λ sites, but a stochastic check at this point decides with probability ρ / (1 +ρ) that one or more detrimental mutations have already occurred somewhere else in the protein, rendering it nonfunctional. The gene is then considered to be null, and it no longer counts in the model. However, the model allows for the occurrence of new gene duplication events, which recent estimates have shown to happen at a rate comparable to that of point mutation (Lynch and Conery 2000). Because the rates of point mutations and gene duplication are similar, in the model a gene that is determined to be null is replaced by a new gene duplication event, with a new copy of the original gene (which is presumed to be still under selection) with all sites in the original, incompatible state. In the computer model, this process effectively results in all λ sites of a null gene being reset to the original, incompatible state from whatever state they were in. This will happen at rate
The number of nucleotide positions λ appears in this expression because the more compatible positions that were contained in a discarded null gene, the more that are replaced with incompatible ones in a new gene duplication event. The protocol of checking for null mutations in the model only when a mutation first occurs in one of the λ array sites has the intended effect of ensuring that gene duplication occurs in the population at a rate that is comparable to the rate of point mutation.
The overall net rate of change of the fraction ϕ of sites from the incompatible state will be a sum of these three processes:
((1))
The first term of the right-hand side of the equation is negative because it is a process in which incompatible sites are removed. The second and third terms are positive because they describe processes where incompatible sites are gained.
((2))
The numerator of the right-hand term is the degree of saturation of the population with compatible mutations—the degree to which it has approached steady state. The value of (1 - ϕ) is the population-wide fraction of nucleotide positions that are in a state compatible with the MR feature.
Because of computing limitations, the values of 0.01–0.0001 used for the mutation rate v in the simulations presented following are much higher than the biologically realistic value of about 10−8 (Drake et al. 1998), and the values of 1–100 used here for ρ are lower than the value of a thousand or greater expected for biologically realistic situations (Walsh 1995). However, the fact that Figure 2 shows that the fraction (1 - ϕ) of compatible mutants in our simulations follows equation 2 very closely over a wide range of values for λ and ρ in populations that reproduce either deterministically or stochastically makes us more confident when we extrapolate the model to biologically realistic values of v andρ.
In the following paragraphs, we develop from simple considerations an equation which gives the same quantitative behavior as the numerical model. In Appendix 1, we derive the same form of equation more rigorously by considering coupled equations representing different segments of the population.
What is the probability that a duplicated gene will give rise to a particular MR feature? Consider a gene with λ sites all originally in the incompatible state. As discussed previously, the probability of one of those sites mutating to a compatible state before the occurrence of a null mutation elsewhere in the gene is
Because any one of the λ sites can mutate first, we can write this as
To mutate another residue to a compatible state, we must choose among the remaining (λ- 1) possibilities. Thus, the probability for the second position is
The multiplied probability of all λ sites mutating to compatible states before a null mutation occurs and before a back mutation occurs is thus
(If a back mutation occurs at any point, the likelihood of successfully developing an MR feature is much lower than if the first λ mutations to the duplicate gene yield compatible residues; thus, we ignore that possibility.)
If the probability of an event is P, then of course on average 1/P opportunities will be required before the event occurs. Thus, to produce an MR feature in our model will require an average number of opportunities equal to the inverse of the probability discussed earlier, or
At steady state, the number of opportunities to produce an MR function in a given time period in a population will be equal to the number of point mutations that occur in the potential MR site across the population—that is, to the time multiplied by the mutation rate per nucleotide v, the number of nucleotide positions λ that must mutate to compatible residues, and the population size N —that is, equal to Nvλt. To produce a gene with λ compatible mutations, the incompatible residue in a gene with λ- 1 compatible mutations has to be mutated, so that the time to produce an MR function with λ compatible sites will be proportional to the degree of saturation of the system with genes containing λ- 1 compatible sites. However, as exemplified by Figure 2, our model does not start at steady state; it starts with all sites in the incompatible state. Thus, the time required to produce an event will also depend on the degree to which the system has approached steady state, as follows. If the degree of saturation for one compatible site is in general S, then the degree of saturation for n compatible sites is Sn. Thus, the degree of saturation with λ- 1 compatible sites at any given time is equal to the degree of saturation given in equation 2 raised to the λ- 1 power. Because the degree of saturation changes in time, to find the total number of opportunities for producing an MR feature, this value must be integrated over time.
These considerations can be combined to yield a quantitative description of the behavior of the model with time. The expected average time Tf to the first occurrence of an MR feature for a population of duplicate genes initially in a uniform state, needing λ positions mutated to acquire the MR feature, and with a ratio λ of null-to-compatible mutations, can be evaluated by equation 3.
((3))
The right-hand side of equation 3 is the inverse of the probability discussed earlier. The left-hand side gives the number of opportunities for production of the MR feature in the nonequilibrium system starting with no nucleotide positions in compatible states. The preintegral term of the left-hand side of the equation, Nvλ, is the number of point mutations occurring in the population per unit time at steady state. The integrand of equation 3, which is the numerator from the right-hand side of equation 2 raised to the power of λ- 1, is the degree of saturation of the system with “preselectable” mutants—that is, mutants that are one step from being selectable, with λ- 1 sites in the compatible state.
Figure 3 shows the result of simulations in which the number of sites λ in an MR feature was varied along with the ratio ρ of null-to-compatible mutations and the haploid population size N. As can be seen, the curves generated by equation 3 match the results of the simulations very closely for a wide range of values of N, ρ, and λ.
Pre-equilibration of the population
Thus far, the starting point for the model has been a uniform population in which all genes are initially present as exact duplicates of the parent gene. Mutations then begin to accumulate and the program immediately starts to check for the presence of the MR feature, simulating the presence of selective pressure from the start. However, a different situation can also be considered, in which the duplicate gene begins to undergo mutation, but selective pressure arises only at a later time, perhaps as a result of environmental changes. In that case, the population of duplicate genes will be at least part of the way toward its steady-state frequency before selection affects the population. This can be modeled in the simulation by neglecting to check for the presence of the MR feature, treating it as a neutral property, until a predetermined number of generations have passed.
Figure 5 shows the result of simulations in which all duplicate genes began in a uniform state, identical to the parent gene, but the population was allowed to undergo mutation and reproduction for varying periods of time before starting to check for the MR feature. It can be seen that as the length of the pre-equilibration period increases, the average time from the start of selection to observation of the duplicate gene coding for the new MR feature decreases for population sizes, where, at steady state in the absence of selection, at least one duplicated gene with the feature is expected to already be present in the population, that is, where the population size is greater than the inverse of the probability of producing the MR feature, N > (1 +ρ)λ(λλ/λ!). In Figure 5, this occurs at λ ≤ 5. For the case where N < (1 +ρ)λ(λλ/λ!) (at λ ≥ 6 in Fig. 5), however, the expected time is essentially unaffected by pre-equilibration of the population. Because it follows from equation 3 that N < (1 +ρ)λ(λλ/λ!), when v times the evaluated integral is >1, then Tf will be substantially unaffected by pre-equilibration when Tf ≥ 1 / v.