The processive kinetics of gene conversion in bacteria

Summary Gene conversion, non‐reciprocal transfer from one homologous sequence to another, is a major force in evolutionary dynamics, promoting co‐evolution in gene families and maintaining similarities between repeated genes. However, the properties of the transfer – where it initiates, how far it proceeds and how the resulting conversion tracts are affected by mismatch repair – are not well understood. Here, we use the duplicate tuf genes in Salmonella as a quantitatively tractable model system for gene conversion. We selected for conversion in multiple different positions of tuf, and examined the resulting distributions of conversion tracts in mismatch repair‐deficient and mismatch repair‐proficient strains. A simple stochastic model accounting for the essential steps of conversion showed excellent agreement with the data for all selection points using the same value of the conversion processivity, which is the only kinetic parameter of the model. The analysis suggests that gene conversion effectively initiates uniformly at any position within a tuf gene, and proceeds with an effectively uniform conversion processivity in either direction limited by the bounds of the gene.

Because all steps are assumed independent, the probability Pr(m) that position m is successfully converted without proceeding beyond the end of the gene, is a product of three probabilities. For i < n, this is the probability of proceeding from i to n, multiplied by the probability of not proceeding from n to N +1, multiplied by the probability of not proceeding from position i to 0: The sought probability is thus given that we start in position i. The probability that we start at position i is in turn 1/ N. Summing over all i ≤ m gives where ( ) P m i n < is the conditional probability that m is successfully converted given that the process starts to the left of m. The same procedure for i > m gives The total probability is where m/N and ( )/ N m N − are the probabilities that a certain DSB were in the intervals The joint probability ( ) , P m n that both positions m and n are present in the same conversion tract, can be calculated assuming m < n with no loss in generality. The calculation is broken up into three parts, depending on the position of the DSB: The contributions from the three segments then follow and the total probability sums to The conditional probability in Eq.
To ensure the correctness of the derivations, we also designed a straightforward first-principle Monte-Carlo simulation based on the same assumptions. In each round of the simulation, we initiate a single DBS with uniform probability at each position in the gene. We then simulate conversion to the left and right by drawing uniform random numbers and proceeding with the conversion with probability ρ . Each resulting conversion tract is stored, and the process is iterated at least 10 6 times. Finally, we discard tracts that proceeded beyond either end of the gene, or that do not include the selection position. Each simulation was then repeated for each of the three selection points and a very wide range of ρ -values.
The simulation results were identical to the analytical solutions down to the miniscule error at a sampling of 10 6 simulations for all parameters tested: for all the figures of the paper, the simulated curves cannot be separated from the analytically derived curves. They also perfectly agree on the expected sampling errors, included in the theoretical confidence intervals. However, we also note that the agreement does not lend any further support to the model -because the two approaches start with the same assumption, the simulations must converge to exact analytical distributions unless there are errors in either approach. The simulations are instead done to convince readers who do not wish to go through all the algebraic steps of the derivations, and to reduce the risk of errors in the derivations.

SUPPORTING INFORMATION
for The processive kinetics of gene conversion by Paulsson, El Karoui, Lindell and Hughes

S2 Figure. Distribution of conversion tracts for the MudJ transposon.
We selected for conversion at nt 362 using a MMR deficient strain in which a ~11kb MudJ transposon is inserted at nt 713 in the donor tuf gene. This tested for consistency with the model, and the results are shown below. (1) of the main text for processivity parameter ρ = 0.998 (black line), and the limit distribution in Eq.
(2) of the main text where the average walk length approaches infinity (red line). The grey envelopes are theoretical 95% confidence intervals given the binomial statistics for a two-outcome process, as described above in S1 and in Fig. 2 of the main text, using ρ = 0.998 and the experimental sample size of the experiment (n=40).

SUPPORTING INFORMATION
for The processive kinetics of gene conversion by Paulsson, El Karoui, Lindell and Hughes Here we discuss the average length of duplicated genes in S. enterica. The mathematical results were also double-checked by first-principle computer simulations (for details see below Eq. (S5))

S3 Text. Average length of duplicated genes in Salmonella
The main repeated coding sequences in Salmonella (not counting IS elements) are: • Two copies of the ccm operon, 7.5 kb, 99% nt identity.
• Five copies of rrl (23S rRNA) 2993 nts, 99-100% nt identity. (There is also a 3009 nt copy with one intervening sequence, and a 3092 nt copy with 2 intervening sequences).
• Two copies of tuf genes, 1185 nts. 99% nt identity The ccm operon is concerned with cytochrome C biogenesis and is the longest duplicate sequence in the Salmonella genome. The pag genes are PhoP-PhoQ activated. All other genes (mostly in the rrn ribosomal group) are involved in translation.
Longer genes have proportionally more sites for double-stranded breaks, and should perhaps therefore be counted proportionally more towards the average. Weighting each gene by its length relative to the average is equivalent to accounting for the variance in length 2 L σ . With L i as the length of duplicated sequence number i, the un-weighted and weighted average are then: Including all genes above, except the rrl genes with intervening sequences, we have 1.9 L = kb and 3.9 weighted L = kb.