The bidirectional replication of bacterial genomes leads to transient gene dosage effects. Here, we show that such effects shape the chromosome organisation of fast-growing bacteria and that they correlate strongly with maximal growth rate. Surprisingly the predicted maximal number of replication rounds shows little if any phylogenetic inertia, suggesting that it is a very labile trait. Yet, a combination of theoretical and statistical analyses predicts that dozens of replication forks may be simultaneously present in the cells of certain species. This suggests a strikingly efficient management of the replication apparatus, of replication fork arrests and of chromosome segregation in such cells. Gene dosage effects strongly constrain the position of genes involved in translation and transcription, but not other highly expressed genes. The relative proximity of the former genes to the origin of replication follows the regulatory dependencies observed under exponential growth, as the bias is stronger for RNA polymerase, then rDNA, then ribosomal proteins and tDNA. Within tDNAs we find that only the positions of the previously proposed ‘ubiquitous’ tRNA, which translate the most frequent codons in highly expressed genes, show strong signs of selection for gene dosage effects. Finally, we provide evidence for selection acting upon genome organisation to take advantage of gene dosage effects by identifying a positive correlation between genome stability and the number of simultaneous replication rounds. We also show that gene dosage effects can explain the over-representation of highly expressed genes in the largest replichore of genomes containing more than one chromosome. Together, these results demonstrate that replication-associated gene dosage is an important determinant of chromosome organisation and dynamics, especially among fast-growing bacteria.
Bacterial species show a large variance in terms of minimal doubling times, which stretch from less than 10 min to several days. It is possible that some of the majority of uncultivated bacteria have even longer doubling times. Although long generation times are frequent among Eukaryotes, the very short doubling times observed among bacteria show that for these bacteria fast growth is a very important determinant of evolutionary fitness. Interestingly, several physiological characteristics of fast growth are remarkably constant (Neidhardt, 1999). Notably, the chemical composition in proteins, stable RNA and DNA are simple functions of the steady-state rate of growth for a given temperature and tend to be constant with changes in the medium. Also, the rates of elongation of peptides, in translation, mRNA, in transcription, and DNA, in replication, show little dependence of the growth rate, at least when compared with the changes of growth rate itself in different media (Vogel and Jensen, 1994; Bremer and Dennis, 1996; Bipatnath et al., 1998). In general, faster-growing cells increase in size and are enriched in the elements involved in translation and transcription (Bremer and Dennis, 1996). This increase may be dramatic. In Escherichia coli when the doubling time decreases from 100 to 24 min, the number of RNA polymerases (RNAP) per cell increases from 1500 to 11 000, and the number of ribosomes from 6800 to 72 000 (Bremer and Dennis, 1996). In fact, at the maximal growth rate the large majority of the transcription/translation machinery is dedicated to its own synthesis.
When gene expression approaches saturation, the only way of significantly increasing transcription is by increasing gene dosage. This may result from gene duplication or not and may be transient or stable. Transient gene dosage can result from tandem amplification of genes because of a particularly strong selection pressure for the increase in the expression of a given gene (Romero and Palacios, 1997). It usually reverts fast to the single-copy case when the selective pressure is removed. Stable gene duplication is rare in bacteria and the most widespread example is the presence of multiple copies of rDNA and tDNA genes in many bacterial genomes (Marck and Grosjean, 2002; Stevenson and Schmidt, 2004). The number of the copies of these genes is well correlated with the growth rate (Rocha, 2004a). The deletion of up to two out of the seven rDNA copies in E. coli has a small negative effect on the growth rate (Condon et al., 1992), but there is evidence that excessive rates of single-gene rDNA expression leads to folding problems and incorrect ribosome assembly (Lewicki et al., 1993). Also, the full set of copies is necessary for faster adaptation to situations of exponential growth (Condon et al., 1995; Klappenbach et al., 2000). This is because a quicker onset of exponential growth is determinant for bacterial fitness in fast-growing bacteria. Finally, replication can lead to higher gene dosage in the cell through a mechanism that does not involve multiple copies of the genetic material in the chromosome but takes advantage from the distance of the gene to the origin of replication (Cooper and Helmstetter, 1968; Chandler and Pritchard, 1975). This is what we shall call throughout this article as replication-associated gene dosage.
Replication shapes the organisation of the bacterial chromosome at several levels (Rocha, 2004b). The bidirectional replication of the bacterial chromosome from a single origin differentiates it into leading and lagging strands that differ in the number, type and composition of genes. Also, because some regions, near the origin, are replicated earlier than others, closer to the terminus of replication, there are asymmetries in gene type and composition along the chromosome. For example, late replicating regions tend to exhibit higher substitution rates (Sharp et al., 1989; Mira and Ochman, 2002) and A+T richness (Daubin and Perriere, 2003). Some of this effect stems from the fact that highly expressed genes are more conserved (Rocha and Danchin, 2004) and they tend to concentrate near the origin of replication of bacteria (this work). Indeed, as soon as a replication round starts, the genes near the origin are in two copies in the cell, whereas the genes near the terminus remain in one single copy. Hence, genes whose expression is near saturation in moments of fast growth may enjoy this replication-associated gene dosage effect (Chandler and Pritchard, 1975; Schmid and Roth, 1987; Sousa et al., 1997). The effect is even stronger for bacteria in which the minimal doubling time is smaller than the time necessary to replicate the chromosome. In this case, there are multiple simultaneous rounds of replication (Yoshikawa et al., 1964), and the ratio of the number of origins over the number of termini in the cell doubles with each additional replication round (Cooper and Helmstetter, 1968). In some fast-growing strains of E. coli there may be up to three replication rounds and thus genes near the origin are eight times more abundant in the cell than genes near the terminus. Naturally, the expression of a gene is primordially determined by the control at the transcriptional level, even when strong gene dosage effects are present (Ardell and Kirsebom, 2005). However, if at the moment of exponential growth a subset of genes is limiting growth and approaching gene expression saturation, one would expect the positioning of these genes near the origin of replication to be under strong selection. As this effect may be so strong it has been often invoked to explain the deleterious effects of chromosome rearrangements that shift highly expressed genes from the origin to the terminus of replication (Louarn et al., 1985; Hill and Gray, 1988; Campo et al., 2004; Kothapalli et al., 2005).
Here, we have put forward an analysis of gene dosage effects in bacterial genomes. This subject has been extensively studied in E. coli (Cooper and Helmstetter, 1968; Chandler and Pritchard, 1975; Schmid and Roth, 1987; Sousa et al., 1997; Ardell and Kirsebom, 2005) and frequently invoked to justify expression and genome rearrangement data. However, the phylogenetic span of gene dosage effects, the genes it affects and its association with growth rates has not been analysed. This is because such analysis requires the availability of many genomes and also the information on minimal doubling time for each species. Minimal doubling times are often hard to measure and contingent to a set of experimental conditions. Here, we used an improved update of previously published data on minimal doubling times for bacteria (Rocha, 2004a). As we were interested in the effects of gene dosage at maximal growth, we used the smallest available doubling times, which we expect to be an overestimation of the minimal ones, especially for little studied bacteria for which the best growth conditions are still far from optimal. We have used this information to compute gene dosage effects in bacterial genomes, using E. coli as a model. This allowed assessing these effects and quantifying their influence in the distribution of different types of genes. We then put these results in relation to genome stability and structure.
Results and discussion
Estimation of the number of replication rounds in maximal growth conditions
We measured the importance of the replication-associated gene dosage effects according to the Cooper–Helmstetter model for each genome (Cooper and Helmstetter, 1968) (see Experimental procedure). For this, we first calculated the chromosome replication time using as a model the maximal rate of the replication fork in E. coli (1000 nt/s). Then, we divided the chromosome replication time, i.e. the time it takes to replicate each replichore, by the minimal doubling time for each bacteria, to which we called R. On average, bacteria have an average R not significantly different from 0.5 (R = 0.48, P > 0.5) (Fig. 1). Yet, this value masks a tremendous diversity between species. Some bacteria have a very low R: less than 0.005 (Supplementary Table S1). These are mainly divided in two groups. First, obligatory symbionts (parasites and mutualists) tend to grow slowly and have small genomes, smaller genomes implicating in general smaller replication time, and show a significantly smaller average R of 0.11 (P < 0.001, Wilcoxon test). Second, photosynthetic cyanobacteria, which often double once per day, show an average R of 0.035. The R of E. coli and Bacillus subtilis are close to 1.5. Slightly higher values have been observed in classical works (Yoshikawa et al., 1964; Cooper and Helmstetter, 1968), especially for E. coli, but this value depends on the genome length, which is smaller in MG1655 than in most other strains. In optimal conditions, E. coli MG1655 has 2 × (22 − 1) = 6 replication forks at the same time, but strains larger than ∼4.8 Mb, if they have the same doubling time, have 14 replication forks. The highest R is 3.5 for Erwinia carotovora, followed by all the larger chromosomes of Vibrio species, which have R-values comprised between 2 and 3.
According to the Cooper–Helmstetter model, and if the DNA polymerase (DNAP) rate in other enterobacteria is the same as in E. coli, exponentially growing E. carotovora have 30 replication forks at the same time in the chromosome, and the closely related Vibrio are expected to have between 6 and 14, for the major chromosomes, plus between 2 and 6 for the minor ones. These results suggest an extreme variability in the number of expected replication forks in maximally growing cells, matching the variability in bacterial minimal doubling times and chromosomal sizes. They also suggest that some bacteria, e.g. E. carotovora, can manage a very large number of replication forks. For this, not only cell division mechanisms must be tightly tuned with replication terminus, but also one would expect an extremely efficient intervention of recombination mechanisms to repair stalled replication forks (Michel et al., 2004). It has been estimated that 18% of replications lead to a replication fork arrest in E. coli cells with a doubling rate lower than that inducing multiple rounds of replication (Maisnier-Patin et al., 2001). This means that each fork ends chromosome replication without suffering an arrest with a probability of 0.91. If these results are valid for the closely related faster-growing Vibrio and Erwinia species, then the likeliness that none of the 30 replication forks possibly present at each given moment in Erwinia cells suffers arrest before completing replication is only of 0.9130∼6 × 10−2 (0.15 in Vibrio vulnificus).
Testing the weight of phylogenetic inertia on R
Before proceeding with the multigenomic comparisons, we explicitly tackled the problem of phylogenetic inertia (Blomberg and Garland, 2002). Because some genomes are evolutionarily closer than others, their common history may lead to higher similarity. Stated otherwise, if the traits evolve slowly, e.g. because they are highly constrained, then two closer genomes are similar with higher probability even they are not under the same selection pressure for the trait, simply because of the inertia associated with their common evolutionary history. There is an extensive literature on these effects and on how to deal with them (Martins and Hansen, 1997). We have beforehand removed from our analysis genomes of strains of the same species and then checked if the remaining data were strongly affected by this effect. If so, then points could not be treated as statistically independent. For this we computed for all pairs of genomes the difference in R (ΔR) and the phylogenetic separation time. If phylogenetic inertia was strong one would expect to find a correlation between the two variables, because small separation times should tend to lead to smaller differences in R. We checked this at several levels of divergence and found no signals of phylogenetic dependence (Table 1). For the 337 closer pairwise comparisons (5% of the total), we found a non-significant association between ΔR and time (Spearman ρ = 0.069, P = 0.2; Pearson's r = −0.01, R2 = 0.0001, P = 0.8) (Supplementary Fig. S4). Hence, for all practical purposes the data points can be treated as statistically independent. Surprisingly, this suggests that at these large time scales, there is a weak phylogenetic inertia associated with the trait R (the same is valid for minimal doubling time).
Table 1. Correlation between the difference in R and the corresponding phylogenetic distance for pairs of genomes at increasingly closer phylogenetic distances (given by the threshold distance).
Threshold on distance
% of data
The comparisons include N points corresponding to different percentages of the data.
+: P-value < 0.05; ++: P-value < 0.001. NS, non-significantly different from zero.
The reliability of R, and its association with the doubling time and genome length
At this stage, we must emphasize that R is calculated assuming all replication forks to advance at 1000 nt/s, as in E. coli under fast growth. For very slow-growing bacteria, DNAP may be slower due to relaxed selection for a high processing rate. There is little information on DNAP processing rates. In Mycoplasma capricolum a DNAP processing rate of 100 nt/s has been observed. However, the genome is so small, that not more than one replication fork is expected at maximum growth (Seto and Miyata, 1998). In cases where the replication fork progresses slower, gene dosage effects will be under-represented. However, as gene dosage effects are not expected in bacteria for which there is no strong selection for fast growth, which are the ones that might show relaxed selection for fast DNAP, this probably has little consequence for the conclusions derived from our analysis. A particularly interesting example is found among Mycobacteria (Hiriyanna and Ramakrishnan, 1986). The DNAP of the slow-growing Mycobacterium tuberculosis progresses at ∼50 nt/s whereas the one of the faster-growing M. smegmatis (doubling time ∼3 h) progresses at ∼600 nt/s, comparable to E. coli at the same doubling time. This is compatible with the idea that fast replication is under strong selection in fast-growing bacteria, and that slower DNAP are the result of relaxed selection, but that such decrease in the fork advance does not result in significantly higher gene dosage effects.
In fact, throughout the work we shall show that low values of R are associated with low selection on gene dosage effects. It is also not implausible that some of the bacteria with predicted highest R have evolved faster DNAP, which would lead to a lower effective increase in R. In both cases R should be viewed as a proxy for the expected intensity of selection for fast replication, and not necessarily as directly proportional to the number of simultaneous replication rounds. In E. carotovora, a DNAP with a rate twice that of E. coli MG1655 would have only two simultaneous replication rounds. Further, the replication fork rate depends on the fork composition (Khodursky et al., 2000) and accelerates with growing rates in E. coli (Bipatnath et al., 1998). A higher acceleration in even faster-growing bacteria could allow significantly smaller increases in R. If true, this raises interesting biotechnological potential for such DNAP. However, there are a number of reasons to expect DNAP not being able to totally compensate for shrinking doubling times. Although DNAP processing rates are ignored for the vast majority of sequenced bacteria, Vibrio cholerae replication forks have been found to proceed at a rate similar to the ones of E. coli (Egan et al., 2004). The same is true for B. subtilis replication forks in spite of the more than 2 billion years separating the two species and their different DNAP constituents (Huang and Ito, 1999; Dervyn et al., 2001). Hence, maximal replication speed may show limited variation among the fastest-growing bacteria possibly being limited by factors such as kinetics and accuracy. Finally, our observations in the following paragraphs that R correlates systematically with growth-optimization strategies involving the biased distribution of genes to enjoy replication-associated gene dosage effects is also an indication that the number of simultaneous replication rounds varies significantly.
To further substantiate our claim, we computed the correlation between R and the minimum doubling time, which is very high (Spearman ρ = −0.94, P < 0.001) (Fig. 2). Then, throughout this work, we computed in parallel correlations using R and using solely the information of minimum doubling time. All correlations were qualitatively similar but, as expected, correlations using R were systematically higher (Supplementary Table S2), indicating that R is more informative than simply using the minimum doubling time.
As R is a function of the minimal doubling time and chromosome length we also computed the correlation between chromosome size and R (Spearman ρ = 0.52, P < 0.001) and chromosome size and the other variables treated throughout the article. These are much smaller than the ones using R or the doubling time. We then tested if the mutual dependency of R and other relevant variables upon chromosome size could be determinant. This was done using partial correlations and showed that partial correlations involving R and controlled for chromosome size remain almost unchanged. On the other hand, correlations between chromosome size and the other variables nearly disappear when one controls for their mutual dependency on R (Supplementary Table S4). Hence, the parallel analysis using chromosome size will be omitted throughout the text (see Supplementary Table S2 for results).
Estimation of the actual number of genes of tDNA and rDNA considering gene dosage effect associated to replication
Because of replication-associated gene dosage effects, genes located near the origin of replication are present in more copies in the cell than the ones near the terminus. The difference increases exponentially with the number of simultaneous replications, i.e. with R. We started by analysing the most transcribed genes in exponentially growing bacteria, the rDNA and tDNA genes, and calculating their effective gene dosage (see Experimental procedures). In our dataset the dosage of rDNA genes between bacteria differs from three (the minimal gene complement of rDNA) for M. tuberculosis to ∼160 for E. carotovora and Bacillus cereus (Fig. 1). The calculated dosage of tDNA genes between bacteria differs from 31 for Mycoplasma pulmonis to more than 450 for V. parahaemoliticus and B. cereus. Two effects can explain the wide differences of gene dosage effects in tDNA and rDNA among different genomes. First, the difference is related to the number of copies of rDNA (respectively tDNA) genes in the genomes, which varies from three for M. tuberculosis (respectively 29 for M. pulmonis) to 34 for B. cereus (respectively 168 for Photobacterium profundum). Second, it is linked to the place the genes occupy on the chromosome: when genes are nearer from the origin, their replication-associated gene dosage increases. This effect becomes predominant if the number of simultaneous replication rounds in the bacterium is large. Protein coding genes are rarely present in duplicated copies in bacteria and if one ignores transient gene amplification by recombination mechanisms, their gene dosage is only function of the two latter replication-associated effects. This will be discussed in the next section.
Association between gene positioning and gene dosage effects reflects genetic regulation and expression levels
We wondered if the general effects described in the previous sections translate in a significant selective advantage of having highly expressed genes near the origin. We analysed the distribution of rDNA, ribosomal proteins (RP), RNAP and tDNA genes. To take into account other highly expressed genes we also used the set of the other genes having the 5% highest codon usage bias, measured through the Codon Adaptation Index [CAI5%– protein coding genes with the 5% strongest codon usage bias (excluding ribosomal proteins)]. These genes are the ones exhibiting the most adapted codon composition to the translation machinery and are thus usually regarded as highly expressed under exponential growth (Andersson and Kurland, 1990). One of us has previously proposed that some tDNA are much more ubiquitous (tDNAub) and important for high levels of translation than others (tDNAnub) (see Supplementary Table S3 for a list of tRNAub taken from the table 2 of Rocha, 2004a). We thus divided tDNA genes into tDNAub and tDNAnub and analysed them separately. As the dosage effect associated with the location of genes near the origin increases with R, we calculated the correlation between R and the average distance of each type of genes to the origin of replication (Fig. 3). We observed that for rDNA, RP, RNAP, tDNAub the average distance to the origin is negatively correlated with R in a significant way (−0.23, −0.44, −0.51, −0.36 respectively, all Spearman ρ, P < 0.001). Thus, the higher the R, the nearer the genes are from the origin of replication. We found qualitatively similar, although systematically inferior correlations between the minimum doubling time and the average position of these genes (Supplementary Table S2 and Fig. S1). The only sets of genes whose positioning does not significantly correlate with R are tDNAnub and CAI5% (−0.07 and −0.13 respectively).
Some tDNA genes are included in the rDNA operons, which creates an artificial correlation between tDNA position and R, via rDNA position. We found that on average 34% of tDNAub and 12% of tDNAnub are within rDNA operons. Hence, tRNAub are more frequently with rDNA, and this frequency is higher for genomes with R > 0.5 (41%). We then removed all the tDNA genes clustered within rDNA operons from the analysis and found that the average distance to the origin of the remaining tDNAub still correlates with R (ρ = −0.21, P < 0.02). Hence, there is evidence for selection of tDNAub positioning to enjoy gene dosage, which is independent of rDNA gene dosage, and weaker. tDNAnub are less biased. This is probably because they are much less used than the other tDNA to translate highly expressed genes (Rocha, 2004a; Sharp et al., 2005). Hence, the selective advantage associated with a positioning close to the origin of replication is much smaller, if it exists at all, and these genes are randomly distributed in the chromosome. As far as the CAI5% genes are concerned, its low bias suggests that only the highly expressed genes associated with translation and transcription are subject to strong gene dosage effects. This will be analysed in more detail in the next section.
Replication-associated gene dosage effects are expected to be more important in fast-growing bacteria, where small doubling times are under particularly strong selection. Hence, we analysed the position of each set of genes separating bacteria with high values of R from the others (R > 0.5 and R ≤ 0.5). We divided each chromosome in 10 equal parts (bins) regarding the distance from the origin (five parts for RNAP because of the lower number of genes involved) and computed for each part the average number of rDNA, RP, RNAP, tDNAub, tDNAnub and CAI5% genes (Fig. 4). For each set of genes, we also computed the correlation between the numeric value of the bin (i.e. 1 for the first bin, 2 for the second, etc.) and the frequency of genes at the bin (Fig. 4). When R > 0.5, the correlations are highly significant for each set of genes. As previously, the positions of tDNAnub and CAI5% genes are the least affected by gene dosage effects. When R < 0.5, the correlations for rDNA, RP and RNAP genes are less significant, and the ones of tDNAub, tDNAnub, CAI5% genes are not significant at all. We repeated the analysis using minimum doubling time, instead of R, and found qualitatively similar results (Supplementary Fig. S2).
We observed in the group with strong gene dosage effects (R > 0.5) that the order of the average distance to the origin of replication is the following: RNAP (20%), rDNA (28%), RP (33%), tDNAub (41%), tDNAnub (46%) and CAI5% (47%). Interestingly, this matches the order of regulation of these genes during fast growth (Dennis et al., 2004). At high growth rate, in E. coli, rDNA synthesis is limited by the amount of RNA polymerase. Then, the amount of free rDNA regulates the expression of RP (Dennis et al., 2004). RNAP is the class of genes closest to the origin among 26 of the 33 genomes with R > 0.5 and in 17 of the 17 genomes with R > 1. Both observations indicate a very significant bias towards positioning RNAP closer to the origin (P < 0.0001, binomial tests). Strikingly, the order of average position of RNAP, rDNA and RP among the 17 genomes with R > 1, is RNAP < rDNA < RP in 12 genomes (expected < 3) and RNAP < RP < rDNA in the five others. Among genomes with R > 0.5 we find the order RNAP < rDNA < RP < tDNAub in 16 out of the 33 genomes (expected < 1, difference significant, P < 0.0001, binomial test). This strongly suggests an association between gene dosage, coding position and regulatory dependencies among these genes. The regulation of tDNA is more complex. The expression of the minority of tDNA that is included in rDNA operons is regulated in the same way as rDNA. The other part is regulated by other factors (Nilsson and Emilsson, 1994). Finally, some tRNAs arise in the genome from horizontal gene transfer, as is often the case in pathogenicity islands (Hacker and Kaper, 2000). In this case tDNA is randomly distributed relative to the origin of replication. The conjunction of these effects may explain why the tRNA average position is farther from the origin than the average position of rDNA and RP.
Replication-associated gene dosage effects are associated with translation/transcription genes
The previously observed low bias associated with the CAI5% set suggests that few highly expressed genes outside the categories of translation/transcription enjoy a selection for replication-associated gene dosage effects. Several methodological reasons complicate this analysis. First, the CAI5% set was built by identifying high expression levels through codon usage bias, which correlates well, but not perfectly, with expression levels in fast-growing organisms (Andersson and Kurland, 1990; Coghlan and Wolfe, 2000; Rocha, 2004a). Second, it would be best to remove from the CAI5% set all genes related to transcription and translation, but this is complicated by sequence divergence and the quality of annotations (in the CAI5% set we only removed the RP, which are easy to pinpoint). The first problem may lead to an underestimation of the gene dosage effect in highly expressed genes, whereas the second may lead to an overestimation of the effect if translation and transcription genes are more biased.
To further analyse this problem we turned to E. coli MG1655, which is a fast-growing bacterium for which codon usage biases have been used to identify highly expressed genes for decades (Ikemura, 1981; Gouy and Gautier, 1982) and for which there are publicly available studies of transcriptomics (Wei et al., 2001) and proteomics (VanBogelen et al., 1999). As annotations are very reliable for this species we could remove from the list of genes all the ones associated with transcription or translation (Blattner et al., 1997). We then computed the median position of the improved CAI5% set of genes in this genome. We did the same for the 5% most expressed genes in rich medium, identified through transcriptomics. Finally, we took the 61 non-transcription/non-translation related proteins identified in the E. coli gene–protein database, which are expected to be among the proteins in higher concentrations in the cell. The CAI5% set showed a median relative position of 50%, with an average relative position at 45% of the chromosome (non-significantly different from 50%, P > 0.05). The other sets also failed to show a significant effect (respectively medians of 52% and 51%, averages of 49% and 44%, non-significantly different from 50% at P < 0.05). Together with the previous data this shows that replication-associated gene dosage effects are mostly associated with the elements of the translation and transcription machinery.
The fraction of RNAP dedicated to transcribe mRNA decreases with exponential growth, from 75% to less than 10% in E. coli (Dennis et al., 2004). Because a large fraction of the mRNA involves translation and transcription genes, the expression of the remaining genes in the set of CAI5% is certainly much less significantly increased at the onset of exponential growth under the fastest growth rates. This may explain why the biases are so low in these genes.
Multiple chromosomes and the optimization of gene dosage effects
Some bacteria have two chromosomes and if they have different sizes, the genes present in them may have very different replication-associated gene dosage effects. For example, let's suppose that one chromosome is twice as long as the other and that the doubling time is similar to the time required to replicate the smallest chromosome. Let's also assume that the forks advance at the same rates in the two chromosomes, which has not been extensively tested, but seems reasonable. Under these circumstances, the large chromosome has two simultaneous replications rounds, whereas the smallest has only one. The genes located near the origin of replication of the largest chromosome would be in four copies; the genes located near the origin of replication of the smallest chromosome would be in only two copies. Bacteria with their highly expressed genes near the origin of the largest chromosome would be advantaged compared with bacteria having them far from the origin or in the other chromosome. To test the association between gene dosage and gene distribution between the two chromosomes, we calculated the number of CAI5%, tDNAub, rDNA and RP on both chromosomes. Except for Leptospira interrogans, the amount of each of these genes on each chromosome is significantly different from what would be expected if it were proportional to the chromosome length (Table 2). Moreover, the over-representation of the highly expressed genes in the largest chromosome (supposing distribution proportional to size) is highly correlated with R (ρ= 0.79, P < 0.005, Fig. 5). Hence, the stronger the dosage effect, the higher the over-representation of highly expressed genes in the largest chromosome.
Table 2. Differential gene dosage effects between two chromosomes and test that larger chromosomes tend to have more highly expressed genes in bacteria with significant gene dosage effects.
Highly expressed genes are the sum of CAI5% including ribosomal proteins, tDNAub and rDNA.
I, larger chromosome; II, smaller chromosome; L, length (Mb); O, observed number of highly expressed genes; E, expected number; P, significance of the rejection of equal partition between the two chromosomes (+: P-value < 0.05, ++: P-value < 0.001), R, replication time of the largest chromosome divided by the minimal doubling time; NS, non-significantly different from zero.
The reason for the existence of multiple chromosomes is subject to speculation (Egan et al., 2005). One could hypothesize that multiple chromosomes take less time to be replicated in parallel than one single chromosome with the same amount of genetic material. However, with the exceptions of Vibrio and Burkholderia, species having two chromosomes are not among the fastest-growing bacteria. Further, if faster DNA replication were driving the existence of multiple chromosomes, one would expect these chromosomes to have the same size. Instead, in all cases one chromosome is much larger than the other and contains most of the highly expressed genes. On the other hand, gene dosage effects would be more important in one single large chromosome than in two smaller ones. From our analysis it is tempting to speculate that whatever reason leading to the presence of two chromosomes in bacterial cells, differently sized replicons may allow a stronger gene dosage effect, and thus be selective. Naturally, this is not applicable to genes in plasmids, which can be in very large numbers in a single cell.
If the second chromosomes are derived from megaplasmids, the over-representation of translation- and transcription-associated genes might be due to phylogenetic inertia, i.e. not enough time has passed for the genes to be ‘transferred’ to the second chromosome. However, two controls show that our results are independent of this effect. First, one finds many more orthologues in the second chromosome than would be expected taking simply the percentage of highly expressed genes they contain. For example, in V. vulnificus, the highest R in the set, one finds that 22% of the orthologues with E. coli are in the second chromosome, whereas only 12% of the highly expressed genes are in this chromosome. Hence controlling for the different frequency of orthologues between the two replicons shows that the smaller replicons have an under-representation of orthologues among highly expressed genes. Second, as indicated above, we find a strong association between the discrimination against highly expressed genes in the smaller chromosome and R (Table 2). As a result, the slowest-growing bacteria, L. interrogans is also the only genome for which there is no significant bias in the distribution of highly expressed genes between the two chromosomes. This further suggests that selection, possibly for gene dosage effects, has fixed, or maintained, highly expressed genes on the largest chromosome.
Gene dosage correlates with genome stability
If replication-associated gene dosage effects are under positive selection, then rearrangements leading to the loss of this adaptive feature should be counter-selected. Indeed, rearrangements causing such a disruption have been found to lead to lower growth rates (Louarn et al., 1985; Hill and Gray, 1988; Campo et al., 2004), although it is notoriously difficult to evaluate that it is indeed the disruption of gene dosage effects that leads to lower growth. Naturally, the systematic permanence of highly expressed genes near the origin of replication, as demonstrated in this work, is a further indication that gene dosage is under strong selection in some species and that some inversions will lead to lower fitness. To further test this hypothesis we measured the association between gene dosage, measured by R, and genome stability.
Recently, one of us proposed a method to quantify the intrinsic genome stability of genomes (Rocha, 2006), departing from a previously published analysis (Rocha, 2003). The method starts by fitting the best model that explains the average loss of gene order conservation through time. Then, the stability of one genome is defined as the average deviation from the model associated with the comparisons where it participates. Positive deviations indicate more than average stability, whereas negative deviations indicate the inverse. We used the data on stability that was published as supplementary material in Rocha (2006) and corresponding to 80 chromosomes analysed in this work. We found a positive correlation between R and genome stability among the genomes with higher R(Fig. 6). No such correlation is found for genomes with smaller R (Supplementary Fig. S3). Hence, the bacteria having the highest gene dosage effects are also among the most stable. This further indicates that gene dosage effects impose significant constraints on the dynamic of bacterial chromosomes. This does not mean that highly biased genomes will show lower intrinsic rearrangement rates. It means that asymmetric rearrangements are more likely to be deleterious, hence not fixed, in the natural populations of bacteria having a genome organization that takes advantage of gene dosage effects.
We have shown that highly expressed genes associated with transcription and translation are preferentially located close to the origin of replication in bacteria with rapid growth and especially if R is high. We then showed that such a positioning is under strong selection as such genomes tend to be more stable. This work extends an intensive inquiry showing how fast growth has important consequences at the level of genome evolution and organisation. From our point of view it also raises several important questions. First, because of obvious experimental difficulties, many of the molecular details of replication and doubling have been studied under conditions of growth far from maximum. This is the case for studies on replication arrests (Maisnier-Patin et al., 2001), chromosome segregation (Errington et al., 2003) and chromosome folding domains (Valens et al., 2004). Our data suggest that at very high growth rates, bacteria may face challenges that are barely visible at lower growth rates. Second, gene dosage effects are so important in some bacteria that they may destabilize gene regulation if the operons participating in the regulon are at very different distances from the origin of replication. Suppose two operons whose expression follows a given stoichiometry. If they are at very different distances from the origin they will be at very different numbers in the cell at maximal growth when compared with moderate growth. This certainly complicates tight gene expression tuning. Hence, in fast-growing bacteria one would expect operons belonging to the same regulons to be close together, or symmetrically placed around the chromosome. There is some evidence for both (Lathe et al., 2000; Rocha et al., 2000; Horimoto et al., 2001; Audit and Ouzounis, 2003).
According to previous data, we show that at very high growth rates, selection upon high gene expression is concentrated within a very narrow group of genes. Gene dosage effects, in particular, are mostly limited to translation- and transcription-associated genes. The study of the imprint of selection on the organisation of genomes relative to gene dosage effects will certainly provide important clues on how important is the optimisation of growth in different bacteria. One would expect this to be very high for bacteria growing very fast and very small for bacteria growing very slowly. Accordingly, genomes will be more or less tolerant to disruptive rearrangements.
We started by analysing all genomes available at GenBank Genomes (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/) by March 2005. We then tried to identify their origins and terminus of replication using strand compositional biases as in (Rocha and Danchin, 2003). If these could not be unambiguously determined, we could only determine the value of R for the genome (see below) and the genomes were excluded from further analysis. We analysed 126 chromosomes for which we could find the doubling time in the literature and only used one strain per species (typically we chose the first published). For this, we updated and improved the list of previously published doubling times (Rocha, 2004a). tDNA genes were identified with tRNAscan using the standard procedure for bacterial genomes (Lowe and Eddy, 1997). The genes coding for rRNA and RP were identified from genome annotations, and RNA polymerase orthologues were identified using bidirectional best hits with at least 40% of protein similarity as in Rocha and Danchin (2003). Highly expressed genes were identified using the Codon Adaptation Index (CAI) (Sharp and Li, 1987). This index provides a measure of codon usage optimisation relative to a set of genes, for which we used RP, and it is usually assumed that this correlates well with expression levels (Coghlan and Wolfe, 2000; Carbone et al., 2003). We took the 5% highest CAI values to be the ones corresponding to highly expressed genes (hence named CAI5% set). We checked that variations around this value make little qualitative difference for the analysis (data not shown). The full table with the names of the genomes, their length, number of rDNA, tDNA, highly expressed genes and highest 5% CAI is given in supplementary material (Supplementary Table S1). Statistics were computed with R (http://www.r-project.org/).
To test the association between phylogenetic distance and differences in R, we made a phylogenetic tree using 2822 sites resulting from the concatenation of conserved RP. The multiple alignments were made with clustalw (Thomson et al., 1994) and the phylogenetic tree was done with phyml (Guindon and Gascuel, 2003) using maximum likelihood with the model JTT + Γ(4) + I. The tree was then used to estimate the distance between all pairs of genomes.
The model of gene dosage was first described by Cooper and Helmstetter (Cooper and Helmstetter, 1968) for synchronous cultures and then shown to be applicable for any type of culture (Bremer and Churchward, 1977). This model has been described under several different perspectives and we will only resume its main results here. Let C be the time necessary to replicate the chromosome, i.e. the time that each replication fork takes to replicate a replichore (half of the chromosome between the origin and the terminus). Let D be the time for division of the cell after the chromosome is replicated. Let τ be the doubling time of the cell. R is then the number of simultaneous replication rounds (R = C/τ). If τ > C then only one replication round will be present at each cell. If τ < C, then a new replication round will start before the precedent has finished. In this situation the chromosome will present multiple replication rounds. More precisely, the average number of origins (NO) and terminus (NT) per cell is given by:
NO = 2(C+D)/τ(1)
NT = 2D/τ(2)
Hence, the ratio of number of origins over the number of termini (rO/T) is given by:
rO/T = 2C/τ(3)
And the number of replication forks (NRF):
NRF = 2(2R − 1)
Now, let's consider a gene that is at a normalized position x in the chromosome, where x is the smallest distance of the gene to the origin of replication divided by half of the chromosome size. If one assumes that the replication fork advances along the chromosome at a constant rate, then it takes xC time to replicate the gene and then (1 − x)C time to replicate the remaining replichore. Hence, the replication-associated gene dosage of the gene is given by the ratio of the number of copies of the gene in the cell and the number of terminus (GD):
GD = 2(1−x)C/τ(4)
GD is then an estimator of the gene dosage associated with replication for a given gene. The values will be as large as x is small, i.e. largest at the origin of replication, and as large as the ratio C/τ is large, i.e. it will be larger if there are many simultaneous replication rounds.
We thank Guillaume Achaz, Martine Boccara, Danielle Joseleau-Petit, Anne Vanet and four anonymous reviewers for comments and criticisms on earlier versions of this manuscript, and Antoine Danchin and Vic Norris for related discussions throughout the years.
The following supplementary material is available for this article online:
Fig. S1. Average position of the genes in the chromosome within each set plotted against the minimum doubling time and non-parametric correlation between the two variables.
Fig. S2. Repartition of sets of genes along the bacterial chromosomes (day ≥ 1 in red and day < 1 in blue).
Fig. S3. Association between R and genome stability as computed in Rocha (2006), for genomes lacking important gene dosage effects (R < 0.55).
Fig. S4. Association between the pairwise differences in R and the phylogenetic distance.
Table S1. Supplementary table 1 – Chromosomes used in the analysis, their estimated optimal doubling time (d) in hours, their size (kb), R, and the number of each type of genes used in the analysis (♯) and their average normalized position in the chromosome.
Table S2. Supplementary table 2 – Non-parametric Spearman correlations between R, the duplication time (d) the replicon length and the average position of ribosomal proteins (RP), RNA polymerase (RNAP), rDNA and tRNA ub genes.
Table S3. Supplementary table 3 – The list of anticodons corresponding to tRNA ub.
Table S4. Supplementary table 4 – Pearson and partial correlations of R (left) and genome length (right) with the average position of each type of genes.