In silico identification of TRs
The increasing availability of genome sequences and specialized bioinformatics software greatly facilitates the search and identification of TR loci on a genomewide scale, which obviously is a prerequisite for understanding their distribution, predicting their function, and tracking their evolution. A variety of algorithms have been developed for detecting TRs, but it is important to be aware that these may differ in their ability to detect different types of TRs (Merkel & Gemmell, 2008; Treangen et al., 2009; Kajava, 2012). Hence, the choice of search tool should be determined by the TR type of interest, or the parallel use of several algorithms is advisable when a wide screen for TRs is performed. Furthermore, parameter settings (i.e. alignment weights, definition of repeats, and threshold scores) can also strongly affect outcome in terms of number and consensus motif of TRs (Lim et al., 2013). In particular, problems are still commonly encountered in detecting imperfect TRs (Leclercq et al., 2007; Schaper et al., 2012). Several algorithms are freely available online, such as Tandem Repeat Finder (Benson, 1999) and IMEx (Mudunuri & Nagarajaram, 2007). In addition, several databases of annotated TRs in prokaryotes have been established, such as TRs DB (http://minisatellites.u-psud.fr), PSSRDb (http://pssrdb.cdfd.org.in) and MICAS (http://18.104.22.168/micas/index.php). In the next section, we will review the major findings from some recent in silico studies of the genomic distribution of TRs in bacteria.
The distribution of TRs in bacterial genomes
The analysis of TRs in bacterial genomes so far has mainly focused on microsatellites with unit size 1–6 bp, also termed ‘simple sequence repeats (SSRs)’. A number of general observations regarding the distribution of SSRs can be stated. First of all, the abundance of SSRs in bacteria is lower than that in eukaryotes (Schlotterer et al., 2006). Nevertheless, the number of SSRs is orders of magnitude higher than that of other repeat types (i.e. minisatellite and macrosatellite) in the genomes of most, if not all bacteria. Of course, this is not unexpected because this SSR count included even mononucleotide trimers (e.g. AAA), which account for about 70% of the total number of SSRs. While SSRs are generally believed to contribute to genome polymorphism and adaptation potential of bacteria (Kassai-Jáger et al., 2008), the contribution of these very small SSRs like mononucleotide trimers is probably limited. In fact, a rough threshold of minimum TR unit number (4–9) has been noted, below which a SSR is not likely to mutate or be variable (Lai & Sun, 2003; Dettman & Taylor, 2004; Kelkar et al., 2010). Intriguingly, heptameric repeats were found to be overrepresented among these SSRs in most prokaryotes, and it was hypothesized that the seemingly preferred 7 bp length of a repeat unit might relate to the DNA segment size that interacts with the active site of the DNA polymerase, thus facilitating the occurrence of polymerase slippage (Mrázek et al., 2007).
A remarkable feature of SSRs is their widely diverse distribution across species, even closely related ones, and this may indicate that they are subject to rapid evolutionary change (Yang et al., 2003; Mrázek, 2006; Kassai-Jáger et al., 2008). Analysis of more than 300 prokaryotic genomes showed that the distribution of SSRs varied with the bacterial species, genome size, and G + C content (Mrázek et al., 2007). More specifically, SSRs with small motif (1–4 bp) are more abundant in small genomes and particularly in host-adapted pathogens with reduced genomes (< 2 Mb) and low G + C content (< 40%), such as Mycoplasma and Haemophilus spp. (Moxon et al., 2006; Treangen et al., 2009). In contrast, SSRs with a larger motif (5–11 bp) are more frequent in nonpathogens and opportunistic pathogens with large genomes (> 4 Mb) and high G + C content (> 60%), such as Burkholderia and Anabaena spp. Based on this observation, it was hypothesized that the differential representations of SSRs in bacteria may correlate with pathogenicity, but more work is needed to corroborate this. Another interesting observation is that some relatively large bacterial genomes (e.g. Pseudomonas aeruginosa, c. 5 Mb) have fewer SSRs than would be predicted based on their genome properties, but harbor comparatively more two-component sensor transducers. In contrast, some host-adapted pathogens with small genome size (i.e. Haemophilus influenzae, Neisseria meningitidis, and Helicobacter pylori) have comparatively more SSRs, but less two-component sensor transducers (Moxon et al., 2006). Thus, it seems that environmental adaptability in host-adapted pathogens depends primarily on SSR variations, while in opportunistic pathogens with a more versatile lifestyle it depends primarily on two-component sensor transducers.
Closer examination of the SSR distribution across the genome shows significant differences in coding and noncoding regions. Because bacterial genomes are more compact than those of eukaryotes, they have comparatively more intragenic than intergenic SSRs. For example, in Escherichia coli K-12, 79.5% of SSRs locate in coding regions (Gur-Arie et al., 2000), whereas in the genome of the Japanese pufferfish (Fugu rubripes), only 11.6% of SSRs are intragenic (Edwards et al., 1998). Generally, long mono- and dinucleotide SSRs are excluded from coding regions, probably because they have a higher probability to rearrange and cause frameshift mutations in genes (Coenye & Vandamme, 2005; Ackermann & Chao, 2006; Orsi et al., 2010; Lin & Kussell, 2012). In contrast, SSRs whose unit size is a multiple of three nucleotides (3, 6, 9 …) are overrepresented in open reading frames (ORFs) because their expansion or contraction does not disrupt the reading frame (Mrázek et al., 2007). However, exceptions have been reported. For example, tetranucleotide SSRs of H. influenzae are exclusively found in ORFs, which is consistent with their role in phase variation (Power et al., 2009). An interesting situation exists in the mycoplasmas, where long trinucleotide repeats are overrepresented in Mycoplasma genitalium, Mycoplasma gallisepticum, and Mycoplasma hyopneumoniae, but occur mainly in intergenic regions in the former two species, but in coding regions in the latter one (Mrázek, 2006). This difference in distribution is also reflected in different functional roles. In M. gallisepticum, the most prominent trinucleotide TRs are the GAA repeats in the 5′ untranslated region of the 42 up to 70 vlpA adhesin gene paralogs that exist in each strain, which regulate vlpA gene expression (Glew et al., 1998, 2000; Liu et al., 2000; Papazisi et al., 2003). In contrast, M. hyopneumoniae trinucleotide repeats are found mostly within hypothetical ORFs, but also in some adhesins, and their contraction or expansion results in variability of amino acid repeats that are believed to play a role in protein–protein interaction or adhesion (Mrázek, 2006).
A more detailed study on the occurrence of intragenic TRs in 44 bacteria and archaea revealed additional features (Lin & Kussell, 2012). Intragenic SSRs were found more frequently near the termini (5′ and 3′ ends) of the ORF rather than in the middle, which most likely stems from biophysical constraints of protein structure. In addition, SSR-induced frameshifts at the 3′ end are less harmful than at other parts, because most of the upstream coding region will not be affected. Nevertheless, an overrepresentation of SSRs was found in the 5′ end in ORFs of pathogens, probably because this allows SSR-induced frameshifts to function as an ON/OFF switch for these ORFs, which can be advantageous for pathogens because it facilitates rapid adaptation of populations. Similar observations had already been made earlier for some intragenic mononucleotide repeats at the 5′ end of genes (van Passel & Ochman, 2007; Janulczyk et al., 2010; Orsi et al., 2010). However, it remains unclear and often difficult to prove whether this type of distribution bias of intragenic SSRs is linked to selection pressures in bacteria. An argument in favor of such a link is that intragenic SSRs show a preference for certain categories of genes. In both Gram-negative (e.g. Haemophilus and Helicobacter) and Gram-positive (e.g. Streptococcus) pathogens, SSR-associated genes frequently encode virulence factors, cell surface components, and restriction–modification enzymes (van Belkum, 1999; Moxon et al., 2006; Guo & Mrázek, 2008; Power et al., 2009; Janulczyk et al., 2010). On the other hand, several intragenic SSRs with numerous repeat copies and a unit size that is not a multiple of three are also found in housekeeping genes whose products are essential for important cellular processes, such as cell division, energy production, and DNA replication and repair (Guo & Mrázek, 2008). Obviously, corresponding TR rearrangements leading to reading frame disruption are anticipated to be detrimental or even lethal for the cell, and it remains unclear why such TRs have been maintained during evolution.
Intergenic SSRs also show a nonrandom distribution, being found more frequently in the immediate vicinity of genes than at distant positions. For example, intergenic SSRs of E. coli K-12 concentrate in a region up to 200 bp from the start codon, which contains proximal regulators of gene expression (Gur-Arie et al., 2000). Another study showed that in most cases, the intergenic SSRs with numerous copies are located upstream of the first gene in prokaryotic operons (Guo & Mrázek, 2008). Together, both studies reflect the potential role of intergenic SSRs in the regulation of gene expression.