General features of 20 Lactobacillus genomes
The 20 completed genomes of Lactobacillus, representing 14 different species, vary in size from approximately 1.8 to 3.3 Mb and show a number of discriminating features (Table 1). The number of predicted protein-coding sequences (CDS) in these Lactobacillus genomes ranges from 1721 to 3100 and such variation points at substantial gene loss/gain in their evolution, as has been presented previously for a smaller set of Lactobacillus genomes (Makarova et al., 2006). The pangenome, defined as the full complement of genes of these Lactobacillus genomes, consists of nearly 14 000 proteins (Table S1).
Table 1. A general overview of the origin and genome statistics of the 20 Lactobacillus genomes.
|Genome||Length (bp)||G+C content (%)||Predicted ORFs||Isolated from||Genes assigned to COG||LPXTG genes||Genes encoding signal peptides predicted by SignalP (%)||Genes encoding signal peptides predicted by LocateP (%)||Reference|
|Lactobacillus acidophilus NCFM||1 993 564||34.71||1864||Infant faeces||1433||13||21.03||9.40||Altermann et al. (2005)|
|Lactobacillus helveticus DPC 4571||2 080 931||37.08||1757||Cheese||1396||2||17.07||6.96||Callanan et al. (2008)|
|Lactobacillus gasseri ATCC 33323||1 894 360||35.26||1755||Human Gut||1316||14||17.89||6.38||Azcarate-Peril et al. (2008)|
|Lactobacillus crispatus ST1||2 043 161||36||2024||Chicken faeces||1499||8||13.58||9.39||Ojala et al. (2010)|
|Lactobacillus johnsonii FI9785||1 781 645||34.43||1733||Human faeces||1320||12||24.58||6.49||Wegmann et al. (2009)|
|Lactobacillus johnsonii NCC 533||1 992 676||34.61||1821||Human faeces||1403||18||19.17||7.41||Pridmore et al. (2004)|
|Lactobacillus delbrueckii ssp bulgaricus ATCC BAA 365||1 856 951||49.69||1721||Yoghurt||1196||3||17.49||8.28||Makarova et al. (2006)|
|Lactobacillus delbrueckii ssp bulgaricus ATCC 11842||1 864 998||49.72||2094||Yoghurt||1153||2||16.19||8.69||van de Guchte et al. (2006)|
|Lactobacillus casei ATCC 334||2 924 325||46.58||2771||Cheese||1959||18||20.39||8.55||Makarova et al. (2006)|
|Lactobacillus casei BL23||3 079 196||46.34||3044||Cheese||2152||20||20.47||8.40||Mazéet al. (2010)|
|Lactobacillus rhamnosus GG||3 010 111||46.69||2944||Human Gut||2032||18||30.16||7.83||Kankainen et al. (2009)|
|Lactobacillus rhamnosus Lc 705||3 033 106||46.68||2992||Cheese||2099||15||31.45||8.34||Kankainen et al. (2009)|
|Lactobacillus sakei 23k||1 884 661||41.26||1879||Meat||1462||7||19.96||8.46||Chaillou et al. (2005)|
|Lactobacillus brevis ATCC 367||2 340 228||46.06||2218||Human||1678||12||21.06||9.38||Makarova et al. (2006)|
|Lactobacillus plantarum JDM1||3 197 759||44.66||2948||Human saliva||2248||34||28.05||7.94||Zhang et al. (2009)|
|Lactobacillus plantarum WCFS1||3 348 625||44.42||3100||Adult Intestine||2305||35||19.87||7.88||Kleerebezem et al. (2003)|
|Lactobacillus fermentum IFO 3956||2 098 684||51.47||1843||Adult Intestine||1519||5||14.81||5.48||Morita et al. (2008)|
|Lactobacillus reuteri DSM 20016||1 999 618||38.87||1935||Silage||1529||4||15.76||4.84||A. Copeland, S. Lucas, A. Lapidus, K. Barry, J.C. Detter, T. Glavina del Rio, N. Hammon et al. (unpublished)|
|Lactobacillus reuteri JCM 1112||2 039 414||38.88||1820||Fermented plant material||1495||5||16.65||5.22||Morita et al. (2008)|
|Lactobacillus salivarius||2 133 977||33.04||2073||Terminal ileum of human||1476||5||14.52||6.58||Claesson et al. (2007)|
The Lactobacillus secretome has received considerable attention as it includes proteins that may interact with the environment (Kleerebezem et al., 2010). Both SignalP (Emanuelsson et al., 2007) and LocateP (Zhou et al., 2008) were used to predict the secretome of the lactobacilli (Table 1). While secretome predictions via SignalP suffer from some inaccuracy not present in LocateP, it predicted the largest secretome. It is of interest to note that the fraction of genes that were predicted to encode signal sequences is highly variable. The largest set (over 30% of the predicted proteome) was found to be encoded by the genomes of the L. rhamnosus GG and Lc705 that are marketed as probiotics (Kankainen et al., 2009). However, several other probiotic strains, including L. johnsonii NCC533 and L. acidophilus NCFM, were predicted to encode a higher fraction of secreted proteins than L. helveticus or L. delbrueckii that contain an equally sized genome but derive from a well-known dairy background. Similarly, the latter starter strains were predicted to have the lowest number of proteins that are cell wall anchored via sortases that recognize the LPXTG-like motif (termed here LPXTG genes) located at the C-terminal end (Boekhorst et al., 2005). The 20 Lactobacillus genomes showed a highly diverse G+C content varying from 33% to 51%. This represents a span of G+C values that is about twice as large as that normally observed in well-defined bacterial genera, raising the question whether the Lactobacillus species analysed here belong to a single genus (Fujisawa et al., 1992).
The Lactobacillus core genome
To study the relation between the genes in the 20 genomes, we determined the set of shared orthologous genes, termed the Lactobacillus core genome (LCG). A total of 383 sets of orthologous genes were calculated to constitute this LCG (Table S2). This LCG is larger than the gene set for 141 core proteins defined based on the comparison of 12 Lactobacillus genomes (Claesson et al., 2008). This can be mainly ascribed to the more stringent criteria and the classification of genes into COGs that was used to select the core genes in this previous study.
Close inspection of the order of the genes in the LCG revealed that over 100 genes were organized in operon-like clusters that were conserved in all 20 genomes. This indicated that apart from a shared function, these genes also had a conserved organization and control. This reflects a common ancestry that likely extends beyond the Lactobacillus group as many of the genes in the LCG are also conserved in other related Gram-positive bacteria. Among those genes, we found the canonical large gene clusters for the ribosomal proteins, the major proton-translocating ATPase and many house-keeping functions. Moreover, the LCG contained all genes of the dlt operon coding for the d-alanylation of lipotechoic acids that are involved in specific signalling to the host (de Vos, 2005). In addition, three conserved two-component regulatory systems were found to be present in all 20 Lactobacillus genomes that could form a basic network of responses to the environment although it is not known yet what they control. Moreover, the ccpA gene for the carbon catabolite control protein was always located adjacent to that of the pepQ gene for a prolidase. This was earlier observed in L. delbrueckii where the specific CcpA-mediated control of the prolidase gene expression was experimentally verified (Morel et al., 1990; Schick et al., 1999). This common organization indicates a link between control of sugar and nitrogen metabolism that is conserved in all lactobacilli. While the LCG contains over 80 genes for hypothetical proteins, one gene with an assigned function stands out – this is the gene annotated to encode FbpA that is present in all sequenced Lactobacillus genomes, including those not yet completed, such as that of the intestinal L. buchneri and L. coleohominis (Nelson et al., 2010). This over-500-residue FbpA protein has first been described in S. pyogenes as a fibronectin-binding protein (Courtney et al., 1994). It is highly conserved in many lactic acid bacteria as well as some bacilli, and belongs to the PF05833 family of proteins. Given its widespread occurrence in Gram-positive bacteria and absence of signal and other cognate topogenic sequences, it is doubtful whether binding to fibronectin is the natural function of this protein in lactobacilli. It is tempting to speculate that the FbpA-like proteins share a common function relating to environmental interactions such as biofilm formation.
In order to further characterize the Lactobacillus gene pool, we classified it using the COG classification that annotated the vast majority of the LCG genes (Fig. 1). This functional prediction of the LCG showed 26% of genes belonging to ‘Translation, ribosomal structure and biogenesis’, most likely acting as house-keeping genes, while 10% of the genes belonged to ‘Replication, recombination and repair’, 14% to ‘unknown function or general function prediction only’, 7% to ‘Transciption’, and 6% to ‘Carbohydrate transport and metabolism’ (Fig. 1). Remarkably in view of the large predicted secretome of the lactobacilli (Table 1) is that only a small fraction (5%) of the proteins encoded by the LCG were predicted to be secreted, indicating that many secreted proteins are encoded by strain-specific genes.
Grouping of Lactobacillus genomes
The 383 genes of the LCG were used for the construction of a phylogenetic tree of the lactobacilli (method described in detail in the Experimental procedures section). The obtained tree differs slightly from the well-known 16S rRNA-based grouping but adds a higher level of confidence as it is based on comparisons of the complete LCG with ∼130 kb per genome.
The generated whole-genome-based phylogeny revealed the presence of three distinct and large clusters of lactobacilli (Fig. 2). These clusters were named after the strain designation of the largest or most well-known genome they contained. In this way the NCFM, WCFS and GG clusters were defined that consisted of 8, 7 and 5 genomes respectively. The NCFM cluster is not only the largest but also the most coherent. In contrast, the WCFS and GG clusters contain each an outgroup genome, that of L. salivarius and L. sakei respectively.
Figure 2. Phylogenetic grouping of the Lactobacillus spp. with known genomes based on the features of their LCG. Three groups are shaded with different colours and termed NCFM, WCFS and GG groups (for further explanation see text).
Download figure to PowerPoint
The COG distribution of all 20 genomes was compared to reveal specific features (Table S3). The Lactobacillus genomes were dominated by COG categories including ‘Amino acid transport and metabolism’, ‘Carbohydrate transport and metabolism’, ‘Replication, recombination and repair’, ‘Transcription’ and ‘Translation, ribosomal structure and biogenesis’. Remarkably, the first two categories were only moderately represented in the LCG as they included only 8 and 19 genes of the total of 383 genes respectively. The NCFM group was characterized by more than average number of genes in the ‘Translation, ribosomal structure and biogenesis’, while the GG group had the smallest number of genes in this category. The categories ‘Transcription’ and ‘Replication, recombination and repair’ also showed variation among different Lactobacillus groups with some exceptions. It was also interesting to notice that the largest genomes (L. casei BL23, L. rhamnosus GG and Lc705, and L. plantarum WCSF1) are having most carbohydrate utilization proteins as reported earlier for L. plantarum WCFS1 genome (Kleerebezem et al., 2003). Apart from this no clear trends could be observed when the COG distribution was analysed.
Specific signatures in the Lactobacillus genomes
Subsequently, we defined additional groups of core genes, next to the LCG, including the set of genes that are present in all the genomes of one group (termed the core group genes) and the set of genes that are present in all genomes of one group and absent in all other Lactobacillus genomes (termed the signature group genes). The core group gene numbers are similar and vary from 771, 636 to 991 (Table 2, Tables S4–S6), but the signature group genes vary from 119, 14 to 88 in the NCFM, WCFS and GG groups respectively (Table 2, Tables S7–9). The low level of signature group genes in the WCFS group can be explained by the fact that this group is the least coherent as indicated above (Fig. 2).
Table 2. Proteins found in core group and signature group genes of Lactobacillus genomes.
|Core group genes||771||636||991|
|Signature group genes||119||14||88|
The core group genes were further used to define the LCG-specific ORFans and the Group-specific ORFans. ORFans are the genes present in genome of one species and absent in all other. LCG-specific ORFans are the genes present in LCG and absent in all other genomes while Group-specific ORFans are the genes present in core group genes of one group and absent in all other genomes. As can be expected from the different level of coherence of the three groups (see above), there were large differences between the number of Group-specific ORFans, including 56, 4 and 30 for the NCFM, WCFS and GG group respectively (Table 3, Tables S10–12) while LCG-specific ORFans consisted of 41 genes (Table 3, Table S13). Here we describe the salient features of these LCG-specific ORFans and the Group-specific ORFans that are characteristic of the lifestyle of the members of these groups.
Table 3. General statistics of proteins predicted to be ORFans from the three specific core groups of Lactobacillus genomes.
|Data set||Genes blasted||ORFans found||Hypothetical||Annotated|
|Complete core (LCG)||383||41||13||28|
The LCG-specific ORFans are the genes that are only found in the genomes of 20 complete lactobacilli. Remarkably, all ORFans were predicted to encode small proteins with an average size of 75 residues and this may be due to the method of calculating the ORFans. As these ORFans are unique for lactobacilli it is not a surprise that 13 out of 41 ORFans were predicted to encode hypothetical proteins. Several of these were found in operon structures but their function remains to be elucidated. Many of the annotated ORFans (a total 13) were predicted to code for ribosomal proteins and some of them also existed in conserved operon-like clusters.
The NCFM Group-specific ORFans are found in the genomes of L. acidophilus, L. helveticus, L. crispatus, L. gasseri, L. johnsonii and L. delbrueckii and the 56 representatives include a majority (34) of genes coding for not-yet-annotated proteins and many that have been annotated inconsistently, such as LBA1852 that is annotated as a potential d-ala,d-ala ligase in L. acidophilus but a TAT-pathway signal in L. gasseri and a conserved hypothetical protein in all other representatives of the NCFM group. Evidently, this hampers the possibility to speculate about the function of these genes. Other NCFM Group-specific ORFans include LBA0044 for a GDSL-like lipase/acylhydrolase, LBA0342 for a 2′,3′-cyclic nucleotide 3′-phosphodiesterase with a polynucleotide kinase domain conserved in all NCFM group members, and LBA0189 predicted to code for the glycerol-3-phosphate acyltransferase PlsY involved in the early stages of glycerolipid biosynthesis.
The WCFS Group-specific ORFans are found in the genomes of L. plantarum, L. brevis, L. fermentum, L. reuteri and L. salivarius. There are only four of these detected and three of these encode hypothetical proteins. The remaining one is represented by Lp_2528 in L. plantarum and is annotated in various ways, including a dioxygenase, a bleomycin resistance protein and a lactoylglutathione lyase glyoxalase. The latter is likely to be the correct annotation based on extensive blast analysis and lactoylglutathione lyase is involved in the detoxification of methylglyoxal, a highly toxic byproduct of triosephosphates that are abundant glycolytic intermediates in lactobacilli. Recently, it has been observed that in S. mutans the lactoylglutathione lyase glyoxalase gene was upregulated during acid stress while its inactivation results in loss of acid resistance (Korithoski et al., 2007). It is tempting to speculate that members of the WCFS group that include species that are known to tolerate acidity below pH 3 have adapted a specific form of acid resistance effected by a highly related lactoylglutathione lyase glyoxalase.
The GG group includes L. rhamnosus, L. casei and L. sakei, and the GG group-specific Orfans include 30 genes from which 15 code for hypothetical proteins. Many of the annotated ones have discrete features in spite of their small size. These include LGG02390, a small hydrophobic protein coding for bacteriocin immunity. In L. sakei this is likely to be Sakacin P but its function in L. rhamnosus GG is not clear – while a potential bacteriocin operon was predicted from the genome (Kankainen et al., 2009) as experimental analysis suggested that this strain does not seem to produce bacteriocins (De Keersmaecker et al., 2006). However, this may be due to the laboratory growth conditions employed in this study that are known to induce different gene expression than the intestinal environment (Marco et al., 2010). Another small protein is that encoded by LGG01384 in L. rhamnosus GG which has all features of a 4Fe-4S ferredoxin, found in many anaerobic bacteria and archaea. The question remains in what redox reaction this ferredoxin is involved, as the members of the GG group are considered to grow only by fermentation and do not respire.
In our analysis we could not identify any niche-specific genes when considering the source of the isolated strains. Such genes were previously reported for the analysis of a smaller set of genomes (O'Sullivan et al., 2009). All the nine niche-specific genes identified in that study were found to be present in other niches as well based on the present set of Lactobacillus genomes. However, it remains to be seen whether the source of isolation is really the natural niche, the more so as some species, such as L. plantarum, are found in plant fermentations, dairy products and the intestinal tract (De-Vries et al., 2006). Within this cosmopolitan species, a set of characteristic genes can be detected, as was already indicated by complete genome hybridization (Molenaar et al., 2005). The observation that many L. plantarum genes are expressed in the intestine of humans and mice but are transcriptionally silent in laboratory media indicate the presence of a core of genes specific for the intestinal niche (Bron et al., 2004; Marco et al., 2010). Further comparative and functional genome sequencing will show whether more of these niche-specific genes can be detected and how widely these are distributed.