High or low correlation between co-occuring gene clusters and 16S rRNA gene phylogeny


Correspondence: Knut Rudi, Department of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, PO Box 5003, NO-1432 Ås, Norway. Tel.: +47 64 96 58 73; fax: +47 64 96 59 01; e-mail: knut.rudi@umb.no


Ribosomal RNA (rRNA) genes are universal for all living organisms. Yet, the correspondence between genome composition and rRNA phylogeny remains poorly known. The aim of this study was to use the information from genome sequence databases to address the correlation between rRNA gene phylogeny and total gene composition in bacteria. This was done by analysing 327 genomes with TIGRFAM functional gene annotations. Our approach consisted of two steps. First, we searched for discriminatory clusters of co-occurring genes. Using a multivariate statistical approach, we identified 11 such clusters which contain genes that were co-occurring only in a subset of genomes and contributed to explain the gene content differences between genome subsets. Second, we mapped the discovered clusters to 16S rRNA-based phylogeny and calculated the correlation between co-occuring genes and phylogeny. Six of the 11 clusters exhibited significant correlation with 16S rRNA gene phylogeny. The most distinct phylogenetic finding was a high correlation between iron–sulfur oxidoreductases in combination with carbon nitrogen ligases and Chlorobium. The other correlations identified covered relatively large phylogroups: Actinobacteria were positively associated with kinases, while Gammaproteobacteria were positively associated with methylases and acyltransferases. The suggested functional differences between higher phylogroups, however, need experimental verification.


There is currently an exponential increase in bacterial genome sequences available in the public domain. A main focus until now has been to build a universal tree of life, either through gene composition or through phylogenetic relations deduced for core genes (McCann et al., 2008). The most widely accepted single-gene phylogenetic marker is the gene encoding the small subunit ribosomal RNA (16S rRNA gene; Woese, 1987). Extensive phylogenetic and taxonomic frameworks have already been built for this gene (Cole et al., 2005; Kumar et al., 2005), and it is currently by far the most widely used gene for taxonomic assignments of bacteria based on next-generation sequencing data (Tringe & Hugenholtz, 2008). However, it has recently been realized that defining a single universal genome tree may not be feasible and appropriate due to the fuzzy nature of evolution (Beauregard-Racine et al., 2011; Smillie et al., 2011). With the accumulating number of genomes it is becoming evident that the number of genes shared among all bacteria is rather low (Dagan et al., 2008; Wu et al., 2009). A field, yet to be explored, is how different bacterial phylogroups are related based on presence/absence of co-occurring gene clusters (genes that occur together in a subset of genomes). Investigating the distribution of such clusters in a phylogenetic framework would enable the question of which genes are associated with which branches of the phylogenetic tree to be addressed and possibly phylogroup-specific clusters to be identified.

The aim of the work presented here was to explore the correlation between 16S rRNA gene phylogeny and the distribution of co-occurring genes for genome sequenced bacterial strains. Our hypothesis is that such gene clusters are important for determining the properties of the different bacterial taxonomic groups. Our approach consisted of two steps: first, identifying clusters of co-occurring genes that explain the gene content variation among strains independently of phylogenetic relations and, second, determining how these clusters are distributed in a 16S rRNA gene phylogenetic framework.

We used a multivariate statistical approach for both gene cluster identification and determination of the 16S rRNA gene phylogenetic framework. The rationale is that it is very difficult to handle the vast amount of data without identification of the underlying structures. Multivariate curve resolution with alternating least squares (MCR-ALS; Tauler, 1995; Zimonja et al., 2008) was utilized to identify gene clusters. The MCR-ALS method estimates concentrations and profiles of contributing factors/components when no prior information is available about the nature and composition of the mixtures analysed. In our case the factors/components correspond to co-occurring gene clusters, while the scores represent the distribution of these clusters among genomes. For the description of phylogenetic relatedness we chose an approach involving transforming DNA sequences into word frequencies and subsequently identifying underlying structures using principal component analysis (PCA; Rudi et al., 2006). Correlations between MCR-ALS components and 16S rRNA gene phylogeny were determined by multivariate regression.

Materials and methods

A schematic overview of the approach is given in Fig. 1. All bioinformatic and statistical analyses were done in the Matlab (MathWorks Inc, Natic, MA) programming environment utilizing the following toolboxes: Statistics, Bioinformatics and PLS Toolbox (Eigenvector Research Inc., Wenatchee, WA).

Figure 1.

Overview of our analytical approach. Genomes with TIGRFAM annotations were downloaded from GenBank, while corresponding 16S rRNA gene information from the same genomes were downloaded from the RDP database. These data were processed according to the scheme.


All genomes with a TIGRFAM gene annotation were downloaded from GenBank (June 2011). For each genome the TIGRFAM annotations were categorized into tables, where each column represents a TIGRFAM category (Selengut et al., 2007). For hierarchical functional annotations the corresponding KEGG BRITE hierarchy information (Okuda et al., 2008) was extracted from the TIGRFAM data. The rationale for using the TIGRFAM database is that this database has a prokaryotic focus for annotations and the source files are freely available through an ftp site (ftp://ftp.jcvi.org/pub/data/TIGRFAMs/).

Gene cluster identification and correlation with 16S rRNA gene data

The number of co-occurring gene clusters in the dataset was identified by evolving factor analysis (EFA) using the PLS Toolbox. EFA identifies the number of components (co-occurring clusters) in the data. The number of components identified by EFA was then used as the initial input into MCR-ALS analysis conducted with non-negativity constraint (Zimonja et al., 2008) to detect the actual co-occurring genes. To identify unique genes for the different co-occurring gene clusters we evaluated whether the loading for a particular cluster was higher than the sum for the other clusters, meaning that it mainly is present in that cluster.

The full-length 16S rRNA gene sequences were retrieved from the Ribosomal Database Project (RDP) genome browser for the corresponding TIGRFAM annotated genomes. The 16S rRNA gene phylogenetic description was determined by first transforming the full-length sequences into frequencies of words (pentamers), with subsequent PCA for data compression (Rudi et al., 2006). The principle here is to avoid DNA sequence alignments, and to describe the phylogenetic relations in a coordinate- rather than a tree-based system. The rationale is that coordinates (vector-based distance description) enable regression analyses. For validation of the pentamer-derived phylogeny we used an alignment-based approach where we generated a matrix composed of all pairwise 16S rRNA gene distances. These were subsequently correlated to the pentamer-derived distances through regression.

To evaluate the prediction of bacterial functional properties from 16S rRNA gene sequences, partial least square regression (PLSR) was conducted using the MCR-ALS scores as response, and 16S rRNA gene principal component scores for the first six components of the PCA-compressed pentamer data as predictors (Rudi et al., 2007). PLSR identifies the underlying correlation structure between a table of predictors, and a response vector. The binary nature of the correlations was tested by fuzzy logic clustering. This approach determines how well each sample fits to each of the binary categories (Bjornstad et al., 2004).

Finally, for visualization purposes, the presence/absence of the MCR-ALS identified co-occurring gene sequence clusters was determined based on binarized score data. The binarization was based on the bimodal distribution nature of the data using k-means clustering with two categories. The presence/absence information was subsequently mapped onto the coordinate 16S rRNA gene phylogeny framework.

Validation of TIGRFAM annotations

To validate the effect of annotation coverage on the gene clusters identified we compared TIGRFAM annotated genes, and genes annotated by blast homology searches. First, we selected two genes for each co-occurring cluster that showed 16S rRNA gene correlation. Second, we downloaded the respective seed alignments from the TIGRFAM database. For each alignment a consensus sequence was generated. Thereafter, we performed a tblastn search towards a local blast database of the TIGRFAM annotated genomes. For each gene the lowest score value for the TIGRFAM annotated genomes was estimated. The genomes with higher scores were assigned to contain the given gene. Then, for all annotated genes, the average and standard deviation for the first and second principal component for the 16S rRNA gene coordinate-based phylogeny were calculated, respectively. Finally, we determined the concordance in phylogenetic distribution between blast and TIGRFAM annotations by linear regression.


Dataset characteristics

We identified 327 genomes with TIGRFAM annotations. The dominant phyla in our dataset were Proteobacteria, Firmicutes and Actinobacteria – with a total of 20 phyla (Table 1). The coordinate phylogeny showed a separation of the main phyla in the two-first principal components (Fig. 2a), while six components were needed to resolve the phyla representing minor components in the data (Supporting Information, Fig. S1). Comparison of pentamer- and alignment-derived distances showed good correspondence (Fig. S2), supporting the phylogenetic nature of the pentamer distances.

Table 1. Genome sequenced phyla with TIGRFAM annotations
PhylumNo. of genomes
Deferribacteres 1
Fusobacteria 4
Actinobacteria 49
Firmicutes 51
Spirochaetes 2
Bacteroidetes 11
Chrysiogenetes 1
Proteobacteria 155
Thermotogae 8
Chlorobi 8
Acidobacteria 1
Aquificae 3
Deinococcus–Thermus 3
Synergistetes 2
Verrucomicrobia 3
Planctomycetes 3
Cyanobacteria 9
Dictyoglomi 1
Fibrobacteres 1
Chloroflexi 10
Figure 2.

Distribution of co-occurring gene clusters in a 16S RNA gene phylogenetic framework. (a) A coordinate-based phylogenetic framework was generated based on the 16S rRNA gene by PCA. The colour code indicates the explained variance. (b–g) Genomes harbouring the respective co-occurring gene clusters are marked with red dots (b, k2; c, k4; d, k7; e, k8; f, k9; g, k11), while the others are marked with green triangles. The same co-ordinate system as in panel (a) is used.

The frequency of TIGRFAM annotations of the genes in our dataset was 26%, which is quite low compared with current gene annotation approaches (Halachev et al., 2011). The annotations of the individual genomes ranged from 2% to 46.7% (Fig. S3). We found that none of the TIGRFAM categories was universally annotated among all the analysed genomes. However, a subset consisting of 33 genes was annotated in more than 50% of all genomes.

Correlation between gene content and 16S rRNA gene phylogeny

Independent of phylogeny we first identified the number of co-occurring gene clusters contributing to explaining the differences in gene composition between genomes. These genes are co-occurring in a subset of genomes. EFA indicated 11 such clusters. These clusters were identified by MCR-ALS analysis. MCR-ALS revealed that each cluster represented a heterogeneous population of genes (Fig. S4a), explaining 39.2% of the variance in the dataset. For each gene cluster there was a set of highly influential genes (Fig. S4b), representing the genes that are important for the respective co-occurring clusters. Several of these influential genes were those annotated in more than 50% of the genomes, but none of these contributed to distinguishing between the co-occurring clusters. The distribution of the co-occurring gene clusters among genomes was represented by a sigmoidal Q–Q plot (Fig. 3), indicating a binary distribution where the clusters are either present or absent.

Figure 3.

Distribution of co-occurring gene clusters in the analysed genomes. The Q–Q plot illustrates the deviation from a normal distribution (a straight line would indicate a normal distribution) for the score of (level of) the co-occurring clusters k1 to k11 in each of the genomes. The sigmoid shapes indicate high- and low-level categories. The lines represent the thresholds used for data binarization with respect to presence of co-occurring clusters within genomes – where the levels above the line are coded as present and below as absent.

Regression analysis between the 11 co-occurring gene clusters and 16S rRNA gene phylogeny showed that six gene clusters had a high correlation (R2 = 0.28 ± 0.11) to 16S rRNA gene phylogeny, while the remaining five had low correlation (R2 = 0.033 ± 0.013). The clusters were identified using fuzzy clustering (Fig. 4). Mapping the correlating co-occurring gene clusters onto the 16S rRNA gene coordinate phylogenetic framework showed that the co-occurring gene clusters mainly covered rather broad phylogroups (Fig. 2b–g). For the gene clusters showing low correlation, the mapping onto the 16S rRNA gene phylogeny confirmed the low phylogenetic correlation (results not shown).

Figure 4.

Binary distribution of correlation between co-occurring gene clusters and 16S rRNA gene phylogeny. Fuzzy clustering of R2 values for co-occurring gene clusters as a function of 16S rRNA gene phylogeny (determined by PLS regression) are shown. Clusters correlating to 16S rRNA gene phylogeny are marked in grey.

Functionality of 16S rRNA phylogeny-correlated gene clusters

We identified KEGG BRITE functional categories that corresponded to TIGRFAMs, with subsequent determination of which functional categories that were over-represented for the respective co-occurring gene clusters with high positive correlations to 16S rRNA gene phylogeny.

These analyses showed a range of functionalities for each gene cluster (Table S1). The most pronounced pattern was that cluster k2 (correlated with Actinobacteria) showed a high prevalence (14%, 47 of 319 genes) of kinases (EC number 2.7). Cluster k11 (correlated with Gammaproteobacteria) showed the second highest level of over-represented genes (= 90 of 4024). This cluster showed an association with methylases, EC number 2.1 (10%, nine of 90), and acyltransferases, EC number 2.3 (11.1%, 10 of 90). Comparisons of k2 and k11 using the binominal test showed that kinases were significantly over-represented in k2 (= 0.03), while methylases and acyltransferases were both significantly over-represented in k11 (< 0.01). For the other clusters the most pronounced pattern was an association of cluster k9 (correlated with Chlorobium) with iron–sulfur oxidoreductases in combination with carbon nitrogen ligases.

Concordance between blast and TIGRFAM annotated genes

The TIGFAM annotation coverage ranged from about 20% to 95% as compared with the blast annotation. Despite the variation in coverage, there was good phylogenetic concordance between the TIGRFAM and blast annotated genes (Fig. 5). This was confirmed by regression analyses where the squared regression coefficients for the respective mean values were 0.95 for PC1 and 0.88 for PC2.

Figure 5.

Comparison of blast and TIGERFAM annotated genes in a 16S rRNA gene phylogenetic framework. The average and standard deviation for the 16S rRNA gene scores for the genomes with annotated genes (blue bars, blast; red bars, TIGRFAM) for 16S rRNA phylogeny coordinates PC1 (a) and PC2 (b). Two genes from each of the six gene clusters with 16S rRNA gene correlation were selected: for k2, TIGR02408 (= 16 and n = 5 genomes for blast and TIGRFAM annotations, respectively) and TIGR03557 (= 36 and = 9); for k4, TIGR02965 (= 34 and = 13) and TIGR01416 (= 96 and = 53); for k7, TIGR02092 (= 33 and = 15) and TIGR01859 (= 44 and = 14); for k8, TIGR00644 (= 81 and = 16) and TIGR0135 (= 25 and = 5); for k9, TIGR02064 (= 32 and = 24) and TIGR01278 (= 38 and = 36); for k11, TIGR03146 (= 9 and = 5) and TIGR02445 (= 14 and = 3).


Our work represents a relatively naïve screening for correlations between co-occurring gene clusters and 16S rRNA gene phylogeny. Despite its simplicity, this approach revealed that more than one-third of the variance in the gene distribution dataset was explained by the co-occurring gene clusters identified. Given the metabolic and functional diversity of bacterial genomes (Koonin & Wolf, 2010), the relatively low number of co-occurring gene clusters was surprising. Therefore, there is probably also an underlying correlation structure between different functional and/or metabolic pathways.

Although the criteria for cluster identification did not include 16S rRNA gene information, the detected co-occurring gene clusters showed binary patterns with respect to 16S rRNA gene correlations. They showed either high or low correlation to 16S rRNA gene phylogeny. A potential explanation for this pattern could be that the co-occurring clusters with low correlation to 16S rRNA gene phylogeny represent mobile gene clusters with a wide host range (Smillie et al., 2011). Addressing mobility, however, is challenging, requiring detailed phylogenetic information for each gene, which is outside the scope of this article.

The co-occurring gene cluster with the most distinct correlation to 16S rRNA gene phylogeny is iron–sulfur oxidoreductases in combination with carbon nitrogen ligases for Chlorobium. These genes are associated with the photosynthetic metabolism of this bacterium, and all act on the same substrate (Brocker et al., 2008). For Gammaproteobacteria we found an over-representation of RNA methylases. Therefore, RNA methylation could be important for the phylogroup-specific functional properties of RNA in the cell (Motorin & Helm, 2011). We also found an over-representation of acyltransferases. In Escherichia coli, acetylation is important for regulation of key metabolic processes (Zhang et al., 2009). For Actinobacteria a range of different kinases were over-represented. As for acetylation, bacterial phosphorylation is involved in a broad selection of different metabolic pathways (Cozzone, 1998). Therefore, phosphorylation could potentially be important for metabolic signalling in Actinobacteria. Regulation of metabolic genes are of key importance in bacterial competition and fitness, suggesting a potential evolutionary role of the proposed post-transcriptional modifications.

In conclusion, our work represents an initial characterization of co-occurring gene clusters showing a phylogenetic distribution. We found that some co-occuring gene clusters correlate with higher level phylogroups, suggesting that these higher phylogroups have distinct functional properties. Considering that our work does not provide mechanistic insights, further research should be directed towards understanding why these genes co-occur.


We would like to thank the Norwegian University of Life Sciences, Hedmark University College and Genetic Analysis AS for their institutional support.