Testing evolutionary models to explain the process of nucleotide substitution in gut bacterial 16S rRNA gene sequences

Authors


Correspondence: Jose F. Garcia-Mazcorro, Francisco Villa s/n Ex-Hacienda el Canadá C.P. 66050. General Escobedo, Nuevo León, México. Tel.: +52(81)8087 0592; e-mail: josegarcia_mex@hotmail.com

Abstract

The 16S rRNA gene has been widely used as a marker of gut bacterial diversity and phylogeny, yet we do not know the model of evolution that best explains the differences in its nucleotide composition within and among taxa. Over 46 000 good-quality near-full-length 16S rRNA gene sequences from five bacterial phyla were obtained from the ribosomal database project (RDP) by study and, when possible, by within-study characteristics (e.g. anatomical region). Using alignments (RDPX and MUSCLE) of unique sequences, the FINDMODEL tool available at http://www.hiv.lanl.gov/ was utilized to find the model of character evolution (28 models were available) that best describes the input sequence data, based on the Akaike information criterion. The results showed variable levels of agreement (from 33% to 100%) in the chosen models between the RDP-based and the MUSCLE-based alignments among the taxa. Moreover, subgroups of sequences (using either alignment method) from the same study were often explained by different models. Nonetheless, the different representatives of the gut microbiota were explained by different proportions of the available models. This is the first report using evolutionary models to explain the process of nucleotide substitution in gut bacterial 16S rRNA gene sequences.

Introduction

The intestinal tract of animals is inhabited by a complex assembly of microorganisms from the three main domains of life, which together with the host constitute an inseparable ecological system. The intestinal microbiota has coevolved with the host for millions of years up to the point where the health of the latter can be seriously compromised without the presence of the former. Different environmental forces have acted upon the host and its associated gut microorganisms, resulting in a highly efficient and most often peaceful coexistence between the two (Ley et al., 2006).

Despite recent massive efforts to culture the gut microbiota (Lagier et al., 2012), the use of molecular methods is still considered indispensable to fully characterize the membership of the microbiota in the gut and other environments. In particular, the gene encoding the 16S small subunit of the ribosomal RNA (16S rRNA gene) has been widely used to study phylogeny and diversity of bacteria in different ecosystems. Although extensive work has been performed on the evolution of rRNA sequences (e.g. Smit et al., 2007), and many tools have been developed to investigate the details (e.g. rate of transitions and transversions) of molecular evolution (Posada & Crandall, 1998; Johnson & Omland, 2004), we still do not know the model of evolution that best explains the process of nucleotide substitution in the 16S rRNA gene among gut microorganisms. This information is important not only for the accuracy of phylogenetic analysis (Posada, 2009), but because it can improve our understanding of the biological processes that shape the evolutionary process itself (Liò & Goldman, 1998). The aim of this study was to test different evolutionary models to explain the process of nucleotide substitution among gut bacterial 16S rRNA gene sequences.

Materials and methods

Over 46 000 16S rRNA gene sequences of several gut bacterial groups (Faecalibacterium, Ruminococcus, Bacteroides, Prevotella, and members of Actinobacteria, Proteobacteria, and Fusobacteria) were downloaded from the ribosomal database project (RDP, size > 1200 base pairs, good quality only) by study or submission (for unpublished research) and, when possible, by relevant within-study characteristics (e.g. anatomical region). The FASTA format without common gaps was used for download, and only sequences from the small and large intestine (including feces) from mammals were considered (several sequences were not included mainly because there were single or few sequences from unpublished studies). RDP allows the user to download aligned sequences using RDPX (Cole et al., 2009), but the obtained sequence alignments were realigned using MUSCLE (Edgar, 2004) in order to investigate the impact of the alignment method on the chosen model of evolution. The ElimDupes tool (http://www.hiv.lanl.gov/) was used to obtain unique sequences using the maximum threshold of similarity allowed (99%). Then, the FINDMODEL tool (http://www.hiv.lanl.gov/) was used to find the evolutionary model that best describe the input sequence alignment. The FINDMODEL tool uses an idea first implemented in MODELTEST (Posada & Crandall, 1998) and the Akaike information criterion (AIC) to choose the best model (lower AIC values indicate a better model fit). Currently, there are 14 models available in FINDMODEL (each of those with a gamma distribution, which is a continuous probability distribution that has proven to be useful in modeling site-specific rate heterogeneity, Yang, 1994) with various degrees of complexity regarding the assumptions about the process of nucleotide substitution (Table 1). In order to confirm the differences in the chosen models among the bacterial groups (see below), all unique sequences from each bacterial group were compiled in separate files (a total of seven files were created, one for each bacterial group). These files were then used to obtain the same percentage of random sequences using the script subsample_fasta.py in QIIME (Caporaso et al., 2010). A total of 50 subgroups of random sequences were generated from each bacterial group and aligned with MUSCLE for analysis in the FINDMODEL tool. Using the data generated by this approach, a chi-squared test was used to test the null hypothesis of no association between the chosen evolutionary models and the bacterial group.

Table 1. Models supported by the FINDMODEL tool available at http://www.hiv.lanl.gov/ (adapted from Posada, 2009)
AbbreviationModelNumber of free parametersBase frequenciesSubstitution ratesUseful references
  1. Asterisks (*) indicate models that Los Alamos National Laboratory do not consider to have an obvious biological interpretation (http://www.hiv.lanl.gov/content/sequence/findmodel/findmodel.html). A summary of this information is provided at: http://molecularevolution.org/molevolfiles/models/submodels_final.pdf. More information about the FINDMODEL tool can be found here: http://www.hiv.lanl.gov/content/sequence/findmodel/doc.pdf.

JCJukes-Cantor0EqualAC=AG=AT=CG=CT=GTJukes & Cantor (1969)
F81Felsenstein 813UnequalAC=AG=AT=CG=CT=GTFelsenstein (1981)
K2PKimura 2-parameter1EqualAC=AT=CG=GT,AG=CTKimura (1980)
HKYHasegawa-Kishino-Yano4UnequalAC=AT=CG=GT,AG=CTHasegawa et al. (1985)
TrNef*Tamura-Nei equal-frequencies2EqualAC=AT=CG=GT,AG,CTTamura & Nei (1993)
TrNTamura-Nei5UnequalAC=AT=CG=GT,AG,CTTamura & Nei (1993)
K81*Kimura 3-parameter2EqualAC=GT,AT=CG,AG=CTKimura (1981)
K81uf*Kimura 3p unequal-frequencies5UnequalAC=GT,AT=CG,AG=CTKimura (1981)
TIMef*Transition equal-frequencies3EqualAC=GT,AT=CG,AG,CTPosada (2003)
TIM*Transition6UnequalAC=GT,AT=CG,AG,CTPosada (2003)
TVMef*Transversion equal-frequencies4EqualAC,AT,CG,GT,AG=CTPosada (2003)
TVM*Transversion7UnequalAC,AT,CG,GT,AG=CTPosada (2003)
SYM*Symmetrical5EqualAC,AG,AT,CG,CT,GTZharkikh (1994)
GTRGeneral Time-reversible8UnequalAC,AG,AT,CG,CT,GTRodriguez et al. (1990)

Results

The FINDMODEL tool allows the construction of the initial tree using MrBayes (Huelsenbeck & Ronquist, 2001), Weighbor (Bruno et al., 2000) and PAUP* (phylogenetic analysis using parsimony) (Swofford, 2003). The use of MrBayes was constrained to ten or fewer sequences of the size (in base pairs) used in this study and therefore could not be utilized to construct the initial tree. With very few exceptions, the chosen models were identical when using Weighbor or PAUP* to construct the initial tree. Also, Weighbor and PAUP* yielded results in similar amounts of time (differences in seconds and/or minutes were considered insignificant). Therefore, only one set of results (using Weighbor) for each sequence alignment is presented.

The use of MUSCLE and RDPX yielded different models of evolution for the same group of sequences (see below and Supporting Information, Tables S2–S8). Regardless of the alignment method, all but one group of Helicobacter sequences generated by Ley et al. (2005) yielded consistent results with respect to the gamma distribution. Several models were not chosen for any group or subgroup of sequences, including the Jukes and Cantor, TIMeq, and TVMeq (Table 2). Other models were chosen only a few times, including the TrNeq, K81, K2P, and the SYM models (Table 2).

Table 2. Summary of all chosen evolutionary models for the 16S rRNA gene sequences from each bacterial group investigated (all animal species included) using RDP-based and MUSCLE-based alignments. The numbers in parenthesis indicate the number of models that also incorporated a gamma distribution for site-specific rate heterogeneity
Model Faecalibacterium Ruminococcus Bacteriodes Prevotella Proteobacteria Actinobacteria Fusobacterium
RDPMUSCLERDPMUSCLERDPMUSCLERDPMUSCLERDPMUSCLERDPMUSCLERDPMUSCLE
  1. Asterisks (*) indicate that the models were not chosen at all. The symbol en dash (–) was written instead of zero for easier data visualization. A detailed description of the results within this table is presented as Supporting Information, including additional analysis using randomly selected sequences from each bacterial group.

JC*
F811 (0)1 (0)1 (0)1 (0)3 (0)2 (0)
K2P1 (1)
HKY7 (4)8 (3)1 (1)2 (1)2 (1)2 (1)7 (3)10 (4)
TrNef1 (1)
TrN14 (14)17 (17)1 (1)1 (1)10 (10)10 (9)2 (1)4 (3)7 (6)6 (4)5 (3)6 (4)2 (2)1 (0)
K811 (1)
K81uf3 (2)2 (1)1 (1)3 (3)1 (0)1 (0)1 (0)2 (2)
TIMef*
TIM18 (16)12 (10)6 (6)3 (3)5 (5)6 (6)4 (3)2 (2)5 (5)4 (4)1 (1)1 (1)
TVMef*
TVM3 (1)1 (1)2 (2)4 (2)1 (0)1 (1)1 (1)2 (2)1 (1)
SYM1 (1)2 (2)2 (2)
GTR6 (6)11 (11)19 (19)19 (19)32 (31)35 (35)21 (21)19 (19)4 (4)7 (7)8 (7)8 (7)4 (1)4 (2)
% Gamma distribution8383959598989394666380806767

The results for each bacterial group are presented in detail as Supporting Information. Several studies contained more than 100 unique sequences (the maximum allowed in the FINDMODEL tool) and therefore had to be divided into subgroups of sequences. Two groups of sequences that had to be divided into subgroups consistently yielded the same model using either alignment method. For example, the GTR model was consistently chosen for Faecalibacterium sequences generated by Durso et al. (2010), and the SYM model was chosen for all subgroups of Escherichia/Shigella sequences generated by Li et al. (2012). All other groups of sequences that had to be separated in subgroups yielded different models using either alignment method (Supporting Information).

In spite of the differences between alignments and within studies, the investigated gut bacterial sequences were explained by different proportions of the available models, suggesting that the 16S rRNA gene from different gut bacterial taxa has evolved accordingly to different evolutionary models (Table 2). These observations were confirmed when looking only at the results obtained from humans (Table 3). Moreover, additional analysis using equal percentages of randomly selected sequences from each bacterial group (all animal species included) confirmed these observations with statistical significance (Supporting Information, Table S1). The proportion of models with a gamma distribution also differed among the investigated taxa, suggesting that site-specific rate heterogeneity throughout the 16S rRNA gene is not evenly spread among different members of the gut bacterial microbiota (Table 2). This was also confirmed when looking only at the results from humans and in the analysis of random sequences (Tables 3 and S1).

Table 3. Summary of all chosen evolutionary models for the 16S rRNA gene sequences from each bacterial group investigated (only humans included) using RDP-based and MUSCLE-based alignments. The numbers in parenthesis indicate the number of models that also incorporated a gamma distribution for site-specific rate heterogeneity
Model Faecalibacterium Ruminococcus Bacteriodes Prevotella Proteobacteria Actinobacteria Fusobacterium
RDPMUSCLERDPMUSCLERDPMUSCLERDPMUSCLERDPMUSCLERDPMUSCLERDPMUSCLE
  1. Asterisks (*) indicate that the models were not chosen at all. The symbol en dash (–) was written instead of zero for easier data visualization. A detailed description of the results within this table is presented as Supporting Information.

JC*
F811 (0)1 (0)1 (0)1 (0)
K2P1 (0)
HKY5 (3)4 (3)1 (1)4 (2)6 (3)
TrNef1 (1)
TrN12 (12)14 (14)5 (5)7 (7)1 (1)5 (4)3 (2)3 (2)5 (4)
K811 (1)
K81uf2 (1)2 (1)1 (0)1 (0)1 (1)
TIMef*
TIM17 (16)11 (9)4 (4)1 (1)3 (3)1 (1)2 (2)4 (4)3 (3)1 (1)1 (1)
TVMef*
TVM1 (1)1 (1)3 (1)1 (0)1 (1)
SYM1 (1)2 (2)2 (2)
GTR2 (2)7 (7)13 (13)13 (13)20 (20)21 (21)9 (9)7 (7)2 (2)3 (3)7 (6)6 (5)1 (0)1 (0)
% Gamma distribution8588100100100100100100676780806767

Discussion

There is evidence that the 16S rRNA gene sequence composition has a role in modulating the initiation, efficiency, and fidelity of translation (Jacob et al., 1987; Sprengart et al., 1990; O'Connor et al., 1997; Asai et al., 1999). Also, higher-order structures of the 16S rRNA gene, which are crucial for the biological performance of the molecule, are believed to be in part dependent on the primary structure (Gutell et al., 1994). Because proteins are the fundamental building blocks of life on which natural selection acts, we can improve our understanding of the biological processes (e.g. use, cooperation, and competition for nutrients) that have shaped the evolution of the microorganisms into different lineages by studying the process of molecular evolution of the 16S rRNA gene. Despite previous work on RNA sequence evolution (Rzhetsky, 1995; Savill et al., 2001; Smit et al., 2007) and the wide availability of tools to investigate molecular evolution (Posada & Crandall, 1998; Johnson & Omland, 2004), to date there are no published studies that have looked at the process of nucleotide substitution of this gene within and among gut bacterial taxa. The aim of this study was to fill this gap by testing different evolutionary models to explain differences in nucleotide composition among gut bacterial 16S rRNA genes.

In order to find the best model of molecular evolution using the FINDMODEL and other tools, the sequences need first to be aligned. However, each program uses different criteria to align sequences (Edgar, 2004), which can affect any downstream analysis. For example, RDP uses the Infernal secondary structure aware aligner (Cole et al., 2009), while MUSCLE uses a three-stage algorithm that has been shown to provide significant improvements in accuracy and speed when compared with other commonly used alignment methods (Edgar, 2004). In this study, these two alignment methods yielded different evolutionary models for the same group of sequences. It is important to note that the great majority of the models using MUSCLE-based alignments yielded lower AIC values when compared with the RDP-based alignments, suggesting a better model fit (Supporting Information). However, it is not clear whether this can help researchers determine which method to use because subgroups of sequences from the same study often yielded different models using either alignment method.

Despite the differences within studies and between the alignment methods, the different gut bacterial sequences were explained by different proportions of the available models. In particular, the TrN and GTR models, which assume different nucleotide substitution rates (Table 1), were chosen with a different frequency among the bacterial groups (Table S1). Interestingly, evidence was found to suggest that another commonly chosen model (the TIM model), which is considered not to have a biological interpretation (Table 1), was also selected with a different frequency (Table S1). Moreover, the proportion of the models that incorporated a gamma distribution also differed among the taxa. These observations confirm previous findings suggesting that relative rates and patterns of rRNA evolution are lineage specific (Smit et al., 2007). The implications of these observations may relate to the well-researched diversification of gut bacteria throughout evolution (Ley et al., 2008). For instance, it is feasible to hypothesize that the machinery of translation, including the rRNA, has become specialized to exploit more efficiently distinctive metabolic pathways, such as utilization of specific dietary (De Filippo et al., 2010) and/or host compounds (Berry et al., 2013). It is the author's hope that other researchers can use this line of thought to study in more depth the relationship between the evolution of microbial rRNA and metabolic diversification in the gut and other environments.

The Jukes and Cantor (JC) model assumes that the equilibrium frequencies of the four nucleotides are each 25% and that throughout evolution, any nucleotide has the same probability to be replaced by any other. Expectably, this model was not chosen for any sequence alignment in this study because it is well documented that some sites change more often than others (e.g. transitions occur more frequently than transversions). Other models that were chosen minimally or not at all include the TrNeq, TIMeq, TVMeq, K81, and the SYM models (Table 2). These models share a common feature with the JC model in that they assume equal base frequencies (Posada, 2009). These observations confirm that nucleotide frequencies do not change at the same rate in gut bacterial 16S rRNA gene sequences.

The FINDMODEL tool used in this study is a relatively fast and user-friendly way to obtain the best evolutionary model. Other tools available for the same purpose include DAMBE (Xia & Xie, 2001) and jModelTest (Darriba et al., 2012). In DAMBE, the find model function requires the user to manipulate the sequences, which may not be practical for a large number of sequences like the one presented here. jModelTest is a Java tool that provides more models and selection strategies but depends on third-party binaries. Although this freedom could allow users to use this tool more effectively, the FINDMODEL tool offers a more convenient alternative to find the best evolutionary model, especially for researchers with minimal training in computer programming.

Future studies working on microbial rRNA evolution in the gut or other environments should consider models that take into account other aspects of the molecule aside from its primary structure; for example, the base pairings that form the secondary and tertiary structures (Tillier & Collins, 1998) and the effect of phenotype on the evolution of the genotype (Yu & Thorne, 2006). Indeed, more work is still needed not only to develop and make available more precise models to explain molecular evolution of rRNA but also to test its performance using different alignment methods with data from many naturally occurring environments.

In summary, this communication tested different evolutionary models to explain the process of nucleotide substitution in gut bacterial 16S rRNA gene sequences. The results showed that the alignment method has an impact on the chosen model and that sequences from the same bacterial taxa yield different models. The results also confirmed previous findings suggesting that relative rates and patterns of rRNA evolution are lineage specific. However, more research considering secondary and tertiary structures of the molecule and other naturally occurring environments is needed to build a more comprehensive picture of this phenomenon.

Acknowledgements

I would like to thank Los Alamos National Laboratory and the Ribosomal Database Project for their work and services, and CONACYT (México) for financial support through the National System of Researchers (SNI, for initials in Spanish) program. I am particularly grateful to one anonymous reviewer for thoughtful observations on a previous version of this manuscript.

Ancillary