Yonghui H. Zeng, Guangdong Provincial Key Laboratory of Pathogenic Biology and Epidemiology for Aquatic Economic Animals, Guangdong Ocean University, Zhanjiang 524025, China. E-mail: email@example.com
To perform a systematic evaluation of the applicability, validity and reliability of the long PCR-RFLP of 16S-ITS-23S rRNA genes for bacterial genotyping using both sequences retrieved from public genome databases and the experimental data obtained on bacterial cultures.
Methods and Results
3301 Full-length sequences of 16S-ITS-23S rRNA genes were retrieved from 885 published bacterial genomes. Copy numbers of the whole set of 16S-ITS-23S rRNA genes per genome ranged from 1 (n =161) to 14 (n =4) with an average of 3·71. Their length varied greatly, from 4319 to 6568 bp with an average of 4952 bp. Computer-simulated RFLP analyses of the 16S-ITS-23S fragments flanked by the conserved primers 27F and 2241R suggested MspI, RsaI, HhaI and TaqI as the most appropriate enzymes for long PCR-RFLP analysis of the 16S-ITS-23S sequence. MspI was used to screen over 900 bacterial cultures isolated from the Huguangyan Maar Lake in southern China. An experimental sequencing of 16S rRNA genes of the isolates possessing a unique RFLP band pattern proved the broad applicability and high resolution of this approach.
These results indicate that long PCR-RFLP of 16S-ITS-23S rRNA genes is a potentially universal and reliable bacterial genotyping tool with a high resolution.
Significance and Impact of the Study
The methodology of long PCR-RFLP of 16S-ITS-23S rRNA genes will facilitate the exploration and tracing of cultivable microbial diversity in natural environments.
Microbial communities in nature are astonishingly diverse, and only a very tiny fraction of this diversity has been cultured in the laboratory, leaving the majority unknown (Rappe and Giovannoni 2003; Keller and Zengler 2004). Isolating novel bacterial cultures from this vast majority or tracing a given microbial species in the environment often requires the screening of a great number of strains. A number of DNA-based fingerprinting methodologies have been developed to facilitate such typing work, for instance, amplified fragment length polymorphism (Blears et al. 1998), random amplified polymorphic DNA (Cocconcelli et al. 1995), amplified ribosomal DNA restriction analysis (ARDRA) (Redecker et al. 1997), ribosomal intergenic spacer analysis (Bourque et al. 1995), pulsed-field gel electrophoresis (Tanskanen et al. 1990) and repetitive extragenic palindromic-PCR (rep-PCR) (Pooler et al. 1996).
Among these diverse genotyping tools, 16S rRNA gene targeted ARDRA (also 16S PCR-RFLP) was one of the earliest used techniques with its advantages of simplicity, low cost and broad scope of application. However, 16S PCR-RFLP is presently less used than other molecular genotyping methods because the high sequence identity of the 16S rRNA gene among bacterial species may result in a low resolution of differentiation. Three or greater restriction enzymes were necessary for full species discrimination (Moyer et al. 1996), greatly limiting its application in bacterial genotyping. A study on typing some Acinetobacter species indicated that the number of restriction enzymes used could be reduced to one while reaching a similar resolution as conventional ribotyping/DNA hybridization identification methods, by the introduction of a long PCR protocol and extending the PCR templates from the 16S rRNA gene alone to the combination of 16S and 23S rRNA genes and the spacer sequence in between (Garcia-Arata et al. 1997).
The long PCR with proofreading DNA polymerase fusions was first introduced in the early 1990s, which successfully generated over 20-kb-long DNA fragments from human genomic DNA (Cheng et al. 1994) and lamda bacteriophage DNA with high fidelity (Barnes 1994). Since then, DNA polymerase for long PCR use has been continuously improved. A recent study has shown that a significant proportion of the long PCR products exceeding 20 kb in length were error-free (Hogrefe and Borns 2011). These biotechnological progresses make long PCR a routine and reliable technique widely used in laboratories world wide. In bacterial genotyping, the pioneer application of long PCR may be the work conducted by Smith-Vaughan et al. (1995) who generated a 5·5-kb-long product of 16S-ITS-23S-5S rRNA genes from the genomic DNA of a human bacterial pathogen. However, only a few applications of long PCR-RFLP of rRNA operon sequences has hitherto been reported with all data produced from pathogenic strains (Garcia-Arata et al. 1997; Abd-El-Haleem et al. 2002; Yavuz et al. 2004). A systematic evaluation of the applicability, validity and reliability of this methodology is lacking. Given the fact that the three rRNA molecules, namely 16S, 23S and 5S, are linked together into an rRNA operon in virtually all bacterial genomes and that universal primers for amplifying 23S rRNA genes have recently been designed and successfully applied to environmental DNAs (Hunt et al. 2006), we hypothesize that the strategy of long PCR-RFLP of 16S-ITS-23S sequences is theoretically sound and applicable to most bacterial species.
Here, we present an in-depth comparison and computer-simulated restriction analyses of 16S-ITS-23S sequences retrieved from published bacterial genomes with the aim of testing the above hypothesis and answer the questions of whether universal restriction endonucleases can be set for all bacteria and whether their discriminatory capability is greatly improved at species level compared with that of traditional 16S PCR-RFLP. Experimental evidence was further provided by employing this technique on a large number of bacterial cultures that were isolated from the Huguangyan Maar Lake in southern China. We propose the methodology of long PCR-RFLP of the 16S-ITS-23S sequence as a universal and reliable bacterial genotyping tool.
Materials and methods
Collection of 16S-ITS-23S sequences from published bacterial genomes
Sequences of the genome fragment containing 16S rRNA gene, internal transcribed spacer (ITS) and 23S rRNA gene and associated taxonomy information were manually retrieved from the NCBI bacterial genome database (mostly as of May 2008 and supplemented up to November 2011). Sequences from thermophilic bacteria (the genera Thermus, Thermoanaerobacter, Thermosipho and Thermotoga) were discarded due to their unusual sequence and structural features in ribosomal RNA operons (Acinas et al. 2004). In order to gain high-quality data, we set size ranges for each individual component of the 16S-ITS-23S rRNA genes to exclude abnormal molecules: 1000–2000 bp for the 16S rRNA gene, 1500–5000 bp for the 23S rRNA gene and 1000–2500 bp for ITS. Sequences containing one or more components that fell out of these size ranges were discarded.
Computer-simulated RFLP analyses
Each sequence in our database was trimmed into the fragment flanked by two conserved regions corresponding to the primer sites of 27F (Lane 1991) at the 5′ end of the 16S rRNA gene and 2241R (Hunt et al. 2006) at the 3′ end of the 23S rRNA gene. The trimmed sequences were in silico digested using the electronic gel function of the vector NTI advance package (ver. 10.0; Invitrogen, Carlsbad, CA, USA). During the pre-experiment of simulated digestion with some commonly used hexa-cutting restriction enzymes (PstI, XbaI, HindIII, EcoRI and BamHI), 1–4 bands were generated for most sequences in our database. The amount of both differential and common bands between samples was insufficient to support a high discriminatory capability as well as reliable cluster analyses of RFLP data.
We analysed the frequency of all palindromic tetra nucleotide sequences in a selection of trimmed 16S-ITS-23S sequences, including one sequence from Escherichia coli O157:H7_EDL933, 502 sequences from the 89 genomes within the family Enterobacteriaceae (defined as the 89Ent data set), and 360 representative sequences from all bacterial genera in our database, each of which was randomly picked from a given genus (defined as the 360Bac data set). Then, the tetra-cutting restriction endonucleases, HhaI (GCG^C), MspI (C^CGG), RsaI (GT^AC), Sau3AI (^GATC) and TaqI (T^CGA), were chosen to conduct the computer-simulated digestion.
Given that the strains belonging to the family Enterobacteriaceae received the most effort in full genome sequencing, the 89Ent data set was set up and in silico digested with the enzymes HhaI, MspI, RsaI, Sau3AI and TaqI to assess their discriminatory ability with a focus on species and strain levels. The 360Bac data set was further used to extend the assessment to genus or higher level and also to evaluate the scope of application of these enzymes. In addition, computer-simulated digestion analyses of both data sets would help to answer whether phylogenetically distant species can be clearly differentiated through a long PCR-RFLP analysis in the context of the massive number of species involved and how the phylogenetic relationship of the studied species based on 16S rRNA gene sequence correlates to the relatedness simply inferred from the similarity between their RFLP band patterns.
After computer-simulated digestions, RFLP band pattern data were exported to the Bionumerics software package (ver. 4.6; Applied Maths, Kortrijk, Belgium). Cluster analysis of the band pattern data was performed using the UPGMA algorithm with Pearson correlation set as the similarity coefficient. A four percentage band position tolerance was allowed. Finally, the RFLP band pattern database was established for each endonuclease and served as reference for wet experimental data analyses.
Primers for long PCR and sequencing
The 360Bac data set (described above) was subjected to multiple alignment using the ClustalW algorithm implemented in BioEdit (ver. 7.0; Hall 1999). Alignments were used to assess the primers previously designed for 16S or 23S rRNA gene amplification (Table 3). The primer set 27f and 2241r was employed in this study to amplify the long PCR products of 16S-ITS-23S rRNA genes. Additional sequencing primers 1406f and 1492r (within 16S rRNA gene) and 457r, 820f and 1087r (within 23S rRNA gene) were used to assemble the full-length sequence of the amplified 16S-ITS-23S products.
Bacterial isolation, long PCR of 16S-ITS-23S rRNA genes and restriction analysis
A mesotrophic freshwater lake located in southern China, Huguangyan Maar Lake, was chosen as the study site to isolate bacterial cultures and experimentally test the methodology of long PCR-RFLP of 16S-ITS-23S rRNA genes. This volcanic crater lake has an area of c. 2·3 km2 and a maximum water depth of 18 m, and was formed by small volcanic eruptions that occurred around 140 000 years ago and subsequent water input and accumulation (Wang et al. 2000; Mingram et al. 2004). Its features are as follows: no upheaval of surrounding earth surface, thick sediment (50 m) and a short distance to the nearby South China Sea (<6 km). By means of this molecular typing tool, we developed a nonredundant bacterial culture collection during 2009–2011 as a part of an endeavour to protect microbial resources in this unusual lake. Sampling times and layers are provided in Table S3.
Bacterial cultures were isolated with either heterotrophic R2A agar (Atlas 2004) or lake water agar (by mixing 1 l of 0·2 μm filtered in situ lake water with 15-g agar powder). 1 μl of original lake water was diluted into 100-μl pure water, and the dilution was then plated on agar media. Single colonies were picked and subjected to lysis treatment with 100 μl 0·05 mol l−1 NaOH solution at 95°C for 15 min. After centrifuging at 12 000 rev min−1 for 2 min, 1-μl lysis solution was employed as a PCR template. The long PCR of 16S-ITS-23S rRNA genes was performed as follows: 3-μl PCR buffer, 6-μl dNTP mixture with 2·5 mmol l−1 each, 0·2 μl 100 μmol l−1 primer each, 0·2-μl LA-Taq (TaKaRa, Dalian, China) and 1-μl colony lysis solution in a final volume of 30-μl. PCR products were directly digested with MspI (Fermentas, Burlington, ON, Canada) at 37°C for 1–3 h. The restriction fragments were resolved on a 1·5% agarose gel stained with ethidium bromide. The RFLP band pattern in each lane was analysed and grouped using Bionumerics 4.6 with the same settings, as described in the above computer simulation analyses.
Sequencing and assembly
Long PCR product of 16S-ITS-23S rRNA genes was purified using a commercial purification kit (Tiangen Biotech, Beijing, China) and sequenced on an ABI 3730 sequencer (Applied Biosystems, Foster City, CA, USA) using either 27f or 1492r as the sequencing primer. Sequences were analysed with the Classifier tool provided by the Ribosomal Database Project (Cole et al. 2009) and the SILVA web server (Pruesse et al. 2007) to determine their taxonomy. Sequencing was conducted on selected PCR products with all PCR and sequencing primers (Table 3) to assemble the full-length sequence of the long PCR product. Sequence assembly was performed with the ContigExpress program of the software package Vector NTI advance (ver. 10; Invitrogen). 16S-ITS-23S sequences obtained in this study were deposited into GenBank under the accession numbers JX219383–JX219400.
Copy number and length variation of 16S-ITS-23S rRNA genes in bacterial genome
16S rRNA gene, ITS and 23S rRNA gene within bacterial ribosomal rRNA (rrn) operon were treated as a single and complete unit in this study. The 16S-ITS-23S sequence database was set up with 3301 sequences retrieved from 885 bacterial genomes belonging to 360 genera of 29 known bacterial phyla or subphyla (Table 1; for full list see Table S1). Within rrn operon, 16S and 23S rRNA genes are not always located on the same DNA strand, which could make PCR of the long rrn operon sequence problematic. In some cases, the 16S rRNA gene, ITS and 23S rRNA genes are not tightly linked together. However, we found that in each of the 885 genome, there is at least one intact rrn operon containing a complete set of 16S-ITS-23S rRNA genes with all components located on the same strand (either plus or minus). The copy number of the 16S-ITS-23S rRNA genes within a single genome ranged from 1 (n =161) to 14 (n =4) with an average of 3·71 (n =885) (Fig. 1; Table 1). The 63 genomes that possessed ≥8 copies were all from highly pathogenic strains belonging to Firmicutes or Gammaproteobacteria. When the pathogenic strains were removed from the calculation, the average copy number decreased to 3·23 (n =822), which we suggest to be representative for most bacteria that inhabit natural environments. Among the 724 genomes possessing multiple copies, 451 genomes have both plus- and minus-strand copies. No regular pattern was observed regarding the located strand of different copies within the same genome.
Table 1. Statistics of bacterial 16S-ITS-23S sequence database established by retrieving sequences and associated taxonomy information from the NCBI genome database
The length of the 16S-ITS-23S rRNA genes varies greatly across bacterial species, genera and phyla, ranging from 4319 to 6568 bp with an average of 4952 bp (n =3301) (Tables 1 and S2), suggesting that the length of 16S-ITS-23S rRNA genes may serve as a preliminary discriminator between bacteria. Among the 724 genomes containing multiple copies, 250 (34·5%) show no size variation between different copies within the same genome and 438 (60·5%) genomes show <20 bp length differences between copies (Fig. 2, also see Table S1). The maximum length difference occurred in a cyanobacterium, Nostoc sp. PCC 7120, with a maximum of 1698 bp variation observed among the four copies within its genome. The other species that showed over 1000-bp length variation was Corynebacterium aurimucosum ATCC 700975 belonging to Actinobacteria. We observed a weak correlation between the sequence length of the ITS, 23S rRNA gene and those of the 16S-ITS-23S rRNA genes (Fig. 3). When 16S-ITS-23S rRNA gene size increases, the size of either the ITS or 23S rRNA gene within it increases accordingly, whereas 16S rRNA gene size displays no significant change (Fig. 3), suggesting that the ITS and 23S rRNA genes contribute the most to the length difference in 16S-ITS-23S rRNA genes. ITS size varies from 14 bp (Lactobacillus salivarius CECT 5713 and UCC118, Firmicutes) to 1977 bp (Coxiella burnetii CbuG Q212, Gammaproteobacteria) with an average of 521 bp (n =3301) (Table S2). The 23S rRNA gene sequence length ranges from 2181 bp (Agrobacterium vitis S4, Alphaproteobactium) to 4517 bp (Ruminococcus bromii L2-63, Firmicutes), averaging 2909 bp (n =3301).
Computer-simulated RFLP of 16S-ITS-23S rRNA gene sequences
The occurrence frequency of all possible 4-bp-long palindromic sequences within 16S-ITS-23S sequences were analysed on representative data sets (results shown in Table 2). The average frequencies calculated from E. coli alone or all 502 sequences from Enterobacteriaceae were within a comparable range to those based on 325 unique bacteria genera (the 360Bac data set, composed of representative sequences from 362 genera as defined in 'Materials and methods'), indicating that the occurrence frequency of each recognition sequence within 16S-ITS-23S tends to be stable across various bacterial species. We set the criterion of choosing appropriate restriction enzymes used in long PCR-RFLP analysis as producing no <10 bands with a >150 bp length after single enzyme digestion, which roughly corresponded to an average cutting frequency of around 12–14. Among the enzymes that reach this criterion, here we tested only the most commonly used ones for PCR-RFLP analysis, that is, HhaI, MspI, RsaI, Sau3A1 and TaqI.
Table 2. Occurrence frequency of palindromic tetra nucleotide sequences in the selected data sets
The data set 89Ent contains 502 16-ITS-23S sequences retrieved from the 89 genomes of the family Enterobacteriaceae.
TaiI, TscI, MaeII
BsuRI, HaeIII, PhoI
The 89Ent data set (as defined in 'Materials and methods') was digested using the Vector NTI program with one of these enzymes. A dendrogram was inferred for each enzyme based on the similarity of the band position pattern within each lane (for MspI see Fig. 4; for other enzymes, see Fig. S1). After simulated digestion, most sequences produced five or more restriction fragments of <150 bp. These small fragments would be poorly resolved in common agarose gels and thus are not suitable as efficient discriminators. Therefore, we only adopted bands longer than 150 bp in both computer simulation analyses and practical applications. Bands were scattered within the region of 150–1500 bp (see Fig. 4 and Fig. S1), which can be easily recognized both visually and via software. Different patterns were observed among most genomes, indicating a good resolution of RFLP analyses of 16S-ITS-23S sequences at species level.
We further tested MspI, RsaI, HhaI and TaqI on the 360Bac data set (see Fig. S2 for MspI, results for other enzymes not shown). Band clustering results indicated that they were able to easily discriminate all species with band similarity values below 90% and simultaneously show good band size distribution pattern. No significant difference was observed between these four enzymes.
The results from the simulated RFLP analyses suggested that MspI, RsaI, HhaI and TaqI are good candidates for practical use in long PCR-RFLP of 16S-ITS-23S rRNA genes. We found that identical band patterns only occurred occasionally within some strains of the same species, which suggests that this method may be able to guarantee a full discrimination at species level. At strain level, the resolution was constrained because strains of the same species may share identical 16S-ITS-23S rRNA genes, for example, Erwinia pyrifoliae str. Ep1/96 and Erw. pyrifoliae DSM 12163, and E. coli O157 str. EC4115 and the strain TW14359. But, such situations appeared only in a few highly pathogenic bacteria that occupied a tiny part of the sequences (<2%) in our database. When the sequences from pathogenic bacteria were not considered, all four enzymes were able to resolve to the subspecies/strain level (data not shown).
From the dendrogram based on MspI restriction profiles of the 89Ent data set (Fig. 4), we retrieved Pearson correlation values of the nodes where species of different genera started to be clustered into the same group and found that the minimum band similarity value was 89·6%. The minimum value for grouping different phyla was 62·3% (data retrieved from the tree in Fig. S2). These values suggest that an unknown species may be classified into a known genus or phylum based solely on the RFLP pattern of its 16S-ITS-23S sequences if its band pattern is correlated with that of a known species with a Pearson correlation value greater than 89·6 or 62·3%, respectively.
PCR and sequencing primers of 16S-ITS-23S rRNA genes
Multiple alignments of the sequences of the 360Bac data set were conducted and used to evaluate the conservation of the forward and reverse primers previously designed within 16S and 23S rRNA genes (primer sequence and improved positions see Table 3; primer location and sequencing and assembly strategy were shown in Fig. S3). For the PCR of 16S-ITS-23S rRNA genes, the primers close to the 5′ and 3′ ends of rrn operon were considered to produce as long as possible a product for RFLP analyses. The forward PCR primer within the 16S rRNA gene was a slightly modified version of the 27f. For the reverse primer located within the 23S rRNA gene, we searched all the candidates suggested by Hunt et al. (2006) and found that 2241r was the most conserved. However, there are also several mismatches in the conventional version of 2241r (Lane 1991). Some bases within 27f and 2241r were revised and allowed for more degeneracy (Table 4) to improve their coverage. In addition, a set of sequencing primer within the 16S-ITS-23S region was introduced with modifications based on their previous versions (Table 4).
Table 3. 16S-ITS-23S rRNA genes PCR and sequencing primers
Previous primer Reference
The underlined nucleotides are the revised sites in the previous version. The numbers under the nucleotides are the total mismatches to the sequence in the data set 360Bac that contains 360 16S-ITS-23S sequences randomly picked from 360 genera in our database. \x96, represents no mismatch.
Screening of bacterial culture collection using long PCR-RFLP of 16S-ITS-23S rRNA genes and 16S rRNA gene sequencing
During the years 2009–2011, we employed the long PCR-RFLP technique to screen over 900 bacterial colonies isolated from the Huguangyan Maar Lake in southern China using the restriction enzyme MspI. A single and distinct band between 4·2 and 4·6 kb was observed for all successful PCR products. After digestion by MspI (results partly shown in Fig. 5), the RFLP band pattern data were grouped using software, with a 90% Pearson correlation set as the threshold for the same species. Under this rigorous threshold, we set up a nonredundant bacterial culture collection containing 221 species (Table S3). Taxonomic results based on 16S rRNA gene sequencing showed that Alphaproteobacteria and Gammaproteobacteria dominated the culture collection, accounting for 55·7% (123/221) and 20·8% (46/221), respectively, whereas Betaproteobacteria was a minor group with only 12 (5·4%) species documented, together with Firmicutes (13, 5·9%), Actinobacteria (13, 5·9%), Bacteroidetes (13, 5·9%) and one unclassified proteobacterium.
We documented abundant species diversity within the following genera, Microbacterium (Actinobacteria), Blastobacter, Bosea, Brevundimonas, Caulobacter, Methylobacterium, Novosphingobium, Porphyrobacter, Rhizobium, Sphingobium, Sphingomonas and Xanthobacter (Alphaproteobacteria), Chryseobacterium (Bacteroidetes), Bacillus (Firmicutes), Acinetobacter and Pseudomonas (Gammaproteobacteria), each of which contains five or more strains that show different RFLP band patterns. Some strains have identical 16S rRNA gene sequences, but RFLP band patterns of their 16S-ITS-23S sequences were different (data not shown); for instance, Acinetobacter sp. 12-9, LA5-27, dR1-1, dLA13-02, dLA1-10 and dLA1-012, suggesting the methodology of long PCR-RFLP of 16S-ITS-23S rRNA genes functioned well at the species level.
Sequencing and assembly of long 16S-ITS-23S rRNA sequences
16S-ITS-23S rRNA gene PCR products from the strains that showed a high frequency of their RFLP patterns in our culture collection were sequenced for their full-length amplicons using PCR primers and sequencing primers introduced in this study. Ten full-length 16S-ITS-23S sequences (4·0–4·5 kb) were assembled (Table 4) and classified into Alphaproteobacteria (seven sequences), Betaproteobacteria (one sequence), Bacteroidetes (one sequence) and Actinobacteria (one sequence) based on the results of a Blast analysis, the RDP classifier tool and the SILVA web server. Six strains containing multiple copies of ITS that differed in base composition were found by identifying the presence of overlapped peaks in the sequence-read graphs. They were not amenable to the assembly of full-length 16S-ITS-23S sequences (listed in Table 4), suggesting a limitation of the direct sequencing of 16S-ITS-23S rRNA gene PCR products. Based on the blast analysis results (Table 4), we found that only one 16S-ITS-23S sequence (Ralstonia sp. 1–6, Betaproteobacteria) got a hit with over 97% identity to its closest match in GenBank, whereas all the other nine sequences were below 90% identity (82–89%) to their closest matches, much lower than the identities (94–100%) shown by the 16S rRNA gene alone within the same 16S-ITS-23S sequence. This discrepancy was mostly due to the sequence divergence within ITS and 23S molecules, which also suggests that introducing ITS and 23S sequences into the PCR template could improve the resolution of the rrn operon sequence-based method for differentiating bacteria.
For the strains Sphingomonas sp. SL9 (Alphaproteobacteria), Ralstonia sp. 1–6 (Betaproteobacteria), Elizabethkingia sp. LA1-18 (Bacteroidetes) and Microbacterium sp. 1–9 (Actinobacteria) (underlined in the Table 4), we sequenced 16S-ITS-23S rRNA gene PCR products from three different colonies showing the same RFLP pattern and found that identical gel band profiles always led to identical sequencing results from the 16S-ITS-23S rRNA gene. Further, computer-simulated restriction analyses of the full-length 16S-ITS-23S sequences obtained from these strains clearly showed that the RFLP bands presented by the real agarose gel running were almost identical to those produced by computer programs and that a discrepancy only occurred in bands with a low molecular weight (mostly <150 bp), which were weak and smeared in real gels and were not able to be correctly recognized by the software. These results showed that our strategy for sequencing and assembling the long sequence of 16S-ITS-23S rRNA genes was feasible, provided that the ITS size was <1000 bp, which is the case for most bacteria (97·4%) in our sequence database, as well as simultaneously being in either single or multiple copies but with identical sequence composition.
Low cost, convenient and rapid DNA fingerprinting techniques are still essential and routine tools for the purpose of bacterial genotyping, although we have entered an ‘omics’ era. Among these tools, long PCR-RFLP of 16S-ITS-23S rRNA genes has only recently been introduced and successfully applied to some pathogenic bacteria (Garcia-Arata et al. 1997; Abd-El-Haleem et al. 2002; Yavuz et al. 2004). The limited information available thus far showed that the resolution of this method was much greater than traditional 16S PCR-RFLP and was comparable to REP-PCR and DNA-DNA hybridization, suggesting it would be a promising universal bacterial genotyping tool. However, no further studies have been reported on its practical applications, probably due to a lack of a thorough theoretical evaluation and standardization of this method. Taking advantage of the fast-growing databases of completely sequenced bacterial genome sequences, for the first time, we were able to explore the copy number, length variation and sequence divergence of 16S-ITS-23S rRNA genes as a single unit within genomes of almost all known bacterial phyla and to critically assess the theoretical basis of the methodology of long PCR-RFLP of 16S-ITS-23S rRNA genes.
The rRNA genes, like other members of a multigene family, are subjected to homogenization processes such as gene conversion (Liao 2000), which makes the copy number of 16S-ITS-23S rRNA genes nonidentical to that of the 16S or 23S rRNA gene alone within most bacterial genomes. Nonetheless, we found that at least one intact copy, composed of one 16S rRNA gene, ITS and one 23S rRNA gene in the same direction, exist in each bacterial genome in our database (a total of 885 bacterial genomes), suggesting that a whole set of 16S-ITS-23S rRNA genes might be an indispensable component in bacterial genomes and provide a basis for application of a long PCR-RFLP methodology. Most bacterial genomes (822/885, 92·9%) contain less than eight copies of 16S-ITS-23S (as shown in Fig. 1). Moreover, 69·6% (616/885) of the bacterial genomes have one to four copies, similar to the distribution pattern observed for 16S rRNA gene alone in bacterial genomes (Acinas et al. 2004). On average, Proteobacteria, Actinobacteria, Firmicutes and Bacteroidetes appear to possess more 16S-ITS-23S copies in their genomes than other bacterial groups (see Table 1). This phenomenon is consistent with observations of their dominating existence in various bacterial communities in nature (Keller and Zengler 2004; Newton et al. 2011).
In environmental bacterial diversity surveys based on the approaches of 16S or 23S rRNA gene amplification and sequencing, the level of divergence and redundancy among rRNA gene copies within the same genome is a concern as they can result in an overestimated level of diversity (Acinas et al. 2004; Case et al. 2007; Pei et al. 2010). We did not test the validity of 16S-ITS-23S rRNA genes in clone libraries constructing and screening, but we should face the same problem as those in 16S or 23S rRNA genes because they are based on the same rationale. However, multiple copies can provide an advantage for long PCR-RFLP analyses of bacterial cultures. All the copies within a given genome are simultaneously amplified using the conserved primers and a mixture of 16S-ITS-23S molecules is then subjected to restriction digestion. Consequently, multiple copies may actually increase the number of restriction fragments because different copies provide more restriction sites for endonucleases and thus lead to an increased resolution.
Another concern may arise from the choice of restriction enzymes. A computer-simulated RFLP analysis of bacterial 16S rRNA genes suggested that the combined use of HhaI and RsaI could reach maximum resolution (Moyer et al. 1996), similar to our recommendations of MspI, RsaI, HhaI or TaqI for long PCR-RFLP of 16S-ITS-23S rRNA genes, but where only one enzyme is used. For the 16S rRNA gene sequences, HhaI and RsaI possess relatively few restriction sites with an average of 4·17 and 4·48 cutting sites within the 16S sequence of a given taxon, respectively, leading to a limited resolution and irreproducible phylogenetic bootstrap values (Moyer et al. 1996). Such conditions are greatly improved in our tests, as the long 16S-ITS-23S sequences in our database provide at least nine restriction sites, with the size of the MspI restriction fragments spanning from 150 to 1500 bp and a stable relationship between taxons being achieved. We did not perform a statistically rigorous test of all candidate enzymes because the low cost and widely used restriction enzyme, MspI, achieved a very good resolution in both computer simulation tests and practical application on our strains. The criterion of enzyme choice is not strict due to the advantage provided by the long length and high divergence of the 16S-ITS-23S sequences.
The conserved regions within the 16S and 23S rRNA genes may have the potential to offer bacterial group–specific recognition sites for restriction enzymes, whereas the variable regions together with hyper-variable ITS sequence could offer discriminating recognition sites. If this reasoning is true, common bands might appear in the RFLP analyses of a given bacterial group. We searched the simulated RFLP profiles generated with MspI, HhaI, RsaI and TaqI on reference data sets, but no marker band was observed at the phylum or subphylum level. We further collected the sequences of the FISH probes for major bacterial divisions from the probeBase database (Loy et al. 2007). These probe sequences correspond to the most conserved regions within the 16S or 23S rRNA gene sequence, but our attempt at searching for particular probe-specific restriction enzymes (equivalent to bacterial taxon specific) failed. This suggests it is not possible to roughly predict the taxonomic status of an unknown species based solely on the restriction fragment size distribution.
Similarly, we failed to predict the taxonomic status of a given band pattern by comparing it with our electronic RFLP pattern database before sequencing its 16S rRNA gene. Practical RFLP patterns in wet experiments did not always show as many and sharp bands as those of the computer-simulated restriction analysis. One reason is that the agarose we used is a common type, of moderate resolution and strength, instead of metaphore agarose (strength >300 g cm−2), which allows recovery of as low as 20-bp-long fragments using a common horizontal electrophoresis system. The other reason, more importantly, is that common problems in real agarose gel runs – such as insufficient or uneven staining, gel shift effect and excess or insufficient amount of the DNA loaded – as well as slight differences in settings between different runs could significantly affect the accuracy of band detection by computer software and the quality of subsequent cluster analyses of the fingerprinting data. However, the continuously growing wet RFLP band pattern database has been developed over years of our isolating and screening work, composed of the strains that have been identified with 16S rRNA gene sequencing. This real band pattern database was successful in serving as a reference for the classification of later isolated strains, suggesting that the wet/wet data comparison was more feasible and reliable than the wet/computer data comparison.
Analyses of the nonredundant bacterial culture collection from the Huguangyan Maar Lake indicated that the genotyping tool of long PCR-RFLP of 16S-ITS-23S rRNA genes was successfully applied to Actinobacteria, Firmicutes, Bacteroidetes, Alphaproteobacteria, Betaproteobacteria and Gammaproteobacteria. These bacterial groups are thought to be the most abundant ones in freshwater lakes (Newton et al. 2011). The results experimentally proved the applicability of the long PCR-RFLP technique on both major and minor bacterial groups in the environment. The sequencing primers we designed and verified for the assembly of full-length 16S-ITS-23S sequences is another contribution from the wet experimental parts of this work. As DNA sequencing becomes less expensive, the large-scale sequencing of whole sets of 16S-ITS-23S molecules becomes more feasible.
Sequencing the whole set of 16S-ITS-23S rRNA genes would expand the current 23S rRNA gene sequence database and, more importantly, allow the taxonomy information inferred from the 16S rRNA gene sequence to be directly mapped to 23S rRNA gene sequences, thus gradually diminishing the obstacles that prevent 23S rRNA from being used in a classification of micro-organisms (Hunt et al. 2006; Yilmaz et al. 2011). Currently (June 2012), high-quality (length over 1200 bp for 16S; over 1900 bp for 23S) 16S rRNA gene sequences are deposited 30 times more often than those of 23S rRNA genes in the SILVA database (release 108) (Pruesse et al. 2007). In the foreseeable future, the number of 23S rRNA gene reference sequence will still grow at a much lower rate than that of 16S rRNA gene sequence, requiring continuous efforts to solve this problem. Second- and third-generation sequencing technologies have progressed rapidly (Mardis 2011), but their short read lengths prevent them from being applied to the sequencing of the complete rrn operons retrieved from environmental genomic DNA samples in a high-throughput manner. It remains a technical challenge but undoubtedly would be a promising application in microbial ecology studies. PCR amplification and sequencing of the whole 16S-ITS-23S molecule using traditional methods is still the only way to simultaneously gain 16S and 23S rRNA gene information from the same genome of either a laboratory culture or an environmental species. The long PCR-RFLP methodology verified in this work combined with sequencing would contribute in this respect, in addition to bacterial genotyping.
In summary, the genome region of 16S-ITS-23S rRNA genes is a conserved unit in bacterial genomes and its size, length discrepancy, sequence divergence and the presence of conserved priming sites support the methodology of long PCR-RFLP of 16S-ITS-23S rRNA genes as an efficient tool for typing either distantly or closely related bacterial species. Compared with other molecular typing tools (Table 5), this long PCR-RFLP approach has the combined advantages of low cost, simple and fast process, high reproducibility and a high resolution. It could be applied to virtually all bacterial phyla. RFLP enzyme choice is not strict but could be further optimized to obtain better resolution when targeting one specific bacterial taxa. Generally, the use of a single enzyme MspI, RsaI, HhaI or TaqI, could be enough for most typing work.
Table 5. Comparison of bacterial molecular typing tools
Molecular Typing Tool
Major tech-niques used
Level of cross matching
High throughput screening
AFLP, amplified fragment length polymorphism; RAPD, random amplified polymorphic DNA; ARDRA, amplified ribosomal DNA restriction analysis; RISA, ribosomal intergenic spacer analysis; PFGE, pulsed-field gel electrophoresis; rep, repetitive extragenic palindromic; MLST, multilocus sequence typing.
This work was supported by the NSFC project 30900045, the Guangdong provincial NSF project 9452408801002444 (B09292) and the open fund MELRS0923 of National Key Lab of Marine Environmental Science (Xiamen University, China) to Y.Z. and the Czech project Algatech (CZ.1.05/2.1.00/03.0110) to M.K. The authors thank Ms Qianru Guo and Ms Xiaojie Chen (Guangdong Ocean University, China) for their assistance in sampling and in laboratory work. We also thank Jason Dean for correction of the language.