Long PCR-RFLP of 16S-ITS-23S rRNA genes: a high-resolution molecular tool for bacterial genotyping

Authors

  • Y.H. Zeng,

    Corresponding author
    1. Institute of Microbiology CAS, Třeboň, Czech Republic
    • Guangdong Provincial Key Laboratory of Pathogenic Biology and Epidemiology for Aquatic Economic Animals, Guangdong Ocean University, Zhanjiang, China
    Search for more papers by this author
  • M. Koblížek,

    1. Institute of Microbiology CAS, Třeboň, Czech Republic
    Search for more papers by this author
  • Y.X. Li,

    1. Institute for Applied & Environmental Microbiology, College of Life Sciences, Inner Mongolia Agriculture University, Huhhot, China
    Search for more papers by this author
  • Y.P. Liu,

    1. Institute for Applied & Environmental Microbiology, College of Life Sciences, Inner Mongolia Agriculture University, Huhhot, China
    Search for more papers by this author
  • F.Y. Feng,

    1. Institute for Applied & Environmental Microbiology, College of Life Sciences, Inner Mongolia Agriculture University, Huhhot, China
    Search for more papers by this author
  • J.D. Ji,

    1. Guangdong Provincial Key Laboratory of Pathogenic Biology and Epidemiology for Aquatic Economic Animals, Guangdong Ocean University, Zhanjiang, China
    Search for more papers by this author
  • J.C. Jian,

    1. Guangdong Provincial Key Laboratory of Pathogenic Biology and Epidemiology for Aquatic Economic Animals, Guangdong Ocean University, Zhanjiang, China
    Search for more papers by this author
  • Z.H. Wu

    1. Guangdong Provincial Key Laboratory of Pathogenic Biology and Epidemiology for Aquatic Economic Animals, Guangdong Ocean University, Zhanjiang, China
    Search for more papers by this author

Correspondence

Yonghui H. Zeng, Guangdong Provincial Key Laboratory of Pathogenic Biology and Epidemiology for Aquatic Economic Animals, Guangdong Ocean University, Zhanjiang 524025, China. E-mail: yonghui.sci@gmail.com

Abstract

Aims

To perform a systematic evaluation of the applicability, validity and reliability of the long PCR-RFLP of 16S-ITS-23S rRNA genes for bacterial genotyping using both sequences retrieved from public genome databases and the experimental data obtained on bacterial cultures.

Methods and Results

3301 Full-length sequences of 16S-ITS-23S rRNA genes were retrieved from 885 published bacterial genomes. Copy numbers of the whole set of 16S-ITS-23S rRNA genes per genome ranged from 1 (= 161) to 14 (= 4) with an average of 3·71. Their length varied greatly, from 4319 to 6568 bp with an average of 4952 bp. Computer-simulated RFLP analyses of the 16S-ITS-23S fragments flanked by the conserved primers 27F and 2241R suggested MspI, RsaI, HhaI and TaqI as the most appropriate enzymes for long PCR-RFLP analysis of the 16S-ITS-23S sequence. MspI was used to screen over 900 bacterial cultures isolated from the Huguangyan Maar Lake in southern China. An experimental sequencing of 16S rRNA genes of the isolates possessing a unique RFLP band pattern proved the broad applicability and high resolution of this approach.

Conclusions

These results indicate that long PCR-RFLP of 16S-ITS-23S rRNA genes is a potentially universal and reliable bacterial genotyping tool with a high resolution.

Significance and Impact of the Study

The methodology of long PCR-RFLP of 16S-ITS-23S rRNA genes will facilitate the exploration and tracing of cultivable microbial diversity in natural environments.

Introduction

Microbial communities in nature are astonishingly diverse, and only a very tiny fraction of this diversity has been cultured in the laboratory, leaving the majority unknown (Rappe and Giovannoni 2003; Keller and Zengler 2004). Isolating novel bacterial cultures from this vast majority or tracing a given microbial species in the environment often requires the screening of a great number of strains. A number of DNA-based fingerprinting methodologies have been developed to facilitate such typing work, for instance, amplified fragment length polymorphism (Blears et al. 1998), random amplified polymorphic DNA (Cocconcelli et al. 1995), amplified ribosomal DNA restriction analysis (ARDRA) (Redecker et al. 1997), ribosomal intergenic spacer analysis (Bourque et al. 1995), pulsed-field gel electrophoresis (Tanskanen et al. 1990) and repetitive extragenic palindromic-PCR (rep-PCR) (Pooler et al. 1996).

Among these diverse genotyping tools, 16S rRNA gene targeted ARDRA (also 16S PCR-RFLP) was one of the earliest used techniques with its advantages of simplicity, low cost and broad scope of application. However, 16S PCR-RFLP is presently less used than other molecular genotyping methods because the high sequence identity of the 16S rRNA gene among bacterial species may result in a low resolution of differentiation. Three or greater restriction enzymes were necessary for full species discrimination (Moyer et al. 1996), greatly limiting its application in bacterial genotyping. A study on typing some Acinetobacter species indicated that the number of restriction enzymes used could be reduced to one while reaching a similar resolution as conventional ribotyping/DNA hybridization identification methods, by the introduction of a long PCR protocol and extending the PCR templates from the 16S rRNA gene alone to the combination of 16S and 23S rRNA genes and the spacer sequence in between (Garcia-Arata et al. 1997).

The long PCR with proofreading DNA polymerase fusions was first introduced in the early 1990s, which successfully generated over 20-kb-long DNA fragments from human genomic DNA (Cheng et al. 1994) and lamda bacteriophage DNA with high fidelity (Barnes 1994). Since then, DNA polymerase for long PCR use has been continuously improved. A recent study has shown that a significant proportion of the long PCR products exceeding 20 kb in length were error-free (Hogrefe and Borns 2011). These biotechnological progresses make long PCR a routine and reliable technique widely used in laboratories world wide. In bacterial genotyping, the pioneer application of long PCR may be the work conducted by Smith-Vaughan et al. (1995) who generated a 5·5-kb-long product of 16S-ITS-23S-5S rRNA genes from the genomic DNA of a human bacterial pathogen. However, only a few applications of long PCR-RFLP of rRNA operon sequences has hitherto been reported with all data produced from pathogenic strains (Garcia-Arata et al. 1997; Abd-El-Haleem et al. 2002; Yavuz et al. 2004). A systematic evaluation of the applicability, validity and reliability of this methodology is lacking. Given the fact that the three rRNA molecules, namely 16S, 23S and 5S, are linked together into an rRNA operon in virtually all bacterial genomes and that universal primers for amplifying 23S rRNA genes have recently been designed and successfully applied to environmental DNAs (Hunt et al. 2006), we hypothesize that the strategy of long PCR-RFLP of 16S-ITS-23S sequences is theoretically sound and applicable to most bacterial species.

Here, we present an in-depth comparison and computer-simulated restriction analyses of 16S-ITS-23S sequences retrieved from published bacterial genomes with the aim of testing the above hypothesis and answer the questions of whether universal restriction endonucleases can be set for all bacteria and whether their discriminatory capability is greatly improved at species level compared with that of traditional 16S PCR-RFLP. Experimental evidence was further provided by employing this technique on a large number of bacterial cultures that were isolated from the Huguangyan Maar Lake in southern China. We propose the methodology of long PCR-RFLP of the 16S-ITS-23S sequence as a universal and reliable bacterial genotyping tool.

Materials and methods

Collection of 16S-ITS-23S sequences from published bacterial genomes

Sequences of the genome fragment containing 16S rRNA gene, internal transcribed spacer (ITS) and 23S rRNA gene and associated taxonomy information were manually retrieved from the NCBI bacterial genome database (mostly as of May 2008 and supplemented up to November 2011). Sequences from thermophilic bacteria (the genera Thermus, Thermoanaerobacter, Thermosipho and Thermotoga) were discarded due to their unusual sequence and structural features in ribosomal RNA operons (Acinas et al. 2004). In order to gain high-quality data, we set size ranges for each individual component of the 16S-ITS-23S rRNA genes to exclude abnormal molecules: 1000–2000 bp for the 16S rRNA gene, 1500–5000 bp for the 23S rRNA gene and 1000–2500 bp for ITS. Sequences containing one or more components that fell out of these size ranges were discarded.

Computer-simulated RFLP analyses

Each sequence in our database was trimmed into the fragment flanked by two conserved regions corresponding to the primer sites of 27F (Lane 1991) at the 5′ end of the 16S rRNA gene and 2241R (Hunt et al. 2006) at the 3′ end of the 23S rRNA gene. The trimmed sequences were in silico digested using the electronic gel function of the vector NTI advance package (ver. 10.0; Invitrogen, Carlsbad, CA, USA). During the pre-experiment of simulated digestion with some commonly used hexa-cutting restriction enzymes (PstI, XbaI, HindIII, EcoRI and BamHI), 1–4 bands were generated for most sequences in our database. The amount of both differential and common bands between samples was insufficient to support a high discriminatory capability as well as reliable cluster analyses of RFLP data.

We analysed the frequency of all palindromic tetra nucleotide sequences in a selection of trimmed 16S-ITS-23S sequences, including one sequence from Escherichia coli O157:H7_EDL933, 502 sequences from the 89 genomes within the family Enterobacteriaceae (defined as the 89Ent data set), and 360 representative sequences from all bacterial genera in our database, each of which was randomly picked from a given genus (defined as the 360Bac data set). Then, the tetra-cutting restriction endonucleases, HhaI (GCG^C), MspI (C^CGG), RsaI (GT^AC), Sau3AI (^GATC) and TaqI (T^CGA), were chosen to conduct the computer-simulated digestion.

Given that the strains belonging to the family Enterobacteriaceae received the most effort in full genome sequencing, the 89Ent data set was set up and in silico digested with the enzymes HhaI, MspI, RsaI, Sau3AI and TaqI to assess their discriminatory ability with a focus on species and strain levels. The 360Bac data set was further used to extend the assessment to genus or higher level and also to evaluate the scope of application of these enzymes. In addition, computer-simulated digestion analyses of both data sets would help to answer whether phylogenetically distant species can be clearly differentiated through a long PCR-RFLP analysis in the context of the massive number of species involved and how the phylogenetic relationship of the studied species based on 16S rRNA gene sequence correlates to the relatedness simply inferred from the similarity between their RFLP band patterns.

After computer-simulated digestions, RFLP band pattern data were exported to the Bionumerics software package (ver. 4.6; Applied Maths, Kortrijk, Belgium). Cluster analysis of the band pattern data was performed using the UPGMA algorithm with Pearson correlation set as the similarity coefficient. A four percentage band position tolerance was allowed. Finally, the RFLP band pattern database was established for each endonuclease and served as reference for wet experimental data analyses.

Primers for long PCR and sequencing

The 360Bac data set (described above) was subjected to multiple alignment using the ClustalW algorithm implemented in BioEdit (ver. 7.0; Hall 1999). Alignments were used to assess the primers previously designed for 16S or 23S rRNA gene amplification (Table 3). The primer set 27f and 2241r was employed in this study to amplify the long PCR products of 16S-ITS-23S rRNA genes. Additional sequencing primers 1406f and 1492r (within 16S rRNA gene) and 457r, 820f and 1087r (within 23S rRNA gene) were used to assemble the full-length sequence of the amplified 16S-ITS-23S products.

Bacterial isolation, long PCR of 16S-ITS-23S rRNA genes and restriction analysis

A mesotrophic freshwater lake located in southern China, Huguangyan Maar Lake, was chosen as the study site to isolate bacterial cultures and experimentally test the methodology of long PCR-RFLP of 16S-ITS-23S rRNA genes. This volcanic crater lake has an area of c. 2·3 km2 and a maximum water depth of 18 m, and was formed by small volcanic eruptions that occurred around 140 000 years ago and subsequent water input and accumulation (Wang et al. 2000; Mingram et al. 2004). Its features are as follows: no upheaval of surrounding earth surface, thick sediment (50 m) and a short distance to the nearby South China Sea (<6 km). By means of this molecular typing tool, we developed a nonredundant bacterial culture collection during 2009–2011 as a part of an endeavour to protect microbial resources in this unusual lake. Sampling times and layers are provided in Table S3.

Bacterial cultures were isolated with either heterotrophic R2A agar (Atlas 2004) or lake water agar (by mixing 1 l of 0·2 μm filtered in situ lake water with 15-g agar powder). 1 μl of original lake water was diluted into 100-μl pure water, and the dilution was then plated on agar media. Single colonies were picked and subjected to lysis treatment with 100 μl 0·05 mol l−1 NaOH solution at 95°C for 15 min. After centrifuging at 12 000 rev min−1 for 2 min, 1-μl lysis solution was employed as a PCR template. The long PCR of 16S-ITS-23S rRNA genes was performed as follows: 3-μl PCR buffer, 6-μl dNTP mixture with 2·5 mmol l−1 each, 0·2 μl 100 μmol l−1 primer each, 0·2-μl LA-Taq (TaKaRa, Dalian, China) and 1-μl colony lysis solution in a final volume of 30-μl. PCR products were directly digested with MspI (Fermentas, Burlington, ON, Canada) at 37°C for 1–3 h. The restriction fragments were resolved on a 1·5% agarose gel stained with ethidium bromide. The RFLP band pattern in each lane was analysed and grouped using Bionumerics 4.6 with the same settings, as described in the above computer simulation analyses.

Sequencing and assembly

Long PCR product of 16S-ITS-23S rRNA genes was purified using a commercial purification kit (Tiangen Biotech, Beijing, China) and sequenced on an ABI 3730 sequencer (Applied Biosystems, Foster City, CA, USA) using either 27f or 1492r as the sequencing primer. Sequences were analysed with the Classifier tool provided by the Ribosomal Database Project (Cole et al. 2009) and the SILVA web server (Pruesse et al. 2007) to determine their taxonomy. Sequencing was conducted on selected PCR products with all PCR and sequencing primers (Table 3) to assemble the full-length sequence of the long PCR product. Sequence assembly was performed with the ContigExpress program of the software package Vector NTI advance (ver. 10; Invitrogen). 16S-ITS-23S sequences obtained in this study were deposited into GenBank under the accession numbers JX219383JX219400.

Results

Copy number and length variation of 16S-ITS-23S rRNA genes in bacterial genome

16S rRNA gene, ITS and 23S rRNA gene within bacterial ribosomal rRNA (rrn) operon were treated as a single and complete unit in this study. The 16S-ITS-23S sequence database was set up with 3301 sequences retrieved from 885 bacterial genomes belonging to 360 genera of 29 known bacterial phyla or subphyla (Table 1; for full list see Table S1). Within rrn operon, 16S and 23S rRNA genes are not always located on the same DNA strand, which could make PCR of the long rrn operon sequence problematic. In some cases, the 16S rRNA gene, ITS and 23S rRNA genes are not tightly linked together. However, we found that in each of the 885 genome, there is at least one intact rrn operon containing a complete set of 16S-ITS-23S rRNA genes with all components located on the same strand (either plus or minus). The copy number of the 16S-ITS-23S rRNA genes within a single genome ranged from 1 (= 161) to 14 (= 4) with an average of 3·71 (= 885) (Fig. 1; Table 1). The 63 genomes that possessed ≥8 copies were all from highly pathogenic strains belonging to Firmicutes or Gammaproteobacteria. When the pathogenic strains were removed from the calculation, the average copy number decreased to 3·23 (= 822), which we suggest to be representative for most bacteria that inhabit natural environments. Among the 724 genomes possessing multiple copies, 451 genomes have both plus- and minus-strand copies. No regular pattern was observed regarding the located strand of different copies within the same genome.

Table 1. Statistics of bacterial 16S-ITS-23S sequence database established by retrieving sequences and associated taxonomy information from the NCBI genome database
Taxonomic groupAmount16S-ITS-23S copy number16S-ITS-23S length
GenusSpeciesRangeAverageAverage (bp) n MaximumMinimum
  1. a

    After removal of the values ≥8.

Alphaproteobacteria 49981–92·825058·127657254502
Betaproteobacteria 36841–83·344985·928253954284
Deltaproteobacteria 20351–62·634995·29252614597
Epsilonproteobacteria 10221–52·965098·16755574608
Gammaproteobacteria 712421–144·924966·6119763464402
Unclassified Proteobacteria11334468·7344694468
Actinobacteria 491081–62·965000·432061494611
Aquificales 331–21·754850·1548574812
Bacteroidetes 24331–63·214915·110653074319
Chlamydias 22114804·5248754734
Chlorobi 581–31·7549191449804850
Chrysiogenetes 111148331--
Cyanobacteria 9261–41·734929·54563014603
Deferribacteres 33224953·7650834850
Deinococcus-Thermus 22114663·5247124615
Dictyoglomi 12114868248744862
Elusimicrobia 111151921--
Fibrobacteres/Acidobacteri22115245·5252725219
Firmicutes 461751–144·624862·582065684490
Fusobacteria 441–52·754776·91150494566
GNS Bacteria581–31·54718·71250904578
Gramella 11335060·3350615060
Mycoplasmas 3121–21·084795·51350824471
Nitrospirae 111150481--
Planctomycetes 11335051350495054
Spirochetes 221–21·54827348444810
Synergistetes 111148271--
Tenericutes 11224590·5245914590
Thermotogae 22115161257374585
Verrucomicrobia 441–31·754947·1750224894
Σ3608851–14

3·71 (n = 885)

3·23 (n = 822)a

4952330165684319
Figure 1.

Distribution of 16S-ITS-23S copy number in 885 bacterial genomes. Detailed data refer to Table S1.

The length of the 16S-ITS-23S rRNA genes varies greatly across bacterial species, genera and phyla, ranging from 4319 to 6568 bp with an average of 4952 bp (= 3301) (Tables 1 and S2), suggesting that the length of 16S-ITS-23S rRNA genes may serve as a preliminary discriminator between bacteria. Among the 724 genomes containing multiple copies, 250 (34·5%) show no size variation between different copies within the same genome and 438 (60·5%) genomes show <20 bp length differences between copies (Fig. 2, also see Table S1). The maximum length difference occurred in a cyanobacterium, Nostoc sp. PCC 7120, with a maximum of 1698 bp variation observed among the four copies within its genome. The other species that showed over 1000-bp length variation was Corynebacterium aurimucosum ATCC 700975 belonging to Actinobacteria. We observed a weak correlation between the sequence length of the ITS, 23S rRNA gene and those of the 16S-ITS-23S rRNA genes (Fig. 3). When 16S-ITS-23S rRNA gene size increases, the size of either the ITS or 23S rRNA gene within it increases accordingly, whereas 16S rRNA gene size displays no significant change (Fig. 3), suggesting that the ITS and 23S rRNA genes contribute the most to the length difference in 16S-ITS-23S rRNA genes. ITS size varies from 14 bp (Lactobacillus salivarius CECT 5713 and UCC118, Firmicutes) to 1977 bp (Coxiella burnetii CbuG Q212, Gammaproteobacteria) with an average of 521 bp (= 3301) (Table S2). The 23S rRNA gene sequence length ranges from 2181 bp (Agrobacterium vitis S4, Alphaproteobactium) to 4517 bp (Ruminococcus bromii L2-63, Firmicutes), averaging 2909 bp (= 3301).

Figure 2.

Maximum, minimum and average sequence length of 16S-ITS-23S copies within each of 885 bacterial genomes and 16S-ITS-23S length range among different copies within the same genome. The range equals the maximum minus the minimum. Data from the genomes possessing a single copy of 16S-ITS-23S were used. (image_n/jam12057-gra-0001.png) Maximum; (image_n/jam12057-gra-0002.png) minimum; (image_n/jam12057-gra-0003.png) average and (image_n/jam12057-gra-0004.png) range.

Figure 3.

Length distribution of individual parts of 16S rRNA gene, ITS and 23S rRNA gene in each of 3301 sequences of 16S-ITS-23S in our database. (image_n/jam12057-gra-0005.png)16S-ITS-23S; (image_n/jam12057-gra-0006.png) 23S; (image_n/jam12057-gra-0007.png) 16S and (image_n/jam12057-gra-0008.png) ITS.

Computer-simulated RFLP of 16S-ITS-23S rRNA gene sequences

The occurrence frequency of all possible 4-bp-long palindromic sequences within 16S-ITS-23S sequences were analysed on representative data sets (results shown in Table 2). The average frequencies calculated from E. coli alone or all 502 sequences from Enterobacteriaceae were within a comparable range to those based on 325 unique bacteria genera (the 360Bac data set, composed of representative sequences from 362 genera as defined in 'Materials and methods'), indicating that the occurrence frequency of each recognition sequence within 16S-ITS-23S tends to be stable across various bacterial species. We set the criterion of choosing appropriate restriction enzymes used in long PCR-RFLP analysis as producing no <10 bands with a >150 bp length after single enzyme digestion, which roughly corresponded to an average cutting frequency of around 12–14. Among the enzymes that reach this criterion, here we tested only the most commonly used ones for PCR-RFLP analysis, that is, HhaI, MspI, RsaI, Sau3A1 and TaqI.

Table 2. Occurrence frequency of palindromic tetra nucleotide sequences in the selected data sets
Recognition sequencesExamples of conventional enzymeOccurrence frequency
E. coli O157:H7_EDL933The data set 89Enta (Enterobacteriaeae)The data set 360Bac (360 bacterial genera)
AverageSDAverageSD
  1. a

    The data set 89Ent contains 502 16-ITS-23S sequences retrieved from the 89 genomes of the family Enterobacteriaceae.

AATTTasI, TspEI1312·021·1612·153·43
ATATn.a.66·831·177·863·89
AGCTAluI2322·251·8720·963·01
ACGTTaiI, TscI, MaeII1213·421·6111·883·03
TATAn.a.87·961·17·753·72
TTAAMseI2622·891·624·663·24
TGCAHpyCH4V2220·771·2123·063·87
TCGATaqI1313·011·7812·142·86
GATCSau3AI, MboI108·090·9112·663·62
GTACRsaI, AfaI1614·340·9913·872·65
GGCCBsuRI, HaeIII, PhoI1717·171·1815·763·32
GCGCHhaI, HspAI1211·692·4212·393·71
CATGFatI, NlaIII1414·471·4814·693·04
CTAGXspI, FspBI68·853·2711·13·73
CGCGBstUI, Bsh1236I99·371·4211·923·44
CCGGMspI, HpaII2425·061·5421·873·86

The 89Ent data set (as defined in 'Materials and methods') was digested using the Vector NTI program with one of these enzymes. A dendrogram was inferred for each enzyme based on the similarity of the band position pattern within each lane (for MspI see Fig. 4; for other enzymes, see Fig. S1). After simulated digestion, most sequences produced five or more restriction fragments of <150 bp. These small fragments would be poorly resolved in common agarose gels and thus are not suitable as efficient discriminators. Therefore, we only adopted bands longer than 150 bp in both computer simulation analyses and practical applications. Bands were scattered within the region of 150–1500 bp (see Fig. 4 and Fig. S1), which can be easily recognized both visually and via software. Different patterns were observed among most genomes, indicating a good resolution of RFLP analyses of 16S-ITS-23S sequences at species level.

Figure 4.

Dendrogram based on a computer-simulated MspI restriction analysis of the 89Ent data set. Cluster analysis of the band pattern data was performed using the UPGMA algorithm with Pearson correlation set as the similarity coefficient. Four per cent band position tolerance was allowed.

We further tested MspI, RsaI, HhaI and TaqI on the 360Bac data set (see Fig. S2 for MspI, results for other enzymes not shown). Band clustering results indicated that they were able to easily discriminate all species with band similarity values below 90% and simultaneously show good band size distribution pattern. No significant difference was observed between these four enzymes.

The results from the simulated RFLP analyses suggested that MspI, RsaI, HhaI and TaqI are good candidates for practical use in long PCR-RFLP of 16S-ITS-23S rRNA genes. We found that identical band patterns only occurred occasionally within some strains of the same species, which suggests that this method may be able to guarantee a full discrimination at species level. At strain level, the resolution was constrained because strains of the same species may share identical 16S-ITS-23S rRNA genes, for example, Erwinia pyrifoliae str. Ep1/96 and Erw. pyrifoliae DSM 12163, and E. coli O157 str. EC4115 and the strain TW14359. But, such situations appeared only in a few highly pathogenic bacteria that occupied a tiny part of the sequences (<2%) in our database. When the sequences from pathogenic bacteria were not considered, all four enzymes were able to resolve to the subspecies/strain level (data not shown).

From the dendrogram based on MspI restriction profiles of the 89Ent data set (Fig. 4), we retrieved Pearson correlation values of the nodes where species of different genera started to be clustered into the same group and found that the minimum band similarity value was 89·6%. The minimum value for grouping different phyla was 62·3% (data retrieved from the tree in Fig. S2). These values suggest that an unknown species may be classified into a known genus or phylum based solely on the RFLP pattern of its 16S-ITS-23S sequences if its band pattern is correlated with that of a known species with a Pearson correlation value greater than 89·6 or 62·3%, respectively.

PCR and sequencing primers of 16S-ITS-23S rRNA genes

Multiple alignments of the sequences of the 360Bac data set were conducted and used to evaluate the conservation of the forward and reverse primers previously designed within 16S and 23S rRNA genes (primer sequence and improved positions see Table 3; primer location and sequencing and assembly strategy were shown in Fig. S3). For the PCR of 16S-ITS-23S rRNA genes, the primers close to the 5′ and 3′ ends of rrn operon were considered to produce as long as possible a product for RFLP analyses. The forward PCR primer within the 16S rRNA gene was a slightly modified version of the 27f. For the reverse primer located within the 23S rRNA gene, we searched all the candidates suggested by Hunt et al. (2006) and found that 2241r was the most conserved. However, there are also several mismatches in the conventional version of 2241r (Lane 1991). Some bases within 27f and 2241r were revised and allowed for more degeneracy (Table 4) to improve their coverage. In addition, a set of sequencing primer within the 16S-ITS-23S region was introduced with modifications based on their previous versions (Table 4).

Table 3. 16S-ITS-23S rRNA genes PCR and sequencing primers
PrimerSequence (5′–3′)LocationPrevious primer ReferenceModification
  1. The underlined nucleotides are the revised sites in the previous version. The numbers under the nucleotides are the total mismatches to the sequence in the data set 360Bac that contains 360 16S-ITS-23S sequences randomly picked from 360 genera in our database. \x96, represents no mismatch.

27fAG R GTTTGAT Y H TGGCTCAG  16S

27f

Lane (1991)

Increased degeneracy
12222  
1406f T TGYAC W CACCGCCCGT     16S

1406f

Lane (1991)

Increased degeneracy

Add one T at the 5′ end

11     
1492rTA S GG H TACCTTGTTACGACTT16S

1492r

Lane (1991)

Increased degeneracy
21521
457rCC D TTCC Y TC R C R GTACT    23S

457r

Hunt et al. (2006)

Increased degeneracy
411141    
820fTAKCT S GT W CTCY B CGAAA   23S

803r

Ward et al. (2000)

Increased degeneracy
43211   
1087r G A Y Y RGTGAGCTRTTACGC   23S

1091r

Lane (1991)

Add four nucleotides at the 5′ end
33116   
2241rACC R CCCCAGTHAAACT     23S

2241r

Lane (1991)

Increased degeneracy
333     
Table 4. Sequence analysis of 16S-ITS-23S rRNA genes cloned from some strains isolated from the Huguangyan Maar Lake
Group SpeciesAppearance frequency of RFLP pattern16SITS23S16S-ITS-23S
Closest matchIdentity (%)Length (bp)Closest matchIdentity (%)Length (bp)Closest matchIdentity (%)Length (bp)Closest matchIdentity (%)Length (bp)
  1. a

    16S-ITS-23S rRNA gene PCR products from three different colonies that showed the same RFLP band pattern were sequenced.

Alphaproteobacteria
Agrobacterium sp. SL268Agrobacterium tumefaciens C58941602Rhizobium gallicum CCBAU79259Agrobacterium radiobacter K84892701Agrobacterium radiobacter K84824562
Rhodobacter sp. SL2310Uncultured bacterium971495 Rhodobacter capsulatus 80577Rhodobacter sphaeroides 2.4.1872174Rhodobacter sphaeroides 2.4.1894276
Rhodobacter sp. SL2411Uncultured bacterium clone981494 Rhodobacter capsulatus 86587 Rhodobacter capsulatus 882168 Rhodobacter capsulatus 874249
Rhodospirillum sp. SL388Uncultured bacterium clone991423No significant match504Rhodospirillum centenum SW922089Rhodospirillum centenum SW894016
Starkeya sp. SL257Uncultured bacterium clone991436Brady rhizobium sp. TSA2681595Starkeya novella DSM 506882165Starkeya novella DSM 506864196
Sphingomonas sp. SL99aSphingomonas sp. clone A V 0691001537 Sphingomonas suberifaciens 96424Sphingomonas wittichii RW1902146Sphingomonas wittichii RW1894107
Sphingomonas sp. SL2111Sphingomonas sp. MUELAK1991479 Novosphingobium capsulatum 95392 Novosphingobium aromaticivorans 892462 Sphingobium chlorophenolicum 864333
Sphingomonas sp. SL215Sphingomonas sp. KOPRI 258911001326Multiple copiesaSphingomonas wittichii RW1921959Not available
Sphingomonas sp. SL87Sphingomonas sp. KOPRI 25891991339Multiple copiesSphingomonas wittichii RW1921922Not available
Sphingomonas sp. SL7810Uncultured bacterium clone991348Multiple copiesSphingomonas wittichii RW1921934Not available
Betaproteobacteria
Ralstonia sp. 1–68aRalstonia pickettii 12D1001698 Ralstonia pickettii 96233Ralstonia pickettii 12D982212Ralstonia pickettii 12D974143
Paludibacterium sp. 12–549Uncultured beta proteobacterium981369Multiple copies Chromobacterium violaceum 912331Not available
Gammaproteobacteria
Pseudomonas sp. 12–1212Pseudomonas sp. CL06051001382Multiple copiesPseudomonas putida ND6922215Not available
Bacteroidetes
Elizabethkingia sp. LA 1–1814a Elizabethkingia meningoseptica 991504 Elizabethkingia meningoseptica 95406 Elizabethkingia meningoseptica 922153 Riemerella anatipestifer 894063
Actinobacteria
Microbacterium sp.1–917a Microbacterium hominis 991526 Microbacterium testaceum 75335 Microbacterium testaceum 922416 Microbacterium testaceum 884277
Microbacterium sp.1–259Microbacterium sp. UN PA 149991351Multiple copies Microbacterium testaceum 932478Not available

Screening of bacterial culture collection using long PCR-RFLP of 16S-ITS-23S rRNA genes and 16S rRNA gene sequencing

During the years 2009–2011, we employed the long PCR-RFLP technique to screen over 900 bacterial colonies isolated from the Huguangyan Maar Lake in southern China using the restriction enzyme MspI. A single and distinct band between 4·2 and 4·6 kb was observed for all successful PCR products. After digestion by MspI (results partly shown in Fig. 5), the RFLP band pattern data were grouped using software, with a 90% Pearson correlation set as the threshold for the same species. Under this rigorous threshold, we set up a nonredundant bacterial culture collection containing 221 species (Table S3). Taxonomic results based on 16S rRNA gene sequencing showed that Alphaproteobacteria and Gammaproteobacteria dominated the culture collection, accounting for 55·7% (123/221) and 20·8% (46/221), respectively, whereas Betaproteobacteria was a minor group with only 12 (5·4%) species documented, together with Firmicutes (13, 5·9%), Actinobacteria (13, 5·9%), Bacteroidetes (13, 5·9%) and one unclassified proteobacterium.

Figure 5.

MspI restriction analyses of 16S-ITS-23S PCR products from a portion of the cultures isolated from the Huguangyan Maar Lake in southern China. M, 100-bp ladder DNA marker.

We documented abundant species diversity within the following genera, Microbacterium (Actinobacteria), Blastobacter, Bosea, Brevundimonas, Caulobacter, Methylobacterium, Novosphingobium, Porphyrobacter, Rhizobium, Sphingobium, Sphingomonas and Xanthobacter (Alphaproteobacteria), Chryseobacterium (Bacteroidetes), Bacillus (Firmicutes), Acinetobacter and Pseudomonas (Gammaproteobacteria), each of which contains five or more strains that show different RFLP band patterns. Some strains have identical 16S rRNA gene sequences, but RFLP band patterns of their 16S-ITS-23S sequences were different (data not shown); for instance, Acinetobacter sp. 12-9, LA5-27, dR1-1, dLA13-02, dLA1-10 and dLA1-012, suggesting the methodology of long PCR-RFLP of 16S-ITS-23S rRNA genes functioned well at the species level.

Sequencing and assembly of long 16S-ITS-23S rRNA sequences

16S-ITS-23S rRNA gene PCR products from the strains that showed a high frequency of their RFLP patterns in our culture collection were sequenced for their full-length amplicons using PCR primers and sequencing primers introduced in this study. Ten full-length 16S-ITS-23S sequences (4·0–4·5 kb) were assembled (Table 4) and classified into Alphaproteobacteria (seven sequences), Betaproteobacteria (one sequence), Bacteroidetes (one sequence) and Actinobacteria (one sequence) based on the results of a Blast analysis, the RDP classifier tool and the SILVA web server. Six strains containing multiple copies of ITS that differed in base composition were found by identifying the presence of overlapped peaks in the sequence-read graphs. They were not amenable to the assembly of full-length 16S-ITS-23S sequences (listed in Table 4), suggesting a limitation of the direct sequencing of 16S-ITS-23S rRNA gene PCR products. Based on the blast analysis results (Table 4), we found that only one 16S-ITS-23S sequence (Ralstonia sp. 1–6, Betaproteobacteria) got a hit with over 97% identity to its closest match in GenBank, whereas all the other nine sequences were below 90% identity (82–89%) to their closest matches, much lower than the identities (94–100%) shown by the 16S rRNA gene alone within the same 16S-ITS-23S sequence. This discrepancy was mostly due to the sequence divergence within ITS and 23S molecules, which also suggests that introducing ITS and 23S sequences into the PCR template could improve the resolution of the rrn operon sequence-based method for differentiating bacteria.

For the strains Sphingomonas sp. SL9 (Alphaproteobacteria), Ralstonia sp. 1–6 (Betaproteobacteria), Elizabethkingia sp. LA1-18 (Bacteroidetes) and Microbacterium sp. 1–9 (Actinobacteria) (underlined in the Table 4), we sequenced 16S-ITS-23S rRNA gene PCR products from three different colonies showing the same RFLP pattern and found that identical gel band profiles always led to identical sequencing results from the 16S-ITS-23S rRNA gene. Further, computer-simulated restriction analyses of the full-length 16S-ITS-23S sequences obtained from these strains clearly showed that the RFLP bands presented by the real agarose gel running were almost identical to those produced by computer programs and that a discrepancy only occurred in bands with a low molecular weight (mostly <150 bp), which were weak and smeared in real gels and were not able to be correctly recognized by the software. These results showed that our strategy for sequencing and assembling the long sequence of 16S-ITS-23S rRNA genes was feasible, provided that the ITS size was <1000 bp, which is the case for most bacteria (97·4%) in our sequence database, as well as simultaneously being in either single or multiple copies but with identical sequence composition.

Discussion

Low cost, convenient and rapid DNA fingerprinting techniques are still essential and routine tools for the purpose of bacterial genotyping, although we have entered an ‘omics’ era. Among these tools, long PCR-RFLP of 16S-ITS-23S rRNA genes has only recently been introduced and successfully applied to some pathogenic bacteria (Garcia-Arata et al. 1997; Abd-El-Haleem et al. 2002; Yavuz et al. 2004). The limited information available thus far showed that the resolution of this method was much greater than traditional 16S PCR-RFLP and was comparable to REP-PCR and DNA-DNA hybridization, suggesting it would be a promising universal bacterial genotyping tool. However, no further studies have been reported on its practical applications, probably due to a lack of a thorough theoretical evaluation and standardization of this method. Taking advantage of the fast-growing databases of completely sequenced bacterial genome sequences, for the first time, we were able to explore the copy number, length variation and sequence divergence of 16S-ITS-23S rRNA genes as a single unit within genomes of almost all known bacterial phyla and to critically assess the theoretical basis of the methodology of long PCR-RFLP of 16S-ITS-23S rRNA genes.

The rRNA genes, like other members of a multigene family, are subjected to homogenization processes such as gene conversion (Liao 2000), which makes the copy number of 16S-ITS-23S rRNA genes nonidentical to that of the 16S or 23S rRNA gene alone within most bacterial genomes. Nonetheless, we found that at least one intact copy, composed of one 16S rRNA gene, ITS and one 23S rRNA gene in the same direction, exist in each bacterial genome in our database (a total of 885 bacterial genomes), suggesting that a whole set of 16S-ITS-23S rRNA genes might be an indispensable component in bacterial genomes and provide a basis for application of a long PCR-RFLP methodology. Most bacterial genomes (822/885, 92·9%) contain less than eight copies of 16S-ITS-23S (as shown in Fig. 1). Moreover, 69·6% (616/885) of the bacterial genomes have one to four copies, similar to the distribution pattern observed for 16S rRNA gene alone in bacterial genomes (Acinas et al. 2004). On average, Proteobacteria, Actinobacteria, Firmicutes and Bacteroidetes appear to possess more 16S-ITS-23S copies in their genomes than other bacterial groups (see Table 1). This phenomenon is consistent with observations of their dominating existence in various bacterial communities in nature (Keller and Zengler 2004; Newton et al. 2011).

In environmental bacterial diversity surveys based on the approaches of 16S or 23S rRNA gene amplification and sequencing, the level of divergence and redundancy among rRNA gene copies within the same genome is a concern as they can result in an overestimated level of diversity (Acinas et al. 2004; Case et al. 2007; Pei et al. 2010). We did not test the validity of 16S-ITS-23S rRNA genes in clone libraries constructing and screening, but we should face the same problem as those in 16S or 23S rRNA genes because they are based on the same rationale. However, multiple copies can provide an advantage for long PCR-RFLP analyses of bacterial cultures. All the copies within a given genome are simultaneously amplified using the conserved primers and a mixture of 16S-ITS-23S molecules is then subjected to restriction digestion. Consequently, multiple copies may actually increase the number of restriction fragments because different copies provide more restriction sites for endonucleases and thus lead to an increased resolution.

Another concern may arise from the choice of restriction enzymes. A computer-simulated RFLP analysis of bacterial 16S rRNA genes suggested that the combined use of HhaI and RsaI could reach maximum resolution (Moyer et al. 1996), similar to our recommendations of MspI, RsaI, HhaI or TaqI for long PCR-RFLP of 16S-ITS-23S rRNA genes, but where only one enzyme is used. For the 16S rRNA gene sequences, HhaI and RsaI possess relatively few restriction sites with an average of 4·17 and 4·48 cutting sites within the 16S sequence of a given taxon, respectively, leading to a limited resolution and irreproducible phylogenetic bootstrap values (Moyer et al. 1996). Such conditions are greatly improved in our tests, as the long 16S-ITS-23S sequences in our database provide at least nine restriction sites, with the size of the MspI restriction fragments spanning from 150 to 1500 bp and a stable relationship between taxons being achieved. We did not perform a statistically rigorous test of all candidate enzymes because the low cost and widely used restriction enzyme, MspI, achieved a very good resolution in both computer simulation tests and practical application on our strains. The criterion of enzyme choice is not strict due to the advantage provided by the long length and high divergence of the 16S-ITS-23S sequences.

The conserved regions within the 16S and 23S rRNA genes may have the potential to offer bacterial group–specific recognition sites for restriction enzymes, whereas the variable regions together with hyper-variable ITS sequence could offer discriminating recognition sites. If this reasoning is true, common bands might appear in the RFLP analyses of a given bacterial group. We searched the simulated RFLP profiles generated with MspI, HhaI, RsaI and TaqI on reference data sets, but no marker band was observed at the phylum or subphylum level. We further collected the sequences of the FISH probes for major bacterial divisions from the probeBase database (Loy et al. 2007). These probe sequences correspond to the most conserved regions within the 16S or 23S rRNA gene sequence, but our attempt at searching for particular probe-specific restriction enzymes (equivalent to bacterial taxon specific) failed. This suggests it is not possible to roughly predict the taxonomic status of an unknown species based solely on the restriction fragment size distribution.

Similarly, we failed to predict the taxonomic status of a given band pattern by comparing it with our electronic RFLP pattern database before sequencing its 16S rRNA gene. Practical RFLP patterns in wet experiments did not always show as many and sharp bands as those of the computer-simulated restriction analysis. One reason is that the agarose we used is a common type, of moderate resolution and strength, instead of metaphore agarose (strength >300 g cm−2), which allows recovery of as low as 20-bp-long fragments using a common horizontal electrophoresis system. The other reason, more importantly, is that common problems in real agarose gel runs – such as insufficient or uneven staining, gel shift effect and excess or insufficient amount of the DNA loaded – as well as slight differences in settings between different runs could significantly affect the accuracy of band detection by computer software and the quality of subsequent cluster analyses of the fingerprinting data. However, the continuously growing wet RFLP band pattern database has been developed over years of our isolating and screening work, composed of the strains that have been identified with 16S rRNA gene sequencing. This real band pattern database was successful in serving as a reference for the classification of later isolated strains, suggesting that the wet/wet data comparison was more feasible and reliable than the wet/computer data comparison.

Analyses of the nonredundant bacterial culture collection from the Huguangyan Maar Lake indicated that the genotyping tool of long PCR-RFLP of 16S-ITS-23S rRNA genes was successfully applied to Actinobacteria, Firmicutes, Bacteroidetes, Alphaproteobacteria, Betaproteobacteria and Gammaproteobacteria. These bacterial groups are thought to be the most abundant ones in freshwater lakes (Newton et al. 2011). The results experimentally proved the applicability of the long PCR-RFLP technique on both major and minor bacterial groups in the environment. The sequencing primers we designed and verified for the assembly of full-length 16S-ITS-23S sequences is another contribution from the wet experimental parts of this work. As DNA sequencing becomes less expensive, the large-scale sequencing of whole sets of 16S-ITS-23S molecules becomes more feasible.

Sequencing the whole set of 16S-ITS-23S rRNA genes would expand the current 23S rRNA gene sequence database and, more importantly, allow the taxonomy information inferred from the 16S rRNA gene sequence to be directly mapped to 23S rRNA gene sequences, thus gradually diminishing the obstacles that prevent 23S rRNA from being used in a classification of micro-organisms (Hunt et al. 2006; Yilmaz et al. 2011). Currently (June 2012), high-quality (length over 1200 bp for 16S; over 1900 bp for 23S) 16S rRNA gene sequences are deposited 30 times more often than those of 23S rRNA genes in the SILVA database (release 108) (Pruesse et al. 2007). In the foreseeable future, the number of 23S rRNA gene reference sequence will still grow at a much lower rate than that of 16S rRNA gene sequence, requiring continuous efforts to solve this problem. Second- and third-generation sequencing technologies have progressed rapidly (Mardis 2011), but their short read lengths prevent them from being applied to the sequencing of the complete rrn operons retrieved from environmental genomic DNA samples in a high-throughput manner. It remains a technical challenge but undoubtedly would be a promising application in microbial ecology studies. PCR amplification and sequencing of the whole 16S-ITS-23S molecule using traditional methods is still the only way to simultaneously gain 16S and 23S rRNA gene information from the same genome of either a laboratory culture or an environmental species. The long PCR-RFLP methodology verified in this work combined with sequencing would contribute in this respect, in addition to bacterial genotyping.

In summary, the genome region of 16S-ITS-23S rRNA genes is a conserved unit in bacterial genomes and its size, length discrepancy, sequence divergence and the presence of conserved priming sites support the methodology of long PCR-RFLP of 16S-ITS-23S rRNA genes as an efficient tool for typing either distantly or closely related bacterial species. Compared with other molecular typing tools (Table 5), this long PCR-RFLP approach has the combined advantages of low cost, simple and fast process, high reproducibility and a high resolution. It could be applied to virtually all bacterial phyla. RFLP enzyme choice is not strict but could be further optimized to obtain better resolution when targeting one specific bacterial taxa. Generally, the use of a single enzyme MspI, RsaI, HhaI or TaqI, could be enough for most typing work.

Table 5. Comparison of bacterial molecular typing tools
Molecular Typing ToolMajor tech-niques usedTimeCostApplicabilityReproducibilityResolution levelLevel of cross matchingHigh throughput screeningReference
  1. AFLP, amplified fragment length polymorphism; RAPD, random amplified polymorphic DNA; ARDRA, amplified ribosomal DNA restriction analysis; RISA, ribosomal intergenic spacer analysis; PFGE, pulsed-field gel electrophoresis; rep, repetitive extragenic palindromic; MLST, multilocus sequence typing.

AFLPPCRHoursLowMost bacteriaLowLow and not stableHighAllowed not recommendedBlears et al. (1998)
RAPDPCRHoursLowMost bacteriaLowLow and not stableHighAllowed not recommendedCocconcelli et al. (1995)
ARDRA

PCR

Restriction enzyme digestion

HoursLowMost bacteriaHighGenus sometimes speciesHighAllowedRedecker et al. (1997)
RISAPCR sequencingDaysModerateMost bacteriaHighSpecies sometimes strainsLowAllowedBourque et al. (1995)
PFGE

Prepare intact genome DNA

Restriction enzyme digestion

DaysModerateMost bacteriaHighSpecies sometimes strainsLowAllowedTanskanen et al. (1990)
Ribotyping

Restriction enzyme digestion

Hybridization

DaysModerateMost bacteriaHighSpecies sometimes strainsLowAllowed not recommendedBouchet et al. (2008)
rep-PCRPCRHoursLowMost bacteriaModerateSpecies sometimes strainsLowAllowedPooler et al. (1996)
MLSTPCR sequencingDaysHighMost bacterial pathogensHighStrainsVery lowAllowedMaiden (2006)
Long PCR-RFLP of 16S-ITS-23S rRNA genes

PCR

Ristriction enzyme digestion

HoursLowMost bacteriaHighSpecies sometimes strainsLowAllowedThis study

Acknowledgements

This work was supported by the NSFC project 30900045, the Guangdong provincial NSF project 9452408801002444 (B09292) and the open fund MELRS0923 of National Key Lab of Marine Environmental Science (Xiamen University, China) to Y.Z. and the Czech project Algatech (CZ.1.05/2.1.00/03.0110) to M.K. The authors thank Ms Qianru Guo and Ms Xiaojie Chen (Guangdong Ocean University, China) for their assistance in sampling and in laboratory work. We also thank Jason Dean for correction of the language.

Ancillary