V-REVCOMP: automated high-throughput detection of reverse complementary 16S rRNA gene sequences in large environmental and taxonomic datasets

Authors


  • Editor: David Studholme

Correspondence: Martin Hartmann, Department of Microbiology and Immunology, University of British Columbia, Vancouver, BC, Canada V6T 1Z3 Tel.: +1 604 822 5646; fax: +1 604 822 6041; e-mail: martinha@mail.ubc.ca

Abstract

Reverse complementary DNA sequences – sequences that are inadvertently given backwards with all purines and pyrimidines transposed – can affect sequence analysis detrimentally unless taken into account. We present an open-source, high-throughput software tool –v-revcomp (http://www.cmde.science.ubc.ca/mohn/software.html) – to detect and reorient reverse complementary entries of the small-subunit rRNA (16S) gene from sequencing datasets, particularly from environmental sources. The software supports sequence lengths ranging from full length down to the short reads that are characteristic of next-generation sequencing technologies. We evaluated the reliability of v-revcomp by screening all 406 781 16S sequences deposited in release 102 of the curated SILVA database and demonstrated that the tool has a detection accuracy of virtually 100%. We subsequently used v-revcomp to analyse 1 171 646 16S sequences deposited in the International Nucleotide Sequence Databases and found that about 1% of these user-submitted sequences were reverse complementary. In addition, a nontrivial proportion of the entries were otherwise anomalous, including reverse complementary chimeras, sequences associated with wrong taxa, nonribosomal genes, sequences of poor quality or otherwise erroneous sequences without a reasonable match to any other entry in the database. Thus, v-revcomp is highly efficient in detecting and reorienting reverse complementary 16S sequences of almost any length and can be used to detect various sequence anomalies.

Introduction

The bacterial and archaeal small-subunit rRNA (SSU rRNA, 16S) gene has emerged as the gold standard genetic marker for determining the diversity and structure of prokaryotic communities in the environment and for the assessment of phylogenetic relationships within the microbial tree of life (reviewed in Tringe & Hugenholtz, 2008; Pace, 2009). Numerous international efforts to characterize microbial communities have led to an unparalleled accumulation of 16S sequences in the International Nucleotide Sequence Databases (INSDs, Sayers et al., 2010) and warranted the establishment of curated 16S reference databases such as SILVA (Pruesse et al., 2007), RDP (Cole et al., 2007) and Greengenes (DeSantis et al., 2006). As per October 2010 release of SILVA version 104, close to 3 million 16S sequences are currently deposited in the INSDs, not counting the enormous number of short reads currently generated by massively parallel sequencing technologies (Margulies et al., 2005) and typically deposited as raw data in the Sequence Read Archive (Leinonen et al., 2011).

The contribution of these data repositories to scientific progress is indisputable. However, as the number of public 16S sequences increases, so does the number of sequences exhibiting poor read quality, chimaerism and incomplete or incorrect taxonomic annotation (Bridge et al., 2003; Hugenholtz & Huber, 2003; Ashelford et al., 2005; Bidartondo et al., 2008; Christen, 2008). Another problem undermining data integrity in the INSDs is the deposition of sequences in the reverse complementary orientation (i.e. backwards and with all purines and pyrimidines transposed). Reverse complementary sequences are generated unintentionally, usually during the sequence assembly step, through human or machine failure to relate the orientation of the sequences under processing to that of the others being generated. Reverse complementary sequences are easy to reorient using publically available software resources (e.g. Stajich et al., 2002), but to detect them in the first place is not always as straightforward.

Contamination of datasets with reverse complementary sequences can seriously affect downstream analysis. Currently, only a few tools such as NCBI blast (Altschul et al., 1997) can actually account for the presence of reverse complementary sequences. In contrast, these sequences will introduce analytic noise in analyses such as multiple sequence alignments, phylogenetic classifications and various approaches to sequence-based clustering. These events are usually detectable by manual screening; however, this becomes unfeasible as datasets grow. Automated detection and correction of reverse complementary sequences has therefore become essential in order to screen individually generated datasets as well as to assess and maintain the integrity of public data repositories. To address the problem of reverse complementary bacterial and archaeal 16S sequences in environmental sequence datasets, we developed v-revcomp, a high-throughput, command-line driven, open-source software package.

Materials and methods

Drawing from Nilsson et al. (2011), the software is written in Perl and processes arbitrarily large fasta format (Pearson & Lipman, 1988) datasets. Hidden Markov Models (HMMs) recently designed for every conserved region along the bacterial and archaeal 16S gene (Hartmann et al., 2010) are used to determine the orientation of the sequence. The software attempts to locate up to 18 HMM regions along the query sequence using hmmer version 3 (Eddy, 1998). The query sequence is first screened in its input orientation and subsequently in the reverse complementary orientation. The ratio of HMM detection frequency between the default and the opposite orientation of a query sequence provides a reliable measure of its orientation. A fasta format output file containing all entries of the input file is generated; in this file, all sequences identified as reverse complementary are given in the correct orientation. A comma-separated value file contains the detection statistics and allows the user to examine sequences with ambiguous detection results in more detail. This output lists the HMM detection frequency in the input and reverse complementary orientation, and provides a prediction of the sequence orientation based on the detection ratio, i.e. ‘forward’ (all HMMs in input orientation), ‘reverse’ (all HMMs in the reverse complementary orientation), ‘uncertain-forward’ (majority of HMMs in input orientation), ‘uncertain-reverse’ (majority of HMMs in the reverse complementary orientation), ‘uncertain-eitherway’ (equal HMM counts in both orientations) and ‘notfound’ (no HMMs detected in neither orientation). The software flags sequences that have uncertain assignments or in which no HMM regions could be detected in either orientation, suggesting the presence of sequence anomalies.

Results and discussion

We evaluated the reliability of the software by screening all bacterial (387 520) and archaeal (19 261) 16S sequences deposited in the SILVA database release 102 (Fig. 1a); mitochondrial and chloroplast sequences were excluded beforehand. Because the SILVA database stores all entries in a well-curated multiple sequence alignment, all these entries should be present in the 5′–3′ orientation. On a 3 GHz dual-core computer, v-revcomp processed the bacterial and archaeal datasets in 252 and 8 min, respectively. All sequences except one bacterial entry were assigned as being in the 5′–3′ orientation, representing a detection accuracy of virtually 100%. The software flagged 40 (0.01%) sequences that showed the detection of either one HMM (37 cases), two HMMs (two cases) or three HMMs (one case) in the reverse complementary orientation; however, the majority of HMMs (i.e. 9–16) were detected in the input orientation. We studied these 40 uncertain sequences in more detail using blast against NCBI GenBank (Benson et al., 2010) as well as through pairwise sequence alignments against an Escherichia coli reference rRNA operon (GenBank accession J01695, Prestle et al., 1992) where necessary. Fifteen cases (37.5%) were reverse complementary chimeras, i.e. sequences erroneously assembled to contain one segment in the reverse complementary orientation as compared with the remainder of the sequence (see representative example in Supporting Information, Fig. S1a). This reverse-complemented segment led to the detection of one or more HMMs in the opposite orientation compared with the rest of the sequence. In 12 cases (30%), the HMMs detected a segment at either the 5′ or the 3′ end of the reverse complementary sequence that did not match any entry in GenBank; such sequences are very likely to represent chimeric unions or other sequence artefacts (see representative example in Fig. S1b). The remaining 13 cases (32.5%) contained no obvious anomaly and might represent occasional false-positive detections by individual HMMs. Importantly, though, the average HMM detection ratio between the original and the reverse complementary sequence in these 13 cases was 16 : 1, which leaves no doubt about the true orientation of the query.

Figure 1.

 Detection statistics of the HMMs for all bacterial and archaeal queries extracted from the SILVA database (a) and the NCBI GenBank (b). HMM detection counts are given for the original (x-axis) as well as the reverse complementary (y-axis) orientation and are plotted as a function of sequence length (z-axis). White circles represent sequences that show no suspicious results, suggesting that these are correct 16S sequences. Coloured symbols represent problematic queries, i.e. sequences that are chimeric (red); taxonomically misclassified (green); feature genetic regions other than 16S (orange); contain a significant proportion of up- or downstream information (blue); or showed no, poor or a partial match to other entries in the database and/or are of poor quality (purple).

Considering that any 500-bp segment of a 16S sequence should have approximately 4–6 HMM detections (Hartmann et al., 2010), some sequences had lower HMM detection counts than would have been expected based on the sequence length. We examined the 104 cases that showed eight or fewer HMM counts in more detail. The 12 most extreme cases, with only 0–4 HMM detections over 1051–1808 bp, were all identified as taxonomic misclassifications and represented eukaryotic 18S rather than bacterial or archaeal 16S sequences. This prevented detection by the domain-specific HMMs, although some HMMs that were designed at highly conserved regions were able to perform detections across taxonomic domains. Among the 92 less extreme cases, with 6 to 9 HMM detections over 900–1504 bp, most sequences (i.e. 75 cases) contained a sequence segment at either the 5′ or the 3′ end that did not match any entry in GenBank, as assessed through blast. We extracted these segments from 15 entries and subjected them to a separate blast analysis. In 11 cases, the segment alone showed no reasonable match to any entry in GenBank, indicating that the segment probably represents erroneous sequence information. In the other four cases, the segment matched entries other than the matches from the full blast search, indicating that the entire sequence is probably chimeric. Eight sequences were chimeric, which might have reduced the number of HMM detections per read length equivalent. It is noteworthy in this case that most cases (76 out of 92) were flagged as being potentially chimeric in the SILVA database (average SILVA pintail score of 1.7%). In conclusion, the software showed extremely high detection reliability and flagged sequences containing anomalies that can be detected by the algorithm such as reverse complementary chimeras or non-16S sequence information.

Automated detection of the sequence orientation might be particularly useful for environmental sequence data sets generated by high-throughput sequencing (HTS) techniques. However, the reduced length might affect detection reliability and speed could be a limiting factor in processing millions of reads in a reasonable time. In order to assess the performance of v-revcomp on HTS data, we extracted 332 835 and 13 876 V1-V2 subregions as well as 332 799 and 13 870 V1-V3 subregions from the bacterial and archaeal SILVA datasets using v-xtractor 2.0 (Hartmann et al., 2010). These two datasets simulate sequence lengths approximately equivalent to lengths generated by the current HTS platforms (V1-V2, 261±18 bp) and lengths that will likely be reached by the next-generation of HTS platforms (V1-V3, 481±22 bp). The bacterial V1-V2 and V1-V3 datasets were processed in 18 and 37 min, respectively, whereas both archaeal datasets took around 1 min. All sequences were given in the correct orientation, but five V1-V3 or four V1-V2 were flagged as containing one reverse complementary HMM detection. These were cases already flagged in the full-length dataset. In conclusion, the tool performed well also for the short sequence reads characteristic of HTS datasets. The processing time increases linearly with the number of sequences and the million reads obtained from a full round of 454 pyrosequencing is processed in around one hour.

orientationchecker (Ashelford et al., 2006) is a competing software package for reverse complementary 16S sequence detection and orientation. The software operates by matching short oligonucleotide sequences at highly conserved positions along the gene and offers a user-friendly interface combined with an impressive processing speed. In order to compare the detection efficiency and reliability of this tool, we processed the bacterial and archaeal full-length, V1-V3 and V1-V2 datasets. The detection efficiency of orientationchecker decreased with decreasing sequence length, showing detection of 100%, 95% and 2% for the full-length, V1-V3 and V1-V2 datasets, respectively. Although the performance on full-length sequences was somewhat similar to that of v-revcomp, orientationchecker failed to detect the correct orientation of 124 full-length sequences and incorrectly assigned 10 as being reverse complementary. The lack of detection increased by 5% on V1-V3 sequences when compared with v-revcomp and the tool almost completely failed to detect the shorter V1-V2 sequences. In conclusion, v-revcomp demonstrated superior performance, especially on shorter sequences, and features a more reliable mechanism by screening multiple conserved regions at once. Furthermore, HMMs will be more flexible in detecting deviant sequences than a simple pattern matching using oligonucleotide sequences. In addition, the command-line nature of v-revcomp facilitates incorporation into automated software pipelines (e.g. Barker et al., 2010; Caporaso et al., 2010), which makes it especially suited to screen HTS datasets.

In order to assess the status of reverse complementary sequences in public data repositories, we ran v-revcomp on the 1 113 159 bacterial and 58 487 archaeal 16S sequences of a minimum length of 500 bp that were available in GenBank as of 1 July 2010. The 16S status was determined by screening the GenBank definition line for various synonyms for this gene; therefore, 16S sequences including parts of up- or downstream regions of the gene (e.g. promoter region, intergenic spacer) were coextracted. A total of 1 158 546 sequences (i.e. 98.9%) were reported by v-revcomp to be in the correct orientation, 9067 (0.8%) in the reverse orientation, 185 (0.02%) were flagged as uncertain and 3848 (0.3%) did not show any HMM detection at all such that no decision was obtained (Fig. 1b).

The following reasons accounted for the failure to detect any HMM in the 3848 sequences. In 3437 cases (89.3%), only a very small segment was actually identified as the 16S, whereas most of the sequence information comprised either the intergenic spacer region downstream of the gene (3421 cases) or regions, such as promoters, upstream of the gene (16 cases). In 220 cases (5.7%), the sequences showed only partial, poor or no match to any entry in GenBank as assessed through blast, and are therefore likely to be artefacts created during PCR amplification, sequencing or data processing. In 26 cases (0.7%), blast assigned the sequences to a different gene than 16S, for example dehydrogenases, transposases, kinases, translocases, ATP-binding components, membrane proteins and hypothetical proteins. In 26 cases (0.7%), the sequences were taxonomically misclassified, representing SSU rRNA genes from other taxonomic domains. In 28 cases (0.7%), the sequences were chimeric, some of which were sequences with serious anomalies (Fig. S1c). Eight sequences (0.2%) were of poor quality (i.e. many ambiguous base calls or long homopolymers) and two queries (0.1%) exclusively contained sequences identified as cloning vectors. The remaining 101 cases (2.6%) did not show any anomaly within the scope of this investigation and likely represented highly divergent sequences.

The following reasons accounted for at least one HMM detection in both orientations, leading to the 185 sequences being flagged as uncertain. In 61 cases (33%), the sequences were reverse complementary chimeras, with the reverse complement segment matching one or more HMMs. In 29 cases (16%), the sequences showed only partial, poor or no match to any entry in GenBank as assessed through blast. The remaining 95 sequences (51%) did not show any anomaly within the scope of this investigation and likely represent rare false detection by individual HMMs. In all these 95 cases, only single HMMs were detected in the opposite orientation, while the remaining HMMs were detected in the other orientation, leaving no doubt about the true orientation of the sequence (i.e. 90 forward and five reverse complementary). In conclusion, the queries showing multiple HMM detections in both orientations were all identified as having some sort of anomaly, whereas all other queries flagged as uncertain represented rare single false-positive detections, which did not impair determination of the true orientation of the sequence.

Among the 1 167 613 sequences with unambiguous orientation assignments, 3117 sequences had unusually low HMM counts of three or fewer. After looking in more detail at all these cases, we identified the following reasons for these observations. In 1882 cases (60%), the sequences contained only partial 16S information and partial up- or downstream regions, i.e. 101 upstream and 1781 downstream cases. A total of 714 sequences (23%) showed only partial, poor or no match to any entry in GenBank, whereas 277 sequences (9%) were of poor quality. In 110 cases (4%), the sequences had been associated with wrong taxa and represented different domains, and three cases (0.1%) were chimeric sequences that contained two concatenated identical segments. The remaining 131 cases (4%) did not show any anomaly within the scope of this investigation and are likely sequences with long hypervariable regions and/or sequences that contain divergent segments that are not detected by some individual HMMs. However, if no anomaly was present, the HMM detection ratio allowed reliable determination of the true orientation in most cases even if not all HMMs could be detected.

We used the tool to screen three published studies with sequences deposited in the first 2 months after our GenBank survey took place. Among the 1076 16S sequences published by Fujita et al. (2010), we found 403 (37%) sequences that were reverse complementary (i.e. average HMM detection ratio of 0 : 6), indicating that reverse complementary sequences can be a very significant problem. Screening the very small dataset of Jurado et al. (2010), one among the 39 sequences was reverse complementary (i.e. HMM ratio 0 : 10), indicating that reverse complementary entries can occur even in very small datasets where manual curation should not be an issue. No reverse complementary sequences or any other anomalies were detected among the 11 173 sequences published by Durso et al. (2010), demonstrating that v-revcomp can identify studies of high data integrity with respect to reverse complementary sequences.

The fraction of reverse complementary 16S sequences in public data repositories is around 1%, which must be seen as low, given the error-prone user-controlled submission mechanism and the lack of support for third-party annotation of INSD entries (Pennisi, 2008). Nevertheless, the over 9000 reverse complementary sequences can have serious implications for downstream analysis if the user is not aware of their status. Furthermore, the number of sequences deposited in these repositories will increase drastically with HTS technologies used in amplicon and metagenome sequencing projects, highlighting the need to detect these events in an automated manner. The clear cases of reverse complementary sequences found in this survey were reported to NCBI for reorientation. NCBI does not need prior agreement with sequence authors in order to correct sequences that were deposited in the incorrect orientation, and such reorientations are brought about quickly.

While the problem of reverse complementary sequences can be avoided with v-revcomp, the number and types of anomalous 16S sequences are of greater concern. It is worrisome that we detected 136 sequences that were taxonomically misclassified at the domain level, and more surprising that 26 cases did not even represent ribosomal genes. Our results stress the importance of critically examining sequences before inclusion in scientific analysis and submission to public databases (Harris, 2003). While v-revcomp is specifically designed to detect reverse complementary sequences, it has certain intrinsic capabilities of detecting some types of sequences anomalies such as reverse complementary chimeras, nontarget genes and erroneous reads. In particular, large-scale metagenome sequencing projects that require automated fragment assembly are prone to errors that could be detected by v-revcomp. As a rule of thumb, we suggest examining sequences that are flagged as having one or more HMM detections in both orientations, a step we feel will be helpful to the detection of anomalous sequences and the enhancement of the quality of the dataset beyond reorienting unwanted reverse complementary entries. As a special case, the failure to detect HMMs in either orientation is a very strong indicator that the entry does not represent a 16S sequence to begin with, at least not one of good quality.

Acknowledgements

This study was supported by a grant to the Centre for Microbial Diversity and Evolution from the Tula Foundation and a grant from Genome British Columbia. We also acknowledge support from the Frontiers in Biodiversity Research Centre of Excellence (University of Tartu, Estonia).

Authors' contribution

M.H. and C.G.H. contributed equally to this work.

Ancillary