Hybridization to microarrays has been the standard for genome-wide transcriptome analyses of prokaryotes in the past 10 years. Microarrays have several limitations, however, among which are a small dynamic range for detection of transcript levels due to problems with saturation, background noise, spot density and spot quality. Moreover, comparing different experiments requires complex normalization methods (Hinton et al., 2004) and comparing different strains requires designing pangenome arrays based on multiple sequenced genomes, leading to further problems in non-specific or cross-hybridization and complicated data analysis (Bayjanov et al., 2009). Most microarrays have a biased genome coverage, as they only contain a limited number of short probes for known or expected genes in sequenced genomes, and they rarely probe intergenic regions. Technological advances in array production and dropping costs have recently led to the design and use of high-density tiling arrays based on overlapping short oligonucleotides covering both strands of entire genomes (Selinger et al., 2000; Mcgrath et al., 2007; Rasmussen et al., 2009; Toledo-Arana et al., 2009). Tiling array and other studies have provided a first insight into far more complex transcriptomes than previously envisioned, including an ever-expanding range of regulatory RNAs (Waters and Storz, 2009). To overcome the remaining limitations of microarrays, a totally new approach to whole-transcriptome analysis was needed – and a much-awaited breakthrough in DNA sequencing came to the rescue. Here, we describe the first whole-transcriptome applications in prokaryotes and discover that a new treasure chest of regulation in prokaryotes is being opened.
With the dawn of next generation (or deep) sequencing technologies in recent years (Ansorge, 2009; Metzker, 2010), their application to high-depth sequencing of whole transcriptomes, a technique now referred to as RNA-seq, has been explored (Morozova et al., 2009; Wang et al., 2009; Wilhelm and Landry, 2009). RNA-seq requires a conversion of mRNA into cDNA by reverse transcription, followed by deep sequencing of this cDNA (Fig. 1A). RNA-seq was initially only used for analysing eukaryotic mRNA, as prokaryote mRNA is less stable and lacks the poly(A) tail that is used for enrichment and reverse transcription priming in eukaryotes. But these technological difficulties are being overcome, as various methods for enrichment of prokaryote mRNA and appropriate cDNA library construction protocols have been developed, some generating strand-specific libraries which provide valuable information about the orientation of transcripts.
In June 2008, the first reports appeared of RNA sequencing of whole microbial transcriptomes, i.e. the yeasts Saccharomyces cerevisae (Nagalakshmi et al., 2008) and Schizosaccharomyces pombe (Wilhelm et al., 2008). Both studies demonstrated that most of the non-repetitive sequence of the yeast genome is transcribed, and provided detailed information of novel genes, introns and their boundaries, 3′ and 5′ boundary mapping, 3′ end heterogeneity and overlapping genes, antisense RNA and more. Starting in 2009, several examples have been reported of prokaryote whole-transcriptome analysis using tiling arrays and/or RNA-seq, and these are summarized in Table 1. The first reviews of prokaryote transcriptome sequencing have just appeared (Croucher et al., 2009; van Vliet and Wren, 2009; Sorek and Cossart, 2010; van Vliet, 2010).
|Technique||Corrected genes||New genes||ncRNA||Antisense RNA||Reference|
|Mycoplasma pneumoniae||TA, RNAseq, spotted arrays||5||4||108||89||Guell et al. (2009)|
|Salmonella enterica sv Typhi||ssRNA-seq||40||Perkins et al. (2009)|
|Chlamydia trachomatis L2b||RNA-seq||5||41||25||Albrecht et al. (2009)|
|Listeria monocytogenes EGD-e||TA||5||45||7||Toledo-Arana et al. (2009)|
|Listeria monocytogenes 10403S||RNA-seq||67||Oliver et al. (2009)|
|Burkholderia cenocepacia||RNA-seq||13||Yoder-Himes et al. (2009)|
|Bacillus anthracis Sterne 34eF2||RNA-seq||11||57||Passalacqua et al. (2009)|
|Bacillus subtilis 168||TA||119||84||127||Rasmussen et al. (2009)|
|Vibrio cholerae||RNA-seqa||520||127||Liu et al. (2009)|
|Sulfolobus solfataricus P2||RNA-seq||162||80||310||185||Wurtzel et al. (2010)|
|Halobacterium salinarum||TA||61||10||61||Koide et al. (2009)|
|Schizosaccharomyces pombe||TA, RNA-seq||75||26||427||37||Wilhelm et al. (2008)|
|Saccharomyces cerevisiae||RNA-seq||64||487||Nagalakshmi et al. (2008)|
Novel general features discovered
Numerous new insights into genomic elements, gene expression and complexity of regulation are emerging from these new high-throughput and high-resolution studies of microbial transcriptomes (Fig. 1B).
Gene structure/length, novel genes
Gene annotation has always been fraught with difficulties and is not a trivial exercise. Most gene-finding algorithms miss or miss-annotate small protein-encoding genes and non-coding RNAs (together called sRNAs), but tiling arrays and RNA-seq can readily identify these genes (Figs 2 and 3). The high resolution of these techniques allows transcription start sites (TSS) to be mapped with single-base pair resolution. Moreover, gene structure can be corrected (Table 1), as many gene starts are found to be downstream of the automatically predicted start of largest possible ORFs, e.g. in Sulfolobus solfataricus (Wurtzel et al., 2010).
Whole-transcriptome mapping can identify contiguous expression extending into flanking regions of a protein-encoding gene, indicative of 5′ or 3′ untranslated regions (UTRs). Long 5′ UTRs are often indicative of upstream regulatory elements, such as riboswitches (Toledo-Arana et al., 2009). Archaea have much shorter or no 5′ UTRs compared with bacteria (Koide et al., 2009; Wurtzel et al., 2010), suggesting alternative modes of regulation. Long 3′ UTRs could affect expression of downstream genes or genes on the opposite strand, as found in archaea (Brenneis and Soppa, 2009).
Whole-transcriptome data allow operons to be better defined, and the first experimentally determined operon maps show that 60–70% of bacterial genes are transcribed as operons, but only 30–40% in archaea. Staircase-like expression within operons appears to be common (Guell et al., 2009).
Whole-transcriptome analysis of Mycoplasma pneumoniae, using a mixture of tiling arrays, deep sequencing and 137 different growth conditions, showed that there is context-dependent modulation of operon structure (Guell et al., 2009). This involves repression or activation of operon internal genes as well as genes located at the operon ends. This adds a whole new level of complexity to gene regulation. Similar ‘conditional operons’ were found in Halobacterium salinarum (Koide et al., 2009).
Non-coding RNAs (ncRNA), typically 50–500 nt long, can play important regulatory roles in prokaryotic physiology, such as virulence, stress response and quorum sensing. These ncRNAs have been largely overlooked in prokaryote genome annotation, since they are very difficult to detect with existing gene-prediction software (Meyer, 2008; Livny and Waldor, 2009). Many act by binding to target 5′ UTR by base pairing, resulting in inhibition of translation or mRNA degradation. Whole-transcriptome analysis of several prokaryotes has now identified large numbers of ncRNAs (Table 1), some of which are induced during niche switching, such as in Burkholderia cenocepacia (Yoder-Himes et al., 2009).
Cis-antisense RNA was previously thought to be extremely rare in prokaryotes, but whole-transcriptome analysis has recently detected hundreds of antisense transcripts in bacteria and archaea (Table 1). Some of these have been experimentally shown to downregulate their sense counterparts (Toledo-Arana et al., 2009). This is an area in which much is still to be discovered, as cis-antisense may be a common form of regulation in prokaryotes.
Validation and comparing techniques
The ultimate goal is to obtain a complete and bias-free view on microbial transcriptomes. The question remains in how far RNA-seq has the potential to provide such a view. Clearly, RNA-seq has a number of advantages above microarray technology, since RNA-seq offers both a single-base resolution and a high-mapping resolution. RNA-seq is especially suited to identify novel transcripts, alternative splice variants and non-coding RNA (Marioni et al., 2008; Mortazavi et al., 2008; Nagalakshmi et al., 2008; Wilhelm et al., 2008).
However, some studies indicate that RNA-seq is also not bias-free (Marioni et al., 2008; Mortazavi et al., 2008). In recent studies that compared expression levels measured using both (tiling) microarrays and RNA-seq, expression levels between the two technologies show reasonably good correlation (ranging from 0.62 to 0.75) (Marioni et al., 2008; Mortazavi et al., 2008; Fu et al., 2009), especially when comparison is restricted to protein-coding gene loci (Sasidharan et al., 2009a,b). It should be noted that in order to compare expression levels from tiling microarray and RNA-seq, one has to consider the different data types of the two technologies. Comparison of results may depend on the procedure applied to convert continuous expression levels from tiling microarray into a ‘digital’ signal (Sasidharan et al., 2009a,b). Correlating expression levels from both technologies to proteomics data shows that RNA-seq provides a better estimate not only of absolute transcript levels but also of protein levels (Fu et al., 2009).
As demonstrated in a recent study on M. pneumoniae, combining various experimental data types can provide a more complete view on a transcriptome than using tiling arrays or RNA-seq alone (Guell et al., 2009). They report that in some cases (in particular for lowly expressed genes), RNA-seq data alone were not sufficient to unambiguously define operon boundaries. However, the single-base resolution of RNA-seq allows more precise prediction of promoter locations (Guell et al., 2009).
Deep RNA sequencing provides clear advantages over the conventional (tiling) micro array technology. It allows transcriptome analysis of the entire nucleotide sequence of the genome, it is very sensitive, it offers a large dynamic range, and it allows accurate determination of boundaries (e.g. TSS, 3′/5′ ends, exons). However, RNA-seq is not completely bias-free. Nearly all studies to date have used some sort of enrichment procedure for mRNA, inherently leading to some bias. In many recent studies this enrichment step is being skipped, as the enormous volume of cDNA sequence data holds enough information, even if mRNA comprises only a few % of the total RNA. Just throw away 95–98% of your sequence data!
The conversion of RNA into complementary DNA (cDNA) may also lead to bias. Recently, a new method was developed that measures RNA levels directly without this conversion step (Ozsolak et al., 2009). The method is based on direct sequencing of RNA and is an extension of single-molecule DNA sequencing technology (Braslavsky et al., 2003; Harris et al., 2008). The direct method uses RNA directly as a template for nucleotide incorporation by a modified DNA polymerase with reverse transcriptase activity. Under optimal conditions the method yields sequences in the range of 20–40 nucleotides in length, with a total raw base error rate of approximately 4%. These read lengths and error rates are sufficient to align sequences to reference genomes (Ozsolak et al., 2009).
What does the future hold for sequencing and RNA-seq? There is no doubt that the revolution that has occurred in our ability to sequence and profile RNA from the days of a single ‘Southern blot’ to microarray RNA dot-blot hybridization and Q-PCR to RNA-seq has been exciting, informative and rapid. In the future we will need to miniaturize as we move to single-cell sequencing and transcriptomics. How will this be achieved? IBM is working on nanotechnology (‘The DNA transistor’; for a video see http://www.youtube.com/watch?v=wvclP3GySUY) to enable even more rapid, accurate and cheap genome sequencing (patent US200828191A1). DNA, or in fact any charged polymer, can be made to move through nanopores, and detection of the bases moving through the pore is possible. In fact the DNA moves through the pore too quickly and needs to be slowed down to be readable. So in the not too distant future, we may see that the genome sequence, transcriptome and regulome of a single cell will all be determined before the first coffee break of the day.
This project was carried out within the research programmes of the Kluyver Centre for Genomics of Industrial Fermentation and the Netherlands Bioinformatics Centre, which are part of the Netherlands Genomics Initiative/Netherlands Organization for Scientific Research. TT is funded by the HAN University of Applied Sciences.