Making the most of drought and salinity transcriptomics
Michael K. Deyholos. Fax: +1-780-492-9234; e-mail: email@example.com
More than 100 different studies of plant transcriptomic responses to salinity or drought-related stress have now been published. Most of these use microarrays or related high-throughput profiling technologies. This compels us to ask three questions in review: (1) what has transcriptomics contributed to our understanding of stress physiology; (2) what limits the ability of transcriptomics to contribute to increases in stress tolerance; and (3) given these limits, what are the most appropriate uses of transcriptomics? We conclude that although microarrays are now a mature technology that accurately describes the transcriptome, the consistently low correlation between transcript abundance and other measures of gene expression imposes an inherent limitation that cannot be ignored. Further limitations on the relevance of transcriptomics arise in some cases from experimental practices related to the treatment regimen and the selection of tissue or germplasm. Nevertheless, there is good evidence to support the continued use of transcriptomics, especially emerging techniques such as RNA-Seq, as a screening tool for candidate gene discovery. Microarrays can also be valuable in analysing the transcriptome per se (e.g. when describing the phenotype of a transcription factor mutant or discovering non-coding RNA species), and when integrated with other types of data including metabolomic analyses.
INTRODUCTION: THE ALLURE OF TRANSCRIPTOMICS
The past decade has seen the application of new techniques of functional genomics to the analysis of crops and model species under stress. These techniques are also known as transcriptomics, because they measure the abundance of transcripts of thousands of genes in parallel. Microarrays are the basis of the most widely used transcriptomic techniques. To date, over 100 publications have used microarrays to describe a total of 28 species exposed to either hypersalinity or treatments meant to simulate drought. Most of these studies seek an answer to the same question: what genes contribute to stress tolerance? Researchers hope that by answering this question (and many questions that follow it) a better understanding of stress physiology will be gained, leading to improvements in agricultural productivity. This review will describe progress towards these goals, beginning with a discussion of the advantages of transcriptome profiling.
Microarray analysis, like any analytical technique, has both inherent advantages and disadvantages. In comparison to other existing methods of high-throughput molecular profiling, the great strength of transcriptomics derives from the relative simplicity of its subject: mRNA is a polymer of only four subunits, unlike proteins and other chemically more diverse metabolites within the cell. Thus, a single method of extraction and detection can theoretically identify and quantify every transcript in a tissue sample (Peck 2005). In practice, the sensitivity of microarrays is reduced by a combination of factors, including non-specific hybridization, labelling biases, autofluorescence and detector noise. Nevertheless, microarray studies tend to identify at least an order of magnitude more gene products than are typically reported in proteome surveys (Baginsky 2009). The accuracy of microarrays in describing transcriptomes is demonstrated in part by the consistency with other assays; each published microarray experiment is routinely validated by qRT-PCR analysis of a subset of transcripts (some of which should be selected at random). Thus, microarrays have become the predominant platform for molecular profiling, because of their relatively high sensitivity, specificity, accuracy, throughput and cost-efficiency.
The expanded availability of next-generation DNA sequencing technologies promises further improvements in transcript profiling. In these sequence-based profiling methods (e.g. RNA-Seq), RNA is extracted from tissues and converted to cDNA, and part of each of tens of thousands of cDNA molecules are sequenced in parallel. Transcript abundance is then estimated based on the relative frequency with which sequence is obtained for a given gene (Wang, Gerstein & Snyder 2009). RNA-Seq therefore offers a dynamic range (typically >10 000-fold) that is limited only by the depth of sequencing, and is in any case much greater than the dynamic range of microarrays, which is effectively limited to a few hundred-fold due to noise and the information capacity of electronic detectors. RNA-Seq also appears to have greater sensitivity, and increased base resolution to discriminate between splicing variants, alleles, and other isoforms. There also evidence that RNA-Seq may have improved accuracy: at least one study (using animal subjects) showed that in comparison to proteomic data, RNA-seq showed moderately better correlation (r = 0.36) than microarray data (r = 0.24) (Fu et al. 2009). A final important consideration is that microarrays can only measure the abundance of transcripts that are represented by a pre-defined probeset. In contrast, RNA-Seq is inherently open-ended, meaning that it can detect any molecule presented to the sequencer (although some pre-existing genomic sequence is useful in interpreting the results). This allows RNA-Seq to provide more complete descriptions of the transcriptome than can be obtained from traditional microarrays (with the possible exception of whole-genome tiling arrays). Removing the requirement for microarray probeset definition and production will also allow transcriptomic profiling to be applied to a much wider range of species as subjects for abiotic stress studies. Based on all of these advantages, it is clear that sequence-based transcript profiling methods will replace microarrays in the near future, as access to next-generation sequencing instruments expands (Morozova & Marra 2008). However, additional limitations and sources of error associated with RNA-Seq may yet be discovered as this new technology is more widely adopted. For example, an unexplained bias towards increased variance in RNA-Seq quantification of smaller transcripts was recently reported (Oshlack & Wakefield 2009). To date, only a few reports of plant transcriptome analysis using next-generation sequencers have appeared, in addition to previous reports of Serial Analysis of Gene Expression-type analyses of various abiotic stresses using more expensive Sanger sequencing techniques (Moon et al. 2007; Molina et al. 2008; Barakat et al. 2009; Byun, Kim & Lee 2009).
BIOLOGICAL LESSONS FROM MICROARRAYS
The efficiency and presumed accuracy of microarray analysis has produced a wealth of data about stress-related transcript expression. These data have often been used to guide selection of individual candidate genes for mutation or overexpression.
In some cases, this has resulted in plants with modified responses to specific stresses (Vij & Tyagi 2007; Luhua et al. 2008). However, the further impact of microarray analyses on our understanding of stress physiology is less apparent; indeed, a reading of comprehensive reviews of mechanisms of stress tolerance fails to identify any pathway or major component of the stress response that was first discovered through transcriptomics (Yamaguchi-Shinozaki & Shinozaki 2006; Munns & Tester 2008; Takeda & Matsuoka 2008). Part of this is due simply to the time lag that might be expected when translating new and large data sets into well-accepted models. Nevertheless, the microarray literature has tended to follow rather than lead physiology in the interpretation of the stress responses. One overly simplified version of the salinity and osmotic microarray narrative is that following the onset of stress, a decrease in abundance of transcripts related to primary energy metabolism, photosynthesis and protein synthesis is observed, concomitant with an increase in stress-signalling, transporters and hydrophilic, osmoprotective and antioxidative-related transcripts (Bray 2002; Gong et al. 2005; Jiang & Deyholos 2006; Sahi et al. 2006). This generalization understates the real complexity of patterns within each functional category, and the variability between experiments, to make the point that this story largely matches what was already known from pre-microarray physiology and molecular biology studies. Why has microarray data, despite its comprehensive scale and great efficiency, thus far been unable to make more definitive contributions to stress physiology?
BIOLOGICAL LIMITATIONS ON THE RELEVANCE OF MICROARRAYS
In contrast to the technical advantages outlined previously, the single major disadvantage of transcriptome analysis is biological in origin. Although all gene expression is fundamentally dependent on transcription, various post-transcriptional regulatory influences mean that transcript abundance is not necessarily correlated with the ultimate activity of the gene product. Thus, even if microarrays or related techniques accurately describe stress-induced changes in the transcriptome, it cannot be assumed that a change in abundance of any particular gene has any relevance to the physiology of either the cell or the whole plant. This inherent ambiguity presents an unavoidable limitation on the power of microarray analysis and all other types of transcriptomics.
The extent of the variation between transcript abundance and other measures of gene expression is not well characterized, but for almost every direct comparison of proteomic and transcriptomic data, the correlation appears to be disappointingly low (Feng et al. 2009; Fu et al. 2009; Lee et al. 2009; Minic et al. 2009). In a comparison of protein and transcript abundance in NaCl-treated Arabidopsis roots, a correlation of as low as r2 = −0.1 was observed (Jiang et al. 2007). Of course, protein abundance is not itself perfectly correlated with gene activity; proteins are regulated post-translationally and by their localization and association with other molecules. The question of the relevance of transcriptomics is especially important in the context of stress physiology, where it has been shown that in some contexts, only a small portion of the transcripts representing a specific subset of genes are actively translated (Kawaguchi et al. 2004; Branco-Price et al. 2008). Accordingly, the selection of polysome (i.e. multiple ribosome) associated transcripts during RNA extraction appears to be useful in increasing the physiological relevance of microarray data. Unfortunately, because of its extra technical demands, this approach has not been widely adopted in plant genomics, although refinements to the method are being developed including immunopurification of epitope-tagged ribosomes (Arava et al. 2003; Zanetti et al. 2005; Ederth et al. 2009; Mustroph et al. 2009).
METHODOLOGICAL LIMITATIONS ON THE RELEVANCE OF MICROARRAYS
In addition to the inherent limitations described earlier, some experimental practices introduce further constraints on the relevance of transcriptomics to stress physiology. These constraints can arise because of the tissues, treatments, and germplasm used in a given experiment. Each of these three methodological considerations will be discussed briefly in the sections that follow.
Most of the species studied by stress physiologists contain complex tissues comprised of many different cell types. The expression of genes in just a few of these cells (e.g. meristems, vascular tissue or guard cells) can have a profound impact on the whole plant. Moreover, the stress responses of reproductive tissues are of greater relevance to agriculture than the juvenile, vegetative tissues that have been most frequently subjected to transcriptome analysis (Munns & Tester 2008). Despite this, individual cells or tissues have only rarely been dissected from stress-treated plants prior to transcriptome profiling. As a result, relevant changes in gene expression within specific cells are diluted by transcripts from surrounding cells and go undetected. The importance of spatial resolution in tissue sampling was demonstrated by a microarray analysis of the effects of water stress in four serial, transverse segments cut from the maize root apex (Spollen et al. 2008). Following treatment with osmoticum, only 10% of the transcripts that increased in abundance in one segment also increased in an adjacent segment. Higher resolution techniques have also been used to study transcriptomes of NaCl-treated Arabidopsis roots (Dinneny et al. 2008). Cell sorting [Fluorescence-Activated Cell Sorting (FACS)] of protoplasts from seedlings expressing green fluorescent protein markers in six different cell types along the radial axis demonstrated that nearly 80% of the NaCl-responsive transcripts were responsive in only one of the cell types. Most recently, this FACS approach has been combined with ribosome immunopurification to obtain cell-type specific profiles of actively translated mRNA (Mustroph et al. 2009). A third class of techniques, known as laser microdissection is also starting to be used to harvest specific cell types from stressed tissues, such as the root hairs from iron-deficient roots of cucumber (Santi & Schmidt 2008). Each of these dissection techniques requires considerably more cost and effort (and in some cases additional equipment) compared to experiments involving whole organs. Nevertheless, results of cell and tissue-specific stress transcriptome analyses conducted to date demonstrate that high spatial resolution is essential to increasing the utility of microarray data.
Stress exposure regimens
Most plants used in expression profiling are grown in laboratories because this allows for greater control and replication of environmental variables in any season. A major challenge in experimental design is therefore finding laboratory conditions that are most relevant to conditions experienced by field-grown plants. Simulation of the meteorological phenomenon of drought in the laboratory presents unique issues. This has been variously attempted by introduction of osmotica (PEG, CaSO4, mannitol) into hydroponic growth media, passive dehydration of depotted plants or water withholding of plants in soil. Although all of these treatments decrease the water potential of the tissue, there are critical differences in their effects on the transcriptome. A meta-analysis of three separate microarray experiments using either filter paper, mannitol or soil water deprivation to reduce water potential found only 1% of transcripts were commonly by all three treatments (Bray 2004). As part of this, differences in the rate of stress imposition also greatly affect the transcriptomic responses: rapidly dehydrated (∼6 h) as compared to gradually dehydrated (∼7d) barley roots had only 10% of stress-responsive transcripts in common (Talame et al. 2007). Thus, the kinetics of stress treatments are particularly important and should be carefully considered in experimental designs: stresses imposed in the laboratory (especially salinity) tend to be more sudden than what is typically experienced in the field.
Under physiologically relevant conditions, hypersalinity imposes two distinct effects (Munns & Tester 2008). First, plants experience an osmotic stress due to decreased water potential surrounding the roots. Later, ions exert toxic effects as they are drawn into the plant and accumulate especially in the shoot, a process that requires transpiration. In this context, Munns and Tester neatly describe some of the shortcomings of commonly used NaCl-stress regimens: seedlings grown in Petri dishes almost certainly do not have sufficient transpiration rates to experience the ionic aspect of salinity exposure. Moreover, a rapid NaCl shock under these conditions is likely to result in only a transient dehydration of the cell followed by osmotic adjustment. Finally, the importance of including supplemental Ca+ during NaCl exposure must be emphasized, because experiments involving Na+ alone will be confounded by secondary effects from decreased Ca+ availability. Others have pointed out that soil alkalinity, modelled by laboratory exposure to Na2CO3, is a related stress of great agronomic importance that also deserves more attention (Jin et al. 2008).
Selection of germplasm
The exclusive use of stress susceptible species (e.g. Arabidopsis thaliana) as subjects for drought and salinity studies has been criticized because their limited defensive repertoires mean that the exhibited responses may actually reflect death or senescence, rather than useful adaptations (Munns & Tester 2008). Microarray-based comparisons of related species or genotypes (ideally near isogenic lines) that differ only in their tolerance to stress are a potentially powerful strategy for the identification of genes that confer stress tolerance. For example, Thellungiella salsuginea (also known as T. halophila) is stress tolerant species from the same family as Arabidopsis, and so has served as a useful subject for comparative transcriptome analyses (Taji et al. 2004; Gong et al. 2005; Wong et al. 2006). According to one study, approximately 40% of the NaCl-responsive transcripts were similarly induced in both T. salsuginea and Arabidopsis, while remaining majority of responsive transcripts in T. salsuginea were unique to that species (Gong et al. 2005). Other pairs of closely related species that differ in their stress tolerance have also been identified (Bohnert et al. 2006). Comparative stress transcriptome studies have now also been reported in drought or salinity treated rice, wheat, sugarcane and Andean potato (Mane et al. 2008; Mohammadi, Kav & Deyholos 2008; Rabello et al. 2008; Ergen et al. 2009; Rodrigues, de Laia & Zingaretti 2009). Establishing causality in this type of comparison depends on careful experimental design with sufficient controls, because any observed differences in transcript abundance after the onset of stress could simply arise from the resulting differences in physiology (e.g. water status, metabolic rates) that occur between the tolerant and susceptible varieties under these conditions.
RATIONAL USE OF MICROARRAYS IN STRESS PHYSIOLOGY
Given the inherent limitations on the relevance of transcriptomics to stress physiology, what are the most appropriate uses of the technology? Accepting the weak average correlation between transcript abundance and gene activity, we can for this discussion describe the best uses of microarrays in the following three non-exclusive categories: analysis of transcription, gene discovery, and systems biology (in its broadest definition).
Detecting the immediate products of transcription
Microarrays are most relevant to studies in which transcripts themselves are the subject of interest, rather than when they serve as a proxy of ultimate gene expression. For example, when sufficient genomic sequence is available, microarray data can be very useful in the identification of cis-regulatory elements adjacent to transcripts co-regulated by drought or salinity (Ma & Bohnert 2007). Microarrays are useful when describing the phenotype associated with a mutation in a transcription factor or other direct regulator of transcript abundance (Fowler & Thomashow 2002; Maruyama et al. 2004; Li et al. 2008; Jiang & Deyholos 2009; Yokotani et al. 2009). However, even in this situation, a portion of transcripts will be differentially expressed due only to indirect effects of the mutation. Therefore additional techniques, such as the use of chemically inducible transcription factors and chromatin immunoprecipitation are used with microarrays to more accurately define direct targets of transcriptional regulators (Waters et al. 2009).
Another application of transcriptomics in which RNA molecules are the focus of inquiry is the detection of unusual, stress-induced transcripts. These include transcribed regions that differ from annotated exons: e.g. miRNAs, splicing and processing variants, and long non-coding transcripts of unknown function (Liu et al. 2008; Ding et al. 2009). Recently, whole-genome tiling microarrays have been used to compare transcriptomes of Arabidopsis plants exposed to various abiotic stresses (Matsui et al. 2008; Zeller et al. 2009). Together, these experiments resulted in the identification of thousands of unannotated RNA molecules that increased or decreased in abundance following stress. Some of these previously unannotated molecules likely represent protein-coding genes that have eluded prediction algorithms, while others are non-coding RNAs, the significance of which in relation to stress is only starting to be investigated.
Screening for novel, stress-related genes
Transcriptomics can also be an effective in the characterization of stress-responsive protein-coding genes associated with stress, notwithstanding uncertain correlation between transcript abundance and gene activity. In this context, transcriptomics should be seen as a blunt screening tool to identify candidate genes and generate hypotheses for further analysis. Transcriptomics may here have an advantage in the discovery of redundantly acting genes that cannot be easily identified in typical phenotypic screens. The success of transcriptomics in gene identification depends on the frequency of false positives and the efficiency of subsequent biochemical and physiological assays. Functional relevance of candidate gene can be demonstrated by phenotypic analysis of loss-of-function and overexpressing mutants, but only if a given gene is either necessary or sufficient, respectively, for some measurable parameter of stress tolerance. Here, Arabidopsis has advantages as a model system because of the availability of a large collection of sequence-tagged insertions (e.g. Salk Lines), and its rapid generation time and ease of transformation (Alonso et al. 2003). A particularly encouraging demonstration of the use of microarrays in the discovery of novel genes was recently provided by the constitutive expression of 41 genes of unknown function in Arabidopsis. These genes were selected based on their increased transcript abundance in a microarray study of H2O2 accumulating mutants (Luhua et al. 2008). In transgenic plants, the majority (70%) of these genes conferred increased tolerance to oxidative stress, and in almost all cases, the increased stress tolerance was specific to oxidative stress, as compared to other abiotic stresses. These results support the use of microarrays as a screening tool in gene discovery. However, the general utility of this strategy is not certain: many contrary experiments (in which a microarray-derived candidate gene fails to confer tolerance) go unreported, and caution must be used when extrapolating the relevance of assays of stress tolerance in Arabidopsis to field conditions. It is also possible that the luxuriant conditions under which most plants are grown for phenotypic analysis in the laboratory masks some mutant phenotypes. We have reported an Arabidopsis bHLH transcription factor that increases in transcript abundance more than 100-fold upon exposure of roots to salinity (Jiang & Deyholos 2006). Yet despite this very strong transcriptional response, there is little effect on either the downstream transcriptome or stress tolerance of the plant in either loss-of-function mutants or transgenic plants constitutively expressing the transgene (Jiang, Yang & Deyholos 2009).
Integrating with other datasets
Although transcriptome profiles alone are generally insufficient to draw reliable conclusions about physiological responses to stress, microarray data can be integrated effectively with other types of linkage or expression data (Bohnert et al. 2006). For example, stress-induced transcript expression profiles have been used to identify candidate genes associated with QTLs (Gorantla et al. 2005; Street et al. 2006; Diab et al. 2008; Golkari et al. 2009). Integration of metabolomic and transcriptomic data has also been used effectively to confirm inferences about stress-related gene expression (Gong et al. 2005; Lippold et al. 2009; Urano et al. 2009). Many types of network and meta-analyses of stress microarray data have also been proposed (Long, Rady & Benfey 2008; Ma & Bohnert 2008; Vandepoele et al. 2009; Weston et al. 2008). In some cases these provide a net improvement on the accuracy of transcriptomic data, in part because experimental noise is reduced by filtering, averaging, consensus or other algorithms (Benedict et al. 2006). Indeed, network analyses have proven useful in plant biology for the identification of candidate genes, putative regulatory modules and cis-elements (Persson et al. 2005). However, all analyses that rely exclusively on transcript abundance data ultimately face the same limitation on biological relevance because of apparently overwhelming effects of post-transcriptional regulation in abiotic stress. A recent review provides an excellent summary of what is needed to improve pathway reconstruction from expression data in plants, such as improved functional annotations (Van Baarlen et al. 2008). The theoretical limits on reverse-engineering transcriptomic networks have been treated elsewhere, and again highlight the constraints imposed by the low correlation between transcript abundance and other measures of expression (Margolin & Califano 2007).
Microarrays provide a combination of technical and practical advantages that make them the most widely used platform for transcriptional profiling, and these advantages justify their continued use until major improvements are made in the efficiency of proteomics technologies and until the availability of next-generation sequencing technologies expands. Microarrays do appear to describe the transcriptome accurately, but all transcriptome profiling techniques have limited relevance in describing the physiology of a whole cell or organism because of extensive post-transcriptional regulation, which has been shown to be particularly relevant in stress responses. Wider use of polysomal mRNA fractions could increase the relevance of microarray data to physiology. Other improvements in some common methods are also strongly recommended including higher spatial resolution in tissue sampling and further efforts to make treatment conditions relevant to plant growth outside of the lab. These improvements will help ensure that transcriptomic profiles from salinity and drought treatments of an expanding number of species will be of the most use in gene discovery, comparative genomics and systems biology.