Transcriptome characterization, novel gene discovery, and analysis of alternative splicing
Although computational methods have steadily improved, simply applying computational gene prediction methods for annotation of genome sequences does not allow accurate determination of the gene structure and/or identification of all transcription units of an organism. Additionally, large-scale cloning and sequencing of complementary DNA (cDNA) molecules corresponding to expressed gene products, the traditional approach for identifying coding regions, often misses very low abundance and non-polyadenylated transcripts. Furthermore, cDNA collections are often devoid of transcripts that are expressed in response to a specific physiological or environmental condition(s). To circumvent such problems, Yamada et al. (2003) used WGAs containing oligonucleotide features that cover the entire Arabidopsis thaliana genome sequence. These arrays were used for gene expression studies by hybridizing targets made from RNA samples of four different tissues (flower, leaf, root, cultured cell). To make the targets for array hybridization, total RNA was isolated from the four different Arabidopsis tissues, and first-strand cDNA synthesis was prepared using an oligo(dT) containing a linked promoter that is recognized by T7 RNA polymerase (RNAP). The single-strand cDNA was then used as a template for a second-strand cDNA reaction, thus resulting in double-stranded cDNA (dscDNA) molecules. The dscDNA was then used as a template for an in vitro transcription reaction using T7 RNAP to amplify complementary RNA (cRNA) molecules by initiation from the promoter in the original oligo(dT) primer. This protocol is based on a method devised by Van Gelder et al. (1990), and results in equal representation of all expressed gene products contained in the total RNA samples, as well as amplification of targets in sufficient quantity for hybridization to WGAs (Figure 1). Interestingly, initial expression studies in Arabidopsis using WGAs identified a large number of novel sites of active gene expression that had been missed by computational gene prediction algorithms and cDNA collections (Hanada et al., 2007; Kim et al., 2003; Stolc et al., 2005; Yamada et al., 2003). Furthermore, many of these newly identified transcripts were expressed from the opposite DNA strand in the reverse orientation (antisense) to previously annotated transcripts (Yamada et al., 2003). For instance, it was noticed (based on the whole-genome tiling array data) that an antisense transcript overlapping the 3′ end of the RNA for the repressor of Arabidopsis flowering time, FLOWERING LOCUS C (FLC), is involved in the regulation of flowering time (Swiezewski et al., 2007). Furthermore, Swiezewski et al. (2007) were able to demonstrate that the antisense transcript may also act as a biogenesis substrate for small interfering RNAs (siRNAs) that are responsible for the heterochromatization and subsequent silencing of this genomic region. Additionally, the study by Yamada et al. (2003) demonstrated that gene expression occurs in centromeric regions, which were previously thought to be mostly devoid of active transcription. Thus, a much more complete view of the entire transcriptome can be gained from gene expression studies using WGAs.
Figure 1. Diagram of the labeling procedure for gene expression studies using WGAs. Total RNA is used as the target template for hybridization to WGAs. To begin, first-strand cDNA synthesis is performed using an oligo(dT) containing a T7 RNAP promoter. Next, second-strand cDNA is synthesized, resulting in double-stranded cDNA (dscDNA) molecules. The dscDNA is then used as a template for in vitro transcription using T7 RNAP to amplify complementary RNA (cRNA) molecules by initiation from the T7 promoter in the oligo(dT) primer. The targets are then fragmented and hybridized to the WGAs. After hybridization, the WGAs are stringently washed, stained with a fluorophore, and scanned for detection of fluorescence intensity. Target hybridization to WGAs results in genome-wide transcriptome data for analysis.
Download figure to PowerPoint
To gain a better understanding of the transcriptome of rice (Oryza sativa), WGAs were developed that spanned the entire genome of this important crop and model for grass species (Li et al., 2006). To these arrays, Li et al. (2006) hybridized RNA targets (created by a similar method to those described above) from the indica rice sub-species to a set of high-density tiling microarrays. These studies provided expression data support for the existence of 35 970 (81.9%) of the annotated gene models, and identified a large number (5464) of novel sites of active gene expression that had never before been annotated in this model crop genome. More recently, Li et al. (2007) hybridized RNA targets (again created by a similar method to those described above) from two rice sub-species, indica and japonica, to WGAs spanning the entire rice genome. From these studies, 25 352 and 27 744 transcriptionally active regions (TARs) not encoded by annotated exons in the sub-species japonica and indica, respectively, were identified. Overall, transcriptome studies making use of WGAs have demonstrated that this technology can be successfully applied for more comprehensive characterization of the transcriptome, as well as for discovery of novel sites of active transcription that lie within plant genomes (Hanada et al., 2007; Kim et al., 2003; Li et al., 2006, 2007; Udall et al., 2007; Yamada et al., 2003). Furthermore, all of the data obtained from genome tiling microarray studies have consistently documented rich transcriptional activity beyond the annotated set of protein-coding genes. Overall, WGAs are a useful tool for novel gene discovery, as well as for interrogation of a much larger number of transcription units in the course of gene expression studies performed using RNA samples extracted from various plants.
Alternative splicing (AS) is an important post-transcriptional regulatory mechanism that can increase protein diversity and affect mRNA stability by combining the various splice junctions that are present in pre-mRNA transcripts. In this way, a variety of mRNA and proteins can be created from the same gene, thus AS is thought to play a major role in expanding the potential informational content of eukaryotic genomes. Furthermore, a number of different types of AS have been observed; including exon skipping, alternative donor or acceptor sites, and intron retention (Carninci et al., 2005; Kan et al., 2002; Modrek and Lee, 2002; Nagasaki et al., 2005; Reddy, 2007). In Arabidopsis, intron retention has been demonstrated to be the most prevalent type of AS (40%; Ner-Gaon et al., 2004; Reddy, 2007). A study performed by Ner-Gaon and Fluhr (2006) using a novel algorithm on the available transcriptome data from WGAs demonstrated that retained introns are detected in 8% of all the Arabidopsis transcripts examined, and suggested an overall total AS rate of 20% for Arabidopsis compared with 10–22% based on EST/cDNA-based analysis. Together, these data demonstrate that direct transcript expression analysis using WGAs is particularly amenable for assessing global intron retention in Arabidopsis and probably for other systems as well.
Population genomic studies
It is well known that the genomic content of individuals from the same species can vary widely in sequence as a result of diverse evolutionary processes. To identify sequence variants affecting phenotypes among these individuals, comprehensive polymorphism data are required (reviewed in Shiu and Borevitz, 2006). Although direct sequencing of individual populations is the most direct method for amassing comprehensive polymorphism data, this methodology is only now becoming cost-effective for such experiments in most organisms (Kling, 2005). To circumvent these problems, Clark et al. (2007) applied an alternative approach of using WGAs for polymorphism detection in Arabidopsis. Interestingly, this same strategy was previously used to identify a large fraction of the SNP variation in human and mouse (Hinds et al., 2005; Patil et al., 2001). To do this, they isolated DNA samples from 19 wild accessions of A. thaliana. These 19 accessions were selected to gain maximal genetic diversity of the polymorphisms obtained from such a comprehensive dataset. To generate sufficient DNA for hybridization, each DNA sample was whole-genome-amplified. The amplified DNA fragments were treated briefly with DNase I to reduce the size of the DNA molecules, thereby making them better targets for hybridization. Next, these DNA targets were end-labeled with biotinylated dUTP and ddUTP using terminal transferase. These labeled DNA fragments were used for hybridization to WGAs spanning the entire Arabidopsis genome with single base resolution on both Watson and Crick strands, and utilized nearly a billion features per hybridization experiment. Analysis of the data obtained from these experiments demonstrates that much of the common sequence polymorphism found in the worldwide A. thaliana population has been captured in this single study. Furthermore, these data were used to determine in a systematic fashion the types of sequences and genes that differ between accessions, and to provide a high-resolution description of the genome-wide distribution of polymorphism in this multi-cellular reference organism. Specifically, >1 million non-redundant single nucleotide polymorphisms (SNPs) were identified, and approximately 4% of the genome was identified as being highly dissimilar or even deleted relative to the reference (Col-0) genome sequence. Interestingly, the patterns of polymorphism between the 19 wild accessions and the reference genome sequence (Col-0) are highly non-random among gene families, with genes mediating interaction with the biotic environment having exceptional polymorphism levels. Also, regional variation in polymorphism was readily apparent at the chromosomal scale. Overall, this dataset provides an extremely important resource for both evolutionary genetic and functional studies, including mapping genetic mutations obtained in laboratory studies.
Furthermore, two other related studies performed by Borevitz et al. (2007) and Kim et al. (2007) also used hybridization of DNA samples from wild accessions of A. thaliana to measure the genetic diversity and polymorphism between individuals of the same species. Interestingly, the study by Borevitz et al. (2007) demonstrated that total and pair-wise diversity was higher near the centromeres and the heterochromatic knob region, which are highly repetitive in nature and (as noted above) contain less active transcription. Furthermore, this study noted that overall diversity between the Arabidopsis accessions was positively correlated with recombination rate. The combined data from these studies (Clark et al., 2007; Kim et al., 2007) provided the requisite information necessary for the production of an Affymetrix® Arabidopsis genotyping array (http://www.affymetrix.com), which contains 250 000 SNPs and is commercially available. Additionally, Kim et al. (2007) provided the first evidence that this SNP array should have more than adequate coverage for future genome-wide association mapping studies in this reference plant, thus providing the community with a framework for further in-depth studies on genetic variation in plants. Taken together, the results of these studies and others like them demonstrate that, in the absence of sequence data for a number of individuals from the same species, population genomic studies can still be carried out successfully using hybridization to arrays (Borevitz et al., 2007; Clark et al., 2007; Kim et al., 2007; Vaughn et al., 2007).
High-resolution mapping of the methylome
The methylation of cytosine bases within DNA molecules (DNA methylation) is a conserved epigenetic silencing mechanism that is involved in many important biological phenomena, including defense against transposon proliferation, genomic imprinting and regulation of gene expression. DNA methylation is a heritable epigenetic modification that has been previously demonstrated to regulate the expression of a number of genes without changes to the DNA sequence (Grewal and Klar, 1996; Kakutani, 2002). The regulation of gene expression by DNA methylation can occur in cis (the gene itself is methylated) or in trans (methylation at another site in the genome regulates the target gene (Alleman and Doctor, 2000; Alleman et al., 2006; Stam and Mittelsten Scheid, 2005). The machinery for establishing and maintaining cytosine DNA methylation has been reviewed recently (Chan et al., 2005; Henderson and Jacobsen, 2007).
Whole-genome tiling microarrays have been used to map sites of DNA methylation within the Arabidopsis genome (Lippman et al., 2005; Martienssen et al., 2005; Zhang et al., 2006; Zilberman et al., 2007). To do this, Zhang et al. (2006) and Zilberman et al. (2007) used a combination of biochemical and molecular biological approaches to map the methylated sequences of the Arabidopsis thaliana genome. For example, Zhang et al. (2006) used an antibody that recognizes methylated cytosine bases within the context of genomic DNA. After fragmenting the genome into smaller pieces, methylated regions could be specifically immunoprecipitated using this antibody. After immunoprecipitation, the DNA fragments were amplified to obtain higher amounts of DNA for subsequent hybridization to WGAs. To minimize artifacts from the amplification process, multiple amplification reactions were performed in parallel, and the products were pooled. The amplified DNA fragments were treated briefly with DNase I to reduce the size of the DNA molecules, making them better targets for hybridization. Next, these DNA targets were end-labeled with biotinylated dUTP and ddUTP using terminal transferase. The labeled DNA fragments were purified over a size-exclusion column, and the purified targets were used for hybridizing to WGAs spanning the entire Arabidopsis genome. Interestingly, the resulting DNA methylation map reveals that almost 19% of the A. thaliana genome is methylated. The data obtained from these experiments demonstrate that regions containing the highest density of DNA methylation are located in highly repetitive regions of the genome. For instance, much of this methylation occurs in heterochromatin, including centromeres and the knob region found on chromosome 4, peri-centromeric repeats of all chromosomes that harbor transposons, and repetitive elements. Surprisingly, a considerable amount of methylation was also distributed in euchromatin. Unsurprisingly, the highest levels were seen in pseudogenes and unexpressed genes. However, only approximately 5% of expressed genes contained methylation upstream of their ORFs (promoters), while 33% of the transcribed regions of these genes were methylated (body methylation), consistent with a previous report (Tran et al., 2005). In fact, based on these data, promoter regions generally seem to be hypomethylated. Another surprising finding from these whole-genome studies was that most of the genes that contain DNA methylation within their transcribed regions are highly expressed and constitutively active. Furthermore, these studies demonstrated that the distribution of DNA methylation is clearly different between transposons and genes with annotated function. While DNA methylation of transposons was highly distributed across their entire length, including up and downstream regions, genes with annotated function tended to contain methylation sites with a biased distribution towards their 3′ end. The bias in the expressed genes indicates that methylation might interfere with transcription initiation and termination, which is similar to what is hypothesized to occur in mice (Carninci et al., 2006). Taken together, these findings suggest that the distribution of DNA methylation on a gene may determine how gene expression is affected by this epigenetic marker. Furthermore, these studies highlight the importance of this epigenetic marker in the control of genome dynamics within plants.
Interestingly, the methyl groups added to cytosine bases through the process of DNA methylation are removable through the actions of so-called DNA demethylases (Agius et al., 2006; Gehring et al., 2006; Xiao et al., 2003; Zhu et al., 2007). Arabidopsis has four such DNA demethylases, REPRESSOR OF SILENCING1 (ROS1), DEMETER (DME), DEMETER-LIKE2 (DML2) and DEMETER-LIKE3 (DML3; Agius et al., 2006; Gehring et al., 2006; Zhu et al., 2007). Interestingly, DME is required for genomic imprinting during Arabidopsis embryo development (Choi et al., 2002), while the closely related ROS1 is involved in regulating transcriptional gene silencing in a transgenic background mediated by DNA methylation (Zhu et al., 2007). Recently, whole-genome tiling microarrays were used by Penterman et al. (2007) to map sites of DNA demethylation within the Arabidopsis genome. Interestingly, using a similar immunoprecipation protocol as described for mapping the methylome (see above), this group immunoprecipitated methylated DNA samples from wild-type (WT) and mutant plants lacking three of the DNA demethylases (ROS1, DML2, DML3). After hybridization of these immunoprecipitated samples to WGAs, 179 loci that are actively demethylated by DML enzymes in Arabidopsis were identified. This was determined by identifying loci with DNA hypermethylation in triple mutant plants (ros1-1 dml2-1 dml3-1) versus the wild-type control plants, thus suggesting locus-specific DNA demethylation mediated by at least one of the three DNA demethylases. Interestingly, demethylation by DML enzymes in gene coding regions primarily occurs at both the 5′ and 3′ ends, a pattern opposite to the overall distribution of DNA methylation. Taken together, these results suggest that DNA methylation is actively removed, and the demethylases are likely to protect genes from potentially deleterious methylation. Overall, immunoprecipitated samples obtained using an antibody that specifically recognizes methyl groups attached to cytosine bases hybridized to WGAs results in a whole-genome view of DNA methylation within plant genomes. These first methylome studies have provided a glimpse of the dynamic nature of this important epigenetic marker, and provide a hint of some of the roles it may be playing in regulating genome function in plants.
Genome-wide mapping of regulatory DNA motifs using ChIP–chip
Gene expression is regulated by a number of mechanisms, including DNA binding transcription factors and the covalent modifications of histone tails. To gain insight into the genome-wide locations of transcription factor (TF) binding sites and regions of histone modifications, the practice of chromatin immunoprecipitation (ChIP) with an antibody specific to the protein or modification of interest followed by hybridization of these products to WGAs (ChIP–chip) has emerged (Bernstein et al., 2006; Bulyk, 2006; Hudson and Snyder, 2006; Li et al., 2007; Wu et al., 2006). Interestingly, the major limiting factor of this technique is the quality of the antibody used in immunoprecipitation of the DNA-bound protein of interest. A high-quality antibody is required to achieve proper enrichment of protein-bound DNA fragments for hybridization to WGAs. Recently, Thibaud-Nissen et al. (2006) used antibodies specific for the transcription factor TGA2 to immunoprecipitate chromatin associated with this DNA-binding protein after treatment of plants with the phytohormone salicylic acid (SA). The immunoprecipitated DNA molecules were amplified to obtain enough material for hybridization to WGAs. The amplification products were digested with DNase I and end-labeled with biotinylated ddATP using terminal transferase. These labeled targets were then hybridized to two distinct types of WGA. The first platform contained 190 000 probes representing 2 kb regions upstream of all annotated genes, at a density of seven probes per promoter, and the other platform was divided into three chips, each of over 390 000 features, that represent the entire Arabidopsis genome at a density of one probe per 90 bases. Analyses of the results obtained from these ChIP–chip experiments provided evidence for 51 putative binding sites for TGA2, including the only previously confirmed site in the promoter of PR-1 (At2g14610). Furthermore, 15 of the putative binding sites for TGA2 lie outside presumptive promoter regions. The effect of the SA treatment on gene expression was measured using standard gene expression arrays, and SA-induced genes were found to be significantly over-represented among genes neighboring putative TGA2-binding sites. Therefore, ChIP–chip experiments performed on WGAs in combination with gene expression studies can provide clues as to which regulatory networks are controlled by specific transcription factors within plant cells. Thus, as the number of ChIP–chip experiments utilizing WGAs for mapping transcription factor binding sites increases, a much more complete view of the transcriptional networks controlling plant growth and development will soon emerge.
Recently, the covalent addition of methyl groups to lysine or arginine residues of histone tails (histone methylation) has emerged as another crucial step in controlling eukaryotic genome dynamics (Bernstein et al., 2006; Chanvivattana et al., 2004; Gendrel et al., 2005; Kinoshita et al., 2001; Lee et al., 2006; Schubert et al., 2005). For instance, tri-methylation of lysine 27 of histone H3 (H3K27me3) plays critical roles in regulating animal development (Bernstein et al., 2006; Boyer et al., 2006; Lee et al., 2006). Furthermore, H3K27me3 has been demonstrated to regulate several genes that are important for proper development in plants (Chanvivattana et al., 2004; Kinoshita et al., 2001; Schubert et al., 2005; Yadegari et al., 2000). Recently, chromosome- and genome-wide profiling of H3K27me3 in Arabidopsis was carried out using tiling microarrays (Turck et al., 2007; Zhang et al., 2007a). To do this, DNA cross-linked with formaldehyde was incubated in the presence of a polyclonal antibody to H3K27me3 (α-H3K27me3). The ChIP sample was then amplified, labeled, and hybridized to tiling arrays. Interestingly, it was determined that H3K27me3 is a major silencing mechanism in plants, regulating an unexpectedly large number of Arabidopsis genes located in mostly euchromatic regions of the genome (Turck et al., 2007; Zhang et al., 2007a). Furthermore, analysis of the H3K27me3 profiles suggested that establishment and maintenance of this epigenetic modification is largely independent of other epigenetic pathways, such as DNA methylation or RNA silencing. Interestingly, using ChIP–chip (Turck et al., 2007) and DamID-chip (Zhang et al., 2007b) it was found that genomic domains marked by H3K27me3 associate almost exclusively and co-extensively with binding sites for the protein TERMINAL FLOWER 2/LIKE HETEROCHROMATIN PROTEIN 1, which is the only Arabidopsis protein that shares overall sequence similarity to the HETEROCHROMATIN PROTEIN 1 (HP1) family of metazoans and Saccharomyces pombe. Additionally, results from this study demonstrated that the distribution of H3K27me3 is unaffected in lhp1 mutant plants. Overall, these results suggest that TFL2/LHP1 is not involved in the deposition of this chromatin modification, but is part of a mechanism that represses the expression of many genes that are marked with H3K27me3. Therefore, the use of genomic tiling microarrays to study a variety of epigenetic regulatory pathways in plants and animals has suggested an extremely complex network of mechanisms involved in the regulation of genome dynamics. Thus, ChIP–chip and DamID-chip experiments can be used for studying regulatory DNA motifs in plants, and will aid in obtaining an appreciation for the complex nature of regulatory mechanisms employed in controlling genome dynamics in these sessile organisms.