Genome-wide analyses of epigenomic and transcriptomic profiles provide extensive resources for discovering epigenetic regulatory mechanisms. However, the construction of functionally relevant hypotheses from correlative patterns and the rigorous testing of these hypotheses may be challenging. We combined bioinformatics-driven hypothesis building with mutant analyses to identify potential epigenetic mechanisms using the model plant Arabidopsis thaliana. Genome-wide maps of nine histone modifications produced by ChIP-seq were used together with a strand-specific RNA-seq dataset to profile the epigenome and transcriptome of Arabidopsis. Combinatorial chromatin patterns were described by 42 major chromatin states with selected states validated using the re-ChIP assay. The functional relevance of chromatin modifications was analyzed using the ANchored CORrelative Pattern (ANCORP) method and a newly developed state-specific effects analysis (SSEA) method, which interrogates individual chromatin marks in the context of combinatorial chromatin states. Based on results from these approaches, we propose the hypothesis that cytosine methylation (5mC) and histone methylation H3K36me may synergistically repress production of natural antisense transcripts (NATs) in the context of actively expressed genes. Mutant analyses supported this proposed model at a significant proportion of the tested loci. We further identified polymerase-associated factor as a potential repressor for NAT abundance. Although the majority of tested NATs were found to localize to the nucleus, we also found evidence for cytoplasmically partitioned NATs. The significance of the subcellular localization of NATs and their biological functions remain to be defined.
Epigenomes of eukaryotes consist of a complex ensemble of molecular patterns that are spatially distributed over the chromatin, a subset of which are highly correlated with transcription outputs (Liu et al., 2005; Filion et al., 2010; Roudier et al., 2011). Mapping of histone modifications and DNA methylation in multiple organisms has led to similar observations that the patterns of many chromatin marks are highly correlated and may be summarized as a few chromatin states using fuzzy statistical methods (Liu et al., 2005; Filion et al., 2010; Roudier et al., 2011). The correlated nature of chromatin modifications has presented the opportunity for deciphering the relationship between epigenomic and transcriptomic features through bioinformatic analyses. However, many previous studies regarding plant epigenomics were largely correlative, and did not generate explicitly testable hypothesis or perform critical validations using genetic or biochemical approaches.
A considerable amount of work in this field has focused on the regulation of protein-coding gene expression by chromatin. However, eukaryotic genomes are known to produce a variety of small and large non-coding RNAs (Ponting et al., 2009). In recent years, long non-coding RNAs (lncRNAs) have attracted much interest because of their potential roles in regulating gene expression and chromatin-dependent functions (Ponting et al., 2009). lncRNAs may originate from intergenic regions as independent transcription units (TUs), or may appear as cognate sense or antisense transcripts of protein-coding genes (Rinn et al., 2007; Guttman et al., 2009; Heo and Sung, 2011). The cognate antisense lncRNAs are commonly referred to as natural antisense transcripts (NATs) (Rinn et al., 2007; Swiezewski et al., 2009). A series of case studies using mammalian and plant systems has suggested that NATs either positively or negatively modulate expression of the cognate loci (Faghihi and Wahlestedt, 2009). The repressive functions of NATs may be mediated by deposition of DNA methylation or repressive histone marks, generation of siRNA species, or interference with translation processes (Tufarelli et al., 2003; Borsani et al., 2005; Ebralidze et al., 2008; Yu et al., 2008; Modarresi et al., 2012). Several NATs have been shown to have regulatory importance in plants. The Arabidopsis P5CDH-SRO5 antisense overlapping gene pair produced a 24 nt siRNA and caused post-transcriptional silencing of the P5CDH gene (Borsani et al., 2005). The NAT associated with the Arabidopsis FLC locus (COOLAIR) was capable of causing cold-induced silencing when coupled with a reporter gene (Swiezewski et al., 2009), although further analysis may be necessary to elucidate the role of COOLAIR in epigenetic silencing of FLC (Helliwell et al., 2011). Interestingly, NATs may also positively regulate the abundance of sense transcript. For example, the NAT associated with the mammalian BACE1 locus has been shown to form a duplex with BACE1 mRNA that results in an increase in its stability (Faghihi et al., 2008). Taken together, NATs may potentially interact with the production of sense transcripts through diverse mechanisms. However, the pervasiveness of such regulation is difficult to assess due to the small number of examples that have been investigated. Although various kinds of lncRNA, including NATs, have been found to be functionally relevant, it remains possible that a proportion of NATs are the consequence of ‘noisy’ transcription activities, and thus may not have significant biological functions (Werner and Berdal, 2005).
In the past decade, some efforts have been made to define the global pattern of NATs in plant genomes using cDNA clones and whole-genome tiling arrays (Yamada et al., 2003; Stolc et al., 2005; Li et al., 2006; Matsui et al., 2008). Approximately 30% of the annotated genes were estimated to associate with NATs in both Arabidopsis and rice (Yamada et al., 2003; Stolc et al., 2005; Li et al., 2006; Matsui et al., 2008). A few pathways, including those involved in microRNA biogenesis and nonsense-mediated mRNA decay, have been implicated in the regulation of NAT accumulation at several hundred loci in the Arabidopsis genome (Kurihara et al., 2009; Luo et al., 2009). However, the pervasiveness of NATs and the known complexities of gene regulation mechanisms suggest that diverse pathways may be involved in regulating NATs at either the transcriptional or post-transcriptional levels. Chromatin-associated regulators are particularly attractive candidates, because sense and antisense transcription activities may share a common chromatin context that is frequently polarized along TUs. As it is generally believed that chromatin modifications across TUs are established to support and perhaps optimize sense transcription activities (Li et al., 2007), it is intriguing to determine how polarized chromatin configurations may also affect antisense transcription in the genome.
In the present work, we used the ANchored CORrelative Pattern (ANCORP) method and a newly developed informatics tool called state-specific effects analysis (SSEA) to integrate epigenomic and transcriptomic datasets of Arabidopsis (Luo and Lam, 2010). We aim to extend our previous analyses in two respects. First, the generation of hypotheses and subsequent testing with mutant analyses are stressed in the present work. The highly abstract chromatin state models reported by other studies were useful in describing the global structure of the Arabidopsis epigenome (Roudier et al., 2011). However, this approach may be less informative for predictive purposes due to its emphasis on spatial rather than quantitative information. Therefore, we have produced a more refined chromatin state model by defining 42 major states from 10 chromatin modifications. The resulting chromatin state information was then used for prediction of functional relevance by SSEA. Second, taking advantage of the relatively new technique of strand-specific RNA-seq (Levin et al., 2010), we performed genome-wide correlation analyses between chromatin states and NAT abundance. We discovered clear correlations between certain chromatin states and the levels of NATs, leading to the hypothesis that antisense transcripts are in part regulated by chromatin modifications. Further testing of this hypothesis with genetic mutants showed that 5mC and H3K36me2 synergistically regulate NAT abundance at a significant percentage of the NAT-producing loci. The analysis also identified polymerase-associated factor (PAF) as a potential regulator of NAT production. Some of the tested NATs were found to accumulate in the nucleus, suggesting that these transcripts may participate in the regulatory network of chromatin states and gene expression, while other NATs that apparently localized to the cytoplasm may have cryptic functions that remain to be defined.
Determination of chromatin states using global profiles of 10 chromatin modifications
Whole-genome profiles of nine histone modifications in the aerial tissue of 2-week-old Arabidopsis plants were produced using the ChIP-seq approach (Table S1). The data may be browsed using the GBrowse interface at http://epigenome.rutgers.edu. The patterns for three histone modifications (H3K4me3, H3K27me3 and H3K36me2) with distinct distributions were compared with the results of a previous ChIP–chip study performed with similar plant materials (Figure S1) (Oh et al., 2008). The global Pearson correlation coefficients were 0.63, 0.60 and 0.54 for H3K4me3, H3K27me3 and H3K36me2, respectively, between the ChIP-seq and ChIP–chip datasets. Thus the data produced by the two technology platforms were well correlated. The correlation between ChIP-seq and ChIP–chip data was also supported by visually browsing the patterns (Figure S1). Our analyses also incorporated a profile of DNA methylation (5mC) produced by methylated DNA immunoprecipitation (MeDIP)-chip from similar Arabidopsis tissues to enable a more comprehensive integration of epigenomic information (Zilberman et al., 2007). To correlate the epigenome with transcription outputs, strand-specific RNA-seq was performed using leaf tissue of 3-week-old Arabidopsis plants. Although the plant tissues used for ChIP-seq and RNA-seq were not identical, the correlative analysis between the epigenome and transcriptome is likely to be valid as previous work has shown that rosette leaves at different stages exhibit highly similar transcriptome profiles (Schmid et al., 2005).
To enable a higher-order integration of epigenomic data, chromatin states were assigned to each of more than 30 000 annotated TUs as 10-digit binary codes (Table S2a). To this end, the ‘peaks’ of each chromatin marks were first identified by Model-bases Analysis of ChIP-Seq (MACS) at P < 10−5 (Zhang et al., 2008). To accurately determine chromatin states in the highly compact Arabidopsis genome, histone modification peaks were assigned to genes, allowing no interval between the peaks and annotated genic regions. The sets of genes found to be associated with peaks of H3K4me3, H3K36me2 or H3K27me3 were highly similar (> 88%) when allowing either zero or up to 500 bp intervals (Figure S2). This result suggests that the vast majority of histone modification peaks are found within the annotated genic regions rather than the intergenic regulatory sequences for Arabidopsis. The global structure of the epigenome was visualized by bi-directional clustering of the chromatin states (Figure 1a). Only 295 states were observed of the 1024 theoretically possible chromatin states. Forty-two states were found to associate with more than 100 genes (major states), while 186 states associated with fewer than 10 genes (Table S2b). The relatively small number of identified chromatin states is consistent with other reports, and is probably caused by extensive concurrences and exclusivities between individual modifications (Liu et al., 2005; Roudier et al., 2011). For example, H3K4me3 and H3K36me3 co-localized at 12 695 genes. However, only 1510 and 771 genes solely associated with either H3K4me3 or H3K36me3, respectively. Chromatin modifications that appear to be mutually exclusive, such as H3K27me3 and H3K36me2, further restricted the appearance of certain states. H3K27me3 and H3K36me2 co-localized at only 128 genes even though they individually mark 6309 and 7940 genes, respectively. Thus, the absence or under-representation of certain chromatin states may provide information on synergistic or antagonistic roles between chromatin marks. The global relationships between chromatin marks were further demonstrated by a matrix of pairwise Pearson correlation coefficients between modifications along chromosome arms (Figure 1b). Consistent with partitioning of the genome into euchromatin and heterochromatin, two dominant clusters were observed, consisting of euchromatic and heterochromatic chromatin marks, respectively (Figure 1b). Intriguingly, a strong correlation was detected between H3K18Ac and H3K27me3 in the Arabidopsis genome (Pearson r = 0.44, Figure 1b and Figure S4), although histone acetylations are not expected to co-localize with a repressive mark such as H3K27me3. The functional relevance of the co-existence of these two marks is unclear at this time.
Validation of the physical co-existence of histone modifications by re-ChIP assays
The use of tissues with mixed cell types for our present ChIP-seq work raised the possibility that the apparent concurrence of two chromatin marks results from superimposition of chromatin patterns from two or more different cell types. The physical co-existence of chromatin marks was tested by the re-ChIP approach for selected chromatin states. For all three active TUs tested, re-ChIP for H3K4me3 + H3K9Ac or H3K4me3 + H3K36me3 yielded a comparable amount of signal as re-ChIP for H3K4me3 + H3K4me3 (Figure S3a–c). This result suggests that the physical co-localization of these three chromatin marks at the transcription start site is indeed a common property of actively expressed genes. In addition, our re-ChIP analysis supported co-localization of H3K18Ac and H3K27me3 at all four loci tested (Figure S3d–g).
Chromatin that is simultaneously modified with H3K4me3 and H3K27me3, hereafter referred to as the ‘K4/K27me3’ bivalent state, pervasively associates with developmental regulator genes in mammalian embryonic stem cells (Bernstein et al., 2006; Mikkelsen et al., 2007). A total of 378 genes were found to associate with the putative K4/K27me3 bivalent state in our analysis (Table S3). Bi-directional re-ChIP was performed for 10 of these putative bivalent loci by swapping the antibodies used in the first and second round of ChIP (Figure 1c,d). The specificity of the re-ChIP assay was empirically controlled by analyzing H3K4me3 or H3K27me3 monovalent regions (Figure 1c,d). Five of the 10 tested loci were found to be true bivalent in the H3K4me3 + H3K27me3 re-ChIP study, including the previously known bivalent locus FLOWERING LOCUS C (Figure 1c) (Jiang et al., 2008). However, the bivalency of two of these five loci was not supported by the reciprocal H3K27me3 + H3K4me3 re-ChIP (Figure 1d). These two loci may be marginally bivalent and thus cannot be consistently detected. We noted that the H3K4me3 + H3K27m3 or reciprocal re-ChIP consistently recovered a smaller percentage of the chromatin compared to H3K4me3 + H3K4me3 or H3K27me3 + H3K27me3 re-ChIP, even for loci found to be truly bivalent (Figure 1c,d). Thus H3K4me3 and H3K27me3 may co-localize at these loci only in a fraction of the cells in aerial tissues of mature plants.
Gene ontology (GO) term and locus type enrichment in finely segregated chromatin states
Inclusion of 10 chromatin modifications in the present analysis enabled us to assess the gene function and locus type enrichments in finely divided chromatin states. Gene ontology (GO) and locus type analyses were performed for each of the 42 major states, and are summarized in Table S4(a,b). The analysis of locus type with chromatin states identified striking enrichment patterns for various kinds of non-coding RNAs. The vast majority of annotated pre-tRNAs (93%), miRNAs (85%) and snoRNAs (86%) were remarkably restricted to only a few states (no more than three), as shown in Figure 2. Importantly, 11 (6%) and 45 (26%) miRNA-encoding genes were modified by H3K4me3 or H3K27me3, respectively, suggesting that antagonism between the two histone modifications may modulate the transcription of a number of miRNA genes (Figure 2). To test this speculation, we estimated the expression level of miRNA genes by counting RNA-seq tags mapped in the predicted pre-miRNA (foldback) regions. miRNA genes marked with H3K4me3 were expressed at significantly greater levels compared to those marked with H3K27me3 (P = 1.1 × 10−5) or with no significant modification (P = 7.4 × 10−6) (Figure S5a). The potential regulation of miRNA genes by H3K4me3 or H3K27me3 marks was further supported by the differential marking of miRNA genes by the two histone modifications between plant tissues: 42% of miRNA genes associated with distinct H3K4me3 or H3K27me3 states between seedlings and roots (P = 1.4 × 10−4) (Table S4d,e). Compared to other locus types and GO terms, a large fraction of miRNA genes associated differentially with the two chromatin marks between seedlings and roots. Therefore, the results suggest that antagonism between H3K4me3 and H3K27me3 may play pivotal roles in modulating miRNA gene expression during plant development.
As expected, transposable elements were predominantly found in states enriched of 5mC, H3K9me2 and H3K27me1 marks, either individually or in various combinations. Differential enrichments of certain transposon super-families were observed between chromatin states (Table S4f). Notably, LTR/Gypsy retrotransposons were enriched in State 2 of the transposable element panel (Figure 2 and Table S4f), constituting 76% of all transposons found in this state (Table S4f). Consistent with the expectation that retroelements are more enriched in the pericentromeric regions compared to DNA transposons (Arabidopsis Genome Initiative, 2000), transposons in State 2 were located significantly closer to the centromere than other states (Figure S5b). Therefore, the results imply that co-marking of H3K9me2 and H3K27me1, together with cytosine methylation, is characteristic of pericentromeric transposable elements.
Determination of NATs using RNA-seq data
We previously hypothesized that antisense transcription activities may be modulated by the H3K36me2 mark and DNA methylation of the gene body (Luo and Lam, 2010). To test this hypothesis, we analyzed the global distribution of NATs by performing strand-specific RNA-seq on total RNA that was depleted of ribosomal RNAs. The experiment generated 5.6 million non-redundant tags that matched unique positions. The global ratio between antisense and sense tags in exons was 0.01 for our dataset, similar to the ratio of 0.008 reported for a strand-specific RNA-seq study using floral tissue (Lister et al., 2008). We first applied a stringent definition for NATs, requiring consecutive regions longer than 90 bp covered with antisense tags, allowing a maximum gap of 20 bp between adjacent tags (Table S5a). Of the 4939 putative NAT domains identified, 3043 domains were not caused by tail-to-tail protein-coding genes and had a mean length of 200 bp (Table S5a). These NAT domains mapped to 1302 annotated genes. To test whether these NAT domains were reproducibly detected in an independent experiment, we analyzed a published strand-specific RNA-seq dataset prepared using oligo(dT)-enriched RNA (Deng et al., 2010). This dataset contained 3.3 million non-redundant and uniquely mapped tags. The two RNA-seq experiments produced highly correlated results with regard to both NAT domains and genes associated with these domains (Figure 3a,b). Sequencing of the oligo(dT)-enriched RNA identified approximately 40% more NAT domains despite a lower sequencing depth, suggesting that a large portion of the NATs may be polyadenylated (Figure 3a). This result is consistent with a previous study showing that a few thousand NATs may be detected by whole-genome tiling array using cDNAs synthesized with oligo(dT) primers (Matsui et al., 2008).
The number of NATs that we reported was substantially fewer than several previous studies performed with microarray platforms (Yamada et al., 2003; Matsui et al., 2008). The under-estimation of NAT frequency was probably caused by the discontinuous nature of the RNA-seq data and the relatively low abundance of these RNA species at the current depth of sequencing. To discover additional NATs, we applied a more relaxed definition of NATs, requiring at least two antisense tags to be detected within the region of an annotated gene (Table S6a). The two RNA-seq experiments again produced highly correlated gene sets that associated with NATs defined by the relaxed definition (Figure 3c). Taking advantage of the apparent correlation between the two RNA-seq experiments, we combined the two datasets to improve the overall sequencing depth (Tables S5c and S6c). The NATs detected from the combined data with either stringent or relaxed criteria were compared with the results generated by the genome-wide tiling array (Figure S6) (Matsui et al., 2008). Although the correlation between RNA-seq and tiling array experiments was significant (Figure S6), approximately one-half to two-thirds of the NATs identified were represented in only one experiment. Thus global mapping of NATs using the two technical platforms showed substantial divergence. This may reflect differential technical limitations inherent in these methods that are exacerbated by the overall low levels of NATs (approximately 1%) compared to sense transcripts.
The global pattern of NATs across annotated TUs was characterized by plotting the frequency of antisense tags across scaled TUs (Figure 3d). For genes associated with stringently defined NATs, antisense tags were predominantly localized within the transcribed region, with a notable depletion at the transcription start site (Figure 3d). Compared to genes showing no evidence of NATs, NAT-associated genes showed substantially greater antisense signals at the transcription termination site (Figure 3d). Therefore, promoter activities in downstream intergenic spaces may be responsible for certain antisense transcripts, as shown for some characterized examples such as COOLAIR (Swiezewski et al., 2009). We next analyzed the relationship between the abundance of sense and antisense transcripts. Tag counts from both strands had a global Spearman correlation coefficient of 0.24 (P < 2.2 × 10−16), suggesting that transcription activities from both strands are positively correlated. After categorizing all genes into five bins according to their sense RNA quantity, we found that actively expressed bins have higher frequencies of NATs defined either by the stringent or relaxed definitions (Figure 3e). This result indicates that NAT production is coupled to sense transcription for a subset of expressed genes in the genome.
Predicting chromatin modification functions by state-specific effects analysis (SSEA) in the context of chromatin states
Correlative analyses examining individual chromatin modifications together with genome outputs are useful to estimate the impact of a particular chromatin mark. For example, assuming the analysis concerns chromatin modification X, correlative analyses commonly compare all genes modified with X to the rest of the gene space to deduce its function, without considering the specific chromatin context such as the presence of other chromatin marks (Figure 4a). We refer to this type of analysis as global effect analysis. However, the influence and interactions between multiple chromatin marks on a locus is unaccounted for. To address this challenge, we developed state-specific effects analysis (SSEA) to harness more complex chromatin state information and improve functional predictions. SSEA was designed to interrogate the functions of a particular chromatin mark X in distinct chromatin state contexts (e.g. circles labeled with A, B or C in Figure 4a,b) defined by an ensemble of other chromatin modifications (Figure 4b). For example, analysis of H3K36me2 between 11 control and positive state pairs was performed to assess the state-specific effect of H3K36me2 in various chromatin contexts (Figure 4c). Each row in the two panels of Figure 4(c) describes two chromatin states with a similar ensemble of 10 chromatin marks as indicated on the top. The difference between the states on the left panel (control) and those on the right panel (positive) is that they are all lacking significant levels of H3K36me2 (Figure 4c). The numbers of genes associated with each state are shown to the side of the panels, and only the 42 major states associated with more than 100 genes were included in this analysis (Figure 4c). As the states of the nine chromatin modifications other than H3K36me2 were identical between the control and positive state sets, the potential impact resulting from functional interactions between the H3K36me2 mark and other modifications may be visualized by SSEA. In Figure 4(d), the control and positive chromatin states from each row in Figure 4(c) are compared for their mean levels of chromatin marks and genome outputs (e.g. sense and antisense transcript abundances). The log2 value for the ratio between positive and control states, which is referred to as the state-specific effect for each measurement, are presented as heat maps, as shown in the corresponding rows of Figure 4(d). The mean abundance of chromatin modifications was determined separately for the transcription start site and 3′ gene bodies to account for potential location-dependent functions of these modifications. For global effect analysis, all genes with significant enrichment of H3K36m2 are compared to the genes that do not. In SSEA, only the genes that associate with the control and positive state pair in a single row of Figure 4c are compared. The significance of each reported ratio was assessed by comparison with 10 000× permutations that randomly assigned genes from the whole gene-space into the control or positive states (Figure S7). while maintaining the number of genes associated with each state identical to the actual dataset.
By applying SSEA to the RNA-seq dataset, we identified an intriguing correlation between H3K36me2 and antisense RNA tags. Global effect analysis suggested that H3K36me2 is supportive for antisense RNA with a log2 positive/control value of 0.57 (column indicated by an arrow in Figure 4d). However, a more sophisticated relationship between the two features was revealed by SSEA. SSEA identified three state pairs that show positive SSEs and eight state pairs that show negative SSEs with regard to the quantity of antisense RNA tags per gene length (column indicated by an arrow in Figure 4d). Interestingly, the state pairs showing positive SSEs all had no H3K4me3 enrichment, whereas state pairs showing negative SSEs were consistently modified by H3K4me3 (Figure 4c,d). Thus the results suggest that the state of H3K4me3 may determine the role that H3K36me2 plays in regulating antisense RNA abundance. H3K36me2 may repress antisense RNA production in H3K4me3-enriched states but positively modulate antisense RNA levels in states lacking significant H3K4me3 marks.
NAT abundance is correlated with chromatin states
The correlation between transcription and chromatin states was assessed by plotting transcripts from both strands as correlative patterns of clustered chromatin states (Figure S8). In addition, chromatin states were ordered according to the abundance of sense transcripts or NATs (Figure 5a,b and Table S7a,b). The analysis of NATs and chromatin states identified two types of chromatin states that were enriched with antisense transcripts. Consistent with our finding that NATs were more frequently associated with actively expressed genes, states associated with one or multiple active marks are found at the top of the chart (e.g. 1st, 3rd, 5th and 6th states, Figure 5b). SSEA revealed that H3K36me2 was positively correlated with NAT abundance in the absence of H3K4me3 enrichment. In line with this observation, the states modified with H3K36me2 alone (H3K36me2) or together with 5mC (H3K36me2 + 5mC) were enriched in NATs (i.e. 2nd and 4th states, Figure 5b). The results were confirmed by determining the percentage of genes associated with stringently defined NATs within each state (Table S7c). As the H3K36me2 and H3K36me2 + 5mC states associated with a relatively small number of genes, it was essential to test whether these observations are caused by mis-annotation of gene models. By manual examinations, we found that 83% (53/64) and 91% (31/34) of gene models in the H3K36me2 and H3K36me3 + 5mC states, respectively, were supported by at least one representative in the available cDNA or RNA-seq datasets as evidence (Table S7d). The observation that H3K36me2 and H3K36me2 + 5mC states associated with a low level of sense transcripts but a high level of antisense transcripts may seem contradictory with the global positive correlation between sense and antisense transcripts (Figure 3e). However, substantial deviations from the global trends were expected as the correlation between sense and antisense transcript abundance was moderate and thus not strictly observed (Spearman r = 0.24).
The correlation between H3K36me2, 5mC and NATs in H3K4me3-modified states was further visualized, and the results are presented in Figure 5(c–h). States 2, 3 and 4 in Figure 5(c), which were modified with H3K36me2 and/or 5mC, associated with fewer antisense signals compared to State 1 (Figure 5e), but only a modest difference in sense transcript abundance was observed between genes in these four chromatin states (Figure 5d and Figure S9a). Our analysis also showed that multiple active histone marks such as H3K4me3, H3K9Ac and H3K18Ac were significantly more depleted in gene bodies for genes in States 2, 3 and 4 compared to those in State 1 (Figure 5f–h and Figure S9b–d). Further analysis of NATs in the four states showed that genes in State 1 associated with substantially more stringently defined NATs compared to those in States 2, 3 and 4, whereas the frequency of NATs identified using the relaxed definition (stringent NATs excluded) was similar across the four states (Figure 5i). As stringently defined NATs were usually supported by more antisense tags than those only identified under relaxed criteria, we found that NAT-positive loci in State 1 associate with significantly more antisense tags per gene length than for States 3 and 4 (Figure 5j). Collectively, these observations are consistent with the model proposed previously (Luo and Lam, 2010) that the H3K36me2 and 5mC marks may synergistically inhibit the production of NATs, probably through inhibiting ‘active’ chromatin marks such as histone acetylation in gene bodies.
The abundance of antisense transcripts is regulated in part by the synergy between 5mC and H3K36me, as well as by polymerase-associated factor
The proposed role of H3K36me2 and 5mC marks in regulating NATs was tested using mutants disrupted in DNA methyltransferase MET1 (met1-1) or histone H3K36 methyltransferase SDG8 (sdg8-2), as well as the met1-1 sdg8-2 double mutant. As mutations in PAF subunits have been shown to cause abnormal H3K36me patterns (Oh et al., 2008), we also included mutant elf7-3, in which the PAF1 subunit is disrupted, and mutant elf8-1, in which the CTR9 subunit is disrupted (He et al., 2004). Strand-specific quantitative RT-PCR assays were developed for selected loci in order to quantify their NAT levels (Figure S10). Previous cloning of NATs suggested that a proportion of NATs may be derived from spliced mRNA, presumably through the activity of RNA-dependent RNA polymerase in the cytoplasm (Matsui et al., 2008). As 5mC and H3K36me are more likely to regulate nuclear processes, we chose to test NATs that contained more than two intronic antisense tags (Table S8), which were presumably generated from DNA-dependent transcription activities. The relative abundance of NATs was determined for 17 loci that associate with States 3 or 4 (as determined in Figure 5c) in different genetic backgrounds (Figure 6). The selected NATs accumulated at various levels, supported by between several hundred tags and only a few tags. Over-accumulation of NATs was not observed in met1-1 or sdg8-2 single mutants for any locus, whereas NATs associated with four loci were induced more than two-fold in the met1 sdg8 background (Figure 6). Therefore, the results supported our hypothesis that H3K36me2 and 5mC perform overlapping or redundant functions in repression of NATs at least for a subset of loci. A partially overlapping set of four NATs was found to over-accumulate in elf7-3 (Figure 6). Importantly, three of the four NATs were also induced in elf8-1, suggesting that mis-regulation of NATs was not subunit-specific and was indeed caused by loss of PAF functions (Figure S12).
To address whether the over-accumulation of NATs was mediated by histone hyper-acetylation in gene bodies as speculated, the level of histone modifications were analyzed for the AT3G06510 and AT3G08670 loci. The induction of NATs in the met1 sdg8 and elf8-1 backgrounds at the two loci was not coupled with hyper-acetylation of histone H3 or H4 (Figure S13c–e), which indicates that 5mC, H3K36me and PAF suppress NATs through a mechanism that is independent of depleting histone acetylation. The results also implied that PAF suppress NATs through a mechanism that is distinct from modulation of H3K36me levels, as gene body H3K36me2 was more drastically reduced in sdg8-2 than in elf8-1 (Figure S13c,d). As the depletion of H3K36me in sdg8-2 and the lack of 5mC in met1-1 were efficient for the two loci tested (Figures S13c,d and S14), the lack of significant NAT induction in met1-1 or sdg8-2 backgrounds was unlikely to result from ineffective removal of these chromatin marks.
The sense mRNA for the tested loci were quantified to determine whether changes in NAT abundance lead to corresponding modulations of sense transcription in cis (Figure S11). Changes in NAT levels were not associated with a general change in the level of sense transcripts (Figure S11). Interestingly, the mRNA level of SGS3 was up-regulated by more than 50% in met1 sdg8, together with a more than five-fold induction of NATs in this background. Further studies are necessary to determine whether the induction of NATs is mechanistically related to the up-regulation of sense transcripts.
A significant percentage of NATs are retained in the nucleus
The potential roles of NATs in regulating chromatin states suggest that some NATs may primarily localize to the nucleus (Rinn et al., 2007; Swiezewski et al., 2009). To address the subcellular localization of some of the NATs identified in this work, we quantified seven NATs together, with the corresponding spliced mRNA, between RNA samples isolated from total tissue or nuclei preparations (Figure S15). The assay was effective as the two nascent (unspliced) sense transcripts were enriched in the nucleus (Figure S15b). Six of the seven NATs showed significantly more nuclear partitioning compared to the spliced mRNA, suggesting a large proportion of these NATs are retained in the nucleus. Interestingly, the NAT associated with the TAF4 locus (AT5G43130) behaved similarly to the TAF4 spliced mRNA. Therefore, this NAT may be exported to and retained in the cytoplasm (Figure S15a). The TAF4-associated NAT was a short transcript covering the 3′ end of the TAF4 locus (Figure S16). This NAT accumulated at a significant level and was supported by 75 RNA-seq tags. Future deep RNA-seq studies with nuclear and cytoplasmic RNA preparations are required to systematically determine the global partitioning and functional diversities of NATs in plant cells.
The highly correlative nature of the eukaryotic epigenome has stimulated efforts to elucidate the functional consequences of global organization of chromatin marks. Some of the previous work utilized fuzzy statistical methods to describe the global pattern of chromatin marks with a few dominant chromatin states (Liu et al., 2005; Filion et al., 2010; Roudier et al., 2011). This approach has been useful in revealing the overall structure of epigenomes in multiple species. However, the results from these studies were largely correlative, and did not commonly lead to the discovery of novel epigenetic mechanisms. From our perspective, the difficulties associated with the effort to identify new mechanisms through correlative analyses are two-fold: (i) the sophisticated correlative patterns found in the epigenome prevent identification of true causal factors, and (ii) insufficient genetic or biochemical verifications were performed to test the hypotheses derived from bioinformatics analyses. To address the first difficulty, we introduced a new informatics tool (SSEA) to more critically dissect the consequence of a particular chromatin mark in various chromatin state contexts. We have shown that SSEA is able to resolve state-specific correlations that are not captured by global analysis due to averaging amongst diverse gene sets. With regard to the second challenge, we obtained and constructed genetic mutants to test the hypothesis derived from our bioinformatics analyses. The genetic testing supported part of our speculations, i.e. that 5mC and H3K36me synergistically modulate production of NATs at a subset of loci in the genome.
A critical part of our analysis of chromatin states in Arabidopsis was the more rigorous validation of chromatin mark co-localization via re-ChIP. Correlative analyses generally assume that the coincidental chromatin marks are physically co-localized on the same chromatin molecules. However, this assumption is rarely tested at either local or global scales. The establishment of a highly specific re-ChIP assay has enabled us to biochemically test the co-localization for several chromatin states of interest. Our studies showed that active chromatin marks were highly co-localized at all three loci that were tested, which corroborates the overall validity of our chromatin state model. However, only a proportion of the putative K4/K27me3 bivalent loci (three of 10) were found to be truly bivalent by bi-directional re-ChIP. This observation suggests that assignment of chromatin states using ChIP-seq data obtained from complex tissue may be inaccurate for a subset of genes, especially for some scarce states such as the K4/K27me3 bivalent state. The analysis of gene expression may be similarly affected by use of complex tissues because the observed transcripts and epigenomic features may be present in different cell populations. These difficulties may be resolved by combining the epigenomic and transcriptomic techniques (e.g. ChIP-seq and RNA-seq) with isolation of specific cell populations by fluorescence-activated cell sorting (FACS) or isolation of nuclei tagged in specific cell types (INTACT). Such experiments may also facilitate the study of epigenetics in the context of cell differentiation (Deal and Henikoff, 2010).
Our analysis using strand-specific RNA-seq data has confirmed the pervasiveness of NATs in the Arabidopsis transcriptome. The large number of identified NATs raised the question of what percentage of NATs may perform regulatory functions. For NATs accumulated at substantial levels, such as those associated with the AT4G39270 and GL2 loci, their nuclear localization is consistent with the hypothesis that lncRNAs may bind chromatin regulatory complexes and modulate gene expression (Rinn et al., 2007; Heo and Sung, 2011). However, it was not apparent whether some of the extremely scarce NATs mediate any significant molecular functions. The low-abundance NATs may be produced by ‘leaky transcription’ and may be targeted for immediate degradation. 5mC, H3K36me and PAF may suppress the steady-state level of these low-abundance NATs post-transcriptionally through recruitment of RNA-degrading complexes. It is also possible that the chromatin modifications and PAF directly inhibit antisense transcription at these loci. The depletion of NATs in actively expressed genes modified with 5mC and H3K36me2 suggests that generation of NATs may adversely affect the function of this group of genes. 5mC and H3K36me2 preferentially associate with longer genes (> 2 kb) that are more likely to contain introns (Luo and Lam, 2010). Therefore, it is plausible that NATs may interfere with the splicing of longer genes, presumably through hybridization with pre-mRNA and masking of their splice sites (Faghihi and Wahlestedt, 2009). This hypothesis may be tested in future studies by examining the efficiency and fidelity of the splicing events for loci in which the abundance of associated NATs is drastically increased upon disruption of 5mC and H3K36me2.
In contrast to some previous reports that NATs regulate their cognate loci in cis, we did not find general changes in sense transcription associated with over-accumulation of NATs except for the SGS3 locus. For NATs that regulate sense transcription co-transcriptionally, the increase in steady-state NAT abundance, if it occurs through post-transcriptional mechanisms, may have no effect on the sense transcription process per se. In order to dissect the function of NATs at various stages of gene expression, experiments are required to manipulate NAT abundance at either the transcriptional or post-transcriptional levels. A prerequisite for such studies is a more accurate global description of NAT structures that may be obtained through deeper strand-specific RNA sequencing with longer read length, coupled with the capture and sequencing of 5′ and 3′ ends of RNA. These efforts may shed light on the biogenesis mechanism of NATs, such as the location of promoter sequences that drive NAT expressions. A number of recently developed plant genetic engineering approaches may greatly facilitate difficult manipulation of NAT expression. Using a recombineering-based approach, large genomic fragments containing the NAT-producing locus may be introduced into plants with the promoter driving NAT expression modified (Zhou et al., 2011). The native NAT promoter may also be replaced by foreign sequences using transcription activator-like effector nuclease or zinc finger nuclease (Bogdanove and Voytas, 2011). Such method would be ideal for analyzing the co-transcriptional functions of NATs at near-endogenous conditions. With the development of state-of-the-art detection and reverse-genetic tools that can be applied to NATs and other type of lncRNAs, it should be possible to analyze these RNA molecules much more effectively, ultimately leading to a deeper understanding of lncRNA biology and its function in epigenetic control.
The antibodies used were purchased from Millipore (http://www.millipore.com/) or Abcam (http://www.abcam.com/), as follows: H3K4me2 (catalog number 07-030, lot number DAM1503382; Millipore), H3K4me3 (catalog number 07-473, lot number 27343; Millipore), H3K9Ac (catalog number 07-352, lot number 31388; Millipore), H3K9me2 (catalog number ab1220, lot number 625300; Abcam), H3K18Ac (catalog number ab1191; Abcam), H3K27me1 (catalog number 07-448, lot number DAM1598790; Millipore), H3K27me3 (catalog number 07-449, lot number DAM1514011; Millipore), H3K36me2 (catalog number ab9049-100, lot number 614453; Abcam), H3K36me3 (catalog number ab9050-100, lot number 573603; Abcam).
ChIP-seq and RNA-seq
ChIP was performed using the aerial tissue of 2-week-old Arabidopsis plants grown on 0.5× MS medium (1% sucrose, 0.8% agar) under constant illumination. The ChIP-seq samples were sequenced as multiplexed fragment libraries using a Applied Biosystems SOLiD™ 2.0 or 3.0 system (Applied Biosystems, http://www.appliedbiosystems.com), and 35 bp sequences were read as color codes. For the RNA-seq experiment, Arabidopsis plants were grown on soil for 21–28 days. Total RNA was isolated from young leaves ranging from 7 to 12 mm long. After depletion of ribosomal RNAs, RNA-seq fragment libraries were constructed using a SOLiD™ whole-transcriptome analysis kit. Detailed ChIP-seq and RNA-seq methods are described in Appendix S1.
Bioinformatic processing of ChIP-seq and RNA-seq datasets
The color-space reads were aligned to the TAIR8 Arabidopsis thaliana reference genome using Corona Lite 4.2.1 (Life Technologies, http://www.lifetechnologies.com), allowing three color-space mismatches in the 35 bp reads. For ChIP-seq tags, the reported location was randomly chosen if the tag mapped to multiple positions. For RNA-seq libraries, only uniquely placed tags were used for analyses. Multiple tags mapped to identical genomic positions were removed except for one, in order to compensate for potential over-amplification of the library. Identification of peaks for histone modifications was performed by Model-bases Analysis of ChIP-Seq (MACS) with the P value set as 10−5 (Zhang et al., 2008). Annotated genes that overlapped with identified peaks were considered enriched with the particular histone modification. The cytosine methylation profile was obtained from the National Center for Biotechnology Information Gene Expression Omnibus (accession number GSE5974), and processed as previously described (Zilberman et al., 2007; Luo and Lam, 2010). The RNA-seq data for oligo(dT)-enriched Arabidopsis total RNA were downloaded from the National Center for Biotechnology Information Gene Expression Omnibus (accession number GSE21323 Deng et al., 2010) and processed identically to the RNA-seq data generated by us. All scripts used for the bioinformatic analyses may be downloaded from http://aesop.rutgers.edu/~lamlab/resources/resources.html.
In the first round of ChIP, 20 μg of the primary antibodies was cross-linked with 40 μl of Protein A resin using a crosslink immunoprecipitation kit (Pierce, http://www.piercenet.com/). The Protein A resin coupled to the antibody was used for ChIP with a 3 ml chromatin sample. After washing the resin with ChIP dilution buffer (15mM Tris-Cl pH 7.5, 1% Triton X-100, 150mM NaCl and 1mM EDTA) three times with rotation for 10 min each, the chromatin was eluted with 200 μl of 50 mm Tris/Cl pH 7.5, 1% SDS at room temperature for 30 min. The eluted chromatin was diluted 10-fold with ChIP dilution buffer, and 500 μl of chromatin sample was used for each second-round ChIP with 2 μg of primary antibody according to the regular ChIP procedure.
RT-PCR for detection of sense and antisense transcripts
Total RNA was isolated from the aerial tissue of 2-week-old Arabidopsis seedlings using plant RNA reagent (Invitrogen, http://www.invitrogen.com). RNA samples were subjected to two rounds of RQ1 DNase treatment (Promega, www.promega.com). For quantification of NATs, we noticed that cDNAs corresponding to exonic regions of active genes may be detected without addition of the priming oligo. Therefore, we have chosen to quantify antisense transcripts that map to intronic regions or intron–exon junctions. As pre-mRNAs accumulate at much lower abundance than mature mRNAs, obtaining non-specific cDNA corresponding to introns or intron–exon junctions is much less likely. Reverse transcriptions were primed using gene-specific primers (Table S9) and were performed at 50°C using Improm-II reverse transcriptase (Promega). The abundance of TUB8 mRNA was used as the internal control for quantitative PCR. The sequences of primers are listed in Table S9.
Determination of the subcellular localization of NATs
Plant nuclei were isolated using 1.0 g of leaf tissue as described above for the ChIP method without adding formaldehyde to the nuclear isolation buffer (Appendix S1). Nuclear RNA was extracted using plant RNA reagent (Invitrogen) and subjected to two rounds of RQ1 DNase digestion (Promega). Approximately equal amounts of nuclear RNA and total RNA (approximately 100 ng) were used for reverse transcription with gene-specific primers or random hexamer oligos.
ChIP-seq and RNA-seq data are available through the National Center for Biotechnology Information Sequence Read Archive under accession number SRA010097. Tag alignments may be retrieved from the National Center for Biotechnology Information Gene Expression Omnibus (accession number GSE28398). The data may be browsed using the GBrowse interface at http://epigenome.rutgers.edu.
We would like to thank L. Honaas and C.W. DePamphillis from Pennsylvania State University, Department of Biology (University Park, PA) for providing the RNA sample for RNA-seq. We are grateful to R.A. Martienssen from the Cold Spring Harbor Laboratory (Cold Spring Harbor, NY), J. Li from Mississippi State University, Department of Biochemistry & Molecular Biology (Starkville, MS) and R.M. Amasino from the University of Wisconsin, Department of Biochemistry (Madison, WI) for sharing the met1-1, sdg8-2, and elf8-1 and elf7-3 mutants, respectively. Suggestions and advice from A. Amini and T. Widiez are sincerely acknowledged. Support for this work from the Biotechnology Center for Agricultural and the Environment, School of Environmental and Biological Sciences, Rutgers University (New Brunswick, NJ) is greatly appreciated.