ANCORP: a high-resolution approach that generates distinct chromatin state models from multiple genome-wide datasets

Authors

  • Chongyuan Luo,

    1. Biotechnology Center for Agriculture and the Environment, Rutgers, The State University of New Jersey, 59 Dudley Road, New Brunswick, NJ 08901, USA
    2. Graduate Program of Plant Biology, Rutgers, The State University of New Jersey, 59 Dudley Road, New Brunswick, NJ 08901, USA
    Search for more papers by this author
  • Eric Lam

    Corresponding author
    1. Biotechnology Center for Agriculture and the Environment, Rutgers, The State University of New Jersey, 59 Dudley Road, New Brunswick, NJ 08901, USA
    2. Graduate Program of Plant Biology, Rutgers, The State University of New Jersey, 59 Dudley Road, New Brunswick, NJ 08901, USA
    Search for more papers by this author

For correspondence (fax +1 732 932 6535; e-mail lam@aesop.rutgers.edu).

Summary

Chromatin components can be extensively modified and dynamically regulated by a plethora of catalytic complexes. The numerous modifications may form a type of molecular pattern that defines particular local and global ‘chromatin states’ through extensive cross-talk. Analyses that can integrate multiple genome-wide datasets are essential to determine the interactions and biological function of chromatin modifications in various contexts. Through a combination of hierarchical clustering and pattern visualization, we categorized all annotated Arabidopsis genes into 16 chromatin state clusters using combinations of four chromatin marks (H3K4me3, H3K36me2, H3K27me3 and cytosine methylation) using publicly available data. Our results suggest that gene length may be an important factor in shaping chromatin states across transcription units. By analysis of two rare chromatin states, we found that the enrichment of H3K36me2 around the transcription start site is negatively correlated with transcriptional activities. High-resolution association analyses in the context of chromatin states have identified inter-correlations between chromatin modifications. H3K4me3 were found to be under-represented in actively transcribed regions that are modified by DNA methylation and the H3K36me2 mark, concomitant with increased nucleosome occupancy in these regions. Lastly, quantitative data from transcriptome analyses and gene ontology partitioning were integrated to determine the possible functional relevance of the corresponding chromatin states. We show that modelling the plant epigenome in terms of chromatin states and combining correlative visualization methods can be a productive approach to unravel complex relationships between epigenomic features and the functional output of the genome.

Introduction

One prominent difference between the genome structure of prokaryotes and eukaryotes is that, in the latter, nuclear DNA is packaged into chromatin, which consists of nucleosomes as the fundamental units. Early research on chromatin function revealed that nucleosomes could prevent DNA from being digested by nucleases and also impede transcription in vitro (Kornberg and Lorch, 1999). These results are consistent with the speculation that nucleosomes are inert structures that reduce the accessibility of DNA. Although this over-simplified hypothesis has been dramatically revised over the past decade, the lessons learnt from these early studies are nevertheless important, i.e. that the biochemical behaviour and functional output of DNA can be altered through variations in chromatin packaging. Therefore, any DNA-related processes in eukaryotes, such as transcription, replication or repair, will inevitably involve manipulation of chromatin structures, such as the packing density between neighbouring nucleosomes or the remodelling of nucleosome positioning.

Because of the central role of transcriptional regulation in development, pathogenesis and adaptation to environmental changes, the impact of chromatin structure on transcription has prompted numerous investigations. Chromatin structures are known to be relevant for most stages of transcription, including initiation, elongation and termination (Li et al., 2007a). Transcription initiation has been shown to be highly correlated with nucleosome depletion at the promoter region of expressed genes (Jiang and Pugh, 2009), localization of the histone variant H2A.Z and various histone modifications including H3K4 methylation and H3 acetylation in a broad range of eukaryotes (Henikoff and Ahmad, 2005; Pokholok et al., 2005; Oh et al., 2008; Wang et al., 2008). During transcription elongation, H2Bub1 (mono-ubiquitinated H2B) and H3K36 methylation were found to correlate with the C-terminal domain (CTD) phosphorylated form of RNA polymerase II (Li et al., 2007a; Weake and Workman, 2008). Both mono-ubiquitination of histone H2B and FACT (facilitates chromatin transcription) complex are essential for transcription elongation in a highly reconstituted in vitro transcription system (Pavri et al., 2006; Weake and Workman, 2008). Abolishing H2Bub1 in Saccharomyces cerevisiae led to defects in H3K4 and H3K79 methylation (Weake and Workman, 2008), suggesting that H2Bub1 can regulate transcription partly by mediating other types of histone modification. H3K36 methylation can recruit the Rpd3 histone deacetylase to gene bodies, which is apparently crucial for the repression of cryptic transcription (Li et al., 2007b; Lickwar et al., 2009). In addition to histone modifications, studies using Arabidopsis thaliana have revealed that CpG DNA methylation is associated with the ORF region in thousands of moderately active genes (Zhang et al., 2006; Zilberman et al., 2007; Cokus et al., 2008; Lister et al., 2008). Intriguingly, CpG methylation also correlates strongly with heterochromatic silencing in Arabidopsis. How CpG methylation can have functionally distinct biological outcomes depending on the genomic context is not yet understood. Although no chromatin modification has been correlated specifically with transcription termination so far, evidence has been reported for a physical association (i.e. looping) between the transcription start site (TSS) and the polyadenylation site for several S. cerevisiae genes (Ansari and Hampsey, 2005; Singh and Hampsey, 2007), which suggests potential involvement of chromatin structure in the coordination of transcription initiation and termination.

As a complement to classical biochemical and molecular approaches, recent application of ultra-high-throughput genome-wide profiling technologies (e.g. Chromatin-Immunoprecipitation followed by either the hybridization to microarray (ChIP-chip) or high-throughput sequencing (ChIP-Seq)) has allowed more rapid and quantitative definition of global chromatin modification patterns. One of the most comprehensive datasets reported so far includes whole-genome maps for 39 distinct histone modifications in human CD4+ T cells (Wang et al., 2008). In that study, a common histone modification module formed by 17 distinct marks was found to correlate with active expression, and a cooperative model was proposed for the function of various histone marks with respect to transcription regulation (Wang et al., 2008).

The results from studies with various eukaryotes using biochemical, genetic and genomic approaches are consistent with the concept that chromatin modifications function in a combinatorial fashion (Sims and Reinberg, 2008). Thus, the presence or absence of one histone mark may influence the level of another by recruitment or inhibition of particular chromatin modification enzymes. In addition, the functional consequence of a chromatin mark may be dependent on other modifications and genome context. In an effort to resolve this complexity, we are seeking to integrate multiple datasets of genome-wide chromatin-related information including gene length, histone marks, cytosine methylation and transcript abundance. In this paper, we have adopted and extended a clustering approach to incorporate both genomic and chromatin context into analysis of chromatin modifications (Zilberman et al., 2008). In the first step of our two-step method, Arabidopsis genes are ordered based on genomic or epigenomic features and displayed as a stacked heat map. The derived gene order is referred to as the ‘anchor’. In the second step, using the order of genes (the ‘anchor’) generated in step 1, other sets of features such as chromatin modifications or transcription activity are then passively sorted and visually compared between clusters. We have named the procedure anchored correlative pattern (ANCORP) analysis. Using this method, we categorized the epigenomic features of all annotated Arabidopsis genes into 16 chromatin states consisting of all possible combinations of four distinct chromatin marks: H3K4me3 (trimethylated histone H3 lysine 4), H3K36me2, H3K27me3 and DNA methylation (5mC). Correlative analyses on chromatin states identifed state-specific associations between distinct chromatin modifications. Further incorporation of transcriptome, stress transcriptome and ontology information indicated the potential functional relevance of some of the chromatin states.

Results

ANCORP: display of global correlations of chromatin modifications

We aimed to detect global interactions between several types of chromatin marks (i.e. H3K4me3, H3K36me2, H3K27me3 and 5mC) and transcript abundance. Our analyses first focused on the euchromatic regions of the Arabidopsis genome, so sequences encoding putative transposable elements (TEs) and pseudogenes based on the version 8 of genome annotation curated by The Arabidopsis Information Resource (TAIR) were excluded (Swarbreck et al., 2008). We generated an anchoring pattern by performing hierarchical clustering of the region between −1 and +5 kb for the 28 244 annotated Arabidopsis genes, using the relative level and intragenic location of the H3K4me3 chromatin mark as criteria for comparison and clustering (Figure 1a). We then plotted the levels of H3K36me2, H3K27me3, 5mC and transcript abundance for the corresponding regions (Figure 1b–e) using the same order of genes as in Figure 1(a) to reveal the presence of any significant correlations. As each row in the various panels of Figure 1 corresponds to the same gene, this approach enables direct visual comparison of multiple chromatin modifications across the same set of genes. Consistent with previous analysis (Oh et al., 2008), we found that the H3K4me3 mark associates with the TSS of expressed genes and H3K36me2 is enriched in the 3′ portions of these genes. In contrast, the H3K27me3 mark appears to cover the whole transcription unit of its targets. Also, we observed that H3K36me2 is frequently found in genes associated with H3K4me3, whereas H3K27me3 and H3K4me3 are mutually exclusive in most instances. In addition, by displaying the transcript abundance as another correlative pattern in Figure 1(e), it is readily apparent that H3K4me3 and H3K36me2 marks are generally coupled with active transcription, whereas the H3K27me3 mark correlates with transcriptionally silent genes. The relationship between methylation found in transcription units and transcription was not clearly established by this analysis and will be discussed in later sections.

Figure 1.

 ANCORP analysis of inter-modification relationships using the pattern of H3K4me3 marks as the anchor.
(a) All Arabidopsis genes except for TEs and pseudogenes were hierarchically clustered based on H3K4me3 patterns. Genes were aligned at the 5′ ends, with the white line indicating the TSS. The enrichments of H3K4me3 between −1 and +5 kbp are displayed as heat maps, with each row representing a single gene.
(b–e) Enrichments of H3K36me2, H3K27me3, DNA methylation and transcription activity plotted using the same order of genes as in (a).

High-resolution and multi-color displays of gene length-dependent chromatin landscapes

Before defining chromatin state groups within the Arabidopsis genome, it is necessary to thoroughly address the relationship between gene length and individual chromatin modification. Multiple studies have demonstrated the interdependence between histone modifications and gene length at relatively low resolutions (Oh et al., 2008; Lickwar et al., 2009). The method used involves computing average patterns of chromatin modifications after grouping genes into several size ranges. For higher plants, each average pattern represents the mean of approximately 2000–10 000 genes if all annotated genes are separated into 3–10 groups (Oh et al., 2008; Wang et al., 2008). To improve the resolution of this type of analysis, we generated stacked heat maps for each chromatin modification using genes that had first been sorted by length (Figure 2a–d). The annotated sizes of Arabidopsis genes included in the current analyses range from 35 bases to 31 kb, with most genes (>95%) shorter than 5 kbp. This ANCORP method enables recognition of groups containing as few as 100 genes (Figure 3a and Table S1). Furthermore, the heat map presentation method provides a convenient tool to simultaneously reveal spatial information for multiple chromatin modifications at a large range of resolutions (from single genes to more than 10 000 genes), which cannot be easily done using other methods.

Figure 2.

 Correlation between chromatin modifications and gene lengths.
(a–d) All Arabidopsis genes except for TEs and pseudogenes were sorted by length of transcription units. Genes were aligned at the 5′ ends, with the straight white lines indicating the TSS and white curves indicating transcription termination sites (TTS). The enrichments of chromatin modifications are shown as heat maps, with each row representing a single gene. Numbers on the left of each panel indicate the size of genes at the corresponding positions.
(e–h) Merging of the positive values of two heat maps (a–d). The colours of channels and the modifications being shown are indicated below each panel. Yellow indicates the presence of both red and green.
(i–l) Magnifications of regions corresponding to two-channel merged images as indicated on the right of panels (e–h).

Figure 3.

 Correlation between chromatin states and gene lengths.
(a) Modified chromatin domains were identified for H3K4me3, H3K36me2, H3K27me3 and DNA methylation, and then mapped onto the TAIR8 Arabidopsis genome. Chromatin states for all genes except for TEs and pseudogenes were displayed as a heat map and subsequently hierarchically clustered. Each row in the heat map corresponds to a single gene. A positive association with chromatin modifications is represented in red, and negative associations are indicated in black. The 13 chromatin states that can be recognized on the heat map were given a number, as shown on the left.
(b) Lengths of transcription units were plotted using the same order of genes as displayed in (a) after the hierarchical clustering. Genic regions are coloured yellow.

The patterns observed from Figure 2(a–d) are largely consistent with previous studies (Oh et al., 2008). H3K4me3 and H3K36me2 marks are localized to the 5′ and 3′ region of gene bodies, respectively. Both marks are under-represented in genes shorter than 1 kb, in which the H3K27me3 mark is enriched. In addition, we note that DNA methylation is essentially absent in genes shorter than 2 kb. A conspicuous enrichment of DNA methylation was found in most genes longer than 4 kb.

To explicitly visualize the spatial relationship between chromatin modifications, we merged positive values of the respective heat maps from several pair-wise combinations of chromatin marks (Figure 2e–h,i–l). The multi-colour visualization indicates the presence of genic regions with either one of the two chromatin marks (red or green) or both (yellow). In Figure 2(e,i), we found H3K4me3 and H3K36me2 cover essentially non-overlapping regions within the bodies of the same genes. In shorter genes, there is almost no overlap between the genic regions marked by H3K4me3 and H3K36me2, with few or no yellow regions (Figure 2e,i). However, as the gene size increases, significant gaps emerge between the H3K4me3-marked domains around the TSS and the H3K36me2 domains in the 3′ portion of gene bodies (Figure 2e), such that the central part of longer genes associates with neither the H3K4me3 nor the H3K36me2 marks. This H3K36me2 pattern may be adverse to the faithful transcriptions as H3K36 methylation has been reported to be essential for suppressing cryptic transcription initiation within gene bodies in other eukaryotes (Li et al., 2007b; Lickwar et al., 2009). Cryptic transcription initiation may generate antisense or aberrant transcripts that can trigger siRNA silencing and subsequently impair normal transcription. Strikingly, we found from our analysis that the location of DNA methylation domains in Arabidopsis genes covers the region in longer genes that is not marked by H3K36me2 (Figure 2d,h,k,l), in addition to significant degrees of overlap as shown by the yellow region in Figure 2(h,l). In contrast, there appears to be little overlap between regions that are methylated and those with the H3K4me3 mark (Figure 2g,k).

Chromatin states reflect dominant correlations between gene lengths and individual chromatin modifications

In order to perform integrated analyses of multiple chromatin modifications, we categorized Arabidopsis genes into combinatorial chromatin states. Modified chromatin domains were first identified throughout the genome using a hidden Markov model (HMM)-based method (Ji and Wong, 2005) as described in Experimental procedures, and then assigned to each annotated Arabidopsis gene. All genes were then hierarchically clustered by their associations for these four chromatin marks (Figure 3a). Thirteen of the 16 possible chromatin state clusters that can be recognized on the heat map were given a number for designation (Figure 3a). The global chromatin states generated are not random combinations of four modifications. Rather, notable concurrences and exclusions exist between chromatin modifications. The number of genes associated with each state is significantly differently from that generated by computational permutation (Table S1).

Using the gene order that resulted from this clustering procedure as the anchor, we plotted the size of genes as a correlative pattern in Figure 3b. It is apparent that each chromatin state cluster preferentially associates with a distinct range of gene lengths (Figures 3b and S1a,b). For example, genes associated with both H3K4me3 and H3K36me2 (cluster 9, median length 2420 bp) are generally longer than those in clusters 4 (median length 1495 bp) and 5 (median length 1401 bp), which predominantly associate with either H3K4me3 or H3K27me3, respectively (Figure S1b,c). Also, consistent with data presented in Figure 2, genes longer than 2.5 kb are highly represented in cluster 12 (median length 3327 bp). The percentage of genes associated with each chromatin modification in six size groups was calculated and is shown in Figure S1d.

Among genes with length shorter than 1 kb (Figure S1d), the percentage of genes that are H3K4 trimethylated in their TSS region is relatively low (approximately 23%). However, this may not imply that the H3K4me3 mark is more important for longer genes than short genes. As we shown earlier, the H3K4me3 mark is under-represented in H3K27me3-targeted genes, most of which are shorter than 2 kb (Figures 3b and S1b,c). Thus, under-representation of the H3K4me3 mark in shorter genes may be simply due to the frequent suppression of transcription that is associated with the presence of the H3K27me3 mark in this category of genes.

Overall, a considerable proportion of the chromatin state clusters can be explained by the gene length preferences of individual chromatin modifications. For example, H3K36me2 and DNA methylation are over-represented in longer transcription units compared to H3K4me3. These characteristics may contribute to the separation of clusters 4, 9 and 12. However, gene length is apparently not the only determinant of chromatin states. Thus, substantial overlaps can be observed between clusters 4, 9 and 12, while clusters with genes solely modified by H3K4me3 or H3K27me3 (clusters 4 and 5) share a similar size range.

The structure of chromatin states reveals diversities in transcription regulation

This relatively simple partitioning of the Arabidopsis epigenome using just four distinct chromatin marks has identified five major chromatin states, each consisting of more than 3000 genes, in addition to 11 relative minor states. Of the five major states, clusters 4, 9 and 12 associate with actively transcribed loci and are commonly modified with H3K4me3. Cluster 1, which is not enriched with any of the four modifications studied here, and cluster 5, which is modified by H3K27me3, were found to be transcriptionally repressed.

To examine the finer details and patterns underlying the various chromatin states generated by our ANCORP analyses, we used the same anchor as in Figure 3a to sequentially display patterns of chromatin modifications across gene bodies within various clusters (Figure 4a–e). As shown in Figure 4a–c, most genes found a high level of H3K36me2 also associate with H3K4me3, except for two clusters (clusters 3 and 8). This observation is consistent with the model that H3K36me2 is established during transcription elongation and thus usually correlates with the presence of the H3K4me3 mark. Genes within cluster 3 only associate with H3K36me2, while both H3K36me2 and 5mC are targeted to overlapping regions of genes within cluster 8 (Figure 4c,e). For higher-resolution views, the chromatin states for the 373 genes in cluster 8 are magnified and displayed in Figures 4f–i, together with their transcript levels (Figure 4j). Only background levels of H3K4me3 are found around the TSS region for clusters 3 and 8 (Figure 4b,g). The H3K36me2 domain in these genes showed clear expansion into the 5′ portion of the gene bodies close to the TSS compared to clusters 9 and 12 (Figure 4c,h). These special chromatin states are correlated with generally suppressed transcript abundance (Figures 4j and S3).

Figure 4.

 Inter-relationship between chromatin modifications in various chromatin states.
(a) ANCORP anchor (as described in Figure 3 (a)).
(b–e) Enrichments of H3K4me3, H3K36me2, H3K27me3 and DNA methylation marks between −1 and +5 kb plotted using the same order of genes as displayed in (a).
(f) Magnification of cluster 8 together with part of clusters 7 and 9 from (a).
(g–j) Enrichments of H3K4me3, H3K36me2, DNA methylation and transcription activity plotted for cluster 2 using the same order of genes as displayed in (f).

As the deposition of H3K36me2 is believed to depend on active transcription, it is perplexing that genes within clusters 3 and 8 are modified with H3K36me2 without significant enrichment of H3K4me3. We suggest several mechanisms to explain the formation of these relatively rare chromatin states.

(i) Active transcription can occasionally be uncoupled from H3K4me3. Due to limitation of the resolution, a small fraction of actively expressed genes that belong to cluster 3 and 8 cannot be recognized through visual examination of the heat map. The TSSs of these genes apparently lack significant H3K4me3, but H3K36me2 (the mark of transcription elongation) persists. An example is given in Figure 5a, which shows that AT1G07300 is actively expressed and modified by H3K36me2 within the transcription unit, with little H3K4me3 around the TSS.

Figure 5.

 Chromatin states and expression for selected genes of cluster 3.
(a–c) Enrichments of H3K4me3 and H3K36me2 (log2 IP/H3) and RNA on both strands plotted for the surrounding 5 kb regions of AT1G07300 (a), AT2G36355 (b) and AT4G35900 (c). The TSSs are indicated by arrows and are supported by full-length cDNA clones.
(d) Average expression percentiles plotted against the distance between the TSS and the closest H3K36me2 domain for either all H3K36me2 genes or cluster 9 only.

(ii) The signal for H3K36me2 within the transcription unit might be contributed by overlapping genes that are actively expressed. As seen in Figure 5b, the H3K36me2 mark associated with AT2G36355 is probably attributable to transcription of the overlapping gene AT2G36360.

(iii) Other behaviours may occur that cannot be explained by either hypothesis, such as for AT4G35900 shown in Figure 5c. The whole transcription unit, including the region juxtaposed to TSS, is modified by H3K36me2 without significant H3K4me3 enrichment or transcription.

The general behaviour of clusters 3 and 8 led us to assess whether an H3K36me2 mark-enriched domain adjacent to the TSS is associated with transcription repression. We found that an inverse relationship exists between gene expression and the distance between the TSS and the closest H3K36me2 domain for distances up to approximately 1 kb (Figure 5d). This is valid for all H3K36me2-associated genes as well as those in cluster 9, which are in general actively expressed (Figure 5d). An observation that coincides with these results was obtained using S. cerevisae, in which tethering of SET2, a protein that can mediate H3K36 methylation, to a heterologous promoter region was found to cause transcriptional repression (Strahl et al., 2002).

In most cases, for genes >2 kb, DNA methylation (5 mC) co-localizes with the H3K4me3 and H3K36m2 marks (cluster 12), while genes belonging to cluster 2 are heavily methylated and are not associated with H3K4me3 or H3K36me2. In addition, genes in cluster 2 behave similarly to transposable element-related genes as their transcription is largely repressed. However, this group of genes has only limited homology with known transposable elements as shown by the TAIR8 filtering of TEs (Swarbreck et al., 2008). Nevertheless, genes of cluster 2 are located closer to annotated TEs than any other clusters (Figure S2), suggesting that they may be derived from highly degenerate TE sequences or could be epigenetically influenced by some kind of ‘spreading’ mechanism due to their juxtaposition to heterochromatin.

Although the H3K4me3 and H3K27me3 marks appear to be mutually exclusive at the global level, a set of 630 genes is found to associate with both of these chromatin marks (cluster 7). The expression level of this ‘bivalent’ group is generally higher than that of genes that are only targeted by H3K27me3 (cluster 5, Figures S3 and S4).

Identification of a gene length-independent inverse correlation between H3K4me3 and H3K36me2 by partitioning Arabidopsis genome into chromatin states

Categorizing all the genes using the model of chromatin states enabled the discovery of patterns or correlations that are specific for certain chromatin state. By examining the H3K4me3 pattern in Figure 4b, we noted that the H3K4me3 marks in the 3′ region of genes are more thoroughly depleted in cluster 9 or 12, compared with genes in cluster 4. In addition to H3K4me3, genes that belong to cluster 9 are also enriched with H3K36me2, while both H3K36me2 and 5mC are found to be associated with cluster 12. To confirm this observation more quantitatively, we compared average log2 values of the H3K4me3/H3 ratio across genes associated with five chromatin states (Figure 6a–e). Arabidopsis genes were grouped according to size for this analysis, as we have shown above that gene size is correlated with distinct types of chromatin states. Consistent with our qualitative visual assessment, genes within clusters 9 and 12 display stronger depletion of H3K4me3 in the 3′ portion of their gene bodies compared to the other chromatin states. This preferential under-representation of H3K4me3 marks is specific to the region beyond +1 kb, as clusters 4, 9 and 12 all show a comparable H3K4me3 level around the TSS. The differential depletion of H3K4me3 among chromatin states is statistically significant within gene bodies as shown by the P values across gene bodies generated using Welch t tests (Figure S5a–d). Also, the distribution of gene expression level for clusters 4, 9 and 12 is indistinguishable for genes longer than 2 kb (Figure S4), indicating that the differential deposition of H3K4me3 is not caused by differences in expression level between gene groups from each chromatin state.

Figure 6.

 Correlation between H3K4me3, nucleosome occupancy and chromatin states.
All genes except for TEs and pseudogenes were grouped based on gene length. For each group, average enrichments of H3K4me3 (a–e) and histone H3 (f–j) were calculated for genes associated with the five chromatin state clusters 4, 5, 7, 9 and 12. The average enrichments of H3K4me3 and Histone H3 between −1 and +5 kb were plotted. Regions that can be unequivocally assigned within gene bodies were shaded in grey. Genes shorter than 1 kb are omitted from the curve for cluster 12 due to small sample number (n < 20).

Genes that belong to cluster 9 have a higher average 5 mC level in their transcription regions compared to those in cluster 4 (Figures 4e and S6). Thus the function of H3K36me2 and 5 mC with respect to H3K4me3 depletion is unclear. To resolve this question, we performed an ‘epistasis-like’ analysis by comparing H3K4me3 enrichment in clusters 9 and 12 with that of cluster 10, which consists of 270 genes enriched with the H3K4me3 and 5 mC marks but not with H3K36me2 (Figure S8a–c). The average H3K4me3 enrichment in the 3′ region of both clusters 9 and 10 is lower than that of cluster 4 but higher than that of cluster 12 (Figure S8a–c). Therefore, both H3K36me2 and DNA methylation appear to inversely correlate with H3K4me3 within gene bodies in an independent fashion.

Nucleosome occupancy is positively correlated with the H3K36me2 mark and DNA methylation within gene bodies

In Arabidopsis, DNA methylation and the histone variant H2A.Z have been shown to be mutually antagonistic chromatin marks (Zilberman et al., 2008). Together with our discovery that H3K4me3 density inversely correlates with DNA methylation and H3K36me2 levels within gene bodies, we postulate that the pattern of nucleosome occupancy might also correlate with different chromatin modifications. We plotted log2 values for H3 ChIP/genomic input DNA using the same order of genes as displayed in Figure 3a (Figure S7b). In addition, average log2 values for H3 IP/genomic input DNA across genes associated with five chromatin states were compared quantitatively (Figure 6f–j). Consistent with the notion that nucleosome depletion correlates with high transcription frequency (Jiang and Pugh, 2009), histone H3 is consistently under-represented in cluster 4. In contrast, significant enrichment of histone H3 in the 3′ region of gene bodies was found in genes belonging to clusters 9 and 12. The differential enrichment of H3 deposition levels between genes in clusters 4, 9 and 12 is specific to the 3′ region, not at the promoter. Similarly to the H3K4me3 mark, the observed differential representation of H3 is independent of gene length or expression level. An additive effect of H3K36me2 and DNA methylation with respect to increased H3 occupancy in the corresponding regions was observed in a similar ‘epistasis-like’ analysis for H3K4me3 (Figure S8d–f).

Genes belonging to cluster 5 (H2K27me3 only) have relatively high levels of histone H3 enrichment, consistent with the transcriptional suppression in this group (Figures S3 and S4). Interestingly, intermediate enrichment of H3 was observed for genes associated with both H3K4me3 and H3K27me3 (cluster 7). This correlates with the higher gene expression of this ‘bivalent’ group compared to that found in cluster 5. The distribution curve of gene expression for cluster 7 lies between those for clusters 4 and 5 (Figure S4).

Recently, it has been demonstrated in yeast that the genome-wide nucleosome density is largely determined by DNA sequences (Kaplan et al., 2009). Using the algorithm developed from this in vitro nucleosome assembly experiment to the Arabidopsis genome, the resulting pattern is only moderately similar to the in vivo H3 ChIP map (Figure S7c). Given that plant H2A and H2B histones have significantly diverged from their yeast and animal orthologs, the prediction model developed with chicken nucleosomes and yeast genomic DNA may not readily apply in plant systems (Kaplan et al., 2009).

Enrichment of H2A.Z in H3K27me3-modified genes

Through perfoming ANCORP with Figure 3a as the anchor, a conspicuous enrichment of H2A.Z was observed in genes of cluster 5, which are solely modified by H3K27me3 (Figure S7d), with the pattern of H2A.Z different from that of clusters 4, 9 or 12. As the presence of the H3K27me3 mark was assessed using aerial tissue, whereas H2A.Z was mapped in root tissue, the two datasets might not be entirely comparable. However, the pattern of H3K27me3 in roots is expected to be largely similar to that in the shoots, as shown in the rice system (Wang et al., 2009). Thus our observation may indicate a possible functional link between the plant Polycomb-like mechanism and H2A.Z.

Inter-relationships between chromatin states and gene expression

The relationship between transcript levels and each of the H3K4me3, H3K36me2, H3K27me3 and 5mC marks has been investigated in previous studies (Zhang et al., 2006, 2007; Zilberman et al., 2007; Cokus et al., 2008; Lister et al., 2008; Oh et al., 2008). By performing the analysis of gene expression in the context of chromatin states, we were able to refine and extend some of the previous observations.

The distribution of expression level for genes associated with five distinct chromatin states is quantified in Figure 7a. As gene lengths are not independent of chromatin states, we also performed a similar analysis after separating all Arabidopsis genes into size groups (Figure S4). The results are largely consistent with those based on visual examination (Figure S3). H3K36me2 is clearly not required for active expression per se as no significant differences were observed between clusters 4 and 9 regarding gene expression. Although this result appears to contradict a previous study showing that H3K36me2 correlates with active expression (Oh et al., 2008), the two observations are in fact complementary. The previous analysis involved all genes in the Arabidopsis genome, including genes that are silenced by DNA methylation or Polycomb protein-mediated mechanisms, whereas our comparison was between clusters 4 and 9, which are both marked by H3K4me3.

Figure 7.

 Correlation between gene functions and chromatin states.
(a) Distributions of gene expression for each chromatin state cluster (4, 5, 7, 9 and 12) shown as a histogram.
(b–e) Specificity of gene expression in response to (b) development stage and tissue types, (c) various light conditions, (d) abiotic stresses and (e) pathogen challenges were estimated by calculation of the Shannon entropy. The distribution of Shannon entropies was plotted for each of the five chromatin state clusters and for all analysed genes.

We also noted that the relationship between methylated and unmethylated genes with respect to gene expression is dependent on the length of the genes. DNA methylation was reported previously to be preferentially associated with gene bodies of moderately expressed genes, which means that the bodies of genes with either low or high expression are unlikely to be methylated (Zilberman et al., 2007). Using our approach, we found that this conclusion is only true for genes shorter than 2 kb (Figure S4b,g). For genes longer than 2 kb, the distribution patterns of gene expression for clusters 4, 9 and 12 are essentially indistinguishable (Figure S4h,i). Therefore, it is unclear whether DNA methylation is specifically targeted to genes with certain expression level.

Preferential association of particular chromatin states with classes of genes that are involved in development or response to pathogens

To address whether any combined chromatin patterns are involved in the regulation of specific biological processes, such as development or response to environmental stimuli, we quantified the specificity of gene expression for five chromatin states during plant development, response to various light conditions, abiotic and biotic stresses (Schmid et al., 2005). Shannon entropies were calculated using the corresponding AtGenExpress datasets for genes in each chromatin state group (Figure 7b–e) (Schug et al., 2005; Zhang et al., 2006). Large entropy values indicate low specificity of gene expression and smaller entropy values suggest higher specificity (Schug et al., 2005). Genes within clusters 4, 9 and 12 generally show high entropy for all four conditions. This suggests that the majority of genes within these groups are consistently expressed across different developmental stages and light or stress conditions. In contrast, genes that have an specific expression pattern (i.e. low entropy) in response to developmental stage or pathogen challenge are enriched in cluster 5 (i.e. H3K27me3 only) and to a lesser extent cluster 7, which is the bivalent group that contains both H3K4me3 and H3K27me3 marks (Figure 7b,e). Genes that respond to abiotic stresses or light treatments do not appear to be enriched in any of the chromatin state groups examined here (Figure 7c,d). As gene suppression mediated by the H3K27me3 mark probably involves Polycomb-related proteins in plants (Pien and Grossniklaus, 2007), our observation suggests that this type of silencing mechanism may be involved in responses to biotic stresses in addition to its well-known function in plant development (Pien and Grossniklaus, 2007). However, responses to light and other abiotic stresses may be largely independent of regulation via the H3K27me3 mark.

Consistent with the fact that H3K27me3-targeted genes show expression specificity in response to developmental cues and biotic stresses, transcription factors are over-represented in clusters 5 and 7 (Table S2). In addition, genes that encode structural constituents of the ribosome were frequently found in cluster 4, which contains genes with high expression levels. This apparent enrichment might indicate a housekeeping function and the relatively short gene length of this cluster of genes. The gene ontology was similarly examined for genes of various size ranges (Table S3).

Discussion

Genome-wide correlative analysis constantly faces a trade-off between resolution and extraction of general patterns. A previous study determined the inter-relationship between gene length and H3K4me3, H3K36me2 and H3K27me3 marks with the bin size of approximately 1500 genes (Oh et al., 2008). We improved the resolution of this analysis by using heat map visualization. In addition, using the multi-colour merging technique, spatial relationships between multiple chromatin modifications can be visualized easily at high resolution. We found that H3K4me3 and H3K36me2 respectively mark the 5' and 3' regions of transcription units with little overlap (Figure 2e,i). In contrast, DNA methylation coincides with the regions observed in longer genes that are not covered by either H3K4me3 or H3K36me2 marks, but also can spatially overlap with the H3K36me2 mark (Figure 2d,h).

The pattern of 16 chromatin states identified in the current study confirmed that gene length is one of the critical parameters involved in establishing and/or maintaining chromatin states across transcription units. In addition to the state-specific chromatin patterns, a major distinction between clusters 4, 9 and 12 is the differential distribution of gene lengths. We hypothesize that, depending on the gene size, different chromatin states have evolved for optimal gene regulation to ensure correct amplitude and specificity of expression. H3K36me2 and DNA methylation may be dispensable for transcription of shorter genes as the elongating RNA polymerase II complex needs little time to traverse the transcription unit and fewer nucleosomes will need to be displaced and repositioned.

Of the 16 possible chromatin state clusters generated based on four chromatin modifications, some may correspond to as yet unknown mechanisms of genome regulation. We detected significant 5′ enrichment of the H3K36me2 mark in clusters 3 and 8 (a total of 1288 genes), which correlates with transcription repression. This observation underscores that combining clustering and multi-colour visualization of genome-wide datasets is a useful approach to aid the discovery of infrequent patterns that deviate from the general picture.

Using the 16 chromatin state clusters as a common anchor, we show here that ANCORP enables simultaneous visualization and parallel analysis of multiple chromatin modifications as well as gene expression data. This approach provides a convenient platform for discovering correlations in the context of chromatin states. H3K4me3 was found to be specifically depleted from the 3′ region of genes in clusters 9, 10 and 12 compared to those in cluster 4, concomitant with greater nucleosome occupancy in these regions. Together with a previous observation that DNA methylation and H2A.Z are mutually exclusive (Zilberman et al., 2008), DNA methylation of gene bodies inversely correlates with multiple chromatin properties that associate with transcription initiation. We thus envisage that DNA methylation, as well as the H3K36me2 mark, may result in a more ‘closed’ chromatin conformation at the 3′ portion of gene bodies to ensure the fidelity of transcription and prevent the activity of cryptic promoters in longer transcription units. It is worth noting that these correlations are unlikely to be found without defining chromatin states together with pattern visualization at high resolution. For instance, calculating the global correlation coefficient reveals a positive relationship between nucleosome density and DNA methylation. However, the enrichment of both DNA methylation and histones in the pericentromeric region will impede unambiguous correlation determination in transcribed genes.

Observations from the current analysis illustrate the importance of data integration. Combining genomic and epigenomic features to define the context or chromatin states has augmented the sensitivity of genome-wide correlative analysis. Patterns associated with relatively small number of genes (e.g. clusters 3 and 8) or a portion of the transcription unit (e.g. the 3′ region of long genes) can be extracted from the datasets and used to define a testable hypothesis. In the present study, chromatin states were defined one the basis of only three histone modifications and DNA methylation, and are likely to be a gross over-simplification of the actual molecular patterns of chromatin in cells. In addition, the ‘all-or-none’ determination of modified chromatin domains using HMM methods is somewhat arbitrary, and a continuous spectrum of ChIP enrichment may be overlooked in some instances due to the statistical cut-off. Lastly, the heterogeneity of cell populations in the tissue used to obtain the genome-wide datasets also creates some uncertainties as to the actual situation in specific cell types. Notwithstanding these deficiencies, the present study revealed novel patterns of correlation between gene length, chromatin states and genome output. In future, genome-wide profiling of more chromatin features using defined cell populations, together with more refinement of the algorithm for the definition of chromatin states, will provide a more accurate and functional description of the molecular patterns associated with chromatin.

Experimental procedures

Microarray data and processing

Genome-wide profiles of H3K4me3, H3K36me2, H3K27me3 and histone H3 have been published previously (Oh et al., 2008). Microarray intensity files in the CEL format were downloaded from GEO accession GSE7907 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7907). Microarray data from H3K4me3, H3K36me2 and H3K27me3 ChIP–chip experiments were quantile-normalized against H3 ChIP–chip data using TileMap (Ji and Wong, 2005). The HMM method implemented in TileMap was used for identification of chromatin domains modified with the various histone modifications. Modified chromatin domains were defined as contiguous probes with posterior probabilities >0.5. The minimum domain size was set to 100 bp, with domains located within 300 bp merged into one. To analyse global nucleosome occupancy, microarray data for H3 ChIP–chip were similarly quantile-normalized against data obtained with genomic DNA hybridized to arrays.

The genomic profile of DNA methylation was downloaded from GSE5974 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5974) and has been published previously (Zilberman et al., 2007). Scaled log2 values for Immunoprecipitation/input through subtracting bi-weight mean were provided in GSE5974, and therefore no further normalization was performed by us. Methylated domains were identified by at least two contiguous probes with log2 values for IP/input >1.28 (Zilberman et al., 2007).

The microarray data for transcript abundance used in Figures 1e, 4j and S3 were obtained from GSE9646 (Matsui et al., 2008). The plant tissue used for this experiment is comparable to samples used in other ChIP–chip datasets. Untreated control samples were used for our analysis. CEL files were normalized using single-sample mode by MAT (model-based analysis of tiling arrays) (Johnson et al., 2006). The MAT scores for each probe were used to generate heat maps.

The H2A.Z profile in root tissue of Arabidopsis was downloaded from GEO accession GSE12212 (Zilberman et al., 2008).

Hierarchical clustering and generation of stacked heat maps

For Figure 1, average log2 values for IP/H3 or IP/input were calculated across all TAIR8 genes at 50 bp resolution using custom Perl scripts. The resulting list, including the enrichment level of chromatin modifications across all genes, was hierarchically clustered using Cluster 3.0 with the single-lineage method (de Hoon et al., 2004). To generate a correlative pattern based on a particular anchor, the list (CDT file in Cluster 3.0 format) including the correlative pattern was sorted using the same order of genes contained in the anchor CDT file using custom Perl scripts. All CDT files were visualized using java treeview 1.1.3 (http://sourceforge.net/projects/jtreeview/, Saldanha, 2004). Two-channel merging of heat maps was performed using the RGB merging function of ImageJ version 1.38 (http://rsbweb.nih.gov/ij/, Abramoff et al., 2004).

Determination and clustering of chromatin states

The positions of modified chromatin domains for each modification identified using the HMM method were compared for all Arabidopsis genes. Genes (from transcription start site to transcription termination site based on TAIR8) that overlap with modified domains were identified as a target of the particular chromatin modification. The chromatin state for each single gene was represented by a four-digit binary code, where 0 and 1 correspond to a negative or positive association with a certain chromatin modification. The binary codes for all Arabidopsis genes were then subjected to hierarchical clustering using the single-lineage method to reveal groups of genes with similar dominant states.

A computational permutation method was used to test the statistical significance of the chromatin state models generated by our method. Simulated chromatin states were generated by randomly assigning chromatin modifications to genes. In this process, the number of genes modified by a particular modification is kept identical as determined by the biological experiment (i.e. 15 887 genes modified by H3K4me3). This simulation was repeated 10 000 times. The actual numbers of genes associated with each state were compared with those generated by permutation to derive the P value. The P value describes the likelihood that a certain chromatin state is generated by random coincidence rather than biological mechanisms.

Analysis of gene expression level and specificity of expression

For the analysis of gene expression shown in Figures 7a and S4, the tissue type ATGE 101 (aerial part of 21-day-old seedlings grown on 1 × MS agar, 1% sucrose under continuous light) of the AtGenExpress Arabidopsis development expression atlas was used. In order to examine the specificity of gene expression in various tissues identity and growth conditions we obtained gcRMA (guanine cytosine robust multi-array analysis) normalized gene expression datasets from http://www.weigelworld.org/resources/microarray/AtGenExpress. Only expression data generated using wild-type Arabidopsis tissues in the AtGenExpress developmental expression atlas were used to calculate the developmental entropy of genes. To estimate the specificity of gene expression in response to abiotic stress, only data obtained from shoot tissue were used. Shannon entropies of genes were calculated as described previously (Zhang et al., 2006). The expressed data were first converted from logarithmic to linear scale. To allow detection of genes that are only expressed in a few tissues or under a few conditions, the median value of the expression profile for each gene was subtracted. As a consequence, entropy was calculated for only the top half of the expression profile for individual genes. The relative expression level for gene g at condition c was defined as Pc|g = ωc,g/∑1≤cNωc,g, where ω is the median-subtracted expression level and N is the total number of conditions. The Shannon entropy (H) of gene g is determined as Hg = ∑1≤cNPc|g log2 Pc|g.

Acknowledgements

We thank Drs T.P. Michael (Waksman Institute, Rutgers University) and A. Sengupta for critical reading of the manuscript, and Drs R.A. Kerstetters (Waksman Institute, Rutgers University), N. Watanabe and A. Amini for discussions and advice. Support for this chromatin research from the Biotechnology Center for Agriculture and the Environment, School of Environmental and Biological Sciences, Rutgers University, NJ, is greatly appreciated.

Ancillary