Gene body methylation shows distinct patterns associated with different gene origins and duplication modes and has a heterogeneous relationship with gene expression in Oryza sativa (rice)

Authors


Author for correspondence:

Andrew H. Paterson

Tel: +1 706 583 0162

Email: paterson@plantbio.uga.edu

Summary

  • Whole-genome duplication (WGD) has been recurring and single-gene duplication is also widespread in angiosperms. Recent whole-genome DNA methylation maps indicate that gene body methylation (i.e. of coding regions) has a functional role. However, whether gene body methylation is related to gene origins and duplication modes has yet to be reported.
  • In rice (Oryza sativa), we computed a body methylation level (proportion of methylated CpG within coding regions) for each gene in five tissues.
  • Body methylation levels follow a bimodal distribution, but show distinct patterns associated with transposable element-related genes; WGD, tandem, proximal and transposed duplicates; and singleton genes. For pairs of duplicated genes, divergence in body methylation levels increases with physical distance and synonymous (Ks) substitution rates, and WGDs show lower divergence than single-gene duplications of similar Ks levels. Intermediate body methylation tends to be associated with high levels of gene expression, whereas heavy body methylation is associated with lower levels of gene expression.
  • The biological trends revealed here are consistent across five rice tissues, indicating that genes of different origins and duplication modes have distinct body methylation patterns, and body methylation has a heterogeneous relationship with gene expression and may be related to survivorship of duplicated genes.

Introduction

Gene duplication is a primary mechanism for the evolution of novelty and complexity in higher organisms (Ohno, 1970; Flagel & Wendel, 2009; Innan & Kondrashov, 2010). It is now known that genes may be duplicated by various modes, generally referred to as large-scale and small-scale duplications (Maere et al., 2005; Casneuf et al., 2006; Ganko et al., 2007; Freeling, 2009; Wang et al., 2012). The most frequent consequence of gene duplication is reversion to single-copy (singleton) status (Freeling & Thomas, 2006; Freeling, 2009); however, genes retained in duplicate offer the potential for the evolution of novelty (Ohno, 1970; Flagel & Wendel, 2009; Innan & Kondrashov, 2010). Thus, the study of mechanisms for gene retention and evolution in view of different gene duplication modes is very important (Wang et al., 2012). Oryza sativa (rice) is a good model to elucidate the genetic mechanisms and evolutionary features of different gene duplication modes (Wang et al., 2007, 2011; Li et al., 2009).

Rice has experienced at least two whole-genome duplications (WGDs), one shared with most if not all cereals (ρ), and another more ancient event (σ) (Paterson et al., 2004; Tang et al., 2010). In angiosperm species, most duplicated chromosomal segments are thought to arise from WGDs (Tang et al., 2008a,b). Small-scale gene duplications, often referred to as single-gene duplications, are also widespread in rice (Wang et al., 2007, 2011; Li et al., 2009). According to the physical distance between duplicates, single-gene duplications can be further classified into local and transposed gene duplications (Ganko et al., 2007; Wang et al., 2011, 2012). Local duplications may occur as tandem duplications (i.e. duplicated genes are consecutive in the genome), which may be caused by illegitimate chromosomal recombination (Freeling, 2009), or proximal duplications (i.e. separated by one or more genes), which may be caused by localized transposon activities (Zhao et al., 1998; Wang et al., 2011, 2012). Transposable element (TE)-related genes comprise a significant portion of rice protein-coding genes (Yuan et al., 2005; Jiao & Deng, 2007). TE-related genes have normal gene structures with coding capacity and transcriptional activity, but share significant sequence similarity with known TEs (Jiao & Deng, 2007). Transposed duplications that create two gene copies far away from each other are widespread in plants (Freeling et al., 2008; Freeling, 2009; Woodhouse et al., 2010, 2011; Wang et al., 2011, 2012), suggesting that many non-TE-related genes are also mobile, via either DNA- or RNA-mediated transposition (Cusack & Wolfe, 2007). Transposed duplicates may also occur by intrachromosomal recombination (Woodhouse et al., 2011).

Divergence between duplicated genes increases with time, but the rate/extent of divergence is affected by gene duplication modes (Casneuf et al., 2006; Arabidopsis Interactome Mapping Consortium, 2011; Wang et al., 2011). Generally, WGD duplicates are less divergent than other duplicates (Casneuf et al., 2006; Ganko et al., 2007; Li et al., 2009; Wang et al., 2011). Moreover, singletons show higher interspecies conservation than duplicates based on cross-species comparison of genomic and expression data (Ha et al., 2009; Wang et al., 2011). Indeed, the distinct evolutionary effects of gene duplication modes may, in turn, affect the rates of gene retention, depending on functional category-specific selection pressures on neo-functionalization, functional buffering or high expression (Freeling, 2009; Innan & Kondrashov, 2010; Wang et al., 2012).

Under-explored and controversial in the current literature are the roles of epigenetic marks in gene duplication, evolution and retention. DNA methylation is one of the most important epigenetic marks, and high-resolution whole-genome DNA methylation maps based on bisulfite sequencing have been made for rice (Feng et al., 2010; Zemach et al., 2010a,b). Previous analyses of whole-genome DNA methylation data have suggested that rice DNA methylation occurs predominantly at cytosine followed by guanine, that is, ‘CpG’ dinucleotides (Feng et al., 2010; Zemach et al., 2010b). Gene body methylation (DNA methylation of coding regions) is conserved across eukaryotic lineages (Lee et al., 2010; Su et al., 2011). Although it is broadly accepted that promoter methylation is generally associated with the repression of plant gene expression (Zhang et al., 2006; Su et al., 2011), the functional roles of gene body methylation are controversial (Lee et al., 2010; Su et al., 2011). To date, gene body methylation has been suggested to enhance accurate splicing of primary transcripts (Lorincz et al., 2004; Kolasinska-Zwierz et al., 2009; Schwartz et al., 2009; Luco et al., 2010) and/or prevent ‘leaky’ expression from intragenic cryptic promoters (Zilberman et al., 2007; Maunakea et al., 2010). In Arabidopsis and rice, association of gene body methylation with active transcription has been proposed (Zhang et al., 2006; Zilberman et al., 2007; Zemach et al., 2010b; Takuno & Gaut, 2012). By contrast, several studies in rice have suggested that the major effect of body methylation on gene expression is repression (Li et al., 2008; He et al., 2010). From the point of view of evolution, body-methylated genes have been suggested to be functionally important and to evolve slowly (Sarda et al., 2012; Takuno & Gaut, 2012). However, the interplay between gene body methylation and gene duplication, as well as the evolution of duplicate genes, has been little explored.

Study of the potential interplay between gene body methylation and gene origins and duplications may help us to understand the roles of epigenetic factors in shaping current genomes, as well as the mechanisms underlying gene duplications and evolution. In rice, we analyzed single-base resolution, whole-genome DNA methylation maps of five tissues (Zemach et al., 2010a,b). For each gene, we computed a body methylation level (proportion of methylated CpG dinucleotides within coding regions) in each tissue. We classified rice genes into different origins and duplication modes, including TE-related genes, singletons, and WGD, tandem, proximal and transposed duplicates, and compared the body methylation levels among different categories of genes. For duplicated genes, we examined divergence in body methylation levels and its relationship with coding sequence divergence. Furthermore, we studied the potential relationships between body methylation and duplicate gene retention. Finally, we investigated the complicated relationships between body methylation and gene expression levels.

Materials and Methods

Sequence sources

The rice gene set was retrieved from the Rice Genome Annotation Project (TIGR5, http://rice.plantbiology.msu.edu/). The gene sets of outgroups, including Sorghum bicolor, Brachypodium and Zea mays, were retrieved from Phytozome (http://www.phytozome.net/). For each gene, only the first transcript in the genome annotation (transcript name suffixed by ‘.1’) was used for analysis.

Identification of genes of different origins

Rice genes were first divided into TE-related and non-TE-related genes, according to TIGR5. The non-TE-related genes were further classified into WGD duplicates, singletons, tandem, proximal, transposed and dispersed duplicates. To this end, the population of potential gene duplications in rice was identified using BLASTP (Altschul et al., 1990) (TE-related genes were not considered for BLASTP). For each gene, only the top five nonself BLASTP matches that met a threshold of E < 10−10 were considered as potential gene duplication relationships. The genes without any BLASTP hit were deemed singletons. WGD duplicates were obtained from a previous study (Tang et al., 2010). We then derived single-gene duplications by excluding pairs of WGD duplicates from the population of gene duplications. Tandem duplicates were adjacent homologs and proximal duplicates were not adjacent, but within 10 annotated genes of each other on the same chromosomes and without any paralog between them. The remaining single-gene duplications, that is, after deduction of the tandem and proximal duplications, were searched for transposed duplications. To accomplish this aim, genes at ancestral (i.e. interspecies collinear) chromosomal positions were discerned by aligning syntenic blocks within rice and between rice and its outgroups, including Sorghum bicolor, Brachypodium and Zea mays. For a pair of transposed duplicates, we required that one duplicate was at its ancestral locus and the other was at a nonancestral locus, named the parental duplicate and transposed duplicate, respectively. For a transposed duplicate, there may be multiple ancestral paralogs, and we regarded the ancestral paralog with highest sequence identity as its parental duplicate. The remaining duplicates which do not belong to any of the WGD, tandem, proximal and transposed duplicates were simply denoted as dispersed duplicates.

Rice whole-genome DNA methylation data

Rice single-base resolution DNA methylation data of embryo, endosperm, leaf, root and shoot tissues, generated by bisulfite sequencing technology, were obtained from two previous studies (Zemach et al., 2010a,b). We used the processed data provided by the authors, available at the Gene Expression Omnibus database (accession numbers: GSM497260, GSM560562, GSM560563, GSM560564 and GSM560565). In the processed data, the likelihood of methylation was shown for each CpG, CHG and CHH site, whose chromosomal position was annotated according to TIGR5. Only CpG methylation was considered in this study. The likelihood of CpG methylation showed a strong bimodal distribution, and we regarded a value of > 0.5 as methylation of CpG dinucleotides.

Comparing the distributions of body methylation levels

As body methylation levels tend to be bimodally distributed, it is not reasonable to compute a single mean and standard deviation of body methylation levels for a gene group. To compare the distributions of body methylation levels of different gene groups, we used both parametric and nonparametric tests: (1) parametric test: we counted the gene numbers associated with low methylation (body methylation level < 0.1), intermediate methylation (0.1 ≤ body methylation level ≤ 0.9), and high methylation (body methylation level > 0.9) for each gene group, and then compared the gene numbers with different extent of methylation between different gene groups using a χ2 test; and (2) nonparametric test: the comparison of the distributions of body methylation levels between two gene groups was modeled as testing whether one gene group had more outliers (highly body-methylated genes) than the other group. The Outlier-Sum statistic (Tibshirani & Hastie, 2007) was adopted. P values were assessed based on 104 permutations of the pooled body methylation levels of the two gene groups for comparison.

Ks calculation

Protein sequences of duplicated genes were aligned using Clustalw (Thompson et al., 1994) with default parameters. Then, the protein alignment was converted to a coding sequence alignment using the ‘Bio::Align::Utilities’ module in the BioPerl package (http://www.bioperl.org/). Ks was calculated using the methods of Nei & Gojobori (1986) and Yang & Nielsen (2000), via the ‘Bio::Align::DNAStatistics’ and ‘Bio::Tools::Run::Phylo::PAML::Yn00’ modules, respectively, in the BioPerl package. It should be noted that extremely high levels of sequence divergence between duplicated genes may cause the ‘Bio::Align::DNAStatistics’ module to generate invalid Ks values, which were then ruled out from the related analysis. Following a previous study in rice (Tang et al., 2010), we excluded Ks values for gene pairs with average third-codon-position GC content (GC3) > 75% from related statistical analyses because there are two distinct groups of genes with significantly different GC3. Ks values > 3.0 were also excluded because of saturated substitutions at synonymous positions.

Gene expression data

Processed rice expression data over 508 tissues and physiological conditions, generated by the Affymetrix GeneChip Rice Genome Array, were obtained from previous studies (Ficklin et al., 2010; Wang et al., 2011). In the data, the numbers of columns that sampled embryo, endosperm, leaf, root and shoot were 3, 4, 50, 99 and 84, respectively. For some genes, there are multiple probe sets on the array to measure their expression. Inclusion or exclusion of ‘suboptimal’ probe sets with suffix ‘_s_at’ or ‘_x_at’, which were suspected of potential cross-hybridization, has been shown previously to have only trivial effects (Wang et al., 2011). In this study, all types of probe sets were considered and, for a gene with multiple probe sets, the first probe set according to alphabetic sorting was used to represent its expression profile.

Correlation analysis and smoothing spline regression

In this study, correlations were measured by Spearman's correlation coefficients. Smoothing spline regression was performed via the ‘smooth.spline’ function of R language. To avoid overfitting in smoothing spline regression, three degrees of freedom, including 2, 4 and 6, were tested.

Results

Gene origins in rice

Like many other eukaryotic species, the rice genome has been shaped and dynamically reconstructed by multiple evolutionary forces and events, which render its genes to have different origins (International Rice Genome Sequencing Project, 2005). TE-related genes are classified on the basis of sharing significant sequence similarity with TEs (Jiao & Deng, 2007). Among non-TE-related genes, those present in only single copies were deemed to be singletons, whereas others were deemed to be duplicated. Duplicated genes were further classified in terms of duplication modes, with those at collinear positions of intraspecies syntenic blocks deemed to be WGD duplicates (Tang et al., 2010). All other duplicates were assumed to have occurred by single-gene duplications, further classified into tandem, proximal and dispersed, as described above. The mechanisms underlying dispersed duplications are very complicated (Wang et al., 2012). However, if one member of a pair of dispersed duplications was at its ancestral locus and the other was at a nonancestral locus, such gene duplications were deemed to be transposed (Wang et al., 2011, 2012). Summary statistics on rice gene origins are shown in Table 1, and the classification of duplicated genes is shown in Supporting Information Table S1.

Table 1. Statistics on rice (Oryza sativa) genes of different origins and duplication modes
Gene originNumber of gene pairsNumber of distinct genes
  1. N/A, not applicable; TE, transposable element.

Non-TE-relatedN/A41 046
SingletonsN/A12 618
DuplicatesN/A28 428
WGD30875061
Tandem20083529
Proximal24843728
Transposed62696269
DispersedN/A12 957
TE-relatedN/A15 232

Body methylation levels show different distributions associated with gene origins and duplication modes

To investigate the patterns of gene body methylation in view of different gene origins and duplication modes, we computed the body methylation level for each gene, defined as the proportion of methylated CpG dinucleotides relative to all CpG dinucleotides within its coding region, in embryo, endosperm, leaf, root and shoot. To test the consistency of body methylation levels across tissues, we visualized the body methylation levels of all genes between all pairs of tissues via scatter plots (Fig. S1). Although endosperm tissue shows higher variations than other tissues, body methylation levels are much more likely to be consistent (rather than different) across tissues, that is, points (genes) are densely distributed along the ‘x’ diagonal line in the scatter plots. This analysis indicates that it is feasible to study the evolutionary characteristics of body methylation for large groups of genes with the acknowledgement of the existence of tissue-specific body methylation for specific genes.

A recent study has suggested that gene bodies cluster into two groups corresponding to high and low levels of DNA methylation, respectively, in honeybee, silkworm, sea squirt and sea anemone (Sarda et al., 2012). We plotted the distribution of body methylation levels for all rice genes (Fig. 1a), finding a clear bimodal distribution peaking at ‘0’ or ‘1’, suggesting that gene bodies tend to be either highly methylated or little methylated in rice.

Figure 1.

Gene body methylation shows different patterns associated with gene origins and duplication modes. Each column represents one tissue. (a) Distribution of body methylation levels for all rice genes. (b) Comparison of distributions of body methylation levels between transposable element (TE)-related and non-TE-related genes. (c) Comparison of distributions of body methylation levels between singleton and duplicate genes. (d) Comparison of distributions of body methylation levels among whole-genome duplication (WGD), tandem, proximal and transposed duplicates.

We found that different gene origins differ in the distributions of body methylation levels. First, we compared the distributions of body methylation levels between TE-related and non-TE-related genes, and found that the two distributions were significantly different (P < 2.2 × 10−16, χ2; P < 10−4, Outlier-Sum statistic; see the 'Materials and Methods' section) (Fig. 1b). Specifically, most TE-related genes are highly body-methylated (body methylation level > 0.9), consistent with previous studies (Zilberman et al., 2007; Li et al., 2008; Feng et al., 2010; He et al., 2010; Zemach et al., 2010b), whereas non-TE-related genes are bimodally distributed, with more genes little body-methylated (body methylation level < 0.1). As noted previously, TE-related genes exhibit much lower transcriptional activities than non-TE-related genes (Jiao & Deng, 2007), suggesting that high levels of body methylation may be associated with reduced transcription, and conflicting with the hypothesis that body methylation has only minor, but positive, effects on the levels of gene expression (Zhang et al., 2006; Zilberman et al., 2007; Zemach et al., 2010b; Takuno & Gaut, 2012).

We compared the distributions of body methylation levels between different origins within non-TE-related genes. Singletons show a higher frequency of high body methylation than do duplicates (Fig. 1c; P < 2.2 × 10−16, χ2; P < 10−4, Outlier-Sum statistic; see the 'Materials and Methods' section). Tandem, proximal and transposed duplicates show an obvious frequency peak of high body methylation (Fig. 1d), whereas WGD duplicates do not (P < 2.2 × 10−16, χ2; P < 10−4, Outlier-Sum statistic; see the 'Materials and Methods' section). Moreover, the likelihood of a duplicated gene being highly body-methylated follows the tendency: transposed > proximal > tandem > WGD (P < 2.2 × 10−16, χ2; P < 10−4, Outlier-Sum statistic; see the 'Materials and Methods' section). In partial summary, body methylation levels show different distributions associated with gene origins and duplication modes, suggesting that genes of different origins tend to have distinct epigenetic features.

Divergence in body methylation levels between duplicated genes

Genes duplicated by different modes differ in the extent of expression divergence and the rewiring of protein–protein networks (De Smet & Van de Peer, 2012; Wang et al., 2012). Here, we examined whether duplicated genes of different modes also differ significantly in divergence in body methylation levels. Divergence in body methylation levels among gene pairs duplicated by different modes (Fig. 2a) showed the following trend: random gene pairs > transposed duplicates > proximal duplicates > tandem duplicates ≈ WGD duplicates (both an ANOVA model involving all duplication modes and Tukey's honestly significant difference (HSD) test between adjacent duplication modes were significant at α = 0.05), indicating that different modes of gene duplication tend to result in different extents of divergence in body methylation levels. The physical distance between single-gene duplicates (in terms of number of genes apart) also followed a trend: transposed duplicates > proximal duplicates > tandem duplicates. We hypothesized that there may be position effects that affect body methylation levels, for example, genes that are closer to each other on chromosomes tend to have more similar body methylation levels. To this end, we randomly selected 20 000 gene pairs on the same chromosomes and computed the correlations between divergence in body methylation levels and physical distance. These correlations ranged from 0.053 to 0.061 (P < 4.2 × 10−14), indicating that there exist weak position effects that affect body methylation levels for all rice genes. For single-gene duplicates, these correlations ranged from 0.111 to 0.137 (P < 2.2 × 10−16), indicating that the position effects increase slightly for single-gene duplicate pairs relative to random gene pairs. At the same physical distance, single-gene duplicates diverge less in body methylation levels than do random gene pairs (Fig. 2b), suggesting that body methylation patterns are either copied or recapitulated following gene duplication.

Figure 2.

Divergence in body methylation levels between duplicated genes. Each column represents one tissue. (a) Comparison of divergence in body methylation levels among different modes of gene duplication. Whiskers correspond to the minimum and maximum values in the data. (b) Linear regressions between divergence in body methylation levels and physical distance for random gene pairs and single-gene duplicate pairs.

Relationship between body methylation patterns and Ks for pairs of duplicated genes

To understand how gene body methylation evolves following gene duplication, it may be helpful to relate patterns of body methylation of duplicated genes to the divergence of their coding sequence. Synonymous (Ks) substitution rates largely reflect the neutral mutation rates of coding sequences, suggested to increase approximately linearly with time for relatively low levels of sequence divergence (Li, 1997). We first related divergence in body methylation levels between duplicated genes to Ks using linear regression (Fig. 3a). Positive correlations were found for all duplication modes (0.113 ≤ ≤ 0.175, P < 2.2 × 10−16). For single-gene duplicates, these correlations ranged from 0.112 to 0.185 (P ≤ 1.081 × 10−9). However, as we have shown that, for single-gene duplicates, there is a weak correlation between divergence in body methylation levels and physical distance, the position effects could be a nuisance factor for the correlation between divergence in body methylation levels and Ks. To remove the effect of physical distance on these correlations for single-gene duplicates, we computed the partial correlations between divergence in body methylation levels and Ks. These partial correlations ranged from 0.101 to 0.159 (P ≤ 3.794 × 10−8), declining by 0.01–0.03 from their corresponding correlations, indicating that physical distance has a very weak effect on the correlation between divergence in body methylation levels and Ks. Thus, divergence in body methylation levels between duplicated genes tends to increase with Ks. Moreover, at similar Ks levels, WGDs tend to have smaller divergence in body methylation levels between duplicates than do tandem, proximal or transposed duplications. The different extent of divergence in body methylation levels between gene duplication modes may be explained by the hypothesis that WGDs generate duplicated chromosomal segments in which collinear duplicates are more likely to have similar chromatin environments, whereas single-gene, especially transposed, duplications re-locate to new chromosomal positions which often have different chromatin environments.

Figure 3.

Relationships between patterns of body methylation and Ks for duplicated genes. Each column represents one tissue. (a) Linear regressions between divergence in body methylation levels and Ks for different modes of gene duplication. (b) Linear regressions between body methylation levels and Ks for different modes of gene duplication.

Next, we related the body methylation levels of duplicated genes to Ks using linear regression (Fig. 3b). The direction of the correlations differs among different modes of gene duplication: Body methylation of WGD duplicates is positively correlated with Ks (0.051 ≤ ≤ 0.084, P < 0.05), whereas body methylation of single-gene duplicates decreases with Ks (−0.212 ≤ ≤ −0.082, P < 9.4 × 10−4). Some duplicated genes are highly methylated, particularly those generated by single-gene duplications. It is well known that single-gene duplicates have a shorter half-life than WGD-generated duplicates (Lynch & Conery, 2000). Different rates of nonrandom gene loss shortly after WGD and single-gene duplication may contribute to the contrasting directions of the correlations between body methylation levels and Ks. In the first few million years following single-gene duplication, many duplicates become nonfunctionalized and are lost (Innan & Kondrashov, 2010). Biases among these genes may mitigate the long-term tendency towards increased body methylation, as in WGD duplicates, for example if highly body-methylated duplicates are preferentially lost. Thus, there could be links between body methylation patterns and the probability of long-term survival of duplicated genes.

Relationship between gene body methylation and gene expression

The observation that TE-related genes are highly body-methylated, but little expressed, appears to conflict with the observation that body methylation has a positive effect on the levels of gene expression (Zhang et al., 2006; Zilberman et al., 2007; Zemach et al., 2010b; Takuno & Gaut, 2012). However, these two observations might be reconciled if gene body methylation has heterogeneous effects on gene expression, that is, gene body methylation affects gene expression in different ways under different conditions. We plotted the regression lines between gene expression levels and body methylation levels for all non-TE-related genes based on each tissue, using smooth splines with different degrees of freedom (Fig. 4); this showed that intermediate body methylation tends to be associated with higher gene expression levels than both low and high body methylation. To test this observation statistically, we computed the correlations between body methylation levels and expression levels for the genes with body methylation levels of < 0.5 and ≥ 0.5. These correlations ranged from 0.223 to 0.284 (P < 2.2 × 10−16) when the body methylation level was < 0.5, and from −0.182 to −0.101 (P ≤ 1.648 × 10−9) when the body methylation level was ≥ 0.5. This result suggests that intermediate body methylation may indeed have positive effects on transcription, possibly through the enhancement of accurate splicing of primary transcripts, whereas high body methylation is more likely to repress gene expression, which may lead to pseudofunctionalization or gene losses.

Figure 4.

Gene body methylation has heterogeneous effects on gene expression. Smooth spline curves are fitted between gene expression levels and body methylation levels for all non-transposable element (TE)-related genes, based on different degrees of freedom. A body methylation level of 0.5 appears to be a point dividing the up- and down-regulation of gene expression levels.

We related gene expression to variances of body methylation levels across tissues. Based on Fig. S1, we inferred that TE-related genes tend to have more uniform body methylation levels (closer to the ‘x’ diagonal line) than do non-TE-related genes, which was then proven statistically by two-sample t-test for variances of body methylation levels between TE-related and non-TE-related genes (P < 2.2 × 10−16). This observation indicates that the ‘repressive’ TE-related body methylation tends to be uniform across tissues. For non-TE-related genes, we found that there is a significant positive correlation (= 0.173, P < 2.2 × 10−16) between the average expression levels and variances of body methylation levels, indicating that non-TE-related genes with high expression tend to vary in body methylation across tissues.

Discussion

We have related gene body methylation to gene origins and duplication modes in rice. Our results suggest that genes of different origins and duplication modes are associated with different patterns of gene body methylation, and highly body-methylated genes are preferentially lost following gene duplication. Although it is known that natural variations in DNA methylation exist among individuals of a species (Becker et al., 2011; Bell et al., 2011; Fraser et al., 2012) and that, within an individual, many cytosines may be differentially methylated among different tissues (Zemach et al., 2010a; Zhang et al., 2011; Vining et al., 2012) or developmental stages (Alisch et al., 2012), or between normal and stress conditions (Chinnusamy & Zhu, 2009), our analyses of body methylation patterns based on five different tissues reveal highly consistent evolutionary trends. We summarized a body methylation level for each gene that may involve hundreds of CpG dinucleotides. Further, we compared body methylation levels among large groups of genes with each group consisting of several thousand genes. Thus, our computational procedure, through mitigation of the effect of dynamic changes of methylation status that may occur at some cytosine nucleotides, is reliable for large-scale evolutionary analyses.

DNA methylation is an important epigenetic mark and can affect the nucleotide composition of DNA sequences. DNA methylation can trigger the spontaneous deamination of methyl-cytosine to thymine (Bird, 1980; Jones et al., 1987; Pfeifer, 2006), which makes DNA methylation levels and GC levels interdependent. The data of this study showed strong negative correlations (−0.514 ≤ r ≤ −0.458, P < 2.2 × 10−16) between body methylation levels and the GC content at the third codon position (GC3) for rice genes. The evolution of DNA methylation patterns and DNA sequences can be intermingled, and the study of DNA methylation evolution may facilitate the understanding of mechanisms for DNA sequence evolution.

In eukaryotic genomes, there are multiple epigenetic marks, including DNA methylation, histone modifications, nucleosome positioning and others, all of which may contribute to the regulation of gene expression (Henderson & Jacobsen, 2007). Among these epigenetic marks, DNA methylation has been studied extensively for its role in the regulation of gene expression. In rice, Li et al. (2008) showed an interplay between DNA methylation, histone methylation and gene expression, and that gene expression appeared to be repressed by DNA methylation, but to be rescued by the concurrence of DNA and H3K4 methylation. He et al. (2010) found a weak negative correlation between DNA methylation and transcript levels, and that TE-related genes are highly methylated and little transcribed. In Populus trichocarpa, gene body methylation is suggested to have a more repressive effect than promoter methylation on transcription (Vining et al., 2012). By contrast, in Arabidopsis, many studies have suggested that gene body methylation is associated with active transcription (Zhang et al., 2006; Zilberman et al., 2007; Takuno & Gaut, 2012). The conflicting conclusions on the direction of the relationship between body methylation and gene expression in previous studies may be because an overall correlation pattern has often been sought, overlooking the possibility that body methylation may have heterogeneous effects on gene expression.

In conclusion, in rice, using the proportion of methylated CpG dinucleotides within coding regions to measure the level of gene body methylation, we found that body methylation levels follow a bimodal distribution peaking at ‘0’ or ‘1’, and display distinct patterns associated with different gene origins and duplication modes. For pairs of duplicated genes, divergence in body methylation levels increases with physical distance and Ks, and WGDs show lower divergence than single-gene duplications at similar Ks levels. Body methylation of WGD duplicates tends to increase with Ks, whereas the body methylation levels of single-gene duplicates decrease with Ks, indicating that highly body-methylated genes are preferentially lost following gene duplication. Moderate body methylation tends to enhance gene expression, whereas light or heavy body methylation tends to repress gene expression. This study suggests that genes of different origins and duplication modes have distinct body methylation patterns, and body methylation evolves with DNA sequence evolution, has heterogeneous effects on gene expression and might be related to survivorship of duplicated genes.

Acknowledgements

We thank Barry Marler for IT support, Xinyu Liu for statistical consulting and Haibao Tang for providing python scripts. A.H.P. appreciates funding from the National Science Foundation (NSF: DBI 0849896, MCB 0821096, MCB 1021718). This study was supported in part by resources and technical expertise from the Georgia Advanced Computing Resource Center, a partnership between the Office of the Vice President for Research and the Office of the Chief Information Officer.

Ancillary