Evidence of function for conserved noncoding sequences in Arabidopsis thaliana


Author for correspondence:
F. Alex Feltus
Tel: +1 864 656 3231
Email: ffeltus@clemson.edu


  • Whole genome duplication events provide a lineage with a large reservoir of genes that can be molded by evolutionary forces into phenotypes that fit alternative environments. A well-studied whole genome duplication, the α-event, occurred in an ancestor of the model plant Arabidopsis thaliana. Retained segments of the α-event have been defined in recent years in the form of duplicate protein coding sequences (α-pairs) and associated conserved noncoding DNA sequences (CNSs). Our aim was to identify any association between CNSs and α-pair co-functionality at the gene expression level.
  • Here, we tested for correlation between CNS counts and α-pair co-expression and expression intensity across nine expression datasets: aerial tissue, flowers, leaves, roots, rosettes, seedlings, seeds, shoots and whole plants.
  • We provide evidence for a putative regulatory role of the CNSs. The association of CNSs with α-pair co-expression and expression intensity varied by gene function, subgene position and the presence of transcription factor binding motifs. A range of possible CNS regulatory mechanisms, including intron-mediated enhancement, messenger RNA fold stability and transcriptional regulation, are discussed.
  • This study provides a framework to understand how CNS motifs are involved in the maintenance of gene expression after a whole genome duplication event.


Ancestral duplication of chromosomes occurs on a small (e.g. tandem or transposition) or large (e.g. polyploidy) scale and provides a lineage with new genetic resources to modify biological processes (Ohno, 1970). An extreme form of gene duplication is the whole genome duplication (WGD) event. The remnants of multiple WGDs have been observed in most plant lineages (reviewed in Sémon & Wolfe, 2007; Van de Peer et al., 2009; Paterson et al., 2010). Although it is impossible to determine the precise effects of WGD on fitness or the plasticity of fitness in these ancestors of modern plants, it is clear that these lineages survived and possibly drew upon the expanded gene pool to provide adaptive advantages through subfunctionalization and neofunctionalization mechanisms (Walsh, 1995; Lynch & Force, 2000; Lynch et al., 2001; Sémon & Wolfe, 2007; Freeling, 2009). A deeper comprehension of the evolutionary forces that sculpt the enhanced post-WGD gene pool has implications for the evolution of genome size, as well as for an understanding of agriculturally relevant genome interactions in modern, heterotic polyploids and hybrids.

In the case of the model plant Arabidopsis thaliana, evidence suggests that the Arabidopsis lineage has survived three WGD events (α, β and γ, the latter being a paleohexaploidy event) and consistently returned to a diploid state (Blanc et al., 2003; Bowers et al., 2003; Maere et al., 2005). Following each WGD event, some gene pairs tend to be preferentially retained (loss resistant), whereas the remaining gene complement is reduced to a preduplicated state (diploidization, also known as fractionation; Freeling, 2009). The mechanism for partial retention of the polyploidy state in some species vs the process of fractionation is unclear, but clues may lie in synthetic polyploidy events. Several studies have examined the expression changes in the Brassicaceae family following recent polyploidy, and implicated epigenetic control as a means of differential gene silencing (Wang et al., 2006; Xu et al., 2009; Yu et al., 2010). Another system of interest is from the Asteraceae family. Expression changes in recent Tragopogon miscellus polyploids provide a model system for the examination of rapid and short-term duplicate gene fates (Tate et al., 2006; Buggs et al., 2009). It is possible that these studies of recent WGD events will lead to subfunctionalization/neofunctionalization hypotheses that can be applied to paleopolyploidy events.

By whatever retention mechanism, it is clear that many Arabidopsis genes have ‘resisted’ deletion since the most recent α-duplication event. Specific α-duplicate gene pairs are well defined (Bowers et al., 2003; Thomas et al., 2006). The question thus becomes: Why are these gene sequences conserved and what biological functions are encoded in these DNA patterns? For example, Paterson et al. (2006) have shown that there is a pattern of functional conservation at the protein domain level after WGD within and across multiple eukaryotic lineages. In that study, four protein domains found in plant gene products (e.g. ‘protein kinase’) tended to be maintained after WGD, whereas 23 domains (e.g. ‘glycine-rich’) tended to be repeatedly lost. In an A. thaliana focused α-WGD analysis, it was shown that there was a nonrandom preference for the retention (e.g. ‘cysteine metabolic process’ (biological process, BP), ‘oxygen evolving complex’ (cellular component, CC), ‘casein kinase activity’ (molecular function, MF)) or loss (e.g. ‘apoptosis (BP)’, ‘mitochondrion (CC)’, ‘oxygen binding activity (MF)’) of gene ontology (GO) terms (Blanc & Wolfe, 2004). Chapman et al. (2006) provided evidence that amino acid changes tended to be more severe in genes that were diploidized relative to unfractionated pairs following WGD in both Arabidopsis and Oryza lineages. This suggests that there are evolutionary forces potentially shifting post-WGD gene function at the protein encoding sequence level.

Sequence alignments of the regions surrounding retained duplicated genes from the α paleopolyploidy event have revealed conserved noncoding DNA sequence (CNS) patterns. These are genomic DNA motifs (15–255 bp) in close proximity to α-duplicate genes that have resisted fractionation (Thomas et al., 2007; Freeling & Subramaniam, 2009), and are similar to those previously identified in maize and rice (Kaplinsky et al., 2002; Inada et al., 2003). The size and similarity of these CNS signatures implies a functional role, but this role does not appear to be related to small RNAs or transposable elements (Freeling et al., 2007). Significantly, those genes that were most enriched for CNSs were most often associated with transcription factor activity and were enriched in particular known DNA–protein binding motifs (especially G-boxes; Freeling et al., 2007). Although conserved, the functional role of these CNS patterns is unclear. It seems likely that many CNSs play cis-regulatory roles shared by α-duplicate pairs, as reviewed by Freeling & Subramaniam (2009). For this reason, we used the presence of one or more known, significantly CNS-enriched, DNA binding motifs within an Arabidopsis CNS as a validating metric.

Utilizing publicly available gene expression datasets, our study aimed to examine gene expression patterns between retained A. thalianaα-duplicate pairs in the context of CNS signatures. Our working hypothesis was that CNSs common to an α-duplicate pair would be involved in the concomitant control of gene expression for both genes, even though the genes may now exist on different chromosomes. Our strategy was to determine whether CNS frequencies would correlate with pairwise α-duplicate co-expression or a co-increase in expression intensity, with the logic being that more CNS signatures would have a higher probability of containing cis-regulatory patterns conferring common control mechanisms.

Materials and Methods

Microarray dataset collection, tissue categorization and preprocessing

Microarray CEL files were obtained from the National Center for Biotechnology Information (NCBI)’s Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo; Edgar et al., 2002) for the Affymetrix Arabidopsis ATH1 Genome Array (GPL198). At the time of collection (15 December 2009), 5009 individual microarray experiments were downloaded. Each GEO experiment description was manually categorized into specific transcriptome categories using plant ontology (PO) terms defined by the Plant Ontology Consortium (http://www.plantontology.org). The entire annotated microarray set was then RMA normalized using RMAexpress (http://rmaexpress.bmbolstad.com/; Bolstad et al., 2003) and screened for outliers with arrayQualitymetrics (http://www.bioconductor.org/help/bioc-views/devel/bioc/html/arrayQualityMetrics.html; Kauffmann et al., 2009) as implemented in R (http://www.r-project.org). A microarray dataset was considered to be an outlier if it failed at least one of the three default tests. The entire collection of microarray datasets was iterated through outlier detection five times before no datasets were flagged as outliers (Supporting Information Fig. S1). After outlier removal, the numbers of filtered arrays with normalized transcriptome expression intensities and common PO coding used for downstream analysis were (array count in parentheses): aerial (231), flower (146), leaves (877), root (640), rosette (268), seedlings (675), seeds (108), shoot (305) and whole plant (771).

α-Duplicate pair functional categorization

Probe sets were assigned to Arabidopsis genes using AFFY-TAIR8 mappings (affy_ATH1_array_elements-2009-7-29.txt; ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix) highlighting ATH1 probe sets mapped to TAIR8 genes with α-duplicate pair genes as defined by Thomas et al. (2006) (Tables S1, S2). Any probe set at high risk of cross-hybridization (*x_at; *s_at) was excluded. The number of α-duplicate genes measured on the ATH1 array was 5550. Next, α-duplicate pairs were grouped on the basis of common predicted molecular function or cellular component GO terms (http://www.arabidopsis.org/tools/bulk/go; gene counts in parentheses; Ashburner et al., 2000): all pairs (5216), transcription factors (GO:0006351; 540), kinases (GO:0016301; 298), plasma membrane (GO:0005886; 588), chloroplast (GO:0009507; 478) (Table S3). Environmental response genes were collected from the published list of differentially expressed (DE) genes induced by varied hormone response (Goda et al., 2008) and UV stress (Kilian et al., 2007). In total, 465 DE α-duplicates were present in the ‘high-stringency’ hormone-responsive list and 146 DE α-duplicates were present in the UV stress list. An α-duplicate was associated with its partner even if it was not present in the DE list, yielding 191 hormone- and 118 UV stress-responsive α-duplicate pairs. CNSs were screened for known transcription factor binding site motifs (TFMs; detailed methods below) and α-duplicates were coded as follows: CNS-positive (CNS+) α-duplicates (2810), CNS+ and TFM-positive (TFM+) α-duplicates (2210), and CNS+ and TFM-negative (TFM–) α-duplicates (600). For a CNS to be TFM+, it contained at least one motif that was determined to be significantly enriched (< 0.05).

α-Duplicate pair presence/absence of expression

Individual expression calls were made for all probe sets in each transcriptome dataset using the MAS 5.0 algorithm (default parameters) as implemented in the R Bioconductor affy package (Gentleman et al., 2004; http://www.bioconductor.org). Individual probe sets were called ‘present’ (P) if their P value was ≤ 0.04, ‘marginal’ (M) if 0.04 > < 0.06 and ‘absent’ (A) if the P value was ≥ 0.06. For each PO-defined dataset, the number of P calls for each α-duplicate (A1/A2) was determined across all arrays, and log10(PA1/PA2) P count ratios were determined. If the ratio was within two standard deviations (SDs) from the mean of the normal distribution, both members of the α-duplicate pair were deemed to be present in the PO-defined group. Alternatively, if the log10(PA1/PA2) value was outside of two SDs, only one-half of the pair was considered to be present. If both members of a pair had no P calls in an organ set, the pair was considered to be absent for that dataset. Heat maps of all ATH1 probe sets (Fig. S2) and of only the α-duplicates present in each of the nine PO-defined datasets (Fig. S3) were generated using the heatmap function in R/Bioconductor.

α-Duplicate pair CNS counts

CNS counts per gene originally published for TAIR5 (Thomas et al., 2007) were updated to TAIR8 for this investigation. Any ambiguous CNS gene assignments were manually examined using the GEvo application in the CoGe suite of genomics tools (http://synteny.cnr.berkeley.edu/CoGe/) employing the same rules as applied in the original study. These genes were checked with regard to whether they were α-duplicate pairs according to the annotations of Thomas et al. (2006), and are listed in Table S1. CNS counts per gene can be found in Table S4.

Detection of TFMs enriched within At–At CNSs

The enrichment of DNA sequence TFMs within A. thaliana homeologous CNSs was calculated. Motifs used for this analysis came primarily from AtcisDB (http://arabidopsis.med.ohio-state.edu/AtcisDB/), whereas others were obtained through an extensive literature search for experimentally confirmed A. thaliana transcription factor binding sites. The citations for all motifs are available at http://genomevolution.org/CoGe/MotifView.pl. Using regular expressions coded in Perl, each CNS was analyzed for the presence of every motif. For each motif, it was assumed that both complement and reverse complement constituted a functional orientation. Noncoding nonconserved and nonrepetitive (any 100-bp fragment that hit the genome 50 times or more was masked; E ≤ 0.001) nucleotide sequences from the gene space of each CNS-containing gene were pooled and used as the control for the CNSs in that region of gene space. The gene space was defined as the extended space including and around genic regions, encompassing coding as well as intergenic noncoding regions bounded by the farthest upstream and downstream CNSs associated with a gene. Depending on the position of the CNS relative to the gene coding sequence, the control sequences were separated into three positional groups: 5′, 3′ or intronic. The χ2 significance of the detection of each motif within the CNS per region was calculated by comparing the expected motif count based on the incidence frequency of the motif in the control sequence vs the observed motif count within the CNS per region. Independent χ2 values were determined for each motif and for each position relative to the gene with a significance cut-off of 0.05 (95% confidence). All motifs enriched within CNSs relative to control sequences with a maximum P value significance of 0.05 were classified as significantly enriched. The ratio of the motif frequency within the CNS to within the control sequence was used as the enrichment measure. Using these methods, we found 195 motifs to be significantly ‘enriched’ in Arabidopsis CNSs in the 5′-region, and similar groups of motifs in intronic and 3′-regions (Table S5).

α-Duplicate pair expression analysis in the CNS context

α-Duplicate pairs were associated with several parameters on an expression dataset-specific basis. For each pair, pairwise co-expression was calculated as determined by the Pearson correlation coefficient. Next, the average pairwise expression intensity (log2(I)) of a combined duplicate pair was calculated across a given PO-defined dataset. Lastly, the combined CNS counts for each α-duplicate pair were assigned across the full-gene or subgene regions (5′-upstream, 5′-untranslated region (5′-UTR), intron, 3′-UTR, 3′-downstream). The metrics were correlated using the standard cor function in R, using default parameters and Spearman’s ρ rank correlation for each PO-defined microarray dataset. The significance of a correlation was determined by permutation analysis in which randomly selected α-duplicate probe sets (not necessarily pairs) were subjected to an identical analysis over 10 000 permutations. Any correlation was considered to be significant when < 0.01. All correlation coefficients and associated P values can be found in Tables S6 and S7.

5′-UTR folding energy calculations and noncoding RNA pattern searches

Arabidopsis 5′-UTR, intron and 3′-UTR sequences were downloaded from TAIR (http://www.arabidopsis.org/; release TAIR8_5_utr_20080228, TAIR8_intron_20080228, TAIR8_intron_20080228). These files contained annotations of 24 267 5′-UTRs (18 962 genes), 154 240 introns (22 167 genes) and 25 273 3′-UTRs (19 889 genes) with average lengths of 148.9, 164.8 and 238.2 bp, respectively. Free folding energies (ΔG) were calculated using the RNAfold program in the Vienna RNA package (http://www.tbi.univie.ac.at/~ivo/RNA/; Hofacker et al., 1994) employing default parameters. Each of the gene lists was separated into α-duplicates and non-α-duplicates to compare differences in mean free folding energy and 5′-UTR length between groups using Student’s t-test. The gene lists were further separated into categories as follows: α-duplicates with no CNSs, α-duplicates with 5′-upstream CNSs, α-duplicates with 5′-UTR CNSs, α-duplicates with intronic CNSs, α-duplicates with 3′-UTR CNSs and α-duplicates with 3′-downstream CNSs. Control sequences for microRNA (miRNA) and transfer RNA (tRNA) were downloaded from version 16 of the miRNA database and the Genomic tRNA database (http://www.mirbase.org; http://gtrnadb.ucsc.edu/Athal). RNAfold was used to calculate the free folding energies of 243 miRNAs and 639 tRNAs. In order to associate free folding energies with expression intensity, it was necessary to remove any genes in which 5′-UTR sequences had multiple ΔG values. The remaining 16 379 genes were associated with available probe sets on the ATH1 platform (*x_at and *s_at probe sets were removed), reducing the total count to 13 768 genes with an average 5′-UTR length of 126.9 bp. The Pearson correlation values between the expression intensity and 5′-UTR ΔG were calculated using R, and no significant correlations were found (data not shown). The Rfam 10.0 database and scanning software (rfam_scan-1.0.2.pl; Gardner et al., 2009; ftp://ftp.sanger.ac.uk/pub/databases/Rfam/10.0/rfam_scan/rfam_scan-1.0.2.pl) were downloaded and used to search the TAIR8_5_utr_20080228 database for specific noncoding RNA patterns. Both BLAST and full covariance model searches (– global) methods were used.

α-Duplicate average expression intensity differences

All α-duplicate pairs both considered to be present in each of the nine PO-defined datasets were separated into categories on the basis of exclusive CNS position (e.g. only 5′-upstream, 3′-downstream, etc.). The mean and standard error of the pairwise α average expression intensity for each CNS positional category were calculated for each of the nine PO-defined datasets. Student’s t-test was performed to compare each of the categories with α-duplicates with no CNSs, and categories that were found to be statistically similar at the level of > 0.01 were identified.

Intron-mediated enhancement (IME) calculations

IMEter v1.0 (http://korflab.ucdavis.edu/Software/imeter-2008-08-11.tar.gz) was used to calculate the Imeter score for all TAIR8 intron sequences (TAIR8_intron_20080228), and the A. thaliana training set was obtained from the software website (IMEter-2008-08-11). The IMEter score for the first intron of the first gene variant was determined and used in the group-wise analyses.


PO-defined microarray expression set framework

Our A. thaliana gene expression analysis framework was constructed from 5009 Affymetrix ATH1 arrays downloaded from the NCBI GEO database. These arrays were RMA-normalized, probe sets flagged for the presence or absence of expression, and arrays that demonstrated outlier expression intensity distributions were removed (Fig. S1). Outliers included microdissected tissue, flow cytometry sorted samples and pollen samples. For each microarray dataset, the GEO experiment description was examined manually and the dataset was assigned one or more PO terms in order to sort the arrays into similar transcriptome profiles. We were then able to dissect the master expression matrix into nine PO-defined ‘organ/organ system’ expression datasets: aerial tissue, flower, leaves, root, rosette, seedlings, seeds, shoot and whole plant. We chose to focus our studies on similar expression profiles derived from common tissue transcriptome mixes as opposed to specific ‘treatments’ or genetic backgrounds. However, as duplicated genes have been linked to adaptation to environmental stimuli (Hanada et al., 2008), DE α-duplicate gene lists from specific UV stress and hormone-treated datasets were identified to examine environmentally responsive (i.e. outside of the organ/organ system) α-duplicate expression patterns.

ATH1 platform probe sets were then associated with TAIR8 transcript models and the probe sets underlying the CNS-coded 3166 α-duplicate pairs were identified (6332 genes; Tables S1, S2). All α-duplicate pairs were grouped on the basis of putative function, including GO, including annotations for transcription factors, plasma membrane, kinases and chloroplast (Table S3). α-Duplicate pairs were also annotated for α-CNS count and CNS position within the gene model structure (Table S4). Using these coded framework expression data, we were able to examine the expression patterns of all or functionally sorted α-duplicate pairs across nine PO-defined transcriptome groups. On average, the PO-defined expression set expresses both α-duplicate pairs 84% of the time, indicating a tendency towards unrestricted expression across multiple tissues. Interestingly, the α-duplicate expression patterns appear to be sufficient to cluster similar tissue types (Fig. S3).

Correlation of CNS richness and position with α-duplicate pair co-expression

In order to determine whether there was an association between α-duplicate pair co-expression and CNS signatures, we first determined the Pearson correlation coefficient for each α-duplicate pair normalized expression vector in each of the nine PO-defined expression datasets. We used this value as our measure of α-duplicatepair pairwise co-expression and determined the Spearman’s ρ rank correlation between co-expression and total CNS frequency. Co-expression significance was determined by randomly selecting an identical number of gene expression vectors from the relevant expression dataset in which co-expression was measured. Genes were selected from the total α-duplicate pool, and a random Spearman’s ρ was determined across 10 000 permutation tests (all ρ and P values are listed in Table S6).

This analysis revealed that there was significant positive correlation of α-duplicate pair co-expression with the full-gene CNS count in all nine PO-defined datasets examined, except for seeds (Fig. 1a). This indicates a broad positive effect of CNS signatures on α-pair co-expression across many plant organ/organ systems. An almost identical effect was seen when total CNS base pair counts were used, given that most CNSs are very short (mean is c. 33 bp; data not shown). These correlations were weak (Spearman’s ρ ranging from 0.07 to 0.14), but significant (< 0.01). When a similar correlation was made for CNSs localized to TAIR8 subgene positions (5′-upstream, 5′-UTR, intronic, 3′-UTR and 3′-downstream), there was a significant positive correlation with co-expression for CNSs localized to the 5′-upstream and 5′-UTR positions in all datasets, except for seeds (5′-upstream CNSs). Interestingly, there was a weak, but significant, inverse correlation between co-expression and 3′-downstrem CNSs in some datasets (leaves, rosette, seedling and whole plants). When the same analysis was restricted to α-duplicate pairs that contained at least one CNS, a significant correlation between total CNS frequency and co-expression was detected in root and shoot datasets only. However, most of the 5′-upstream CNS and 5′-UTR CNS correlation trends were maintained, and the 3′-downstream inverse trend was strengthened in significance. These data suggest that CNSs are correlated with α-duplicate pair co-expression, but that the putative underlying expression control mechanisms in subgene positions (e.g. 5′-upstream/5′-UTR vs 3′-downstream) and tissues (e.g. root vs shoot) may be mixed in the sampled population.

Figure 1.

Correlation patterns between conserved noncoding DNA sequence (CNS) frequency and α-duplicate pairwise co-expression. Heat map for CNS frequency correlation with α-duplicate pair co-expression (Spearman’s ρ) in all nine expression datasets for all α-duplicate pairs (a), gene ontology (GO)-related α-duplicate pairs (b) and UV stress/hormone-treated differentially expressed (DE) α-duplicate pairs (c). CNS counts subdivided into subgene positions including 5′-upstream, 5′-untranslated region (5′-UTR), intron, 3′-UTR, 3′-downstream, or the sum of all five positions (full gene). TFM, transcription factor binding site motif; n, number of α-duplicate pairs used in test.

CNSs enriched for TFMs provide additional evidence for CNS relevance and imply (not prove) transcription factor binding near α-duplicate genes. To perform this important control and to test for the potential involvement of TFMs, we segregated CNSs into those enriched for TFMs (TFM+) and performed a correlation analysis (Fig. 1a). This revealed that at least one TFM+ CNS was required for significant positive correlation in root (full-gene CNSs, 5′-upstream CNSs), flower (5′-upstream CNSs), leaves (5′-UTR CNSs), rosette (5′-UTR CNSs), seedlings (5′-UTR CNSs), seeds (5′-UTR CNSs) and shoot (5′-UTR CNSs) datasets. In addition, the inverse correlation between α-duplicate pair co-expression and CNS frequency in 3′-downstream CNSs was most common with TFM+ CNSs (flower, leaves, rosette, seedlings, whole plant datasets), with the exception of 3′-downstream TFM– CNSs being associated with co-expression in seeds and leaves.

With the constraint that a subgene position had to contain at least one TFM+ CNS, α-duplicate pairs with 3′-downstream TFM+ CNSs demonstrated a positive correlation between α-duplicate pair co-expression and full-gene CNS frequency, which was significant across all datasets, except rosette (Fig. 2a). This effect was not seen for α-duplicate pairs that contained CNSs with 3′-downstream CNSs that did not contain TFMs. These results do not explain the observed inverse correlation between 3′-downstream CNS frequency and co-expression, but suggest the possible involvement of regulatory protein–DNA binding, including the possibility of direct transcriptional control.

Figure 2.

Correlation patterns between conserved noncoding DNA sequence (CNS) frequency and α-duplicate pair expression intensity. Heat map for CNS frequency correlation with α-duplicate pair joint pairwise expression intensity (Spearman’s ρ) in all nine expression datasets for all α-duplicate pairs (a), gene ontology (GO)-related α-duplicate pairs (b) and UV stress/hormone-treated differentially expressed (DE) α-duplicate pairs (c). CNS counts subdivided into subgene positions including 5′-upstream, 5′-untranslated region (5′-UTR), intron, 3′-UTR, 3′-downstream, or the sum of all five positions (full gene). TFM, transcription factor binding site motif; n, number of α-duplicate pairs used in test.

CNS correlation with α-duplicate pair expression intensities

Given the expression dataset and subgene position associations of CNSs with α-duplicate pair co-expression, we tested whether CNSs might have an effect on the overall expression intensity. For these experiments, CNS counts were tested for correlation with combined average expression intensity across a dataset for both genes in an α-duplicate pair. Permutation tests, as described above, were used in significance testing. No significant correlations were observed when the full-gene CNS count was tested for all α-duplicates (Fig. 2a; all ρ and P values are listed in Table S7). However, when only α-duplicates that contained a CNS were considered, a significant trend of negative correlation was observed for all datasets when 5′-upstream CNS counts were used in the correlation with expression intensity (ρ: − 0.21 to − 0.27). Conversely, a significant positive correlation between intronic CNSs and expression intensity was observed in five datasets (flower, seedlings, seeds, shoot and whole plant). The relevance of these correlations was supported by the fact that the average expression intensity for α-duplicates with only 5′-upstream CNSs was significantly lower and for α-duplicates with only intronic CNSs was significantly higher than for α-duplicates with no CNSs in almost all datasets (Table 1). These results suggest that the CNS subgene position affects the expression levels of α-duplicate genes.

Table 1.   Comparison of mean combined expression intensity of α-duplicate pairs grouped by conserved noncoding DNA sequence (CNS) position in gene
CNS position categoryAerialFlowerLeavesRootRosetteSeedlingsSeedsShootWhole
  1. UTR, untranslated region.

  2. Numbers represent combined average log expression for all α-pairs determined to be present in each tissue, ± SEM.

  3. 1Group was not analyzed because of insufficient sample size. nd, not determined.

  4. ABCValues with the same letter are statistically similar via t-test (> 0.01).

  5. **, < 0.001 (t-test) when compared with nontranscribed α-pairs.

α-Duplicates with only 5′ upstream CNSs7.24 ± 0.14B7.32 ± 0.12A7.15 ± 0.13A7.14 ± 0.12B7.30 ± 0.14A7.05 ± 0.13B7.10 ± 0.12B7.02 ± 0.13B7.00 ± 0.13B
α-Duplicates with only 5′-UTR CNSs8.25 ± 0.17C8.28 ± 0.16B8.14 ± 0.17B8.14 ± 0.16AC8.23 ± 0.17B8.22 ± 0.17C8.18 ± 0.15AC8.25 ± 0.16C8.20 ± 0.17C
α-Duplicates with only intronic CNSs8.07 ± 0.12C8.07 ± 0.11B7.98 ± 0.12B8.09 ± 0.12C8.11 ± 0.11B7.97 ± 0.12C8.02 ± 0.12AC8.02 ± 0.11C7.99 ± 0.12C
α-Duplicates with only 3′-UTR CNSs1ndndndndndndndndnd
α-Duplicates with only 3′ downstream CNSs7.33 ± 0.25AB7.17 ± 0.26A7.11 ± 0.24A7.06 ± 0.27AB7.22 ± 0.26A7.12 ± 0.27AB7.16 ± 0.29AB7.07 ± 0.26AB7.14 ± 0.25AB
α-Duplicates with zero CNSs7.66 ± 0.07A7.66 ± 0.06A7.48 ± 0.06A7.65 ± 0.06A7.65 ± 0.06A7.50 ± 0.06A7.68 ± 0.06AC7.51 ± 0.06A7.51 ± 0.06A
α-Duplicates with at least 1 CNS in any position7.47 ± 0.05AB7.46 ± 0.05A7.35 ± 0.05A7.38 ± 0.05B7.49 ± 0.05A7.33 ± 0.05AB7.31 ± 0.05B7.33 ± 0.05AB7.31 ± 0.05AB
α-Duplicates with only nontranscribed CNSs7.25 ± 0.127.30 ± 0.117.14 ± 0.127.12 ± 0.117.29 ± 0.127.06 ± 0.117.11 ± 0.117.03 ± 0.117.02 ± 0.11
α-Duplicates with only transcribed CNSs8.13 ± 0.10**8.15 ± 0.09**8.03 ± 0.10**8.11 ± 0.09**8.15 ± 0.10**8.05 ± 0.10**8.08 ± 0.09**8.10 ± 0.09**8.06 ± 0.10**

We then dissected α-duplicate pairs that were TFM(+/−) and found that the inverse correlation caused by 5′-upstream CNSs was associated with TFM+ CNSs (ρ: − 0.21 to − 0.27) (Fig. 2a). Interestingly, the 5′-upstream TFM+ effect was enhanced when 5′-UTR TFM− CNSs were excluded from the CNS count (ρ: − 0.30 to − 0.34). The positive correlation associated with the presence of intronic CNSs was maintained in TFM+ tests for seeds, shoot and flower datasets, and the correlation was extended to four additional datasets when 5′-UTR TFM− CNSs were included. Further refinement of TFM+ CNSs to those with a transcription factor binding site enrichment of 11 times or greater exhibited an increase in correlation across full-gene CNS counts and the 5′-upstream region. Increases in inverse correlation (decreased ρ) across full-gene CNS counts were observed in all nine expression datasets, with a modest increase observed in seeds (Δ0.10; data not shown), whereas mild increases in the 5′-upstream region were limited to aerial, flower, rosette and seeds (Δ0.02; data not shown). These data suggest that CNSs localized to the 5′-upstream/5′-UTR regions may have an opposite effect on α-duplicate expression relative to intronic CNS TFM− CNSs, and the regulation could occur transcriptionally or post-transcriptionally.

Co-expression and expression intensity correlations with CNS frequency in functionally restricted α-duplicate pairs

When α-duplicate pairs were divided into GO functional example subgroups (α-transcription factors, α-kinases, α-chloroplast genes, α-plasma membrane genes), the correlation of the full-gene CNS frequency with α-duplicate pair co-expression tended to be more expression dataset specific (Fig. 1b). In root samples, a significant and stronger correlation was seen for α-duplicate pairs coded as α-transcription factors (ρ = 0.26), α-kinases (ρ = 0.32) or localized to the chloroplast (ρ = 0.24). Three additional significant correlations were observed for α-transcription factor pairs in the flower (ρ = 0.24) and shoot (ρ = 0.18) datasets, and α-kinases in the seedling dataset (ρ = 0.24). When the constraint that the α-duplicates had to contain at least one CNS was conferred, only the root α-kinase full-gene CNS frequency correlation with co-expression was retained (ρ = 0.39). The correlation of CNS frequency with α-duplicate pair intensity tended to be more specific to subgene position (Fig. 2b). Significant positive correlations were restricted to α-duplicate transcription factors with 5′-UTR CNSs in seven datasets (aerial, flower, leaves, rosette, seedlings, shoot, whole) and α-duplicate kinases with intronic CNSs in the seed dataset.

CNS expression patterns of hormone/UV-responsive α-duplicates

Many genes associated with biotic and abiotic stresses in Arabidopsis are duplicated and exhibit discordant or partitioned gene expression (Zou et al., 2009). Therefore, we examined sets of α-duplicate genes found to be DE under ‘environmental’ perturbations (UV stress and hormone treatment) across all nine expression datasets to determine whether CNSs were associated with particular treatment conditions. Both the hormone DE and UV stress DE gene lists were found to be significantly enriched (Fisher’s exact test; P < 0.001) for α-duplicates (34% and 33%, respectively, vs 19% expected from α-duplicates in the genome background), as well as enriched for the presence of CNSs (74% and 69%, respectively, vs 60% expected from α-duplicates in the genome background).

Correlations between CNS frequency and co-expression (Fig. 1c; Table S6) or combined average expression intensity (Fig. 2c; Table S7) were found in DE α-duplicates from both hormone-/UV-treated datasets, but the patterns of significant correlation were distinct. For α-duplicate hormone response genes, a significant positive association between full-gene CNS frequency and co-expression was observed in flower, root, seedlings, shoot and whole plant datasets (Fig. 1c). However, this trend was only observed for α-duplicates DE under UV stress in the seed dataset.

When separated into positional categories, DE α-duplicates with 5′-upstream CNSs tended to broadly correlate with co-expression under hormone (leaves, whole plant, shoot, seedlings, root, flower), but not UV (seedlings, whole plant), treatment (Fig. 1c). When DE α-duplicates with CNSs localized to the 5′-UTR were considered, this trend was reversed in that no correlation with co-expression was observed for hormone DE α-duplicates, but correlation was seen for UV DE α-duplicates in five datasets. Interestingly, UV DE α-duplicates (but not hormone α-duplicates) showed an inverse correlation between CNS+ α-duplicates with 3′-downstream CNSs and co-expression in three datasets (leaves, whole plant and root).

Although UV stress DE α-duplicates showed no significant correlations between CNS frequency and expression intensity, a correlation was found for hormone response DE α-duplicates between intronic CNS frequency and pairwise expression intensity in five of the nine datasets (Fig. 2c). Although there were significant correlations for 3′-UTR CNSs in UV stress α-duplicates, further investigation revealed only five 3′-UTR CNSs within the selected gene list, suggesting that the significance may be a result of the presence of these rare CNSs. These data suggest that position-specific CNSs may exhibit alternative roles for α-duplicates under varying conditions.

Possible CNS control mechanisms of α-duplicate pair co-expression via 5′-UTRs

To test the possibility that stable RNA folds in 5′-UTRs could be involved in the post-transcriptional regulation of α-duplicate steady-state transcript levels, we predicted RNA fold ΔG values for the TAIR-annotated 5′-UTRs, introns and 3′-UTRs of α-duplicate pairs and non-α-genes, allowing for a fold stability comparison of all transcribed CNSs. Significantly, comparisons between 5′-UTR and 3′-UTR ΔG (length-corrected) values of all α-duplicates vs all non-α-transcripts were found to be significantly different, with α-duplicates tending to be more ‘stable’ (Table 2; Student’s t-test; = 1.79 × 10−22 and 4.38 × 10−8, respectively).

Table 2.   Comparison of free folding energies and transcribed unit length grouped by conserved noncoding DNA sequence (CNS) position
Group5′-UTRsMean ΔG per kbp (5′-UTR)IntronsMean ΔG per kbp (intron)3′-UTRsMean ΔG per kbp (3′-UTR)
  1. UTR, untranslated region.

  2. 1Group was not analyzed because of insufficient sample size. nd, not determined.

  3. *, < 0.001 (t-test) when compared with α-duplicates with zero CNSs; **, < 0.001 (t-test) when compared with non-α-pairs.

α-Duplicates with zero CNSs2279−138.85 ± 1.7210 858−152.23 ± 0.492293−182.58 ± 0.97
α-Duplicates with at least 1 CNS in any position3889−154.52 ± 1.30*18 878−152.61 ± 0.363798−186.11 ± 0.76
α-Duplicates with only 5′ upstream CNSs886−123.53 ± 2.82*3883−148.95 ± 0.80*927−179.81 ± 1.56
α-Duplicates with only 5′-UTR CNSs400−185.03 ± 3.37*1601−156.71 ± 1.26*379−188.98 ± 2.08
α-Duplicates with only intronic CNSs705−172.34 ± 2.79*4311−156.30 ± 0.76*644−192.88 ± 1.68*
α-Duplicates with only 3′-UTR CNSs2nd159−150.95 ± 6.2710−175.29 ± 14.25
α-Duplicates with only 3′ downstream CNSs249−129.01 ± 5.481076−147.25 ± 1.54238−180.20 ± 3.30
Non-α-duplicates16 153−139.22 ± 0.6793 469−152.56 ± 0.1716 578−180.71 ± 0.39
All α-duplicates6170−148.73 ± 1.04**29 736−152.47 ± 0.296091−184.79 ± 0.60**

We refined this analysis to α-duplicates with CNS signatures by comparing CNS(+) α-duplicate pairs against CNS(−) α-genes. A significant decrease in the α-gene 5′-UTR ΔG value was observed if the α-gene contained at least one CNS or a 5′-UTR-localized CNS. This increase in RNA fold ‘stability’ was either very small or insignificant when the intron/3′-UTR CNS average ΔG value was tested (Table 2). Interestingly, a general increase in ΔG was seen in α-duplicate pairs that contained putatively nontranscribed CNSs (5′-upstream, 3′-downstream), which provided a comparison with nontranscribed CNSs that would not be found in messenger RNA (mRNA) (Table 2). However, our examination of the 5′-UTR fold stability in A. thaliana found no significant correlations between 5′-UTR folding energy and expression intensity or co-expression for α-duplicate pairs (data not shown). Although these speculative data suggest that the co-expression of α-duplicate pairs, in general, could be influenced by more stable 5′-UTR folds, this effect could be caused by the presence of the 5′-UTR CNSs.

Given our results that α-duplicate pair 5′-UTRs may be enriched for stable RNA folds relative to non-α-5′-UTRs, we tested whether α-duplicate 5′-UTRs were enriched for any specific noncoding RNA patterns. One possible RNA motif class is the riboswitch, which is an mRNA fold that can act as a protein-free metabolite sensor in bacteria and eukaryotes (Breaker, 2008). A thiamine pyrophosphate (TPP)-based riboswitch motif has been observed in the 3′-UTR of A. thaliana transcripts and affects differential transcript processing (Sudarsan et al., 2003; Bocobza et al., 2007). The TPP riboswitch has also been located in the 5′-UTR of an ascomycete, Neurospora crassa (Cheah et al., 2007). Another possible motif is the AU-rich element (AUUUA core motif; Bakheet et al., 2001), which interacts with the RNA-degrading exosome complex (Schilders et al., 2006). To search for these and other motifs, we scanned all Arabidopsis 5′-UTRs for motifs from the Rfam database (Gardner et al., 2009). Even at moderate stringencies, few motifs were identified in the 5′-UTRs of either the α-duplicates or the non-α-genes, with no evidence for enrichment of these motifs within CNSs.

Possible CNS control mechanisms of α-duplicate expression intensity

The positions of CNSs relative to their associated genes can be separated into two categories: transcribed and nontranscribed. Each of these categories suggests the potential for alternative mechanisms of expression intensity regulation if the transcribed CNS does not contain TFM(s). In order to evaluate the predicted differences in CNS position effects, α-duplicates were separated into categories on the basis of CNS counts for varying positions (Table 1). The frequency of α-duplicate pairs with only 3′-UTR CNSs was found to be less than five pairs in all of the nine expression datasets, and this class was excluded from the analysis. Duplicate pairs with transcribed CNSs (5′-UTR and intronic) showed significantly higher average expression intensity when compared with nontranscribed CNSs (5′-upstream and 3′-downstream) in all nine expression groups. When CNS position categories were examined independently, the transcribed groups were found to be significantly different from all other categories, except in the root and seed datasets (Table 1).

Although not significantly different in all expression datasets, the observed increase in average expression intensity for transcribed CNSs suggests a distinct role for nontranscribed CNSs. Although the possibility of particular sequence motifs in the 5′-UTR that confer mRNA stability could explain the increase in observed intensity, the role of intronic sequences is unclear. Any intronic sequences would be removed from the final mRNA and would be unable to influence directly mRNA stability after splicing. A potential mechanism for the influence of intronic CNSs on expression intensity is that they may contain motifs that exhibit intron-mediated enhancement (IME) of gene expression (Mascarenhas et al., 1990). IME is an observed phenomenon by which the presence of particular introns near the 5′-end of a gene is found to enhance expression levels above those observed in the absence of the intron (Rose, 2008; Rose et al., 2008). However, for the maize knotted1 homeobox transcription factor gene, a cluster of intron CNSs conserved in grasses turns ‘off’ the gene when actively bound (Inada et al., 2003). Using the IMEter algorithm, introns were scored for IME potential. The average IMEter score for α-duplicates with at least one intronic CNS was 12.41, whereas all α-duplicates and non-α-duplicates had scores of 7.46 and 2.10, respectively (Table 3). Although each of these scores is lower than those obtained from the screening of known IME elements (Rose et al., 2008), the trend in IME scoring and the observed differences in expression intensity suggest that intronic CNSs may be a marker for gene regulation.

Table 3.   Evidence for intron-mediated enhancement (IME) of transcription in α-duplicate pairs
Intron category1TranscriptsIMEter score (average)P value2
  1. CNS, conserved noncoding DNA sequence.

  2. 1First intron of first transcript variant.

  3. 2t-test relative to non-α-duplicates.

Non-α-duplicates16 8402.10
α-Duplicates with at least one intronic CNS138212.411.98E-44
α-Duplicates with zero intronic CNS37595.629.95E-14
α-Duplicates with intronic CNS only53317.134.73E-39


The total collection of microarrays sampled in this study comprised not only a large variety of different organ systems, organs, cells and tissues, but also multiple transcriptome measurements involving chemical, hormonal or environmental treatments. We chose to focus our study on potential organ/organ system CNS control of α-duplicate pair expression with a brief examination of ‘treated’ datasets. Despite the relatively large amount of ‘noise’ conferred by mixed transcriptomes in our system, significant correlations between CNS frequency and α-duplicate co-expression or expression intensity were detectable in all tissue sets examined. These results support our hypothesis that CNSs common to an α-duplicate pair are involved in the coordinated control of gene expression for both genes, even though the genes may now exist on different chromosomes. Furthermore, our analysis has begun to reveal CNS regulatory complexity, in that CNSs may be involved in multiple mechanisms of gene expression control based on their position relative to the reading frame of the gene, as well as tissue-specific control.

Our data suggest that there is a link between CNS sequence patterns and α-duplicate co-expression. In general, our data suggest that CNSs near the transcription start site (TSS; 5′-upstream/5′-UTR) tend to have a positive effect on co-expression, whereas CNSs downstream of the TSS tend to disrupt co-expression (3′-UTR). However, these trends are not true for all organs, with seeds and roots being notable exceptions (Fig. 1a). The underlying mechanism of control by 5′-UTR and 3′-downstream CNSs appears to be a result, at least in part, of transcription factor binding, given the tendency for the TFM+ CNS frequency to correlate with co-expression. Again, this trend was not true for all datasets, suggesting that CNSs may be inactive or behave differently depending on the tissue type. Intriguingly, the 5′-upstream CNS+ correlation pattern was primarily observed in the absence of 5′-UTR/intronic/3′-UTR TFM− CNSs, suggesting that TFM-free transcribed CNSs could mask the transcriptional effect of 5′-upstream CNSs on co-expression, presumably through post-transcriptional mechanisms. It should also be noted that the TFM+ 5′-UTR effect on co-expression was only present in the absence of 5′-upstream TFM− CNSs, but not in all datasets. These data suggest that CNSs can play a role in coordinated expression, but the correct CNS mixture (e.g. gain/loss of a 3′-downstream CNS) across a gene could result in divergent expression and possibly drive subfunctionalization/neofunctionalization.

Our data suggest that the potential regulation of expression intensity by CNSs is, in general, centered on CNSs localized to the 5′-upstream and intronic regions of α-duplicate genes. As one might expect, the evidence suggests that 5′-upstream CNSs appear to affect steady-state transcript levels at the level of transcription, given the requirement for TFM+ CNSs in the CNS/expression intensity correlation. Furthermore, these correlations are enhanced by the enrichment for α-duplicates with 5′-UTR TFM+ CNSs. An unexpected result is that these correlations are inverse, suggesting that 5′-upstream/5′-UTR CNSs may function in a gene repression capacity. Intronic CNSs, however, seem to have a positive effect on transcription, but in a more tissue-restricted fashion, and this effect is not always dependent on TFM+ CNSs. Interestingly, if α-duplicate pairs with 5′-UTR TFM+ CNSs are removed, the intron CNS+ effect on expression intensity becomes more general, suggesting that intronic CNSs acting alone could increase transcript levels, which could be attenuated by the presence of 5′-UTR CNSs.

Given that TFM+ CNSs do not explain all the transcript levels in our analysis, an obvious question is: what other regulatory mechanism(s) are encoded in the CNS signatures? Dramatic progress has been made in recent years in understanding the rich variety of mechanisms involved in the post-transcriptional regulation of gene expression. Steady-state levels of RNA transcript concentrations can be controlled by the regulation of nuclear RNA processing by the spliceosome complex (Rino & Carmo-Fonseca, 2009), nuclear RNA degradation by the exosome complex (Belostotsky & Sieburth, 2009), mRNA nuclear export (Durairaj et al., 2009), riboswitches (Breaker, 2008) and cytoplasmic degradation of transcript through miRNA binding and recruitment of the RNA-induced silencing complex (RISC) (Kawamata & Tomari, 2010). One or more of these molecular regulatory mechanisms could be coded in the transcribed CNS motifs (e.g. 5′-UTR), which we found to be strongly associated with α-duplicate gene regulation.

It has been reported previously that the mean length of α-duplicate genes is c. 25% larger than their nonduplicated counterparts, which is consistent with our analysis of 5′-UTRs with more recent gene annotations (TAIR8 vs TIGR3; data not shown) (Chapman et al., 2006). The length of 5′-UTRs has been shown to have an important role in the differential blend of tissue/organ-specific transcripts of various genes, and may play a role in some cancers (reviewed in Pickering & Willis, 2005). Computational analysis of the transcription/translation profiles of Saccharomyces cerevisiae found significant correlations between 5′-UTR-mediated transcript stability, transcript half-life, protein level and translation rates (Ringner & Krogh, 2005). This observation, coupled with the significant correlations between 5′-UTR CNS counts and α-duplicate pair co-expression for all nine core organ datasets, suggests that the 5′-UTR-localized CNSs may be involved in the control of the steady-state expression of α-duplicate pairs. However, we were unsuccessful at identifying any known 5′-UTR motifs in α-duplicates that might explain the co-regulation patterns observed. It may be that specific RNA motifs from known classes are present in α-duplicate 5′-UTRs, but are not present in the Rfam database at this time. It is also possible that novel classes of motifs acting post-transcriptionally might reside in α-duplicate 5′-UTRs, an idea that will require further study.

Once transcribed, the propensity of a single-stranded RNA molecule to fold is high, and the bioinformatic determination of the theoretical free energy of RNA folding (ΔG) can be associated with potential RNA fold stability. Nevertheless, this is a prediction and the expected ‘noise’ of the nucleotide free folding energy can vary substantially in reality. To provide a clue to the range of ΔG in real genes, the average ΔG value (sequence length corrected) was determined for short A. thaliana processed miRNAs (− 61.36 ± 4.49 ΔG kbp−1) and longer highly folded tRNAs (− 366.69 ± 1.93 ΔG kbp−1). We assumed that significant differences in ΔG in CNS(+) α-duplicate pairs relative to CNS(−) pairs, if they fell well within the above range, could be a measure of RNA fold stability trends and possibly allow for the detection of different modes of potential CNS regulation (e.g. transcriptional vs post-transcriptional).

It has been suggested previously that duplicated genes which are retained in pairs may be restricted to a nonfractionated state, as their biological functions are more sensitive to gene dosage. These duplicate pairs are therefore retained as a byproduct of purifying selection (the natural drive to eliminate deleterious alleles from a population) as the loss of one gene copy among a group of interacting genes would result in reduced fitness (Freeling & Thomas, 2006; Freeling, 2009). It may be that CNSs are maintaining this dosage constraint as their putative cis action can influence the degree of gene expression and affect common dosage in an organ-specific manner.


This work was supported by grants from the National Science Foundation: MCB-0820345 to F.A.F. and MCB-0820821 to M.F.