Islands of co-expressed neighbouring genes in Arabidopsis thaliana suggest higher-order chromosome domains

Authors


*(fax +1 519 763 8933; e-mail llukens@uoguelph.ca).

Summary

Biochemical and cytogenetic experiments have led to the hypothesis that eukaryotic chromatin is organized into a series of distinct domains that are functionally independent. Two expectations of this hypothesis are: (i) adjacent genes are more frequently co-expressed than is expected by chance; and (ii) co-expressed neighbouring genes are often functionally related. Here we report that over 10% of Arabidopsis thaliana genes are within large, co-expressed chromosomal regions. Two per cent (497/22 520) of genes are highly co-expressed (r > 0.7), about five times the number expected by chance. These genes fall into 226 groups distributed across the genome, and each group typically contains two to three genes. Among the highly co-expressed groups, 40% (91/226) have genes with high amino acid sequence similarity. Nonetheless, duplicate genes alone do not explain the observed levels of co-expression. Co-expressed, non-homologous genes are transcribed in parallel, share functions, and lie close together more frequently than expected. Our results show that the A. thaliana genome contains domains of gene expression. Small domains have highly co-expressed genes that often share functional and sequence similarity and are probably co-regulated by nearby regulatory sequences. Genes within large, significantly correlated groups are typically co-regulated at a low level, suggesting the presence of large chromosomal domains.

Introduction

The ability of cells to regulate gene expression in a precise manner is central to plant development and environmental acclimation. Classical experiments proposed that eukaryotic genomes are organized into distinct domains, visible for example in the banding of Drosophila melanogaster salivary gland chromosomes that define units of genetic function (Weisbrod, 1982). Consistent with this hypothesis, several recent studies across a range of eukaryotes have shown that genomes have a specific functional architecture that acts to regulate gene transcription locally through cis-regulatory sequences, and globally through the modification of large segments of the genome (Grewal and Moazed, 2003). Cis-regulatory sequences interact with transcription factors to promote or repress the transcription target genes. In addition, the activation or repression of genes is associated with nuclear compartmentalization and chromatin changes that affect large chromosomal regions (Emerson, 2002; Udvardy et al., 1985). The goal of this study was to identify consistently co-regulated regions within the Arabidopsis thaliana genome, and to characterize the genes within them. Co-regulated regions may contain suites of locally or globally regulated genes that are critical for plant development and environmental response. Such regions may provide an efficient mechanism for co-ordinated gene expression.

All gene transcription is local in nature because transcription depends on the recruitment of chromatin-remodelling complexes, histone deacetylases, and the basal transcriptional machinery at a core promoter (Hsieh and Fischer, 2005). Tissue- or developmentally specific initiation factor subunits (Freiman et al., 2001) and cofactor-modifying enzymes (Xu et al., 2001) also can induce or silence a gene. The idea that specific activators or repressors control gene expression has been confirmed with microarray analysis. Genes with correlated expression across microarray experiments frequently are bound by common transcription factors or may be within the same regulatory pathway (Allocco et al., 2004).

Genes are also affected by their chromosomal location. Nucleosome modifications may spread over domains that extend over tens of kilobases and potentiate or depotentiate large genomic regions for expression. Actively transcribed genes are associated with a variety of covalent modifications of histones, including the acetylation of histone 3 at lysine 9 (H3K9) and the methylation of histone 3 at lysine 4 (H3K4), while inaccessible chromatin is associated with covalent modifications including the methylation of H3K9. During lymphocyte maturation, the mouse Dntt locus is acetylated early in B- and T-cell development, but inactivated as the cells mature (Su et al., 2004). These changes are associated with progressive histone acetylation and deacetylation over 19 kb of the locus. Similarly, a 15-kb HSP70 domain within D. melanogaster is modified following heat treatment (Udvardy et al., 1985). Chromatin immunoprecipitation analysis of the 55-kb chicken beta globin locus showed that global levels of H4 and H3 acetylation and H3K4 methylation increase in the regions located between two insulator sites during erythrocyte development (Litt et al., 2001; Mutskov et al., 2002). The insulator may act as a key developmental switch. At the imprinted Igf/H19 locus in mouse, the binding site of an enhancer-blocking protein (CTCF) is inaccessible within the paternal allele, causing a long-range (90-kb) interaction between enhancers and Igf2 (Bell and Felsenfeld, 2000; Hark et al., 2000).

These changes in global chromatin structure may be visible microscopically. In different human cell types, loops of chromatin are between several thousand and several million nucleotides. The frequency of specific loops is cell type- and stimulus-specific, and the frequency is related to the transcriptional activity of genes within the loop. For example, a matrix-attachment region is necessary for the looping and repression of the interleukin-2 receptorα gene; and megabase-sized loops are positively correlated with gene activity at the MHC locus (Volpi et al., 2000; Yasui et al., 2002). Visible loops in A. thaliana interphase chromosomes are also associated with transcriptional activity, and may be important for the regulation of transcription. Arabidopsis thaliana interphase chromosomes are organized as heterochromatic chromocentres with emanating loops between 200 and 2000 kb enriched in acetylated histone H4 (Fransz et al., 2002). Co-localization of DNA sequences with the chromocentre varies among cells (Fransz et al., 2003). The dispersed, multicopy transposons Emil2 and AthE1.4 co-localize at chromocentres (Soppe et al., 2002) and are frequently mapped to gene-rich regions (Fransz et al., 2003), suggesting that large, gene-rich regions of the interphase chromosome may be partitioned into distinct loops.

Functionally related genes are often co-expressed within a common global chromosomal domain. Although the influence of the domain on co-expression may be confounded by the influence of shared cis regulatory sequences because of tandem duplication of genes, functionally related genes are frequently clustered even after removing duplicates. In Caenorhabditis elegans, over 30% of muscle-expressed genes are within 10 kb of other muscle-expressed genes (Roy et al., 2002). In humans, highly co-expressed genes and tissue-specific genes are localized to large chromosomal domains (Caron et al., 2001; Lercher et al., 2002). Genes may become clustered over time due to the selective advantage of a common regulatory environment. Consistent with this theory, genes within 10–900-kb conserved regions of synteny between D. melanogaster and Drosophila obscura have highly correlated expression patterns (Stolc et al., 2004).

The existence of co-expressed groups of genes has been reported recently in A. thaliana, and the identity and characteristics of these co-expressed groups are now being elucidated. Williams and Bowles (2004) noted that the mean correlation coefficient of a large number of neighbouring genes was slightly higher than expected by chance. This result suggested that the genome contains groups of large, co-expressed gene clusters, and Ma et al. (2005) and Schmid et al. (2005) identified many groups of 10 genes with high correlation coefficients. In an analysis of two small data sets, Ren et al. (2005) found small clusters of highly co-expressed neighbouring genes. Pairs of co-expressed genes occurred approximately 1.3 times more than expected by chance.

In this study we used a large data set of 128 hybridizations over diverse conditions to identify groups of genes throughout the A. thaliana genome that are co-regulated. Our work identifies large groups of neighbouring genes that are weakly but significantly correlated. Such co-expressed regions account for over 10% of all genes, and a subset overlap with large, highly co-expressed domains previously identified (Schmid et al., 2005). Approximately 2% of neighbouring genes (226 groups comprised of 497 genes) within the A. thaliana genome fall into small groups that are highly co-expressed, a level five times greater than expected by chance. Forty per cent (91/226) of the groups have co-expressed genes with high amino acid sequence similarity, and probably contain tandem duplicates. Excluding tandem duplicates, co-expressed groups occur twice as often as expected. These highly co-expressed groups have more co-expressed neighbouring genes that are close together, share functional annotations, and are arrayed in parallel than expected. Our results suggest two levels of gene regulation in A. thaliana. At the local level, tandem duplicated regulatory elements or shared regulatory elements may contribute to the highly similar expression levels of a small number of neighbouring genes. At the global level, large domains of closed and open chromatin that differ across experimental conditions may explain the low level of co-expression among large groups of genes. Plants must adapt to extreme changes in environment, such as water availability and other environmental cues. Domain-level regulation of gene expression may enhance plant responses to environmental change (Arnholdt-Schmitt, 2004; Finnegan et al., 2004).

Results

Identification and characterization of highly correlated genes

Gene-expression values were obtained from 128 different microarray hybridizations representing 128 experimental conditions. For each hybridization, probe set signal values were given and designated as present (1), absent (−1) or marginal (0), using affymetrix software (affymetrix Microarray Suite ver. 5.0). To determine the data set with the greatest power to detect co-expressed neighbouring genes, we sampled the expression data of genes located on chromosome 2. With the chromosome 2 data set, including present, absent and marginal calls, we identified 37 more highly correlated groups (mean absolute value of Pearson correlation coefficient between all gene pairs within the group, |r| > 0.7) than the mean of 10 permuted data sets (Figure 1). Using only chromosome 2 data with the present detect call, we identified 22 more highly correlated groups than the mean of 10 permuted data sets (Figure 1). The larger data set, including absent and marginal calls, had greater power than the smaller data set, and the former was used in all subsequent analyses.

Figure 1.

Number of groups with highly correlated genes (mean |r| > 0.7) detected in two data sets of chromosome 2 genes.
The net number of highly correlated groups was higher when all expression values were included in the analysis (detection calls −1,0,1) than when only the high-quality expression values were included (detection calls 1). The number of highly correlated groups for the random data sets was an average of 10 permutations.

Across the genome, most groups of neighbouring genes were weakly correlated. Out of 22 515 two-gene groups, 90% had |r| values <0.4 (Figure S1a). The ratio between the number of expected and observed groups is similar across different, high |r| values (Figure S1b). Nonetheless, the number of observed correlated groups with |r| > 0.8 deviates slightly more from the expectation than the number of observed, correlated groups with |r| > 0.6. In this study, groups with |r| values >0.7 define highly correlated groups.

Although most neighbouring genes are not co-expressed across diverse conditions, small groups of neighbouring genes with highly similar expression patterns occurred five times more frequently than expected by chance. Neighbouring genes lie in close proximity to each other on the chromosome, but are not necessarily adjacent. In total, 497 genes (226 groups) were highly co-expressed (Table 1), and 15 genes (seven groups) were expressed divergently (Table 2). In contrast, 92 genes had high, positive correlation coefficients (mean Pearson correlation coefficient between all gene pairs within the group, r > 0.7) in a random association of genes with genomic position. The number of negatively correlated genes (15) was about the same as expected by chance (20). Across the genome, 92% (456/497) of highly co-expressed genes were found in small groups of two or three genes (Table 1). The largest groups of six genes were found on chromosome 3. One group contained six repeats of a jacalin lectin family protein. The second group contained six expressed proteins of unknown function that share some sequence similarity with other genes within the group. Groups of highly co-expressed neighbouring genes appeared randomly distributed across the five chromosomes with a mean of about four co-expressed groups per 1 Mb (Table 1), but they were seldom located in heterochromatic, gene-poor regions such as centromeric and peri-centromeric regions. Among the 233 groups of positively and negatively correlated neighbouring genes (Table 3), 22 groups contained a gene whose probe set was predicted to cross-hybridize to another gene (Redman et al., 2004). An analysis of only those probe sets that were predicted to hybridize to a unique sequence found 10 fewer co-expressed groups. A complete listing of co-expressed genes is given in Table S1.

Table 1.  Number of groups of genes and total number of genes identified as highly co-expressed, and their proportions in the corresponding chromosome
ChromosomeGroup sizeObserved data setRandom data seta
GenesProportion (%)GroupsProportion (%)GenesProportion (%)GroupsProportion (%)
  1. aRandom data set generated from a single permutation.

12901.54450.77260.44130.22
3210.3670.1260.120.03
440.0710.020000
22741.98370.9980.2140.11
360.1720.060000
32781.74390.87160.3680.18
360.1320.040000
440.0910.020000
500000000
6120.2720.040000
42541.61270.81160.4880.24
3180.5460.180000
4120.3630.090000
550.1510.030000
52941.84470.92200.39100.2
3150.2950.10000
440.0810.020000
Whole genome23901.731950.87860.38430.19
3660.28220.0960.0320.01
4240.1160.030000
550.02100000
6120.0520.010000
Total 4972.192261.00920.41450.20
Table 2.  Number of groups of genes and total number of genes identified as divergently expressed (negatively correlated) in the Arabidopsis thaliana genome
Group sizeObserved data setRandom data seta
GenesProportion (%)GroupsProportion (%)GenesProportion (%)GroupsProportion (%)
  1. aRandom data set generated from a single permutation.

2120.0560.03140.0670.03
330.011060.0320.01
400000000
500000000
600000000
Total150.0670.03200.0990.04
Table 3.  Inferred protein-sequence similarity of neighbouring, highly correlated genes including co-expressed and divergently expressed genes
Group sizeStrong correlationCo-expressionDivergent expression
GroupsGenesGroupsGenesAll homologousaPartial homologybGroupsGenesAll homologousaPartial homologyb
GroupsGenesGroupsGenescGroupsGenesGroupsGenesc
  1. aAll possible gene pairs in the group had blastE < 1.0 × 10−7.

  2. bGroups contained at least one gene pair, but not all gene pairs had blastE < 1.0 × 10−7.

  3. cAll genes within a group are listed, but only some had sequence similarity.

220140219539074148006120000
323692266515412130000
4624624312312000000
515150000000000
621221221200000000
Total233512226497841877247150000

Many of the 226 co-expressed groups were composed of tandem duplicates. Forty per cent (91/226) of the co-expressed groups contained homologous genes (Table 3). Thirty-seven per cent (84/226) contained genes that were all homologous (all possible gene pairs in the group have a blastpE value of <1.0 × 10−7). Three per cent (7/226) of groups had at least two genes that were homologous (Table 3). In contrast, not one of the 15 divergently expressed, neighbouring genes was homologous (Table 3).

Genes with functional relationships were more likely among all co-expressed groups, including groups that did not have tandem duplicates, than expected by chance. Forty-one per cent (92/226) of highly co-expressed groups had genes with shared level 3 gene ontology biological process (GO BP) terms (Table 4). In contrast, among 226 groups randomly chosen from the genome, only 28 (14%) contained genes that shared level 3 GO BP terms. After removing the 91 groups with homologous genes, co-expressed groups shared functional categories twice as often as expected. Thirty-nine of the 135 co-expressed groups that did not contain homologous genes (29%) contained genes that shared level 3 GO BP terms (Table 4). Genes that shared level 3 terms but were not homologous sometimes had clear functional relationships. At1g53430 and At1g53440 both encode protein kinases and share the level 3 BP terms ‘cellular physiological process’ and ‘metabolism’. At2g19740 and At2g19750 encode 60S and 40S ribosomal proteins, respectively. Both share the level 3 term ‘cellular physiological process’. The functional relationship among other co-expressed groups was less clear. For example, a dehydroascorbate reductase (At5g16710) and a tRNA synthetase (At5g16715) were co-expressed and shared GO BP terms.

Table 4.  Number of groups and genes sharing the same gene ontology biological process (GO BP) term at least at level 3 for highly co-expressed neighbouring genes with high and low inferred protein-sequence similarities
Group sizeHighly co-expressedHomologousaNon-homologousb
GroupsGenesShared GO terms Shared GO by allcPartial GO sharingd Shared GO by allcPartial GO sharingd
GroupsGeneseGroupsGenesGroupsGeneseGroupsGenesGroupsGenese
  1. aWithin homologous groups, at least one gene pair in a group had blastE < 1.0 × 10−7.

  2. bWithin non-homologous groups, all possible gene pairs in a group had blastE ≥ 1.0 × 10−7.

  3. cAll possible gene pairs in a group shared the same GO BP term, at least at level 3.

  4. dAt least one, but not all, gene pairs shared the same GO BP term, at least at level 3.

  5. eNumber of genes within a group only includes those that shared the same GO BP term at least at level 3.

219539070140428400285600
32266164261800412612
4624416416000000
5151300000013
62121200120000
Total2264979220352118123268715

Because high levels of co-expression remained, even after tandem duplicates had been removed from the data, the duplication of cis-regulatory sequences alone could not explain co-expression, and we examined if co-expression was a function of the proximity of two genes to each other. The mean intergenic distance between highly co-expressed neighbouring genes was remarkably lower than the mean distance between all neighbouring genes in the A. thaliana genome. Among the 271 pairs of genes within the 226 groups of co-expressed genes, the mean distance differed by 962 bp (2092 bp versus 3054 bp) compared with all neighbouring genes (Table 5). The standard deviation for co-expressed neighbouring genes was also over two times lower than that for all genes within this analysis. Thus, neighbouring co-expressed genes are much more tightly arrayed within the genome than expected. One reason for the large difference between the genome standard deviation and the co-expressed gene standard deviation could be because only a single, co-expressed group of genes lies close to a centromere. Genes that are far apart within the peri-centromeric regions of the genome could have inflated the distance among the set of all neighbouring genes. Nonetheless, we found the same result excluding gene pairs close to centromeres. The distance between all gene pairs remained significantly greater than the distance between co-expressed genes (P ≤ 6.8 × 10−3). The median, intergenic distance among co-expressed, non-homologous genes (1168 bp) was lower than the distance among co-expressed homologous genes (1276 bp) (Table 5). The proximity of non-homologous gene pairs is probably an important factor that contributes to their co-expression.

Table 5.  Intergenic distancesa and standard deviations of all neighbouring genes and all correlated neighbouring genes within the Arabidopsis thaliana genome
 All neighbouring genesPositively correlated neighbouring genesPositively correlated non-homologous neighbouring genesNegatively correlated neighbouring genes
  1. aDistance is the shortest distance in bp between two genes, irrespective of the strand on which the genes are located. If two genes are overlapping on opposite strands their intergenic distance is set at zero.

  2. bMedian significantly differs from all non-positively correlated neighbouring genes (median = 1463 bp; two-sided Wilcoxon rank sum test, w = 3 232 068, P = 4.04 × 10−2).

  3. cMedian significantly differs from all non-positively correlated neighbouring genes (median = 1463 bp; w = 2 130 901, P = 1.63 × 10−5).

  4. dMedian significantly differs from all non-negatively correlated neighbouring genes (median = 1459 bp; w = 127 614, P = 4.09 × 10−2).

Median (bp)14591276b1168c1109d
Mean (bp)3054209217671741
Standard deviation6652267127153164
Observations22 5152711608

Divergently transcribed genes may share the same promoter region, and we determined whether this class of genes is over-represented within co-expressed groups. Remarkably, the proportion of divergently expressed genes among co-expressed gene pairs was lower than expected (19.56 versus 24.04%, respectively; Table 6). In contrast, genes encoded in parallel were greatly over-represented among co-expressed neighbouring genes. Among the 271 pairs of genes within the 226 highly co-expressed groups, the large majority, 189/271 (70%), consisted of pairs of genes with parallel transcription orientations (Table 6). Tandem duplicates are often in parallel orientations, but co-expressed genes in parallel that did not have high sequence similarity were also over-represented. After removing homologous gene pairs, 62% of all gene pairs were in parallel. The frequency of parallel pairs within both classes differed from the frequency of parallel pairs within the genome as a whole (52%). The frequency of neighbouring genes located on opposite strands and transcribed towards each other was lower than expected, 11% (29/271) within all co-expressed genes and 14% (23/160) within non-homologous neighbouring genes (Table 6). For all orientations, the intergenic distance among co-expressed genes was shorter than expected.

Table 6.  Orientation of transcription of all neighbouring, co-expressed gene pairs within the Arabidopsis thaliana genome: median of intergenic distances between gene pairs within each class
Orientation of transcriptionAll neighbouring genesHighly co-expressed neighbouring genesHighly co-expressed non-homologous neighbouring genes
Gene pairsProportions (%)Median intergenic distances (bp)Gene pairsProportions (%)aMedian intergenic distances (bp)Gene pairsProportions (%)bMedian intergenic distances (bp)
  1. aProportions in the three transcriptional orientations differed significantly from those for non-co-expressed neighbouring genes (χ2 = 39.3, d.f. = 2, P = 2.93 × 10−9).

  2. bProportions in the three transcriptional orientations differed significantly from those for non-co-expressed neighbouring genes (χ2 = 9.99, d.f. = 2, P = 6.76 × 10−3).

  3. cMedian intergenic distance of divergent and parallel gene pairs differed significantly between highly co-expressed genes and non-co-expressed neighbouring genes (two-sided Wilcoxon rank sum tests: divergent, P = 9.19 × 10−3; parallel, P = 2.44 × 10−2), and between highly co-expressed non-homologous genes and non-co-expressed neighbouring genes (divergent, P = 7.47 × 10−3, parallel, P = 4.08 × 10−5).

Divergentc541324.0421625319.5612523723.131134
Parallelc1169051.92155018969.74135910062.501111
Convergent541224.046572910.704532314.37252

Large groups of genes are significantly co-expressed

As described above, highly co-expressed (r > 0.7) neighbouring genes comprise approximately 2% of the genome and are predominantly clustered into groups of two or three genes. To determine if large regions of the genome are co-expressed at low but significant levels, we identified groups of genes in which the mean pairwise correlation coefficient was greater than expected by chance (α = 0.01). We compared the observed |r| for group sizes two to nine with the distribution of |r| for 1000 permuted data sets. The number of significantly correlated groups was substantially higher than expected for all group sizes examined. At P < 0.01 and with a group size of two, we found 602 significant groups out of 22 515 groups (2.7%) over the entire genome (Table 7). In contrast, 225 (1.0%) significant groups were expected. Across chromosomes, the median number of significant groups was between 2.4 (chromosome 3) and 3.8 times (chromosome 2) greater than the number of groups expected by chance. When the genome is subdivided into groups of nine genes, the number of significant groups (694) is similar to the number of significant groups when group size is two (Table 7). However, many more novel genes are within those groups. As group size increased from two to nine, the total number of significantly correlated genes increased from 1178, over 5% of the 22 520 genes surveyed to 2520, over 10% of genes surveyed (Figure 2). The increase in the number of significantly correlated genes was not caused by larger groups per se. Both the number of significantly correlated genes and the number of significantly correlated groups were the same or similar within permuted data (Figure 2). These results show that at least 10% of the A. thaliana genome lies in correlated groups.

Table 7.  Number of n-gene (2 ≤ n ≤ 9) groups of adjacent genes for which transcript levels were significantly correlated at P < 0.01 in the Arabidopsis thaliana genome
ChromosomeGroup sizeSignificant groupsMedian of significant groupsExpected groupsTotal groups
12147147.5585844
3120585843
4142585842
5143585841
6155585840
7148585839
8156585838
9162585837
22114141.5373740
3120373739
4131373738
5135373737
6155373736
7148373735
8162373734
9176373733
32110110454470
3104454469
491454468
5108454467
6118454466
7111454465
8110454464
9126454463
42102107343362
395343361
4115343360
5101343359
6107343358
7110343357
8107343356
9110343355
52139126515099
3119515098
4128515097
5136515096
6126515095
7122515094
8126515093
9120515092
Total  63222522 515
Figure 2.

Number of significantly correlated genes identified in groups at P < 0.01 across the genome for group sizes between two and nine genes.
Observed and expected genes are given for chromosome 1. Random data reflect the genes from one permutation that meet the significance threshold.

Mean |r| values for groups obtained from the observed data were consistently slightly higher than those obtained from the permuted data (Figure 3), confirming a previous report (Williams and Bowles, 2004). The difference was between 0.023 and 0.007 for groups of two to nine genes. It is important to note that smaller group sizes had much greater critical |r| at α = 0.01 than larger groups (Figure 3). For example, a group of three genes with a mean |r| value of 0.434 is significant at P < 0.01; while the critical |r| value for a group of nine genes is 0.267.

Figure 3.

Observed and expected |r| means of overlapping groups of different sizes.
Expected mean |r| was calculated from 1000 permutations of gene order. Critical |r| mean is the score from permutations that is >99% of all scores (α = 0.01).

Discussion

Identifying co-expressed domains

Classical work in cytogenetics revealed distinct chromatin compartments within eukaryotic chromatin, and chromosomal regions that contain co-regulated genes have been found across a wide range of species. These domains vary in their length, gene content, and the strength of co-regulation of the genes within them. Here we analyse a large and diverse data set to identify gene-expression domains across the A. thaliana genome. Over 10% of neighbouring genes within the A. thaliana genome lie in groups of nine genes that have significantly correlated expression values. Over 2.2% of genes have highly correlated expression values (Pearson's |r| > 0.7). The vast majority of these genes are positively correlated and in small groups of two to three genes.

Our results confirm recent reports that both large and small groups of neighbouring genes are co-expressed within A. thaliana (Ma et al., 2005; Ren et al., 2005; Schmid et al., 2005). Nonetheless, the location and attributes of these groups differ among studies. Schmid et al. (2005) identified a number of significantly co-expressed, 10-gene groups. Eight domains contained very highly correlated groups, and several of these domains encompassed hundreds of genes. In our data such large domains were absent (Figure S2), but three regions of unusually high co-expression on chromosomes 2–4 overlap the regions reported by Schmid et al. (Figure S2). Ren et al. (2005) identified small groups of neighbouring genes that were highly correlated. However, in contrast to the analysis reported here, neither tandemly duplicated genes, nor gene orientation, nor gene distance explained the occurrence of co-expression. We found that co-expressed genes are homologous; share level 3 GO BP terms; are close together; and are in a parallel orientation more frequently than expected. The difference between this study and that of Ren et al. (2005) may be, in part, because the co-expressed groups analysed in this report have a relatively low number of false positives (type I errors). For example, we expect that of the 195 highly co-expressed genes pairs identified in this study, 43 (23%) are false positives (Table 1). Of 1800 co-expressed groups identified by Ren et al. (2005), 1352 (75%) are expected to be false positives.

Although we report a number of neighbouring genes with similar expression patterns, ours is a conservative estimate. First, the microarray data did not include every gene within the genome. The current estimate for gene number in A. thaliana is 30 700, while 22 520 genes are detected by the chip (Borevitz and Ecker, 2004). Tandem duplicates were also under-represented on the chip. Second, the data set reflected mRNA transcript levels derived from a large number of different experimental conditions and tissues. A detailed analysis of multiple-expression profiles from a single experiment or a particular tissue type may detect additional groups (Birnbaum et al., 2003; Ren et al., 2005; Roy et al., 2002). For example, in Saccharomyces cerevisiae adjacent genes were often correlated across hybridizations conducted during one experiment, but the same genes were frequently not correlated across hybridizations in other experiments (Cohen et al., 2000).

Characterization of co-expressed domains

Among the 2.2% of all neighbouring genes that were highly co-expressed with r > 0.7, 40% (91/226) probably arose from tandem duplication. One explanation for the co-expression of tandem duplicates is that these genes have duplicate promoter elements. Although a shared cis-regulatory sequence may also lead to the activation or repression of two divergently expressed genes (Kraakman et al., 1989; Shin et al., 2003; Xie et al., 2001), divergently expressed genes were significantly under-represented among co-expressed genes.

Our findings suggest that intergenic distance is an important factor that contributes to co-expression. First, highly correlated neighbouring genes, on average, were 962 bp closer to each other than the typical gene pairs (Table 5). After removing tandem duplicates, co-expressed neighbouring genes were, on average, 1287 bp closer than expected (P < 1.63 × 10−5). Second, co-expressed genes were remarkably clustered. The standard deviation of intergenic distance for co-expressed gene pairs was 2671 bp, compared with 6652 bp between genes within the genome as a whole (Table 5). Distance has been shown in other genomes to be predictive of co-expression. Cohen et al. (2000) found that approximately 15% of genes within 1000 bp in the S. cerevisiae genome were co-expressed at r > 0.7. This effect was independent of gene orientation, coding sequence relationships, or shared upstream activating sequences (Cohen et al., 2000). In C. elegans, after removing duplicate genes, the mean correlation coefficient for genes located within 20 kb was higher than expected by chance (Lercher et al., 2003). The discovery of this phenomenon in A. thaliana suggests that chromatin modifications affect gene expression across a wide range of eukaryotes. Mechanistically, these regions may be regulated by a common transcription factor. Upstream activating sequences within areas of open chromatin may be capable of long-range interactions (Felsenfeld, 1996).

The presence of significantly co-expressed groups of genes that span long distances is consistent with recent observations of chromosome structure during interphase. The interphase eukaryotic genome contains compartments of open chromatin anchored within highly condensed chromatin (Gerasimova et al., 2000; Noma et al., 2001). Euchromatic domains could contain large groups of significantly co-expressed genes. This suggestion is based on two observations. First, domains of co-expressed genes encompassed groups of at least nine genes (Figure 2), and the mean level of co-expression of all genes encompassing at least 16 genes was also higher than expected by chance (data not shown). Second, the transcriptional state of some large genomic regions is developmentally regulated, and we found that large groups of genes changed in similar ways across a variety of developmental stages and/or treatments. The correlation values between genes within large domains were low, perhaps because general nucleosome disruption over large regions is necessary but not sufficient for stable gene expression (Vermaak et al., 2003). Large clusters of co-expressed genes were also identified in D. melanogaster (Spellman and Rubin, 2002), suggesting that large regions of co-expression may be common across diverse eukaryotes. However, within D. melanogaster, approximately twice the number of groups of genes are co-expressed (>20% of the genes in the genome; Spellman and Rubin, 2002).

Highly co-expressed groups contained functionally related genes over twice as frequently as expected by chance, even after excluding homologous genes. In humans, Lercher et al. (2002) found that genes expressed in most tissues (housekeeping genes) are often clustered, while genes that are tissue-specific are not (Lercher et al., 2002). Within our data, an analysis of 70 highly co-expressed gene pairs showed that housekeeping genes and stress-response genes were over-represented within A. thaliana co-expression domains. From the 342 possible level 3 GO BP terms, 16 terms occurred 104 times. The process ‘metabolism’ (45 times) was very highly represented, followed by ‘cell growth and/or maintenance’ (15 times) and ‘response to external stimulus’ (14 times). GO BP processes representing regulation or development were noticeably under-represented. Functionally related genes may be over-represented among co-expressed, adjacent genes because genome rearrangements placed genes into a novel and advantageous regulatory environment. Nonetheless, neighbouring gene expression is not universal, and genes that encode proteins involved in a shared process may be further apart than expected by chance (Biehl et al., 2005).

In summary, our results show that the A. thaliana genome contains discrete domains of neighbouring genes that are co-regulated across a large number of diverse treatments. These domains may be large and encompass more than nine genes. Over 10% of the genome contains genes that are correlated at low, but significant levels. Many domains, comprising over 2% of the genome, are small and composed of highly co-expressed genes. Functionally related genes are over-represented within highly co-expressed groups with or without homologous genes. Intergenic distance and gene orientation are both important predictors of co-expression. Investigation of the regulatory elements that control these domains may lead to the identification of development-specific motifs that control gene expression through chromatin-fibre modification. We suggest that higher-order chromatin domains may change in a genome-wide, programmed fashion in response to specific environmental stimuli such as temperature, drought or biotic stress.

Experimental procedures

Data source

We collected 128 A. thaliana gene-expression profiles from the online affymetrix microarray data repository of the Nottingham Arabidopsis Stock Centre (NASCArrays: http://affymetrix.arabidopsis.info/narrays/experimentbrowse.pl), using the criteria described in Results. Expression data were selected to represent 128 different conditions. More than one hybridization from a single experiment was sampled only if hybridizations were of different tissues, or tissues that had undergone different treatments. Over 100 hybridizations were assayed because a previous study found that sample sizes >100 had high resolution to detect relationships among genes (Allocco et al., 2004). The entire (22 520 × 128) data set is available via anonymous ftp from 131.104.74.76. We used a subset of these data, genes on chromosome 2 measured across 40 unrelated hybridizations, to investigate the signal-to-noise ratio of different data sets. One data set consisted of all genes with detection call of present (1), absent (−1) or marginal (0). A second data set consisted of genes with a detection call of present (1) only. Across data sets, detection values P > 0.06 were defined as present; values P < 0.06 but >0.04 were defined as marginal; and values P < 0.04 were defined as absent. The discrimination score threshold τ was 0.015 across all hybridizations.

All gene-expression profiles of 22 520 genes were from affymetrixArabidopsis ATH1-121501 chips. Within the NASC data set, all experiments had been normalized so as to share a similar mean gene-expression level. The normalized signal value in the affymetrix data spreadsheet is referred to as the raw gene-expression level. All expression values were subsequently log2-transformed before analysis. We described probe sets on the chip as ‘genes’ using the probe-set matches to Arabidopsis genome initiative (AGI) locus identifiers (affy25k_array_elements-2004-06-01.txt available via anonymous ftp from ftp://ftp.arabidopsis.org). The probe sets on the affymetrix chip were developed from predicted coding sequences derived from the A. thaliana genome annotation from The Institute for Genomic Research (TIGR) of 15 December 2001.

Recovery and analysis of gene characteristics

Using bioperl-1.4 scripts (http://bioperl.org), the predicted starting and ending positions of the 22 520 arrayed genes were extracted from Genbank RefSeq records (accessions NC_003070.2; NC_003071.1; NC_003074.2; NC_003075.1; NC_003076.2). Analyses to calculate the number of groups of co-expressed genes, and the intergenic distance between genes, used the 2001 TIGR version 2 annotation. The intergenic distance of neighbouring genes was calculated as the shortest distance in nucleotides between two genes. If two genes overlapped, their intergenic distance was set at zero. The standalone blastbl2seq program with the blastp option and default parameters (NCBI, ftp://ftp.ncbi.nih.gov/blast/executables) was used to determine homology. Genes were conceptually translated into amino acid sequences prior to the analysis, based on GenBank sequence records (accessions NC_003070.4; NC_003071.3; NC_003074.4; NC_003075.3; NC_003076.4; 14 May 2004). We defined sequence pairs with E < 1.0 × 10−07 as homologous. We used the most recent genome annotation (ver. 5, released in 2005) to confirm the genes used from the 2001 annotation. In version 5, 263 genes from version 2 were removed from the annotation, renamed giving an existing gene name, or renamed giving a novel gene name. Fourteen highly co-expressed groups contained these genes, and these groups were not reported. The version 5 annotation was used to calculate the orientation of gene pairs relative to each other.

To classify functional relationships between genes, gene ontology biological process (GO BP) terms for all genes were downloaded from The Arabidopsis Information Resource (TAIR, 21 August 2004: ftp://ftp.arabidopsis.org/home/tair/Ontologies/Gene_Ontology/OLD). Genes often had several BP ontology terms. We acquired all GO BP terms for genes from the TIGR 2004 annotation using perl scripts. Gene ontology terms are hierarchical with a parent–offspring structure, and upper-level terms are more general than lower-level terms. We defined genes that share one of 342 level 3 BP terms as sharing a general biological function. Genes frequently had several GO terms, and annotated GO terms were at different nodal levels from the root. For every assigned GO BP term located at a level deeper than three nodes from the root of the hierarchy, we retrieved its corresponding level 3 GO BP term(s) using the GOstats package (ver. 1.0.1) obtained from bioconductor (http://www.bioconductor.org).

The non-parametric two-sided Wilcoxon rank sum test evaluated the significance of differences between groups, and goodness-of-fit (χ2) tests compared the number of expected and observed individuals within certain classes. Expected distributions of traits such as intergenic distances or gene orientation were obtained by sampling from the entire genome. All tests were done in r (R Development Core Team, 2004).

Co-expression analysis

To determine if genes located near to each other on the chromosome tended to be co-expressed, the expression values of genes across experiments were ordered first by the gene's chromosome number, and second by the gene's putative chromosomal position. Two genes represented on the affymetrix chip were defined as neighbouring if their ordered rank differed by one. The A. thaliana genome has approximately 30 700 genes, 8000 more than on the chip, so neighbouring genes are not necessarily immediately adjacent.

Pearson's correlation coefficient, r, was used to determine if neighbouring gene-transcript levels changed in similar ways across experiments. The coefficient measures the linear relationship between two variables. If the variables are normally distributed, then r fully characterizes the joint variation (Cook and Weisberg, 1999). The NASC data were highly skewed to the right. A log2-transformation of all expression data approximated normality. Pearson's correlation coefficient is given as:

image

where N is the total number of hybridizations; Xi = X1, X2, …, XN represents expression levels of one gene across hybridizations; and Yi = Y1, Y2, …, YN represents expression levels of a neighbouring gene across hybridizations. Sx and Sy are the standard deviations of expression values for genes x and y, respectively, across hybridizations. Genes or groups of genes whose expression was correlated at r > 0.7 were considered highly co-expressed (Cohen et al., 2000).

Groups of between two and nine ordered genes throughout the genome were analysed to determine if a number of neighbouring genes within the genome have similar gene-expression changes across different treatments. Adjacent groups overlapped, and differed by a single gene. The level of correlation among genes within a group was estimated either as the mean absolute value of the Pearson correlation coefficient between all gene pairs within the group (|r| mean), or as the mean Pearson correlation coefficient between all gene pairs within the group (r mean). To estimate the number of genes that are co-expressed, we counted all genes within a co-expressed group. Genes were counted only once. A gene within a highly co-expressed group of length n would not be counted if the same gene had been counted as part of another group of size n, or as part of a group of ≥n + 1.

Groups of genes with significant and high mean |r| values were composed of pairs of genes that had low-to-high pairwise |r| values. For example, the mean standard deviation of all significantly co-expressed chromosome 1 groups of n = 9 was 0.202 with a range of 0.143–0.326. The distribution of pairwise |r| values within the most variable group, the most uniform group, and six additional, randomly selected groups is given in Figure S3.

To evaluate the significance of co-expressed genes, we estimated the frequency of |r| under the null hypothesis of random association between gene positions and expression values. To determine if groups of genes were significantly correlated, 1000 permutations were conducted to estimate the null distribution, and an observed |r| ≥ (1 − α)% of the permuted values was considered to be significant at level α.

Acknowledgements

We thank Dr Nick Provart (University of Toronto) for access to the University of Toronto Botany Beowulf cluster, and the contributors of array data to the Nottingham Arabidopsis Stock Centre (NASC). We thank Dr David Wolyn, Dr Joseph Colasanti and two anonymous reviewers for valuable comments on the manuscript. This research was funded by the Ontario Ministry of Agriculture, the National Science and Engineering Research Council of Canada, and the Ontario Graduate Scholarship of Science and Technology.

Ancillary