A gene expression atlas of the model legume Medicago truncatula


*(fax +1 580 224 6692; e-mail mudvardi@noble.org).


Legumes played central roles in the development of agriculture and civilization, and today account for approximately one-third of the world’s primary crop production. Unfortunately, most cultivated legumes are poor model systems for genomic research. Therefore, Medicago truncatula, which has a relatively small diploid genome, has been adopted as a model species for legume genomics. To enhance its value as a model, we have generated a gene expression atlas that provides a global view of gene expression in all major organ systems of this species, with special emphasis on nodule and seed development. The atlas reveals massive differences in gene expression between organs that are accompanied by changes in the expression of key regulatory genes, such as transcription factor genes, which presumably orchestrate genetic reprogramming during development and differentiation. Interestingly, many legume-specific genes are preferentially expressed in nitrogen-fixing nodules, indicating that evolution endowed them with special roles in this unique and important organ. Comparative transcriptome analysis of Medicago versus Arabidopsis revealed significant divergence in developmental expression profiles of orthologous genes, which indicates that phylogenetic analysis alone is insufficient to predict the function of orthologs in different species. The data presented here represent an unparalleled resource for legume functional genomics, which will accelerate discoveries in legume biology.


Legumes (family Fabaceae) are second only to grasses (Gramineae) in importance to humans as a source of food, feed for livestock, and raw materials for industry (Graham and Vance, 2003). Legumes account for one-third of the world’s primary crop production and are key to sustainable agriculture because they can ‘fix’ nitrogen (reduce N2 to NH3) in a symbiotic association with bacteria called rhizobia, providing crops with a free and renewable source of nitrogen. It is estimated that 40–60 million tonnes of N are fixed annually by cultivated legumes (Smil, 1999), saving about US$10 billion on nitrogen fertilizer (Graham and Vance, 2003).

Symbiotic nitrogen fixation (SNF) in legumes takes place in specialized organs called nodules which develop from root cortical cells that start dividing following signal exchanges between the plant roots and rhizobia in the soil (Brewin, 1991; Long, 2001; Oldroyd and Downie, 2004). Rhizobia gain entry into cortical cells of developing nodules via an infection thread, which traverses the epidermal root hair cell that makes first contact with the bacteria, and subsequently ramifies throughout the cortical tissue. Concomitantly, dividing cortical cells form a nodule primordium inside which a new meristem is initiated that drives nodule organogenesis. Bacteria are then released from infection threads into the cytoplasm of cortical cells via endocytosis, which leaves the bacteria surrounded by a host membrane called the symbiosome membrane (Udvardi and Day, 1997). Within the resulting organelle, called the symbiosome, the bacteria multiply and ultimately differentiate into their nitrogen-fixing ‘bacteroid’ state. Infected cortical cells typically contain thousands of symbiosomes, each containing one or a few bacteroids. Transcriptome analyses have identified hundreds of plant and bacterial genes that are differentially expressed during nodule development and differentiation (Becker et al., 2004; Colebatch et al., 2002, 2004; El-Yahyaoui et al., 2004; Mitra et al., 2004; Uchiumi et al., 2004), although genome-wide studies of plant gene expression during nodulation are yet to be reported. Likewise, although transcriptome studies have been carried out on different organs in multiple legume species under a variety of experimental conditions (Colebatch et al., 2004; El-Yahyaoui et al., 2004; Zabala et al., 2006), no single study has yet brought together genome-wide data for all major plant organs of a single species.

Results and discussion

To create a comprehensive gene expression atlas for Medicago truncatula, we utilized the new Affymetrix GeneChip Medicago Genome Array, which contains 50 900 probe sets representing the majority of genes in this species. Gene expression values were obtained from three independent biological replicates of each of the major organ systems: roots, nodules, stems, petioles, leaf blades, vegetative buds, flowers, and seed pods. In addition, multiple stages of nodule and seed development were profiled to obtain greater insight into the transcriptional programs that underlie the development of these two organs, which are the foci of most legume research. The fraction of genes for which transcripts were detected (called ‘present’ by dCHIP; Li and Wong, 2001) was similar in all samples, and ranged from 55.2% in seeds to 63.3% in roots (Figure S1). Similar results were obtained recently for Arabidopsis (Schmid et al., 2005). However, Arabidopsis cannot nodulate and provides no information about symbiotic nitrogen fixation.

No transcripts were detected for 13.9% of putative genes in any of the organs tested. The majority of these correspond to genes annotated from genomic sequences by the International Medicago Genome Annotation Group (IMGAG; Town, 2006). No evidence for expression of 23.3% of IMGAG-annotated genes was found, corresponding to 4370 out of 18731 probe sets. It is possible that some of these have been incorrectly annotated as genes, although more thorough sampling of the transcriptome, by encompassing a wider range of developmental stages, growth, and stress conditions, together with more sensitive measurement devices such as quantitative (q)RT-PCR, will be required to conclude that such DNA is not transcribed under any conditions. In contrast to the results obtained from probe sets designed from IMGAG-annotated genes, only 8.4% of probe sets derived from expressed sequence tag (EST)/cDNA sequences did not detect gene transcripts in any of the organs tested, which presumably reflects the inherent bias in EST data towards more highly expressed genes.

Transcripts were detected in at least one organ for 86.1% of all genes represented by the 50 900 probe sets on the Medicago Genome Chip. There was a high degree of overlap in the sets of genes expressed in different organs: Transcripts of 42% of all expressed genes (36% of all probe sets; Figure S1) were detected in all organs, and this percentage increased in pair-wise comparisons between organs. For example, 79% of genes expressed in roots or nodules of 28-day-old plants were expressed in both organs (Figure 1a). Despite the qualitative similarities in gene expression amongst organs, the dynamics of gene expression differed markedly between organs and within organs over developmental time (Figure 1b–d). The majority of genes were subject to transcriptional and/or post-transcriptional regulation that altered transcript levels during plant development. In fact, at least 73% of all genes exhibited a >100% change in transcript level from the organ with lowest expression to the organ with highest expression. The mean coefficient of variance (CV = standard deviation/mean) of transcript levels for all expressed genes across all organs was 60.6%, ranging from 2.3% to 428.6%, while the mean CV for the three biological replicates of each organ was only 13.3%. In other words, the biological variation in gene transcript levels within an organ, including technical errors associated with measurement, was far less than the biological variation between organs.

Figure 1.

 Gene expression dynamics during Medicago development.
(a) Comparison of gene expression (detected transcripts) in roots, nodules, and leaves of 28-day-old plants.
(b–d) Relative gene expression levels (Z scores) in different organs at one developmental stage (b), and at multiple stages for nodules (c) and seeds (d). Transcript levels were log2-transformed before calculation of Z = (X − Xav)/SD; where X is the mean transcript level for a given gene in the specified organ, Xav is the average transcript level for that gene in all organs, and SD is the standard deviation of transcript level for that gene across all organs. The number of genes was determined for each ΔZ = 0.1. Nod4–Nod28 represent nodules harvested 4–28 days after inoculation with Sinorhizobium meliloti, while Root0 represents root tissue immediately prior to inoculation. S10–36 represent seed harvested from 10 to 36 days after pollination.

Similarity between the transcriptomes of different organs was estimated using Pearson correlation, taking into account all genes expressed in at least one organ. The resulting heatmap of correlations revealed three main groups of organs with similar transcriptomes (Figure 2). The first group consisted of the underground organs, roots and nodules, the second group included seeds at different developmental stages, and the third group contained aerial organs, including leaves, stems, petioles, shoot apices, pods, and flowers. The aerial organs were more closely related to seeds than to underground organs.

Figure 2.

 Comparison of transcriptomes of various organs.
Pair-wise Pearson correlation coefficients were used to generate the heat map. The color scale indicates the degree of correlation. Samples were clustered with Euclidean distance using the MultiExperiment Viewer (MeV, http://www.tm4.org/mev.html).

Consistent with the large degree of overlap between expressed genes in the various organs, relatively few genes were found to be expressed in an organ-specific manner (Figure 3). The number of organ-specific genes identified in samples from 4-week-old plants ranged from 10 for stems to 322 for nodules. These numbers increased when data were obtained from multiple stages during development, as exemplified by nodules and seeds in which 473 and 584 genes, respectively, were identified as organ-specific from the corresponding developmental series (Table S1). In some cases, organ-specific genes were expressed only transiently during development, suggesting to us that they play roles in development per se, rather than in the maintenance of specialized biochemical or physiological functions of each organ, at least under ideal growth conditions.

Figure 3.

 Heat map of organ-specific genes.
The color scale indicates the number of times transcripts for a given gene were detected in the three biological replicates of each organ. Only those genes are shown for which transcripts were detected in all three biological replicates of one organ and no more than once in another organ.

A more complex picture emerged for genes that were expressed in specific organs at a level at least twice that of any other organ (Figure S2). The number of such genes ranged from 40 in petioles to 908 in roots of 4-week-old plants. Even larger numbers of organ-induced genes were uncovered in the developmental time-course for nodules (1354 genes) and seeds (3228 genes; Table S2). Interestingly, transcript levels of many of the genes induced during nodule or seed development were maintained at relatively high levels in the mature organ, suggesting that they may play roles in differentiation or the maintenance of specialized organ functions.

To identify all genes that are subject to transcriptional or post-transcriptional regulation during the development of Medicago, we chose roots as an arbitrary reference organ and tested the null hypothesis that expression in other organs was not significantly different from that in roots. Using a Bonferroni-corrected (Abdi, 2007) P-value threshold of 1.14 × 10−6, the percentage of genes expressed at a different level from roots ranged from 46.7% in nodules to 55.9% in leaves. In total, 81.5% of genes were differentially expressed between roots and one or more of the other seven organs (Table S3). The false discovery rate of these genes was estimated by determining Q-values for each, using edge software (Leek et al., 2006; Storey and Tibshirani, 2003). Clearly, organ development, differentiation, and maintenance in plants are underpinned by massive quantitative changes in gene expression.

Genes with a constant expression level throughout development and in the face of environmental challenges, which often fulfill housekeeping roles in cells, are useful reference points for comparative gene expression analysis (Czechowski et al., 2005). We identified 102 genes with <16% coefficient of variance for transcript levels amongst all the organs analyzed (Table S4). Transcript levels of these genes ranged widely, from as high as 14 500 to as low as 100 units, which was used as the minimum threshold level. Amongst the stably expressed genes were several that are used traditionally as reference genes in plants, including glyceraldehyde-3-P-dehydrogenase and ubiquitin. The most stably expressed gene, corresponding to TC97716 with unknown function, had a CV for transcript level amongst all organs of 9.4%. This set of reference genes will be of great value to legume research for normalizing gene expression data derived from qRT-PCR or probe-hybridization approaches.

Symbiotic nitrogen fixation in plants is a process confined largely to the legume (Fabaceae) family. Therefore, well-established, non-legume model species such as Arabidopsis thaliana and Oryza sativa (rice) cannot be used to learn more about SNF. The data presented here represent the most comprehensive data set to date for gene expression during nodule development in a legume. More than 26 000 genes are expressed during nodule development and 30.2% of these are differentially expressed (transcript levels increase or decrease more than twofold compared with roots with Bonferroni-corrected P < 1.14 × 10−6) at some stage during this development (Figure 4 and Table S5). Visualization of the nodule development data, using mapman to overlay changes in gene expression onto metabolic maps (Goffard and Weiller, 2006; Thimm et al., 2004), confirmed and extended previous, smaller-scale transcriptomics studies (El-Yahyaoui et al., 2004) that showed induction during nodule development of genes involved in glycolysis, carbon fixation, and nitrogen metabolism (Figure S3a–c). Many genes involved in secondary metabolism, such as the terpenoid and flavonoid pathways, were repressed during nodule development (Figure S3d). These results are interesting in light of the fact that many secondary compounds play roles in plant defense (Dixon, 2001), a process that is presumably suppressed in nodules in order to maintain a quasi-stable symbiosis with the nitrogen-fixing rhizobia. We provide the nodule development data described here in a form that can be imported into mapman for exploration by the reader of other pathways and processes (Table S6).

Figure 4.

 Hierarchical clustering of genes that were differentially expressed during nodule development.
Clustering of genes was based on Pearson correlation. The heat map portrays transcript levels in nodules 4, 10, 14, and 28 days after inoculation (Nod4–Nod28) relative to that in uninoculated roots (Root0).

The massive changes in transcript abundance that occur during the development of nodules and other organs indicate tremendous regulatory activity at the transcriptional and/or post-transcriptional levels. Transcription factors (TFs) are DNA-binding proteins that interact with specific cis-elements of genes to regulate transcription, either positively or negatively. Plants such as Arabidopsis may possess as many as 2000 TF genes, representing more than 6% of all their genes (Riechmann and Ratcliffe, 2000; Riechmann et al., 2000). To identify TF genes that control development and differentiation in Medicago, we first created a list of putative TF genes by screening predicted protein sequences for the presence of known or suspected DNA-binding domains, using InterPro (http://www.ebi.ac.uk/interpro/) and Pfam (http://pfam.sanger.ac.uk/) domain identification, additional hidden Markov model (HMM) predictions (Guo et al., 2005; Sonnhammer et al., 1997), and a BlastX search of the NCBI NR database to support annotations. In this way, we previously identified 1298 putative TF genes represented by probe sets on the Affymetrix Medicago GeneChip (Udvardi et al., 2007). Most of these (1169) fall into the 45 known families of plant TF genes or other transcriptional regulators, while 129 may define novel TF families in plants (Table S7). Five hundred and thirty-two of the putative TF genes are differentially expressed (more than twofold change; P < 1.14 × 10−6) during nodule development and may therefore play important roles in SNF (Table S5). The vast majority (>1100) of TFs are differentially expressed in other organs (Table S3). These data are a rich source of information and a sound platform for future experimental work aimed at unraveling genetic regulatory networks that govern organ development in Medicago.

The ability to form a nitrogen-fixing symbiosis appears to have evolved relatively recently in land plants, approximately 65 million years ago (Doyle, 1998), and as a result SNF is restricted to legumes and a few non-legume species. Interestingly, some of the genes required to establish SNF in legumes appear to have been recruited from a more ancient set of genes that are required for mycorrhizal symbiosis (Kistner and Parniske, 2002). Mycorrhizal symbioses are believed to have evolved when plants first colonized land 450 million years ago (Redecker et al., 2000; Remy et al., 1994), and it has been suggested that these fungal symbionts served as an extension of the primitive plant root system. Indeed, mycorrhizal fungi extend the reach of plant roots and aid in plant nutrition, especially phosphorus uptake (Harrison, 1999). The ancient origin and importance of mycorrhizal symbioses to land plants is reflected by the fact that approximately 90% of all land plant species are able to form such symbioses. A number of genes, mostly encoding signaling proteins, have been discovered in legumes that are required for both mycorrhizal symbiosis and nodulation/SNF (Kistner and Parniske, 2002; Parniske, 2004). Additional genes are essential for SNF but not for mycorrhizal symbiosis (Kalo et al., 2005; Radutoiu et al., 2003; Schauser et al., 1999; Smit et al., 2005). Each of these additional genes has so far been found to have one or more homologs in non-legume, non-nitrogen-fixing plant species, such as Arabidopsis. This raises an important question: Were all genes required for nodule development and SNF simply recruited from a pre-existing stock of plant genes, or did novel genes evolve as a result of natural selection for SNF? Legume-specific genes (LSGs) that appear to be absent from the genomes of non-legumes have been identified in a number of species. Several classes of LSGs have been identified in Medicago that encode short proteins, including over 300 cysteine cluster proteins (CCPs; Alunni et al., 2007; Fedorova et al., 2002; Graham et al., 2004; Mergaert et al., 2003), 63 proline-rich proteins (PRPs; Graham et al., 2004; Sherrier et al., 2005) and 21 glycine-rich proteins (GRPs; Alunni et al., 2007; Kevei et al., 2002; Silverstein et al., 2006). Five thousand eight hundred and forty-two probe sets representing LSGs were identified on the Medicago GeneChip. This included sequences for 355 CCPs, 5 PRPs, and 7 GRPs (Table S8). Analysis of LSG expression revealed that a subset of 322 CCPs (called NCRs for nodule-specific cysteine rich) and all seven GRPs were expressed in a nodule-specific manner consistent with roles for these genes in nodule development and/or function (Figure 5 and Table S8). Some PRPs were expressed in nodules but were also detected in other tissues such as seeds or flowers (Figure 5). These results confirm and extend earlier work on LSGs (Alunni et al., 2007; Fedorova et al., 2002; Graham et al., 2004; Mergaert et al., 2003).

Figure 5.

 Expression of legume-specific genes in Medicago organs.
Genes were clustered based on Pearson correlation. Cysteine cluster proteins (CCPs) are indicated in the far-right column. The color scale shows log2-transformed transcript levels for each gene.

Large data sets of the type presented here for Medicago and elsewhere for different plant species, such as Arabidopsis (Schmid et al., 2005), enable us to address other important questions about gene evolution in plants. One of these questions relates to the conservation of gene function in different plant lineages. To gain insight into the possible role(s) of a gene in a crop species, for instance, plant biologists often turn to a related model species, such as Arabidopsis, rice, or Medicago, and ask: what is the function of the ortholog(s) of the crop gene(s)/protein(s) in the model species? The implicit assumption is that orthologs in different species perform similar, if not identical, physiological functions despite millions of years of evolution. For this to be true, orthologous genes must have similar expression profiles in the two organisms. To test whether this is generally the case, we identified homologs of Medicago genes in Arabidopsis and made pair-wise comparisons of gene expression between the two species, using matching data from roots, stems, leaves, flowers, seeds, petioles, and vegetative buds. Pearson correlation analysis was used to rank Arabidopsis homologs based on the similarity of gene expression profiles in the various organs of the two species. The top-ranking homolog determined from correlation analysis of gene expression matched the putative ortholog, based on phylogenetic analysis in only 62% of cases (Figure 6 and Table S9). Thus, transcriptional regulation of putative orthologs has diverged substantially between Medicago and Arabidopsis, indicating that many orthologs may not perform the same range of functions in these two species. It should be noted that the plant growth conditions and developmental states of Medicago and Arabidopsis organs compared here were not identical, which would tend to decrease the correlation between gene expression patterns in the two species. Nonetheless, given that the expression patterns of paralogs often exhibited higher correlation than predicted orthologs (Figure 6), phylogenetic analysis alone may yield inaccurate predictions for the physiological functions of many genes, at least for comparisons between families as divergent as legumes and crucifers. This underscores the importance of the gene expression data collected here as a tool for Medicago and legume functional genomics.

Figure 6.

 Correlation between expression profiles of homologous and orthologous genes in Arabidopsis thaliana and Medicago truncatula.
Normalized transcript levels of roots, stems, leaves, flowers, seeds, petioles, and vegetative buds were compared between the two species. Pearson (linear) correlation coefficients were determined for all pairs of sequences regarded as either homologous or orthologous between Medicago and Arabidopsis. Histograms of the number of sequences (Y-axis) over the correlation coefficient (X-axis) of Medicago sequences and the corresponding Arabidopsis sequences are shown. The blue line represents the best-correlating Arabidopsis sequence identifiable within the set of sequence homologs for each Medicago sequence. The green line represents the best-correlating Arabidopsis sequence identifiable within the set of putative sequence orthologs for each Medicago sequence.

In summary, we have produced a comprehensive gene expression atlas for the model legume M. truncatula, which encompasses all organs of this species, including detailed time-courses through nodule and seed development. In addition to being a rich source of information for legume biologists, this data set enables large-scale comparisons between the transcriptomes of different plant species. Analysis of the data presented here shows that differences between plant organs result mainly from quantitative rather than qualitative changes in global gene expression. Relatively few genes are organ-specific in this species. Amongst these are subsets of legume-specific genes that appear to be expressed exclusively or preferentially in nodules. This implies that evolution of symbiotic nitrogen fixation was accompanied, and possibly facilitated, by the evolution of novel genes in legumes. These genes are clear targets for future studies aimed at identifying the core set of genes required for SNF in legumes. Finally, comparative functional genomics studies of the type presented here for Medicago and Arabidopsis will add a new dimension to studies of gene evolution in plants and other organisms.

We plan to expand the Medicago Gene Expression Atlas to encompass transcript data from wild-type and mutant plants exposed to various biotic and abiotic challenges, and to increase the spatial resolution of expression data by analyzing specific tissues and cell types. We invite the scientific community to collaborate with us in this venture by submitting raw Medicago Affymetrix data from complementary experiments together with metadata describing the experimental material, either directly to us (contact mudvardi@noble.org) or to ArrayExpress (http://www.ebi.ac.uk/miamexpress/).

Experimental procedures

Plant material, RNA isolation, probe preparation and array hybridization

Medicago truncatula cv. Jemalong, line A17 seeds were scarified with concentrated sulfuric acid, rinsed, sterilized with 2% sodium hypochlorite, and vernalized at 4°C for 3 days on moist, sterile filter paper. Germinated seedlings were transplanted to pots containing Turface MVP calcined (illite) clay (Profile Products, http://www.profileproducts.com/) and placed in a growth chamber set to the following conditions: 16-h/8-h light/dark regime, 200 μE m−2 sec−1 light irradiance, 24°C and 40% relative humidity. Plants were fertilized daily with half-strength B&D solution (Broughton and Dilworth, 1971) containing 2 mm KNO3 and 2 mm NH4NO3. A subset of plants were inoculated with Sinorhizobium meliloti strain 1021 at 1 and 7 days after sowing and fertilized with half-strength B&D solution containing 0.5 mm KNO3. Vegetative organs (roots, stems, petioles, leaves, and shoot buds from uninoculated plants and nodules from inoculated plants) were harvested from multiple plants 28 days after planting and pooled for individual biological replicates in a completely randomized design. All experiments were performed with three biological replicates planted on separate days. For flowers and pods, plants were vernalized for 2 weeks to decrease time to flowering. Flowers were harvested on the first day that they opened fully, while pods were collected at various stages of development (length ranged from 2.5 to 9.0 mm) within 21 days after the appearance of the floral bud. Harvesting of all organs occurred at the same time each morning, approximately 3 h after ‘dawn’, to avoid as far as possible diurnal changes in gene expression. All harvested material was frozen immediately in liquid nitrogen and stored at −80°C prior to RNA isolation. Total RNA was extracted using TRIzol reagent (Invitrogen, http://www.invitrogen.com/; Chomczunski and Mackey, 1995), treated with DNaseI (Ambion, http://www.ambion.com/), and column purified with a RNeasy MinElute CleanUp Kit (Qiagen, http://www.qiagen.com/).

Material for the nodule developmental series was harvested from plants grown aeroponically at 22°C, 75% hygrometry, a light irradiance of 200 μE m−2 sec−1, and a 16-h/8-h light/dark regime. Plants were grown initially for 11 days using a nitrogen-rich medium (Journet et al., 2001) then deprived of nitrogen for 4 days before being inoculated with S. meliloti strain 2011. Roots were harvested 0, 4, 10, and 14 days post-inoculation and nodules were dissected from roots prior to freezing in liquid nitrogen, storage at −80°C, and RNA isolation using a Nucleospin RNA kit (Macherey-Nagel, http://www.macherey-nagel.com/).

Material for the seed developmental series was harvested from plants grown in pots containing attapulgite (50%) and clay beads (50%) at 22°C/19°C day/night, 16-h photoperiod at 220 μE m−2 sec−1 light irradiance, and 60–70% relative humidity. Plants were fertilized with nutrient solution three times a week and watered on intervening days. Seeds were excised from pods 10, 12, 16, 20, 24, and 36 days after pollination, frozen in liquid nitrogen, and stored at −80°C prior to RNA isolation (Chang et al., 1993).

Ribonucleic acid was quantified using a Nanodrop Spectrophotometer ND-100 (NanoDrop Technologies, http://www.nanodrop.com/) and evaluated for purity with a Bioanalyzer 2100 (Agilent, http://www.home.agilent.com/). The Affymetrix GeneChip® Medicago Genome Array (Affymetrix, http://www.affymetrix.com/) was used for expression analysis. The RNA from three independent biological replicates was analyzed for each organ/developmental stage. Probe labeling using 10 μg RNA, array hybridization and scanning were performed according to the manufacturer’s instructions (Affymetrix) for eukaryotic RNA, using a one-cycle protocol for cDNA synthesis.

Data extraction and normalization

For each Affymetrix array hybridized, the resulting .cel file was exported from GeneChip Operating Software Version 1.4 (Affymetrix) and imported into Robust Multiarray Average (RMA; Irizarry et al., 2003) for global normalization. Presence/absence call for each probe set was obtained using dCHIP (Li and Wong, 2001). Gene selections based on an associative t-test (Dozmorov and Centola, 2003) were made using Matlab (MathWorks, http://www.mathworks.com/). Using this method, the background noise presented between replicates and the technical noise generated during hybridization were measured by the residual presented among a group of genes whose residuals are homoscedastic within the control group. Only genes whose residuals between compared sample pairs are significantly higher than the measured noise level will be considered to be differentially expressed. Since the residual was obtained from thousands of genes on the chip, the P-values obtained by this method are corrected towards a large sampling size, thus enabling the use of Bonferroni corrections without being overly stringent. The advantage of this methodology is that it takes into consideration technical noise and internal variations between replicates within a sample group and provides a baseline for selecting biologically significant genes. A selection threshold of two for transcript ratios (where applicable) and a Bonferroni-corrected P-value threshold of 1.14 × 10−6 were used. Bonferroni-corrected = 0.05/N, where N is the number of genes in the comparison, which was 43 836 in the experiments reported here. To monitor the false discovery rate of differentially expressed genes, the Q-value of each gene was obtained by edge software (Leek et al., 2006; Storey and Tibshirani, 2003).

Z scores

The Z score was calculated as follows: Z = (XXav)/SD; where X is the log2-transformed mean transcript level for a given gene in a specific organ, Xav is the log2-transformed mean transcript level for that gene in all organs, and SD is the standard deviation of transcript level for that gene across all organs. Mean transcript levels were determined from three biological replicates of each organ.

Hierarchical clustering analysis (HCA)

Hierarchical clustering analysis was conducted with Spotfire DecisionSite 8.1 (Spotfire Inc., http://spotfire.tibco.com/). For clustering analysis of data from different organs, data were transformed into log2 and clustered using Pearson correlation analysis (Zar, 1999). For the nodule developmental series, transcript levels were expressed relative to the level in roots just prior to inoculation with rhizobia (Root 0) before constructing clusters using the Pearson correlation coefficient.

Legume-specific genes

Legume-specific genes (LSGs) represented on the Affymetrix Medicago GeneChip were identified by a series of in-house BlasT searches that were used to eliminate probe sets representing sequences with homology to any non-legume plant sequence in GenBank. Briefly, starting with all 50 900 sequences upon which the Medicago GeneChip was based, consecutive BlasT searches were used to filter out homologs from O. sativa, A. thaliana, Populus trichocarpa, and Chlamydomonas reinhardtii. Finally, using the reduced list, a final search against all remaining non-legume sequences in GenBank NR and EST databases was made using the blast Network Client ‘BLASTcl3’. (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/). For all these searches both TBlastX and BlastN (E-value cutoff of ≤ 1 × 10−4) were used. Subsets of LSGs (CCPs and GRPs) represented on the Affymetrix Medicago GeneChip were identified based on homology to known family members using a TBlasTX search (match criteria: E-value ≤ 1 × 10−5). Probe sets corresponding to known PRPs were identified based on perfect or near perfect matches from a BlasTN search employing a complexity filter.

Correlation analysis of expression profiles for homologous genes in Medicago and Arabidopsis

All M. truncatula consensus sequences (http://www.affymetrix.com/support/technical/byproduct.affx?product=medicago), the sequences from which the probe sets for the corresponding Affymetrix chip were derived, were translated into protein in all six frames. The longest open reading frame (ORF) for each sequence was compared, using BlastP, against six-frame translations of the consensus sequences of other Affymetrix chips as well as the NCBI non-redundant sequence database ‘nr’. Consensus sequences for the M. truncatula and Arabidopsis chips were accepted as sequence homologs if BlastP hits connected the sequences with E-values better than 1 × 10−30 and bit-scores greater than 150. Putative phylogenetic orthologs between M. truncatula and A. thaliana were identified using AffyTrees (Frickey et al., 2008). AffyTrees is based on PhyloGenie (Frickey and Lupas, 2004) and provides a repository of neighbor-joining (Saitou and Nei, 1987) trees for Affymetrix consensus sequences in plants.

To compare the gene expression of homologous sequences, A. thaliana microarray data for stems (ATGE_27), petioles (ATGE_19), leaves (ATGE_14), vegetative buds (ATGE_8), flowers (ATGE_39), roots (ATGE_9), and seeds (ATGE_79; Schmid et al., 2005) were compared against the corresponding organs provided by this atlas (stem, petiole, leaf, vegetative bud, flower, root, and seed20d). All expression data were normalized using gcrma (Wu et al., 2004). The Pearson (linear) correlation coefficient of the expression values was calculated for all pairs of sequence homologs and used to quantify the similarity of expression of homologous sequences for the two species. As the number of Medicago consensus sequences for which blast and AffyTrees could determine homologs or orthologs differed, we restricted the analysis to the 10 243 sequences for which both methods were able to produce results. Sequence pairs between M. truncatula and A. thaliana with the highest correlation coefficient within the set of sequence homologs, as determined by BlastP, and the set of putative orthologs, as determined by AffyTrees, were compared.

Our project web site is http://bioinfo.noble.org/gene-atlas/. All gene expression data have been deposited in the ArrayExpress Database (http://www.ebi.ac.uk/miamexpress/) under accession number E-MEXP-1097.


We thank the USDA CSREES-NRI, the Samuel Roberts Noble Foundation, the Max Planck Society, the European Union FP6 Program, and the Australian Research Council Centre for Integrative Legume Research for support of this work.