Establishment of the Lotus japonicus Gene Expression Atlas (LjGEA) and its use to explore legume seed maturation


For correspondence (e-mail


Lotus japonicus is a model species for legume genomics. To accelerate legume functional genomics, we developed a Lotus japonicus Gene Expression Atlas (LjGEA), which provides a global view of gene expression in all organ systems of this species, including roots, nodules, stems, petioles, leaves, flowers, pods and seeds. Time-series data covering multiple stages of developing pod and seed are included in the LjGEA. In addition, previously published L. japonicus Affymetrix data are included in the database, making it a ‘one-stop shop’ for transcriptome analysis of this species. The LjGEA web server ( enables flexible, multi-faceted analyses of the transcriptome. Transcript data may be accessed using the Affymetrix probe identification number, DNA sequence, gene name, functional description in natural language, and GO and KEGG annotation terms. Genes may be discovered through co-expression or differential expression analysis. Users may select a subset of experiments and visualize and compare expression profiles of multiple genes simultaneously. Data may be downloaded in a tabular form compatible with common analytical and visualization software. To illustrate the power of LjGEA, we explored the transcriptome of developing seeds. Genes represented by 36 474 probe sets were expressed at some stage during seed development, and almost half of these genes displayed differential expression during development. Among the latter were 624 transcription factor genes, some of which are orthologs of transcription factor genes that are known to regulate seed development in other species, while most are novel and represent attractive targets for reverse genetics approaches to determine their roles in this important organ.


Legumes (Fabaceae) are the second most important plant family for agriculture, after cereals. They account for approximately one-third of world crop production, and are an excellent source of protein, oil, carbohydrates, fiber and minerals for human nutrition, animal feed and industry (Graham and Vance, 2003). Legumes also play a central role in sustainable agriculture by virtue of symbiotic nitrogen fixation, which injects approximately 40 million tonnes of nitrogen into agricultural systems each year (Herridge et al., 2008). As a substitute for nitrogen fertilizer, symbiotic nitrogen fixation by rhizobia in legume root nodules has enormous economic benefits (approximately US$40 billion per annum) and environmental benefits (due to reduced nitrogen pollution).

Legume seeds and shoots are a rich and sustainable source of protein for humans and other animals. Legume pods are grown either until desiccation for production of dry seeds, or directly consumed as fresh immature pods containing seeds. High seed storage protein (SSP) content in legumes is mainly the result of accumulation of two kinds of proteins from the globulin family: vicilins and legumins. Accumulation of SSP occurs during the seed filling phase following embryogenesis, and preceding the desiccation phase during which the seed prepares for dormancy. Although some regulatory genes that control seed maturation have been identified in Arabidopsis and other non-legumes, such as LEC1, LEC2, ABI3 and FUS3 (Santos-Mendoza et al., 2008; Verdier and Thompson, 2008), little is known about regulation of legume seed maturation.

Draft genome sequences of five legume species have now been published: Lotus japonicus (Sato et al., 2008), Glycine max (soybean; Schmutz et al., 2010), Glycine soja (Kim et al., 2010), Cajanus cajan (pigeonpea; Varshney et al., 2011) and Medicago truncatula (Young et al., 2011), and more legume genomes are underway. Two of these, soybean and pigeonpea, are important crop legumes, while L. japonicus (Lotus) and M. truncatula are the premier model systems for legume genetics and genomics (Barker et al., 1990; Udvardi et al., 2005; Young and Udvardi, 2009). Importantly, the seed composition of Lotus is similar to that of soybean, with approximately 7% of dry weight as lipids, very low starch content (approximately 0.6%), and a protein content of approximately 43% of dry weight (Dam et al., 2009).

Knowledge gained from the model legumes is expected to lead to a better understanding of the molecular basis for important traits and eventually to improvements in crop legume species via translational genomics. Key resources for functional genomics are populations of mutants that may be used to uncover the function of individual genes in the genome. A variety of mutant populations have been developed for the model legumes, including chemically induced mutants that mostly contain point mutations (Perry et al., 2003; Le Signor et al., 2009), deletion mutants generated using fast neutrons (Hoffmann et al., 2007), and transposon insertion mutants (Tadege et al., 2008). These mutants have been instrumental in advancing our understanding of symbiotic nitrogen fixation (Murray et al., 2007, 2011; Pislariu et al., 2012) and other important legume processes, including legume seed development and metabolism (Pang et al., 2008; Zhao and Dixon, 2009; Verdier et al., 2012).

Other key resources for functional genomics are analytical databases of genome-wide transcriptome data. Changes in the transcriptional activity of genes during development and in response to environmental challenges provide clues about the function of genes. Transcriptome databases or Gene Expression Atlases have been developed for M. truncatula (Benedito et al., 2008) and soybean (Libault et al., 2010; Severin et al., 2010). For example, the M. truncatula Gene Expression Atlas (MtGEA, provides access to published transcriptome data from essentially all organs at various stages of plant development, and data on plant responses to biotic and abiotic stresses (Benedito et al., 2008; He et al., 2009). The MtGEA has been heavily utilized by the legume research community (Wang et al., 2010; Verdier et al., 2012). Although many individual transcriptome studies have been published for L. japonicus (Colebatch et al., 2002, 2004; Kouchi et al., 2004; Deguchi et al., 2007; Sanchez et al., 2008, 2011; Guether et al., 2009; Høgslund et al., 2009), no attempt has been made to merge these data into a Gene Expression Atlas for this species. To remedy this situation, we have developed the L. japonicus Gene Expression Atlas (LjGEA:, using the same database structure and analysis functions as for the MtGEA (Benedito et al., 2008; He et al., 2009) and data generated using the Affymetrix Lotus GeneChip® (Sanchez et al., 2008). We have extended the LjGEA with new transcriptome data from all major organs, including a detailed developmental time series during pod and seed development. In addition, we imported publicly available Affymetrix data for L. japonicus, such as data on plant responses to salinity stress, and gene expression during symbiosis with rhizobia or mycorrhizal fungi. The database is expandable and will accommodate other types of transcript data, such as RNA-seq data. In addition to describing the previously unpublished data contained in the LjGEA, we illustrate the database's utility by using it to decipher a complex and crucial developmental process, namely seed maturation.

Results and Discussion

Transcript profiling of various plant organs

To provide a foundation for the LjGEA, we measured transcript levels in each of the major organ systems of L. japonicus (Myakojima line MG–20), including roots, nodules, stems, petioles, leaves, open flowers, seeds and seed pods of plants grown under optimal controlled conditions. Intact seed pods and isolated seeds were assayed at multiple stages of development. Transcriptomic data were obtained for three biological replicates in all cases using the Affymetrix Lotus japonicus GeneChip® that contains 53 751 probe sets representing the majority of genes in this species.

A presence/absence call analysis was performed for each probe set to remove background noise using dCHIP (Li and Wong, 2001) (Table S1). The proportion of the genome expressed at detectable levels was similar for the various organs and ranged from 50.8 to 60.8% of all genes in seeds at 20 days after pollination (DAP) and in roots at 0 h. Similar results were obtained in other species (Schmid et al., 2005; Benedito et al., 2008). Under the growth conditions used in our experiments, 19.7% of genes were not expressed at detectable levels in any of the organs tested, which is similar to the value of 13.9% for undetected transcripts in the Medicago Gene Expression Atlas (Benedito et al., 2008). Thus, 80.3% of all genes were expressed in one or more organs. The majority of genes in Lotus were subject to transcriptional or post-transcriptional regulation that altered steady-state transcript levels during development, with a mean coefficient of variation (CV, standard deviation/mean) of 40.7% and a CV range from 1.8 to 384.4%. Approximately 53.4% of genes exhibited a difference in transcript levels between organs of at least twofold.

Hierarchical clustering analysis (HCA) of global transcript data, based on Pearson correlations between tissues and developmental stages, yielded three main groups of organs: underground organs (i.e. roots and nodules), aerial tissues (i.e. flowers, petioles, stems and leaves), and seeds and pods (Figure 1). Differences between organs of the same group were also apparent. More detailed analysis of this data, for example via HCA of transcript data for individual genes throughout development, may identify subsets of genes that control the development and differentiation of specific organs (see below).

Figure 1.

Comparison of transcriptomes of various organs. Pairwise correlation coefficients were calculated to generate the heatmap. The color scale indicates the degree of correlation. Samples were clustered on the basis of Euclidean distance using the multiexperiment viewer (mev).

Lotus japonicus Gene Expression Atlas web server

To provide open access to the data presented here, and other published Affymetrix data from L. japonicus, together with a suite of analysis tools to explore the data, we generated a web-accessible database called the Lotus japonicus Gene Expression Atlas (LjGEA, The LjGEA utilizes the architecture and tools of the Medicago truncatula Gene Expression Atlas (MtGEA) server (He et al., 2009). Data in the LjGEA may be explored using a set of flexible user-friendly tools to: (i) find genes/transcripts using their Affymetrix probe set ID, their gene ID according to the Lotus japonicus genome database (LjGDB,, their gene annotation (see 'Experimental Procedures'), the gene description or their DNA sequence using blast; (ii) visualize and compare transcript profiles; (ii) perform co- or differential expression analyses for specific genes; (iii) help researchers to formulate hypotheses about gene function based on gene annotations and expression profiles; and (iv) download results of analyses in a tabular or a graphical form. By enabling identification of putative orthologs of Lotus genes in Medicago and soybean, the LjGEA also provides a useful tool for comparative genomics.

The current version of LjGEA contains expression data from 83 treatments of eight organs resulting from hybridization of 237 Affymetrix GeneChips. This includes data from all major organ systems described above, salt acclimatization experiments for L. japonicus and related Lotus species (Sanchez et al., 2008, 2011), analysis of a plastidic glutamine synthetase mutant (Díaz et al., 2010), analysis of arbuscular mycorrizal symbiosis (Guether et al., 2009), and a detailed dissection of nodulation in mutant and wild-type plants (Høgslund et al., 2009). A ‘microarray sample selection’ tool is available that enables users to specify which dataset/chip sample to analyze. The Lotus web server will be updated on a regular basis with the latest gene annotations and new gene expression data. Due to the rapid development of high-throughput, so-called next-generation sequencing technologies, we plan to integrate RNA-seq and classical microarray data into the LjGEA, using recently developed normalization strategies (Battke and Nieselt, 2011), as new data become available.

Legume-specific genes

Previous analysis of the Lotus genome identified 1190 putative legume-specific genes (Sato et al., 2008) that match 729 probe sets present on the chip (Table S2). To obtain insight into the possible roles of these genes, we performed HCA of legume-specific gene transcript levels in the major organ systems of Lotus (Figure 2a). Intriguingly, some of the legume-specific genes showed organ-specific patterns of gene expression, including 10 nodule-specific genes, five root susceptible zone-specific genes, two leaf-specific genes, 12 flower-specific genes, three pod-specific genes and 12 seed-specific genes (Table S3). Unfortunately, most of these 44 genes have unknown function. Nonetheless, knowledge of the organ specificity of these genes should help to guide genetic studies to link genotype to phenotype in the future.

Figure 2.

HCA of genes annotated as legume-specific genes (a) transporters (b) and transcription factors (c) Clustering was based on Pearson correlation using MULTIEXPERIMENT VIEWER VERSION 4 (MEV4). The heatmap displays normalized values for each gene transcript for each sample.

Stably expressed ‘reference’ genes for transcript normalization

Genes with stable expression (transcript) levels throughout development and/or during environmental fluctuations are useful for transcript normalization prior to comparative gene expression analysis (Czechowski et al., 2005). To identify a core set of such reference genes, we performed coefficient of variation (CV) analysis of data from a total of 24 experiments and 72 Affymetrix GeneChips, covering nodules (four developmental stages), roots (eight stages; uninoculated and inoculated with rhizobia at various stages; whole root, tip and nodulation-susceptible zones), stems, petioles, leaves, flowers, pods (three stages) and seeds (five stages). The CV represents a normalized measure of dispersion, and ranged from 5.4%, for the most stably expressed genes, to 481% for the most differentially expressed genes. Using the selection criteria of a CV < 20% and at least a twofold difference between the highest and lowest levels of a gene transcript in any organs, we identified 71 reference genes that may be suitable for normalizing expression of genes expressed at low (14 genes with transcript levels <100 Affymetrix units), medium (24 genes with transcript levels from 100 to <1000), and high levels (33 genes with transcript levels >1000 Affymetrix units; Table S4). Interestingly, these genes included homologs of several that have been used as reference genes in other species, including GLYCERALDEHYDE-3–PHOSPHATE-DEHYDROGENASE (Ljwgs_092567.1_s_at and TC14070-3_at) and UBIQUITIN (TC14054-M_s_at, TC14054-3_at and TC14054-5_at) (Czechowski et al., 2005; Benedito et al., 2008).

These genes represent a valuable catalogue of potential reference genes that may be used for normalizing gene expression data derived from quantitative RT–PCR or probe-hybridization approaches. From a functional point of view, most of these genes are involved in basic cell functions and maintenance, based on our annotations (Table S4).

Organ-specific genes

Genes that are expressed specifically or preferentially in a particular organ may provide insight into specialized processes at work in these organs, including biochemical, physiological, developmental and other processes. To identify organ-specific genes, we calculated Z–scores for each probe set using the same representative subset of 24 experiments described above. The Z–score is an indicator of tissue specificity; a high Z–score indicates a high specificity for a particular organ. Using a minimum Z–score of 2.85 and a minimum normalized expression value >100 as threshold values, we identified 2949 genes/probe sets that are specific to or preferentially expressed in a single organ (Figure 3 and Table S3). HCA of gene transcript levels in various organs revealed that approximately 39% were nodule-specific, 21% were flower-specific and 16% were seed-specific, with fewer genes specific to other organs.

Figure 3.

Heatmap of organ-specific genes. The color scale indicates the Z–score associated with all probe sets preferentially expressed in the various organs.

To make the data more useful to the symbiosis research community, the known symbiosis genes have been annotated in the LjGEA database and are listed in Table S5. The LjGEA and MtGEA databases will enable interesting comparisons of gene expression between the determinate nodules of L. japonicus and the indeterminate nodules of M. truncatula. For example, Lotus nodules have a transient infection stage during which infection threads ramify throughout the nodule, which lasts just a few days, whereas Medicago nodules retain an infection zone in which infection threads are constantly infecting new cells produced from the nodule meristem. Using the two databases, a comparison of orthologous pairs for three genes involved in rhizobial infection [VAPYRIN (Murray et al., 2011), PUB1 (Mbengue et al., 2010) and NODULE PECTATE LYASE (NPL; Xie et al., 2012)] revealed that they were highly expressed early during the infection phase in both species, and repressed in mature nodules of Lotus but not of Medicago, consistent with their roles in infection (Figure S1). In this manner, the LjGEA and MtGEA databases will allow researchers to explore the differences between determinate and indeterminate nodule types.

The list of organ-specific genes is potentially useful for targeted gene expression. Promoters of organ- or tissue-specific genes may be used to direct expression of transgenes in specific locations or at certain developmental stages, which may be essential to avoid the negative effects associated with constitutive expression of some genes.

We have a special interest in legume transcription factors (TFs) and transporters (Udvardi et al., 2007; Benedito et al., 2010), and found that a large proportion of TF and transporter genes represented by probe sets on the Affymetrix chip were developmentally regulated, with many exhibiting organ-specific expression (Figure 2b,c and Table S2). Presumably, this reflects specialization of these genes for distinct developmental and/or metabolic tasks. Organ-specific TF genes are particularly interesting targets for future work aimed at understanding the regulation of organ development and differentiation. For instance, we recently showed that a seed coat-specific MYB–TF gene is a regulator of proanthocyanidin (tannin) biosynthesis in Medicago (Verdier et al., 2012).

Characterization of Lotus seed and pod development

Seed development is a complex process that requires coordinated expression of many genes in different tissues. To obtain insight into the transcriptional network underlying seed development, particularly storage compound metabolism in Lotus seeds, we performed a detailed time-series study of the seed maturation phase.

Lotus japonicus seeds and pods were harvested and weighed every 2 days from 10 to 25 DAP to determine the timing of the major developmental phases (Figure 4). Seed development may be divided into three major phases coinciding with changes in seed fresh weight: embryogenesis, which involves little change in seed weight; seed filling (or early maturation), accompanied by a marked increase in weight; and desiccation (or late maturation), during which seed weight declines (Bewley and Black, 1994). Under our plant growth conditions, seed fresh weight increased slightly between 10 and 13 DAP, consistent with this period being the end of embryogenesis. Between 14 and 20 DAP, seed fresh weight increased rapidly, indicating the filling phase. Seeds reached a maximum fresh weight of approximately 3.5 mg at 20 DAP before decreasing in mass as they dried during late maturation, which ended with seed maturity/dormancy and pod dehiscence at 30 DAP.

Figure 4.

Timeline of Lotus seed and pod development from late embryogenesis to desiccation. Fresh weight is indicated in mg. The dashed line corresponds to pod weight and the solid line corresponds to seed weight. Putative developmental seed stages are indicated. Boxes on the x axis correspond to stages chosen for the transcriptional profiling study of seed development.

Intact pods containing seeds were harvested at 10, 14 and 20 DAP to complement our seed development study. We observed a high correlation between the increase in pod fresh weight over time and seed fresh weight (r2 = 0.88). The mean number of seeds per pod was 24.6 ± 2.1 (SD, = 37), which accounted for approximately half the fresh weight of intact pods and hence the strong correlation between whole pod and seed weights (Figure 4). Similar correlations in other species indicate that pod weight may be used as convenient measure of seed yield (Diepenbrock, 2000).

Transcriptional dynamics of seed development

To reveal molecular events underlying seed maturation, we analyzed the dynamics of the Lotus transcriptome during this developmental process. Harvest days for transcriptome analysis of isolated seeds were chosen based on our analysis of seed weights during development (Figure 4). Seeds were harvested at 10 and 12 DAP to describe late embryogenesis, at 14 and 16 DAP for early and mid seed-filling, and at 20 DAP to capture the transition to desiccation and maturation.

Approximately 68% of genes, corresponding to 36 474 probe sets, were expressed at least one stage of seed development. Approximately 30% of these genes (16 272 probe sets) were differentially expressed (CV > 15%) during seed development (Table S6). These included 190 legume-specific genes, 624 TFs and 293 transporters. These also included SSP genes such as VICILIN (e.g. Ljwgs_083584.1_s_at), LEGUMIN (e.g. Ljwgs_083584.1_s_at) and CONVICILIN genes (e.g. TM1755.26_s_at), which accounted for seven of the ten most highly expressed genes in seeds.

All differentially expressed seed genes were sorted into clusters, using a K–means algorithm and correlations between expression profiles (Figure 5). A figure of merit (FOM) calculation (Yeung et al., 2001) was then performed to determine the optimal number of clusters to classify gene expression profiles, which resulted in four clusters (Figure 5a). Cluster I consisted of 3988 genes/probe sets with maximal expression at 10 DAP during late embryogenesis; cluster II consisted of 1248 genes with peak activity at 16 DAP during seed filling; cluster III consisted of 6119 genes that were induced at 20 DAP at the onset of seed desiccation; cluster IV represented 4917 genes that were repressed at 20 DAP.

Figure 5.

Expression profiles of the genes that were differentially regulated during Lotus seed development. (a) K–means clustering analysis of these genes resulted in major clusters. Geometric means (or centroids) are indicated, with the total number of probe sets per cluster. (b) pageman analysis of differentially expressed genes in each of the clusters. Genes present in each cluster were subjected to over-representation analysis with respect to KEGG functional classes using Fisher's exact test with Bonferroni correction. Red indicates enrichment in a functional class, and blue indicates depletion in a specific functional class.

Over-representation analysis using pageman software (Usadel et al., 2006) was used to identify molecular, biochemical and cellular processes that are characteristic of the various stages of seed development (Figure 5b). Cluster I (maximal at 10 DAP) showed over-representation of genes involved in ligand–receptor interactions and the cell cycle (Figure 5b), consistent with the known cell division activity during embryogenesis (Dam et al., 2009). Cluster II (maximal at 16 DAP) showed over-representation of genes involved in starch and sucrose metabolism (Figure 5b). Consistent with the transcriptomic data indicating enhanced sugar metabolism during seed filling, we observed elevated levels of many sugars (galactose, mannose, fructose, glucose, xylose, pinitol and fructose) at 14 DAP compared to 10 DAP (embryogenesis) and 20 DAP (transition to desiccation phase; Figure 6a and Table S7). With respect to starch metabolism, it is known that starch accumulates transiently in seeds of Lotus and other legumes before being utilized during production of SSPs and storage lipids (Djemel et al., 2005; Stevenson et al., 2006; Dam et al., 2009). Cluster III (maximal at 20 DAP) showed over-representation of genes involved in pyruvate, fatty acid and amino acid metabolism, and by genes for genetic information processing (ribosome and DNA replication and repair; Figure 5b). Twenty days after pollination represents the tail end of the seed filling phase (Figure 4), during which SSPs and storage lipids are produced en masse. Mature Lotus seed consists of 7% lipids, with mainly C18:2 fatty acids, and approximately 43% protein, mainly legumins (Dam et al., 2009). Therefore, it is not surprising to find an over-representation of transcripts of genes involved in pyruvate, amino acid and lipid metabolism at this stage. Pyruvate is a precursor for both amino acid and fatty acid biosynthesis. The over-representation of ribosomal transcripts presumably reflects a greater capacity for protein synthesis during seed filling. Consistent with this, we observed a decrease in the level of several free amino acids between 14 and 20 DAP (Figure 6a). Cluster III contained most of the SSP genes, as well as oleosin genes involved in oil body formation, consistent with the above results (Table S6). This cluster also included genes for late embryogenesis abundant and seed maturation proteins, e.g. PM34. Thus, 20 DAP appeared to mark a transition period, with high expression of genes involved in the production of storage compounds (proteins and lipids) and of genes for stress-related proteins that are presumably required to maintain embryo viability during the subsequent desiccation phase.

Figure 6.

Sugar and amino acid content during seed and pod development. (a) HCA of ‘row normalization’ sugar and amino acid content during seed development. (b) Heatmap of the ratio between pod and seed content for the same sugars and amino acids. Ratios close to zero indicate a low level of metabolites in pod walls compared to seeds, whereas ratios close to two indicate higher levels in pod walls than in seeds.

Finally, cluster IV (minimal at 20 DAP), which consisted of genes that were repressed at the onset of desiccation and seed maturation, showed over-representation of genes involved in protein folding, sorting and degradation, and ATP synthesis (Figure 5b). The decreased activity of these genes prior to desiccation may precede the decreased demand for energy metabolism and protein processing during seed desiccation.

Transcriptional dynamics of pod development

Pods consist largely of pod walls that feed and protect developing seeds. Although we did not collect data from isolated pod walls (minus seed), we were able to identify pod-enhanced transcripts by comparing transcriptome data of whole pods (including seeds) with data from isolated seeds. Pod/seed transcript ratios close to 1 indicate a similar level of gene expression in pod walls and seeds, while ratios substantially <1 indicate lower transcript levels in pod walls than in seeds. We selected genes with pod/seed transcript ratios >10 as those showing enhanced expression in pods. By combining genes identified in this way with the 114 genes identified as pod-specific (Table S3), we identified 558 pod wall-enhanced genes (Table S8). From a biotechnological perspective, pod walls are potential targets for genetic engineering in order to enhance their protective role by ectopic expression of genes encoding proteins that repel pests or pathogens. The promoter sequences of pod-specific genes may be used to target gene expression to this specific tissue, without affecting seed quality.

We organized pod wall-enhanced genes into three clusters according to their maximum transcript levels at 10, 14 and 20 DAP. The first cluster consisted of 180 genes with maximal expression at late embryogenesis (10 DAP), the second cluster contained 122 genes with maximal expression during seed filling (14 DAP), and the third cluster comprised 256 genes with maximal expression at the onset of seed desiccation/maturation (20 DAP). A Fisher exact test did not reveal any over-representation of cellular processes encoded by genes in the three clusters. However, several genes encoding late embryogenesis abundant proteins (nine probe sets), maturation proteins (11 probe sets) and SSPs (12 probe sets) were present in the third cluster (Table S8). Late embryogenesis abundant proteins, maturation proteins and SSPs are usually associated with seeds: SSPs as storage proteins, and maturation proteins and late embryogenesis abundant proteins for desiccation tolerance. However, consistent with our transcriptome data, a proteomic study recently found some SSP, late embryogenesis abundant and maturation proteins in the pod walls of Lotus (Nautrup-Pedersen et al., 2010).

Past work indicates that the pod may coordinate grain filling by regulating the partitioning of nutrients from plant to seeds. In pea, for instance, 20% of the nitrogen accumulated in seeds was remobilized from adjacent pods (Schiltz et al., 2005), and in vitro studies revealed that detached pods were able to provide 60% of the seed assimilates to the growing embryo (Diepenbrock, 2000). The increase in pod weight preceding and during the increase of seed weight (Figure 4), and the accumulation of some sugars (e.g. sucrose, fructose, mannose and galactose), amino acids (e.g. phenylalanine and isoleucine; Figure 6b) and transcripts encoding SSPs in the pod, presumably reflect its role as a temporary storage tissue in Lotus. Pod wall-enhanced genes identified in this study represent potential targets for manipulation of resource partitioning and assimilation into seeds.

Transcriptional regulation of SSP genes in seeds

Regulation of gene expression during development is orchestrated, in large part, by transcription factors. We mapped annotated Lotus TFs to 1489 probe sets on the Affymetrix chip (Table S2). Using the ‘query’ tools of the LjGEA, we identified 13 putative SSP genes with seed-specific expression profiles. Sequence homology indicated that these genes encoded five legumins, four vicilins, two glycinins, one convicilin and one globulin-like protein (Table S9). The developmental expression profiles of these putative SSPs exhibited a common pattern, with steady increases of transcripts in seeds from 14 to 20 DAP. A geometric mean of these expression profiles was used as a query template to identify co-expressed genes, using the co-expression analysis tool of the LjGEA. A total of 194 genes were found with expression profiles similar to the mean of the SSPs, using a Pearson correlation coefficient threshold of >0.9 across all studied tissues (Figure 7 and Table S9). These included the original 13 SSP genes, as well as genes encoding oleosins (e.g. chr5.CM0909.72_at and chr2.TM0263.6_at), seed maturation proteins (chr1.CM0544.46_at and Ljwgs_035512.1_s_at) and a late embryogenesis abundant protein (Ljwgs_145133.1_at). Interestingly, 16 of the co-expressed genes were putative TF genes, encoding bZIP (chr5.CM0077.60.1_at, chr1.CM0010.32_at), ABI3/VP1 (chr1.CM0147.20.1_at) and AP2 (e.g. chr2.CM0002.11_at) TFs, amongst others (Table S9). Some of these TFs encode proteins that are closely related to functionally characterized genes, such as ABI3 (chr1.CM0147.20.1_at), ABI5 (chr1.CM0010.32_at) and ABI4 (chr2.CM0002.11_at), the products of which interact to promote SSP synthesis in Arabidopsis (Nakamura et al., 2001; Kagaya et al., 2005; Reeves et al., 2011). Some of the cis-regulatory elements present in SSP promoters have been reported for Arabidopsis. Interactions between these regulatory boxes and specific TFs have been proven, such as the interaction between bZIP proteins and the G–box (Chern et al., 1996; Lara et al., 2003; Bensmihen et al., 2005), while others remain hypothetical, such as the MYB–AACA box interaction (Vicente-Carbajosa and Carbonero, 2005). Lotus homologs of several TFs regarded as potential interactors with SSP cis-regulatory elements were identified in our analysis (Figure 7 and Table S9). Thus, the set of 16 TFs co-expressed with SSP genes probably include several regulators of seed storage protein biosynthesis in Lotus. It will be interesting to test the function of these and other seed-specific TFs in the future, using the growing number of LORE1 insertion mutants that are being developed in this species (Fukai et al., 2012).

Figure 7.

Genes co-expressed with Lotus SSPs. (a) Clustering visualization of the 194 genes co-expressed with SSP genes in various plant tissues. (b) Normalized expression profiles of the 14 TFs co-expressed with Lotus SSP genes in various plant tissues. All probe sets, annotations and correlation values are indicated in Table S9.

Experimental Procedures

Plant material and growth conditions

Lotus japonicus cv. Myakojima line MG–20 seeds were scarified using concentrated sulfuric acid, rinsed, sterilized with 2% sodium hypochlorite, and incubated at 24°C for 3 days on agar plates (1% w/v) placed vertically in the dark. Young seedlings were transplanted to pots containing Turface MVP calcined (illite) clay (BWI, Texarkansas, Nash, TX, USA) and placed in a growth chamber under controlled conditions: a 16 h light/8 h dark regime, 200 μE m−2 sec−1 light irradiance, 21°C and 40% relative humidity. Plants were fertilized daily with half-strength B&D solution (Broughton and Dilworth, 1971) containing 2 mm KNO3 and 2 mm NH4NO3. Vegetative organs (roots, stems, petioles and leaves are from uninoculated plants. Nodules are from inoculated plants) were harvested from multiple plants 28 days after planting and pooled to produce individual biological replicates in a completely randomized design. All experiments were performed with three biological replicates planted on separated days. For flowers and pods, plants were vernalized for 2 weeks. Flowers were harvested on the first day that they opened fully, while pods and seeds were collected at various stages of development from 10 to 20 days after pollination (DAP). Seed excision from pods was performed on ice prior to freezing in liquid nitrogen. Harvesting of all tissues occurred at the same time of day to avoid diurnal changes in gene expression. Nodule tissues were harvested from a subset of plants grown under low-nitrogen conditions (0.5 × B&D + 0.5 mm KNO3) and inoculated with Mesorhizobium loti strain MAFF303099 7 days post-transplantation. Control roots were harvested prior to inoculation, whereas mature nitrogen-fixing nodules were dissected from the root system at 21 days post-inoculation. Fresh seeds and pods were collected every day and weighed immediately after harvest. Weight curves were determined using triplicates collected from several plants.

RNA isolation and hybridization to Affymetrix GeneChips

Harvested material was frozen immediately in liquid nitrogen and stored at −80°C prior to RNA isolation. Total RNA was extracted from 0.5 g pulverized tissue, using a cetyltrimethyl ammonium bromide and LiCl method as described by Chang et al. (1993). Removal of genomic DNA was performed as described by Benedito et al. (2008). RNA was quantified using a Nanodrop ND–100 spectrophotometer (NanoDrop Technologies, and evaluated for purity using a Bioanalyzer 2100 (Agilent, The Affymetrix Lotus GeneChip® Array (Affymetrix, was used for expression analysis. RNA from three independent biological replicates was analyzed for each organ/tissue. Probe labeling using 10 μg RNA, array hybridization and scanning were performed according to the manufacturer's instructions (Affymetrix) for eukaryotic RNA, using a one-cycle protocol for cDNA synthesis.

Data extraction and normalization of transcriptomic data

Transcriptomic data normalization was performed as described by Benedito et al. (2008). Each Cel file from the hybridized Affymetrix array was exported from the genechip operating software version 1.4 (Affymetrix) and imported into Robust Multi-array Average Express (Irizarry, 2003) for global normalization. A presence/absence call for each probe set to remove background noise was obtained using dCHIP (Li and Wong, 2001). Raw microarray data were deposited at ArrayExpress ( as E–MEXP–1726.

Clustering analyses, normalization, Z–score and coefficient of variation

Hierarchical clustering analysis (HCA) was performed, and the results were visualized using multi experiment viewer version 4 (mev4, part of the microarray software suite). Distance analysis was calculated using pairwise Pearson correlation and the average linkage clustering method.

To aid comparisons between genes, we normalized gene expression by subtracting the mean transcript level for a given gene across all samples (Ta) from the mean transcript level in a specific sample (Ts), and dividing this by the standard deviation across all samples (SD). Thus, the normalized transcript level (Tn) is (Ts − Ta)/SD, which corresponds to the Z–score. K–means clustering analysis was also performed and visualized using mev4. The optimum number of clusters was predicted using the FOM (figure of merit) algorithm in mev4. This algorithm is an estimate of the predictive power of a clustering algorithm (Yeung et al., 2001). K–means clustering analysis was used to cluster genes using Pearson correlation as a distance metric in a specific number of clusters determined by FOM calculation.

The coefficient of variation (CV) was calculated as SD/mean, where the mean and SD are calculated from transcript levels across all organs and conditions.

Over-representation analysis

pageman software was used to perform over-representation analysis of KEGG functional classes using Fisher's exact test and Bonferroni correction. According to statistical values, a color scale was used to show over-represented and under-represented functional classes (Usadel et al., 2006).

Annotation and mapping of genes/probe sets

We used the annotated genome sequence of L. japonicus (Sato et al., 2008) to associate each probe set on the Affymetrix GeneChip with a gene and putative function where possible via reciprocal blast searches. As the genome sequence of L. japonicus is not yet complete, we extended this approach to cDNA sequences of the Lotus japonicus gene index (ljgi version 6) using reciprocal tblastx between probe sequences and different tentative contigs or expressed sequence tags and selecting the best match between both. Using the Lotus gene database (ljgdb version 6), we mapped 36 599 probe sets to 31 438 gene locus IDs. Of these, 1489 probe sets were assigned to putative TF genes (from a total of 1616 TFs in ljgdb version 6; Table S2), and 597 probe sets were assigned to putative transporter genes (from a total of 1087 genes; Table S2). To facilitate in silico analysis of multiple genes, we mapped the bincode classifications of the Kyoto Encyclopedia of Genes and Genomes (KEGG) and the Gene Ontology (GO) consortium to each Lotus gene/probe sets (Table S10). By assigning KEGG and GO annotations to each gene and probe set, we provide a standard representation of genes, gene products and metabolic pathways that may be used to describe putative gene function across species.

For identification of soybean and Medicago orthologs of Lotus genes, reciprocal blastp was performed using the reciprocal best hits method and a combination of soft filtering and Smith–Waterman alignment options, as described by Moreno-Hagelsieb and Latimer (2008).

Creation of the LjGEA web server

The Lotus japonicus Gene Expression Atlas uses the same hardware and software as the Medicago truncatula Gene Expression Atlas web server (MtGEA, He et al., 2009). All the features and analytical tools of MtGEA are enabled on the LjGEA web server. In addition to the new dataset described in this paper (E–MEXP–1726), we have also incorporated other datasets (E–MEXP–2690, E–MEXP–2344, E–TABM–715 and E–MEXP–1204) from the ArrayExpress database (Parkinson et al., 2005). All the data were normalized and deposited into our MySQL database. We adopted the three-tier software architecture to implement a ‘biologist-friendly’ web server to facilitate public access and analysis of the data. To provide a highly interactive web interface, we used PHP and Javascript programming combined with some additional GNU web development packages including overlib, tab pane and open flash chart.

Metabolite profiling of Lotus seeds

Polar metabolite extraction was performed as described by Broeckling et al. (2005). Seed and seed pod tissues were ground in liquid nitrogen and lyophilized using a freeze dryer. Six milligrams of dry tissue were extracted with 1.5 ml chloroform and 1.5 ml HPLC-grade water containing ribitol (a polar internal standard). Following extraction, the mixture was centrifuged at 2900 g for 30 min at 4°C, and 1 ml of the separated polar layer was collected and dried. Dried polar extracts were treated with methoxyamine HCl in pyridine, and then derivatized using MSTFA + 1% TMCS (Thermo scientific, Samples were analyzed using an Agilent 6890 gas chromatograph coupled to a 5973 mass spectrometry detector ( The injection split ratio was 15:1 for polar samples. Separation was achieved using a 60 m DB–5MS column (J&W Scientific, Compounds were identified using amdis software using commercial and in-house libraries, and quantified using met–idea software (Broeckling et al., 2006).


This work was supported by the Samuel Roberts Noble Foundation and the United States Department of Agriculture (USDA) cooperative state research, education and extension service (CSREES)-national research initiative (NRI) program.