Recent omics technologies provide information on multiple components of biological networks. Web-based data mining tools are continuously being developed. Because genes involved in specialized (secondary) metabolism are often co-ordinately regulated at the transcriptional level, a number of gene discovery studies have been successfully conducted using network analysis, especially by integrating gene co-expression network analysis and metabolomic investigation. In addition, next-generation sequencing technologies are currently utilized in functional genomics investigations of Arabidopsis and non-model plant species including medicinal plants. Systems-based approaches are expected to gain importance in medicinal plant research. This review discussed network analysis in Arabidopsis and gene discovery in plant-specialized metabolism in non-model plants.
Network analysis depicts biological systems and behaviours using biological components that are generated from omics data. Multivariate transcriptome and metabolome data are processed by statistical methods and visualized by, for example, scatter plots, profile plots, and heat maps, and are also subjected to gene co-expression and gene-to-metabolite network analysis, where over-represented networks are depicted as a result of ‘enrichment’ searches based on the scientist's interest. Network clustering and visualization, a type of graph theoretical analysis (graph clustering), describes the connectivity of system's elements (nodes) and their interactions (edges) in an abstract representation. Network nodes represent genes (usually equivalent to the corresponding transcripts and proteins) and small molecules (metabolites) in the case of the studies we review here. The set of genes involved in a biological process is co-regulated; this is detectable by omics technologies as gene co-expression, based on the so-called ‘guilt-by-association’ principle (Saito, Hirai & Yonekura-Sakakibara 2008). In order to annotate candidate genes and generate hypotheses involved in specialized metabolism, the integration of microarray gene co-expression network analysis and metabolomic investigation is widely utilized in the model plant Arabidopsis. Metabolic pathway and flux constitute other types of network analysis, where the edges are directed from one node to another as determined by enzymatic reactions. Metabolic networks are required to accomplish description of living organism (Kitano 2002), though it is not described in this review article (see the other review articles in the Special Issue).
Examples of web-based gene co-expression network analysis tools and the statistical issues have been reviewed by Usadel et al. (2009). Movahedi et al. (2012) reported comparative co-expression analyses of Arabidopsis, poplar, and rice, where orthologous genes involved in the DNA replication system were similarly co-expressed. Recently, various websites with new functions have been released (reviewed by Gehlenborg et al. 2010; Bassel et al. 2012; Ruprecht & Persson 2012; Tohge & Fernie 2012; Yonekura-Sakakibara, Fukushima & Saito 2012a). As described in this review, recent websites visualize networks by mixing gene co-expression and other network components including information on the molecular relationship such as co-expression of genes from the other plant species, subcellular localization of proteins, physical interaction of proteins, mutant phenotypes, regulatory information, gene–gene associations, conserved motif sequences and enzymatic reaction (Fig. 1).
Plant metabolic engineering and synthetic biology have the potential to produce environmentally clean and low-input chemicals and biofuels. The production of specialized metabolites in crops enhances their nutritional value and resistance against insects and microorganisms. For this purpose, it is necessary to understand gene regulatory networks and the functioning of appropriate enzyme genes. Plants contain various genes involved in metabolic pathways. Specialized metabolic genes are thought to be derived from primary metabolic genes (De Kraker & Gershenzon 2011; Milo & Last 2012; Weng, Philippe & Noel 2012). Advances in functional genomics have facilitated the identification of a number of genes involved in specialized metabolism. In recent times, inexpensive high-throughput sequencing analysis has enabled us to obtain information on gene expression profiles from non-model (non-reference) plant species (reviewed by De Luca et al. 2012; Schilmiller, Pichersky & Last 2012). This is anticipated to enhance the discovery of the biosynthetic genes and pathways of specialized metabolites.
The volume of omics data is increasing. Therefore, it is necessary to take advantage of bioinformatics and statistical techniques to promote the successful implementation of complicated data mining and visualization processes. Here, we review advances in plant systems biology, leading to facilitating gene discovery and functional genomics approaches in plant-specialized metabolism. The possibility of gene discovery in metabolomics-based omics approaches under environmental stress conditions is discussed.
Gene Discovery in Specialized Metabolism in Arabidopsis
A functional genomics approach, which integrates >1000 sets of microarray gene expression data and metabolite profile data, is a powerful tool to identify novel genes involved in specialized metabolism (Saito et al. 2008; Fukushima et al. 2009; Moreno-Risueno, Busch & Benfey 2010; Tohge & Fernie 2010). Biosynthesis of plant-specialized metabolites often takes place in a tissue-specific manner. AtMetExpress provides the results of the metabolome dataset acquired during the development of the reference plant Arabidopsis (Matsuda et al. 2010). The dataset was acquired following an experimental design compatible with that of AtGenExpress (Schmid et al. 2005; Kilian et al. 2007; Goda et al. 2008). The dataset was illustrated in the same pictographic style as in the eFP browser (Brady & Provart 2009), and subjected to gene-to-metabolite correlation network analyses, resulting in the successful prediction of the functions of genes involved in the biosynthesis of hydroxycinnamoylspermidine and neolignan. Gene discovery in anthocyanin biosynthetic pathway is a typical example of successful integration of gene co-expression network analysis and metabolite profiles (Tohge et al. 2005). The general flavonoid/flavonol subgroup and anthocyanin subgroup were classified into different clusters by gene co-expression network analysis using all datasets in ATTED-II (Yonekura-Sakakibara et al. 2007, 2008). Metabolite analysis revealed the existence of the novel flavonol glycosides from Arabidopsis flowers. These findings led to the functional identification of 2 glycosyltransferase genes. Recently, independent component analysis (ICA, see Table 1) was used for the identification of other glycosyltransferase genes involved in anthocyanin biosynthesis (Yonekura-Sakakibara et al. 2012b). Glucosinolates are one of the specialized metabolites accumulated in Brassica plants, and involved in biotic stress responses. Advances in omics technologies and quantitative trait locus analysis have resulted in the identification of a number of metabolic genes involved in the biosynthesis of glucosinolates in Arabidopsis (reviewed by Sønderby, Geu-Flores & Halkier 2010). Gene co-expression network analysis predicted that a putative bile acid transporter family protein is involved in the transport of glucosinolate-related metabolites (Gigolashvili et al. 2009; Sawada et al. 2009). Three serine carboxypeptidase-like acyltransferases were predicted to be involved in glucosinolate biosynthesis (Lee et al. 2012). Two of them were subsequently proven to be involved in the accumulation of benzoylated and sinapoylated glucosinolates. Kerwin et al. (2011) combined circadian clock network and quantitative genetics, discussing the links between clock function and specialized metabolism including glucosinolates.
Table 1. Glossary of terms for data mining process in the field of transcriptome and metabolome as cited in this review
Independent component analysis is an unsupervised analysis method, and useful when a gene expression is considered a linear combination of some independent components. Bi-functional genes became apparent (Yonekura-Sakakibara et al. 2012a, 2012b).
The Jaccard coefficient was utilized to assess the degree of node overlap among networks generated from different datasets (Fukushima et al. 2012; Yonekura-Sakakibara et al. 2012a).
Principal component analysis projects the data into a new space followed by the principal components. Each successive principal component is captured to obtain the maximum variance from each variable, which is not already present in the previous components (Maruyama et al. 2009; Mounet et al. 2009; Matsuda et al. 2010; Caldana et al. 2011; Kusano et al. 2011a).
Analysis of variance allows comparison of the effects of two or more levels of factors (Kusano et al. 2011a, 2011b, 2011c), which is expanded from t-test in order to avoid type I error.
Weighted correlation network analysis provides an R-based analytical tool to calculate gene co-expression (Langfelder & Horvath 2008; Caldana et al. 2011), which was used for the transcriptomics analysis of the heat stress responses of Arabidopsis, poplar and soybean (Weston et al. 2011).
Orthogonal projections to latent structures is a supervised data mining approach. Unlike PCA, metabolites or genes are separated based on the phenotype which they want to hypothesize (Wiklund et al. 2008; Kusano et al. 2011a,b, Marti et al. 2012). From the data, we can find genes or metabolites which are responsible for the hypothesis.
Pearson correlation coefficient is often used in gene co-expression analysis (Usadel et al. 2009). Pearson correlation coefficient is a parametric measurement (Mounet et al. 2009). PCC is sensitive to spike peaks, which can be distinguished by a scattered plot.
Hierarchical cluster analysis represents a tree (dendrogram) often with heat map in which closely related pairs of genes have lowest common ancestors (Clauset, Moore & Newman 2008; Caldana et al. 2011; Inzé et al. 2012).
Batch-learning self-organizing map is a modification of the conventional SOM (Mounet et al. 2009), conducts the learning process independent of the order of data input and displays coloured feature maps of lattice (Kanaya et al. 2001). The genes and metabolites with similar behaviour are classified to the same or neighbour lattices (Matsuda et al. 2010).
Cytoscape is open source Java network visualization and analysis tool to display integrating biomolecular interaction networks such as gene co-expression, protein–protein interaction and gene-to-metabolite network analysis (Shannon et al. 2003). Cytoscape Web is a web-based network visualization tool (Lopes et al. 2010).
In Table 2, web-based network analysis tools are summarized from a viewpoint of gene annotation in specialized metabolism. These sites are released or updated recent years and display networks of gene co-expression with other biological co-responses as user-friendly and attractive figures. PlaNet calculates various sizes of co-expression networks using a NetworkComparer pipeline under the Heuristic Cluster Chiseling Algorism across seven plant species (Mutwil et al. 2011). Two examples of genes involved in photosynthesis and the polyketide synthase-related pathway were examined by the authors. As another example of comparative analyses, co-expression networks of genes from seven plant species that were involved in monolignol biosynthesis were illustrated (Ruprecht et al. 2011; Ruprecht & Persson 2012). In these instances PlaNet represented the degree of mutant phenotype and overlap among the species of each gene using four different node colours. As shown in these papers, PlaNet is useful in finding metabolic genes such as flavonoids and lignin, which are conserved among plant species. CORNET displays Arabidopsis gene co-expression networks with information on putative protein–protein interaction networks, regulatory interactions and subcellular localization (De Bodt et al. 2010, 2012). It would be worth trying the various compendia of CORNET, such as the no bias, abiotic stress and hormone-treatment expression conditions, because genes involved in specialized metabolism often expressed at a specific tissue or under stress conditions. The LinkOut menu of this application is useful because it links to a variety of database websites such as interactome, reactome and Arabidopsis databases. The colour of CORNET represents subcellular localization, which is useful because specialized metabolite biosynthesis often takes place in a particular compartment at least in a part of the pathway. We queried the websites using Arabidopsis de novo fatty acid biosynthetic genes (Andre et al. 2007; Li-Beisson et al. 2010; Fig. 2) in order to explain the features of these websites. Because they used different array datasets and graph-clustering algorisms of gene co-expression with other biological co-responses, the results were not identical to each other; for example, the genes encoding the plastidic pyruvate kinase complex (PK ALPHA and PK BETA) are co-expressed with the other fatty acid biosynthetic genes in Arabidopsis (Fig. 1a, PlaNet) and among four plant species (Fig. 1b, PlaNet). The gene co-expression network observed in seed (Fig. 1c, CORNET) was similar, but not identical to that in leaf (Fig. 1d, CORNET), suggesting the existence of different regulatory mechanisms.
Table 2. Graphically oriented web-based tools of gene co-expression network analysis for gene discovery in specialized metabolism
Web-based tools for gene co-expression network analysis
ATTED-II displays the co-expression networks of Arabidopsis and rice (Obayashi et al. 2011). Users can easily move to correlated gene pages by clicking on the links on a given query gene page, and move to pathway maps of KEGG database from nodes of characterized genes and linked to Cytoscape Web system, which promote better understanding of the network result. In this regard, ATTED-II is suitable for gene annotation by the non-targeted approach (Aoki, Ogata & Shibata 2007). Several genes involved in specialized metabolism were predicted using ATTED-II and subsequently proven with experimental evidence; these include an MYB transcription factor involved in aliphatic glucosinolate biosynthesis (Hirai et al. 2007), modifying enzymes involved in flavonoid biosynthesis (Yonekura-Sakakibara et al. 2007, 2008), and a uracil-diphosphate (UDP)-glucose pyrophosphorylase involved in sulfolipid biosynthesis (Okazaki et al. 2009). VirtualPlant (Katari et al. 2010) displays gene co-expression and enzymatic reactions (substrates) in a network. VirtualPlant was used by Von Saint Paul et al. (2011), where a relationship between a glucosyltransferase and a metabolite in branched-chain amino acid degradation pathways was represented. Therefore, VirtualPlant is useful in finding metabolic enzyme genes. GeneMANIA uses a complex implementation of Cytoscape Web, allowing users to handle nodes and edges on the resulting display (Warde-Farley et al. 2010). GeneMANIA represents information on gene co-expression, shared protein domains, putative physical interactions, tissue specificity in root (Brady et al. 2007) and orthologous genes (Lee et al. 2010, AraNet). AraNet calculates gene co-expression patterns, putative protein–protein interactions and gene–gene associations, which are transferred from publications of orthologous human, fly, worm and yeast genes (Lee et al. 2010). GeneMANIA connects edges between the query genes and genes that belong to the same protein family. This function of GeneMANIA would be useful for the annotation of family enzymes to which a large number of genes are assigned. There are many useful websites (Table 2) that we cannot introduce here. They use different graph-clustering algorisms and microarray datasets for gene co-expression network analysis. Therefore it is recommended to try the all websites and manually review the outputs.
Network Analysis in Arabidopsis and Tomato
Network analysis of transcriptome and metabolome data have been used for omics studies in a variety of plant species, and have been quite successful in Arabidopsis and tomato in particular. In this section, we summarize the examples of the systems biology approaches in these species. These studies will lead to gene discovery in specialized metabolism. They used various statistical methods, which are briefly explained in Table 1.
Plants have evolved acclimation responses to local environmental stress. Systems biology approaches describe global structure of plant responses against environmental stresses. Light- and temperature-inducible responses of metabolites and gene expression were monitored in 177 states (Caldana et al. 2011). Transcriptome and metabolome data were subjected to hierarchical cluster analysis (HCA), principal component analysis (PCA), over-representation analysis of gene ontology terms and gene-to-metabolite correlation analysis (see Table 1). Metabolic and transcriptomic changes were clearly detected from normal-/high-temperature samples in the dark. They found distinguishable responses under short-term environmental stresses, for example, the changes of amino acid accumulation, leucine biosynthesis and phenylpropanoid pathway metabolic flow, suggesting a coordinated response for adaptation to the new environment (Caldana et al. 2011). Metabolic pathways involved in cold acclimation have also been analysed by integration of metabolites and transcripts (Maruyama et al. 2009). Samples were grouped by PCA based on metabolic profiles, and represented by pathway maps of starch degradation and sucrose metabolism. Mutant analysis suggested that the production of raffinose was regulated by the cold stress-responsible transcription factor dehydration-responsive element-binding (DREB) 1A. Abscisic acid (ABA)-regulated responses to dehydration were characterized by metabolomics and transcriptomics (Urano et al. 2009). Using an ABA-deficient mutant, an ABA-dependent increase of amino acids and ABA-independent raffinose accumulation were suggested. Global metabolite–metabolite correlation networks were depicted between the wild type and the mutant. The correlation network analyses annotated a large number of unknown metabolites. Kusano et al. (2011a) demonstrated that short- and long-term exposure to ultraviolet B (UV-B) irradiation induced dynamic response of both primary and specialized metabolism such as sugars, amino acids, α-tocopherol, anthocyanins and caffeate. The combined transcription and metabolic responses of different genotypes indicated flavonoid-less mutants exhibited enhanced UV–B-inducible senescence.
Tomato plants as a commercial crop are extensively analysed by systems-based network analysis. Mounet et al. (2009) investigated the cell expansion phase during fruit development at the level of the transcriptome and metabolome. The results were represented by PCA, self-organizing maps (SOM), metabolite correlation calculated by the Pearson correlation coefficient and co-expression analysis visualized by using Networks cartography and Pajek software (see Table 1). Tomato fruit development was additionally examined by combined transcript/protein/metabolite analysis (Osorio et al. 2011). Expression analysis was visualized by PageMan (Usadel et al. 2006). The ethylene receptor mutant showed up-regulation of ethylene biosynthesis-related lipoxygenases. Post-transcriptional regulatory mechanisms have been suggested from transcriptome and proteome data. Kusano et al. (2011b) performed a metabolomics study and characterized the differences in tomato fruit metabolites among two transgenic lines, five reference cultivars, and the control line. Triterpene and flavonoid biosynthetic genes were classified by graph-clustering algorithms and gene ontology (GO)-enrichment analysis using more than 300 tomato microarray data (Fukushima et al. 2012). The root dataset showed a higher Jaccard coefficient (see Table 1) than leaf and fruit datasets, suggesting that gene co-expression data in roots contain similar networks to those in leaves and fruits.
Gene Discovery in Specialized Metabolism in Non-Model Plant Species
Specialized metabolites, including medicinal compounds, often accumulated only in a limited number of plant species or genera. Some plant cultivars have been bred for a long time by humans. Gene expression patterns and amino acid sequences that subsequently define protein structure are anticipated to be optimized for a high yield of specific secondary (specialized) metabolites. Omics approaches on non-model plant species have been carried out by gene expression analyses such as high-throughput RNA sequencing (RNA-Seq), in-house cDNA microarrays, expressed sequence tag (EST) databases and differential screening methods, in which the researchers have compared different cultivars, tissues, and elicitor-treated cell suspension cultures and seedlings (Fig. 1). In this section, we introduce examples of gene identification in specialized metabolism in non-model plant species.
Papaver somniferum accumulates narcotic analgesics. Morphine biosynthetic genes were analysed by in-house cDNA microarrays, resulting in the identification of salutaridine reductase (Ziegler et al. 2006), salutaridine synthase (Gesell et al. 2009) and dioxygenases catalysing O-demethylation steps (Hagel & Facchini 2010). Desgagné-Penix et al. (2012) used RNA-Seq to compare the gene expression profiles of eight cultivars. The expression levels of the 12 known morphine biosynthetic genes were partially co-ordinately regulated. P. somniferum also produces an anti-tumour compound, noscapine. Winzer et al. (2012) annotated biosynthetic genes involved in noscapine biosynthetic pathway using RNA-Seq. Six candidate genes were characterized by virus-induced gene silencing or biochemical methods. The authors also performed focused genome sequencing and found that the biosynthetic genes formed a gene cluster. Gene cluster analyses have a potential to become a new clue for gene discovery in plant-specialized metabolism, which are observed in the biosynthetic pathways of terpenes in Arabidopsis (Field & Osbourn 2008; Field et al. 2011), and cyanogenic glycosides in cassava, sorghum and Lotus japonicus (Takos et al. 2011; Takos & Rook 2012). Cannabis sativa produces medicinal cannabinoids in the glandular trichome. Deep sequencing technology made it possible to read both the draft genome and transcriptome sequences of Can. sativa (Van Bakel et al. 2011). The expression of the known cannabinoid biosynthetic genes among different tissues and cultivars was compared. The results indicated that the genes involved in cannabinoid and hexanoate biosynthesis were highly expressed in the flowers of the cannabinoid-producing cultivar, where the medicinal compounds are biosynthesized. Genomic DNA sequencing of Can. sativa revealed that high copy numbers of a gene encoding an acyl-activating enzyme existed in the cannabinoid-producing cultivar, suggesting an unknown role for this enzyme in cannabinoid biosynthesis (Van Bakel et al. 2011). An acyl-activating enzyme involved in hexanoyl-CoA biosynthesis has been identified in the trichome-specific cDNA library from female flowers (Stout et al. 2012). Catharanthus roseus accumulates the antitumor compounds vinblastine and vincristine. Rischer et al. (2006) depicted gene-to-gene and gene-to-metabolite networks based on transcriptome data obtained by cDNA-amplified fragment length polymorphism (AFLP) analysis and on metabolome data, where the cell suspension culture was treated with methyl jasmonate. The analysis of Cat. roseus expression mapping data in cDNA libraries sequenced by RNA-Seq has been used to facilitate the functional characterization of tabersonine oxidase (Giddings et al. 2011). An N-methyltransferase has been isolated and functionally characterized from the Cat. roseus cDNA library of elicitor-treated seedlings (Liscombe, Usera & O'Connor 2010), where the discovery was guided by information on tissue specificity and subcellular localization. Lupinus angustifolius produces quinolizidine alkaloids (QA). Differential screening of cDNA libraries generated from QA-producing and non-producing cultivars was carried out using PCR-select cDNA subtraction, resulting in the isolation of a lysine/ornithine decarboxylase gene (Bunsupa et al. 2012). Coffea arabica seeds accumulate chlorogenic acids. The same coffee plant cultivar was grown in various natural environments to see the different phenylpropanoid accumulation level (Joët et al. 2010). Known phenylpropanoid genes were targeted and analysed by real-time PCR, and co-expression analysis was carried out to investigate the metabolic network in the phenylpropanoid pathway.
Terpenes are involved in the mechanism of plant defence against insects and insect-associated pathogens. Glandular trichomes accumulate volatile monoterpenes and sesquiterpenes. Gene expression levels are thought to be optimized for producing such compounds. Using RNA-Seq, a neryl diphosphate synthase has been identified from the tomato trichome (Schilmiller et al. 2009). Shotgun proteome analysis was aided by RNA-Seq data, and resulted in the annotation of volatile aldehyde and sesquiterpene biosynthetic genes (Schilmiller et al. 2010). Zerbe et al. (2012) functionally characterized new enzymes involved in diterpene biosynthesis in conifers, which is highlighted by the combination of metabolite profiling and tissue-specific cDNA library and subsequent RNA-Seq. RNA-Seq analysis was resulted in the discovery of iridoid biosynthetic pathway from Cat. roseus (Geu-Flores et al. 2012). Glycyrrhiza plants accumulate triterpene saponins including pharmacological glycyrrhizin. Candidate genes were prioritized based on information regarding their tissue specificity and presence/absence among closely related species (Seki et al. 2008, 2011). As a result, 2 cytochrome P450 genes were identified in the in-house cDNA library. A similar strategy was applied to mining genes involved in triterpene saponin biosynthesis (Naoumkina et al. 2010).
Because of the increase in nucleotide sequence data, it is helpful to have manually annotated nucleotide sequence repositories where users can readily perform sequence homology searches (Table 2). Added-value nucleotide sequence repositories and web-based network analysis tools are expected to be developed for medicinal plant research. For example, the Medicinal Plant Genomic Resource (http://medicinalplantgenomics.msu.edu/) is a medicinal plant-specific deep sequencing consortium providing blast search tools of RNA-Seq data from 14 species, which are linked to ClinicalTrials.gov, to educate users regarding the therapeutic effects of the medicinal plants, also linked to a metabolomics database of the Medicinal Plant Metabolomics Resource.
Conclusion and Future Directions
The biosynthesis of specialized metabolites is often restricted to a particular plant tissue or induced by biotic/abiotic stresses under the regulation of unique transcription factors that activate gene expressions involved in a particular metabolic pathway. Also, the biosynthetic pathways in specialized metabolism are often not alternative. Therefore, there are a lot of successful gene discovery studies in plant-specialized metabolism, in which the researchers used gene co-expression network analysis and mass spectrometry-based metabolite profiles (gene-to-metabolite correlation). Recently, Baerenfaller et al. (2012) described the investigation of Arabidopsis leaf growth and abiotic responses using systems-based omics analysis, by mining and displaying transcriptome, proteome and phenome data. As described in this paper, mining and visualization tools for omics data are becoming better and better developed, which will stimulate gene discovery in plant-specialized metabolism in both model and non-model plant species. The involvement of specialized metabolites in plant biology has just begun to be elucidated. Plant-specialized metabolites such as terpenes, glucosinolates, flavonoids and alkaloids are known to be involved in plant defence systems (Dixon 2001). Herbivore-induced metabolites were identified from maize using untargeted metabolomics approaches (Marti et al. 2012). Huffaker et al. (2011) isolated four novel acidic sesquiterpenes from a large number of unknown terpenes. N-acetylornithine is identified as a novel jasmonate-induced metabolite in Arabidopsis (Adio et al. 2011). The biosynthetic gene was predicted from publicly available microarray data and functionally characterized by the absence of the compound in a T-DNA inserted knockout line (Adio et al. 2011). As described in this review, the metabolomics studies revealed unknown compounds that were responded to environmental stresses such as light, temperature and dehydration (Maruyama et al. 2009; Urano et al. 2009; Caldana et al. 2011). Gene discovery in plant-specialized metabolism via network analyses would provide new insights into the novel metabolic pathway of those compounds and the biological processes underlying plant responses.
This research was supported in part by JST, Strategic International Collaborative Research Program, SICORP.