Author for correspondence: Robert D. Hall Tel: +31 317477230 Fax: +31 317418094 Email: firstname.lastname@example.org
I. Introducing plant metabolomics 453
II. The technologies for data acquisition 455
III. Mind the gap: data analysis, bioinformatics and statistics 458
IV. Potential and limitations 461
V. Applications 462
VI. Data integration, metabolic networks and systems biology 464
VII. Where do we go from here? Bright prospects for the future 465
In a short time, plant metabolomics has gone from being just an ambitious concept to being a rapidly growing, valuable technology applied in the stride to gain a more global picture of the molecular organization of multicellular organisms. The combination of improved analytical capabilities with newly designed, dedicated statistical, bioinformatics and data mining strategies, is beginning to broaden the horizons of our understanding of how plants are organized and how metabolism is both controlled but highly flexible. Metabolomics is predicted to play a significant, if not indispensable role in bridging the phenotype–genotype gap and thus in assisting us in our desire for full genome sequence annotation as part of the quest to link gene to function. Plants are a fabulously rich source of diverse functional biochemicals and metabolomics is also already proving valuable in an applied context. By creating unique opportunities for us to interrogate plant systems and characterize their biochemical composition, metabolomics will greatly assist in identifying and defining much of the still unexploited biodiversity available today.
Biology is an informational science (Hood et al., 2004). As such, biologists are benefiting greatly from all the recently developed and information-rich functional genomics technologies. These technologies, which are ultimately aimed at giving us full coverage of, for example, total genome sequence or the complete transcriptional analysis of an organism, are already expanding research horizons well beyond those envisaged just 5–10 yr ago. The capacity to generate data on the molecular organization of living things has never been so great and, particularly with data coming from complementary, multivariate sources, this has resulted in a paradigm shift in our way of thinking regarding how we tackle complex biological questions. Multidisciplinary approaches have become the order of the day. Divergent expertises are being combined to enable us to gain wholly new insights into, for example, cellular behaviour, how plants are organized and how they respond to genetic or environmental insult. Techniques such as DNA microarrays and proteomics have already expanded our knowledge greatly. Metabolomics shall add to this and indeed, shall benefit considerably from the additional advantage of not being reliant upon having species-specific genomic information available for effective data analysis. However, while this switch from a reductionist to a more holistic approach has created many new opportunities, it has also raised a number of significant new challenges. This is particularly the case concerning the newest field of functional genomics –‘metabolomics’– where attention is specifically focused on the biochemical complement of cells and tissues.
While developments in genome sequencing, transcript profiling (transcriptomics) and protein profiling (proteomics) were rapidly progressing in the 1990s, what was lagging behind for many years was the development of a complementary technology covering cell metabolites. Recent progress has begun to overcome this shortfall and indeed the plant community has played the leading role here, driven on by the richness of plant metabolism and the generic value of the information which can be gathered from such an approach. Since the term ‘metabolome’ was coined in 1998 to describe the metabolite complement of living tissues (Oliver et al., 1998), much has been written and even more has been speculated as to the potential value of having more detailed knowledge of the biochemical composition of biological materials in relation to activities concerning the other molecular targets in functional genomic studies. Metabolomics has been defined as the technology designed to give us the broadest, essentially nontargeted insight into the richly diverse population of small molecules present in living things (Table 1 and see Hall et al., 2005). Larger molecules (generally > 1500 Da) and typical amino acid and sugar polymers are, by definition, excluded, albeit for reasons of practicality related to extraction and detection limitations. As such, small molecule-omics might indeed have been a better name (Adams, 2003). In the early days after 1998, the paucity of published results was more than compensated for by a plethora of (speculative) reviews but this situation has now thankfully been properly reversed and the number of full scientific papers has been expanding exponentially in the last couple of years and even a dedicated Metabolomics journal has come into existence (http://www.springeronline.com).
Table 1. Some useful working definitions
The complete complement of small molecules present in an organism.
The technology geared towards providing an essentially unbiased, comprehensive qualitative and quantitative overview of the metabolites present in an organism
A nonplant term generally used to define the technology used to measure quantitatively the metabolic composition of body fluids following a response to pathophysiological stimuli or genetic modification
High throughput qualitative screening of the metabolic composition of an organism or tissue with the primary aim of sample comparison and discrimination analysis. Generally no attempt is initially made to identify the metabolites present. All steps from sample preparation, separation and detection should be rapid and as simple as is feasible. Often used as a forerunner to metabolic profiling
Identification and quantification of the metabolites present in an organism. For practical reasons this is generally only feasible for a limited number of components, which are generally chosen on the basis of discriminant analysis or on molecular relationships based upon molecular pathways/networks
Following broad-scale metabolomics analysis, or based upon previous knowledge, biochemical profiling can be performed in greater detail on selected groups of metabolites by using optimized extraction and dedicated separation/detection techniques
This review is aimed at introducing the topic and detailing progress, potentials and limitations. Bearing in mind the existence of many previous reviews, information in this paper, with just a few key exceptions, will specifically be extracted from examples from the published literature from 2004 to mid 2005, as a means to avoid overlap and emphasize the extent of recent progress. Readers are therefore immediately referred to several key publications for a detailed insight into the pioneering work of previous years (Harrigan & Goodacre, 2003; Fernie et al., 2004; Vaidyanathan et al., 2005; Saito et al., 2006).
Before embarking upon various specific aspects determining the success and current limitations of metabolomics technologies, it is useful first to place these in the context of plant metabolism as a whole. Being generally, sessile obligate autotrophs, plants have typically been forced to evolve complex multicellular structures of divergent function in order to survive in a continually changing and predominantly hostile environment. These organs (leaves, roots, stems, tubers, etc.) in turn, comprise multiple specialized cell types (epidermis, guard cells, parenchyma, glandular hairs, etc.) each of which has a dedicated metabolism in order to perform effectively its desired function. Local environmental, temporal and seasonal perturbations have to be compensated for through responses which are predominantly facilitated by metabolic change. In addition, abiotic (light, UV, water) and biotic (herbivory, parasitism and pathogen attack) stress factors continually have to be dealt with and, for this, plants have developed a complex metabolic arsenal of compounds some of which are common but many of which are specialized to the level of genera or perhaps even species. Speculative estimations of the total number of metabolites which are produced within the plant kingdom, including the ‘secondary’ metabolites, vary considerably but the real number is likely be in the range between 100 000 and 200 000 (Oksman-Caldentey & Inzé, 2004). This complexity can therefore be used to define plants at every level of genotype, phenotype, tissue and cell. Consequently, when we have a more complete insight into the metabolic composition of plant tissues, this knowledge can be exploited effectively for both its diagnostic and predictive value.
It is clear that there are still large gaps in our knowledge and understanding of biochemical pathways and their regulation (Jander et al., 2004). Much of plant biochemistry can, unfortunately, still be defined in terms of ‘unidentified compounds derived from undefined pathways and with unknown function’ (Oliver Fiehn, University of California, Davis, CA, USA; 1st Metabolomics Conference, Japan 2005). In the past, research in this area has been very effective in expanding our knowledge but for practical reasons it has inevitably been focused on tightly predefined biochemical questions. Nevertheless, in relation to the complete metabolic picture, this approach has recently been described, perhaps unfairly, as just a piecemeal elucidation of components and causal relationships within a very small and circumscribed subset of cellular pathways (Fridman & Pichersky, 2005). Clearly, our increasingly holistic viewpoint in other areas of molecular biology is fuelling the demand for a broader overview of plant biochemistry. However, this is only ever going to be of value if it can be achieved in a reliable and meaningful way. This represents the major challenge to plant metabolomics at present. It is perhaps also for this reason that, owing to our restricted knowledge of plant biochemistry, the essentially nontargeted (blind) approach of metabolomics is a particularly suitable initial approach as an hypothesis generator to use to provide early leads for future research.
Metabolomics represents a change in scientific philosophy (Trethewey, 2004). Nevertheless, despite (because of?) the immense metabolic diversity of plants, metabolomics has been quickly embraced by the plant community (Goodacre, 2005). The reasons for this are clear. Having more detailed information on the biochemical composition of plant tissues has huge potential value in a wide range of both scientific and applied fields (Hall et al., 2005). Furthermore, the information gained reflects much more of a biological endpoint (Lindon et al., 2004) than that obtained, e.g. from transcriptomics or proteomics and, as such, is perhaps more relevant to our understanding of how a plant exists, functions and responds within its own environment. This is of particular relevance considering the clear time displacement that there is between gene expression, enzyme translation, metabolite production and any physiological event (Nicholson et al., 2004).
The quality of crop plants is a direct function of their metabolite content (Memelink, 2005). Quality of plant tissues also determines their commercial value in relation to aspects of, for example, flavour, fragrance, shelf life, physical attributes, etc. Each of these can be fully defined in terms of the metabolic profile of the material concerned at a particular time. However, it is in fundamental research, and in particular that related to the current mammoth quest to annotate complete genome sequences with properly defined and fully testable gene-to-function relationships, where metabolomics will likely come into its own. Together, all these broad fields of application form the backdrop to the current initiatives. They also explain the major driving force behind the establishment and further development of metabolomics to become an equally valuable complementary technology to transcriptomics and proteomics. In this review most attention will be given to the potential and the limitations of the currently available technologies in relation to many of the envisaged applications in functional genomics and plant biology research.
II. The technologies for data acquisition
Many significant advances have been made in the area of technology development for metabolomics applications in recent years. Our capacity to analyse simultaneously a chemically diverse range of organic components in complex mixtures has never been greater. However, one thing should perhaps be made clear from the outset: the holy grail of plant metabolomics – to gain a complete overview of the entire metabolic complement of a plant and to obtain this in a single or small series of analyses – as was predicted in the late 1990s – is currently inconceivable – but, of course, future technological advances might prove us wrong. While we can already obtain remarkably broad metabolic profiles, even in single analyses (Kikuchi et al., 2004; von Roepenack et al., 2004; Tohge et al., 2005), complete coverage will always be hampered by the sheer biochemical complexity of the metabolites present in plant tissues, together with the broad dynamic range of the different components present at any one time. Within tissue extracts, individual key components can, for example, vary in concentration by as much as seven to nine orders of magnitude (Dunn & Ellis, 2005). Furthermore, within the plant kingdom c. 100 000 secondary metabolites have already been discovered (Oksman-Caldentey & Inzé, 2004). This comprises also large populations of key compound groups including, for example, 6000 different flavonoids (Schijlen et al., 2004) and 12 000 different alkaloids (Facchini et al., 2004). Such large groups of structurally, closely-related compounds, represent just as big a challenge to separation and detection technologies as the diversity in molecular structures. Comprehensive coverage must therefore be achieved through multiparallel, complementary extraction/detection technology combinations using hyphenated systems which can be employed to build up the picture towards an ever-increasing degree of metabolic completeness. Nevertheless, as a starting point, a carefully chosen pair of analysis protocols is proving an excellent initial strategy for gaining a first impression of a metabolic profile which can be used to identify key biochemical leads for further more focused attention (Hirai et al., 2004, 2005 (Fourier transform (FT)-mass spectrometry (MS) and capillary electrophoresis (CE)); Bino et al., 2005 (liquid chromatography (LC)-photodiode array detection (PDA)-MS and gas chromatography (GC)-MS)]; Tohge et al., 2005 (LC-PDA-MS and FT-MS)).
Many technical reviews and reports have already been written on the different strategies available for metabolomics data acquisition (Table 2; see Saito et al., 2006) and only a brief overview can be given here. Simplistically, success can be considered to be dependent on a few key aspects – production of the biological material/sample preparation and sample extraction/metabolite detection. Comparative analyses in particular rely on the detection of statistically significant differences between samples and thus the approaches chosen for both aspects are critical to subsequent success. A very useful, detailed review on appropriate strategies for the design of metabolomic experiments has been written by Gullberg et al. 2004 and could be used as an effective starting point.
With the exception of nondestructive nuclear magnetic resonance (NMR)-based, in vivo metabolic analyses (which nevertheless still have certain limitations to application with plants; Stitt & Fernie, 2003), essentially all metabolomic approaches give us what is basically a ‘metabolic snapshot’ which is specific to the time of sampling. This so-called ‘point-in-time-chemistry’, provides us with a hugely valuable metabolic insight but it is also important to bear in mind that failing to take full and proper care of the cultivation and sampling of the plants will result in potentially misleading and incorrect conclusions. Oscillations in plant homeostasis are commonplace (e.g. related to local environmental/developmental differences (sample location on the plant, plant position with regard to light/shading at the time of sampling, etc.), diurnal rhythm, tissue differences, etc. Consequently, full consideration must be given to these temporal and spatial dynamics of plant metabolism when wishing to perform comparative analyses on contrasting samples. Producing and comparing metabolic profiles of any biological materials is always possible but determining the significance and biological relevance of the differences observed is another issue altogether. As with all functional genomics technologies, where the number of data points obtained per sample (in this case the number of metabolite peaks) will usually greatly exceed the number of samples that can be measured, great care has to be taken both in the way the data is generated as well as in the statistical treatment of the data obtained. Results to date suggest that technical variation is generally acceptably low (SD usually ≤ 10%) in comparison with the biological variation which can vary several-fold (unpubl.). Care must therefore be taken to account for this in planning the number of replicates to be measured and/or to make samples from pooled materials from several plants. In our tomato programme, we routinely use several replicates per sample and each sample comprises material from up to 18 fruits obtained from nine different plants (Tikunov et al., 2005). Pilot experiments are always required before planning a full-scale metabolomics analysis and employing the help of a statistician experienced in experimental design is to be strongly recommended. Ideally, all samples for comparison should be grown and harvested together under identical conditions as this will allow maximum biological relevance to be linked to the conclusions drawn from any discriminant analysis. Material coming from different sources/times, etc. can be compared, but conclusions should only ever be drawn in the context of the initial sample differences. The biological relevance will therefore be less clear and additional experiments will be required to confirm whether the differences observed are genotypically rather than environmentally relevant.
2. Extraction, separation and detection
Chemical complexity, metabolic heterogeneity, dynamic range and ease of extraction remain the main challenges in developing an effective metabolomics technology platform (Goodacre et al., 2004). All current extraction and detection techniques, irrespective of their high level of sophistication, have unavoidable intrinsic bias towards certain metabolite groups (Hall et al., 2005). No single extraction/detection technique will therefore suffice and multiparallel technologies are required to gain the desired broad metabolic picture. Careful selection of suitable combinations of extraction, separation and detection protocols can however, lead to a rapid build-up of complementary biochemical data on the composition of biological samples (Bino et al., 2005 (LC-MS, GC-MS, high-pressure liquid chromatography (HPLC)-PDA); Broeckling et al., 2005 (GC-MS, LC-MS); Hirai et al., 2004 (HPLC-PDA, FT-ion cyclotron resonance (ICR)-MS); Sato et al., 2004 (CE-MS, CE-PDA)). Consequently, both polar/semipolar (e.g. methanol (MeOH)/water) and lipophilic (e.g. chloroform) extractions are usually analysed and, especially for plants, an additional analysis of the volatile components via solvent extraction (e.g. pentane) or through headspace extraction (e.g. solid phase (micro)extraction, SP(M)E) is often desirable (Tikunov et al., 2005).
For medical metabonomics, the use of robotics for sample preparation is proving a valuable approach (Lindon et al., 2004). In these analyses, nuclear magnetic resonance (NMR)-based techniques with relatively short measurement times are regularly used on large numbers of urine or plasma samples. However, while robotics offers opportunities for more uniform and reproducible protocols, widespread use for plants has not yet been established. Plant extracts generally have a more complicated biochemical composition and with MS detection being more widely used, this requires longer LC or GC run times for each analysis. As most laboratories do not have access to a battery of mass spectrometers the number of samples that can be run per day is limiting. Large-scale sample preparation can mean significant time differences between the moment of extraction and measurement of different samples and, even with −80°C storage, this can be reflected in the fingerprint obtained. This adds an undesired extra complication during data analysis. We therefore prefer to prepare and run samples for GC-MS in daily batches (with a common reference throughout the measurement period; Tikunov et al., 2005) although freeze-drying after extraction can be used to enhance long-term stability (Joachim Kopka, MPI/MPP-Golm, pers. comm.). Other sample preparation pitfalls, and how to avoid them are detailed in Dunn & Ellis (2005).
Mass spectrometry (MS) is the primary detection method of choice for plant metabolomics due to its sensitivity, speed and broad application. Depending upon the type of extract made, GC (gas chromatography) or LC (Liquid Chromatography) are most routinely used for metabolite separation before the samples pass into the mass spectrometer. Capillary electrophoresis separation is, however, gaining interest (Sumner et al., 2003; Sato et al., 2004).
Gas chromatography–mass spectrometry (GC-MS) is currently proving to be the most popular global analysis method. Popularity stems primarily from the robustness of both the separation and the electron impact spectrometry technique and the availability of some excellent deconvolution and metabolite identification software. Together, these can be used for complete metabolite identification and quantification. Running standards and comparing with commercially available compound databases can be used to confirm metabolite identity. Gas chromatography-MS is the principal technique for separation and detection of metabolites that are naturally volatile at temperatures up to c. 250°C (e.g. alcohols, monoterpenes and esters), although thermolabile compounds will be missed. However, the technology is more broadly applicable to groups of nonvolatile, polar (mainly primary) metabolites, such as amino acids, sugars and organic acids, by converting these into volatile and thermostable compounds through chemical derivatization. These derivatized samples can then be analysed by GC-MS and detailed information on many of the key primary metabolites in plants can be obtained in a single chromatographic run (e.g. Desbrosses et al., 2005).
Liquid chromatography-MS is a particularly important additional, versatile technology for plants as it also provides a means to analyse many large groups of ‘secondary’ metabolites often present in plant tissues (Verhoeven et al., 2006). Advances in chromatographic technologies (e.g. the ultra-performance liquid chromatography (UPLC) system from Waters Corporation, Milford, MA, USA) together with advances in column chemistry (e.g. hydrophilic interaction chromatography (HILIC) and long monolithic columns) are yielding significantly improved separation potentials. The technology is inherently restricted to molecules which can be ionized, either as positively or negatively charged ions, before moving through the MS. A range of possible techniques is available in order to ensure the ionization of a broad range of (LC-separated) metabolites (e.g. electrospray ionization (ESI); atmospheric pressure chemical ionization (APCI); photo-ionization (PI)). The high analytical precision of modern LC techniques combined with the high sensitivity and mass accuracy and resolution of MS systems, in particular time-of-flight (TOF) and Fourier transform ion cyclotron resonance (FT-ICR)) instruments, is proving very useful in the analysis of complex metabolite mixtures typified by plant extracts. Working in positive and negative ion modes can provide broader coverage of molecules which more readily either gain or lose a proton. Rather than performing separate analyses, some machines now have the capacity to switch continuously between positive and negative modes during each run. Unlike GC-MS, few mass spectral libraries are available for LC-MS and this is a key topic being given considerable attention at present (Verhoeven et al., 2006). The strategy of multidimensional LC (MDLC)-MS, which is gaining increasing popularity for proteomics analyses (America et al., 2005), may, on modification, also have potentially valuable application in metabolomics but this has yet to be properly assessed.
Capillary electrophoresis-MS The recent development of CE as an alternative separation technology is growing in popularity when combined with MS for extra selectivity and sensitivity (Soga et al., 2003; Sato et al., 2004). The value of application with the less complex microbial extracts has already been clearly demonstrated and wider use with plants can be expected. High-resolution chromatographic separation and sensitive detection of water soluble extracts make a strong combination suitable for the analysis of a diverse range of primary and secondary metabolites (Sato et al., 2004). In Escherichia coli, for example, as approximately 80% of all the metabolites are charged, the combination of CE and MS give significant coverage of the metabolome in a single analysis (M. Tomita, unpubl. conference commun., 2005). The same is also true for rice (Sato et al., 2004) and the speed and sensitivity of the technology bode well for broader application in the near future.
Fourier transform-ICR-MS (often shortened to FTMS) is a much discussed but, as yet, rarely used technology for plant metabolomic analysis (Brown et al., 2005). After a pause following the first paper on this topic by Aharoni et al. (2002), more recent applications are beginning to emerge (Hirai et al., 2004, 2005; Murch et al., 2004; Tohge et al., 2005). This technique has the advantage of providing more accurate molecular mass estimations through a high resolving power. The extremely high mass resolution means that high chromatographic resolution becomes less important and is often omitted (though isomers, which can be very abundant in plant extracts, consequently cannot be distinguished). Up to 2400 compounds analysed simultaneously, of which 781 were putatively identified, has been reported (Murch et al., 2004). The recently introduced orbitrap FTMS instruments (Hu et al., 2005) which offer not only faster and more sensitive analyses but also are significantly less expensive than the cyclotron FTMS machines (Sumner et al., 2003) are also worthy of attention. The technology is predicted to have a great future in the field of metabolomics (Brown et al., 2005; Dunn & Ellis, 2005), although for many the inability to separate structural isomers which have identical mono-isomeric masses is still seen as a significant limitation to its application. Linking up to some sort of prior isomeric separation strategy will be required to overcome this.
Flow- or direct-injection/infusion (FI/DI)-MS is basically mass spectrometry without any effort being made for separation. A single mass spectrum is produced, usually using a TOF (Verhoeven et al., 2006) or FTMS instrument (Aharoni et al., 2002; Hirai et al., 2004) and this output represents a potentially valuable rapid qualitative screening tool. Run times of 1 min or even less have been reported (Goodacre et al., 2003). Despite potential limitations by matrix effects or ionization suppression, evidence is growing that such a technique can effectively be used, not as a quantification but as a comparative analysis screening tool appropriate for quality control, mutant screening, biodiversity analysis, etc.
Nuclear magnetic resonance seems currently to be the method of choice for medical metabolomics (metabonomics). For plants, its application has so far been limited because of its currently much poorer sensitivity compared to MS-based methods (Kaddurah-Daouk et al., 2004). Nuclear magnetic resonance analysis is a favoured choice for the major metabolites (Defernez et al., 2004) but many other important compounds will typically be missed as they occur in plant extracts at levels below the usual NMR detection threshold. It does, however, have the advantage that it is a more uniform detection system and can directly be used to identify and quantify metabolites, even in vivo. Ongoing improvements in instrumentation design will lead to increasing popularity of this approach and a full overview of the current potentials and limitations of NMR has recently been provided by Ratcliffe & Shachar-Hill (2005).
III. Mind the gap: data analysis, bioinformatics and statistics
Raw data is nothing but a poor relative of information and information is itself a giant leap away from knowledge (after Goodacre, 2005). With the capacity to interrogate biological systems never having been so great, the concomitant capacity for data generation has itself created a whole new field of science. What will inevitably prove pivotal to the success of plant metabolomics will be our ability to mine the data generated and to perform reliable, comparative analyses. In order to do this we have to develop novel tools for effective in silico handling of the data. This entails having the necessary software for (1) collecting and preprocessing the data (machine output) in a way which allows the direct comparison of datasets from comparative analyses; (2) processing and mining the data to extract those components of interest (contrasting component, significant quantitative differences, etc.); (3) being able to present complex data in a readily understandable way using dedicated visualization strategies; and (4) effective databasing for efficient data storage.
1. Data preprocessing
For the metabonomics analysis of metabolic perturbation in humans, data preprocessing has been described as being an essential prerequisite because pathophysiological disease generally involves complex marker patterns. As such, single biological markers such as, e.g. cholesterol, cannot be relied upon for effective diagnosis (van der Greef et al., 2004). The same holds true for plants where frequently the aim is to use metabolomics to assess broad developmental differences, phenotypic modifications or multifactorial responses of plants to abiotic or biotic stresses (see Fig. 1). Also, in the field of plant-based medicines, linking multicomponent preparations, such as traditional Chinese medicines, to a complex pattern of bioactivity shall benefit greatly from a metabolomics approach (Wang et al., 2005).
Effective but cumbersome manual data handling is no longer feasible with the highly complicated, multidimensional metabolomic datasets now being produced. Consequently, dedicated bioinformatics and statistical tools are required to convert this data into information which shall enable plant metabolomics to gain its true position as a major hypothesis generator on the way to us taking the last critical step towards expanding our knowledge of the chosen biological system. In addition, data visualization tools are essential if we are to cope effectively and properly understand, these complex datasets.
Intrinsic to basically any metabolomics approach are the unavoidable imperfections in data generation relating to chromatographic or instrumental shifts during or between measurements. These represent the first significant challenge to be overcome if we are to compare effectively large series of chromatograms or mass spectral outputs. Even multiple screening strategies involving relatively simple HPLC-PDA outputs require appropriate data preprocessing before reliable differential analysis. In their Intelligent Screening Strategy (ISS) for assessing underexploited biodiversity in fungi, Smedsgaard & Nielsen (2005) employed Correlative Optimized Warping (COW) for chromatographic matching before identifying discriminant metabolomic components between isolates. In plants, Hendriks et al. (2005) compared COW with other warping tools such as Dynamic Time Warping and Permutational Time Warping for their effectiveness when combined with weighted principal component analysis (PCA) for the analysis of willow bark extracts. Preprocessing tools are also required for GC/LC-MS datasets and different but comparable strategies are regularly appearing (Duran et al., 2003; Jonsson et al., 2004; Wiener et al., 2004; Broeckling et al., 2005; Goodacre, 2005; Katajamaa & Oresic, 2005). von Roepenack et al. (2004) have written their own software to replace manual data deconvolution of LC-MS datasets with an unsupervised automated approach for reliable comprehensive mass profiling and peak list generation without loss of reproducibility. specalign (Wong et al., 2005) has also been reported as a graphical preprocessing computational alignment tool which can be effectively used for the simultaneous visualization and manipulation of multiple proteomics data sets but may be equally applicable for metabolomics output. metalign software has been shown to be appropriate for both GC-MS and LC-MS datasets and in addition to spectral alignment, this package also performs baseline correction and noise reduction, which are highly effective for reducing dataset size and thus also comparative analysis time (Vorst et al., 2005; Verhoeven et al., 2006).
2. Data mining and visualization
In the quest to translate abstract machine output into biologically relevant leads, effective discriminant analysis following data preprocessing requires new concepts involving some kind of statistical filtering before differential comparisons can be made and reliable conclusions drawn. Unsupervised approaches for discriminatory analyses such as PCA and hierarchical clustering (HCA) or supervised approaches such as partial least squares (PLS) or SIMCA are the simplest and most widely used at present (see Fig. 2). More advanced and detailed approaches will, however, be required to extract fully all the wealth of information which is embedded in metabolomics data sets. Such high-dimensional data will benefit greatly from the evolutionary computation-based methods and genetic search algorithms, as described by Goodacre (2005). Deconvolution of data in terms of which differential metabolites are of key biological importance is essential if we are to generate new knowledge. It is developments in this area in particular which will make or break metabolomics as a widely applied technology and this has been the topic of several recent papers and reviews (Kell, 2004; see also various chapters in Harrigan & Goodacre, 2003; Vaidyanathan et al., 2005; Saito et al., 2006). For GC-MS datasets, one recent approach has been described on how to ‘reconstruct’ metabolites from the measured molecular fragments in mass spectra as a means to automatically predict which metabolites were present in the original sample (Tikunov et al., 2005). In this way it was predicted from a volatiles analysis of 94 tomato varieties, that 322 volatile components were represented in the entire dataset generated. Checking these predictions using analysis of individual pure standard compounds confirmed a high degree of reliability. An alternative strategy for GC-MS analysis of popular tissues, has been presented by Jonsson et al. (2004). Here, a procedure for data preprocessing is presented which could save time when comparing large numbers of datasets and could be performed as a useful preliminary procedure before deconvolution of preselected chromatographic regions.
Data visualization is another underestimated topic which requires extensive attention. It is only when we have effective visualization tools that we shall be able to simplify and more easily comprehend the multidimensional complexity of the growing sets of data that we are generating (Kell, 2002). New concepts are needed and, in this regard, effective visualization of pathways, for example, using fully flexible, readily updateable models will be essential. Various tools have already been developed such as KEGG (Kanehisa et al., 2002), AraCyc (Mueller et al., 2003) and MetNet (Wurtele et al., 2003). Recently, Lange & Ghassemian (2005) described another visual interface, BioPathAtMAPS which enables the integration of biochemical pathway maps with a gene function database BioPathAtDB. Tools such as mapman, developed initially for microarray data (Thimm et al., 2004; Usadel et al., 2005) are proving highly effective for the simple visualization of gene expression data and mapman is already being used with metabolomics datasets. However, as with all such tools, success will remain limited to those metabolites which can be definitively identified in metabolomics outputs and for which the biochemical pathways have already been confirmed (see the Potentials and limitations section below).
3. Data storage and database building
Key to effective data storage will be the ability to minimize the volume of data to enable rapid multiplex sample comparisons while still retaining all the data needed to describe fully the differences detected. Currently, various approaches are being tested for their effectiveness and attention is also being given to the extent and type of metadata which needs to be stored for current and future reference. While most experiments are usually performed with a specific aim in mind, the metabolomics data generated have a potentially much broader application. This data could be used for future investigations, provided that it is then still possible to access all associated information concerning (e.g. type of material, growing conditions, method of sample preparation/extraction, etc.). A standard Minimum Information About a Microarray Experiment (MIAMI)-type approach is to be recommended and a number of position papers have already appeared from the plant community as an effective starting point for this (Bino et al., 2004; Jenkins et al., 2004, 2005). The metabonomics community is also active and information from the SMRS (Standard Metabolomic Reporting Structures) group has recently been published (Lindon et al., 2005). Despite working with highly contrasting biological systems both communities could benefit greatly from collaborative efforts in this area and moves to this end are already in progress, as was recently illustrated at the First Metabolomics Congress in Japan in June, 2005 and at the National Institutes of Health (NIH)-sponsored metabolomics standards workshop in Washington, August, 2005.
Building chemical reference databases is also a topic of huge importance as metabolite identification is still considered a serious limitation (Fernie et al., 2004; see later). While LC-MS is lagging behind for various technology-based reasons, the first steps have been taken towards establishing an open access plant metabolite database for GC-MS (Kopka et al., 2005). In addition, following on from similar initiatives in the microarray and proteomics fields, Schauer and colleagues have also reported on the initiation of a supervised GC-MS database comprising 2000 fully evaluated datasets (Schauer et al., 2005a). The same group is currently further expanding their GMDB database of reference compounds (Schauer et al., 2005b). Input from multiple research groups is essential for the success of such initiatives as only in this way can we ameliorate for the limited and focused input which single groups provide. National and cross-national (e.g. European Union (EU)) initiatives are, thankfully, already underway.
IV. Potential and limitations
As in many areas of science, the primary limitation at all levels of metabolomics application, is ignorance: specifically our basic lack of knowledge of many (secondary) metabolic pathways in plants and our current inability to identify probably the vast majority of metabolites synthesized in the plant kingdom (Fernie et al., 2004). A key factor in this regard is the lack of available reference standards for many of the (secondary) metabolites of which plants are an especially rich source. While metabolic fingerprinting will have broad application, in most instances scientists will wish to proceed to the identification of at least the key differential metabolites. As such, metabolomics, in the true sense of the word – the-nontargeted analysis of complex biochemical mixtures – is an essentially ‘blind’ approach which lends itself very effectively as a starting point for the comparative analysis of such mixtures without the need for previous knowledge on composition. Furthermore, our knowledge of pathways will also increase directly as metabolomics data itself is increasingly being used as a prediction tool for metabolic interrelationships. Correlation analyses will, in many cases, reveal direct biochemical pathway links. Metabolomics shall, therefore, assist in closing knowledge gaps and help us build up an ever improving picture of the pathways which go to form the complete metabolic network within an organism.
The two aspects which are key to all of this and which currently represent the two main bottlenecks in the field are first, our limited ability, as detailed above, to identify chemically many of the components which can now be detected. Second, and related to this, not having known reference compounds for all these components represents a major limitation to their definitive quantification. An ability to quantify discriminative components is essential if we are to exploit fully metabolomics datasets beyond profiling and particularly when we wish to investigate topics such as metabolic flux (Fernie et al., 2005).
While still not having the required reference standards, a potential effective circumvention strategy has been devised. Saturated in vivo labelling of reference samples, using stable isotopes such as 1H and 13C, is being used as an aid to relative component quantification in comparable samples. Such mass isotopomer ratios have been used by Birkenmeyer et al. (2005) and Kikuchi et al. (2004) for component quantification of, even as yet unknown, components. Fernie et al. (2004) have also speculated on the potential of ICAT (isotope coded affinity tagging), which is increasingly being used for complex proteomics analyses, as an effective tool for metabolomics quantification.
A final limitation that deserves a mention is that current methodologies require tissue extracts to be made in order to profile metabolites effectively. Ideally, we would like to be able to perform metabolomics analyses at the cellular and even subcellular levels. This is particularly important as subcellular compartmentation of metabolites plays a major role in determining pathway activity, metabolic flux, etc. and the total content of a metabolite, as is currently measured, takes no account of the potential physical separation of the major subcellular pool of a particular metabolite being present in, e.g. the vacuole and thus not being available to its corresponding modifying enzyme(s). Very significant technological advances in metabolomics hardware will be needed before such (sub)cellular metabolomics becomes feasible.
Plant metabolomics has a potentially broad field of application and this is reflected by the variation in scientific publications which have been emerging over the last year. As well as being a valuable tool for fundamental science, examples are already emerging where metabolomics is being used in applied situations concerning, for example, monitoring crop quality characteristics (Hall et al., 2005) or identifying potential biochemical markers to detect product contamination and adulteration (Reid et al., 2004). A number of the main areas where recent publications have shown how metabolomics is expanding our general knowledge of plant materials and is illustrating how organisms function as integrated biological systems are discussed in this section.
1. Genotyping and phenotyping (metabotyping)
An area where metabolomics approaches will likely prove indispensable in the coming years concerns the need both to define better plant genotypes and to relate observations to phenotype. Considering our current capacity for full genome sequencing, this is particularly the case when we wish to proceed with genomic annotation and link gene sequences to function. Metabolomics can provide a useful complement to the existing functional genomics technologies in this regard. Furthermore, it will be essential when the aim is to characterize the majority of so-called ‘silent plant phenotypes’ (Weckwerth et al., 2004) or correlate function with ‘orphan genes’ (Goodacre, 2005). Such metabolomic approaches can also be made more effective when they are combined with supplementary analyses such as robotic large scale enzyme analysis (Gibon et al., 2004) in order to build up an overview of plant metabolism in ever-increasing detail. It can be anticipated that more directed analyses of specific groups will be developed (as in ‘Lipidomics’ in humans covering specifically the entire fatty acid and oil content: see http://www.lipidmaps.org) but, in all cases, the primary goal remains (i.e. to expand the battery of fundamental information concerning how plants are composed biochemically and how this is controlled).
Sato et al. (2004) have used CE/PDA/MS to characterize primary metabolism in rice and specifically concentrated on the tricarboxylic acid (TCA) cycle, amino acid metabolism, glycolysis and the pentose phosphate pathway. Using leaf material they were able to pick out and follow simultaneously, 88 of the primary metabolites involved in these processes. When profiling metabolites in potato tubers, Parr et al. (2005) were able to identify the presence of four phenolic amines previously undetected within the Solanaceae. Liquid chromatography-MS/MS was used to confirm their identity and subsequently it could be shown that these compounds were also present in tobacco and tomato. Tohge et al. (2005) used both metabolomics (FTMS and LC-MS) and transcriptomics to characterize Arabidopsis plants overexpressing the MYB transcription factor PAP1. They were able to putatively identify and follow c. 1800 metabolites and correlate these with gene expression. In total, 38 genes were identified which had a modified expression pattern associated with the transgene and some of these had putative functions correlating with the detected changes in metabolite profile. Slowly but surely the application of metabolomics technologies is gaining in popularity for genotyping/phenotyping studies and it is anticipated that this will continue in the near future.
2. Population screening
As many metabolic differences between genotypes are not visible externally and minor qualitative changes in metabolite profile can still mean differences of major biological importance there is a great demand for having a rapid screening system which both covers a broad chemical range and has the capacity to cope feasibly with large to very large sample numbers (Weckwerth et al., 2004). In this way we can consider using metabolomics to screen breeding progeny as a tool for selection, or to screen natural or induced mutant populations or populations of randomly genetically modified plants to find the small number of individuals which have significantly modified biochemical composition. All the various detection technologies are proving appropriate. For example, Jander et al. (2004) have used a rapid LC-MS protocol to screen 10 000 Arabidopsis mutants in 6 months. Using ICP-MS Lahner et al. (2003) screened trace element and mineral composition to identify 51 confirmed mutants in a population of 6000 Arabidopsis plants. Kikuchi et al. (2004) describe how advances in proton NMR will lead to rapid 5 min analyses of crude plant materials which will give information on 500–1000 different metabolites. Fourier transform-MS has also been used by Aharoni et al. (2002) and Murch et al. (2004) to identify putatively more than 2000 mass signals in strawberry fruits and the medicinally important Chinese species Scutellaria baicalensis, respectively. Short measurement times enable the rapid screening of multiple extracts giving an exceptionally broad overview of the chemistry. It is particularly in this area where the further development of data management and mining tools is most essential and will prove most effective (von Roepenack et al., 2004; Verhoeven et al., 2006).
3. Understanding physiological processes
Plant physiology is a complex phenomenon which is dependent on multiple genetic, physical and chemical environmental factors interacting at many levels in relation to the tissues and organs involved, the developmental stage of the plant, its cultivation history, etc. Many changes occur in an essentially nonindependent manner and as such are likely to be reflected in ways representing both direct and indirect chemical perturbations. How plants respond to biotic and abiotic stress is a clear illustration of this. To survive, a plant, on exposure to suboptimal growth conditions, must initiate protective responses at the subcellular level upwards. These are potentially of such a fundamental nature that multiple interrelated pleiotropic biochemical changes take place. Which exactly, and how, are important questions to be answered if we wish to understand stress responses fully and how strategies to make (crop) plants more stress resistant might be designed. Hirai et al. (2004, 2005) used FTMS in combination with HPLC to analyse the effect of sulphur/nitrogen (S/N) stress on Arabidopsis plants. They showed that many significant changes could be observed, some of which, such as glucosinolate metabolism demonstrated a considerable degree of coordination. Nikiforova et al. (2004) describe how the effect of sulphur starvation on Arabidopsis can be studied using a combined transcriptomics – metabolomics approach to build up a more holistic picture of nutrient stress in plants. In vitro Medicago cultures have also been used to study and compare the effects of different stress-related biotic and abiotic elicitors on cell response (Broeckling et al., 2005). Marked changes in several primary metabolites were observed using both GC and LC-MS, some of which were common but differed in attenuation while others appeared specific to the type of elicitor used. These authors suggest that a ‘metabolic reprogramming’ occurs which again is demonstrating some kind of coordinated response. Other types of physiological perturbation can also be effectively analysed using multiplex metabolomics approaches. Bino et al. (2005) used both targeted and nontargeted GC-MS and LC-MS analyses of primary and secondary metabolites (including the natural volatiles) to define the pleiotropic effects of a single gene mutation in tomato fruit. Nunes-Nesi et al. (2005) used a transgenic approach to perturb metabolism in tomato plants and subsequent metabolomics analysis was used to help define the resulting changes in growth and plant physiology.
A final example concerns a field where plant metabolomics is predicted to become invaluable because of its nontargeted nature. Modes of action of herbicides determine how plants respond to these chemicals and can be used to predict their suitability for application. Discovering new herbicides with novel modes of action is a major goal in the quest to develop durable strategies for weed control. Grossman (2005) details how the search for new active herbicidal compounds will be greatly advanced by combining metabolomics approaches with arrays of functional bioassays in order to differentiate between responses and thus identify new lead compounds.
4. Biomarkers and bioactivity
The chemicals present in plants generally have an in vivo function related to their biochemical structure and thus their bioactivity. The most obvious example is the enzymic proteins but primary and secondary metabolites can also be bioactive in that they, for example, act as free radical scavengers (antioxidants), bind to proteins, stimulate olfactory cells of insects, etc. These bioactivities can be of great relevance regarding how plants are of importance to, or are used by, humans. Some may contain components which have potentially health-promoting properties (Chinese medicines, plant infusions with anticancer activity, antioxidant activity, etc.). Others can function, for example, as attractants or repellants. Interest in this area, and in particular, concerning a potential link between food and health is growing. This is partly the result of increasing consumer demand for healthier food products and partly due to companies wishing to link their products to health claims. In this regard, metabolomics has already been employed to characterize novel plant products such as mutant or modified tomatoes (Giovinazzo et al., 2005, Bino et al., 2005; Davuluri et al., 2005). Davuluri et al. (2005) reported how a targeted genetic modification in tomato can result in modifications to both health-related carotenoid and flavonoid-based antioxidants (Dixon, 2005). Giovinazzo et al. (2005) also used metabolite profiling to identify novel antioxidants in transgenic tomatoes supplied with the stilbene synthase gene from grape. Differences between green and black tea infusions have been characterized and key differences were found between the antioxidant flavonoids specifically of the flavan-3-ol type. These are likely to be directly related to the known difference in antioxidant properties between green and black tea (Del Rio et al., 2004). Willow bark extracts and Chinese medicines have also been the subject of preliminary metabolomic analyses (Hendriks et al., 2005; Wang et al., 2005). The latter report that metabolomics will be key to new lead discovery and will help validate drug targets and efficacy. Consequently, it is through approaches like these, that the concept of personalized medicines will approach reality.
An interesting expansion on the ‘standard’ analysis techniques is to link bioactivity to the metabolic profile directly ‘on line’. Beekwilder et al. (2005) described a system for determining antioxidant activity in plant extracts directly after LC separation, on line, so that in contrast to the standard measurement of total antioxidant activity, the activity of the individual components can be determined separately. In this way, antioxidant activity can be followed in parallel with the biochemical profile and thus we can ascertain which qualitative/quantitative differences between samples correlate most with changes in antioxidant capacity. While such a system is limited in terms of throughput it will prove very useful for preselected samples for which a more detailed bioactivity/biochemical analysis is desired.
5. Quality and breeding
The potential value of a metabolomics input in the field of plant product quality and the development of targeted breeding strategies for plant improvement was covered extensively in a previous paper (Hall et al., 2005). It is already evident that detailed information on biochemical composition will not only be of primary importance in defining what is actually meant by certain quality aspects of plants (taste, flavour, shelf-life and disease resistance) but also will have considerable predictive value. Broad metabolic screening creates the possibility to identify biochemical markers which are directly associated with particular traits and as such may be used as an additional tool in progeny selection (Overy et al., 2005; Schauer et al., 2005b). Lui et al. (2005) also report on how volatile metabolic profiling can discriminate potato tubers either free or infected with different spoilage pathogens. This approach can therefore be used as a predictive, management decision tool determining marketing strategy (the ‘keep or sell’ approach).
6. Substantial equivalence
Substantial equivalence is a concept which has been designed to help define the nature and extent of differences between samples. While it can also be used to characterize differences between genotypes, or processing methods, or plant material sources, etc., it is most associated with the desire to assess the degree of similarity/difference between transgenic and wild-type plants. Concern for ‘unseen or unexpected’ effects derived from a genetic modification (GM) event has given rise to the desire for untargeted analyses. Metabolomics is one approach which is being considered in this regard. Initial results indicate that few such changes are in fact observed. For example, Arabidopsis transgenics accumulating large amounts (4% DW) of dhurrin, showed no detectable changes in the associated amino acid pools (Kristensen et al., 2005; Memelink (2005). Similarly, Defernez et al. (2004) used NMR and HPLC-UV to compare transgenic and wild-type potato varieties and observed that the differences between standard varieties were always significantly greater than the differences between the wild-types and their respective transgenics. Similar conclusions were reached by Catchpole et al. (2005) using a different approach. The question remains, however, to what extent the whole concept is perhaps flawed in that it is difficult to link potential biological significance to any changes observed.
VI. Data integration, metabolic networks and systems biology
All of the applications and examples detailed above indicate how metabolomics is gaining popularity as a functional genomics tool useful in broadening our knowledge of biological systems. Despite essentially providing us only with metabolic snapshots, a consequence of this approach when broadly applied, is that we can rapidly build up multiple datatsets containing extensive information on the biochemical differences between genotypes, following different treatments, during development, after genetic modification, etc. A concomitant growing ability to integrate and collate this information is giving us a unique insight into how metabolic pathways are actually highly interactive rather than operating as separate units. In every biological system there is a metabolic network in place which is a highly flexible and inherently compensatory mechanism. This network is likely to be one of the primary reasons why, for example, many knock-out or antisense events often yield no immediate phenotype (Weckwerth et al., 2004). For this reason also, Trethewey (2004) raises the issue that the network model of plant metabolism implies that the whole concept of ‘biochemical engineering’ is fundamentally flawed and will only work effectively when we fully understand the metabolic pathway concerned and can design a dedicated GM strategy accordingly. Knowledge of how this network operates and how it might be influenced in a particular system is essential. A better ability to characterize pleiotropic effects will lead to an enhanced understanding of the organism as a whole. By comparing different source materials, or materials in functional and dysfunctional states, a picture of commonality can be constructed which may be used to predict metabolic relationships. Desbrosses et al. (2005) used a metabolomics approach to compare different tissues in Lotus japonica and they were able to define discrete metabolic phenotypes for each. In addition, they have also begun to build up a picture of the development of root nodules and thus help us to a better understanding of symbiotic nitrogen fixation in legumes. Even when metabolic differences are subtle and when taken individually, are statistically insignificant, techniques such as multivariate clustering and the analysis of pairwise, linear metabolic correlations can still reveal significant metabolic perturbations in a network context. This network topology can be predicted to reflect the underlying regulatory pathway structure (Weckwerth et al., 2004).
This approach can now also be taken to another level by combining metabolomics data with directly comparable data from other functional genomics techniques, provided that it comes from identical plant material. This so-called ‘Systems Biology’ approach is gaining popularity especially in fundamental research on Arabidopsis, where almost full genome microarrays are already available. Hirai used targeted and nontargeted metabolomics together with transcriptomics, in an integrated analysis involving a batch learning, Self Organising Map (SOM) approach to build up a more global picture of stress response in plants and to link genes to putative function (Hirai et al., 2004, 2005). Merke et al. (2004) also used a combination of microarray analysis and metabolomics to link gene to function in cucumber and in so doing were able to identify two new terpene synthases, confirm their function and define their products.
The umbrella concept of systems biology (Brown et al., 2005) clearly has huge potential regarding our desire for a full understanding of how organisms function. Information input from multidisciplinary sources such as statistics, functional genomics, physiology, etc., is needed to do this (van der Greef et al., 2004). Nevertheless, while the concept fits perfectly for microbes, a systems biology approach for multicellular organisms is considerably more complicated and brings with it an additional set of challenges which need to be tackled in the near future (Nicholson et al., 2004).
VII. Where do we go from here? Bright prospects for the future
One thing which is already evident is that while we have come far there is still a lot to be done before we can exploit fully the metabolomics data we can already produce. With continually advancing technologies and increasingly sophisticated bioinformatics tools at our disposal, data production capacity will only increase and, without care, we risk losing sight of the primary goal – to advance knowledge rather than generate data. Already there is talk of cellular and even subcellular metabolomics but we first need to establish proper strategies for technology application. New applications are also being envisaged in areas such as ecology, biotic interaction studies, climate change effects and evolutionary aspects of plant communities in relation to, for example biodiversity and its desired preservation.
Metabolomics must become a full scale discovery platform which supports and feeds into complementary, parallel activities to advance our understanding of multidimensional plant systems. It has huge potential to have both diagnostic and prognostic value. The tasks ahead are considerable and, having learned from previous functional genomics activities, it is to be greatly encouraged that we implement as soon as possible, community-based, multinational efforts (Bino et al., 2004) to create the critical mass necessary to tackle the challenge and to facilitate the establishment of consensus regarding standard operating practices (SOPs) for data acquisition, storage, handling, etc. Multidisciplinary and multiparallel approaches are required and these must be performed in a cohesive and coordinated manner. The plant, human and microbial metabolomics communities should be encouraged to interact more in this regard. We all also dearly need tools for metabolite identification to enable a more defined, comprehensive coverage of the metabolome and this is where the greatest emphasis perhaps should lie. Full identification and quantification will also bring us to another level and shall enable, for example, detailed flux analyses in a network context. The challenge is there and many are ready and willing to take it up.
The author acknowledges funding support from Plant Research International and The Centre for BioSystems Genomics. The latter is part of the Netherlands Genomics Initiative (NGI) and the Netherlands Organization for Scientific Research. Raoul Bino and Ric de Vos are thanked for their comments on the manuscript.