Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data

Authors

  • Max Bylesjö,

    1. Research group for Chemometrics, Department of Chemistry, Umeå University, SE-901 87 Umeå, Sweden
    Search for more papers by this author
    • These authors contributed equally to this work.

  • Daniel Eriksson,

    1. Umeå Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences, SE-901 83 Umeå, Sweden
    Search for more papers by this author
    • These authors contributed equally to this work.

  • Miyako Kusano,

    1. Umeå Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences, SE-901 83 Umeå, Sweden
    2. RIKEN Plant Science Center, 1-7-22 Tsurumi, Yokohama City, Kanagawa, 230-0045, Japan
    Search for more papers by this author
  • Thomas Moritz,

    1. Umeå Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences, SE-901 83 Umeå, Sweden
    Search for more papers by this author
  • Johan Trygg

    Corresponding author
    1. Research group for Chemometrics, Department of Chemistry, Umeå University, SE-901 87 Umeå, Sweden
      (fax +46 (0)90 138885; e-mail johan.trygg@chem.umu.se).
    Search for more papers by this author

(fax +46 (0)90 138885; e-mail johan.trygg@chem.umu.se).

Summary

The technological advances in the instrumentation employed in life sciences have enabled the collection of a virtually unlimited quantity of data from multiple sources. By gathering data from several analytical platforms, with the aim of parallel monitoring of, e.g. transcriptomic, metabolomic or proteomic events, one hopes to answer and understand biological questions and observations. This ‘systems biology’ approach typically involves advanced statistics to facilitate the interpretation of the data. In the present study, we demonstrate that the O2PLS multivariate regression method can be used for combining ‘omics’ types of data. With this methodology, systematic variation that overlaps across analytical platforms can be separated from platform-specific systematic variation. A study of Populus tremula × Populus tremuloides, investigating short-day-induced effects at transcript and metabolite levels, is employed to demonstrate the benefits of the methodology. We show how the models can be validated and interpreted to identify biologically relevant events, and discuss the results in relation to a pairwise univariate correlation approach and principal component analysis.

Introduction

In the post-genomics era, the development of life science technologies that enable transcriptomic, proteomic and metabolomic events to be analyzed in detail in the same biological systems has revolutionized biological studies. Instead of relating biological observations to a small number of variables, it is now possible to study biological systems with a global analytical approach. This is sometimes called systems biology, and the purpose is to study organisms as integrated systems of genetic, protein, metabolic, cellular, and pathway events.

A systems biology approach based on data collected from many different analytical platforms involves advanced statistics to be able to interpret the data. Numerous examples exist in the literature regarding the use of data from parallel sources (e.g. Carrari et al., 2006; Clish et al., 2004; Gygi et al., 1999; Hirai et al., 2004, 2005; Kleno et al., 2004; Kolbe et al., 2006; Oresic et al., 2004; Rischer et al., 2006; Tohge et al., 2005). In the majority of studies, pairwise univariate correlations between measured variables in different datasets have been utilized to elucidate joint effects and underlying mechanisms. In the plant field Professor Saito’s group have made substantial contributions (Hirai et al., 2004, 2005; Tohge et al., 2005), primarily concerning the integration of transcript and metabolite data for Arabidopsis thaliana. The general approach has been to analyze and interpret the data sources in parallel, formulate hypotheses independently for each platform and finally outline a consensus theory based on prior knowledge of existing pathways to unravel novel trends and processes.

The main intricacies associated with a unified approach lie in the subsequent analysis and interpretation: i.e. to unwrap systematic and biologically relevant information from noisy, high-order data structures collected at several levels. Here we describe a general methodology that is complementary to previously outlined approaches, with the potential to provide further information and insights into the field of plant biology. The strategy builds on recent advances in multivariate regression methods, specifically the O2PLS method (Trygg, 2002; Trygg and Wold, 2003) that has the capacity to integrate data from multiple sources (e.g. data collected from different analytical platforms). This enables one to separate joint (overlapping) information across multiple analytical platforms from systematic variation that is unique to each platform. The O2PLS method identifies systematic trends across any pair of datasets: for instance transcript and metabolite data that will be utilized in the presented work. This process is schematically outlined in Figure 1. A recent study investigating metabonomic and proteomic correlations from mice samples using this approach is available in the literature (Rantalainen et al., 2006).

Figure 1.

 Overview of the O2PLS method.
Different representations of the model structures of the O2PLS method and the practical set-up of the data are depicted. In (a), the six different model structures of the O2PLS model are described for the particular example of integrating transcript with metabolite data. Using the O2PLS methodology, it is possible to separate the predictive variation (e.g. used to predict metabolite levels from transcript profiles) from the variation that is unique to each platform as well as residual variation. In (b), different plant samples, exemplified by wild-type and mutant genotypes, are measured using several analytical platforms, exemplified by cDNA microarrays and GC/MS.

The fundamentals of the O2PLS method builds on the work by Trygg and Wold (2002) who outlined orthogonal projections to latent structures (OPLS), which is a supervised multivariate regression method. OPLS combines the existing theory of partial least squares (PLS) regression (Wold et al., 1984, 2001) and orthogonal signal correction (OSC) (Wold et al., 1998). OPLS is primarily used when one has a single data source, e.g. transcript data from cDNA microarrays, and want to study the relationship with some other property, for instance concentration levels of a set of compounds. One can then separate information that is related to these concentration levels from other sources of variation that are unrelated to the problem formulation, for instance overall disparities for each microarray slide caused by production inaccuracies. OPLS can also be used for discriminant analysis (OPLS-DA) to differentiate between classes (Bylesjöet al., 2006).

The potentials of the OPLS and O2PLS methodologies for integrating transcript and metabolite data are also obvious when considering the following hypothetical example. We have collected a dataset consisting of peak-resolved GC/MS data where one wants to investigate the metabolomic differences between two plant genotypes. During the data collection, however, treatment-independent systematic noise (baseline fluctuations) has affected the intensity values of the spectra, which introduces false univariate correlation patterns. Using OPLS the data will be characterized by multivariate latent structures that describe these properties independently from one another, which is feasible because the baseline fluctuations are independent (orthogonal in mathematical notation) of the genotype effect. It will thus be possible to interpret the covariance (predictive) structure related to the genotype effect and the baseline effect separately. If we also generate transcriptomic data from the same biological samples, the O2PLS methodology would be useful to complement the analysis by discovering trends between the metabolite (GC/MS) and transcript (microarray) datasets (Figure 1a–b). This has the potential to provide additional information in order to enhance the understanding of a possibly complex biological process.

Characteristics of the O2PLS model

As previously stated, O2PLS identifies joint variation between two datasets, in this case transcript and metabolite data, respectively, as well as systematic variation that is unique to each dataset (Figure 1a). Each of these sources of variation are composed of smaller entities referred to as latent variables (Kvalheim, 1992), which describe independent effects in the data. The O2PLS method finds systematic trends across two datasets, which could be any pair of datasets. The effect of applying O2PLS to a problem of this form is described schematically in a simplified form in Figure 1(b) for the special case of integrating transcript and metabolite data for a plant experiment. Although the O2PLS method typically relies on a terminology from the field of linear algebra, we will aim to avoid the usage of complex mathematical notation whenever possible. When such usage is required to make a clear point we will attempt a parallel generalized explanation for clarity. A more detailed explanation of the underlying mathematics is partly available in the Supplementary material, but is given more comprehensively in the works of Trygg (Trygg, 2002; Trygg and Wold, 2003).

As seen in Figure 1(a), the O2PLS model contains six distinctive structures. Without a loss of generality, we will assume that one dataset contains transcript data and the other contains metabolite data, although there is no general restriction towards usage of these particular data types. The transcriptomic dataset is decomposed into three distinct structures in the O2PLS model. One is transcript-predictive, i.e. describes gene expression profiles that are useful for predicting metabolite profiles. There is also a transcript-unique structure, which describes systematic effects in the transcript data that are not useful for predicting the metabolite profiles. As this structure describes systematic changes in the transcript levels that have no general correlation patterns with the metabolites, it could be linked, for example, to systematic array bias. Finally, the remaining variation in the transcript data ends up in the transcript residual: capturing stochastic or noise-effects. The corresponding structures for the metabolite dataset will be denoted metabolite predictive, metabolite unique and metabolite residual, respectively. The term joint variation, which will be used throughout the paper, refers to the transcript-predictive structure together with the metabolite-predictive structure (Figure 1a). The residual structures will generally be excluded from illustrations and elaborations regarding the results.

All of the different model structures maintain the dimensionality of the original datasets. This implies that one can interpret the most dominating correlation and covariance trends both in the sample directions (for instance, to identify interesting tendencies, clusters among samples or highly deviating biological replicates) and in the variable directions (for instance, to identify influential transcripts and potential metabolites). This applies to all six model structures individually. This model transparency is a unique property of the O2PLS method, and will be demonstrated in numerous ways throughout the paper.

O2PLS model estimation

O2PLS modeling requires estimation of the complexity of the different structures in Figure 1(a–b). Specifically, this determines what fraction of the total variation is dispersed into the predictive, unique and residual structures, respectively. This is equivalent to finding a suitable number of latent variables (Kvalheim, 1992) for each type of structure (Figure 1a–b). In essence, there are two different categories of unfavorable outcomes that should, optimally, be avoided. The first case is model underfitting, where the complexity of the model is too low compared with the complexity of the dataset. Systematic structures exist in the datasets but the model fails to capture them; thus hampering both predictions of future (unknown) samples and model interpretation. The second case is model overfitting, where the complexity of the model is set too high. Systematic structures as well as noise-related (stochastic) features will be incorporated, causing the generality of the solution to weaken as the stochastic features are unlikely to be representative for the data. Obviously, the two outcomes are mutually exclusive, thus requiring a suitable intermediate solution. This is basically a dataset-specific problem that is frequently approximated by means of resampling methods. Here, Monte Carlo cross-validation (MCCV) (Shao, 1993) has been utilized for this purpose; further details are available in the Experimental procedures and in the Supplementary material.

Study summary

In the present case study, we show how the O2PLS method can be used for integration of transcriptomics data, in the form of dual-channel cDNA microarray data (Schena et al., 1995), and metabolomics data in the form of GC/MS data (de Hoffmann and Stroobant, 2001). All biological samples used in this study originate from an experiment investigating short-day-induced effects on wild-type hybrid aspen (Populus tremula × Populus tremuloides). The trees have been grown under different light conditions: long day (LD) and short day (SD). Material has subsequently been collected at various time points, rendering the LD0, LD2, SD2 and SD6 sample categories, respectively (see the Experimental procedures for further details). Finally, the utilized biological samples have been measured in parallel for estimation of metabolite and transcript abundances. We will adapt a non-stringent use of the terms ‘transcript’ and ‘metabolite’ to refer to microarray elements and resolved peaks from the GC/MS spectra, respectively.

The study is outlined as a two-step procedure: first the predictive versus unique variation will be identified without incorporating prior knowledge of the classes (treatments) using O2PLS. The modeled data will be visualized and interpreted primarily based on graph methods as a proof of concept, but no comprehensive biological investigations will be conducted as this is peripheral to the aim of the presented work. Because of the simplicity of this proof-of-principle study we can only expect to identify certain transcript–metabolite correlations. These effects are bound to be temporally linked (as the time-dimension is unavailable) and related across the different datasets and treatments (LD versus SD). See Rischer et al. (2006) for a demonstration of the potential differences in temporal response for transcripts and metabolites. Subsequently, the predictive systematic variation from the two datasets (from the O2PLS model) will be used to discriminate between the classes to show that the related structures capture the implicit class information by means of OPLS-DA (Bylesjöet al., 2006). This process is depicted in Figure 2. In the second step the two most diverse categories will be used for model training, whereas the remaining two categories of samples will be used for validation where class belonging is defined a priori.

Figure 2.

 Schematic view of the general strategy for integration and discrimination.
At the initial step, transcriptomic and metabolomic data from the transcript and metabolite sources are integrated using the O2PLS method without acknowledging class information. The predictive structure is subsequently subjected to discriminant analysis and is shown to capture the latent class information.

Results

Step 1: integration of the transcript and metabolite datasets using O2PLS

Variance dispersion of the six model structures in Figure 1(a) was determined using class-balanced MCCV (Shao, 1993) based on the six biological replicates from the LD0 and SD6 treatments (12 samples in total). Model statistics are available in the Supplementary material. In short, three latent variables were identified for the joint variation that is useful for predicting metabolite levels based on the transcript profiles, and vice versa. The joint variation can thus be considered to consist of three separate subsets with systematic properties. This roughly corresponds to independent clusters of transcripts and potential metabolites with a high within-cluster correlation, but with a minimal between-cluster correlation, which could be interpreted as identification of different unrelated but distinct biological patterns. The transcript-predictive structure accounts for 38.0% of the total variation in the transcriptomic dataset. For the metabolomic dataset, the corresponding metabolite-predictive structure accounts for 60.8% of the total variation in the dataset. No remaining transcript-unique structured variation was available for the transcript dataset, but one metabolite-unique structure was identified for the metabolomic dataset, accounting for 9.7% of the total variation in that dataset.

From each of the three latent variables forming the joint variation in Figures 1 and 2, significantly predictive transcripts and metabolites were subsequently identified using a permutation procedure described in the Experimental procedures. It is important to note that the resulting subset of transcripts and metabolites exhibit strong multivariate correlation patterns, and can be considered candidates when connecting regulatory expression patterns to metabolite levels. Because of the simplicity of the study, we can only expect to find temporally linked transcript-metabolite effects as the time dimension is unavailable [see e.g. Rischer et al. (2006) for discussions regarding this issue]. The total number of identified transcripts and metabolites for each latent variable (cluster) is available in Table 1. A complete list of identified transcripts and metabolites is available in the Supplementary material.

Table 1.   The identified number of transcripts and metabolites for each latent variable (LV)
 LV 1LV 2LV 3Total unique
Transcripts10490206410
Metabolites3392564

One interesting aspect of this type of analysis is the interpretation of the results. We have utilized graph theoretical methods for visualization purposes. For each of the latent variables in the joint variation, pairwise Pearson correlations were calculated between all identified transcripts and metabolites. Note that we are basing the pairwise correlations on the O2PLS model data, thus implicitly exploiting the exposed multivariate correlation patterns while ignoring the remaining patterns. For visualization clarity, the transcripts have been categorized based on the gene ontology biological process (GO-BP) annotation (http://www.geneontology.org). For further details on the construction of the graphs see the Supplementary material.

We will exemplify the visualization and discuss the relevance of the identified transcripts and potential metabolites using mainly the second latent variable (cluster) of the joint variation, which is depicted in Figure 3. All identified GO-BP groups are related to the nucleotide metabolism. Specifically, GO-BP groups all belong to the pyrimidine metabolism, with the exception of two groups belonging to purine metabolism (e.g. GTP metabolism). Nucleotide metabolism is known to play an important role in plant biochemical and developmental processes (Boldt and Zrenner, 2003). Purine and pyrimidines are generated from different amino acids and other small molecules via de novo pathways, and from nucleobases and nucleosides by the salvage pathways, and are frequently involved in joint pathways. Purines and pyrimidines are regulators of products such as sucrose, polysaccharides, sugars, phospholipids and secondary products (Stasolla et al., 2003), and thereby influence plants growth and development. Many of the identified metabolites are in fact carbohydrates, including sucrose, fructose and glucose derivatives (glucaric acid and gluconic acid), as well as undetermined carbohydrates. Although the metabolomics methodology used here does not cover the analysis of purine and pyrimidine derivatives, the metabolite data still supports the putative connection between transcripts and metabolites suggested here.

Figure 3.

 Visualization of the identified transcripts and metabolites.
A graph of the identified elements of the second latent variable of the joint variation is illustrated. The transcripts are grouped using the gene ontology biological process (GO-BP) annotations (boxes) together with the identified metabolites (circles). The edges (connecting lines) are colored according to the correlation, where darker lines denote stronger correlations and vice versa. The displayed correlations span the interval 0.82–0.98. The label ‘UPSC unknown’ of potential metabolites refers to database matches that have been seen in previous (internal) studies, but the identity of which currently remains unknown.

The recent sequencing of the Arabidopsis genome has increased the possibilities to analyze the role of purines and pyrimidines in biological functions of higher plants, which was previously assumed to have identical roles, as in microorganisms and in yeast (Boldt and Zrenner, 2003; Kafer et al., 2004). The pyrimidines CTP and UTP seem to affect different areas of the metabolism (Dowhan, 1997; Ostrander et al., 1998). CTP is required for phospholipid metabolism, where one can spot the constituents of phosphoric acid and lipid compounds amongst the identified metabolites in the present study. UTP is required for sugar metabolism mainly in the form of UDP-glucose, which is a precursor for components in the cell wall, glycoproteins, glycolipids and sulfolipids (Kafer et al., 2004; Zrenner et al., 2006). We have shown that glucose derivatives as well as undetermined carbohydrates constitute a large portion of the identified metabolites (Figure 3). The purines (GTP and ATP) play an important role for the assimilation of nitrogen (Smith and Atkins, 2002), as well as in cell division and energy metabolism. Interestingly, in another latent variable (subgroup) of the joint variation, the majority of the GO-BP groups are related to mitosis effects (not shown). For instance, the anaphase in cell division is seen together with microtubule polymerization processes, which are known to be altered during mitosis. Other GO-BP groups are connected to various kinds of transports, two of which are coupled to transports of purines and nucleobases; both important parts in the mitosis process. Such effects on cell division are frequently seen in dormancy-related experiments (Devitt and Stafstrom, 1995; Horvath et al., 2003), similar to the present experimental setup.

Step 2: classification of the biological samples

OPLS-DA model training was performed using the two most diverse samples, LD0 and SD6, as different classes as illustrated in Figure 2 (step 2). The remaining samples, LD2 and SD2, were subsequently predicted and assigned to belong to any of the two classes as validation. Based on previous experiments we know that SD induces growth cessation, which suggests that (i) the LD2 class should be highly similar to the LD0 class, and (ii) the SD2 class should be an approximate intermediate of the LD0 and SD6 classes (Thomas and Vince-Prue, 1997; M. Kusano et al., unpublished data).

OPLS can superficially be seen as a one-way O2PLS; out of the six structures described in Figure 1(a), only three of these remain in the OPLS model. These will be denoted as predictive, unique and residual for consistency with the previous notation for the O2PLS model. Properties of the predictive and unique structures were determined using class-balanced MCCV (Shao, 1993), recommending one predictive and no unique latent variable for the OPLS-DA classification model. Model statistics are available in the Supplementary material. For the training samples, perfect classification was achieved for the LD0 and SD6 classes. All of the six LD2 samples were accurately predicted to belong to the LD0 class as an external test set. Out of the six SD2 samples, two were predicted to belong to the LD0 class, and the remaining four were predicted to belong to the SD6 class, roughly as expected. This is depicted in Figure 4, where densities of the class predictions are shown.

Figure 4.

 Predictions of class assignment by means of OPLS-DA.
In (a), densities of the observations predicted by the full OPLS-DA model for the SD6 (dashed line) and LD0 (solid line) classes are presented, showing distinct bimodal properties and little overlap, as expected. A theoretical separation point (decision boundary) between the classes is shown using a dotted vertical line. In (b), the SD2 (dashed line) and LD2 (solid line) classes are displayed as predicted by the same OPLS-DA model as in (a). LD2 behaves approximately as LD0, whereas SD2 exhibits a density that is roughly an intermediate between LD0 and SD6. The same decision boundary as in (a) is shown using a dotted vertical line.

Comparison with related methods

Comparison with univariate correlation studies  A permutation test was utilized to identify significant transcript–metabolite connections based on the Pearson’s correlation coefficient for all possible pairwise combinations. The identified interval of interest was by permutation determined to be (−0.943, 0.948), i.e. only the transcripts and metabolites with a pairwise correlation outside this interval were retained (see Supplementary material for details). This resulted in the identification of a total of 102 unique transcripts and 109 metabolites. By pooling the transcripts by means of the GO-BP annotations, using the same procedure as described in the O2PLS example, the identified groups exclusively describe metal ion homeostasis. Only eight out of 102 (7.8%) of the identified transcripts overlap between the pairwise univariate correlation and the O2PLS approach. As for the potential metabolites, the identified entries contain a multitude of unknown hits with a slight dominance of carbohydrates (not shown). Consistent with the transcripts, only a small proportion of the identified variables overlap with the O2PLS approach (11 out of 109; 10.1%), as shown in Figure 5.

Figure 5.

 Venn diagram of the overlap of identified transcripts and metabolites across the methods.
An overview of the overlap between the different methods reveals a greater overlap between the results from O2PLS and PCA compared with the results based on univariate correlations.

Comparison with principal component analysis.  Principal component analysis (PCA; Jolliffe, 2002; Wold et al., 1987) is a multivariate projection method that has been employed in several systems biology studies for exploratory analysis of data from multiple sources (see e.g. Clish et al., 2004; Rischer et al., 2006, for examples of such usage). The general approach is to concatenate multiple datasets and project these according to the maximum variance, e.g. to find interesting trends or for the detection of outlying samples. For comparison, we have utilized this technique by concatenating the transcript and metabolite datasets, and we subsequently calculated three principal components. This was followed by the same permutation strategy to identify significant transcripts and metabolites as that used for the O2PLS method. See the Supplementary material for more details of the conceptual similarities and differences regarding the PCA and O2PLS methods.

The permutation test resulted in the identification of a total of 321 significant transcripts and 45 metabolites (Figure 5). By pooling the transcripts by means of the GO-BP annotations, using the same procedure as described in the O2PLS example, the identified groups belong to varying categories, such as response to wounding and cell-wall catabolism, with no distinct consensus. As for the potential metabolites, the identified entries contain a multitude of unknown hits, where the metabolites overlapping with the metabolites identified using O2PLS are mostly carbohydrates (not shown). Only 59 out of 321 (14.4%) of the identified transcripts are overlapping between the PCA and the O2PLS approach. For the metabolites, 28 out of 45 (62.2%) do overlap with the O2PLS approach, suggesting that the methods perform similarly for this particular dataset. The differences observed between PCA and O2PLS models are related to the difference in how they model the variation in the data tables. The PCA model will find the direction in the data with the highest variability. Part of the variability in one dataset is not necessarily related to the other dataset, and hence in that situation the outcome will be different from the results of the O2PLS model. Any systematic effects in the data, caused by technical issues in one dataset, for example, will negatively affect the estimated principal components. Such effects can be quite considerable; in the presented material, approximately 10% of the variation in the metabolite data is related to a systematic, platform-specific effect. As the datasets are concatenated in the PCA calculations, this effect will propagate and also influence the identified transcripts, as seen in Figure 5. In the O2PLS method this effect is identified and can be studied separately, which is beneficial from an interpretational perspective.

Summary of comparisons.  An overview of the relative overlap between the evaluated methods is available in Figure 5, showing a greater consensus between the respective results from O2PLS and PCA compared with the results based on univariate correlations. This is particularly apparent for the metabolite dataset. Additionally, PCA has the fewest relative number of unique transcripts and metabolites, suggesting that it is an approximate intermediate between univariate correlation analysis and O2PLS. Another striking feature is that only one transcript (vacuolar calcium-binding protein) and four metabolites (one identified as galactinol, whereas three remain of unknown identity) are found using all three methods.

Discussion

In the present study, we have employed the O2PLS multivariate regression method to identify joint systematic variation across transcriptomic and metabolomic datasets.

One could ask what is conceptually different when employing the presented methodology compared with traditional studies, which are almost exclusively based on pairwise univariate correlation studies between the different datasets. By utilizing the O2PLS method, all results are founded on an underlying multivariate model of the data. This implies that one could predict the properties of new (unknown) samples, as well as estimate the predictive performance of the model (exemplified here using cross-validation), which yields statistical measures of confidence so that not merely stochastic events are characterized. The most unique attribute of O2PLS is, however, the fact that one can separate systematic predictive variation from platform-unique systematic variation as well as non-systematic structures captured in the residuals. This is most clearly seen in the direct comparison with PCA, which is also a multivariate method, where this variation is incorporated into the model structures. The transparency of the O2PLS model enables straightforward interpretation of both the predictive and the unique sources of variation, both with respect to samples as well as based on individual variables (e.g. genes, metabolites). Furthermore, unlike pairwise correlation studies, one can assess model outliers, which is a fairly common notion in biological studies because of the technical issues involved.

As we introduce a novel methodology for the integration of multiple datasets in the plant field, it is natural to additionally compare the performance of the method directly to the current ‘gold standard’, which is the employment of pairwise univariate correlations. One can clearly see that the outcomes differ; in fact only 10% of the transcripts and potential metabolites intersect between the two variable selection strategies. The discrepancies observed between the two approaches (multivariate versus univariate) are related to the underlying basis of the correlation calculations. This subject is handled in more depth in the Experimental procedures section and in the Supplementary material.

The additional comparison to PCA reveals a greater overlap between the methods, in particular for the metabolite dataset, which is partly explained by the more closely related properties of PCA and O2PLS. Nonetheless, from Figure 5 we can see that the majority of the transcripts (82.6% on average) and metabolites (61.5% on average) are completely unique to each method. The evaluated methods are evidently quite dissimilar and can be seen as complementary for elucidating joint regulatory patterns, as it is not possible to claim the superiority of one methodology over another based solely on the evidence provided here. However, as demonstrated in the presented material, the O2PLS method has the distinctive capability to separate dataset-predictive from dataset-unique variation. This property is likely to be advantageous from an interpretational perspective in the general case.

Although we only explicitly describe a situation where one wants to study the relationship between two different datasets, it is easy to imagine a case where three or more datasets are available. A natural extension would, for instance, be to complement the transcriptomic and metabolomic dataset with proteomic data (for example based on 2D-DIGE or LC/MS; de Hoffmann and Stroobant, 2001). As can be seen by the model structures in Figure 1, the O2PLS method does not support such a situation without proper modifications. A generalization of the O2PLS methodology to handle multiple datasets is something we intend to study in a future paper.

Conclusions

Outlined is a general methodology for integrating data structures from multiple sources based on the O2PLS multivariate regression method. The benefits of the methodology are demonstrated using a study investigating SD-induced effects of hybrid aspen (P. tremula × P. tremuloides) based on transcriptomic (dual-channel cDNA) as well as metabolomic (GC/MS) data. We show how one can identify transcripts and metabolites that exhibit strong multivariate correlation patterns, and relate these to the underlying molecular functions based on corresponding annotations. Identified effects are mainly related to changes in energy metabolism and mitosis-associated effects, which are biologically sound given the studied system. The methodology is compared with frequently utilized methods such as univariate correlations and PCA, where results and conclusions deviate across methods.

Experimental procedures

Plant material and sampling

Twenty-four hybrid aspen (P. tremula × P. tremuloides) trees were grown in a growth chamber under LD conditions with 12 h of photosynthetically active radiation (PAR) light (400 μEin m−2 s−1) and a 6-h daylength extension with low light (30 μEin m−2 s−1). After 3 months, leaves 15–17 (with the first leaf below the apex ≥ 1-cm long being designated as leaf 1) from six plants were sampled (LD0 samples). Two days later another six plants were sampled (LD2 samples), and the daylength was changed to SD conditions, with 12 h of PAR light. After both 2 and 6 days with short photoperiods six plants were sampled (denoted SD2 and SD6, respectively), as described above.

Microarray preparation

The utilized POP2.3 microarray layout consist of 27 648 single-spotted cDNA clones from a previous assembly of more than 100 000 expressed sequence tags (ESTs) from the Populus genus (Sterky et al., 2004). All sequence information is available in the PopulusDB (http://www.populus.db.umu.se) online sequence database; see also Tuskan et al. (2006) for further information regarding the Populus genome. A full array layout is available for download from the UPSC-BASE (http://www.upscbase.db.umu.se) online microarray database (Sjödin et al., 2006).

All microarray slides were printed using a QArray arrayer (Genetix, http://www.genetix.com). The preparation, labeling and hybridization of cDNA clones and mRNA samples were carried out according to the protocol described by Smith et al. (2004). The arrays were scanned on a ScanArray 4000 (Perkin-Elmer, http://www.perkinelmer.com) at 10-μm resolution to obtain raw image files for the Cy5 and Cy3 dye channels.

Microarray pre-processing

Gridding and segmentation of all images were performed in GenePix Pro 5.1 (Molecular Devices, http://www.moleculardevices.com). Quantification was based on median foreground intensity values, and data was subsequently within-slide normalized using the print-tip lowess method (Yang et al., 2002). All original image files as well as raw and normalized data are available online for download at the UPSC-BASE microarray database from experiment UMA-0068.

The four samples LD0, LD2, SD2 and SD6 were hybridized in a loop design (see Supplementary material) independently for each of the six biological replicates. Expression ratios were subsequently resolved using the generalized (Moore–Penrose) inverse. The resulting ratios for all four samples are thus calculated against a weighted mean, which can be seen as being compared against a virtual reference (in silico reference).

Metabolomics analysis

The samples were extracted by chloroform:MeOH:H2O, and their metabolites profiles were analyzed by GC/TOFMS essentially according to the method described by Gullberg et al. (2004). All non-processed MS-files from the metabolic analysis were exported from the ChromaTOF software in NetCDF format to matlab™ version 7.0 (Mathworks, http://www.mathworks.com), in which all data pre-treatment procedures, such as baseline correction chromatogram alignment, data compression and hierarchical multivariate curve resolution (H-MCR) were performed using in-house produced scripts according to Jonsson et al. (2005). All manual peak integrations were performed using in-house scripts.

O2PLS

Dataset pre-processing.  The transcript dataset is composed of 27 648 microarray in silico expression ratios for all six biological replicates from the LD0 and SD6 treatments, rendering 12 samples in total. The metabolite dataset consists of the peak-resolved GC/MS data (Jonsson et al., 2005): a total of 453 peaks for the same 12 samples. See Figure 2 (step 1) for a depiction of the different datasets. The transcript dataset has been mean-centered per microarray element, whereas the GC/MS data matrix set has been both mean-centered and scaled to unit variance for each resolved peak prior to modeling. Scaling to unit variance implies dividing each potential metabolite by its standard deviation to reduce deviations in magnitude for the different metabolite concentrations. Both datasets were subsequently scaled to an equal total sum of squares to avoid dominance of any of the matrices. Further details are available in the Supplementary material.

Variable selection.  From the joint variation, influential variables (transcripts and metabolites) were subsequently identified using a permutation test. The basis of the variable selection process is the correlation-scaled loadings (see Supplementary material). The selection procedure is similar to the one described in (Johansson et al., 2003) and can be briefly described as follows. The data is reshuffled multiple times so that the original correlation structures are partially or completely destroyed. O2PLS models are generated from each rearranged dataset to estimate what degree of correlation could be expected by chance from a dataset with equal properties. This degree of correlation is later used as a threshold to select transcripts and metabolites with unusually high correlations (significantly higher than chance). The result is a small collection of transcripts and metabolites that are of particular interest when studying the correlation patterns across the two datasets. Further details of the variable selection procedure are available in the Supplementary material.

Both the presented O2PLS method and the univariate approaches utilize correlations of sorts to rank influential associations. In that context the discrepancies observed between the two approaches (multivariate versus univariate) are related to the selection of the correlation calculations. In the univariate case, all possible pairwise correlations are calculated between a transcript profile i and another potential metabolite profile j. In the multivariate approach we select only one representative variable from each respective dataset (the latent variable) that is optimal for predicting the respective dataset in a least-squares sense. A latent variable corresponds to a weighted average of the original variables, and can not generally be replaced by any single variable in the original datasets. Readers familiar with similar feature extraction strategies, such as PCA (Wold et al., 1987), should note that the latent variables in the O2PLS model are not principal components, as they are derived based on a different objective function (covariance/correlation versus variance). From these O2PLS latent variables, one can subsequently calculate the correlations between the latent variable and the original variables. This strategy reduces the number of correlation calculations, and will typically provide stability in the correlation estimations as the risk for spurious correlations is reduced. This procedure represents the outlined method to rank variable importance, but is generalized to several latent variables and incorporates the dataset-unique structures.

Acknowledgements

This work was supported by grants from The Swedish Foundation for Strategic Research (MB, TM), The Knut and Alice Wallenberg Foundation (JT), The Swedish Research Council (MB, TM, JT), The Functional Genomics Initiative at Swedish University of Agricultural Sciences (DE), FORMAS (TM) and The KEMPE Foundation (TM).

Ancillary