## Introduction

In the post-genomics era, the development of life science technologies that enable transcriptomic, proteomic and metabolomic events to be analyzed in detail in the same biological systems has revolutionized biological studies. Instead of relating biological observations to a small number of variables, it is now possible to study biological systems with a global analytical approach. This is sometimes called systems biology, and the purpose is to study organisms as integrated systems of genetic, protein, metabolic, cellular, and pathway events.

A systems biology approach based on data collected from many different analytical platforms involves advanced statistics to be able to interpret the data. Numerous examples exist in the literature regarding the use of data from parallel sources (e.g. Carrari *et al.*, 2006; Clish *et al.*, 2004; Gygi *et al.*, 1999; Hirai *et al.*, 2004, 2005; Kleno *et al.*, 2004; Kolbe *et al.*, 2006; Oresic *et al.*, 2004; Rischer *et al.*, 2006; Tohge *et al.*, 2005). In the majority of studies, pairwise univariate correlations between measured variables in different datasets have been utilized to elucidate joint effects and underlying mechanisms. In the plant field Professor Saito’s group have made substantial contributions (Hirai *et al.*, 2004, 2005; Tohge *et al.*, 2005), primarily concerning the integration of transcript and metabolite data for *Arabidopsis thaliana*. The general approach has been to analyze and interpret the data sources in parallel, formulate hypotheses independently for each platform and finally outline a consensus theory based on prior knowledge of existing pathways to unravel novel trends and processes.

The main intricacies associated with a unified approach lie in the subsequent analysis and interpretation: i.e. to unwrap systematic and biologically relevant information from noisy, high-order data structures collected at several levels. Here we describe a general methodology that is complementary to previously outlined approaches, with the potential to provide further information and insights into the field of plant biology. The strategy builds on recent advances in multivariate regression methods, specifically the O2PLS method (Trygg, 2002; Trygg and Wold, 2003) that has the capacity to integrate data from multiple sources (e.g. data collected from different analytical platforms). This enables one to separate joint (overlapping) information across multiple analytical platforms from systematic variation that is unique to each platform. The O2PLS method identifies systematic trends across any pair of datasets: for instance transcript and metabolite data that will be utilized in the presented work. This process is schematically outlined in Figure 1. A recent study investigating metabonomic and proteomic correlations from mice samples using this approach is available in the literature (Rantalainen *et al.*, 2006).

The fundamentals of the O2PLS method builds on the work by Trygg and Wold (2002) who outlined orthogonal projections to latent structures (OPLS), which is a supervised multivariate regression method. OPLS combines the existing theory of partial least squares (PLS) regression (Wold *et al.*, 1984, 2001) and orthogonal signal correction (OSC) (Wold *et al.*, 1998). OPLS is primarily used when one has a single data source, e.g. transcript data from cDNA microarrays, and want to study the relationship with some other property, for instance concentration levels of a set of compounds. One can then separate information that is related to these concentration levels from other sources of variation that are unrelated to the problem formulation, for instance overall disparities for each microarray slide caused by production inaccuracies. OPLS can also be used for discriminant analysis (OPLS-DA) to differentiate between classes (Bylesjö*et al.*, 2006).

The potentials of the OPLS and O2PLS methodologies for integrating transcript and metabolite data are also obvious when considering the following hypothetical example. We have collected a dataset consisting of peak-resolved GC/MS data where one wants to investigate the metabolomic differences between two plant genotypes. During the data collection, however, treatment-independent systematic noise (baseline fluctuations) has affected the intensity values of the spectra, which introduces false univariate correlation patterns. Using OPLS the data will be characterized by multivariate latent structures that describe these properties independently from one another, which is feasible because the baseline fluctuations are independent (orthogonal in mathematical notation) of the genotype effect. It will thus be possible to interpret the covariance (predictive) structure related to the genotype effect and the baseline effect separately. If we also generate transcriptomic data from the same biological samples, the O2PLS methodology would be useful to complement the analysis by discovering trends between the metabolite (GC/MS) and transcript (microarray) datasets (Figure 1a–b). This has the potential to provide additional information in order to enhance the understanding of a possibly complex biological process.

### Characteristics of the O2PLS model

As previously stated, O2PLS identifies joint variation between two datasets, in this case transcript and metabolite data, respectively, as well as systematic variation that is unique to each dataset (Figure 1a). Each of these sources of variation are composed of smaller entities referred to as latent variables (Kvalheim, 1992), which describe independent effects in the data. The O2PLS method finds systematic trends across two datasets, which could be any pair of datasets. The effect of applying O2PLS to a problem of this form is described schematically in a simplified form in Figure 1(b) for the special case of integrating transcript and metabolite data for a plant experiment. Although the O2PLS method typically relies on a terminology from the field of linear algebra, we will aim to avoid the usage of complex mathematical notation whenever possible. When such usage is required to make a clear point we will attempt a parallel generalized explanation for clarity. A more detailed explanation of the underlying mathematics is partly available in the Supplementary material, but is given more comprehensively in the works of Trygg (Trygg, 2002; Trygg and Wold, 2003).

As seen in Figure 1(a), the O2PLS model contains six distinctive structures. Without a loss of generality, we will assume that one dataset contains transcript data and the other contains metabolite data, although there is no general restriction towards usage of these particular data types. The transcriptomic dataset is decomposed into three distinct structures in the O2PLS model. One is transcript-predictive, i.e. describes gene expression profiles that are useful for predicting metabolite profiles. There is also a transcript-unique structure, which describes systematic effects in the transcript data that are not useful for predicting the metabolite profiles. As this structure describes systematic changes in the transcript levels that have no general correlation patterns with the metabolites, it could be linked, for example, to systematic array bias. Finally, the remaining variation in the transcript data ends up in the transcript residual: capturing stochastic or noise-effects. The corresponding structures for the metabolite dataset will be denoted metabolite predictive, metabolite unique and metabolite residual, respectively. The term joint variation, which will be used throughout the paper, refers to the transcript-predictive structure together with the metabolite-predictive structure (Figure 1a). The residual structures will generally be excluded from illustrations and elaborations regarding the results.

All of the different model structures maintain the dimensionality of the original datasets. This implies that one can interpret the most dominating correlation and covariance trends both in the sample directions (for instance, to identify interesting tendencies, clusters among samples or highly deviating biological replicates) and in the variable directions (for instance, to identify influential transcripts and potential metabolites). This applies to all six model structures individually. This model transparency is a unique property of the O2PLS method, and will be demonstrated in numerous ways throughout the paper.

### O2PLS model estimation

O2PLS modeling requires estimation of the complexity of the different structures in Figure 1(a–b). Specifically, this determines what fraction of the total variation is dispersed into the predictive, unique and residual structures, respectively. This is equivalent to finding a suitable number of latent variables (Kvalheim, 1992) for each type of structure (Figure 1a–b). In essence, there are two different categories of unfavorable outcomes that should, optimally, be avoided. The first case is model underfitting, where the complexity of the model is too low compared with the complexity of the dataset. Systematic structures exist in the datasets but the model fails to capture them; thus hampering both predictions of future (unknown) samples and model interpretation. The second case is model overfitting, where the complexity of the model is set too high. Systematic structures as well as noise-related (stochastic) features will be incorporated, causing the generality of the solution to weaken as the stochastic features are unlikely to be representative for the data. Obviously, the two outcomes are mutually exclusive, thus requiring a suitable intermediate solution. This is basically a dataset-specific problem that is frequently approximated by means of resampling methods. Here, Monte Carlo cross-validation (MCCV) (Shao, 1993) has been utilized for this purpose; further details are available in the Experimental procedures and in the Supplementary material.

### Study summary

In the present case study, we show how the O2PLS method can be used for integration of transcriptomics data, in the form of dual-channel cDNA microarray data (Schena *et al.*, 1995), and metabolomics data in the form of GC/MS data (de Hoffmann and Stroobant, 2001). All biological samples used in this study originate from an experiment investigating short-day-induced effects on wild-type hybrid aspen (*Populus tremula* × *Populus tremuloides*). The trees have been grown under different light conditions: long day (LD) and short day (SD). Material has subsequently been collected at various time points, rendering the LD_{0}, LD_{2}, SD_{2} and SD_{6} sample categories, respectively (see the Experimental procedures for further details). Finally, the utilized biological samples have been measured in parallel for estimation of metabolite and transcript abundances. We will adapt a non-stringent use of the terms ‘transcript’ and ‘metabolite’ to refer to microarray elements and resolved peaks from the GC/MS spectra, respectively.

The study is outlined as a two-step procedure: first the predictive versus unique variation will be identified without incorporating prior knowledge of the classes (treatments) using O2PLS. The modeled data will be visualized and interpreted primarily based on graph methods as a proof of concept, but no comprehensive biological investigations will be conducted as this is peripheral to the aim of the presented work. Because of the simplicity of this proof-of-principle study we can only expect to identify certain transcript–metabolite correlations. These effects are bound to be temporally linked (as the time-dimension is unavailable) and related across the different datasets and treatments (LD versus SD). See Rischer *et al.* (2006) for a demonstration of the potential differences in temporal response for transcripts and metabolites. Subsequently, the predictive systematic variation from the two datasets (from the O2PLS model) will be used to discriminate between the classes to show that the related structures capture the implicit class information by means of OPLS-DA (Bylesjö*et al.*, 2006). This process is depicted in Figure 2. In the second step the two most diverse categories will be used for model training, whereas the remaining two categories of samples will be used for validation where class belonging is defined a priori.