Numerous domains of life science addressing the complexity of global biological systems, namely systems biology and related omics technologies, constitute fast-growing areas of research. They benefit from recent technological developments of analytical platforms, allowing extensive signal recording to characterise molecular events. Furthermore, the perspective of new insights on relations between genes, proteins and metabolic networks is offered in various domains including pharmacokinetics, metabolism or toxicology. Holistic strategies, related to the extended coverage of analytes and synergies between data layers, generate massive data structures that remain challenging to handle and fully exploit. Major changes regarding data dimensionality and complexity have recently caused an important paradigm shift of the knowledge discovery process. Extracting valuable information from omics approaches has therefore become crucial, and dedicated chemometric tools absolutely need to accompany their blossoming production.
Metabolomics can be considered as the most functional part among the omics sciences, and since its early days, it was promoted as a potent tool for assessing biochemical processes related to complex phenotypes . To benefit to the largest extent from the wealth of metabolomic data, several great challenges have to be handled by adapted data mining strategies, with respect to both biochemical knowledge and analytical methods (Figure 1). These include reducing data dimensionality for biomarker discovery, handling multiple data tables generated by several analytical platforms and analysing longitudinal metabolomic data with time-resolved models. These three key aspects of metabolomic data analysis are specifically discussed in the present article.
2 ASSESSING THE METABOLIC COMPLEXITY
Metabolomics is dedicated to the global evaluation of biological systems, performed from functional indicators of metabolic networks through the comprehensive and simultaneous monitoring of metabolites concentration levels. By providing global biochemical information about modifications of these levels related to pathological, environmental or genetic factors, metabolomics represents a promising way to monitor changes of naturally occurring compounds from the metabolism, that is, endogenous metabolites, and also xenobiotics and their transformation products in a holistic context. Untargeted approaches are generally achieved as a methodological starting point, to derive data-driven hypotheses through an unbiased monitoring. Targeted (or confirmatory) methods are then applied for the reliable quantitation of relevant metabolites and the unequivocal confirmation of compound identities. Nowadays, metabolomic approaches play a key role in a wide range of research areas, including toxicology, disease diagnosis, cancer research, diabetes, responses to environmental stress , drug metabolism, natural product discovery and functional genomics .
As metabolites possess an extensive variety of physicochemical properties and concentration levels ranging from picomoles to millimoles, major efforts have been made to provide analytical tools able to detect, quantify or identify thousands of compounds in a single analysis of complex samples. Today, most untargeted metabolomic strategies are based on two main analytical platforms, namely nuclear magnetic resonance (NMR) spectroscopy and mass spectrometry (MS). Because of their respective advantages and limitations, their combination provides complementary information, which remains to be more exploited.
Nuclear magnetic resonance fingerprints can be obtained from the direct analysis of samples without complex sample preparation or prior separation , and quantitative information in the micromolar to millimolar range can be derived from simple and automated experimental protocols . NMR spectra are very reproducible and provide reliable fingerprints that are well suited for long-term studies. Therefore, the variability observed in NMR spectra can be easily related to biological effects. Because of its limited sensitivity, NMR is very useful in the comparison of samples based on their main constituents .
Metabolite identification can be performed by the comparison of signals with reference compounds data, measured under the same experimental conditions. However, because of the chemical complexity of biological samples, the identification process may be limited by overlapping signals . To cope with this limitation, the emergence of two-dimensional NMR experiments has provided valuable information for the de novo identification of unknown compounds . Such strategies lead to high-order data structures requesting adapted pre-processing and analysis.
While early metabolomic studies relied mainly on NMR, hyphenated methods involving separation techniques and MS (gas chromatography (GC)–MS, liquid chromatography (LC)–MS and capillary electrophoresis (CE)–MS) have now been demonstrated to be powerful and complementary analytical techniques . Although a prior separation step allows easier and more reliable metabolite quantitation, no single separative method is able to handle all classes of compounds possibly present in a biological sample . Cross-platforms comparisons were performed to compare and associate data obtained from LC, GC and CE experiments .
Recent developments of ultra-high-pressure LC (UHPLC) have provided new perspectives regarding chromatographic performance. UHPLC has become a gold standard in less than 10 years, as it allows well-resolved peaks with narrow peak width leading to either an increased peak capacity or a shorter analysis time without the loss of resolution. These developments were simultaneously followed by a tremendous increase of reliable acquisition capacity by MS. The high mass-resolving power and acquisition rate of modern mass spectrometers such as time-of-flight, Orbitrap-based technology and Fourier transform platforms allow the simultaneous determination of elemental compositions of thousands of compounds from accurate mass measurements.
3 REDUCING DIMENSIONALITY FOR BIOMARKER DISCOVERY
Technical improvements associated with the analytical platforms described earlier lead to the generation of data structures of increasing size and complexity. Some typical examples of data structures generated by metabolomics experiments are presented in Table 1. Classical approaches are no longer applicable to such data collection, and dedicated data analysis strategies have to be developed. One of the major challenges is to reduce the curse of dimensionality by selecting relevant signals from the raw data and decreasing redundancy. It has to be noted that the latter should be handled cautiously. As conservation relationships occur in networks of interrelated compounds , redundant signals may be informative to evaluate metabolic pathways comprehensively.
Table 1. Some examples of data structures in metabolomics
Direct injection MS
Time × m/z
Time × m/z
Time × m/z
LC × LC
Time × time
GC × GC
Time × time
Chemical shifts × chemical shifts
GC × GC-MS
Time × time × m/z
Time × m/z × m/z
Time × m/z × m/z
The information extracted from the raw data and the outputs generated depend heavily on the data analysis methodology. When investigating a biochemical pathway, a complete picture of all the actors involved in the chemical reactions is highly desirable. On the other hand, a limited series of robust biomarkers is often sufficient for clinical diagnosis or prognosis. It has to be underlined that the added value of many metabolomic studies and the associated models is primarily related to their biological relevance and to a lesser extent to their statistical significance.
The inspection of a statistical model should provide understandable metabolic information through the ability to disentangle the role of each variable. Finding relevant biomarkers, related to specific or common trends, constitutes the most evident aspect, but it needs to be followed by studies specifically elaborated for confirmatory purposes. A view of data relevance and redundancy is presented in Figure 2.
Literature regarding the unsupervised or supervised analysis of very large data tables resulting from metabolomic experiments already exists . Models based on latent variables, such as principal component analysis (PCA) and partial least squares (PLS) regression, constitute attractive solutions to provide compact data representations and diagnostic tools for the detection of biomarkers or to find common variation patterns in complex data. However, some crucial aspects such as dimensionality issues and proper model validation are still subjects of research and debate [14-16]. In numerous cases, feature selection is desirable. Methods dedicated to variable selection can be separated into three main categories, that is, filter methods based on selection criteria such as regression weights or variable importance in projection (VIP scores); wrapper methods including genetic algorithms or uninformative variable elimination; and algorithms with embedded selection scheme such as sparse classifiers [17, 18]. Alternatively, an untargeted selection of molecules or classes of compounds can be performed from specific chemical characteristics of the raw data. For example, a priori information from lists of reference m/z values related to molecules of interest can be compared with accurate MS data. When knowledge-based data selection is successful, models can then be built on the basis of reduced sets of variables instead of the whole dataset . Today, metabolite set enrichment analysis is performed by searching for metabolites belonging to a given pathway among variables that are differentially expressed between case and control conditions . Compounds identification remains the major bottleneck to provide sound biological knowledge from statistical models. This step is mandatory for both the biological validation of the results and the added value of the study. Such an issue might be circumvented by the development of new approaches incorporating variable identification and using metabolic pathways information to build models. Metabolic pathways and chemical classes of compounds extracted from databases provide a priori information that can be used to assess altered patterns with respect to groups of specific related metabolites. These repositories constitute key tools to extract biological information from metabolomic data by linking observed features to existing knowledge about enzymatic reactions, proteins or genes. Unfortunately, the few databases providing this information are still incomplete.
Recent advances in data mining strategies include the incorporation of background knowledge, for example, dependencies between variables, to provide global tests able to assess subsets of metabolites . Information inferred from metabolic networks might be used to weight the relation between metabolites usually computed as a covariance. Such a strategy may help to integrate touches of hard modelling in metabolomic models. In that perspective, chemometric tools still need to adapt to more interpretation-oriented strategies. Great benefits are expected through the early integration of biological information in the analysis.
4 MEASURING METABOLITES WITH MULTIPLE PLATFORMS
As a result of its complexity and dynamic nature, multiple analytical platforms are mandatory to grasp the chemical space and provide the full coverage of a metabolome. The combination of information collected from different data sources (e.g. NMR, GC-MS, CE-MS or LC-MS [22, 23]) constitutes the next challenge for the fusion, integration and comparison of these complex data. Prior to data fusion, it should be useful to evaluate to what extent the different data blocks provide similar information. Some analytes may be detected with different analytical methods, and the redundant information recorded may lead to an inappropriate impact on the analysis due to this overrepresentation. On the other hand, when analytical devices measure different class of compounds or concentration ranges, they can be considered as complementary.
Several statistical indices were proposed to measure the complementarity of sensors, such as the correlation coefficient, the coefficient of contingency or the similarity index . Correlation analysis shifts from the traditional point of view of sample-driven analysis, that is, a sample described by a set of variables (spectrum) to variable-driven analysis, that is, variables defined by a series of intensities across the samples. Additionally, probability values can be associated to observed correlations to check whether their values are likely to occur by chance or due to a biological origin. Non-parametric correlation indices such as Spearman's or Kendall rank correlation coefficients constitute simple alternatives to the linearity assumption. However, correlation does not necessarily imply causality, irrespective of the way it is evaluated. This calls into question the guilt-by-association principle that is widely used in the context of high-throughput approaches . While it remains quite intuitive to evaluate correlation between similar techniques (e.g. UV or fluorescence), it appears to be non-trivial when spectral and separative approaches are combined. Noda et al. proposed a generalised two-dimensional correlation method for the conjoint analysis of spectroscopic data . Similarly, statistical heterospectroscopy was proposed to combine NMR and LC-MS data through the evaluation of Pearson correlation profiles associated with highly significant probability values . An analogous strategy was recently applied to associate data from 1H-NMR and GC-MS experiments in the context of plant metabolomics . Correlation analysis of spectra from different biological sources, building links between NMR spectra from urine and serum, was tentatively implemented recently .
On the other hand, the use of a single number to globally characterise the link strength between high-dimensional data tables remains very useful for the feedback given to the analytical scientists. The RV coefficient, a generalisation of Pearson correlation designed to measure the similarity and quantify the common information between two matrices, constitutes an association measure that can be interpreted straightforwardly.
5 DATA INTEGRATION IN METABOLOMICS
The combination of multiple data sources can be undertaken by so-called high-level data fusion. It implies independent processing and modelling procedures applied to each dataset and chosen according to the data characteristics. The separate results, for example, models outputs from unsupervised or supervised approaches, are then associated to provide a global overview of the data. As a subset of the most valuable variables can be selected from each data source, the prediction performance can be increased when compared with individual analysis. The combination of equally performing models is expected to provide the best results in terms of prediction accuracy improvements, in comparison with unequal performance between classifiers . The partial correlation between data tables, which is inherent to the analysis of metabolic networks, decreases more heavily the benefits of high-level data fusion in the context of discriminant analysis when dealing with unequally performing models. Such an issue may occur even if the measured metabolites are different between data tables. By underlying the interconnections between features of interest, correlation networks constitute a solution for the combination of individual results. Such a data representation facilitates the interpretation of high-level data fusion and provides a global picture of the data . However, the complete network is often too complex to be displayed in a single image, when handling massive data for which new and original representation still need to be proposed.
As mentioned earlier, the information resulting from the separate analysis of several data tables is often incomplete and/or hardly interpretable. From an analytical point of view, integrating pieces of information from multiple sources at the data level constitutes one of the most fascinating perspectives offered by chemometrics. The simplest form of low-level data fusion, obtained by the horizontal concatenation of the data matrices combined to classical multivariate methods, for example, PCA or PLS–discriminant analysis [32, 33], remains insufficiently informative for most metabolomic studies. Because such an approach is greatly affected by disparate signal intensity ranges and variabilities between analytical platforms, particular attention has to be paid to the scaling procedure. Autoscaling is simple and has some advantages, but a careful consideration is needed in the presence of naturally related groups of variables that could have too much influence. In this context, block scaling can be adequate, as it provides a way to balance the influence of blocks of variables in relation to their size. Block-wise scaling allows each group of variables to be considered independently as an entity with a specific variance . Therefore, we believe that scaling and generic/specific rules of block normalisation will constitute an important point to data integration in metabolomics.
Additionally, low-level data fusion further worsens the curse of dimensionality as it produces data tables with an excessive number of variables. In that perspective, the selection of the most relevant/predictive variables and the application of dimensionality reduction techniques can reduce the size of the fused data . Such strategies were coined intermediate or mid-level data fusion . The combination of multiple kernels constitutes another alternative to data fusion by taking advantage of the kernel trick to draw statistically sound inferences from several data tables . Kernels represent input data (input space) by means of a kernel function (usually non-linear), defining similarities between pairs of observations in a feature space of high dimension . The kernel matrices are then associated by linear combination to build a statistical model. A view of the main approaches to data fusion is provided in Figure 3.
6 MULTIBLOCK DATA MODELLING
Dedicated chemometric approaches, such as multiblock models, should provide new insights for the simultaneous analysis of multiple data tables , but their use remains often too complex to be handled by most analytical chemists. Multiblock modelling aims at assessing the relevance of each block and investigating the underlying relationships between different blocks of possibly related variables. The corresponding models provide usually two levels of analysis: (i) at an individual level, score and loading vectors are related to each data table independently; (ii) at a global level, consensus information is derived from the combination of all data tables. An overview of the major trends is provided to highlight information that is common and also differences between the blocks. Similar to single-block data analysis, two strategies can be further distinguished for multiblock modelling, that is, the unsupervised (descriptive) and supervised (predictive) approaches. Several methods were developed as multiblock extensions of the PCA algorithm and the analysis of the concatenated blocks, namely SUM-PCA  or consensus PCA , can be considered as the simplest form of multiblock data analysis. More complex approaches include hierarchical PCA , multivariate component models , multiple factor analysis , common component and specific weights analysis [] and multiple component analysis []. These methods intend to determine global latent variables summarising the common structure shared by several data tables and still need to gain a broader audience. They should play a fundamental role in the next future, as they constitute an objective basis for the selection of the most informative analytical methods with respect to a specific biological question. On a technical point of view, this information is of utmost interest, as it can easily give an estimation of the real value of the chosen analytical process. In the metabolomic context, and according to the fact that not only complementary but also competitive methods are used, it is really important to rapidly know if a method (e.g. hydrophilic interaction LC-MS in positive mode) is more informative than the other (e.g. reversed-phase LC-MS in negative mode). Therefore, more proposals regarding multiblocks data modelling are definitively welcome to enlarge the possibility to use chemometric tools in metabolomics.
Strategies for supervised multiblock modelling aiming to relate multiple blocks of explanatory variables to a response Y block originate from different frameworks, such as PLS regression and its extensions [42, 47], PLS path modelling , generalised canonical analysis  and multiblock redundancy analysis . Initially, these methods have been mainly applied to process control [51, 52] and sensometrics . One can regret that, to date, only a few metabolomic studies have implemented these advanced approaches for biomarkers discovery. Recently, a consensus orthogonal PLS (O-PLS) strategy was developed for multiblock omics data fusion . In addition, the O2-PLS algorithm was proposed to link two data blocks in the context of systems biology . As an alternative to the joint and simultaneous fitting of all data blocks, the sequential consideration of data tables constitutes another very promising way of analysing multiblock data structures. With this end in view, an explorative path modelling strategy involving a sequential PLS fitting was presented by Naes et al. for relating several blocks .
Besides statistical criteria, chemistry-driven strategies are desirable to guide the way different data tables are associated in a model, with respect to the nature of the sample and the techniques used for its analysis. In the next future, integrated variables identification procedures should help to build reliable statistical models from heterogeneous data based on common biochemical information. As multiple analytical techniques could cover different chemical compartments, identified variables might drive the models toward a clever combination of data blocks, taking their complementarity, their predictive power and the accuracy of the measured signals into account. Conversely, depending on the degree of recovery between data tables, identification might be made easier by cross-linking information of different nature such as chemical shifts from NMR, m/z from MS or retention times from chromatography. Finally, visualising both the relation between the different blocks and the compounds detected in the different tables in a single picture might be very helpful in terms of interpretation. Such a graphical representation remains to be developed.
7 LONGITUDINAL DATA IN METABOLOMICS
Biochemical processes are intrinsically dynamic, and taking the temporal behaviour of the system under study into account constitutes a pivotal element for its understanding in many situations in life sciences. As it provides a phenotypic description of biological phenomena, the characterisation of time-resolved metabolic states is expected to reveal important biochemical information. In this perspective, the data collection over time is especially relevant in metabolomics, and also for many clinical researches including toxicology and pathogenesis. For practical reasons, an ideal data collection (i.e. appropriate set of time points and a sufficient number of observations with balanced groups) is not always guaranteed. For example, the sampling duration and frequency are usually based on expected kinetics following the treatment, but problems may arise in the presence of quick or slow responders in the cohort under study. Time-resolved measurements generate multivariate metabolic time courses, and if several classical statistical approaches are dedicated to the analysis of time-series data, modelling high-dimensional multicollinear metabolomic data usually characterised by a low number of time points requires adapted strategies .
In the perspective of biologically driven data analysis, variable selection based on prior kinetic information may constitute a useful practice to extract signals related to a dynamic metabolic process. It also constitutes an excellent starting point to be discussed between the chemical analyst and the chemometrician. In the case of exploratory metabolomic strategies, fundamental knowledge about the system under study is generally not available a priori, and dimension reduction strategies constitute relevant methods to provide insights into the data. As hundreds to thousands of variables can be measured during untargeted metabolomics experiments, finding a reduced data subspace is a potent way to grasp a dynamic behaviour. While two-dimensional data tables are generated in classical experimental setups (N observations × P variables), time-series data intrinsically involve a third temporal dimension. When considering the whole dataset, a three-dimensional data structure (N observations × P variables × Q time points), that is, a three-way tensor, is obtained. It is already recognised that matrix unfolding or matricisation aiming at flattening the data structure is not adapted. Taking multiple correlations into account in multiway data is complicated by the unfolding process, as elements that are adjacent in a tensor may become spatially separated when the data are unfolded according to specific modes. Hence, it may provoke the loss of important structure and meaningful information during a subsequent projection step. Chemometric methods able to grasp the dynamic nature of the studied phenomenon are therefore needed in this context. The multiway structure of multivariate time-series data and two matricisation possibilities are presented in Figure 4.
8 TIME-RESOLVED DATA ANALYSIS
In the context of time-series data, Smilde et al. reported that the covariance matrix of an unfolded table combines the covariances between different variables, between time points for the same variable (auto-covariance) and between different variables at different time points (cross-covariance) . As a result, the specific influence of each aspect is hidden, and several approaches have been proposed for getting around these difficulties. First, dedicated modelling approaches taking advantage of structured data were developed to decrypt the combined effects of several factors from an experimental design (e.g. treatment, dose and time), such as multilevel simultaneous component analysis (SCA) , ANOVA-SCA (ASCA [59, 60]), ANOVA-PCA , scaled-to-maximum aligned and reduced trajectories (SMART ) and principal response curves (PRC ). All these methods intend to separate the different sources of variation with respect to the experimental protocol based on univariate linear models. A dimension reduction step is then applied to summarise the factors' effects. These approaches differ in the way that they handle the separate facets of the dataset and can provide different viewpoints on the data. While ASCA evaluates deviations between classes of observations across time, SMART assesses variations with respect to the first time point, and PRC focuses on differences with a reference control group . The choice of an adequate method depends therefore on the biological problem to be solved. A second possibility is to use a supervised modelling strategy to predict a response Y vector characterising the dynamic evolution of the process , but the interpretation is, however, limited to variables following closely the temporal trend imposed by the Y response. A third option relies on lagged variables in the X matrix obtained through the application of a lag operator to an element of the temporal series to generate its predecessor . However, such a procedure may be detrimental as it increases the dimensionality of the data by building an extended data matrix. A fourth strategy is to evaluate differences between two successive time points by piecewise modelling, as recently proposed with O-PLS models . The temporal information is then obtained by summarising a series of individual models. It is to be noted that because of the diversity of the proposed approaches and their respective advantages and limitations, it remains difficult to establish their usefulness for a specific metabolomic study.
Alternatively, the multiway data structure can be seen as a special case in the multiblock framework, with an identical number of observations and variables in each slice . Tensor-based techniques were therefore developed to find structure-preserving projections more efficiently . They are usually designed to generalise singular value decomposition, and most of the existing approaches are unsupervised, including PARAFAC  and Tucker algorithms , but supervised alternatives also exist, for example, N-PLS . The use of the tensor arrangement was recently underlined as a potent alternative to counteract the curse of dimensionality . As time-series data usually involve monitoring a number of parameters over time, such representation of the data and the associated chemometric techniques are particularly suited for the analysis of these complex data structures [19, 73]. We strongly believe that such approaches will be applied more extensively in the context of metabolomics. Finally, mixing prior knowledge of chemical reactions, that is, hard modelling, with soft-modelling chemometric approaches is expected to be highly beneficial to the temporal monitoring of metabolic processes. Instead of a pure abstract mathematical representation, soft-modelling techniques may take advantage of a rigorous model of the phenomena under investigation. The resulting hybrid model should be adjusted to fit the experimental data by optimising reaction parameters. This balance between theoretical and empirical information is needed to provide robust results with high generalisation abilities, while avoiding overfitting. Once again, the identification of the variables at hand constitutes a key point for the incorporation of physicochemical or biological knowledge into models and an appropriate evaluation of reaction kinetics. Moreover, as time-series data may help to identify compounds based on specific temporal profiles, the relation is two way. As a single analytical snapshot constitutes a simplistic view of metabolic phenomena, chemometric approaches including time-resolved analysis will certainly constitute key elements of future developments.
As it provides information about the biochemical state related to a biological system, metabolomics has become a highly valuable methodology in many research fields, such as biomedical diagnostics, plant science and nutrition research. The constant analytical developments of MS and NMR platforms are expected to provide broader functional information about metabolites, thanks to both improved resolution and sensitivity. Combining multiple data sources will also certainly acquire a growing importance, as merging information from different structures, and extracting common or distinctive traits from several datasets will undoubtedly constitute a key element for a more comprehensive vision of scientific domains, including metabolomics. In that perspective, an in-depth understanding of the data integration strategies is mandatory to extract the information within massive amounts of data spread across different datasets. It has to be noted that the data obtained thanks to these two major analytical techniques are now well established and should vary not in structure the next few years but in volume. Because LC-MS is now the gold-standard analytical technique for numerous laboratories, advances in data analysis will rapidly profit to the entire community. As presented here, several alternatives are already available, and others still need to be invented, taking into account that the choice of the appropriate methodology will depend on numerous aspects such as the analytical platforms, the data intrinsic characteristics, the biological questions to be answered, the cooperative quality and the statistical understanding of the various researchers involved in the study. Finally, early or embedded compound identification within the chemometric workflow is highly desirable to improve model interpretation and biochemical relevance