Automated resolution of overlapping peaks in chromatographic data

Parallel factor analysis 2 (PARAFAC2) has been shown to be a powerful tool for resolution of complex overlapping peaks in chromatographic analyses. It is particularly useful because of its ability to handle shifts in the elution time mode and peak shape changes. Like all curve resolution techniques, PARAFAC2 will only find chemically meaningful parameters (elution time profiles and mass spectra) if the correct number of factors are determined. So far, the primary way to determine an appropriate number of factors, when using PARAFAC2, is to calculate models with different number of factors and then inspect the models manually. This approach is time consuming, and the result may be biased because of the manual assessment of the model quality, making PARAFAC2 inaccessible for analytical chemists in general. Here, we develop a method that can determine an appropriate number of factors in an automated way. The automation is based on a number of model diagnostics (quality criteria) collected from models with different numbers of factors. Combining these diagnostics, it is possible to assess what the appropriate number of components is. In this work, only gas chromatography–mass spectrometry data are considered. However, it will most likely be fairly straightforward to expand the work to also cover liquid chromatography data (with a multivariate detector). Automating the model quality evaluation of the PARAFAC2 model enables both the inexperienced and trained user to perform comprehensive and advanced analysis of chromatographic data with a minimum of manual work. © 2013 The Authors. Journal of Chemometrics Published by John Wiley & Sons Ltd.


INTRODUCTION
Nowadays, the most widespread approach for chromatographic data analysis is using the software provided by the manufacturer of the instrument. However, it has been shown that the algorithms implemented in many commercial packages can result in suboptimal utilization of the information in the data, compared to what can be provided by curve resolution techniques such as multivariate curve resolution or parallel factor analysis (PARAFAC or PARAFAC2) [1,2]. A quite typical example is shown in Figure 1, where the deconvolution procedure from commonly used manufacturer's software has been applied to the data from 45 samples. The peaks in the total ion chromatogram (TIC) (left side of the figure) seem to be divided into two groups, indicating that the data represent two different chemical compounds. If the individual mass traces (not shown) are inspected, it is clearly seen that there are indeed two chemical compounds. The result from the manufacturer's software is shown in the middle plot in Figure 1. Several problems can be observed with this result. First of all, the software only finds a total of six peaks in all samples; this means that many of the peaks from the raw data are not described by this model. Furthermore, several of the peaks contain a shoulder (circle in Figure 1), indicating that the resolution of the two co-eluting peaks has not been successful. A third problem is that none of the peaks in the model have been separated from the baseline and that the level of the included baseline is fluctuating. On the other hand, by using PARAFAC2 on the dataset, these problems are seemingly solved.
The PARAFAC2 model is able to split the peak into three main contributions arising from the two chemical compounds and signal from the baseline (rightmost plot in Figure 1).
PARAFAC2 is a method able to handle many chromatographic artifacts (e.g., baseline drift, overlapping, and elution time shifts) [3]. It allows separating each source of variability in the data by using the spectral information gathered for each elution time.
The resulting model provides three important sets of parameters: an estimated elution time profile for each compound in each sample, an estimated pure spectrum for each compound, and relative concentrations for each of these chemical compounds [4].
Previous publications [5][6][7] have shown that the use of PARAFAC2 enables a comprehensive analysis of chromatographic data. However, a common practical issue in all curve resolution techniques is that the correct number of factors must be determined in order to obtain chemically meaningful profiles [8]. Unfortunately, it is not straightforward to determine the proper number of factors to include. This may be one reason why curve resolution methods are not usually seen in routine chemical analysis. Therefore, an automated selection of the appropriate PARAFAC2 model would provide a significant improvement in terms of allowing non-chemometrically skilled chemists to take advantage of modern data analysis solutions. When a PARAFAC2 model is calculated, several statistical and empirical diagnostics can be used to evaluate the reliability of the model. Traditionally, only explained variance, residuals, and mere observation of the obtained elution time and spectral profiles have been used when determining the appropriate number of factors in PARAFAC2. It has recently been suggested to use core consistency, which is a diagnostic tool indicating whether the model is overfitted [9], in the evaluation of PARAFAC2 models [10]. Nevertheless, there exist a number of additional diagnostics that can be used to evaluate the obtained PARAFAC2 model. In this manuscript, we suggest 102 different diagnostics that all aim to describe some aspect of the quality of a PARAFAC2 model. From these initial statistical and empirical diagnostics, we determine a classification model, which, in an automated way, can find an appropriate number of factors to include in the PARAFAC2 model. In order to obtain a good classification model, we determine which of the initial descriptive quality criteria are most important for the classification, and only those will be used in the final classification model. The proposed method is tested on four different gas chromatography-mass spectrometry (GC-MS) datasets originating from different chromatographic instruments and from different sample matrices (apples, wine, aroma standards, and cheese).
The manuscript will be initiated by a section describing the theory behind PARAFAC2 and a brief description of the diagnostics developed for assessing PARAFAC2 model complexity. The theory behind the classification will not be described in this paper but can be found elsewhere in the literature [11,12]. The description of the diagnostics will be followed by a section that describes how the classification model is optimized and a validation of the final classification model. Throughout this paper, the word compound refers to the chemical compounds contained in a sample, while the words factor and component refer to the outcomes of the model.

THEORY
PARAFAC was introduced simultaneously by Harshman [8] and Carroll and Chang [13] (who named it canonical decomposition).
Harshman based his PARAFAC model on the idea of parallel proportional profiles by Cattell [14]. The idea behind parallel proportional profiles is that if the relative amounts of overlapping phenomena are changing across samples, then it is possible to resolve the unique patterns for each of these phenomena.
The PARAFAC solution is uniquely identified (up to scaling and permutations) under mild conditions [15]. Essentially, PARAFAC is unique as long as the chemical compounds do not have identical spectra or identical elution profiles. This indirectly implies that a correctly specified PARAFAC model, applied to, for example, GC-MS data, can provide estimates of the pure mass spectra, concentration profiles, and pure elution time profiles when there are no elution time shifts (as illustrated in Figure 2).
Chromatographic data often contain shift in the elution time dimension, and PARAFAC is unable to handle shifted data efficiently without the data being aligned prior to modeling with, for example, correlation optimized warping as described by Skov et al. [1], as only one common elution time profile is estimated for each compound. Hence, PARAFAC assumes that the same elution profile can model the compound in all samples. However, it has been shown that PARAFAC2 can be used to solve the problem with shifts in retention time [4,16]. In PARAFAC2, an elution time profile is found for each compound in each sample as illustrated in Figure 2. As for PARAFAC, the PARAFAC2 solutions are also unique. A thorough description of PARAFAC2 has been made by Bro et al. [4,15].
For both PARAFAC and PARAFAC2, a model with too many factors will describe a variation that is not chemically meaningful. Sometimes, one or a few extra components do not disturb the model. For example, for PARAFAC, it has been shown that in some cases, more than one model can be chemically meaningful and provide estimates of the underlying patterns even though the models have different numbers of factors [8]. This implies that in some cases, there is no one optimal model but rather a range of appropriate models. In our experience, the same goes for PARAFAC2. Figure 3 shows two examples of under-fitted models obtained from the data presented in Figure 1. Both residuals and scaled elution profiles indicate that the models do not have enough factors included. The residuals are, especially in the one-factor model, behaving in a very systematic manner. This indicates that there is still chemical information in the data, which is not explained by the model. In the elution profiles, the under-fitting is indicated by the modeled peak shapes, which suggest that two (or more) co-eluting compounds are described by the same elution profile.
In Figure 4, a good model is shown to the left and an overfitted model to the right. Both models show how the residuals become smaller and more unsystematic when a sufficient number of factors are included. The over-fitted model, in this example, is characterized by having low core consistency (below zero) as well as an increase in the proportion of negative values in the obtained mass spectra, compared to the model with one less factor. Also, the number of iterations, which has increased considerably, compared to the model with one less factor, indirectly indicates that this model is over-fitted.
One disadvantage in the aforementioned evaluation of obtained models is that all of the diagnostics are subjective, and such evaluations will inevitably be biased by different personal opinions on which of the diagnostics one thinks is more important in the evaluation. Another disadvantage is that the evaluation is a time-consuming task. To counter these problems, an automated quality control is proposed by investigating 34 diagnostics (see supplementary material), as descriptors for quality of the PARAFAC2 model. All the diagnostics are calculated on models where the problem with sign indeterminacy has been solved with the method proposed by Bro et al. [17].
For each of the 34 diagnostics, also the difference ("diff") between the diagnostic value of the present model and the model with one factor less was included (diagnostics #35-68). This approach was inspired by the DIFFIT approach described by Timmerman and Kiers [18]. The diff value for a model with k factors is defined as the value at k factors minus the value at k À 1 factors. Also, the relative change is determined for all 34 diagnostics (diagnostics #69-102). The relative change is defined as the diff divided by the original value. This gives a total of 102 diagnostics.

Datasets
A total of four different GC-MS datasets have been included in this study: ), analysis of cheese derivatized with methoxamine (20 mg/mL in pyridine) followed by derivatization with N-methyl-N-(trimethylsilyl) trifluoroacetamide as described by Kanani et al. [20]. The derivatized cheeses were analyzed using an Agilent technologies 7890A GC system coupled to an Agilent technologies 5975C inert XL MSD with Triple-Axis detector.
As curve resolution techniques work in small and regional intervals, the four datasets were divided into intervals with an estimated maximum of five compounds in each interval. The TICs from the four datasets are shown in Figure 5, with blowups showing examples of these intervals. In total, 155 intervals were created. PARAFAC2 models with one to seven factors were calculated for each of these intervals, and the diagnostics described in the supplementary material were determined for each of these 1085 models. By working on baseline-separated intervals, the problems are typically much easier to analyze, and more unambiguous results are obtained. PARAFAC2 models were calculated without any constraints (such as non-negativity) as these constraints might disguise indications of over-fit.
The 'correct' number of factors for each model was determined by having a skilled GC-MS user evaluating the models by assessing not only the core consistencies, number of iteration used, residuals, and obtained elution and spectral profile (as described in the preceding section) but also the chemical information in the raw data.
The classification model was constructed with partial least squares discriminant analysis (PLS-DA) as follows: The 155 intervals were randomly divided into a calibration (75 intervals) and a validation (80 intervals) set. The PLS-DA model was optimized, Figure 2. Illustration of the differences between PARAFAC and PARAFAC2. In PARAFAC, one common elution profile is used to describe each compound in all samples, whereas in PARAFAC2, each compound in each sample is modeled with a distinct elution profile. Figure adapted with permission from Amigo et al. [3]. Copyright © 2010 American Chemical Society. and the number of variables (diagnostics) included reduced by using the calibration set with 75 intervals and the 102 diagnostics, and subsequently tested using the test set containing the rest of the intervals and the remaining diagnostics. Eleven of the intervals were not modeled well by PARAFAC2. The reason for that could be that two compounds were totally overlapping (embedded peaks), the peak shapes were severely changed, or that the spectra of different compounds were too similar. These intervals were removed from the calibration set before any variable selection was performed in order for them not to influence the selection.

Software
All algorithms and models have been developed using MATLAB R2012a (Mathworks, Inc., Natick, MA, USA). PLS_Toolbox (Eigenvector Research, Inc., Wenatchee, WA, USA) has been used for principal component analysis and PLS-DA models. PARAFAC2 models have been calculated using the algorithm available from www.models.life.ku.dk (April 2013).

RESULTS AND DISCUSSION
By visual inspection of the raw diagnostics, it became obvious that most of them were following an overall pattern. In Figure 6, the values of core consistency (diagnostic 20), the number of iterations (diagnostic 2), and the relation between the absolute area and the area of the spectral profile (diagnostic 10) are shown. The values are obtained from the PARAFAC2 models illustrated in Figures 3 and 4.
Small changes in the diagnostic value are observed as more and more factors are included until the model becomes overfitted. At this point, there is a significant change after which the diagnostic value only changes slowly again. The relative difference obtained from the first over-fitted model describes this jump in the raw diagnostic values, as illustrated in Figure 6. Therefore, the first over-fitted model will be distinguishable from the ones with fewer components. This approach is similar to what was described by Hoggard and Synovec [21] in their paper concerning automated determination of the number of factors to include in PARAFAC models. In order to determine which of the 102 selected diagnostics were useful in classification of the first over-fitted model, variable selection was conducted. By using a combination of three different variable selection techniques, the most important diagnostics were selected. The variable selection techniques used were variable importance for projection (VIP, which indicates the importance of a variable for the model), interval partial least squares regression (iPLS, which identifies a few number of variables that give better predictions than the other), and genetic algorithms (which selects the variable combination that gives the best model). All of these techniques are thoroughly described by Andersen and Bro [22].
The following seven diagnostics represent a good compromise between having a good classification power without too many variables (the numbers in parentheses refer to the table in the supplementary material, which includes all 102 diagnostics): core consistency (#20), change in the negative area in the elution profile (#43), change in negative correlation between spectra (#53), logarithm of the number of iterations (#3), the positive correlation between spectra (#18), the relative change in how systematic the residuals are (indicated with Durbin Watson) (#97), the relative change in the correlation between the TIC from raw data and the TIC from the obtained model (#94).
The specificity (0.97 for over-fit and 0.95 for not over-fit) obtained with a model including these seven diagnostics is very similar to the model including five additional diagnostics.
None of the seven final diagnostics could be removed without a significant loss of classification power (decrease in specificity close to 10%).
The regression vector obtained in the PLS-DA model (Figure 7) indicated that the core consistency, changes in the negative area in the elution profile, and changes in negative correlation between spectra are negatively correlated to over-fit. On the other hand, the following diagnostics were positively correlated with the first overfitted model: the logarithm of the number of iterations, the positive correlation between spectra, the relative change in how systematic the residuals are (indicated with Durbin Watson), and the relative change in the correlation between the TIC from raw data and the reconstructed TIC from the obtained model. The PLS-DA model based on these seven diagnostics was used to classify the models in the test set (a total of 80 intervals). Most of these intervals were correctly classified according to the manual evaluation. Furthermore, it was found that the majority of the cases where there was incongruence between manual and automated evaluation were either models that were misclassified by the manual evaluation or cases where PARAFAC2 was simply unable to describe the data in an appropriate way.
As an example of the wide application range of the presented method, the elution profiles from three over-fitted models, with very different characteristics, are shown at the right of Figure 8. These were all correctly classified as the first over-fitted models, leading to the models with one factor less, shown at the left of the figure, to be determined as being appropriate models. The diversity in these models indicates that the classification is performing well for many different types of GC-MS data.
In Table I, the cases of incongruence between manual and automated evaluation are listed. These cases can be divided into three categories; misclassified (eight models), intervals not well described by any PARAFAC2 model (10 models), and models where both the results of the manual and automated classifications can be assessed as correct (nine models). In the table, it can be seen which of these categories the individual models falls into. In the following, a thorough inspection of the three types will be given.

Intervals not well described by PARAFAC2
These are cases where it is not possible to obtain a model that looks meaningful from a chemical point of view. An example of this is interval 2 from the cheese data (Figure 9). In the model with two factors (left side), the two peaks have  not been separated. Furthermore, the factor that describes the baseline contains a number of small peaks. This indicates that this component is actually describing baseline plus at least one chemical compound. These observations indicate that additional factors should be included. In the model with three factors (right side), there is still one elution profile that describes both of the peaks. Additionally, the spectra from component 2 (green profiles) are exclusively negative, which indicate that the model is over-fitted. In models with more than three factors included, the signs of over-fit become even more obvious. When inspecting the mass spectra of the two peaks (not shown), it can be seen that the two compounds have many fragments in common but that they also have some differences. In this case, we must therefore conclude that this interval cannot be well described with PARAFAC2. Imposing non-negativity in the estimated spectral profile and concentrations did not improve the models. Note that traditional chromatographic software would also not be able to provide meaningful results of this interval.
It might be possible to identify infeasible intervals by including the correct diagnostics. The development of such a method is beyond the scope of this paper but should be investigated further in future work.

Intervals offering several solutions
In cases where the classification might be correct even though it differs from the manual classification, there exists more than one number of factors for which it could be argued that the model is appropriate. An example of this could be interval 24 from the apple data. In Figure 10, the PARAFAC2 models with two and three factors, respectively, are illustrated. In the model with two factors (left side), there is still a small amount of systematic behavior in the residuals, indicating that an additional factor should be included in the model. However, if an additional factor is included (right side), there is an increase in negative values in the spectral profile, indicating that too many factors have been included. By constraining the model with non-negativity in spectral profiles and concentration, the model with three factors no longer has indications of being over-fitted. This indicates that the model with three factors indeed is not over-fitted as also suggested by the classification model. A more thorough investigation was conducted by calculating new models on the same interval with different starting points. These new models indicated that the obtained model is representing a local minimum.
Another example of a model with seemingly conflicting results is from the cheese dataset, interval 22. Upon thorough inspection of the model and the raw data, there are indications that this three-factor model may have been correctly classified as not over-fitted and that it is the initial manual assessment that is wrong. The spectral profile obtained by the model (middle plot, Figure 11) shows that there are some masses that separate the two peaks. If these masses are inspected in the raw data (leftmost plot in Figure 11), it can be seen that there actually seems to be a shift, indicating that two chemical compounds are eluting. However, the two compounds are poorly resolved (as also shown by the modeled elution time profiles; leftmost plot in Figure 11), and this might be the reason that PARAFAC2 has some difficulties in modeling them. It is interesting that apparently, the results of a properly trained expert system can outperform a skilled analyst.

Misclassified
The incorrectly predicted models can be divided into two sub-categories: misclassified as not over-fitted and misclassified as over-fitted. However, there are no models that fall into the first category, indicating that the actual specificity of the classification model is very high. In the second category, there are eight models.
In the cheese dataset, interval 5, there is a small amount of tailing, which is not caught in the model identified as the correct one. However, neither the obtained spectral profile nor the concentration profiles (not shown) are affected by this. From all practical aspects, this will not affect any further interpretation of data. In the apple dataset, interval 13, a small peak is missing in the model identified as being appropriate. However, this peak is of such small magnitude that it does not affect the obtained concentration profiles and mass spectra, which are very similar in the two models. The same situation is observed for cheese, interval 20. In this case, the missing peak has a higher magnitude, but nevertheless, the compounds are modeled meaningfully and very similar to those in the appropriate model (illustrated in Figure 12). Furthermore, both spectra and concentration profiles (not shown) of the main peak (which is present in both models) are practically unaffected. This means that further interpretation will not be affected besides the fact that a compound is missed. The misclassifications of the remaining five intervals (apple, interval 1; cheese, intervals 10, 12, and 65; wine, interval 30) result in models that deviate more severely from the appropriate model.
Summing up, these observations actually show that over 90% of the models that are identified as correct with the suggested approach are describing real underlying chemical information (assuming that such a model can be found using PARAFAC2). This is to our best belief a significant improvement compared to the alternatives offered by the manufacturer's software. We have shown that the suggested automated approach can indeed be used to identify models that are describing the underlying chemistry. It may be interesting to compare automatically determined PARAFAC2 models to solutions obtained with alternative methods for quantification of overlapping signals in order to determine which methods are most suitable for different situations.
Additionally, it should be mentioned that the additional computation time needed for calculating the seven diagnostics included in the final PLS-DA model is insignificant compared to the time needed for calculation of the PARAFAC2 models. The step in the automated procedure that is most time consuming is the correction of the sign indeterminacies, and also, this step is considerably less time consuming than the calculation of the PARAFAC2 models. Figure 9. Interval 2 from the cheese data modeled with two and three factors, respectively. The model with two factors seems to have too few factors, whereas the model with three seems to have too many factors included. The colors in the figure refer to the different components; for example, blue in elution and spectral profile refers to the same component. Figure 10. Interval 24 from the apple data modeled with two and three factors, respectively. The model with two factors seems to have too few factors because there is still some systematic behavior in the residuals, but otherwise, it seems like an appropriate model. The model with three factors seems to have too many factors included, because of the increase in negative values in the spectra profile; otherwise, it is a nice model. The colors in the figure refer to the different components; for example, blue in elution and spectral profile refers to the same component. Figure 11. Illustration of the elution time profiles (left) and mass spectral profiles (middle) obtained from cheese, interval 22, modeled with three factors. The rightmost plot shows the mass traces of the masses, which separates the two peaks (indicated with circles in the middle plot). The colors in the figure refer to the different components; for example, blue in elution and spectral profile refers to the same component.

CONCLUSION
By the usage of different variable selection techniques, we have found that a PLS-DA model based on seven quality criteria can be used to automate selection of the number of components in PARAFAC2. The seven quality criteria that turned out to be most important for the classification are as follows: (1) core consistency, (2) change in the negative area in the elution profile, (3) change in negative correlation between spectra, (4) logarithm to the number of iterations, (5) the positive correlation between spectra, (6) the relative change in how systematic the residuals are (indicated with Durbin Watson), (7) the relative change in the correlation between the TIC from raw data and the TIC from the obtained model.
The obtained PLS-DA model was evaluated on an independent test set. This validation showed that over 90% of the models that were determined to be appropriate by the automated procedure describe the underlying chemistry. This makes the approach very useful for handling huge amounts of data, without the need of a skilled chemometrician to manually evaluate every model. It has been shown that the developed method can handle very different kinds of GC-MS data, and optimization should therefore only be necessary if the model is applied on other types of data (e.g., liquid chromatography-MS).
A MATLAB routine has been written, which calculates PARAFAC2 models and finds valid models using the approach described in this paper. The function, which is called AutoChrome, is available at www.models.life.ku.dk (April 2013). If the prediction of the number of compounds is combined with additional tools [2], identification can be accomplished using the open source software OpenChrom [23] combined with the NIST mass spectral library.