Implications of confounding from unmodeled interactions between explanatory variables when using latent variable regression models to make inferences

With linear dependency between the explanatory variables, partial least squares (PLS) regression is commonly used for regression analysis. If the response variable correlates to a high degree with the explanatory variables, a model with excellent predictive ability can usually be obtained. Ranking of variable importance is commonly used to interpret the model and sometimes this interpretation guides further experimentation. For instance, when analyzing natural product extracts for bioactivity, an underlying assumption is that the highest ranked compounds represent the best candidates for isolation and further testing. A problem with this approach is that in most cases, the number of compounds is larger than the number of samples (and usually much larger) and that the concentrations of the compounds correlate. Furthermore, compounds may interact as synergists or as antagonists. If the modeling process does not account for this possibility, the interpretation can be thoroughly wrong because unmodeled variables that strongly influence the response will give rise to confounding of a first‐order PLS model and send the experimenter on a wrong track. We show the consequences of this by a practical example from natural product research. Furthermore, we show that by including the possibility of interactions between explanatory variables, visualization using a selectivity ratio plot may provide model interpretation that can be used to make inferences.

Latent variable regression (LVR) methods, such as partial least squares (PLS), 1 are extensively used to model relations between a suite of explanatory variables and a response variable.Often, there is no need for the explanatory variables to be responsible or causally linked to the response.A model with satisfactory prediction ability can still be obtained if the explanatory variables correlate with the response.However, if the objective is to use the associations of the explanatory variables to the response to generate a hypothesis and to make inferences for increased understanding and/or to guide additional experiments, the situation is different. 2In this case, we assume that the explanatory variables are responsible for the systematic variation in the response and that interpretation of the association pattern between response and explanatory variables is meaningful and can be used to achieve the purpose of the investigation.
Analysis of natural product extracts to discover new bioactive compounds represents an application where the predictive performance of a model is necessary, but not sufficient, to be useful for its purpose.Such investigations are conducted on whole or fractionated botanical extracts.Chromatography in combination with mass spectrometry is used for profiling samples, and the measured peak areas or peak heights of the selected ions are used as explanatory variables to predict the measured bioactivity. 3But because the number of measured analytes is usually higher than the number of samples, or, more precisely, higher than the number of underlying latent variables necessary to establish a validated model with predictive ability, confounding presents a problem.Thus, bioactive candidates obtained by interpreting the association pattern of analytes to the measured bioactivity may be the result of confounding patterns and not point to true bioactive analytes. 4The problem is exaggerated by the possibility of strong synergistic or antagonistic behavior between analytes. 5Thus, models including interactions between analytes should also be examined. 2This increases the complexity of the model because the number of possible interactions may be high. 4his situation is very different from the situation with controllable explanatory variables where confounding can be planned for and is accounted for by using statistical experimental design. 6,7For instance, in a screening phase, looking for the most important explanatory variables influencing the response, a saturated design is often chosen with number of experiments N equal to the number of explanatory variables M plus 1. 8 With variable levels chosen so that the explanatory variables are orthogonal to each other, a first-order regression model is calculated.By assuming that all the explanatory variables are unimportant and that they therefore follow a normal distribution, a reduced model is obtained including only those explanatory variables that violate this assumption.An underlying assumption for such screening analyses is that interactions between explanatory variables are negligible.If this is not the case, it is sometimes still possible to detect the possible presence of interactions by inspecting the confounding pattern of the explanatory variables with the interactions when few of the explanatory variables associate to the response variable.
The approach used for controllable explanatory variables is not easily carried over to situations with uncontrolled explanatory variables, for instance, when looking for bioactivity in a set of similar natural product extracts. 5There are two main differences between this situation and the case with controllable explanatory variables.First, the number of variables M is typically larger than the number of samples N and often M is much larger than N. Second, even if M > N, the explanatory variables are correlated to different degrees so that the number of independent variables A with chemical information is often much less than both M and N.This shows up in a relatively low-dimensional LVR model between the response and the explanatory variables.For instance, a typical PLS regression model contains two to 10 PLS components even when the number of samples is in the hundreds.This means that confounding will present a problem for the interpretation of the model with the aim to make inferences to guide further experimentation even with the assumption of a first-order model for the response.The problem is exaggerated when the assumption of no interaction between compounds in the mixtures is violated.
The aim of this communication is to show through a practical example from natural product research 4 how confounding from interactions not accounted for in the model may influence model interpretation and possibilities for correct inferences when the number of samples is small compared with number of explanatory variables.

| Software
Sirius version 13 (Pattern Recognition Systems AS) was used for the analysis.Methods in this software have recently been implemented as an open-source R package called MVPA which is available on GitHub (github.com/liningtonlab/mvpa). 9The MVPA R package is integrated into an R shiny graphical user interface called mvpaShiny (github.com/liningtonlab/mvpaShiny). Detailed description of how to install and use the packages is available on the associated documentation page (https://liningtonlab.github.io/mvpaShiny_documentation).The package can be used to solve different kinds of problems in multivariate regression.The basic PLS regression algorithm is from the chemometrics package of Filzmoser and Varmuza, 10 but the validation of predictive PLS components uses the repeated Monte Carlo resampling algorithm of Kvalheim et al. 11,12

| Data set
A thorough description of experimental procedures and access to the data set used here is provided in Vidar et al. 4 Here, we only provide the necessary information for understanding the analysis and interpretation in Section 3.
The data set consists of nine mixtures for which the response variable is the antimicrobial activity for each mixture against Staphylococcus aureus.The mixtures were obtained by spiking a series of mixtures without antimicrobial activity (inactive mixtures) with the known antimicrobial berberine and the antimicrobial synergist piperine.The piperine molecule does not possess antimicrobial activity on its own but synergistically enhances the antimicrobial activity of berberine.The mixtures were profiled using liquid chromatography coupled to high-resolution electrospray ionization mass spectrometry on an Acquity UPLC System (Waters) interfaced to a Q-Exactive Plus Hybrid Quadrupole-Orbitrap Mass Spectrometer (Thermo), and the data were preprocessed using MZmine with methods described previously. 4After filtering and preprocessing, the abundances (as measured by chromatographic peak area) of 33 ions were obtained.This included one ion associated with berberine and five ions (redundant features) associated with multiple clusters of piperine. 4These clusters were the protonated species [M + H] + , the sodiated species [M + Na] + , the proton bound dimer [2M + H] + , the sodium bound dimer [2M + Na] + , and a sodiated acetonitrile cluster [M + ACN + Na] + .Table 1 shows the measured inhibition of the nine mixtures together with the concentrations of berberine and the synergist piperine.Note from Table 1 that the concentrations of berberine and piperine were spiked according to an experimental design to achieve good coverage of different degrees of inhibition for the mixtures.The abundance variables (peak areas) were standardized to unit variance prior to modeling.

| RESULTS AND DISCUSSION
Table 1 shows good coverage between 0% and 100% for the microbial activity of the nine mixtures of inactive compound spiked with the antimicrobial compound berberine and the synergist piperine.A three-component PLS model for activity including the 33 ions explained 97.3% of the variance in microbial activity.
For designed experiments with orthogonal explanatory variables, the common approach to determine the significant variables is to assume as a null hypothesis that the regression coefficients follow a normal distribution.If this hypothesis is correct, the regression coefficients sorted in increasing order from most negative to most positive should be on a straight line in a normal probability plot.Figure 1A shows this plot for the PLS model with the association of antimicrobial activity with the peak areas.The horizontal axis in the normal probability plot is linear with the most negative regression coefficient on the left and the most positive regression coefficient on the right.The vertical axis, which represents cumulative probability for a theoretical normal distribution from zero to one, is constructed in such a way that if divided into equal intervals according to the number of explanatory variables, and letting each explanatory variable sorted in increasing order occupy the intervals in the same increasing order, the variables fit to a straight line in the plot if they obey the assumption of a normal distribution.We observe that several variables violate the null hypothesis of being normally distributed.The most important variable appears to be the one with m/z 195.0873, but this is an inactive compound.Second to this is the [M] + ion for the active compound, berberine, with m/z 336.1224.Then follows three inactive compounds.Evidently, regression coefficients cannot be used to make inferences to guide further experimentation as commonly done in designed experiments with controllable exploratory variables.
F I G U R E 1 (A) Normal probability plot of regression coefficients for the first-order PLS model of antimicrobial activity using abundance of the detected ions (as measured by chromatographic peak area) and (B) normal probability plot of correlation coefficients between antimicrobial activity and the abundance of the detected ions.
Correlation coefficients represent another possible route to explore the association between activity and the peak areas.We can assume normal distribution of the correlations and look for variables violating the null hypothesis.Figure 1B displays the normal probability plot of correlation coefficients.Again, ions associated with inactive analytes are ranked as most important, while the [M] + ion for berberine ranks as number 5. Evidently, something is wrong in our approach.
Calculation of selectivity ratios (SRs) and interpretation using an SR plot 13,14 represent a third route to rank explanatory variables according to their importance for the response variable.Comparative studies with other methods for ranking of variable importance have confirmed that SRs are well suited for this purpose. 15,16Especially, as pointed out by Andersen and Bro, 17 when the variables do not overlap, SRs should provide good results.The high-resolution data analyzed in this work fulfill this premise, but the SR plot (Figure 2) performs even worse than the normal probability plot of regression coefficients.Thus, the SR plot implies that the most important ions associated with antimicrobial activity are the two with m/z of 144.9876 and 157.035 which are both from inactive compounds.The [M] + ion for berberine with m/z 336.1224 is ranked as number 10, and the [M + H] + ion for the synergist piperine with m/z 286.1433 is ranked as number 8. Obviously, model interpretation to make inferences using SR performs not better than the approach using regression coefficients.
The problem is the assumption that interactions are negligible.This assumption must be validated if we want to make inferences from the modeling process.Thus, we must examine the possibility of interaction effects in our modeling approach to ensure a model interpretation that can be used to make inference. 2To accomplish this, we augmented the data by including interactions between analytes. 4This led to a data set with 528 possible interaction terms, but we can still calculate a PLS model and see how this influences the ranking of variables inferences.
The normal probability plot of regression coefficients obtained from a three-component model including interactions (Figure 3A) shows that five interaction terms ranked as the most important contributors to the antimicrobial activity.They are all interactions terms calculated for ions associated with berberine and piperine.Indeed, these interactions correlate much stronger with activity than berberine itself.The problem is that these five interaction terms do not stand out from many other interactions with high importance for activity.This is highlighted in Table 2 which shows the explanatory variables with highest regression coefficients.The five most important explanatory variables represent synergists of berberine and piperine, but the regression coefficient for the variable (ion with mass 195.0873) closest in size to these interaction terms is similar in size and so are many other explanatory variables.
F I G U R E 2 Selectivity ratio plot for the first-order PLS model.Higher positive or negative bars imply higher importance of the explanatory variable for the response variable.Positive bars imply positive predictive association with the response, while negative bars imply negative associations.
F I G U R E 3 (A) Normal probability plot of regression coefficients for the PLS model of antimicrobial activity using ion abundances (peak areas) and their interactions as predictors and (B) normal probability plot of correlation coefficients between antimicrobial activity and ion abundance and their interactions between ion abundances.
T A B L E 2 Variable importance for the five berberine-piperine interaction terms and for the explanatory variable raked just after these interactions based on regression coefficients, correlation coefficients, and selectivity ratios.A normal probability plot of the correlation coefficients (Figure 3B) shows the same picture as the normal plot of the regression coefficients: The five interaction terms calculated for ions associated with berberine and the piperine appear most important, but there are many explanatory variables with similar degree of correlations to the activity, for instance, the interaction 3 Â 5 shown in Table 2.
In contrast, the SR plot (Figure 4 and Table 2) for this PLS model clearly implies that the five interaction terms associated with berberine and piperine are the most important predictors of antimicrobial activity.They stand out in the right side of the SR plot; three of these variables are very close to each other in the SR plot, and they are approximately twice as large as the explanatory variable 3 Â 5 ranked just behind these interactions (Table 2).We can conclude that for situations where M > N, inferences made from interpretation of a first-order PLS model cannot be trusted unless the assumption of negligible contributions from interaction terms is validated by calculating a model with interactions included.When M > N, the first-order regression model builds association patterns among the explanatory variables that account for the missing or lurking variable(s) not included in the model.
The large number of possible interactions accompanying an increasing number of explanatory variables may be expected to lead to a problem with significant interaction terms possibly generated by chance.In our recent work on interaction metabolomics, 4 this possibility was investigated by designing mixtures of inactive compounds spiked only with the active compound berberine.Only one interaction was deemed significant, and it was shown to be an interaction between an ion associated with berberine and an ion associated with a related berberine analogue.The concentrations of these two compounds correlated perfectly in the mixtures.This may imply that interactions appearing important just by chance, may be a smaller problem than expected, but confounding due to high correlations between analytes may occur even when the number of samples is high compared with the number of analytes.

| CONCLUSIONS
By a practical example from the natural product area, we have shown how confounding resulting from omission of interactions between variables in the model building process can lead to incorrect inferences about which variables are important for a response.This is a special concern in application areas where the number of variables is larger (and often much larger) than the number of samples and where inferences are made with the purpose of guiding further experimentation.It cannot be ruled out that interactions can be ranked important simply by chance; thus, validation by further experimentation is mandatory.

F
I G U R E 4 Selectivity ratio plot for the PLS model including interactions.
Mixtures of inactive compounds spiked with berberine and piperine.
T A B L E 1