Principal balances of compositional data for regression and classification using partial least squares

High‐dimensional compositional data are commonplace in the modern omics sciences, among others. Analysis of compositional data requires the proper choice of a log‐ratio coordinate representation, since their relative nature is not compatible with the direct use of standard statistical methods. Principal balances, a particular class of orthonormal log‐ratio coordinates, are well suited to this context as they are constructed so that the first few coordinates capture most of the compositional variability of data set. Focusing on regression and classification problems in high dimensions, we propose a novel partial least squares (PLS) procedure to construct principal balances that maximize the explained variability of the response variable and notably ease interpretability when compared to the ordinary PLS formulation. The proposed PLS principal balance approach can be understood as a generalized version of common log‐contrast models since, instead of just one, multiple orthonormal log‐contrasts are estimated simultaneously. We demonstrate the performance of the proposed method using both simulated and empirical data sets.


Introduction
Compositional data (CoDa) occur in plenty of research fields, such as geochemistry (Hron et al., 2021), metabolomics (Štefelová et al., 2021), microbiome data (Monti and Filzmoser, 2021), time use data (Dumuid et al., 2020), or ecology (Perujo et al., 2020).Let us consider a regression (or classification) task where y = (y 1 , . . ., y n ) is a vector of n observations of a continuous (or binary) response variable and X = (x ij ) 1≤i≤n,1≤j≤D is an associated matrix of a D-part compositional predictor.Analyzing CoDa requires careful consideration since such data do not carry relevant information in their absolute values, but rather in the ratios between the parts that constitute the composition.Moreover, in this paper we address the case of high-dimensional compositions consisting of a large number D of parts.
In the growing literature regarding omics sciences, most data are actually of relative (hence compositional) nature (Gloor et al., 2017) and the development of methods for high-dimensional compositions is of increasing interest.The challenge not only concerns compositions consisting of many parts, but also the fact that the number of samples n is usually substantially smaller than the number D of parts due to the nature of the technology and omics sciences in general (as for example hundreds of proteins or metabolites are examined).Such settings with more variables than observations immediately discard the use of the most popular regression and classification methods, including Least Squares (LS) regression and Linear or Quadratic Discriminant Analysis (LDA/QDA) models.
For regression analysis with high-dimensional compositions, logcontrast models (Aitchison and Bacon-Shone, 1984) have gained increasing popularity, see for instance Bates and Tibshirani (2019), Susin et al. (2020) and Gordon-Rodriguez et al. (2022).Logconstrasts are a building block in CoDa analysis through the log-ratio methodology (Aitchison, 1986).Given a D-part composition, a logconstrast is a loglinear combination D i=1 a i ln x i , with D i=1 a i = 0, a i ∈ R.
Any log-ratio coordinate representation of CoDa consists of D − 1 logcontrasts, with such number corresponding to the actual dimensionality of compositions.A typical feature of logcontrast models in Aitchison and Bacon-Shone (1984) or Rivera-Pinto et al. (2018) is that only one of the possible D − 1 logcontrasts corresponding to the dimensionality of D-part CoDa is estimated as predictor variable.Note that considering only one logconstrast might be unnecessarily restrictive as other logcontrasts can be of interest as predictor variables, this is something we investigate here.However, estimating the set of all possible D − 1 predictor logcontrasts could be computationally exhaustive or even unfeasible with high-dimensional compositions, definitely so when using existing logcontrast models.But, in fact, this would not be needed.Just having a few logcontrasts capturing most of the information contained in the original explanatory composition, while they appropriately relate to the response variable, would be necessary.We here introduce a model that sufficiently explains a response variable y using the matrix X of compositional predictors while performing dimension reduction through Partial Least either directly interpreted or used for further statistical analysis.In the following, we will refer to this new proposal as PLS-PB.
This manuscript is structured as follows.In Section 2, the basics of CoDa are presented together with the description of the proposed PLS-PB procedure.In Section 3, a simulation study is conducted to investigate the ability of PLS-PB to reflect and simplify the structure of PLS loadings as well as its prediction performance.The method is then applied to two real-world data sets in Section 4. Section 5 concludes with some final remarks and future outlook.

Compositional Data and PLS Principal Balances
When analyzing CoDa, their relative nature needs to be appropriately accounted for.The sample space of CoDa is formed by equivalence classes of proportional vectors (Barceló-Vidal and Martín-Fernández, 2016).
In practice, CoDa are typically (equivalently) represented in the form of percentages, proportions or parts per million (ppm), that is, as data with a constant sum constraint and living on a simplex.Given the scale invariance property of compositions (Aitchison, 1986) and the geometric structure of their sample space (the so-called Aitchison geometry), a well-principled way to conduct analysis of CoDa is to express them in the form of log-ratios of parts, or more generically as their logcontrasts, and then proceed to further statistical processing on these (Pawlowsky-Glahn et al., 2015;Filzmoser et al., 2018).
Log-contrast models have been recently used for regression or classification analysis with high-dimensional compositional covariates (Bates and Tibshirani, 2019;Monti and Filzmoser, 2021;Gordon-Rodriguez et al., 2022).The general form of these models is given by where y = (y 1 , . . ., y n ) is a univariate response variable, X = (x ij ) 1≤i≤n,1≤j≤D is a matrix of the compositional predictor, β represents the regression coefficients with D j=1 β j = 0, and ε stands for the ordinary random error term.It is common practice to interpret the regression coefficients directly in terms of the original parts like in a standard multiple regression model.Nevertheless, from a compositional perspective, just one logcontrast is estimated.
The model in Eq. ( 1) can, however, be immediately generalized to a setting where more orthonormal logcontrasts are estimated simultaneously.In order to reduce the dimension, it would be interesting to rank them according to decreasing relevance to explain or predict the response variable.Moreover, simplifying the interpretation of these logcontrasts would be beneficial.This is achieved by the novel PLS-PB method introduced in this work as showed in the following.

Log-ratio Representations of Compositional Data
As said, CoDa are commonly expressed in the form of log-ratios of parts (logcontrasts) for statistical analysis.
Thus, clr coefficients (Aitchison, 1986) are often used.For a D-part composition x = (x 1 , . . ., x D ) , its representation in the clr coefficients is defined as where g(x) denotes the geometric mean of the parts of the composition x.Clr coefficients and logcontrasts are closely linked together as it holds that (Martín-Fernández et al., 2018) where clr i (x) = ln xi g(x) .The clr coefficients impose a zero sum constraint and thus lead to a singular covariance matrix, which is undesirable for some statistical methods including LS regression or LDA/QDA.However, their construction and interpretation is appealing for some others including PLS regression.Note that, by taking Eq.(3) into account, the regression model in Eq. ( 1) could have been developed directly in clr coefficients, but we will do it separately later.The reason is methodological: while in Eq. (1) the zero-sum constraint is additionally imposed which needs to accommodate estimators of regression coefficients used, like in Monti and Filzmoser (2021), with applying clr coefficients first it is automatically incorporated to proceed with standard estimators, but care needs to be taken for the interpretation of this log-ratio coordinate representation.
An alternative way to map CoDa from their original sample space into the real space is through orthonormal log-ratio (olr) coordinates, also known as isometric log-ratio (ilr) coordinates, which are derived from the Euclidean vector space structure of the Aitchison geometry and overcome the singularity issue, hence being more generally applicable in statistical analysis (Egozcue et al., 2003;Filzmoser et al., 2018;Martín-Fernández, 2019).
Note that balances are one concrete instance of olr coordinates that are constructed by means of a sequential binary partition (SBP) of the parts of a composition (Egozcue and Pawlowsky-Glahn, 2005).This procedure sequentially splits parts into two non-overlapping groups.Thus, at the kth partition, k = 1, . . ., D − 1, the balance b k between two subgroups is given by where r k denotes the number of parts in the first group (numerator of the log-ratio) and s k denotes the number of parts in the second group (in the denominator of the log-ratio), with n 1 , .
Recently, the use of balances has been questioned by some (Greenacre, 2019;Greenacre et al., 2021;Hron et al., 2021).Nonetheless, their ability to represent the original information in terms of contrasts or comparisons between two groups of parts remains appealing when compared to using general logcontrasts, particularly so in high-dimensional settings.
However, as the number D of parts of a composition increases it becomes more challenging to build a SBP and obtain a collection of balances that are interpretable.It is then desirable to construct just a few balances which capture the majority of the information.In an unsupervised learning settings, principal balances have been proposed for this aim (Pawlowsky-Glahn et al., 2011).In Martín-Fernández et al. (2018) PBs are formally defined as follows.
Definition 2.1.Given a composition x = (x 1 , x 2 , . . ., x D ) , principal balances are logcontrasts D i=1 a ki • ln x i , k = 1, . . ., D − 1, such that a k = (a k1 , . . ., a kD ) are constant vectors which maximize the variances var D i=1 a ki • ln x i and: • for k = 1, . . ., D − 1 the coefficients a ki take one of the three values (−c 1 , 0, c 2 ), c 1 and c 2 being some strictly positive numbers, PBs facilitate dimension reduction using PCA in the clr space, but the interpretation of the resulting principal components is simplified through their expression in terms of balances.Accordingly, the first PB maximizes the sample variance and each subsequent balance then maximizes the remaining variance in the data, while satisfying the orthonormality constraint.Up to D − 1 PBs can be derived, although in practice much fewer are typically needed to capture the main modes of variability.In the following subsection, we embed such dimension reduction by PB coordinates into a regression setting using a PLS formulation.

PLS Regression and Classification
PLS is a multivariate method which is used to model a linear relationship between a response variable and a set of (non necessarily compositional) explanatory variables.The linear relationship is however not modeled directly, but via the construction of latent variables (PLS components).Values of new latent variables are called scores, and coefficients that determine the influence of each variable on the score are called loadings (Varmuza and Filzmoser, 2009).
More precisely, PLS regression aims to estimate the regression parameter vector b = (b 1 , . . ., b D ) in the linear regression model where e stands for a random error term.Both the response variable and covariates are centered prior to the analysis, so no intercept is included in model 4. As we deal with CoDa, the matrix X in Eq. ( 4) is expressed in the form of clr coefficients and this clr matrix is denoted clr(X) in the following.While the estimation of b in terms of the original variables (here clr coefficients as used for compositional PCA) is the final goal, the regression fit itself and prediction are performed on the PLS components, which are linear combinations of clr variables in the n × D matrix clr(X).Because of Eq. ( 3), these latent PLS components are just the logcontrasts we are searching for.
Namely, the matrix clr(X) is decomposed as where T is a score matrix, P is a loading matrix, and E X is an error matrix (Varmuza and Filzmoser, 2009).Both matrices T, P have k columns, k ≤ min(D, n), indicating the number of PLS components.The goal of PLS is then to maximize the covariance between the scores (coordinates corresponding to the latent variables) and y, under the constraint of uncorrelated scores (the most usual case) or orthogonal loading vectors (representing weights given to the original variables in the construction of the PLS components).Let p denote a (column) loading vector of matrix P. Then it holds for a score vector t that t = clr(X)p, and the maximization problem can then be written as where p is considered to be a weighting vector.The constraint of unit length ensures that the maximization problem is unique (Varmuza and Filzmoser, 2009).Such maximization problem results in the first score vector t.The subsequent score vectors are obtained in the same way, with the condition that they must be orthogonal to the previous ones.The score vector t = clr(X)p contains n observations of a logcontrast because the sum of elements of each loading vector p equals to zero in the clr coefficient representation of the composition acting as covariate.The resulting logcontrasts can then be used in the regression model with v = P b, for prediction purposes as well as to determine the number of logcontrasts that are sufficient for a good prediction of y.There exist several algorithms to find a solution to this optimization problem.
We resort to the well-known SIMPLS algorithm.
Similarly, PLS can be used for classification purposes.This method is then commonly called PLS Discriminant Analysis (PLS-DA).Here we will focus on binary response variables, typically using codes 0 for observations that do not belong to a certain group and 1 for those that do belong.
Beyond representing a generalization of logcontrast models, a step further with the proposed formulation is to simplify it in the form of balances, and then rely on PBs instead of PLS loadings.Thus, we can investigate which groups of parts contribute in positive or negative sense to their values.In doing so, a small price is paid in terms of prediction ability of the resulting regression model, but a benefit in interpretability of logcontrasts as latent variables is generally obtained.

Algorithmic Implementation of PLS Principal Balances
PLS-PB can be straightforwardly constructed by adapting the PCA-PB approach from Martín-Fernández et al. (2018).Specifically, we take the constrained PCs algorithm introduced in that work as reference.This algorithm builds PBs based on loadings obtained from PCA.Our proposal is to do it based on the loadings from PLS.
The core of the modification is as follows: instead of maximizing explained variance in accordance with PCA, we maximize the covariance between the response variable and the new established balance (through the respective balance coefficients) while keeping orthonormality of the new coordinate system.Thus, balances and balance coefficients play the role of score vectors and loadings respectively.In the relationship t = clr(X)p, balance coefficients are in the place of vector p.Accordingly, the first PLS loading vector is used to derive the first PB, and the other balances are then obtained by maximizing the absolute value of the covariance of subsequently derived balances with the response variable.Algorithm 1 summarizes this procedure for obtaining the PLS-PB.

Algorithm 1 CONSTRUCTION OF PLS-PB
Initiation: center response variable y, compute clr coefficients of composition in X and center them PB: 1. First PB: pb 1 (based on first PLS loading vector p 1 ): i. PLS regression of centered y on centered clr(X) ii.First loading vector: p 1 iii.Signs of values in p 1 : s sign = sign(p 1 ) iv.Using p 1 , derive D − 1 candidate sign of balances s 1 , . . ., s D−1 with codes {−1, 0, +1}: • for i = 1, . . ., D: • s j , j = 2, . . ., D − 1: copy codes in s j−1 , add +1 or −1 using s sign and the remaining components of p 1 (excluding the minimum and maximum values chosen in the previous step).
For i = 1, . . ., D : • result: matrix of signs S D×(D−1) ; sign of balances in columns ) matrix of balance coefficients: for i th row and j th column, where r j is number of +1 values in column j and s j is number of −1 values in columns j ii.Further partition: • Down the numerator: repeat steps 1.[i.-vi.] using variables in the numerator of pb 1 • Down the denominator: repeat steps 1.[i.-vi.] using variables in the denominator of pb 1 Final step: Sort PB: pb 1 , pb 2 , . . ., pb (D−1) such that

Numerical Assessment
In this section, we first introduce some examples to demonstrate that PLS-PBs help to arrive at a simplified structure of PLS loadings.In line with an application to metabolomics in Section 4, here we refer to (bio)markers (explanatory variables), i.e. biological measurements or signals most associated to some health or biological outcome/status of interest (response variable) (Štefelová et al., 2021).We therefore focus on the identification of meaningful biomarkers as the main purpose of the data analysis.Subsequently, we set up a simulation study to formally compare the predictive performance of the proposed PLS-PB with the original PCA-PB in Martín-Fernández et al. (2018).

Artificial Settings for Comparison: PLS-PB Against PLS Loadings
To assess the behavior of PLS-PB across various settings, we consider three artificial cases inspired by the study in Štefelová et al. (2021).The first case contains just one block of markers amongst a given collection of signals defining the explanatory composition.In the other two examples, we extend such setting to include several groups of markers.In all cases, we consider n = 250 samples and D = 100 signals in the composition.
The number of PLS-PB is then 99 (D −1).Following on (Štefelová et al., 2021), compositions were simulated using so-called pivot coordinates, an instance of olr coordinates (Fišerová and Hron, 2011) (see Appendix A for more details).The resulting covariance matrices are visualized in Figure 1.we adapt the covariance matrix defined in Eq. ( 8) of Appendix A to allow for multiple blocks of markers.
In the second case, we consider four blocks of markers (each block consisting of 20 markers) as visualized in Figure 1(b).Although the structure of the covariance matrix is the same as in Equation (8) (Appendix A), the elements in each block are generated following a decreasing sequence such that the elements further away from the diagonal have smaller values.This approach ensures that the scattering of pivot coordinates in each block differs.The effect of the first block of pivot coordinates is the strongest, the second block should produce the "weakest" markers (in the sense of pivot coordinates; covariance between variables in the second block is the lowest).Because of the way pivot coordinates are built, it can be said that the relevance of markers (in terms of the original parts) decreases in each subsequent block.
Finally, in the third case, we again consider four blocks of markers but this time of different size, as visualized in Figure 1(c).The first and third block consist of 30 markers, while the second and fourth consist of 10 markers each.The entries in the covariance matrix are in the same range in each block, with the diagonal elements being identical.Note that the second and third case not only contain 80 markers but also 20 irrelevant signals (i.e. the last 20 rows/columns in Figures 1(b) and 1(c)), which can be considered as noise).
We are now ready to compare PLS-PB to PLS loadings.For this, it is important to note that the structure of PLS-PB must not necessarily reflect the structure of the corresponding PLS loadings.This is due to the fact that orthonormality is required when constructing PLS-PB, while this is not the case for PLS loadings.These examples provide a first insight into the performance of PLS-PB to correctly identify markers in a collection of signals.
Figure 2 displays the comparison of PLS loadings and PLS-PB for the first case with one marker block.
As we are typically interested in the first few PBs most strongly related to the response, we focus our discussion on the first five PLS loadings and PLS-PBs.These are displayed in the Figure 2  In summary, these examples illustrate the potential of PB as a convenient counterpart to PLS loadings, as they deliver a simplified structure and the orthonormality constraint enables to, for example, perform interpretable regression analysis (Hron et al., 2021).

Simulation-based Assessment
We investigate the predictive ability of PLS-PB in comparison to PLS loadings, although it is important to note that this is not the main focus of the method proposed in this work.Moreover, a natural alternative principal component regression (PCR, Varmuza and Filzmoser, 2009), where the number of explanatory variables in a regression model is reduced using PCA.Both PLS and PCA regression are popular tools, for example, in chemometrics and molecular biology applications to cope with high-dimensionality and/or multicollinearity issues.We therefore devise a simulation study to compare the prediction performance of PLS PB, PCA PB and PLS loadings.It is known that PLS regression generally leads to better prediction performance than PCR when just a few latent components are involved.
We consider three scenarios for the simulation study based on the three cases introduced in the previous section, that is: only one block of markers, several blocks of markers of the same size, and several blocks of markers of different sizes.The computed PBs (either PLS or PCA) are used to fit linear regression models like Eq. ( 6), including only one PB up to all possible PBs, that is, 99.The root mean squared error of as demonstrated in Section 3.1.Moreover, note that for models with a large number of latent components, the performance of standard PLS worsens dramatically, as reflected by the large values of the RMSEP.
This might be caused by numerical instability resulting from applying the SIMPLS algorithm as it provides non-orthogonal components causing degeneration when increasing the number of components.
Lastly, we compare the proposed PLS-PB approach to the selbal algorithm introduced in Rivera-Pinto et al. ( 2018) to identify an optimal balance between parts in high-dimensional compositions for regression and classification problems in a microbiome analysis context.The latter cannot be included in Figure 5 as here we demonstrate prediction performance across all possible numbers of balances, and the selbal algorithm selects only one single balance.Nonetheless we can compare its performance to PLS-PB based on the first PB.The resulting mean values of the RMSEP for the three simulation scenarios are shown in Table 1.While selbal exhibits better performance in terms of prediction, its ability to reflect the actual structure of markers

Applications
We here demonstrate the application of the PLS-PB approach on two real-world data sets.Firstly, we consider a regression problem and then a classification task by simply accommodating the binary response into the PLS model.which plays the role of a continuous response variable that we aim to model in terms of the metabolite composition.

NMR Data Set
We apply the proposed PLS-PB method and analyze its ability to predict the response.Considering a varying number of PBs, we aim to detect an optimum that combines sensible prediction accuracy with preferably a small number of PB.Moreover, we investigate whether the PBs reflect the structure of PLS loadings while simplifying the interpretation.

Optimal Number of Principal Balances
Similar to the simulation study (Section 3.2), we compute PLS and PCA-PBs as well as standard PLS and compare prediction performance for models including several PBs (ranging from the first PB or PLS loading up to all D − 1 = 126 of them).Again, RMSEP is the measure used for comparison and its values are estimates based on 5-fold CV and averaged over 100 runs.
For each number of latent components (either PB or ordinary loadings), the mean and standard deviation of the RMSEP values were computed across the 100 runs.Then, given the lowest mean RMSEP, the model using the fewest number of balances within one standard error from such a minimum is chosen (one standard error rule; see Friedman et al., 2001).Accordingly, the most parsimonious model amongst those of best prediction performance is selected.
The results based on PLS-PB (blue) and PCA-PB (black), together with the results of standard PLS (red), are shown in Figure 7(a).We can observe that PLS-PB outperforms PCA-PB for most part of the range of PBs.Unlike in the previous simulation study, the effect of a possible overfitting issue can be observed here as the RMSEP increases with the number of PBs.Also, as expected, the RMSEP values coincide (disregarding the minimal numerical difference) again for both approaches at the maximum number of PBs.Similarly to the simulation study in Section 3.2, standard PLS outperforms PLS-PB for the lowest numbers of latent components.However, weaker prediction performance of PLS-PB is compensated by the interpretability, which can be later seen in Figure 8.Moreover, it can be clearly seen in Figure 7(a) that standard PLS performs worse with increasing the number of loadings considered.
It can be observed in Figure 7(a) that the RMSEP for all three methods decreases rapidly at the beginning of the range as PBs (or ordinary loadings) are aggregated.).The prediction performance was also assessed using the selbal algorithm.In this case, the resulting RMSEP is 3.227, much lower than for the other three methods.
However, it was shown in Section 3.2 that selbal has a rather poorer ability to capture the structure of the data compared to PLS-PB, which can be seen in Figure 8(c).
It is important to note that even if the PLS-PB approach does not outperform the PCA-PB approach, it is by construction expected to provide a more interpretable structure of PB, as these are tailored to maximize association with the response variable.The next section discusses the interpretative advantage of the PLS-PB approach.

Comparison of PLS-PB to PLS Loadings
We compare the PLS-PBs to PLS loadings to examine whether the former suitably reflect the latter (in terms of signs of coefficients) and, at the same time, facilitate interpretation.
In Figure 8 we display their values for the first 6 PBs, the optimal number determined in the previous section.The first PLS-PB reproduces the structure of signs of the coefficient values observed in the first PLS loading vector well, highlighting biologically meaningful markers identified in Štefelová et al. (2021), related to methane yield.For example, very distinct groups of identified markers are the group from Integral32 to Integral37 and a group from Integral67 to Integral83 (with Integral68 being picked in the third balance).
These markers are colored red and blue in Figure 8. Markers colored in red are those having a positive relationship with the response variable, whereas blue-colored markers are those that have a negative relationship with the response variable.The other PLS-PB, whose structure does not necessarily coincide with the structure of the respective PLS loadings due to the orthogonality constraint, capture some other patterns, related to both marker and non-marker variables.Moreover, PLS loadings also provide misleading information, as in the third loading there is a group of signals (Integral49 to Integral57) which is highlighted, but it is not biologically meaningful Štefelová et al. (2021).On the contrary, PLS-PB provides a neater and more parsimonious view.

Metabolomic Data Set
We now consider a metabolomic data set consisting of n = 46 observations and D = 209 metabolites, thus representing the common high-dimensional setting with n < D. The response variable is in this case dichotomous and states cancerous (y i = 1) or healthy (y i = 0) tissues, having 23 patients suffering from lung cancer and other 23 being healthy (Cífková et al., 2022).The aim is to enable classification of tissues, and in particular, to reveal pathobiochemical changes of the disease.The samples were analysed by a targeted metabolomic method based on HILIC liquid chromatography coupled with triple quadrupole mass spectrometry.This method allows to detect altogether 350 metabolites in different biofluids, tissues and cells and covers main metabolic pathways.
We applied the PLS-PB and PLS-PCA methods and assessed their relative performance by 5-fold CV over the range of possible numbers of PB as detailed previously.The RMSEP was replaced by the misclassification error, defined as where y i denotes the group number of the ith object, ŷi is the estimated group number, and the index function I gives 1 if the group numbers are not the same and 0 otherwise.
Figure 9 displays the results for the first 10 latent components.Similar to the NMR data study, PLS-PB generally outperforms PCA-PB.Even though the numerical difference in ME is not dramatic, it can be seen that the ME of PLS-PB slightly drops, whereas the ME of PCA-PB rather levels off.On the other hand, the  performance of standard PLS is considerably worse than the other methods.Their performance was again compared to the result of selbal, for which the ME was 0.343.Selbal thus shows worse performance than PLS-PB and PCA-PB.Moreover, this latter does not recover the structure of markers very well.
Figure 10 displays PLS loadings and PLS-PB for the first five balances.Again, PLS-PBs appear to be less noisy than the PLS loadings counterparts.The latter puts an unnecessarily large emphasis on absolute differences.PLS-PBs, in contrast, are easier to navigate in the outcome matrix and show more agreement with univariate statistical analysis (Cífková et al., 2022).We can see general trends in decreasing short and medium chain (Car.0 -Car.12)compared to increased very long chain (Car.20 -Car.22)acyl carnitines, which are closely metabolically connected.Furthermore, selected groups of metabolites such as glycine dipeptides (GLY.ALA -GLY.TYR) and pyrimidine nucleotides (UDP.glucuronate,UDP.AcGlcNH2 and CDP.choline) show systematic trends.The PLS-PB approach in fact splits acylcarnitines into two separate groups, which could be then subject of future research.

Conclusions
This manuscript introduces a new procedure to construct PBs within a log-ratio analysis framework for high-dimensional CoDa.We extend previous work in PBs by exploiting PLS as a dimension reduction tool that accounts for the relationship between a response variable of interest and a high-dimensional composition playing the role of predictor.The algorithm determines D − 1 data-driven PLS-PBs that maximize their covariance with the response variable.
The proposal is applicable to both regression and classification problems and our numerical experiments firstly demonstrate that the resulting PLS-PBs provide a simplified structure of PLS loadings and outperform the original PCA-PB in terms of prediction performance.Secondly, when compared with the recently proposed selbal algorithm, which targets the same goal as PLS-PB, it is shown that although the selbal method may perform better in terms of prediction, it shows poorer ability to capture the data structure.
Finally, PLS-PBs simplify the structure and enhance the interpretation of the results when compared with standard PLS.The method is further demonstrated on two real data sets regarding regression analysis with NMR spectral data and a classification task with metabolomic data.In both cases, the usefulness of the PLS-PB approach for variable selection and biomarker discovery is illustrated.
Building on the PLS-PB framework presented here, possibilities for further developments include its robustification to manage the potential influence of outlying samples in the results or the ability to deal with sparse data.

Figure 1
Figure 1(a) displays the covariance matrix in the first case.The generated compositions then contain one block of 20 markers and the remaining signals are regarded unimportant.For the other two exemplary cases,

Figure 1 :
Figure 1: Visual representation of covariance matrices used in the examples.Colors represent the covariance between pairs of pivot coordinates.Figure 1(a) shows a single block of 20 meaningful markers, Figure 1(b) 4 same-sized blocks of 20 markers each and Figure 1(c) 4 blocks of markers containing either 30 or 10 markers.
(a), and Figure2(b) respectively.It can be clearly seen that the structure of PLS-PBs is much more parsimonious than using PLS loadings.The first PB (i.e.column 1 in Figure2(b)) captures the information contained in the first loading vector very precisely, highlighting all the markers in the data (i.e.first 20 colored rows in Figure2(b)).The fourth and fifth balances then capture several differences between markers in the block.

Figure 3
Figure 3 displays the resulting PLS loadings and PLS-PB for case 2 with same-sized blocks of markers.The PLS-PBs in Figure 3(b) display a fairly neat structure, while the interpretation of the PLS loading in Figure 3(a) is less clear cut.The first PB reproduces the information in the first loading vector.Except for the second block of markers (signals from V21 to V40), the other three blocks are correctly identified.It is the third PB that highlights this block with the lowest covariances between variables.The second and fourth PB further stress the difference between the first and third block of markers, whereas the fifth PB captures some differences to the fourth block.Note that all PBs correctly exclude the signals corresponding to random noise (i.e. the last 20).Finally, Figure4displays the results for the case of varying block sizes (case 3).Similarly to the previous case, the largest blocks (i.e. the first and the third one; signals V1-V30 and V41-V70) are correctly highlighted by the first PB, while the second and the fourth (smaller) blocks are not included (signals V31-V40 and V71-V80 with the exception of signals number 33 and 34).However, markers from these blocks are identified by the third PB.

Figure 2 :
Figure 2: Comparison of the first five PLS loadings (left) and PLS-PB (right) for the first example with one block of markers (case 1).Signal variables generated are arranged by rows.Colors represent values of loading vectors (left) and coefficients of PLS-PB (right).Signals in a numerator of a balance get a positive value, in a denominator a negative value.Signals not included in a balance get 0. Block of markers (first 20 signals) is highlighted into a black frame.

Figure 5 :
Figure 5: Cross-validated root mean squared error of prediction using PB, PLS-PB (blue), PCA-PB (black), and ordinary PLS loadings (red) for each simulation scenario according to number of latent components used (either PB or ordinary loadings).
We use the data set fromŠtefelová et al. (2021)  consisting of high-throughput spectral profiles obtained by nuclear magnetic resonance (NMR).The data set involves a 127-part compositional predictor (metabolite signals; also called integrals) measured in n = 211 rumen fluid samples from cattle.This was collected along with individual measurements of animal methane yield (CH 4 in grams per kilogram of dry matter intake) Figure 7: Prediction performance of PLS-PB (blue), PCA-PB (black) and standard PLS (red) methods on NMR data set for different choices of latent components used (either PB or ordinary loadings).Optimal number determined according to one standard error rule from minimum CV RMSEP are indicated by vertical dashed lines for each method.
Figure 7(b) zooms in on the results for the range of the first ten latent components.While the smallest RMSEP for PLS-PB occurs for a model consisting of 11 PBs, a more parsimonious model using just 6 PBs provides comparable performance according to the one standard error rule (marked by a blue vertical dashed line in Figure 7(b)).For PCA-PB, the minimum occurs for a model containing the first 17 PBs, while the model with 8 PBs lies within a one standard error of this minimum (marked by a black vertical dashed line in Figure 7(b)).For standard PLS, the minimum was reached for 3 loadings and the optimal model determined by one standard error is the one with 2 loadings (red vertical dashed line in Figure 7(b)

Figure 8 :Figure 9 :
Figure 8: Comparison of the first 6 PLS loadings, 6 PLS-PB and a selbal balance from the NMR data set.Red and blue labels are used to highlight markers identified in previous studies.Red (blue) colored text is used for markers having a positive (negative) relationship with the response variable.

Figure 10 :
Figure 10: Comparison of the first five PLS loadings and PLS-PB.Red colored text is used to highlight metabolites which were marked as significant using p-values after Bonferroni correction.22 . ., n r k and d 1 , . . ., d s k being the indices of the parts of the first and second group, respectively.From a D-part composition, the number of balances derived is D − 1 which corresponds to the actual dimensionality of the composition.The interpretation of balances is straightforward: they represent the relative dominance of one group of parts with respect to the other group.They are orthonormal logcontrasts by construction, which means that the respective vectors of logcontrast coefficients are re-scaled to have unit norm and are mutually orthogonal:

Table 1 :
Comparison of mean RMSEP of the first PCA PB and PLS PB with mean RMSEP of selbal