A brief note on a new faster covariate's selection (fCovSel) algorithm

Covariate's selection (CovSel) is a variable selection technique in the domain of chemometrics and is commonly used for extracting variables carrying high covariance with response variables. CovSel is a special case of partial least square analysis, where at each weight estimation step, the weight vector is of the binary form to select the variables. In the earlier algorithm of the CovSel which is identical to the nonlinear iterative partial least square (NIPALS), there is a key step of predictor matrix deflation which makes it a time‐consuming approach leading to longer time consumption during tasks such as cross‐validation or in multiblock multiway CovSel scenarios where a wide number of variables combinations are usually explored for model optimization. We present a new CovSel algorithm called faster CovSel (fCovSel) which drops the need for predictor matrix deflation. By dropping the predictor matrix deflation step, the method naturally becomes faster than the CovSel based on NIPALS. Mathematical and analytical comparisons of the CovSel and the fCovSel in terms of achieving the same solution and time requirements are presented.


| INTRODUCTION
In the domain of chemometrics and analytical chemistry, multivariate signals are widely met [1,2]. Furthermore, multivariate signals generated from several analytical techniques such as optical spectroscopy are highly collinear [3]. Often, such multivariate multicollinear signals cannot be directly interpreted but demand chemometric approaches which can handle the multicollinearity and simultaneously allow exploring meaningful information [3]. For example, to explore the data, variance-based subspace modeling techniques such as principal component analysis (PCA) are widely used [4]. In the presence of response variables, the common approach to subspace modeling is the partial least square (PLS) which models the subspaces as the function of covariance between the predictor and the response variables [5,6]. Note that both the PCA and PLS are bilinear modeling approaches where the analysis results in a bilinear decomposition of predictor matrix into a set of scores and loading vectors, where the loadings are the key directions maximizing the covariance (in the case of PLS) or variance (in the case of PCA) and scores are the data projected to those directions.
Note that apart from subspace modeling, one of the key tasks while exploring the multivariate multicollinear data sets is to extract information about the key variables of interest [7,8]. This task is often also referred to as variables selection [9,10] in the domain of chemometrics or feature selection [11] in the domain of machine learning. The key variables of interest are information-rich subset of variables that can be used for wide ranges of tasks such as the development of predictive models, data pattern interpretation, development of low-cost multispectral sensors in the case of optical spectroscopy, development of selective chemical sensing applications, and selection of key markers in omics data analysis [8-10, 12, 13]. In the domain of chemometrics, there are three distinct types of variable selection approaches existing, that is, filter, wrapper, and embedded methods [7]. The main difference between the methods is the amount of interaction the variables selection has with the underlying model for which the variables need to be extracted. For example, in the case of the filter-based method, there is almost no direct interaction of the variable selection strategy with the underlying model [11]. An example of the filter-based method can be the selection of variables based on their correlation with the response. The second case of the wrapper method involves the exploration of all possible individual variables and their combinations in an exhaustive search to find the best possible subset of variables. Unlike the filter-based methods, in the wrapper methods, the variable selection is performed by evaluating the model performance; however, due to the random nature of such exhaustive searches, for example, genetic algorithms [14,15], the results are not repeatable. The third case is the embedded methods involving variable selection as a joint step of model calibration by optimizing training criteria such as successive addition, elimination, and a joint exploration of both [7,16]. More details on variables selection in the chemometric domain can be found in some recent review articles [9,10,12,13,17].
Of all the three classes of methods, the embedded methods are of high interest as they undertake the step of variable selection and modeling as a joint step to find the information-rich subset of variables being highly predictive at the same time. CovSel is one such powerful method allowing the extraction of informative variables by repeated steps of covariance maximization and orthogonalization [7]. CovSel is similar to the nonlinear iterative partial least square (NIPALS) algorithm which involves successive steps of covariance maximization and repeated orthogonalization of predictor matrices [5,18]. In the later part of the manuscript, the similarity between the CovSel and NIPALS will be highlighted in detail. A wide range of applications of the CovSel can be found in scientific literature [19][20][21]. Furthermore, wide extensions of the CovSel for multiblock [8,20] and multiway multiblock [22] cases can also be found in the scientific literature. The main reason behind the CovSel being so popular and extendable to frameworks such as multiblock and multiway is because the CovSel is a special case of PLS and almost any extensions that are possible for PLS are also directly applicable to the CovSel. However, just like the NIPALS algorithm is a slow algorithm in the chemometric domain due to the requirement of deflation of the predictor matrix [23,24], the deflation of the predictor matrix also makes the current CovSel [7] inherently a slower algorithm. Due to the presence of the predictor matrix deflation step, the extension of the CovSel such as for multiblock and multiway can even be slower than the CovSel on a single data block. In the domain of chemometrics, recently, a major focus has been on having fast and stable PLS algorithms, and one out of many is the nondeflation PLS approach where the predictor matrix deflation step is replaced by the Gram-Schmidt (GS) orthogonalization step of the scores of the predictor matrix [23][24][25]. Such an algorithm is naturally faster as there is no predictor matrix deflation step [23][24][25]. Since the CovSel is a special case of PLS, therefore, using the concept of nondeflating PLS [23][24][25], a nondeflating version of the CovSel can be achieved which will naturally be faster due to no predictor matrix deflation step. Furthermore, the extension of such nondeflating CovSel approaches to multiblock [8] and multiway [22] scenarios will also be naturally faster, as those methods just involve the deflation of multiple predictor matrices or deflation of wide matrixes of unfolded multiway arrays, respectively. To the best awareness of the author, a nondeflating version of the CovSel is never developed and the one developed in this study can have wide practical implications for single block [7], multiblock [8], and multiway [22] CovSel modeling.
The aim of this study was to present a new CovSel algorithm called faster CovSel (fCovSel) which drops the need for the predictor matrix deflation. By dropping the predictor matrix deflation, the method naturally becomes faster than the CovSel. Mathematical and analytical comparisons of the CovSel and the fCovSel in terms of achieving the exact same solution and time requirements are presented.

| THEORY
This section will supply the theoretical basis for the fCovSel algorithm without deflation of predictor matrix. All matrices are presented in bold upper case. All vectors are in bold lower case, and constants are in lower case. Let X be the predictor matrix of size n Â p, where n is the number of samples and p be the number of variables. Let Y be the multiresponse matrix of size n Â k, where n is the number of samples and k be the number of responses. The CovSel algorithm proposed in the earlier studies [7,8,22] involves two main steps. The first step is to find the variables that carry the maximum covariance with the response variable, and the second step is to use the column vector for the selected variable to deflate both the predictor matrix and response variables. The process is repeated until the desired number of variables is extracted being n < p; if n < p, then the total number of variables can be extracted ¼ < n. The deflation of the predictor matrix is a computationally expensive step in the traditional CovSel algorithm. Furthermore, the CovSel is almost identical to the NIPALS algorithm as shown in Table 1, except for the step where the scores are the column vector of predictor matrix, while for NIPALS, they are estimated as the product of predictor matrix and the normalized loading weight. One can assume that the CovSel has the loading weights of the form [0,0,0 … 1 … 0,0,0], where the multiplication of the loading weight with the predictor matrix leads to the selection of the column vector as the score vector. The CovSel algorithm presented in the earlier study [7] uses the squared covariance; however, in the algorithm presented in Table 1, that step is replaced with the determination of the absolute value (mod) to show the similarity between the CovSel and NIPALS. Note that selecting variable using the absolute value (mod) of the covariances leads to the exact same variable selection as that based on squared covariances as the square operation is just making all the covariances positive to identify the maximum covariance. For the multiresponse case of CovSel, the sum of the absolute (mode) covariances can be used to select the variable, and later, the column vector corresponding to the selected variable can be used for the deflating the predictor matrix. The algorithm for the fCovSel is presented in Table 2. The main key difference in the fCovSel and the CovSel algorithm can be noted as the GS orthogonalization [26] of the selected column vector of the predictor matrix concerning all the earlier selected variables to remove any T A B L E 1 Algorithms of CovSel and NIPALS presented for comparative purpose CovSel for single (y) or multiresponse (Y) NIPALS for single response (y) Step 3 : τ ¼ X k Step 4 : t i ¼ τ τ k k Step 5 : Step 5 : Step 1 : Þ , note that k is the selected variable for step i Step 3 : Step 5 : Step 7 : redundant information. Performing that avoids the need for the deflation of the predictor matrix and then only requires the GS orthogonalization [26] of the response variable. This idea is inspired from the nondeflating PLS [24,25] and the recently proposed response-oriented sequential alternation modeling [23]. The process of the fCovSel can then be repeated to the desired number of variables. In the later part of the study, it has been analytically shown that the fCovSel and the CovSel will lead to the same solution in terms of the selected variables. Furthermore, fCovSel will also lead to the same solution as CovSel for the multiresponse case.
The key difference between the fCovSel and the traditional CovSel is that the computationally expensive predictor matrix deflation step is replaced by two main steps, that is, the deflation of the response and the reorthogonalization of the scores as proposed in previous studies [23,25]. It can be mathematically shown that the predictor matrix deflation can be skipped with such two steps. The main idea behind the CovSel is to use the covariance information between predictor X (n Â p) and the response Y (n Â kÞ to select the variable carrying maximum covariance. In successive steps, the predictor matrix is deflated to capture covariances. For a particular step, the covariance V iþ1 (p Â k) estimation can be expressed as Equation (1): Furthermore, the X i n Â p ð Þ at a particular step can be expressed as I À T 1:i T 0 1:i À Á X, a(n Â p) size matrix, where the T 1:i (n Â i) is the matrix of orthonormal scores extracted up to the i th step. Note that I is the identity matrix of size n Â n ð Þ: Resubstituting Equation (1) (2) is achieved: Distributing the transpose in I À T 1:i T 0 1:i ÞX where, Y i ¼ I À T 1:i T 0 (3), it can be noted that the covariance estimation V iþ1 can be done without the deflation of predictor matrix X while needing the deflation of the responses Y. This is slightly different from the current CovSel algorithm (also NIPALS) where the deflation of predictor matrix is needed while the deflation of response does not contribute anything extra. Since responses Y are already known, the only extra thing needed to estimate the covariance according to Equation (3) is the orthonormal scores matrix T. Such orthonormal scores can be obtained with the scores reorthogonalization step proposed in Indahl [25] as explained in Equation (4) In Equation (4), the term I À T 1:iÀ1 T 0 1:iÀ1 À Á of size n Â n ð Þ is the orthogonal projection operator spanning the variability orthogonal to the T 1:iÀ1 n Â i À 1 ð Þ scores. The multiplication of the orthogonal projection operator with the nonorthogonalized scores removes all the variability explained by the T 1:iÀ1 scores.
The nonnormalized score vector τ n Â 1 ð Þ extracted with Equation (4) is the same as selecting the columns as the score vector from deflated X matrix. The difference in the former approach is that in Equation (4), the column is selected before deflation and only the single nonnormalized score vector is deflated at a particular time, while in later, the whole X matrix is first deflated and then the column vector is selected as the nonnormalized score for that step.
Please note that the τ n Â 1 ð Þin Equation (4) is the i th column vector of the predictor matrix carrying the maximum covariance with the response variables. Furthermore, Equation (4) followed by the normalization in Equation (5) defines a recurrence equation for computing the orthonormal scores t n Â 1 ð Þ. The associated nested sequence of vector equations can be solved from the starting score vector t 1 n Â 1 ð Þ, and the starting score vector t 1 can be estimated without the need of any deflation operation. All scores can be accumulated as column vectors in a score matrix (Equation 6) which can later be used for postprocessing operations such as estimation of regression coefficients.
Since fCovSel and CovSel are cases of NIPALS, it is possible to extract the regression vectors using the traditional regression coefficient b total variables Â 1 ð Þ estimation equation for NIPALS approach (Equation 7) as has been reported in several earlier works [23,27].
Note that the W (weight matrix) in the case of fCovSel or CovSel is the identity matrix of size same as the total number of selected variables as the loading weights are of the form [0,0,0 … 1 … 0,0,0]. P is the loading matrix total selected variables Â total selected variables ð Þ which can be obtained by postprocessing as P ¼ X 0 T, and the q 1 Â total selected variables ð Þ can be obtained by postprocessing q ¼ T 0 y or Q ¼ T 0 Y , for multiresponse case. Since the postprocessing operation is the same for fCovSel and CovSel, therefore, this study will only compare the time requirements for variable extraction as that is the main step where CovSel and fCovSel are different.
Note that since the fCovSel algorithm leads to the same solution as the CovSel algorithm [7] presented in the earlier study, this means that all the operations that were possible with the CovSel algorithm such as the identification of key variables, building regression or classification predictive models, are also naturally possible with fCovSel. Furthermore, all extensions such as multiblock [8] and multiway [22] can directly incorporate the fCovSel algorithm without compromising any results but achieving results in less time due to no predictor matrix deflation step in the fCovSel algorithm.The fCovSel algorithm will be made available at: https://github.com/puneetmishra2https://github.com/ puneetmishra2

| DATASETS FOR COMPARISON OF COVSEL AND FCOVSEL
The apricot data set was used to show the similarity of the fCovSel concerning the CovSel analysis and the benefits of the fCovSel such as speed gain in optimization of multiblock models such as SO-CovSel [8]. The apricot data set was a two-block data, where NIR (near infrared 800-2770 nm) and MIR (mid-infrared 4000-650 cm À1 ) spectra were acquired on 750 apricot puree samples. As a reference property, SSC (soluble solids content %) was used. More details on the data set can be found in earlier study related to the measurement of samples and reference analysis [28]. The NIR data block had 769 variables, while the MIR data block had 579 variables. The CovSel and the fCovSel analysis aimed to select the variables that carry high covariance with the SSC.

| RESULTS AND DISCUSSION
The proposed fCovSel algorithm allows to achieve the same solution as the CovSel approach but in a much faster way due to no deflation of the predictor matrix. The similarity in the results can be noted with the CovSel and the fCovSel analysis performed on the MIR data to explain the SSC. To verify if the order of extracted variables was the same, the variable indexes extracted by the CovSel and the fCovSel are plotted as correlation plots. A correlation of 1 suggested that the order of extracted variables was the same for the CovSel and the fCovSel verifying the identical variable selection ( Figure 1).
To have a comparison of execution speeds, the CovSel and the fCovSel analyses were performed as a function of the increasing number of variables and the results are shown in Figure 2. It can be noted that the fCovSel always performed faster than the CovSel algorithm requiring predictor matrix deflation. The effect of sample size was also explored, and the results are shown in Figure 3. The reduction of sample size from 750 to 579 had minimal effect on the execution speed of both the CovSel and the fCovSel algorithm. The size was reduced from 750 to 579 as there were only 579 variables in the data set. The analysis presented in Figures 2 and 3 shows that the fCovsel performed faster than the CovSel; however, the time gain was a couple of seconds. This may appear as a slight improvement, but the real benefit of the speed will be when the exhaustive operations are performed. For example, one of the main steps in the CovSel based variable selection is to incorporate the cross-validation procedure to select the optimal number of variables. The second even more exhaustive task is to perform cross-validation in the multiblock variable selection scenario such as for the SO-CovSel [8], requiring all combinations of selected variables to be explored. In such a multiblock cross-validation scenario, there will be repeated variable selection analysis involved; hence, a faster algorithm such as the fCovSel can show its real benefit in the time gain. As an example, the SO-CovSel and the SO-fCovsel analyses were performed on two blocks of apricot data set (NIR and MIR) to explain the variance in SSC. The analysis involved exploration of variable combinations in the range from (1,1) to (20,20). In Figure 4, the x-axis is mentioned as 1 to 20 but indicates (1,1) to (20,20). The result of the analysis clearly shows the potential benefit of the fCovSel over the CovSel in terms of time saving for exhaustive modeling tasks such as multiblock modeling. In the same way, the algorithm can also speed up the multiway multiblock CovSel approach [22] as the most time-consuming step in the multiway CovSel is the deflation of a wide matrix achieved by unfolding the multiway array.
F I G U R E 1 A summary of the order of covariates selection by the CovSel and the fCovSel F I G U R E 2 Time requirements for the CovSel and the fCovSel for extracting variables. Note that the x-axis represents independent runs of the CovSel and the fCovSel. A maximum of 579 runs were made due to 579 variables in the mir data block of apricot data set We presented a newer faster algorithm for covariate selection called fCovSel which is naturally faster than the CovSel algorithm due to no predictor matrix deflation step involved. The comparison of execution speed based on real data set showed that for the same number of variables extracted, the execution time for the fCovSel was lower than the CovSel. Furthermore, the exploration of the fCovSel in the exhaustive multiblock variable combinations exploration showed a huge time reduction with the use of the fCovSel compared with the CovSel. The fCovSel can become the preferred choice for covariates selection over the CovSel when a faster execution speed is needed and particularly serves as the backbone to the multiblock and multiway extension of the CovSel as their optimization is usually exhaustive and timeconsuming.

PEER REVIEW
The peer review history for this article is available at https://publons.com/publon/10.1002/cem.3397.

ORCID
Puneet Mishra https://orcid.org/0000-0001-8895-798X F I G U R E 3 Effect of sample size on the execution speed of (A) CovSel and (B) fCovSel. Note that the x-axis represents independent runs of the CovSel and the fCovSel. A maximum of 579 runs were made due to 579 variables in the mir data block of apricot data set F I G U R E 4 A summary of execution speed of the CovSel and the fCovSel in the sequential multiblock modeling scenario