Sparse multivariate regression with missing values and its application to the prediction of material properties

In the field of materials science and engineering, statistical analysis and machine learning techniques have recently been used to predict multiple material properties from an experimental design. These material properties correspond to response variables in the multivariate regression model. This study conducts a penalized maximum likelihood procedure to estimate model parameters, including the regression coefficients and covariance matrix of response variables. In particular, we employ $l_1$-regularization to achieve a sparse estimation of regression coefficients and the inverse covariance matrix of response variables. In some cases, there may be a relatively large number of missing values in response variables, owing to the difficulty in collecting data on material properties. A method to improve prediction accuracy under the situation with missing values incorporates a correlation structure among the response variables into the statistical model. The expectation and maximization algorithm is constructed, which enables application to a data set with missing values in the responses. We apply our proposed procedure to real data consisting of 22 material properties.


Introduction
The importance of data analysis applications using statistics and machine learning for materials science and engineering has been steadily increasing (cf. [1,2,4,6,7,16,17,23,25,27]). Owing to the recent development of machine learning methods, data-centric informatics applied to a sufficiently large amount of data is useful for identifying materials that have desirable properties, such as durability and flexibility. Desirable materials are often difficult to identify using only experiments and physical simulations.
The data structure in that field often has two features that introduce new challenges. The first feature is that we often predict multiple properties of the materials; i.e., we must construct a regression model with multiple responses (multivariate regression). For example, in a study on adhesion structure, the yield strength, ultimate tensile strength, fracture, and Young's modulus must be predicted, among many other material properties that have not been described here. However, it would be difficult to find a material that satisfies multiple desired properties simultaneously because there are typically trade-offs among these properties ( [11,13]). These trade-offs are expressed as a correlation matrix among the response variables. This study assumes the correlation structure among response variables and employs a likelihood procedure to estimate regression coefficients and a covariance matrix of response variables. The estimated covariance matrix assists engineers with interpreting the relationship between properties and improves the prediction accuracy ( [19,24]).
When the number of responses is relatively large, it may be difficult to estimate all correlation pairs because the number of parameters is proportional to the square of the number of variables. In such cases, regularization methods have typically been employed to achieve a stable estimation of the covariance matrix of response variables. In particular, the least absolute shrinkage and selection operator (LASSO)-type sparse estimation (cf. [22]) conducts simultaneous variable selection and model estimation, which enables an interpretation of the relationship among material properties. A wide variety of regularization methods that induce a sparse structure have been proposed, such as elastic net ( [28]), group LASSO ( [20]), graphical LASSO ( [9]), generalized LASSO ( [18]), and overlapping group LASSO ( [26]). Rothman et al. [19] introduced the multivariate regression with covariance estimation (MRCE) method, in which sparse regression coefficients and a sparse inverse covariance matrix of the response variables are simultaneously estimated. This is a generalization of the LASSO regression to the sparse multivariate regression analysis. The model parameter is estimated using the penalized maximum likelihood procedure with the LASSO. In this study, we perform data analysis based on the MRCE method.
The second feature that introduces challenges is that the data values in the response variables are often missing because it is difficult to observe all the properties of the materials, owing to large-scale experiments. When the ratio of missing data is small, we can exclude the corresponding observations and conduct a data analysis. This method is referred to as the complete-case analysis ( [10]). However, responses tend to have many missing values. In fact, a data set used in this study comprised 50% or more missing values (see Figure 1 in Section 4). If we conduct a complete-case analysis, the number of observations becomes extremely small, which results in low prediction accuracy. The full information maximum likelihood (FIML) approach (see [8]) provides a means to handle a large number of missing values; Hirose et al. [12] showed that the FIML approach provides a good estimator when the ratio of missing data is 90%. The FIML produces a consistent estimator, even when the number of missing values is large under the missing at random (MAR) assumption (cf. [10]). Moreover, the FIML approach enables missing value interpolation, which may assist with understanding some hidden structures/relations. For multivariate data without response variables, Städler and Bühlmann [21] proposed the graphical LASSO with missing values MissGLASSO method for data with missing values. An l 1 -regularized likelihood method is used to estimate the sparse inverse covariance matrix. Moreover, they proposed an efficient EM algorithm (see [3]) for optimization with provable numerical convergence properties. Städler and Bühlmann [21] extended MissGLASSO to multiple (not multivariate) regression analysis. However, they only assumed the case where the exploratory variables have missing values; MissGLASSO cannot be directly applied to multivariate regression analysis when there are missing values in the response variables.
As mentioned above, the MRCE simultaneously estimates the regression coefficients and covariance matrix of the response variables. However, it is applicable only to complete multivariate data; thus, we cannot perform this method directly for data with missing values. Therefore, we need a suitable extension of the MRCE to apply to data with missing values. Notably, MissGLASSO can be applied to data with missing values. The aim of this study is to propose a multivariate regression model with missing values by combining these two methods.
In this study, we establish a new algorithm called the sparse multivariate regression with missing data (SMRM) algorithm to estimate the inverse covariance matrix and interpolate the data with missing values (see Section 3). To estimate multivariate regression coefficients and the covariance structure, we need to solve a particular l 1 -regularized likelihood type optimization problem with two regularization parameters; one is related to the correlation structure of the responses, and the other is related to the regression coefficients matrix. Here, we note that multiple regularization parameters for regression coefficients are assumed because the error variances vary among the response variables. For this optimization problem, we employ the EM algorithm. As with the case of the MRCE method, the coordinate descent algorithm and graphical LASSO algorithm are conducted in the maximization (M) step of the EM algorithm. Using sparse estimation, the SMRM algorithm can conduct stable estimation, even for a dataset with a relatively large number of missing values. In addition, we can improve the prediction accuracy by using the correlation structure among the response variables. We estimate the sparse inverse covariance matrix to introduce our method instead of the covariance matrix itself because spurious correlations among responses may be excluded ( [15]). In the last section, we apply the SMRM algorithm to real data and investigate influences of regularization parameters. Furthermore, we compare the prediction accuracy obtained by our method to that of the LASSO.

Preliminaries
2.1. Conditional distribution. We briefly review some notions and facts from multivariate regression analysis. For a detailed explanation, refer to [5]. Let be the predictor variables, y l = (y 1 l , . . . , y n l ) T (1 ≤ l ≤ q) response variables. (We consider x j and y l as column vectors.) Then, we set matrices X, X, and Y as . . , x i p ), and y i = (y i 1 , y i 2 , . . . , y i q ) (1 ≤ i ≤ n) be the i-th row vectors of X, X, and Y, as in (2.1), respectively. (We consider x i ,x i , and y i as row vectors.) We then consider the multivariate linear regression model of the form , and E is the error matrix given by We denote the i-th row vector of E (1 ≤ i ≤ n) as ε i = (ε i 1 , ε i 2 , . . . , ε i q ). We assume that the n subjects are independent. We then obtain the following: where 0 q is the q × q zero vector, I n is the n × n identity matrix, and Σ is the covariance matrix of the form We assume the independence condition in the following. Under these assumptions, we note that y i |x i follows y We now consider a partition 5,14]). Here, we divide µ i and Σ into for each i. (For example, if we divide y i into y i,1 = (y i 1 , . . . , y i l ) and y i,2 = (y i l+1 , . . . , y i q ), then µ i,1 , µ i,2 , Σ i,11 , Σ i,12 , Σ i,21 , and Σ i,22 are a l × 1 matrix, (q − l) × 1 matrix, l × l matrix, l × (q − l) matrix, (q − l) × l matrix, and a (q − l) × (q − l) matrix, respectively.) Thus, it can be observed that 12 ). Let K be a q × q matrix that satisfies KΣ = I q . We call K the precision matrix. For i, if we divide (y i ) T into (y i,1 , y i,2 ) T , then it holds that 22 ). This relation is the key part of our algorithm.

By (2.3) and (2.4), we obtain
2.2. The LASSO. We briefly review the LASSO. (For further details, refer to [22].) This method will be used in Section 4 to evaluate the prediction accuracy obtained by our proposed method according to real data. Let x j = (x 1 j , . . . , x n j ) T be predictor variables (1 ≤ j ≤ p) and y = (y 1 , . . . , y n ) T be the response variables. We set a matrix X using x j , as in (2.1). We then consider the linear regression model where β 0 ∈ R and β = (β 1 , . . . , β p ) T ∈ R p are parameters of the regression. In this case, the LASSO optimizes the following form where λ > 0. By solving the optimization problem, as in (2.6), we obtain estimators of β and β 0 . Generally, the regularization parameter λ of the LASSO is chosen to minimize predicted errors of each response. Such a regularization parameter is typically called the 'best' regularization parameter.

Interpolation for data with missing values
3.1. Responses with missing values. Let x j ∈ R n (1 ≤ j ≤ p) be the predictor varieties and y l ∈ R n the response varieties (1 ≤ l ≤ q). We assume that the relation (2.2) holds. We consider the case in which the matrix of responses Y, as in (2.1), has missing values. Then, we divide the i-th row vector y i of Y into where y i,obs is a vector that consists of the observed values, and y i,mis consists of missing values. By (2.5), it follows that where we divide the mean vector µ i = B T (x i ) T and the precision matrix K into for each i. For remainder of this paper, we assume that for the matrix Y given by there are no columns with entries that are missing values.

3.2.
Algorithm to interpolate the data with missing values. We derive an algorithm that performs multivariate regression and interpolates data with missing values. We assume the same conditions as in the previous subsection.
where |Σ| is the determinant of Σ. Thus, the log-likelihood function can be expressed as Using this function, we set where Y is a matrix given by (3.2). We set the following l 1 -regularization of the function l: where λ 1 ≥ 0 and λ j 2,l ≥ 0 are regularization parameters. We consider the following condi- In this case, we derive an algorithm to impute the data y i,mis by solving the optimization problem for Q by applying the EM-algorithm. We call this procedure the sparse multivariate regression for responses with missing data (SMRM) algorithm. First, we provide initial values B (0) and K (0) . Then, we compute the E-steps and M-steps as follows (cf. [21]). E-step: For each i, we denote the mean vector and precision matrix in the m-step (m = 0, 1, 2, . . .) as µ (m) i = B (m)T (x i ) T and K (m) , respectively. We set i,obs ) for each i. We consider c i,(m) as a column vector and use vectors c i,(m) to impute the missing values (y i,mis,(m) ) T in the m-step for each i. Then, the conditional means can be calculated as To do this, we define the following function: Then, our aim corresponds to solving the following optimization problem: The algorithms used to numerically solve the above problem are called the MRCE algorithm ( [19]) and MissGLASSO ( [21]). By applying the MRCE-type and MissGLASSO-type algorithms, we solve the above problem and provide updates.
Remark 3.1. To compute the mean of g, we remark the following process: (1) The case of optimizing B for fixed K: to compute the mean of g, we use Y as a matrix with a row vector that consists of the complement vector ( For complete data, the SMRM algorithm performs similarly to the LASSO for large λ 1 . Thus, we may consider the SMRM algorithm as a generalization of the LASSO for multivariate regression analysis.

Applying the algorithm to real data
In this section, we apply the SMRM algorithm to real data provided by Toray Industries, Inc. This data consists of physical/mechanical properties of particular polymer compounds. To maintain confidentiality, we cannot display the full dataset; however, we describe the size and components of the data. The sample size of the data is n = 114, and the number of predictors and responses are p = 26 and q = 22, respectively. Predictor variables consist of compounding ratios of the source materials. Response variables consist of mechanical characteristics created by the source materials, such as Young's modulus, tensile strength, elongation at break, flexural modulus, flexural strength and the Charpy impact strength of polymer compounds. Responses have missing values, of which the rates range from 5% to 80 % for each observation. In particular, the total ratio of missing values in the responses is 59.7% which is typical in materials science field due to the development process of focusing on the specific properties (see Figure 1).
During data analysis, we apply the SMRM algorithm to the data and compare the prediction accuracy of our proposed method with that of the LASSO. In this study, we use R version 4.0.2.
4.1. The procedure. Let x j = (x 1 j , . . . , x n j ) T (1 ≤ j ≤ p) be predictor varieties, and let y l = (y 1 l , . . . , y n l ) T (1 ≤ l ≤ q) be response varieties. Then, we divide the original data into training data : test data = 8 : 2; that is, we partition x j and y l into x T j = (x T j,train , x T j,test ) and y T l = (y T l,train , y T l,test ), respectively, with a partition rate of 8 : 2 for each j and l (see Figure 1). Although y l,train and y l,test may have missing values, we assume that x j , x j,train , and x j,test are complete data. We perform the following analysis to compare the SMRM algorithm and the LASSO: Step 1: For each l, we set y l,train,obs and X l,train,obs , where y l,train,obs is a vector, of which the elements are observed values in y l,train , and X l,train,obs is the matrix corresponding to y l,train,obs . Then, we apply the LASSO to the data set (y l,train,obs , X train,obs ) for each l.
The above MSEs are primarily affected by response variables with large variances. Thus, we define the MSEs that are not affected by the variance of the response variables as follows: Then, we compare MSE lasso and MSE SMRM .
Remark 4.1. Because the SMRM algorithm is based on the multivariate normal distribution, the predictedŶ SMRM test contains negative values. However, the physical property y l cannot assume negative values in a real-world situation. To avoid this, we first set log(Y) and consider it as the response matrix. Then, applying exp(log(Ŷ)) to the predicted matrix log(Ŷ), we havê Y.
In Step 2, we use the λ 2 matrix for the SMRM algorithm. If the responses are standardized, we define the λ 2 matrix as λ 2 = rλ, where r ∈ R \ {0} and However, when the responses are not standardized, the variance of the response variable affects the regularization parameter; λ 2 must be different among the response variables. Thus, we conduct the following procedure to reduce the effect of the variances: Step 1: For each l ∈ {1, . . . , q}, we estimate y l,train,obs using the LASSO with the regularization parameter λ l,train , which is chosen via cross-validation.
Step 2: For each l, we calculate the MSEs, t l = ||y l,train,obs −ŷ lasso l,train,obs || 2 /length(y l,train,obs ), for the training data, whereŷ lasso l,train,obs is the estimator for y l,train,obs by the LASSO in Step 1.
Step 3: Using t l , which was obtained in Step 2, we define a vector a as Step 4: We define a matrix as Step 5: We set λ 2 = rλ (r ∈ R \ {0}) and apply the SMRM algorithm using this matrix.

4.2.
Comparison between the SMRM algorithm and the LASSO. Following the procedure that we explained in the previous subsection, we compare our method (the SMRM algorithm) to the LASSO using the data provided by Toray Industries, Inc. Regularization parameters λ l,train (1 ≤ l ≤ 22) for the training data and MSEs (MSE lasso l ) for the test data by the LASSO are summarized in the first and second rows of Table 1. The second row of Table 1 shows that the mechanical characteristics A, G, H, K, L, M, N, and V have large MSE values. These values significantly act on MSE lasso and also act on MSE SMRM . Therefore, it is better to use MSE, as in (4.3), to compare the prediction accuracy between the LASSO and the SMRM algorithm by avoiding the dependence of the variance of responses.
We subsequently apply the SMRM algorithm. Because the responses of the data are not standardized, we calculate the vector a, as in (4.5) (this is a 22 × 1 matrix), to obtain the λ 2 matrix. The vector a is listed in the third row of Table 1. Using the regularization parameters λ l,train (1 ≤ l ≤ 22), as in the first row of Table 1 and a, we obtain the λ matrix, as given in (4.6). The row vector of λ is shown in the fourth row of Table 1.
The values in the third row of Table 1 correspond to the reciprocals of the MSEs for each mechanical characteristic of the training data obtained by the LASSO. These MSE values vary significantly. The values in the fourth row of Table 1 can be considered as modified regularization parameters obtained by the LASSO in Step 1. A comparison of the first and fourth rows of Table 1 shows that the values in the fourth row vary only slightly. Thus, one may make a stable estimation using the SMRM algorithm with λ 2 constructed by (4.6) instead of the matrix given by (4.4).
In our observation, the values of MSE SMRM improve gradually whenever λ (s) 1 is updated for a fixed r ≤ 1 (see Figure 2 and Figure A.1 in Appendix A). However, when λ 1 becomes smaller than a particular number, the SMRM algorithm is not stable, and MSE SMRM deteriorates. The following reasons can be considered: • When the regularization parameter λ 1 for the SMRM algorithm is large, the correlation structure ise not considered. In this case, imputation of missing values may not be effective because missing completion is achieved by taking advantage of the correlation structure of the responses. Hence, the prediction accuracy may be worse than that of the LASSO. • When λ 1 is appropriately small, missing values are well imputed using the correlation structure among the responses (or among the residuals). The prediction accuracy improves, owing to the contribution of the correlation structure. • When λ 1 is too small, the estimated model overfits the data. Hence, the prediction error increases again.
In the case of r > 1, MSE SMRM deteriorates independent of λ 1 . When r > 1, elements in λ 2 assume large values. For these data, the elements of the row vectors of the estimator B tend to be zeros. This is the reason why MSE SMRM is worse than the case of r ≤ 1.
We observe the influence of r on the prediction accuracy. As r gradually decreases, the prediction accuracy for the SMRM algorithm improves. In particular, we observe that r = 0.2, that is, λ 2 = 0.2λ, with log(λ 1 ) = −4.96 provide the best prediction accuracy (see Figure 2). When r = 0.1, the prediction accuracy decreases compared with the case where r = 0.2 (see Figure A.1 in Appendix A).
Furthermore, comparing MSE SMRM with λ 2 = 0.2λ and MSE lasso , it can be observed that MSE SMRM < MSE lasso holds; that is, the SMRM algorithm is superior to the LASSO for a suitably small λ 1 . Because λ 1 affects the correlation structure among responses, the prediction accuracy may be improved by estimating responses multivariately with the appropriate correlation structure, instead of individually. Next, we consider the MSEs in the case of λ 2 = 0.2λ for mechanical characteristics individually (see Figure A.2). In the individual analysis, we consider the mechanical characteristics C and D, i.e., the third and fourth mechanical characteristics (see Figure 3). For −4.35 ≤ log(λ 1 ) ≤ 0, MSE SMRM l > MSE lasso l (= 1) holds, where l = 3, 4. This implies that the prediction accuracy obtained by the SMRM algorithm is lower than that of the LASSO. However, for log(λ 1 ) ≤ −4.38, we find that MSE SMRM l < MSE lasso l (l = 3, 4) holds. Thus, it can be observed that the prediction accuracy is improved by using the correlation structure among the responses. Although log(λ 1 ) = −4.96 exhibits the best MSE SMRM prediction accuracy, as mentioned above, MSE SMRM 3 and MSE SMRM 4 provide the best values for log(λ 1 ) = −4.99 (one after 'the best' with respect to MSE).
To consider the reason for these differences, we use heat maps of correlation structures (see Figure 4). Here, 'the best,' 'better1,' and 'better2' are named with respect to MSE SMRM . (When we observe MSE l individually, note that 'better2' represents the case for the best prediction accuracy of C and D.) We first note that C and D have a particular physical/mechanical relationship with each other 1 . It is noticeable that C and D have a strong positive correlation.
Therefore, it appears that if a positive correlation between C and D gradually increases, the prediction accuracy improves. On the other hand, the SMRM algorithm estimates the correlation structure among C, D, and O (fifteenth mechanical characteristic). Since C (resp. D) and O represent different mechanical properties, it is difficult to emphasize their correlation structure by experiments. Therefore this may be considered as a hidden relation among mechanical properties, and it seems that the sparse multivariate regression method using a precision matrix contributes to identifying such relations. If a positive correlation among C, D, and O is suitably set, the prediction accuracy of C and D improve. For the mechanical characteristic O, we notice that the prediction accuracy is better when the positive correlation of O with C and D increases (see Figure 5).
By the above observations, a suitable positive correlation structure among C, D, and O for this dataset affects the prediction accuracy of these mechanical characteristics. Furthermore, we can identify unexpected relations among responses, similar to the above characteristics, using our method.
For other mechanical characteristics, see Appendix A. 1 This is indicated by K. Nomura, S. Kobayashi, and K. Koyanagi, who provided this data.

Conclusion
In this study, we proposed a novel method called the sparse multivariate regression with missing values (SMRM) algorithm with the intention to apply it to materials science. Since the data structure of materials science often contains two features; (1) material properties are multivariate, and (2) they have often missing values, it seems that the MRCE and the MissGLASSO work effectively. Unfortunately, these methods cannot apply directly to our setting. However, by modifying these and establishing suitable framework, we constructed the proposed algorithm (Section 3). Owing to the regularization for the correlation structure, we can improve the prediction accuracy and may find the unexpected relation among the response variables. Actually, in the real data analysis, we found the unexpected relation among response variables of data (Section 4). Further, we verified that our proposed method was superior to the LASSO for the data. For these reasons, we expect that our proposed method has a possibility to contribute to the progress of materials science and related area.
Our proposed procedure performed worse than the lasso for some material properties. The poor performance is probably due to a large number of missing values. As the ratio of missing values increases, the prediction accuracy of our proposed method becomes poor. As future work, it would be interesting to investigate the influence of the ratio of missing values on our proposed method.