Multi‐Layer Feature Selection Incorporating Weighted Score‐Based Expert Knowledge toward Modeling Materials with Targeted Properties

Selecting proper descriptors or features is one of the central problems in exploring structure–activity relationships of materials using machine learning models. The current feature selection algorithms usually require tedious hyperparameter tuning and do not actively consider the prior knowledge of domain experts about the features. Here, this work proposes a data‐driven multi‐layer feature selection method incorporating domain expert knowledge named DML‐FSdek, which is automated, with users entering training data without manual tuning of the hyperparameters. The domain expert knowledge is quantified by means of weighted scoring and integrated into the selection process to eliminate the risk of crucial features being removed. The test studies on ten material properties datasets demonstrate the potential of the approach to automatically search for a reduced feature set with lower root mean square errors than those for the initial feature set. Essentially, the most relevant material features, the number of which is much smaller than that in the original feature set, are automatically selected to establish a closer and more accurate structure–activity relationship for the materials of interest. As a result, the method represents the targeted properties of materials with a smaller and more interpretable set of features while ensuring equal or better prediction accuracy.


Introduction
Machine learning (ML) is rapidly gaining popularity as an approach to accelerate the design and development of advanced materials. It has been used for material properties prediction and optimization, [1][2][3] new materials discovery, [4] improvement of DOI: 10.1002/adts.201900215 parameters for computational studies, [5] and has proved its efficiency and accuracy. The universal procedure of ML in material properties prediction is schematically illustrated in Figure 1. The representations of a material dataset, called "descriptors" or "features," not only uniquely define each material in the input dataset but also correlate with its target properties. [6][7][8] One of the critical aspects of constructing a machine learning model is to select appropriate descriptors to reflect material properties. [9] Ideally, only the relevant features are picked and redundant and irrelevant features are discarded as they reduce the prediction performance of the ML model and increase computational complexity.
For instance, in ion conductivity prediction of lithium battery materials, [10] crystal enthalpy does not correlate with ion conductivity and thus should be regarded as irrelevant attribute for an ML model. A more complicated example is lattice constant prediction. [11] When three features such as composition, average coordination number, and atomic valence are present in the original feature set simultaneously, all of them strongly correlate with the target property, that is, lattice constant. However, it is also known that the average coordination number can be predicted for a given composition via the known atomic radii and valence. In this case, the average coordination number can be eliminated as a redundant attribute. In summary, in predicting material properties, complex and unclear correlations exist not only between the features and target properties but also among features themselves and it is essential to identify and eliminate the irrelevant and redundant features and retain only the representative features of the original feature set.
Feature selection (FS) is one of the most important steps in the machine learning process for constructing quantitative structureactivity relationships, [12][13][14] as identification and ranking of the most relevant features greatly affect the computational speed and predictive ability and interpretability of the model. [12,15] Recently, in the machine learning study of the thermodynamic stability of perovskite oxides, Morgan et al. applied three different FS methods, that is, stability selection, recursive feature elimination, and univariate feature selection based on mutual information. [16] Based on the basic principles of these algorithms, we will refer them in this work as Wrapper, Embedded, and Filter methods, respectively. By applying these methods, the authors were able to reduce the initial 791 features to 70 and construct an ML model without significant overfitting. In order to avoid overfitting, Zhang et al. [17] employed L1 (Wrapper) method for feature selection before model training. After a series of parameter adjustment and model selection, the model with good generalization performance was achieved.
In summary, many researchers in the field of computational materials science have begun to adopt a variety of FS algorithms to quantify the relevance of material descriptors to the properties of interest. Nevertheless, given the diversity of the available FS methods, domain experts usually face a problem of choosing the appropriate methods. Additionally, even if a FS method that is suitable for handling a certain type of domain problem is determined, the hyperparameters and strategies involved require manual setting and adjustment, which is usually time-consuming and labor-intensive. For example, for the filter methods, users usually manually define the number of selected features and the filtering threshold, the wrapper methods need manual input of the subset search strategy to generate a candidate feature subset. The embedded methods employ the machinelearning algorithms, such as Lasso and gradient boosting regression (GBR), to measure importance of the features, in which the hyperparameters of the algorithm (e.g., the number and max-depth of decision trees in GBR) also need to be manually searched and optimized to achieve better performance. Consequently, domain experts may face difficulties in selecting easyto-use but accurate methods, as parameter adjustment requires practical experience in using machine learning. On the opposite side, the prior knowledge of domain experts on the importance or relevance of the features is typically ignored in the FS process, even though domain experts may know in advance which features are more important, which leads to the reduction of the model development efficiency and its predictive ability.
Although such information can be introduced into some machine learning models such as support vector machines (SVMs) to indirectly help select features, [18] in general domain experts' prior knowledge is rarely incorporated into the selection procedure of material features. Therefore, it would be very useful to develop an automatic feature selection approach combining with domain expert knowledge.
Herein, we propose a multi-layer approach utilizing the intrinsic characteristics of data to evaluate the importance of features from different perspectives to eliminate the irrelevant and redundant features from the original training set. The whole process is automated and does not require the user to have experience in feature selection. Moreover, the integration of domain expertise prevents key features from being ignored. In Table 1,  we compare the filter, wrapper, and embedded methods and our approach. As we demonstrate below, our method has advantages except for slow processing speed, that can be mitigated in the future through parallelization. The results discussed below in detail indicate that the proposed method can successfully replace time-consuming trial-and-error process of model hyperparameters and provide equal or better prediction performance with a smaller in size and interpretable feature set. The remainder of the paper is structured as follows: Section 2 briefly reviews the existing feature selection methods (Filter, Wrapper, and Embedded) and their working principles. Section 3 describes the details of the method proposed in this work. Section 4 demonstrates the effectiveness and feasibility of the method for ten material property datasets. Finally, the conclusions of this study are given in Section 5.

Preliminaries
The overall goal of feature selection (FS) is to identify a subset of features that contains the most representative information on the properties of interest in the original data. In recent years, the feature selection (FS) process has attracted widespread attention due to its importance for further analysis and understanding the data. Thus, several approaches have been presented, which can be generally grouped into three categories: filter methods, wrapper methods, and embedded methods. [19] Figure 2 presents the working procedure of the filters. Filter methods [20,21] attempt to use evaluation criteria based on statistical theory and information theory such as distance function, [22] statistical correlation coefficient, [23] mutual information, [24] etc., to assess the relevance of the features and rank them according to their importance. Then the features with high scores are used in the ML model. The advantages of the filter approach are its simplicity and efficiency. Nevertheless, a common disadvantage is that selection process is decoupled from the classifier used to further build the predictor and ignores the effects of a selected feature subset on the performance of the ML model that in general leads to its lower prediction accuracy. [25] In contrast to the filter approach, wrapper methods [25,26] use the prediction performance of a machine-learning model (e.g., support vector machine (SVM), neural network (NN)) as criteria to evaluate the quality of the candidate feature subset. The workflow of wrappers is schematically illustrated in Figure 3. First, the wrapper generates an initial candidate feature subset based on the predefined search strategies, such as sequential backward selection, sequential forward selection, sequential forward floating selection, or sequential backward floating selection, and then a ML model is trained and tested to estimate the candidate feature subset. This process is performed iteratively until the selected feature subset meets the specified requirement. The better results and higher prediction accuracy of the wrappers are achieved at the cost of computational time and complexity.
Embedded methods [27,28] utilize types of ML models, such as linear model (Linear, Lasso, or Ridge regression), support vector machine (SVM) and random forest, to guide the feature selection process and define a criterion depending on a class of regression or classification function.
In general, filter methods are faster than the wrapper and embedded methods in terms of processing speed but produce inferior results because they are independent of the specific ML algorithms.

The Principles of the Proposed DML-FS dek Method
To our knowledge, many non-machine learning experts often rely on either a tedious trial-and-error process or personal biased experience in choosing feature selection (FS) methods. Moreover, current FS algorithms tend to ignore the prior knowledge of domain experts about what features are more relevant which may lead to removal of some crucial features. Hence, we develop a www.advancedsciencenews.com www.advtheorysimul.com novel multi-layer feature selection method, DML-FS dek , that incorporates domain expert knowledge.
The overall framework of our approach is illustrated in Figure 4. First, based on the domain expert knowledge applied to the initial features in the material database, an a priori model of feature relevance is constructed and used to drive the process of DML-FS dek . Then, in DML-FS dek , the problems of sparsity, irrelevance, and redundancy of the input data can be hierarchically addressed by three corresponding processing layers, which ultimately ensures that selected features are highly differentiated and highly correlated with the target attributes. To sum up, the key idea behind DML-FS dek is that statistical analysis method and information theory (see Section 3.1.1) are employed to analyze the relationship between features and target attributes to remove the most interrelated (redundant) features or the least relevant to the target attributes. Further, in feature subset evaluation (see Section 3.1.2), each layer generates an initial subset of features (candidate feature subset) using a default initial filtering threshold , and then constructs a specific ML model to evaluate the subset. If it is better than the previous subset of features, then the threshold is adaptively updated and a new subset of the features generated. This whole process is iteratively performed until a subset of features that meets the specific requirements is found. Finally, the best subset of features is picked considering domain expert knowledge and used for subsequent prediction of material properties.
The design of the DML-FS dek three-layer structure, quantitative representation of domain expert knowledge, and strategies to combine them are covered in the following subsections.

The Trigger Conditions Design for DML-FS dek Layers
The proposed DML-FS dek includes the three processing layers: sparsity evaluation, correlation evaluation, and redundancy evaluation, which analyze the importance of features from differ-ent perspectives depending on the characteristics of the data. Thus, the trigger conditions for DML-FS dek layers are designed as follows.
where X indicates the input features, x i indicates the i th feature. , , are the sparsity threshold, correlation threshold, and redundancy threshold, respectively. Layer 1 (X), Layer 2 (X), andLayer 3 (X) will be defined in the following Equations (2), (3), and (9), respectively. In the first layer (Layer 1 (X)), in order to address the problem of sparsity in discrete and continuous variables, numerical statistical (NS) method and variance score (VS) are adopted for preprocessing. For a continuous variable, if its variance is close to zero, it indicates that the variable fluctuates in a small range and thus its correlation with the target attribute cannot be precisely assessed and the variable should be ignored. Similarly, if the fraction of a certain value of a discrete variable exceeds 95% of the total number of samples, this also implies that the discrete variable is sparse and can be disregarded. Thus, in this layer, the numerical type of each feature is first determined and then corresponding evaluation criteria are used to calculate the sparsity value. In summary, if the sparsity value of features meets the sparsity threshold, the feature will be discarded, otherwise, retained. The calculation of NS and VS of each feature is defined as follows: Adv. Theory Simul. 2020, 3,1900215 where n represents the total number of samples, x i denotes the i th eigenvector, and x i represents the mean value of the i th eigenvalue.
After the sparse features are removed, the second layer (Layer 2 (X)) is used to eliminate irrelevant features only weakly correlated to the target attribute. The Mutual information (MI) and Pearson correlation coefficient (PCC) are chosen to measure the correlation between the features themselves and between the features and target attributes, respectively. From Equation (3), it follows that the corresponding correlation evaluation methods are selected according to different trigger conditions. If the number n of samples is less than or equal to k 1 and the target attribute y is discrete, the MI is used to calculate the correlation value, otherwise, the PCC. Finally, if the correlation between the features and attributes is lower than the correlation threshold, the feature will be discarded, otherwise, retained.
where x i is the i th feature, y is the target attribute, n is the amount of training data, k 1 is the threshold of data size, Cov() is the covariance and Var() is the variance.
In Equation (3), MI evaluates how much information each feature can provide. Equations (4)-(6) are the steps for calculating MI. Equation (5) calculates the uncertainty (information content) in all classes Y. Equation (6) implies that by observing a variable X, the uncertainty in the output Y is reduced. The decrease in uncertainty is given as Equation (7). That gives the mutual information (MI) between Y and X meaning that if X and Y are independent then MI is going to be zero, otherwise they are interdependent. Similarly, PCC measures the correlation between two variables as defined in Equation (4).
The irrelevant features are eliminated in the Layer 2 , thus there is strong correlation between the remaining features and target attributes. However, in the subset of features obtained, some of the features may still be correlated with the other features. Hence, in the third layer (Layer 3 (X)), DCC (Distance correlation coefficient calculated by Equation (8)) and PCC are used to evaluate the redundancy among features. As shown in Equation (9), if the number n of samples is less than or equal to k 1 or the number d for the features is less than or equal to k 2 , the DCC is used to calculate the correlation coefficient (redundancy) among features, otherwise, the PCC is calculated. If the redundancy value is greater than the redundancy filtering threshold, one of the two features will be removed, otherwise, the two features will be both retained.

Feature Subset Evaluation Based on Machine Learning Model
The merits of the feature subsets obtained in each layer are evaluated by testing them in ML model and only the best set is passed onto the next layers. Figure 5 presents the procedure of feature subset evaluation. First, a candidate feature subset is generated; next, the subset evaluation step is performed, which estimates the quality of the current feature set. In this step, the learning model is determined by the user according to specific learning problems, such as support vector machines (SVMs), neural network (NN), decision tree, etc. In addition, evaluation criteria are also adaptively chosen based on different learning problems, such as root mean square error (RMSE), mean absolute percentage error (MAPE), etc. These steps are performed iteratively, until a stopping criterion is met, which happens either when the results begin to deteriorate or the number of features reaches a predetermined threshold.

The Weighted Scoring for Domain Expert Knowledge
When the DML-FS dek is applied for machine learning in practice, the importance of the features depends, among other things, on the domain knowledge of the user. We quantify that aspect by the score s that includes the importance score (weight) of the feature given by the user and rating weight of the user himself. The importance score (weight) su, sp of the feature given by the user is described by where su represents the current user, sp represents experienced user, su, sp = 0 indicates that the user thinks the feature is not important, su, sp = 0.5 indicates that the user is uncertain about the importance of the feature, and su, sp = 1 indicates that the user considers the feature is crucial. The rating weight d of the user is described by Adv. Theory Simul. 2020, 3, 1900215 where d = 2 indicates that the user is a material expert, d = 1.5 indicates that the user is a computer expert, and d = 1 indicates that the user is neither a material expert nor a computer expert. The rating weight is determined according to the users' expertise in the problem domain. The higher the weight is, the more significant the suggestion of this user is. For example, for a problem in the material domain, the material experts have the most abundant domain knowledge and experience, their judgments are the most authoritative, and thus the weight d is given the higher value of 2. For a computer expert, who may also have some background in material science with the rapid development of machine learning in materials discovery and materials properties prediction, consequently, the weight d is given to 1.5. Finally, for the users who are neither material experts nor computer experts, the judgment may not be well founded and thus the weight d is given to 1.
When analyzing data using our method, any kind of human expert can theoretically participate. However, in order to obtain good prediction results, a great deal of domain expertise with high professional level should be acquired. The greater the amount of domain expert knowledge is and the higher the professional level is, the more suitable it is to obtain good results.

Collaborative Feature Selection between DML-FS and Domain Expert Knowledge
When we perform the feature selection procedure utilizing the proposed DML-FS layers, the importance score sa of each feature can be described as follows: where 0 means removing the feature and 1 means retaining the feature. This means that when the DML-FS layer considers that a feature has little or no effect on the result, the feature is scored as 0, otherwise, the feature is scored as 1.
On the other hand, based on the presentation of domain expert knowledge proposed above, the expert experience score s for each feature can be described quantitatively as follows: Where m is the total number of people who had scored the feature, n is the number of people who gave the feature the same score as the current user in the m people and ∑ sp * d ∑ d is the historical users' score which is the weighted average of the scores of all historical users. The current user's score su is given a weight √ 1 − c 2 (m−n) 2 c 1 m 2 and the historical users′ score is given another where c 1 and c 2 are unknown constants and c 1 ≤ c 2 . The specific values of c 1 and c 2 can be determined based on experimental data and experts' experience to distinguish the credibility of current user' score from historical users' scores. Here, we consider that the parameters c 1 and c 2 are set to 9 and 8, respectively, because in the expert experience score s, the weight of the current user' score (less than 1/2) should be less than the weight of the historical users' scores (more than 1/2) (the minority is subordinate to the majority). Therefore, we can see from the Equation 13 that the range of s is between 0 and 1.
Using the weight √ 1 − c 2 (m−n) 2 c 1 m 2 , we can measure the credibility of historical users' scores. When n is unchanged and m increases, the weight of the current user' score will become smaller, but the www.advancedsciencenews.com www.advtheorysimul.com degree of reduction will be reduced. This ensures the dominance of the current user. However, when m does not change and n becomes smaller, the weight of the current user' score will decrease, and the degree of reduction will increase. This reflects the trustworthiness of historical users.
Finally, we embed the domain expert knowledge into the three DML-FS layers. We define the comprehensive importance score of the feature (FCIS) as follows: where sa represents the feature importance score obtained through DML-FS layer, s represents the experience score which is rated by domain experts. Therefore, if FCIS of the feature is greater than 0.5, the feature would be retained, removed otherwise. The FCIS contains the following six core ideas: 1) The current user′s experience, historical users′ experience, and the result of the DML-FS layer are needed to jointly determine whether to remove the feature; 2) The expert experience score considers the expert's domain issues and establishes weights to divide the differences in each field, which can improve the credibility of the scores; 3) The weighted sum of the current user's score and the historical users' score is 1, which can integrate the experience of the current user with the experience of the historical user; 4) The DML-FS layers result in the fact that the feature should be retained, and the feature will be retained; 5) The current user and historical users consider that the feature is very important, then the feature will not be removed; 6) The current user believes that this feature is unrelated, however, based on the experiences of historical users, this feature has the potential to be retained.

Experimental Section
In this section, we evaluate our proposed method on ten material properties datasets. We first introduce the datasets. Then we introduce the parameter setups of the experiments. Finally, we provide the analysis and discussions of experimental results.

Experimental Datasets
In order to validate the performance of the proposed method, this paper collected ten groups of material properties data sets from the published references or online resources. The brief information on the data sets is shown in Table 2, which involves the macro and micro properties of materials. Since all ten datasets have been publicly released, the authenticity and reliability of the data can be guaranteed. Moreover, the ten datasets also cover various sizes and dimensions of samples, which can sufficiently verify the adaptability of the method in different data scenarios. Datasets 1, 2, 7, 8, and 9 have a small amount of data and a low dimension. The dimension of dataset 3 is also relatively low, but the amount of data is relatively large. The original feature

Experimental Setups
The DML-FS dek is used to perform feature selection experiments on the collected data sets, in which the subset of features obtained from each layer serves as the input of the next layer, and support vector regression (SVR) is employed to evaluate the merit of the feature set. The details of the process are as follows. In the first layer, the sparsity filtering threshold is initially set as 0.01, and the threshold is automatically adjusted with until the prediction accuracy of the model is no longer improved. In the second layer, the correlation filtering threshold is initially set as 0.4, and the threshold is automatically adjusted until the prediction accuracy of the model is no longer improved. Finally, in the third layer, the redundancy filtering threshold is initially set as 0.88, and the threshold is automatically adjusted until the prediction accuracy of the model is no longer improved. To evaluate the generalization performance of the constructed machine learning model on unseen data and reduce the risk of overfitting, fivefold cross-validation is used to assess the comprehensive predictive power of the learning model. In addition, to achieve better prediction performance of the models, some specific search strategies such as random search and grid search, can be used to find the optimal model parameters from the hyperparameter space. Herein, grid search is used to optimize the hyperparameters of the model. All the algorithms covered in this paper are implemented in Python and call the scikit-learn toolkit. [36]

Experimental Results and Analysis
Referring to the scoring rules in Section 3.2, we have invited seven material experts from different fields to score the features contained in ten groups of datasets, and designed an expert scoring table of features shown in Table 3 for each group of dataset (which has been uploaded to https://github.com/wujunming1/ material-attribute-datasets). The expert experience score of each feature on the ten material properties data sets can be calculated through Equation (13) based on the score records of seven experts from different fields, as shown in Figure 6. In order to speed up the convergence of our model and eliminate the influence of dimensionality on the prediction accuracy, the Max-Min normalization processing is first performed on all datasets. For comparison, we also conducted the material properties prediction without employing any feature selection method. For all the ten datasets support vector regression (SVR) is used as a predictor. RMSE and MAPE are used to evaluate the prediction accuracy of the model. The results are listed in Table 4. The prediction accuracies of the models on data sets 2, 3, and 4 are relatively high, which indicates that the original features well reflect the distribution of data. The data sets 1, 2, 3, 7, 8, 9, and 10 have fewer original features and limited feature selection space. In addition, when the original feature set is used, the predictive ability of the model on datasets 1, 5, 6, and 8 is relatively poor, and their RMSEs even two orders of magnitude higher than those for the dataset 3. Therefore, there is still room for improvement in the prediction accuracy of the model on the four data sets. Finally, the results and analysis of the sparsity, correlation, redundancy evaluation on the ten data sets are described and analyzed in the following subsections.

Results and Analysis of Sparsity Evaluation
Sparsity evaluation is conducted on the original data, and the results are listed in Table 5. It can be observed that the number of features of datasets 1, 2, 3, 8, 9, and 10 has no change, indicating that there are no sparse features in the six data sets. The number of features of dataset 4 is reduced significantly from the original 47 to 34, among which the thirteen features including Y25, Y26, Y22, Y21, Y20, Y19, Y16, Y13, Y12, Y11, Y9, Y8, and Y6, are below the sparsity filtering threshold and thus are eliminated. The number of features of datasets 5 and 6 is reduced from the original 18 (27) to 17 (25), respectively. Among them, only the feature Y5 in dataset 5 is screened out because its calculated variance is close to zero, and only the features Y (mass fraction of the Y element) and a_2time (The second stage aging treatment time) in dataset 6 are also removed because their sparsity evaluation values do not meet the threshold requirement. On dataset 7, the number of features is decreased from a total of five to three, among which the feature X1 and X2 are eliminated. The feature subset evaluation on the selected features shows that the prediction accuracies on the datasets 4, 5, 6, and 7 are slightly improved, which indicates that the sparse features have little influence on the prediction accuracy of the model. Furthermore, when the collaborative selection based on sparsity evaluation layer and domain expert knowledge is performed on the datasets 4, 5, and 7, the number of selected features is different from that of DML-FS layer, which may indicate that some of the features that domain experts consider to be crucial for system modeling are retained. For instance, for dataset 4, the number of features selected through the pure DML-FS layer is 34, while the DML-FS with domain expert knowledge (DML-FS dek ) has picked out 38 features in which four key features including Y25, Y22, Y19, and Y16 that domain experts consider important are retained (high expert experience scores, see Figure 6). Moreover, we can observe that the RMSE for the two selected feature sets (0.0611 and 0.0581) maintain at the same level. As for dataset 5, the removed feature Y5 is also retained due to the integration of domain expert knowledge.

Results and Analysis of Correlation Evaluation
The results of the correlation evaluation of the remaining features from the first layer are listed in Table 6. As can be seen, no features on the datasets 2, 6, 8, and 10 are removed, indicating that there is no irrelevant information in the four datasets. On the other hand, some of the features are discarded due to the weak correlation between features and target attribute in the other six data sets. For dataset 1, the features X3 and X6 are discarded. The feature Specimen Thickness on dataset 9 is removed. As for the dataset 3, the feature X5 is removed. And for the dataset 7, the feature X5 is eliminated. In addition, the number of features is decreased from 34 (17) to 31 (15) for the datasets 4 and 5, respectively. To be specific, the three features including Y18, Y10, and Y4 for the dataset 4, are deleted because their correlations with target property (density of organic materials) are all lower than the correlation filtering threshold. Moreover, the two features G and Y23 are also removed for the dataset 5. Finally, compared with the previous feature subset, the selected features are evaluated via a subset evaluation procedure in this layer, and it can be concluded that the data sets 1, 3, 4, and 5 reach improvement in the prediction accuracy, which is mainly reflected in the decrease of RMSE. In particular, the prediction accuracy is improved significantly for the dataset 1 and its RMSE is decreased by 19.2%. The above results indicate that uncorrelated features have an impact on the prediction performance of the model. Furthermore, when the collaborative selection based on correlation evaluation layer and domain expert knowledge is performed, the key feature X3 is preserved for the dataset 1 and for the dataset 5 the key feature G is also retained as the result of domain expert knowledge integration. Figure 6. Expert experience scores for each feature in ten material properties data sets. The red dotted line (0.5) represents the dividing line of the experts' experience score. When the experts' experience score for a feature is greater than 0.5, the feature is considered so important by domain experts that it cannot be removed from the model.

Results and Analysis of Redundancy Evaluation
Redundancy assessment is conducted for the output feature subset from the previous layer to eliminate redundant features and the results are listed in Table 7. The number of features for the datasets 1, 2, 3, and 7 do not change, which indicates that the features in the four datasets are independent of each other or very weakly correlated. However, for the datasets 4, 5, 6, 8, 9, and 10, the number of features is decreased. To be specific, the number of features is reduced from 31 to 20 for the dataset 4, among which the eleven features including A, X3, X4, X5, X10, X15, X16, X17, X22, Y1, and Y24 are removed due to their correlation with other features. The size of the feature subset is reduced from 15 to 11 for the dataset 5, in which the four features including A, X11, X13, and Y17 are removed. For the dataset 6, only the feature B (mass fraction of B element) is discarded. For the dataset 8, the feature X3 is removed. Regarding dataset 9, the four features of Ni, Al, W, Ti are removed. The features x, <r> and K on dataset 10 are deleted because of their high correlation with other features. Finally, as shown in Table 7, the RMSEs of the prediction models for the six data sets are decreased slightly, indicating that redundant features have little impact on the prediction performance. Furthermore, when the collaborative selection based on redundancy evaluation layer and domain expert knowledge is  Table 4. Prediction accuracies of the models using the original feature set. and 0.0545) remain at the same level. For the dataset 5, the two features including X11 and X13 are retained due to high expert experience scores (see Figure 6, index = 10, index = 12) and the prediction accuracy is maintained at the same level (0.1789 and 0.1786). The features Ni and Al have a great influence on the lattice misfit of Ni-based single-crystal superalloy, and domain experts also give these two features high expert experience scores (see Figure 6, index = 0, index = 1). Thus, our approach ultimately  retains these two important features, which is consistent with accepted knowledge of the physical and chemical domains. In order to verify the superiority of our model in predicting performance and interpretability of materials, we compared our method with two existing sparsity methods (Lasso, Elastic net) on ten groups of materials properties datasets collected. The experimental results are shown in Table 8. For these ten datasets, our method is lower than Lasso and Elastic net in terms of RMSE, which shows a better prediction performance. The number of features selected by our method is more than that of the sparsity methods. This is because the introduction of domain expert knowledge allows the domain experts to consider important features to be retained, which is consistent with the accepted domain knowledge of materials physical-chemistry. Especially, for dataset 2, Lasso and Elastic net selected only one feature, while the DML-FS dek selected six features due to the combination of domain expertise, which not only improved the material interpretability but also improved the predictive accuracy by 40% (RMSE is decreased  by about 40%). Similarly, for dataset 10, Lasso and Elastic net selected four features, while our method selected five features due to the assessment of the expert experience scores, resulting in a 52% improvement in prediction accuracy (RMSE is decreased by approximately 52%). Except for dataset 3 and 4, our method selected more features in the other eight datasets than the two sparse methods (the features that domain experts considered important were retained), and the predictive performance of the model was better. In general, compared with the two sparsity methods, the proposed method can improve the predicted performance while ensuring that the selected features are coincided with domain expert knowledge (i.e., the materials physics and chemistry information can be interpreted). Next, taking the dataset 6 as an example, sparsity, correlation, and redundancy evaluation were conducted for all its features. The sparsity of the features is shown in Figure 10. We can observe that the variances of the two features Y (mass fraction of Y element) and a_2time (The second stage aging treatment time) are 0 and 0.0174, respectively, lower than the updated filtering threshold of 0.02. Combining the experts′ experience score in Figure 6 with the proposed collaboration strategy in Section 3.3, we have calculated that the FCISs for these two features are less than 0.5. Thus, the two features were discarded in this procedure. Then, correlation evaluation was further conducted for the remaining features and the result is shown in Figure 11. We can observe that the correlations between all the retained features and target www.advancedsciencenews.com www.advtheorysimul.com  attribute are all strong, exceeding the updated correlation filtering threshold of 0.3. Therefore, in this layer, no features were removed. Finally, we conducted redundancy analysis on the remaining features from the previous layer. Figure 12 presents the correlation relationship between features and top 5 highly correlated pairs of features are listed in Table 9. It can be observed that high correlation or interdependence between features C and B, B and a_2T, Ni and Co, with their correlation coefficients of 0.9353, 0.8692, −0.8487, respectively. Further, the correlation coefficient between the C and the B is greater than the updated redundancy filtering threshold of 0.88, so feature C is retained, but the feature B is discarded (its FCIS is also below 0.5).
Finally, we accumulated statistics on the prediction accuracy for the above ten testing models, as shown in Figure 7b. It can be clearly observed that after passing through the three layers of DML-FS, the prediction error has decreased, particularly for the datasets 1 and 5. Moreover, Figure 7a shows the size of the output feature subset from each DML-FS layer. It can be observed that the 17 features were filtered out for the dataset 4. Analysis of the removed features revealed that there is a large amount of sparse, irrelevant and redundant information in the dataset. In contrast, for the datasets 1, 2, 3, and 8, their initial sets features were already pre-processed by domain experts and thus they had little sparsity or irrelevant or redundant information. Similarly, Figure 8 presents the number of selected features and the prediction accuracy after the introduction of domain expert knowledge into each DML-FS layer. After the integration of experts′ experience the risk of important features being removed is mitigated (the number of selected features by each layer is different from DML-FS layers without experts′ input), and the prediction performance is equal to or better than that of pure DML-FS layer. From the results on consumed computational time for each DML-FS layer, shown in Figure 9, we can observe that it mainly depends on the size of the data set and none of the layers has a dominant effect on the process performance.

Conclusion
A novel data-driven multi-layer feature selection mechanism integrating domain expert knowledge is proposed and tested. The proposed method can eliminate sparse, irrelevant, and redundant information from the original feature set using three layers of sparsity evaluation, correlation evaluation, and redundancy evaluation. The whole process is automatic and does not require the user to have professional knowledge about feature selection.    Moreover, we present a method to quantify and integrate domain expert knowledge into the feature selection process. First, the domain expert knowledge of features is quantified as the weights (importance scores) of features and expert experience score of each feature is evaluated by means of weighted average. Second, the feature subset is optimized based on the collaborative strategy between the expert experience score and the DML-FS layers, which reduces the risk of removing features that domain experts consider important. The proposed method was tested on ten groups of material properties datasets. The results show that the mechanism can effectively select the optimal and interpretable feature subset while ensuring the prediction accuracy remains unchanged or even improves.
With the extensive use of DML-FS dek in the field of materials science, additional expert knowledge, that is, the importance score of different experts for features, will be stored and recorded internally to further improve predictive performance of the models. Overall, the DML-FS dek implementation presented herein is expected to enable the feature and correlation analysis of largescale material properties datasets which was unattainable so far. Notably, as the data volume increases and the expert knowledge obtained is more abundant, there may be uncertainty when the model is used for future predictions, that is, the prediction results may change accordingly. Additionally, when novices use our automatic modeling method, they may produce models that look good with serious flaws and little utility. Therefore, we urgently need to bring in the domain knowledge of experts to further regulate the model so that the model can meet the needs of materials modeling in accuracy and reliability.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.