Artificial Intelligence Guided Thermoelectric Materials Design and Discovery

Materials discovery from the infinite earth repository is a major bottleneck for revolutionary technological progress. This labor‐intensive and time‐consuming process hinders the discovery of new materials. Although machine learning techniques show an excellent capability for speeding up materials discovery, obtaining effective material feature representations is still challenging, and making a precise prediction of the material properties is still tricky. This work focuses on developing an automatic material design and discovery framework enabled by data‐driven artificial intelligence (AI) models. Multiple types of material descriptors are first developed to promote the representation and encoding of the materials’ uniqueness, resulting in improved performance for different molecular properties predictions. The material's thermoelectric (TE) properties prediction is then utilized as a baseline to demonstrate the investigation logistic. The proposed framework achieves more than 90% accuracy for predicting materials' TE properties. Furthermore, the developed AI models identify 6 promising p‐type TE materials and 8 promising n‐type TE materials. The prediction results are evaluated by density functional theory calculations and agree with the material's TE property provided by experimental results. The proposed framework is expected to accelerate the design and discovery of the new functional materials.


Introduction
Material innovation has always played an essential role in the science and technology revolution. [1][2][3] In the discovery and design of next-generation materials, artificial intelligence (AI) technology appears to be one of the most promising approaches. [4][5][6][7] Presently, AI technologies are focused on developing predictive functions through a data-driven approach. These data-driven approaches utilize the material's information from experimental data or high-throughput calculations to assist the material discovery, design, and optimization. Despite being more efficient than the experiments and theoretical calculation approach, the AI approach still faces considerable challenges.
Insufficient data is one of the most significant challenges in leveraging AI technology in materials science. [4,8,9] Material database involves several categories of data, such as organic materials, metal, semiconducting materials, etc. [10] Unlike other AI-assisted disciplines, such as computer vision or natural language processing, the outputs for AI-assisted material discovery are not directly accessible and verifiable. It has become increasingly difficult to obtain valuable datasets as a result. Take the Materials Project platform, which recorded ≈150 000 inorganic material data, as an example; less than half of the total data covered thermoelectric (TE) properties, and only about two percent covered piezoelectric properties. [11] Among the different categories of material properties, the data representing superior performance typically comprise a minimal portion, which results in a scarcity of knowledge. This inefficient and skewed dataset essentially limits the production of powerful predictive models with AI.
We propose an active learning-based automatic framework for materials discovery and design to address this issue, as shown in Figure 1. The AI-enabled material property prediction models are trained based on the materials database. To narrow the search range for high-performance materials, a sensitivity analysis approach was employed to perform model analysis and discover new materials. As a result, promising material candidates with unique chemical components and crystal structures may emerge via the active learning approach. The outcome with new knowledge can contribute to the existing databases as new data after further calculations or experimental validation. More adequate and Figure 1. The proposed framework of this study. Learning from the existing database, AI technology can understand the material's properties based on the representative material's descriptor. Through sensitivity analysis, we can obtain information from AI models to identify and design high-performance materials. Candidates for high-performance materials will be identified and designed quickly via an active learning approach. This new material identification information will be re-entered into the database after the appropriate validation technique.
informative data can yield more accurate predictive AI models. This proposed framework presents a viable technique for quickly identifying novel materials from infinite material repositories.
We utilize an AI-assisted TE materials discovery to illustrate the proposed framework. The TE effect enabled the generation of versatile electric energy from ubiquitous heat energy, effectively addressing the need for clean and renewable energy. [12][13][14] To maximize the TE effect, the thermoelectric materials are desired to have high electrical conductivity but low thermal conductivity, [15][16][17] which is quite challenging since these two properties are usually positively correlated. AI-assisted materials discovery is an effective approach to finding high-performance thermoelectric materials. [18][19][20] The performance of TE materials is typically evaluated by a dimensionless figure of merit zT, which is defined as follows. [12] zT = S 2 k T T (1) where S is the Seebeck coefficient, is the electrical conductivity, k T is the total thermal conductivity, and T is the temperature. The thermal conductivity k T is the sum of the electronic contribution (k e , due to the carrier transport) and lattice contribution (k l , due to the phonon transport). The material's TE properties, such as electrical conductivity, thermal conductivity, and Seebeck coefficients, can be accurately predicted using AI techniques. [21][22][23] However, the previous study has not focused on developing AI models for simultaneously predicting all three TE properties. Moreover, a limited number of studies focus on finding representative TE material descriptors and AI models for TE properties prediction. To address these issues, we adopt a matrix encoding approach to represent the material's stoichiometry and structural information. In addition, to better represent the material TE properties, we enrich the material descriptors by using the output of dense layers of a trained network [24] as the material's high-level representation. Furthermore, we train a TE material screening model using different machine learning models. The TE materials' screening model can identify promising TE materials from an AI-generated new material database. We use the density functional theory (DFT) calculation to verify the AI model prediction results. The calculated TE properties of the newly discovered materials are well aligned with the AI-predicted results. Based on our knowledge, this work is the first study to develop AI models which can comprehensively analyze the material's all TE features and accelerate the novel TE materials crystal structure design and discovery. Moreover, the proposed framework and technique can www.advancedsciencenews.com www.advelectronicmat.de be implemented through an automatic upgrading process to discover a broader range of new materials. The novelty of our work is multi-fold: • This work proposes an AI-guided TE material discovery and design pipeline that includes machine learning model training, AI-based material screening, design, and DFT validation. • This work develops effective material descriptors with matrix encoding, high-level representation from deep networks, and Radial Distribution analysis for materials properties representation. • Based on our knowledge, this work presents the first systematic study of different machine learning models to predict and analyze eight different thermoelectric properties. At the same time, we innovatively utilize model predictions to discover new material components and structures. • This work incorporates an active learning scheme via variance decomposition-based sensitivity analysis materials that likely have good TE properties.

Results and Discussion
This section presents the results and analysis of using the proposed framework to discover novel high-performance TE materials. In Section 2.1, we present the material properties prediction performance by using the developed materials encoding technique. In Section 2.2, we train the machine learning models to predict material TE properties given the material characteristics. We analyze the model performance for different TE properties and determined the TE properties prediction model with the best performance. Section 2.3 presents the model analysis results by using Sobol sensitivity analysis. In Section 2.4, we employ the developed models and model analysis results to discover new promising high-performance TE materials. The prediction and materials discovery results are validated by theoretical calculation in Section 2.5.

Proposed Materials Descriptors for Materials Properties Representation
This work utilized the Materials Project (MP) and JARVIS-DFT (JARVIS) datasets to illustrate the proposed framework's materials properties prediction performance. We selected commonly used material properties in machine learning materials science work, to validate the ability of our proposed method to represent and transform material information. [25][26][27][28] These properties are widely applied in different fields of machine learning in materials science. Using the proposed framework, the users can quote different materials descriptors by Materials Project ID (mpid) for the various machine learning tasks. Table S1, Supporting Information lists the materials properties regression model performance of testing dataset using different types of descriptors. The overall prediction has a high correlation coefficient and low mean absolute error (MAE) except for energy above the hull and total magnetization. The poor performance may attribute to the limited data size and unsatisfied materials representation. More effort should be applied to targeted design algorithms and material encoding methods for further improvement. Table S2, Supporting Information compares the proposed framework's performance with the benchmark framework. Most of the material properties are better understood by the proposed framework. The prediction on the properties such as band gap, total energy, and formation energy reaches the lowest MAE according to the best of our knowledge. Figures S1-S10, Supporting Information show the prediction parity plots on the test set and the entire data set. Figure S11 and Table S4, Supporting Information present the classification performance using the proposed framework. The designed structural descriptor significantly improves crystal structure and oxide-type classification performance. In general, the proposed frame can provide a powerful solution to encode the materials properties to the AI models. In addition, it provides flexibility and exploitability for various machine learning tasks related to materials discovery.

Machine Learning Model Performance for TE Properties Prediction
Five different machine learning models were trained to predict the TE properties. Tables 1 and 2 show the area under the receiver operator characteristic curve (AUC-ROC) on MP and JARVIS datasets, respectively. The plot of the ROC curve can be found in Figures S12 and S13, Supporting Information. AUC-ROC indicates the prediction performance under all thresholds, for which the higher the score is, the better the model can classify these properties. Based on Table 1, the TE properties prediction model from the MP dataset can accurately predict the Seebeck coefficient and thermal electronic conductivity with AUC-ROC scores higher than 0.95. Furthermore, the power factor can also be well predicted with AUC-ROC scores around 0.9. Table 2 signifies the better performance of the TE property prediction from the JARVIS dataset. The model has over 0.9 AUC-ROC scores in predicting all the thermoelectric properties. In particular, there has been nearly 1 AUC-ROC score in the predictions of Seebeck coefficients, electrical conductivity, and thermal electronic conductivity. The difference in the dataset, in terms of the computation setup and statistical distribution, will affect the accuracy of the prediction models. The model trained on the MP and JARVIS datasets can accurately predict the Seebeck coefficient. As reported by Choudhary et al. [29] the Seebeck coefficient calculation results from the MP dataset and JARVIS have a good agreement. However, the models trained on different datasets significantly differ when classifying the power factors and electrical conductivity. The power factors and electrical conductivity have an extremely skewed distribution among different classes for both MP and JARVIS datasets, which is harmful to the model training. Both MP and JARVIS datasets were constructed by using high-throughput DFT calculation. However, the setup for the relaxation time differed. Moreover, the MP dataset adopted GGA-PBE functional along with fixed k-points and cutoffs [30] while the JARVIS dataset used the OptB88vdW functional and an automatic convergence procedure for k-points and cutoffs. More accurate transport properties results, especially for vdW-bonded materials, are expected for the JARVIS dataset. Consequently, the results show the differences in model performance on these two datasets.  Different machine learning algorithms also considerably influence the accuracy of TE properties prediction. Based on the AUC-ROC score and F1 score for positive class (Tables S5 and S6, Supporting Information), the Random Forest (RF) algorithm results in the overall highest AUC-ROC for TE properties, followed by the Gradient Boosting (GB) algorithm. While the Neural Network (NN) and K-nearest neighbors (KNN) algorithms provide similar performance, the Decision Tree (DT) algorithm on the testing set was inferior due to over-fitting in the training stage. These observations remain consistent for the models trained on both the MP and JARVIS datasets. Therefore, we selected the TE properties prediction model from the RF algorithm for further model analysis and new materials discovery. Table S7, Supporting Information compares our results with the existing TE properties prediction framework. Using the same database and threshold, the Seebeck coefficient and power factor prediction exhibit significant improvement. Especially for the power factor, the AUC-ROC scores of both the n-type and ptype classifiers increased to more than 0.9 from 0.74. In addition to representing the atomic structure, the proposed materials descriptor emphasizes the materials' electronic structure information by extracting the information from a pre-trained neural network's high-level representation. The material's properties theoretical calculation normally requires information such as atomic structure, potentials, and physical law. The cutting-edge machine learning models are not built on such laws of chemistry and physics. Thus, obtaining information in a pre-trained materials prediction model can effectively improve the model's understanding of complex material TE properties. However, developing the TE properties regression model is still a challenge. The model performance is primarily limited by the data quantity and quality (in terms of data distribution). Nevertheless, the proposed method still provides effective measures for the material's transport properties representation. The developed TE materials classifier could accelerate the design and the discovery of novel high-performance materials.

Machine Learning Model Analysis
To investigate the contributions of each descriptor to the classifiers, the first-order Sobol indices of all descriptors were calculated. Figures 2 and 3 show the results of the Sobol sensitivity analysis on the RF classifiers with the best performance. Sobol indices provide insight into how vital the descriptors are to the classifier so that the critical descriptors can receive greater attention during material design processes. From the distribution of high-sensitivity descriptors, the designed descriptors make a different contribution in terms of varying material's TE properties. The chemistry descriptor presents the highest sensitivity among others for the overall sensitivity. The structural descriptor also behaves differently in predicting the n-type and p-type TE properties due to the different transport carriers and mechanisms. The electronic structure descriptor from a pre-trained neural network contributes significantly to the classifier. The sensitivity analysis result of the model can substantially aid the discovery and design of new TE materials. According to the results, different descriptors must be adopted when developing machine learning models for different material properties.

TE Materials Discovery
We employed the developed TE materials screening engine to explore the promising TE materials from the new materials repositories generated by CubicGAN framework. [31] The TE   Figure 2. The sensitivity analysis for predicting the p-type Seebeck coefficient, p-type power factor, p-type electrical conductivity, and p-type thermal electronic conductivity from the Sobol analysis. The yellow, green, and blue backgrounds stand for the chemistry descriptor, structural descriptor, and electronic structure descriptor, respectively. . The sensitivity analysis for predicting the n-type Seebeck coefficient, n-type power factor, n-type electrical conductivity, and n-type thermal electronic conductivity from the Sobol analysis. The yellow, green, and blue backgrounds stand for the chemistry descriptor, structural descriptor, and electronic structure descriptor, respectively.
properties of the materials from the new database were examined using sensitivity analysis suggestions and AI model prediction results ( Table 3). Prediction consistent rate by using the model from the MP dataset and JARVIS dataset presents the frequencies at which the models trained on the MP dataset and the JARVIS dataset match for the evaluation of the thermoelectric properties of new materials. Despite the differences in electrical conductivity, the two models gave highly consistent judgments on the thermoelectric properties of the new crystal structures. Moreover, it indirectly demonstrates the accuracy of the models  in determining the Seebeck coefficient, power factor, and thermal electronic conductivity. The mismatch of the electrical conductivity prediction resulted from the different relaxation time assumptions for the TE property calculation. Another explanation is that the MP dataset may deliver insufficient electrical conductivityrelated knowledge due to the dataset's composition difference. The models trained on different datasets may have inconsistent judgments on the same materials. Therefore, we identified potential high-performance thermoelectric materials by combining the knowledge of the models from both MP and JARVIS datasets. The validation of the TE materials discovery is presented in the following section.

Prediction Results Validation by DFT Calculation
We utilized 6 promising p-type TE materials and 8 promising ntype TE materials identified by the developed TE materials discovery framework to validate their performance. The criteria of the high/low TE properties are exactly the same as the TE properties prediction model training, as shown in Table 1. Figure 4 compares the encoded materials for their TE properties prediction. The similar configuration of the description signifies their similar desired TE properties. Therefore, the models can learn from the encoded material's descriptors and determine the highperformance TE materials. The formula and information of the identified promising high-performance TE materials are listed in Tables S8 and S9, Supporting Information, respectively. We performed DFT calculation using the quantum espresso (QE) package and BoltzTrap package to validate the prediction results. [32,33] Figure 5 displays the DFT calculation results of the identified p-type TE materials. The validation results show that the power factor and the electrical conductivity of the specified materials from the TE materials screening engine have all passed the expected criteria of the high-performance TE materials. The predicted p-type Seebeck coefficient and p-type thermal conductivity of BaCaH 6 Rh differed slightly from the results from the theoretical calculations. In general, the predicted p-type TE properties agree with the DFT-validated results The DFT validation results of the identified high-performance n-type TE materials are shown in Figure 6. Among the eight high-performance TE materials determined by the model, only one showed a discrepancy between the DFT-validated and modelpredicted results. Since the models from the machine learning algorithm are not inherently 100% accurate, the prediction results will conceivably have minor inconsistencies with the DFTvalidated results. Nevertheless, the proposed framework demonstrates the effectiveness of discovering new high-performance TE materials with minor errors. This process can also further enrich the information in the TE materials database, which will benefit the development of more accurate materials property prediction models. The proposed framework thus forms an automatic updating process to accelerate the design and discovery of new materials.

Conclusion
This work demonstrated a framework that automatically discovers functional material using AI technology. The proposed materials encoding methods show improved materials properties representation performance when developing the material properties models. We then utilized thermoelectric materials as an example to present the functionality of the proposed framework. The framework initiates from the materials database. Using the existing TE databases, we implemented 5 different machinelearning models for TE property prediction. The RF model exhibited the best overall performance and was selected for further investigation. Most TE properties can be predicted with an AUC-ROC score higher than 0.9. We utilized the sensitivity analysis result from the developed model to guide the design and discovery of novel high-performance TE materials. The DFT calculation results verified the TE properties of the identified novel materials. The predicted TE properties have a good agreement with the calculated TE properties. We further performed the machine learning model analysis using Sobol sensitivity analysis, which suggested the higher importance of the chemistry descriptors and material graphical encoding to the machine learning models, benefiting TE material design with desired properties.
In summary, the proposed framework is a promising approach for the large-scale discovery of new functional materials from an infinite materials repository.

Experimental Section
Descriptor Selection and Machine Learning Input Generation: The focus was on designing suitable material descriptors to help machine learning models better recognize the material's unique nature and TE property. To maximize the universality of the proposed method, no additional experimental and computational effort should be made to create the material descriptors. [34] Descriptors were developed for the material's TE properties representation consisting of three parts, i) the chemical descriptors, ii)structural descriptors, and iii) electronic structure descriptors. Figure 7 demonstrates the material descriptors generation for machine learning model training. The chemical descriptors were designed to represent the material's stoichiometry information. The atomic features, such as the atomic number, electronegativity, atomic radius, etc., were adopted to represent the diverse chemical composition of each different material. The statistical approach for input descriptor dimensional control comes from ref. [22]. A total of nineteen elementary properties were organized and processed to construct the ultimate chemical descriptor. The structural descriptors utilized a matrix encoding approach to represent the material's atomic coordinate information. The radial distribution function (RDF) calculation was first performed using the material's atomic coordinates. The RDF results, which included the maximum value, maximum position, and skewness of the smeared RDF plot, were subsequently transformed into a matrix using atomic numbers as coordinates. Matrices representing the materials' crystal structure were turned into structural descriptors employing principal component analysis (PCA) analysis. [35] To enhance the TE properties representation of the material descriptor, the trained machine learning model was adopted from ref. [24]. The MEGNet model can successfully predict material properties, such as total energy, density, bandgap, etc., from the materials' atomic type and atomic coordinates. The material's TE properties strongly correlated to the material's total energy, Fermi energy, and bandgap. An electronic structure descriptor was developed by extracting high-level representations from the neural network for predicting these TE-related properties. The electronic descriptor was generated from the second-to-last layer of pre-trained graph neural network model from MEGNet framework, which encodes the electronic structure information of the material. And then, the chemical descriptors, structural descriptors, and electronic structure descriptors were combined to form the input for training the materials TE properties prediction model.
Training Data: Developing a TE properties prediction model relied on a large volume of data, from which the learning process can result in Figure 6. The comparison between the DFT calculated results and the threshold of the TE properties classification for AI discovered n-type TE materials. The solid dots in the light area represent the results predicted by the model that match the DFT calculation results. Conversely, the hollow dots in the red area signify that the model predictions contradict the DFT calculation results. reliable relations in the input and output pairs. The TE properties prediction models were trained on the database from the Materials Project (MP) database [11] and JARVIS-DFT (JARVIS) database. [36] The TE properties in the datasets were calculated using the first-principles DFT and post-processed using the Boltzmann transport equation (BTE). The details can be found in refs. [30,36]. Since the relaxation time setup for these two data sets differs, the TE properties prediction models were trained separately. The 50 percentile served as the threshold for screening highperformance TE materials to determine classes. To compare the results from the two different models, TE properties on the temperature of 600 K and doping concentration of 10 20 cm −3 were chosen. The exact threshold for 8 TE properties p-type Seebeck coefficient, n-type Seebeck coefficient, p-type power factor, n-type power factor, p-type electrical conductivity, ntype electrical conductivity, p-type thermal electronic conductivity, n-type thermal electronic conductivity is shown in Tables 1 and 2. TE Property Classification: RF algorithm [37] was adopted with the extract descriptors to effectively train accurate TE properties prediction models. Furthermore, the classification results were compared with a few other commonly used machine learning algorithms in terms of F1 scores and AUC-ROC. RF [37] was selected as the classifier to identify material TE properties qualitatively. As an ensemble method, RF used several different Decision Trees (DT) to perform classification and then aggregated the result from each DT via bagging. As the building blocks of RF, DTs were nonparametric supervised learning models that can perform both classification and regression tasks. In RF, each DT conducts classification by split-ting the data based on certain features where within each split, the data labels become more homogeneous. DT classifiers split the data based on Gini impurity, defined as follows where D is the data on the current node in the tree and p i is the probability of the samples belonging to class i in the dataset. The TE property classification task was formulated as a binary classification task, and thus, k = 2. DTs split the dataset using the feature that can provide the most Gini gain. More specifically, when the dataset was split based on feature F, resulting in two subsets, D 1 and D 2 , after the split, the corresponding Gini impurity is where n 1 , n 2 , and n are the data size of D 1 , D 2 , and D, respectively. Now the Gini gain is calculated as the Gini impurity difference before and after the split. Figure 7. The generation of the materials input for developing the TE properties prediction model. The total input includes three-part, the chemistry descriptor, the structural descriptor, and the electronic structure descriptor. The chemistry descriptors focus on the material's stoichiometry information.
The structural descriptors can provide crystal structure information to the machine learning model. The electronic structure descriptor emphasizes the representation of the material's transport properties.
DTs will finalize the split on the feature having the largest Gini gain and repeat this process until stopping criteria were met (e.g., maximum tree depth). Due to the limited information that the data can contain, a single DT tended to overfit and generalize poorly on the unseen samples. RF addressed this issue by combining several DTs trained on different subsets of data and features. As a result, RF models usually end up with lower model variance so they can generalize better on unseen data.
To optimize the model's performance, its hyperparameters were carefully tuned using a grid search method. Specifically, the number of trees in the forest, the criterion used for splitting nodes, and the minimum number of samples required to be at a leaf node were tuned. The optimal hyperparameters were selected based on their influence on the model's performance and their computational cost. The final optimization has been uploaded to the GitHub repo. To evaluate the performance of the model, a fivefold cross-validation method was employed.
In this work, RFs were trained to classify different TE properties. In addition, the model performance of the Gradient Boosted Decision Trees (GBDT) was compared with other commonly used machine learning classifiers, such as DT, [38] GBDT, [39] KNN, [40] and NN [41] Sensitivity Analysis: To identify the usefulness and importance of each descriptor, a variance-decomposition-based sensitivity analysis, Sobol sensitivity analysis (SSA) [42,43] was adopted. In SSA, the model of interest was regarded as a black-box model, from which only the input and corresponding output were utilized in the analysis. SSA provided first-order indices as well as total-order indices of the model input variables. The firstorder indices accounted for the sensitivity contribution to the model of the variables when they acted on their own. Meanwhile, the total order indices indicated the sensitivity contribution of variables to the model when they interacted with each other. Specifically, in this work, the first-order indices were calculated to represent the sensitivity contribution of each descriptor. The model Y = f(X 1 , …, X m ), where f is a trained data-driven model, Xs being the input descriptors.
More specifically, in a machine learning model, a set of mutually independent descriptors (features) x = (x 1 ,x 2 ,…, x N ) were used, each of which had a finite interval that can be [0,1] after rescaling. The learned function can be decomposed as follow: where f 0 is the mean value of f(x), and the expression of f i (x i ) and f ij (x i ,x j ) are listed below: f(x) in Equation (6) is called the analysis of variance representation when the condition shown in Equation (10) is satisfied. [42] ∫ 1 where k = i 1 , …, i N . Because of this property, if both sides of Equation (2) are squared and integrated: where D is the model output variance, and D i 1 …i N = ∫ 1 0 ∫ 2 i 1 …i s (x i 1 , … , x i N )dx i 1 , … , x i N is called the partial variance corresponding to the subset of parameters x i 1 , … , x i N . The Sobol sensitivity www.advancedsciencenews.com www.advelectronicmat.de indices of that subset of parameters are defined as The integer N is called the order or the dimension of the index. For instance, S i =