Interpretable Machine‐Learning and Big Data Mining to Predict Gas Diffusivity in Metal‐Organic Frameworks

Abstract For gas separation and catalysis by metal‐organic frameworks (MOFs), gas diffusion has a substantial impact on the process' overall rate, so it is necessary to determine the molecular diffusion behavior within the MOFs. In this study, an interpretable machine learing (ML) model, light gradient boosting machine (LGBM), is trained to predict the molecular diffusivity and selectivity of 9 gases (Kr, Xe, CH4, N2, H2S, O2, CO2, H2, and He). For these 9 gases, LGBM displays high accuracy (average R2 = 0.962) and superior extrapolation for the diffusivity of C2H6. And this model calculation is five orders of magnitude faster than molecular dynamics (MD) simulations. Subsequently, using the trained LGBM model, an interactive desktop application is developed that can help researchers quickly and accurately calculate the diffusion of molecules in porous crystal materials. Finally, the authors find the difference in the molecular polarizability (ΔPol) is the key factor governing the diffusion selectivity by combining the trained LGBM model with the Shapley additive explanation (SHAP). By the calculation of interpretable ML, the optimal MOFs are selected for separating binary gas mixtures and CO2 methanation. This work provides a new direction for exploring the structure‐property relationships of MOFs and realizing the rapid calculation of molecular diffusivity.


Pearson correlation coefficient:
The Pearson product-moment correlation coefficient (PPMCC or PCCs), which varies in value from -1 to 1, is used to calculate the correlation between the two variables, X and Y. A Pearson product-moment correlation coefficient between two variables is defined as the quotient of covariance and standard deviation between two variables: Section S3. Details of Model Training Python 3.9.12 was used for all training tasks.

Overview of four machine learning algorithms
In this work, we utilize four different ML methods (RF, GBRT, XGBoost, and LightGBM). They are all outcomes of the ensemble learning theory. Ensemble learning addresses inherent flaws in a single model or a model with a certain set of parameters. Its fundamental idea is to combine weak learners to establish a strong model. The Bagging and Boosting methods comprise most of the ensemble learning methods. The full name of Bagging is "bootstrap aggregation". The working mechanism is shown in Figure S3, the basic idea of Bagging is to train multiple classifiers, and there is no strong dependence between the classifiers, then calculate the average of the calculated results. RF algorithm is an evolution of Bagging, based on the random sampling of bagging samples, RF algorithm also adds a random selection of features, and its basic idea is not divorced from the category of bagging. However, The Boosting Method is based on a serial strategy, and the new learner is generated by the old learner. The working mechanism is displayed in Figure S4. The same training is used for all sample weights to build the first weak classifier, followed by an adjustment of the sample weight based on the results of the previous classification. Points near the classification border receive greater weights since they are more likely to be misclassified. These are all typical examples of boosting: GBRT, XGBoost, and LightGBM. The four algorithms' main differences, advantages, and disadvantages are listed below: RF, GBRT, XGBoost, and LightGBM. (A summary of the following algorithms, excluding classification algorithms, is S7 based on the introduction of the regression models used in this work.) Figure S3. Bagging model. Figure S4. Boosting model.

Random Forest (RF) Algorithm
As an extension of the Bagging variations, built using decision trees as the learner, random forest (RF) further integrates random feature selection into the decision tree training process. It may be boiled down into four sections, as shown in Figure S5: a new training sample set is made by taking repeated random samples of k from the initial training sample set of N. In this study, the feature characteristics are chosen at random, and all the features are trained; the predictions of a single decision tree are obtained based on the sample extraction; the predictions from all decision trees are then averaged to provide the final predictions. [17] The random forest algorithm has the benefits of strong generalizability, the capacity to handle missing data, and the ability to be used without normalization. However, it can only generate predictions inside the training set, which causes overfitting when modeling some noisy data.

Gradient boosting regression tree Algorithm
Gradient boost is the fundamental tenet of the Gradient boosting regression tree (GBRT). Gradient Boosting is a strategy for boosting, and it differs from traditional boosting in that each calculation is performed to lower the residual of the previous one. And to eliminate the residual, a new model will be constructed in the direction of the Gradient in which the residual is reduced. In support of the forecast derivation, the final iteration of the gradient learning loss function is used. As illustrated in Figure S6, the "squared error" loss function used in this study produces a continuous error fit during the learning process. Each regression tree is used to fit the current GBRT by learning the conclusions and residuals of earlier trees. The advantages of GBRT are evident. It can deal with all types of data flexibly, and the prediction accuracy is increased under the relatively short time of parameter change. The performance of GBRT is further enhanced by RF. It is challenging to train data in parallel since the basic learner already has serial relations due to the Boosting architecture. Figure S6. GBRT model. S10

Extreme Gradient Boosting Algorithm
Extreme Gradient Boosting, or XGBoost, was built by Tianqi Chen of the University of Washington to improve the GBDT-based boosting algorithm. It approximates the residuals using the negative gradient of the model on the data, but it is a Taylor series approximation of the model's loss residuals with the addition of a regular term of model complexity. XGBoost performs better than the GBDT because it utilizes parallel CPUs to run more quicker. The next generation of XGBoost is likewise one iteration away (The price of the first t iteration function contains t-1 in front of the iteration of the predicted values). Use these later rounds of XGBoost frequently to sort the data in preparation and store it in the block structure before training. Parallelism is also feasible by using this block layout. We must first determine the gain of each feature before selecting the one with the highest gain to split nodes. The gain of each feature may then be determined by many threads. However, as shown in Figure S7, the level-wise method is the same for all leaf nodes in the current layer. Even if some of the leaf nodes split very little profit and do not have an impact on the outcome, they nonetheless split, raising the cost of computation. [18] Figure S7. XGBoost model. S11

LightGBM Algorithm
Another algorithm built by Microsoft Research Asia utilizing the GBDT framework is named LightGBM. [19] It seeks to increase computing effectiveness to address big data prediction challenges more successfully. Lightgbm uses the GOSS (Gradient-based One-Side Sampling) and EFB (Exclusive Feature Bundling) methods for random sampling and feature extraction. EFB does not scan all features to find the best cut-off point; instead, it reduces the dimensionality of features by grouping them, which lowers the cost of locating the best cut-off point. [19] GOSS does not use the sample points used to calculate gradients; instead, samples are sampled to calculate gradients. While accuracy is not compromised and may be improved in certain cases, processing samples takes much less time.
LightGBM utilizes a Histogram algorithm to combine mutually exclusive features. The fundamental principle of the histogram technique is to discretize the continuous eigenvalues into k integers and build a histogram with width k, as shown in Figure S8(b). The Histogram accumulates the statistics based on the discretized values as an index in the Histogram as it traverses the data. The Histogram first traverses the data once to get the appropriate statistics, and then, using the discrete values of the Histogram, the Histogram traverses to choose the best segmentation point. It is possible to lower the cost of calculation and storage using the histogram algorithm. The pre-sorted algorithm is the default in XGBoost, and it requires O(#data) computations, but the Histogram algorithm only needs O(#bins) computations, which is a considerably lesser number than O(#data). Figure S8(a) shows that LightGBM uses the leaf-wise strategy, whereas XGBoost uses the level-wise split strategy. Find the leaf with the biggest split gain from all the current leaves, split that leaf, and so on. Therefore, with the same number of splits, Leaf-wise can achieve greater accuracy while reducing more mistakes than Level-wise. However, leaf-wise may result in over-fitting when the sample size is small. Therefore, LightGBM may utilize the additional option Max to restrict the depth of the tree and prevent overfitting. [18] Figure S8. LightGBM model. S12

Algorithm Evaluation Index
In this work, the performance of each ML algorithm was evaluated by calculating the R 2 values and the rootmean-square error (RMSE). The R 2 value was calculated using eq.S1, where n, yi, ui, and u ̅ are the number of MOFs, the simulated diffusion coefficient of gas molecule (diffusion selectivity of ideal binary gas), the predicted diffusion coefficient of gas molecule (diffusion selectivity of ideal binary gas) and average diffusion coefficient of gas molecule (diffusion selectivity of ideal binary gas), respectively. The various error values for each algorithm were calculated using eq. S2.

k-fold cross-validation
It is no longer essential to split the validation set when cross-validation is used, and the test set is always used in the model's final assessment. The cross-validation approach used in this study is known as k-fold crossvalidation (k-fold CV), and it divides the training set into k minimum subsets. As shown in Figure S9, one of the k "folds" is used in the method, a subset of k-1 is used for model training, and the remaining data is used to validate the model created in the previous phase (like using the test set to determine the model's accuracy). The results of the k-fold cross-validation, which are the average of the outcomes of the previous stages, serve as a representation of the model's performance. [20] In this work, the 10-fold cross-validation is used.

Shapley additive explanation
In where ℛ is the set of all feature orderings, P i R is the set of all features that come before feature i in ordering R, and M is the number of input features for the model. For tree-based models, the study's algorithm, S19 TreeExplainer, built by Lundberg et al., [21] When ij，and When we set ϕ 0,0 (f,x)=f x (∅),ϕ(f,x) sums to the output of the model: In equation S5, SHAP interaction values of feature i and feature j are equally distributed between each feature, so ϕ i,j (f,x)=φ j,i (f,x) , and the overall interaction effect is ϕ i,j (f,x)+ϕ j,i (f,x). The residual effect can be defined as the difference between the SHAP value of the feature and the non-diagonal SHAP interaction value, as calculated by equation S6. SHAP interaction values have properties similar to SHAP values. [22] Python implementations of these metrics are available online at https://github.com/ suinleelab/treeexplainer-study.

The trajectories of Kr and Xe in MOFs
In statistical mechanics, the mean squared displacement (MSD, also mean square displacement or average squared displacement) is the most common measure of the spatial extent of random motion. Einstein showed that the mean square displacement of a particle in a liquid grows linearly with time. For a stable solid, this quantity will stay stable over the course of the simulation, and MSD was calculated by Einstein Eq. S9 In this work, the coefficient in the denominator n is 4, because the gas molecules diffused in the 3D x -y -z plane, and we focused on simulations with 10 gas molecules in FAPYEA, BUSQIQ, FUDQIF, ELUQIM06.      Figure S20. Approximate distribution of VSA, LCD, and ρ of about TOP 100 MOFs for 36 mixed gas pairs. The colors represent different gas mixtures, and the white ball represents the median.