Interpretable machine learning methods for in vitro pharmaceutical formulation development

Background: Machine learning has become an alternative approach for pharmaceutical formulation development. However, many machine learning applications in phar-maceutics only focus on model performance rather than model interpretability. Aim: This study aims to propose an attention-based deep neural network (DNN) for pharmaceutical formulation development. Methods: An attention-based DNN, AttPharm, was proposed. AttPharm separately handledfeaturevaluesandfeaturephysicalmeaningbyrepresentationlearningtosuc-cessfully apply the attention mechanism to the pharmaceutical tabular data. Furthermore, the distributions of the attention weights were computed using AttPharm. Two post hoc methods, local interpretable model-agnostic explanation (LIME) and Tree-SHAP, were utilized to obtain the post hoc model interpretability for lightGBM. Results: The results demonstrated that AttPharm significantly improved the model performance of plain neural networks on a pharmaceutical cyclodextrin dataset because the attention mechanism could extract related features and find minute variation. Notably, the attention weights were analyzed, which illustrated global and local feature-level and sample-level model interpretability, thus providing insights for formulation design. Comparing with post hoc methods, AttPharm can be used without the concern of the faithfulness of interpretability. Conclusion: This is the first step in applying the attention-based DNN to pharmaceutical formulation development. Considering the importance of model interpretability, the proposed approach may have a wide range of applications in pharmaceutics.

machine learning algorithms in pharmaceutics and food research has improved the efficiency of drug and food development. However, the current applications of machine learning in pharmaceutics and food mainly focus on model performance. In addition to accurate model prediction, machine learning interpretability is increasingly critical and has gained widespread attention (Bibal & Frénay, 2016). Different dosage forms, foods, and formulations depend on different molecular mechanisms. Although researchers have devoted significant resources, many of them remain unknown. Therefore, there is still room for improvement in machine learning predictions in drug and food research to gain insight into the mechanisms and accelerate their development.
Model interpretability has been increasingly recognized as an integral part of machine learning (Gentiluomo et al., 2019;Sheridan, 2019).
The key aspect of deep learning is the general automatic feature extractor (Abrol et al., 2021;Lecun et al., 2015). Attention-based DNNs can distinguish inputs, leading to more outstanding performance than stan-dard DNNs (Luong et al., 2015;Rush et al., 2015;Vaswani et al., 2017;Wu et al., 2016). Due to the generated attention weights, attentionbased DNNs are considered to be a kind of intrinsically interpretable model (Xu et al., 2015). The attention mechanism assigns an attention weight distribution to each sample to extract the relevant information.
The mean value of the attention weights for all samples can be considered as an estimate of the global feature importance. The attention weight distribution of a single sample can be considered as the local feature importance (sample level). The attention weights of a feature for all samples can be considered as the feature importance (feature level) that varies with the value of the feature. Recently, several studies have applied attention-based DNNs to biomedical research.
They used high-dimensional sequence and graph data as the network input to predict molecular activity, molecular stability, and compoundprotein affinity (Karimi et al., 2019;Li et al., 2019;Liu et al., 2018;Xiong et al., 2019). However, the application of attention-based DNNs to twodimensional tabular data is an unresolved difficulty. In pharmaceutical and food fields, structured tabular data are the most widely used data type rather than sequence, graph, or image high-dimensional data. To our knowledge, no studies apply attention-based DNNs to formulation development.
In industrial applications and data competitions, the gradient boosting decision tree (GBDT) framework is competitive among machine learning models, especially on structured datasets (Friedman, 2001). For example, lightGBM is an ensemble learning algorithm based on GBDT (Ke et al., 2017). lightGBM is further optimized using exclusive feature bundling and histogram-based learning, which has the advantages of good performance and high efficiency. However, GBDT framework models do not provide estimates of local feature importance (sample level and feature level). Additional post hoc interpretable methods can be used to provide local interpretability for the trained models (Gilpin et al., 2019;Johansson et al., 2011). Local interpretable model-agnostic explanation (LIME) and Shapley additive explanation (SHAP) are two approaches (Lundberg & Lee, 2017;Ribeiro et al., 2016). LIME uses a linear model as a surrogate model. LIME obtains the predictions of the original model by slightly changing the inputs and uses the surrogate model to fit the original model locally and linearly. However, there are concerns about the faithfulness of the surrogate model interpretation (Gilpin et al., 2019). In addition, SHAP estimates the difference in model performance when each feature is eliminated. TreeSHAP is an efficient implementation of SHAP proposed for the GBDT models, which truncates the nodes containing the features that need to be eliminated (Lundberg et al., 2018(Lundberg et al., , 2019. TreeSHAP uses an elimination-based approach to estimate feature importance. However, accurate estimation requires the assumption that the input features are independent of each other, which is a strong assumption in practice (Lundberg et al., 2019). For these reasons, we believe that Attention-based DNNs can be an alternative approach.
In this study, an attention-based DNN, AttPharm, was proposed.
AttPharm handled separately the feature values and the feature physical meaning by representation learning to apply the attention mechanism to the structured tabular data. AttPharm was available in three versions, including AttPharm, AttPharm-formula, and AttPharmformula-ResNet. By integrating the attention mechanism, the normalization formula, and residual blocks, AttPharm-formula-ResNet gave accurate predictions with interpretable attention weight distributions.
The proposed methods were also compared with other commonly used algorithms, including SVM, plain neural network, and lightGBM. Furthermore, two post hoc interpretable methods, TreeSHAP and LIME, were applied to trained lightGBM to get post-hoc interpretability for comparison purposes. The pros and cons of numerous interpretable methods were discussed for their further applications in pharmaceutics and food research.

Dataset and data splitting strategy
Recently, the pharmaceutical formulation data for the binding-free energy of inclusion complexes between 8 types of cyclodextrins (CD) and 1320 diverse guest molecules were extracted from research literature published from 1990 to 2018 by the author's lab . The literature was recognized to belong to the binary CD complexation system. The original dataset was further checked and later used as a model dataset in this study. Five formulations were deleted due to incorrect descriptors, and three formulations were deleted due to unconvertible SMILES. The predicted logS and logP of guest molecules were obtained from the ALOGPS 2.1 program (http://www.vcclab.org) and added to the dataset (Tetko et al., 2005).
Finally, a tabular dataset including 2992 data and 42 features was generated for machine learning modeling. The features were composed of 19 molecular descriptors of small guest molecules, 21 molecular descriptors of CD, 2 process parameters of pH, and temperatures.
The range of the predicted target, the binding-free energy, was from −40 to 0 kJ/mol.
A three dataset splitting strategy was carried out. This strategy has been widely applied in machine learning fields. The dataset was split into three subsets of training (2318 data), validation (339 data), and test (335 data) subsets. Stratified random sampling was used to keep the balanced proportion of each CD type . The training subset was used for constructing models, the validation subset was for turning the hyperparameters, and the test subset was for testing model generalizability on unseen data.

Evaluation criteria for machine learning models
where n is the number of formulations, y i is the ith real value, andŷ i is the ith predicted value.

Hyperparameters of SVM, plain neural network, and lightGBM
Recent development and applications of machine learning have been driven both by data availability and advances in learning algorithms. A variety of machine learning algorithms have been proposed to approximate different mapping types Bishnupriya et al., 2019). Traditional machine learning processes rely on feature engineering, the process of extracting features using domain knowledge. DNNs are regarded as universal feature extractors that automatically extract critical features from raw data through multi-layer architectures and gradient back-propagation (Abrol et al., 2021). In this study, SVM, a plain neural network, and lightGBM were introduced to predict the binding-free energy of CD inclusion complexes.  Table 1. Note: ";" separates different hyperparameters; "," means the hyperparameter is composed of more than one element. Abbreviation: SVM, support vector machine.

A normalization formula
In machine learning, z-score normalization (also known as standardization) and min-max normalization are usually adopted as feature scaling methods for data processing. z-Score normalization subtracts the mean from each feature and then divides the result by its standard deviation. Min-max normalization scale the range of features in [0, 1].
However, the z-score normalization and the min-max normalization cannot properly represent the physicochemical properties of features.
In pharmaceutics and food, some features have opposite physicochemical properties when taken to different values. For example, pH less than 7 means acidic and more than 7 means basic. Thus, specific normalization suitable for pharmaceutics and food should be introduced to scale the range of features.
where x is an original value, x′ is the normalized value, and shift is the shift constant. For AttPharm, the usually used standardization was adopted. For AttPharm-formula and AttPharm-formula-ResNet, the normalization formula was adopted.
where i is an attention weight, e i is a similarity, N is the number of features, p i is a position vector, and d is a drug vector.
The formulation vector is where formulation is a formulation vector, N is the number of features, i is an attention weight, and i i is an input feature vector. Finally, the cost function (C) and the objective function (J) are whereŷ are predicted values, y are real values, n is the number of data, w are network weights, and is the coefficient of l2 regularization.

Residual blocks
The residual blocks were applied to boosting the AttPharm model generalization. The residual blocks solve the network degradation problem, which occurs when the depth of the networks increased (He et al., 2015). The residual blocks bridge shortcut connections to skip one or more layers, thus the network is easy to learn identity mappings, go deeper, and perform better than DNNs without residual blocks. In this study, the residual blocks were used, and if the input dimension does not match the output dimension, a linear projection is performed.

Hyperparameters of AttPharm
Three versions of AttPharm models were tested, namely AttPharm

LIME and SHAP
The post hoc methods divided the process into two parts of model pre- For example, small molecules were represented by the descriptors of molecular weight, logS, logP, XlogP3 (a calculation of logP by machine

Performance comparison of different machine learning models
The MAE, MSE, and R 2 results of the machine learning models were obtained on the training, validation, and test subsets ( Table 2). The SVM and the plain neural network were used as the baseline models.
On the final test subset, the SVM and the plain neural network got similar MAE and the R 2 results. In addition, AttPharm demonstrated a better result than that of SVM and the plain neural network. AttPharmformula further enhanced the performance of AttPharm. Furthermore, AttPharm-formula-ResNet and lightGBM models had comparable and better results. AttPharm-formula-ResNet significantly improved the model performance of the plain neural network from the MAE of 1.90-1.58 kJ/mol, from the MSE of 6.88-4.58, and from the R 2 of 0.71-0.81. Table 2 indicate that the AttPharm-formula-ResNet and lightGBM had a comparable and superior prediction of the R 2 over 0.81 and the MAE under 1.59 kJ/mol, which met the formulation development requirements.

The results in
lightGBM is a tree-based ensemble learning algorithm (Ke et al., 2017). Tree-based ensemble models combining tree structure base learners usually demonstrate good performance on tabular data with pre-engineered features (Abrol et al., 2021;Friedman, 2001) SHAP's accurate estimation requires the assumption that the input features are independent of each other, which is a strong assumption in practice (Lundberg et al., 2019).
Deep learning is able to both learn task-specific representations from raw data and determine decision boundaries (Abrol et al., 2021;Lecun et al., 2015). Powerful feature extraction capabilities contribute to the superior performance of deep learning models. On structured tabular data, despite depriving deep learning of its key aspect of representation learning, AttPharm-formula-ResNet also has outstanding performance similar to the cutting-edge tree-based ensemble model (R 2 > 0.8). The method has the potential to scale well because the neural networks may benefit from both learning new representational patterns and the increasing data. Furthermore, the attention mechanism based deep learning models allow for reliable correlation estimates.
Attention-based DNNs do not depend on additional interpretable methods. The current limitation is the small domain data in the food and pharmaceutics fields. However, it is safe to expect that these samples will grow much larger in the coming years, due to the development of high-throughput screening and computational pharmaceutics. For these reasons, we believe that attention-based DNNs can be an alternative approach.

Results of the attention weights
AttPharm-formula-ResNet predicted the CD binding-free energy, provided the attention weight distributions that could be recognized as global and local (feature level and sample level) interpretability.
For CD inclusion complexes, although the impact factors such as CD types, pH, and temperatures were studied in experiments, it is still hard to investigate the influence of substance innate features, such as molecular weight, logP, and logS. Although the global feature importance ranking was studied in the previous research by ensemble models , feature-level and sample-level interpretability and a general machine learning architecture are still necessary.

F I G U R E 3 The global mean attention weights for the dataset
First, the mean attention weights over the dataset are computed.
The features containing the API and CD descriptors and process parameters were ranked ( Figure 3) according to the overall mean attention weights. The mean attention weights globally estimated the importance of the features, which helped determine key factors that influenced CD inclusion complexes. Preparation into the CD inclusion complexes can improve the solubility and bioavailability of insoluble drugs, enhance drug stability, cover up the bad smell and reduce drug irritation and side effects (Carrier et al., 2007). The preparation process started after adding API into CD saturated solution (in given pH) at a proper temperature. The inclusion complexes formed when the API molecule entered the cavity of CD to form a noncovalent bond by stirring, milling, crystallizing, filtering, and drying (Zhao et al., 2018).
According to the feature list shown in Figure 3, the top six critical features included "XLogP3" (a prediction of logP by machine learning techniques) and "logP" of API, "ring count" of API, "temperature,", "complexity" of API, and "maximum projection radius" of CD. Among these features, the "XLogP3" and "logP" of API were well-recognizable impact factors on the binding-free energy of inclusion complexes.
Generally, the binding-free energy decreased with the "XLogP3" and "logP" of API increasing. Apart from the "XLogP3" and "logP" of API, other substance innate features contained "ring count," "complexity" of API, and "maximum projection radius" of CD, which were difficult to be studied in pharmaceutical experiments and exhibited a significant impact on the binding-free energy. Because the forming process of the inclusion complexes is a self-assembly process between host and guest molecules, "ring count" and "complexity" of API would increase the steric hindrance, leading to a high energy barrier against inclusion complex preparation. As CD molecules are cylindrical, the drugs with specific structures inserted into the hydrophobic cavity to form the inclusion complexes, "maximum projection radius" of CD can enhance the degree of the insertion, thus reducing the binding-free energy and improve the inclusion complex stability. Besides, "temperature" can affect the binding-free energy. Generally, in most of the CD inclusion complex forming processes, the enthalpy change (∆H) is negative.

F I G U R E 4 The feature-level attention weights of API XLogP3 versus the feature values
Raising the temperature is not suitable for the inclusion complex forming. However, on the other hand, it is beneficial to increase the solubility of CD to obtain the inclusion complexes.
Feature-level and sample-level interpretability are two local model interpretabilities. The feature-level interpretability for the" XLogP3" of API was obtained by plotting the attention weights of the" XLogP3" of API versus the feature values of the" XLogP3" of API, which eventually illustrated the change of model attention with the drug lipophilicity increasing (Figure 4). The" XLogP3" of API was the most influenced feature ranked by the attention weights in the CD dataset. Here, when the "XLogP3" of API had a negative value, the model paid relatively high attention to this feature. This was possible because of the unstable binding of polar drugs caused by the hydrophobic interaction between host and guest molecules. As the cavity of CD was hydrophobic, when a guest molecule entered the cavity, its hydrophilic group was far away from the cavity, and a polar molecule was not stable in the nonpolar cavity. When the "XLogP3" of API had a positive value, the attention was relatively stable. The API concentrated between 0 and 5 "XLogP3" because typically insoluble and nonpolar drugs were preferred by researchers to get the improved CD inclusion complexes. This can also suggest that when the "XLogP3" of API is greater than 5, the confidence of the prediction is insufficient due to the sparse data distribution.
The sample-level interpretability for a specific example formulation was collected by representing an attention weight distribution for the example formulation, which locally evaluated the impact of the features on this example formulation ( Figure 5). Similar to the global feature importance list shown above, the key points for achieving this CD inclusion complex included the" XLogP3" and "logP" of API, closely related to the lipophilic interaction between host and guest molecules in the system. The "complexity" of API also played an important role in forming the inclusion complex prepared by the example formulation.
Additionally, the "maximum projection radius" and "minimum projection radius" of CD can influence the preparation of the CD inclusion complex by the different cavity radiuses of CD because CD selectively holds the guest molecules with the molecular volume matching their cavity that leads to stable inclusion complexes. Apart from these features, the "logP" of CD reflects the lipophilicity of CD. This feature

Results of SHAP and LIME
The TreeSHAP method was applied to lightGBM to get the feature level post hoc interpretability. The SHAP values for the "XLogP3" of API are shown in Figure 6. A high SHAP value indicated that the feature of the instance had a high impact on the prediction made by lightGBM. The negative values of the "XLogP3" of API displayed the correlation with the relatively high SHAP values, which is roughly consistent with the above attention results.
The LIME method was applied to lightGBM to obtain the sample level post hoc interpretability. The LIME interpretation for the example formulation is illustrated in Figure 7. Using a linear model as the surrogate model, LIME interpreted the linear model's behaviors as the underlying mechanisms of lightGBM. For this formulation, the "logP" of API, the "minimum projection radius" of CD, the "logS," "hydrogen bond donor count," and "aromatic ring count" of API jointly made lightGBM to give out the prediction of a relatively low binding-free energy. This also demonstrated a unique advantage of LIME that LIME can provide the direction of feature importance.
The global mean SHAP values over the dataset are computed, and the features containing both the API and CD innate features and experimental parameters were ranked ( Figure 8). As expected, there were inconsistencies between AttPharm and TreeSHAP when comparing the top six features. For the first, second, and sixth features, the same conclusion was made. "logP" and "XLogP3" of APIs impacted the binding-free energy of the inclusion complexes. Generally, the bindingfree energy decreased with the "XLogP3" and "logP" of APIs increasing. The same sixth feature was "maximum projection radius" of CD because CD molecules are cylindrical, the drugs with specific structures are inserted into the hydrophobic cavity to form the inclusion complexes. However, the third, fourth, and fifth features predicted by AttPharm were API properties and process, including "Ring count" of APIs, "Temperature" and "Complexity" of APIs. TreeSHAP focused on "logS" of APIs, "minimum projection radius," and "Aromatic ring count" of APIs.

Pros and cons of current interpretable methods
Interpretable machine learning methods have some advantages for Nowadays, various interpretable machine learning techniques were proposed (Aben et al., 2016;Bibal & Frénay, 2016;Briesemeister et al., 2010aBriesemeister et al., , 2010bDu et al., 2020;Gentiluomo et al., 2019;Gilpin et al., 2019;Guidotti et al., 2018;Karimi et al., 2019;Kuz'min et al., 2011;Li et al., 2019;Liu et al., 2018;Lundberg & Lee, 2017;Lundberg et al., 2018;Marchese Robinson et al., 2017;Ribeiro et al., 2016;Rosenbaum et al., 2011;Xu et al., 2015). The recent progress of the interpretable methods was summarized (Table 3). According to the time when model interpretability was obtained, these methods were divided into three types of premodeling methods, intrinsically interpretable models and post hoc methods. Premodeling methods started after data cleaning and before model training, mainly by dimensionality reduction and visualization methods. Intrinsically interpretable models happened when making predictions by directly illustrating the internal decisionmaking process of models. Post hoc methods took place after training models and applying post hoc methods to the well-trained models, mainly by surrogate models and utilizing model parameters. Post hoc methods split the complete modeling process into two parts of training and interpreting. On the basis of the application scope of model interpretability, model interpretability was divided into global interpretability and local interpretability. Global interpretability referred to a whole dataset and an entire model. Local interpretability referred to a sample and a feature.
Additionally, some post hoc interpretable methods were only designed for specific machine learning models, so-called model-specific methods. Model-agnostic methods were the post hoc methods that were available for any models. Considering the characteristics of specific tasks and the unique advantages of different interpretable methods, it was profitable to combined use multiple interpretable methods in a series of continuous modeling steps.

CONCLUSION
In this study, AttPharm, a DNN architecture based on the attention mechanism, was proposed for formulation development. The results showed that AttPharm surpassed the plain neural network and achieved results comparable to lightGBM on the structured tabular dataset. Furthermore, the attention weight distributions