Development of a Gaussian Process – feature selection model to characterise (poly)dimethylsiloxane (Silastic®) membrane permeation

The current study aims to determine the effect of physicochemical descriptor selection on models of polydimethylsiloxane permeation.


Introduction
Predicting the permeation into and across the skin of exogenous chemical is of substantial interest for a number of industries, including pharmaceuticals, cosmetics, pesticides and the handling of industrial chemicals. Determining skin permeability is therefore an essential component of understanding the risks associated with exposure of the skin to exogenous chemicals. In general, in vitro experiments form a significant part of early-stage evaluation of pharmaceutical formulations or in risk assessment protocols for other topical exposures. While fresh human skin is the perceived 'gold standard' for in vitro testing, its consistent use is constrained by availability, which often means that certain compromises are commonly adopted, including the use of previously frozen human skin and tissue from other species; in the latter case, it is generally accepted that pig skin is the best model for human skin, with the pig ear being widely used despite differences in the lateral packing of stratum corneum lipids, compared to human skin. [1][2][3][4] Given the scientific and logistical constraints discussed above, artificial membranes have also found widespread use in early-stage assessment of percutaneous absorption, notably in the context of formulation optimisation and elucidation of permeation mechanisms. Polydimethylsiloxane (PDMS, Figure 1) is one of the most widely used membranes in such studies and has been shown to correlate well with mammalian skin studies [5] but differences in the distribution of permeation compared to mammalian skin have also been found. [6,7] Similarly, Gullick and co-workers found reasonable correlations between in vitro diffusion experiments using PDMS membranes and pig skin. [8] PDMS has also been widely used to investigate mechanisms of membrane transport. Permeability was generally related to the physicochemical properties of their penetrants, and solvents were taken up into the membrane, altering membrane properties and the flux of the permeants. [9][10][11][12] ATR-FTIR spectroscopy has also been used to evaluate diffusion across a PDMS membrane [13] with a rudimentary structure-activity relationship for permeability across a PDMS membrane being developed. [14] It was also reported that the permeation distribution of a range of chemicals across a PDMS membrane was Gaussian-normal, [15] in contrast to a number of studies which reported non-Gaussian (log-normal) distribution in mammalian skin. [6,9,[16][17][18][19][20][21] This difference was broadly attributed to the heterogeneity of biological membranes, including the possibility of multiple permeation pathways in mammalian skin, which is in contrast to the homogeneity of PDMS membranes. PDMS membranes are therefore pharmaceutically significant and provide an effective screen in early-stage formulation development and in the elucidation of mechanistic information for the permeation process.
The first major studies quantifying permeability across a PDMS membrane initially saw the development of empirical models for permeation across a PDMS membrane which related flux through the membrane to partial atomic charge, mole fraction solubility and molecular weight. [22] : log J mss ¼ À2:497 À 4:339Re þ À 1:531Re À þ 4:065ðRe þ :Re pÀ Þ þ 0:649 log C S À 0:00651MW À 0:640imidazole þ 0:689amine ð1Þ n = 103, r 2 = 0.966, s = 0.238, F = 386.5; where J mss is the maximum steady-state flux (lmol/s per cm 2 ); Σe + is the sum of the charge values of hydrogen atoms (with charge >0.1) and the positive charge of a nitrogen atom in a nitro group; Σe À is the sum of the absolute charge values of all other heteroatoms with unshared electron pairs in the same molecule. Equation 1 gave better predictions than their previous model, and they applied it to predict the flux of 171 new compounds which were not included in their previous study. [23] This yielded a slightly simplified model which omitted the imidazole descriptor: log J mss ¼ À2:497 À 4:339Re þ À 1:531Re À þ 4:065ðRe þ :Re À Þ þ 0:649 log C s À 0:00651MW þ 0:689amine These data [22,23] were re-analysed with the aim of developing a QSAR model based on readily calculable descriptors, unlike those used in the original studies, with greater mechanistic insight for the whole data set. [24] This resulted in a simple QSAR: log J ¼ À0:561HAÀ0:671HD À 0:801 6 vÀ0:383 ð3Þ n = 242, r = 0.900, s = 0.464, F = 338; where J is the steady-state flux (lmol/s per cm 2 ), HA and HD are, respectively, the number of hydrogen bond acceptor and donor groups present on a penetrant and 6 v is the sixthorder path molecular connectivity. This model describes permeability across a PDMS membrane in terms of hydrogen bonding and molecular topology. These data [22][23][24] were also modelled by in an artificial neural network (ANN) study. [25] They generated a 12-parameter non-linear QSAR model which, most significantly, found that log P was not significant, being attributed to the inability of log P to account for intramolecular interactions. Similar findings were also reported [26] although these were based on a very small data set (n = 16). In more recent years, Gaussian Processes Regression methods (GP, or GPR) have been shown to outperform QSARs in predicting the skin permeability of compounds. [27] GPs have recently been used for the prediction of the permeability of compounds across non-human skin and synthetic polydimethylsiloxane membranes, [28,29] and human skin. [30] Two of the major problems with developing valid and useful models of biological functions, such as skin permeability, are the selection of appropriate descriptors (features) and the size of the data set. The impact for a small data set may be more prominent as the solution (the set of selected descriptors) may change more significantly if one data point is replaced by another. These issues have been addressed in recently, [30][31][32] resulting in the models that are highly predictive and robust.
It is clear that a number of previous studies in this field focus on using existing data or collating as much available data as possible and that the descriptors of interest are often chosen based on empirical studies. Such approaches limit the value of models, and while a detailed understanding of data set size and distribution has been addressed previously, [32] this study focuses on the selection of objective relevant physicochemical descriptors, employing a Feature Selection methodology to do so. Therefore, the two aims of this work were, firstly, to find a set of highly relevant molecular descriptors that describe the process of membrane permeation and, secondly, to determine whether feature selection techniques can yield benefits in the prediction of chemical transport across PDMS membranes and to provide a more specific and nuanced understanding of the mechanistic nature of the permeation process. The novelty of this work is that this is the first time such methodology (that is, feature selection on a small data set with a large number of physicochemical descriptors) has been applied to a system of pharmaceutical interest and this clearly has implications for other permeation across other membranes, notably skin.

Description of the data set
A human skin data set was not used in this study due to previously reported inconsistencies in the available data, [33][34][35] which it was felt could inhibit the development of new models based on the approach used in this study. Thus, a PDMS membrane was considered more appropriate as its use is generally associated with a greater consistency. [22][23][24]36] The data set was collated from the existing scientific literature and is available as Appendix S2. It consists of 77 unique chemical compounds. For each compound, the permeability coefficient, k p (cm/h), across a PDMS membrane was used and 2942 physicochemical descriptors were determined using a range of software packages (Dragon Professional, Molecular Operating Environment (MOE), HYdrogen BOnd Thermodynamics (HYBOT) and WinMolconn). While it is more common to use flux (J) to describe membrane permeation, [22][23][24] the permeability coefficient, k p , which is effectively the concentration-corrected flux, is more widely used in the scientific literature and, notably, in the construction of skin permeation models. This is because the use of k p allows a comparison to be made between different chemicals as concentration differences are generally accounted for. [36][37][38] Data set pre-processing There are no missing values or inconsistent data in the data set used in this study. 772 descriptors had only one possible value and were removed from the data set. A further 194 descriptors were removed as their distributions were highly imbalanced (i.e., the number of one certain value is greater than a threshold set in this analysis, i.e. 90% or greater of the total number of data points). 613 further descriptors which were highly correlated within the data set (i.e. those whose correlation coefficients are greater than 0.99) were removed. The analysis in this study was conducted on a refined data set of 77 compounds, each with 1363 physicochemical descriptors.

Generation of four independent test sets
It has been demonstrated previously that there are no significant linear relationships with the target permeability data for the descriptors used. [37,39] Principal component analysis (PCA) indicated that no linear relationship was observed between the permeability coefficient and the compound descriptors combined in the first component, where 40.2% of the variance was due to the first two principal components, PC1 and PC2. Figure 2 shows a PCA plot of the original, complete, PDMS data set, indicating that the first two Eigenvectors capture 40.21% of the total data variance. In order to observe how the final selected descriptor set affects the performance of regression models on the test set, four different test sets were randomly constructed, each of which includes seven test compounds: • Set Aseven test compounds were selected from the boundary of the data cloud shown in the PCA (Figure 3a for PC1, and Figure 3b for PC2, denoted by crosses, '+' in these figures). These PCA plots further show the relationship between logK p and the first two principal components, indicating that, except for the selected compound with the largest logK p value and the two selected compounds with relatively small logK p values, the remaining four selected compounds, compounds with different PC projections but very similar logK p values can be found. This test set is denoted TstA and the rest of the chemical compounds as TrnA.
• Set Bthis test set includes the three smallest and the four largest logK p values. Figure 3c,d shows projections of these compounds on a PCA plot for PC1 and PC2, respectively; this test set is denoted as TstB and the rest of the chemical compounds as TrnB.
• Set Cseven compounds randomly selected from the data set. Figure 4a,b shows the projections of these compounds in the PCA plot for PC1 and PC2, respectively. This test set is denoted as TstC and the remaining chemical compounds as TrnC.
• Set Dseven compounds selected from the t-sne plot [40] and for PC1 and P2 is shown in Figure 4c,d, respectively. This is denoted as TstD and the rest of the chemicals as TrnD.

Constructing random training/validation sets
To obtain robust results for each training set the data has been shuffled 10 000 times. Each shuffle sees 60 chemicals randomly selected as the training set of regression models and the remaining compounds are used as the validation set. [28][29][30][31][32]37,39] Thus, there are 10 000 training subsets and 10 000 validation sets corresponding to each training set. The validations sets are used for feature selection rather than hyperparameter tuning. In using such a method, descriptors selected from different training sets will vary and those that are consistently selected are of most interest. The standard deviation is also used to describe the variance of the outputs.

Random Forest Trees (RFTs)
The Random Forests algorithm was coupled with the random selection of descriptors and bootstrap aggregation to the training sets. [52] In this study, an ensemble of decision trees for regression has been applied via MatLab TreeBagger. The 50 most important descriptors in each ensemble of decision trees for a given training set were identified. Thus, either a trained RFT or GP model has been developed using the most relevant physicochemical descriptors only. While normally it would not be necessary to have a validation set for the RFT experiments, we have, in this case, used a validation set in order to keep our results consistent with the other tests used in this study.

LASSO
In this study, the LASSO method [54] has been applied, with a five-fold cross-validation method, in order to remove redundant descriptors from the data set. To speed up the processing time the UseParallel mode has been enabled. All other parameters are set to the default mode in MATLAB.

(Joint) Mutual Information (MI, JMI)
In this study, the MatLab FEAST toolbox (http: www.cs.ma n.ac.uk/gbrown/fstoolbox) has been used to select descriptors using the JMI criterion by assigning labels to the responses required by the model. Those responses have been aligned into 10 categories.

Gaussian Processes
A Gaussian Process (GP) is defined as a collection of random variables which, jointly, have a Gaussian distribution. [57] GPs have also been successfully applied to the field of predicting percutaneous absorption. [30][31][32]37] GP methods have been described in detail previously and are also discussed in the Appendix S1.

Automatic Relevance Determination
To implement automatic relevance determination (ARD) in a GPR, the characteristic length scale matrix, M, is redefined as a diagonal matrix containing the elements of vector . . .l D on the diagonal are the characteristic length scales for each input dimension, determining how relevant an input is to the task. These characteristic length scales can be optimised from the data by Bayesian inference.

Performance measures
The mean squared error (MSE), improvement over Na€ ıve model (ION) and the Pearson correlation coefficient (r, or CORR) we are all used to evaluate the performance of each model. [30][31][32]37,39,59] MSE measures the average squared difference between model predictions and the corresponding targets. The ION measures the degree of improvement of the model performance over the performance of the Na€ ıve predictor, which is normally the mean of the output (e.g. logK p ) in the training set. The ION is thus defined as: CORR measures the correlation between predictions and targets. For comparison, a 'good' model should have a low value of MSE and high values of ION and CORR on a given test data set.

Results and Discussion
In this field, it is important to note that previous studies are either based on small data sets (e.g. 16 chemicals of limited structural diversity [26] ) or on different experimental conditions (e.g. an isopropylalcohol solvent system at 30°C [22][23][24] ). Benchmarking is therefore a challenge, and the decision to focus on a skin permeation QSAR [60] is limited, but offers a relevance in terms of comparison of PDMS relevance to skin. [39] A consideration of the ION values between the models and the benchmark suggests a limited ability of data from PDMS membranes to successfully model skin, and vice versa, emphasising the limitations of the PDMS membrane in acting as a substitute for skin. Table 1 also shows the results of applying GP models. [30,31] to these data and does so using the same parameters as the QSAR model in order to provide a comparable benchmark as well as a full set of 1363 descriptors. While it is usual for two-thirds of the data set to be used for training and the remaining one-third for the test in this work, in aiming to train the model with as many data points as possible and to repeat this 10 000 times to address concerns of inherent variance, we aimed for 10% test data, keeping most data in the training set so the estimates of permeability were as accurate as possible. We were concerned that if we removed more data from the training set the predictions would be significantly impaired, a finding which was apparent in previous work with similar types of data. [32,61,62] We have previously used several measures of model quality, including MSE and ION, rather than just the correlation coefficient. [28][29][30]37] This is important because in this study (e.g. Table 1) it is clear that the QSAR models have the lowest ION values but reasonably high correlation coefficients, suggesting a possible systematic error. The GP models have comparability improved ION values. It is interesting to note that the MSE and ION results are better than for the QSAR model but still very poor for the correlation coefficients. This might reflect the somewhat comparable nature of the linear QSAR model with the experimental findings that associate permeation across a PDMS membrane to be predominately linearthereby over-estimating permeability particularly at high lipophilicities. This is not reflected in the GP models which are more reflective of the non-linear nature of skin permeability in the context of molecular descriptors such as log P and molecular weight. [37] It also highlights the limitations of replacement membranesin this case, PDMS and mirrors similar outcomes reported previously for a range of mammalian species. [39] It should also be noted that the results for the GP models shown in Table 1 do indicate that ION and MSE values are significantly better for these models, even if the variance associated with each outcome is large when the full set of descriptors is used. This suggests that the full set of descriptors is responsible for noise and redundancy within the data, highlighting the importance of selecting the correct descriptors as well as the need to rationalise and optimise the descriptors used for our models in order to produce more accurate estimations of membrane permeability. Further, the ION values are comparable to those for the correlation coefficient and provide similar measurements of model quality. This is important as it allows readers from different disciplines to frame the outcomes of this study in their own contexts. Table 2 summarises the results of the first set of experiments on the PDMS data set. These results indicate that the best overall model is found using the Random Forests method, which outperforms other models for three of the four validation setsthe Mutual Information method performs better on validation set D. This is most likely because the Random Forests method embeds feature selection methodology in regression. It was also found that the results of the LASSO method were not stable and that the mean values of estimations on two validation sets -A and Cwere worse than the na€ ıve predictions. Further, when the GPARD + Full method is compared to GP it was observed that the use of ARD gives much better results. For example, all IONs of GPARD + Full are greater than 10%, whereas all ION values in Table 1 are around zero (for the GP method using the full feature set). This is because, rather than using an identical length scale over all features and treating all features equally, GPARD uses a different Table 1 Results on data sets with two features (MW and log P) using (a) the benchmark QSAR model of Potts and Guy for skin [60] ; (b) the Gaussian Process Regression model with two descriptors (MW and log P) and (c) the Gaussian Process Regression model with the full set of descriptors. Results are the mean value and standard deviation over 10 000 validation sets, which are denoted as ValA, ValB, ValC and ValD, respectively QSAR is quantitative structure-activity relationship, and the term QSPR (quantitative structure-permeability) relationship is often interchangeably used in modelling studies of skin permeation; MSE is mean squared error; ION, or ION%, is the improvement over the na€ ıve model; CORR, or r, is the correlation coefficient; MW is molecular weight; log P is the octanol : water partition coefficient. Results highlighted in bold are those which achieved the best outcome for each test. PDMS is polydimethylsiloxane (see Figure 1); RFT is random forest tree; GP is Gaussian Process, often used interchangeably with Gaussian Process Regression (GPR); LASSO is the least absolute shrinkage and selection operator; MI is mutual information (note that JMI, joint mutual information, has also been discussed in this study); GPARD is Gaussian Process Regression; GP + two fs is the Gaussian Process model run with two descriptors, log P and MW, to mimic the descriptors used in common QSAR models of skin permeation; Full refers to studies run with the full data set. Results highlighted in bold are those which achieved the best outcome for each test.
length scale for each feature to indicate the importance of each feature in its predictive ability. Figure 5a firstly shows that the benchmark QSAR models perform worst among all the models, which might be expected based on the above discussion and may reflect more widely the value of PDMS as an alternative to mammalian tissue. [39] Figure 5b shows the mean values of ION, and it can be seen that there is no improvement if the GP model is used with the full data set compared to the na€ ıve model. Interestingly, in Figure 5c the corresponding standard deviation is consistently low over all four validation sets, suggesting that the full feature data set contains a substantial amount of noise. Figure 5b also indicates that the performance of the LASSO models varies across the four validation sets; they are better than the na€ ıve model for two data sets (A and C) and worse for the other two. RFTs give better results than the na€ ıve model for all four validation sets. The value obtained using GP with the full feature selection set is the lowest for all sets of data. The QSAR models yields relatively good correlation coefficients, which addresses the issues described above and also previously in attempting to demonstrate the potential of the PDMS membrane to represent human skin. [39] Moreover, it indicates that use of the correlation coefficient aloneas is often the case in skin permeability studiesmay not be a reliable indicator of model performance and that use and consideration of a series of statistical measures may provide more suitable outcomes in terms of describing model quality.
A small number of descriptors were repeatedly found in at least 5000 of the repeated experiments. The features selected using RFTs are shown in Table 3. From this  Table 3 but, in general, they fall under three main categories in that they describe lipophilicity, partial charge and hydrogen bonding as key determinants of PDMS permeation. This is consistent with many previous findings in this field. [59,63,64] but provides a more specific molecular basis in determining specific structural features that contribute positively or negatively to skin permeation. Figure 6 indicates that a number of the significant descriptors demonstrate a reasonable correlation. In Figure 7, it can be seen that, using the MI method, b_1rotR (the fraction of rotatable single bonds) and b_rotR (the fraction of rotatable bonds) are highly correlated. The results of four independent test sets with different descriptor inputs, obtained using the RFT and MI methods ( Table 4) predominately suggest that the best results all come from the selected descriptors (i.e. using either RFTs or MI) These results suggest that the two best predictive models are those which are produced using the feature selection method to generate the most relevant descriptors for the data set analysed. The consideration of highly correlated descriptors (Table 5) indicates that statistically improved models are generated when highly correlated descriptors, for example b_1rotR and b_rotR, are removed from the model (that is, only one or the other, not both, are included in the final model). This highlights the need to consider the underlying statistical nature of the data and the potential pitfalls of using highly correlated or covariate descriptors in models simply because the data and particular descriptors are easily generated or readily available. [30,32,36] The consideration of highly correlated descriptors is shown in Table 5, where the results displayed are obtained from a MI analysis with and without highly correlated descriptors. The top line of Table 5 shows data obtained using six common descriptors, which indicates that this group of descriptors gives slightly better results over all test sets. This indicates that the 6-descriptor model outperforms the 8-descriptor model, yielding a significant improvement on test sets A and D.
The modelling approach in this study is generally perceived to be of a 'top-down' nature, in that it provides a gross estimate of skin permeability and, like related QSAR-type methods, offers limited mechanistic insight into the permeation process. [36] However, the addition of feature selection methodologies to this problem domain in order to precisely determine the molecular descriptors of interest offers an opportunity to expand this approach to generate more precise information about the membrane permeation process. However, a practical limitation to this approach is the size of the data set (n = 77). For example many previous studies, mostly in the 1990s, examined subsets of skin permeation generated from Flynn's data set. [33,[65][66][67][68][69][70][71] In many cases, these subsets were focused on specific molecular properties (such as non-electrolytes [71] ) or specific functional groups. [66,67] One perceived issue with these studies is that they resulted in poor statistical fits in many cases, which one might relate to the small size of the data sets used or, for example, the  absence of methods that can optimise outputs from small datasets [32] . Thus, while it is possible to speculatively identify specific molecular fragments that would improve permeation models, validation of such models might be difficult and impractical to develop due to the size of the data sets involved. This study raises a novel and highly significant finding in the field of modelling in percutaneous absorption. Modelling of skin permeation has been shaped historically by models that used as few as two descriptors and assumptions were subsequently made about the mechanism of skin permeation that shaped detrimentally subsequent developments in this field. It is known that descriptors such as melting point, ionisation and hydrogen bonding play a role in skin permeation. The main point, therefore, of this study was to propose a methodology that examines as many descriptors as possible in an unbiased manner and which then selects the most appropriate descriptors based on established methods, thus providing a platform for the development of models to predict permeation across, in this case, PDMS membranes. While such analogies cannot be automatically extended to fields other than percutaneous absorption, it should be noted that similar methods have been applied to a range of biological and environmental endpoints. Thus, the findings of this study clearly have a wider relevance beyond percutaneous absorption.

Conclusions
Two sets of molecular descriptors which can provide improved predictions, compared to the use of either two descriptors (log P and MW) or a full list of excess of two thousand descriptors, of permeability across a PDMS membrane, have been identified (Table 3). Using the GP method with either of these two sets of descriptors provides significantly better predictions of permeability than using a QSAR model, when compared with four independent test sets, although the use of such benchmarks in this study has clear limitations. The generation of two sets of descriptors echoes previous findings [30] that suggest certain permutations of different descriptors yield similarly relevant predictive models. More broadly in the development of reliable, robust and valid models it is also recommended that highly correlated descriptors are removed from the data set when using the GP method. It is also apparent that resampling sub-training sets helps with feature selection for small data sets, and it is again recommended that this approach be taken in the development of small data sets.
The implications of these results for building relevant predicative models are novel and significant. Models in the field of percutaneous absorption are based on a small number of significant descriptors which have been widely used and often assumed to be relevant based on historical experimental findings. While it has been shown in this study that some of these descriptors are not relevant, they are still employed routinely in model development. This study indicates that this practice should be discontinued and that future predictive models will be improved by the analysis adopted in this study of the descriptors employed, rather than simply using any and all available descriptors or using a smaller set of descriptors whose choice has been influenced empirically by previously published in vitro diffusion studies.

Conflicts of interest
The Authors have no conflicts of interest to report.