Transfer Learning as Tool to Enhance Predictions of Molecular Properties Based on 2D Projections

Images of molecules are widely used to predict molecule properties in teaching and chemical research. A trained chemist can easily derive molecule properties by analyzing its structure and evaluate its functional groups. To predict, for example, the water solubility of an organic compound a chemist would intuitively count the number of polar groups, consider the size of the molecule, and estimate the water/molecule interaction by counting the number of H‐bond donors and acceptors. Therefore, 2D molecule representations and their directly accessible features should provide enough information to predict the molecule's structure‐dependent properties. To support this thesis, different image‐based machine learning approaches as dense neural networks, convolutional neural networks, clustering, data augmentation, and transfer learning are compared and evaluated in this work. The influence of the image size as well as the network size is discussed. Finally, a simple yet effective dense neural network trained on expert preselected, visually accessible features, is presented and its efficiency and comparability to other more complex methods are demonstrated.


Introduction
Deep learning in chemical research gains in importance at a rapid pace and the prediction of molecule properties like the free solvation energy [1] or the ionization energy [2] competes with or even outperforms modern ab initio calculations.
The spectrum of possible tasks for machine learning (ML) algorithms seems inexhaustible and can range from the classification of bioactive compounds into active and non-active categories in HIV replication inhibition [3] to regressions tasks where based on the input a specific value, such as the lipophilicity, is predicted. [4] However, ML algorithms are not limited to the prediction of categories and values. They can generate new compound DOI: 10.1002/adts.202000148 structures based on input structures or desired interaction targets. Therefore they can support the design of novel molecules in material sciences and medicinal chemistry. [5,6] Moreover, clustering processes divide large molecule datasets into groups of maximum congruence and can be utilized in medicinal chemistry to discover novel lead structures and accelerate their computer aided selection. [7] For all these tasks molecules must be converted into machine readable inputs, before feeding them to the ML algorithms. Starting from a structural formula the molecule is often converted into a vector to be processable by the most ML algorithms. Those vectors should be unique. Therefore, one vector represents only one compound, invariant to atom numbering and fixed in length. There are several methods available to obtain such inputs and in the following the most popular ones will be discussed.
A simple way to represent a molecular structure is the usage of strings. A string is a series of numbers and letters and often simplified molecular input line entry specification (SMILES) strings are utilized. In these strings each molecule is described by its edge atoms and adjacent bonds and a molecule like n-propanol would be represented by the SMILESstring "CCCO." [8] Due to the lack of adequately describing chirality, except central chirality, and tautomerism, the IUPAC proposed a different string representation named the International Chemical Identifier (InChI) which is not directly interpretable for a chemist. [9] Another method of representing a molecular structure are molecular descriptors. Those descriptors can range from simple, from the structure directly accessible, features like the number of atoms, the number of heteroatoms or the molecular weight to complexly calculate or simulate physical properties like the dipole moment and the partial charges. [10] Those derived or calculated properties are then condensed in a defined order into a fixed length vector.
More advanced featurization methods like molecular fingerprints are available and widely used. A molecular fingerprint is a fixed length vector in which each molecule is represented by a sequence of zeros for the absence of a feature and by a sequence of ones for the presence of a feature. Those features can again be directly derived from the structure of a molecule or be automatically obtained by using pre-defined queries of structural motifs. Structure derived fingerprints are called substructure keys fingerprints and were originally developed for substructure and similarity search in compound databases. [11] Topological fingerprints are a different type of molecular fingerprint and the most common one is the extended-connectivity fingerprint (ECFP) using the Morgan algorithm. To obtain an ECFP each atom of a molecule is assigned with an initial identifier such as the atom type and its connectivity. After the initial assignment of the identifiers, each identifier is updated by including the identifiers of its direct neighbor atoms. By utilizing more update steps and variating the update radius each atom identifier contains more information of its chemical surroundings. A vector of the identifier is created and hashed to a fingerprint which is not directly interpretable on its own. [12] It is also possible to represent a 3D molecule by a 3D representation of its structure using the coordinates of the atoms, the bond length, and the bond as well as its dihedral-angles. One approach to this type of representation is the Coulomb matrix which describes a molecular structure by its pair-wise Coulomb potentials. This representation method is often used for quantum chemical calculations and can be simplified by the bag of bonds (BOB) representation where only the Coulomb elements representing a chemical bond are used and categorized after their bond type. [13] The main problem of representing a molecular structure by a 3D representation is the accessibility of accurate 3D structures and the factor that for a reasonable representation many energetically comparable conformers would need to be considered in the representation of a molecule.
The most advanced and computing-intensive molecule representations are molecular graphs. In molecular graphs the chemical structure of a molecule is directly utilized and used as an architecture for a neural network. Each atom of the molecule is described by a node of the resulting neural network and each bond is described by a connection between those nodes. Therefore, each layer of the resulting neural network contains as many nodes as atoms are present in the molecule and those layers are interlinked accordingly to the bonds in the molecule. After the assignment of initial states to every node, each atom receives inputs from the previous atom layer and updates its own state. By building the resulting neural network deeper, each node is interconnected with more nodes and is therefore obtaining more information about a larger part of its chemical environment and the molecule. This means that the representation of molecular structures is achieved by training one neural network per compound and the real representation is learned through machine learning. To obtain a chemically interpretable output of the molecular graphs a read out function which is also often realized by another neural network is used. [14,15] This work will focus on the usage of 2D representations of molecules in the form of gray scale images and will compare and evaluate their performance in regression and clustering tasks on the MoleculeNet, [16] ESOL Dataset, [17] which includes water solubility and SMILE strings of 1128 organic compounds.
For the prediction of the water solubility of organic compounds several artificial intelligence tools have been developed. Some of them use regression equations like the general solubility equation (GSE) to predict the water solubility of organic drug-like compounds and achieved impressive accuracies with a root mean square error (RMSE) of 0.62. [18] Nevertheless, the use of the GSE requires the measured octanol water partition coefficient as well as the measured melting point of a compound to provide reliable results. Other tools use previously calculated topological indices like E-states to describe molecular structures in combination with a neural network and achieved a RMSE of 0.67 for fixed sizes of the training and validation set. [19] Even Monte Carlo simulations of solutes in water can be used to obtain regression equations which are able to predict the water solubility of organic compounds with a RMSE of 0.7. [20] In the literature there are also water solubility prediction accuracies of human experts available who achieved an impressively low RMSE of 0.94. [21] Flagship of the solubility prediction algorithms for drug-like compounds are computational-intensive molecular graph based methods achieving RMSE as low as 0.58. [22] In contrast to the more advanced and not directly interpretable representation methods used by other algorithms, 2D representations of molecules as used in this work are widely used to predict molecule properties in teaching and chemical research. A trained chemist can easily derive molecule properties by analyzing its structure and functionalities. To predict, for example, the water solubility of an organic compound a chemist would intuitively count the number of polar groups, consider the size of the molecule, and estimate the water-molecule interaction by counting the number of H-bond donors and acceptors. All of these variables are influencing the water solubility and empirical rules, such as Lipinski's "Rule of Five," are taking these into account. [23] Therefore, 2D molecule representations and their directly accessible features should provide enough information to predict the molecule's structure-dependent properties.
To validate this thesis different types of artificial neural networks (ANN) will be evaluated on their performance in a water solubility regression-and a clustering-task. For the clustering a combination of a convolutional autoencoder [24] and UMAP [25] will be used for dimension reduction, followed by labeling via HDBSCAN. [26] The ESOL water solubility regression task will be performed using five different approaches: 1) A dense neural network (DNN) [27] 2) A convolutional neural network (CNN) [27] 3) A CNN and a DNN trained on augmented data [28] 4) A neural network pretrained by transfer learning on GDB17 [29]

5) A neural network trained on four structure derived chemical descriptors
Thereby a conceptually simple method which can easily be implemented by a chemist who is not explicitly trained in programming while yielding a prediction accuracy comparable to other more complex approaches is presented. Using unique 2D representations of molecular structures in combination with a successfully applied transfer learning approach is a straightforward process and the generation of 2D molecular representations is a common and computing power inexpensive step. In contrast to the implementation of a two-step pipeline which first computes molecular fingerprints or descriptors for every compound and then feeds them into a ML algorithm. The direct use of molecular structures in form of common 2D-representations is a streamlined and desirable process. The resulting clusters will be critically evaluated by their chemical usefulness. The results of the regression task will be compared to the MoleculeNet [16] benchmark results. Furthermore, the influence of the ANNs size and the image size will be discussed.

Datasets
All regression tasks were performed with the ESOL dataset published by John S. Delaney [17] and curated by MoleculeNet. [16] The used dataset contains among other information the water solubility of 1128 organic small molecules in the form of the logarithmic maximum dissolved concentration (log(sol)). All compounds are represented as SMILE strings and contains on average 13 heavy atoms, has a molecular weight of 203.5 g mol −1 and a logarithmic solubility of −3.11. Moreover, the log(sol) distribution follows nearly a Gaussian bell distribution.
All transfer learning approaches were realized with the GDB-17 dataset [29] created by Reymond et al. This dataset contains SMILE string representations of 50 million organic small molecules and no molecule descriptors or labels.

Pre-Processing
All molecules in the MoleculeNet [16] ESOL dataset are represented by SMILE strings. The SMILE strings were converted to quadratic RGB-images via the Python RDKit-environment. [30] To reduce the dimensionality of the images the images were converted to gray scale images via OpenCV. [31] The gray scale images were then converted into NumPy [32] arrays and the pixel values scaled to the range of 0 to 1 by dividing the array by 255. The pre-processing process is illustrated in Figure 1.

Clustering
The clustering was performed on 28 × 28 pixel images and therefore on 784 dimensional inputs. To reduce the dimensionality of the input images an autoencoder was trained by using a transfer learning approach. [33] Transfer learning approaches are a common way of pretraining and presetting weights of deep neural networks to achieve high accuracies on small datasets. To achieve this, a large available dataset as the GDB-17 dataset [29] with similar features to the original, smaller dataset is used to pretrain the desired network without using the original data. After sufficient training, the pretrained deep neural network can then be further trained on the original dataset leading to higher accuracies and less overfitting. [33] The autoencoder consists of 12 layers, using convolutional layers for feature extraction and pooling layers for dimension reduction. An 84% reduction of dimensions was realized by utilizing three sequential blocks of convolution and pooling layers. Each block consisted of one 3 × 3 convolution layer followed by a 2 × 2 MaxPooling layer. The first block contained a 16 nodes convolution layer. The second and third block contained an 8 nodes convolution layer each. After dimension reduction to 128 dimensions, each picture was resized to 784 dimensions for a comparison to the input image. This was done by using three blocks of 3 × 3 convolution layers and 2 × 2 Up-Sampling layers. The architecture of the autoencoder is shown in Figure 2.
The obtained autoencoder was trained on 54 000 molecules (validation split 0.1) on the GDB-17 [29] dataset for 600 epochs with  an Adagrad-optimizer [34] resulting in a binary cross entropy loss value of 0.26. Further dimension reduction was implemented using UMAP [25] decreasing the number of dimensions from 128 to plottable 2D outputs. The obtained outputs were then clustered via HDBScan [26] and the resulting clusters were plotted in a colored scatter plot.

CNN/DNN Comparison
To evaluate the performance of different sized DNNs and CNNs on various sized quadratic images multiple ANNs were built. The architecture of the ANNs is illustrated in Figure 3. The influence of image size was analyzed for three different DNNs and two different CNNs on three different image sizes (14 × 14, 28 × 28, 56 × 56). The influence of the ANN size was evaluated by varying the number of nodes in the dense n d or the convolutional layers n conv.1 , n conv. 2 .
Each DNN consisted of a flatten layer to reshape the quadratic (p, p, 1) shaped images to a 1D tensor. The number of nodes in the hidden Dense layer was varied from n d = 32 to n d = 128 and the layer was L2-kernel regularized [35] to prevent overfitting. All models were using the Adam optimizer [36] and mean square error (MSE) loss function. [37] Activation of the dense layers was realized via the ReLU-activation function. [38] The CNNs consisted of two sequential 3 × 3 convolution layers with varying number of nodes (n conv.1 = 32, 64), (n conv.2 = 16, 32). A flatten layer was added to reshape the resulting multidimensional output into a 1D tensor before feeding it to the output layer.

Data Augmentation CNN/DNN
The influence of data augmentation [39] was evaluated for a DNN and a CNN model. The architecture of those models is shown in Figure 4. Augmentation of the dataset was done by randomly flipping the images horizontally, vertically, and rotating them in a range from 0°to 15°. Therefore, each image was only fed one time through the ANN and was randomly changed before each epoch. All dense layers in the ANN are L2-kernel regularized [35] and activated via the ReLU-activation function. [38] To prevent overfitting, additional dropout layers [40] were added with a dropout rate of 0.5. The augmented CNN is equal to the CNN (n conv.1 = 64 and n conv.2 = 32) shown before.

Transfer Learning Model
To improve the CNN prediction accuracy a deep CNN model was constructed via transfer learning. For pretraining of the model the GDB17 dataset [29] was used. The model was pretrained on 360 000 compounds, predicting the exact molecular weight. After being trained for three training episodes ([60 000 images, 40 epochs], [120 000 images, 40 epochs], [180 000 images, 18 epochs]) a mean absolute error of ±3.65 g mol −1 was reached. The last eight layers of the pretrained model were truncated and substituted by a flatten layer, followed by a dense layer, a dropout layer, and a single node dense layer. The resulting model was trained for 100 epochs on 80% of the ESOL dataset [17] and validated on 10%. The architecture of the pretrained model and the solubility prediction model are shown in Figure 5.

Model Trained on Expert Structure Derived Features
The last and smallest model was built by manually preselecting important features accessible by visually analyzing the molecule structure. The input consisted of four features: molecular weight, number of hetero atoms, number of H-bond acceptor groups, and the number of H-bond donor groups. All the mentioned features should have an impact on the water solubility and are considered in Lipinski's "Rule of five." [23] The architecture of the model is shown in Figure 6.
Scaling of the input data was achieved by dividing each feature by its maximum value. Therefore, each feature is in the range of 0 to 1. All dense layers, except the output layer, are L2-kernel regularized [35] and activated via the ReLU-activation function. [38] The resulting model was trained for 4000 epochs on 80% of the ESOL dataset [17] and validated on 10%.

ESOL Clustering
The scatter plot of the dimension reduced data is shown in Figure 7. All 1128 compounds of the ESOL dataset are represented as colored dots accordingly to their assigned cluster. In total, 16 various sized clusters were identified, and their chemical usefulness evaluated. It is to mention that the cluster assignment itself is an unsupervised process and that therefore no labels are required.
Clusters far apart from each other are structurally more diverse then clusters in direct neighborhood. Small and condensed clusters like cluster 7 and 16 are more likely to be structurally uniform in contrast to large and widely spread clusters such as cluster 9. Cluster 0 and 1 are separated from the rest of the clusters and in direct vicinity to each other and should therefore be uniform among themselves and be structurally different to the rest of the clusters. The same trend is observable for clusters 2 and 3 in a less distinct manner.
Example compounds of the four distinct clusters (0, 5, 2, 15) are shown in Figure 8. Example compounds of cluster 9, a widely spread cluster, are also provided to examine their difference.
As illustrated in Figure 8, compounds assigned to cluster 2 are mainly linear hydrocarbon chains with few hetero atoms and only small, unbulky side chains. Cluster 2 is in direct neighborhood to cluster 3 and should therefore be structurally similar. Cluster 0 contains mainly 5 and 6 membered carbocycles with a 1or 1,3,5-substitution pattern. Cluster 5 contains sphere shaped molecules, mainly unsubstituted 5 and 6 membered carbocycles or heterocycles. Also compounds similar in 2D-shape, like condensed ring systems, are observable. Cluster 15 was merged with cluster 16, containing larger cyclic systems, such as condensed carbocycles. Additionally, carbon chain linked bicyclic systems were labeled as cluster 15. The widely spread cluster 9 contains primarily Y-shaped molecules and is therefore only 2D-structural sensible. The containing molecules are very diverse in structure and only uniform in their 2D-shape. This illustrates a major drawback in using images as inputs for chemical clustering, yielding clusters similar in 2D-shape but not in chemical motifs or substructures.
Chemical descriptors and statistics of all assigned clusters are given in Table 1, which contains the mean molecular weight, the mean number of hetero atoms, the mean number of aromatic cycles, and the mean number of cyclic systems of each cluster. For every variable the standard deviation is given to evaluate its uniformity. The chosen descriptors are not part of the clustering process itself and only allow to evaluate the results of the unsupervised clustering process.
As depicted in Table 1, the clustering is well suited for the differentiation of substituted benzenes. Therefore, the clusters 1, 4, 7, and 11 contain benzenes of different substitution patterns and with different sized substituents. The distinguishing of linear and cyclic systems is also possible and even condensed aromatic and condensed saturated cyclic systems can be distinguished (clusters 13, 12, and 15). This leads to a low standard deviation and cluster uniformity as listed in Table 1.
Larger deviations in the number of aromatic systems or the number of cyclic systems can be explained by similar visual shape. As an example: The upper right compound of Label 0 in Figure 8 resembles the shape of the monosubstituted cyclic systems, therefore it is assigned to cluster 0, but contains in fact eight cyclic systems. Those wrong assignments have a major impact on the standard deviation for small clusters. The large deviation in the number of hetero atoms is conspicuous and it seems that the clustering does not involve the number of hetero atoms for the differentiation of the compounds.
Also problematic are the three clusters (8, 9, and 14) which are not chemically sensible. These clusters are only similar in 2Dshape and not in substructures or chemical motifs. Therefore, they show no usefulness in chemical contexts.

Influence of the Image Size on DNN/CNN Accuracy
To evaluate the influence of the image size on the water solubility prediction accuracy different sized images (  models were trained for 100 epochs, all CNN models for 40 epochs, resulting in almost converged loss and test mean absolute error (MAE) values. Using different sized images resulted in trained models of varying accuracy. It is shown that the assumption that larger images lead to a more accurate DNN prediction is wrong. In fact, an optimal size of 28 × 28 pixel images was observed by evaluating the mean absolute and mean squared errors of the test set. The influence of image size on test set MAE, loss function, and number of parameters for the DNN architectures are shown in Figure 9.
Considering the loss function (mean squared error) of the training set, it is shown that for increasing image size and increasing n d the MSE of log(sol) decreases. The intersection of the curve n d = 128 and n d = 64 could be explained by statistical fluctuations but the general trend is more significant.
A different influence of image size and varying minima are shown for the test set MAE. The global minimum of the test set MAE (1.3) is reached on an image size of 28 × 28 pixels and the largest n d of n d = 128 nodes. For both models, n d = 128 and n d = 64, the test set MAE is increasing with increasing image size after passing their local minima.
The trend of the test set MAE and the loss function could be explained by the quadratic increase of the number of parameters with image size and therefore the rapidly increasing entropic capacity of the DNN models. The extended number of parameters leads to a larger capacity for the model to memorize the training set and impairs the generalization ability. [27] Only if the number of possibly obtainable information of the images would scale in comparable manner to the number of parameters, the extension of image size would positively affect the predication accuracy.
The influence of layer size was further analyzed by increasing the number of nodes n d of the best performing DNN from 128 to 1024, resulting in a test MAE of only 1.2. Therefore, it seems that the influence of n d is not as significant as the influence of the image size.
The influence of the image size on the test set MAE and the loss function for the CNN models is shown in Figure 10. For the influence of the image size on the trainings set loss function a similar trend as for the DNNs is observable. Increasing image size, as well as increasing n conv.1 and n conv.2 leads to a decrease of the loss value. In comparison to the DNNs, the test set MAE decreases with increasing CNN capacity and image size as well. Figure 8. Example compounds of the distinct clusters 2, 0, 5, and 15. Compounds of cluster 9 are shown to illustrate the difference between dense, narrow sized clusters, and widely spread clusters. Label 2 = "linear hydrocarbon chains," Label 0 = "5-and 6-membered carbocycles with 1-and 1,3,5substitution pattern," Label 5 = "monocyclic, unsubstituted cyclic systems and condensed carbocycles," Label 15 = "large condensed carbocycles and hydrocarbon chain linked bicyclic systems," Label 9 = "Y-shaped molecules."    Considering the huge success of image classification CNNs as ImageNet [41] the surpass of the CNN is conspicuous and could be caused by the small number (902) of available training compounds of the ESOL data set. To enhance the performance of the CNN and to eliminate the lack of available training compounds, alternative approaches as data augmentation and transfer learning were studied.

Data Augmentation as Compensation for Small Datasets
Data augmentation for the DNN model resulted in a slight improvement of the prediction accuracy (test_MAE = 1.28 compared to 1.30). The augmented DNN model was also able to learn for more epochs (100 vs 2000). Considering the minor improvement of the log(sol) prediction accuracy and longer computing time, data augmentation seems infeasible for this model and task. Evaluation of the CNN model and its performance with augmented data showed no improvement in solubility prediction accuracy. The solubility prediction accuracy even decreased by implementing data augmentation (test_MAE = 1.61 compared to test_MAE = 1.46). Figure 11 illustrates the training curve and validation curve of the augmented DNN and CNN.
The improvement of the DNN could be explained by its stronger feature orientation dependency and the increasing number of learnable features by rotating and flipping of the training images. The minor improvement of the prediction accuracy may suggest that the orientation of chemical structures only plays a smaller role in feature extraction from images.
Considering the more orientation independent feature extraction ability of CNNs, it is understandable that the prediction accuracy did not improve. However, the increase of the test set mae for the CNN could be explained by untypical representations of the training images. The compound images were all produced by the same RDkit algorithm and the augmented training data set may be too different to the unaugmented test set, leading to an increased test MAE.
Therefore, data augmentation in this scenario leads to no major improvements of the solubility prediction accuracy and a transfer learning approach was implemented.

Transfer Learned CNN
Creating a deep CNN without preselected weights and trained layers can be a challenge for small datasets. Transfer learning is one approach to avoid collecting more data and achieving high performance models with few data. A deep, exact molecular mass predicting CNN was trained using the GDB-17 [29] dataset. The top layers were replaced by untrained dense layers and the obtained model was trained on 80% of a random split ESOL dataset. The lowest test_MAE of 0.87 for logarithmic water solubility predictions was achieved after 100 epochs. The log(sol) prediction error distribution and the predicted/true value-water solubility scatter plot is shown in Figure 12.
The most common prediction error is around 0 and the errors are ranging from ≈−4 to 3. The shape of the error distribution resembles a Gaussian bell distribution. Therefore, most of the predicted values almost corresponded to the true log(sol) values and outliers negatively affecting the test set MAE.

Expert Selected Feature Model
The last and simplest model was trained on chemist preselected features important for water solubility. The input tensor consisted of the molecular mass, the number of hetero atoms, and the number of H-bond donor and acceptor functionalities. Consequently, only four scaled values (between 1 and 0) were used to represent each compound. The built DNN only consisted of 33 total trainable parameters and was trained, in under 15 min, on a random split ESOL dataset (80/10/10, train/val./test) for 4000 epochs on a laptop CPU, in contrast to the Tesla K80 GPU used before. The lowest test set MAE achieved was 0.78, demonstrating the importance of expert supervision. The simplest DNN, trained on preselected and for a chemist visually obtainable chemical features, outperforms the complex transfer learned CNN. The results of the expert selected feature model are shown in Figure 13.
Fewer outliers are observable for the scatter plot of the expert selected feature model in comparison to the transfer learning model. However, the most common prediction error shifted from around 0 to ≈−0.5 and a second common prediction error of 1 is noticeable.
In this context the significance of expert supervision for machine learning is shown, demonstrating the role of ML www.advancedsciencenews.com www.advtheorysimul.com  algorithms as an expert tool, gaining their full potential when combined with expertise.

Conclusion
It was shown that 2D-images of molecules, as used by chemists for decades, are feasible for machine learning tasks. The combination of a transfer learned CNN and 2D-molecule representations resulted in a model that can compete with other conventional methods but is less accurate in the prediction of water solubility of organic compounds than more advanced graph-based methods. [16] Furthermore, the relevance of expert supervision was demonstrated by creating a very simple DNN (33 total parameters) trained on expert preselected, visually accessible features. This model was able to outperform the more complex CNN in computing time and prediction accuracy. MoleculeNet benchmark conventional methods / 0.99 [16] MoleculeNet benchmark graph-based methods / 0.58 [16] An overview over all discussed results is given in Table 2 supplemented by results of the MoleculeNet ESOL benchmark.