Prediction of Nanoparticle Sizes for Arbitrary Methacrylates Using Artificial Neuronal Networks

Abstract Particle sizes represent one of the key factors influencing the usability and specific targeting of nanoparticles in medical applications such as vectors for drug or gene therapy. A multi‐layered graph convolutional network combined with a fully connected neuronal network is presented for the prediction of the size of nanoparticles based only on the polymer structure, the degree of polymerization, and the formulation parameters. The model is capable of predicting particle sizes obtained by nanoprecipitation of different poly(methacrylates). This includes polymers the network has not been trained with, indicating the high potential for generalizability of the model. By utilizing this model, a significant amount of time and resources can be saved in formulation optimization without extensive primary testing of material properties.


Materials and instrumentations
All monomers as well as the initiator (2,2′-azobis(2-methylpropionitrile, AIBN) and the RAFT-agent (2-cyano-2-propylbenzodithioat) were purchased from Sigma-Aldrich. Methanol was received from Thermo Fisher Scientific, and toluene and dimethylformamide from Acros Organics. Dimethylformamide (DMF) was dried over a molecular sieve under a nitrogen atmosphere. The liquid monomers that were used, were destabilized over a short AlOx column (neutral AlOx, obtained from Molecula). The dialysis tubings were purchased from Spectrum Labs (Spectra/PorTM, pre-wetted tubing, 3.5 kDa) and were rinsed with water before use. For the nanoparticle preparation, poly(vinyl alcohol) (PVA, Mowiol® 4-88) was purchased from Sigma-Aldrich and tetrahydrofuran (THF) was obtained from VWR and purified using a solvent purification system (SPS; Pure solv EN, InnovativeTechnology).
Nuclear magnetic resonance spectra were measured using a Bruker AC 300 (300 MHz) and Bruker AC 400 (400 MHz) spectrometer at 298 K if not stated differently. The chemical shift is given in parts per million (ppm on the δ scale) related to deuterated solvent.
The data evaluation as well as the model training was performed on a custom build desktop computer using a 14 core Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz and a Nvidia GeForce RTX 2080 Ti. The model training was performed utilizing the GPU using the Nvidia CUDA API.

Polymer preparation
For the preparation of the nanoparticles, different methacrylate-based homopolymers with variable side-groups and different degrees of polymerization were utilized. All polymers were synthesized via RAFT polymerization using a standard procedure:

Nanoparticle formulation
The formulation of the nanoparticles via high-throughput nanoprecipitation was performed using a FasTrans liquid handling robot (Analytik Jena AG, Jena, Germany Scientific, GenPure ultrapure water system) or water containing 0.25% (w/v) PVA (see Table   S3). The pipetted volume of the polymer solution was varied in order to keep the final nanoparticle concentration below 1 mg mL -1 . With this constant nanoparticle concentration, concentration dependent destabilizing effects and enhanced aggregation of the nanoparticles was avoided. Furthermore, misleading DLS measurements due to high particle concentrations, which often results in multiple scattering patterns and shifted size reports, were further prevented.

Dynamic light scattering measurements
The

Data preprocessing
Since not all formulations (n = 3753) worked in a suitable manner, non-useable data and outliers had to be removed before training of the model. For example, some measurement results showed particle sizes up to a few hundreds of µm, probably due to precipitation of the polymer material or impurities in the well. For the preprocessing, all measurements with particle sizes over a fixed threshold (>500 nm) [1] were removed from the sample set (n = 3454).
Figure S11: Change of particle size distribution after removing measurements with large particle sizes.
Afterwards, all samples with a dispersity larger than a fixed threshold (>0.3) [2] were removed from the remaining set resulting in a data set of nanoparticles that should meet the criteria of a quality formulation (n = 3093). Since each formulation was performed nine times, outliers can be detected via Grubb"s test, resulting in an outlier free dataset (n = 3052).  Some sampling groups contain only one or two samples, which might be prone to notdetectable outliers or are remainder of groups with poor-quality formulation results. All formulation groups with a sample number smaller than three samples were removed (n = 3010) to prevent the model from training on such data. Figure S15: Exemplary change of the dataset (P8) when the group size is smaller than 3. In this example for the polymer with PVA, the sample groups at 3 g/L and 15 g/L were removed.
As a standard rule, particle sizes increase in diameter with increasing polymer concentrations. [3] Suppose the particle size at low concentrations is larger than the next formulations with increased concentration; it can be assumed that these formulations did not work, and no particles formed, but only artifacts were measured. Analogs, if particle sizes drop at higher concentrations, this can indicate precipitation of polymer, and only small residual particles of the polymer were measured. Samples that show one of these characteristics were excluded from the set (n = 2823). Finally, all polymers with less than two valid concentrations were removed from the set since this indicates that these polymers are not suitable for nanoparticle formulation under the given conditions, and they were removed entirely from the training data (n = 2813).  In the graph convolution for each node, a new feature vector is created, whose values are calculated by passing the combined feature vectors of the previous, the next, and the node itself (via a function e.g., averaging ( ) ) multiplied with a weight vectormatrix into an activation function: Since W is independently from the nodes, for the infinite periodic graph representation. For the reduced cyclic representation, the same is true, since the feature vector prior to the first equals the last one due to the cyclic connection of the first and the last node and, thus, equation 1 gives the same results for the circular and the linear representation.
As a practical proof of the concept, a simple graph convolutional network (see notebooks/circular_proof.ipynb in software repository) was designed and the resulting output was a 128 values long array. The network was not trained and all weights and biases were left at the randomly initiated state. The resulting fingerprints have no intrinsic meaning, but the Euclidian distance between two fingerprints correlates with the similarity of the inputs ( Figure S18). The Euclidian distance for a cyclical representation of an arbitrary linear polymer (PMMA was used in the example) to a full linear graphical representation of the same polymer converges to 0 with an increasing number of repeating units. As a result, the cyclic representation is equivalent to an infinite long linear chain. Furthermore, the computation times for the generation of the fingerprints were recorded and the cyclic representation has a significant increase in performance compared to the linear one, whose computation time increases exponentially with the number of repeating units.

Conversion of structural data to a node-featurized graph representation
To bring the used representing chemical structures into a data structure that can be used as input for the model, initially, simplified molecular-input line-entry system (SMILES) representations were used to create a machine-readable representation of the molecules. [4] Therefore, diradical SMILES representations for each repeating unit were created in such a manner that a concatenation  the   corresponding  cyclic  representation  with  6 repeating units

Featurization of the cyclic polymer representations
The resulting macrocycle was then transformed into a rdkit.Mol object using the RDKit software. [5] For each atom, a feature vector was created using different featurization methods (Table S5) PyTorch Geometric [6] Data object, which was used as input to the graph convolutional layers. The featurization is repeated for every atom (including hydrogens not depicted in the structure) in the structural input, resulting in a feature vector . All feature vectors are packed together to the feature matrix | | . The feature matrix is multiplied with the symmetrically reduced adjacency matrix and the randomly initiated weight matrix. A bias is added to the resulting matrix, and the sum is passed into an activation function, resulting in a new set of features, which can be used as input for the next layer or pooled to the final output.
The model design. The model was implemented using PyTorch [7] as the base framework and PyTorch Geometric [6] as an additional library for graph convolution. The base model consists of an arbitrarily large number of graph convolutional layers, each with a specific feature output length (the length of the feature vector for each node after the graph convolution).
Since each convolution kernel is applied to each node and its directly attached neighbors, it takes n convolutional layers to pass any feature information over n edges.  Together used with the additional input data, the absence or presence of PVA, the degree of polymerization of the polymer, and the concentration of the polymer, the fingerprint is feed into a multi-layer fully connected neuronal network. The network consists of two hidden layers with output sizes of 32 and 16. The final output layer returns a single output value, which is the predicted particle size. After each layer, a dropout of 0.1 was performed during the training of the network to prevent overfitting. A dropout larger than 0.1 results in under learning, and the training metrics are more than two times worse than the validation and never converge.

Hyperparameter optimization
The initial hyperparameter optimization was performed via the Optuna-Framework and the implemented default pruning algorithm. [8] For this, in each optimization step, the model was initialized with the respective hyperparameter, trained for a maximum of 200 epochs. The set of hyperparameters was optimized to result in a minimum mean absolute percentage error. An overview of the tunable hyperparameters, the range or set of each, and the optimized parameters are listed in Table 2.

Pretrained model overview
In the following section figures of the training and prediction results of a few models are presented, which can be found in the repository. All overviews consist of: a) A graph of the evaluation parameters during training against the training epochs.
b) A plot of the predicted particle sizes against the measured one, grouped by the SMILES of the repeating unit. Here, polymers, where the model is not good at predicting the particle size, can be identified.
c) A plot of the predicted particle sizes against the measured one, grouped by the quantiles of all data, sorted by the error. Here one can see that mostly only a small percentage (high quantiles) of the data shows insufficient prediction. Furthermore, the plot contains the test data as a separate group, including a fully removed polymer class, to evaluate the models prediction capabilities. d) Two plots of randomly selected data points where the measured and predicted particle sizes are plotted with the error in a bar graph, once relative to the measured values and once with absolute values.
e) A plot of the testing data as a line plot, sorted by where the group with the strongest influence is highlighted in blue and the remainder in orange. The group with the strongest influence is determined by selecting all samples where the polymers are the same as the polymer in the sample with the largest error. Subsequently, the mean error of the subgroups, in which the degree of polymerization, the polymer concentration, or the additive is the same as in the sample with the largest error is determined, and the one with the highest mean error is marked as the worst-performing group.
f) The same as e but in order of appearance in the dataset.
g) The same as c but without the group identified as the worst-performing.
h) A collection of prediction grids, one for each polymer in the test-and training data.
Here, particle sizes were predicted for the given polymer with a number of repeating units from 10 to 300 in concentrations from 0 to 30 g/L, with and without polyvinyl alcohol as the surfactant. The color at each point indicates the predicted particle size.
Additionally, all available measurement data points in the prediction range were plotted at points, again with a color correlating to the particle size. Thus, it is easily identifiable in which regions the particle size predictions differ highly from the true data.
The model parameters for each model can be found in the respective folder in the repository under /pretrained/np_model_X In this model, poly(ethyl methacrylate) (P7-P9) was excluded from the training data. As can be seen in e and f, the prediction error for PEMA is completely within the range of the training data.
In this model, poly(propyl methacrylate) (P10-P12) was excluded from the training data. As can be seen in e and f, the prediction error for PPMA is completely within the range of the training data.
In this model, poly(butyl methacrylate) (P13-P15) was excluded from the training data. As can be seen in e and f, the prediction error for PBMA is completely within the range of the training data. The separation in e is due to the measurement order; however, the groups in f show no separation.
In this model, poly(iso-butyl methacrylate) (P16-P18) was excluded from the training data.
As can be seen in e and f, the prediction error for PiBMA is completely within the range of the training data.
In this model, poly(phenyl methacrylate) (P28-P30) was excluded from the training data. As can be seen in e and f, the prediction error for PPhMA is entirely within the range of the training data.
As can be seen in e and f, the prediction error for PtBMA is completely within the range of the training data.
In this model, poly(benzyl methacrylate) (P25-P27) was excluded from the training data. As can be seen in e and f, the prediction error for PBzMA is completely within the range of the training data.
In this case the model was trained with a fully connected network with only a single hidden layer. As a result, the output is not as accurate but seems to be much smother. Ideally this smoothness should also be reachable by deeper networks with more trainable parameters, but much more datapoints are needed to achieve this.
In this model only two graph convolutional layer were implemented. The prediction results are ok for our structural input but might get worse with increasing structure sizes or more functional units.
In this model, poly(tert-butyl methacrylate) (P19-P21) and poly(phenyl methacrylate) (P28-P30) were excluded from the training data. As can be seen in e and f, the prediction error for both is completely within the range of the training data.