Modeling Uptake of Polyethylenimine/Short Interfering RNA Nanoparticles in Breast Cancer Cells Using Machine Learning

Polyethylenimine (PEI) is one of the most promising nonviral vectors for delivery of short interfering RNA (siRNA) agents into cancer cells. A promising approach that increases the delivery ef ﬁ ciency of PEI is its modi ﬁ cation with hydrophobic substitutions. However, the performance of modi ﬁ ed PEIs depends on the nature and extent of substitutions. Herein, machine learning algorithms are used on the basis of quantitative structure activity relationship (QSAR) method to predict the cellular uptake of hydrophobically modi ﬁ ed PEI/siRNA nanoparticles (NPs) into various cancer cell lines. To this end, 3 different regression models, namely, random forest (RF), multilayer perceptron (MLP), and linear regression (LR), are used. The results show that RF and MLP regression methods have a better performance than the LR method, suggesting that nonlinear models are better estimators when predicting the cellular uptake of PEI/siRNA NPs. Additionally, critical descriptors that have major contributions to cellular uptake are found to be PEI-to-siRNA weight ratio, type of hydrophobic substitution, as well as total numbers of Cs, unsaturated C, and thioester groups on substitutions in each PEI. This study is the ﬁ rst report that predicts cellular uptake with PEI-based carriers, which provides valuable insight into the design of performance-enhancing hydrophobic substituents on PEIs.

Polyethylenimine (PEI) is one of the most promising nonviral vectors for delivery of short interfering RNA (siRNA) agents into cancer cells. A promising approach that increases the delivery efficiency of PEI is its modification with hydrophobic substitutions. However, the performance of modified PEIs depends on the nature and extent of substitutions. Herein, machine learning algorithms are used on the basis of quantitative structure activity relationship (QSAR) method to predict the cellular uptake of hydrophobically modified PEI/siRNA nanoparticles (NPs) into various cancer cell lines. To this end, 3 different regression models, namely, random forest (RF), multilayer perceptron (MLP), and linear regression (LR), are used. The results show that RF and MLP regression methods have a better performance than the LR method, suggesting that nonlinear models are better estimators when predicting the cellular uptake of PEI/siRNA NPs. Additionally, critical descriptors that have major contributions to cellular uptake are found to be PEI-to-siRNA weight ratio, type of hydrophobic substitution, as well as total numbers of Cs, unsaturated C, and thioester groups on substitutions in each PEI. This study is the first report that predicts cellular uptake with PEI-based carriers, which provides valuable insight into the design of performance-enhancing hydrophobic substituents on PEIs.
membranes to induce the gene silencing. Both viral and nonviral carriers can be used to deliver polynucleotides into the cells. Among nonviral carriers, cationic polymers have the advantage of being easily modified with other functional groups, making it feasible to tailor their properties for different applications. [6] Polyethylenimine (PEI) is the most promising and extensively investigated cationic polymer for polynucleotide delivery. [7] The efficacy of PEI was found to increase with its molecular weight. [8] However, high molecular weight (HMW) PEIs (%25 kDa) exhibited significant toxicity. [8] Low molecular weight (LMW) PEIs (< 2 kDa) had an acceptable level of toxicity, but efficacy of its gene delivery was low. [9] Modification of LMW PEIs with hydrophobic substitutions significantly increased its cellular uptake. [10,11] Various hydrophobic substitutions were used to enhance efficacy of LMW PEIs including cholesterol, phospholipids, and hydrophilic alkyl groups such as ethyl octyl and aliphatic lipids including caprylic, stearic, and linoleic acids (LAs). [12][13][14] As a representative example, Neamnark et al. [14] studied the delivery and transfection efficiency of 2 kDa PEI modified with different hydrophobic substitutions. They observed that these modifications significantly increased cellular uptake compared with unmodified native PEI. [14] The beneficial effect of the hydrophobic modification was shown to depend on both the type and level of the substitution. [11] A number of studies attempted to address the role of lipids in the delivery efficiency of hydrophobically modified PEIs. [6,8,15,16] Meneksedag-Erol et al. [8] using molecular dynamics (MD) and experimental tools studied LMW PEI modified with short propionic acid (PrA), focusing on the role of level of substitutions. They observed the highest surface hydrophobicity and surface charge density of the PEI/siRNA nanoparticle (NP) at an intermediate substitution ratio; substitution level beyond this optimal value induced migration of PrA toward the NP center and had a deleterious effect on both uptake and silencing. Experimental study of Neamnark et.al. [14] showed that at a similar substitution level, PEI modified with LA had a higher transfection efficiency than PEI modified with caprylic acid (CA). These two hydrophobes differ in their length, i.e., number of hydrophobic carbons. These studies highlighted the dependence of PEI's delivery performance on the properties of grafted lipids. However, exhaustive parametric study, experimentally or numerically, on all PEIs and their derivatives is time consuming and impractical. In addition, significant amount of experimental data already exists in the literature on modified PEIs but have not been critically analyzed to explore universal relationships between the physicochemical features of the carriers and their efficiency in delivering polynucleotides.
Machine learning methods can help determine complex relationships between participating factors and desired targets, providing a means to predict the outcome without the need to perform extensive testing. In recent years, chemoinformatics methods including quantitative structure activity/property relationship (QSAR/QSPR) have been utilized to predict the activities/properties of a given compound as functions of its molecular substituents. [17] The core assumption in these methods is that variation in the biological activities of a compound is correlated with changes in its molecular structure. To date, there have been only two studies that utilized QSAR method to correlate molecular properties with cellular uptake of NPs. Both studies examined the cellular uptake data of cross-linked iron oxide (CLIO) NPs from Weissleder et al. [18] Fourches et al. [19] developed a QSAR model to predict the uptake of CLIO NPs, with a variety of small organic molecules decorating their surface, by human pancreatic cancer cells (PaCa2). The authors used 150 2D descriptors for 109 organic compounds, which included surface area, physicochemical properties (such as the net charge and hydrophilicity), Kier and Hall connectivity indices, kappa shape indices, atom and bond counts, adjacency and distance matrix descriptors, molecular charges, and pharmacophore feature descriptors. A fivefold cross validation k-nearest neighbors (kNN) regression was used as the prediction algorithm. The performance of their model under optimum condition resulted in a R 2 value of 0.77. It was proposed that higher cellular uptake was associated with higher lipophilicity of the organic molecules bound to the CLIO NP. Using the same data set, Winkler et al. [20] predicted the cellular uptake of CLIO NPs into PaCa2 and human umbilical vein endothelial cells (HUVECs) based on machine learning methods. The data set, containing 108 data points, was separated into training set (87 data points) and test set (21 data points). Two-dimensional DRAGON descriptors were used for the decorated CLIO NPs. Using these descriptors, a linear and nonlinear nano-QSAR model was developed. Eleven DRAGON descriptors were utilized for the uptake of NP into HUVEC cells, and the R 2 values for the best fits were 0.63 and 0.66, respectively, for the linear and nonlinear models. For the uptake of NPs into PaCa2 cell, 19 DRAGON descriptors were used, which resulted in a R 2 of 0.79 and 0.54, respectively, for the linear and nonlinear models. Different molecular descriptors were used to predict cellular uptake in different cell lines, which was related to different uptake mechanisms. As these two studies used data set that only involved CLIO NPs, the predictive cellular uptake models can not be applied to more complex systems such as polymeric-nucleotide NPs.
In this work, for the first time, we applied machine learning methods to develop a predictive model where clinical significance (cellular uptake) was used to optimize and improve delivery of PEI/siRNA NPs into breast cancer cells. The data set for this study originated from published studies in our lab since 2011 that utilized different types of breast cancer cells and an evolving pattern of hydrophobically modified LMW PEIs. We selected molecular descriptors that are easy to interpret by chemists, thereby providing informative guidance for the design of effective PEI carriers in gene delivery applications.

Data Set
The experimental cellular uptake data of PEI/siRNA NPs into various breast cancer cell lines including MDA MB231, MCF7, AU565, MDA 468, MDA 435, and MDA 231 were collected from previous publications by our group. [3,[21][22][23][24][25][26] All NPs were tested under similar culture conditions in the same laboratory, thereby enabling direct comparisons across NP formulations. The basic methodology remained the same, where a scrambled FAMlabeled siRNA was formulated with PEI carriers and the uptake determined after 24 h using flow cytometry methodology. The data set was formed from 197 data points where PEI in its native form was the base polymer which was modified with 9 different hydrophobic substitutions including LA, alpha-LA (αLA), thioester linkage LA (tLA), CA, oleic acid (OA), palmitic acid (PA), lauric acid (Lau), stearic acid (StA), and cholesterol (Chol). Molecular structures of aforementioned hydrophobic substitutions are shown in Figure 1. All these structures, except Chol, differed mostly in terms of their length, type of carbon bonds (saturated versus unsaturated), number of unsaturated carbon bonds, and presence of thioester bonds.

Descriptors
Molecular descriptors were selected to capture the key parameters in the preparation of PEI/siRNA NPs while ensuring easy interpretation by chemists. Considered descriptors and their symbolic notations are shown in Table 1. Three descriptors, namely, N c , N uns , and N thio , were generated by multiplying two other descriptors as shown in Table 1. The purpose of such an approach was to add complexity and nonlinearity to the descriptors before modeling. The output was logarithmic value of the cellular uptake of PEI/siRNA NPs, hereafter referred to as y (y ¼ log (cellular uptake)).

Modeling
The data set was divided into training data (147 data points, 75%) and test data (50 data point, 25%), which was a common ratio when the number of date points was relatively low. Training data were used at the modeling stage whereas the test data were used to evaluate model performance. Three different regression models were used, namely, random forest (RF), multilayer perceptron (MLP), and linear regression (LR). RF was an ensemble of large number of individual decision trees, where the predicted output was the average of predicted values from each decision tree. The MLP model was constructed using one input layer, one hidden layer, and one output layer (see Supporting Information, section S1). The activation function for the hidden layer was the "ReLU" function, whereas the activation function for the output layer was the identity function. [27] The nodes in each layer were connected to the nodes in the next layer by weights, which were interactively adjusted during the training to minimize the network error. The network error was calculated from 1 2 kyðpredictÞ À yðactualÞk 2 2 þ α 2 kwk 2 2 , where the subscript 2 represents l 2 norm, α is a non-negative hyperparameter that penalize large weights (α ¼ 0.0001 used in this work), and w is the weight. LR fitted a linear model between y and the independent descriptors to minimize the model error as calculated by kyðpredictÞ À yðactualÞk 2 2 . Using both nonlinear (RF and MLP) and linear (LR) models would enable us to map the relation between the descriptors and y with different complexity.
Three approaches were used for the modeling, using different number of descriptors. In Model type 1, all descriptors in Table 1 were directly used, where the two categorical descriptors, S and C L , were transformed using a dummy encoding scheme (see Figure 2). This created a binary column for each category (see Supporting Information, section S2). Using this approach, these two categorical descriptors were translated into

0-7
S Type of hydrophobic substitution; none (native PEI) or one of the nine types in Figure 1.
-C L Type of breast cancer cell line; one of the six types described earlier.
www.advancedsciencenews.com www.advnanobiomedres.com 16 descriptors, 10 for S and 6 for C L . Adding the 10 numerical descriptors in Table 1, the total number of descriptors under Model type 1 was 26. In Model type 2, the total number of descriptors was reduced using a binary encoder, where S and C L were encoded by 4-digit and 3-digit binary numbers, respectively (see Supporting Information, section S3). The 4-digit binary number for S translated into 4 index descriptors denoted as I sub -1, I sub -2, I sub -3, and I sub -4. Similarly, the 3-digit binary number for C L translated into 3 index descriptors denoted as I cell -1, I cell -2, and I cell -3. The total number of descriptors under Model type 2 was 10 þ 4 þ 3 ¼ 17. Although the smaller number of descriptors lacked the ability to independently describe each type of hydrophobic substitution or cell line, this approach reduced overfitting and potential dimensionality problems. When dimensionality increases, the volume of the descriptor space increases, causing the data points to become sparse, which can be problematic for any models. Using Model type 2, the most informative descriptors were identified as the ones showing more than 10% correlation with cellular uptake, and they were selected for modeling. In Model type 3, the number of descriptors was further reduced from Model type 2 using backward elimination. In this method, independent descriptors were entered into stepwise multilinear regression, followed by the trial deletion of each descriptor using a chosen model fit.
The descriptor (if any) whose removal gave statistically the most insignificant deterioration of the model fit was deleted, and this was repeated until no further descriptors could be eliminated without a statistically significant loss of fit. Afterward, the remaining descriptors were used for modeling. Figure 3 shows the selection of final descriptors for Model type 2 and Model type 3.

Metrics of Model Performance
Among various statistical measures for the performance of regression models, we used two criteria for both the training and test data: squared correlation coefficient usually denoted as R 2 and root mean square error (RMSE). In addition, accuracy of the model was evaluated using an acceptable difference between the predicted and actual values of y. The overall model accuracy was calculated by dividing the total number of accurate predictions (within 25% of the actual values) by the total number of predictions.

Statistical Analysis
No transformation was performed on the data set except for the output (y), which was defined by applying logarithmic transformation on the value of the cellular uptake of PEI/siRNA NPs. The data set was divided into training data (147 data points, 75%) and test data (50 data point, 25%). For the reduction of descriptors using backward elimination method, a significance level of 5% was selected that corresponded to p-value of 0.05. All modeling and statistical analysis were performed using open source Scikit-learn library. [28] The programming language was Python 3.7. Seaborn python programming package [29] was used for data visualization.

Model Type 1
All 26 descriptors (see Figure 2) were directly used for modeling. The regression models are trained and their performances are summarized in Table 2. Better model performance corresponds to higher values of accuracy and R 2 , as well as lower values of RMSE. Among the three methods, performances of RF and MLF were close and both were better estimators than LR.

Model Type 2
Model type 2 had a reduced number of descriptors (17) compared with Model type 1 (26 descriptors) (see Figure 3). Correlation between these 17 descriptors and y was calculated using Pearson correlation coefficient ( Table 3). The results show that 11 descriptors (highlighted in red in Table 2) had correlation of more than 10% with y. The other six descriptors had negligible correlation and hence were removed from further modeling using Model type 2. The performances of regression models using Model type 2 are shown in Table 2. Similar to Model type 1, RF and MLF were better performers than LR. Compared with www.advancedsciencenews.com www.advnanobiomedres.com Model type 1 and for RF, R 2 and RMSE showed a minor improvement, whereas accuracy slightly decreased using Model type 2. For MLP and LR, the differences in performance metrics between Model type1 and Model type 2 were insignificant.

Model Type 3
Using lower number of descriptors (Model type 2), no significant decrease in model performance was observed. In Model type 3 where backward elimination was utilized to further remove insignificant descriptors, a significance level of 5% was selected that corresponded to p-value of 0.05. LR regressor was fitted using the 11 descriptors identified from Model type 2, and the descriptor with the highest p-value was examined. If its p-value was greater than the defined significance level of 0.05, the descriptor was removed and LR was implemented again using the remaining descriptors. The process continued until the highest p-value from all the remaining descriptors was less than 0.05. The number of descriptors reduced to six after backward elimination, which were r, I sub -1, I sub -2, N c , N uns , and N thio . The regression modeling using Model type 3 showed comparable performance to the other two model types ( Table 2), suggesting that the most significant descriptors were those in Model type 3. In addition, for all model types, nonlinear models of RF and MLP had a better performance than the linear model of LR. In the following, physical interpretation of the most significant descriptors was presented.

Effect of Most Significant Descriptors
The descriptor r, PEI-to-siRNA weight ratio, usually ranges between 2 and 10 in experiments. Higher r leads to higher positive charge on the PEI/siRNA NPs, causing stronger interaction with cell membranes. In addition, higher r increases the presence of free PEIs that are not bound to siRNAs, which might destabilize membrane structure, thereby contributing to the uptake of PEI/nucleic acid NPs. [30] Boeckle et al. [30] attributed the increase in gene expression observed with free PEIs to their ability to facilitate proton sponge effect. They proposed that free PEIs can be internalized into endosomes, which can then merge with vesicles containing PEI/DNA NPs. These free PEIs subsequently assisted in the buffering of vesicular pH and accumulation of Cl À in the endosome leading to its rupture.
Unlike r, the two index descriptors I sub-1 and I sub-2 on a standalone basis are not physical attributes. These descriptors represent a compacted form of the type of hydrophobic substitution, suggesting that type of hydrophobic substitution plays a significant role in the delivery of siRNA into breast cancer cells. As a supporting evidence, Aliabadi et al. [31] explored LMW PEIs substituted by different lipids, as carriers for siRNA-mediated BCRP downregulation. BCRP is an efflux protein whose activity has been connected to multidrug resistance in breast cancer treatment. It was shown that the efficacy of siRNA delivery increased significantly with hydrophobic lipid substitutions ranging from C8 to C18, and that LA-and CA-substituted PEIs were the most effective in both cellular uptake and BCRP downregulation. Sun et al. [15] using MD simulations explored the role of lipid substitutions on PEI/siRNA NPs. It was found that the substituted lipids enhanced the stability of PEI/siRNA NPs via association among them. Such linkages between the lipid side groups formed and broke frequently for the short lipids, whereas the associations between long lipids were more stable.
N c (¼ n c Â n sub ) is a measure of hydrophobicity introduced to the PEI molecules. Same level of hydrophobicity can be obtained experimentally by adjusting either the level of substitutions per PEI (n sub ) or the number of C on each substitution (n c ). It is worth noting that backward elimination in Model type 3 eliminated the descriptor of n c . This indicates that given N c , n c itself does not play an important role in cellular uptake. Figure 4a shows the relation between y and the two parameters of N c and n c , based on experimental data. Reading the plot horizontally at a fixed value of N c , the variation in y was small suggesting its insensitivity to n c . Reading the plot vertically, y exhibited an overall increasing trend with N c for n c of 8 and 12, whereas no specific trend was observed for n c of 16 and 18. The descriptor N uns was found to be significant to the prediction of cellular uptake, which is a measure of unsaturation level in the lipid substitutions. To further evaluate the role of N uns , the value of y as a function of N uns and n uns is shown in Figure 4b. The results suggested that for n uns ¼ 2, between 2 and 3 substitutions per PEI (N uns ¼ (2-3) Â n uns ¼ 4-6) resulted in higher y value, and substitution level higher than 3 or lower than 2 caused a decrease in y. Because n uns was eliminated in backward elimination, the optimal N uns was expected to be in the same range for other n uns . This cannot be seen directly from Figure 4b due to the lack of data points to cover wider range of n uns and N uns , especially since some data points were not considered during training. Finally, N thio is another important descriptor which represents the total number of thioester groups on hydrophobically modified PEIs. This functional group can increase the sensitivity of NPs to dissociation due its labile nature under aqueous conditions. Strong electrostatic interaction between cationic polymers and nucleic acids has been shown to benefit the formation of a stable NP to deliver nucleic acids across cell membranes; [32] however, it was also proposed to limit the efficacy of polymers due to insufficient unpacking of NPs in the cytoplasm. [33] K.C. et al. [34] showed that introducing electronegativity (with the addition of thioester groups) along with hydrophobic tail decreased the binding strength between lipid-grafted PEI and plasmid DNA (pDNA). In addition, it was shown that PEI-tLA polymers were easier to dissociate from pDNA than PEI-LA polymers, which resulted in better delivery performance of PEI-tLA polymers.

Closed-Form Predictions
Although the LR method exhibited poorer performance than the RF and MLP methods, a strength of the LR method is that it can provide predictions in closed form. This can be useful in practical settings where initial screening of the descriptors is done efficiently with relatively low but acceptable accuracy. Here, closedform predictions of y generated from LR regressor are presented for each model type. The coefficient of each descriptor is shown in Table 4, and absolute value of the coefficient represents significance of the descriptor. For Model type 1 (26 descriptors in total, notations defined in Supporting Information, section S2), 6 descriptors had a coefficient whose absolute value was greater than 0.5. For Model types 2 (11 descriptors in total), 6 descriptors had a coefficient with absolute value greater than 0.13. For Model type 3 (6 descriptors in total), 3 descriptors had a coefficient with absolute value greater than 0.13. These coefficients are colored red in Table 4. All model types suggested that N uns and N thio were significant descriptors. This indicates that that tLA is an important hydrophobic substitution, being the only substitution that carries thioester group on its structure. Also, for both Model type 2 and Model type 3, I sub -1 and I sub -2 were found to be significant. Although these two descriptors cannot directly distinguish different types of hydrophobic substitutions, some information can be obtained from them. Examining the binary encoding (Table S3, Supporting Informat), 7 substitutions had I sub -1 ¼ 1 or I sub -2 ¼ 1 including tLA, PA, StA, LA, αLA, Chol, and OA, suggesting long lipids are better hydrophobic modification than short tail lipids (CA, Lau). Among the 6 cell lines, I AU565 and I MDA 468 had the smallest coefficient (in magnitude) generated by Model type 1, suggesting insensitivity of y with respect to these two cell lines. The closed-form LR prediction of y using Model type 3 is given as follows y ¼ 0.10 r þ 1.24 I sub À 1 þ 0.82 I sub À 2 þ 0.02 N c À 0.05 N uns À 0.20 N thio (1) This equation predicts that y increases with r, I sub -1, I sub -2, and N c , whereas decreases with N uns and N thio . The coefficient in front of N thio was negative, which might seem contradictory  to the aforementioned beneficiary effect of thioester group on cellular uptake. However, it should be pointed out that the descriptors are interconnected; an increase in N uns can also be associated with an increase in I sub -1 or I sub -2which are accompanied by positive coefficients. Figure 5 shows representative model prediction for y by varying N c , N uns , and N thio . In these plots, r was set to 5, based on its most frequent value in the data set. Index indicators of I sub -1 and I sub -2 were assigned 1 and 0 because no hydrophobic substitutions in the data set had value of 1 for both two indices (Table S4, Supporting Information). Figure 3 shows that given N thio , low N uns , and high N c resulted in higher y (log (cellular uptake)) into breast cancer cell lines. Such parametric study will help better design more potent PEI carriers for gene delivery applications.

Limitation and Future Perspectives
A common problem in the modeling of biological effects is the limited number of data points and high number of descriptors.
This might lead to overfitting, where the model performs well on the training data but poorly on the test data. Many experimental studies are also typically performed under different conditions including cell medium, baseline polymer, polymer concentration, NP size and charge, number of cells, and instrument used to measure cellular uptake, to name a few. This issue can be addressed by performing systematic experimental studies to generate a large data set, and normalizing the data into unitless quantities to minimize differences in measurements. The problem of large number of descriptors can be tackled by proper descriptor selection for model construction as used in our study through Model type 3. The models generated here can be used to predict the cellular uptake of PEI/siRNA NPs where the parameters of hydrophobic substitutions are new and untested. Reliable predictions can be made within the ranges of the utilized descriptors. The accuracy of utilized algorithms in this work was about 60% which had room for improvement. To enhance the model predictions, new experimental data, especially on the types of hydrophobic substitutions with low number of data points in the current data set, should be added (i.e., wider range of unsaturated lipids with various level of substitutions (Figure 4b)), and then the models need to be retrained. Another approach is to find more informative descriptors that provide additional information regarding uptake of PEI/siRNA NPs into cancer cell lines. An example of such descriptors can be the toxicity level PEI/siRNA NPs. As new descriptors are added to the models, backward elimination should be used to test whether the new descriptors are statistically significant. In addition, our model suggested that cellular uptake of PEI/siRNA NPs was insensitive to AU565 and MDA 468 cell lines which needs to be verified with experiments.
Computational methods that give better model predictions are usually in some form of artificial neural networks, or other nonlinear methods such as the RF used in our study. Selection of methods must take into consideration factors such as the complexity of NPs, number of descriptors affecting cellular uptake, and complexity of the relation between descriptors and cellular uptake. The advantage of nonlinear methods is that they can find more complex relation between descriptors and the target (here, cellular uptake), thereby providing better prediction. However, the disadvantage is that such methods cannot separately describe the contribution of individual descriptors and can only be used as a black box algorithm. The simpler LR can provide direct information on the contribution of each descriptor, at the cost of losing accuracy. To improve the performance of linear methods, more complex descriptors can be generated by imposing nonlinearity. For example, through multiplication of two descriptors as utilized in our study (N c , N uns , and N thio ).
The current data set, originally generated from novel lipid modified PEIs developed in our lab, provided a basis for applying machine learning algorithms into the prediction of cellular uptake of PEI/siRNA NPs. This data set can be expanded to address more complex scientific questions. As a representative example, the purpose of delivering PEI/siRNA NPs is to downregulate specific mRNAs involved in signaling, thereby reducing the intracellular levels of associated proteins. [3] By altering the nucleotide sequence of siRNA molecules, one can silence a broad range of mRNAs for specific therapies. In the case of breast cancer cells, a wide range of targets were identified for silencing to control uncontrolled growth of the cells. [35,36] The efficacy of siRNA silencing depends on a functional delivery system that can transport siRNA into cells and release it into the cytoplasm to exert its effect. A new data set can be built with the efficacy of siRNA silencing as a new target. Our current descriptors can continue to be used, with the addition of cellular uptake as a new descriptor. Such a model can provide valuable information on what polymer modifications result in better delivery vector for silencing the target proteins.

Conclusions
Machine learning algorithms were used to model the uptake of PEI/siRNA NPs into breast cancer cells. A large data set was compiled composed of experimental values for the uptake of LMW PEI/siRNA NPs into six breast cancer cell lines, where the PEIs were modified with different hydrophobic substitutions. Three different regression models, namely, RF, MLP, and LR, as well as various descriptors were used to construct the models. RF and MLP regression methods were found to have a better performance, suggesting that nonlinear models were better estimators than the linear model to predict the cellular uptake of PEI/siRNA NPs. Descriptors that had major contribution to uptake included PEI-to-siRNA weight ratio, type of hydrophobic substitution, as well as the total number of C, thioester groups, and unsaturated C on the substitutions in each PEI. Our results suggested that the type of hydrophobic substitution on PEI molecules plays a crucial role in the delivery of PEI/siRNA NPs into breast cancer cells. Our study was the first report of quantitative modeling and prediction of cellular uptake of PEI/siRNA NPs, using chemically interpretable descriptors, which will facilitate the design of more efficient gene delivery systems that improve cancer therapeutics.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.