Machine learning of organic solvents reveals an extraordinary axis in Hansen space as indicator of spherical precipitation of polymers

Machine learning is an emerging tool in the field of materials chemistry for uncovering a principle from large datasets. Here, we focus on the spherical precipitation behavior of polymers and computationally extract a hidden trend that is orthogonal to the availability bias in the chemical space. For constructing a dataset, four polymers were precipitated from 416 solvent/nonsolvent combinations, and the morphology of the resulting precipitates were collected. The dataset was subjected to computational investigations consisting of principal component analysis and machine learning based on random forest model and support vector machine. Thereby, we eliminated the effect of the availability bias and found a linear combination of Hansen parameters to be the most suitable variable for predicting precipitation behavior. The predicted appropriate solvents are those with low hydrogen bonding capability, low polarity, and small molecular volume. Furthermore, we found that the capability for spherical precipitation is orthogonal to the availability bias and forms an extraordinary axis in Hansen space, which is the origin of the conventional difficulty in identifying the trend. The extraordinary axis points toward a void region, indicating the potential value of synthesizing novel solvents located therein.


INTRODUCTION
Since the emergence of synthetic polymers, the solubility of polymers in organic solvents has been a topic of interest.In addition to the fundamental curiosity about the anomalous behavior of polymers in solvents, there was a pressing need for a practical guide to good and poor solvents.In 1967, C.M. Hansen made significant contributions to this field by introducing Hansen parameters and Hansen space. [1]He divided the Hildebrand solubility parameter into three components representing the chemical affinity in terms of dispersion force (δ D ), dipole interactions (δ P ), and hydrogen bonding (δ H ).He also proposed that the distance between the solute and solvent in the three-dimensional Hansen space represents the solubility.Solubility prediction with Hansen space is highly accurate and has gained popularity in polymer chemistry. [2]he three chemical affinity parameters are defined as orthogonal in Hansen space and have conventionally been treated as individual parameters.However, in practical chemical compounds, they often correlate with each other, and we cannot manipulate the three parameters independently.For instance, compounds with large δ H typically contain a hydrogen bonding unit such as hydroxy and amine groups, which enhance δ P simultaneously.Consequently, the statistical distribution of existing organic compounds in Hansen space is biased rather than homogeneous.
The biased distribution of available chemical compounds is particularly problematic in systematic research that seeks for identifying trends in large datasets. [3]Recently, we faced this problem while studying the supramolecular chemistry of polymers and investigating their precipitation behaviors in a binary mixture of organic solvents.We precipitated a polymer slowly by mixing a poor solvent with the polymer solution via the vapor phase and found that the polymer occasionally formed micrometer-scale, highly spherical particles with smooth surface morphology, which is also reported from other research group. [4]The sphericity of the precipitates depended strongly on the choice of good and poor solvents.To clarify the origin of this dependency, we tested a series of common organic solvents for precipitation but could not find clear relationship between the common physical and chemical parameters and the capability for forming spherical precipitates.Therefore, we conduct a machine learning study of the experimental results to eliminate the bias and extract the essential trend (Figure 1).
Here, we systematically construct a dataset from polymer precipitation experiments and computationally extract a hidden trend in the dataset.We conduct precipitation experiments with four polymers in 416 pairs of good and poor solvents and observed the morphology of the precipitates with a microscope.The results are binarized and subjected to computational investigations consisting of a statistical pretreatment (principal component analysis, PCA) [5] and machine learning with random forest model (RF) [6] and support vector machine (SVM). [7]Thereby, we extract a correlation between the precipitation behavior and the common physical and chemical properties of the solvents, among which a linear combination of Hansen parameters describes the trend well while the other conventional parameters do not.Notably, the trend is orthogonal to the availability bias direction, meaning that this orthogonality is the origin of the conventional difficulty in predicting the precipitation tendency.What is important here is that such a tendency is unpredictable when analyzing the dataset straightforwardly along the Hansen parameters because the tendency is diagonal in the Hansen space and is orthogonal to the distribution bias.The direction of the parameter in the feature space provides desirable molecular design for enhancing the ability to form spherical precipitation.This study also reveals how currently available molecules are biased in terms of structures and properties and provides a guide to chemical space, along which chemically and physically anomalous solvents exist.
4a-c] As an authentic procedure, 20.0 mg of powdery polymer was added to 10.0 mL of a good solvent to obtain a clear homogeneous solution.Next, 500 μL of the solution was poured into a small glass vial with a volume of 2 mL and placed into a larger glass container with a volume of 20 mL that was partially filled with 3 mL of a poor solvent.The glass container was tightly sealed with a plastic cap and kept at 25 • C for 5 days to allow the poor solvent to slowly mix with the polymer solution via vapor phase.Thereafter, the small vial was picked up from the container and sonicated for 1 min to disperse the precipitates well.A portion of the dispersion was cast on a glass plate and left under atmosphere for complete dryness.The morphology of the remaining solids was judged as either spherical or nonspherical based on the optical microscopic images (Olympus model BX53 Upright Microscope).The spherical and nonspherical morphologies were converted into 1 and 0, respectively, and were utilized as the target variable (θ) in the computational study.

Machine learning of the dataset
Materials' properties listed below were utilized for the computational study as explanatory variables: molecular weight (M), density (ρ), boiling point (T b ), melting point (T m ), saturated vapor pressure at 21 • C (P sat ), refractive index (n), viscosity at 25 • C (η), molecular dipole (μ), molar volume (V), and Hansen solubility parameters (δ D , δ P , and δ H ). The values were collected from the previous literatures and books [1a,8] (Table S1) and were converted into a training data in the form of ( g where g and p stand for good and poor solvents, respectively.g j (i) is a j-th parameter of good solvent in i-th pair of a polymer and good/poor solvents, and m is the number of physical parameters utilized.The experimental results were converted into target vectors represented by (  (1) , … ,  (318) ) where θ (i) is a target variable in i-th pair of a polymer and good/poor solvents.The dataset was summarized as a comma-separated values (.CSV) file and was subjected to machine learning with random RF integrated in the Scikit-learn (v1.1.2)Python library, yielding feature importance for the explanatory variables.The top-ranked features of good and poor solvents were chosen as explanatory variables for subsequent learning with SVM, which was conducted using the same package, yielding decision boundaries in the feature space.The hyperparameters of C and γ for SVM were optimized by grid-search (see the Supporting Information for details).The model was evaluated using a five-fold cross-validation scheme scheme [9] through the same program package, yielding learning curves.
The same procedure was repeated by replacing the physical parameters with their standardized linear combinations, which were obtained through PCA.PCA was performed using Scikit-learn, with the input file provided in the .CSV format.PCA yields principal components (PCs) and variance explained ratios (λ), simultaneously.

Experiments on polymer precipitation
PS and PVK are nonpolar compounds and are therefore soluble to nonprotic and moderately polar solvents such as Bz, Tol, DCB, CH 2 Cl 2 , CHCl 3 , DMF, and THF (Table S2).PVP is a highly polar polymer and are therefore soluble to MeOH, EtOH, PrOH, BuOH, Ace, MeCN, DMSO, CH 2 Cl 2 , CHCl 3 , DMF, and THF (Table S2).Accordingly, we divided the solvents into good and poor and utilized all the combinations of the good and poor solvents for the slow precipitation of the polymers.
The microscopic images of the precipitates are tabulated in Figures S1-S3.The visual-based distinction between spherical and nonspherical is not quantitative, but it is valid for this study because the tendency toward spherical/nonspherical precipitates was very distinct.For instance, we took several images of the PVK precipitates obtained from a combination of DMF and MeCN and found that nearly all the solids were spherical with a diameter of several micrometers (Figure 2C).In contrast, the morphology of the PVK precipitates obtained from a combination of DMF and BuOAc was irregular (Figure 2D), and we could not find any spherical particles.We also investigated the morphology of the precipitates and found that PVK formed micrometer-scale spherical particles in 54 pairs out of 98 solvent combinations.PS formed spherical particles in 38 of 110 combinations, while spherical PVP particles were yielded in 28 out of 110 combinations.We then tried to find a correlation between the precipitation behavior and the solvents by rearranging the table as a function of the properties listed in Table S1.However, we could not identify clear tendency.

Machine learning of the dataset to extract a trend
To quantitatively assess the correlation between the parameters and the tendency toward spherical precipitation, we conducted machine learning based on RF model and measured the feature importance.We first incorporated Hansen parameters and V into the machine learning as the explanatory variables with an expectation that the solubility parameters would explain the tendency most clearly among the other chemical and physical parameters.1a] Among them, V of good solvent and δ D of poor solvent score the highest (Figure 3B).
We then trained SVM with the highest-ranked features for the classification of the precipitation behavior.The optimized SVM decision boundaries are shown in Figure 3F together with its learning curve, where the red regions represent clusters of θ = 1 (producing spherical precipitates), and blue regions represent θ = 0 (producing nonspherical precipitates).However, the resulting boundaries did not describe the experimental results well.The mean accuracy was 75.4% ± 7.1%, which is moderate.More importantly, multiple fragmental regions appeared in the feature space as the result of large γ.Such fragments are undesirable because they increase the risk of overfitting. [10]The fragmental trend is neither desirable in terms of chemistry because the precipitation behavior should exhibit a monotonic correlation with a certain parameter rather than exhibit an oscillatory trend. [11]We conducted the same machine learning with the other physical parameters including/excluding Hansen parameters, but the results were unsuitable for the classification of the spherical precipitation with mean accuracy lower than 76% (Figure 3A,E, and Figure S4).
We attributed the mismatch to the nonorthogonality of the parameters tested.As written in the introductory part, Hansen parameters (δ D , δ P , and δ H ) including V have been regarded as independent and orthogonal with each other, but the parameters of actual chemical compounds correlate with each other.Such hidden correlation of the parameters are usually uncovered by mathematical transformation of the dataset. [11]We therefore utilized PCA, which is a statistical tool to find the actual orthogonal axes in a parameter space.
We carried out PCA with the Hansen parameters including V and obtained new four axes (first-fourth principal components, PC1-4) provided as the linear combinations of δ D , δ P , δ H , and V (Tables S3 and S4) With these new axes, we again conducted the machine learning with RF model and identified PC4 as the most decisive parameter for the classification of the spherical polymer precipitation (Figure 3D).We then trained SVM with PC4 as the explanatory variables and obtained decision boundaries in a feature space spanned by PC4 of the good and poor solvents as shown in Figure 3H.Distinct from the original one in Figure 3F, the new feature space shows a simple and clear boundary at the 3rd quadrant.Despite its simplicity, the mean accuracy of the model calculated by the cross-validation is the highest (77.0%± 7.0%) among the other models.As a comparison, we conducted the same procedure by using, as the explanatory variables, the other physical and chemical parameters treated with PCA (Figures 3C,G, and S4, and Tables S4-S6).The resulting model provided fragmental boundaries with less mean accuracy (<73%).
To further prove the validity of the decision boundaries shown in Figure 3H, we additionally conducted the precipitation experiments with F8BT in a series of combinations of good/poor solvents (Table S2).F8BT barely assembled to form spherical particles, yielding microspheres only in one of 98 conditions (Figure S5).We projected the parameters of the successful solvent combination onto the feature space (Figure S6) and found that the data point locates at the center of the red region, supporting the validity of the boundaries.We therefore concluded that the linear combination of the Hansen parameters was the best indicator for the prediction of spherical polymer precipitation.

Chemical interpretation of the computationally extracted trend
The computationally provided boundaries and parameters should contain certain chemical insight into the spherical precipitation.To address this issue, we interpret the components in PC1 and PC4 given in the equations below: In case of PC1, the ability to form hydrogen bonding and polar interactions are accounted as positive while molecular volume is added as negative value.Dispersion force is less influential to PC1.Accordingly, organic solvents are categorized into three groups listed below: 1. Solvents that is highly capable of forming polar interactions and hydrogen bonds while featuring small molecular volume.2. Solvents that is barely capable of forming polar interactions and hydrogen bonds while featuring large molecular volume.3. Solvents in the middle of ( 1) and ( 2).
Likewise, we categorize the organic solvents along the axis of PC4 as follows: 4. Solvents that is highly capable of forming polar interactions and hydrogen bonds while featuring large molecular volume.5. Solvents that is barely capable of forming polar interactions and hydrogen bonds while featuring small molecular volume.6. Solvents in the middle of (4) and (5).
The feature importance in Figure 3C indicates that the values of PC1 and location among the categories (1)-( 3) are not influential to the spherical precipitation.Instead, the values of PC4 and location among the categories (4)-( 6) are the decisive factor.The decision boundaries in Figure 3H showing a red region at the third quadrant indicated that (5) is the best for the spherical precipitation.The predicted suitable good and poor solvents locating in (5) include, for instance, furan, Bz, carbon disulfide, and carbon tetrachloride.These solvents commonly feature heavy hetero atoms or soft π-electrons.These chemical features are in line with the fundamental chemical understanding that such atoms and electrons contribute to the increase of δ D without enhancing δ P , δ H , and V significantly.

Statistical interpretation of the decision boundaries for extracting the effect of availability bias
The study herein not only provides chemical insight into the spherical polymer precipitation but also proves the importance of eliminating the availability bias.The effect of the bias in this study can be statistically evaluated by using PCA.
Conceptually, PCA is a method that regards a dataset as a high-dimensional ellipsoid and extracts its major and minor axes.PC1 is the major axis of the ellipsoid, along which the data points are distributed most broadly.The other l-th PCs (PCl) are the l-th minor axes of the ellipsoid and are orthogonal to PC1.The broadness of the distribution along PCl becomes narrower as increasing l. λ is the quantitative value of the broadness and is a relative length of the ellipsoid along PCl.The distribution of the solvents in Hansen space and its projection onto each plane together with the ellipsoid calculated through PCA are shown in Figure 4. PC1 and PC4 are the major and minor axis of the ellipsoids, which is visualized in Figure 4 as red and blue arrows.The values of λ of PCl are summarized in Table S3.λ of PC1 (61.0%) is much larger than those of PC2-4 (25.8%, 8.9%, and 4.3%), meaning that the organic solvents are pseudo-linearly distributed on the axis of PC1.1a] This bias coincides with the conventional understanding about the organic solvents.For instance, the popular polar solvents including H 2 O, MeOH and MeCN feature small molecular volume.Reversely, organic solvents with large molecular volume such as alkanes and aromatic derivatives are typically less polar.
The narrow distribution along PC4 explains the reason why our conventional systematic study on the spherical precipitation failed.Since the common solvents are naturally distributed along PC1, any statistical rearrangement of the solvents contains this trend, and trends orthogonal to PC1 are hidden unless we computationally transform the dataset so that the bias from PC1 is canceled.Namely, PC1 is an ordinary axis in Hansen space, and its trend appears globally in most of systematic study.In contrast, PC4 is an extraordinary axis in the space and appears only upon appropriate statistical conversions.

CONCLUSION
We experimentally observed the morphology of polymer precipitates obtained in a binary mixture of a series of solvents and divided the results into either spherical or nonspherical.Machine learning was applied to investigate the correlation between the sphericity of the precipitates and the common chemical and physical properties of the solvents utilized.Among the chemical and physical properties tested, a linear combination of Hansen parameters described the tendency well.Moreover, the computationally suggested axis was the minor (extraordinary) axis of the data distribution ellipsoid, which was not distinguishable with conventional methods.Our machine learning model is also meaningful with respect to the concept of chemical space.Chemical space represents the diversity of the currently available compounds relative to the theoretically possible compounds. [12]Therein, the available compounds are inhomogeneously distributed and form some clusters such as alkanes, saccharides, and peptides.Between the clusters exist large voids, representing the currently unavailable compounds due to the lack of synthetic routes.Recent synthetic organic chemists take this concept into consideration and try to establish a reaction to fill the voids.However, promising directions to be F I G U R E 4 Scatter plot of the solvents utilized for the precipitation experiments in a space spanned by δ P , δ H , and V (A) and its projection onto the planes spanned by δ H and V, δ p and V, and δ H and δ P (B-D).The distribution ellipsoid obtained through principal component analysis (PCA) is projected onto the three-dimensional space spanned by PC1, PC2, and PC4 and is colored in pale blue.PC1 and PC4 and their projections are represented as red and white arrows in the plot.explored and actual benefits for exploring the voids are sometimes elusive.Our machine learning model finds category (5) as a promising region to be explored.Solvents therein are chemically and physically anomalous, featuring low polar cohesion interactions and small molecular volume, which may be fundamentally valuable for polymer chemistry, such as self-assembly, phase separation, and precipitation.

A C K N O W L E D G M E N T S
This work was supported by CREST (grant number: JPMJCR20T4) and ACT-X (grant number: JPMJAX201J) from Japan Science and Technology Agency (JST), Grantin-Aid for Young Scientist (grant number: JP22K14656) from Japan Society for the Promotion of Science (JSPS) as well as New Energy and Industrial Technology Development Organization (NEDO), and The Kato Memorial Bioscience Foundation.

C O N F L I C T O F I N T E R E S T S TAT E M E N T
The authors declare no conflict of interest.

F I G U R E 1
Schematic representation of the working flow consisting of experimental observation of polymer precipitations, collection of physical parameters of solvents, and computational analysis of the dataset.

F
I G U R E 2 (A) Molecular structures of polymers utilized.(B) Schematic representation of the procedure for slow precipitation of polymers.(C and D) Microscopic images of polyvinylcarbazole (PVK) obtained from a mixture of dimethylformamide (DMF) and MeCN (C), and a mixture of DMF and BuOAc (D).

F
I G U R E 3 (A-D) Bar charts of feature importance obtained with random forest model (RF) from datasets consisting of common physical parameters (ρ, M, T b , T m , P sat , n, η, and μ) with and without the treatment of principal component analysis (PCA) (C and A), and Hansen parameters (δ D , δ P , δ H , and V) with and without the treatment of PCA (D and B).(E-H) Feature space obtained by support vector machine (SVM) and corresponding learning curves of training accuracy (blue curve) and validation accuracy (green curve) together with the values of hyperparameters and mean accuracy.The red and blue dots in the feature space represent experimentally obtained data points with θ = 1 and 0, respectively, and red and blue regions represent a class of θ = 1 and 0, respectively.