Predicting Highly Enantioselective Catalysts Using Tunable Fragment Descriptors

: Catalyst optimization process is typically relying on an inductive and qualitative assumption of chemists based on screening data. While machine learning models using molecular properties or calculated 3D structures enable quantitative data evaluation, costly quantum chemical calculations are often required. In contrast, readily available binary fingerprint descriptors are time-and cost-efficient

Alternatively, 3D descriptors represent general strucutural information, and thus models on 3D structures may identify correlations without requiring knowledge of the detailed reaction mechanism. 102][13][14][15] The Denmark group reported a modification of grid-based methods with non-binary features based on distribution of conformations to account for their flexibility, and demonstrated that the model can predict higher selectivity even in the absence of such examples in the training dataset. 16The disadvantage of such methods is the requirement of costly quantum chemical calculations and, in the case of grid-based methods, the necessity of alignment of core structures.2][23][24] Using 2D descriptors, therefore, has a clear advantage in the speed.6][27][28][29][30][31][32] Recently, multiple fingerprint features were employed together to build predictive models.Despite the relatively good performance of the test set on substrates, the validation sets for catalysts still had room for improvement, 33 which confirms the inherent difficulty of representing complicated catalyst structures solely relying on the binary 2D fingerprints.
Such unsatisfactory results would partly stem from their binary nature; they often do not contain enough structural information to construct a reasonable model, as a key corresponding to a certain fragment is only determined by its presence.Nevertheless, simply employing conventional non-binary fingerprints is still insufficient for predictions; thus a relatively large number of training data with multiple optimization cycles were required as described by Belyk, Sherer, and co-workers. 32Alternatively, fragment count descriptors, represented  by ISIDA (In Silico design and Data Analysis) platform, 34 encode the fragments into the non-binary vectors based on the number of their occurences in a molecule without limiting the number of features.ISIDA descriptors in particular offer a wide variety of possible representations of the chemical structures, including topology (linear, atom-centered fragments, atom pairs, or triplets) and size of fragments, which allows to fine-tune them to the problem at hand.Futhermore, ISIDA platform enables to calculate fragments for reaction schemes as a whole via Condensed Graphs of Reaction (CGR), 35 a special representation that combines reactants and products into a single pseudo-molecule containing dynamic (i.e., changing during the reaction) bonds, which is an unique property of this approach.Although these descriptors have been utilized in modeling of pharmacological properties, 36,37 materials, 38 and chemical transformations, 39 their application to asymmetric catalysis has never been described.
Here, we report a predictive model for enantioselectivity of structurally diverse and flexible catalysts using fragment count descriptors based on ISIDA platform.For more precise representation of cyclic or polyaromatic hydrocarbon substituents which are common in asymmetric catalysis, another type of fragments is introduced.
Further, its application to an actual synthetic challenge is demonstrated; highly selective catalysts of a previously unsolved asymmetric catalytic reaction have been predicted from the training data consisting of catalysts showing only moderate selectivities.
We initiated a preliminary study on the stereoselective hydroalkoxylation reaction of simple alkenes to construct 2,2-disubstituted tetrahydrofuran rings using imidodiphosphorimidate (IDPi) catalysts under several different conditions (35 examples, e.r. from 15:85 to 91:9; referenced further as THF set). 40These IDPi catalysts possess a common scaffold, yet have a variety of substituents on the 3,3'-positions of BINOLs and nitrogen substituents, represented as Ar and R (Figure 1A, only R=Tf in THF set for the sake of simplicity). 41The workflow is summarized in Figure 1B.The structural information of catalysts, substrates, and products is encoded by either fingerprints or fragments descriptors.In the case of fragments, the reaction schemes are transformed into CGRs.
Structures of catalysts can be represented either as full structures or only by the modulable substituents depicted as Ar and R; the latter would reduce noise from the common scaffold and enable to apply different descriptors which are suitable for each substituent.Physico-chemical parameters of solvents are also included as descriptors into the model.Other reaction conditions, namely temperatures and concentrations, are separately considered; all data points are calibrated to an identical condition before constructing a predictive model, and the predicted free energy differences are calibrated back to each experimental condition, providing the predicted enantiomeric ratio in the given conditions (see SI for details).Thus, this workflow represents a generalized approach to handling catalysts, reaction schemes, and reaction conditions simultaneously.
With these strategies, preliminary benchmark studies were performed to compare the performance of various fingerprints and ISIDA fragment counts with which both atom-centered and linear fragments are considered.Due to the modest size of the dataset, the Support Vector Machine (SVM) method was chosen as a widely used nonlinear machine learning algorithm. 42Descriptor space and model hyperparameters were optimized by genetic algorithms 43 in 10-fold cross-validation.In our preliminary studies, ISIDA fragments have demonstrated better predictivity than any conventional fingerprints (Figure 2A).However, there is still a concern of their intrinsic linearity: for instance, atom-centered fragments are represented as a collection of linear fragments sharing the same central atom, and thus they do not explicitly account for cyclic structures.While this is normally considered to make the descriptors more general and insensitive to subtle changes in structures, 44 it may introduce ambiguity in the case of structures such as IDPi catalysts, which contain a large number of condensed aromatic rings, or certain substituents, like polycyclic hydrocarbons.
To account for this complexity, augmented substructure type descriptors called CircuS (Circular Substructures) were developed in the framework of the CGRtools library. 45They function similarly to ISIDA atom-centered fragments, yet explicitly consider encountered full substructures with closed rings, representing complex structural motifs more specifically (Figure 2B).As expected, the CircuS performed even better than ISIDA descriptors; the fully optimized model built on CircuS fragments of catalyst substituents outperformed others with the satisfactory R 2 =0.905,MAE=0.741kJ/mol in LOO (Figure 2A and 2C).
We then turned our attention to demonstrate the generality of the developed method and envisioned to apply it for computer-aided design of more selective catalysts for tetrahydropyran (THP) synthesis (Figure 3A).Taking the advantage of the modularity of IDPi catalysts, [46][47][48][49][50][51][52] a variety of catalysts were prepared, providing a diversity on the 3,3'-positions of BINOL backbones represented as Ar and the sulfonyl groups on nitrogen atoms represented as R.For more efficient screening and data consistency, a synthesis robot was employed, streamlining the process from experiments to data generation (Figure S1).To build a predictive model for the THP synthesis, we then constructed a common data comprising with 35 examples for THF and 35 examples for THP to ensure the diversity of the catalyst structures (Figure S7).
The fully optimized model built on CircuS fragment descriptors provided a reasonable predictivity with R 2 CV =0.878, MAE CV =0.754 kJ/mol (Figure 3B).Based on this, a virtual screening has been conducted.As the training data does not cover all the combinations of Ar and R groups of IDPi catalysts, additional 190 catalysts representing experimentally unseen combinations of those were prepared in silico, and their selectivities were virtually evaluated.As expected, among all the predictions, some catalysts were predicted to provide higher enantioselectivities than those experimentally obtained in the training set.Following these results, additional catalysts were prepared and evaluated as a validation set.Overall, predictions worked well (Figure 3B), and the most selective catalyst 61 indeed provided an enantiomeric ratio of 91.5:8.5, which is significantly higher than any examples from the training set (up to 82:18 e.r.).Additionally, by changing the temperature and concentration, an even higher enantiomeric ratio (96:4) was predicted, and this predicted value was almost identical to the experimentally observed value (Figure 3C).
Having been able to predict the optimal catalysts and conditions, we were keen on exploring other substrates.
Phenyl-substituted substrate 1b furnished an excellent isolated yield and enantioselectivity even on a larger scale.
This result is superior to the previously reported one, regarding both reactivity and selectivity, which underlines the efficiency of the presented method. 40Naphthalene derivative 2c was also obtained with excellent yield and enantioselectivity.The more challenging aliphatic substrates, both with primary and secondary alkyl groups, underwent the desired cyclization to give products 2d and 2e in excellent yields and enantioselectivities (Figure 4).
The methodology delineated here enables quantitative evaluation of catalyst selectivities, including the predictions of extrapolated examples.This approach is solely based on 2D structures and reaction conditions and does not require costly quantum mechanical calculation.Our approach appears to be of utility to optimize various catalytic reactions enabling fast and robust predictions.

Figure 1 .
Figure 1.(A) The data set (THF) used in the initial study.(B) The general modeling workflow.Each catalyst/reaction combination has been encoded by different types of molecular descriptors forming several descriptors pools.Each descriptors pool was used for the building of related Support Vector Regression model, performance of which (determination coefficient, R 2 ) was optimized in 10-fold cross-validation.Best performing model (model with maximum R 2 ) is selected among those for external predictions.

Figure 2 .
Figure 2. (A) Descriptor benchmarking results (as MAE, kJ/mol) on THF set in Leave-One-Out (LOO) crossvalidation.The SVM model was optimized by genetic algorithms, and the best model's results are shown for several descriptor types (see Figure S9 for full data).(B) Fragments generated by CircuS at given distance from center atoms.(C) LOO cross-validation results for the best model (CircuS fragments, solvent descriptors, catalysts represented by the Ar substituent) on the THF dataset.Statistical parameters (MAE, R 2 ) are indicated on the plot.

Figure 3 .
Figure 3. (A) Target reaction (THP synthesis) to predict more selective catalyst.(B) 10-fold cross-validation results for the best model (CircuS fragments, solvent descriptors, catalysts represented by the Ar and R substituent, and the reaction is represented as CGR) on the THF dataset (black dot) and THP dataset (black triangle).Dash line represents perfect prediction.Statistical parameters (MAE, R 2 ) are indicated on the plot.Validation sets for THP are indicated in red triangle.(C) Selected examples for comparison between experimental and predicted values under different reaction conditions.G is represented in kJ/mol.