Machine learning activation energies of chemical reactions

Application of machine learning (ML) to the prediction of reaction activation barriers is a new and exciting field for these algorithms. The works covered here are specifically those in which ML is trained to predict the activation energies of homogeneous chemical reactions, where the activation energy is given by the energy difference between the reactants and transition state of a reaction. Particular attention is paid to works that have applied ML to directly predict reaction activation energies, the limitations that may be found in these studies, and where comparisons of different types of chemical features for ML models have been made. Also explored are models that have been able to obtain high predictive accuracies, but with reduced datasets, using the Gaussian process regression ML model. In these studies, the chemical reactions for which activation barriers are modeled include those involving small organic molecules, aromatic rings, and organometallic catalysts. Also provided are brief explanations of some of the most popular types of ML models used in chemistry, as a beginner's guide for those unfamiliar.

The β coefficients are the parameters in the model that are adjusted such that the model is fit to the data. This can be achieved by, for example, minimizing a loss function (L) of the mean squares error of the model with respect to the β coefficients, given by Equation (2).
where y j is the true value of the target property from the training data, f(x j ) is the predicted value from the linear model, n is the number of datapoints in the training set, and j runs over all training points with features x j and target values y j . It is also possible to introduce an extra term to the loss function that penalizes redundant features in the linear model. This is known as regularization, and two common types are called L2 and L1. L2 regularization adds the sum of the squares of the β coefficients to the loss function, as in Equation (3).
F I G U R E 1 A generic strategy for designing a ML model to predict a target chemical property, such as activation energy (E a , Y) of S N 2 reactions. Encoding of both the training and unseen reacting systems gives the input features (X). The model is trained to reproduce the target property of interest (Y train ), from the features of the training systems (X train ). The trained ML model can be applied to unseen systems with features X predict , and the model's performance is assessed by how accurately it predicts the property Y predict from the unseen system where i runs over all of the β coefficients for all the input features and λ is the regularization parameter and gives the "strength" of the L2 regularization. Larger λ values lead to an increase in the magnitude of the penalty and may lead to underfitting if too high. When L2 regularization is used in a LR model, it is known as ridge regression. 104,105 L1 regularization adds the magnitude of the β coefficients to the loss function, as in Equation (4).
The use of L1 regularization with an LR model is known as least absolute shrinkage and selection operator (LASSO) regression. 106 An interesting advantage of LASSO regression is that the L1 term in the loss function causes the β coefficients of low importance features to become zero, and it can thus be used for feature selection. When both L1 and L2 regularization terms are applied to a LR model, it is known as elastic net regression. 107 Note that regularization is a technique that is not only applicable to LR, the regularization terms can be included in the loss functions for many ML methods. LR models are easily trained and understood by the user, however, their predictions can be less reliable than other models if the key assumption behind a LR model is not present in the data; that a linear relationship exists between the features and the target property. 32

| Neural networks
A standard feedforward neural network (NN) is made up of layers of nodes (or neurons) with connections between all nodes in adjacent layers ( Figure 2). [29][30][31][32][108][109][110][111] Each node operates by taking the sum of all values that are input to it and outputting the activation function value of that sum. The activation function may be, for example sigmoid (Figure 2, right), hyperbolic tangent, or rectified linear functions. 112 A sigmoid activation function means that if the total input to a node is greater than a certain threshold, the node will output a larger value. Similarly, for inputs lower than the threshold, the node will output a smaller value. This is analogous to the process which neurons in the brain undergo; neurons receive electric signals, and if the total signal passes that neuron's threshold, it "fires" and transmits its signal to further neurons. 113 The NN's ability to learn the relationship between the input features and output variables of a dataset comes from the weight parameters between nodes. By adjusting these weights such that the error between the NN predictions and the actual values in the training data is minimized, the NN effectively learns the relationship between the input and output variables. NNs with larger numbers of layers are known as deep neural networks (DNNs) and are the algorithms behind "deep learning." 84,[114][115][116][117][118] DNNs are powerful predictors due to the extreme flexibility of the algorithm; however, they typically require much larger datasets to be reliable 119 and all but the simplest NNs become impossible for a F I G U R E 2 (Left) A representation of a simple NN. Starting from the input layer, the outputs of each node are multiplied by the weights for each connection before being passed to the next layer. Each connection has a unique weight associated with it, w k ij , where the weight corresponds to the connection between the ith and jth nodes in the kth and (k + 1)th layers respectively. An activation function is applied to the sum of the inputs to hidden layer nodes that produce the outputs of those nodes. (Right) The form of a sigmoid activation function human mind to grasp. Hence, NNs are often referred to as "black box" methods 117,[120][121][122] due to the difficulty in comprehending how or why the predictions of the model were made.

| Support vector regression
A support vector machine (SVM) is a tool for classifying data (a chemical example of which could be whether a particular compound will undergo a given reaction). This is done by constructing a hyperplane such that all datapoints from each class lie on opposite sides of the hyperplane. If the data are not linearly separable in the original feature space, a kernel function is used to map the data to a space where the categories are linearly separable by the hyperplane (Figure 3). The hyperplane is positioned so the distances between the hyperplane and its nearest datapoints (the support vectors, Figure 3) are maximized. In support vector regression (SVR), which considers continuous rather than discrete data (perhaps the percentage yield of a particular reaction), the support vectors between the hyperplane and the data are minimized, allowing the hyperplane to give the best possible description of the data. Should the data not be linearly related in the original feature space, a nonlinear kernel function can be applied to the data, and they become linearly related in the higher-dimensional space. 29,30,32,111,[123][124][125] However, there are some considerations that need to be made when utilizing a SVM or SVR. The choice of kernel function is very rarely obvious from the data alone (since the optimal separation or relationship between the datapoints may be found in a higher-dimensional space), and potentially several different kernel functions may need to be tested to maximize the predictive performance of the model. Furthermore, from a practical perspective, the SVR algorithm must store all of the support vectors for the datapoints in memory, and therefore the computational cost of the model increases with the size of the training dataset.

| Kernel ridge regression
The basic idea of kernel ridge regression (KRR) is to map the input features to a higher dimensional space and perform Ridge regression in that space. 29,31,[125][126][127][128][129] The KRR prediction for the features of a new instance, x, is given by a weighted sum of all examples in the training set, as in Equation (5).
where k is the kernel function, α i are the regression coefficients (cf. the β coefficients in LR), and x i are the feature vectors for the training examples where i runs over all examples in the training set. The loss function for KRR is then given by Equation (6).
F I G U R E 3 Illustration of the workings in a SVM. For example, a SVM could be attempting to classify whether a given compound will undergo a reaction (e.g., blue represents a compound that will react and red represents a compound that will not). Since the data are not linearly separable in the original feature space, a kernel function is applied that maps the data to a higher-dimensional space, where they are separable by the hyperplane. The distances between the datapoints and the hyperplane are known as the "support vectors" (purple arrows) where λ is the regularization parameter, j again runs over all the examples in the training set, y j is the true target value in the training set, and kfk is the norm of the function f from Equation (5) in the higher dimensional space. Since the square of the norm is added to the loss function, this is where the ridge regression part of KRR arises. Setting the first derivative of the loss function with respect to the α coefficients and solving gives the values of the α coefficients as in Equation (7).
where α is the vector containing the α coefficients, I is the identity matrix, y is the vector of the target values from the training set, and K is the kernel matrix with each element given by K ij = k(x i , x j ), the kernel function between x i and x j which are the feature vectors for two training set examples, where i and j run over all training points. A critical disadvantage of KRR is that the kernel function must be computed between the new instance and every datapoint in the training set each time a prediction is to be made. This can make KRR computationally expensive for large training sets.

| Random forest regression and gradient boosting
Random forests consist of decision trees which contain nodes that, for a given input, split the data according to the value of that input ( Figure 4). Each node is followed by two separate nodes that further split the data until the end of the tree is reached. In classification problems, categories within the data will have been separated by the end of the tree, so that any input will be placed into its proper category. In regression problems, the tree splittings formed during training minimize the root mean square error (RMSE) between the output of the decision tree and the true value corresponding to the input from the training set. Random forest regression (RFR) uses numerous decision trees and takes the average prediction of all the decision trees in the forest as its final output ("the wisdom of crowds"). 29,30,32,111,124,130,131 For example, if one attempts to predict the value of a property with RFR, each tree is provided with the set of descriptors, and each provides its prediction of the property, and the average value from all trees becomes the RFR prediction ( Figure 5). Random forests are generally considered to perform well for datasets of moderate size (perhaps a few hundred points) and increasing the number of decision trees within the model will typically improve model performance. However, if too many trees are used, the computational cost may become an intractable problem. Gradient boosting (GB), or tree boosting, is also constructed from decision trees, but rather than initializing the model with a certain number of trees and training them together, GB trains one tree at a time such that a subsequent tree minimizes the error from the previous ensemble of trees. 29,32,132 GB is a very similar ML method to RFR, and it also has the issue of computational cost when very large numbers of trees are used in the model. Also, one significant difference between RFR and GB is that the decision trees in RFR can be trained in parallel, while those in GB are created F I G U R E 4 A simple decision tree (right) to classify whether a point in a dataset (left) will be blue or red based on the values of its two features and trained in series. This means that, for the same number of trees at the same depth, GB will have a longer training time than RFR ( Figure 6).

| Gaussian process regression
In Gaussian process regression (GPR) a prior distribution of functions for the input features is sampled, such that the mean of the prior function distribution is zero for all values of all features. The forms of the functions are defined by a kernel function, which could be for example, sinusoidal or the radial distribution function. As training data are added to the model, the function distribution is updated so that each function passes through the training points, and this is known as the posterior distribution of functions. The GPR predictions for new inputs are given by the mean of the posterior distribution of functions for those input features. 31,[133][134][135][136][137] Empirically, GPR seems to have the advantage of not requiring a large training dataset 60,133,138 as in other models such as NNs. However, just as in KRR, it must take the entire training set into account for every prediction made, meaning its computational cost scales expensively relative to other ML models (the cost scales with the cube of the number of training points, 133,134 which presents a major bottleneck for the use of this algorithm when dealing with large datasets, as is common in problems for which ML is useful) (Figure 7).

| Other points for machine learning
There are a few other points with regard to ML that are worth mentioning. The first is that just as the parameters of the ML algorithm (e.g., the β coefficients in LR, or the weights in a NN) must be optimized in order to train the model, the F I G U R E 5 An illustration of RFR, which is constructed from the combination of many decision trees. Each tree is supplied with the same input features (X) and makes its (likely rather inaccurate) prediction of the desired output property. The splittings are based on the values of the input features (e.g., atomic charges or log K ow ). The values of the features determine which path on the tree is followed and thus which prediction the single tree produces. The average value of the predictions from all trees gives the final RFR prediction (y prediction ) F I G U R E 6 An illustration of the GB method. Training starts from an initial tree and subsequent trees are added such that the error from all tree predictions combined is minimized hyperparameters of the models may also need to be optimized to ensure that the model is best suited to the dataset at hand. These are the parameters that relate to the models themselves, rather than the ones used to fit the model to training data, for example, the number of hidden layers in a NN, or the number of decision trees in RFR.
Care must also be taken during training to avoid the occurrences of underfitting and overfitting ( Figure 8). Underfitting can occur when the model is not flexible enough to account for the complexity of the data. An underfit model will show large errors between its predictions and both training and testing errors. Overfitting occurs when the training data have been passed to the model too many times, or the model is too flexible for the dataset. The model thus becomes overfit towards the training data, and it has learned the specific trend in the training data and is unable to generalize. An overfit model is characterized by a low error for the training data, but a high error for testing data. Lastly, ML methods are interpolative not extrapolative. Generally, one cannot expect good performance from ML models for inputs outside of the range of the training data. For example, if a model were trained to predict the activation energies of reactions following the E2 mechanism, this model will not likely provide reliable predictions for reactions outside its training domain, such as S N 2 reactions.

| ML FOR ACTIVATION ENERGIES
Given the success of ML both inside and outside of chemistry, it has been considered for the prediction of reaction activation barriers. Several groups have made use of ML in this way, and this section reviews the work concerning ML predictions of homogeneous chemical reaction activation barriers. [139][140][141][142][143][144][145][146][147][148] Note, for works discussed in this review, the terms "activation barrier" and "activation energy" refer to those of elementary reactions, rather than the apparent barrier that would be observed over an entire reaction composed of multiple elementary steps.  creating a ML model to predict activation barriers. As well as obtaining a large dataset of reaction activation energies, a critical stage in creating these ML models is extracting the molecular representations, that is, the molecular features (or descriptors). Often used molecular descriptors in predicting activation barriers can be organized into different categories. "Physical organic" descriptors 139,[142][143][144]146,149 that are molecular and atomic properties such as atomic charges, bond orders, molecular orbital energies, measurements of molecular sterics, or thermodynamic quantities (e.g., free energy or enthalpy). Also used are geometric descriptors, for example, two-dimensional molecular representations known as molecular fingerprints [139][140][141]147,148 and descriptors of the three-dimensional atomic environments. 142 Physical organic features seem to require expensive QM calculations to be obtained, whereas the purely geometric features may not require such expensive calculations.
First, studies that apply ML models to directly predict activation energies will be covered, along with some aspects of the use and training of NNs in the context of these studies. Following this, attention will be turned to some of the limitations in these works; either the dataset used, or their use of reaction energy (the energy difference between products and reactants) as an input feature to the models. Further, some studies that address these issues will be covered.
Works that compare different types of features are examined, followed by those that produced models with reduced data requirements, and those that attempted to interpret their models. Finally, the uses of Δ-ML (see Section 3.4) to predict reaction activation energies will be discussed, and an interesting non-direct use of activation energy ML will be briefly mentioned. There also exists a fair amount of literature concerning ML predictions of the adsorption energies of small molecules onto catalytic surfaces, [150][151][152][153][154][155][156] but these are beyond the scope of the discussion here and a review on ML for heterogeneous small molecule activation has been published very recently. 157 We also do not include works where ML has been used for predicting activation energies in heterogeneous systems, [158][159][160][161] only homogeneous reactions are covered here. There are also a couple of works exploring ML for activation barriers from molecular dynamics simulations, 162,163 which are also not covered here.

| Direct predictions of activation energies with machine learning
The first step in building a ML model for chemical property prediction is the choice of algorithm. It may not be obvious which ML method is best suited for a given problem, but given the flexibility and predictive power of NNs, they should be expected to give good predictions of reaction activation barriers. In this section, the training and use of NNs are discussed, in the context of multiple studies that used NNs (among other models) for activation barrier predictions.
In work by Choi et al. in 2018, 139 an attempt was made to assess the feasibility of using ML for predicting activation energies of organic reactions (see Figure 9 for a few example reaction types). SVR, GB, and NNs with three to six hidden layers were trained on 12,704 (6078 unique) gas-phase reaction barriers from the RMG-py dataset. 164 The models were assessed by mean absolute error (MAE), RMSE, and coefficient of determination (R 2 ) between the ML-predicted and actual values of activation energy from the dataset. The NN reached a minimum MAE (for the test data) of roughly 2.5 kcal mol À1 at five hidden layers, indicating that overfitting was occurring with increased layers. However, none of the NNs were greater in accuracy than the GB method, which had the lowest MAE of just under 2.0 kcal mol À1 , despite the fact that one may expect the more complex and flexible NN model to show a better performance. Choi et al. state that this occurred due to an insufficient amount of data used in training the NNs. Although, all of the NNs had the Li et al. 142 used 15 ML methods (including, but not limited to, LR, SVR, RFR, KRR, GPR, GB [implemented in XGBoost], and NNs) to predict the regioselectivity of 3406 radical C-H functionalization reactions of heteroaromatic rings. Activation barriers of the reactions were determined from the differences between the Gibbs free energies of the minima and TSs, which were geometry optimized at the B3LYP/6-311+G(2d,p) level of theory, with single-point energies calculated at the M06-2X/aug-cc-pVTZ level of theory. The regioselectivity of the reactions was determined by the difference between free energy activation barriers (ΔΔG ‡ ) of the functionalization reaction for different positions on the heteroaromatic ring ( Figure 10). The ML models were trained to predict the values of ΔΔG ‡ , and several different sets of features were used as inputs (see Section 3.2). The NN attained the second lowest R 2 correlation coefficient of 0.967 between the ML-predicted and DFT-calculated values of ΔΔG ‡ , very close to the best, which was GB with an R 2 value of 0.968. However, further analysis was only carried out with GB, not the NN. This illustrates a practical consideration when facing model choice in ML; it is likely more worthwhile to use and explore a simpler model if, for a given dataset, it has roughly equal performance with a more complex model such as a NN.
A "frustrated Lewis pair" is a system in which a Lewis acid and a Lewis base are combined, but bulky groups surround the electron accepting and donating centers, forcing the centers to remain separated and preventing the transfer of electron density between them. Frustrated Lewis pairs are used in small molecule catalysis as they can be inserted between the Lewis centers, and subsequently activated for reactions. [165][166][167][168][169] Migliaro and Cundari undertook a study of methane activation by frustrated Lewis pairs formed from group 13 trihalides and group 15 pentahalides with ammonia ( Figure 11). 143 Note that simplified structures were used to reduce the impact of steric effects on the reaction barriers, and hence the effects of periodicity on the reaction could be analyzed directly. The group 13 elements in the Lewis acids were B, Al, Ga, In, and Tl, the group 15 elements were P, As, Sb, and Bi, and calculations were performed on the F, Cl, Br, and I halides of all nine elements. Calculations of the reactant and TS energies used the ωB97X-D density functional and the Def2-TZVPP basis set. The deprotonation of methane by the Lewis base (see Figure 11), rather than hydride abstraction by the Lewis acid, was calculated to be the more favorable mechanism both kinetically and thermodynamically. Migliaro and Cundari formulated a model for the activation energies of this mechanism. No single descriptor of F I G U R E 1 0 Illustration of how Li et al. determined regioselectivities for radical (R˙) C H functionalization reactions of heteroaromatic rings. The ΔG ‡ value for a single reaction is calculated by the difference in free energies of the reactants and the transition states. Regioselectivity is then given by the difference between ΔG ‡ values for two positions on the ring. The ML models were trained to predict these ΔΔG ‡ values F I G U R E 1 1 (Left) Transition state for deprotonation of methane with a group 13 trihalide (here AlCl 3 ). (Right) Transition state for deprotonation of methane by a group 15 pentahalide (here SbF 5 ). Note that the mechanism here involves the insertion of methane between the Lewis acid-base adduct and a proton from methane is transferred to the Lewis base (ammonia) the systems had strong linear correlation with activation energy, and hence a NN model was created. Only 35 datapoints were available (a TS for the reaction with BiF 5 was not found) so a very small NN with one hidden layer containing five nodes was trained to predict the activation energies for the insertion of methane into the Lewis pair. The importance of avoiding overfitting in NN training was also seen in this work. For a NN to become an accurate predictor, the training procedure must be repeated with the same data several times. The mean square error in the NN's prediction for the training, validation, and testing data decreased with each training iteration (or epoch). However, after nine iterations, overfitting towards the training data became apparent, and the errors for the validation and testing sets started to increase, while the training error continued to decrease. Thus, training was stopped at nine epochs in this case, so that the NN had its maximum possible general predictive accuracy. After training, the NN had an R 2 coefficient between the NN-predicted and DFT-calculated activation energies of 0.908 for all 35 datapoints, and an R 2 of 0.851 for the test data alone.
The works discussed above have demonstrated that NNs can be used to make strong predictions of activation energies based on input features that do not usually have strong linear correlations with activation energy. The flexibility of NNs gives rise to their strong predictive ability, but as seen in the above discussion, the user must tune the NN architecture and training process to avoid under/overfitting.

| Dataset limitations and the use of reaction energy as a feature
Grambow et al. highlight 141 that the results of Choi et al. 139 have some limitations. Most of the reactions in the dataset used to train and test the ML models were mostly composed of many similar reactions, which limits the general applicability of the ML models. Another unfortunate fact of the RMG-py dataset 164 used by Choi et al. is that many of the 12,704 gas-phase reaction energy barriers were different values from multiple sources, and thus, only 6078 unique reactions were actually considered. All 12,704 barriers were used to train the ML models, since there was no way of determining which value for each reaction was the more accurate. Therefore, on average, each unique reaction in the dataset would have two activation energies associated with it. It is not unreasonable to expect that activation energies for the same reaction from different sources may be similar, and the testing set would likely contain many reactions that were already included in the training set. Thus, the predictions of the ML models trained on this data would show on average, better predictive performance than if tested with reactions which the model had not already seen, albeit with data from a different source. Ideally, future workers should consider diversity of reaction datasets, such that ML models can be constructed with greater generality and lower costs of development.
Choi et al. used the reaction enthalpies and entropies (B3LYP/6-31G*) as features. This presents a practical issue and undermines the use of ML as a rapid predictor of chemical properties. For any reaction to be analyzed by this ML model, one requires thermodynamic properties of the products and reactants. These must be obtained via DFT calculations, which is not feasible on a high-throughput scale.
These same drawbacks can also be found in the study by Migliaro and Cundari. 143 As mentioned above, Migliaro and Cundari used only 35 methane deprotonation reactions, all of which follow identical mechanisms, and the only difference between structures is the periodic alteration of the halides and the central atom in the Lewis acid. However, it is not clear if this model could make accurate predictions for reactions involving more realistic frustrated Lewis pairs, with bulkier groups around the Lewis acid and base centers. The model also takes reaction energy as input, but also the dissociation energy of the acid-base adduct and the global electrophilicity index 170 (a measure of the level of Lewis acidity of a molecule), all of which had to be calculated at the expensive ωB97X-D/Def2-TZVPP level of theory.
A promising example comes from Palazzesi et al., 145 where they created "BIreactive," which uses Extra-Tree regression 171 (similar to RFR, but with small differences in how best splittings in the trees are chosen in the training algorithm) to predict activation energies for covalent drugs, which bind to their biological target through covalent bonds rather than non-bonded interactions. [172][173][174] The reactions of 2083 acrylamides and 2716 2-chloroacetamides with the methanethiolate anion (used as a less-complex stand-in for the peptide glutathione) were chosen to make up the dataset ( Figure 12). Conformational searches 175 were performed for each molecule, and the five lowest energy conformers of each were calculated at the ωB97XD/cc-pVDZ level of theory. Activation energies were calculated from the difference between the Boltzmann-averaged energies of the reactants and TSs. Separate BIreactive models were trained for the acrylamides and 2-chloroacetamides with 44 and 45 descriptors respectively for each system. Notably, none of the descriptors for the systems required QM calculations, and could be obtained from only the geometric information of the molecule. 176,177 The most important features to the two BIreactive models were: the number of terminal, primary sp 2 carbons atoms in the acrylamides, and whether chlorine was present five bonds from the nitrogen in the 2-chloroacetamides. The testing predictions from a five-fold cross-validation were combined and, for acrylamides and 2-chloroacetamides respectively, BIreactive was able to achieve R 2 coefficients between the predicted and DFTcalculated activation barriers of 0.85 and 0.69; Spearman rank correlation coefficients of 0.91 and 0.84 and MAEs of 0.75 and 0.88 kcal mol À1 . The values of the Spearman coefficients 178 show how well the ML model has managed to order its predictions, that is, how well the predicted and actual values of the target activation energies would fit to a monotonic function; the closer the value is to unity, the better the overall ordering. While the accuracies of BIreactive are impressive (notably the test MAEs are substantially less than 1 kcal mol À1 ), the software used to calculate the descriptors 179 is proprietary and has been discontinued at the time of writing, making use of these particular descriptors as well as a replication study rather difficult. Therefore, we would make the recommendation for the use of open-source alternatives for extracting chemical descriptors.
Choi et al. 139 suggested that, to make improvements in the quality of ML predictions of activation energies: "A deep learning approach with big reaction data may be a viable solution." This is the very approach Grambow et al. took, although without using any features that require QM calculations to extract. 141 The model begins with atom-mapped representations of the products and reactants; atom-mapped means that every atom in the reactants was labeled and its final position in the products is identified by that label. 180,181 These initial representations are passed to a directed message passing neural network 182 (D-MPNN) which creates its own learned representations of the molecules. The D-MPNN representation is a series of atomic vectors that contain the bonding information for each atom (collected from the atom type), the bonds with which it is directly involved and its neighboring bonds. The D-MPNN representation was chosen as it had been found that the D-MPNN model could frequently provide better predictions of chemical properties than using more traditional molecular descriptors as inputs. 182 For each atom in the products and reactants, the D-MPNN atomic vectors are subtracted from each other to obtain a difference fingerprint of the reaction. This was done so that the atoms that undergo the greatest changes to their environments (during the reaction) have the greatest influence on the activation energy prediction, since (as Grambow et al. state) the atoms whose environments change very little have only a small effect on activation energy. The difference fingerprint is then passed to another NN that takes the difference fingerprints from each atom and collects them into an encoding of the entire reaction. Lastly, the reaction encoding is input to a final NN, which is trained to predict the activation energies of reactions from their encodings.
The model was trained with the activation energies of 33,000 elementary reactions calculated at the B97-D3/ def2-mSVP level of theory, and 24,000 reactions at the ωB97X-D3/def2-TZVP level of theory. See Figure 13 for a few examples of the reactions considered. This dataset 183 included the forward and backward reactions, and structures contained up to a maximum of seven heavy atoms (carbon, nitrogen, and oxygen). It is not clear whether this model will be transferrable to larger systems, given the small sizes of the molecules in the training set. The DNNs in this model may have been able to learn the underlying principles behind the chemical reactions, and if this is the case, the model may show good performance on larger systems. Two levels of theory were chosen and a transfer learning approach was used to train the model. In this case, transfer learning 184 was performed by initially training the ML model with a large dataset of less accurate, more easily obtainable data and then the model was refined by further training with a more accurate and expensive dataset. The central idea is that the ML model can be well trained for the lower level of theory, given the larger amount of data available, and then a smaller amount of higher-level data can correct the model. This has a similar philosophy to the Δ-ML method (discussed later, Section 3.4), although in this case the low-level trained ML model is corrected by training with high level data, rather than using Δ-ML to correct lower-level calculated properties.
F I G U R E 1 2 Illustration of the TSs for the reactions between acrylamides (left) and 2-chloroacetamides (right) with the methanethiolate anion that made up the datasets for the training of the BIreactive models The model was able to achieve an impressive MAE of 1.7 kcal mol À1 and RMSE of 3.4 kcal mol À1 relative to the ωB97X-D3/def2-TZVP activation energies (the reactions spanned a range of 200 kcal mol À1 , and no QM-derived features were used as inputs for the model). The advantage of this model (i.e., that it can provide predictions of activation energies directly from 2-D molecular representations without the need for expensive QM calculations) also leads to a potential disadvantage. Unlike BIreactive, 145 this model does not consider the effects of conformational flexibility. Should a compound be particularly bulky, important steric effects may not be accounted for by the ML model, as they would be by QM. However, since the molecules included a maximum of seven heavy atoms, this likely did not cause a particularly significant amount of error. ML models that are able to consider the conformation of a chemical system have already been constructed for predicting potential energy surfaces. 66,118,[185][186][187] These methods can use, for example, atom-centered symmetry functions that describe the atomic environments and are invariant with respect to translation or rotation of the system. 65,188 Barring the likely massive data requirements from sampling the potential energy surfaces for thousands of reacting systems with accurate QM methods (ensuring both minima and TSs were well represented), there is no reason why models such as these could not be applied to predicting activation barriers. However, it would seem the approach from Palazzesi et al. of conformational searching the reactants and TSs is a more efficient way to construct a dataset. If this were explored further, reliable and accurate force fields [189][190][191] will become vital for the conformational searching involved in ML dataset construction.

| Comparisons of different feature types
Along with several ML models, Li et al. 142 tested how model performances varied with different sets of input features. Three types of local features were tested, as well as the combination of each type of local feature with global features. Local features refer to those that are specific to properties or environments of individual atoms, while global features refer to those that represent properties or the structure of an entire molecule. An example of a local feature is a single atom partial charge, whereas a global feature could be the molecular electrostatic polarizability. The local features that described atomic environments based on molecular geometries were atom-centered symmetry functions 188 ; smooth overlap of atomic positions (SOAPs), in which atoms have a Gaussian function associated with their positions; and the local environments are given by the integral of the sum of all other distance-weighted atomic-position Gaussians within a given cutoff radius. 135,192,193 Local physical organic (PhysOrg) features were calculated using the B3LYP/6-311+G (2d,p) level of theory and were C H bond order, atomic charges, and buried volume 194,195 (in this case, the percentage of a sphere, centered on a specific atom, of radius 3 Å that is occupied by the heteroaromatic molecule). The global geometric features were molecular fingerprints, 196,197 where molecules are described by vectors in which each element is a bit corresponding to a substructure of the molecule, and bag of bonds, 198 where a molecule is represented as a vector containing all possible pairwise interactions of atoms. The global PhysOrg features consisted of frontier molecular orbital energies and nucleus-independent chemical shifts 199 (NMR chemical shift at the center of an aromatic ring). Following training, the importances of the local and global PhysOrg features were analyzed, and it was found that both local and global were in the top 15 most important PhysOrg features. This suggests that a sensible direction for future chemical ML models may involve the inclusion of both molecular and atomic properties as descriptors, since features at both levels may contribute significantly to the predictions of activation barriers.
The model that achieved the greatest R 2 value (0.968) between the predicted and DFT-calculated ΔΔG ‡ values was a GB algorithm with SOAP local features and molecular fingerprints as global features. Since the SOAP/molecular fingerprint descriptors only require molecular geometries to be determined, the heteroaromatic rings were also optimized with the MMFF94 force field 200 and the PM7 semiempirical method, 201 as well as with B3LYP/6-311+G(2d,p). Using the SOAP/molecular fingerprint features, computed from the MMFF94 geometries, produced a slightly higher R 2 value (0.975) between the predicted and DFT-calculated ΔΔG ‡ values. This is rather promising, since these features could be used in a reliable ML model without the need for expensive QM calculations, as needed for the PhysOrg features. The GB model with SOAP/molecular fingerprint features and the RFR model with PhysOrg features were then applied to heteroaromatic systems with substituents that were not present in the training data. RFR was the method that gave the greatest R 2 for PhysOrg features.
After retraining the two models with the new systems, the PhysOrg model showed better predictions of ΔΔG ‡ (lower MAEs) for unseen systems from the new class of heteroaromatics, but the SOAP/molecular fingerprint model predictions did not improve. This suggests that the PhysOrg features are better able to capture the underlying influences that lead to differences in activation energies between the systems, whereas the description from the SOAP/molecular fingerprints could have reached a maximum possible accuracy. This brings up the discussion from the previous section, that is, the molecular features that tend to provide the most reliable predictions of activation energies require expensive QM calculations, which thus limits the model's feasibility for high-throughput applications. However, models trained with input features that are more easily produced may lose predictive accuracy. Perhaps there is the potential for "dual" ML models for the screening of large numbers of reactions: an initial ML model using cheaply obtainable features to provide an estimation of reaction activation energies, then a second model using QM-derived features to provide a better prediction for the reactions of interest from the first model.
Mikami 144 discovered that using so-called "interactive" quantum chemical descriptors (see definitions below) can lead to an improvement in the predictive abilities of ML models compared with using "classical" quantum chemical descriptors. In this case, the ML models were trained to predict the activation barriers of ethylene-metallocene coordination-insertion reactions (see the TS for the reaction in Figure 14). The classical descriptors were obtained from a lone metallocene (pre-reaction) and were LUMO energy, the charge on the metal (from natural population analysis 202 ), the metal cation-cyclopentadiene distance, buried volume, and sterimol parameters (that provide quantitative measures of ligand size 203,204 ). The "interactive" descriptors were obtained from an ethylene-metallocene complex ( Figure 15) and were stabilization energy (energy difference between the starting metallocene and the intermediate ethylene-metallocene complex); and the electrical interaction, charge transfer, and core repulsion energies from natural energy decomposition analysis [207][208][209] (each of these three energies are a portion of the energy difference between the ethylene-metallocene complex and the unrelaxed, separated ethylene and metallocene, see Figure 15).
Calculations were carried out with 68 group IV metallocenes (Ti, Zr, and Hf) with various cyclopentadiene-based ligands with the B3LYP density functional, the SDD basis set [210][211][212] for the metal and Si, and the 6-31++G(d,p) basis set for all other atoms. Ten ML models were trained to predict the activation energies of the ethylene-metallocene reactions using only classical features and five were retrained with the interactive descriptors. R 2 coefficients between the predicted and DFT-calculated activation energies increased for the training, validation, and testing datasets for all five models. The R 2 values for all five models also became much closer to each other. Although, once again, the descriptors used in this work require QM calculations to be obtained. It would be interesting to assess the use of interactive descriptors for other reaction classes, particularly if the descriptors were obtained at low-cost levels of theory.
Döntgen et al. have compared atomic partial charges to bond dissociation energies for their use as descriptors for activation barriers of unimolecular reactions. 149 Reaction activation barrier heights and bond dissociation energies were F I G U R E 1 4 A representation of the TS for the insertion-coordination reaction of ethylene with a cationic metallocene catalyst, studied by Mikami calculated at the DLPNO-CCSD(T)/CBS(cc-pVTZ, cc-pVQZ) level of theory (CBS refers to extrapolating cc-pVTZ and cc-pVQZ to the complete basis set limit 213 ) for 15 alkoxy roaming reactions, 10 keto-enol tautomerizations, 17 external radical H-atom abstractions, and 14 internal radical H-atom abstractions, with geometries optimized at the B3LYP-D3BJ/def2-TZVP level of theory. See Figure 16 for illustrations of the four reactions. One atomic partial charge was used for each reaction that corresponded to the carbon atom at the "reaction center." The bond dissociation energies and partial charges were fit to the activation barrier heights data with linear and/or quadratic equations (i.e., determining the functions that best expressed the relationships between barrier height and the two descriptors for each set of reactions). For the alkoxy roaming reactions and keto-enol tautomerizations, the model relating atomic partial charges to activation barrier was a much better fit than that for bond dissociation energies, as evidenced by the lower MAEs and RMSEs for the partial charge model than the bond dissociation energy model. However, for the two sets of H-atom abstraction reactions, neither the single partial charge nor the bond dissociation energy could be reasonably fit to a polynomial equation. Despite this, for the internal H-atom abstraction reactions where a peroxy radical formed an aldehyde or ketone and an OH˙radical, the product of the carbon and oxygen partial charges of the peroxide group had a strong linear correlation with the activation energy of the reactions (with R 2 correlation coefficient of 0.9). This indicates that the single partial charge provided insufficient information, but the two partial charges encoded enough information of the reaction mechanism that a better model for activation barrier was achieved. This perhaps indicates that a choice of features based on mechanistic understanding may provide a better description of activation energy. In this case, more than only the carbon atom in the peroxy group was significantly involved in the reaction mechanism and thus the combination of partial charges of atoms involved in the reaction was a superior descriptor.

| Reducing data requirements and interpreting models
Most ML models require reasonable amounts of data to ensure the accuracy of their predictions is as close as possible to the accuracy of the training data. However, highly accurate training data, whether from experiment, or high-level QM calculations, is very expensive to produce and in most instances these data will likely be of a limited scale. Therefore, a ML model that can attain a good performance with only very small amounts of training data will be very desirable indeed. GPR seems to be a good candidate for this task, 60,133,138 and in this section the papers that have used GPR for predicting activation barriers with reduced training sets are reviewed. Another attempt to address both reducing data requirements and interpreting ML models comes from Friederich et al. with their modeling of hydrogen activation on derivatives of Vaska's complex. 140 The derivatives of Vaska's complex 214,215 that would make up the dataset were defined from the combinations of a chosen set of ligands surrounding an iridium center ( Figure 17). See also Figure 18 for a representation of a TS for the hydrogen activation reaction at an example of Vaska's complex. In total, 2574 complexes were possible from the set of ligands (and their allowed positions around the complex), and the ML models were to be trained to predict activation energies of the reaction between the complexes and molecular hydrogen. However, the TSs for only 2197 complexes could be optimized using the PBE density functional, 216 the def2-SVP basis set, 217 and Grimme's D3 dispersion correction. 218 The first part of the study was to train NNs to predict the activation energies of the reactions. The NNs used autocorrelation and deltametric functions as features. These combine the values of an atomic property for all atoms within a certain number of bonds from the iridium center, through pairwise products (for autocorrelation) or differences (for deltametric). Thus, the functions give unique encodings of the environments within a complex as the number of bonds from the iridium center varies. The atomic properties used were electronegativities, atomic numbers, the number of atoms at a given number of bonds from the center, coordination numbers, and covalent radii. Thus, for example, one feature could be the deltametric function for electronegativities at a distance of two bonds, which would be calculated by the sum of the pairwise differences between the electronegativities of all atoms two bonds from the center. Note that none of these features require QM calculations to be determined, only a two-dimensional representation of the complex is required. The best NN model had a MAE for the test data of 1.12 kcal mol À1 when trained on 80% of the training data available; an impressive result. Ideally however, the ML could be able to make predictions with this accuracy (or better) with a reduced set of training data. Hence the second part of their study involved using GPR and feature selection, to ascertain if an accurate ML model could be constructed on a reduced training set. As well as the autocorrelation/ deltametric function features, Morgan fingerprints 197,219 (which provide string-based representations of atomic environments taking into account all atoms within a given circular cutoff) and molecular fingerprints from RDKit 220 were also included as potential features for the GPR model. A GB model was used to select which features had the most substantial contributions to activation energy. The GPR model trained on 80% of the data had a MAE of 0.59 kcal mol À1 , but perhaps more significantly, the GPR was able to reach a MAE below 1 kcal mol À1 with only 20% of the training data. This is a significant result; not only did the GPR model show excellent performance with very limited amounts of training data, but it did so with features that were not derived from expensive QM calculations.
After the ML models were trained, the GB method was used to analyze which input features had the greatest importance to the GPR and NN models, thereby allowing interpretation of the factors that lead the models to their predictions for any given complex. For the NN and GPR, the autocorrelation feature corresponding to the atomic numbers of atoms F I G U R E 1 7 The ligands composing the chemical space of the derivatives of Vaska's complex considered by Friederich et al. The R1 ligands are σ-donors and π-acceptors, the R2 and R4 ligands are σ-donors, and the R3 ligands are σ-donors and π-donors F I G U R E 1 8 An illustration of an example TS for hydrogen activation on the [Ir(PMe 3 ) 2 (CO)Cl] complex at a distance of two bonds from the center was the most important. In the NN, autocorrelation functions for atoms at other distances, and the deltametric functions for electronegativities, also had high importances. For GPR, the Morgan and RDKit fingerprints were the other high-importance features. Taken together, the analysis showed that reactivity around the complexes was more largely dependent on electronic effects, rather than steric effects from the ligands. Further in-depth analyses using the fingerprints from the GPR model allowed for interpretation of the effects of changing ligands on activation energy. For example, when a fluoride ligand was replaced with a cyanide, the predicted activation energy decreased by 14.4 kcal mol À1 which was due to the electron withdrawing character of the cyanide, leading to less repulsion between the iridium center and the approaching hydrogen. Friederich et al. note that this type of interpretation of ML models can be used for the screening and design of new potential catalysts. 140 Jorner et al. also utilized GPR in low data circumstances 146 when they created a "hybrid" model, in which a ML model was trained on experimental data, and DFT was used to calculate the features for the model. The rate constants for 443 nucleophilic aromatic substitution (S N Ar) reactions were obtained from the literature, see Figure 19 for an illustration of a selection of the reactions. GPR models were trained to reproduce these rate constants and thus, from the predictions, activation barriers could be calculated. Reactants and TSs for each reaction were conformationally searched with the GFN2-xTB semiempirical method, 221 and the lowest energy conformers from the searches were optimized at the ωB97X-D/6-31+G(d) level of theory, with single-point energies calculated at the ωB97X-D/6-311+G(d,p) level of theory. Electronic properties of the reactants and TSs were calculated with the B3LYP/6-31+G(d) level of theory and used as features for the models. Reactant features were ionization energies; electron attachment energies 222 ; the surface electrostatic potential and electrostatic potentials at reactive nuclei 223,224 ; local and global descriptors of nucleophilicity and electrophilicity 225 ; atomic charges; bond orders; solvent accessible surface areas 226 ; and a descriptor for dispersion. 227 TS features were the activation barrier of the reaction as calculated with the ωB97X-D/6-311+G(d,p) level of theory; TS atomic charges; TS bond orders; and TS nuclear electrostatic potentials. Separate GPR models were trained using different sets of the electronic (DFT-calculated) features. One set used all the electronic features from reactants and TSs, another without any TS features, and another using a small subset of the features, including the DFT calculated activation energy. GPR models were also trained with Morgan fingerprints, as well as "In Silico Design and Data Analysis" descriptors 228 (which encode molecules and reactants based on the fragments that make up the systems). The final feature set was a deep learning encoding of the reactions using a Bidirectional Encoder Representations from Transformers (BERT) classifier. 229 The DNN in the BERT classifier is trained to create fingerprint representations of the reactions, and these learned fingerprint representations were used as input features for the barrier ML models. Overall, GPR trained on the full set of electronic features performed best, with a MAE of 0.77 kcal mol À1 .
Since the hybrid model proposed by Jorner et al. uses experimental data, and most experimental datasets are likely to be of reduced size compared to those constructed from DFT, it will be crucial for the ML method in this type of model to be able to provide reliable results with restricted amounts of data. As also seen in the study by Friederich et al., 140 GPR seems to fill this limited-data role very well; it was able to reach a MAE less than 1 kcal mol À1 with less F I G U R E 1 9 Three of the S N Ar reactions modeled by Jorner et al. from their experimental dataset than 120 training samples and was also found to outperform LR, RFR, and SVR models trained with the same experimental rate constants with the full set of electronic features. The major advantage of this hybrid method is that the model is not confined to the accuracy of DFT; the limit of the model's accuracy is experiment itself. It also represents a useful method in the case when a limited amount of training data is available for ML (e.g., when using experimental data). In this case, the additional cost of DFT calculations for input features is justified by the greatly improved accuracy of the ML model.
Jorner et al. interpreted the model by calculating the Spearman coefficients between the model predictions and the features. This found, as may be expected, that the DFT calculated activation barrier was the most important feature to the predictions. However, the GPR model that did not include any features from the TSs still had a very respectable MAE of 0.86 kcal mol À1 , and the MAE was still lower than 1 kcal mol À1 with under 200 training samples. Thus, not including TS features did not lead to massively lowered accuracy for the S N Ar reactions. The model was still able to give chemical accuracy with only a modest amount of extra training data. The model based on the deep learning BERT features was also able to reach chemical accuracy, although with just over 350 training samples. In addition, the BERT features only require the connectivity of the systems in the reactions to be created, with no QM calculations needed. With a rather small number of datapoints, the risk of overfitting is apparent, although given the very strong performance of this model this may not be an issue. However, it is still worth guarding against overfitting when creating ML models, since this will deteriorate the general performance of the model, and one should always aim to use the largest dataset that is feasible to obtain.
There has recently been an increasing amount of attention paid to the interpretation of ML models, especially to avoid so-called "black box" approaches (especially DNNs), where the ML methods can make useful predictions, but their complexity means that it is not possible for humans to grasp exactly why the model came to its conclusions. 230 An argument can be made 231 that a better approach is to construct explainable, more easily understood models (such as LR), rather than to interpret feature importances after a more complex model has been trained. However, at least in the context of predicting activation barriers, LR models do not typically match the performance of more advanced algorithms such as NNs, GPR, and KRR. 140,142,144,146,158,159 This is not to say that LR models are irrelevant; their accuracies can be reasonable when trained with suitable features, but future workers attempting to predict activation barriers with ML should consider that a potential trade-off between model performance and interpretability may occur.

| Δ-ML for activation energy prediction
Δ-ML is a technique developed by Ramakrishnan et al. 232 where an ML model is trained to predict a correction for a property calculated at a low level of theory (known as the baseline). Rather than training the model to directly predict the high-level (targetline) value of the property, Δ-ML models are trained on the difference between the baseline and targetline values. The Δ-ML model is thus able to correct a property calculated at a low level of theory closer to its value at a higher level of theory (represented graphically in Figure 20).
Hammett's equation is notable for its simultaneous simplicity and predictive ability. [233][234][235][236][237] However, Hammett's original method is saddled with some significant disadvantages. The values of the substituent constants are highly dependent on the choice of reference reaction and may not be reliable for other reactions. [238][239][240] Hammett's method also easily overfits towards the reference reaction, meaning its predictions will be good for the reference reaction, but poorer for others. Thus, it was the work of Bragato et al. 148 to form a new Hammett method such that the substituent constants could become more transferable and the overall model be made more reliable for different reactions. This was achieved by calculating all reaction constants at the same time and calculating each substituent constant from its average over all reactions. This means the substituent constants give, generally speaking, better predictions for all reactions in the dataset rather than being biased towards the reference.
For a dataset of approximately 2400 S N 2 reactions (Figure 21), with MP2 calculated activation energies, Bragato et al. created Hammett models based on both the old and new methods as well as a ML and a Δ-ML model, to predict the activation energies of the S N 2 reactions. Note that the Hammett model is still applicable to activation energies since, from TS theory, activation energy is directly proportional to the negative logarithm of rate constant. Their chosen ML model was KRR, and it was trained on inputs of one-hot strings (zeros and ones only) encoded to represent the functional groups and their positions, with reaction activation energy as output. As may be expected, KRR was able to outperform both Hammett models. When plotted, the MAEs between the predicted and the MP2 activation energies were consistently lower for KRR than the new Hammett model and much lower than the original as the number of training points was increased. The greater performance of KRR is due to its greater flexibility; it can account for deviations from linearity in the data that the linear Hammett models cannot. However, KRR does have the disadvantage of being much more complex and requiring more data compared to the Hammett model. For the Hammett equation, only one parameter is needed per substituent, per possible position, but for KRR a parameter must be determined for every training point.
The Δ-ML also used KRR and was trained to predict the difference between the activation energies from the new Hammett method as a baseline, and the MP2 activation energies as the targetline. The "normal" ML and Δ-ML models eventually reached the same error (MAEs just greater than 1.5 kcal mol À1 ) with enough training points, but Δ-ML consistently had lower MAEs for all numbers of training points. This has been noted as an advantage of the Δ-ML approach. 232,241 The function that describes the error from a lower level of quantum chemical theory, relative to a higher level is apparently smoother 232 (and hence easier for ML algorithms to interpolate its form), compared to the function describing the relationship between the chemical system and its properties.
Δ-ML makes for an effective combination of QM and ML. It is able to reach the same level of accuracy as a ML model trained with molecular descriptors, but with reduced data requirements due to the smoother and more easily F I G U R E 2 0 Illustration of how Δ-ML can make predictions of chemical properties (E) in chemical feature space (R). Δ-ML is trained to predict the difference between the baseline (orange) and targetline (green) values of the property of interest. This predicted difference can then be used to correct the baseline value closer to the accuracy of the targetline (black arrow) F I G U R E 2 2 Illustration of how ReactionPredictor breaks down an overall reaction (top) into a sequence of elementary steps (bottom). For a given reaction, the most favorable elementary steps are selected by the DNN and the entire reaction is modeled from start to completion with DFT F I G U R E 2 1 S N 2 reactions with activation energies calculated at the MP2 level of theory used by Bragato   Note: The "reaction types" column briefly describes the chemical reactions for which barriers were predicted by each of the references. The "dataset details" column provides the total number of datapoints used as well as the proportions of those data points used for training, validation and testing (if provided). The source from which the activation barriers were extracted are given in the "barrier source" column, and the best ML model and its performance metrics (if provided) are in the columns "best model" and "best test accuracy." For each study, the most relevant molecular descriptors used in the best predictions of activation barriers are listed in the "important features" column. learnt relationship between the baseline and targetline properties. This allows for QM reaction modeling studies to be performed at increased speed and scale.

| QM-ML synergy
An interesting example of a somewhat indirect use of reaction activation energies with ML came from Sadowski et al. and their software ReactionPredictor. 147 The program was designed to be able to predict the likely outcomes of chemical reactions through a synergy of QM and ML. The procedure for this task was as follows: when given a set of reactants, electron sources and sinks are identified 242,243 and a DNN ranks all possible electron transfers between sourcesink pairs (corresponding to an elementary step in a reaction, see also Figure 22) in order of their favorability. After ranking by the DNN, the reactants, products, and TSs for the top few reaction pathways are optimized with DFT and the activation energies for the elementary reactions are calculated. Hence the overall mechanism of a reaction can be determined based on which elementary steps are most favorable, as ranked by the DNN. In this approach, the synergy between ML and QM occurs when the DNN has incorrectly ranked the reactions, as will be discovered after QM calculations. In that case the DNN is retrained with the correct ordering for these reactions, and thus its predictions are brought closer in line with QM. The ReactionPredictor system was later tested by Fooshee et al. 244 on a new dataset of reactions and achieved an accuracy of 80% for correctly predicting the products of an unseen benchmark dataset of multistep reactions.
The task of ReactionPredictor requires a great deal of flexibility from its ML model. Not only does the model have to interpret the input for a given reaction, but it also needs to be able to learn the underlying principles behind how organic reaction mechanisms proceed, based only on elementary reaction steps. DNNs are optimally suited to this type of problem. DNNs appear to gain some understanding of the underlying chemistry of the systems in their training sets. This is also seen when DNNs are used to predict potential energies of chemical systems 66,185,245 ; when trained on data from smaller systems, they are still able to make accurate predictions for larger systems.

| SUMMARY AND CONCLUSION
This review has examined the work using ML to predict activation energy barriers of homogeneous chemical reactions. Using ML models for activation barrier prediction is a relatively new, rapidly evolving area of study with all studies presented having being published within 5 years of the time of writing. Please see Table 1 for a summary of the key information from these studies. NNs have been seen to be flexible and effective ML techniques for this task, especially DNNs, and their use in the context of these works was discussed. A couple of potential limitations in some studies were noted. First, models using training data that consisted mostly of similar reactions would likely display very limited ability for generalization, but even more prominently was the use of input features derived from QM calculations. Although the ML models mean that difficult TS structures need not be found with QM to make predictions, the purpose of using ML in chemistry is to reduce the need for expensive QM calculations. However, also seen are models that exclusively use geometrical features as inputs, and still give very promising results. On the other hand, these models will not be able to offer the same level of mechanistic insight that could be obtained by analyzing feature importances in a model using QM-derived features.
As the use of ML in computational chemistry increases, the need for large datasets of chemical properties generated by QM calculations will also increase. It will therefore be a worthy endeavor to create tools that are capable of performing high-throughput QM calculations for a wide variety of different reaction types. 246 Attention has also been drawn to methods that have a reduced requirement for the amount of training data needed, and those that make attempts to interpret ML models. It is debatable whether it is better to use simpler models that can be easily interpreted, or to construct less comprehensible models that (at least in the context of activation energies) give more reliable results and interpret them by computing feature importances from the fully trained models. Δ-ML has also been seen to be a very effective technique for the acceleration of activation energy evaluation. Given the very recent influx of work using ML for reaction barrier modeling, there is little doubt that there is much work to be done towards the goal of general, rapid and chemically accurate activation energies.

CONFLICT OF INTEREST
The authors have declared no conflicts of interest for this article.

DATA AVAILABILITY STATEMENT
Data sharing is not applicable to this article as no new data were created or analyzed in this study.