Assessing Conformer Energies using Electronic Structure and Machine Learning Methods

We have performed a large-scale evaluation of current computational methods, including conventional small-molecule force ﬁelds, semiempirical, density functional, ab initio electronic structure methods, and current machine learning (ML) techniques to evaluate relative single-point energies. Using up to 10 local minima geometries across ~700 molecules, each optimized by B3LYP-D3BJ with single-point DLPNO-CCSD(T) triple-zeta energies, we consider over 6,500 single points to compare the correlation between diﬀerent methods for both relative energies and ordered rankings of minima. We ﬁnd promise from current ML methods and recommend methods at each tier of the accuracy-time tradeoﬀ, particularly the recent GFN2 semiempirical method, the B97-3c density functional approximation, and RI-MP2 for accurate conformer energies. The ANI family of ML methods shows promise, particularly the ANI-1ccx variant trained in part on coupled-cluster energies. Multiple methods sug-gest continued improvements should be expected in both performance and accuracy. energies from the same set of density-functional optimized geometries, comparing multiple current methods to a high-quality coupled cluster baseline. We consider the mean absolute relative errors in energies (MARE), as well as the correlation of relative energies, reﬂected in the R 2 coeﬃcient of determination, and the ranking of single-point energies reﬂected in the Spearman ρ correlation. The use of correlation coeﬃcients and the Spearman correlation intend to consider whether methods exhibit systematic errors that may not aﬀect linear correlation or ranking of energetic stabilities.


Introduction
For almost all molecules, multiple geometrically-distinct conformers exist. Understanding and predicting thermodynamically accessible ensembles of molecular conformers is a key task underlying much of computational chemistry. [26,72,52] In principle, for each rotatable bond, the number of possible minima increases exponentially. Consequently, most conformer sampling methods [45] use classical small-molecule force fields to evaluate energies because of their fast performance, despite potentially poor correlation with quantum mechanical methods. [59] Multiple efforts have evaluated the success of wavefunction and density functional first-principles methods to compare the energetics of different conformers. [34,95,63,92,86,60,116] While experimental crystal structures and bioactive docked conformers are not always the lowest energy conformer, recent efforts have demonstrated only small energy differences when using quantum chemical methods instead of force fields. [88,23] Even for simple molecules such as 1,1'-biphenyl, use of large basis set coupled cluster methods are needed to accurately place the dihedral angle and barrier. [54] Other works have documented the need for accurate treatment of non-covalent interactions to model conformers in π-conjugated oligomers. [51] One common assumption is the presumed balance between increasing desired thermochemical accuracy and increased computational time. That is, more computationally intensive methods produce more accurate geometries and thermochemical properties. For example, the rise of composite ab initio thermochemical recipes such as G3, [15] G4, [16] and W1 [74,79] to W4 [61] seeks to provide highly accurate thermochemical predictions by separate estimates of basis set extrapolation and electron correlation. Still, such methods are largely limited to small molecules due to the high computational cost. [24] As mentioned above, efforts for conformer sampling have often focused on classical force fields or multi-level approaches using semiempirical methods. [45,46,58] In our previous paper, [59] we considered both the single-point energies and geometry optimizations of a range of common computational chemistry methods, including classical force fields, semiempirical quantum chemistry, and dispersion-corrected density functional methods. In general, due to the large differences in the potential energy surfaces predicted by force fields and quantum methods, we found poor correlation between both single point energies at the same geometry and optimized geometries using different methods.
In this work, in order to expand our range of computational methods, we only consider the relative single point energies from the same set of density-functional optimized geometries, comparing multiple current methods to a high-quality coupled cluster baseline. We consider the mean absolute relative errors in energies (MARE), as well as the correlation of relative energies, reflected in the R 2 coefficient of determination, and the ranking of single-point energies reflected in the Spearman ρ correlation. The use of correlation coefficients and the Spearman correlation intend to consider whether methods exhibit systematic errors that may not affect linear correlation or ranking of energetic stabilities.
While we find increased accuracy typically still requires exponential increases in computational time, several methods stand out as widely useful methods for ranking conformer energies. Future improvements in standard computational methods and machine learning surrogates suggest that both increased accuracy and efficiency are expected from further method development.

Computational Methods
Calculations were performed using Open Babel version 3.0 [77] for all force field calculations (MMFF94 [35,37,38,40,39] and UFF [90,13]) OpenMOPAC for PM7, [102]  Machine learning methods included "bag-of-features" representations and ANI-1x [97], ANI-1ccx [98], and ANI-2x [17] models. The Bag-of-Features representations chosen were Bag of Bonds [41](BOB), Bond Angle Torsion [47](BAT), and Bond Angle Torsion Typed (BATTY). BOB represents atoms and pair-wise interactions into sorted bags with BAT being a many-body expansion to include angles and torsions. Both of these representations were implemented using chemreps. [22] The BATTY representation takes inspiration from BAT in order to include minimal atom typing in all bond, angle, and torsion bags while excluding nonbonding interaction and nuclear charge bags in the final representation, as discussed below. scikit-learn [81] was used for kernel ridge regression of Bag-of-Features representations.
For this work, all timings are single-core CPU times using a 2.60 GHz Intel Skylake CPU (Intel Xeon Gold 6126) with 192GB RAM per node, through the University Pittsburgh Center for Research Computing.
Python scripts and Jupyter notebooks were used to compile all data into pandas [75] data frames, using numpy [104] and scipy [105] functions for analysis. 3DMol.js was used for interactive molecular visualization of conformers. [91] Plotly was used for interactive plots. [49] All scripts and data, including molecular geometries, are provided through GitHub (https://github.com/ghutchis/ conformer-benchmark) with the intent that additional computational methods can be added to these benchmark comparisons.

Test Set Selection
As in our previous work, [59] a dataset consisting of experimental crystal structures of 700 small molecules capable of multiple conformer geometries was provided to us by Ebejer [20] and were derived from the work of Hawkins et al. [46] along with ligands from the Astex Diverse Set. [44] These compounds have been repeatedly used to evaluate the quality of conformer generation. [46,20] Approximately half (320 molecules) consist solely of carbon, hydrogen, nitrogen, and oxygen (CHON) atoms, while the remainder are more complex drug-like compounds and ligands from the Protein Data Bank (PDB). [46] A list of Simplified Molecular Input Line Entry Specification (SMILES) [114] for all 700 molecules can be found in the Supporting Information.
For ab initio calculations using the cc-pVTZ basis sets, relativistic effective core potentials were not available for molecules containing iodine. Thus, for comparisons with DLPNO-CCSD(T) and RI-MP2 methods, such species were omitted. Similarly, the ANI-1x and ANI-1cxx methods only support molecules containing CHON atoms and evaluations were only performed on the subset of molecules supported. The ANI-2x method supports additional elements, but not bromine or iodine and thus evaluations were similarly only performed on the supported subset for that method.
For bag of feature ML testing, the training set was five conformers of each molecule, with the remaining conformers as test/validation. Any molecule with fewer than five conformers had the conformers added to the training set and was omitted from the test set.

Results
In this work, we focus on the evaluation of single point atomization energy calculations on a subset of~700 organic molecules. Conformers were initially created from a set of 250 diverse poses with maximal heavy-atom root mean squared deviation (RMSD) using Open Babel, and at most 10 poses were selected based on the lowest heat of formation calculated by PM7, followed by full geometry optimization using B3LYP-D3BJ with the def2-SVP basis set. [59] Using this set of DFT-optimized minima, in this work, single point atomization energies were computed using the DLPNO-CCSD(T) [69,33] method using the cc-pVTZ basis set. (Dunning 1989, Kendall 1992) This approach has been found to be a highly accurate method for calculating thermochemical properties and with a significantly lower computational cost for medium to large organic molecules, compared to canonical CCSD(T) methods. [80,68,69] Using only the set of molecules in which all standard (i.e., not machine-learning based) methods completed leaves 6511 entries. Of those, 9 molecules (out of 690) had 2 or fewer poses and were also removed, leaving 681 unique molecules and~6500 entries for comparison.
To our knowledge, this is the most extensive computational validation set, both in terms of the number of compounds, geometries, and computational methods for studying low energy molecular conformers. We provide all data and analysis scripts as open data and open source to allow future reuse via a GitHub repository.
As illustrated in Figure 2, each method is correlated with DLPNO-CCSD(T) / cc-pVTZ energies for each molecule (e.g., astex_1hwi in Figure 2). Since each molecule has several conformers, three metrics are compiled, the mean absolute relative energy (MARE) compared to the DLPNO-CCSD(T) atomization energies, the Pearson R 2 correlation, and the Spearman correlation ρ. The MARE metric gives an absolute measure of the energetic errors, but since different methods use different energy scales (e.g., heats of formation for PM7 and force fields), the statistical correlations use linear regression (R 2 ) and relative ordering (Spearman ρ) to remove sources of systematic energy differences. For each metric across each method, the median value was compiled as illustrated in Figure 2, to represent the overall quality of a given method.
Since the metrics are unlikely to reflect normal distributions (e.g., Figure 2 shows highly non-Gaussian distributions), determining confidence intervals cannot be established from analytical formulas. Consequently, we used bootstrap sampling to establish 95% confidence values for the medians, as reported below. For ease of discussion, we have given the confidence ranges in all tables and figures, but indicate ± errors using the average of the upper and lower bounds. In general, the asymmetry between upper and lower bounds are smalll.
By considering a large number of diverse organic molecules with many poses per molecule, we seek to sample a wide variety of conformer energy preferences (e.g., intramolecular hydrogen and halogen bonding, π-π stacking, electrostatic interactions, etc.). While using optimized low-energy conformers may under-estimate the accuracy of methods for high-energy structures, [95] we believe the current work is a challenging but useful comparison. In general, such high-energy geometries reflect steric repulsion more than the diverse types of interactions driving low-energy geometries.
Moreover, many computational predictions rely on Boltzmann-weighted averages of multiple thermally accessible conformers, including NMR prediction, [72,26] reactions, and even understanding the effects of dipole moments on solvent viscosity. [106] Consequently, deriving accurate relative energies of molecular conformers is a crucial task, as discussed below.

| Basis Set Effects
For the frequently-used B3LYP-D3BJ and PBE-D3BJ density functional methods, we considered both the def2-SVP and def2-TZVP basis sets. In both cases, the triple zeta basis set significantly improved correlation with the DLPNO-CCSD(T)/cc-pVTZ baseline, for example, the median R 2 scores improved from 0.868±0.064 to 0.920±0.025 for B3LYP-D3BJ and from 0.835±0.025 to 0.885±0.018 for PBE-D3BJ. There were comparable improvements in median Spearman rank correlation and reduced mean absolute relative errors, all statistically significant (i.e. p-values far below 0.001). The increased basis sets also roughly doubled the CPU time required.
Thus increasing basis set size for these density functional methods, at least from double zeta to triple zeta, does improve accuracy, albeit at a significant computational cost. In general, the B97-3c method provides accuracy comparable to popular dispersion-corrected DFT methods such as B3LYP-D3BJ with faster performance, and RI-MP2 provides greater accuracy at a very similar speed.

| Dispersion Corrections
Since the bonding is consistent across multiple conformers, the ranking of small energy differences is known to be dominated by non-bonding interactions. [73,32] Density functional methods are known to incorrectly account for dispersion interactions, which has led to a variety of empirical corrections. [31,9,55,56,11,12,27,57] Comparing un-corrected PBE, B3LYP, and ωB97X single-point energies to DLPNO-CCSD(T) illustrate a significant effect. The uncorrected median R 2 values drop by~0.12, and the median Spearman correlations drop by~0.08. For example, the median R 2 of B3LYP / TZ drops from 0.920±0.012 to 0.706±0.050 without the D3BJ dispersion correction.
On the time-scale of a density functional calculation, these empirical dispersion corrections require only a minuscule time, yet significantly improve the accuracy of the relative energies. Thus, even though this work is concerned with intramolecular interactions in conformers, dispersion-corrected density functional calculations should always be used. Continued efforts, such as the improved D3 methods [115] or the new D4 method [12,11] will hopefully improve their accuracy further.

| Comparison of Timing
As discussed above, a frequent concern for conformer screening is the relative computational performance. In general, classical molecular force field methods have been preferred since they allow the generation of hundreds of conformers per compound in seconds. While traditional high-level ab initio methods are considered a "gold standard" for thermochemical energies, the time required for a single point energy evaluation may be high. For this work, all timings are single-core CPU times using a 2.60 GHz Intel Skylake CPU (Intel Xeon Gold 6126) with 192GB RAM per node.
As indicated in Figure 3, hybrid density functional methods such as B3LYP-D3BJ require significant single-computational time for single-point energies of medium-sized organic molecules (median 26±0.3 minutes) compared to GGA methods such as PBE or approximate density functional tight binding methods such as GFN1 / GFN2 (median 2.6±0.06 s yields 600x speedup). Conventional density functional methods nevertheless represent a meaningful mid-point relative to DLPNO-CCSD(T) method, which may be faster than traditional coupled cluster methods but are still five to ten times slower than B3LYP (i.e., hours per single point energy).
Consequently, an important consideration is also the typical trade-off in computational chemistry between thermochemical accuracy and computational time. Since traditional MP2 and coupled-cluster methods exhibit high computational complexity, much research ignored them for medium to large organic molecules due to the time required.
Particularly in computational screening and conformer generation, fast molecular force fields such as MMFF94 and UFF, as well as semiempirical quantum chemical methods such as AM1, [18] PM3, [100] PM6, [101] and PM7 [102] were considered "good enough" to generate structures for further refinement with density functional and other methods. More recent methods, particularly the ANI machine learning methods and the GFN family of density functional tight binding appear to significantly improve on accuracy with only modest increases in the time required.
We find that consistent with common assumptions, even recent methods roughly adhere to the requirement of significant increases in computational (time) cost to gain increased thermochemical accuracy, as illustrated in Figure 4 with R 2 . Similar trends are found for MARE and Spearman ρ metrics. Since multiple studies have demonstrated the need for accurate treatment of noncovalent interactions including intramolecular electrostatic and dispersion effects for understanding conformer relative energies, it is not surprising that this benchmark illustrates the significant accuracy advantage of modern dispersion-corrected density functional and wavefunction methods.

| Use of Machine Learning Methods as Surrogates: ANI and Bag-of-Features
One possible solution to the trade-off between accuracy and computational cost would be the growing use of machine learning (ML) methods in chemistry, particularly as a surrogate for thermochemical parameters such as atomization energies. [94,42,21] Typically, these ML methods use deep neural networks (DNN) and have been trained to density functional calculations, particularly hybrid B3LYP or ωB97X atomization energies [89][96] although recent efforts have included training on coupled-cluster quality data as well. [98] In principle, since the evaluation of the DNN is fast, the time required for the prediction of an ML method is dominated by the time to generate the input descriptors -still only a small fraction of that required for a quantum calculation. Therefore, if an ML method could reproduce density functional or coupled-cluster energies at semiempirical or force field computational cost, it would dramatically change the conventional accuracy/time tradeoff.
While evaluation of DNN methods would be significantly faster on graphics processing units (GPUs), and may not be optimized for CPU evaluation, we note that many quantum chemistry methods are also accelerated on GPUs. Thus we retain the single-core CPU timings in Table 1 and Figure 4 but note that the actual speed of ML methods such as ANI would be faster when evaluated on a modern GPU.
| ANI methods Table 1 and Figure 4 show the ANI family ML methods, ANI-1x, ANI-1ccx, and ANI-2x, performing similarly to GFN tight binding semiempirical methods in both accuracy and speed. ANI-1cxx outperforms the ANI-1x model that does not contain dispersion corrections while performing slightly better than the ANI-2 model. The inclusion of dispersion correction for DFT methods is clearly beneficial as they improve upon their non-dispersion corrected counterparts, as seen in Table 2.
In principal, it is possible to perform post hoc addition of a D3 dispersion correction to both ANI-1x and ANI- 2x. Table 3 shows potentially improved performance over their non-dispersion corrected counterparts, although the differences are not statistically significant. Moreover, since the D3 dispersion correction for ωB97X-D3 cannot be calculated by standard tools, applying such a post hoc correction is challenging. For our set, one could calculate the dispersion correction from the ωB97X-D3 calculations performed on the same molecule, but without such density functional calculations, applying dispersion correction would be impossible.
While the newer D4 correction [12,11] can be calculated using the DFTD4 program, [1] we find adding D4 corrections worsen the median R 2 and Spearman metrics, although again the differences are not statistically significant. The variance of applying D3 and D4 corrections to the ANI models illustrates the challenge in current machine learning methods. Since they inherently add some error on top of the underlying data used for training the model, use of coupled-cluster or other highly accurate dispersion-corrected training is needed.

| Bag of Feature methods
The performance of the bag-of-features models, while faster than the ANI symmetry function models, were more comparable to the accuracy of force field methods. The inclusion of additional information to the descriptor such as three and four-body interactions and atom typing were beneficial to the bag-of-features models, the accuracy pales in comparison to the ANI symmetry function models.
Standard bag-of-features have at minimum a bag of nuclear charges and a bag of two-body interactions as seen in BOB and further bags are added that contain additional information such as angles and torsions with BAT. This approach was taken for the BATTY representation with the modification of using minimal atom typing (i.e., sp, sp 2 , sp 3 hybridization) to sort bags. Unlike other bag-of-features representations, the performance of BATTY was increased by removing the bags of nuclear charges and excluding the nonbonding interactions from the two-body interactions bag to create a bag of simple bonds. Since relative conformer energies are strongly dominated by non-bonded interactions, this finding is surprising, although perhaps separating bonding and two-body non-bonded interactions facilitate ML training. A recent example, BAND-NN, took the approach of separating the bonding and nonbonding information similarly to classical force fields and finds an improvement in performance. [66] ML commonly employs techniques to normalize the data, improving the model's training. [50,64] In this work, we ML methods, despite training on density functional and coupled-cluster energies, are still not as accurate as conventional quantum methods for predicting conformer energies. At present, the ANI family is comparable to the semiempirical GFN methods for accuracy on this task.

| Effects of Conformer Energy Ranges on Accuracy Metrics
Previous work has suggested that the poor correlations found between force field and semiempirical methods are derived from the small number of low-energy conformers considered in this benchmark. [95] Certainly, one might imagine that when considering multiple geometries with only small differences in energies, random errors are magnified. Figure 5 illustrates a histogram of the ranges in DLPNO-CCSD(T) energies across the molecules considered.
Despite the small ranges in energies, there is little correlation between the energy range of a molecule and the accuracy metrics of a particular method. This suggests no bias from the small energy windows used in this benchmark set.

| Dipole Moment Ranges
Since we generally find very small energy differences between the conformers considered in this work, one might wonder whether such differences have meaningful consequences. Due to Boltzmann statistics, many properties are dominated by the lowest energy geometry, even with small energy windows to other geometries. One recent example comes from understanding the effects of dipole moments on solvent viscosity. [106] Finding all conformers with proper weighting is thus crucial to predicting the dipole moment of an ensemble of different conformers.
We find over the set of molecules considered, over 140 molecules have a range of 3 D or more, and 75 molecules have a range of 4 Debye or above across multiple conformers in the study. Figure 9 illustrates the example of omegacsd_CNBPCT, in property prediction, as recently discussed with conformer and polarity effects on solvent viscosity. [106] Machine Learning Batch Evaluation An advantage for ML and force field predictions is the ability to batch evaluate by loading all conformers of a molecule at once and evaluating them as a batch opposed to evaluating one at a time, as with conventional quantum chemistry methods. Table 5 indicates the median sequential times from Table 1 and median time per single point in batch evaluation. Speedups range~70-170 times faster for both force field and ANI methods. We note that while the ANI methods improve performance in batch evaluation, traditional force field methods do as well, with similar speedups.

Conclusions
The current work extends previous efforts to consider the accuracy of modern computational chemistry methods to rank the energies of drug-like conformers. Since such energy differences are small, this poses a challenging benchmark Current ML methods show great promise, particularly the ANI-1ccx method trained in part on coupled-cluster energies, [98] since they provide accuracy comparable to the semiempirical GFN2 method and can be performed in batch and accelerated on GPUs. Despite claims of reaching and exceeding DFT accuracy, we do not find these methods yet meet the accuracy of modern dispersion-corrected methods. Nevertheless, we expect these methods will provide increased accuracy in the future. An important caveat is the need to train on accurate data, such as dispersion-corrected density functional, MP2, or coupled-cluster calculations.
We expect continued improvement from other methods, particularly multiple efforts to improve classical force fields, [108,103,93,43] inclusion of polarizable atomic charges, [70,48,111,53,119,87,84,71] novel force fields from experimental data, density functional and other quantum methods, [3,7,118,109,117,28] and continued development of approximate semiempirical quantum methods. [4] At present, we can highly recommend methods at each tier of the accuracy-time tradeoff, particularly the recent GFN2 semiempirical method, the B97-3c density functional approximation, and RI-MP2 for accurate conformer energies. Previous efforts to use a hierarchy of methods are still useful, for example, the use of GFN2 methods to refine initial conformer ensembles, followed by refinement of a smaller set of low-energy geometries with more accurate methods. Batch evaluation with ANI methods are also efficient, although they do not yet span the range of elements supported by semiempirical methods such as GFN2 or density functional methods.
The current benchmark reflects conformational preferences in a vacuum as judged by enthalpy differences alone.
Since free energy differences drive experimental conformers, introducing entropic considerations will be needed for further work. [54] Moreover, much chemistry is performed in solution, thus work on understanding conformer energies in solvation is also critical. [6,5] Acknowledgments GRH  TA B L E 5 Comparison of single-core median sequential time to median batch time (in seconds), and relative speedups for batch evaluation.