Uncertainty quantification in thermochemistry, benchmarking electronic structure computations, and Active Thermochemical Tables



The accepted convention for expressing uncertainties of thermochemical quantities, followed by virtually all thermochemical tabulations, is to provide earnest estimates of 95% confidence intervals. Theoretical studies frequently ignore this convention, and, instead, provide the mean absolute deviation, which underestimates the recommended thermochemical uncertainty by a factor of 2.5–3.5 or even more, and thus may vitiate claims that “chemical accuracy” (ability to predict thermochemical quantities within ±1 kcal/mol) has been achieved. Furthermore, copropagating underestimated uncertainties for theoretical values with uncertainties found in thermochemical compilations produces invalid uncertainties for reaction enthalpies. Two groups of procedures for determining the accuracy of computed thermochemical quantities are outlined: one relying on estimates that are based on experience, the other on benchmarking. Benchmarking procedures require a source of thermochemical data that is as accurate and reliable as possible. The role of Active Thermochemical Tables in benchmarking state-of-the-art electronic structure methods is discussed. Published 2014. This article is a U.S. Government work and is in the public domain in the USA. International Journal of Quantum Chemistry published by Wiley Periodicals, Inc.


New thermochemical determinations, such as enthalpies of formation, bond dissociation energies, or reaction free energies, can contribute constructively to the existing body of knowledge only if their fidelity level is also conveyed. The latter is normally achieved by providing associated uncertainties, which, to be valid, need to be properly quantified and unambiguous, that is, must follow a common convention.

Expression of uncertainties in thermochemistry

The standard for expressing uncertainties in thermochemistry is to provide earnest estimates of the 95% confidence intervals (u95%). The convention was introduced in the 1930s by Rossini,[1] who proposed that the quoted uncertainty should correspond to twice the standard deviation σ, once the components due to random errors have been augmented by realistic estimates of all suspected systematic errors. The convention was immediately adopted for all thermochemical work at the US National Bureau of Standards, influencing similar research at other laboratories, and by mid-1950s it became a de facto worldwide standard,[2] subsequently formalized by IUPAC recommendations,[3] and matter-of-factly followed by virtually all major thermodynamic tabulations, such as CODATA,[4] JANAF,[5] or Gurvich et al.[6]

Accuracy versus precision and trueness

The role of uncertainty is to convey a quantitative indication of accuracy. Colloquially, accuracy and precision are incorrectly treated as synonyms; compounding the confusion is the fact that scholarly literature that emphatically distinguishes these two terms, often treats accuracy as synonymous with trueness.

According to International Organization for Standardization (ISO) standards,[7] precision reflects the degree of consistency between repeated measurements and provides an account of the spread of the distribution of a series of measurements, whereas trueness reflects the bias between the mean of the measured values and the (unknown) true value, caused by errors that do not average out by repetition (Fig. 1). The spread is caused primarily by random errors, the bias primarily by systematic errors. Accuracy reflects the combined effect of the achieved precision and the perceived trueness.

Figure 1.

Schematic depiction of the relationship between accuracy, precision, and trueness. The blue vertical line is the central value of a series of measurements, the distribution of which can be described by the blue Gaussian; the red vertical line is the (unknown) true value of the measurand. The blue horizontal double arrow is the spread in the set of measurements (expressed here as two standard deviations) and reflects precision. The purple horizontal double arrow is the bias between the mean of the measurements and the true value, and reflects trueness. The green horizontal double arrow assesses accuracy, and—assuming that the sign of the bias is unknown—the corresponding uncertainty is obtained by combining in quadrature the spread of the determinations and the estimated bias.

Notably—by virtue of combining the account of random errors with best estimates of all conceivable systematic errors—the u95% uncertainty as used in thermochemistry intends to quantify the expected accuracy, rather than just precision or just trueness. If u95% uncertainties are properly quantified, than the true value should lie inside the quoted error bounds at least 19 times out of 20.

Type A and Type B uncertainties

The Guide to the Expression of Uncertainty in Measurement[8] (GUM, formerly ISO GUM) classifies the components of the uncertainty based on the method of their evaluation, rather than on their nature (random vs. systematic). Type A components are those that are evaluated by statistical analysis of a series of observations, Type B are those evaluated by other means, typically by estimation based on knowledge and experience of the evaluator. By formalizing Type B methods, ISO GUM provided for the first time a well-defined procedure for quantifying uncertainty components that arise from systematic errors. Irrespective of whether they are Type A or B, once the individual uncertainty components are evaluated, ISO GUM treats them as equally valid and makes no further distinctions in their use.

In ordinary cases, the uncertainty components are unsigned and essentially uncorrelated (correlation coefficients ρij ≈ 0), and the final uncertainty is obtained by adding the components in quadrature (square root of the sum of squares). If the sign of a significant component of the bias is known, then the appropriate correction to the measured value must also be worked out. If the uncertainty components are correlated, their summation becomes more involved: in the limiting case of a perfect correlation between two components (ρij = 1), their combined uncertainty corresponds to their simple sum.

Evaluating the Accuracy of Theoretical Thermochemical Quantities

Quantifying the uncertainty can be a laborious undertaking, the complexities of which may exceed those of the respective measurement. This, however, is not a valid excuse for presenting a measurement without an indication of its fidelity. GUM explicitly states[8]: “When reporting the result of a measurement of a physical quantity, it is obligatory that some quantitative indication of the quality of the result be given so that those who use it can assess its reliability.” This applies equally to physical measurements (experimental determinations) and virtual measurements (theoretical computations).

Theoretical determinations of thermochemical quantities suffer primarily from systematic errors, although the role of random errors is not entirely nonexistent. An example of the latter would be a zero-point-energy obtained from geometry optimization followed by a frequency calculation; unless the convergence limits are set exceedingly tightly, this will tend to produce a slightly different result each time the initial geometry is slightly modified. Such random errors are small in comparison to systematic errors, entailed by the choice of theoretical method(s) (electronic structure method error) and basis set(s) (basis set convergence error).

Type B procedures of evaluating accuracy

Type B procedures require high expertise, usually gained only after years of experience. They are often the only feasible choice for theoretical computations that do not have a fixed protocol because the actual sequence of calculations is finely customized to the targeted chemical species and property, such as is frequently the case with FP[9] or FPD[10] approaches. An example would be a custom sequence that is centered around frozen-core CCSD(T) calculations using high-ζ correlation-consistent basis sets and extrapolated to complete basis set (CBS), with additional corrections for contributions due to core-valence, scalar relativistic effects, higher excitations, spin-orbit, first-order nonadiabatic effects, and anharmonic vibrational zero-point-energy. The process of elaborating the error budget consists of critically estimating the individual uncertainties of additive components of the final computational result. The individual uncertainties are obtained by exhaustively analyzing all pertinent details of each additive constituent of the final result, such as sensitivity to the exact geometry at which computations are made, residual basis set truncation error and the spread in values when different extrapolation formulas are used, applicability of a single reference method and the consequent effects of potential spin contamination and so forth, and may need to go as far as including the uncertainties of the fundamental physical constants and the atomic or nuclear masses. The line of reasoning behind each uncertainty component needs to be congruent, complete, and fully justifiable. A helpful generic step-by-step guide to the process of estimating Type B uncertainty components can be found in Taylor and Kuyatt.[11] Additional uncertainty components, which do not directly correspond to a computed additive constituent of the final result, may also need to be included, such as the effect of the additivity assumption, the exclusion of even higher excitations, and so forth.

The individual uncertainty components need to be converted to a common form (same coverage factor) before combining them. GUM[8] and National Institute of Standards and Technology (NIST) Guidelines[11] suggest conversion to standard deviations (a coverage factor of one), although an alternative would be to consistently keep all uncertainty components as 95% confidence intervals (a built-in coverage factor of two). The individual components are combined in quadrature, unless there is good reason to believe that some of them are significantly correlated. Finally, if the individual uncertainty components were not kept as estimates of 95% confidence intervals, the final uncertainty needs to be expanded to a 95% confidence interval by multiplying with the appropriate coverage factor (two, if individual components were at the level of a standard deviation).

Type A procedures of evaluating accuracy

Type A procedures are a good choice for electronic structure approaches that have a fixed protocol and can be applied to a wider selection of chemical species, such as, for example, the popular composite methods CBS-APNO,[12] G4,[13] or W1.[14] Type A procedures pivot on a set of comparisons of computed values to reliable benchmarks; the resulting deviations are used to estimate the uncertainty and the average bias in the computed values. If the benchmark set encompasses a wide range of chemical species, the inferred uncertainty represents a generic value for the method; a more targeted approach is to restrict the benchmark set to closely related species. A detailed description of the steps involved in Type A procedures was given by Irikura et al.[15]

Type A-like procedures are quite popular for benchmarking new electronic structure methods, but in practice most cases found in the literature do not quite follow the discussed outline. First, the typical quantity offered as a measure of fidelity does not correspond to the u95% uncertainty expected by the convention (vide infra). Second, average bias, even when it is computed, is generally ignored as a potential correction term for the computed property. Third, in methods that involve additional empirical correctors, the same benchmark set is usually used to fit the values of empirical correctors and to test the outcome, weakening the statistical significance of the procedure.

Comments on mean absolute deviation

Assessments of accuracy of electronic structure methods nearly universally report the mean absolute deviation (MAD), (a.k.a. mean unsigned error, average absolute deviation, etc.) as the central measure of achieved fidelity. For example, the very popular G3 theory has been reported to achieve MAD of 1.02 kcal/mol for the 299 energies in the G2/97 test set, a significant improvement over 1.48 kcal/mol of the predecessor G2 theory.[16] If there are good reasons to de-emphasize outliers (which occur either because the corresponding benchmark is unreliable, or because the species in question is a pathological case for the benchmarked theory, or both), there may be some advantage in using MAD to intercompare theoretical methods. The problem arises when MAD becomes equated with the expected uncertainty of the computed thermochemical property, or when it is used to adjudicate whether the benchmarked theory has achieved the fabled “chemical accuracy” (traditionally defined as the capability of consistently producing thermochemical quantities within an accuracy of ±1 kcal/mol).

MAD is by definition smaller than standard deviation σ. As opposed to the conventional uncertainty expected in thermochemistry, MAD allows the true value to be outside the bounds roughly half of the time. An additional complication is that the relationship between MAD and σ (and thus u95%) is distribution-dependent. In the limit of an ideal normal distribution, MAD is smaller than u95% by a factor of (2π)1/2 or 2.51 (Fig. 2). With error distributions that occur in practice, the conversion factor is frequently higher: for example, the aforementioned MAD of 1.02 kcal/mol for G3 translates to u95% = ±2.95 kcal/mol, which is ∼2.9 times larger. The conversion factor between MAD and u95% is always larger than two, and is generally between 2.5 and 3.5, occasionally even more.

Figure 2.

MAD seriously underestimates the conventional uncertainty expected in thermochemistry. The relationship between MAD (blue) and the standard deviation σ is distribution-dependent, but MAD is always smaller than σ. In a Gaussian distribution (gray underlying curve), MAD = 0.79 σ. The conventional uncertainty associated with thermochemical quantities (red) is expected to correspond to a 95% confidence interval, u95% = 2σ. Thus, in a Gaussian distribution, u95% corresponds to 2.51 MAD. See text for further discussion.

Whereas older benchmarking efforts generally reported only MAD, the discrepancy between MAD and the conventional uncertainty attached to thermochemical quantities has recently started receiving attention,[17-20] and newer benchmarking efforts are now beginning to additionally report the root-mean-square (RMS) error.[13],[18-23] In these cases the corresponding u95% uncertainty can be approximately obtained by doubling the value of the reported RMS error.

Additional terms in the uncertainty

Assuming that the proper u95% uncertainty was obtained using a statistically sound approach, under some circumstances additional terms may need to be incorporated to fully quantify the accuracy of the computed thermochemical quantity.

For example, if the reported u95% pertains to 0 K total atomization energy (TAE0), but the final quantity is the 0 K enthalpy of formation, ΔfH°0, its uncertainty needs to additionally include the uncertainties of the enthalpies of formation of the atoms that were used to convert TAE0 to ΔfH°0, typically a nonnegligible contribution. If, instead of 0 K, the final quantity is the 298.15 K enthalpy of formation, ΔfH°298, then its uncertainty needs to also include the estimated uncertainties of the enthalpy increments, H°298H°0, for the targeted species and its constituent atoms (although the latter are normally quite small, at least for lighter atoms).

Similarly, if the reported quantity is ΔfH°0 from isodesmic (or similar) reaction(s), its uncertainty needs to include not only the uncertainty of the computed reaction enthalpy, but also the uncertainties of the enthalpies of formation of the other species in the reaction that were taken as known a priori from external source(s). Furthermore, the uncertainty of the computed reaction enthalpy needs to be properly evaluated either by benchmarking (using a related set of known reaction thermochemistries) or by estimation, and cannot be simply determined using the spread between reaction enthalpies computed by different theoretical methods (a frequent mistake!), as such spread accounts only for the differences between systematic biases of these methods, rather than the actual biases.

The role of benchmarks and their accuracy

The role of reliable benchmarks in Type B procedures is not transparent, although it is difficult to imagine how sufficient experience can be built without their presence. In Type A procedures their role is obvious, and their accuracy directly affects the outcome: the obtained uncertainty, u95%,measured, is approximately equal to

display math(1)

where u95%,true is the true uncertainty of the theoretical method, and u95%,bench is the average uncertainty of the benchmark set, which can be approximated by the RMS over the individual uncertainties in the benchmark set. Clearly, u95%,measured will fairly represent the desired u95%,true only if u95%,bench << u95%,true.

Noting that attempts to recover u95%,true by retroactively correcting u95%,measured for the influence of u95%,bench generally produce statistically nonsignificant results, it becomes clear that the best option is to use a high-accuracy benchmark set. During early stages of development, the authors of HEAT[19, 21] tried to evaluate several variants of protocols using benchmarks extracted from conventional thermochemical compilations. This produced disappointingly undistinguishable RMS errors of ∼4.2 kJ/mol,[24] but when a highly accurate benchmark set extracted from Active Thermochemical Tables (ATcT) was used, the resulting RMS errors have improved by well over an order of magnitude, allowing them to clearly grade the protocols.

Active Thermochemical Tables

ATcT[25] are a novel paradigm for obtaining accurate, reliable, and internally consistent thermochemical quantities. In contrast to traditional thermochemistry, which uses a sequential approach (A begets B, B begets C, etc.), ATcT are based on constructing, analyzing, correcting, and solving a thermochemical network (TN).[25-27] The TN contains all available experimental measurements relating to the targeted species (bond dissociation energies, constants of equilibria, reaction enthalpies, etc.), complemented by determinations extracted from high-accuracy theory (reaction energies, total atomization energies, etc.). The TN effectively represents a set of constraints that must be satisfied by the final thermochemistries of the involved species. Each determination has an initially assigned uncertainty, reflecting its perceived u95%. The uncertainties influence the weight by which each determination contributes to the knowledge content of the TN. ATcT perform an iterative statistical analysis of the TN that evaluates all determinations for mutual consistency, identifies potential “offenders” (determinations with too optimistic initial uncertainties), and adjusts the uncertainties until internal consistency is achieved across the entire TN. The final thermochemistry is obtained by solving a self-consistent TN simultaneously for all species. The TN approach enables ATcT to produce values that are both highly accurate and robust, because they simultaneously satisfy as many constraints as possible and thus are based on the cumulative knowledge content of the TN.

Unprecedented accuracy and reliability makes ATcT values very attractive for tweaking and benchmarking state-of-the-art electronic structure methods, such as HEAT,[19, 21] W4,[18] and others.[20, 23, 28] In addition, ATcT have introduced a number of useful features that were never before available in thermochemistry, such as painless updates with new knowledge,[25-27] or the availability of full covariance matrix, which allows a detailed analysis of the distributed provenance of each value,[27] identification of additional determinations that would efficiently boost the knowledge content of the TN[29] and so forth.


The accepted standard for expressing uncertainties of thermochemical quantities, which is uniformly followed by virtually all thermochemical tabulations, is to provide earnest estimates of the corresponding 95% confidence intervals. Theoretical studies frequently ignore this convention, and, instead, offer the MAD as a measure of achieved accuracy. MAD underestimates the recommended thermochemical uncertainty by a factor of 2.5–3.5 or even more, thus vitiating many claims that “chemical accuracy” has been achieved. The incongruence between the uncertainties attached to some theoretical quantities and those expected in thermochemistry (and followed by experimental reports) makes a fair comparison between competing determinations difficult. Furthermore, copropagation of uncertainties of thermochemical quantities extracted from standard thermochemical sources with underestimated uncertainties for theoretical values produces invalid uncertainties for reaction enthalpies and free energies. Two groups of procedures for determining the accuracy of computed thermochemical quantities have been outlined: one relying on estimates that are based on experience (Type B), the other on benchmarking (Type A). Benchmarking procedures require a source of thermochemical data that is as accurate and reliable as possible, or else the true accuracy of the benchmarked method cannot be recovered. The role of ATcT in benchmarking state-of-the-art electronic structure methods was briefly discussed.


  • Image of creator

    Branko Ruscic was born in Rijeka, Croatia in 1952. He graduated in Chemistry in 1975 and obtained his doctorate in 1979 from the University of Zagreb. As an undergraduate, he started scientific research at the Rugjer Bošković Institute in Zagreb, and continued there as a scientist. He performed postdoctoral research in the Physics Division of Argonne National Laboratory near Chicago, and joined the Chemistry Division in 1988, where he is currently a senior scientist. His present research interests are in experimental and theoretical methods that provide highly accurate and reliable thermochemistry. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]