Double hybrid DFT calculations with Slater type orbitals

Abstract On a comprehensive database with 1,644 datapoints, covering several aspects of main‐group as well as of transition metal chemistry, we assess the performance of 60 density functional approximations (DFA), among them 36 double hybrids (DH). All calculations are performed using a Slater type orbital (STO) basis set of triple‐ζ (TZ) quality and the highly efficient pair atomic resolution of the identity approach for the exchange‐ and Coulomb‐term of the KS matrix (PARI‐K and PARI‐J, respectively) and for the evaluation of the MP2 energy correction (PARI‐MP2). Employing the quadratic scaling SOS‐AO‐PARI‐MP2 algorithm, DHs based on the spin‐opposite‐scaled (SOS) MP2 approximation are benchmarked against a database of large molecules. We evaluate the accuracy of STO/PARI calculations for B3LYP as well as for the DH B2GP‐PLYP and show that the combined basis set and PARI‐error is comparable to the one obtained using the well‐known def2‐TZVPP Gaussian‐type basis set in conjunction with global density fitting. While quadruple‐ζ (QZ) calculations are currently not feasible for PARI‐MP2 due to numerical issues, we show that, on the TZ level, Jacob's ladder for classifying DFAs is reproduced. However, while the best DHs are more accurate than the best hybrids, the improvements are less pronounced than the ones commonly found on the QZ level. For conformers of organic molecules and noncovalent interactions where very high accuracy is required for qualitatively correct results, DHs provide only small improvements over hybrids, while they still excel in thermochemistry, kinetics, transition metal chemistry and the description of strained organic systems.

No satisfactory way to systematically improve the exchangecorrelation (XC) functional of KS-DFT is known [42] and in practice approximations are necessary. Density functional approximations (DFA) explicitly depending on both, the electron density and oneelectron orbitals resolve many deficiencies of (meta-)generalized gradient approximations (GGA). [5,[43][44][45][46][47] If only the exchange part of such DFAs depends on these orbitals, they are referred to as hybrid functionals. For an orbital-dependent treatment of correlation based on the KS reference Hamiltonian, a dependence of the DFA on virtual orbitals must be introduced. A particularly economic strategy to do so is via second order Møller-Plesset perturbation theory (MP2). As both, the exchange-and the correlation part of such DFAs are bridging standard DFAs and WF theory, they are commonly referred to as double-hybrid (DH) [48][49][50][51][52][53][54][55] functionals.
According to the famous Jacob's ladder classification scheme for DFAs, [56] the generally most accurate and robust functionals should be DH-DFAs. Indeed, rigorous validation on the comprehensive GMTKN55 [57] database by Goerigk, Grimme, Martin and coworkers has substantiated this conjecture. [57][58][59][60][61] Unfortunately, blending WF theory into KS-DFT not only goes along with increased accuracy but also with increased computational cost. The calculation of the KS matrix for hybrid functionals is commonly associated with a N 4 scaling operation count, [62] and the CPU time for the evaluation of the canonical MP2 correlation energy increases with N 5 as a function of system size N. [63] Given these scaling properties, the availability of cost-effective algorithms for both tasks is crucial to enable routine application of hybrid-and DH-DFAs to large molecules. For both, the Coulomb (J) and the exchange (K)-term, exploiting sparsity in the rank-4 electron repulsion integral (ERI) tensor is key to reducing the computational cost of the KS matrix construction for hybrid functionals. Algorithms based on integral prescreening techniques [64][65][66][67][68][69][70] offer favorable scalability but suffer from a high prefactor. To circumvent this issue, modern codes additionally rely on density-fitting (DF) [71][72][73][74][75][76][77][78][79][80][81][82] approximations being an efficient approach to evaluate the J-term, especially in conjunction with the J-engine technique. [83][84][85][86][87] Global density fitting is less efficient for the K-term and promising approaches to achieve better performance are the pseudo-spectral method, [88][89][90][91][92][93] the auxiliary density matrix method [94][95][96] or different flavors of local DF (LDF) [97][98][99] approximations. [100] The most promising variant of the latter approach might be the pair-atomic resolution of the identity (PARI) approximation [101][102][103] (Known as PARI-K when applied to the K-term [103] ). In this most extreme variant of LDF methods, following the treatment of Baerends et al. [71] for the J-term, each pair product of AOs is expanded in a set of auxiliary basis functions centered on the same two atoms as the target pair of primitives.
Following the latter approach, we have shown that the PARI approach can not only be used to accelerate the KS matrix construction but also to efficiently compute MP2 correlation energies. [144] While we have demonstrated that an MO based PARI-MP2 algorithm can easily compete with DF-MP2 in terms of accuracy, we have also formulated the Spin opposite-scaled [145] (SOS)-AO-PARI-MP2 algorithm, enabling to obtain accurate SOS-MP2 energies with quadratic scaling operation count. Implemented in the STO-based Amsterdam density functional (ADF) [79,146,147] code, the SCF, although extensively accelerated via PARI-K, represents the bottleneck (both CPU time and memory-wise) for single-point SOS-MP2 calculations on molecules of several hundreds of atoms using basis sets of TZ quality. We concluded that this enables quantum chemists to carry out DH calculations whenever a hybrid calculation is feasible too.
Employing any of the mentioned local approximations might lead to tremendous speed-up to the price of an additional source of error.
For an approximation to be useful in practice, the introduced errors must be small and well controlled and the inequalities where Δ m , Δ b , and Δ a denote the inherent error of the method, basis set and additional approximations, respectively, should be fulfilled.
The latter inequality is of high importance, as in most quantumchemistry packages algorithmic details will differ and it is often non-obvious to users how to tweak them. In this sense, the second inequality is crucial to make comparisons of results between different codes meaningful.
Even for very small Δ m and Δ b , (1) is certainly met by global DF approximations. Is this also the case for PARI? Concerns regarding robustness [103,148,149] and accuracy [100] of PARI-K, especially for the description of virtual orbital energies, [100,150] have recently been expressed. Additionally, the exact exchange energy from the PARI-K algorithm is unbounded from below which might result in convergence to artificial low-lying states. [103,149] This should not be confused with errors due to nonconvergence of the SCF, an algorithm for which convergence is not guaranteed. [151][152][153] The appearance of fully converged, but unphysical results is much more problematic.
These shortcomings have been found to get more pronounced with increasing number of AOs and molecular size. [100] Hybrid calculations with a QZ basis are unreliable for large molecules. For PARI-MP2, additional difficulties arise in the fitting of virtual orbitals, [144] leading to even larger numerical instabilities with increased number of primitives. Effectively, this limits the largest basis sets which can be safely be used to TZ quality with a moderate number of diffuse functions.
In practice, this might be the most severe limitation for DH calculations. Although some exceptions are known, [154] TZ basis sets are usually sufficient to fulfill the first inequality in (1) for GGAs and hybrids. [155] Due to the exceptionally slow convergence to the CBS limit of MP2, [156][157][158][159][160][161] considerably larger basis sets are needed for DHs. [162] As it only requires the evaluation of a subset of two-center Coulomb integrals, PARI is a crucial technique for efficient exact exchange algorithms in quantum chemistry packages based on Slater type orbitals (STO) or numerical atomic orbitals (NAOs) [163][164][165][166] relying on numerical integration techniques. Therefore, it is of utmost importance to investigate whether accurate, robust and efficient hybridand DH-KS calculations can be performed within the PARI- List of DFAs benchmarked in this work. The functional name is given together with the amount of HF-exchange (a x in (2)) and MP2 correlation energy (a c, os /a c, ss in (2), if only one number is given, a c, os = a c, ss ) as well as the variants of empirical dispersion correction with which it is combined. The corresponding parameters used herein are given in the ESI.
Different DFAs show different dependencies on the choice of basis set and numerical approximations. [154,167] The superior performance of a certain DFA under ideal conditions that is, using large basis sets and very small Δ a does not necessarily imply the same behavior when smaller basis sets and PARI are employed. Employing these ideal conditions, in the last years large numbers of DFAs have been ranked according to their performance on large databases. [57][58][59]154,168] It is certainly of great interest to investigate to what extend these rankings can be reproduced using STOs and PARI (STO/PARI in the following).
Against this background, we herein present benchmark calculations with 60 DFAs using a comprehensive database consisting of 1,644 data points, covering both, main-group and transition metal (TM) chemistry. This is the first comprehensive benchmark of DFAs using PARI and, to the best of our knowledge, the first one using STOs.
This paper is organized as follows: After giving an overview of our database and of the herein benchmarked DFAs we shorty describe our computational approach and give representative timings before we discuss the results of our benchmarks. We will investigate the accuracy of the STO/PARI approach in comparison to GTO-type basis sets for two exemplary DFAs and subsequently present benchmarks for all 60 assessed functionals. In the end, we conclude and summarize this work.

| Test sets
Large compilations of diverse datasets covering a wide range of maingroup chemistry have now a long-standing tradition in quantum chemistry. The importance of meta-databases like GMTKN55 [57] (and its predecessors GMTKN24 [169] and GMTKN30 [162] ) and MGCDB84 [168] compiled by Grimme and coworkers and Head-Gordon and coworkers, respectively, to assess the performance of DFAs and also wave-function methods [170] can barely be overemphasized.
We do not present any new database or test set in this work.
Instead, we selected nearly all subsets from GMTKN55 [57] and MGCDB84 [168] and additionally include two databases featuring energies and barrier heights of reactions between TM containing species. An overview of all categories in which we organize our database is given in Table 2, while a short description of all subsets is given in Table 3. In the latter table we also reference for each test set the publication where it was introduced first. The excellent papers of Grimme and coworkers [57] and Head-Gordon and coworkers [168] relief us from the burden of a detailed description of all subsets and we refer to them for details. Nevertheless, we will briefly introduce all six subcategories employed herein.
While there is considerable overlap between MGCDB84 and GMTKN55, the reference data in the latter one is generally newer and often more accurate. [226] In example, MGCDB84 contains much data from GMTKN30, [162] which has been updated in GMTKN55. [57,226] Consequently, for test sets contained in both databases, we always compared our results to the reference values presented in the latter one and in total we selected 45 out of the 55 subsets in GMTKN55 (two of them, (MCONF, BUT14DIOL) with reduced size [265] ) to benchmark all DFAs in Table 1 and included three more into our subcategory of large molecules.

| Transition metal containing systems
As they are not part of GMTKN55 or MGCDB84, we describe the TM containing subsets in our database in some detail. We employ two subsets with 66 reaction energies combined, capturing different aspects of TM chemistry.
The MOR23 test set is a subset of the MOR41 database recently presented by Grimme and coworkers, [227] consisting of 41 reactions involving large molecules with up to 120 atoms and including 13 different TMs. All reference values have been obtained on the DLPNO-CCSD(T) [23,31] [235][236][237][238] and contains 43 barrier heights of reactions catalyzed by TMs, 6 are catalyzed by the 4d-elements Zr [236] or Mo, [238] and 37 are catalyzed by one of the 5d-elements Pt, [235] Au, [235] Ir, [235] Re [237] or W, [238] out of these 22 by Au. All reference values have been obtained at the CCSD(T)/CBS level of theory. The average barrier height in TMBH43 is 11.08 kcal/mol.
The average relative energy of this category is with 4.52 kcal/mol very low. The electronic structures of these species are rather simple (isomerization easy [IE]) and are usually very well described already on the GGA + Dispersion level of theory. However, given the small energy differences in this subcategory, high accuracy is required to correctly reproduce the energetic ordering of conformers.

| Intermolecular noncovalent interactions
The fifth category of our database contains the energies of 274 noncovalently bounded complexes, relative to the monomers of which they consist. This category contains 12 subsets and features the prominent S66 [253] databases of dimerization reactions between organic molecules and biomolecules and the X40 [254] test set of interaction energies between halogene-containing organic molecules, both complied by Hobza and coworkers. While these systems are in their equilibrium geometry, Hobza and coworkers S66x8 [253] test set additionally contains all dimers in the S66 test set at eight different nonequilibrium geometries and is also part of MGCDB84. While it is certainly insightful to benchmark nonequilibrium geometries also, we only selected 10 out the 66 potential energy curves in order to keep our dataset well balanced, resulting in the S10x8 subset.
We note, that the very popular S22 [269] subset (and its extension to nonequilibrium geometries in the S22x5 database), which is also a part of GMTKN55 is not part of our database. Despite its usefulness, it is rather imbalanced and does not represent important interactions like single hydrogen bonds or aliphatic-aliphatic dispersion interactions. [253] These shortcomings have been addressed with the compilation of S66 and we argue that with its introduction S22 is obsolete.
The average energy of the reactions contained in this category is 11.84 kcal/mol.

| Large systems
The last category differs from all other categories in the sense that its subsets cannot be assigned to a certain type of chemical interactions.
We refer to the systems in this category as large in the sense that it contains mainly molecules for which a canonical DH-DFA calculation requires a considerable amount of CPU-time. To give an example, calculation of the 10 isomers in the C60ISO [261] test set on the B2GP- Benchmarking large systems is crucial to assess the robustness and accuracy of local approximations which do not come into play for small systems. Especially PARI-errors errors have already been reported to accumulate with increasing system size. [100] Furthermore, growing concern has been expressed that MP2 might diverge for weak interactions in large molecular complexes. [270,271] ISOL24, [260] C60ISO, and UPU23 [263] contain isomerization reactions and are already part of GMTKN55. Additionally, we selected Hobza and coworkers popular L7 [262] test set of noncovalently bounded dimers with up to 120 atoms as well as another benchmark set which we designate herein ENZYMES23. It is based on recent work of Goerigk an coworkers [264] were E GGA (E MP2 c,ss ) denotes the opposite-spin (os) (same-spin [ss]) contribution to the MP2 correlation energy and finally, E DISP empirical is an empirical dispersion correction term. While many other forms of dispersion corrections have been suggested, [272][273][274][275][276] we restrict ourselves to the popular atom-atom dispersion potentials DFT-D3 [173,255] and DFT-D4. [174,181] For (meta-)GGAs (a c, os = a c, ss = a x = 0) and hybrids (a c, os = a c, ss = 0) our selection is mainly based on their performance in the recent comprehensive benchmarks by Goerigk et al. on the GMTKN55 database [ 57,58] and Mardirossian et al. on the MGCDB84 database. [168] The only exceptions are the B3LYP and PBE0 hybrid functionals, which we included due to their continuing high popularity despite their now well established [9,57,162,168,264,277] average performance. As functionals based on the B95 [278] and B97 correlation functionals were not available to us (except for the hybrids ωB97-X and B97 which were available to us through libxc [279] ), we excluded the popular hybrids B97M-V [280] and ωB97X-V [281] as well as the DHs ωB97X-2, [282] ωB97M (2), [283] PWPB95, [162] and DSD-PBEB95 from this study. [214] The majority of the herein benchmarked DH-DFAs are of the DSD (Double hybrid, Spin-component scaled, Dispersion)-type. [212] These functionals are parametrized together with an empirical dispersion correction term as well as without any constraints on a c, ss , a c, os and a c, gga .
Consequently, in this approach different forms of dispersion correction give rise to different overall parametrizations. Examples are the recently published [60] reparametrizations of DSD-type functionals by the Martin group, both for the D4 [174,181] dispersion correction as well as for the older D3 [255] version with Becke-Johnson damping (D3[BJ]). [173] While we consistently assess all other functionals with and without an empirical-dispersion correction term, we benchmark DSD-functionals in their dispersion corrected form only. [284] We especially emphasize DOD-functionals in our study, a special variant of DSD-functionals with a c, ss = 0. E MP2 c,os can be evaluated efficiently using SOS-AO-PARI-MP2, and DOD-functionals can therefore be applied to molecules consisting of hundreds of atoms in a routine fashion. [144] As they can provide accuracies comparable to their DSDcounterparts, they are in our opinion the most-interesting DHs for practical applications. Table 1 list the DFAs assessed herein together with the employed dispersion correction as well as the parameters a x and a c from (2). In total, 122 unique combinations of DFA and dispersion-correction are evaluated in this study.

| Computational details
The majority of our DFT calculations as well as all empirical dispersion corrections have been performed with a locally modified development version of the ADF code. [79,146,147] For a variety of subsets, we also performed B3LYP and B2GP-PLYP calculations with PSI4. [285] For all test sets contained in GMTKN55, the structures and reference energies as available on the dedicated website [286] have been used. For all other datasets we employed the structures and reference energies from the ACCDB [226] database.

| ADF calculations
PARI-K has been used for all hybrid and DH calculations and, if not stated otherwise, PARI-MP2 for the post-SCF energy correction required for the latter ones, except for all test sets in the category LM, for which SOS-AO-PARI-MP2 has been used instead. The accuracy of the latter algorithm depends on the quality of the numerical approximation to an integral. [288] We use nine quadrature points, being the default in ADF and a number for which this approximation can be considered as sufficiently converged. [132,289] For the evaluation of the exact exchange as well as for all SOS-AO-PARI-MP2 calculations, the Normal tier of threshold qualities has been used. We refer to our recent work [144] for a detailed explanation and discussion of these algorithms as well as explicit threshold values.
The majority of the herein presented numbers have been calculated on the all-electron level using the TZ2P (TZ with two shells of polarization functions) STO-type basis set. [290] The Normal auxiliary fit set has been used for for PARI-K and all PARI-MP2 calculations with the only exception of the IDISP test set for which we employed the Very Good auxiliary fit set for both, PARI-K and PARI-MP2 in all DH calculations. For both, the numerical integration quality as well as the quality of the DF for the evaluation of the J-term, [291] we used Good quality. [291,292] For all systems containing fourth-row elements or heavier, relativistic effects have been treated with the Zero Order Regular Approximation (ZORA) [293][294][295][296] in conjunction with ZORA-optimized basis sets and the Minimum of neutral Atomic potential Approximation (MAPA).

| PSI4 calculations
All PSI4 calculations have been performed using def2-TZVPP [297,298] and employing DF for the J-and K-terms as well as for the evaluation of MP2 energies; default auxiliary fits sets have been employed. [299] For all subsets, calculations have been performed on the all-electron level. Default settings have been used for the integration grids and all other numerical settings.

| Representative timings
To give an estimate on the CPU-time requirements for a prototypical GGA, hybrid and double-hybrid, respectively, we herein report timings for our whole database. Calculations have been performed sequentially on a 2 × 12-core 2.4 GHz Intel Xeon E5-2695 v2 (ivy Bridge) with 64 GB of memory. Sequential calculation is obviously not appropriate for small molecules as parallelization will be inefficient. In practice, the best performance will be achieved by running many jobs in parallel on one core each and only parallelize individual calculations for which the memory available on a single core is not sufficient. For the present context, this is irrelevant as we only aim to compare the relative timings. Results for all calculations required to obtain values for our whole database of 1,644 datapoints without large molecules are given in Table 4. For BLYP, B3LYP and revDOD-BLYP we also give timings for the subset of c,os only consumes about 25% of the total CPU-time. Clearly, these timings show that CPU-time is hardly a concern when climbing Jacob's ladder to the top, provided a DODfunctional is used.

| RESULTS
In this section, the results of our calculations are presented. In the following discussion, we proceed in two steps. First, we investigate the basis set dependence of the STO/PARI-approach in detail. To do so, we present results for B3LYP as an exemplary hybrid, and B2GP-PLYP as an exemplary DH, respectively. Second, we analyze and discuss the results of our ADF calculations for all DFAs in Table 1 in order to identify the most accurate and robust DFAs. All numbers presented in the following figures can also be found in table form in the ESI.

| Comparison of STO/PARI approach to Gaussian type basis sets
As a first step in the analysis of our results, we compare our TZ2P calculations to calculations using the GTO-type basis set def2-TZVPP for a selection of 30 subsets from our database. def2-TZVPP is comparable in size to the TZ2P basis set but has more shells of polarization functions for the first three rows of the periodic table. Table 5 shows the sizes of the herein employed basis sets for selected elements.
For our analysis we use two exemplary DFAs, the hybrid B3LYP and the DH B2GP-PLYP. The results of the respective functional as calculated by Grimme and coworkers using an augmented def2-QZVP basis set (aug-def2-QZVP in the following, for details see Reference [57]) close to the CBS limit and available online [286] serve as reference values. This allows to investigate the basis set errors (BSE) without being flawed by cancellation between functional error and BSE.
While for hybrid-functionals basis sets of TZ quality are usually assumed to give results close to the CBS limit, a larger basis set dependence can be expected for B2-GPPLYP. Also due to possible difficulties in the description of virtual orbitals, the PARI approximation might introduce larger errors than for B3LYP. To this end, larger BSEs as well as larger differences between TZ2P and def2-TZVPP are to be expected.  Note: All timings are given in hours and have been obtained on a single node with 24 cores. In the first row, the CPU-times for our whole database except for the subcategory of large molecules are given while the timings for this subcategory are displayed in the second row. B2GP-PLYP calculations have been performed using PARI-MP2. For revDOD-BLYP, SOS-AO-PARI-MP2 has been used.

T A B L E 5
Composition and basis set sizes (in parentheses) for selected elements of the STO-type TZ2P and the GTO-type def2-TZVPP basis sets [298]  In the same way as Figure 1, Figure 2 shows the BSEs of TZ2P

| Discussion of the whole database and its categories
As we have performed well over 150.000 single point calculations in total, we can herein only highlight the most important trends and conclusions while we refer to Data S1 for details. We will discuss the performance of the assessed DFAs for each of the five subcategories.
Results for large molecules will also be discussed whenever appropriate. Arguably, the information given in the following is most relevant for users of the ADF code. However, having established that STO/PARI calculations offer accuracies comparable to ones attainable using GTO-type TZ basis sets, we hope the herein presented results can also serve as an important guideline for users of GTO-based quantum chemistry codes who are not always willing or able to afford QZ sized basis sets. Final recommendations will be given based on WTMAD-2 values as well as on a best-worse analysis, that is, we recommend functionals based on their accuracy but also on their robustness. Before we dive into the discussion of the benchmarked DFAs performance, both metrics will shortly be introduced.

| Weighted mean absolute deviations
We follow the approach of Grimme and coworkers [57] and base the discussion of our data on weighted mean absolute deviations (WTMAD-2), where the sum runs over all subsets in a subcategory, MAD i denotes the mean absolute deviation of subset i, ΔE i its average reaction energy and N i the number of datapoints in it. The benefits of this scheme have been adressed in detail by Grimme and coworkers. [57] We prefer this metric over simple MADs as it does not focus on absolute accuracies but on relative ones, being clearly of higher relevance to predict and reproduce chemical trends.

| Best-Worse-Analysis
For this we also adopt the approach of Grimme and coworkers [57] and present best-worse analyses for the total database and subcategories.
Instead of simply counting how often a functional yields the lowest (best) and the highest (worse) MAD, we rather count how often a certain functional yields a MAD which is not more than 15% higher (lower) than the MAD of the best (worse)-performing functional.
We consider an illustrative example to point out the advantage of this approach. The functional with the lowest MAD for the Amino20x4 subset is revDSD-PBEP86-D4 with 0.13 kcal/mol. For six other functionals we obtain MADs not more than 15% worse. All of these functionals perform well for this subset which means that there a seven reliable (within our arbitrary criterion) DHs to describe it. As another example, we consider the DARC test set. The best DH here is B2GP-PLYP-D4 with a MAD of 0.23 kcal/mol, followed by mPW2K-PLYP with a MAD of 0.57 kcal/mol. Thus, B2GP-PLYP-D4 is the only logical choice for this test set.

| Basic properties and reaction energies
To start with, we discuss the category of basic properties of small systems and bonded interactions. Figure 3 shows the WTAMD-2s being ranked 25th and having been parametrized for NCIs. To our surprise, B2π-PLYP-D3(BJ), parametrized for π-π interactions, is the best DH in this category. As it has been optimized for thermochemistry, [55] it is not surprising that B2T-PLYP-D3(BJ) performs well too.
Taking a closer look at the performance of the benchmarked hybrid-functionals reveals that the Minnesota functionals M06-2X, M08-SO and M08-HX with different flavors of dispersion correction are the methods of choice on the hybrid level. The amount of exact exchange contained in these functionals is with more than 50% rather large. This is in line with results for GMTKN55 where the two best performing functionals are M06-2X-D3(0) and M08-HX-D3(0) and in contrast to the common belief that, due to the ability of the SIE to mimic static electron correlation, [300,301] a smaller fraction of exact exchange should be ideal for thermochemistry. [302,303] After this general analysis we proceed by taking a closer look at  times among the best DH-DFAs and never among the worst. Also revDSD-BLYP-D3(BJ) reaches MADs not worse more than 15% than the best DH for five times and can be recommended as a reliable functional for thermochemistry. On the hybrid level, B97-D4 and M08-HX are five times among the best functionals. Again, this does not come as a surprise for the latter one, given its fourth position in the WTMAD-2 ranking, while the former only ranks tenth there. The hybrid yielding by far the lowest WTMAD-2, M08-SO ranks among the best hybrids four times, but once it can also be found among the worst ones.
In conclusion, we recommend B2T-PLYP-D3(BJ) as the method of choice for the description of bonded interaction and basic properties and we generally advise to use DH-DFAs whenever feasible. If CPU-time is a concern, we recommend the best DOD-DH, revDOD-BLYP-D3(BJ)/D4 as well as the Minnesota functional M06-2X-D3(0).

| Reaction barrier heights
The next category consists of nine subsets with 264 data points, containing barrier heights of 21.92 kcal/mol on average. The WTAMD-2 s for the best functionals of each rung are displayed in Figure 5. As F I G U R E 3 The best 25 DHs (blue) and hybrids (green) as well as the best 10 (meta-)GGAs (red) according to their WTMAD-2 s for the category of basic properties and reaction energies [Color figure can be viewed at wileyonlinelibrary.com] they suffer considerably from many-body self-interaction error, (meta-)GGAs show a devastating performance for stretched bonds [268,[303][304][305][306] and should never be used for the calculation of barrier heights. Replacing the D3(BJ) dispersion correction by its successor D4 results in essentially the same picture and also the pair revDOD-PBE-D3(BJ)/ revDSD-PBE-D3(BJ) shows a comparable performance.
As for TC, the best hybrid functionals are the dispersion uncorrected M08-SO, M08-HX and M06-2X functionals outperforming their dispersion corrected counterparts slightly. The bestworse analysis in Figure 6 substantiates the conclusions drawn from the inspection of the WTMAD-2s. revDOD-PBEP86-D3(BJ) and revDSD-PBE86-D3(BJ) are among the best functionals in four of nine cases. On the hybrid level, M08-SO is the leader in this category, three times being among the best hybrids. M08-HX and M06-2X, on the other hand only rank once among the best hybrids.
The picture drawn so far chances substantially when the TMBH43 test set, with only 15% of all datapoints in BH, is excluded from this analysis. In Data S1 we present a plot similar to Figure 5 which shows that revDOD-PBEP86-D3(BJ) and revDSD-PBE86-D3 (BJ) are still the best available DHs for this category, while M08-HX takes the lead from M08-SO. However, without TMBH43, no DH yields clear improvements over M08-HX. Although many DHs yield a comparatively small WTMAD-2, none of them can clearly be rec-  As it contains many barrier heights, we briefly discuss our results for ENZYMES23 here. We generally observe good performance of functionals with a large fraction of exact exchange. The best hybrid functional is M08-HX-D3(0). It is only slightly outperformed by revDOD-PBEP86-D3(BJ) with a MAD of roughly 1.2 kcal/mol. Note that Goerigk and coworkers obtained a only slightly better MAD of 1.07 kcal/mol with the herein not benchmarked SOS0-PBE-2-D3 (BJ) functional in their benchmark study on this test set. [264] In conclusion, we cannot unreservedly recommend to use DH-DFAs for the calculation of BHs unless TMs are involved. In this case, the method of choice is dispersion uncorrected B2K-PLYP, having been optimized for kinetic properties, and, when CPU time is a concern, revDOD-PBE-D4. Otherwise, hybrid functionals seem to provide the best price/performance-ratio for the calculation of barrier heights and we recommend M08-HX as the most robust and accurate alternative.

| Easy isomerization reactions
The availability of a plethora of benchmark sets containing isomers of organic molecules illustrates the great relevance of this type of systems. In total, our database contains 433 relative conformational energies of which only 15 are not organic. In this work, we consider two categories of isomers, and in this paragraph we are concerned with the category containing 274 relative energies of isomers with an easy electronic structures and an average isomerization energy of only 4.52 kcal/mol. Given these subtle energy differences, obtaining qualitatively correct results requires high precision and Figure 7 reveals large differences in the accuracies of the benchmarked DFAs. While the best DH has a WTMAD-2 of 5.4 kcal/mol, the WTMAD-2 of the 25th is with more than 9 kcal/mol already 71% higher, for hybrids, the corresponding ratio is 79%. This might be contrasted with the corresponding number for BH, where we obtain a ratio of 25% for DHs and of 29% for hybrids, receptively. We already mention at this point that the WTMAD-2s for IE are higher than for all other subcategories, that is, the largest relative errors are obtained here.
Going from the best hybrid, B97-D4, to the best DH, revDSD-PBEP86-D4, the WTMAD-2 only improves by roughly 10%, only three DHs yield improvements at all and only 16 perform better than the best meta-GGA, SCAN-D4. Thus, some care must be taken when selecting a DFA for these systems and with an uneducated choice of DH, one easily ends up with a worse performance than with a GGA.
F I G U R E 6 Best (full colored bars, positive x-range)-worse (shaded bars, negative x-range) analysis for the best 15 DHs (blue) and hybrids (green) for reaction barrier heights [Color figure can be viewed at wileyonlinelibrary.com] F I G U R E 7 The best 25 DHs (blue) and hybrids (green) as well as the best 10 (meta-)GGAs (red) according to their WTMAD-2 s for the category of easy isomerization reactions [Color figure can be viewed at wileyonlinelibrary.com] Hybrids, although the best of them performing slightly worse than the best DH, seem to be the most robust class of DFAs for IE. Also the best-worse analysis in Figure 8 suggests B97-D4 to be the most reliable DFA being never among the worst-performing DFAs, while each of the three best DHs, revDSD-PBEP86-D4, revDOD-PBEP86-D4 and revDSD-PBEP86-D3(BJ), can be found among the worst performing DHs in this subcategory twice.
The source of the difficulties of DHs for this category is most likely the BSE for MP2 as exemplified by the PCONF test set of relative energies of oligopeptide conformers. Recently, Goerigk et al.
investigated the basis set dependence of MP2 for parts of this subset and found a tremendous BSE for MP2 calculations on the TZ level. [245] Although to a lesser extent, the same is true for SCONF. [247] Indeed, eliminating the 35 data points of these two test set from our analysis, we observe that the WTMAD-2 of the best DH is now more than 17% smaller than the one of the best hybrid. Finally, we recommend the B97-D4 functionals as a reliable, robust and cheap DFA for easy isomerization reactions. Also revDSD-PBEP86-D3(BJ)/D4 and the cheaper alternative revDOD-PBEP86-D3(BJ) can be recommended, although some care must be taken for test sets for which a slow convergence to the CBS limit is documented.

| Difficult isomerization reactions
We proceed with our discussion of relative conformational energies by analyzing our results for the 159 isomerization reactions in the category of difficult isomerization reactions. As poined out by Head-Gordon and coworkers, the systems in this category, mainly consisting of strained and conjugated organic molecules with a difficult electronic structure and an average energy of 25.96, pose a big challenge for DFT methods. [168] For the corresponding subcategory in MGCDB84, essentially being identical to ours, no GGA or hybrid yields a RMSD lower than 2 kcal/mol, while for their dataset of easy isomerization reactions 23 functionals yield RMSDs lower than 0.5 kcal/mol. [168] The failure of common DFAs originates from the pronounced many-electron SIE in the systems contained in ID [168] and DHs, usually featuring large fractions of exact exchange, should provide a better description of their electronic structures. This expectation is immediately confirmed by the WTMAD-2s presented in Figure 9. As for barrier heights, with WTMAD-2s of up to more than 15 kcal/mol, the (meta-)GGAs benchmarked by us show a devastating performance and are clearly inadequate for ID. The best hybrids all feature a large amount of exact exchange-M05-2X with 56%, the range-separated ωB97-X with 100% in the long-range-regime, and M08-SO with 57%.  [307] that MP2 tends to overestimate the resonance effect in these systems and indeed, ωB97-X outperforms all DHs for C20C24. Furthermore, for reasons remaining elusive to us, compared to DFAs of lower rung, DHs show a devastating performance for the C60ISO subset, confirming previous results. [57,252] As all C60 isomers are nonaromatic, [308] the overestimation of the reso- These functionals are especially suitable for the description of strained ring systems, that is, Styrene45, C20C24 and also DIE60.
revDOD-PBE-D4, the functional with the smallest WTMAD-2, is three times among the best functional but never among the worst.
We also emphasize the excellent performance of revDOD-BLYP-D3 (BJ) being the best functional for EIE22, DIE60 and ISOMERIZA-TION20 but only showing a mediocre performance for the remaining subsets.

| Intermolecular noncovalent interactions
The interactions stabilizing the structures of biomacromolecules like DNA and proteins [269] or governing the self-assembly process of nanomaterials, but also interactions between a drug and a protein or intramolecular interactions in donor-acceptor complexes are mostly of noncovalent nature. [309] Consequently, with 11 subsets and a total of 375 datapoints, this interaction type comprises a large fraction of our database. The low average reaction energy of 11.84 kcal/mol nicely illustrates why NCIs are often denoted as weak interactions.
The fact that semi-local and hybrid-functionals do not reproduce the correct 1/r 6 -behaviour of the London-dispersion interaction [310][311][312] has often been used as an argument for the importance of explicit electron correlation to accurately describe NCIs. [48,132,134,135,144] Indeed, Engel et al. could show [313] that the exchange-correlation functional from second order Levy-Görling perturbation theory (LG-PT2) contains the leading contribution to the van der Waals interaction which might have served as the main motivation for the construction of an early DH which includes an MP2 term in the long-range regime only. [314] Modern DHs can be seen as semiempirical "corrections" of LG-PT2 and it seems to be reasonable to expect them to yield substantial improvements over hybrids and (meta-)GGAs for NCIs. In practice however, NCIs can be described very well by partnering a GGA or a hybrid with some form of empirical dispersion correction. [57] It has been pointed out by Goerigk et al. [311] that a plethora of DFAs performs well on the S66 and the S66x8 test sets if only some form of empirical dispersion correction is considered.
The WTMAD-2s we present in Figure 11 partly confirm these findings. While the analysis of the WTMAD-2s favors D4 over D3(BJ) for B3LYP, as shown in Figure 12, the latter combination is three times among the best hybrid functionals but the former only twice. Also B97-D4, CAM-B3LYP-D3(BJ) and PW6B95-D3(0) are among the best DFAs thrice. B2K-PLYP-D3(BJ) is the only DH being three times among the best ones of this rung on Jacob's ladder.
It is instructive to take a closer look at the important S66, S10x8 and 3B-69-TRIM subsets. With 215 datapoints in total, they comprise nearly two third of the NCI category and serve as a representative benchmark for NCIs between neutral organic molecules, including dimers in nonequilibrium as well as in their equilibrium geometry and trimers. B2K-PLYP-D3(BJ) describes noncovalently bounded complexes in their equilibrium geometry by far most accurately and its fifth position for S10x8 demonstrates that it is reliable for nonequilibrium geometries also. B3LYP-D3(BJ) is the best hybrid functional for 3B-69-TRIM and S66 but cannot be found among the best 15 DFAs for S10x8. On the GGA level, we also emphasize the great performance of BLYP-D3(BJ) for all three subsets.

| The entire database
To conclude the discussion in this subsection, we shortly comment on the performance of the benchmarked DFAs for all 1,644 datapoints in our database. As for all subcategories, Figure 13 shows the WTMAD-2 s of the 25 best performing DHs (blue) and hybrids (green) as well as the best 10 meta-GGAs (red) together with their empirical dispersion correction.
We find that, also on the TZ level, using STOs and PARI, Jacob's ladder is still reproduced-the best hybrid functionals clearly (2) outperforms all other functionals, it is closely followed by revDSD- DHs ever constructed, also performs surprisingly well in our benchmark, being only average in studies on GMTKN55. [61] Although our database is not identical to GMTKN55, it has considerable overlap. The highlighted differences show that care must be taken in the interpretation of benchmarks of DFAs and result obtained at the CBS limit are not necessarily transferable to smaller basis sets.
It is also interesting to observe, that the best three hybrid functionals, M05-2X, M06-2X and M08-S0 do not feature an empirical dispersion-correction term unlike, with the exception of LS1DH, the best 25 DHs. revDSD-BLYP-D4 outperforms all other functionals in this study, followed by B2GP-PLYP and revDOD-BLYP-D3(BJ). Using PARI-AO-MP2, the latter one is faster than the best hybridfunctionals M05-2X and M06-2X for medium systems and only negligibly slower for large molecules.
However, we hesitate to recommend it as a general purpose functional. The best-worse analysis in Figure 14  These BSEs strongly suggest that the use of QZ basis sets is clearly indispensable for DHs to exploit their full potential. This can be seen most clearly for ground state energies of conformers of organic molecules with easy electronic structure and NCIs where much higher accuracy than the often cited 1 kcal/mol criterion of chemical accuracy is crucial for predictive quantum chemistry. It is the most pressing problem of PARI-MP2, that QZ calculations are not doable at the moment. Research towards a strategy to overcome the limiting numerical issues is currently pursued in our group.
Nevertheless, we could demonstrate on a large and diverse database of 1,644 datapoints, that Jacob's ladder is still reproduced-DHs are the most accurate and robust DFAs available and many of them outperform the best hybrid functionals. Possibly due to the large BSEs for DHs and maybe also due to larger PARI-errors in the computation of four-center integrals, the increase in accuracy of DHs over hybrids is less pronounced than the one found in recent studies on GMTKN55.
Among the 14 DHs outperforming all hybrid functionals, also five DOD-functionals can be found which enable DH calculations to essentially the price of a hybrid calculation. For five test sets of large molecules we could show that their accuracy is also retained for molecules of more than 100 atoms, including biologically relevant enzymes and isomers of large organic molecules. Consequently, DODfunctionals are the method of choice when CPU-time and memory requirements are an issue and highly accurate energies are required.
In summary, the use of DHs is recommended for various chemical problems. On the TZ level, they especially excel in thermochemistry, the calculation of barrier heights and the calculation of relative conformational energies of organic molecules with challenging electronic structure. This is not only true for main group chemistry but especially when TMs are involved. In this case, highly accurate energies can be obtained from DHs in conjunction with the ZORA formalism.
As an important results of this work, we shortly recapitulate our recommendations for the individual subcategories of our database.
1 Basic properties and reaction energies: DHs should always be used here. B2T-PLYP-D3(BJ) provides the best balance between accuracy and robustness. If CPU-time is a concern, revDOD-BLYP-D3 (BJ) should be used.
2 Reaction barrier heights: revDOD-PBEP86-D3(BJ) can always be recommended, but in the absence of TMs, M08-HX is a robust and accurate alternative. When TMs are involved, B2K-PLYP provides excellent performance.
3 Simple isomerization reactions: DHs cannot be recommended here.
We recommend B97-D4 due to its excellent price/performance ratio. Apart from these findings, we could draw two other important conclusions from the results presented herein. First, a general purpose functional remains to be developed. None of the herein benchmarked functionals can safely be applied to all chemical problems assessed here. Even for a single subcategory, there is almost never a functional which can be recommended for all subsets. Thus, the recommendations given herein might serve as important guidelines but they should never be understood as black-box solutions for a problem at hand.
Second, in our benchmarks we could not reproduce the ranking of functionals according to their accuracy on GMTKN55 when large QZ basis sets are used. While we can confirm some important results presented previously, for example, the outstanding success of the Minnesota-functional M05-2X, M06-2X and M08-HX on the hybrid level, the great performance of revDSD-functionals [60,61] or the considerably worse performance of one-parameter functionals, [58] revDSD-PBEP86-D4 is not the frontrunner in our study but rather revDSD-BLYP-D4. On the other hand, the second best DFA in our study, B2GP-PLYP, only shows an average performance on GMTKN55 with QZ basis sets. This indicates that benchmark results obtained with QZ basis sets do not necessarily apply to the TZ level also, especially for DHs, where BSEs can become quite large.
As the certainly most important outcome of this study, we conclude this work by giving a definite answer to the question asked in the very beginning-is it possible to perform accurate, robust and efficient hybrid-  Lucas Visscher https://orcid.org/0000-0002-7748-6243