In analogy to the vastness and sparseness of outer space, we can loosely refer to the space of chemical systems as chemical compound space (CCS), some continuous observable space that is populated by all experimentally and theoretically possible chemicals with natural nuclear charges and real interatomic distances for which chemical interactions occur.1 Stated more precisely, CCS refers to the combinatorial set of all compounds that can be isolated and constructed from possible combinations and configurations of atoms and electrons in real space. In absence of external fields and for given and atom-types and spatial configurations , not only covalent, ionic, and metallic bonding result, but also the much weaker hydrogen and van-der-Waals (vdW)-bonding, responsible for the physics and chemistry of molecular crystals, liquids, and other supramolecular aggregates. Most research efforts in this first principles context are concerned with approximations and methods necessary for making property predictions for given compounds. By contrast, the focus of this tutorial is a first principles view on the compounds per se.
Notwithstanding chemical bonding or conformations and merely considering the number of possible stoichiometries, it is obvious that the size of CCS is unfathomably large for all but the smallest systems. Due to all the possible combinations of assembling many and various atoms its size scales exponentially with compound size as . Here is the number of possible atom types, that is, the maximal permissible nuclear charge in Mendeleev's table ( ), and depends on the employed definition of “isolated system” but can certainly reach Avogardro's number scale for living organisms, chunks of unordered matter, or planets. Although many of such speculative compounds are likely to be unstable, the state of affairs worsens dramatically when accounting for the additional degrees of freedom which arise from distinguishable geometries due to differences in atom bonding or conformations. This combinatorial explosion with system size is the main motivation for advocating an ab initio, or first principles, view on CCS, namely a view that restricts us to use solely and as input variables* and, while maybe not free of empirical parameters, will not change in its parameterization as and are freely varied.2 A major part of modern electronic structure theory and interatomic potential work is concerned with the development of improved methods and approximations for solving Schrödinger's equation (SE) within the Born–Oppenheimer approximation for systems relevant to materials, biological, or chemical research, and deriving properties thereof.3Ab initio statistical mechanics efforts are dedicated to sampling the corresponding degrees of freedom from first principles.4 In the context of CCS, the electronic Hamiltonian H for solving SE, , of any compound with a given charge, , is uniquely determined by its (unperturbed) external potential, , that is, by its set . Here, is the total number of protons in the system, the sum over all nuclear charges. Due to the Hohenberg–Kohn theorem, we also know that the electron density , and all electronic properties derived thereof, are determined by , up to a constant, .5 Consequently, we can work directly with .
In this tutorial, CCS is first briefly illustrated in terms of a rough energy scale in section Energy Hierarchy. In section Molecular Grand-Canonical Ensemble, we will review the notion of a molecular grand-canonical ensemble density functional theory (DFT) that accounts for fractional electrons and nuclear charges. Section Compound Pairs will deal with pairs of chemical compounds, and with efforts to exploit the arbitrariness of interpolating functions. It also details the challenge associated to a prize award of one ounce of gold. Finally, we will discuss in section Statistical Methods recent efforts to use intelligent data analysis methods [machine learning (ML)] to systematically infer analytical structure property relationships from previously calculated electronic structure data sets.
Considering CCS, it is useful to think of a variable system that is comparable to the Mendeleev's table of the elements. Compounds, however, have many more dimensions than a single atom's nuclear charge, specifically ( if linear). One way of thinking about CCS is in terms of an abstract Gedankenexperiment involving all theoretically existing compounds. For a set of protons, subject to a varying amount of kinetic energy (or temperature), provided by a thermostat, regimes emerge with various familiar degrees of freedom. In such a “phase diagram” of CCS, these various regimes correspond to
i.stoichiometrical isomers: a “very high” temperature regime: Let us assume such high temperatures that all bonds break, and that all spatial degrees of freedom can safely be neglected. Furthermore, we assume isomers to have the same number of elementary particles, and . How many of such stoichiometrical isomers could be observed populating up to sites with at least one proton? Mathematically speaking, this is a discrete number theory problem: this number is the integer partition of , that is, the number of ways to write as sum of positive integers. For example, CH , NH , H O, HF, and Ne represent only five out of all the 42 possible stoichiometrical isomers for 10. The total number of possible partitions corresponds to the partition function, which increases exponentially with . The exponential increase and an illustration of the emerging stoichiometries according to Young-Ferrers diagrams are shown in Figure 1. These degrees of freedom are rarely explored in nature except when it comes to radioactive decay, nuclear fusion, or nuclear synthesis in the early stages of our universe. Through fictitious interpolation, however, we can meaningfully render this space continuous, as illustrated using DFT for the potential energies displayed in the inset of Figure 1.
ii.Constitutional isomers: “high” temperature regime: At high temperatures, only strong chemical bonding (covalent, ionic, metallic) survives. Corresponding Lewis structures enumerate many (but not all) of the possible constitutional isomers distinguishable as possible topologies, or molecular graphs, that can be constructed. The enumeration (and canonization) of all possible constitutional isomers has been the focus of long standing graph-theoretical efforts.8–11 The exponential scaling of their number is also evident for the recently published exhaustive list of small organic molecules.12 This is the regime in which isomerism occurs through conventional “chemistry,” that is, chemical reactions that lead from one constitutional isomer to another, usually under the influence of pressure, temperature, light, or some catalytic agent. We can model this and the subsequent regimes iii and iv using ab initio molecular dynamics methods.4 Universal or reactive force-fields can be used to accomplish similar sampling.13, 14
iii.Conformational isomers: “ambient” temperature: Folding and unfolding events, sampling of intramolecular degrees of freedom, for example, along dihedral angles, and similar processes take place at “ambient” temperatures. These isomers are typically sampled using force fields that assume fixed molecular topologies and parameterized charges, dihedral and angular terms, in addition to the typical potentials used for the chemical bond, such as harmonic, Buckingham's or Morse potentials.
iv.Weakly interacting systems: “low” temperature: Supramolecular assemblies, soft aggregates condense to molecular liquids or solids. Also biological systems, such as membranes, or even living cells and organisms, fall in this regime. Such van der Waals dominated systems are typically modeled using classical simulation with effective Lennard-Jones-type potentials.
In this review, we will discuss recent contributions that are consistent with all of the four regimes, accounting for all the spatial and elemental degrees of freedom and . We will ignore the electronic number as an independent variable because for most if not all possible scenarios. Furthermore, all discussions are based on the Born-Oppenheimer approximation, and assume that nuclear positions are well defined.
Molecular Grand-Canonical Ensemble
Much of conceptual DFT concerns the energy response to infinitesimal variations in number of electrons , or external potential, .15, 16 Although very important for interpreting orbitals, deriving reactivity indices, and even for redox processes, the diversity (and combinatorial scaling) of CCS is rather due to variations in nuclear charge distribution, than due to variations in . Consequently, for the following, we will mostly be concerned with changes in nuclear charges. To offer a rigorous framework for explicit changes in , molecular grand-canonical ensemble DFT was introduced,17 relying to a significant degree on preceding work.18, 19 Only a brief summary is given here, for more details, the reader is referred to the original contributions.
Assuming a classical nuclear charge distribution, , we can introduce an auxiliary grand-canonical variational energy functional for the aforementioned fictitious “very high” temperature regime (i),
where correspond to the usual total potential energy functional, the electron charge density, global electronic, and nuclear chemical potentials, respectively. For high temperatures, entropy will prevail and the system would dissociate into H (if biased to natural ). For lower temperatures, the potential energy will dominate the free energy, and the energy of a single atom ( ) will dominate over the energy of many individual atoms that sum up to the same number of protons, . Hence, the nuclear charge distribution would collapse onto a single site. For this we obviously assume the classical and fictitious self-repulsion of protons occupying the same nuclear site to be switched off.
For the “lower temperature” regimes (ii)–(iv), the following energy functional is more meaningful,
where corresponds to a spatially resolved nuclear point charge distribution. The nuclear chemical potential is now a locally defined Lagrange multiplier. Using an external potential that excludes the aforementioned intranuclear self-repulsion of protons (here through use of an error function that switches off the divergence at nuclear sites), we find from the corresponding Euler equation,
—the electrostatic potential of the system. As such, starting with Q—a Legendre transformed energy functional of intensive properties and —one can derive the Gibbs-Duhem equation analogue for electrons and protons,
and obtain relationships between electronic hardness , molecular Fukui function ,20 and nuclear hardness kernel, .17 While the nuclear chemical potential is defined everywhere, its value at an atomic position has special meaning: It quantifies the system's first-order energy response to a fractional change of the atom's nuclear charge. Consequently, we dub the molecular “alchemical potential” of atom I.19
Ignoring potential applications to radioactive and nuclear processes, alchemical interpolations obviously do not describe reality. They offer, however, a rigorous mathematical way to render CCS continuous. Alchemical changes and potentials involving fractional nuclear charges are commonly used for two, often related, purposes: either for the evaluation of free energy differences between different compounds, for example, using thermodynamic integration,21; or for obtaining a set of gradients with dimension of indicating the response of the system to a variation in nuclear charge on every site.19, 22 In practice, we can calculate such changes through interpolation of nuclear charges in any basis set that is converged for all values of an interpolating order parameter, . For plane-wave pseudopotential implementations, the same can be accomplished by interpolation of pseudopotentials that replace the explicit treatment of the core electrons.23–28 The use of a plane-wave basis set is advantageous because it is independent of atomic position and type, and will not introduce Pulay forces.29 The manipulation of pseudopotentials for affecting electronic structure properties is nothing out of the ordinary. It has successfully been deployed for an array of properties including relativistic effects,30 self-interaction corrections,31, 32 exact-exchange and quantum mechanical/molecular mechanical (QM/MM) boundary effects,33, 34 van-der-Waals interactions,35, 36 and widening the band gap.37, 38, 39 For fractional nuclear charges, we can interpolate pseudopotentials, and evaluate properties as a function of order parameter, . An interpolation of pseudopotential parameters as a function of nuclear charge is shown in Figure 2. Calculated properties as a function of such alchemical changes are illustrated in Figures 1 and 3 for total potential energies, and protonation energies and polarizabilities, respectively. Note that the former application is not physical because of the arbitrary energy offset of pseudopotentials. This, however, is inconsequential, because most of chemistry deals with energy differences, and differences thereof. As shown in Figure 3 for HCl→NH , the use of pseudopotentials for alchemical changes can be particularly advantageous when it comes to transmuting elements from different rows of the periodic table while keeping constant the total number of valence electrons.
Free energy applications
Fractional charges were used to calculate free energy of solvation of ions in water.42 Sulpizi and Sprik43 rigorously explored the need for fractional nuclear charge calculations to predict pK 's of various organic and inorganic acids and bases. In the case of free energy differences, fractional charges can also be avoided all together within a simple alternative and elegant interpolation scheme put forth by Alfè et al.44: atomic forces are evaluated at both endpoints ( ), and dependent molecular dynamics trajectories are generated for atoms being propagated according to a linear combination of these forces using instantaneous values as weights. This is to be compared to a trajectory that uses Hellmann–Feynman forces directly evaluated on the interpolated alchemical species, such as for the 0 temperature limit of relaxing the geometry of a reaction barrier.45 The limitations of Alfè's procedure are that (a) one requires twice as many self-consistent field calculations, namely for both end-points instead of a single one when using an alchemical interpolation (assuming of course that for both approaches the free energy integrand , varies similarly with ); and (b) that the number of atoms must be kept constant during the interpolation, significantly restricting the possible number of stoichiometries that can be explored. Both of these disadvantages can be avoided within the compound pair scheme discussed below.
Through use of a Taylor expansion, truncated after first order, we can also exploit the vector of alchemical potentials for gaining control over properties in the -dimensional space of all the nuclei that are adjacent in the periodic table. Weigend et al.46 were probably the first to propose such an application in the context of predicting stability in binary atom clusters. Independently, it was applied to drug binding energies within a QM/MM study,19 demonstrated for hydrogen-bonded complexes, interconverting methane, ammonia, water, and hydrogen-fluoride while bound to formic acid,22 and applied to the molecular Fukui function for tuning highest occupied molecular orbital (HOMO) eigenvalues of boron/nitride derivatives of benzene.20 In Ref.43, this notion is exploited for the prediction of reaction barriers as well as oxygen adsorption energies on Pd derived core-shell metal nanoclusters that are catalyst candidates for the oxygen reduction reaction. A subsequent application to the design of Ni-based nano-cluster alloys with oxygen adsorption energies corresponding to the target value of 1.65 eV for maximal turn-over-frequency has subsequently been carried out.47 The molecular Fukui function, in particular when evaluated at the position of the atom, was also discussed more recently by Cardenas et al.48 Clearly, truncation of the Taylor expansion at second or higher order would be desirable to increase the accuracy of the alchemical predictions of the effect of atomic transmutations. Alas, higher order derivatives of the energy with respect to nuclear charges lead to computational overhead, they require the calculation of the perturbed electronic structure. Nevertheless, based on coupled perturbed self-consistent field theory, the improved accuracy of higher order predictions has been demonstrated very recently.49
Instead of varying the proton distribution in a compound, we can interpolate the external potential just as well. Albeit mixing up spatial and compositional degrees of freedom, this is more in line with conventional thought in quantum chemistry and conceptual DFT15 where the nuclear charge distribution is hardly ever mentioned explicitly. The route via the external potential has been pursued within the Ansatz of linear combination of atomic potentials in the research groups of Yang and Beratan,50 assigning atom-type specific weights to every atom site. Using simplified Hamiltonians, impressive results were obtained for the control of molecular hyperpolarizabilities,51–55 furthering long-standing molecular design efforts for electronic properties well ahead of their time.56, 57 This approach has also been combined with genetic algorithms for the purpose of crystal structure design using DFT.58 The functional second-order derivatives with respect to external potentials have been published in Ref.59. Analytical expressions for second-order derivatives and linear response functions have very recently been proposed by Yang et al.60 The same authors also derived important constraints for the electronic structure that must be met by the exact exchange-correlation functional. In analogy to using constraints obtained for variable , such as piece-wise linear behavior and derivative discontinuities, to design improved density functionals,61–63 A Cohen's current efforts are dedicated to variations in the external potentials that include fractional nuclear charges. The electronic structure for systems with has also been explored by Constantin et al.63
The above discussed Ansatz, variational in a fractional nuclear charge distribution, defines an appealing, fully spatially resolved, index, that is, a way to probe the sensitivity of a compound not only toward changes in any of its composing atoms but also with respect to adding new protons. However, for two reasons, this approach is limited.
First, severe constraints and preconceived insights are required to explore the -dimensional space of all . Either because if is continuous it requires a bias potential toward integer numbers, possibly using a fictitious temperature, in some analogy to the Fermi function for electrons. Or if is a combination of various atom-types, in line with the aforementioned linear combination of atomic potentials approach,50 the weight of one nuclear charge has to dominate so that it can safely be increased to 1, while decreasing all others to zero. Furthermore, constraints due to overall charge conservation, and electronic structure, have to be taken into account. For example, consider an alchemical transmutation of H N-OH into its isoelectronic stoichiometrical isomer hydrogen peroxide, HO-OH, through simultaneously and continuously decreasing and increasing by one the nuclear charge of a hydrogen and the nitrogen atom, respectively. At some point of this conversion, the spin of the ground state surface will turn into a triplet surface, therefore requiring the consideration of both spin states along the interpolation path.
Second, and more importantly, to carry out alchemical changes along columns in the periodic table a path following Z would have to fill up the shell to go through the entire period before one arrives at the desired elements. This implies significant variations in electronic configurations just to arrive at a target compound with a configuration likely to be very similar to the starting compound. For example, consider a system of eight valence electrons, and Ne and Ar as starting and target compounds, respectively. Then, an isoelectronic path progressing with Z of the central atom, and saturating with hydrogens accordingly, would have to proceed through the following series of compounds, NaH , MgH , AlH , SiH , PH , H S, and HCl, some of which not even likely to be covalently bound. Hence, although Taylor expansions in Z are quite predictive for adjacent elements—as mentioned in the preceding section—it is not surprising that their predictive power decays dramatically when it comes to predictions for changes up and down the columns in the periodic table. Obviously, matters will only become worse when d- or f-elements are to be included, or when trying to make predictions by two or more rows down or upward.
Albeit intuitive, the use of nuclear charges as interpolating variable is fortunately not mandatory. Instead, we can also use a generalized, and entirely arbitrary, interpolation procedure between any two pairs of compounds. As long as it is reversible and integrable, any path can be used to monitor any property that is a state function.4 In all of the following, we will only consider interpolations between isoelectronic compounds, that is, compounds with the same in their Hamiltonian. As mentioned before, this is only a minor restriction because the diversity of CCS is rather due to differences in nuclear charge distribution than due to differences in . For example, we can linearly interpolate the Hamiltonians of any two isoelectronic compounds, A and B,
in order parameter . and denote the initial and final electronic Hamiltonian of the two compounds, obeying the corresponding boundary conditions and , respectively. For any isoelectronic Hamiltonian linear in , the potential energy is not necessarily linear. In fact, the electronic potential energy of a linearly interpolated Hamiltonian is likely to be concave, or, more precisely, equals or larger than a straight line connecting the energies of compound A and B, . This inequality follows from the variational principle and can easily be shown: Eq. (5) implies,
where now correspond to the usual quantum mechanical Bra-Ket notation, denoting the expectation value with the wavefunction, or the density functional (in an orbital-free exact DFT world), evaluated for the Hamiltonian at , that is, . and denote the energies of compound A or B evaluated using the wave functions (or density in the case of orbital free DFT) obtained at . Note that , and . Subtracting and regrouping yields,
where the prefactors of the energy differences 0 by definition, and where and because of the variational principle. Consequently, analogous inequalities will hold for any property for which there is a variational principle, for example, also for the polarizability due to Pearson's maximum hardness principle.65 The corresponding concavity is on display for the static polarizability, fractionally transmutating a hydrogen chloride molecule into ammonia (Fig. 3). Similarly, potential energy inequalities between different molecules were proposed by Mezey in the eighties.66
Analytical first-order derivatives of the energy as a function of any isoelectronic change in the Hamiltonian can easily be calculated using the Hellmann–Feynman (HF) theorem,67 as proposed, and demonstrated for HOMO eigenvalues, in Ref.66, . For a linearly interpolating Hamiltonian, such as in Eq. (5), this leads to,
The protonation energy, and its derivative, also feature in Figure. 3 for the transmutational change, HCl→NH . As mentioned before, the use of pseudopotentials/valence electron densities fortunately renders straightforward the evaluation of the HF derivative according to Eq. (8) even for compound pairs that involve elements from differing rows in the periodic table.
Thermodynamic integration of over yields free energy differences. In the case of compound design, the approach is slightly different, we would like to Taylor expand the energy of a new compound B in terms of a reference compound A and its derivatives,
HOT standing for higher order terms, and . Unfortunately, when making predictions with a linearly interpolated Hamiltonian, the first-order derivative term according to Eq. (8) is not necessarily predictive.68 While the inclusion of higher order derivatives in Eq. (9) is likely to improve the prediction, as found for statistical mechanical averages,69 it also requires the evaluation of the perturbed wavefunction, for example, through the use of linear response theory,33, 70 thereby defying the original purpose of predicting a new compound's energy without having to solve for its wave function. It is coincidence that for some relative energies, such as the protonation energy shown in Fig. 3, the higher order effects mostly cancel, thereby rendering the first order HF derivative quite predictive.59, 60
To improve the predictive power of the first-order term in Eq. (9), an empirical correction has been introduced that “linearizes” the energy through a global yet nonlinear Hamiltonian, .68 If we assume to be a second-order polynomial in , two coefficients are determined by the boundary conditions that , and , leaving one additional degree of freedom. We can obtain the third degree of freedom as a parameter from an arbitrary second isoelectronic compound pair, CD, such that the energy becomes linear in . The resulting expansion up to first order in Eq. (9) then becomes,
Here, is the ratio between the energy difference and the HF derivative of the additional reference compound pair, C and D (its Hamiltonian being linearly interpolated). is determined according to Eq. (8) as . This bears resemblance to a long tradition in physical chemistry, namely the use of reference compounds for electrode potentials or enthalpies.
The idea to use alternative, nonlinear, interpolations is not new within molecular mechanics. In the context of electronic structure theory, nonlinear alchemical paths were also explored for chemical binding,71 and nuclear quantum effects.72 Various open questions deserve further investigation, such as transferability and choice of reference coefficients, isoelectronic changes using valence electrons only versus all electron description, nonisoelectronic changes, necessary accuracy when providing the input of target compound B, that is, also its geometry, ionic forces of B, and so forth. The answers are likely to depend on systems and properties.
Control of ligand binding
In this section, we exemplify the use of the reference coefficients [Eq. (10)] for increasing the predictive power of the HF derivatives of linearly interpolated Hamiltonians. We refer to state-of-the-art van der Waals corrected DFT35, 73 to accurately estimate interaction energies with binding targets across CCS. We consider a small yet illustrative set of mutants of the ellipticine molecule. Ellipticine is a naturally occurring anticancer drug with various binding targets. As also illustrated in Figure 4, its dominant mode of binding to DNA is intercalation. Structural data as well as studies on drug analogues are readily available.74, 75 We will probe the versatility of the linearizing scheme for controlling ellipticine-derivatives/DNA binding, isolated in gas phase and with fixed geometry.76 Clearly, for the eventual control of ligand binding, the property of interest is not the potential energy of interaction but rather the free energies of binding: solvation or entropic contributions can be crucial, as is well known in general,77 and in the particular case of ellipticine.78 For example, Tidor,79 and Oostenbrink and van Gunsteren80, 81 have carried out similar work in the sense of interpolating ligand candidates, by calculating free energies of binding, and using molecular force fields. For this review, however, we will limit the discussion to the potential energy of interaction. Future work will deal with the inclusion of thermal and solvent effects for instance using ab initio molecular dynamics techniques82 in conjunction with QM/MM83 calculations. Moreover, even at the mere potential energy electronic structure level of theory, the accurate quantification and control of intercalated ellipticine derivatives is challenging: vdW forces dominate the binding. Recent studies have already explored the binding of ellipticine and how its vdW forces can be accounted for at the employed electronic structure level.84–86
Let us consider the intercalation energy for the complex depicted in Figure 4 for mutations at the five sites indicated in the bottom panel. In analogy to protein or DNA sequences, an (arbitrary) relevant subspace of CCS is defined in Table 1 as a matrix that corresponds to an alphabet of isoelectronic (in valence electron number) functional groups at each of the selected sites. Note that variation in molecular combinations of letters of this alphabet are capable to not only revert dipole-moments, they can also act either as hydrogen bond acceptors (lone pair in OH/Cl) or donors (NH , proton in OH). Clearly, the alphabet can easily be extended to accommodate further effects, for example, with electron donating/withdrawing or hyperconjugating groups. Conformational degrees of freedom can be encoded explicitly, as it is done for the hydroxyl groups in Table 1.
Table 1. Exemplary alphabet for mutants of ellipticine as oriented in Figure 4, defining a CCS with 4 × 6 = 5184 molecules. [Color table can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
Site vs. group
Highlighted in red are all functional groups whose mutations have been considered. Predictions are displayed in Figure 5. The “wild-type” ellipticine drug is encoded as (21121) with three functional groups coming from the first column, and the functional groups at site R1 and R4 coming from the second column.
Within this restricted CCS, any given molecule is represented by the sequence of functional groups distributed over the five sites. For example, the “wild-type” ellipticine in Figure 4 would correspond to (21121), that is, two for N at R , one for CH at R , R , and R , and two for NH at the R . Let us exemplify a DFT+vdW-based prediction of the binding energy of another mutant: is predicted to bind to the DNA cluster in Figure 4 with = kcal/mol.83 For predicting the single point mutation (21121) (21125) (changing CH into F at R ), one would have to predict a true value of = kcal/mol. The derivative based prediction according to the first-order term in Eq. (9) is calculated to be, + = + 1.4 = kcal/mol. Inclusion of reference coefficient [Eq. (10)], and using compound pair (11121)/(11125) as a reference, yields + = + 1.3 1.4 = kcal/mol.
To gain a more representative idea of the predictive power of this method, Figure 5 features the outcome for a small subspace of the CCS highlighted in red in Table 1: eight compounds have been considered involving permutations at R , R , and R , each with two possible functional groups. Predictions based on all the possible derivatives among these compounds, with and without reference coefficients (as obtained from compound pairs not involved in the transmutation), are compared to calculated binding energies. Despite the several outliers that deviate substantially, the use of reference compounds dramatically improves the overall prediction.
For comparison, we also include predictions based on the additive assumption that the influence of the rest of the molecule cancels when considering an interpolation for the same pair of functional groups. Specifically, we estimate the binding energy of B simply by adding the difference in binding energy of a reference compound pair, CD, to the binding energy of A,
As shown in Figure 5, also this prediction yields remarkable correlation—with less pronounced outliers. Analysis of the distribution of errors, however, suggests that in spite of the outliers the normal distribution of predictions around the ideal correlation is superior for predictions made with the product of derivative and reference coefficient (Bottom of Fig. 5).
Win a prize
The numerical illustrations in the previous section, as well as in Ref.68, represent promising efforts to linearize the property through alternative, nonlinear interpolations of the Hamiltonians are worthwhile. Strictly speaking, however, due to the use of reference compound pairs, the aforementioned interpolation constitutes no longer a first principles but rather an empirical and heuristic Ansatz. What is needed instead is an ab initio interpolating procedure that linearizes the energy (or other properties) in order parameter, such that the first-order Taylor expansion based on the HF derivative is sufficiently accurate to predict properties of other compounds.2 In other words, if this effort was successful, numerical solutions of SE could that the first-order Taylor expansion system to the next (in close analogy to the Car-Parrinello idea of propagating orbitals together with nuclei87), foregoing the need to solve SE for each system from scratch.
Inspired by Erdös' habit to offer cash awards for solutions to outstanding mathematical problems, the author offers the equivalent of an ounce of gold to the first person who presents an ab initio solution to this problem. Specifically, the challenge reads: constructively find—or show nonexistence of—an ab initio, that is, valid for any external potentials, interpolating transform for which two different but isoelectronic molecular Hamiltonians with energies and interconvert such that the electronic ground state potential energy , is linear in order parameter , and that consequently the HF derivative is given by,
Here, 0 1, and , and . Further details can be found in footnote.†
We can exemplify this challenge for the nonrelativistic hydrogen-like atom with only one electron. In this case, , where a is a constant. and the nuclear charge Z is a function of interpolating parameter .‡ For an interpolation linear in λ, , the energy would be quadratic in . The desired behavior of a linearized energy would be,
Equating this to and solving for yields the corresponding interpolating function:
We can test this interpolation to confirm if indeed we find the desired slope for the linearized energy, . Application of the chain rule, and insertion and differentiation of Eq. (14) confirms the expected result,
As such Eq. (14) linearizes the energy in . The challenge of the prize consists of finding an analogous expression for molecules. If existent, the solution is likely to be a spatially resolved and dependent inhomogeneous transformation of the external potential, continuously and non-linearly turning the external potential of compound A into the external potential of compound B while linearizing the potential energy.
Note that a naive extension of Eq. (14) to assemblies of atoms,
does not constitute a practical approximate solution to the challenge. denotes the “alchemical” potential mentioned above which corresponds to the electrostatic potential at without the repulsion due to . will not necessarily cancel the square root term in the denominator of the derivative in Eq. (14), which consequently diverges at = 0 if equal 0.
Inductive reasoning from first principles
Within statistical mechanics, the numerical prediction of macroscopic observables from atomistic simulation requires repeatedly calculating microscopic states, using electronic structure theory, atomistic or coarse-grained force fields, and averaging in an appropriate ensemble. Philosophically speaking, the exercise of performing such computational “experiments” is an application of deductive reasoning to increase knowledge. But also when exploring CCS in terms of ensembles of potential energy hypersurfaces by repeatedly solving SE for N different compounds deductive reasoning is at work. Since the size of CCS (or phase space) is prohibitively large, its exhaustive sampling through screening with solving SE is impossible. Although some interpolating schemes use statistical mechanics for a preselected set of compounds,79, 80 a rigorous way to more systematically and generally gain quantitative insights is desirable. This task can be accomplished through the application of inductive reasoning.
Historically, the role of inductive reasoning in chemistry is considerable, Mendeleev's table, the Hammett equation,88, 89 or Pettifor's structure maps90 are all based on inferred phenomenological relationships. Further examples include widely spread rules and notions of chemistry, such as the chemical bond, atomic charges, or aromaticity. Although popular and useful to the experimental chemist, conventional quantum chemistry, based on deductive reasoning, is still struggling to account for these notions.91 Recent advances in statistical data analysis methods92–95 and applications in other areas of science and engineering, such as searching the internet, automated locomotion (self-driving cars), algorithmic trading, or brain-computer interfaces, strongly suggest that they will also play an increasingly important role in quantum chemistry. Examples of first efforts to quantitatively infer laws for atomistic simulations include “Learning On The Fly,”96 or “force-matching.”97, 98 More sophisticated statistical learning methods have been applied to the training of exchange correlation functionals in DFT,99, 100 or to parameterizing interatomic force fields.101–106 Support vector machines have been shown to quantify basis-set incompleteness.107 Gaussian kernel-based ML for the design of accurate and reactive force-fields without predetermined functional form was introduced by Bartók et al.108 Contributions by Curtarolo, Hautier, and Ceder combine data-mining with mean-field electronic structure theory.109–111 Even the learning of reorganization energies that enter Marcus charge transfer rates are promising.112, 113 Very recently, kernel-based ML models have also delivered promising results for learning the kinetic electron density functionals within orbital free DFT,114 or dividing transition surfaces that determine reaction rates.115 Bayesian error estimates and cross-validation methods have also been applied to the development of exchange-correlation models with controlled transferability.116
Within the bioinformatics and cheminformatics communities, the development of quantitative structure property relationships (QSPRs) has a long tradition. QSPRs, relying on similar statistical frameworks (ML, cross-validated training, principal component analysis, etc.), deliberately attempt to circumvent solving the underlying laws of physics by directly correlating system features (so called descriptors) with macroscopic properties of interest. Conventionally, QSPRs are based on descriptors that explicitly forsake atomic resolution in the first principles sense. A large variety of such QSPR descriptors for various properties has been proposed.117–120 Two such descriptors, the molecular signature by Faulon et al.121 and a combination of HOMO eigenvalues of charged and neutral species, have recently yielded promising results for the QSPR modeling of a first principles property, the reorganization energy, in the CCS of polycyclic aromatic hydrocarbons (PAHs).113 PAHs form discotic liquid crystals which self-assemble into columnar liquid crystal structures, implying their usefulness for organic photovoltaic applications.122
In this section, we will discuss the application of ab initio statistical learning approaches to previously obtained first principles data for N compounds. Merely based on the data, QSPRs can be “learned,” and subsequently be used to avoid the cumbersome task of having to explicitly model all the underlying physical degrees of freedom of electrons and nuclei. As such, ML estimates solutions of SE for a new, that is, “unseen,” molecule B simply by evaluating an analytical expression that (explicitly or implicitly) encodes the results previously obtained for N other molecules. Obviously, all these inferred relationships are inherently limited in accuracy by the quality of the reference data used for training. At this point it should be stressed that in order to avoid statistical artefacts such as overfitting and lack of transferability, successful ML efforts should rely on (i) many-fold cross-validation, (ii) data stratification, and (iii) regularization. Active learning can also be used to remove selection bias and lead to error bars that are constant across all of the relevant variable domains—interatomic distances and nuclear charges in our case. All tests and results should be reported for out-of-sample predictions, exclusively.
Machine learning in CCS: The quantum machine meets Schrödinger
Recently, a first principles based ML approach to DFT atomization energies across CCS has been introduced.123 Unlike ordinary QSPR approaches, this ML model is free of any heuristics. It exactly encodes the supervised learning problem posed by SE, instead of solving SE by finding the wavefunction which maps the system's Hamiltonian to its energy, , it directly infers the solution by mapping system to energy, based on N examples given. In the limit of converged N, that is, sufficiently dense system coverage, the ML model is therefore a formally exact inductive equivalent to the deductive solution of SE achieved through the usual computational machinery of approximate wave-functions (such as separability of nuclear and electronic wavefunction or single slater determinants), Hamiltonians (such as certain exchange-correlation potentials or tight-binding), and self-consistent field procedure to minimize the energy. In Ref.123, numerical evidence is given for this idea. Specifically, for a diverse set of organic molecules, one can show that a ML model can be used instead of solving SE, . After training, solutions to SE can be inferred for out-of-sample, that is, “unseen,” compounds that differ either in geometry or in composition or in both. The evaluation of an estimate is ordinarily negligible in terms of computational cost, that is, milliseconds instead of hours on a conventional CPU, while yielding an accuracy competitive with the deductive approaches of modern electronic structure theory. As within any inductive approach, the accuracy is limited by the domain of applicability as defined by the data used for training, that is, robust results can only be expected in interpolating regimes with sufficient coverage. While the training data imposes limits on the ML model's accuracy, this arbitrariness can also offer an appealing advantage: Any level of theory, and even experimental data, could be used for training. Within the Gaussian kernel model, the energy of a query molecule is given as a sum over N molecules in the training set,
Each training molecule i contributes to the energy according to its specific weight , scaled by a Gaussian in its distance to , . For given length-scale and regularization parameter , are obtained by solving the regression problem,
This regularized model limits the norm of regression coefficients, , thereby improving the transferability of the model to new compounds. All regression coefficients and hyperparameters are determined by cross-validation on data stratified training sets.94, 95
So far this model has been trained and validated only in its most rudimentary form for atomization energies of a small set of interesting compounds. Specifically, molecular atomization energies at the hybrid DFT level of theory5, 124–127 have been used for training on up to molecules from the molecular generated data base (GDB)12 (see Fig. 6 for an illustration), for which mean absolute errors of less than 10 kcal/mol have been obtained. The choice of hybrid DFT is motivated by relatively small errors (<5 kcal/mol) for thermochemistry data that includes molecular atomization energies.128 Although 10 kcal/mol is still far from “chemical accuracy” ( 1 kcal/mol), more recent efforts have not only led to atomization errors with less than 3 kcal/mol accuracy,129 but also include other electronic properties, such as frontier eigenvalues, polarizability, and excitation energies.130
An appealing advantage of analytical models, independent if obtained from physical insight or statistical regression, is their amenability to analysis and interpretation. For example, otherwise ill-defined concepts in electronic structure theory, such as distance/neighborhood/similarity in CCS, can now be quantified within the “world” of the ML model. Specifically, Eq. (17) gives the energy of a query molecule as an expansion in compound space spanned by reference molecules : The regression weights are scaled by the similarity between query and reference compound as measured by a Gaussian of the distance. Hence, assigns a positive or negative weight to molecule i. Within the compound space used as reference, molecules therefore can be ranked according to their . We should note, however, that are merely statistical regression coefficients in a nonlinear model, that is, after a nonlinear transformation of the training data, the resulting energy contributions are specific to the employed training set without general implications for other properties or regions of compound space. The locality of the model is measured by , enabling the definition of a critical distance of locality, . Only if will contribute to the energy of more than some threshold energy . Rearranging summands in Eq. (17) leads to . For atomization energies, and the chemical space considered in Ref.123, that is, with a critical distance 400 Hartree (see TOP of Fig. 6), the ML results suggest that the model becomes local when 60 Hartree, for the average , and for = 1 kcal/mol. Such values are achieved when the number of molecules in training set N exceeds 5000. In other words, for 5000, the model is global. All reference compounds contribute with more than 1 kcal/mol to any prediction made. See BOTTOM of Figure 6 for the N dependence of and .
Coulomb matrix descriptor
To represent compounds, a wide variety of “descriptors” is in use by statistical methods for cheminformatics and bioinformatics applications.117–121 The descriptor introduced by us123 is based solely on coordinates and nuclear charges, and dubbed “Coulomb-matrix,” , a symmetric square matrix of dimensions,
The diagonal elements, , correspond to a polynomial fit to free atom energies, inspired by the discussion of the “high temperature” regime above.131 The off-diagonal elements correspond to the Coulomb repulsion between atoms I and J, and hence the name. For a data set containing molecules with differing number of atoms, all the of all the smaller systems are extended by zeros until they reach the dimensionality of the largest molecule in the training set. Within a multi-scale sort of approach, the Coulomb-matrix can easily be extended to account for large or condensed phase systems: let be the number of atoms in the unit cell, and let be the number of atoms in unit cell plus sufficiently large surrounding environment, then define as above except that all off-diagonal elements are set to zero for all I and J larger than .
We can measure the distance between two molecules by the Euclidean norm of their diagonalized Coulomb matrices: , where are the eigenvalues of in order of decreasing absolute value. The physical meaning of representing CCS in this way can easily be understood by considering the simplest of all molecules, homonuclear diatomics (i.e., and ). Any corresponding is then simply defined by its two eigenvalues, the roots of its characteristic polynomial, . When measuring similarity between two such diatomics with different interatomic distances, and , the measure of similarity reduces to ; and the corresponding estimated potential energy curve for any new interatomic distance, , as trained on N other interatomic distances, , is given by
In complete analogy, a ML model of the homonuclear dimer can also analytically be understood in terms of other homonuclear dimers with differing atomic numbers, heteronuclear dimers, or heteronuclear trimers. The ease of differentiation with respect to not only geometry ( ) but also with respect to composition ( ) illustrates further advantages of such a simple model.
The Coulomb matrix uniquely encodes any compound because stoichiometry as well as atomic configuration are explicitly accounted for. Even homometric molecules,132 see Figure 7, are uniquely encoded by . Symmetrically equivalent atoms will contribute equally, and the representation is rotationally and translationally invariant. To gain invariance of with respect to the index ordering of atoms, one can either diagonalize, sort rows and columns according to their norm, or use sets of matrices with permutated rows and columns. As emphasized in a Comment by Moussa, using the eigenvalues of will yield an undercomplete representation.134 Obviously, as within any coarsened representation, the degrees of freedom represented by eigenvalues will fail to uniquely represent the full set of degrees of freedom for any nonlinear molecule with more than three atoms.134 Although sorting by the norm of rows (or columns) leads to an overcomplete, index invariant, and unique representation, the matrix is no longer differentiable for any combination of matrix entries that could be achieved through changes in geometry or in nuclear charges. Extending the representation by randomly permutated variants of Coulomb matrices is feasible, and leads to dramatic improvement in predictive accuracy.128, 129 To encode known invariances through such data extension has also been successful for improving the accuracy of handwritten digit recognition.136 Due to disadvantageous scaling, this approach might prove problematic, however, when it comes to larger systems. As discussed in Ref.137, these are all crucial criteria for representing atomistic systems within statistical models.
Alternative descriptors for CCS
We shall now discuss more sophisticated alternatives to the Coulomb-matrix. An intuitive extension is to assume a matrix with an interatomic potential form. This could be worth-while as long as the incurred computational overhead is small by comparison to the method used to generate the reference data. For example,
would correspond to the Lennard-Jones analog to the Coulomb-matrix. Similarly, a Morse or Buckingham matrix could be constructed. One could even conceive to go beyond such pair-wise approaches and introduce interatomic three and higher order terms in the form of molecular tensors. But also electronic structure models can be encoded in terms of such a representation, such as extended Hückel theory, semiempirical quantum chemistry, or tight-binding models. For example, an orbital free Thomas–Fermi DFT representation138 is possible based on a data-base of frozen free atomic electron densities, . The “Hartree” matrix is given by,
the “external” potential matrix is given by,
and the “kinetic” matrix is given within
where is a constant, and atomic integrals are evaluated over all of space. If need be, the kinetic matrix could even be extended by the von Weiszäcker correction term, .138 Summation of all entries in the matrix and addition of the off-diagonal Coulomb-matrix entries would yield the corresponding exact DFT energy for frozen atomic electron densities. Preliminary training on atomization energies of the GDB-7 data set12 indicates that neither use of the Lennard-Jones nor of the Thomas–Fermi matrices leads to any significant improvement in predictive accuracy when compared to the original Coulomb-matrix representation in Ref.123. A possible explanation for this surprising result is that these more sophisticated descriptors are no longer monotonic functions in interatomic distances—in contrast to the Coulomb matrix.
An alternative new descriptor, entirely consistent with the first principles view on CCS, has recently been proposed.139 Each atom I in the molecule is represented by its nuclear charge multiplied with a cosine term that contains a radial distribution function of atom I with respect to all other atoms J. Summing up the atomic contributions yields a Fourier series of atomic radial distribution functions which, because of the superposition principle, is not only unique for each compound, but also invariant with respect to molecular rotations, translations, and atom indexing. Using a Gaussian radial distribution function the descriptor reads,
where , and n and are hyperparameters that can be optimized. This descriptor has units of charge , d has units of distance and goes from zero beyond the largest interatomic distance. As in the case of the Coulomb-matrix described above, the environment of large or condensed systems can be accounted for by chosing to be larger than . The reader is referred to the original paper for further details.139 These descriptors can be contrasted to other alternatives, for example recently investigated by Bartók, Kondor and Csányi.140
We have reviewed a notion of CCS that is consistent with any ab initio approach to atomistic simulations. Starting from an energy hierarchy, variations in nuclear charge distributions have been discussed, followed by order-parameter-based interpolation approaches and statistical learning methods. The concepts presented offer a seamless and rigorous framework to unify electronic structure theory with rigorous rational as well as combinatorial compound design efforts. This view of chemical space is advantageous for several reasons, (i) important fundamental questions can be tackled in the future, including rigorous definitions of diversity in CCS, property transferability, uncertainty, and selection bias in training sets; (ii) transferability and applicability typical for the black-box characteristics and the accuracy of ab initio calculations can be achieved; (iii) a mathematically, physically, and chemically rigorous notion of relevant input variables enables the application of sophisticated property optimization algorithms. Ultimately, efforts along these lines promise to lead to “the right compound for the right reason,” aiming to replace by systematic engineering protocols the heuristics and serendipity on which most, if not all, of past compound discoveries have relied.
The author is thankful for helpful discussions with C. Anderson, K. Burke, M. Cuendet, R. A. DiStasio, Jr., F. Furche, J. R. Hammond, G. Henkelman, F. Kiraly, A. Knoll, G. Montavon, J. E. Moussa, K. R. Müller, B. C. Rinderspacher, M. Rupp, A. Tkatchenko, D. Truhlar, M. Tuckerman, A. Vazquez-Mayagoitia. The many participants of the 2011-program “Navigating Chemical Compound Space for Materials and Bio Design” at the Institute for Pure and Applied Mathematics, UCLA, are also greatly acknowledged.
Include the mass of the nuclei as an additional variable if also thermal properties and nuclear quantum effects are to be accounted for
Prize (2011), A prize award of the equivalent of an ounce of gold was announced during the Navigating Chemical Compound Space program in spring 2011 at the Institute of Pure and Applied Mathematics, UCLA. The prize is for finding an interpolating transform of two isoelectronic Hamiltonians such that the potential energy becomes linear in the interpolating order parameter. The ounce of gold is currently held in the form of 100 shares of iShares Trust fund (NYSEARCA:IAU), and will be dispensed in cash at instantaneous exchange rate on recognition of a valid solution by a prize-committee. Apart from the author, the prize-committee consists of Profs. K. Burke, G. Henkelman, K. R. Müller, and M. E. Tuckerman. Contact the author regarding donations to increase the prize award. For more information, see http://www.alcf.anl.gov/∼anatole.
We assume the reduced mass to equate the mass of the electron.
O. Anatole von Lilienfeld will begin a Swiss National Science Foundation professorship in the department of chemistry at the University of Basel, Switzerland, in summer 2013. He has been an Assistant Computational Scientist at the Argonne Leadership Computing Facility at Argonne National Laboratory since 2011. In spring 2011 he chaired the 3 months program, “Navigating Chemical Compound Space for Materials and Bio Design”, at the Institute for Pure and Applied Mathematics, UCLA. From 2007 to 2010 he was a Distinguished Harry S. Truman Fellow at Sandia National Laboratories, New Mexico. Prior to that he was a Swiss National Science Foundation postdoctoral scholar with Mark Tuckerman in the Chemistry Department of New York University. In 2005, he received PhD in computational chemistry from EPF Lausanne under the supervision of Ursula Röthlisberger. His diploma thesis work was carried out in 2001 with Martin Quack at ETH Zuerich and Nicholas Handy at the University of Cambridge, UK. Apart from the topics addressed in this tutorial, his research deals with van der Waals contributions to interatomic forces, nuclear quantum effects, defects in semiconductors, molecular dynamics, and high-performance computing. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]