• Open Access

First principles view on chemical compound space: Gaining rigorous atomistic control of molecular properties


  • O. Anatole von Lilienfeld

    Corresponding author
    1. Argonne Leadership Computing Facility, Argonne National Laboratory, Argonne, Illinois 60439 As of July 2013: Department of Chemistry, University of Basel, Klingelbergstr. 80, 4056 Basel, Switzerland
    • Argonne Leadership Computing Facility, Argonne National Laboratory, Argonne, Illinois 60439
    Search for more papers by this author


A well-defined notion of chemical compound space (CCS) is essential for gaining rigorous control of properties through variation of elemental composition and atomic configurations. Here, we give an introduction to an atomistic first principles perspective on CCS. First, CCS is discussed in terms of variational nuclear charges in the context of conceptual density functional and molecular grand-canonical ensemble theory. Thereafter, we revisit the notion of compound pairs, related to each other via “alchemical” interpolations involving fractional nuclear charges in the electronic Hamiltonian. We address Taylor expansions in CCS, property nonlinearity, improved predictions using reference compound pairs, and the ounce-of-gold prize challenge to linearize CCS. Finally, we turn to machine learning of analytical structure property relationships in CCS. These relationships correspond to inferred, rather than derived through variational principle, solutions of the electronic Schrödinger equation. © 2013 Wiley Periodicals, Inc.


In analogy to the vastness and sparseness of outer space, we can loosely refer to the space of chemical systems as chemical compound space (CCS), some continuous observable space that is populated by all experimentally and theoretically possible chemicals with natural nuclear charges and real interatomic distances for which chemical interactions occur.1 Stated more precisely, CCS refers to the combinatorial set of all compounds that can be isolated and constructed from possible combinations and configurations of equation image atoms and equation image electrons in real space. In absence of external fields and for given equation image and equation image atom-types equation image and spatial configurations equation image, not only covalent, ionic, and metallic bonding result, but also the much weaker hydrogen and van-der-Waals (vdW)-bonding, responsible for the physics and chemistry of molecular crystals, liquids, and other supramolecular aggregates. Most research efforts in this first principles context are concerned with approximations and methods necessary for making property predictions for given compounds. By contrast, the focus of this tutorial is a first principles view on the compounds per se.

Notwithstanding chemical bonding or conformations and merely considering the number of possible stoichiometries, it is obvious that the size of CCS is unfathomably large for all but the smallest systems. Due to all the possible combinations of assembling many and various atoms its size scales exponentially with compound size as equation image. Here equation image is the number of possible atom types, that is, the maximal permissible nuclear charge in Mendeleev's table ( equation image), and equation image depends on the employed definition of “isolated system” but can certainly reach Avogardro's number scale for living organisms, chunks of unordered matter, or planets. Although many of such speculative compounds are likely to be unstable, the state of affairs worsens dramatically when accounting for the additional degrees of freedom which arise from distinguishable geometries due to differences in atom bonding or conformations. This combinatorial explosion with system size is the main motivation for advocating an ab initio, or first principles, view on CCS, namely a view that restricts us to use solely equation image and equation image as input variables* and, while maybe not free of empirical parameters, will not change in its parameterization as equation image and equation image are freely varied.2 A major part of modern electronic structure theory and interatomic potential work is concerned with the development of improved methods and approximations for solving Schrödinger's equation (SE) within the Born–Oppenheimer approximation for systems relevant to materials, biological, or chemical research, and deriving properties thereof.3 Ab initio statistical mechanics efforts are dedicated to sampling the corresponding equation image degrees of freedom from first principles.4 In the context of CCS, the electronic Hamiltonian H for solving SE, equation image, of any compound with a given charge, equation image, is uniquely determined by its (unperturbed) external potential, equation image, that is, by its set equation image. Here, equation image is the total number of protons in the system, the sum over all nuclear charges. Due to the Hohenberg–Kohn theorem, we also know that the electron density equation image, and all electronic properties derived thereof, are determined by equation image, up to a constant, equation image.5 Consequently, we can work directly with equation image.

In this tutorial, CCS is first briefly illustrated in terms of a rough energy scale in section Energy Hierarchy. In section Molecular Grand-Canonical Ensemble, we will review the notion of a molecular grand-canonical ensemble density functional theory (DFT) that accounts for fractional electrons and nuclear charges. Section Compound Pairs will deal with pairs of chemical compounds, and with efforts to exploit the arbitrariness of interpolating functions. It also details the challenge associated to a prize award of one ounce of gold. Finally, we will discuss in section Statistical Methods recent efforts to use intelligent data analysis methods [machine learning (ML)] to systematically infer analytical structure property relationships from previously calculated electronic structure data sets.

Energy Hierarchy

Considering CCS, it is useful to think of a variable system that is comparable to the Mendeleev's table of the elements. Compounds, however, have many more dimensions than a single atom's nuclear charge, specifically equation image ( equation image if linear). One way of thinking about CCS is in terms of an abstract Gedankenexperiment involving all theoretically existing compounds. For a set of protons, subject to a varying amount of kinetic energy (or temperature), provided by a thermostat, regimes emerge with various familiar degrees of freedom. In such a “phase diagram” of CCS, these various regimes correspond to

  • i.stoichiometrical isomers: a “very high” temperature regime: Let us assume such high temperatures that all bonds break, and that all spatial degrees of freedom can safely be neglected. Furthermore, we assume isomers to have the same number of elementary particles, equation image and equation image. How many of such stoichiometrical isomers could be observed populating up to equation image sites with at least one proton? Mathematically speaking, this is a discrete number theory problem: this number is the integer partition of equation image, that is, the number of ways to write equation image as sum of positive integers. For example, CH equation image, NH equation image, H equation imageO, HF, and Ne represent only five out of all the 42 possible stoichiometrical isomers for equation image 10. The total number of possible partitions corresponds to the partition function, which increases exponentially with equation image. The exponential increase and an illustration of the emerging stoichiometries according to Young-Ferrers diagrams are shown in Figure 1. These degrees of freedom are rarely explored in nature except when it comes to radioactive decay, nuclear fusion, or nuclear synthesis in the early stages of our universe. Through fictitious interpolation, however, we can meaningfully render this space continuous, as illustrated using DFT for the potential energies displayed in the inset of Figure 1.
  • ii.Constitutional isomers: “high” temperature regime: At high temperatures, only strong chemical bonding (covalent, ionic, metallic) survives. Corresponding Lewis structures enumerate many (but not all) of the possible constitutional isomers distinguishable as possible topologies, or molecular graphs, that can be constructed. The enumeration (and canonization) of all possible constitutional isomers has been the focus of long standing graph-theoretical efforts.8–11 The exponential scaling of their number is also evident for the recently published exhaustive list of small organic molecules.12 This is the regime in which isomerism occurs through conventional “chemistry,” that is, chemical reactions that lead from one constitutional isomer to another, usually under the influence of pressure, temperature, light, or some catalytic agent. We can model this and the subsequent regimes iii and iv using ab initio molecular dynamics methods.4 Universal or reactive force-fields can be used to accomplish similar sampling.13, 14
  • iii.Conformational isomers: “ambient” temperature: Folding and unfolding events, sampling of intramolecular degrees of freedom, for example, along dihedral angles, and similar processes take place at “ambient” temperatures. These isomers are typically sampled using force fields that assume fixed molecular topologies and parameterized charges, dihedral and angular terms, in addition to the typical potentials used for the chemical bond, such as harmonic, Buckingham's or Morse potentials.
  • iv.Weakly interacting systems: “low” temperature: Supramolecular assemblies, soft aggregates condense to molecular liquids or solids. Also biological systems, such as membranes, or even living cells and organisms, fall in this regime. Such van der Waals dominated systems are typically modeled using classical simulation with effective Lennard-Jones-type potentials.
Figure 1.

Exponential scaling of the total number of all possible N partitions, that is, stoichiometries, as a function of Np [Regime (i) in section Energy Hierarchy]. Inset upper left-hand side: Young-Ferrers diagrams illustrating the possible partitions (stoichiometries) for Np as a function of number of atoms, Ni. The color code corresponds to the total number of protons in the compound, Np equation image1 (black), 2 (red), 3 (green), 4 (blue) equation image Inset lower right-hand side: total potential energy of relaxed molecules as a function of Np using interpolated pseudopotentials in analogy to Figure 2 (all systems neutral, BLYP DFT level of theory,6, 7 arbitrary energy origin due to use of pseudopotentials). [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

In this review, we will discuss recent contributions that are consistent with all of the four regimes, accounting for all the spatial and elemental degrees of freedom equation image and equation image. We will ignore the electronic number as an independent variable because equation image for most if not all possible scenarios. Furthermore, all discussions are based on the Born-Oppenheimer approximation, and assume that nuclear positions are well defined.

Molecular Grand-Canonical Ensemble


Much of conceptual DFT concerns the energy response to infinitesimal variations in number of electrons equation image, or external potential, equation image.15, 16 Although very important for interpreting orbitals, deriving reactivity indices, and even for redox processes, the diversity (and combinatorial scaling) of CCS is rather due to variations in nuclear charge distribution, than due to variations in equation image. Consequently, for the following, we will mostly be concerned with changes in nuclear charges. To offer a rigorous framework for explicit changes in equation image, molecular grand-canonical ensemble DFT was introduced,17 relying to a significant degree on preceding work.18, 19 Only a brief summary is given here, for more details, the reader is referred to the original contributions.

Assuming a classical nuclear charge distribution, equation image, we can introduce an auxiliary grand-canonical variational energy functional for the aforementioned fictitious “very high” temperature regime (i),

equation image(1)

where equation image correspond to the usual total potential energy functional, the electron charge density, global electronic, and nuclear chemical potentials, respectively. For high temperatures, entropy will prevail and the system would dissociate into H equation image (if biased to natural equation image). For lower temperatures, the potential energy will dominate the free energy, and the energy of a single atom ( equation image) will dominate over the energy of many individual atoms that sum up to the same number of protons, equation image. Hence, the nuclear charge distribution would collapse onto a single site. For this we obviously assume the classical and fictitious self-repulsion of protons occupying the same nuclear site to be switched off.

For the “lower temperature” regimes (ii)–(iv), the following energy functional is more meaningful,

equation image(2)

where equation image corresponds to a spatially resolved nuclear point charge distribution. The nuclear chemical potential equation image is now a locally defined Lagrange multiplier. Using an external potential that excludes the aforementioned intranuclear self-repulsion of protons (here through use of an error function that switches off the divergence at nuclear sites), we find from the corresponding Euler equation,

equation image(3)

—the electrostatic potential of the system. As such, starting with Q—a Legendre transformed energy functional of intensive properties equation image and equation image—one can derive the Gibbs-Duhem equation analogue for electrons and protons,

equation image(4)

and obtain relationships between electronic hardness equation image, molecular Fukui function equation image,20 and nuclear hardness kernel, equation image.17 While the nuclear chemical potential is defined everywhere, its value at an atomic position has special meaning: It quantifies the system's first-order energy response to a fractional change of the atom's nuclear charge. Consequently, we dub equation image the molecular “alchemical potential” of atom I.19

Interpolating pseudopotentials

Ignoring potential applications to radioactive and nuclear processes, alchemical interpolations obviously do not describe reality. They offer, however, a rigorous mathematical way to render CCS continuous. Alchemical changes and potentials involving fractional nuclear charges are commonly used for two, often related, purposes: either for the evaluation of free energy differences between different compounds, for example, using thermodynamic integration,21 equation image; or for obtaining a set of gradients with dimension of equation image indicating the response of the system to a variation in nuclear charge on every site.19, 22 In practice, we can calculate such changes through interpolation of nuclear charges in any basis set that is converged for all values of an interpolating order parameter, equation image. For plane-wave pseudopotential implementations, the same can be accomplished by interpolation of pseudopotentials that replace the explicit treatment of the core electrons.23–28 The use of a plane-wave basis set is advantageous because it is independent of atomic position and type, and will not introduce Pulay forces.29 The manipulation of pseudopotentials for affecting electronic structure properties is nothing out of the ordinary. It has successfully been deployed for an array of properties including relativistic effects,30 self-interaction corrections,31, 32 exact-exchange and quantum mechanical/molecular mechanical (QM/MM) boundary effects,33, 34 van-der-Waals interactions,35, 36 and widening the band gap.37, 38, 39 For fractional nuclear charges, we can interpolate pseudopotentials, and evaluate properties as a function of order parameter, equation image. An interpolation of pseudopotential parameters as a function of nuclear charge is shown in Figure 2. Calculated properties as a function of such alchemical changes are illustrated in Figures 1 and 3 for total potential energies, and protonation energies and polarizabilities, respectively. Note that the former application is not physical because of the arbitrary energy offset of pseudopotentials. This, however, is inconsequential, because most of chemistry deals with energy differences, and differences thereof. As shown in Figure 3 for HCl→NH equation image, the use of pseudopotentials for alchemical changes can be particularly advantageous when it comes to transmuting elements from different rows of the periodic table while keeping constant the total number of valence electrons.

Figure 2.

Interpolation of Goedecker–Hutter pseudopotential parameters for BLYP-based DFT calculations.40, 41 Parameters, shown as a function of nuclear charge, become polynomial regressions of third degree in Z for intervals of Z connected such that the sum is continuous and differentiable everywhere.

Figure 3.

Protonation energy, equation image, and static polarizability equation image of neutral species, as a function of order parameter equation image driving X = HCl into X = NH equation image. The inset shows the derivative of the protonation energy, once evaluated analytically according to Hellman–Feynman in Eq. (8), and once from a quadratic fit to protonation energy. Heavy atoms of two end-point molecules were superimposed, their effective nuclear charge being ( equation image)7+ equation image 5, and hydrogens were placed in xy-plane. The H+ was placed in z, 1 Å above the heavy atom. All values calculated with PBE functional and linearly interpolated analytical pseudopotentials.30 [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Free energy applications

Fractional charges were used to calculate free energy of solvation of ions in water.42 Sulpizi and Sprik43 rigorously explored the need for fractional nuclear charge calculations to predict pK equation image's of various organic and inorganic acids and bases. In the case of free energy differences, fractional charges can also be avoided all together within a simple alternative and elegant interpolation scheme put forth by Alfè et al.44: atomic forces are evaluated at both endpoints ( equation image), and equation image dependent molecular dynamics trajectories are generated for atoms being propagated according to a linear combination of these forces using instantaneous equation image values as weights. This is to be compared to a trajectory that uses Hellmann–Feynman forces directly evaluated on the interpolated alchemical species, such as for the 0 temperature limit of relaxing the geometry of a reaction barrier.45 The limitations of Alfè's procedure are that (a) one requires twice as many self-consistent field calculations, namely for both end-points instead of a single one when using an alchemical interpolation (assuming of course that for both approaches the free energy integrand equation image, varies similarly with equation image); and (b) that the number of atoms must be kept constant during the interpolation, significantly restricting the possible number of stoichiometries that can be explored. Both of these disadvantages can be avoided within the compound pair scheme discussed below.

Design applications

Through use of a Taylor expansion, truncated after first order, we can also exploit the vector of alchemical potentials for gaining control over properties in the equation image-dimensional space of all the nuclei that are adjacent in the periodic table. Weigend et al.46 were probably the first to propose such an application in the context of predicting stability in binary atom clusters. Independently, it was applied to drug binding energies within a QM/MM study,19 demonstrated for hydrogen-bonded complexes, interconverting methane, ammonia, water, and hydrogen-fluoride while bound to formic acid,22 and applied to the molecular Fukui function for tuning highest occupied molecular orbital (HOMO) eigenvalues of boron/nitride derivatives of benzene.20 In Ref.43, this notion is exploited for the prediction of reaction barriers as well as oxygen adsorption energies on Pd equation image derived core-shell metal nanoclusters that are catalyst candidates for the oxygen reduction reaction. A subsequent application to the design of Ni-based nano-cluster alloys with oxygen adsorption energies corresponding to the target value of 1.65 eV for maximal turn-over-frequency has subsequently been carried out.47 The molecular Fukui function, in particular when evaluated at the position of the atom, was also discussed more recently by Cardenas et al.48 Clearly, truncation of the Taylor expansion at second or higher order would be desirable to increase the accuracy of the alchemical predictions of the effect of atomic transmutations. Alas, higher order derivatives of the energy with respect to nuclear charges lead to computational overhead, they require the calculation of the perturbed electronic structure. Nevertheless, based on coupled perturbed self-consistent field theory, the improved accuracy of higher order predictions has been demonstrated very recently.49

Other applications

Instead of varying the proton distribution equation image in a compound, we can interpolate the external potential equation image just as well. Albeit mixing up spatial and compositional degrees of freedom, this is more in line with conventional thought in quantum chemistry and conceptual DFT15 where the nuclear charge distribution is hardly ever mentioned explicitly. The route via the external potential has been pursued within the Ansatz of linear combination of atomic potentials in the research groups of Yang and Beratan,50 assigning atom-type specific weights to every atom site. Using simplified Hamiltonians, impressive results were obtained for the control of molecular hyperpolarizabilities,51–55 furthering long-standing molecular design efforts for electronic properties well ahead of their time.56, 57 This approach has also been combined with genetic algorithms for the purpose of crystal structure design using DFT.58 The functional second-order derivatives with respect to external potentials have been published in Ref.59. Analytical expressions for second-order derivatives and linear response functions have very recently been proposed by Yang et al.60 The same authors also derived important constraints for the electronic structure that must be met by the exact exchange-correlation functional. In analogy to using constraints obtained for variable equation image, such as piece-wise linear behavior and derivative discontinuities, to design improved density functionals,61–63 A Cohen's current efforts are dedicated to variations in the external potentials that include fractional nuclear charges. The electronic structure for systems with equation image has also been explored by Constantin et al.63

Compound Pairs


The above discussed Ansatz, variational in a fractional nuclear charge distribution, defines an appealing, fully spatially resolved, index, that is, a way to probe the sensitivity of a compound not only toward changes in any of its composing atoms but also with respect to adding new protons. However, for two reasons, this approach is limited.

First, severe constraints and preconceived insights are required to explore the equation image-dimensional space of all equation image. Either because if equation image is continuous it requires a bias potential toward integer numbers, possibly using a fictitious temperature, in some analogy to the Fermi function for electrons. Or if equation image is a combination of various atom-types, in line with the aforementioned linear combination of atomic potentials approach,50 the weight of one nuclear charge has to dominate so that it can safely be increased to 1, while decreasing all others to zero. Furthermore, constraints due to overall charge conservation, and electronic structure, have to be taken into account. For example, consider an alchemical transmutation of H equation imageN-OH into its isoelectronic stoichiometrical isomer hydrogen peroxide, HO-OH, through simultaneously and continuously decreasing and increasing by one the nuclear charge of a hydrogen and the nitrogen atom, respectively. At some point of this conversion, the spin of the ground state surface will turn into a triplet surface, therefore requiring the consideration of both spin states along the interpolation path.

Second, and more importantly, to carry out alchemical changes along columns in the periodic table a path following Z would have to fill up the shell to go through the entire period before one arrives at the desired elements. This implies significant variations in electronic configurations just to arrive at a target compound with a configuration likely to be very similar to the starting compound. For example, consider a system of eight valence electrons, and Ne and Ar as starting and target compounds, respectively. Then, an isoelectronic path progressing with Z of the central atom, and saturating with hydrogens accordingly, would have to proceed through the following series of compounds, NaH equation image, MgH equation image, AlH equation image, SiH equation image, PH equation image, H equation imageS, and HCl, some of which not even likely to be covalently bound. Hence, although Taylor expansions in Z are quite predictive for adjacent elements—as mentioned in the preceding section—it is not surprising that their predictive power decays dramatically when it comes to predictions for changes up and down the columns in the periodic table. Obviously, matters will only become worse when d- or f-elements are to be included, or when trying to make predictions by two or more rows down or upward.


Albeit intuitive, the use of nuclear charges as interpolating variable is fortunately not mandatory. Instead, we can also use a generalized, and entirely arbitrary, interpolation procedure between any two pairs of compounds. As long as it is reversible and integrable, any path can be used to monitor any property that is a state function.4 In all of the following, we will only consider interpolations between isoelectronic compounds, that is, compounds with the same equation image in their Hamiltonian. As mentioned before, this is only a minor restriction because the diversity of CCS is rather due to differences in nuclear charge distribution than due to differences in equation image. For example, we can linearly interpolate the Hamiltonians of any two isoelectronic compounds, A and B,

equation image(5)

in order parameter equation image. equation image and equation image denote the initial and final electronic Hamiltonian of the two compounds, obeying the corresponding boundary conditions equation image and equation image, respectively. For any isoelectronic Hamiltonian linear in equation image, the potential energy is not necessarily linear. In fact, the electronic potential energy of a linearly interpolated Hamiltonian is likely to be concave, or, more precisely, equals or larger than a straight line connecting the energies of compound A and B, equation image. This inequality follows from the variational principle and can easily be shown: Eq. (5) implies,

equation image(6)

where equation image now correspond to the usual quantum mechanical Bra-Ket notation, denoting the expectation value with the wavefunction, or the density functional (in an orbital-free exact DFT world), evaluated for the Hamiltonian at equation image, that is, equation image. equation image and equation image denote the energies of compound A or B evaluated using the wave functions (or density in the case of orbital free DFT) obtained at equation image. Note that equation image, and equation image. Subtracting equation image and regrouping yields,

equation image(7)

where the prefactors of the energy differences equation image 0 by definition, and where equation image and equation image because of the variational principle. Consequently, analogous inequalities will hold for any property for which there is a variational principle, for example, also for the polarizability due to Pearson's maximum hardness principle.65 The corresponding concavity is on display for the static polarizability, fractionally transmutating a hydrogen chloride molecule into ammonia (Fig. 3). Similarly, potential energy inequalities between different molecules were proposed by Mezey in the eighties.66

Analytical first-order derivatives of the energy as a function of any isoelectronic change in the Hamiltonian can easily be calculated using the Hellmann–Feynman (HF) theorem,67 as proposed, and demonstrated for HOMO eigenvalues, in Ref.66, equation image. For a linearly interpolating Hamiltonian, such as in Eq. (5), this leads to,

equation image(8)

The protonation energy, and its derivative, also feature in Figure. 3 for the transmutational change, HCl→NH equation image. As mentioned before, the use of pseudopotentials/valence electron densities fortunately renders straightforward the evaluation of the HF derivative according to Eq. (8) even for compound pairs that involve elements from differing rows in the periodic table.

Thermodynamic integration of equation image over equation image yields free energy differences. In the case of compound design, the approach is slightly different, we would like to Taylor expand the energy of a new compound B in terms of a reference compound A and its derivatives,

equation image(9)

HOT standing for higher order terms, and equation image. Unfortunately, when making predictions with a linearly interpolated Hamiltonian, the first-order derivative term according to Eq. (8) is not necessarily predictive.68 While the inclusion of higher order derivatives in Eq. (9) is likely to improve the prediction, as found for statistical mechanical averages,69 it also requires the evaluation of the perturbed wavefunction, for example, through the use of linear response theory,33, 70 thereby defying the original purpose of predicting a new compound's energy without having to solve for its wave function. It is coincidence that for some relative energies, such as the protonation energy shown in Fig. 3, the higher order effects mostly cancel, thereby rendering the first order HF derivative quite predictive.59, 60

To improve the predictive power of the first-order term in Eq. (9), an empirical correction has been introduced that “linearizes” the energy through a global yet nonlinear Hamiltonian, equation image.68 If we assume equation image to be a second-order polynomial in equation image, two coefficients are determined by the boundary conditions that equation image, and equation image, leaving one additional degree of freedom. We can obtain the third degree of freedom as a parameter from an arbitrary second isoelectronic compound pair, CD, such that the energy becomes linear in equation image. The resulting expansion up to first order in Eq. (9) then becomes,

equation image(10)

Here, equation image is the ratio between the energy difference and the HF derivative of the additional reference compound pair, C and D (its Hamiltonian being linearly interpolated). equation image is determined according to Eq. (8) as equation image. This bears resemblance to a long tradition in physical chemistry, namely the use of reference compounds for electrode potentials or enthalpies.

The idea to use alternative, nonlinear, interpolations is not new within molecular mechanics. In the context of electronic structure theory, nonlinear alchemical paths were also explored for chemical binding,71 and nuclear quantum effects.72 Various open questions deserve further investigation, such as transferability and choice of reference coefficients, isoelectronic changes using valence electrons only versus all electron description, nonisoelectronic changes, necessary accuracy when providing the input of target compound B, that is, also its geometry, ionic forces of B, and so forth. The answers are likely to depend on systems and properties.

Control of ligand binding

In this section, we exemplify the use of the reference coefficients [Eq. (10)] for increasing the predictive power of the HF derivatives of linearly interpolated Hamiltonians. We refer to state-of-the-art van der Waals corrected DFT35, 73 to accurately estimate interaction energies with binding targets across CCS. We consider a small yet illustrative set of mutants of the ellipticine molecule. Ellipticine is a naturally occurring anticancer drug with various binding targets. As also illustrated in Figure 4, its dominant mode of binding to DNA is intercalation. Structural data as well as studies on drug analogues are readily available.74, 75 We will probe the versatility of the linearizing scheme for controlling ellipticine-derivatives/DNA binding, isolated in gas phase and with fixed geometry.76 Clearly, for the eventual control of ligand binding, the property of interest is not the potential energy of interaction but rather the free energies of binding: solvation or entropic contributions can be crucial, as is well known in general,77 and in the particular case of ellipticine.78 For example, Tidor,79 and Oostenbrink and van Gunsteren80, 81 have carried out similar work in the sense of interpolating ligand candidates, by calculating free energies of binding, and using molecular force fields. For this review, however, we will limit the discussion to the potential energy of interaction. Future work will deal with the inclusion of thermal and solvent effects for instance using ab initio molecular dynamics techniques82 in conjunction with QM/MM83 calculations. Moreover, even at the mere potential energy electronic structure level of theory, the accurate quantification and control of intercalated ellipticine derivatives is challenging: vdW forces dominate the binding. Recent studies have already explored the binding of ellipticine and how its vdW forces can be accounted for at the employed electronic structure level.84–86

Figure 4.

TOP: cluster model of drug intercalated in between two Watson-Crick base-pairs connected by sugar puckers and phosphate groups. BOTTOM: neutral “wild-type” ellipticine, R equation image denote sites of groups permitted to mutate (see Table 1). [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Let us consider the intercalation energy for the complex depicted in Figure 4 for mutations at the five sites indicated in the bottom panel. In analogy to protein or DNA sequences, an (arbitrary) relevant subspace of CCS is defined in Table 1 as a matrix that corresponds to an alphabet of isoelectronic (in valence electron number) functional groups at each of the selected sites. Note that variation in molecular combinations of letters of this alphabet are capable to not only revert dipole-moments, they can also act either as hydrogen bond acceptors (lone pair in OH/Cl) or donors (NH equation image, proton in OH). Clearly, the alphabet can easily be extended to accommodate further effects, for example, with electron donating/withdrawing or hyperconjugating groups. Conformational degrees of freedom can be encoded explicitly, as it is done for the hydroxyl groups in Table 1.

Table 1. Exemplary alphabet for mutants of ellipticine as oriented in Figure 4, defining a CCS with 4 × 6 equation image = 5184 molecules. [Color table can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
Site vs. group123456
  1. Highlighted in red are all functional groups whose mutations have been considered. Predictions are displayed in Figure 5. The “wild-type” ellipticine drug is encoded as (21121) with three functional groups coming from the first column, and the functional groups at site R1 and R4 coming from the second column.


Within this restricted CCS, any given molecule is represented by the sequence of functional groups distributed over the five sites. For example, the “wild-type” ellipticine in Figure 4 would correspond to (21121), that is, two for N at R equation image, one for CH equation image at R equation image, R equation image, and R equation image, and two for NH equation image at the R equation image. Let us exemplify a DFT+vdW-based prediction of the binding energy of another mutant: equation image is predicted to bind to the DNA cluster in Figure 4 with equation image = equation image kcal/mol.83 For predicting the single point mutation (21121) equation image(21125) (changing CH equation image into F at R equation image), one would have to predict a true value of equation image = equation image kcal/mol. The derivative based prediction according to the first-order term in Eq. (9) is calculated to be, equation image + equation image = equation image + 1.4 = equation image kcal/mol. Inclusion of reference coefficient [Eq. (10)], and using compound pair (11121)/(11125) as a reference, yields equation image + equation image = equation image + 1.3 equation image1.4 = equation image kcal/mol.

To gain a more representative idea of the predictive power of this method, Figure 5 features the outcome for a small subspace of the CCS highlighted in red in Table 1: eight compounds have been considered involving permutations at R equation image, R equation image, and R equation image, each with two possible functional groups. Predictions based on all the possible derivatives among these compounds, with and without reference coefficients (as obtained from compound pairs not involved in the transmutation), are compared to calculated binding energies. Despite the several outliers that deviate substantially, the use of reference compounds dramatically improves the overall prediction.

Figure 5.

TOP: correlation for eight compounds from alphabet in Table 1. Predictions made using first-order derivatives only [ equation image, Eq. (9)], energy difference of reference compound pairs [ equation image, Eq. (11)], or equation image, Eq. (10). BOTTOM: normalized histogram and corresponding normal distribution of error over 72 predictions, equation image calculated E equation image − predicted E equation image. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

For comparison, we also include predictions based on the additive assumption that the influence of the rest of the molecule cancels when considering an interpolation for the same pair of functional groups. Specifically, we estimate the binding energy of B simply by adding the difference in binding energy of a reference compound pair, CD, to the binding energy of A,

equation image(11)

As shown in Figure 5, also this prediction yields remarkable correlation—with less pronounced outliers. Analysis of the distribution of errors, however, suggests that in spite of the outliers the normal distribution of predictions around the ideal correlation is superior for predictions made with the product of derivative and reference coefficient (Bottom of Fig. 5).

Win a prize

The numerical illustrations in the previous section, as well as in Ref.68, represent promising efforts to linearize the property through alternative, nonlinear interpolations of the Hamiltonians are worthwhile. Strictly speaking, however, due to the use of reference compound pairs, the aforementioned interpolation constitutes no longer a first principles but rather an empirical and heuristic Ansatz. What is needed instead is an ab initio interpolating procedure that linearizes the energy (or other properties) in order parameter, such that the first-order Taylor expansion based on the HF derivative is sufficiently accurate to predict properties of other compounds.2 In other words, if this effort was successful, numerical solutions of SE could that the first-order Taylor expansion system to the next (in close analogy to the Car-Parrinello idea of propagating orbitals together with nuclei87), foregoing the need to solve SE for each system from scratch.

Inspired by Erdös' habit to offer cash awards for solutions to outstanding mathematical problems, the author offers the equivalent of an ounce of gold to the first person who presents an ab initio solution to this problem. Specifically, the challenge reads: constructively find—or show nonexistence of—an ab initio, that is, valid for any external potentials, interpolating transform equation image for which two different but isoelectronic molecular Hamiltonians with energies equation image and equation image interconvert such that the electronic ground state potential energy equation image, is linear in order parameter equation image, and that consequently the HF derivative is given by,

equation image(12)

Here, 0 equation image 1, and equation image, and equation image. Further details can be found in footnote.

We can exemplify this challenge for the nonrelativistic hydrogen-like atom with only one electron. In this case, equation image, where a is a constant. and the nuclear charge Z is a function of interpolating parameter equation image. For an interpolation linear in λ, equation image, the energy would be quadratic in equation image. The desired behavior of a linearized energy would be,

equation image(13)

Equating this to equation image and solving for equation image yields the corresponding interpolating function:

equation image(14)

We can test this interpolation to confirm if indeed we find the desired slope for the linearized energy, equation image. Application of the chain rule, and insertion and differentiation of Eq. (14) confirms the expected result,

equation image(15)

As such Eq. (14) linearizes the energy in equation image. The challenge of the prize consists of finding an analogous expression for molecules. If existent, the solution is likely to be a spatially resolved and equation image dependent inhomogeneous transformation of the external potential, continuously and non-linearly turning the external potential of compound A into the external potential of compound B while linearizing the potential energy.

Note that a naive extension of Eq. (14) to assemblies of atoms,

equation image(16)

does not constitute a practical approximate solution to the challenge. equation image denotes the “alchemical” potential mentioned above which corresponds to the electrostatic potential at equation image without the repulsion due to equation image. equation image will not necessarily cancel the square root term in the denominator of the derivative in Eq. (14), which consequently diverges at equation image = 0 if equation image equal 0.

Statistical Methods

Inductive reasoning from first principles

Within statistical mechanics, the numerical prediction of macroscopic observables from atomistic simulation requires repeatedly calculating microscopic states, using electronic structure theory, atomistic or coarse-grained force fields, and averaging in an appropriate ensemble. Philosophically speaking, the exercise of performing such computational “experiments” is an application of deductive reasoning to increase knowledge. But also when exploring CCS in terms of ensembles of potential energy hypersurfaces by repeatedly solving SE for N different compounds deductive reasoning is at work. Since the size of CCS (or phase space) is prohibitively large, its exhaustive sampling through screening with solving SE is impossible. Although some interpolating equation image schemes use statistical mechanics for a preselected set of compounds,79, 80 a rigorous way to more systematically and generally gain quantitative insights is desirable. This task can be accomplished through the application of inductive reasoning.

Historically, the role of inductive reasoning in chemistry is considerable, Mendeleev's table, the Hammett equation,88, 89 or Pettifor's structure maps90 are all based on inferred phenomenological relationships. Further examples include widely spread rules and notions of chemistry, such as the chemical bond, atomic charges, or aromaticity. Although popular and useful to the experimental chemist, conventional quantum chemistry, based on deductive reasoning, is still struggling to account for these notions.91 Recent advances in statistical data analysis methods92–95 and applications in other areas of science and engineering, such as searching the internet, automated locomotion (self-driving cars), algorithmic trading, or brain-computer interfaces, strongly suggest that they will also play an increasingly important role in quantum chemistry. Examples of first efforts to quantitatively infer laws for atomistic simulations include “Learning On The Fly,”96 or “force-matching.”97, 98 More sophisticated statistical learning methods have been applied to the training of exchange correlation functionals in DFT,99, 100 or to parameterizing interatomic force fields.101–106 Support vector machines have been shown to quantify basis-set incompleteness.107 Gaussian kernel-based ML for the design of accurate and reactive force-fields without predetermined functional form was introduced by Bartók et al.108 Contributions by Curtarolo, Hautier, and Ceder combine data-mining with mean-field electronic structure theory.109–111 Even the learning of reorganization energies that enter Marcus charge transfer rates are promising.112, 113 Very recently, kernel-based ML models have also delivered promising results for learning the kinetic electron density functionals within orbital free DFT,114 or dividing transition surfaces that determine reaction rates.115 Bayesian error estimates and cross-validation methods have also been applied to the development of exchange-correlation models with controlled transferability.116

Within the bioinformatics and cheminformatics communities, the development of quantitative structure property relationships (QSPRs) has a long tradition. QSPRs, relying on similar statistical frameworks (ML, cross-validated training, principal component analysis, etc.), deliberately attempt to circumvent solving the underlying laws of physics by directly correlating system features (so called descriptors) with macroscopic properties of interest. Conventionally, QSPRs are based on descriptors that explicitly forsake atomic resolution in the first principles sense. A large variety of such QSPR descriptors for various properties has been proposed.117–120 Two such descriptors, the molecular signature by Faulon et al.121 and a combination of HOMO eigenvalues of charged and neutral species, have recently yielded promising results for the QSPR modeling of a first principles property, the reorganization energy, in the CCS of polycyclic aromatic hydrocarbons (PAHs).113 PAHs form discotic liquid crystals which self-assemble into columnar liquid crystal structures, implying their usefulness for organic photovoltaic applications.122

In this section, we will discuss the application of ab initio statistical learning approaches to previously obtained first principles data for N compounds. Merely based on the data, QSPRs can be “learned,” and subsequently be used to avoid the cumbersome task of having to explicitly model all the underlying physical degrees of freedom of electrons and nuclei. As such, ML estimates solutions of SE for a new, that is, “unseen,” molecule B simply by evaluating an analytical expression equation image that (explicitly or implicitly) encodes the results previously obtained for N other molecules. Obviously, all these inferred relationships are inherently limited in accuracy by the quality of the reference data used for training. At this point it should be stressed that in order to avoid statistical artefacts such as overfitting and lack of transferability, successful ML efforts should rely on (i) many-fold cross-validation, (ii) data stratification, and (iii) regularization. Active learning can also be used to remove selection bias and lead to error bars that are constant across all of the relevant variable domains—interatomic distances and nuclear charges in our case. All tests and results should be reported for out-of-sample predictions, exclusively.

Machine learning in CCS: The quantum machine meets Schrödinger

Recently, a first principles based ML approach to DFT atomization energies across CCS has been introduced.123 Unlike ordinary QSPR approaches, this ML model is free of any heuristics. It exactly encodes the supervised learning problem posed by SE, instead of solving SE by finding the wavefunction equation image which maps the system's Hamiltonian to its energy, equation image, it directly infers the solution by mapping system to energy, based on N examples given. In the limit of converged N, that is, sufficiently dense system coverage, the ML model is therefore a formally exact inductive equivalent to the deductive solution of SE achieved through the usual computational machinery of approximate wave-functions (such as separability of nuclear and electronic wavefunction or single slater determinants), Hamiltonians (such as certain exchange-correlation potentials or tight-binding), and self-consistent field procedure to minimize the energy. In Ref.123, numerical evidence is given for this idea. Specifically, for a diverse set of organic molecules, one can show that a ML model can be used instead of solving SE, equation image. After training, solutions to SE can be inferred for out-of-sample, that is, “unseen,” compounds that differ either in geometry or in composition or in both. The evaluation of an estimate is ordinarily negligible in terms of computational cost, that is, milliseconds instead of hours on a conventional CPU, while yielding an accuracy competitive with the deductive approaches of modern electronic structure theory. As within any inductive approach, the accuracy is limited by the domain of applicability as defined by the data used for training, that is, robust results can only be expected in interpolating regimes with sufficient coverage. While the training data imposes limits on the ML model's accuracy, this arbitrariness can also offer an appealing advantage: Any level of theory, and even experimental data, could be used for training. Within the Gaussian kernel model, the energy of a query molecule equation image is given as a sum over N molecules in the training set,

equation image(17)

Each training molecule i contributes to the energy according to its specific weight equation image, scaled by a Gaussian in its distance to equation image, equation image. For given length-scale equation image and regularization parameter equation image, equation image are obtained by solving the regression problem,

equation image(18)

This regularized model limits the norm of regression coefficients, equation image, thereby improving the transferability of the model to new compounds. All regression coefficients and hyperparameters are determined by cross-validation on data stratified training sets.94, 95

So far this model has been trained and validated only in its most rudimentary form for atomization energies of a small set of interesting compounds. Specifically, molecular atomization energies at the hybrid DFT level of theory5, 124–127 have been used for training on up to equation image molecules from the molecular generated data base (GDB)12 (see Fig. 6 for an illustration), for which mean absolute errors of less than 10 kcal/mol have been obtained. The choice of hybrid DFT is motivated by relatively small errors (<5 kcal/mol) for thermochemistry data that includes molecular atomization energies.128 Although 10 kcal/mol is still far from “chemical accuracy” ( equation image 1 kcal/mol), more recent efforts have not only led to atomization errors with less than 3 kcal/mol accuracy,129 but also include other electronic properties, such as frontier eigenvalues, polarizability, and excitation energies.130

Figure 6.

TOP: distance versus atomization energy difference in GDB-13 for all molecules with five heavy atoms (excluding pairs containing S) using data from Ref.121, equation image versus equation image. BOTTOM: N dependence of equation image and equation image.

An appealing advantage of analytical models, independent if obtained from physical insight or statistical regression, is their amenability to analysis and interpretation. For example, otherwise ill-defined concepts in electronic structure theory, such as distance/neighborhood/similarity in CCS, can now be quantified within the “world” of the ML model. Specifically, Eq. (17) gives the energy of a query molecule equation image as an expansion in compound space spanned by reference molecules equation image: The regression weights equation image are scaled by the similarity between query and reference compound as measured by a Gaussian of the distance. Hence, equation image assigns a positive or negative weight to molecule i. Within the compound space used as reference, molecules therefore can be ranked according to their equation image. We should note, however, that equation image are merely statistical regression coefficients in a nonlinear model, that is, after a nonlinear transformation of the training data, the resulting energy contributions are specific to the employed training set without general implications for other properties or regions of compound space. The locality of the model is measured by equation image, enabling the definition of a critical distance of locality, equation image. Only if equation image will equation image contribute to the energy of equation image more than some threshold energy equation image. Rearranging summands in Eq. (17) leads to equation image. For atomization energies, and the chemical space considered in Ref.123, that is, with a critical distance equation image 400 Hartree (see TOP of Fig. 6), the ML results suggest that the model becomes local when equation image 60 Hartree, for the average equation image, and for equation image = 1 kcal/mol. Such equation image values are achieved when the number of molecules in training set N exceeds equation image5000. In other words, for equation image 5000, the model is global. All reference compounds contribute with more than 1 kcal/mol to any prediction made. See BOTTOM of Figure 6 for the N dependence of equation image and equation image.

Coulomb matrix descriptor

To represent compounds, a wide variety of “descriptors” is in use by statistical methods for cheminformatics and bioinformatics applications.117–121 The descriptor introduced by us123 is based solely on coordinates and nuclear charges, and dubbed “Coulomb-matrix,” equation image, a symmetric square matrix of equation image dimensions,

equation image(19)

The diagonal elements, equation image, correspond to a polynomial fit to free atom energies, inspired by the discussion of the “high temperature” regime above.131 The off-diagonal elements correspond to the Coulomb repulsion between atoms I and J, and hence the name. For a data set containing molecules with differing number of atoms, all the equation image of all the smaller systems are extended by zeros until they reach the dimensionality of the largest molecule in the training set. Within a multi-scale sort of approach, the Coulomb-matrix can easily be extended to account for large or condensed phase systems: let equation image be the number of atoms in the unit cell, and let equation image be the number of atoms in unit cell plus sufficiently large surrounding environment, then define equation image as above except that all off-diagonal elements are set to zero for all I and J larger than equation image.

We can measure the distance between two molecules by the Euclidean norm of their diagonalized Coulomb matrices: equation image, where equation image are the eigenvalues of equation image in order of decreasing absolute value. The physical meaning of representing CCS in this way can easily be understood by considering the simplest of all molecules, homonuclear diatomics (i.e., equation image and equation image). Any corresponding equation image is then simply defined by its two eigenvalues, the roots of its characteristic polynomial, equation image. When measuring similarity between two such diatomics with different interatomic distances, equation image and equation image, the measure of similarity reduces to equation image; and the corresponding estimated potential energy curve for any new interatomic distance, equation image, as trained on N other interatomic distances, equation image, is given by

equation image(20)

In complete analogy, a ML model of the homonuclear dimer can also analytically be understood in terms of other homonuclear dimers with differing atomic numbers, heteronuclear dimers, or heteronuclear trimers. The ease of differentiation with respect to not only geometry ( equation image) but also with respect to composition ( equation image) illustrates further advantages of such a simple model.

The Coulomb matrix uniquely encodes any compound because stoichiometry as well as atomic configuration are explicitly accounted for. Even homometric molecules,132 see Figure 7, are uniquely encoded by equation image. Symmetrically equivalent atoms will contribute equally, and the representation is rotationally and translationally invariant. To gain invariance of equation image with respect to the index ordering of atoms, one can either diagonalize, sort rows and columns according to their norm, or use sets of matrices with permutated rows and columns. As emphasized in a Comment by Moussa, using the eigenvalues of equation image will yield an undercomplete representation.134 Obviously, as within any coarsened representation, the equation image degrees of freedom represented by eigenvalues will fail to uniquely represent the full set of equation image degrees of freedom for any nonlinear molecule with more than three atoms.134 Although sorting by the norm of rows (or columns) leads to an overcomplete, index invariant, and unique representation, the matrix is no longer differentiable for any combination of matrix entries that could be achieved through changes in geometry or in nuclear charges. Extending the representation by randomly permutated variants of Coulomb matrices is feasible, and leads to dramatic improvement in predictive accuracy.128, 129 To encode known invariances through such data extension has also been successful for improving the accuracy of handwritten digit recognition.136 Due to disadvantageous scaling, this approach might prove problematic, however, when it comes to larger systems. As discussed in Ref.137, these are all crucial criteria for representing atomistic systems within statistical models.

Figure 7.

Sketch of two homometric molecules (same stoichiometry, same sum of interatomic distances) from Ref.133. The Coulomb-matrix (sorted by norm of its rows, or a set of its permutants) can distinguish these two molecules.133–135

Alternative descriptors for CCS

We shall now discuss more sophisticated alternatives to the Coulomb-matrix. An intuitive extension is to assume a matrix with an interatomic potential form. This could be worth-while as long as the incurred computational overhead is small by comparison to the method used to generate the reference data. For example,

equation image(21)

would correspond to the Lennard-Jones analog to the Coulomb-matrix. Similarly, a Morse or Buckingham matrix could be constructed. One could even conceive to go beyond such pair-wise approaches and introduce interatomic three and higher order terms in the form of molecular tensors. But also electronic structure models can be encoded in terms of such a representation, such as extended Hückel theory, semiempirical quantum chemistry, or tight-binding models. For example, an orbital free Thomas–Fermi DFT representation138 is possible based on a data-base of frozen free atomic electron densities, equation image. The “Hartree” matrix is given by,

equation image(22)

the “external” potential matrix is given by,

equation image(23)

and the “kinetic” matrix is given within

equation image(24)

where equation image is a constant, and atomic integrals are evaluated over all of space. If need be, the kinetic matrix could even be extended by the von Weiszäcker correction term, equation image.138 Summation of all entries in the matrix and addition of the off-diagonal Coulomb-matrix entries would yield the corresponding exact DFT energy for frozen atomic electron densities. Preliminary training on atomization energies of the GDB-7 data set12 indicates that neither use of the Lennard-Jones nor of the Thomas–Fermi matrices leads to any significant improvement in predictive accuracy when compared to the original Coulomb-matrix representation in Ref.123. A possible explanation for this surprising result is that these more sophisticated descriptors are no longer monotonic functions in interatomic distances—in contrast to the Coulomb matrix.

An alternative new descriptor, entirely consistent with the first principles view on CCS, has recently been proposed.139 Each atom I in the molecule is represented by its nuclear charge multiplied with a cosine term that contains a radial distribution function of atom I with respect to all other atoms J. Summing up the atomic contributions yields a Fourier series of atomic radial distribution functions which, because of the superposition principle, is not only unique for each compound, but also invariant with respect to molecular rotations, translations, and atom indexing. Using a Gaussian radial distribution function the descriptor reads,

equation image(25)

where equation image, and n and equation image are hyperparameters that can be optimized. This descriptor has units of charge equation image, d has units of distance and goes from zero beyond the largest interatomic distance. As in the case of the Coulomb-matrix described above, the environment of large or condensed systems can be accounted for by chosing equation image to be larger than equation image. The reader is referred to the original paper for further details.139 These descriptors can be contrasted to other alternatives, for example recently investigated by Bartók, Kondor and Csányi.140

Concluding Remarks

We have reviewed a notion of CCS that is consistent with any ab initio approach to atomistic simulations. Starting from an energy hierarchy, variations in nuclear charge distributions have been discussed, followed by order-parameter-based interpolation approaches and statistical learning methods. The concepts presented offer a seamless and rigorous framework to unify electronic structure theory with rigorous rational as well as combinatorial compound design efforts. This view of chemical space is advantageous for several reasons, (i) important fundamental questions can be tackled in the future, including rigorous definitions of diversity in CCS, property transferability, uncertainty, and selection bias in training sets; (ii) transferability and applicability typical for the black-box characteristics and the accuracy of ab initio calculations can be achieved; (iii) a mathematically, physically, and chemically rigorous notion of relevant input variables enables the application of sophisticated property optimization algorithms. Ultimately, efforts along these lines promise to lead to “the right compound for the right reason,” aiming to replace by systematic engineering protocols the heuristics and serendipity on which most, if not all, of past compound discoveries have relied.


The author is thankful for helpful discussions with C. Anderson, K. Burke, M. Cuendet, R. A. DiStasio, Jr., F. Furche, J. R. Hammond, G. Henkelman, F. Kiraly, A. Knoll, G. Montavon, J. E. Moussa, K. R. Müller, B. C. Rinderspacher, M. Rupp, A. Tkatchenko, D. Truhlar, M. Tuckerman, A. Vazquez-Mayagoitia. The many participants of the 2011-program “Navigating Chemical Compound Space for Materials and Bio Design” at the Institute for Pure and Applied Mathematics, UCLA, are also greatly acknowledged.

  • *

    Include the mass of the nuclei as an additional variable if also thermal properties and nuclear quantum effects are to be accounted for

  • Prize (2011), A prize award of the equivalent of an ounce of gold was announced during the Navigating Chemical Compound Space program in spring 2011 at the Institute of Pure and Applied Mathematics, UCLA. The prize is for finding an interpolating transform of two isoelectronic Hamiltonians such that the potential energy becomes linear in the interpolating order parameter. The ounce of gold is currently held in the form of 100 shares of iShares Trust fund (NYSEARCA:IAU), and will be dispensed in cash at instantaneous exchange rate on recognition of a valid solution by a prize-committee. Apart from the author, the prize-committee consists of Profs. K. Burke, G. Henkelman, K. R. Müller, and M. E. Tuckerman. Contact the author regarding donations to increase the prize award. For more information, see http://www.alcf.anl.gov/∼anatole.

  • We assume the reduced mass to equate the mass of the electron.

Biographical Information

original image

O. Anatole von Lilienfeld will begin a Swiss National Science Foundation professorship in the department of chemistry at the University of Basel, Switzerland, in summer 2013. He has been an Assistant Computational Scientist at the Argonne Leadership Computing Facility at Argonne National Laboratory since 2011. In spring 2011 he chaired the 3 months program, “Navigating Chemical Compound Space for Materials and Bio Design”, at the Institute for Pure and Applied Mathematics, UCLA. From 2007 to 2010 he was a Distinguished Harry S. Truman Fellow at Sandia National Laboratories, New Mexico. Prior to that he was a Swiss National Science Foundation postdoctoral scholar with Mark Tuckerman in the Chemistry Department of New York University. In 2005, he received PhD in computational chemistry from EPF Lausanne under the supervision of Ursula Röthlisberger. His diploma thesis work was carried out in 2001 with Martin Quack at ETH Zuerich and Nicholas Handy at the University of Cambridge, UK. Apart from the topics addressed in this tutorial, his research deals with van der Waals contributions to interatomic forces, nuclear quantum effects, defects in semiconductors, molecular dynamics, and high-performance computing. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]