QCTFF: On the Construction of a Novel Protein Force Field

In this perspective, we explain the strategy behind QCTFF, the current name for a novel atomistic protein force field. The atoms are constructed using Quantum Chemical Topology (QCT). These topological atoms determine how a system’s energy is partitioned. We give a brief account of the results hitherto obtained, and a glimpse of unpublished results. Com-bining this QCT partitioning with the universal quantum expression of energy, leads to four types of fundamental energy contributions. The first of these is intra-atomic and the remain-ing three interatomic: (i) atomic self-energy, (ii) Coulomb energy, (iii) exchange energy, and (iv) correlation energy. All structural and dynamic effects emerge from the interplay of these contributions. The machine learning method kriging cap-tures well how they change in response to a change in nuclear configuration. Second, the Coulomb energy is represented by a converging multipolar series expansion when the nuclei are suf-ficiently far apart. V C 2015 The Authors International Journal of Quantum Chemistry Published by Wiley Periodicals, Inc. DOI: 10.1002/qua.24900


Introduction
A force field is a short-cut: when used in geometry optimization or molecular simulation, it bypasses the Schr€ odinger equation. A force field mimics the behavior of molecules (and molecular assemblies) governed by this powerful equation. Let us take a diatomic molecule as an example. If its bond is stretched beyond the equilibrium bond length then the molecular energy will increase. This effect can be accurately reproduced (or predicted) by fitting an appropriate function capturing the relationship between energy and geometry. This is what a force field achieves, and while doing so, it ignores the mechanism behind this energy change. A force field focuses only on the link between the input (geometry) and the output (energy). This strategy appears effective because it delivers what we want: a computationally cheap prediction of energy as a function of geometry. Predicting the molecule's energy using a force field is easily several orders of magnitude faster than by an ab initio calculation, even for a diatomic molecule. This advantage comes with major concerns. Can this force field be safely generalized to large molecules? Can we still sample enough molecular configurations? Will accuracy suffer? It is well-known that the number of configurations blows up as the number of degrees of freedom increases. In a diatomic, there is only one degree of freedom and it is perfectly feasible to sample the energy at say 10 different bond lengths. However, given that there are 3N26 degrees of freedom for a general nonlinear polyatomic, can one feasibly sample 10 3N26 configurations when N is about 30, as is the case for a typical peptide-capped amino acid? A second major concern is whether a force field should necessarily be ignorant about the atomic and quantum mechanical mechanisms that make a molecule's energy respond to a change in its geometry. In other words, should the relationship between a molecule's energy and its geometry remain a black box? or should one probe into the reason why a molecule prefers a certain geometry, find and then model the cause for an energy rise?
If one ignores the two concerns discussed above, one will soon be stuck when setting up a force field for a particular protein in the same way as for a given diatomic. Fragmentation is the way forward. Here, one takes advantage of the fact that quantum mechanics is kind enough to create localizable properties. Indeed, functional groups emerge naturally in the electron density: an imidazole group in an isolated histidine is very similar to one occurring in histidine in the protein crambin, for example. This similarity underpins the concept of transferability, the first cornerstone of any force field. Transferability can be quantified and systematically investigated. However, in spite of being so pivotal, this concept remains remarkably under-researched. There is a need to demonstrate quantitatively that the molecular fragments that are parameterized during the construction of a force field are indeed transferable unit, within an accepted energy threshold.
Force fields such as AMBER and CHARMM continue to dominate the world of biomolecular simulation. The architecture of these force fields is very old and based on approximations that can be abandoned thanks to present-day computing power. A detailed motivation as to why this architecture should be overhauled in next-generation force fields has been made before (e.g., Refs. [1][2][3][4][5]) and cannot repeated here due to space restriction. One convincing argument in favor of this overhaul, which has again been made in detail, [6] is that point charges are inherently limited (except at very long range) and that multipolar electrostatics should, therefore, be standard.
The extension [7] of CHARMM to handle drug-like molecules is a straightforward example of how old approximations prevent accuracy enhancement. Here, charge optimization was carried out at the original HF/6-31G(d) level of theory for sake of consistency while it is well-known that higher levels of theory are mandatory for an accurate treatment of hydrogen bonding. Another recent example is the extension [8] of OPLS to deal with halogen bonding, beyond the model of one charge for each nucleus. Here, the addition of extra sites (Xsites) is chosen as a route forward to model the inherent electrostatic features of chlorides, bromides, and iodides. However, the newly developed OPLS-AAx force field cannot be used in conjunction with the TIP5P water model [9] because the X-sites and the water lone-pair sites can fuse.
Fortunately, several research groups work with multipole moments, which avoid such problems. Many of these groups [10][11][12][13][14][15][16][17][18][19][20] have made tremendous progress in going beyond the paradigm of current force field used in standard molecular dynamics simulations, by advancing polarization methodologies and by improving the electrostatics beyond the point charge framework. In particular, Kriging, which is identical to "Gaussian progress regression" was also used early on in the construction of potentials. [18] The force field AMOEBA, [21] both as approach and software product, is a well developed culmination of such progress beyond the traditional AMBER/CHARMM methodology. With the development of Quantum Chemical Topology Force Field (QCTFF) we joined this journey but started from a framework that differs dramatically from even AMOEBA, let alone AMBER/CHARMM. In the next section, we explain the strategy behind QCTFF and highlight the choices made about its core architecture.

The Strategy Behind QCTFF Generalities
Instead of reporting the chronological development of QCTFF, we start with an outline of the complete strategy including the parts that are currently still being implemented. We only mention that QCTFF started with the interatomic electrostatic interaction, which is hence the best understood part. It is possible to combine this part, the details of which are unique to QCTFF, with the traditional expressions for the nonelectrostatic part (e.g., Lennard-Jones and Hooke-like or Morse-like covalent bond potentials). However, this path is not followed.
A future-proof strategy defines all energy contributions from scratch and develops them together. As a result, one accurate energy term cannot be "swamped" by another less accurate term.
QCTFF is built on a constellation of five resolutions, which we each discuss in subsequent subsections below. Each of these resolutions is a research challenge by itself and hence not necessarily closed at this stage. First, we choose the socalled topological atom (or QCT atom) as the carrier of chemical information. More precisely, the system's total energy can be predicted from the information that these atoms are endowed with. An important question that accompanies this decision is the size of the environment this atom needs to be aware of, which is again the question of transferability. Second, the total energy is partitioned into four types of energy contributions, each of which has a chemical meaning: intraatomic self-energy, and interatomic Coulomb, exchange and correlation energy. Third, any 1/r type of interaction can be expanded by means of the well-known Laplace expansion, which introduces spherical harmonic functions. This expansion generates the familiar multipole moments, which express the way electron density is distributed, in this case within a (topological) atom. This is of great practical use for the Coulomb energy provided the multipole expansion converges. Fourth, a variation in nuclear configuration causes a change in a given atom's energies and multipole moments. The machine learning method Kriging the capacity to capture the link between the coordinates of the atom's surrounding nuclei (input) and this atom's energies and multipole moments (output).

Quantum chemical topology
The "Quantum Theory of Atoms in Molecules (QTAIM)," [22] which is embedded in the more modern extension to Quantum Chemical Topology (QCT), [23] rigorously defines atoms by means of the gradient vector field of the electron density. In other words, QTAIM is a special case of QCT, where the central idea of partitioning by gradient is applied to electron density or to its Laplacian. The gradient also partitions other [24] three-dimensional functions, such as the Electron Localization Function, the electrostatic potential, the virial field, the nuclear potential, the magnetically induced molecular current distribution, the Ehrenfest force field, and the intracule density. There are five conceptual and technical advantages [3] associated with a topological partitioning of the electron density. Paul Popelier educated in Flanders, he is currently Professor of Chemical Theory and Computation at the University of Manchester. At the age of 21, he rediscovered an algebraic expression for the exact roots of x 17 -1=0 (heptadecagon construction). His 185 publications include three books and 25 single-author items, the latter attracting more than 2850 citations. He plays modern jazz piano and composes. He has written poems, one-page images, an incomplete novel and funny pieces, all unpublished; perhaps one day PERSPECTIVE WWW.Q-CHEM.ORG If a given molecule is embedded in a condensed matter state then each atom it contains is fully bounded by (sharp) topological boundaries. The atoms have finite (i.e., not infinite) volumes, they do not overlap and there are no interatomic gaps. However, when the molecules are much separated from each other, reaching a gaseous state, it is practical to replace the outer boundaries of the atoms by a constant electron density envelope. Figure 1 shows how QCT partitions the porphyrin ring in cytochrome P450 BM3 protein data bank (PDB code 2IJ2) into (topological) atoms.

Four types of energy contribution
Atomic self-energy. The first energy is the atomic self-energy, denoted E intra , which measures the intrinsic stability of an atom. This energy is not interatomic but intra-atomic. It should be emphasized that an intra-atomic energy is still precisely informed about how the atom at hand interacts with its full environment, even without explicit interatomic character. Indeed, the manner in which the system's electrons distribute themselves inside the atom must be literally self-consistent with the rest of the system. Hence, an atomic self-energy can be regarded as a fingerprint of the whole system. Broadly speaking, this energy type features in stereo-electronic effects, rotation barriers, steric hindrance, or the anomeric effect.
A QCT atom, designated X, has the desirable property that it possesses a well-defined kinetic energy T X (see Ref. [26] and refs. therein). This energy is the first part of the intra-atomic energy of X. To complete the intra-atomic energy one must add the intra-atomic potential energy. Our lab was early to compute [27] the latter quantity, which needs the second-order reduced density-matrix q 2 (r 1 ,r 2 ). We can now define the electron-electron repulsion energy between two atoms X i and X j , This equation covers both the intra-atomic (X i 5 X j 5 A) and interatomic case (X i 5 A 6 ¼ X j 5 B), which features in Interatomic electrostatic energy. Only the electron-nucleus attraction energy is needed to fully complete E intra of atom A, where r 1A is the distance between an electron and the nucleus of atom A, so that we can finally write This intra-atomic energy E A intra is the energy that a single atom possesses inside a system, regardless of whether this system is a single molecule or a cluster of molecules (including even ions). Unpublished work from our lab shows that an oxygen (or nitrogen atom) has the same energy, within maximum 0.8 kJ mol 21 only, when appearing in a tripeptide compared with appearing in a pentapeptide, with their common nuclear skeleton in exactly the same configuration. This energetic transferability is based on observations made in at least four test cases, that is, the homo-oligopeptides of Ala, Ser, Gly, and Leu. The generation of energy-transferability data on nonprotein systems (e.g., water clusters, ions in water, and carbohydrates) is planned but we already know such systems can be successfuly kriged. This very high degree of transferability of QCT has also been detected and conceded [20] in terms of atomic charges by those who develop alternative partitioning schemes (such as the Hirshfeld partitioning [28] scheme and all its variants). Such favorable assessment makes us hypothesize that high transferability is also a hallmark of topological atoms in nonprotein matter, including drugs and natural ligands.
Interatomic electrostatic energy. The second energy type is interatomic in nature and builds on (but is not equal to) the solely electronic electron-electron Coulomb repulsion energy V XiXj ee between two different atoms or V AB ee . This quantity can only be calculated from Eq. (1) provided the pure Coulomb part is filtered out of q 2 (r 1 ,r 2 ). This can be done by knowing the fine-structure of q 2 , q 2 ðr 1 ; r 2 Þ5q coul 2 1q exch 2 1q corr 2 5qðr 1 Þqðr 2 Þ2q 1 ðr 1 ; r 2 Þq 1 ðr 2 ; r 1 Þ 1q corr 2 ðr 1 ; r 2 Þ where the first term refers to the quantum-mechanically uncorrelated Coulomb-like pair density, the second term to This picture has been generated by in-house software called IRIS, which incorporates a version of MORPHY. [25] the Fock-Dirac exchange (which is dominant by and associated with the Fermi hole), while the third term is at least an order of magnitude smaller [29,30] than the second, and connected with the Coulomb hole. Each term in Eq. (4) is associated with a type of potential energy, so that the corresponding fine-structure of V AB ee automatically follows, The first term in Eq. (5) is what we are after. Note that it is useful to complete the electron-electron Coulomb energy V AB ee;coul with the three extra terms that cover the nuclear contributions: the electron-nucleus attraction (potential) energy, denoted V AB en [which is the interatomic generalization of Eq. (2)], its dual denoted V BA en , and the classical nucleus-nucleus repulsion, V AB nn . When added, these terms lead to the full electrostatic interaction between two atoms A and B, V AB elec , or This fundamental energy type accounts for the ubiquitous electrostatics, including charge transfer and polarization effects.
Exchange-delocalization energy V x : a measure of covalency.
The second term in Eq. (5) represents the exchange delocalization energy, V AB ee;exch , which is (already) present at Hartree-Fock level. This term teases out the interaction that keeps bonded atoms together.
To what extent atoms are bonded can be estimated by a nonenergy measure, which is typically a quantum-mechanical bond order. QCT offers such a measure. [31] However, it was shown by our lab [32] that this bond order is only the first term of the multipolar expansion of V x . Hence, the latter contains more information than a bond order.
A recent article [33] studied the behavior of V x with increasing distance for a sizeable number of bond types. It is clear that V x is always stabilizing and, therefore, always negative. This is why one invariably works with its negative value, to represent it conveniently by a logarithmic scale. The larger the absolute value of V x , or |V x |, the stronger the bond it is associated with. Figure 2 illustrates how V x decreases exponentially with increasing internuclear distance (yellow band) in the water dimer. The four covalent OAH bonds (marked as 1,2 in Fig. 2) found in both water monomers are the strongest bonds in the dimer, as expected. Their |V x | values are of the order of 400 kJ mol 21 (red disk in Fig. 2), which is in line with the listed mean bond enthalpy of 463 kJ mol 21  The V x energy measure is a novel and exciting gauge to detect unusual interactions, in an orbital-and parameter-free manner. In the long term, it will be instrumental in designing a QCTFF version for chemical reactions, given its ability to quantum mechanically quantify bond-breaking and bondmaking. This type of energy governs chemical bonding and (hyper)conjugation effects. Indeed, the latter are easily revealed from a graph of V x versus distance, again as deviations from the general "yellow band" trend. A fine example of this can be found in Ref. [33] where V x values for (C)H. . .H(C) interactions are always anomalously large if the hydrogens are in trans position with respect to each other. One does not have to explain this effect (or trend or even rule) by hyperconjugation but can instead simply name it the "trans effect" and use it to make predictions independently from hyperconjugation. What matters is that the effect is consistent and clearly spotted, in a good number of saturated hydrocarbons, and that it emerges from the relationship between two welldefined quantities: V x and internuclear distance.
Dispersion energy treated as dynamic correlation. Dispersion is an important effect even in biomolecular modeling but its accurate quantification continues to be a challenge. [34] The four natural amino acids that contain an aromatic residue have already been successfully treated by our methodology in terms of pure multipolar electrostatics [35] but the next big challenge is to integrate dispersion in a rigorous and streamlined manner. The key to achieve this is being able to calculate the third term in Eq. (5) or V AB ee;corr , which embodies dynamic correlation. This is the general quantum mechanical effect that governs dispersion, and allows it to be included as an energy contribution without the context of Rayleigh-Schr€ odinger (long-range) perturbation theory. As proven a long time ago by London, one then recovers the familiar C 6 r 26 expression given in textbooks for dipolar dispersion, and higher powers (C 8 , C 9 , C 10 , . . .) for higher rank multipole moments. This treatment is only valid for intermolecular interactions but if represented by the third term in Eq. (5), dispersion can also be included in  5), dispersion then also benefits from the fact that QCT does not suffer from penetration effects. Hence, there is no need for damping functions in QCTFF. We note that obtaining satisfactory expressions for damping functions is problematic in view of the complexity of atom typing. [36] The dynamic correlation part of QCTFF is currently under preliminary investigation, and is expected to internally resolve the atom typing issue through kriging (see Machine learning: kriging tackles complexity of all energy terms responding to geometrical change).

Interatomic electrostatic energy and multipole moments
In 2014, we wrote a perspective [6] entitled "Multipolar Electrostatics," which collects dozens of case studies from the literature, systematically proving the superior accuracy of multipole moments over the traditional point charge approach (of AMBER, CHARMM, OPLS, etc.). Even David Case, a main developer of the point-charge based force field clus-ter and program AMBER, urged for further research into the accuracy of force-field potentials, particularly electrostatic terms. [37] Applying the Laplace multipole expansion leads to V AB elec 5 X lAlBmAmB T lAlBmAmB Q lAmA Q lBmB (7) where Q ' represent the atomic multipole moments of rank ' while T is a purely geometrical interaction tensor. There are three conceptual and technical advantages associated with QCT multipole moments: (i) more compact than Cartesian (no redundancies), (ii) excellent convergence at short-range, and (iii) no penetration effects because of nonoverlapping atoms.
Machine learning: kriging tackles complexity of all energy terms responding to geometrical change Our lab pioneered the use of kriging in the context of potential design. [38] Kriging [39] is excellent in handling the highdimensional complexity of configurational change in condensed matter. Kriging learns the mapping between an atomic quantity (as output, e.g., multipole moment or self-energy) and the coordinates of the neighboring atoms (as input, called features F5{f k }). Figure 3 illustrates this principle on a capped amino acid. In the formula of Figure 3, l is the mean of the observed values, N ex is the number of training examples used to build the model (typically of the order of a thousand), w i is the kriging weight (computed from the correlation matrix), d is the number of (geometric) features (or dimensionality of the feature space), and N feat is the number of features. By (analytically) maximizing the (statistical) likelihood function, the vector of parameters h k and p k are optimized. We then maximize the maximum likelihood in feature space by particle swarm optimization. [40] The division of the data set into training and test set is made less critical by k-fold cross validation. [41] The four advantages of kriging are: (i) ranking of feature values according to importance: chemical insight, (ii) performance scales linearly with the dimension of feature space, (iii) trained kriging model is analytical and differentiable, so forces [42] can be computed quickly and accurately, and (iv) the knowledge of how an environment influences an atom is stored in kriging parameters h k and p k . Hence, there is no need to perform iterative calculations to self-consistent energies during a molecular dynamics simulation.
In Figure 4, we give an impression of the success of kriging. [43] The sigmoidal curves show which percentage of an external test set of two representative doubly peptide-capped amino acids (isoleucine and cysteine) have all their atoms predicted within a certain error. For example, one can read off from the red curve in Figure 4 that 70% of the intramolecular electrostatic energies of isoleucine are within 4 kJ mol 21 ($1 kcal mol 21 ), an often quoted energy threshold of "chemical accuracy.".
The training set of molecules is obtained by distorting a particular energy minimum according to calculated normal modes. An alternative generator for a training set is sampling local environments from PDB files. Yet another generator can Figure 3. Essence of the machine learning method kriging, which learns the complex mapping between geometrical features (i.e., input) and atomic properties (i.e., output). The parameters h k and p k are optimized such that the likelihood of recovering the observed input data is maximized. be a crude (point charge) molecular dynamics simulation in case the training focuses on water clusters or hydrated peptides. How to generate a training data set is still an active topic of research in our group, with the possibility of distorting around configurations that are not energy minima. More details will be published as applied to carbohydrates. All of the 20 natural amino acids have been successfully trained for by kriging, yielding comparable mean errors to the two amino acids illustrated in Figure 4. Figure 5 shows a flow chart of the parameterization protocol for QCTFF, encompassing in-house FORTRAN90 programs, external programs and a number of scripts. A central program is TYCHE, which generates the training set. It was originally developed via the so-called "normal-mode distortion," which needs local energy minima as a feed. The structure in Figure 5 has been implemented from scratch, streamlined, CPU-optimized, documented, and systematically tested. Finally, the training procedure is being turned user-friendly such that users can QCTFF train for their own systems of interest. The full details behind this flow chart cannot be given here due to space restriction but can be found in Ref. [35]. A more advanced training protocol is currently being investigated, with a deeper understanding of the performance of particle swarm optimization, dynamic training set construction, parallellization and computational efficiency, feature reduction, feature interpretation and scrutiny, and safe default settings.

Parameterization Procedure
The very first tests of geometry optimization, as implemented in the simulation package DL_POLY_4 [44] using all fundamental energy types (except dynamic correlation) are carried out right now.

Conclusions
The driving force behind QCTFF's development is a stepchange enhancement of realism. This step-change should be achieved in a bottom-up fashion, from small to large molecules, and by systematic validation. Constrained by the journal imposed brevity of perspective we outlined the strategy of the design of QCTFF, a novel protein force field, based on malleable atoms that adjust their electronic structure to the coordinates of the neighboring atoms. Each atom is endowed with this knowledge in terms of four fundamental types of energy contributions (both intra-and interatomic), making the mimicking (through the traditional expressions) of their combined, overall effects obsolete.

Acknowledgments
The author is very grateful to Dr Tim Fletcher for providing the flow chart of Figure 5 and to Joseph Thacker for preparing Figure 1. The Keywords: quantum chemical topology Á force field designpotential energy types Á multipole moment Á quantum theory of atoms in molecules Á Kriging Á polarization