Computational prediction of protein–protein binding affinities

Protein–protein interactions form central elements of almost all cellular processes. Knowledge of the structure of protein–protein complexes but also of the binding affinity is of major importance to understand the biological function of protein–protein interactions. Even weak transient protein–protein interactions can be of functional relevance for the cell during signal transduction or regulation of metabolism. The structure of a growing number of protein–protein complexes has been solved in recent years. Combined with docking approaches or template‐based methods, it is possible to generate structural models of many putative protein–protein complexes or to design new protein–protein interactions. In order to evaluate the functional relevance of putative or predicted protein–protein complexes, realistic binding affinity prediction is of increasing importance. Several computational tools ranging from simple force‐field or knowledge‐based scoring of single protein–protein complexes to ensemble‐based approaches and rigorous binding free energy simulations are available to predict relative and absolute binding affinities of complexes. With a focus on molecular mechanics force‐field approaches the present review aims at presenting an overview on available methods and discussing advantages, approximations, and limitations of the various methods.


| INTRODUCTION
The interaction of proteins is of fundamental importance for basically all processes in living systems. Most functions in a cell are mediated by the assembly of proteins to form transient dimers or oligomers in order to act as enzymes, transporters, or to stabilize the shape of the cell. 1,2 Numerous interactions between proteins in a cell are in principle possible but only a fraction of putative complexes and assemblies is indeed formed and of functional relevance. 3 The associated binding free energy of protein-protein interactions determines the stability of association and the conditions for complex formation. Hence, a full understanding of cellular processes requires not only knowledge of all possible protein-protein interactions but also quantitative insight into the structure and stability of the formed complexes. 2,4 This also includes the effect of mutations in proteins that can modulate or even disrupt the binding to partner proteins. Protein-protein interactions are often mediated by reoccurring subdomains. 5,6 Within different protein families, these domains can differ in sequence resulting in different binding specificities and in different combinations of possible interactions. Predicting and understanding the significance of putative domain-domain interactions requires a quantitative understanding of protein-protein binding affinity.
In recent years, the possibility to design new synthetic protein-protein complexes with a desired function has gained significant momentum. 7,8 One goal is to modify existing natural proteins in such a way that the geometry and affinity of a known protein-protein interaction may change or the interaction with a different protein surface on another protein partner becomes possible. On the longer run, it is desired to create completely new protein partners with programmed surface properties to allow for new interactions with selected candidate partners. 8,9 Such efforts may form the basis for synthetic complexes that can act as designed molecular machines with a desired new function. A prerequisite for the successful rational design of such interactions and new complexes is the detailed understanding of protein-protein interaction and affinity.
The driving force for a protein binding process corresponds to the associated change in free energy that can be related to structural and physicochemical properties of the protein binding partners. Initially proposed by Fischer 10 the "lock and key" concept of binding emphasizes the importance of optimal complementarity of binding partners at the interface as the decisive element for high binding affinity and specificity. Nevertheless, proteins and other biological macromolecules (e.g., RNA and DNA) can undergo various types of conformational fluctuations at physiological temperatures and, hence, are not rigid objects (illustrated in Figure 1). Often significant conformational changes of the binding partners upon association have been observed leading to the induced fit binding concept of partner proteins. 10,11 During binding, proteins appear to induce conformational changes in the partners that are a prerequisite for specific complex formation. In principle, all protein binding processes require some conformational adaptation but these changes can in certain cases be less than 1-2 Å for the root mean square deviation (RMSD) between bound and unbound conformations. 12 Besides of the induced-fit concept, the idea of a pre-existing ensemble of several interconvertible conformational states of proteins at equilibrium has been postulated. 11,13 Within this ensemble, there are structures close to the bound and unbound forms and the process of binding to partner molecules shifts the ensemble toward the bound form. The mechanism of conformational selection versus induced fit has been systematically studied for many known protein-protein interactions. 14 However, every conformation is, in principle, accessible even in the unbound state albeit with a potentially very low statistical weight. Hence, the original induced fit concept is a special case of conformational ensemble selection where only the presence of a ligand gives rise to an appreciable concentration of the bound partner structure. For an accurate calculation of protein-protein binding free energies, it is desirable to account for conformational changes and also for changes in the conformational freedom (conformational entropy) upon binding. 15,16 F I G U R E 1 Protein-protein association can induce different types of conformational changes. Conformational changes can involve side chain flips (a) indicated for a tyrosine side chain flip in the complex of RNAseA (yellow) inhibitor (green: unbound structure, pink: bound structure) complex, pdb1DFJ). For a protease-inhibitor complex (pdb1GL1), a loop refolding is observed (b, blue: unbound inhibitor; yellow: bound inhibitor). Besides local changes also global adaptations can accompany binding (c) demonstrated for the bound (pink cartoon) and unbound (green cartoon) of an RNAseA inhibitor (yellow cartoon: RNAseA) A great variety of experimental techniques is available to determine the structure of protein-protein complexes and assemblies. Protein X-ray crystallography is still the most common high-resolution technique to determine the structure of protein complexes. However, it requires the formation of sufficiently well-ordered crystals that can be difficult or impossible to obtain especially for low-affinity transient complexes and for large assemblies containing multiple partners. For the latter cases, recent improvements of CryoEM (electron microscopy under cryogenic conditions) resulted in solving the structure of many new large and transient biomolecular assemblies achieving often atomic or near-atomic resolution. 17,18 The CryoEM method is limited to complexes above a certain size limit (~100 kD) to achieve sufficient contrast of the image relative to a noisy background but no crystals of the complex are necessary. It requires only a sufficiently large number of object images from different viewpoints that can be combined to solve the three-dimensional (3D) structure. 17 Nuclear-magnetic resonance (NMR) spectroscopy in solution can also be used for structure determination of mostly small dimeric protein-protein complexes. 19,20 In addition, several NMR techniques can be used to quantify protein-protein affinity and also help in modeling complexes if the structure of partner proteins is known. 21 A significant fraction of protein interactions is mediated by protein domains and for many domain-domain pairs 3D structures are available and also data on the range of domain-domain affinities. In such cases, the prediction of domain-domain interaction affinities is often sufficient to estimate the affinity of whole complexes. Databases of protein domain-domain interactions such as 3DiD, 22 SCOPPI, 23 or PIBASE 24 are available that allow identification of interfaces and are helpful to predict the range of binding affinities for domain-mediated protein complexes.
Despite the great progress in recent years, the experimental determination of protein-protein complexes remains a challenging task and it will be impossible to determine experimentally all putative and transient protein-protein complexes of a cell in the foreseeable future. 6,25 An additional difficulty arises because many protein-protein interactions involve conformational changes or even the coupled folding or refolding of disordered segments. In particular, transient protein-protein interactions in cells frequently include disordered protein segments. Such cases complicate both the structure prediction of complexes but in addition also the prediction of binding affinities.
Computational prediction of protein-protein binding affinity requires typically the three-dimensional (3D) structure of the complex or at least a model of the complex structure. As first part of the present review, we will first briefly outline the computational methods to generate structural models of protein-protein complexes (see also Figure 2).
A second prerequisite for evaluating methods to calculate and predict binding free energies are accurate experimental binding affinity data. Several different methods can be used to measure experimental binding equilibria and one can distinguish between methods that require separation of the bound complex from the isolated partners such as gel filtration, electrophoretic separation, or ultracentrifugation approaches or methods that directly measure the concentration of bound and unbound proteins. 26 Methods based on separation of complex and unbound partners require long lifetimes of the complex (small dissociation rate constants) beyond the time scale of the experiment. In direct methods, complex formation can be detected by changes in heat capacity of the solution upon varying partner concentration (e.g., in isothermal titration calorimetry 27 ) or optically using for example an associated change in absorbance or fluorescence. 26,28 In the surface plasmon resonance technique, one partner is immobilized on a sensor surface. Addition of a partner protein that binds to the sensor surface results in a change of the refractive index that can be optically detected. 29 With the method, it is also possible to analyze the time dependence of binding and to determine kinetic constants of protein-protein association and dissociation. For a critical comparison of methods to predict protein-protein binding affinities, the accuracy and consistency of the experimental reference data are of critical importance. It has been found that experimental protein-protein binding affinities can depend significantly on the experimental method. 30,31 In addition, binding data for protein complexes is obtained under different experimental conditions F I G U R E 2 Computational protein-protein docking starting from separate unbound partners typically results in several putative sterically possible complex geometries. Task of a scoring step is to identify a most realistic geometry and to estimate the relative binding affinity of putative docked geometries (variation in ionic strength, pH, and temperature) whereas computational approaches typically assume the same conditions for all evaluated complexes. However, curated protein-protein binding affinity benchmark sets are available for which also structures of the complexes and the unbound protein partners have been determined. [30][31][32] These sets can serve as references for evaluating computational efforts; however, one should always be aware of experimental conditions and the method for binding affinity determination compared to the computational setup.
In the present review on calculating and predicting protein-protein binding affinity, we will first give a brief overview on modeling and predicting the structure of protein-protein complexes. These methods can also be used to predict which proteins in a cellular system may interact. 33 In the second and third sections, we will focus on rapid approaches to rank predicted complexes and to estimate binding affinities based on single complex structures or ensembles of structures with a focus on force field-based methods. In the last sections, we introduce the application of ensemble-based methods and of rigorous free energy simulations either for calculating relative binding free energies, for example, for predicting the effect of mutations on binding affinity, or to calculate the absolute binding free energy of protein-complexes. As will be discussed below, such methods can also include the contribution of conformational changes to binding. The last sections will also include the application of different advanced sampling approaches and a discussion of the possibility to use such approaches for systematic applications of evaluating docked protein-protein complex geometries and possible future developments.

| PREDICTING THE STRUCTURE OF PROTEIN-PROTEIN COMPLEXES
The prediction of the binding affinity of putative protein-protein interactions is related to the computational modeling of the structure of protein-protein complexes and the ranking of possible predicted solutions. 30,34 Besides of experimentally determined complexes, one can distinguish two main computational techniques to provide structures of putative protein-protein complexes. First, computational protein-protein docking methods can be considered as "ab initio" approaches for generating complex geometries although in practice experimental (or other bioinformatics) data can be included to restrict the search for possible solutions. 15,16,25,35 In addition, the de novo design of protein-protein interactions requires docking or modeling of new protein-protein interfaces. 7 Second, for the majority of natural stable protein-protein interactions one often finds homologous complex geometries in the data base of experimentally determined complexes. 36 Hence, many interactions especially those mediated by reoccurring protein domains can be modeled based on similarity to an already known interaction. 32 In the following, we briefly introduce existing protein-protein docking approaches and also discuss template-based protein-protein complex prediction.

| Protein-protein docking
Aim of computational protein-protein docking is the prediction of the structure of protein-protein complexes based on the structure of the isolated protein partners ( Figure 2). The great majority of protein-protein docking algorithms use the optimal complementarity as the main target criteria for predicting interactions. 37 A variety of computational methods exists to efficiently generate a large number of putative binding geometries typically with an initial systematic docking search keeping partner structures rigid. Subsequently, one or more refinement and scoring steps of a set of preselected rigid docking solutions are added to achieve closer agreement with the native geometry and to recognize near-native docking solutions. Among the most common are geometric hashing methods to rapidly match geometric surface descriptors of proteins 38,39 and fast Fourier transform (FFT) correlation techniques 16,37,40,41 to efficiently locate overlaps between complementary protein surfaces. Alternatively, molecular dynamics (MD), Brownian dynamics, Monte Carlo, or multistart energy minimization can be used to generate locally optimized protein-protein docking complexes. 16 To enhance the speed, often coarse-grained (CG) instead of atomistic protein models are used. These methods have in principle the capacity to introduce conformational flexibility of binding partners already at the initial search step but are, nevertheless, slower than FFT-based correlation methods or geometric hashing. Some approaches included conformational changes during docking already during the initial search including soft collective normal mode (NM) directions as additional variables during docking by energy minimization 42,43 or in using swarm optimization. 44 If experimental data or data based on bioinformatics analysis are available, this information can be included either directly during the docking search, for example, in the HADDOCK, 45 ATTRACT, 43 or RosettaDock 46,47 approaches, or can be used to screen and rerank the docking solutions obtained from rapid FFT-based methods. 37 The final stage of a protein-protein docking protocol consists of a structural refinement at atomic resolution of the binding partners and often rescoring of the various docking solutions. Often the final scoring step employs potentials that are intended to also provide an estimate of the binding affinity of the predicted complex (discussed below). Protein-protein docking performance is regularly evaluated in the community-wide blind docking prediction challenge Critical Assessment of PRedicted Interactions (CAPRI). 12,48,49

| Prediction of protein-protein complexes based on homology to known structures
Due to the growing number of solved protein-protein complex structures, it is often possible to generate a structural model of a complex by using a known complex structure as a template. 50 In fact, it has been found that the majority of stable natural protein-protein interactions can in principle be modeled by a template-based approach. 36 The main task is then to find an appropriate and realistic alignment of the target protein sequence to the template sequence. In combination with an accurate prediction of the binding free energy for a template-based complex model, one could then predict if the putative proteinprotein interaction is likely to really occur.
In general, in order to enhance the impact of protein-protein docking in structural biology, it is highly desirable to be able to use partner protein structures obtained by comparative (homology) modeling. The accuracy of such comparative models depends on the correct alignment of target and template sequence. Even in cases of significant average target-template similarity, the quality of the alignment is often not uniform along the whole protein sequence, for example, due to insertions or deletions in the aligned sequences which can result in structural inaccuracies. Overlap of such inaccurate structural segments with the protein region in contact with binding partners may interfere with the possibility to produce near-native complexes using template-based modeling or rigid docking methods. This is also reflected in the fact that docking cases that involve homology modeled protein partners belong to the most difficult cases in the CAPRI docking challenge. 12,49

| FORCE FIELD AND KNOWLEDGE-BASED SCORING METHODS FOR RANKING AND TO PREDICT BINDING AFFINITIES
The comparison of known native protein-protein interfaces indicates typically a well packed interaction region with high shape complementarity between protein partners and few cavities some of which might be occupied by water molecules. [51][52][53][54] Polar side chain or backbone groups buried at the interface are forming hydrogen bonds or other polar contacts. Assuming that both nonpolar as well as polar contacts contribute on average favorably to protein-protein interactions one early simple idea is to relate binding affinity to the size of the buried solvent accessible surface area (buried SASA: BSA) upon complex formation. The BSA is obtained by subtracting the SASA of the complex from the SASA of the two protein partners. Surprisingly, the BSA without considering the detailed nature of the interface correlates already quite well with the experimentally measured binding affinity (R~−.55), if one excludes structures that undergo large changes upon complex formation. 31,54 More sophisticated physics-based or knowledge-based approaches often result in correlation coefficients that are not much larger. However, interestingly, in systematic docking searches, one frequently obtains incorrect solutions with similar BSA as the near-native docked complexes. Apparently, often for pairs of proteins, one can generate complex geometries that have a similar or larger BSA than the native geometry. It indicates that the assumption that polar and nonpolar interface regions all contribute on average favorably to binding might be reasonable for the native interface but not for alternative non-native interfaces. Hence, these incorrect interfaces are either not well packed or contain other unfavorable contacts that prevent favorable association. Recently, it was also found that residues outside of the interface (defined by the BSA) can contribute to binding. 55 In an extension of the BSA approach for binding prediction, Vangone and Bonvin 56 suggested a contact-based scoring with optimal weights on various types of contacts at an interface (e.g., polar-polar, polar-nonpolar, etc.) that showed improved correlation with experimental binding affinity data on a benchmark set (R = −.73).
More sophisticated statistical-or knowledge-based potentials can be designed that either are based on the statistics of residue or atom contacts at interfaces or can even include the distance and orientation of residues (atoms) around interfaces. As the term "knowledge-based" indicates such potentials are extracted from known protein-protein complexes. The underlying concept is to relate the observed frequency of atom-atom or chemical group-group contacts (or distances) to the corresponding expected frequency assuming a random distribution. Overrepresentation or underrepresentation of a given pair of atoms or residues relates to favorable or unfavorable interactions. In most cases, the inverse-Boltzmann statistics with an appropriate reference state is used to derive an effective potential in terms of group distances and possibly also group-group orientation. The idea to extract effective interaction energies between groups or atoms based on contact frequencies in known protein structures dates back to Tanaka and Scheraga 57 and was further pioneered by Miyazawa and Jernigan 58 as well as Sippl and Weitckus. 59 For evaluating protein-protein complexes, a variety of knowledge-based statistical potentials have been designed in recent years. [60][61][62][63][64] The potentials differ in the resolution of the interface description. Several potentials are based on interatomic distances and possibly also orientation 62,[65][66][67][68] or representing the protein on the level of whole residues or chemical groups. [69][70][71][72] The potentials also vary in the number of atom or pseudo atom types or the reference state from which the expected contact or distance probability for atom-pairs are derived. Although mostly used for scoring and ranking docked protein-protein complexes, statistical potentials can also be optimized for predicting binding affinities. In this latter case, one should keep in mind that many contributions to binding ranging from solvent effects, restriction of conformational, rotational and translational mobility of partners (that can all be different for individual complexes) are all averaged over many (training) complexes and merged into atom (or group) pair potentials to estimate binding affinities of predicted complexes. In addition, typically a knowledge-based scoring includes only interaction terms between partners and does not account for possible internal energy changes of each partner. Given these approximations, it is surprising that often quite reasonable correlations to experimental data are observed. 56,62 Besides of using knowledge-based statistical potentials for ranking of docked complexes, it can also be based on a molecular mechanics (MM) type force field description of binding partners. 45,46,73 Similar to the statistical potentials discussed above, these scoring potentials are mostly used to provide a relative ranking of single predicted complex structures but can also be optimized to predict the binding affinity of protein-protein complexes. 32,49 For the purpose of ranking, various terms of the force field are weighted to give an optimal correlation with respect to experimental data on a training set of complexes. 74,75 The majority of force field scoring potentials neglect any intramolecular energy changes of the binding partners and therefore do not include changes in intramolecular interactions or changes in conformational entropy (e.g., restriction of conformational fluctuations upon binding).
For a binding process one can, however, distinguish several energetic and entropic contributions. Binding results in an interaction between binding partners that can involve electrostatic and van der Waals intermolecular interactions. The partners will also change their average conformation (see Figure 1) that is accompanied by a change in intramolecular interactions which is typically an unfavorable contribution. In addition, the interaction of each partner with the solvent (and surrounding ions) will change upon complex formation. For the change in solvation, one typically distinguishes between a nonpolar contribution (related to hydrophobic effect) and a polar (electrostatic solvation or reaction field) contribution. [76][77][78][79] For rapid evaluation of single complex structures, usually explicit solvent molecules are not included. Hence, solvation contributions are calculated using an implicit solvent model (see Figure 3).
It is more realistic to represent a complex geometry not by a single structure but by an ensemble of relevant conformations and to estimate binding affinities from evaluation of the ensemble of conformations. In the linear interaction energy (LIE) method, this is achieved by taking appropriately weighted averages of interactions between partners from MD simulations. 80,81 More frequently, approaches are used that involve a reevaluation of explicit solvent MD trajectories after replacement of the surrounding environment by an implicit solvent model (see next paragraph). In recent years, full atomic resolution MM approaches for the binding partners combined with an implicit continuum solvent model have been applied to evaluate protein-protein complexes. In contrast to the scoring of single complexes, an ensemble of complex conformations is evaluated in MM Poisson-Boltzmann/surface area (MM-PBSA) or MM-GBSA (using the generalized Born method instead of the Poisson-Boltzmann) approaches. In most cases, the ensemble is generated using MD simulations with an explicit solvent representation. To limit the computational demand and to also keep a narrow distribution of conformations near an initial state, usually simulations of a few nanoseconds are performed. 79 Due to the large energy fluctuations of the explicit solvent molecules, the reanalysis of the trajectory (ensemble) is performed after removing the explicit water molecules (sometimes interface waters are retained) employing an implicit solvent model.
For each evaluated complex structure, the mean partner interaction energy can be calculated. In the most basic single trajectory approach, this is achieved by taking the ensemble average energies of the complex and subtract the corresponding energies of the partners from the same trajectory. This approximation implies that ligand and receptor do not undergo significant conformational changes upon binding and changes in intra-molecular energies can be neglected. The mean interaction energy consists of pairwise electrostatic Coulomb interactions and electrostatic (polar) solvation contributions obtained using the finite-difference Poisson-Boltzmann or generalized Born (GB) equations. Nonpolar solvation (or desolvation) is usually calculated from the BSA surface upon complex formation using an empirical surface tension parameter that represents both cavity creation and van der Waals interaction of the protein with the solvent. Alternatively, the nonpolar part can also be split further into a surface area dependent cavity or hydrophobic term and a change in van der Waals interaction between proteins and solvent. The latter contribution can be estimated from a solvent grid representation around the complex 79 or from a surface integral approach. 82 In order to include changes in intramolecular contributions, it is possible to run MD simulations not only for the complex but also for separate (unbound) partners and evaluate each ensemble separately. However, the single trajectory approach is used much more frequently and gives typically better convergence of the mean energies due to cancellation of the intramolecular contributions. Nevertheless, interaction energies obtained from MM-PBSA or MM-GBSA can show significant statistical errors due to numerically small interaction energies that need to be calculated from subtraction of numerically large and slowly converging mean energies (of the complex and the individual partners).
In addition to interaction energies, changes in the conformational entropy of binding partners can be estimated from a NM analysis of the complex and the isolated partners. This term is often neglected due to its large computational costs or replaced using alternative approaches based on the energy fluctuations within the ensemble or a quasi-harmonic (QH) analysis of the trajectories. Methods have also been developed to just obtain the change in translational and orientational (external) entropy of one partner with respect to the other. 83,84 MM-GBSA and MM-PBSA have been used both for the calculation of absolute protein-protein binding affinities and to evaluate docked protein-protein complex structures (reviewed in Reference 79 ). For example, Gohlke et al. 85,86 investigated the Ras-Raf and Ras-RalGDS complex and reported a binding free energy in good agreement with experiment, however, depending on how conformational entropy was estimated and with errors of about several kcal/mol. MM-GBSA and MM-PBSA were successfully used in several studies for reranking protein-protein docking solutions. 87-91 A very systematic study to predict the free energy of binding and to score docked complexes was recently performed by Chen et al. 89 The authors compared various force fields, protocols for performing MD simulations and using the Poisson-Boltzmann or the GB solvation model (not including conformational entropy effects) for 46 protein-protein complexes. The highest correlation between the predicted binding affinities and the experimental data was −0.64 using MM-GBSA, a low interior dielectric constant of 1 and the AMBER ff02 force field. This correlation was better than using MM-PBSA for which the highest correlation was −0.523.
Often water molecules form specific water mediated contacts at protein-protein interfaces that may not be accurately represented in an implicit solvent model. 92 It is straightforward to include interface water molecules in the calculations and only treat the bulk water as a dielectric continuum. 90 The inclusion of an explicit water model has been proven to give good results in various protein-protein binding affinity predictions for single complexes (e.g., Ulucan et al. 93 ). For example, the correlation of MM-GBSA results to experimental binding affinities of 20 native protein-protein complexes was shown to increase significantly (up to 30%) by inclusion of 30 explicit water molecules at the binding interface. 90 On a smaller test set of four proteins, crystal water molecules were added in the evERdock approach that improved water mediated contacts so that the identification of near-native binding decoys could be improved. 94 The changes in conformational entropy upon protein-protein complex formation are usually neglected in the MM-PBSA evaluation due to the large computational costs to perform NM analysis on protein-protein complexes and on the isolated partners. However, alternative methods to estimate the conformational entropy contribution have recently been introduced and tested also for evaluating predicted protein-protein complexes. In the interaction entropy (IE) approach, the protein-ligand or protein-protein interaction energy fluctuations during the MD trajectories are used to estimate the conformational entropy. 95 This method does not allow to calculate absolute entropy values but is applicable to calculate relative entropy changes, for example, after protein-ligand binding. As no extra computational effort is needed, it has been concluded that for receptorligand binding affinity prediction using IE is superior to the standard NM analysis for estimating entropy effects. IE has been successfully applied in studies in combination with MM-PBSA and MM-GBSA calculations of protein-protein 96,97 or protein-ligand 95,97 binding affinities. Interestingly, quite substantial differences in the resulting free energies of binding using NM and IE were encountered in several studies. 95,97,98 In a recent study of Sun et al., 97 entropy effects on the performance of endpoint methods of over 1,500 protein-ligand systems were assessed. The best correlation to the experimental binding affinities was gained with IE, whereas the absolute binding free energy values had the highest correspondence to experimental values using NM calculations. Recently, however, Kohut et al. 98 proposed that the reproducibility of IE is less robust than that of NM or QH, especially for flexible systems. The calculated entropy value is mainly determined by the highest spikes of interaction energy and it is argued that the calculated entropies are difficult to converge as the simulations are prolonged. Aldeghi et al. 99 also found a higher sensitivity of the IE term to the simulated ensemble than the other MM-PBSA terms for three sets of bromodomain-inhibitor pairs. Hence, further testing could be useful to check the robustness of the IE approach. In a study of 20 protein-protein systems using IE with MM/GBSA, the mean absolute error to experimental binding affinities could be substantially reduced by optimizing the residue type-specific dielectric constants, the errors were especially lower than with NM analysis using a standard dielectric constant of 1. 91 Formally, the solute entropy change during association can be split into an external entropic contribution due to the reduction of motion in external degrees of freedom (relative position and orientation of ligand and receptor) and internal entropy (conformational) upon complex formation. Although a full decoupling of external and internal entropy is not in general possible, one can still compute the lowest upper bound of the external entropy. 83 The number of configurations needed to obtain converged results is, however, quite high using the approach with simulations of over 1 μs required for a Barnase-Barstar complex. Furthermore, an external entropy correction alone has been shown to not necessarily improve the correlation to experimental binding affinities for protein-ligand systems. 100

| Mutations influencing protein-protein binding affinities
Mutagenesis of residues at protein-protein interfaces has demonstrated that the contributions to binding affinity are not uniformly distributed but can often be attributed to a small number of residues called hot spots. 101,102 For protein engineering, it is of significant interest to predict changes in binding affinity of protein-protein complexes due to mutations and to identify important residues for the interaction (hotspots). Several approaches based on just single complex structures are available to estimate the effect of interface mutations (recently reviewed in WIREs Computational Molecular Science 103 ). The ensemblesbased MM/PBSA or MM/GBSA approaches can also be used to identify hot spots and to calculate the change in the binding free energy upon mutation of interface residues. 104 The hot spots of 15 protein-protein complexes were calculated recently with MM/PBSA using residue type specific dielectric constants (11 for charged residues and 7 for nonpolar and polar residues). 77 In this study, a mean SE of 1.1 kcal/mol was achieved in 210 mutations after geometry optimization and subsequent MD simulations in explicit solvent. Using an extension of the MM/PBSA method with residue specific dielectric constants, Petukh et al. 105 achieved a high correlation (correlation coefficient of −0.62) with experiment for a set of 1,300 mutations in 43 protein-protein complexes (several other applications are reviewed in Reference 79).
Besides of estimating binding free energy changes due to mutations using single complex conformations or changes in mean interaction energies obtained from the MM/PBSA or MM/GBSA methods, it is also possible to perform alchemical transformations to mutate residues in silico. In alchemical free energy simulations, one represents the selected amino acid side chain by two force fields, one representing the wild type (State A) and the other the mutated residue (State B). During a series of MD simulations, the force field for State A is switched off (decoupled from the interaction with other parts of the system) whereas the force field of State B is switched on. The changes in free energy can be calculated by integrating the generalized force along the switching pathway (thermodynamic integration [TI] 106 ), free energy perturbation (FEP) 107 or using alternative methods such as Bennett' acceptance ratio (BAR) method. 108 In order to obtain the effect of a mutation on binding affinity, the transformation needs to be performed in the complex and for the unbound solvated partner. The advantage of the approach is that all energetic as well as entropic contributions that may influence the change in binding affinity are accurately included (within the limits of a molecular mechanic force field description of the system). The disadvantage is the typically higher computational cost compared to the above described end point methods. Due to methodological progress and increased computational power, alchemical free energy methods are increasingly being used to study the effect of mutations on protein stability and protein-protein binding. [109][110][111][112][113][114] Although typically performed in explicit solvent, it is also possible to perform alchemical transformations in implicit solvent 115 with computational costs similar to MM/PBSA but avoiding the endpoint approximations inherent to endpoint ensemble approaches. A recent systematic assessment of more than 100 mutations (including charge changing mutations) in four protein-protein complexes resulted in a better performance and higher correlation of the alchemical FEP approach (root-mean-square error [RMSE] = 1.2 kcal/mol) than MM/GBSA (RMSE = 1.5 kcal/mol) in reproducing experimental affinities. 116

| Analyzing protein-protein binding by multiple simulations and advanced sampling approaches
In principle, MD simulations allow studying protein-protein association at full atomic detail, including full flexibility of binding partners and explicit inclusion of surrounding water molecules and ions. 87,88 The most straight forward approach is then to start from separate protein molecules and follow binding during sufficiently long MD simulations. Indeed, in early applications starting from separate components (in bound conformation) of the Barnase-Barstar complex relatively short MD simulations (<100 ns) were sufficient to observe complex formation to form a complex in close agreement with the native geometry. 117 However, such association simulations do not allow direct extraction of the associated binding free energy. To obtain the free energy of binding from unrestraint simulations requires both association to the native binding site but also sampling of dissociation from the bound complex. The dissociation constants of typical protein-protein complexes are in the nanomolar range and the dissociation rate can reach minutes or even larger times much beyond the timescale of current MD simulations. Hence, it seems that direct extraction of binding free energies from counting association and dissociation events during sufficiently long MD simulations is in principle only possible for weakly bound complexes with fast association and dissociation characteristics. However, several approaches have been developed to overcome this sampling dilemma in recent years. Using a large number of simulations starting from various initial placements of the Barnase-Barstar system, the Noe group observed multiple binding events. 118 With a further adaptive selection of new starting configurations and addition of a biasing potential to promote also dissociation of the bound complex in Hamiltonian replica exchange (H-REMD) simulations 119 it was possible to sample sufficient association and dissociation events to generate a Markov model for the binding process. 118 Such Markov model allows the extraction of kinetic rates for transitions between various transient states of the system and also of the associated thermodynamic quantities. 120,121 It includes not only the sampling of the native binding but allows also the characterization of alternative and intermediate encounter binding states. For the Barstar-Barnase system, the kinetic rates and the free energy of binding could be extracted in good agreement with experimental data. As an alternative to multiple unrestraint simulations to obtain a Markov model for the binding process, it is possible to use weighted ensemble (WE) methods to study binding processes along a preset reaction coordinate (RC). 122,123 Briefly, in this technique, the space along one or more RCs is discretized in intervals and simulations are distributed along the coordinates such that each interval is populated by a fixed number of simulations. Each simulation is assigned a statistical weight that can be transported to neighboring intervals and new simulations are then started or eliminated such that the total statistical weight and the number of simulations per interval remains constant. The WE technique allows eventually the extraction of both kinetic and thermodynamic data along the binding RC. 123 Although so far only applied to study small-ligand binding to proteins, 123,124 in principle, it could also be used to study protein-protein binding.
Finally, the idea of destabilizing the bound state with an added biasing potential during otherwise unrestraint MD simulations has been utilized in recent simulated tempering simulations applied to a set of six different protein-protein complexes. 125 During the simulated tempering, different levels of the biasing potential that weakens the protein-protein interaction are applied. Still extremely long MD simulations in explicit solvent were required (>100 μs) to allow reversible association and dissociation of the protein-protein complexes. 125 It is also possible to extract kinetics and binding free energies from these simulations. Despite the rapid increase in computational resources, it is, however, unlikely that such methods will be used as routine methods to predict protein-protein binding affinities in the near future. One should keep in mind that the costs for the electricity to run such simulations alone exceed by far any experimental effort to determine the corresponding protein-protein binding affinities. An approach using a restraining potential to avoid trapping of protein-protein complexes in nonspecific transient states but at the same time restraining the partners not to separate has been introduced to refine docked complexes. 126 Different levels of the biasing potential are applied in an H-REMD simulation with one replica running under the control of the original force field allowing rapid identification of putative complex geometries and possibly also an estimate of the binding free energy.

| Binding free energies from advanced sampling including geometrical restraints
Complex formation of two partner molecules leads to the restriction of the translational and rotational degrees of freedom of the partner molecules relative to each other (Figure 4). In addition, the conformation and conformational freedom of the partners can change and the interaction with the surrounding solvent and ions is affected. Finally, van der Waals and electrostatic interactions may stabilize or destabilize a bound state. All these contributions influence the affinity or binding free energy of the complex. In binding free energy simulations, one ultimately aims at calculating free energies of binding including accurately all the above contributions.
Using Monte Carlo or MD simulations, it is indeed possible to calculate these contributions rigorously and to obtain absolute binding free energies of protein-protein complex formations. In principle, an alchemical transformation pathway to annihilate completely one partner protein in the complex and in the unbound state is possible (as described in the previous section). The difference in free energy changes for these two calculations (and inclusion of standard state conditions) gives the absolute binding free energy of the two proteins. However, the annihilation of a complete protein force field in an explicit solvent simulation box may suffer from insufficient convergence. The resulting binding free energy is a small number obtained from the subtraction of large free energies of annihilating (or creating) the protein force field in the bound and unbound states.
F I G U R E 4 Binding free energy calculations including geometrical restraints. The conformational flexibility as well as the relative rotational/axial degrees of freedom of the binding partners are restrained during the calculation of the potential of mean force (PMF) along a separation coordinate (typically a distance r). This limits the necessary sampling of relevant states during the PMF calculation (third row). The contributions of the restricted mobility can be calculated at the end points (bound and fully separated states) of the PMF simulation using either analytical or free energy perturbation (FEP) methods. The absolute binding free energy between the unrestrained proteins ΔG bind (top row) is calculated by accounting for several free energy contributions through the illustrated thermodynamic cycle. It requires the separate calculation of ΔG conf (site,bulk) (second row), orientation ΔG orient (site,bulk), and direction of separation ΔG axial_site (third row) Instead of an alchemical pathway, it is also possible to use a spatial coordinate to dissociate (or associate) a complex during an MD or MC simulation and record the associated free energy change. Since dissociation of a high-affinity complex typically does not occur during unrestraint MD simulations, a key element is to induce such dissociation (and/or association) along the preselected RC. In the majority of cases, this RC is a distance, for example, between the center of mass of binding partners or involving other subsets of atoms near or around the binding sites on the partners. By adding a biasing potential (umbrella potential) along this RC during the MD simulations, it is possible to induce dissociation or guide association of the complex. In most applications, this is achieved in discrete steps by adding a quadratic umbrella potential with reference values for the RC change in steps of 0.5-2 Å in case of a distance RC. The weighted-histogram analysis method 127,128 or related algorithms are then used to calculate a potential of mean force (PMF) along the RC. This PMF represents the free energy change associated with the dissociation or association of the partners along the RC. Sufficient sampling of relevant states within each umbrella window and overlap of sampled states between neighboring windows is of critical importance for convergence of the calculated free energy profile. 111,129,130 Significant improvements can often be achieved by coupling Umbrella sampling with H-REMD allowing exchanges of sampled states between separate U.S. windows. 129,131,132 Another possibility is to calculate first an approximate PMF and add this to bias the simulations along the RC to smooth the effective interaction for calculating a precise PMF in a second step or iteratively in several steps. 133 Alternatively, it is possible to use Umbrella integration, 134 metadynamics, [135][136][137] or to apply an adaptive force (ABF) 135,136 along the RC to offset the effective interaction between partners. In case of metadynamics, this is achieved by adding a series of biasing potentials in the form of Gaussian functions along the RC. 137 The process is continued until a uniform diffusive sampling along the RC is observed. The added sum of Gaussian biasing potentials represents the free energy change along the RC. In the ABF approach, the added biasing force along the RC can be integrated to obtain an associated free energy change. Another recent approach to efficiently obtain free energy changes for reversible dissociation and association of a protein-protein complex is the perturbed distance restraints approach. 128 In this case, the bound and unbound states are characterized by different sets of distances that are included as distance restraints and a coupling parameter lambda is employed to transform between the sets of restraints. The free energy along the coupling parameter can be extracted by TI or BAR methods and H-REMD can be used to improve the convergence of the binding free energy calculation. 138 The calculation of the PMF along an RC for dissociating a complex without any other restraints up to a state where the interaction of the partners can be neglected gives directly the free energy of binding (after accounting for standard state conditions of the binding partners). However, complete freedom of the conformation and relative orientation of the binding partners means also that in order to achieve converged free energy results sufficient sampling of relevant states at every position along the RC is required. This includes all possible orientations, placements, and conformations within each window in case of Umbrella sampling. The convergence of calculating the free energy of dissociation/association along an RC can be enhanced by restricting the relative orientation and conformational freedom of the binding partners. Woo and Roux 139 devised a method by using a series of simulations including restraints on the spatial arrangement and conformation of the binding partners to further reduce the necessary sampling at every step of the PMF simulation (Figures 4 and 5).
The contributions due to restricting orientation and placement as well as conformation need to be calculated only at the end points of the dissociation/association process (bound state vs. noninteracting separated state). Hence, much fewer states need to be sampled within each U.S. window resulting in enhanced convergence of the PMF calculation. The restriction of orientation and axial placement in the separated state can be calculated analytically whereas commonly a FEP approach is employed to obtain the corresponding contributions in the bound complex.
The contribution due to restriction of the conformation of the partners typically using a restraining term based on the RMSD of partners from a reference conformation is more difficult to estimate. Frequently, a single-step perturbation approach is used with an unrestrained simulation of the isolated partners and the complex as a reference and treating the conformational restraining potential as a (single-step) perturbation. The treatment assumes that an unrestraint simulation samples all relevant conformations in case of the complex but also for the unbound partners. If significant conformational changes are associated with complex formation (e.g., refolding processes), this may not be sufficient to estimate the restriction of conformational space due to binding. More accurate but also more demanding is the possibility to perform free energy simulations along an RMSD coordinate with respect to a reference conformation. 139 Free energy calculations based on PMFs along a spatial (not alchemical) coordinate have been widely used to calculate absolute binding free energies. 130 However, most of these applications focused on drug binding, ion binding, or other small organic molecule binding to biological macromolecules. Typically, such PMF simulations are performed in explicit solvent that results in large system sizes and requires also long simulation times to achieve reasonable convergence. Only recently, applications to calculate protein-protein affinities have been published. 138,140,141 These examples indicate that the methodology, in principle, allow accurate calculation of protein-protein binding free energies if the bound complex structure is known. In practice, it is, however, desirable to use accurate free energy calculations to predict if a given docking geometry or designed interaction is realistic and to predict the affinity of the predicted interaction. Since PMF simulations in explicit solvent can take several days for a single binding mode depending on the system size an application for systematic evaluation of many docked protein-protein complexes is out of reach with current resources. The representation of both binding partners by a CG model is one possibility to drastically reduce the computational demand. 142 This was indeed successfully employed on T-cell receptor interactions with MHC-peptide complexes. 142 However, one should keep in mind that a CG model misses many details of intermolecular interactions such as hydrogen bonds and several conformational degrees of freedom are merged into effective interaction potentials. The agreement with experiment will in general dependent on how it is parameterized with respect to experimental binding data or effective interactions between CG centers.
Instead of using an explicit solvent representation, it is also possible to use an implicit, for example, GB, solvent model during U.S.-based binding free energy simulations at atomic resolution. On modern graphical processing units (GPUs), GB simulations can be performed very efficiently. Due to the instantaneous solvent response (at every time point in equilibrium with the solute structure) and the possibility to use a low viscosity, it also allows for faster convergence than explicit solvent simulations (in each U.S. window). In a recent study, it has been demonstrated that it is possible to directly employ such binding free energy simulations to score a reasonable set of 50 decoy complexes for 20 test complexes within about a day on a GPU cluster. 143 The calculated binding free energies for the near-native complex geometries were in reasonable agreement with experiment. Also, improved ranking of near-native docking solutions compared to simple single complex structure evaluation after energy minimization or short MD simulations was observed. 143 The study demonstrated the feasibility for systematic evaluation of predicted complexes based on calculated binding free energies instead of single point interaction, knowledge-based scores, or mean interaction energies. Further developments of the approach that preferably employ an explicit solvent representation are desirable for evaluating predicted complexes and allowing a clear judgment if a given geometry is thermodynamically stable or may represent only a transient state or represent an unfavorable binding geometry.

| CONCLUSION
In recent years, the number of experimentally determined protein-protein complex structures and structures of large multiprotein assemblies has grown rapidly. In addition, bioinformatics data, data on residue conversation at putative interfaces, low-resolution experimental data, and the increasing number of protein-protein template structures allows to create structural models of many putative complexes and interface geometries. Hence, judging if a predicted complex geometry or putative domain-domain interaction is stable and of functional relevance is of increasing importance. This is also of importance for the F I G U R E 5 Illustration of the rotational and axial angles that are typically used to restrain the rotation of one partner and to restrain the axial placement (relative to the second partner) in PMF-based free energy simulations. For the definition, three centers in each partner need to be selected (indicated as blue and green spheres). These can be centers-of-mass of groups of atoms. The three Euler angles α, χ, and γ are used to restrain the rotation of one partner and the two axial angles θ and ϕ restrict the direction of r to separate the protein partners. Additionally, the conformation of both partner proteins is restrained, preferably via root mean square deviation (RMSD) restraints with respect to a reference structure. The distance r is typically used as reaction coordinate for the PMF calculation to induce dissociation (or guide association) of the complex design of new protein-protein interactions preferably with a controlled affinity. For protein-protein interactions in the cell, one should keep in mind that the effective interaction is influenced by the distribution of proteins in the different compartments of a cell and the crowded environment. Nevertheless, an accurate estimate of the affinity of a putative complex structure is already valuable for the isolated complex. Rigorous binding free energy calculations employing a spatial coordinate for protein-protein binding and unbinding on a so far limited number of cases show promise and may offer a route to obtain the desired accuracy to offer reliable predictions (that do not need to be controlled by experiment). However, if the computational effort is far larger than an experimental affinity measurement the practical value is limited. The combination of fast scoringtype approaches either based on single structures or ensembles to preselect likely putative natural or designed interfaces and limit the application of the most time-consuming and accurate methods to a small subset might be a reasonable route for practical applications. It is important to note, however, that rigorous binding free energy calculations even on known complexes with known binding affinity allow one to distinguish and characterize different energetic and entropic contributions to binding often difficult to obtain experimentally. Hence, these studies give valuable insights into the mechanism of specific proteinprotein recognition. It is important to note that in vivo not all interactions reach always equilibrium and hence not only the binding affinity but also the kinetics of interactions are of increasing relevance. 144,145 Finally, with increasing computational resources and development of improved advanced sampling methods, the option to directly follow protein-protein association and dissociation offers the opportunity for an in depth understanding of transient nonspecific and specific binding.