Measuring the successes and deficiencies of constant pH molecular dynamics: A blind prediction study

A constant pH molecular dynamics method has been used in the blind prediction of pKa values of titratable residues in wild type and mutated structures of the Staphylococcal nuclease (SNase) protein. The predicted values have been subsequently compared to experimental values provided by the laboratory of García-Moreno. CpHMD performs well in predicting the pKa of solvent-exposed residues. For residues in the protein interior, the CpHMD method encounters some difficulties in reaching convergence and predicting the pKa values for residues having strong interactions with neighboring residues. These results show the need to accurately and sufficiently sample conformational space in order to obtain pKa values consistent with experimental results.


INTRODUCTION
It is well established that the structure and function of a protein are highly dependent on the pH of its surrounding environment. The pK a of a titratable residue, which is heavily influenced by interactions with neighboring residues within a protein, governs the protonation state of that residue for a given solution pH. Changes in protonation state within a protein manifest as alterations to the charge distribution of the titratable residue, influencing the electrostatics of the protein environment. Protonation equilibria are thus closely linked with protein conformation, evidence of which is the sensitivity of proteins to denaturation at extreme pH.
The interplay between protonation state and protein conformation is not accounted for in conventional molecular dynamics (MD) simulations. Currently, these simulations employ fixed, predetermined protonation states for titratable residues, which are generally chosen according to the pK a value of the respective residue when isolated in solution. This method of protonation state assignment can be a severe approximation, as ABSTRACT A constant pH molecular dynamics method has been used in the blind prediction of pK a values of titratable residues in wild type and mutated structures of the Staphylococcal nuclease (SNase) protein. The predicted values have been subsequently compared to experimental values provided by the laboratory of García-Moreno. CpHMD performs well in predicting the pK a of solvent-exposed residues. For residues in the protein interior, the CpHMD method encounters some difficulties in reaching convergence and predicting the pK a values for residues having strong interactions with neighboring residues. These results show the need to accurately and sufficiently sample conformational space in order to obtain pK a values consistent with experimental results. coupled with free energy perturbation techniques for pK a calculations. 11-13 A drawback of these techniques is their high reliance on the resolution of the input structure, which renders these methods incapable of calculating pK a shifts where protonation is accompanied by large conformational change.
Another class of methods incorporates the important coupling of conformation and protonation state through the use of computational simulations that employ pH as an external thermodynamic parameter. 14-26 These methods are often described as either continuous or discrete constant pH methods, contingent on how titratable protons are considered within the simulation. The former treats protonation state as a continuous titration parameter that advances simultaneously with the atomic coordinates of the system. 19-21 Originally, the implementation of this method used a mean-field approximation, and protonation sites could exist as fractionally occupied. More recently, Lee et al., have developed methods to overcome issues with fractional protonation states, using k-dynamics with an artificial titration barrier to discourage fractional protonation. 22 Extensions to the work of Lee et al., have incorporated proton tautomerism 23 and enhanced sampling methods to improve convergence. 24 Discrete constant pH methods avoid non-physical intermediate charge states. These methods use MD simulations for conformational sampling, while sampling different discrete protonation states with periodic Monte Carlo (MC) steps interspersed throughout the MD trajectory. 14-18 The methods employed in this article utilize the constant pH MD (CpHMD) method, originally developed by Mongan et al., 14 which uses generalized Born (GB) implicit solvent. Differences among these methods arise from choice of solvation model and protocols for updating protonation states within the simulation. Although these methods have achieved good results for small protein systems, they can be computationally expensive, and long convergence times have been reported for systems with multiple titration sites. In an attempt to overcome these issues of convergence, use of enhanced sampling methods coupled with constant pH MD, such as constant pH accelerated MD (CpHaMD) 25 and constant pH replica-exchange MD (REX-CPHMD) 26 have been investigated. Results from simulations employing these methods indicate the increased sampling provides improvement over the conventional method.
The previous paragraphs provide only a brief summary of the computational methods available for pK a prediction. Further details of these and other methods can be found in the literature, and several reviews have been published. [27][28][29] In this study, the successes and deficiencies of the CpHMD method have been investigated in the blind prediction of pK a values of titratable residues of the WT and mutant forms of the Staphylococcal nuclease (SNase) enzyme based upon comparison to experimental results released after submission to the pK a cooperative 30 by Gar-cía-Moreno and coworkers. 31-37 Particular attention is paid to the differences in electrostatics and, consequently, acid/base properties of exterior and interior residues.

Constant pH molecular dynamics-Theory background
CpHMD employs MD with GB implicit solvent. 14 Within the simulation, the MD simulation is periodically halted, and a MC step is taken, randomly considering a titratable residue for change in protonation. The transition energy corresponding to this MC step is evaluated according to Eq. (1), which calculates pK a with respect to a reference compound for the residue of interest. Reference compounds are the isolated titratable residues solvated in water (reference pK a values are 3.8 for ASP, 4.3 for GLU, 6.8 for HIS, 9.6 for TYR, and 10.5 for LYS). 14,38,39 In Eq. (1), k B is the Boltzmann constant, T is the temperature, pH is the specified solvent pH, pK a,ref is the pK a of the reference compound, DG elec is the electrostatic energy change for protonation state change of the titratable residue, and DG elec,ref is the corresponding electrostatic transition energy for the reference compound. The same GB electrostatics employed in the MD is used for calculating this transition energy, with acceptance of the change in protonation determined by the Metropolis criterion. If the MC move is accepted, the protonation state of the residue will change to the new state, and MD is continued. If not, the simulation will continue with the residue remaining in the unchanged protonation state. CpHMD has been successfully applied in the pK a prediction of titratable residues in the Hen Egg White Lysozyme (HEWL) enzyme. 14 Titration curve construction and pK a calculation The predicted pK a values are calculated from performing CpHMD simulations over a range of solution pH values. Assuming the system is ergodic, we assume fractional protonation is given by the amount of time a particular titratable residue spends in its protonated state. 15 Thus the fraction of deprotonated species, s, for a residue at a specific pH value can be used to predict the pK a from a Hill plot [Eq. (2)]. 16,20,22,40 Fits to this curve allow for estimation of both the pK a value as a midpoint of titration, as well as the Hill coefficient, n, which describes the cooperativity of various sites with respect to titration. 41 Illustrated in Reference 40, for example, the usefulness of the Hill equation resides in its ability to provide a good prediction of the midpoint pK a value, even when the fit is inaccurate at the tails of the titration curve. 40

Test system: Staphylococcal nuclease
Staphylococcal nuclease (SNase) is a highly charged protein, which has generated difficulty in obtaining accurate structure-based pK a predictions. 42 The structures of the wild-type and mutant proteins of the SNase system are provided for this study by the lab of Garcia-Moreno et al., who measured the pK a shifts of the titratable residues using NMR spectroscopy. 31-37 Along with other computational groups, we have computed blind pK a predictions for residues of wild-type SNase (PDB ID: 1STN or 1SNC), the SNase mutant D1PHS (PDB ID: 3bdc), and various mutants from the D1PHS parent protein (referred to as calculated results in this study). D1PHS is unique in that it is a hyperstable, acid-resistant SNase mutant with five substitutions (G50F, V51N, P117G, H124L, and S128A) and a deletion of residues 44-49. 31,32 Garcia-Moreno directed this effort, holding experimental pK a determinations from those making predictions and picking residues of interest for pK a prediction (hereafter referred to as experimental results in this study). In total, approximately 93 structures of the wildtype (WT) and mutant SNase have been provided. 30 Owing to time constraints, however, our CpHMD pK a predictions have not been carried out on the entire set of provided structures, the subset of which were studied and submitted as blind predictions shown in Table I.

CpHMD simulations
The standard CpHMD method has been implemented in AMBER10 molecular dynamics program. All simulations are conducted with the AMBER99SB force field 43 and the GB solvent model igb52, 44-46 using a 30 Å cutoff value for nonbonded interactions and computation of effective Born radii calculations. Similar to experimental conditions, salt concentrations are set to either 0.1 or 1.0M. The SHAKE algorithm constrains all bonds involving hydrogen with a time step of 2 fs, 47 and temperature is maintained at 300 K using the Berendsen temperature coupling method with a time constant of 2 ps. 48 A period of 10 fs of MD separates the MC trials. With these parameters, a 10 ns CpHMD simulation takes approximately 72 h using 16 Xeon X5650 2.67GHz processors.
All simulations begin from the crystal structure coordinates of the WT and D1PHS SNase systems provided by Garcia-Moreno et al., from which specific titratable residues were chosen for the blind prediction study (Table I). For the blind predictions performed on SNase systems where a single ASP or GLU residue has been highlighted as the residue of interest, CpHMD simulations of 10 ns in length have been performed in the solution pH range 2.0-7.0 at 0.5 pH unit intervals, titrating only acidic residues. For these simulations, HIS residues are allowed to titrate from pH 4.5 to pH 7.0. In systems where a LYS or TYR is highlighted as the residue of interest, simulations have been carried out in the pH range 7-10.5, where HIS, LYS, and TYR residues are set to titrate. The exclusion of HIS residues from the most acidic simulations is justified, as the pK a of the HIS reference is around 6-7. 39 In most cases, it is safe to assume all ASP and GLU residues are deprotonated above pH 7 and all LYS and TYR are protonated below pH 7, allowing exclusion of these residues from titration in these respective pH regions. Models for the terminal residues have not yet been developed for this system, so these res- idues are set to their most likely protonation states at neutral pH, with the N-terminus protonated and the C-terminus deprotonated. All non-titrating residues are set to their expected protonation states.

Simulations conducted after publishing of experimental results
To understand why some predictions fail to reproduce the experimental results, further CpHMD simulations have been conducted, as indicated in the proceeding sections. In particular, the pH range at which simulations were originally performed is extended to account for residues that deviate the most from their reference value. In cases where convergence has been determined to be problematic, extended simulations (>10 ns) do not appear to improve predictions (results not shown).

Titration curves
Titration curves are obtained from CpHMD simulations for 32 titratable residues of the WT, D1PHS, and mutant D1PHS SNase systems. The experimental pK a values for the WT protein were published prior to the predictions, but those for the mutant proteins were withheld until blind predictions were made (Table I). From Eq. (2), pK a values are calculated along with the standard errors of regression for curve fits to the Hill equation in Eq. (2) ( Table I). 49 In the instances of ASP21 and L37D, pK a values cannot be computed due to the lack of transitions between protonated and deprotonated forms.
A representative plot of calculated pK a over time is given in Figure 1 for both a surface residue for which the CpHMD predicts pK a accurately (D1PHS, GLU52) and the D1PHS mutant L36D, for which the pK a prediction of the interior residue ASP36 deviates by more than three pK a units from the experimental result. In the former case, the pK a converges rapidly, whereas ASP36 in D1PHS L36D is indicated to not achieve convergence over the duration of the simulations.
In assessing the convergence of a system, it is interesting to observe the trend in the number of transitions between protonated and deprotonated states as a function of pH. In systems that are well converged (e.g., D1PHS, GLU52), the greatest number of transitions between deprotonated and protonated states within the CpHMD scheme are found for the simulation conducted at a pH nearest to the calculated pK a value. Simulations conducted at pH values far from the predicted pK a encounter fewer transitions between protonation states, as is to be expected from the acceptance criteria defined in Eq. (1). This is not the case for certain systems (e.g., D1PHS L36D), where convergence is a problem. Thus the presence of a clear distribution of transitions across the different pH values simulated, peaked at the pH nearest the predicted pK a , may be an indicator of how well converged the system is.

Experimental validation
Following the blind predictions, García-Moreno and coworkers have released experimental results for comparison to predicted pK a values (Table I). 30 In summary, CpHMD simulations calculate the pK a values of 17 residues to within 1 pK a unit of the experimental value, 9 residues within 2 units, 1 residue within 3 units, and 2 residues within 4 units. Residues in the D1PHS protein chosen for analysis are surface residues. All of these residues remain solvent-exposed throughout the CpHMD simulation and are generally well predicted with respect to experiment (Table I). From Figure 2, it is clear that the predictions with the largest deviation from the experimental values are for D1PHS variants that have residues  located within the hydrophobic interior of the protein (e.g., L37D, L36D). These residues have also been found experimentally to have the largest shifts in pK a from their reference values (Table I). Errors with respect to the experimental pK a are shown in Table I, with predictions within ranges of experimental pK a considered to have zero error.
As described in Table I, CpHMD simulations have correctly predicted the experimental trends for the majority of these buried residues, three within 1 pK a unit of the experimental result (G20E, V23E and V23K). The simulations predict the pK a values of F34E/K, L25D, L36D, and V23E/K to be shifted from their model values in the direction of favoring the neutral residue at physiological pH; although, for some of these residues, the shift in the predicted pK a is not as large as that found experimentally (Table I).

DISCUSSION
Residues with pK a predictions greater than 1 pK a unit from experimental Since the release of experimental results, further simulations have been carried out to investigate why our methods predict pK a values that deviate more than 1 pK a unit from experimental results. For this article, we have chosen a selection of residues that illustrate problems with the application of the CpHMD method to these specific systems.

D1PHS: ASP21
In the D1PHS mutant, the residue ASP21 is a notable exception to the good performance of CpHMD in pre-dicting pK a values of surface residues. This problem arises from a lack of transitions between protonated and deprotonated states. In this case, longer simulations fail to alleviate the problem, likely due to the existence of a strong, charged hydrogen bond interaction between ASP19 and ASP21 preventing changes in protonation state from occurring. Consistent with our results, García-Moreno and coworkers have needed to apply two-site binding isotherms to properly describe the experimental titration of these interacting residues and have also noted the difficulty in predicting the pK a for ASP21 computationally. 31 Similar problems arise in the simulation of L37D, indicating that sampling of protonation states is critical to the performance of the CpHMD method. Use of enhanced sampling techniques to allow the system to sample other protonation states may be necessary for accurate pK a predictions in conventional simulations where strong interactions persist.

D1PHS G20K
For the mutant D1PHS G20K, the CpHMD method predicts a pK a of 8.6, nearly two pK a units lower than the experimental value (>10.4). 33 This lysine residue sufficiently sampled protonation space, encountering more than 600 transitions over the duration of each simulation. The trajectories of these simulations incur large motions indicative of protein instability. The root mean square distances (RMSD) with respect to the starting structure for these simulations do not converge at any solution pH, largely influenced by the winding and helical motion of the last 20 residues of C-terminus (Fig. 3). To further probe this conformational change, we performed conventional MD simulations with set protonation states computed by the program PRO pK a 50 for pH 7-10 at intervals of 0.5. The conventional MD simulations similarly suffer from protein instability near neutral pH (pH 7 and pH 8), although take longer to encounter it than CpHMD simulations. At higher pH, the terminal helix in the G20K protein does not incur the same motion observed at neutral pH and in CpHMD simulations. These simulations show the sensitivity of the G20K protein toward change in protonation state, and in order to achieve results closer to experiment with the CpHMD method, it may be necessary to spatially constrain the termini. These findings indicate that although increased sampling is desirable and may be achieved in certain systems, it is important that the correct conformational space is sampled to attain an accurate prediction of pK a .
In further probing the problems involving predicting the pK a for G20K, it is noteworthy that other mutations at site 20 generate similar instabilities (e.g., G20D and G20E). Having an acidic residue at site 20, however, does not affect the pK a prediction to the same extent. Visual analysis of trajectories for the G20D protein reveals that hydrogen bonds from Thr-29 persist throughout the sim- Plot of predicted versus experimental pK a values for WT SNase (l), D1PHS (x, exterior residues), and D1PHS mutants (D, internal residues). 31-37 The line y 5 x represents accurate prediction of the experimental pK a .
ulations, likely lowering its pK a (Table I). Similarly, G20E forms transient hydrogen bonds with Thr-29. All mutated residues at site 20 sample conformational space that is solvent-exposed, in addition to time spent buried in the protein interior. While this explains the propensities for G20D and G20E to exist in their charged states, it fails to explain the shift in pK a for G20K.

D1PHS F34E
CpHMD simulations performed on the D1PHS F34E mutant consistently obtain a predicted pK a (5.7) lower than experiment (7.30). 32 The stability of this particular mutant shows great sensitivity to the pH of the simulation, with large conformational changes occurring at acidic pH (Fig. 4). Nevertheless, the pK a values calculated at neutral pH-closer to the pK a of the residue-still underestimate the experimental pK a despite undergoing a large number of transitions between protonated and deprotonated states. Upon visualization of this structure, it is notable that the carboxylate of GLU34 forms salt bridges with an adjacent arginine residue (ARG81), which causes this residue to favor its deprotonated state. The simulations may not sample enough conformational space owing to the persistence of this salt bridge, therefore leading to a predicted pK a value lower than the experimental result. Enhanced sampling techniques may provide the means to allow the system to escape this GLU34-ARG81 salt bridge and give a more representative prediction of pK a .

D1PHS L36D
The mutant L36D suffers from sampling problems, both of conformational and protonation space (Fig. 1). While at certain pH values CpHMD simulations correctly predict the pK a , which experimentally is found to be 7.90, 36 there is no clear trend in pK a prediction for simulations conducted at different levels of pH (Fig. 1). From visualization of the various trajectories, it is suggested that ASP36 may form a strong hydrogen bond with ASP21 in the MD simulations, stabilizing the deprotonated form. This scenario is seen at pH 4.5, where the simulation more accurately predicts a pK a of 7.4. In other cases, ASP36 becomes buried in the hydrophobic interior of the protein, again leading to insufficient sampling of different protonation states. It is therefore likely that L36D needs to better sample conformational space in order to more effectively predict the pK a of ASP36.

Analysis of CpHMD performance
It is clear the CpHMD method performs better at predicting the pK a values of solvent-exposed residues, which possess pK a values closer to their reference compounds (Table I). This is evident from the calculation of the root mean square error (RMSE) of predicted pK a values, measured against the experimental work of García-Moreno to quantify this result, showing that residues on the surface of D1PHS deviate from experiment with an RMSE of 1.23, whereas the RMSE for interior residues of the various D1PHS mutants is 2.42 (Table II). 31 Most residues found at the surface of the protein encounter an increased number of transitions between protonated and deprotonated forms, and tend to converge relatively   quickly ($6-8 ns). The counterexample to this trend is ASP21, which likely fails to transition due to sampling problems derived from the persistence of its hydrogen bond with ASP19.
The importance of selecting a suitable pH range for titration and difficulties in achieving proper sampling of conformational space are illustrated in some pK a predictions of interior residues for the various D1PHS mutants. Given that many internal residues are found experimentally to have pK a values shifted considerably from their reference pK a , it is thus important to set up simulations over a wide pH range to conduct the titration. For example, in the case of L36D, simulations were performed at acidic pH under the assumption that the pK a of the aspartic acid would exist closer to its reference value of 3.8. In fact, experimental results show this residue to titrate at a pK a of 7.80. The selection of the pH range is also important for the stability of the system when performing CpHMD simulations, as illustrated by D1PHS F34E.
Analyses of computed pK a over time show internal residues to be far less converged compared to surface residues, making the prediction of accurate pK a values more challenging. Although the CpHMD method applied in this study usually predicts the direction of the pK a shift from the reference compounds correctly, there is still room for improvement in accurately predicting pK a for internal residues.
Residues buried within the protein environment experience dielectric environments quite different from those at the surface of the protein, with their pK a properties very susceptible to the nature of the residues in their vicinity and thus more difficult to treat computationally. 31,42 This difficulty is illustrated by the D1PHS L36D and F34E proteins, where strong hydrogen bonds or salt bridges involving these titratable residues affect their protonation equilibria. Simulations of L36D do not contain any transitions between deprotonated and protonated forms owing to the persistence of an interaction between the ASP36 and ASP21 residues. For residues such as this, the use of an enhanced sampling method, such as accelerated MD, may assist in the sampling of relevant conformations and thus protonation states. The requirement for increased sampling is also highlighted in instances where salt bridges persist throughout the simulation, as in the case of ARG81-GLU34 in the F34E mutant protein. The CpHMD method severely under-predicts the pK a of this glutamate, suggesting it spends more time in its deprotonated form than experiment predicts. 31 It is possible, although not proven in our studies, that the strength of salt bridges sampled in our CpHMD method is overestimated under the GB implicit solvation, thus leading to error in predicting protonation state. 51 Although not specifically quantified in this study, errors likely exist in CpHMD simulations due to the use of implicit solvation and conventional (non-polarizable) force fields. With regards to implicit solvation, issues regarding global protein movements would likely be dampened by the presence of explicit solvent molecules. Despite this generally accepted point, we believe CpHMD simulations employing implicit solvation still merit further study due to the simplicity of protonation changes and transition energy calculations. With regards to force fields, polarizable force fields would likely better capture the sensitivity of neighboring groups to changes in the protonation state. The topic of force field effects on constant pH MD simulations is further investigated by others in this issue. 52 Although there exist problems with both implicit solvation and conventional force fields, the CpHMD method has been successful in predicting the pK a of a significant number of residues from the test set from García-Moreno. This study has highlighted areas that may add significant improvement in the pK a prediction capability of the method, such as enhanced conformational sampling and implementation of an improved solvation model. Future work will focus on testing other solvation models and the implementation of different accelerated molecular dynamics techniques, with the goal of achieving better sampling of physically meaningful conformations and protonation states. Errors are computed with zero error if the predicted pK a falls within the bounds of experimental pK a with limiting values. Residues that do not incur transitions (ASP21 and L37D) are omitted from this calculation.