In recent years the energy landscape theory of protein folding1–5 has been validated through its application to protein folding,6–10 oligomerization,11–14 functional transitions,15–20 and structure prediction.21, 22 The theory states that proteins are minimally frustrated, that their energy landscape is funnel shaped and that the folded state of the protein is at the bottom of the funnel. Because of the shape of the landscape there is a strong energetic bias towards the folded state of the protein with relatively infrequent trapping caused by non-native interactions. The resulting heterogeneity observed during folding is due to the geometric constraints of the native structure. Thus, models of proteins that have only the native structure encoded have had great success in determining folding mechanisms. Until recently, most models tended to be coarse-grained, which are very useful in understanding global folding dynamics. In commonly used structure-based potentials,9 each residue is represented by a bead centered at the location of the Cα atom [Fig. 1(b)] and only native interactions are stabilizing.
On the other end of the spectrum of structural and energetic details are the computationally intensive all-atom empirical forcefields.25–30 These forcefields include an atomistic representation of a protein either with an implicit or an explicit solvent. In these potentials, the parameters which determine the interaction between atoms, such as partial charges and van der Waals radii, are fit to experimental measurements and quantum mechanical calculations. With accurate calibration, a single parameter set may be applied to any protein and, with sufficient computing resources, the dynamics of a protein can be calculated on a computer. The physics-based representation of atom–atom interactions automatically includes electrostatic interactions as well as any non-native interactions that may be present. In principle, these models render knowledge of a native structure unnecessary. A major limitation of these potentials is that they are often too expensive to fold all but small proteins.30–39 The timescales that can currently be calculated vary from hundreds of nanoseconds to microseconds, depending on the size of the protein. Biological timescales are usually several orders of magnitude larger and these dynamics cannot be accessed using all-atom empirical forcefields. In addition, sensitivity analysis of the dynamics to the parameters is not possible with these all-atom empirical forcefields.
In all-atom empirical forcefields an observed specificity of (i.e., preference for) native interactions is seen as a consequence of many energetic contributions. Because of the complex formulation of these potentials, it is impossible to partition geometric effects from energetic ones. There is a similar restriction in coarse-grained models because of their simplicity. Partitioning these effects is often impossible because geometry is included implicitly through energetic interactions. By studying all-atom models with structure-based potentials,40–44 because atomic geometry is explicitly included, we can ask to what extent energetics contribute to the apparent native specificity in protein structure, folding, and function. In contrast to enzyme catalysis where specific atomic interactions directly control the chemical reactions, in most cases the energetic specificity required in protein folding is less stringent.
Providing a complete picture of specificity in protein folding and function will require the study of many proteins and many parametric variations. In this article, we lay the foundation for this line of investigation through systematic characterization of a completely specific (only, and all, native interactions are stabilizing) AA structure-based model. We study the effect of varying the parameters of the model on folding barriers, mechanisms, contact formation, and side chain dynamics. The test proteins, B domain of Protein A, SH3 domain of C-Src Kinase, and Chymotrypsin Inhibitor 2 (CI2) (Fig. 2) have been experimentally47–49 and computationally8, 50–52 well characterized. Additionally, they possess two-state folding dynamics and represent different secondary and tertiary structures. The present model is energetically unfrustrated, with an explicit representation of all non-hydrogen atoms and homogeneous interaction strengths. We find that the folding mechanisms in the model are robust to parameter changes and dynamics agrees well with both the Cα model and an all-atom empirical forcefield with explicit solvent. Further, side chain ordering can be probed explicitly and the effect of prolines can be calculated. This study and model will serve as a basis for future AA models which incorporate nonspecific contributions of energetic frustration, electrostatics, and hydration.
Folding mechanisms are robust to parameter changes
We use a model where the potential energy function is defined by the native state and all heavy (non-hydrogen) atoms are explicitly represented. Any two atoms that are close in the native structure are said to form a native contact. We describe the folding process by using the fraction of native residue pairs in contact QAA (see “Methods” section). Figure 3(a) shows QAA, QCA (fraction of Cα contacts, see “Methods” section) and radius of gyration Rg as functions of time for an AA simulation of CI2, near folding temperature. Because QAA captures the same collapse events as Rg and QCA [Fig. 3(b)], QAA is an useful measure of backbone folding in addition to side chain packing.
It is crucial to understand the parameter dependence of a model before it can be used to make reliable predictions of folding mechanisms. The robustness of the folding mechanism is probed here by characterizing Protein A, SH3, and CI2 for variants of the AA structure-based energy function. Because of the debate about the balance between secondary and tertiary interactions, we vary the ratio of nonlocal contact energy to dihedral angles RC/D and the relative strength of backbone dihedral angles to side chain dihedral angles RBB/SC (see “Methods” section).
To characterize the folding mechanism for different parameter sets we computed the probability of contacts formed as a function of the folding process P(Qi,QAA). P(Qi,QAA) is the probability that the contacts involving residue i, Qi, are formed as a function of QAA. P(Qi,QAA) was calculated for the three proteins for 16 different parameter sets (all combinations of RC/D = 1.0,2.0, 3.0,4.0 and RBB/SC = 1.0, 2.0, 3.0,4.0). Figure 4 shows the folding mechanisms for four parameter sets. The difference in folding mechanism between parameter sets i and j can be quantified by the root mean squared deviation in P(Qi,QAA) over all QAA and Qi, (). The largest values of Prms for Protein A, SH3, and CI2 were 0.057, 0.097, and 0.077. SH3 is a complicated fold, Protein A a simple fold, and CI2 an intermediate fold.53 Thus, it is not surprising that energetic modifications have the largest effects on Protein A and the smallest effects on SH3.
Figure 4 shows proline containing regions are less stable to parametric modifications. Regions with prolines, and regions interacting with prolines, form structure earlier (at lower Q) with increased contact strength. This is because contact strength is increased at the expense of dihedral strength. Prolines possess a covalent bond, which limits the mobility of the ϕ dihedral. Removing energy from the dihedrals does not increase flexibility in prolines. However, adding energy to contacts increases structure formation around prolines. For this reason, increasing RC/D stabilizes and promotes earlier formation of proline containing regions.
Fully folded backbone allows for disordered side chains
Although QAA and QCA capture the same cooperative folding events, at folding temperature, QCA is higher than QAA for the folded ensemble. This suggests that although the backbone structure is native (QCA ≈ 0.8), many of the native residue interactions form as temperature is decreased [Fig. 3(c,d)]. To account for this structurally and quantitatively, we calculated the difference between the probability of Cα contacts being formed P(QiCA,QCA) and AA contacts being formed P(QiAA,QCA) (Fig. 5). A value of 0 indicates that, on average, the Cα atoms of a residue pair are near their native distance when the side chains are in contact. Positive values are seen when extended side chains are interacting, resulting in the Cα atoms being far from their native distance. Negative values indicate backbone folding precedes side chain ordering.* Side chains in Protein A seem to be wellpacked, in that there is concomitant side chain and backbone folding. In SH3, the turns have negative values, and are thus underpacked. In CI2, underpacking is primarily found in the active site loop and the C-terminal tail. These results reveal a signature of complicated folds:52, 53 a small subset of native contacts is sufficient to constrain the backbone to its native orientation, resulting in significantly underpacked regions in the native state. This occurs in complicated folds because an individual contact can impose a high level of order on the system. To form contacts that are distant in sequence, a large number of residues must also order. In Protein A, many contacts are local and only constrain single helical turns. In SH3 and CI2, fewer contacts are required to constrain the entire backbone (including the turns and loops).
Figures 3(c, d) shows the dynamics of a typical underpacked contact. As T is lowered below Tf the underpacked contact's average distance and distance fluctuations smoothly decrease. This results in a gradual increase in Q without a noticeable free-energy barrier [See Fig. 6(e)]. We hope that these subtle dynamics will be experimentally probed and tested in the future.
Understanding free-energy profiles through parametric variation: free-energy profiles can be altered through parametric changes
Although the folding mechanisms are stable, the free-energy barriers associated with folding and the locations of the folded basins vary systematically with parameters. Figure 6 shows free-energy profiles for SH3, CI2, and Protein A for several values of RC/D with RBB/SC = 2.0. There are four distinct, interrelated, trends shared by all three proteins. First, there are two folding processes: backbone collapse and side chain packing. Second, the free energy minimum for the folded state moves to lower Q with increasing RC/D. Third, the free energy barrier decreases with increasing RC/D. Finally, increasing RBB/SC has similar effects as increasing RC/D (not shown).
The free-energy basins for the folded states are located at QCA ≈ 0.8 and QAA ≈ 0.5 [Fig. 6(d)], indicating that the backbone orders while many native atom–atom interactions remain extended. Thus, the entropy loss during the cooperative folding transition is likely dominated by backbone ordering. Side chain packing occurs both concomitantly with, and after, backbone ordering.
There are likely two major factors that lead to the observed trends. First, increasing RC/D increases contact strength. As seen in other simplified models,54 when each contact is stronger, a smaller number of contacts is required (lower Q) to provide an equal amount of stabilizing energy. The second contributing factor is the change in side chain entropy. Although entropy loss in the backbone dominates the collapse transition, the gradual side chain packing can also lead to shifting basins. Increasing RBB/SC or RC/D reduces the strength of side chain dihedrals, resulting in more mobile unfolded side chains. Therefore, there is an increased entropy loss per side chain upon folding ΔSsc when RC/D or RBB/SC is increased. Because side chains can pack independently of the collapse transition, when ΔSsc increases, a fraction of the side chain interactions extend, while leaving the overall fold intact. Because the folded basin shifts to lower Q, the overall structure required to form a stable fold is reduced. A reduced barrier height naturally results when the folded basin is less ordered.
Free-energy barriers, in conjunction with diffusion constants, provide a direct connection to experimental folding rates.55–57 We find that the relative barrier heights calculated using our AA model are similar to those from a Cα model [Fig. 6(f)]. The relative barrier heights calculated from this model are known to correlate well with experimental rates.57 We note in passing, that the absolute free-energy barriers in the AA model can be parametrically changed by up to a factor of two for a given protein and that the relative barrier heights between proteins remain constant. Thus, although the magnitude of the rates will be determined by the diffusion constant, the correlation between experimental folding rates and theoretical barriers is independent of the choice of parameters.
Next, we compare the backbone folding mechanisms of our AA model and a commonly used Cα model.9 The Cα representation has been successful at capturing experimentally determined protein folding mechanisms.8, 9, 11 The first column in Figure 7 shows the differences in folding mechanisms between the AA model and an energetically homogeneous Cα model. Every contact and dihedral in the homogeneous Cα model has the same interaction strength. Because the AA model distributes contact energy inhomogeneously between residue pairs, it is not surprising that the mechanisms differ.
To remove differences arising from energetic homogeneity in the Cα model, we modified it such that each contact is weighted by the number of contacts between each residue pair in the AA model (Fig. 7, second column). For Protein A this modification improves agreement. The remaining difference is in a single turn-to-tail contact [Gln1 with Ser31, Fig. 2(a)] that rarely forms in Cα simulations. In SH3, agreement improves around residues Asp34 and Asn52, while differences persist in Gln45 and the tails. The overall effect is increased formation around Gln45 at the expense of the tails. In CI2, there is significant agreement in the tails, though the mini-core still forms earlier (in the AA model), at the expense of the helix. For all three proteins, several regions of disagreement possess proline residues, whose bond is not included in the Cα model.
To eliminate effects specific to proline, we repeated the AA simulations with all prolines mutated to alanines. The third column of Figure 7 shows the Pro-Ala mutants compared to the inhomogeneous Cα model. Improved agreement is observed in Pro-Ala mutants of SH3 and CI2. In both proteins Pro-Ala mutations delay folding of proline regions, in agreement with proline effects on model stability. In SH3 the tails still form slightly earlier in the AA model, at the expense of residues 35–55. In CI2, the balance between minicore and helix formation is clearly improved, highlighting the importance of prolines in the folding process. Pro-Ala mutations have almost no effect on the folding mechanism of P12 and P30 in Protein A and P25 in CI2. This is likely because these prolines are located in turn regions. In our model, turns are highly constrained by short range contacts, and the reduced dihedral constraint (imposed by a proline) acts as a small perturbation. The remaining differences between the Pro-Ala AA mutants and the inhomogeneous Cα model demonstrate, to no surprise, that the inclusion of side chains alters the relative entropy of residues.
Native basin dynamics of AA structure-based model correlate with the dynamics of an all-atom empirical forcefield with explicit solvent
Two common measures of native state dynamics are native contact formation and root mean squared deviations in structure rmsd. Figure 8 shows the average contact formation in the native ensemble for the structure-based model and an all-atom empirical forcefield with an explicit solvent. Although the average contacts are not identical, no major differences in contact formation are observed. The overlaps between the AA maps and the all-atom empirical forcefield maps of Protein A, SH3, and CI2 are 0.85, 0.97, and 0.84, respectively. An overlap of 1 indicates identical maps, and 0 indicates the two maps have no contacts in common.
In a uniquely defined native state, the probability of each contact being formed is 1. Because we sample the native ensemble at finite temperatures, atom mobility leads to additional contacts being formed. In the structure-based model, these additional interactions are strictly repulsive. In an all-atom empirical forcefield these interactions can be attractive, yet they are observed more frequently in the structure-based model.† These contacts are likely due to increased mobility in the structure-based simulations. In all-atom empirical forcefields, hydration shells can result in less mobile side chains, and hence a narrower distribution of contacts.
The increased mobility is quantified by the structural rmsd. The magnitude of fluctuations in all-atom empirical simulations is much lower than in structure-based simulations (not shown). For the all-atom empirical forcefield at 300 K, the average rmsd for Protein A, SH3, and CI2 are 1.53, 1.00, and 0.97 Å. The rmsd of the Cα atoms are 1.23, 0.66, and 0.74. The same values are obtained in structure-based simulations at around T = 0.55Tf. In real temperature units, 0.55Tf corresponds to temperatures significantly less than 300 K. A likely cause for the increased structural fluctuations is hydration effects of explicit solvent molecules in the all-atom empirical forcefield. To compare the distribution of rmsd fluctuations between models, correlation coefficients (r) were computed for the rmsd by atom in the all-atom empirical forcefield and the structure-based potential. For all parameter sets of the structure-based potential, r ≈ 0.7 for CI2 and SH3 and r ≈ 0.8 for Protein A.‡
In this article, we describe a systematic analysis of an AA structure-based model which bridges the gap between coarse-grained models and all-atom empirical forcefields. We show that in our Cα and AA structure-based models the global folding mechanisms agree and the main differences are largely due to energetic heterogeneity and the explicit representation of prolines in the AA model. Also, the native basin dynamics are similar in the AA structure-based model and an all-atom empirical forcefield with explicit solvent. In agreement with previous studies, the folding mechanisms in complicated folds are stable to parametric variation. On the other hand, the free-energy barriers associated with folding vary systematically with parameters. Because free-energy barriers are not a robust feature of this model, understanding the interplay between barrier heights and diffusion will be important before attempting to predict folding rates.55, 58, 59
Using this model we characterized two folding processes: one associated with backbone collapse and the other with side chain packing. We observed that backbone collapse is accompanied by partial side chain packing in a cooperative transition and residual side chain packing occurs as temperature is reduced below the global folding temperature. One explanation for the partial separation of backbone folding and side chain ordering may be that mobility in specific residues is necessary for the functional properties of proteins. Proteins are selected for their function. Orthogonal networks of residues responsible for stability and function have been proposed.60, 61 The observation in our model that some residues are not necessary to maintain the backbone structure is consistent with this proposal. In CI2, the backbone of the active site loop is in the native orientation, yet the side chains are not packed. In SH3, several turns are also disordered. Because binding sites are often found in loops, flexible loops may be more easily adapted to new sequences and functions.
Gradual side chain packing can also allow for proteins to functionally respond to cellular stress by affecting side chain orientations, without denaturing the entire protein. This is consistent with the prediction that localized unfolding, or cracking, is important for biological function of kinases and motor proteins.15, 18, 62–67
The current model explicitly includes the effects of topological contributions to protein folding, and the role of energetic contributions may now be elucidated. Our results are a significant step forward in understanding protein dynamics from the Cα to the all-atom level. In the coming years, it will be interesting to probe the effects of electrostatics, nonnative interactions, water, and explicit mutations in this model.
MODELS AND METHODS
In our AA model of the protein, only heavy (non-hydrogen) atoms are included. Each atom is represented as a single bead of unit mass. Bond lengths, bond angles, improper dihedrals, and planar dihedrals are maintained by harmonic potentials. Nonbonded atom pairs that are in contact in the native state between residues i and j, where i > j + 3, are given a Lennard-Jones potential, whereas all other nonlocal interactions are repulsive. All contacts identified by the Contact of Structural Units software package (CSU)68 were included. The functional form of the potential is,
and εr = 100, εθ = 20, εχ = 10, and εNC = 0.01. ro,θo,χo, ϕo, and σij are given the values found in the native state and σNC = 2.5 Å. When assigning dihedral strengths, we first group dihedral angles that share the middle two atoms. For example, in a protein backbone, one can define up to four dihedral angles that possess the same C − Cα covalent bond as the central bond. Each dihedral is given the interaction strength of 1/ND, where ND is the number of dihedral angles in the group. εBB and εSC are then scaled so that . Next, dihedral strengths and contact strengths are scaled such that our other system parameter, the ratio of total contact energy to total dihedral energy , is satisfied. The total stabilizing energy is equal for all parameter sets (i. e. ).
As a reaction coordinate we use QAA and QCA. QAA is the fraction of natively interacting residues that are in contact. Two residues are considered in contact if any native atom–atom interactions between the residues are within 1.2 times the native distance σij. At 1.2σij the potential energy of a native pair is approximately half of the minimum. Similarly, QCA is the fraction of natively interacting residue pairs whose Cα atoms are within 1.2 times their native distance.
Proline to alanine mutations
To investigate the role of proline residues in the AA model, proline to alanine mutants were constructed. This was achieved by removing the Cγ and Cδ atoms of each proline. Native contacts formed with the Cγ and Cδ of a proline were included as contacts with the Cβ of the corresponding alanine. This ensured the energetics of the system were unperturbed, and only topology was modified.
All-atom structure-based simulations were performed using the GROMACS software package.26 No modifications to the source code were necessary. Reduced units were used. The timestep τ was 0.0005. The Berendsen algorithm69 was used§ with the coupling constant of 1. For all folding results in this article several constant temperature runs were performed, with temperatures that corresponded to the protein being always folded to always unfolded. The Weighted Histogram Analysis Method70, 71 was used to combine data from multiple temperatures into single free-energy profiles.
All-atom empirical forcefield simulations
All-atom empirical forcefield simulations were performed using GROMACS,26, 72 with the OPLS-AA forcefield73 with TIP3P water molecules.74 Each protein was simulated for 10 ns at T = 300 K and a pressure of 1 atm. A timestep of 2 fs was used in conjunction with the LINCS75, 76 algorithm for constraining covalent bonds with hydrogen. Protein A, SH3, and CI2 were simulated with 2810, 3617, and 4644 water molecules in cubic boxes of initial dimensions 45.15Å, 48.98 Å, and 53.07 Å. Temperature was maintained using the Berendsen algorithm.69 One nanosecond was allowed for equilibration. For the remaining 9 ns, structures were saved at 1 ps intervals.
Comparison of contacts
In the all-atom empirical forcefield simulations contacts were determined for each saved structure using CSU.68 The average number of contacts 〈Q〉 was calculated for each protein. The probability of individual contacts being formed was averaged over all structures with Q = 〈Q〉. With the all-atom empirical potential 〈Q〉 was 80, 135, and 146 for Protein A, SH3, and CI2. This analysis was repeated for folded simulations with our AA structure-based simulations. For the structure-based simulations 〈Q〉 was 80, 138, and 144. To compare contact maps, the dot product of the two maps was taken.
We would like to thank Angel Garcia and Peter G. Wolynes for useful discussions regarding all-atom modeling.
QAA is a generous definition of side chain packing, because a side chain is “packed” when one or more atom–atom contacts are formed. Thus, “underpacked” residues clearly have very little native structure
Figure 8 only interactions present more than 2.5% of the time are shown.
Comparison of rmsd of the Cα atoms yields similar values of r. When using the Berendsen thermostat, numerical instabilities can arise when the bath-molecule coupling timescale is shorter than the timescale for internal energy diffusion. In our experience, these problems tend to surface when you simulate weakly interacting domains with implicit solvation. Because the present study investigates folding of single domain proteins under weak temperature coupling, these features are not likely a source of significant errors. Nonetheless, future work will also employ Langevin or Nose-hoover temperature coupling.