An all-atom structure-based potential for proteins: Bridging minimal models with all-atom empirical forcefields


  • Paul C. Whitford,

    1. Center for Theoretical Biological Physics and Department of Physics, University of California at San Diego, La Jolla, California 92093
    Search for more papers by this author
  • Jeffrey K. Noel,

    1. Center for Theoretical Biological Physics and Department of Physics, University of California at San Diego, La Jolla, California 92093
    Search for more papers by this author
  • Shachi Gosavi,

    1. Center for Theoretical Biological Physics and Department of Physics, University of California at San Diego, La Jolla, California 92093
    Search for more papers by this author
  • Alexander Schug,

    1. Center for Theoretical Biological Physics and Department of Physics, University of California at San Diego, La Jolla, California 92093
    Search for more papers by this author
  • Kevin Y. Sanbonmatsu,

    1. Theoretical Biology and Biophysics, Theoretical Division, Los Alamos National Laboratory, MS K710, Los Alamos, New Mexico 87545
    Search for more papers by this author
  • José N. Onuchic

    Corresponding author
    1. Center for Theoretical Biological Physics and Department of Physics, University of California at San Diego, La Jolla, California 92093
    • Center for Theoretical Biological Physics and Department of Physics, University of California at San Diego, 9500 Gilman Drive, LaJolla, CA 92093
    Search for more papers by this author


Protein dynamics take place on many time and length scales. Coarse-grained structure-based equation image models utilize the funneled energy landscape theory of protein folding to provide an understanding of both long time and long length scale dynamics. All-atom empirical forcefields with explicit solvent can elucidate our understanding of short time dynamics with high energetic and structural resolution. Thus, structure-based models with atomic details included can be used to bridge our understanding between these two approaches. We report on the robustness of folding mechanisms in one such all-atom model. Results for the B domain of Protein A, the SH3 domain of C-Src Kinase, and Chymotrypsin Inhibitor 2 are reported. The interplay between side chain packing and backbone folding is explored. We also compare this model to a Cα structure-based model and an all-atom empirical forcefield. Key findings include: (1) backbone collapse is accompanied by partial side chain packing in a cooperative transition and residual side chain packing occurs gradually with decreasing temperature, (2) folding mechanisms are robust to variations of the energetic parameters, (3) protein folding free-energy barriers can be manipulated through parametric modifications, (4) the global folding mechanisms in a Cα model and the all-atom model agree, although differences can be attributed to energetic heterogeneity in the all-atom model, and (5) proline residues have significant effects on folding mechanisms, independent of isomerization effects. Because this structure-based model has atomic resolution, this work lays the foundation for future studies to probe the contributions of specific energetic factors on protein folding and function. Proteins 2009. © 2008 Wiley-Liss, Inc.


In recent years the energy landscape theory of protein folding1–5 has been validated through its application to protein folding,6–10 oligomerization,11–14 functional transitions,15–20 and structure prediction.21, 22 The theory states that proteins are minimally frustrated, that their energy landscape is funnel shaped and that the folded state of the protein is at the bottom of the funnel. Because of the shape of the landscape there is a strong energetic bias towards the folded state of the protein with relatively infrequent trapping caused by non-native interactions. The resulting heterogeneity observed during folding is due to the geometric constraints of the native structure. Thus, models of proteins that have only the native structure encoded have had great success in determining folding mechanisms. Until recently, most models tended to be coarse-grained, which are very useful in understanding global folding dynamics. In commonly used structure-based equation image potentials,9 each residue is represented by a bead centered at the location of the Cα atom [Fig. 1(b)] and only native interactions are stabilizing.

Figure 1.

CI2 (Protein Data Base Entry 1YPA23 shown in (a) cartoon representation, (b) Cα representation, and (c) all-atom (AA) representation. Structures are colored red (C-terminus) to blue (N-terminus). The size of the atoms in the Cα and AA representations correspond to the excluded volume radii used in the Cα9 and AA models studied in this article. Structures visualized using VMD.24

On the other end of the spectrum of structural and energetic details are the computationally intensive all-atom empirical forcefields.25–30 These forcefields include an atomistic representation of a protein either with an implicit or an explicit solvent. In these potentials, the parameters which determine the interaction between atoms, such as partial charges and van der Waals radii, are fit to experimental measurements and quantum mechanical calculations. With accurate calibration, a single parameter set may be applied to any protein and, with sufficient computing resources, the dynamics of a protein can be calculated on a computer. The physics-based representation of atom–atom interactions automatically includes electrostatic interactions as well as any non-native interactions that may be present. In principle, these models render knowledge of a native structure unnecessary. A major limitation of these potentials is that they are often too expensive to fold all but small proteins.30–39 The timescales that can currently be calculated vary from hundreds of nanoseconds to microseconds, depending on the size of the protein. Biological timescales are usually several orders of magnitude larger and these dynamics cannot be accessed using all-atom empirical forcefields. In addition, sensitivity analysis of the dynamics to the parameters is not possible with these all-atom empirical forcefields.

In all-atom empirical forcefields an observed specificity of (i.e., preference for) native interactions is seen as a consequence of many energetic contributions. Because of the complex formulation of these potentials, it is impossible to partition geometric effects from energetic ones. There is a similar restriction in coarse-grained models because of their simplicity. Partitioning these effects is often impossible because geometry is included implicitly through energetic interactions. By studying all-atom models with structure-based potentials,40–44 because atomic geometry is explicitly included, we can ask to what extent energetics contribute to the apparent native specificity in protein structure, folding, and function. In contrast to enzyme catalysis where specific atomic interactions directly control the chemical reactions, in most cases the energetic specificity required in protein folding is less stringent.

Providing a complete picture of specificity in protein folding and function will require the study of many proteins and many parametric variations. In this article, we lay the foundation for this line of investigation through systematic characterization of a completely specific (only, and all, native interactions are stabilizing) AA structure-based model. We study the effect of varying the parameters of the model on folding barriers, mechanisms, contact formation, and side chain dynamics. The test proteins, B domain of Protein A, SH3 domain of C-Src Kinase, and Chymotrypsin Inhibitor 2 (CI2) (Fig. 2) have been experimentally47–49 and computationally8, 50–52 well characterized. Additionally, they possess two-state folding dynamics and represent different secondary and tertiary structures. The present model is energetically unfrustrated, with an explicit representation of all non-hydrogen atoms and homogeneous interaction strengths. We find that the folding mechanisms in the model are robust to parameter changes and dynamics agrees well with both the Cα model and an all-atom empirical forcefield with explicit solvent. Further, side chain ordering can be probed explicitly and the effect of prolines can be calculated. This study and model will serve as a basis for future AA models which incorporate nonspecific contributions of energetic frustration, electrostatics, and hydration.

Figure 2.

Structures of (a) Protein A, (b) SH3, and (c) CI2 (PDB entries 1BDD,45 1FMK,46 and 1YPA23 colored red (C-terminus) to blue (N-terminus). These three proteins represent differing structural content and topological complexity. Protein A is a three-helix bundle, SH3 is composed of multiple β strands, and in CI2 an alpha helix flanks a β sheet. Proline residues are shown as gray spheres. In Protein A, Gln1 and Ser31 are shown as colored spheres. In SH3, Val4 and Trp35 are shown as spheres. The mini-core of CI2 is circled.


Folding mechanisms are robust to parameter changes

We use a model where the potential energy function is defined by the native state and all heavy (non-hydrogen) atoms are explicitly represented. Any two atoms that are close in the native structure are said to form a native contact. We describe the folding process by using the fraction of native residue pairs in contact QAA (see “Methods” section). Figure 3(a) shows QAA, QCA (fraction of Cα contacts, see “Methods” section) and radius of gyration Rg as functions of time for an AA simulation of CI2, near folding temperature. Because QAA captures the same collapse events as Rg and QCA [Fig. 3(b)], QAA is an useful measure of backbone folding in addition to side chain packing.

Figure 3.

(a) Fraction of Cα contacts QCA(t), AA contacts QAA(t), and Radius of Gyration Rg(t) as functions of time for a representative trajectory of CI2 with the AA model. (b) Average structure formation for several reaction coordinates. A contact between residues is formed when a single atom–atom contact between them is formed. An atom–atom contact is considered formed when the pair is at a distance r < γσ where σ is the native pair distance. The fraction of native residue contacts formed QXAA is shown for γ = 1.2 (black) and γ = 1.5 (red). A Cα contact is formed when the Cα atoms are within 1.2 times their native distance (green). All three coordinates capture the same folding events. (c) Atom–atom distance for a contact in the active loop of CI2 versus time at Tf (red) and T < Tf (green). Large changes in distance (>20 Å) coincide with folding transitions. Side chain rearrangements in the folded state (R < 10 Å) occur on much faster time scales than folding of the entire protein. (d) Same as Figure (c) with time scale decreased by a factor of 100. Horizontal lines correspond to σ (yellow), 1.2σ (blue), and 1.5σ (purple). As temperature is decreased, distance fluctuations and average distances decrease.

It is crucial to understand the parameter dependence of a model before it can be used to make reliable predictions of folding mechanisms. The robustness of the folding mechanism is probed here by characterizing Protein A, SH3, and CI2 for variants of the AA structure-based energy function. Because of the debate about the balance between secondary and tertiary interactions, we vary the ratio of nonlocal contact energy to dihedral angles RC/D and the relative strength of backbone dihedral angles to side chain dihedral angles RBB/SC (see “Methods” section).

To characterize the folding mechanism for different parameter sets we computed the probability of contacts formed as a function of the folding process P(Qi,QAA). P(Qi,QAA) is the probability that the contacts involving residue i, Qi, are formed as a function of QAA. P(Qi,QAA) was calculated for the three proteins for 16 different parameter sets (all combinations of RC/D = 1.0,2.0, 3.0,4.0 and RBB/SC = 1.0, 2.0, 3.0,4.0). Figure 4 shows the folding mechanisms for four parameter sets. The difference in folding mechanism between parameter sets i and j can be quantified by the root mean squared deviation in P(Qi,QAA) over all QAA and Qi, (equation image). The largest values of Prms for Protein A, SH3, and CI2 were 0.057, 0.097, and 0.077. SH3 is a complicated fold, Protein A a simple fold, and CI2 an intermediate fold.53 Thus, it is not surprising that energetic modifications have the largest effects on Protein A and the smallest effects on SH3.

Figure 4.

The left column shows the probability of contacts being formed for each residue P(Qi,QAA) as a function of QAA for RC/D = 1.0 and RBB/SC = 1.0. The three right columns show P(Qi,QAA) for different Hamiltonians relative to RC/D = 1.0 and RBB/SC = 1.0. Blue indicates a decrease in formation, relative to RC/D = 1.0 and RBB/SC = 1.0, and red an increase. Proline containing regions are often sensitive to contact energy. In Protein A, both P12 and P30 fold earlier with increased contact strength. In SH3, the increase in formation of Val4 may be attributed to interactions with Pro56, though Pro50 and Trp35 do not exhibit increased formation. In CI2, both Pro6 and Pro61 exhibit increased formation with increased contact strength. Residues that lack native contacts are shown in gray.

Figure 4 shows proline containing regions are less stable to parametric modifications. Regions with prolines, and regions interacting with prolines, form structure earlier (at lower Q) with increased contact strength. This is because contact strength is increased at the expense of dihedral strength. Prolines possess a covalent equation image bond, which limits the mobility of the ϕ dihedral. Removing energy from the dihedrals does not increase flexibility in prolines. However, adding energy to contacts increases structure formation around prolines. For this reason, increasing RC/D stabilizes and promotes earlier formation of proline containing regions.

Fully folded backbone allows for disordered side chains

Although QAA and QCA capture the same cooperative folding events, at folding temperature, QCA is higher than QAA for the folded ensemble. This suggests that although the backbone structure is native (QCA ≈ 0.8), many of the native residue interactions form as temperature is decreased [Fig. 3(c,d)]. To account for this structurally and quantitatively, we calculated the difference between the probability of Cα contacts being formed P(QiCA,QCA) and AA contacts being formed P(QiAA,QCA) (Fig. 5). A value of 0 indicates that, on average, the Cα atoms of a residue pair are near their native distance when the side chains are in contact. Positive values are seen when extended side chains are interacting, resulting in the Cα atoms being far from their native distance. Negative values indicate backbone folding precedes side chain ordering.* Side chains in Protein A seem to be wellpacked, in that there is concomitant side chain and backbone folding. In SH3, the turns have negative values, and are thus underpacked. In CI2, underpacking is primarily found in the active site loop and the C-terminal tail. These results reveal a signature of complicated folds:52, 53 a small subset of native contacts is sufficient to constrain the backbone to its native orientation, resulting in significantly underpacked regions in the native state. This occurs in complicated folds because an individual contact can impose a high level of order on the system. To form contacts that are distant in sequence, a large number of residues must also order. In Protein A, many contacts are local and only constrain single helical turns. In SH3 and CI2, fewer contacts are required to constrain the entire backbone (including the turns and loops).

Figure 5.

Difference in AA contact formation and Cα contact formation P(QiAA,QCA) − P(QiCA,QCA) for (a) Protein A, (b) SH3, and (c) CI2. Positive values (red) indicate that residues are interacting without the Cα atoms being near. Negative values (blue) indicate the residues are “underpacked”: the Cα atoms are near each other without the side chains interacting. Residues that lack native contacts are shown in gray. (d–f) Underpacked (blue spheres) and well packed (orange spheres) residues are shown on the native structures. In Protein A, to order the backbone of a helix the side chains must be packed around it. Beta sheets are stabilized by nonlocal interactions. Thus, a small number of contacts can maintain the tertiary structure of SH3 without the side chains in the turn regions interacting, hence the underpacking. In CI2, the active site loop is significantly underpacked.

Figures 3(c, d) shows the dynamics of a typical underpacked contact. As T is lowered below Tf the underpacked contact's average distance and distance fluctuations smoothly decrease. This results in a gradual increase in Q without a noticeable free-energy barrier [See Fig. 6(e)]. We hope that these subtle dynamics will be experimentally probed and tested in the future.

Figure 6.

Free energy barriers in the AA model for (a) Protein A, (b) SH3, and (c) CI2. Profiles in (a–c) are for RBB/SC = 2.0 with RC/D = 1.0 (black), RC/D = 2.0 (red), RC/D =3.0 (green), and RC/D = 4.0 (blue). In SH3 and CI2, barrier height decreases and the folded basins move to lower Q with increasing RC/D and increasing RBB/SC. (d) F(QCA(t)) and F(QAA(T)) for a typical parameter set demonstrate that the folded basins in (a–c) correspond to collapsed states. (e) Two distinct folding processes observed in our model: backbone collapse and side chain packing. (f) Free energy barriers obtained from Cα structure-based simulations for Protein A, SH3, and CI2. Barrier heights in the Cα simulations are greater than in AA simulations. Both models predict the largest barriers for SH3 and smallest for Protein A.

Understanding free-energy profiles through parametric variation: free-energy profiles can be altered through parametric changes

Although the folding mechanisms are stable, the free-energy barriers associated with folding and the locations of the folded basins vary systematically with parameters. Figure 6 shows free-energy profiles for SH3, CI2, and Protein A for several values of RC/D with RBB/SC = 2.0. There are four distinct, interrelated, trends shared by all three proteins. First, there are two folding processes: backbone collapse and side chain packing. Second, the free energy minimum for the folded state moves to lower Q with increasing RC/D. Third, the free energy barrier decreases with increasing RC/D. Finally, increasing RBB/SC has similar effects as increasing RC/D (not shown).

The free-energy basins for the folded states are located at QCA ≈ 0.8 and QAA ≈ 0.5 [Fig. 6(d)], indicating that the backbone orders while many native atom–atom interactions remain extended. Thus, the entropy loss during the cooperative folding transition is likely dominated by backbone ordering. Side chain packing occurs both concomitantly with, and after, backbone ordering.

There are likely two major factors that lead to the observed trends. First, increasing RC/D increases contact strength. As seen in other simplified models,54 when each contact is stronger, a smaller number of contacts is required (lower Q) to provide an equal amount of stabilizing energy. The second contributing factor is the change in side chain entropy. Although entropy loss in the backbone dominates the collapse transition, the gradual side chain packing can also lead to shifting basins. Increasing RBB/SC or RC/D reduces the strength of side chain dihedrals, resulting in more mobile unfolded side chains. Therefore, there is an increased entropy loss per side chain upon folding ΔSsc when RC/D or RBB/SC is increased. Because side chains can pack independently of the collapse transition, when ΔSsc increases, a fraction of the side chain interactions extend, while leaving the overall fold intact. Because the folded basin shifts to lower Q, the overall structure required to form a stable fold is reduced. A reduced barrier height naturally results when the folded basin is less ordered.

Free-energy barriers, in conjunction with diffusion constants, provide a direct connection to experimental folding rates.55–57 We find that the relative barrier heights calculated using our AA model are similar to those from a Cα model [Fig. 6(f)]. The relative barrier heights calculated from this model are known to correlate well with experimental rates.57 We note in passing, that the absolute free-energy barriers in the AA model can be parametrically changed by up to a factor of two for a given protein and that the relative barrier heights between proteins remain constant. Thus, although the magnitude of the rates will be determined by the diffusion constant, the correlation between experimental folding rates and theoretical barriers is independent of the choice of parameters.

All-atom structure-based simulations capture Cα folding mechanism

Next, we compare the backbone folding mechanisms of our AA model and a commonly used Cα model.9 The Cα representation has been successful at capturing experimentally determined protein folding mechanisms.8, 9, 11 The first column in Figure 7 shows the differences in folding mechanisms between the AA model and an energetically homogeneous Cα model. Every contact and dihedral in the homogeneous Cα model has the same interaction strength. Because the AA model distributes contact energy inhomogeneously between residue pairs, it is not surprising that the mechanisms differ.

Figure 7.

Comparison of backbone folding in Cα and AA structure-based models. The probability of contacts being formed in a Cα model, minus the probability of Cα contacts being formed in an AA model, is shown for (a–c) Protein A, (d–f) SH3, and (g–i) CI2. (a, d, g) Comparison of AA simulation to a Cα model with homogenous contact strength. (b, e, h) Comparison between AA results to an energetically inhomogeneous Cα model. Regions of increased formation in the AA representation correspond largely to proline containing regions, or regions that interact with proline, such as the minicore in CI2 (black arrows indicate mini-core residues), the tails of SH3 and turn 2 of Protein A. Increased formation in the tails of CI2 can largely be accounted for by the large number of contacts between GLU4 and ARG62. (c, f, i) The inhomogeneous Cα model compared to the AA model with all prolines mutated to alanines. Mutating proline to alanine improved agreement between models. Residues that lack native contacts are shown in gray.

To remove differences arising from energetic homogeneity in the Cα model, we modified it such that each contact is weighted by the number of contacts between each residue pair in the AA model (Fig. 7, second column). For Protein A this modification improves agreement. The remaining difference is in a single turn-to-tail contact [Gln1 with Ser31, Fig. 2(a)] that rarely forms in Cα simulations. In SH3, agreement improves around residues Asp34 and Asn52, while differences persist in Gln45 and the tails. The overall effect is increased formation around Gln45 at the expense of the tails. In CI2, there is significant agreement in the tails, though the mini-core still forms earlier (in the AA model), at the expense of the helix. For all three proteins, several regions of disagreement possess proline residues, whose equation image bond is not included in the Cα model.

To eliminate effects specific to proline, we repeated the AA simulations with all prolines mutated to alanines. The third column of Figure 7 shows the Pro-Ala mutants compared to the inhomogeneous Cα model. Improved agreement is observed in Pro-Ala mutants of SH3 and CI2. In both proteins Pro-Ala mutations delay folding of proline regions, in agreement with proline effects on model stability. In SH3 the tails still form slightly earlier in the AA model, at the expense of residues 35–55. In CI2, the balance between minicore and helix formation is clearly improved, highlighting the importance of prolines in the folding process. Pro-Ala mutations have almost no effect on the folding mechanism of P12 and P30 in Protein A and P25 in CI2. This is likely because these prolines are located in turn regions. In our model, turns are highly constrained by short range contacts, and the reduced dihedral constraint (imposed by a proline) acts as a small perturbation. The remaining differences between the Pro-Ala AA mutants and the inhomogeneous Cα model demonstrate, to no surprise, that the inclusion of side chains alters the relative entropy of residues.

Native basin dynamics of AA structure-based model correlate with the dynamics of an all-atom empirical forcefield with explicit solvent

Two common measures of native state dynamics are native contact formation and root mean squared deviations in structure rmsd. Figure 8 shows the average contact formation in the native ensemble for the structure-based model and an all-atom empirical forcefield with an explicit solvent. Although the average contacts are not identical, no major differences in contact formation are observed. The overlaps between the AA maps and the all-atom empirical forcefield maps of Protein A, SH3, and CI2 are 0.85, 0.97, and 0.84, respectively. An overlap of 1 indicates identical maps, and 0 indicates the two maps have no contacts in common.

Figure 8.

Probability of contacts being formed P(i,j) at T ≈ 0.8 Tf for the AA structure-based potential (top left) and an all-atom empirical forcefield (bottom right) for (a) Protein A, (b) SH3, and (c) CI2. Dark red indicates that residue i (x axis) and residue j (y axis) are always in contact under native conditions. Dark blue indicates the contact is formed rarely (less than 10% of the time). White indicates P(i,j) < 0.025. In all three proteins, contacts are more broadly distributed (higher number of low probability contacts) in the structure-based simulations than in all-atom empirical forcefield simulations (fewer contacts, but with higher probabilities). There are approximately four times as many contacts with P(i,j) < 0.01 for the structure-based simulations than are seen in all-atom empirical simulations, indicating more mobile dynamics.

In a uniquely defined native state, the probability of each contact being formed is 1. Because we sample the native ensemble at finite temperatures, atom mobility leads to additional contacts being formed. In the structure-based model, these additional interactions are strictly repulsive. In an all-atom empirical forcefield these interactions can be attractive, yet they are observed more frequently in the structure-based model. These contacts are likely due to increased mobility in the structure-based simulations. In all-atom empirical forcefields, hydration shells can result in less mobile side chains, and hence a narrower distribution of contacts.

The increased mobility is quantified by the structural rmsd. The magnitude of fluctuations in all-atom empirical simulations is much lower than in structure-based simulations (not shown). For the all-atom empirical forcefield at 300 K, the average rmsd for Protein A, SH3, and CI2 are 1.53, 1.00, and 0.97 Å. The rmsd of the Cα atoms are 1.23, 0.66, and 0.74. The same values are obtained in structure-based simulations at around T = 0.55Tf. In real temperature units, 0.55Tf corresponds to temperatures significantly less than 300 K. A likely cause for the increased structural fluctuations is hydration effects of explicit solvent molecules in the all-atom empirical forcefield. To compare the distribution of rmsd fluctuations between models, correlation coefficients (r) were computed for the rmsd by atom in the all-atom empirical forcefield and the structure-based potential. For all parameter sets of the structure-based potential, r ≈ 0.7 for CI2 and SH3 and r ≈ 0.8 for Protein A.


In this article, we describe a systematic analysis of an AA structure-based model which bridges the gap between coarse-grained models and all-atom empirical forcefields. We show that in our Cα and AA structure-based models the global folding mechanisms agree and the main differences are largely due to energetic heterogeneity and the explicit representation of prolines in the AA model. Also, the native basin dynamics are similar in the AA structure-based model and an all-atom empirical forcefield with explicit solvent. In agreement with previous studies, the folding mechanisms in complicated folds are stable to parametric variation. On the other hand, the free-energy barriers associated with folding vary systematically with parameters. Because free-energy barriers are not a robust feature of this model, understanding the interplay between barrier heights and diffusion will be important before attempting to predict folding rates.55, 58, 59

Using this model we characterized two folding processes: one associated with backbone collapse and the other with side chain packing. We observed that backbone collapse is accompanied by partial side chain packing in a cooperative transition and residual side chain packing occurs as temperature is reduced below the global folding temperature. One explanation for the partial separation of backbone folding and side chain ordering may be that mobility in specific residues is necessary for the functional properties of proteins. Proteins are selected for their function. Orthogonal networks of residues responsible for stability and function have been proposed.60, 61 The observation in our model that some residues are not necessary to maintain the backbone structure is consistent with this proposal. In CI2, the backbone of the active site loop is in the native orientation, yet the side chains are not packed. In SH3, several turns are also disordered. Because binding sites are often found in loops, flexible loops may be more easily adapted to new sequences and functions.

Gradual side chain packing can also allow for proteins to functionally respond to cellular stress by affecting side chain orientations, without denaturing the entire protein. This is consistent with the prediction that localized unfolding, or cracking, is important for biological function of kinases and motor proteins.15, 18, 62–67

The current model explicitly includes the effects of topological contributions to protein folding, and the role of energetic contributions may now be elucidated. Our results are a significant step forward in understanding protein dynamics from the Cα to the all-atom level. In the coming years, it will be interesting to probe the effects of electrostatics, nonnative interactions, water, and explicit mutations in this model.


Energy function

In our AA model of the protein, only heavy (non-hydrogen) atoms are included. Each atom is represented as a single bead of unit mass. Bond lengths, bond angles, improper dihedrals, and planar dihedrals are maintained by harmonic potentials. Nonbonded atom pairs that are in contact in the native state between residues i and j, where i > j + 3, are given a Lennard-Jones potential, whereas all other nonlocal interactions are repulsive. All contacts identified by the Contact of Structural Units software package (CSU)68 were included. The functional form of the potential is,

equation image(1)


equation image(2)

and εr = 100, εθ = 20, εχ = 10, and εNC = 0.01. rooo, ϕo, and σij are given the values found in the native state and σNC = 2.5 Å. When assigning dihedral strengths, we first group dihedral angles that share the middle two atoms. For example, in a protein backbone, one can define up to four dihedral angles that possess the same C − Cα covalent bond as the central bond. Each dihedral is given the interaction strength of 1/ND, where ND is the number of dihedral angles in the group. εBB and εSC are then scaled so that equation image. Next, dihedral strengths and contact strengths are scaled such that our other system parameter, the ratio of total contact energy to total dihedral energy equation image, is satisfied. The total stabilizing energy is equal for all parameter sets (i. e. equation image).

As a reaction coordinate we use QAA and QCA. QAA is the fraction of natively interacting residues that are in contact. Two residues are considered in contact if any native atom–atom interactions between the residues are within 1.2 times the native distance σij. At 1.2σij the potential energy of a native pair is approximately half of the minimum. Similarly, QCA is the fraction of natively interacting residue pairs whose Cα atoms are within 1.2 times their native distance.

Proline to alanine mutations

To investigate the role of proline residues in the AA model, proline to alanine mutants were constructed. This was achieved by removing the Cγ and Cδ atoms of each proline. Native contacts formed with the Cγ and Cδ of a proline were included as contacts with the Cβ of the corresponding alanine. This ensured the energetics of the system were unperturbed, and only topology was modified.

Simulation details

All-atom structure-based simulations were performed using the GROMACS software package.26 No modifications to the source code were necessary. Reduced units were used. The timestep τ was 0.0005. The Berendsen algorithm69 was used§ with the coupling constant of 1. For all folding results in this article several constant temperature runs were performed, with temperatures that corresponded to the protein being always folded to always unfolded. The Weighted Histogram Analysis Method70, 71 was used to combine data from multiple temperatures into single free-energy profiles.

All-atom empirical forcefield simulations

All-atom empirical forcefield simulations were performed using GROMACS,26, 72 with the OPLS-AA forcefield73 with TIP3P water molecules.74 Each protein was simulated for 10 ns at T = 300 K and a pressure of 1 atm. A timestep of 2 fs was used in conjunction with the LINCS75, 76 algorithm for constraining covalent bonds with hydrogen. Protein A, SH3, and CI2 were simulated with 2810, 3617, and 4644 water molecules in cubic boxes of initial dimensions 45.15Å, 48.98 Å, and 53.07 Å. Temperature was maintained using the Berendsen algorithm.69 One nanosecond was allowed for equilibration. For the remaining 9 ns, structures were saved at 1 ps intervals.

Comparison of contacts

In the all-atom empirical forcefield simulations contacts were determined for each saved structure using CSU.68 The average number of contacts 〈Q〉 was calculated for each protein. The probability of individual contacts being formed was averaged over all structures with Q = 〈Q〉. With the all-atom empirical potential 〈Q〉 was 80, 135, and 146 for Protein A, SH3, and CI2. This analysis was repeated for folded simulations with our AA structure-based simulations. For the structure-based simulations 〈Q〉 was 80, 138, and 144. To compare contact maps, the dot product of the two maps was taken.


We would like to thank Angel Garcia and Peter G. Wolynes for useful discussions regarding all-atom modeling.

  • *

    QAA is a generous definition of side chain packing, because a side chain is “packed” when one or more atom–atom contacts are formed. Thus, “underpacked” residues clearly have very little native structure

  • Figure 8 only interactions present more than 2.5% of the time are shown.

  • Comparison of rmsd of the Cα atoms yields similar values of r. When using the Berendsen thermostat, numerical instabilities can arise when the bath-molecule coupling timescale is shorter than the timescale for internal energy diffusion. In our experience, these problems tend to surface when you simulate weakly interacting domains with implicit solvation. Because the present study investigates folding of single domain proteins under weak temperature coupling, these features are not likely a source of significant errors. Nonetheless, future work will also employ Langevin or Nose-hoover temperature coupling.