Intermediates and the folding of proteins L and G


  • Scott Brown,

    1. Department of Bioengineering, University of California (UC), Berkeley, Berkeley, California 94720-1762, USA
    Search for more papers by this author
    • Present address: Abbott Laboratories, 1401 Sheridan Road, North Chicago, IL 60064-4000, USA.

  • Teresa Head-Gordon

    Corresponding author
    1. Department of Bioengineering, University of California (UC), Berkeley, Berkeley, California 94720-1762, USA
    • Department of Bioengineering, 472 Donner Laboratory, University of California, Berkeley, Berkeley, CA 94720-1762, USA; fax: (510) 486-6632.
    Search for more papers by this author


We use a minimalist protein model, in combination with a sequence design strategy, to determine differences in primary structure for proteins L and G, which are responsible for the two proteins folding through distinctly different folding mechanisms. We find that the folding of proteins L and G are consistent with a nucleation-condensation mechanism, each of which is described as helix-assisted β-1 and β-2 hairpin formation, respectively. We determine that the model for protein G exhibits an early intermediate that precedes the rate-limiting barrier of folding, and which draws together misaligned secondary structure elements that are stabilized by hydrophobic core contacts involving the third β-strand, and presages the later transition state in which the correct strand alignment of these same secondary structure elements is restored. Finally, the validity of the targeted intermediate ensemble for protein G was analyzed by fitting the kinetic data to a two-step first-order reversible reaction, proving that protein G folding involves an on-pathway early intermediate, and should be populated and therefore observable by experiment.

Although thermodynamics and kinetics of small proteins that fold via a two-state manner are reasonably well understood (Gruebele 2002b; Myers and Oas 2002; Daggett and Fersht 2003a), understanding how (and why!) proteins fold through intermediates will be especially relevant for larger proteins, more complicated topologies, and their possible connection to aggregation processes that are responsible for disease (Speed et al. 1997). Some of the open questions surrounding intermediates include the detection of the so-called “hidden” intermediates by kinetic experiments, whether intermediates can occur earlier than the rate-limiting step in folding, that is, do free energy barriers that precede the rate-limiting nucleation barrier of the folding reaction exist, and if they are “off-pathway,” and therefore obstruct the functionally important progress of folding (Gruebele 2002a; Ozkan et al. 2002; Qin et al. 2002; Sanchez and Kiefhaber 2003a).

This work examines the question of intermediates by simulating the folding of two members of the ubiquitin fold class, Ig-binding proteins L and G. Proteins L and G make excellent targets for theoretical study as their folding attributes have been extensively studied by experiment (Gu et al. 1995, 1997; Park et al. 1997, 1999; Scalley et al. 1997; Kim et al. 2000; McCallister et al. 2000; Krantz et al. 2002). These two single-domain proteins have little sequence identity but identical fold topologies, consisting of a central α-helix packed against a four-strand β-sheet composed of two β-hairpins. Experimental evidence indicates that protein L folds in a two-state manner through a transition state ensemble involving a native-like β-hairpin 1, and largely disrupted β-hairpin 2 (Gu et al. 1997; Scalley et al. 1997; Kim et al. 2000). Protein G, on the other hand, folds through a possible early intermediate (Park et al. 1997, 1999; Speed et al. 1997), followed by a rate-limiting step that involves formation of β-hairpin 2. They therefore provide a perfect contrast to understand features that give rise to protein folding intermediates, while controlling for size and topology.

There have been a number of recent simulations of coarse-grained models of proteins L and/or G using different forms of minimalist models (Head-Gordon and Brown 2003). Shimada and Shakhnovich (2002) have used ensemble dynamics to characterize the kinetics of protein G using an all-atom Gō potential. Karanicolas and Brooks (2002) use a Gō potential bead model supplemented with sequence-dependent MJ statistical potentials to differentiate the folding of G and L. They found the origin of asymmetry in the folding of protein L and G to be in concurrence with that found by Nauli et al. (2001), who used a computer-based design strategy to reengineer the protein G sequence to include more stabilizing interactions for the first β-hairpin turn, producing a protein more faithful to the mechanism of folding for protein L.

Our recent work, inspired by early efforts of Thirumalai and co-workers (Honeycutt and Thirumalai 1990; Guo et al. 1992; Guo and Thirumalai 1996), develops physics-based potentials that make the connection between free energy landscapes and amino acid sequence, allowing us to engineer sequences that fold into α-helical, β-sheet, and mixed α/β protein topologies (Sorenson and Head-Gordon 1999, 2000, 2002a,b; Brown et al. 2003). These coarse-grained protein models provide the right emphasis of the most relevant native state features (Head-Gordon and Brown 2003) by capturing the correct spatial distribution of local and nonlocal contacts that are considered to be possibly the most important in governing the overall kinetics of protein folding (Plaxco et al. 1998; Alm et al. 2002). We have previously explored its use for members of the ubiquitin α/β fold class including proteins L and G and ubiquitin (Sorenson and Head-Gordon 2000, 2002a,b; Brown et al. 2003). Recently, we have verified that the use of a three-letter sequence code is capable of translating the differences in primary sequence for proteins L and G into the experimentally observed differences in thermodynamic and kinetic properties of folding (Brown et al. 2003).

In this work, we analyze the kinetics of folding for these distinct sequences to characterize the dynamics of navigating the free-energy landscape from unfolded to native state. Pfold simulations (Du et al. 1998) and contact map analysis have been used to characterize the folding landscape. The folding of our protein L model follows two-state kinetics, and shows the presence of a transition-state ensemble with a well-formed β-hairpin 1. Similar analysis of protein G shows that it folds through at least two pathways, which we label the fast and slow pathways. The fast pathway exhibits two-state kinetics, and folds through a transition-state ensemble with a well-formed β-hairpin 2. In both cases of protein L and fast-folding protein G, these secondary structural elements are assisted by the α-helix, and the overall folding mechanism seems consistent with a nucleation-condensation mechanism observed for other proteins (Daggett and Fersht 2003a,b).

The slow pathway for protein G is what gives rise to three-state kinetics, and involves an early intermediate, that is, an intermediate that precedes the rate-limiting step in folding. The characteristics of the intermediate are hydrophobic contacts involving the third β-strand interacting with β-strands 1 and 2, although the associated secondary structure strand elements are misaligned relative to our model of the folded state. The transition state that occurs after the intermediate and proceeding the folding to the native state is characterized by native-like registering of these same β-strand pairings. The tractability of the simulation model allows us to fit the kinetic data to a unimolecular two-step kinetic model to summarily characterize the kinetics of protein G folding. We confirm that a barrier, in fact, separates the unfolded state from the early folding intermediate, and is lower in free energy relative to the unfolded state, and that the intermediate should be populated and observable by experiment.

Results and Discussion

One of the differences between our L and G model proteins is manifested in the relative thermodynamic stability of the different elements of secondary structure. Figure 1 shows free-energy projections along χβ−1 and χβ−2 for L and G. From these projections there is a minimum free-energy path connecting the unfolded ensemble to the folded ensemble that involves either sequential formation of β-hairpin 1 followed by β-hairpin 2 (protein L), or β-hairpin 2 followed by β-hairpin 1 (protein G). However, we appear to only be getting part of the picture in Figure 1, as the barrier height separating the unfolded and folded ensembles is insufficiently high (relative to kBT) to justify locating the rate-limiting transition state solely on this surface.

The folding kinetics at the folding temperature, shown in Figure 2, illustrates the difference in folding mechanism between L and G (fit parameters given in Table 2, Table 3.). The kinetic data for protein L are fit well by a single exponential, consistent with what is reported in the literature for protein L (Scalley et al. 1997; Kim et al. 2000). Thus, our protein L model folds in a cooperative two-state manner, and possibly through the initial formation of β-hairpin 1.

For protein G, the story is not as straightforward. We find that protein G folds slower than protein L by a factor of 2, qualitatively consistent with the experiment (McCallister et al. 2000); however, the kinetic data for protein G are better fit by at least a double exponential (parameters shown in Table 2). From this fit we find two populations; one involves a fast-folding event in which roughly 80% of the population folds cooperatively, and a slow-folding remainder of the ensemble that folds by a different mechanism that we analyze further below. The time scale that serves to roughly delineate these two populations is 2 × 106 time steps. After this many time steps the majority of the fast folding states have folded, while only a tiny fraction of the slow folders have folded. Using 2 × 106 time steps as a cutoff, we refit the kinetic data of the fast-pathway population for protein G and obtain a single exponential, while the fit to the remaining 20% of slow folders gives a double exponential, suggestive of an intermediate state in the slow folding trajectories.

If we examine the folding ensemble at both early and late times for the two populations, we see that the fast pathway involves a collapse concomitant with folding scenario (Fig. 3a). The fast pathway also involves a greater degree of native α-helix formation relative to the slower population. This is an important difference between the two pathways, and we will return to it later. The slower population is characterized by more relative ordering of both β-sheet regions 1 and 2 (Fig. 3B). Note that both kinetic pathways exhibit a more developed β-2 region relative to the β-1 region, reflecting what is seen in the thermodynamic analysis. The picture from both kinetics and thermodynamics appears to be consistent, and points to a folding mechanism that involves formation of β-hairpin 2 at some rate-limiting step prior to that of β-hairpin 1.

Shown in Figure 4 is a two-dimensional free-energy surface projected onto the radius of gyration, Rg, and native-state similarity parameter, χ. Figure 4A shows the relationship between L collapse and native-state formation, which appears to occur by a single pathway leading from expanded, nonnative to the minimum on the surface corresponding to collapsed and native-like. This is consistent with the picture from kinetic data of a collapse concomitant with folding scenario (Plaxco et al. 1999). In contrast to this, for protein G there appear to be two pathways for collapse, with two separate minima for each pathway, as illustrated by the arrows in Figure 4B. One pathway involves collapse to a largely nonnative structure, whereas the other pathway reflects a collapse concomitant with folding scenario, as seen with protein L. The barrier separating these two minima in Figure 4B again has insufficient height to account for observed kinetic data.

Recent work has strongly emphasized that the choice of reaction coordinate for monitoring folding progress is important for the observation of intermediates and general interpretation of kinetic data (Qin et al. 2002). A potential pitfall of choosing a reaction coordinate is illustrated in Figure 5, which shows the potential of mean force for protein G as a function of native-state similarity in going from the unfolded (χ ≈ 0) to folded (χ ≈ 1) states for a range of temperatures spanning the folding temperature. In producing our kinetic data we use this same native-state similarity parameter to determine the extent of folding during folding trajectories. Note that the folding temperature for the protein G sequence is T* ≈ 0.41. Jumping from the free-energy surface at T* = 0.5 to the surface at T* = 0.35 would involve a downhill rearrangement in the distribution of the unfolded ensemble. These results represent an alternative interpretation of ultrafast folding experiments that remains consistent with overall evidence for two-state folding (Parker and Marqusee 1999).

One of the benefits of coarse-grained models is the ability to fully characterize ensemble kinetics on the free energy landscape by investigating transition state ensembles, and putative intermediates, provided we can find suitable reaction coordinates for their description. We examined a number of reaction coordinates before determining ones that adequately capture the folding events in our model. These include contact order parameters α, β-1, β-2, β-2α, β-3α, β-1 β-2, β-1 β-4, β-2 β-3, β-3β-4, β-3β-4α, β-2β-3α, β-1β-2β-3, as well as a “diffuse” order parameter that is an expanded native state.

Several additional folding-trajectory analyses were performed to obtain more extensive kinetic characterization, in which progress during folding was monitored along a variety of these chosen order parameters. Structures with desired values along these order parameters were saved and served to form a set of putative transition states, which were then subsequently used as starting structures in trajectories for Pfold analysis. From the Pfold simulations we obtained a subset of successful order parameters (shown in Table 2), which correlate well with the definition of transition state ensemble. Through this procedure we determined a transition state ensemble for protein L, a transition state ensemble for the fast pathway in protein G, and the late transition state ensemble for the slow folding pathway of protein G. During Pfold simulations the trajectories either fold or do not fold, by definition. By saving the structures for those trajectories that did not fold in the Pfold simulations of structures corresponding to the transition state ensemble of the slow folding pathway for protein G, we were able to isolate the structural characteristics of the ensemble of early intermediates.

Figure 6A shows a contact map with reference lines indicating native-state contacts (black line) and the contacts that are present across at least 90% of the transition-state ensemble for protein L (gray line). The transition-state contacts show that the model for protein L folds through structures with a helix-assisted β-1 hairpin nucleus. Figure 6B shows a similar map of contacts to delineate the transition state ensemble in the fast pathway of protein G. In the case of protein G's fast pathway the transition-state ensemble involves formation of a helix-assisted β-2 hairpin nucleus.

Figure 7A shows the contact map for the contacts present in at least 90% of the structures in our intermediate ensemble. Figure 7B shows the contacts present in at least 90% of the late transition-state ensemble structures for the slow folding pathway of protein G. The intermediate is characterized by associated helix with β-strands 2 and 3, with a smaller amount of associated β-strands 1 and 3; however, these strands are misaligned relative to the native state. The subsequent transition state ensemble is in large part characterized by an alignment correction of this same strand association pattern exhibited in the intermediate, followed by more robust association of the other β-strands.

Finally, we prove that the intermediate occurs early on the pathway by fitting the data to a two-step reversible first-order UIN mechanism. Given the characterization of the intermediate for the slow pathway of protein G, we can monitor individual folding trajectories and record when states enter and leave the U, I, and N designations. Provided we observe a large number of trajectories, we can assemble a picture of the pathway the folding population follows as a function of time, and fit the corresponding data to the UIN mechanism:

equation image((1))

The solutions for the time rate of change of concentration of each species expressed as a function of rate constants k1,k−1, k2, and k−2, is given in Appendix A. As far as we are aware, this is the first time a solution of the full UIN mechanism without simplifying approximations has appeared in the protein literature, although this mechanism is very often invoked in the analysis of protein folding reactions in its various simplified limiting forms. When our data is fit to these equations they yield values of the rate constants and associated estimates of relative free-energy minima, which are given in Table 4. We also show the quality of the fit of the slow-folding protein G data to the UIN model in Figure 8. Note that in Figure 8 we have enforced a restriction in which we eliminate all trajectories that fold prior 2 × 106 time steps. This allows us to focus exclusively on the trajectories folding via the slow pathway, but leads to a slight anomaly in Figure 8 for the populations immediately prior to 2000τ. By removing greater than 95% of the fast folding trajectories we have excluded a small fraction of the slow-folding trajectories. In summary, the kinetic model demonstrably shows that a barrier, in fact, separates the unfolded state from the early-folding intermediate, and that the intermediate is lower in free energy relative to the unfolded state, and therefore should be populated and observable by experiment.


We find that protein L is a two-state folder, in agreement with existing experiments (Gu et al. 1997; Scalley et al. 1997; Kim et al. 2000). As such, it provides a unique reference system for understanding intermediates by comparing its folding to protein G, a structurally homologous protein of similar length for which continuous flow fluorescence experiments support the population of an early intermediate along the folding pathway (Park et al. 1997, 1999). This, by definition, involves the presence of an additional free energy barrier preceding the rate-limiting barrier in folding. It is important to note that the stopped flow experiments cannot resolve any early intermediates in the folding of protein L, unlike the better time-resolved continuous flow experiments for protein G. Although continuous flow results have been called into question as a problem of suspect interpretation of ultrafast folding events in general (Krantz et al. 2002), our model supports the view that protein G folds through an early intermediate while protein L does not.

Protein L's transition state ensemble is composed of helix-assisted β-1 hairpin formation. We conclude that protein G folds through at least two pathways: a fast pathway involving roughly 80% of the folding population, with a transition state composed of a helix assisted β-2 hairpin nucleus, and a slow-folding pathway through which the remaining folding population proceeds in a three-state mechanism. Our model clearly demonstrates that the slow pathway involves the presence of an early intermediate involving the third β-strand, which is separated from the unfolded state by a significant barrier (relative to kBT), and, in fact, is lower in free energy relative to the unfolded state (Table 4), and therefore, should be populated and observable by experiment. Therefore, our model strongly supports the interpretation of the continuous flow experiments by Park et al. (1997) as evidence of an early-folding intermediate.

Our results also emphasize that the choice of reaction coordinate used experimentally is very important to avoid conflicting conclusions concerning the presence of intermediates, as was found to be the case for reexamination of the presence of an intermediate in ubiquitin (Qin et al. 2002). Similar conclusions concerning the proper determination of reaction coordinates that monitor folding was also found by Shimada and Shakhnovich (2002). Their simulation of protein G found that folding occurred through multiple pathways, each of which passes through an on-pathway intermediate. They showed that when folding is monitored by using burial of the lone tryptophan in protein G as the reaction coordinate, the ensemble kinetics shows a significant burst phase, while alternative reaction coordinates reveal the presence of different folding pathways. They make the point that ensemble averaging can mask the presence of multiple pathways when nonideal reaction coordinates are used. We required a variety of different order parameters, coupled with Pfold analysis, to characterize the reaction co-ordinates for protein L and protein G folding to find all intermediates and transition states. Furthermore, we fit our kinetic data to a UIN-type mechanism, and provide estimates of the rate constants and relative free-energy minima to fully characterize the folding pathways.

The work reported by Shimada and Shakhnovich (2002) using an all-atom Go potential most closely parallels the study described here of analyzing the folding of protein G. They observe three pathways, each involving its own intermediate: I1 (helix-hairpin 1), I2 (helix-hairpin 2), and I3 (β-1–β-4), and that each pathway converges to the same transition state. Our physics-based α-carbon trace model finds two major pathways, each with its own transition state, with only one pathway exhibiting an intermediate characterized by β-2β-3α. At this point it is difficult to tell more about the structural nature of the experimental intermediate given the nonspecific nature of the tryptophan (on the third β-stand) reaction coordinate used in the experimental study. However, we expect that the structural details of the intermediate are potentially more reliably predicted with the all-atom simulation because our coarse-grained model inadequately describes β-sheet structure, and instead forms a β-strand bundle for proteins L and G.

However, due to the inexpensive cost of our bead model, we are able to perform various analyses of thermodynamics as well as Pfold analyses along entire trajectories to isolate structures belonging to the transition-state and intermediate ensembles. This level of detailed investigation is not possible (or is not pursued) in more complicated models. For example, the total number of trajectories examined in Shimada and Shakhnovich (2002) is only 50, whereas we examine 1000 folding trajectories. With only 50 trajectories we found that we were unable to reliably analyze our data and make comment on ensemble folding properties. We found this to be a particularly significant problem when fitting to our postulated three-state reaction mechanism. The primary advantage of physics-based bead models is that the kinetics and thermodynamics are fully characterizable with high-quality statistics, and the overall qualitative agreement with experiment is very good.

The differences in the folding properties of L and G, for the fast pathways, are consistent with a nucleation-condensation model (Abkevich et al. 1994) or nucleation-collapse mechanism (Guo and Thirumalai 1995) that has been used to analyze kinetic data on two-state folders (Fersht 1997; Daggett and Fersht 2003a,b; Sanchez and Kiefhaber 2003b). Whereas the fast pathway mechanisms for L and G involve the contact-assisted formation of secondary structure to create a folding nucleus at the transition state, the slow pathway in protein G involves an obligatory intermediate that precedes the rate-limiting step, a result that may seem inconsistent with a nucleation-based mechanism. However, as is seen in the case of barnase (Daggett and Fersht 2003a), the intermediate for protein G assists in formation of the folding nucleus. It has been pointed out that increasing the hydrophobicity may lead to a shift in folding mechanism towards a molten globule-like intermediate (Daggett and Fersht 2003a); that does not appear to be the case here. It should be noted that in this model the sequences for proteins L and G have an identical number of L and B beads, and thus have an identical global hydrophobicity. However, the third β-strand is significantly more hydrophobic in protein G relative to protein L, and hence, the intermediate certainly arises due to stabilization by hydrophobic contacts.

This greater hydrophobicity for protein G helps stabilize an intermediate that draws together the secondary structure elements of β-strand 3 in association with β-strands 1 and 2, although these secondary structure elements are out of register relative to the native state. However, this helps set up the final step in folding, which now involves a transition-state ensemble that corrects for the misalignment of this core nucleus of associated strand elements. Recent work has suggested that intermediates that are higher in free energy relative to the unfolded state (perhaps hidden from experimental view) can accelerate folding (Wagner and Kiefhaber 1999; Sanchez and Kiefhaber 2003a). Protein G folding involves an intermediate that is more stable than the unfolded state, and in fact, slows down folding relative to protein L, all of which is supported by experiment as well as the coarse-grained model examined here. Perhaps hydrophobic-stabilized intermediates are a concession to certain amino acid sequences, designed by nature for other functional reasons, that would otherwise fold by enthalpic barriers that are simply too high.

Materials and methods

The protein model has been described in (Sorenson and Head-Gordon 1999, 2000, 2002a,b). The protein chain is modeled as a sequence of beads of three flavors—hydrophilic, hydrophobic, and neutral—designated by L, B, and N, respectively. In general, the pair-wise interaction between beads is attractive for hydrophobic–hydrophobic (B-B) interactions, and repulsive for all other bead pairs (although the strength of the repulsion interactions depends on the bead types involved). In addition to pair-wise nonbonded interactions, the other contributions to the potential energy function include bending and torsional degrees of freedom. The total potential energy function is given by

equation image((2))

where εH determines the energy scale and sets the strength of the hydrophobic interactions. The bond angle energy term is a stiff harmonic potential with force constant kθ = 20εH /rad2, and θ0 = 105°. The second term in the potential energy designates the torsional, or dihedral potential, and is given by one of the following: helical (H), with A = 0, B = C = D = 1.2 εH; extended (E), favoring β-strands, with A = 0.9εH, C = 1.2εH, B = D = 0; or turn potential (T), with A = B = D = 0, C = 0.2εH. For each dihedral angle potential the global minimum is the specified secondary structure type, but has stable local minimum for the other secondary structure angles. This aspect of the potential sits between a Gō model and a purely ab initio energy function because the dihedral angle potential is assigned for each bead based on the known native state. However, we use no explicit secondary or tertiary structure template to define any aspect of the potential, and hence, the form and parameters are transferable to any protein. The nonbonded interactions are determined by S1 = S2 = 1 for B–B interactions; S1 = 1/3 and S2 = −1 for L–L and L–B interactions; and S1 = 1 and S2 = 0 for all N–L, N–B, and N–N interactions. For convenience all simulations are performed in reduced units, with mass m, length σ, energy εH, and kB all set equal to unity. Note that although the nonbonded potential is symmetric with respect to inversion, that is, Vnonbonded(rij) = Vnonbonded(rji), this is not true for the dihedral interactions, as ϕ = f(ri, ri+1, ri+2, ri+3). Thus, the total energy function is not symmetric with respect to indice permutations.

We perform constant-temperature simulations using Langevin dynamics in the low friction limit for characterizing the thermodynamics and kinetics of folding. Bond lengths are held rigid using the RATTLE algorithm (Andersen 1983). The free-energy landscape is characterized using the multiple, multidimensional weighted histogram analysis technique (Ferrenberg and Swendsen 1989; Kumar et al. 1995; Ferguson and Garrett 1999). We collect multidimensional histograms over a number of different order parameters, including energy E, radius of gyration Rg, and various native-state similarity parametersχ,

equation image((3))

where the double sum is over beads on the chain, and rij and rijnative are the distances between beads i and j in the state of interest and the native state, respectively. h is the Heaviside step function, with ε = 0.2 to account for thermal fluctuations away from the native state structure. M is a constant that satisfies the conditions that χ = 1 when the chain is identical to the native state and χ ≈ 0 in the random coil state. The remaining χ parameters are specific to their respective elements of secondary structure. That is, χαinvolves summation over beads in the helix, and χβ−1 and χβ−2 involve summation over beads in the first β-sheet region and second β-sheet region, respectively, etc.

From the histogram method we get the density of statesΩ, as a function of these order parameters, which can be used to calculate thermodynamic quantities. One quantity that is useful is the native-state population as a function of temperature

equation image((4))

where χNBAindicates the boundary of the native-state basin of attraction (NBA; Nymeyer et al. 1998). In constructing the free energy surfaces we collect histograms at 15 different temperatures: 1.20, 0.90, 0.70, 0.62, 0.60, 0.55, 0.50, 0.48, 0.46, 0.44, 0.42, 0.41, 0.40, 0.39, and 0.38. We run three independent trajectories at each temperature, and collect 10,000 data points per trajectory.

The kinetics of the folding process can be characterized by calculating a large number of first-passage times (the time required for a folding trajectory to first enter the native basin of attraction, defined to be χNBA = 0.40). The first-passage times are calculated by taking an initial high-temperature random-coil structure and evolving it at the temperature of interest until recording the time that it first enters the native basin of attraction. We subtract off an initial correlation time in which the high-temperature chain is briefly equilibrated at the target temperature (this is the computational dead time during the kinetics run).

To accurately characterize the proper transition-state ensemble in our analysis of protein L and G folding, we employed the Pfold method proposed by Du et al. (1998). The method assigns a value, Pfold, to a particular structure corresponding to the probability that it will first fold to the native state before unfolding. Structures with Pfold values equal to 0.5 correspond to the transition-state ensemble for the model. To apply this method, we first sampled structures from our simulations corresponding to putative transition-state structures. “Putative” transition-state structures were originally isolated by requiring various combinations of order parameters to correspond to their maximum free-energy values in a one-dimensional projection of free energy against these order parameters. From this procedure, an ensemble of structures with 0.4 ≤ Pfold ≤ 0.6 were isolated during multiple kinetic runs, and were defined as members of the transition state ensemble. By analyzing these structures we are able to postulate new reaction coordinates.

Identifying the transition state ensemble also allowed us to define an intermediate ensemble. By identifying those configurations that have Pfold ≈ 0.5 we can save structures for the trajectories that fail to fold, thus allowing us to postulate an intermediate ensemble. The set of structures obtained in this way can be characterized by analyzing the contacts that are present across all members of the ensemble. Using the defining contacts we can test our definition of intermediates through the direct analysis of kinetic runs. The final test for the validity of any definition of an intermediate ensemble is an analysis performed by fitting to a two-step first-order reversible reaction, or UIN mechanism, U ↔ I ↔ N, where U is the unfolded state, I is the intermediate state, and N is the native state. The formal solution to this kinetic mechanism is given in Appendix A. The data to which we fit is obtained from simulation by monitoring the progress of the folding trajectories and marking each time a state fits our definition of U, I, or N. We note that the kinetic analysis reported by Shimada and Shakhnovich fits the decay of the unfolded population separately from any of the I1, I2, or I3 intermediate populations; that is, the fit violates mass balance.

Next, we discuss our sequence design procedure. Theoretical work (Bryngelson and Wolynes 1989; Sali et al. 1994; Onuchic et al. 1997) has elucidated a criterion for heteropolymers to be foldable by noting that there should be a significant energy gap between the native-state and average misfold energies. Our sequence design strategy makes use of this concept. We create a library of misfolds (obtained from simulation of multiple trajectories), and then maximize the energy gap, Δ Edesign = |〈Emisfold〉 − Enative|, through favorable mutations on the sequence. We start with a sequence that adopts the protein L/G target topology as given in Sorenson and Head-Gordon (2002b), and build upon it through sequence mutation to produce new sequences that comprise distinct members, protein L and G, within a target fold class (Brown et al. 2003). The sequence for protein L was determined by aligning it against the real protein L sequence (after mapping the 20-letter code to three-letter code as described in Brown et al. (2003), and proposing new mutations that moved the original sequence towards being more L-like. For protein G, all possible single mutations were investigated during the design process, with the final outcome resulting in the selection of mutations that were beads corresponding to errors in the protein G alignment (Brown et al. 2003). This is interesting in that it appears to hint at potential criteria for performing sequence mapping onto our minimalist code, which could allow for study of novel proteins whose structure is not yet known. Two of the five point mutations for L and G are shared in common (B18L and B47L), which serve to make the proteins more foldable and to clean up certain thermodynamic aspects of the original L/G sequence (Brown et al. 2003). Another three mutations are what serve to distinguish the sequence of protein L from that of protein G. Table 1 lists the sequences for L and G used in this study, in which there are a difference of six beads between the protein L and G sequences. The energy of the initial L/G sequence is −32.4εH, while for the new protein L the native-state energy is −28.8εH, and for protein G the native-state energy is −26.9εH. For all native states we find that the energy distribution of the misfold library is well separated from the native-state energies.

Finally, we compare the structural similarity of the native state of our protein L model with the experimental structure. The root mean square distance (RMSD) between the native quenched structure of the protein L model and the protein L set of NMR solution structures (2PTL, residues 20–78) was found to be approximately 4.4 Å. This measure of RMSD was generated by the Combinatorial Extension (CE) Web server (; Shindyalov and Bourne 1998). Calculating an RMSD between an α-carbon bead model and a natural protein structure requires certain assumptions because of the difference in the chain (all atom versus bead representation) and the number of amino acids (the protein L model has fewer beads in some of the turn regions). The CE tool was particularly applicable for our purposes for two reasons. First, it compares only the α-carbon positions of the two structures when calculating the structural alignment, and second, the CE algorithm can exclude certain α-carbon positions to align the model and solution structures despite the different lengths of the loop regions. It should be noted that the insertion of gaps in the structural alignment did not result in a spurious alignment. The z-score for the structural alignment was 3.1. This measure indicates that an alignment of that quality with a random structure would occur in 1 in 103 times, showing that the protein L bead model has high topological similarity to the protein L natural fold.

Appendix A: Solution of UIN mechanism

For the two-step reversible mechanism given in equation 1 we have the following differential equations describing the time rate of change in concentration of each species,

equation image((5))

These set of coupled first-order differential equations can be straightforwardly solved by a Laplace transform, given by

equation image((6))

Taking the Laplace transform of the differential equations we have s

equation image((7))

where [U]0, [I]0, and [N]0 are the initial concentrations at time t = 0. Rearranging gives the set of linear equations

equation image((8))

In matrix form these equations can be expressed as

equation image((9))

For the mechanism we are interested in investigating here, we have the initial conditions that [I]0 = [N]0 = 0. Using Cramer's rule we can express the solutions to the above set of equations (Strang 1988):

equation image((10a))
equation image((10b))

Finding the solutions to the determinants and simplifying gives

equation image((11a))
equation image((11b))

Where r1 and r2 are given by

equation image((12a))


equation image((12b))

Taking the inverse Laplace transforms gives us solutions for [U] and [I] as a function of time:

equation image((13a))
equation image((13b))

The condition of detailed balance gives the final equation for [N],

equation image((14))

Thus, we obtain solutions for [U], [I], and [N] as functions of time.

Table Table 1.. Parameters obtained from fits to kinetic data
  1. a

    The data is fit to the equation:

  2. b

    A0 exp(−t0) + (1 − A0) exp(−t1).

Table Table 2.. Order parameters, Q, used for characterizing folding mechanisms in proteins L and G, along with contacts used to define them
Qith:jth Bead contacts
Protein L transition state
β1β2 α6:16, 6:17 6:18 7:15 7:16 7:17 8:13 8:14 8:15 8:16 8:17 9:1 9:14 9:15 10:14 10:15 20:24 23:27 29:33 30:34
Protein G transition state (fast pathway)
β3β4 α10:14 20:24 20:27 23:27 24:28 27:31 36:53 36:54 36:55 37:53 38:52 38:53 39:51 41:48 41:50 41:51 42:48 42:49 43:47 43:48 43:49 44:48
Protein G transition state (slow pathway)
β1β3α28:14 8:15 8:16 8:17 8:36 8:37 8:38 9:13 9:14 9:15 9:36 10:14 14:36 15:36 17:38 19:40 19:41 19:42 20:41 20:42 20:49 21:41 21:42 21:43 43:48
Protein G intermediate
β2β3α8:14 9:13 9:14 9:15 10:14 18:40 18:41 19:39 19:40 19:41 23:27 27:31 31:35 42:48 43:47 43:48 43:49 44:48
Table Table 3.. Parameters obtained from fit to UIN kinetic model outlined in Appendix I to characterize the slow folding pathway of protein G
1.1×10−31.3 × 1042.3 ×1034.0×106−2.0 kBT−1.7 kBT
Table Table 4.. Sequences for the minimalist models of protein L and G
Protein L
  1. a

    Differences between the sequences are shown in bold.

Protein G
Figure Figure 1..

Free-energy projections onto order parameters χβ−1 and χβ−2 for L and G. (A) Free-energy contour plot for protein L as a function of native-state similarity of the second (C-terminal) β-sheet region χβ−2 and first (N-terminal) β-sheet region χβ−1 at the folding temperature. Note the minimum free-energy path connecting the unfolded and folded ensembles proceeds through a transition state in which the β-1 region is native and the β-2 region is largely disrupted. (B) Free-energy contour plot for protein G as a function of native-state similarity of the second (C-terminal) β-sheet region χβ−2 and first (N-terminal) β-sheet region χβ−1 at the folding temperature. For G, the minimum free-energy path connecting the unfolded and folded ensembles proceeds through a transition state in which the β-2 region is native-like and the β-1 region is disrupted. Contour lines are spaced kBT apart.

Figure Figure 2..

Folding kinetics for proteins L and G. (A) Fraction of folded states Pfold as a function of time t for protein L at the folding temperature. The best fit to the data is by a single exponential. (B) Fraction of folded states Pfold as a function of time t for protein G at its folding temperature. The best fit for this data is to a double exponential. All fit parameters are given in Table 2.

Figure Figure 3..

Figure Figure 3..

Shows the presence of two folding pathways for protein G. The fast pathway corresponds to a collapse concomitant with folding scenario, while the slow pathway corresponds to nonproductive collapse and a longer process of finding the native structure.

Figure Figure 3..

Figure Figure 3..

Shows the presence of two folding pathways for protein G. The fast pathway corresponds to a collapse concomitant with folding scenario, while the slow pathway corresponds to nonproductive collapse and a longer process of finding the native structure.

Figure Figure 4..

Free-energy surface projected onto order parametersRg and χ. (A) Free-energy contour plot for protein L as a function of radius of gyration Rg and native-state similarityχ. In this plot there is only a single dominant minimum that corresponds to a collapsed, largely native structure. (B) Free-energy contour plot for protein G as a function of radius of gyration Rg and native-state similarity χ. In this plot there appear to be two dominant minima, one corresponding to collapsed nonnative structures and the other to collapsed native-like structures.

Figure Figure 5..

Potential of mean force vs. native state similarity as a function of temperature for protein G. The folding temperature is T* = 0.41. Based on this projection we might conclude that there is a shift in the unfolded population as we approach folding conditions. There is also evidence for a small barrier.

Figure Figure 6..

Contact map comparing native state (black) to contacts that are found present in the transition-state ensemble for 90% or greater of the structures (gray) for (A) protein L and (B) fast folding pathway of protein G.

Figure Figure 7..

Contact map comparing native state (black) to contacts that are found present in the (A) intermediate ensemble and (B) transition-state ensemble, for 90% or greater of the structures for the slow folding pathway of protein G (gray).

Figure Figure 8..

Kinetic data and fits for the UIN folding mechanism scenario.


We acknowledge financial support from UC Berkeley, and a subcontract award under the National Sciences Foundation Grant No. CHE-0205170. We also thank Nick Fawzi for calculating the RMSD between native states of the protein L model and the experimental structure.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.