Relationships between unfolded configurations of proteins and dynamics of folding to the native state


  • Attila Gursoy,

    1. College of Engineering and Center for Computational Biology and Bioinformatics, Koç University, İstanbul 34450, Turkey
    Search for more papers by this author
  • Ozlem Keskin,

    1. College of Engineering and Center for Computational Biology and Bioinformatics, Koç University, İstanbul 34450, Turkey
    Search for more papers by this author
  • Metin Turkay,

    1. College of Engineering and Center for Computational Biology and Bioinformatics, Koç University, İstanbul 34450, Turkey
    Search for more papers by this author
  • Burak Erman

    Corresponding author
    1. College of Engineering and Center for Computational Biology and Bioinformatics, Koç University, İstanbul 34450, Turkey
    • College of Engineering and Center for Computational Biology and Bioinformatics, Koç University, İstanbul 34450, Turkey
    Search for more papers by this author


We compare folding trajectories of chymotrypsin inhibitor (CI2) using a dynamic Monte Carlo scheme with Go-type potentials. The model considers the four backbone atoms of each residue and a sphere centered around Cβ the diameter of which is chosen according to the type of the side group. Bond lengths and bond angles are kept fixed. Folding trajectories are obtained by giving random increments to the φ and ψ torsion angles with some bias toward the native state. Excluded volume effects are considered. Two sets of 20 trajectories are obtained, with different initial configurations. The first set is generated from random initial configurations. The initial configurations of the second set are generated according to knowledge-based neighbor dependent torsion probabilities derived from triplets in the Protein Data Bank. Compared to chains with randomly generated initial configurations, those generated with neighbor-dependent probabilities (i) fold faster, (ii) have better defined secondary structure elements, and (iii) have less number of non-native contacts during folding. © 2006 Wiley Periodicals, Inc. J Polym Sci Part B: Polym Phys 44: 3667–3678, 2006


The unfolded state of proteins has important role on the folding process and on the stability of the folded state.1 The unfolded state is now being recognized as the ‘other half’ of the protein folding equation.2 In this paper, we show, by a dynamic Monte Carlo and statistical mechanical analysis and by using knowledge-based torsion potentials, that the unfolded state is not random. Important correlations exist between the torsion angles in the unfolded state. These correlations are important because they stabilize fragments of secondary structure and reduce the number of initial conformations available to the protein chain. We show that the conformations that survive in the unfolded state have significantly larger number of native-like contacts compared to those in chains subject to random potentials and this affects the folding rate significantly.

There is growing evidence, both theoretical and experimental, that the unfolded configurations of proteins are not random, and that a significant amount of structure is present that resembles the structure of the folded native state.2–4 Recent fluorescence resonance energy transfer experiments on nascent proteins point out to the fact that proteins have native-like conformations even in the ribosome.5 The fact that the unfolded state is not as heterogeneous as it was thought to be has been pointed out in several earlier studies.6–8 The presence of native-like features in the unfolded protein has important consequences concerning its folding into its native state. As pointed out by several authors,2, 3, 6 the configuration space of the unfolded state is significantly reduced compared to that of a random chain. In our previous work, we showed that this reduction is mainly determined by the neighbor dependent correlations over the successive torsion angles of the chain and that there is a strong relationship between amino acid sequence and backbone torsion angle preferences.9 Recently, Avbelj and Baldwin showed that the backbone angle φ of a residue is more negative when its preceding neighbor is class L (aromatic and β- branched: F, H, W, Y and I, T, V, respectively).10 They showed that peptide backbone solvation is the probable cause. Avbelj and Baldwin10 also discuss preferences for helices and β- structures and their dependence on the residue type. These observations are significant indicators of the context dependence and nonrandomness of the configurations of nascent proteins.

Our aim in the present article is to analyze the effect of initial conformations on the folding process. It is shown that initial conformations chosen according to energetically appropriate conditions fold faster to the native state, whereas random initial conditions end up mostly in dead ends along the folding pathway or at best in slow folding pathways. Specifically, we use a dynamic Go model. We show that unfolded proteins whose initial conformations are generated by knowledge-based potentials fold to the native state much faster than those generated randomly do. Recently, Zaman et al.11 showed that the populations and transitions between Ramachandran basins for monomers, dimers, and trimers of amino acids are influenced by the conformation and identity of the nearest neighbors. The present work is the first treatment, at the full protein scale, of the effect of initial conformation and residue identity on the folding dynamics. We choose a widely studied protein, chymotrypsin inhibitor (CI2), as our example.

In the next section, we describe the protein CI2 in detail and summarize previous experimental and computational work that details the folding dynamics of CI2 and in the section following that we describe our computational model and outline the calculation scheme. In Results of Calculations section, we present the results of the study and show the differences between dynamics of folding from initial configurations from random and knowledge-based potentials.


Chymotrypsin Inhibitor 2 (CI2) is a 64-residue protein that consists of an α-helix and a four-strand β-sheet as shown in Figure 1. The crystal structure has a resolution of 2.0 Å (PDB code 2CI2). The main hydrophobic core is formed by the packing of the α-helix against the β-sheet. It was the first protein shown to fold by a two-state mechanism.7

Figure 1.

The native conformation of chymotrypsin inhibitor 2 (CI2).

CI2 contains one α-helix and four β-strands as shown in Table 1 where the secondary structure elements are identified using PDB.12 The hydrophobic core is formed by packing α-helix to the β-strand 2. In the native state, the residues in contact at the hydrophobic core are 5, 8, 16, 19, 20, 27, 29, 47, 49, 51, 57, and 61. The coordinates for the atoms in the first 18 residues are missing in the PDB file; therefore, we renumbered the residues in our work starting from 1 at the residue number 19 in the original PDB file.

Table 1. Secondary Structure Elements and Corresponding Residue Numbers for CI2
Residue NumbersType of Secondary Structure
3–5β-Strand 1
28–34β-Strand 2
35–44Recognition loop
45–52β-Strand 3
60–64β-Strand 4

The folding of CI2 is well studied both experimentally13–18 and computationally.17, 18 The idea of studying this protein was a turning point, which gave insight into protein folding and new experimental techniques. CI2 was found to fold and unfold as a simple two-state system with no kinetic intermediates.19 Mutational analysis, Φ-value19, 20 allows the investigation of the role of individual residues in the TS (transition state) for folding. This method has also been used by many theoreticians to understand the TS structure and the folding mechanism of simple protein models.21–27 Thus, the method has stimulated interactions between experimentalists and theoreticians as it allows the understanding of folding mechanisms.27 The all-atom MD simulations of unfolding trajectories of CI2 at high temperatures gave insights into the nature of the TS and denatured state ensemble.17, 28 The folding dynamics and kinetics of CI2 by using an all-atom MD with a Go-potential was studied.28 They separated the TS ensembles into two sets, (1) with α-helices disrupted and β-strands intact, (2) with α-helices intact and β-strands disrupted. They found that the first set has a much higher success rate to fold. The unfolding MD trajectories of CI2 were also analyzed to clarify the TS.24 It was observed that the TS region for folding and unfolding occurs early with only 25% of the native contacts. Another interesting result from the latter simulations was the emergence of the statistically preferred unfolding pathway. A Cα model was used to investigate the CI2 model.21 The simulations revealed that a few well-defined contacts were formed with high probability in the TS that drives the protein into its folded conformation.

Experimental and theoretical studies suggest that CI2 folds by a nucleation-condensation/collapse mechanism.7 The earliest folding events were suggested as the clustering of hydrophobic residues (19–21, 29–35) using NMR experiments.17, 28 Molecular dynamics simulations also suggested that residues 49–51 may participate in the clustering of hydrophobic residues.29 The main event during folding was suggested as the closing of the α-helix to β-strand 3.30 A combination of NMR experiments and molecular dynamics simulations also suggested nucleation around residues 16, 49, and 57.17 In addition, the root-mean squared deviation (rmsd) from the native structure during folding suggests sudden collapse after contacts between the α-helix and β-strand 1 are formed.31

Monitoring the native contacts may help understanding of the folding mechanism using targeted molecular dynamics methods. Ferrara et al. performed 128 folding and 45 unfolding molecular dynamic runs on CI2 and reported that majority of the native contacts are formed at the end of folding.32 Similar results are confirmed for wild-type and circular-permuted CI2.33


Description of Chain Kinematics

We adopt the five-atom representation of the protein chain, i.e., the four atoms N, Cα, C, and carbonyl O in the backbone, and a Cβ atom representing the side group. The total number of atoms of the model, N, is equal to 5n where n is the number of residues. The side groups are considered as hard spheres with different van der Waals radii centered on the Cβ atoms. The atomic radii given in Rowland and Taylor for each amino acid are used in the model.34 The bond lengths of the atoms in the protein backbone are assumed to be constant during folding and the default values for these bond lengths are used.35 In addition, the bond angles of the amino acids are also assumed to be fixed. The i − 1st, ith and (i + 1)th residues are shown in Figure 2.

Figure 2.

Five-atom representation of the (i – 1)th, ith, and (i + 1)th residues of the protein chain.

The torsion angles around the bond between atoms CiNi +1, that is represented as ωi is fixed at π radians due to the partial double bond nature of the corresponding bonds. Only the two torsion angles, φi and ψi change.

The position of atom j is given by rj = [xj yj zj]T in the Cartesian coordinate system. Configurations of the protein chains are generated according to the rotational isomeric state formalism and successive matrix multiplication scheme.36 The position vector rj of the jth atom can be expressed as,

equation image(1)


equation image(2)

In eq 2, Tk is the transformation matrix for the kth bond, lk is the kth bond vector expressed in the coordinate frame of the bond, and 0 is the row vector [0 0 0]. The position vector is expressed in the internal coordinate system of the protein which is established by setting the x-axis along the direction of the first bond between N1 and Cα1 atoms. The torsion angle of the first bond is fixed at 0° and the y-axis is chosen in the plane defined by the bonds 1 and 2. The z-axis of the internal coordinate system is chosen to complete the right-handed system.

The rotational isomeric state representation of the protein chain and the internal coordinate system adopted in the model is computationally efficient, the precise location of any atom in space can be calculated for any given set of torsion angles, φi, i = 2,…, n, and ψi, i = 1,…, n − 1 leading to 2n − 2 for the degrees of freedom for a protein chain.

The Go-Model and Simulation of Folding Trajectories

We use a Monte Carlo-Go model for generating the folding trajectories.37–39 The model assumes the native state of the protein as the most stable configuration and drives the chain from a given initial state to the native. Since the native conformations of the protein are resolved in water, the effect of water is implicit in the Go-potentials used. The trajectories are generated by applying small perturbations to randomly chosen φ and ψ angles. At any step, a small number of angles (10% of the total number of angles) are subjected to perturbations with a maximum magnitude of 0.05 radians. This small magnitude of perturbations is chosen to reduce the number of atom pairs that violate excluded volume condition at each move. Each perturbation of the torsion angles contains a random component and a bias so that the dihedral angles are driven to their native values. At the end of each step, the atom pairs are checked for excluded volume given by the van der Waals radii. The calculation of the distances between all atom pairs, of the order n2 in number, is computationally expensive. We employed a grid-based range search algorithm to speed up the collision detection. The algorithmic complexity of this method is linear in the number of neighboring atoms on the average.40 If the distance between two atoms, i and j, is less than the sum of their van der Waals radii, then three torsion angles along the backbone between atoms i and j are modified until all of atom pairs satisfy the excluded volume constraints given by the van der Waals radii. When all collisions among atom pairs are eliminated, the simulation of the folding trajectory continues with the next timestep in the Go model. Simulations were terminated when the rmsd between the native and computed structure was less than 0.3 Å.

Determination of the Initial Conformations

Initial configurations were generated in two ways: (i) in the first method, each initial configuration was generated by choosing each φ and ψ angle randomly. All initial configurations were relaxed to eliminate steric clashes. (ii) The second method was a knowledge-based approach in which triplets from Protein Data Bank structures were used to generate energy maps for ψi− 1 − φi, φi − ψi, and ψi − φi +1 pairs of dihedral angles. Keskin et al. gives the details of the method for extracting pair wise dependent probabilities.9

Residue-specific conformational potentials are developed on the basis of the probabilities of observations, PXYZi, ψi) and PXYZii +1), of the configurations of each triplet XYZ. Here, PXYZi, ψi) is the probability of observing the middle residue (type Y) to be in state (φi, ψi) and PXYZi, φi +1) is the probability of observing residue type Y to be in state (ψi) and Z to be in state (φi +1). PXYZi, ψi) is a measure of the intraresidue correlation of torsion angles and PXYZi, φi +1) is a measure of the interresidue correlation of two successive torsion angles. We use the discrete state formalism for the torsion angles. Each φ and ψ angle is divided into intervals of 30°; thus, 12 representative torsion states are assigned to each bond.41–43 The statistics for the φiψi and ψiφi +1 pairs could as well be obtained from pairs of residues, YZ without incorporating the i − 1th residue, X, along the sequence. This would improve the statistics in the nonredundant database significantly. However, energy maps obtained for doublets and triplets exhibit significant differences. Most importantly, energy maps based on doublets contain different numbers of minima than those based on triplets.

The pairwise dependent probabilities PXYZi, ψi) and PXYZi, φi +1) are calculated according to,

equation image(3)

where NXYZi, ψi) indicates the number of residue triplets observed having the indicated values of the argument. The definition for NXYZi, φi +1) is similar to the definition of PXYZi, φi +1). The term ΣNXYZ in the denominator is the total number of observed triplets of XYZ in all possible states.

We define a conformational energy for a given residue Y (in the triplet XYZ) along the primary sequence of the protein as:

equation image(4)

where, Pmath imagei), Pmath imagei), and Pmath imagei+1) are the uniform distribution probabilities, that is, those valid when all angles are equally probable. In continuous space, they are equal to 1/2π, and; in the discrete state formalism, they are directly proportional to the size of the angular intervals of the states (1/12).41

The statistical weight matrix uηζ;i for a given bond pair i − 1 and i is given as:44–46

equation image(5)

Here the ηζth element of Ui indicates the statistical weight for bond i when it is in state ζ while the bond i − 1 is in state η.

The partition function, Z, of a chain that is n residues long is obtained according to,

equation image(6)

where J* = [1 0 … 0]; J = column [1 1 … 1]

The probability Pζ;i That bond i will be in state ζ is given by

equation image(7)

Here, Uζ;i is the matrix obtained by equating the entries of all of its columns to zero except those of column ζ. The joint probability Pηζ;i – 1,i that bond i − 1 is in state η and bond i is in state ζ is given by,

equation image(8)

The conditional probability qηζ;i − 1,i that bond i will be in state ζ, given that bond i − 1 is in state η is obtained from,

equation image(9)

An initial conformation for the CI2 molecule with knowledge-based potentials is generated by adopting the conditional probabilities given in eq 9. Random initial configurations, on the other hand, are generated by randomly assigning each torsion angle into one of the 12 intervals between –180° and 180°.


The method of initial chain conformation generation described in the preceding section is used to obtain 104 initial conformations for the CI2 molecule. The statistical analysis of the initial conformations is made on this set of 104 conformations. This set is referred to as the conditionally generated (CG) set. Out of these 104 initial conformations, 20 are chosen randomly and the folding trajectory is generated for each, using the Go-model described above. The same analysis is repeated for 104 initial conformations generated randomly, which we refer as the randomly generated (RG) set.

Initial Conformations

The rmsd of the initial structures from the native structure range between 8 and 36 Å for both sets of initial conformations, RG and CG. Both sets show similar distributions. The CG set has a mean of 16.2 ± 3.2 Å and the RG set has a mean of 17.0 ± 3.4 Å. CG's and RG's have a mean squared radius of gyration (Rg2) of 356 ± 150 Å2 and 323 ± 144 Å2, respectively. For both sets, the squared radii of gyration range between 75 and 1225 Å2 Although the rmsd and Rg2 distributions of the initial structures are comparable for the RG and CG populations, the persistence lengths for the two sets differ significantly. The persistence length is the magnitude of the persistence vector, an, defined as

equation image(10)

where, the terms in brackets represent the x, y, and z components of the averaged end-to-end vector, and i, j, and k are the unit vectors along the x, y, and z axes, respectively. The persistence vector reflects the anisotropy of chain dimensions when the first two bonds are held fixed in the xy plane.36, 45–47 In Figure 3, we compare the different components of the persistence vector for RG and CG. Calculations are performed as follows: the first bond of each chain (the bond between the first and second atoms, N and Cα atoms) is fixed along the x-axis of a Cartesian coordinate system Oxyz. The second bond (the bond between the Cα and C atoms along the backbone) is fixed in the xy plane. The persistence vector, 〈r0i〉, between the zeroth atom fixed at the origin and the ith atom is obtained by averaging over the whole ensemble of chains generated. In Figure 3, the ordinate values are the components of the persistence vector between the zeroth and the ith atom of the chain, in angstroms, plotted as a function of atom indices. Filled circles represent results for the CG chains, the three components of the vector being shown on each curve. Similarly, the empty circles denote results for the RG chains. The x, y, and z components of the RG chains are 3.59, 4.98, and 0.09 Å, respectively. For the CG chains these values are 8.66, 11.35, and –4.98 Å. One sees that for the CG chains, the components of the persistence vector exhibit significant differences, indicating different preferences along different directions relative to the three Cartesian coordinates. Differences in preference are much less pronounced for the RG chains. The effects of holding the first and second bonds in the xy-plane are exhibited in the x and y components. This effect dies off approximately when the 20th atom is reached for the RG chains, corresponding to four residues. However, the effect of fixing the first two bonds in the xy-plane reaches up to 25–75 bonds, or 8–25 residues, for the CG chains. Thus, correlations between residues along the CG generated protein chain reach over very large distances. The z component of the persistence vector dies off readily for the RG chains, indicating that correlations are rapidly randomized in this direction. However, the z-component strongly persists, in the negative direction, for the CG chains. The magnitude of the persistence vector for the full CG chains is 15.1 Å while it is 6.1 Å for the RG chains.

Figure 3.

Components of the persistence vector for RG (○) and CG (•).

In Figure 4, the mean-squared end-to-end distance between the zeroth atom and the ith atom of the RG (empty circles) and CG (filled circles) chains are shown. Characteristic ratios, Ci, given in the ordinate are obtained by dividing the mean-squared end-to-end distances by il2, where i is the atom index and l is the bond length taken as 1.47 Å. The characteristic ratio for the RG set rapidly smoothes out, indicating the absence of local correlations. The characteristic ratio for the CG set exhibits fluctuations. The characteristic ratio C for the whole RG and CG chains are 4.79 and 5.38, respectively. These two values are strikingly close to each other. The apparently unexpected result showing that the persistence vectors are significantly different from each other for RG and CG chains while the characteristic ratios are similar is due to presence of strong torsion angle correlations in the CG chains as a result of which they are not yet in the scaling regime. One may suitably analyze the chain in terms of the worm-like or the Porod-Kratky chain for which the mean-squared end-to-end distance 〈r2〉 and the projection a of the persistence vector along the first bond are related according to the expression45

equation image(11)

where L is the contour length and the subscript zero indicates the absence of long range correlations. The contour length L for CI2 is 276 Å. For the RG chains, 〈r20 = 4.79. Substitution of these values into eq 11 and solving for a yields aRG = 3.64 Å, which agrees well with the value of 3.59 Å calculated from the simulations. For the CG chains, 〈r20 = 5.38 Å. Substitution into eq 11 and solving for the persistence length yields aCG = 4.10 Å. Calculations from the simulations give, however, aCG = 8.66 Å, which is more than twice of the Porod-Kratky chain. This indicates that the CI2 chain is not yet in the scaling regime and strong conformational correlations inside the chain dominate the statistics of the overall chain.

Figure 4.

Characteristic ratio for the ith atom of the chains. See legend for Figure 3.

Bond Correlations

In this section, we investigate the statistical correlations between two bonds of the chain. The Flory Isolated Residue Pair hypothesis45, 48, 49 states that the conformations of a given residue are statistically independent of those of the neighboring residues in the random configurations of the protein. Subsequent work, however, showed that certain conformations of neighboring bonds of proteins are correlated.9, 11, 50 The presence of such correlations (or anticorrelations) is important since they may possibly affect the folding of the peptide chain and the stabilization of secondary structure.

The state η of bond i and the state ζ of bond j are uncorrelated if the joint probability Pηζ;ij is equal to the product of the corresponding singlet probabilities, Pη;i and Pζ;j. The ratio

equation image(12)

is a very sensitive measure of correlations in this respect. If the occurrence of state η in bond i enhances (decreases) the occurrence of state ζ in bond j, then the ratio in eq 12 is larger (smaller) than unity and the two states of the two bonds are correlated (anticorrelated). In a random chain, the ratio is unity for all pairs of states.

The neighbor dependent potentials adopted in this study lead to strong dependences between the specific states of neighboring bonds. To answer the more meaningful and more general question of whether two bonds of the chain are correlated or not, one needs to sum eq 12 over all states of the two bonds. For this purpose, we adopted the χ2 test for estimating the degree of correlation between the ith and jth bonds. To emphasize the effect of secondary structure on correlations, we grouped each ψ angle into three states: State 1 is the helix state with –150°≤ψ ≤–70°. State 2 is the β state with 90°≤ψ ≤180°. State 3 comprises all the remaining values of the ψ angle. All the states have –180°≤φ ≤–70°. The χ2-test is then performed according to the expression50

equation image(13)

where, χ2ij denotes the value for the two bonds i and j, nζη;ij is the number of configurations for which the i'th bond is in state ζ and the j'th bond is in state η, and ϵζη;ij = npζ;ipη;j. In the last expression, n is the total number of configurations. The total number, 104, of configurations generated was sufficient to ensure51 that ϵζη;ij ≥ 5 for all ζη, ij pairs of the chain, both for the RG and the CG sets. The critical value of χ2ij is 18.7 for a confidence level of 99.9% for the system of three states: α-, β- and all others.51 The χmath image values ranged between 0.022 and 18.27 for all bond pairs of the RG set. Therefore, within this confidence level, any two bonds of the random set are independent. For the CG set, the χmath image values ranged between 0.016 and 4654. The values of χmath image between 0.016 and 18.7 indicate that the corresponding bonds i and j are uncorrelated for the confidence level of 99.9%. The spectrum of χmath image values larger than 18.7 was approximately continuous up to 600 and showed a large gap between 600 and 1000. The strongest correlations are indicated by χmath image values larger than 1000. In Figure 5, the regions with χmath image values larger than 1000 are shown as a function of residue index. The upper band shows the six secondary structures of the native protein, (β, α, β2, recognition loop, β3 and β4). The lower four black strips indicate the groups of residues having significant correlations among the ψ-bonds. The first group of correlations is among the ψ-bonds between residues 18 and 24. This region corresponds to the α-helix. The second group of correlations is among the ψ-bonds between residues 29 and 33 in the β2 structure. The third group of correlations is among the ψ-bonds between residues 37 and 41 in the recognition loop. The fourth group of correlations is among the ψ-bonds between residues 46 and 53 comprising the β3 structure of CI2. Thus, correlations among torsion angles in the unfolded state are harbingers of secondary structure regions of the native protein. Furthermore, the most probable values of more than 90% of the correlated ψ-bonds, obtained using eq 7, coincide with the native value of the corresponding angles.9 We therefore conclude that correlations among bonds in the unfolded state have the effect of stabilizing the secondary structures.

Figure 5.

Correlations among ψ-bonds (shown by the lower black strips), and secondary structure regions of the native protein (shown by the upper black strips). Rec. loop denotes the recognition loop.

Simulation Times

We performed 20 simulations starting with CG chains and 20 simulations starting with RG chains. Figure 6 displays the simulation times in increasing order, for each set. Solid bars represent the simulation time for CGs and empty bars represent those for RGs.

Figure 6.

Folding times for 20 different conditional and random simulations.

Three of the 20 simulations were unsuccessful (they were nonfolders) in the CG set. On the other hand, 8 of the 20 simulations of the RG set did not reach to the folded structure. Success rates are 0.85 and 0.60 for the two cases. The average folding time for the first set is 11 × 106 time steps. This number increases to 20 × 106 time steps for the second set of simulations. These indicate that the structures starting with conditional probabilities are favored in the folding process compared to the ones starting with random structures. These may suggest that the conditional probability matrices are important to guide the initial coordinates.


Twenty simulations performed starting with CG chains display distinct behaviors along the folding trajectory. Two of the examples are shown in Figure 7 where the rmsd from the native structure is presented as a function of simulation time. Part (A) is a typical example where the simulation starts and a sudden decrease in the rmsd in the early folding times is observed. Eleven of the simulations show such behavior. There is a pseudoequilibrium region where the residues fluctuate and all the structures have rmsd around 6 Å. Then, there is another sudden decrease in the rmsd toward the end of simulation, where the native state is reached. The final folding is always observed to be very fast in a couple of simulation times. Usually the rmsd stays around 6 Å, and suddenly drops to an rmsd (0.3 Å) at which the simulations are terminated. Final adjustments are completed just after the folding nucleus was formed during the pseudoequilibrium region. Part (B) is another example where the structure exhibits several transitions along its folding pathway. Large deviations from the native structure result from attempts for relieving local contacts of residue pairs as discussed in more detail latter.

Figure 7.

Rmsd from the native structure during a simulation process starting with conditionally generated set.

Residue Contacts

Figure 8 displays the behavior of contacts during two of the simulations. The x-axis is the simulation time during a trajectory and the y-axis represents the residue numbers in CI2. A dot is placed in Figure 8 for residue i if it is in contact with another residue at a given simulation time. Results from the two trajectories are presented in Figure 8(A,B). In Figure 8(A) the four bands around residues 13–17, 28–32, 47–51, and 56–59 are the most discernible ones. In Figure 8(B), three bands around residues 28–32, 47–51, and 56–59 are the most persistent ones. All other trajectories conform to either Figure 8(A or B). Fersht and Daggett reported that there is no compact structure until residues 49–57 pack against each other and stabilize the native helix.5 Then these residues contact with residue 16 to form the folding core. Thus, residues 16, 49, and 57 form the folding nucleus.45 It is interesting to see that two residues 49 and 57 are always in contact during the simulation. This is just one of the examples of the simulations, indeed in most cases (13 out of 17 folders) these residues form their contacts in early stages of folding and stay in contact during the simulations.

Figure 8.

Residues that are in contact during the simulation.

Cyclization Probabilities

Another observation from our simulations is that the RG chains exhibit higher number of non-native contacts throughout their folding trajectories. On the other hand, the CG chains have a higher number of native contacts compared to the non-native contacts. The fraction of native contacts, Q, can be used to describe the folding process.52 The Q variable has a value very close to zero for highly denatured protein chains and reaches unity in the native state. Contact of two residues separated by at least two other residues are used in defining cyclization probabilities. Cyclization probabilities may be divided into two, cyclization relating to native and the other relating to non-native contacts.

The total number of contacts (both native and non-native) in a chain at any time is nc = nn + nnn. Here, nc, nn, nnn are the numbers of total, native, and non-native contacts in a chain, respectively. The value of Q equates to equation image. In Figure 9, we compare the evolution of Q along the folding trajectory for an RG (empty circles) and a CG chain (filled circles) whose folding times are close to each other. In the initial state, the ratio of the native contacts, Q, for the two chains are equal and around 0.57. However, the Q value of the CG chain becomes larger than 0.9 at the beginning of the folding simulation. The points for the RG chain remain below that of the CG chain until the late stages of the trajectory.

Figure 9.

The ratio of native contacts for a random (RG) and conditional (CG) simulation along the time steps.

In Figure 10, the average of Q is presented for each trajectory. The abscissa represents the twenty trajectories. Each point is obtained by averaging the Q values over a full trajectory. The filled and empty circles correspond to results of CG and RG simulations, respectively. Both CG and RG Q values are rearranged in increasing order to illustrate the difference between the two. The figure shows that CG Q values are always above that of the RG chains. The means of the CG and RG Q values are 0.88 (±0.06) and 0.81 (±0.10), respectively. We also performed the Wilcoxon nonparametric rank test to see if the difference between medians of these two datasets is statistically significant. We obtained a P-value of 0.02 (at α = 0.05). Therefore, we concluded that the CG chains behave statistically different during the simulations compared to the RG chains.

Figure 10.

The average ratio of native contacts for the 20 random (RG) and 20 conditional (CG) simulations. The simulations are sorted with increasing Q values.

The differences between the average native contact probabilities presented in Figure 10 result from two important factors. First, the initial configurations of neither the RG nor the CG chains have knots. The number of repeat units of the CI2 chain is too small for the chain to exhibit knots while being produced from the amino to carboxyl direction. The persistence length of the RG chains, being 6.1 Å, is approximately equal to that of polyethylene chains.47 The persistence length of the CG chains, on the other hand, is 15.1 Å, which is significantly higher than that of the random chains. The value obtained for the CG chains is somewhat smaller than the persistence length for polyalanine calculated as 22.5 Å by Conrad and Flory.53 This results from the specific φ − ψ propensities inherent in the non-native state of the CG chains. As a result, dense nonlocal non-native contacts form in the initial stages of the RG chains and persist throughout the complete folding trajectory. Secondly, our simulations indicate that native-like blocks of several consecutive amino acids form at the early stages of the trajectories resulting from local rearrangements for CG chains. During the remaining stages of the trajectories, these blocks move toward their relative positions that obtain in the native state. These second types of motions dominate most of the trajectories since they are large-scale motions and require more time. The time of native contact formation of two residues i and j increases linearly with the number of residues |ji + 1| between them.54 Regions connecting such blocks play crucial roles in folding. In RG chains, the connecting regions form tight turns that, once formed, require a long time to break and transform into the native configuration. The most time consuming motions in the Go-model simulations are those that transform one tightly formed region into another. Here, we use the term ‘tightly formed’ to indicate a configuration that contains local contacts. In CG chains, on the other hand, the connecting regions in general do not contain local contacts, in the earlier stages of their trajectories, and therefore require less time to change their shapes from one configuration to another.


The following are the major conclusions of the present study of CI2:

  • 1Folding times of CG chains are at least two times faster than RG chains. There are eight nonfolders in the RG set compared to the three in the CG set. These numbers indicate the importance of initial structures during folding simulations.
  • 2Although the characteristic ratios of the RG and CG chains are close to each other, their persistence lengths differ significantly, resulting from the presence of neighbor dependent correlations between torsion potentials of the CG chains which are absent in the RG chains.
  • 3The number of native and non-native contacts during folding is controlled by the initial conformations. The number of native contacts of the chains generated by the knowledge-based potentials is significantly larger than that of the random chains at the initial configurations and throughout the folding process.
  • 4There are specific contacts (e.g., residues 49 and 57) that appear throughout most of the trajectories.
  • 5Dominant bottlenecks during folding are due to local non-native contacts in the regions of the chain that join two native-like blocks that form but are not in the correct position relative to each other.
  • 6The neighbor dependent torsion potentials built into the CG chains result in strong correlations among bonds that stabilize fragments of secondary structure at the onset of folding.