Predicting protein folding rates from geometric contact and amino acid sequence


  • Zheng Ouyang,

    1. Department of Bioengineering, University of Illinois at Chicago, Chicago, Illinois 60607, USA
    Search for more papers by this author
  • Jie Liang

    Corresponding author
    1. Department of Bioengineering, University of Illinois at Chicago, Chicago, Illinois 60607, USA
    • Department of Bioengineering, University of Illinois at Chicago, 851 South Morgan Street, Room 218, Chicago, IL 60607, USA; fax: (312) 996-5921.
    Search for more papers by this author


Protein folding speeds are known to vary over more than eight orders of magnitude. Plaxco, Simons, and Baker (see References) first showed a correlation of folding speed with the topology of the native protein. That and subsequent studies showed, if the native structure of a protein is known, its folding speed can be predicted reasonably well through a correlation with the “localness” of the contacts in the protein. In the present work, we develop a related measure, the geometric contact number, Nα, which is the number of nonlocal contacts that are well-packed, by a Voronoi criterion. We find, first, that in 80 proteins, the largest such database of proteins yet studied, Nα is a consistently excellent predictor of folding speeds of both two-state fast folders and more complex multistate folders. Second, we show that folding rates can also be predicted from amino acid sequences directly, without the need to know the native topology or other structural properties.

In 1998, Plaxco and colleagues made the important observation that the folding rates of two-state-folding proteins correlate with the native topologies of those proteins: Faster-folding proteins tend to have more local α-helical structure, and slower-folding proteins tend to have more nonlocal β-sheet structure. Plaxco and colleagues characterized the native topology using the average relative contact order (RCO), a measure of the relative fraction of local vs. nonlocal noncovalent contacts (Plaxco et al. 1998).

Many variations of this idea have since been studied, indicating that folding rates also correlate with the long-range order (LRO) (Gromiha and Selvaraj 2001), the effective contact order (ECO) (Dill et al. 1993; Fiebig and Dill 1993), the total contact distance (TCD) (Zhou and Zhou 2002), a chain topology parameter (CTP) (Nolting et al. 2003), and the effective length of the protein, Leff (Ivankov and Finkelstein 2004). A few of these quantities, such as the absolute contact order (ACO), have predictive power beyond two-state-folding rates (Ivankov et al. 2003); they also predict the rates of more complex multistate folders as well. Interestingly, although the protein's chain length was originally found to be poorly correlated with rates of two-state folders, chain length (sometimes with a fractional exponent) was later shown to correlate well with the folding rates in more recent studies (Thirumalai 1995; Finkelstein and Badretdinov 1997; Koga and Takada 2001; Galzitskaya et al. 2003; Shao et al. 2003; Naganathan and Munoz 2005).

However, these results were obtained with relatively small data sets and often begin from knowledge of the native structure of the protein (Plaxco et al. 1998; Gromiha and Selvaraj 2001; Zhou and Zhou 2002; Ivankov et al. 2003; Nolting et al. 2003). There have been several reports of predicting folding rates from protein sequences (Shao and Zeng 2003; Kuznetsov and Rackovsky 2004; Gromiha 2005; Punta and Rost 2005; Galzitskaya and Garbuzynskiy 2006), but these all require some level of information of protein structures, for example, knowledge of the structural class, or are based on prior predictions of the native secondary structures.

Our aim here is to develop a general method that can predict the folding rates for proteins of diverse classes based only on the amino acid sequence of the protein, without knowledge of the tertiary or secondary structures, or information of structural class, and without the aid of any other computational prediction of structural properties (e.g., secondary structures or contact order). We first use the concept of “geometric contact” (defined below) to study the correlation between native structure and folding rate (Li et al. 2003). Using a large set of proteins, including both two-state and multistate folders, we find that folding rates correlate well with the number of residues that form geometric contacts. The correlation coefficients are −0.86, −0.86, and −0.83 for two-state proteins, multistate proteins, and all proteins combined, respectively. Using a reduced alphabet of only two types of amino acids which are weighted differently, these correlation coefficients are all improved. The folding rate predicted from structure has a correlation coefficient of −0.86 with measured folding rate in leave-one-out jackknife tests. Based on estimated propensity values of different residues to form geometric contacts from a protein structural database, we further develop a simple algorithm that predicts folding rates from amino acid sequences alone, without any additional structural information. The predicted values correlate well with the experimental values, with a coefficient of −0.82. Our results suggest that both simple and complex proteins, over all the fold classes, may fold by a single mechanism in which spatial packing and zipping interactions are important determinants of the folding rate.

Materials and Methods

Model and data

Data set

A collected data set of experimentally determined folding rates for 80 proteins, of which 45 are two-state folders and 35 are multistate folders, were a generous gift from Ken Dill and Dr. Ke Fan (University of California at San Francisco). We have slightly modified this data set and have removed structures that contain large hetero groups, such as iron protoporphyrins, or irregular amino acids. We have also incorporated additional data from the literature. These proteins belong to different structural classes: 18 are all-α proteins, 32 are all-β proteins, and 30 are αβ proteins. We took the slowest rate for multistate folders, since the faster rates are due to kinetic traps; the slowest rate corresponds to the appearance of native protein and is therefore most directly comparable with the folding rate of two-state folders. The folding rates of these proteins range over more than eight orders of magnitude, from lnkf = −6.9 for ribonucleotide isomerase (1qo2) to lnkf = 12.9 for albumin-binding domain (1prb). Tables 1 and 2 give the Protein Data Bank names and experimentally measured folding-rate values for two-state and multistate proteins, respectively. Supplemental material is available at (

Table Table 1.. The set of 45 two-state proteins
original image
Table Table 2.. The set of 35 multistate proteins
original image

Defining geometric contacts

In most studies, pairwise contacts are typically declared if two residues are within a specific cutoff distance. Such definitions can include residue pairs that have no steric interactions (Taylor 1997; Bienkowska et al. 1999). We take the view here that a more refined definition of geometric contact may be more useful (Li et al. 2003).

We used a contact definition based on a Voronoi criterion. Voronoi diagrams have been widely used in protein structure and folding analysis (Richards 1977; Poupon 2004). Here we illustrate our contact definition using a simple two-dimensional picture of a molecule formed by a collection of disks of uniform size (Fig. 1A). In the diagram, each Voronoi cell contains one atom, and every point inside a Voronoi cell is closer to this atom than to any other atom. A Voronoi cell is defined by its boundary edges (shown as broken lines in Fig. 1A), which are perpendicular bisectors of the line segments connecting two atom centers. For each Voronoi edge, this line segment is called the corresponding Delaunay edge (Fig. 1B). In this study, residues i and j are defined to form a geometric contact if they are connected by a Delaunay edge, and the corresponding Voronoi edge intersects with the protein body. In addition, we require that contacting residue pairs must be at least four residues apart in the primary sequence, and their spatial distance is no greater than 6.5 Å. Our parameter Nα, the geometric contact number, is simply the total number count of residues in a protein with such contacts. We first test Nα as a predictor of folding rates against other measures. The RCO was introduced by Plaxco et al. (1998):

equation image

where N is the total number of contacts, ΔSi,j is the sequence separation between residue i and j, and L is the total number of residues. RCO measures the relative importance of local and distant contacts. The ACO was also introduced by Plaxco et al. (2000):

equation image

where ACO is the average sequence separation of contacting residues, not normalized by the chain length as RCO is. Finally, chain length (L) has also been used for correlating with folding rates (Thirumalai 1995; Finkelstein and Badretdinov 1997; Koga and Takada 2001; Galzitskaya et al. 2003; Ivankov et al. 2003; Naganathan and Munoz 2005).

Figure Figure 1..

Voronoi diagram of a simple 2D molecule. (A) The molecule is formed by disks of uniform size. The dashed lines represent the Voronoi diagram, in which each region contains one atom. (B) The Delaunay edges of the molecule.

Results and Discussion

The Voronoi-based geometric contact definition gives an improved correlation with protein folding rates

The results of correlating folding rates lnkf with Nα and other measures of native topology are summarized in Table 3. As others have found previously (Ivankov et al. 2003), we find that the RCO correlates poorly with folding rates for this set of 80 proteins. A better measure is the ACO. Its correlation with folding rates is R = −0.83 for two-state proteins, R = −0.64 for more complex proteins, and R = −0.76 for both sets combined. Previous results suggested that the protein's chain length correlates well with the folding rate (Naganathan and Munoz 2005). We found using this enlarged data set protein chain length has a strong correlation for multistate proteins (R = −0.79), but a weaker correlation for two-state proteins (R = −0.72). Although using fractional powers of the length (e.g., L1/2, L2/3, or L3/5) or the logarithm ln(L) can lead to improved correlations with multistate proteins (Naganathan and Munoz 2005), they introduce little improvement for two-state proteins (see Table 3). On the other hand, the quantity Nα introduced here, correlates well in all cases (R = −0.86 for two-state proteins, R = −0.86 for multistate, and R = −0.83 for all 80 proteins). Figure 2 shows how these various measures correlate with folding rates of the combined set of proteins. These data indicate that an accurate description of geometric contacts improves the correlation of native protein structures with folding rates.

Table Table 3.. Correlation coefficients of structure-derived parameters with protein folding rates
original image
Figure Figure 2..

Relationship between different structural parameters and folding rates of two-state (open squares) and multistate (solid squares) proteins. (A) Relative contact order, RCO (R = −0.15); (B) absolute contact order, ACO (R = −0.77); (C) chain length (R = −0.72); and (D) Nα (R = −0.83).

Comparing our geometry-based contact definition with distance-based definitions

We compare our measure using the geometric definition of contact with the following distance-based measure: We declare a pair of residues to be in contact if the distance between their Cα atoms is no greater than 6.5 Å. The results are shown in Table 4. The geometry-based definition gives a slightly better correlation than the distance-based definition for relative contact order and for our parameter of total contact number Nα, and gives the same correlation as the distance measure when using absolute contact order. More importantly, there are 8384 and 5234 pairwise contacts by the distance-based and geometry-based measures, respectively, hence 38% of the distance-based contacts either are unnecessary or degrade the correlation.

Table Table 4.. Comparing distance-based and geometry-based definitions of contacts, for correlating with folding rates
original image

Nα is a better predictor of folding rate than chain length. Although chain length and Nα are highly correlated (R = 0.91), we find via subset testing that Nα is better than simple chain length at correlating with folding rates. We randomly selected a subset of 30 proteins from the 80 proteins, and carried out the correlation analysis on this subset. The correlation coefficients between the folding rate lnkf and the geometric contact number Nα, between lnkf and the chain length L, are recorded, respectively. This is repeated seven times. As can be seen in Figure 3, the chain length L is not a consistently good predictor of protein folding rates: The correlation R is better than −0.50 only for two subsets, and the best R-value is −0.67. Depending on the class of proteins, the R-value can be as little as −0.04. In contrast, Nα gives consistently good correlations: All are better than −0.58, with the best value being −0.79. These results suggest that Nα is more informative than chain length for understanding protein folding mechanisms.

Figure Figure 3..

The geometric contact number Nα, is more robust than chain length L in correlating with protein folding rate. Results of a subset testing where 30 proteins are drawn from the original data set to form a subset. Correlation coefficients of folding rates with Nα and with the chain length L for seven such subsets are plotted.

Different geometric contacts contribute differently to folding rates

Here, we allowed each residue type i to have a weighted contribution wi, leading to the following model for protein folding rates:

equation image(1)

where lnkf is the folding rate of a protein, a is a constant, nα is a 20-dimensional vector recording the number counts of the 20 residue types in geometric contact, and w is the 20-dimensional weight vector whose values are to be determined. Using singular value decomposition for the data set of 80 proteins, we obtain the optimal weight vector w, and the baseline constant a, that minimize the residual error of the predicted lnkf with the experimentally determined lnkf values, by a Euclidean distance measure (Noble and Daniel 1988). The optimal weights for the 20 amino acid types are listed in Table 5. Interestingly, Val, Ile, Trp, and Tyr appear to slow down folding by the greatest extent, whereas Glu and Phe accelerate folding.

Table Table 5.. The weight parameters for the different residue types in determining protein folding rates
original image

Upper bound of protein folding speed

Based on the 20 optimized weight parameters, we can estimate an upper bound for the folding speeds of the fastest proteins. In general, small proteins are fast folders: A foldable protein sequence with only 20 residues has been reported (Qiu et al. 2002). If we: (1) consider such a 20 mer, and (2) take our predicted fastest-folding residue, Glu (recognizing, however, that Glu would not lead to a stable fold), it suggests that no protein or peptide is likely to fold faster than lnkf = 10.29 + 20 × 0.451 ≈ 19.3, according to Equation 1, or roughly 4 nsec.

Folding rates and structures on a reduced alphabet of amino acids

In order to avoid overfitting, we use a reduced alphabet of amino acids containing only two types of residues, and allow these two types of residues to contribute differently to the folding rate. After exhaustive tests using different combination of residue types, we choose the following grouping of amino acids as our reduced alphabet A = (A1, A2), with A1 = (A, C, E, F, M, N, R, G, H, K, L, P, T) and A2 = (D, I, Q, S, V, W, Y). When the number counts (n1, n2) of residues with geometric contacts for these two reduced residue types are weighted differently with w1 = 0.015 and w2 = −0.324, the correlation coefficients for folding rates improves to R = −0.87, −0.87, and −0.87 for two state, multistate, and combined set, respectively.

The resulting model lnkf = 10.192 + nα · w also predicts protein folding rates well. Here nα = (n1, n2) is the vector of number counts of geometric contact, w = (w1, w2) is the vector of weights. Results from jackknife tests show that predicted and measured folding rates are strongly correlated, with a correlation coefficient of 0.86 (Fig. 5A, see below).

Figure Figure 4..

Propensity of residues for forming geometric contact. (A) Distribution of the number of native geometric contacts of 20 amino acids in the PDB select data set; (B) the propensity values of residues for forming geometric contact.

Predicting protein folding rates from sequences

As proteins are generally tightly packed, one may assume, to first approximation, that each residue of a specific type has the same probability of a geometric contact as any other residues of the same type. With this assumption, the folding rate of a protein can be determined from knowledge of its sequence and each amino acid's general ability to form geometric contact.

The geometric contacting propensity can be estimated from known protein structures. Here, we used PDB-SELECT (2002 version), a nonredundant protein structure data set containing 1670 structures with pairwise sequence identity <25% (Boberg et al. 1992). The distribution of geometric native contacts for the 20 amino acid types and the corresponding relative values are shown in Figure 4A, and the propensity values are obtained after correction for residue composition (Fig. 4B). These propensity values collectively form the 20-dimensional contact propensity vector p. We can derive the following model for correlating protein folding rates:

Figure Figure 5..

Scatter plots of the predicted and experimentally measured values of lnkf in jackknife leave-one-out tests: (A) using weighted geometric contact number; (B) using sequence information only, and (C) using chain length.

equation image

where n2 is the two-dimensional vector of the simple number counts of two different simplified residue types for a protein, p is the 20-dimensional geometric contacting propensity vector, w is conceptually the 20-dimensional weight vector of different contributions of the residues, “∘” denotes component-wise vector product, and P (pw) denotes the “projection” of the 20-vector of component-wide product pw to the two-dimensional space of reduced alphabet, namely,

equation image

We can denote the projection of component-wise vector product as: ws = (pw). It integrates both the propensity of a residue type to form geometric contact and its relative contribution to folding rate. The resulting model for predicting protein folding rates is:

equation image

where lnkf is the folding rate of a protein, and n2 is the two-dimensional vector of number count of reduced residue types in the sequence of the given protein. The optimal reduced two-alphabet and values of ws are listed in Table 6.

Table Table 6.. Predicting protein folding rates using reduced alphabets of amino acids
original image

We find an excellent agreement between experimentally determined and predicted folding rates. The effectiveness of the model can be demonstrated in a jackknife test, in which the coefficients ws of the model omitting one protein were calculated and the folding rate of the omitted protein is computed. The result is shown in Figure 5B, which is significantly better (R = 0.82) than prediction results using chain length (R = 0.69, Fig. 5C). As can be seen from the large amount of scattering at the right portion of Figure 5B, chain length correlates with folding rate poorly for fast folders, as folding rates of proteins of similar length (X-axis) can differ significantly. This is a phenomenon well studied in a recent theoretical work (Kachalo et al. 2006).

The deviation of sequence weights, ws, from structural weights, w, can be thought of as an implicit correction by assuming some average structural information for specific residue types. Our results suggest that even models with two residue types can capture a significant amount of information about protein folding rates. This is reminiscent of the well-known HP model for studying protein stability and folding (Chan and Dill 1989; Ozkan et al. 2001; Kachalo et al. 2006).


We introduce here a quantity, Nα, which is a count of the number of well-packed nonlocal contacts in the native structure of a protein, where “well packed” is defined by a Voronoi criterion. The quantity Nα, is highly anti-correlated with the folding rates of 80 proteins, both two-state and multistate folders. This quantity gives a better and more consistent correlation with folding rates over this broad set of proteins than several other quantities, including the RCO, ACO, and the chain length L. For example, simple chain length does not correlate well with two-state folders. In addition, the correlation is not robust, as a different choice of protein samples results in a large variation in correlation (Fig. 3). The overall correlations using either RCO or ACO are not as good as that obtained from Nα. The measure Nα is not biased against shorter loops as long as their lengths are longer than a threshold of three residues, while both RCO and ACO weight more for contacts with long loops. We believe that the physical basis for this correlation is that proteins fold via a mechanism of zipping and assembly. Contacts among monomers that are more widely separated in the sequence are more difficult to form because their conformational search is more costly in chain entropy, and folding is likely to proceed through a local zipping mechanism (Dill et al. 1993; Fiebig and Dill 1993; Weikl and Dill 2003a,b; Weikl et al. 2004; Merlo et al. 2005).

The present work goes beyond predicting folding rates from known native structures or from known/predicted secondary structures (Ivankov and Finkelstein 2004; Gromiha 2005), and predicts rates, instead, just from the amino acid sequences of these proteins. Our prediction works even when protein sequences are based on alphabets of only two residue types. Although several previous studies can correlate protein folding rates with sequences well, they are based on a smaller data set, and they require additional structural knowledge of proteins in the form of general structural class (Kuznetsov and Rackovsky 2004; Gromiha 2005), or secondary structure information (Ivankov and Finkelstein 2004). We find that different amino acids have different propensities for folding speed. Proteins are most slowed down by Val, Ile, Trp, and Tyr forming geometric contacts, and most speeded up by Glu and Phe.


This work is supported by grants from the National Science Foundation (DBI-0646035) and the National Institutes of Health (GM079804-01A1 and GM081682 GM68958). We thank Dr. Ken Dill for stimulating discussions and for sharing the collected data of folding rates, and Dr. Martin Gruebele for helpful discussions.