Guided by the recent success of empirical model predicting the folding rates of small two-state folding proteins from the relative contact order (CO) of their native structures, by a theoretical model of protein folding that predicts that logarithm of the folding rate decreases with the protein chain length L as L2/3, and by the finding that the folding rates of multistate folding proteins strongly correlate with their sizes and have very bad correlation with CO, we reexamined the dependence of folding rate on CO and L in attempt to find a structural parameter that determines folding rates for the totality of proteins. We show that the Abs_CO = CO × L, is able to predict rather accurately folding rates for both two-state and multistate folding proteins, as well as short peptides, and that this Abs_CO scales with the protein chain length as L0.70 ± 0.07 for the totality of studied single-domain proteins and peptides.
Many proteins fold and unfold by a simple two-state transition lacking observable intermediates at any solvent conditions (Jackson 1998). Many other proteins exhibit a more complicated multistate transition; namely, they have observable folding intermediates under physiological conditions. However, the boundary between these two groups of proteins is not as well defined.
It is known that some proteins can be switched from two-state to multistate folding, and vice versa, by point mutations or even by changing conditions such as the salt concentration or temperature (Jackson 1998). In addition, multistate folding is observed only far from the point of thermodynamic equilibrium between the native and denatured states, whereas, close to this point, all proteins fold without any observable intermediates (Privalov 1979; Jackson 1998; Finkelstein and Ptitsyn 2002).
Small two-state folding proteins have attracted particular attention of experimentalists and theorists. It was demonstrated that the logarithms of in-water folding rates of these proteins correlate with their gross topological parameter called relative contact order (CO; Plaxco et al. 1998b). The latter is defined as
where N is the number of contacts (within 6 Å) between nonhydrogen atoms in the protein, L is the length of the protein in amino acid residues, and ΔLij is the number of residues separating the interacting pair of nonhydrogen atoms (adjacent residues are assumed to be separated by one residue, etc.).
CO is a renormalization of the perhaps more intuitive measure, absolute contact order (Abs_CO),
which, however, was found to be less correlated than CO with folding rates of the two-state folders (Plaxco et al. 1998b; Grantcharova et al. 2001).
The CO was invented to compare differences in topology (rather than in size) between proteins of different length. This parameter is small for proteins stabilized mainly by local interactions and is large when residues in a protein interact frequently with partners far away in the protein sequence. The latter should lead to slower folding (Plaxco et al. 1998b; Fersht 2000). Indeed, negative correlation between the CO and the logarithm of folding rates was found to be very strong, ∼ −0.8 (Plaxco et al. 1998b; Fersht 2000) for two-state folding proteins (which also holds for all two-state folding proteins studied to date; Fig. 1, circles).
However, examining a whole set of proteins studied to date (Table 1, Table 1.), we see that CO, although it still gives good results for two-state folding proteins, fails to predict the folding rates of short peptides and large multistate folding proteins (Fig. 1). It seems the reason is that CO takes into account topology only and pays no explicit attention to the protein size.
A number of basic correlations between protein size and folding rate have been suggested (Thirumalai 1995; Gutin et al. 1996; Finkelstein and Badretdinov 1997a,b). All of them stress that, as might be expected, folding rate decreases monotonically with protein size, but all indicate different scaling laws for this decrease. It should be noted that some recent simulations of folding of off-lattice protein models with simplified potentials (Koga and Takada 2001) indicate that the logarithms of protein folding rate decrease with the chain length as L0.61 ± 0.18, which is in accordance with both Finkelstein and Badretdinov's (1997a,b) and Thirumalai's (1995) theories.
It has been shown, however, that the protein size by itself determines folding rates of only multistate folding proteins and fails to predict those for two-state folders (Galzitskaya et al. 2003): For multistate folders, the negative correlation between LP (L being the number of residues in the chain and P a free parameter) and the logarithm of folding rates is as high as −0.80 in the broad range of power P from zero to one, whereas for two-state folders any correlation between folding rate and size is virtually absent.
This study is aimed to develop a general parameter for predicting the protein folding rates of two-state folding proteins, multistate folding proteins, and small peptides. This general estimate, if found, would be useful for two reasons: (1) Attribution of proteins to two-state or multistate folders is somewhat arbitrary, at least for proteins which can be switched from the two-state to the multistate behavior by point mutations or changing solvent conditions, and (2) it is useful to estimate the folding rate of a protein when one does not know a priori if it is two-state or multistate folding protein.
Results and Discussion
The simplest way to obtain such a parameter is to take into account both the protein topology and its size, that is, to combine a length-based theory with empirical topology effect (Plaxco et al. 1998b). Here we describe such a combination.
Specifically, a theory of Finkelstein and Badretdinov's (1997a,b) predicted that in a vicinity thermodynamic midtransition, folding rates of all single-domain proteins should decrease with their lengths, L as exp[−(0.5 ÷ 1.5) L2/3], and where the size-independent coefficient C = 0.5 ÷ 1.5 depends on the topology of the protein: C is close to 0.5 when a protein is stabilized mainly by local interactions, so that semifolded protein does not contain closed loops protruding from the folding nucleus, and C is close to 1.5 when a protein has many long-range contacts, so that many closed loops protrude from the nucleus. Later it was shown (Galzitskaya et al. 2001) that the range kf = exp(0.5L2/3) × 10ns ÷ exp(1.5L2/3) × 10ns is valid for all the studied peptides and single-domain proteins of a great variety of lengths, topologies, and folding behaviors.
Although Finkelstein and Badretdinov did not give an algorithm to compute their coefficient, C, from protein structure, it is clear that a physical sense of C is similar to those of the CO of Plaxco et al. Both are small for proteins with local contacts (i.e., α-helical proteins), and both are large for proteins with predominantly long-range contacts, which cannot avoid having many loops in a semifolded state. Therefore, the values of C and CO should correlate.
The simplest combination of CO and L, which seems to follow from theories of Plaxco et al. and Finkelstein and Badretdinov, may look like CO × L2/3. However, because we observe that CO is not a chain length–independent parameter (as the value C of Finkelstein and Badretdinov should be) but anticorrelates with the chain length, L (Fig. 2), for totality of proteins and peptides, we summarize CO and L in a general parameter, the “size-modified contact order” (SMCO), as
One can see that P = 0 corresponds to SMCO = CO, whereas P = 1 corresponds to SMCO = Abs_CO.
The correlation of SMCO and ln(kf), depending on the power P value, is presented in the inset in Figure 3. One can see that although any P > 0.7 results in approximately the same correlation for the totality of proteins and peptides, the best correlation is achieved at P ≈ 1, that is, when SMCO ≈ Abs_CO. The correlation of Abs_CO and ln(kf) is presented in Figure 3.
It should be mentioned, however, that for the two-state folders, the best ln(kf)–to–SMCO correlation is achieved when P = 0 ÷ 0.5 rather than 1 (Fig. 3, inset).
However, this difference between the scaling laws observed for two-state folders and the other proteins correlates, to a certain extent, with the finding (Fig. 2) that CO is independent on the chain length for the two-state folders, whereas it decreases with the chain length, L, in proportion to L−0.4 for multistate folders, and for the totality of proteins and peptides, CO decreases with their chain length, L, in proportion to L−0.30 ± 0.07 on the average.
It is noteworthy that CO scales namely as L−0.30 ± 0.07 for the totality of proteins and peptides (Fig. 2, dashed line). This means that the value Abs_CO = CO × L (which has the highest correlation with ln[kf] for the totality of proteins and peptides; Fig. 3, inset) scales with the chain length as L0.70 ± 0.07. This is in a very good concordance with a general scaling law L2/3 predicted by Finkelstein and Badretdinov 1997a,b; although the Thirumalai's  scaling law L0.5 has only a little worse correlation with experiment, and thus, cannot be ruled out; Fig. 3, inset), and agrees with an empirical scaling L0.61 ± 0.18 resulting from simplified off-lattice folding simulations of Koga and Takada (2001).
Table Table 1.. List of proteins and polypeptides a
The columns in this table are as follows: Protein, name of protein; Ref, reference to the original article on folding and unfolding kinetics; PDB, Protein Data Bank entry (Bernstein et al. 1977); L, number of residues in the protein used in the experimental study, and (in parentheses) the number of residues that have defined three-dimensional coordinates and contribute to the relative contact order ( ) calculations; ln(kf), natural logarithm of the experimental folding rates in the water; and Abs_CO, absolute contact order.
a The list of single-domain proteins and peptides that lack both disulfide bonds and covalent bonds to ligands is taken from Galzitskaya et al. 2003). If folding of some protein was investigated at different temperatures, the experiment at the temperature closest to 25°C is presented in the Table; we took the slowest phase that is not considered as cis/trans proline isomerization phase in the original paper. If the three-dimensional structure of a protein whose folding was studied experimentally was absent in PDB, but PDB contains the structure of its mutant or very close homolog, the latter was used in our CO calculations; this is mentioned in a corresponding footnote. If several PDB entries are available for some protein, the best refined full-length X-ray structure is used in our CO calculation; in the absence of X-ray structure, the averaged NMR structure is used; in the absence of such, CO was averaged over all NMR models (in this case, the standard deviation is given). Nos. 1–3 indicate short peptides; 4–33, proteins with two-state folding within the whole range of experimental conditions; and 34–57, proteins with multistate folding in water.
b There is no PDB entry for the Ala-rich 21-residue α-helix studied; the ideal (Ala)21 α-helix was used in our contact order calculation.
cln(kf) value in water refers also to the midtransition point at 24°C
d Small WW domain consisting of one β-sheet is considered as a peptide. ln(kf) value refers to the temperature 41.7°C.
eln(kf) value is the investigators' extrapolation of folding rate to 25°C.
f Two-state folding is assumed by long extrapolation made by investigators.
g Although the investigators of the experimental paper reported that the SH3 domain from PI3 kinase is 84 amino acids long, it was actually refolded by them with the additional two N-terminal residues and four C-terminal residues. The latter four are absent in the PDB entry.
Table Table 1.. List of proteins and polypeptides a
h The folding of mutant protein Y34W was studied experimentally; we used the available PDB structure of wild type in our calculation of CO.
i The folding of mutant protein Y47W was studied experimentally; we used the available PDB structure of this mutant in our calculation of CO.
j The folding of mutant protein F56W was studied experimentally; we used the available PDB structure of mutant Y31H/Q36R in our calculation of CO.
k The folding of mutant protein C21S was studied experimentally; we used the available PDB structure of wild type protein in our calculation of CO.
l We used the available PDB structure of a holoform of myoglobin (but without heme) in our calculation of CO.
m We used the available PDB structure of mutant protein T118S from pig in our calculation of CO instead of the wild type protein from rat
n The folding of protein from Escherichia coli was studied experimentally. We used the available PDB structure of the same protein from Salmonella typhimurium in our calculation of CO.
° The folding of mutant protein C40A/C82A was studied experimentally; we used the available PDB structure of this mutant in our calculation of CO.
p The folding of mutant protein C13A/C63A/C133A was studied experimentally; we used the available PDB structure of wild type protein in our calculation of CO.
q The folding of wild type protein was studied. We used the available PDB structure of mutant protein N37D in our calculation of CO. ln(kf) value refers to the summary rate of two parallel pathways of refolding of DHFR.
r The folding of mutant protein W290Y was studied experimentally. We used the available PDB structure of wild type in our calculation of CO.
s The folding of Cys-free mutant was studied experimentally. We used the available PDB structure of wild-type protein in our calculation of CO.
t The folding of bovine protein F45W mutant was studied experimentally. We used the available PDB structure of WT human protein in our calculation of CO.
u There is only a strand-exchanged form of suc1 dimer in PDB. We used a concatenation of fragment 2–88 of chain C and fragment 89–102 of chain A as a tentative structure of monomeric protein in our calculation of CO.
We are grateful to Blake Gillespie and Oxana Galzitskaya for discussions and some computations, and to David Thirumalai for discussions and his results on correlation of ln(kf) with CO × L1/2. This work was supported in part by the Russian Foundation for Basic Research, by an International Research Scholar's Award to A.V.F. from the Howard Hughes Medical Institute, and by the Institute of Theoretical Physics (Santa Barbara University, ITP work no. NSF-ITP-01-173).
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.