Contact order revisited: Influence of protein size on the folding rate

Authors


Abstract

Guided by the recent success of empirical model predicting the folding rates of small two-state folding proteins from the relative contact order (CO) of their native structures, by a theoretical model of protein folding that predicts that logarithm of the folding rate decreases with the protein chain length L as L2/3, and by the finding that the folding rates of multistate folding proteins strongly correlate with their sizes and have very bad correlation with CO, we reexamined the dependence of folding rate on CO and L in attempt to find a structural parameter that determines folding rates for the totality of proteins. We show that the Abs_CO = CO × L, is able to predict rather accurately folding rates for both two-state and multistate folding proteins, as well as short peptides, and that this Abs_CO scales with the protein chain length as L0.70 ± 0.07 for the totality of studied single-domain proteins and peptides.

Many proteins fold and unfold by a simple two-state transition lacking observable intermediates at any solvent conditions (Jackson 1998). Many other proteins exhibit a more complicated multistate transition; namely, they have observable folding intermediates under physiological conditions. However, the boundary between these two groups of proteins is not as well defined.

It is known that some proteins can be switched from two-state to multistate folding, and vice versa, by point mutations or even by changing conditions such as the salt concentration or temperature (Jackson 1998). In addition, multistate folding is observed only far from the point of thermodynamic equilibrium between the native and denatured states, whereas, close to this point, all proteins fold without any observable intermediates (Privalov 1979; Jackson 1998; Finkelstein and Ptitsyn 2002).

Small two-state folding proteins have attracted particular attention of experimentalists and theorists. It was demonstrated that the logarithms of in-water folding rates of these proteins correlate with their gross topological parameter called relative contact order (CO; Plaxco et al. 1998b). The latter is defined as

equation image((1))

where N is the number of contacts (within 6 Å) between nonhydrogen atoms in the protein, L is the length of the protein in amino acid residues, and ΔLij is the number of residues separating the interacting pair of nonhydrogen atoms (adjacent residues are assumed to be separated by one residue, etc.).

CO is a renormalization of the perhaps more intuitive measure, absolute contact order (Abs_CO),

equation image((2))

which, however, was found to be less correlated than CO with folding rates of the two-state folders (Plaxco et al. 1998b; Grantcharova et al. 2001).

The CO was invented to compare differences in topology (rather than in size) between proteins of different length. This parameter is small for proteins stabilized mainly by local interactions and is large when residues in a protein interact frequently with partners far away in the protein sequence. The latter should lead to slower folding (Plaxco et al. 1998b; Fersht 2000). Indeed, negative correlation between the CO and the logarithm of folding rates was found to be very strong, ∼ −0.8 (Plaxco et al. 1998b; Fersht 2000) for two-state folding proteins (which also holds for all two-state folding proteins studied to date; Fig. 1, circles).

However, examining a whole set of proteins studied to date (Table 1, Table 1.), we see that CO, although it still gives good results for two-state folding proteins, fails to predict the folding rates of short peptides and large multistate folding proteins (Fig. 1). It seems the reason is that CO takes into account topology only and pays no explicit attention to the protein size.

A number of basic correlations between protein size and folding rate have been suggested (Thirumalai 1995; Gutin et al. 1996; Finkelstein and Badretdinov 1997a,b). All of them stress that, as might be expected, folding rate decreases monotonically with protein size, but all indicate different scaling laws for this decrease. It should be noted that some recent simulations of folding of off-lattice protein models with simplified potentials (Koga and Takada 2001) indicate that the logarithms of protein folding rate decrease with the chain length as L0.61 ± 0.18, which is in accordance with both Finkelstein and Badretdinov's (1997a,b) and Thirumalai's (1995) theories.

It has been shown, however, that the protein size by itself determines folding rates of only multistate folding proteins and fails to predict those for two-state folders (Galzitskaya et al. 2003): For multistate folders, the negative correlation between LP (L being the number of residues in the chain and P a free parameter) and the logarithm of folding rates is as high as −0.80 in the broad range of power P from zero to one, whereas for two-state folders any correlation between folding rate and size is virtually absent.

This study is aimed to develop a general parameter for predicting the protein folding rates of two-state folding proteins, multistate folding proteins, and small peptides. This general estimate, if found, would be useful for two reasons: (1) Attribution of proteins to two-state or multistate folders is somewhat arbitrary, at least for proteins which can be switched from the two-state to the multistate behavior by point mutations or changing solvent conditions, and (2) it is useful to estimate the folding rate of a protein when one does not know a priori if it is two-state or multistate folding protein.

Results and Discussion

The simplest way to obtain such a parameter is to take into account both the protein topology and its size, that is, to combine a length-based theory with empirical topology effect (Plaxco et al. 1998b). Here we describe such a combination.

Specifically, a theory of Finkelstein and Badretdinov's (1997a,b) predicted that in a vicinity thermodynamic midtransition, folding rates of all single-domain proteins should decrease with their lengths, L as exp[−(0.5 ÷ 1.5) L2/3], and where the size-independent coefficient C = 0.5 ÷ 1.5 depends on the topology of the protein: C is close to 0.5 when a protein is stabilized mainly by local interactions, so that semifolded protein does not contain closed loops protruding from the folding nucleus, and C is close to 1.5 when a protein has many long-range contacts, so that many closed loops protrude from the nucleus. Later it was shown (Galzitskaya et al. 2001) that the range kf = exp(0.5L2/3) × 10ns ÷ exp(1.5L2/3) × 10ns is valid for all the studied peptides and single-domain proteins of a great variety of lengths, topologies, and folding behaviors.

Although Finkelstein and Badretdinov did not give an algorithm to compute their coefficient, C, from protein structure, it is clear that a physical sense of C is similar to those of the CO of Plaxco et al. Both are small for proteins with local contacts (i.e., α-helical proteins), and both are large for proteins with predominantly long-range contacts, which cannot avoid having many loops in a semifolded state. Therefore, the values of C and CO should correlate.

The simplest combination of CO and L, which seems to follow from theories of Plaxco et al. and Finkelstein and Badretdinov, may look like CO × L2/3. However, because we observe that CO is not a chain length–independent parameter (as the value C of Finkelstein and Badretdinov should be) but anticorrelates with the chain length, L (Fig. 2), for totality of proteins and peptides, we summarize CO and L in a general parameter, the “size-modified contact order” (SMCO), as

equation image((3))

One can see that P = 0 corresponds to SMCO = CO, whereas P = 1 corresponds to SMCO = Abs_CO.

The correlation of SMCO and ln(kf), depending on the power P value, is presented in the inset in Figure 3. One can see that although any P > 0.7 results in approximately the same correlation for the totality of proteins and peptides, the best correlation is achieved at P ≈ 1, that is, when SMCOAbs_CO. The correlation of Abs_CO and ln(kf) is presented in Figure 3.

It should be mentioned, however, that for the two-state folders, the best ln(kf)–to–SMCO correlation is achieved when P = 0 ÷ 0.5 rather than 1 (Fig. 3, inset).

However, this difference between the scaling laws observed for two-state folders and the other proteins correlates, to a certain extent, with the finding (Fig. 2) that CO is independent on the chain length for the two-state folders, whereas it decreases with the chain length, L, in proportion to L−0.4 for multistate folders, and for the totality of proteins and peptides, CO decreases with their chain length, L, in proportion to L−0.30 ± 0.07 on the average.

It is noteworthy that CO scales namely as L−0.30 ± 0.07 for the totality of proteins and peptides (Fig. 2, dashed line). This means that the value Abs_CO = CO × L (which has the highest correlation with ln[kf] for the totality of proteins and peptides; Fig. 3, inset) scales with the chain length as L0.70 ± 0.07. This is in a very good concordance with a general scaling law L2/3 predicted by Finkelstein and Badretdinov 1997a,b; although the Thirumalai's [1995] scaling law L0.5 has only a little worse correlation with experiment, and thus, cannot be ruled out; Fig. 3, inset), and agrees with an empirical scaling L0.61 ± 0.18 resulting from simplified off-lattice folding simulations of Koga and Takada (2001).

Table Table 1.. List of proteins and polypeptides a
No.ProteinReferencePDBLln(kf)CO, %Abs_CO
    1α-helixbThompson et al. 1997b2115.510.42.2
    2β-hairpincMunoz et al. 19971PGB161225.84.1
    3WW domaindJager et al. 20011PIN349.519.06.5
    4E3/E1-binding domain of dihydrolipoyl acyltransferaseeSpector and Raleigh 19992PDD419.811.0 ± 0.44.5 ± 0.2
    5ACBPKragelund et al. 19952ABD866.614.3 ± 0.312.3 ± 0.3
    6Cytochrome b562fWittung-Stafshede et al. 1999256B10612.27.57.9
    7Colicin E9 immunity proteinFerguson et al. 19991IMQ867.312.110.4
    8λ-RepressorBurton et al. 19961LMB808.59.47.5
    9Fibronectin ninth FN3 modulePlaxco et al. 19971FNF90−0.918.116.3
10TwitchinClarke et al. 19991WIT930.420.318.9
11Tenascin (short form)Clarke et al. 19971TEN90 (89)1.117.415.4
12SH3 domain (a-spectrin)Viguera et al. 19961SHG62 (57)1.419.110.9
13SH3 domain (src)Grantcharova and Baker 19971SRL64 (56)419.611.0
14SH3-domain (PI3 kinase)gGuijarro et al. 19981PNJ90 (86)−1.116.113.9
15SH3-domain (fyn)Plaxco et al. 1998a1SHF67 (59)4.518.310.8
16Photosystem I accessory proteinP. Bowers and D. Baker, unpubl.1PSF693.217.011.7
17CspB (Bacillus subtilis)Schindler et al. 19951CSP677.016.411.0
  Perl et al. 1998  6.5  
18CspB (B. caldolyticus)Perl et al. 19981C9O667.27.57.9
19CspB (Thermatoga maritima)Perl et al. 19981G6P666.317.5 ± 0.411.4 ± 0.3
20CspAReid et al. 19981MJC695.316.011.0
21Cyclophilin AIkura et al. 20001LOP1646.615.725.7
22DNA-binding proteinhGuerois and Serrano 20001C8C63712.78.0
23IgG binding domain of streptococcal protein LiKim et al. 20001HZ6624.116.110.0
24Protein GMcCallister et al. 20001PGB57 (56)617.39.7
25FKBP12Main et al. 19991FKB1071.517.718.9
26Ci2Jackson and Fersht 19912CI2643.915.710.0
27Activation domain procarboxypeptidase A2Villegas et al. 19951AYE806.816.713.4
28Spliceosomal protein U1AjSilow and Oliveberg 19971URN102 (96)5.816.916.2
29Muscle-AcPkVan Nuland et al. 1998a1APS98−1.521.7 ± 0.621.2 ± 0.6
The columns in this table are as follows: Protein, name of protein; Ref, reference to the original article on folding and unfolding kinetics; PDB, Protein Data Bank entry (Bernstein et al. 1977); L, number of residues in the protein used in the experimental study, and (in parentheses) the number of residues that have defined three-dimensional coordinates and contribute to the relative contact order ( ) calculations; ln(kf), natural logarithm of the experimental folding rates in the water; and Abs_CO, absolute contact order.
a The list of single-domain proteins and peptides that lack both disulfide bonds and covalent bonds to ligands is taken from Galzitskaya et al. 2003). If folding of some protein was investigated at different temperatures, the experiment at the temperature closest to 25°C is presented in the Table; we took the slowest phase that is not considered as cis/trans proline isomerization phase in the original paper. If the three-dimensional structure of a protein whose folding was studied experimentally was absent in PDB, but PDB contains the structure of its mutant or very close homolog, the latter was used in our CO calculations; this is mentioned in a corresponding footnote. If several PDB entries are available for some protein, the best refined full-length X-ray structure is used in our CO calculation; in the absence of X-ray structure, the averaged NMR structure is used; in the absence of such, CO was averaged over all NMR models (in this case, the standard deviation is given). Nos. 1–3 indicate short peptides; 4–33, proteins with two-state folding within the whole range of experimental conditions; and 34–57, proteins with multistate folding in water.
b There is no PDB entry for the Ala-rich 21-residue α-helix studied; the ideal (Ala)21 α-helix was used in our contact order calculation.
cln(kf) value in water refers also to the midtransition point at 24°C
d Small WW domain consisting of one β-sheet is considered as a peptide. ln(kf) value refers to the temperature 41.7°C.
eln(kf) value is the investigators' extrapolation of folding rate to 25°C.
f Two-state folding is assumed by long extrapolation made by investigators.
g Although the investigators of the experimental paper reported that the SH3 domain from PI3 kinase is 84 amino acids long, it was actually refolded by them with the additional two N-terminal residues and four C-terminal residues. The latter four are absent in the PDB entry.
Table Table 1.. List of proteins and polypeptides a
No.ProteinReferencePDBLln(kf)CO, %Abs_CO
30S6Otzen and Oliveberg 19991RIS101 (97)5.918.918.4
31His-containing phosphocarrier proteinVan Nuland et al. 1998b1POH852.717.615.0
32N-terminal domain from L9Kuhlman et al. 19981DIV566.112.77.1
33Villin 14TChoe et al. 19982VIK1266.812.315.4
34ApomyoglobinlCavagnero et al. 19991A6N1511.18.412.7
35Colicin E7 immunity proteinFerguson et al. 19991CEI87 (85)5.810.89.2
36Cro proteinLaurents et al. 20002CRO71 (65)3.711.27.3
37P16 proteinTang et al. 19992A5E1563.55.38.3
38Twitching Ig repeat 27Fowler and Clarke 20011TIT893.617.815.8
39CD2, 1st domainParker et al. 19971HNG98 (95)1.816.916.0
40Fibronectin tenth FN3 moduleCota and Clarke 20001FNF945.516.515.5
41IFABP from ratBurns et al. 19981IFC1313.413.517.7
42ILBPmDalessio and Ropson 20001EAL1271.312.3 ± 0.515.7 ± 0.6
43CRBP IIBurns et al. 19981OPA1331.414.018.7
44CRABP IBurns et al. 19981CBI136−3.213.818.8
45tryptophan synthase α-subunitnOgasahara and Yutani 19941QOP268 (267)−2.58.322.3
46GroEL apical domain (191–345)Golbik et al. 19981AON1550.813.721.2
47Barstar°Schreiber and Fersht 19931BRS893.411.810.5
48Che YMunoz et al. 19943CHY129 (128)18.711.2
49Ribonuclease HIpParker and Marqusee 19992RN21550.112.419.3
50DHFR (dihydrofolate reductase)qJennings et al. 19931RA91594.614.022.3
51tryptophan synthase β2-subunitnGoldberg et al. 19901QOP396 (390)−6.98.332.5
52N-terminal domain from PGKParker et al. 19951PHP1752.311.520.2
53C-terminal domain from PGKrParker et al. 19961PHP219−3.58.017.4
54BarnaseMatouschek et al. 19901BNI110 (108)2.611.412.3
55T4 lysozymesParker and Marqusee 19992LZM1644.17.111.6
56UbiquitintKhorasanizadeh et al. 19961UBQ765.915.111.5
57Suc 1uSchymkowitz et al. 20001SCE113 (101)4.211.811.9
h The folding of mutant protein Y34W was studied experimentally; we used the available PDB structure of wild type in our calculation of CO.
i The folding of mutant protein Y47W was studied experimentally; we used the available PDB structure of this mutant in our calculation of CO.
j The folding of mutant protein F56W was studied experimentally; we used the available PDB structure of mutant Y31H/Q36R in our calculation of CO.
k The folding of mutant protein C21S was studied experimentally; we used the available PDB structure of wild type protein in our calculation of CO.
l We used the available PDB structure of a holoform of myoglobin (but without heme) in our calculation of CO.
m We used the available PDB structure of mutant protein T118S from pig in our calculation of CO instead of the wild type protein from rat
n The folding of protein from Escherichia coli was studied experimentally. We used the available PDB structure of the same protein from Salmonella typhimurium in our calculation of CO.
° The folding of mutant protein C40A/C82A was studied experimentally; we used the available PDB structure of this mutant in our calculation of CO.
p The folding of mutant protein C13A/C63A/C133A was studied experimentally; we used the available PDB structure of wild type protein in our calculation of CO.
q The folding of wild type protein was studied. We used the available PDB structure of mutant protein N37D in our calculation of CO. ln(kf) value refers to the summary rate of two parallel pathways of refolding of DHFR.
r The folding of mutant protein W290Y was studied experimentally. We used the available PDB structure of wild type in our calculation of CO.
s The folding of Cys-free mutant was studied experimentally. We used the available PDB structure of wild-type protein in our calculation of CO.
t The folding of bovine protein F45W mutant was studied experimentally. We used the available PDB structure of WT human protein in our calculation of CO.
u There is only a strand-exchanged form of suc1 dimer in PDB. We used a concatenation of fragment 2–88 of chain C and fragment 89–102 of chain A as a tentative structure of monomeric protein in our calculation of CO.
Figure Figure 1..

Natural logarithm of observed folding rate in water, ln(kf), versus relative contact order (CO) for various proteins and peptides: proteins having two-state folding kinetics at all the denaturant concentrations (circles ), proteins having multistate folding kinetics in water (and at low denaturant concentrations; triangles ), and short peptides (crosses ). The figure includes peptides and proteins listed in Table 1, Table 1.; CO is computed after Equation 1 from the PDB coordinates (Bernstein et al. 1977). If several folding rates are observed for some protein (see Table 1, Table 1.), ln(kf) is the mean value of their natural logarithms. The dashed line represents the best linear fit for two-state folders only (the negative correlation coefficient is as significant as −0.75; the fitted dependence is y = 16.94 − 0.76x); the dotted line represents the best linear fit for multistate folders only (the correlation coefficient is +0.26; namely, it has the opposite sign compared with that for the two-state folders; y = −1.55 + 0.26x); the solid line represents the best linear fit for the totality of all peptides and proteins presented (the correlation coefficient is insignificant, +0.10 only; y = 2.37 + 0.10x).

Figure Figure 2..

Logarithm of relative contact order versus logarithm of chain length. See legend to Figure 1 for specification of the symbols and other details. The dashed line represents the best linear fit for two-state folders only (the correlation coefficient is 0.02; y = 2.68 + 0.01x); the dotted line represents the best linear fit for multistate folders only (the correlation coefficient is −0.54; y = 4.41 − 0.40x); the solid line represents the best linear fit for the totality of all peptides and proteins (the correlation coefficient is −0.50; y = 3.95 − 0.30x). The linear regression coefficients 0.01, −0.40, and −0.30 are determined with errors ±0.16, ±0.13, and ±0.07, respectively.

Figure Figure 3..

Logarithm of observed folding rate in water ln(kf) versus Abs_CO = CO × L. See legend to Figure 1 for specification of the symbols and other details. The dashed line represents the best linear fit for two-state folders only (the fitted dependence is y = 9.44 − 0.36x; the correlation coefficient is −0.51); the dotted line represents the best linear fit for multistate folders only (the fitted dependence is y = 8.56 − 0.44x; the correlation coefficient is −0.78); the solid line represents the best linear fit for the totality of all peptides and proteins presented (the fitted dependence is y = 11.15 − 0.54x; the correlation coefficient is −0.74). (Inset) Correlation coefficients between the logarithm of experimental folding rate in water and the value of CO × LP depending on the value of power P. Error bars, standard errors in correlation coefficients. The curve “2-STATE” concerns the two-state folding proteins only, and the curve “ALL” concerns the totality of all studied peptides and proteins; the curve for the multistate folding proteins is not shown because it is close to the curve “ALL” up to the standard error in correlation coefficients.

Acknowledgements

We are grateful to Blake Gillespie and Oxana Galzitskaya for discussions and some computations, and to David Thirumalai for discussions and his results on correlation of ln(kf) with CO × L1/2. This work was supported in part by the Russian Foundation for Basic Research, by an International Research Scholar's Award to A.V.F. from the Howard Hughes Medical Institute, and by the Institute of Theoretical Physics (Santa Barbara University, ITP work no. NSF-ITP-01-173).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

Ancillary