SEARCH

SEARCH BY CITATION

Keywords:

  • Native state ensemble;
  • threading and fold-recognition;
  • protein structure prediction;
  • residue thermodynamics;
  • protein stability

Abstract

  1. Top of page
  2. Abstract
  3. Results
  4. Discussion
  5. Conclusion
  6. Materials and methods
  7. Electronic supplemental material
  8. Acknowledgements
  9. References
  10. Supporting Information

An amino acid sequence, in the context of the solvent environment, contains all of the thermodynamic information necessary to encode a three-dimensional protein structure. To investigate the relationship between an amino acid sequence and its corresponding protein fold, a database of thermodynamic stability information was assembled that spanned 2951 residues from 44 nonhomologous proteins. This information was obtained using the COREX algorithm, which computes an ensemble-based description of the native state of a protein. It was observed that amino acid types partitioned unequally into high, medium, and low thermodynamic stability environments. Furthermore, these distributions were reproducible and were significantly different than those expected from random partitioning. To assess the structural importance of the distributions, simple fold-recognition experiments were performed based on a 3D-1D scoring matrix containing only COREX residue stability information. This procedure was able to recover amino acid sequences corresponding to correct target structures more effectively than scoring matrices derived from randomized data. High-scoring sequences were often aligned correctly with their corresponding target profiles, suggesting that calculated thermodynamic stability profiles have the potential to encode sequence information. As a control, identical fold-recognition experiments were performed on the same database of proteins using DSSP secondary structure information in the scoring matrix, instead of COREX residue stability information. The comparable performance of both approaches suggested that COREX residue stability information and secondary structure information could be of equivalent utility in more sophisticated fold-recognition techniques. The results of this work are a consequence of the idea that amino acid sequences fold not into single, rigidly stable structures but rather into thermodynamic ensembles best represented by a time-averaged structure.

It is a longstanding idea that protein structures are the result of an amino acid chain finding its global free energy minimum in the solvent environment (Anfinsen 1973). Several exceptions to this so-called “thermodynamic control” have been discovered in recent years, including examples of proteins whose folding may be under “kinetic control” (Baker et al. 1992; Cohen 1999) and proteins requiring information not completely contained in the amino acid sequence (e.g., chaperone-assisted folding [Feldman and Frydman 2000; Fink 1999]). Although thermodynamic control is widely accepted as the default behavior for correct folding (Jackson 1998), a detailed understanding of the forces involved in thermodynamic control and how atomic interactions relate amino acid sequence to the folding and stability of the native structure has still proven elusive.

In this work, a computational method is used to address the issue of how sequence specifies structure in the energetic minimum around the native state. This is accomplished by an approach that determines the stability differences within a protein structure using the COREX algorithm (Hilser and Freire 1996). The COREX algorithm generates an ensemble of states using the high-resolution structure as a template. Based on the relative probability of the different states in the ensemble, different regions of the protein are found to be more stable than others. Thus, the COREX algorithm provides access to residue-specific free energies of folding. Perhaps the most important feature of the COREX algorithm is that the results have been verified by comparison with protection factors obtained from native state hydrogen exchange experiments. The good agreement between the calculated and experimental protection factors suggests that the calculated native state ensemble provides a reasonable representation of the actual native state ensemble (Hilser and Freire 1996; Hilser et al. 1998).

The goal of this work is to determine whether residue types have distinct preferences for thermodynamic environments in the folded native structure of a protein, and whether a scoring matrix based solely on thermodynamic information (independent of explicit structural constraints) can be used to identify correct sequences that correspond to a particular target fold. The strategy is to calculate the regional stabilities for a large database of COREX-analyzed proteins, and to determine whether amino acid types are distributed uniformly across the different thermodynamic environments. The strength of this approach is that it ignores traditional amino acid properties such as hydrophobicity and secondary structural propensity, and focuses only on the thermodynamics of the microscopic states in the native state ensemble. The results are validated using previously described fold-recognition profiling methods (Gribskov et al. 1987; Bowie et al. 1991). Control data sets of randomized data and secondary structure information are used to assess the significance of the fold-recognition results.

Results

  1. Top of page
  2. Abstract
  3. Results
  4. Discussion
  5. Conclusion
  6. Materials and methods
  7. Electronic supplemental material
  8. Acknowledgements
  9. References
  10. Supporting Information

COREX residue stability constants

A database of 44 nonhomologous proteins (Table 1) was analyzed using the COREX algorithm. Briefly, COREX generates an ensemble of partially unfolded microstates using the high-resolution structure of each protein as a template (Hilser and Freire 1996). This is facilitated by combinatorially unfolding a predefined set of folding units (i.e., residues 1–5 are in the first folding unit, residues 6–10 are in the second folding unit, etc.). By means of an incremental shift in the boundaries of the folding units, an exhaustive enumeration of the partially unfolded species is achieved for a given folding unit size. The entire procedure is shown schematically in Figure 1A for ovomucoid third domain (OM3), one of the proteins in the database (PDB accession code 2ovo).

For each microstate i in the ensemble, the Gibbs free energy is calculated from the surface area-based parameterization described previously (D'Aquino et al. 1996; Gomez et al. 1995; Xie and Freire 1994; Baldwin 1986; Lee et al. 1994; Habermann and Murphy 1996 [see also Fig. 1B and Materials and Methods]). The Boltzmann weight of each microstate, Ki = exp(−ΔGi/RT), is used to calculate its probability:

  • equation image(1)

in which the summation in the denominator is over all microstates. From the probabilities calculated in equation 1(1), an important statistical descriptor of the equilibrium can be evaluated for each residue in the protein. Defined as the residue stability constant, κf,j, this quantity is the ratio of the summed probability of all states in the ensemble in which a particular residue, j, is in a folded conformation (ΣPf,j) to the summed probability of all states in which j is in an unfolded (i.e., non-folded) conformation (ΣPnf,j):

  • equation image(2)

From the stability constant, a residue-specific free energy can be written as:

  • equation image(3)

Equation 3(3) reflects the energy difference between all microscopic states in which a particular residue is folded and all such states in which it is unfolded.

The algorithmic basis of the residue stability constant (equation 2(2)) warrants discussion. Although the stability constant is defined for each position, the value obtained at each residue is not the energetic contribution of that residue. Instead, the stability constant is a property of the ensemble as a whole. The energy difference between each partially unfolded microstate and the fully folded reference state is determined by the energetic contributions of all amino acids comprising the folding units that are unfolded in each microstate, plus the energetic contributions associated with exposing additional (complementary) surface area on the protein (Fig. 1B). The stability constant thus provides the average thermodynamic environment of each residue, wherein surface area, polarity, and packing are implicitly considered. This is important because it provides a thermodynamic metric wherein each of these static structural properties is weighted according to its energetic impact at each position.

As stated above, the most significant feature of the ensemble-based description provided by COREX is that the residue stability constant can be compared to hydrogen exchange protection factors. This represents an experimentally verifiable description of the energetic information calculated by the COREX algorithm. An example of such a comparison is shown for OM3 in Figure 2. The agreement in the location and relative magnitude of the protection factors with the stability constants for this and other proteins suggests that the calculated native state ensemble provides a good description of the actual ensemble (Hilser and Freire 1996). It naturally follows that the residue stability constants of a particular protein provide a good description of the thermodynamic environment of each residue in that structure.

Further inspection of Figure 2 reveals another important feature in the pattern of residue stability constants: The stability constants may vary significantly across a given secondary structural element, as observed for α helix 1 of OM3. The protection factors (and stability constants) are high at the N-terminal region of helix 1, but decrease over the length of the helix. This indicates that structural classifications, such as secondary structure type, do not obligatorily coincide with thermodynamic classifications. This result has potentially important consequences for cataloging propensities of amino acids in different environments. For example, in OM3 two threonine residues are located in different structural environments: Thr 47 is part of the loop that follows α helix 1, whereas Thr 49 is part of β strand 3. Despite the different structural environments for the two threonine residues, the stability constants show that both residues, to a first approximation, share the same thermodynamic environment.

Thermodynamic propensities of residue types

The individual residue stability constants for the 2951 residues comprising the 44 proteins in the database were sorted and empirically binned into three classifications of approximately equal size: high, medium, and low stability, as described in Materials and Methods. Statistics of amino acid type as a function of each of these stability categories were tabulated (Table 2), and normalized histograms of these numbers are shown in Figure 3. Striking asymmetries are often observed for the histograms of certain amino acids across the three stability environments. These asymmetries are well outside the standard deviation of the average of three random data sets, as indicated by the data points and error bars in Figure 3. For example, the aromatic amino acids Phe, Trp, and Tyr are mostly found in high stability environments, whereas Gly and Pro are overwhelmingly found in low stability environments. In contrast, residues such as Ala, Met, and Ser show distributions that do not significantly differ from randomized data.

Although the acidic residues Asp and Glu share a slight tendency to be found in medium stability environments, it is observed that several amino acid pairs with nominally similar chemical characteristics partition differently in the stability environments. For example, the basic residues Arg and Lys show opposite stability characteristics: The counts for Arg increase as the stability class is incremented from low to high, but the counts for Lys decrease. Asn is found less often in high stability environments, whereas Gln is found more often in high stability environments. Although the distribution for Ser does not differ significantly from the randomized data, Thr occurs more often in low stability environments and less often in high stability environments. Somewhat surprisingly, the aliphatic amino acids Ile, Leu, and Val do not show a general pattern, except perhaps a slight disfavoring of low stability environments.

To determine whether the trends in these data are simple functions of the side-chain surface area exposure in the different regions of the protein, probability histograms (identical to Fig. 3) were generated using side-chain surface area alone, averaged over a window size of five residues as described in Materials and Methods. As shown in Figure 4, frequencies of amino acids found in COREX stability environments are not correlated to frequencies of amino acids in exposed surface area environments. This is important as it suggests that the thermodynamic information calculated by the COREX algorithm is not simply monitoring a static property of a single X-ray or NMR structure, but instead is capturing a property of the native state ensemble as a whole.

Fold-recognition experiments based on thermodynamic propensities of residue types

As the distribution of amino acids in the different stability environments is nonrandom, it was hypothesized that these propensities could be used to correctly identify sequences corresponding to structures whose thermodynamic environments are known. In other words, if the thermodynamic environments within a structure are given, can the correct sequence of that structure be identified? To test this, simple fold-recognition experiments were performed based on the information contained in the distributions. The profiling method of Eisenberg and coworkers (Gribskov et al. 1987; Bowie et al. 1991) was used, and stability profiles as a function of residue position were created for each of the 44 proteins in the database. A 3D-1D scoring matrix for each protein that contained the log-odds probabilities of finding amino acid types in the high, medium, or low stability classes was calculated from the data in Table 2 as described in Materials and Methods. An analogous scoring matrix was also calculated using only DSSP secondary structure information (Kabsch and Sander 1983). This provided a control data set of identical dimensionality (i.e., COREX matrix: 20 amino acids × three stability categories [low, med., high] vs. DSSP matrix: 20 amino acids × three secondary structure categories [α, β, other]). It is important to note that no target protein was included in the scoring matrix used for its respective fold-recognition experiment, thus negating the possibility that target information could have directly biased the results.

The scoring matrices derived from COREX stability and secondary structure, averaged over all 44 target proteins, are shown in Table 3, A and B, respectively. The stability matrix scores faithfully reflect the histograms shown in Figure 3; for example, Gly and Pro score unfavorably in high stability environments but score favorably in low stability environments. Similarly, the secondary structure matrix scores follow intuitive notions of secondary structure propensity; for example, Ala scores positively in helical environments, the aromatics score positively in β environments, and Gly and Pro score negatively in both α and β environments. The standard deviations in both matrices were generally small as compared to the magnitude of the scores, suggesting that the scores were not affected by the removal of any one protein from the database.

Fold-recognition experiments were performed using a local alignment algorithm based on the Smith-Waterman algorithm (Smith and Waterman 1981) as implemented in PROFILESEARCH (Bowie et al. 1991). The 44 amino acid sequences corresponding to the structures used in the database were each threaded against the 44 target profiles of COREX stability or DSSP secondary structure. The local alignment algorithm was used to optimally score the target profile versus each amino acid sequence. For comparison, three randomized datasets of COREX stability and secondary structure were used to construct scoring matrices in an identical manner, and the results from these control experiments were averaged for comparison. The fold-recognition protocol was therefore designed so that the only variable between any set of experiments was the scoring matrix.

The results of the fold-recognition experiments are shown in Figure 5, and at least two conclusions can be drawn from these data. First, scoring matrices composed of either COREX stability or DSSP secondary structure data performed better than randomized data sets in matching a structural target to its amino acid sequence. In Figure 5, the results for COREX data are stacked toward the left (successful) side of the rankings, whereas the randomized data approach a bell-shaped distribution with a maximum near the median of the size of the sequence datasets (approximately 22 for the library size of 44 sequences). Second, the number of targets falling in the most successful bin was similar (approximately one-third to one-half of the total number of targets) for both the COREX stability and secondary structure matrices, suggesting that COREX stability propensities alone contain comparable fold-specification information to secondary structure propensities.

Because the local alignment algorithm used here computes a score without returning the complete alignment of profile to sequence, high scores may have been possible from nonstructurally significant local alignments. In other words, it is possible that a correct sequence may have scored well against its corresponding target structure without having placed the individual amino acids in their correct positions within the structure. To assess the extent of local alignments that were structurally significant, minor modifications were made to the PROFILESEARCH source code that saved the traceback of the alignment matrix. It was found that for targets scoring poorly in the fold-recognition rankings, local alignments of the corresponding sequence were often not significant. However, sequences that scored in the top two bins were often found to be completely and correctly aligned with their target profiles, even though not all of their residues contributed to the overall score because of the rules of the local algorithm. Three examples of successful alignment based on COREX stability data alone are shown in Figure 6, A, B, and C, and Table 4ATable 4A., BTable 4B., and CTable 4C., for the targets, neuro-oncological ventral antigen 1 fragment (1dt4:a), protein G (1igd), and tendamistat (2ait), respectively. The alignments calculated using the local algorithm were correct, despite the fact that no sequence information about the target was used, and that only a subset of the amino acid sequence was used in the scoring. In addition, it is noteworthy that the success of these examples is not because of merely a small fragment of the sequence, as the cumulative 3D-1D matrix scores steadily increased over the entire length of the sequence.

To test whether the results were dependent on the particular choice of proteins used in the database, a second, independent database was assembled from 50 proteins found in the PDBSelect database (Hobohm and Sander 1994) that were not contained in the original 44 protein database used above. The 3304 residues from these additional proteins were binned into high, medium, and low stability classes, and the normalized frequencies of stability as a function of residue type were compared to those shown in Figure 3. A scatterplot of these results, given in Figure 7, clearly shows that the two databases are correlated. The major outliers in the correlation are attributable to residue types with low statistics, such as His. These results are interpreted as evidence that the COREX stability data calculated in this work were not fundamentally biased by the original choice of 44 proteins.

Discussion

  1. Top of page
  2. Abstract
  3. Results
  4. Discussion
  5. Conclusion
  6. Materials and methods
  7. Electronic supplemental material
  8. Acknowledgements
  9. References
  10. Supporting Information

The interpretation of a protein as a single, static structure is incomplete (Frauenfelder et al. 1988). An increasing amount of data from NMR spectroscopy and molecular dynamics simulation shows that it is more correct to view the native state of a protein as an ensemble of interconverting conformations, with an average structure approximating that of an X-ray or NMR structure (Kay 1998; Philippopolous and Lim 1999). Although it is possible in favorable cases to assign a macroscopic stability difference between the native and denatured states of a protein, not all residues in a protein share the same microscopic stability, as is evident from hydrogen exchange experiments (Bai and Englander 1996; Chamberlain et al. 1996; Hilser et al. 1998; Huyghues-Despointes et al. 1999; Jaravine et al. 2000; Llinas et al. 1999; Sadqi et al. 1999). The COREX algorithm provides a means of estimating this energetic variability in the native state, and it is possible that this information may illuminate the relation between amino acid sequence and protein structure.

The most significant finding of this work is that amino acid types have different propensities to occur in high, medium, or low stability environments within a protein structure. It is remarkable that the distributions for many residues are far from random, despite the facts that the observed stability distributions are massive averages over many million microstates from different proteins, and certain nonsurface area-related interactions such as electrostatics are ignored in the COREX free energy parameterization. Furthermore, Figures 5 and 6, Fig. 6. indicate that the simple three-dimensional stability classification is able to provide sufficient signal for correct sequence to structure alignments.

In general, the success of any fold-recognition experiment will depend on whether the scoring function is representative of the true determinants of the folds being predicted. For the current case, there are two relevant questions: First, is enough structural-energetic information retained to identify a fold by distilling thermodynamic stability down to three dimensions per amino acid? Second, is the database large enough to sample completely over all stability categories? Here we have implemented two strategies to address these issues. The first is the use of a three-dimensional secondary structure matrix. As the utility of secondary structure information in fold-recognition algorithms is well documented (Jones et al. 1999; Rice and Eisenberg 1997; Bowie et al. 1991), it provides a good metric for the performance of the COREX-derived data in fold recognition. The fact that the overall fold-recognition success rate is similar for COREX and secondary structure (Fig. 5) suggests that the COREX stability matrix indeed contains significant structural information. It is likely that breaking the overall stability into its component enthalpies and entropies would increase the discrimination of the scoring by introducing additional categories to describe a target profile. The impact on fold-recognition experiments of further dividing the stability data is currently under investigation.

The use of a global alignment scheme, as opposed to a local alignment scheme, provides a second means of assessing the information contained in the COREX-derived scoring matrix. For an ideal scoring function, the correct sequence should always score highly when each amino acid is aligned with its corresponding structural position. This is indeed the case for the majority of the structures studied here (data not shown). Although the local alignment scores are generally lower than the global scores, suggesting that information is lost in the local alignment, correct sequences that score highly in local alignments are often completely and correctly aligned with the structural target. This result indicates that even the rudimentary partitioning of residue stabilities described here is sufficient to identify correct sequence/structure pairs. It may be possible to improve the local alignment scheme by assessing scores based on more than one high-scoring subsequence, similar to the logic used in the BLAST algorithm (Altschul et al. 1990).

It is important to bear in mind that residues with functional instead of structural significance are not accounted for in the statistics, and these residues may therefore score poorly in this scheme. However, a surprising implication of these results for protein design is that careful consideration of specific interactions between some amino acids may not be necessary for determination of a given fold. A reflection of this phenomenon may be found in recent work in which almost half of the residues in BPTI were replaced by alanine, yet the resulting structure was stable and almost indistinguishable from the original protein in terms of activity (Kuroda and Kim 2000). The COREX matrix given in Table 3A would similarly predict that, in general, alanine replacements would have a greater probability of being tolerated in most positions throughout the protein structure, as the scores for alanine in any stability environment are near zero.

As noted, absolutely no sequence information about the target was used in the threading calculations. The implication of this observation is that it might be possible to routinely generate a correct sequence to structure alignment on a purely energetic basis. This result is particularly exciting because energetic information may be able to increase the chances of successful fold recognition in cases in which identity between the target sequence and a homologous sequence is below the threshold of the best sequence matching algorithms (Park et al. 1998). Indeed, preliminary results from our laboratory suggest that homologous structures with low sequence identity share similar features in their COREX stability profiles.

Conclusion

  1. Top of page
  2. Abstract
  3. Results
  4. Discussion
  5. Conclusion
  6. Materials and methods
  7. Electronic supplemental material
  8. Acknowledgements
  9. References
  10. Supporting Information

Historically, proteins have been classified according to their structural features (i.e., helix sheet, turn) and/or their physico-chemical properties (i.e., hydrophobicity, polarity, packing). The underlying principle that unifies both the structural and physico-chemical properties is that they are all manifestations of the free energy of the system. Whether a sequence is helix, sheet, or turn is determined largely by the energetics of backbone and side-chain interactions within the different conformers and the hydrophobicity, polarity, and packing are determined by the energetic interactions of the different groups with solvent and with each other. This work shows that some proteins can be classified in purely thermodynamic terms. In effect, this approach classifies both the structural and physico-chemical properties within a unifying framework, wherein the importance of these properties is implicitly considered at each site. This finding has potentially far-reaching consequences. As each amino acid sequence adopts the specific fold that is a thermodynamic minimum for that sequence, it is probable that the true determinants for each fold are thermodynamic in nature and universal for all proteins. In other words, thermodynamic rules represent a more fundamental characterization of a protein's fold.

Materials and methods

  1. Top of page
  2. Abstract
  3. Results
  4. Discussion
  5. Conclusion
  6. Materials and methods
  7. Electronic supplemental material
  8. Acknowledgements
  9. References
  10. Supporting Information

Selection of proteins used in dataset

A database of 44 proteins, 2951 residues in total (Table 1), was selected from the Protein Data Bank (Berman et al. 2000) on the basis of biological and computational criteria. Two biological criteria were that the proteins be globular and nonhomologous with every other member of the set as ascertained by SCOP (Murzin et al. 1995). One computational criterion was that the proteins be small (< ∼90 residues), because the CPU time and data storage needs of an exhaustive COREX calculation increased exponentially with the chain length. A second computational criterion was that the structures be mostly devoid of ligands, metals, or cofactors, as the COREX energy function was not parameterized to account for the energetic contributions of nonprotein atoms. The database was comprised of 21 X-ray structures, whose average resolution was 1.62 ± 0.07 Å (mean ± S.D.). Twenty-three NMR structures completed the database. An independent database of 50 proteins, 3304 residues in total, which were not included in the above set, was created from the PDBSelect database (Hobohm and Sander 1994). This second database was used as a control to check the results obtained from the first database, as shown in Figure 7.

Computational details

The COREX algorithm (Hilser and Freire 1996) was run with a window size of five residues on each protein in the database, and the simulated temperature was 25°C. The calculated equilibrium constant for stability of residue j in the native structure, κf,j, was computed from the statistical weights of each microstate i in the folded and unfolded subensembles according to equation 2(2).

The Gibbs energy for each microstate i, relative to the fully folded structure, was calculated using equation 4(4):

  • equation image(4)

In equation 4(4), the calorimetric enthalpy and entropy of solvation were parameterized from polar and apolar surface exposure, and the conformational entropy was determined as described previously (Hilser and Freire 1996). The maximum stability for each protein was normalized to a common arbitrary value of approximately 6.2 kcal/mol (max lnκf = 10.4) by adjusting its conformational entropy factor, W. The average entropy factor required for the normalization was 0.852 ± 0.098 (mean ± S.D.) over the 44 proteins. It was found that adjustment of a stable protein's conformational entropy factor did not change the relative patterns of high and low stability regions in the structure.

Binning of residue stability constants

Lnκf values were binned into three stability classes: high, medium, and low. The cutoffs for each stability class were adjusted so that an approximately equal number of residues in the database fell into each class (Table 2). The low stability category was defined as lnκf ≤ 3.91, the medium stability category was defined as 3.91 > lnκf ≤ 7.08, and the high stability category was defined as lnκf > 7.08.

Calculation of average native state side-chain area surface exposure

Average side-chain area surface area exposure of residue j over a window size of five residues, ASAaverage,j, was calculated using equation 5(5):

  • equation image(5)

Because equation 5(5) was undefined for the first and last two residues in each protein, these four residues were ignored in the binning. The cutoffs for each side chain area class were adjusted so that an approximately equal number of residues fell into each class. The low exposure category was defined as ASAaverage,j ≤ 40.81 Å2, the medium exposure category was defined as 40.81 Å2 < ASAaverage,j ≤ 57.77 Å2, and the high exposure category was defined as ASAaverage,j > 57.77 Å2.

Random data sets

For comparison to the COREX and DSSP data sets from the 44 nonhomologous proteins in the database, control data sets were constructed by randomizing (i.e., shuffling) the calculated stability and the secondary structure data. The random data sets therefore contained the same amino acid composition, counts of high, medium, and low stabilities, and types of secondary structure, as the real data sets. However, any correlation between residue type or secondary structural class was presumably destroyed by randomization. To assess internal variability of these data caused by differing numbers of counts of each residue type, the results from three randomized data sets were averaged and standard deviations calculated; these data are plotted as points and error bars in Figure 3.

Fold-recognition details

Fold-recognition experiments were based on the profile method pioneered by Eisenberg and coworkers (Gribskov et al. 1987; Bowie et al. 1991). Briefly, the method characterizes each residue position of a target protein in terms of a structural environment score derived from analysis of a database of known structures. The resulting profile of the target protein is then optimally aligned to each member of a library of amino acid sequences by maximizing the score between the sequence and the profile. In this work, two structural environment scoring schemes were developed: one based on calculated COREX stability, and one based on DSSP secondary structure (Kabsch and Sander 1983) as contained in each target protein's PDB file. Each scoring scheme had three dimensions as a function of the 20 amino acids: high, medium, and low stability for COREX scoring (as described above), or α, β, and other for secondary structure scoring. Two alignment algorithms were used: a local scheme (Smith and Waterman 1981) as implemented in the PROFILESEARCH software package (Bowie et al. 1991), and a global scheme. The global alignment scheme simply paired the first residue of an amino acid sequence with the first position of a target profile with no allowance for gaps. No attempt was made to optimize the gap opening and extension penalties for the local algorithm; in all cases these were the default values given in the PROFILESEARCH package, 5.00 and 0.05, respectively. The library of amino acid sequences was composed of the sequences corresponding to each of the 44 structures in the database (Table 1).

Construction of scoring matrices

The scoring matrices were calculated as log-odds probabilities of finding residue type j in structural environment k, as described below and in Bowie et al. (1991). The matrix score, Sj,k, was defined as:

  • equation image(6)

In equation 6(6), Pj|k is the probability of finding a residue of type j in stability class k (i.e., number of counts of residue type j in stability class k divided by the total number of counts of residue type j) and Pk is the probability of finding any residue in the database in stability environment k (i.e., number of residues in stability class k, regardless of amino acid type, divided by the total number of residues in the entire database, regardless of amino acid type). The structural environment was described by either COREX stability information (high, medium, or low lnκf), or DSSP secondary structure (α, β, or other) as given in the target's PDB entry. The fold-recognition target was removed from the database, and the remaining 43 proteins were used to calculate the scores; therefore, information about the target was never included in the scoring matrix. The values in Table 3, A and B, are the average ± standard deviation of all 44 individual scoring matrices.

Electronic supplemental material

  1. Top of page
  2. Abstract
  3. Results
  4. Discussion
  5. Conclusion
  6. Materials and methods
  7. Electronic supplemental material
  8. Acknowledgements
  9. References
  10. Supporting Information

This includes COREX lnκf values for 2951 residues in 44 proteins comprising the database (lnkf.txt), 44 3D-1D scoring matrices based on COREX lnκf values (matrices.txt), and the 44 amino acid sequences contained in the threading library (sequence.txt).

Table Table 1.. SCOP classifications and sequence data for 44 proteins used in the database
NumberPDB IDPDB lengthaSCOP foldbSequence lengthc
  • a

    a The number of residues for which coordinates are reported in the PDB entry.

  • b

    b The structural classification for determining extent of homology as found in the SCOP database (Murzin et al. 1995).

  • c

    c The number of residues in the entire amino acid sequence as given in the PDB entry.

11a7i60Glutacorticoid receptor like DNA binding domain81
21a7w68Histone69
31a8o70Acyl carrier protein like70
41aa363Anti LPS factor/RecA domain63
51adr76Lambda repressor like DNA binding domains76
61aiw62WW domain like62
71ak876EF hand like76
81b9g:a57Insulin like57
91bdd60Bacterial Ig/albumin binding60
101bdo80Barrel sandwich hybrid80
111bo9:a73Annexin73
121chc68RING finger domain C3HC468
131cnr46Crambin like46
141ctf68Ribosomal protein L7/12 C-terminal fragment74
151doq:a69SAM domain like69
161dt4:a73KH domain73
171fgp70N-terminal domains of the minor coat protein g3p70
181hyp75Bifunctional inhibitor/lipid transfer protein80
191igd61Beta-grasp (ubiquitin-like)61
201iro53Rubredoxin like54
211kjs74Anaphylotoxins (complement system)74
221kp6:a79Ferrodoxin like79
231mjc69OB fold69
241mkn:a59Midkine59
251nhm79HMG box81
261nkl78Saposin78
271ptf87HPr proteins87
281ptx64Knottins64
291qa4:a56HIV-1 Nef protein fragments57
301qqv:a67Thermostable subdomain from chicken villin67
311r1b:a56SIS/NS1 RNA binding domain59
321sem:a58SH3 like barrel58
331vcc77DNA topoisomerase I domain77
341vmp:a71IL8 like71
352a3d:a73De novo designed single chain 3 helix bundle73
362ait74Alpha amylase inhibitor tendamistat74
372ci2:I65CI2 family of serine protease inhibitors83
382erl40Protozoan pheromone proteins40
392ezh65DNA/RNA binding 3-helical bundle75
402ovo56Ovomucoid/PCI-1 like inhibitors56
412spg:a66Beta clip66
423ebx62Snake toxin like62
433ncm:a92Immunoglobulin like beta sandwich92
446pti56BPTI like58
Table Table 2.. Statistics of lnκf values for 2951 residues in the databasea
Residue typeLow (lnκf ≤ 3.91)Medium (3.91 < lnκf ≤ 7.08)High (7.08 < lnκf)Row total
  • a

    a The values in this table were used to compute the normalized histograms shown in Fig. 3. In addition, these values (minus the values for a given target protein) were used to compute the lnκf scoring matrices, as described in the text.

Ala948982265
Arg333867138
Asn454840133
Asp496841158
Cys405434128
Gln223251105
Glu708665221
Gly1267122219
His18131243
Ile375458149
Leu617690227
Lys898659234
Met21181756
Phe11205990
Pro704322135
Ser454761153
Thr765436166
Trp1042640
Tyr15295195
Val517372196
Column total9839839852951
Table Table 3.. Average 3D-1D scoring matrices derived from lnκf and secondary structure informationa
A.          
 WFYLIVMAGP
Lowb−0.28 ± 0.06−0.98 ± 0.16−0.73 ± 0.12−0.21 ± 0.04−0.29 ± 0.05−0.24 ± 0.040.12 ± 0.030.06 ± 0.020.53 ± 0.080.43 ± 0.07
Medium−1.18 ± 0.20−0.40 ± 0.07−0.08 ± 0.030.01 ± 0.010.08 ± 0.020.11 ± 0.02−0.04 ± 0.030.01 ± 0.01−0.03 ± 0.02−0.04 ± 0.02
High0.65 ± 0.100.66 ± 0.100.46 ± 0.070.17 ± 0.030.15 ± 0.030.09 ± 0.02−0.09 ± 0.03−0.07 ± 0.02−1.17 ± 0.18−0.70 ± 0.11
 CTSQNEDHKR
Low−0.06 ± 0.020.31 ± 0.05−0.12 ± 0.03−0.45 ± 0.080.02 ± 0.02−0.05 ± 0.02−0.07 ± 0.020.22 ± 0.050.13 ± 0.03−0.32 ± 0.05
Medium−0.22 ± 0.04−0.02 ± 0.02−0.08 ± 0.02−0.09 ± 0.030.08 ± 0.020.15 ± 0.030.25 ± 0.04−0.09 ± 0.030.10 ± 0.02−0.19 ± 0.04
High0.23 ± 0.04−0.42 ± 0.070.17 ± 0.030.37 ± 0.06−0.10 ± 0.03−0.12 ± 0.03−0.25 ± 0.04−0.18 ± 0.05−0.27 ± 0.050.37 ± 0.06
B.          
 WFYLIVMAGP
Alphac−0.28 ± 0.040.04 ± 0.03−0.15 ± 0.030.36 ± 0.010.22 ± 0.02−0.18 ± 0.020.13 ± 0.030.39 ± 0.01−0.95 ± 0.03−1.16 ± 0.04
Beta0.78 ± 0.030.75 ± 0.020.72 ± 0.02−0.47 ± 0.030.58 ± 0.020.65 ± 0.01−0.39 ± 0.06−0.63 ± 0.03−0.40 ± 0.02−0.50 ± 0.04
Other−0.33 ± 0.04−0.60 ± 0.03−0.37 ± 0.02−0.16 ± 0.01−0.62 ± 0.02−0.27 ± 0.010.03 ± 0.02−0.16 ± 0.020.43 ± 0.010.48 ± 0.01
 CTSQNEDHKR
  • a

    a Each of the 44 targets used in the fold-recognition experiments had an individual 3D-1D scoring matrix that did not include information about the target. Consequently, matrix scores are reported as the average (large bold numbers) ± standard deviation (small numbers) of 44 values.

  • b

    b The boundaries for the stability categories were defined as follows: low stability was lnκf ≤ 3.91, medium stability was 3.91 < lnκf ≤ 7.08, high stability was 7.08 < lnκf, as described in the text.

  • c

    c DSSP secondary structure (Kabsch and Sander 1983) was used as given in each protein's PDB entry.

Alpha−0.03 ± 0.03−0.54 ± 0.03−0.37 ± 0.020.41 ± 0.01−0.65 ± 0.040.35 ± 0.01−0.00 ± 0.02−0.57 ± 0.050.13 ± 0.010.23 ± 0.02
Beta−0.02 ± 0.040.50 ± 0.02−0.16 ± 0.030.22 ± 0.02−0.31 ± 0.03−0.37 ± 0.03−0.60 ± 0.040.24 ± 0.04−0.67 ± 0.03−0.19 ± 0.03
Other0.03 ± 0.020.04 ± 0.010.23 ± 0.01−0.57 ± 0.020.35 ± 0.01−0.18 ± 0.010.16 ± 0.010.18 ± 0.020.09 ± 0.01−0.11 ± 0.02
Table Table 4A.. Local alignment score of 1dt4:a sequence to 1dt4:a stability profile
Residue numberResidue typeStability environmenta3D-1D matrix scorebCumulative local alignment scorec,dResidue numberResidue typeStability environmenta3D-1D matrix scorebCumulative local alignment scorec,d
  • a

    a H, M, and L denote high, medium, and low stability as defined in the text and in footnote “b” of Table 3.

  • b

    b Value of the 3D-1D scoring matrix corresponding to the results of optimal alignment of the 1dt4:a amino acid sequence given in the “Residue type” column to the 1dt4:a stability profile given in the “Stability environment” column. These values are highly similar, but not identical, to the average values given in Table 3A because these values are from the scoring matrix produced when the target protein was removed from the database, as described in the text.

  • c

    c Sum of all the values in the “3D-1D matrix score” column up to and including the indicated residue number. Values in boldface were used by the local alignment algorithm (Smith and Waterman 1981) to compute the optimal sequence to profile alignment.

  • d

    d Data in the “Cumulative local alignment score” column was used to generate Fig. 6A.

4ML0.100.1041IM0.053.92
5KL0.120.2242SM−0.103.82
6DL−0.080.1443KM0.113.93
7VL−0.25−0.1144KL0.124.05
8VL−0.25−0.3645GL0.534.58
9EL−0.06−0.4246EL−0.064.52
10IM0.05−0.3747FL−1.083.44
11AM0.00−0.3748LL−0.233.21
12VM0.10−0.2749PL0.443.65
13PM−0.05−0.3250GL0.534.18
14EM0.15−0.1751TL0.314.49
15NM0.05−0.1252RL−0.324.17
16LM0.00−0.1253NM0.054.22
17VM0.10−0.0254RH0.354.57
18GM−0.01−0.0355KH−0.284.29
19AM0.00−0.0356VH0.114.40
20IM0.050.0257TM−0.014.39
21LL−0.23−0.2158IM0.054.44
22GL0.530.3259TL0.314.75
23KL0.120.4460GL0.535.28
24GL0.530.9761TL0.315.59
25GL0.531.5062PL0.446.03
26KL0.121.6263AL0.086.11
27TL0.311.9364AM0.006.11
28LM0.001.9365TM−0.016.10
29VM0.102.0366QH0.346.44
30EM0.152.1867AH−0.086.36
31YH0.452.6368AH−0.086.28
32QM−0.072.5669QH0.346.62
33EL−0.062.5070YH0.457.07
34LL−0.232.2771LH0.187.25
35TL0.312.5872IH0.147.39
36GL0.533.1173TH−0.426.97
37CL−0.073.0474QH0.347.31
38RH0.353.3975RH0.357.66
39IH0.143.5376IH0.147.80
40QH0.343.87     
Table Table 4B.. Local alignment score of 1igd sequence to 1igd stability profile
Residue numberResidue typeStability environmenta3D-1D matrix scorebCumulative local alignment scorec,dResidue numberResidue typeStability environmenta3D-1D matrix scorebCumulative local alignment scorec,d
  • a

    a H, M, and L denote high, medium, and low stability as defined in the text and in footnote “b” of Table 3.

  • b

    b Value of the 3D-1D scoring matrix corresponding to the results of optimal alignment of the 1igd amino acid sequence given in the “Residue type” column to the 1igd stability profile given in the “Stability environment” column. These values are highly similar, but not identical, to the average values given in Table 3A because these values are from the scoring matrix produced when the target protein was removed from the database, as described in the text.

  • c

    c Sum of all the values in the “3D-1D matrix score” column up to and including the indicated residue number. Values in boldface were used by the local alignment algorithm (Smith and Waterman 1981) to compute the optimal sequence to profile alignment.

  • d

    d Data in the “cumulative local alignment score” column was used to generate Fig. 6B.

1ML0.090.0932EH−0.142.79
2TL0.300.3933KH−0.302.49
3PL0.440.8334AH−0.092.40
4AL0.070.9035FH0.663.06
5VL−0.260.6436KH−0.302.76
6TL0.300.9437QH0.373.13
7TL0.301.2438YH0.473.60
8YM−0.091.1539AH−0.093.51
9KH−0.300.8540NH−0.113.40
10LH0.171.0241DM0.233.63
11VM0.101.1242NM0.063.69
12IM0.071.1943GM−0.053.64
13NM0.061.2544VM0.103.74
14GM−0.051.2045DM0.233.97
15KL0.131.3346GM−0.053.92
16TL0.301.6347VM0.104.02
17LL−0.221.4148WH0.654.67
18KL0.131.5449TH−0.474.20
19GL0.562.1050YH0.474.67
20EL−0.052.0551DM0.234.90
21TL0.302.3552DM0.235.13
22TL0.302.6553AM0.015.14
23TL0.302.9554TM0.025.16
24KL0.133.0855KM0.125.28
25AL0.073.1556TH−0.474.81
26VL−0.262.8957FH0.665.47
27DL−0.062.8358TH−0.475.00
28AM0.012.8459VH0.115.11
29EM0.163.0060TH−0.474.64
30TM0.023.0261EH−0.144.50
31AH−0.092.93     
Table Table 4C.. Local alignment score of 2ait sequence to 2ait stability profile
Residue numberResidue typeStability environmenta3D-1D matrix scorebCumulative local alignment scorec,dResidue numberResidue typeStability environmenta3D-1D matrix scorebCumulative local alignment scorec,d
  • a

    a H, M, and L denote high, medium, and low stability as defined in the text and in footnote “b” of Table 3.

  • b

    b Value of the 3D-1D scoring matrix corresponding to the results of optimal alignment of the 2ait amino acid sequence given in the “Residue type” column to the 2ait stability profile given in the “Stability environment” column. These values are highly similar, but not identical, to the average values given in Table 3A because these values are from the scoring matrix produced when the target protein was removed from the database, as described in the text.

  • c

    c Sum of all the values in the “3D-1D matrix score” column up to and including the indicated residue number. Values in boldface were used by the local alignment algorithm (Smith and Waterman 1981) to compute the optimal sequence to profile alignment.

  • d

    d Data in the “cumulative local alignment score” column was used to generate Fig. 6C.

1DL−0.07−0.0738EM0.164.91
2TL0.320.2539DM0.255.16
3TL0.320.5740DM0.255.41
4VL−0.230.3441TM−0.065.35
5SL−0.140.2042EM0.165.51
6EL−0.070.1343GM−0.045.47
7PL0.440.5744LM0.015.48
8AM0.010.5845CM−0.255.23
9PM−0.060.5246YH0.445.67
10SM−0.060.4647AH0.015.68
11CM0.250.7148VM0.115.79
12VM0.110.8249AM0.015.80
13TM−0.060.7650PM−0.065.74
14LM0.010.7751GM−0.045.70
15YH0.441.2152QM−0.085.62
16QH0.361.5753IM0.095.71
17SH0.181.7554TM−0.065.65
18WH0.652.4055TM−0.065.59
19RH0.352.7556VM0.115.70
20YH0.443.1957GM−0.045.66
21SH0.183.3758DM0.255.91
22QH0.363.7359GM−0.045.87
23AH−0.093.6460YM−0.055.82
24DH−0.243.4061IL−0.325.50
25NM0.083.4862GL0.556.05
26GL0.554.0363SL−0.145.91
27CL−0.073.9664HL0.216.12
28AL0.074.0365GL0.556.67
29EL−0.073.9666HM−0.126.55
30TL0.324.2867AH−0.096.46
31VM0.114.3968RH0.356.81
32TM−0.064.3369YH0.447.25
33VH0.094.4270LH0.177.42
34KH−0.294.1371AH−0.097.33
35VH0.094.2272RH0.357.68
36VH0.094.3173CH0.257.93
37YH0.444.7574LH0.178.10
thumbnail imagethumbnail image

Figure Fig. 1.. Schematic description of the COREX algorithm applied to the crystal structure of the ovomucoid third domain, OM3 (2ovo). (A) The partitioning strategy of the COREX algorithm. Eleven partially folded microstates are explicitly shown out of the 9208 total in the calculated ensemble. Every one of the 9208 microstates contributes to the κf,j value (equation 2(2)). Each microstate is composed of combinations of folded units (shown in gray) and unfolded units (shown in red) of the protein that are predetermined by the partitioning scheme. Five partitions are created, each comprised of several consecutive five-residue units of the entire primary sequence. For example, the units of partition 1 consist of residues 1–5, 6–10, 11–15, etc. until the entire primary sequence of the protein is exhausted. Each partition unit is then incremented by one residue to increase the degree of freedom available to the system without counting identical units repeatedly. If any unit happens to contain fewer than four residues, it is included as part of an adjacent unit; for example, the first unit of partition 2 contains residues 1–6 instead of residue 1 alone. (B) Illustration of the solvent-exposed surface area (ASA) contributions to the energetics of microstate 32. Microstate 32 consists of the five residues of unit 6 in partition 1 unfolded (Ser 26, Asp 27, Asn 28, Lys 29, Thr 30), and all other residues folded. The ASA of the five residues comprising unit 6 (shown in red, ASAunf) plus the complementary ASA created when unit 6 is removed from the protein structure (shown in orange, ASAcomp) all contribute to the ASA for this microstate. ASAunf is calculated based on each residue's exposed surface area in a structureless tripeptide conformation (Luque et al. 1996). The total ΔASA for microstate 32 is the difference of the ASA for unit 6 in the fully unfolded state (the sum of the red and orange areas) and the ASA for unit 6 in the fully folded reference state (shown in black, ASAnative).

thumbnail image

Figure Fig. 2.. Comparison of hydrogen exchange protection factors predicted from COREX data with experimental values for ovomucoid third domain (2ovo). Unfilled and filled vertical bars denote predicted and experimental values (Swint-Kruse and Robertson 1996), respectively. The solid line denotes lnκf values. The simulated temperature of the COREX calculation was set at 30°C to match the experimental conditions. Secondary structure is given by labeled horizontal lines. Asterisks show the positions of Thr 47 and Thr 49 referred to in the text.

Download figure to PowerPoint

thumbnail image

Figure Fig. 3.. Normalized frequencies of COREX stability data as a function of amino acid type. In each histogram, the low stability bin is on the left, the medium stability bin is in the middle, and the high stability bin is on the right. The data used in each histogram was taken from the 2951 residue data set, as given in Table 2. Error bars refer to the average ± standard deviation of the results of binning three randomizations of the data set, as described in Materials and Methods. The size of the error bars is proportional to the number of counts of each residue type in each stability class and therefore gives an estimate of the sampling error for the real data set.

Download figure to PowerPoint

thumbnail image

Figure Fig. 4.. Scatterplot of normalized frequencies of COREX stability data versus normalized frequencies of average side chain surface area exposure. Average side-chain exposure in the native structure was calculated by using a moving window of five residues, similar to the basis of the COREX algorithm, as described in Materials and Methods. These values were then binned into high, medium, and low surface area exposure so that the normalized frequencies could be compared with the normalized frequencies obtained from the high, medium, and low COREX stability data shown in Fig. 3. It is clear from the low R2 value that there is no correlation between the COREX lnκf value and surface area exposure in the native structure. The dashed line denotes a perfect correlation.

Download figure to PowerPoint

thumbnail imagethumbnail image

Figure Fig. 5.. Summary of fold-recognition results for COREX stability and DSSP secondary structure scoring matrices for 44 targets. Black bars denote real data (either lnκf or secondary structure), and striped bars denote the average of three random data sets as described in the text. For example, a fold-recognition experiment in which the highest score against a target fold resulted from an amino acid sequence that corresponded to the target would be counted in bin 1 to 5. A perfect fold-recognition algorithm would return 44 counts in bin 1 to 5.

thumbnail imagethumbnail imagethumbnail image

Figure Fig. 6.. Examples of successful local alignment for three targets. Results for the targets are shown. (A) 1dt4:a: neuro-oncological ventral antigen 1 fragment, (B) 1igd: protein G, and (C) 2ait: tendamistat. The amino acid sequence corresponding to the target scored highest out of all library sequences in each example and was perfectly aligned to the target profile. The thin black line represents COREX calculated stability data (lnκf) for the protein target. The filled circles connected by a thick black line correspond to the cumulative matrix score contributed by each residue. Scores that did not contribute to the final score owing to the rules of the local alignment algorithm (Smith and Waterman 1981) are shown as unfilled circles connected by a thick dashed line. Sequence and scoring data for these examples are explicitly shown in Table 4 ATable 4A.,BTable 4B., and CTable 4C..

thumbnail image

Figure Fig. 7.. Correlation between stability data derived from the database of 44 proteins used in this work and stability data derived from an independent database of 50 proteins. Data on the X-axis are taken from the normalized histograms in Fig. 3. Data on the Y-axis are derived from an identical COREX analysis of an independent database of 3304 residues from 50 PDB structures not contained in the original database. Three open circles denote the high, medium, and low stability class frequencies for His, a residue type with low statistics in both databases. The high R2 value (calculated excluding the three outlying His data points) indicates that these two data sets are highly correlated, suggesting that the stability distributions shown in Fig. 3 would be relatively unaffected by the exact choice of proteins used in the analysis. The dashed line represents a perfect correlation.

Download figure to PowerPoint

Acknowledgements

  1. Top of page
  2. Abstract
  3. Results
  4. Discussion
  5. Conclusion
  6. Materials and methods
  7. Electronic supplemental material
  8. Acknowledgements
  9. References
  10. Supporting Information

We thank Jim Bowie and UCLA DOE-MBI for a copy of the PROFILESEARCH software, Robert O. Fox for stimulating discussions, and Josephine Chu-Ferreon, Jim Hamburger, Roberto Galletto, Reza Razeghifard, Steve Whitten, and John Wooll for critically reading the manuscript. This work was supported by grants from GSE, Inc. (Columbia, MD), NSF (9875689), and the Welch Foundation (H-1461).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

References

  1. Top of page
  2. Abstract
  3. Results
  4. Discussion
  5. Conclusion
  6. Materials and methods
  7. Electronic supplemental material
  8. Acknowledgements
  9. References
  10. Supporting Information

Supporting Information

  1. Top of page
  2. Abstract
  3. Results
  4. Discussion
  5. Conclusion
  6. Materials and methods
  7. Electronic supplemental material
  8. Acknowledgements
  9. References
  10. Supporting Information
FilenameFormatSizeDescription
Wrabl_lnkf.txt126KSupplementary Materials (ESM) for Wrabl et al. (2000) Thermodynamic propensities of amino acids in the native state ensemble: Implications for fold recognition.
Wrabl_matrices.txt25KSupplementary Materials (ESM) for Wrabl et al. (2000) Thermodynamic propensities of amino acids in the native state ensemble: Implications for fold recognition.
Wrabl_sequence.txt5KSupplementary Materials (ESM) for Wrabl et al. (2000) Thermodynamic propensities of amino acids in the native state ensemble: Implications for fold recognition.

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.