COREX residue stability constants
A database of 44 nonhomologous proteins (Table 1) was analyzed using the COREX algorithm. Briefly, COREX generates an ensemble of partially unfolded microstates using the high-resolution structure of each protein as a template (Hilser and Freire 1996). This is facilitated by combinatorially unfolding a predefined set of folding units (i.e., residues 1–5 are in the first folding unit, residues 6–10 are in the second folding unit, etc.). By means of an incremental shift in the boundaries of the folding units, an exhaustive enumeration of the partially unfolded species is achieved for a given folding unit size. The entire procedure is shown schematically in Figure 1A for ovomucoid third domain (OM3), one of the proteins in the database (PDB accession code 2ovo).
For each microstate i in the ensemble, the Gibbs free energy is calculated from the surface area-based parameterization described previously (D'Aquino et al. 1996; Gomez et al. 1995; Xie and Freire 1994; Baldwin 1986; Lee et al. 1994; Habermann and Murphy 1996 [see also Fig. 1B and Materials and Methods]). The Boltzmann weight of each microstate, Ki = exp(−ΔGi/RT), is used to calculate its probability:
in which the summation in the denominator is over all microstates. From the probabilities calculated in equation 1(1), an important statistical descriptor of the equilibrium can be evaluated for each residue in the protein. Defined as the residue stability constant, κf,j, this quantity is the ratio of the summed probability of all states in the ensemble in which a particular residue, j, is in a folded conformation (ΣPf,j) to the summed probability of all states in which j is in an unfolded (i.e., non-folded) conformation (ΣPnf,j):
From the stability constant, a residue-specific free energy can be written as:
Equation 3(3) reflects the energy difference between all microscopic states in which a particular residue is folded and all such states in which it is unfolded.
The algorithmic basis of the residue stability constant (equation 2(2)) warrants discussion. Although the stability constant is defined for each position, the value obtained at each residue is not the energetic contribution of that residue. Instead, the stability constant is a property of the ensemble as a whole. The energy difference between each partially unfolded microstate and the fully folded reference state is determined by the energetic contributions of all amino acids comprising the folding units that are unfolded in each microstate, plus the energetic contributions associated with exposing additional (complementary) surface area on the protein (Fig. 1B). The stability constant thus provides the average thermodynamic environment of each residue, wherein surface area, polarity, and packing are implicitly considered. This is important because it provides a thermodynamic metric wherein each of these static structural properties is weighted according to its energetic impact at each position.
As stated above, the most significant feature of the ensemble-based description provided by COREX is that the residue stability constant can be compared to hydrogen exchange protection factors. This represents an experimentally verifiable description of the energetic information calculated by the COREX algorithm. An example of such a comparison is shown for OM3 in Figure 2. The agreement in the location and relative magnitude of the protection factors with the stability constants for this and other proteins suggests that the calculated native state ensemble provides a good description of the actual ensemble (Hilser and Freire 1996). It naturally follows that the residue stability constants of a particular protein provide a good description of the thermodynamic environment of each residue in that structure.
Further inspection of Figure 2 reveals another important feature in the pattern of residue stability constants: The stability constants may vary significantly across a given secondary structural element, as observed for α helix 1 of OM3. The protection factors (and stability constants) are high at the N-terminal region of helix 1, but decrease over the length of the helix. This indicates that structural classifications, such as secondary structure type, do not obligatorily coincide with thermodynamic classifications. This result has potentially important consequences for cataloging propensities of amino acids in different environments. For example, in OM3 two threonine residues are located in different structural environments: Thr 47 is part of the loop that follows α helix 1, whereas Thr 49 is part of β strand 3. Despite the different structural environments for the two threonine residues, the stability constants show that both residues, to a first approximation, share the same thermodynamic environment.
Thermodynamic propensities of residue types
The individual residue stability constants for the 2951 residues comprising the 44 proteins in the database were sorted and empirically binned into three classifications of approximately equal size: high, medium, and low stability, as described in Materials and Methods. Statistics of amino acid type as a function of each of these stability categories were tabulated (Table 2), and normalized histograms of these numbers are shown in Figure 3. Striking asymmetries are often observed for the histograms of certain amino acids across the three stability environments. These asymmetries are well outside the standard deviation of the average of three random data sets, as indicated by the data points and error bars in Figure 3. For example, the aromatic amino acids Phe, Trp, and Tyr are mostly found in high stability environments, whereas Gly and Pro are overwhelmingly found in low stability environments. In contrast, residues such as Ala, Met, and Ser show distributions that do not significantly differ from randomized data.
Although the acidic residues Asp and Glu share a slight tendency to be found in medium stability environments, it is observed that several amino acid pairs with nominally similar chemical characteristics partition differently in the stability environments. For example, the basic residues Arg and Lys show opposite stability characteristics: The counts for Arg increase as the stability class is incremented from low to high, but the counts for Lys decrease. Asn is found less often in high stability environments, whereas Gln is found more often in high stability environments. Although the distribution for Ser does not differ significantly from the randomized data, Thr occurs more often in low stability environments and less often in high stability environments. Somewhat surprisingly, the aliphatic amino acids Ile, Leu, and Val do not show a general pattern, except perhaps a slight disfavoring of low stability environments.
To determine whether the trends in these data are simple functions of the side-chain surface area exposure in the different regions of the protein, probability histograms (identical to Fig. 3) were generated using side-chain surface area alone, averaged over a window size of five residues as described in Materials and Methods. As shown in Figure 4, frequencies of amino acids found in COREX stability environments are not correlated to frequencies of amino acids in exposed surface area environments. This is important as it suggests that the thermodynamic information calculated by the COREX algorithm is not simply monitoring a static property of a single X-ray or NMR structure, but instead is capturing a property of the native state ensemble as a whole.
Fold-recognition experiments based on thermodynamic propensities of residue types
As the distribution of amino acids in the different stability environments is nonrandom, it was hypothesized that these propensities could be used to correctly identify sequences corresponding to structures whose thermodynamic environments are known. In other words, if the thermodynamic environments within a structure are given, can the correct sequence of that structure be identified? To test this, simple fold-recognition experiments were performed based on the information contained in the distributions. The profiling method of Eisenberg and coworkers (Gribskov et al. 1987; Bowie et al. 1991) was used, and stability profiles as a function of residue position were created for each of the 44 proteins in the database. A 3D-1D scoring matrix for each protein that contained the log-odds probabilities of finding amino acid types in the high, medium, or low stability classes was calculated from the data in Table 2 as described in Materials and Methods. An analogous scoring matrix was also calculated using only DSSP secondary structure information (Kabsch and Sander 1983). This provided a control data set of identical dimensionality (i.e., COREX matrix: 20 amino acids × three stability categories [low, med., high] vs. DSSP matrix: 20 amino acids × three secondary structure categories [α, β, other]). It is important to note that no target protein was included in the scoring matrix used for its respective fold-recognition experiment, thus negating the possibility that target information could have directly biased the results.
The scoring matrices derived from COREX stability and secondary structure, averaged over all 44 target proteins, are shown in Table 3, A and B, respectively. The stability matrix scores faithfully reflect the histograms shown in Figure 3; for example, Gly and Pro score unfavorably in high stability environments but score favorably in low stability environments. Similarly, the secondary structure matrix scores follow intuitive notions of secondary structure propensity; for example, Ala scores positively in helical environments, the aromatics score positively in β environments, and Gly and Pro score negatively in both α and β environments. The standard deviations in both matrices were generally small as compared to the magnitude of the scores, suggesting that the scores were not affected by the removal of any one protein from the database.
Fold-recognition experiments were performed using a local alignment algorithm based on the Smith-Waterman algorithm (Smith and Waterman 1981) as implemented in PROFILESEARCH (Bowie et al. 1991). The 44 amino acid sequences corresponding to the structures used in the database were each threaded against the 44 target profiles of COREX stability or DSSP secondary structure. The local alignment algorithm was used to optimally score the target profile versus each amino acid sequence. For comparison, three randomized datasets of COREX stability and secondary structure were used to construct scoring matrices in an identical manner, and the results from these control experiments were averaged for comparison. The fold-recognition protocol was therefore designed so that the only variable between any set of experiments was the scoring matrix.
The results of the fold-recognition experiments are shown in Figure 5, and at least two conclusions can be drawn from these data. First, scoring matrices composed of either COREX stability or DSSP secondary structure data performed better than randomized data sets in matching a structural target to its amino acid sequence. In Figure 5, the results for COREX data are stacked toward the left (successful) side of the rankings, whereas the randomized data approach a bell-shaped distribution with a maximum near the median of the size of the sequence datasets (approximately 22 for the library size of 44 sequences). Second, the number of targets falling in the most successful bin was similar (approximately one-third to one-half of the total number of targets) for both the COREX stability and secondary structure matrices, suggesting that COREX stability propensities alone contain comparable fold-specification information to secondary structure propensities.
Because the local alignment algorithm used here computes a score without returning the complete alignment of profile to sequence, high scores may have been possible from nonstructurally significant local alignments. In other words, it is possible that a correct sequence may have scored well against its corresponding target structure without having placed the individual amino acids in their correct positions within the structure. To assess the extent of local alignments that were structurally significant, minor modifications were made to the PROFILESEARCH source code that saved the traceback of the alignment matrix. It was found that for targets scoring poorly in the fold-recognition rankings, local alignments of the corresponding sequence were often not significant. However, sequences that scored in the top two bins were often found to be completely and correctly aligned with their target profiles, even though not all of their residues contributed to the overall score because of the rules of the local algorithm. Three examples of successful alignment based on COREX stability data alone are shown in Figure 6, A, B, and C, and Table 4ATable 4A., BTable 4B., and CTable 4C., for the targets, neuro-oncological ventral antigen 1 fragment (1dt4:a), protein G (1igd), and tendamistat (2ait), respectively. The alignments calculated using the local algorithm were correct, despite the fact that no sequence information about the target was used, and that only a subset of the amino acid sequence was used in the scoring. In addition, it is noteworthy that the success of these examples is not because of merely a small fragment of the sequence, as the cumulative 3D-1D matrix scores steadily increased over the entire length of the sequence.
To test whether the results were dependent on the particular choice of proteins used in the database, a second, independent database was assembled from 50 proteins found in the PDBSelect database (Hobohm and Sander 1994) that were not contained in the original 44 protein database used above. The 3304 residues from these additional proteins were binned into high, medium, and low stability classes, and the normalized frequencies of stability as a function of residue type were compared to those shown in Figure 3. A scatterplot of these results, given in Figure 7, clearly shows that the two databases are correlated. The major outliers in the correlation are attributable to residue types with low statistics, such as His. These results are interpreted as evidence that the COREX stability data calculated in this work were not fundamentally biased by the original choice of 44 proteins.