Protein flexibility and intrinsic disorder


  • Predrag Radivojac,

    1. Center for Information Science and Technology, Temple University, Philadelphia, PA 19122, USA
    Search for more papers by this author
  • Zoran Obradovic,

    1. Center for Information Science and Technology, Temple University, Philadelphia, PA 19122, USA
    Search for more papers by this author
  • David K. Smith,

    1. Department of Biochemistry, University of Hong Kong, Hong Kong
    Search for more papers by this author
  • Guang Zhu,

    1. Department of Biochemistry, Hong Kong University of Science and Technology, Clearwater Bay, Kowloon, Hong Kong
    Search for more papers by this author
  • Slobodan Vucetic,

    1. Center for Information Science and Technology, Temple University, Philadelphia, PA 19122, USA
    Search for more papers by this author
  • Celeste J. Brown,

    1. School of Molecular Biosciences, Washington State University, Pullman, WA 99164-4630, USA
    Search for more papers by this author
    • Present addresses: IBEST, Department of Biological Sciences, University of Idaho, Moscow, ID 83844, USA;

  • J. David Lawson,

    1. School of Molecular Biosciences, Washington State University, Pullman, WA 99164-4630, USA
    Search for more papers by this author
    • Concurrent Pharmaceuticals, 502 W. Office Center Drive, Fort Washington, PA 19034, USA;

  • A. Keith Dunker

    Corresponding author
    1. School of Molecular Biosciences, Washington State University, Pullman, WA 99164-4630, USA
    • Center for Computational Biology and Bioinformatics, Indiana University, Indianapolis, IN 46202, USA; fax: (317) 274-4686.
    Search for more papers by this author
    • Center for Computational Biology and Bioinformatics, Indiana University, Indianapolis, IN 46202, USA.


Comparisons were made among four categories of protein flexibility: (1) low-B-factor ordered regions, (2) high-B-factor ordered regions, (3) short disordered regions, and (4) long disordered regions. Amino acid compositions of the four categories were found to be significantly different from each other, with high-B-factor ordered and short disordered regions being the most similar pair. The high-B-factor (flexible) ordered regions are characterized by a higher average flexibility index, higher average hydrophilicity, higher average absolute net charge, and higher total charge than disordered regions. The low-B-factor regions are significantly enriched in hydrophobic residues and depleted in the total number of charged residues compared to the other three categories. We examined the predictability of the high-B-factor regions and developed a predictor that discriminates between regions of low and high B-factors. This predictor achieved an accuracy of 70% and a correlation of 0.43 with experimental data, outperforming the 64% accuracy and 0.32 correlation of predictors based solely on flexibility indices. To further clarify the differences between short disordered regions and ordered regions, a predictor of short disordered regions was developed. Its relatively high accuracy of 81% indicates considerable differences between ordered and disordered regions. The distinctive amino acid biases of high-B-factor ordered regions, short disordered regions, and long disordered regions indicate that the sequence determinants for these flexibility categories differ from one another, whereas the significantly-greater-than-chance predictability of these categories from sequence suggest that flexible ordered regions, short disorder, and long disorder are, to a significant degree, encoded at the primary structure level.

The B-factor of the α-carbon and the B-factor averaged over the four backbone atoms have both been used as measures of residue flexibility of folded proteins (Karplus and Schulz 1985; Vihinen et al. 1994; Kundu et al. 2002). In crystal structures of macromolecules, the B-factor reflects the uncertainty in atom positions in the model and often represents the combined effects of thermal vibrations and static disorder (Rhodes 1993).

B-factors have been studied from a variety of viewpoints. Karplus and Schulz (1985) determined normalized α-carbon B-factors for each amino acid from which flexibility indices were calculated and subsequently used in a sliding-window prediction of the B-factor. Vihinen et al. (1994) and Smith et al. (2003) further developed the method of Karplus and Schulz (1985) and improved the correlation between predicted and experimentally determined B-factors. These flexibility indices do not indicate inherent amino acid plasticity, but rather correlate with the tendency of the side chain to be buried or exposed (Sheriff et al. 1985), which can explain, among other behaviors, the midrange index value for glycine and the high value for proline (Vihinen 1987). Indeed, Halle (2002) showed that the B-factor is inversely proportional to the atomic packing density and argued that little information on polypeptide chains is contained in B-factors apart from the atom coordinates. This theory was supported by Kundu et al. (2002), who achieved significant improvement in predicting experimental B-factors when atomic coordinates were known. Other researchers studied statistical properties of the B-factor (Altman et al. 1994; Wampler 1997) or aspects such as reliability of B-factors (Carugo and Argos 1999), use of B-factors for predicting biologically active sites (Ragone et al. 1989; Carugo and Argos 1998), and use of B-factors for characterizing protein regions (Carugo 2001).

Intrinsically disordered proteins

In addition to regions with high B-factors, crystallized proteins often contain disordered regions characterized by a lack of associated electron density. Some missing density may correspond to wobbly, ordered domains rather than to intrinsically disordered ensembles. However, the amino acid compositions of long regions of missing electron density are very similar to the amino acid compositions of disordered ensembles characterized by NMR; furthermore, predictors based on NMR-characterized disorder for the most part predict disorder for the long regions of missing electron density. Thus, as an explanation of long regions of missing electron density, wobbly, ordered domains are probably the exception rather than the rule (Garner et al. 1998).

Many other apparently noncrystallizable proteins are mostly comprised of similar disordered regions, with some of these proteins lacking persistent 3-D structure along their entire lengths. Following the work of Ptitsyn and Uversky (1994), we proposed that native proteins may exist in ordered (folded, structured) and/or disordered (unfolded, unstructured) form, where the existence of disorder is determined by overall protein dynamics rather than by local secondary structure. Thus, α-helix, β-sheet, and coil, the three types of secondary structure that are characteristic of ordered chains, may also occur in regions of intrinsic disorder.

Given the strong association of disorder with function (Dunker et al. 2002a), disordered proteins are becoming the subject of increased interest (Wright and Dyson 1999; Dunker et al. 2002a; Dyson and Wright 2002; Uversky 2002b). The predictability of disordered regions from amino acid sequence (Obradovic et al. 2003), the observed compositional biases of such regions (Romero et al. 2001), the typically faster rates of evolution (Brown et al. 2002), and the distinctive amino acid substitution patterns during evolution (Radivojac et al. 2002) combine to strongly indicate that intrinsic protein disorder is generally encoded by the amino acid sequence (Dunker et al. 2002b).

Flexible ordered regions versus intrinsically disordered regions

We and others previously found significant amino acid compositional differences between regions of order and long regions of intrinsic disorder. However, regions of intrinsic disorder and regions of high B-factors (Ringe and Petsko 1986; Smith et al. 1986; Rhodes 1993) could both be associated with large thermal vibrations of individual atoms and with high intramolecular flexibility, so it is important to examine whether high-B-factor regions more closely resemble disorder or low-B-factor regions. Here, we have extended our studies to four flexibility categories: (1) low-B-factor ordered regions, (2) high-B-factor ordered regions, (3) short disordered regions, and (4) long disordered regions. In addition to comparing the local amino acid compositions, we also developed predictors of high- versus low-B-factor regions and short disordered versus ordered regions. These two predictors were compared with a predictor of long disordered regions (Vucetic et al. 2003). The results of our study indicate that the high-B-factor regions are more similar to disorder than to low-B-factor regions. Sequence determinants for the high-B-factor regions and intrinsically disordered regions are correlated, but significant differences exist between them as well.


Comparing ordered and intrinsically disordered regions

In this study, an ordered residue is considered to have a high B-factor if its normalized B-factor (Materials and Methods) is 2.0 or higher; otherwise a residue is considered to have a low B-factor. Residues of both low- and high-B-factor ordered sets were extracted from Dataset-O (Materials and Methods). Short disordered residues, that is, the disordered residues occurring in short stretches, were extracted from Dataset-SD, and long disordered residues were extracted from the previously collected Dataset-LD (Vucetic et al. 2003). The short disordered set was assembled to be similar in its length distribution to the high-B-factor ordered set, and the long disordered set was formed from unrelated proteins having disordered regions of length ≥ 30 residues.

The amino acid compositions of the low-B-factor ordered, the high-B-factor ordered, and the two intrinsically disordered sets were compared to the compositions of a reference ordered set, Globular-3D (Romero et al. 2001), in order to gain insight into the differences among these data sets (Fig. 1). Because the low- and high-B-factor sets contain about 91% and 9% of the ordered amino acids, low-B-factor order has amino acid compositions very similar to those of the reference ordered set. However, the differences from the reference ordered set, although small, are not random: Low-B-factor order is slightly enriched in almost all of the more buried residues (Fig. 1, left) and slightly depleted in three particular surface residues (Fig. 1, right), serine, glutamic acid, and lysine.

The high B-factor, short disorder, and long disorder sets exhibit similar depletions of the typically buried tryptophan, phenylalanine, tyrosine, and isoleucine, and similar enrichments in the typically exposed glutamine, glutamic acid, and lysine. The long disorder set shows much less depletion compared to the high-B-factor and short disorder sets for cysteine, valine, and leucine. The high-B-factor order set is especially enriched in asparagine and aspartic acid, the short disorder set is slightly enriched in these two residues, and the long disorder set is significantly depleted in asparagine, but not in aspartic acid. The high-B-factor and short disorder sets are both enriched in glycine, whereas the long disorder set is not. Finally, the long disorder set is more enriched in proline compared to the high-B-factor order and short disorder sets.

The four distributions can also be compared using a more rigorous statistical approach. Because there is little higher-order Markov dependence in proteins (Nevill-Manning and Witten 1999), all segments from each group can be concatenated to form four distinct samples, Sk (k = 1…4). Each sample Sk can be considered a realization of an independent and identically distributed random process that emits symbols from an alphabet of 20 amino-acid codes. To compare the four amino-acid frequency distributions, we calculated the Kullback-Leibler (KL) distance between each pair of distributions p1 and p2 as

equation image

where p1(i) and p2(i) represent relative frequencies of amino acid i in samples S1 and S2. In all cases, the reference distribution p2 was chosen to be the one with fewer observations. Table 1 presents the six non-zero KL-distances among these four distributions.

KL-distance was also used as a test statistic to evaluate the significance of the differences between the pairs of underlying sample distributions. Using bootstrapping, we tested the null hypothesis that each pair of samples was generated from the same distribution (also given in Table 1). For the pair with the smallest KL-distance, that is, high-B-factor regions and short disordered regions, we rejected the null hypothesis with a P-value of 0.0053; the P-values for rejection of the null hypothesis for all other pairs of distributions were significantly lower. Consequently, the estimated probability distributions from Figure 1 between all four data sets are different with high confidence. Furthermore, the distances suggest that the two most similar sets are high-B-factor order and short disorder, but that these two, together with long disorder, are all closer to one another than any is to the low-B-factor order set.

To further understand the distinctions among the sets, five averages were determined: segment length, flexibility index value, hydropathy, net charge, and total charge (Table 2). Flexibility indices were compared because these are the focus of the present study, whereas average hydropathy and charge were compared because these two properties have been shown to be an indicator of natively unfolded proteins (Williams 1979; Uversky 2002b). The results in Table 2 indicate, surprisingly, that high-B-factor ordered regions have a higher average flexibility index, a higher average hydrophilicity, a higher average absolute net charge, and a higher total charge than do either short or long disordered regions. The low-B-factor ordered regions are significantly enriched in hydrophobic residues and depleted in the total number of charged residues compared to the other three classes. Finally, long disordered regions differ noticeably from both short disordered and high-B-factor ordered regions as their total charge is relatively high, but their (absolute) net charge is low with high variance. This indicates an overall balance of positively and negatively charged residues in the set of long disordered segments. Further analysis, however, indicates that individual segments often have significant net positive or negative charge, which contributes to the large variance in the bootstrapping experiment, with a slightly greater occurrence of negatively charged regions.

Correlation between B-factor values

We investigated the correlation of B-factors between aligned pairs (without gaps) of highly similar protein sequences from Dataset-EO (Materials and Methods). In each iteration of our bootstrap resampling strategy, we randomly selected a set of 195 clusters of homologous sequences and drew no more than three protein pairs from each cluster. Correlation coefficients between the B-factor data for each selected pair were calculated and then averaged over all pairs classified into three ranges of sequence identity. The final estimate of correlation was obtained as the average overall bootstrap iterations within each range (Table 3). The correlation between B-factor values at aligned residues clearly decreases as sequence identity decreases, which is expected. Table 3 also illustrates the extent to which experimental conditions and crystal packing may influence B-factor values. Homologous pairs crystallized within the same space group have more highly correlated B-values than homologous pairs crystallized in different space groups.

In the next experiment, we studied the effect of normalization on discrimination between low- and high-B-factor residues and approximated the upper limit on predictability of the high B-values. Raw data and data normalized using a method by Smith et al. (2003) were dichotomized into class ‘high,’ if the B-values were at least 32 Å2 (2.0), and class ‘low.’ These thresholds provided equal class ratios in both cases. For all pairs of identical sequences selected from Dataset-EO, we then compared the proportion of superimposed residues with the same class and confirmed that the normalization process significantly improves agreement between the residues (data not shown). Because experimental reproducibility limits our ability to predict B-factors, we believe that the average of the agreement between class ‘high’ (65.2%) and class ‘low’ (96.8%) sets the upper limit on predictability of the B-factor only from amino acid sequence to approximately 81%.

Predicting B-factor values

Despite the problems that arise from differences in crystal environments, B-factors show correlation with amino acid sequence, which suggests that they should be predictable from amino acid sequence. To test this hypothesis, three logistic regression models based on different attribute sets were trained to discriminate between high and low B-factors. The models were systematically evaluated for various window sizes, win and wout, and the best results were in all cases obtained for win = 1 for structural attributes, win = 5 for nonstructural attributes, and wout = 5. The three models are called the ‘NS predictor,’ which uses no structural information, the ‘KS predictor,’ which uses known secondary structure, and the ‘PS predictor,’ which uses predicted secondary structure.

The NS predictor reached 64.5% accuracy (sn = 62.8 ± 0.9, sp = 66.1 ± 0.3), the PS predictor reached 67.0% accuracy (sn = 66.8 ± 0.9, sp = 67.2 ± 0.4), and the KS predictor reached 67.8% accuracy (sn = 65.3 ± 0.8, sp = 70.3 ± 0.3). The disparity in confidence intervals is due to the difference in sizes between the two classes. Construction of nonlinear models only marginally improved prediction accuracy (64.5% for the NS, 67.2% for the PS, and 68.3% for the KS predictor). Although the models were trained only to discriminate between high- and low-B-factor regions, we found that the approximated probability that the residue has a high B-factor is well correlated with the experimental B-values. The observed correlation coefficients for the experimental data versus the raw outputs of the NS, PS, and KS predictors reached 0.34 ± 0.02, 0.38 ± 0.02, and 0.41 ± 0.02, respectively.

The prediction accuracies and correlation coefficients of our B-factor predictors were compared with a predictor based only on flexibility indices by Vihinen et al. (1994), which was previously found to outperform other similar methods. The method of Vihinen et al. achieved 63.8% accuracy, and the correlation coefficient with the experimental data was 0.32 ± 0.02. Thus, our PS single-sequence predictor attained an improvement of 3.4 percentage points (5.3%) in prediction accuracy and 0.06 (19%) in correlation coefficient compared to the values obtained by Vihinen et al.

B-factor predictor with evolutionary modeling

It is well known that adding evolutionary information in the form of sequence alignments leads to improved secondary structure prediction (Benner et al. 1992; Levin et al. 1993; Rost 2001). In recent examples of this principle, Jones (1999) and Przybylski and Rost (2002) improved single-sequence prediction accuracy by 2–4 percentage points. Using a similar reasoning for B-factor prediction, we constructed protein families using PSI-BLAST and enhanced the performance of our models (Materials and Methods). The average improvement of the prediction results was 2.0 percentage points for the NS predictor and 2.5 percentage points for the PS predictor. Thus, the overall prediction accuracy reached 69.7%. We note that the higher the number of available homologs, the higher the prediction accuracy. For example, in the case in which 30 or more nonredundant homologs can be found, the average prediction accuracy reaches 70.8%. In terms of average correlation coefficients, PSI-BLAST-enhanced NS and PS predictors reached 0.36 ± 0.02 and 0.43 ± 0.02, respectively. Thus, the overall improvement over the predictor based only on flexibility indices by Vihinen et al. reached 5.9 percentage points (9.2%) in prediction accuracy and 0.11 (34.4%) in correlation coefficient. The quality of our predictions can be verified from the figure presented in the Supplemental Material.

Predictor-based analysis of the ordered and disordered data

To further explore the relationship between the ordered and disordered data sets that was suggested by the amino-acid frequency data, we used two predictors of intrinsic disorder: (1) a previously constructed predictor of long disordered regions, VL2 (Vucetic et al. 2003) and (2) a logistic-regression-based predictor developed here to discriminate between short disordered regions and ordered regions. The short disorder predictor, named XS1 according to our conventions (Obradovic et al. 2003), was developed from Dataset-SD and used the same set of attributes as our PS high-B-factor predictor. The maximum performance of 80.6% was obtained using win = 9 and wout = 7; the structural attributes were averaged in a window of 5.

The high-B-factor predictor, short disorder predictor, and long disorder predictor were all applied to three data sets (Dataset-O, Dataset-SD, and Dataset-LD) and the prediction results are shown in Table 4. This experiment confirmed that high-B-factors and short disorder are the most similar phenomena among the three data sets. On the other hand, VL2 performance on both B-factor and short disorder data sets was weak, in part caused by longer averaging (win = wout = 41). Correlation coefficients between predictor outputs were: 0.26 ± 0.02 between VL2 and the high-B-factor predictor, 0.31 ± 0.02 between VL2 and the short disorder predictor, and 0.88 ± 0.02 between high-B-factor and the short disorder predictor.


Properties of flexibility data

Comparing the B-factor values from highly similar pairs of crystallized chains provides evidence that flexibility is encoded at the amino-acid sequence level to a significant degree and therefore should be predictable, at some level, from the amino acid sequence (Table 3). However, because of variations that result from experimental conditions, crystal contacts, or refinement procedures, the B-factor data are noisy.

Crystal packing effects can be viewed as a special case of nonlocal interactions. Given the dependence of the B-factor on packing density (Halle 2002) and hence on nonlocal interactions, crystal packing would be expected to exert large effects on B-factor values. In agreement with this, previous comparisons indicated that different crystal forms of myohemerythrin (Sheriff et al. 1985) and myoglobin (Phillips Jr. 1990) exhibited rather low correlations in their B-values, with further confirmation on additional protein pairs (Kundu et al. 2002). Our comparisons of many similar and identical proteins in the same and different space groups show that crystal packing effects generally perturb B-factor values, and the effects can be very significant (Table 3). Overall, the B-factor perturbations arising from crystal packing effects are probably the largest source of noise in the B-factor data.

Prediction accuracy

Prediction of B-factors cannot exceed the accuracy with which B-factors can be experimentally reproduced; thus, the noise in the B-factor data sets an upper limit to prediction of flexibility. To estimate this upper limit, we collected pairs of B-factor sets from identical proteins and subjected the data to the same analysis used to compare the predicted and observed B-factor values. The results suggest that the upper limit on prediction accuracy is approximately 81%. In terms of the agreement between raw predictions and experimental values, the upper limit on the correlation coefficient is about 0.8 (Table 3). From this perspective, our achievement of about 70% accuracy and a correlation coefficient of 0.43 seems quite reasonable.

Our predictor of high B-factors joins many other machine learning tools that attempt to predict protein features from amino acid sequence (Lund et al. 1997; Blom et al. 1999; Jones 1999; Pollastri et al. 2002; Obradovic et al. 2003). Its prediction accuracy is comparable to the 64%–77% accuracies for coordination number, two-class interresidue distances, or relative solvent accessibility, and lower than the 75%–80% prediction accuracy of secondary structure or long regions of intrinsic disorder. Because flexible ordered and short disordered protein regions are frequently involved in important biological functions and they were not previously predictable from the sequence using our old predictors, we expect this B-factor predictor to be an advanced practical tool to aid in the automated discovery of short molecular recognition regions and possibly even the active sites. Moreover, the raw outputs of this predictor can be utilized in semi-automated detection of flexible ordered regions (see Supplemental Material). The correlation of the high-B-factor regions with short disordered regions may prove important in high-throughput genomewide characterization of novel proteins with unknown structure and function.

The improvement in B-factor prediction from adding either known (KS predictor) or predicted (PS predictor) secondary structure is small but significant. This improvement is related to the differences in average flexibility observed over the three structural categories (data not shown). Addition of evolutionary information obtained by PSI-BLAST alignments improves prediction of B-factors, for both the NS and PS predictors. The improvement of about three percentage points matches the increase in secondary structure prediction (Przybylski and Rost 2002). The fact that the evolutionary information improved prediction results and that the PSI-BLAST-enhanced PS predictor outperformed the KS predictor is further support for the predictability of B-factor values from amino acid sequence.

In terms of correlation coefficients, results achieved in this study exceed those obtained with other methods from the literature. Predictors by Karplus and Schulz (1985), Vihinen et al. (1994), and Smith et al. (2003) reach correlation coefficients between 0.30 and 0.33, and some earlier methods (Bhaskaran and Ponnuswamy 1988; Ragone et al. 1989) cannot surpass 0.3. On the other hand, our PS predictor reached 0.38 without the presence of evolutionary information, and, on average, homologous sequences boost the correlation coefficient to 0.43. However, the gap of 0.23 between sequence-based methods and the 0.66 found using the methods of Kundu et al. (2002), which includes known atom coordinates, is still significant.

The gap between sequence-based approaches and approaches based on atomic coordinates is likely to be further decreased in the future. An immediate route is noise reduction, which can be effectively achieved by determining residues that are involved in crystal contacts and excluding them from model training. We believe that the improvement similar to that in methods based on atomic coordinates can result (Kundu et al. 2002). Additionally, due to the imbalance between sizes of low- versus high-B-factor classes, our model was constructed using balanced data that, in turn, lead to a significant overprediction of the high B-factors. In our future research, we will study ways to detect locally flexible regions based on their local and nonlocal neighborhoods and thus reduce the number of false positives outputted by our model.

Comparing compositions of high-B-factor ordered and intrinsically disordered proteins

Our original hypothesis was that amino acid composition determines whether a protein folds into specific 3-D structure or not. Although early indications of this idea were developed from structural studies on protein sequences (Williams 1978), we missed this original work and developed our version of this hypothesis from prior studies of lattice models of protein structure by Shakhnovich and Gutin (1993). In those lattice studies, the determination whether a lattice-model protein folds or not depended on the polar/nonpolar ratio, which corresponds to the amino acid composition in real proteins. Given a folding polar/nonpolar ratio (composition), the detailed arrangement of the amino acids indicated which fold was stabilized. Here we suggest that, not only foldability, but also flexibility is determined, to a significant degree, by the amino acid composition.

Comparison of the amino acid compositions of experimentally characterized regions of protein disorder with regions of order (Romero et al. 2001) showed that disordered proteins generally have more of the flexible amino acids as defined by the scale of Vihinen et al. (1994), suggesting that disordered regions and high-B-factor regions might be quite similar to each other. Furthermore, Romero et al. (1997) also indicated that disordered regions of different lengths might have different amino acid compositions, but the original data sets were quite small. Here, comparisons of the amino acid compositions of low- and high-B-factor regions and short and long disordered regions indicate that all four categories are distinct (Fig. 1; Tables 1, 2). Although the compositional distinctions among the high-B-factor order, short disorder, and long disorder sets might change as more data are added, we expect the overall trends indicated in Tables 1 and 2 to be maintained. This expectation is based on the observation that the current data sets are large enough already to show statistically significant distinctions.

Just as amino acid compositions vary for different types of secondary structure (Nakashima et al. 1986; Liu and Chou 1999; Cai et al. 2002), compositional differences might distinguish different types of intrinsic disorder or different types of flexible regions. For example, regions of extended disorder might be expected to be more hydrophilic than either regions of collapsed disorder or regions corresponding to the premolten-globule, if indeed this form is distinctive (Uversky 2002a). Also, there could be compositional biases in subsets of intrinsically disordered proteins that correlate with function such as enrichments in lysine and arginine for nucleic acid binding regions. Indeed, recently published work provides some support for this conjecture (Vucetic et al. 2003).

Previously we found significant amino-acid compositional differences between ordered protein and long regions of intrinsic disorder. If structure–sequence relationships existed on a continuum, then one would expect to observe monotonic increases or decreases in the various amino acid compositions as the set of interest is changed from low-B-factor regions, to high-B-factor regions, to short disordered regions and to long disordered regions. However, almost none of the amino acids exhibit monotonic changes in the order indicated. Even the global averages of Table 2 do not exhibit monotonic changes across the different flexibility/disorder classes in the order indicated. Thus, the amino acid compositions that specify flexibility and intrinsic disorder are evidently distinct and not merely quantitative differences on a continuum.

Materials and methods

Data sets

The first set of protein chains, Dataset-O, consists of 290 nonredundant sequences from the PDB (Berman et al. 2000) selected in the study of Smith et al. (2003). All crystallized chains, consisting of at least 80 amino acids, were required to have a resolution of ≤ 2 Å, and an R-factor ≤ 20%. Sequence identity within the set was limited to 25%, and only chains without nonstandard residues and missing backbone or side chain atoms were chosen, making a database of 67,552 residues in total.

The second set of protein chains, Dataset-EO, contains 1287 sequences from the PDB divided into 195 disjoint clusters of similar sequences. For each chain in a cluster there is at least one other chain with ≥ 50% sequence identity. Minimum and maximum cluster sizes are 2 and 205, and the total number of residues is 238,133. All proteins in the data set were required to have at least 50 residues and a resolution of ≤ 2 Å.

The third data set, Dataset-SD, was extracted from the PDB and contains nonredundant chains with stretches of missing coordinates no longer than 10 consecutive residues. The length limitation of 10 residues was chosen in order to make the average segment length and standard deviation comparable to the high-B-factor regions from Dataset-O. All chains from Dataset-SD were required to be at least 80 residues in length, and the maximum sequence identity between any two chains was limited to 25%. Dataset-SD contains 511 sequences with 3216 disordered residues in short stretches out of 174,301 total residues.

All data sets are publicly available at our Web site:

Data representation and types of predictors

To construct a predictor, a machine-learning example (data point) was constructed for each residue where the corresponding C-α atom B-factor was quantized into classes high and low, according to a threshold, and included as a binary target designation. To compensate for the large variability of averages over proteins, C-α B-factors were normalized using the method of Smith et al. (2003) prior to quantization.

An attribute vector for each position in a protein was constructed considering neighboring amino acids within a symmetric input window of size win. The window was centered at a given position except near the N and C termini, where the window was allowed to expand and collapse, respectively, and where the window was no longer centered as described in more detail previously (Vucetic et al. 2003). The first 21 attributes were the 20 relative frequencies of each amino acid within win and K2 entropy, a measure of sequence complexity (Wootton and Federhen 1996). The last set of attributes used in the present study exploits secondary structure information. Because each residue may belong to structure forms α-helix, β-sheet, and coil, we included three structural attributes, constructed in the same way as compositional attributes. The NMR- or X-ray-determined structures of a query sequence were used for the KS predictor (known structure), the first of the three models built in this work. For proteins whose structure was unknown, the raw PHD secondary structure predictions (Rost et al. 1997) on the query amino acid sequence were used. We refer to the predictor using PHD scores as the PS predictor (predicted structure). Finally, the NS predictor (no structure), which does not exploit secondary structure information, was used for comparison purposes. It is possible to optimize the size of the input window for each attribute; however, due to the high computational requirements, the window size for the structural attributes was optimized separately from the remaining attributes.

After predictions were made for each residue in a protein, the raw outputs were smoothed using a moving average postfiltering. The size of the smoothing (output) window wout was also subject to optimization.

Model choice, training, and evaluation criteria

We use logistic regression for linear modeling and bagged neural networks (Breiman 1996) for nonlinear modeling. To train a predictor we applied the following procedure: The original set of 290 proteins was first randomly split into training and testing sets in the ratio 75% : 25%. From the set of training proteins we constructed examples for all available residues and then fed them to the model, which learned from a class-balanced data set. After the model was trained, we evaluated its performance on all examples from the test set. The whole process of splitting, training, and testing was repeated 30 times in all experiments.

To evaluate the performance of the predictors, we measured sensitivity (sn) and specificity (sp) for a given set of parameters. Sensitivity is defined as the percentage of high B-factors correctly predicted, and specificity is the percentage of low B-factors correctly predicted (Hastie et al. 2001). This type of model evaluation is commonly used in cases of class imbalance (Kubat et al. 1998). Assuming the class sizes are equal, the accuracy of prediction (acc) is expressed as the arithmetic mean of sensitivity and specificity. Therefore, random predictors or models that always output only one class will have an accuracy of 50%. Together with sensitivity and specificity, we also report their 95% confidence intervals calculated as ±2·s/√n, where s is the standard deviation of the estimate (sn or sp) and n is the number of experimental repetitions.

Prediction averaging over evolutionary data

Families of homologous proteins were built using PSI-BLAST queries of GenBank (Benson et al. 1999). The conditions for the PSI-BLAST queries included using the blosum62 scoring matrix (Henikoff and Henikoff 1992) with 11/1 gap penalties and E-values of 0.0002 to include a sequence in a profile and of 0.01 to accept it as a family member. The maximum number of iterations was limited to three in order to constrain the influence of potential false positives. Construction of profiles usually incorporates some form of weight assignment in order to avoid the influence of very similar hits, but also sequences from the “twilight zone.” As noted in the study of Altschul et al. (1997), several intuitive weighting schemes usually yield similar results. Based on these previous studies, the following simple scheme was devised: All sequences with sequence identity above 70% or below 30% in the region of the local alignment to the query sequence were discarded from the family. Additionally, no pair of homologs within a family was allowed to exceed the 70% sequence identity threshold. Pairwise sequence alignments were performed using the Smith-Waterman algorithm (Smith and Waterman 1981) with the blosum62 scoring matrix and 11/1 gap penalties. The remaining sequences in each family were all assigned equal weights, and prediction of the B-factor for the query sequence at position i was formed as an average over all proteins in a family that do not have a gap at that position.

Table Table 1.. Kullback-Leibler distance (P-value) between estimated probability distributions of four data sets
 High-B-factor orderShort disorderLong disorder
  1. a

    Estimates of P-values were calculated using 50,000 bootstrap iterations. As a reference, KL-distances between the four distributions and the uniform distribution are: 0.16 for low-B-factor order, 0.42 for high-B-factor order, 0.40 for short disorder, and 0.32 for long disorder.

Low-B-factor order0.181 (P < 10−4)0.142 (P < 10−4)0.102 (P < 10−4)
High-B-factor order 0.012 (P = 0.0053)0.051 (P < 10−4)
Short disorder  0.033 (P < 10−4)
Table Table 2.. Properties of proteins from four data sets
 Segment length (s.d.)Flexibilitya,bHydropathycNet chargeTotal charged
  • a

    a The per segment means and 95% confidence intervals for flexibility, hydropathy, and charge were calculated using bootstrapping. All regions of length 1, methionine at the N terminus, and His-tags were excluded from each data set.

  • b

    bVihinen et al. (1994).

  • c

    cKyte and Doolittle (1982).

  • d

    d Calculated as the fraction of charged residues in each segment.

Low B-factor order34.2 (35.4)0.996 ± 0.001−0.125 ± 0.041−0.008 ± 0.0060.207 ± 0.007
High B-factor order4.3 (3.5)1.027 ± 0.002−1.310 ± 0.084−0.059 ± 0.0180.326 ± 0.016
Short disorder4.6 (2.2)1.024 ± 0.002−1.175 ± 0.106−0.038 ± 0.0230.310 ± 0.019
Long disorder127.8 (231.7)1.015 ± 0.002−0.853 ± 0.091−0.005 ± 0.0240.294 ± 0.017
Table Table 3.. Relationship between B-factors of highly similar sequences as a function of sequence identity and space groups
 Correlation coefficientsa
Sequence identity (si)All pairsSame space groupDifferent space group
  • a

    a Average correlation coefficient ± 95% confidence intervals; median; average number of pairs.

si ∈ [70, 90] %0.59 ± 0.07; 0.56; 230.63 ± 0.09; 0.65; 100.59 ± 0.06; 0.56; 21
si ∈ [90, 100] %0.76 ± 0.04; 0.81; 1220.82 ± 0.03; 0.88; 930.61 ± 0.04; 0.61; 66
si = 100%0.79 ± 0.02; 0.86; 2900.81 ± 0.02; 0.86; 2860.63 ± 0.05; 0.66; 50
Table Table 4.. Prediction accuracies ±95% confidence intervals for the PS B-factor predictor, PS predictor of short disorder, and VL2 long disorder predictor on three data sets
 Prediction accuracy [%]
 Data set-OData set-SDData set-LD
  1. a

    All accuracies were estimated on a per protein basis; i.e., sensitivity and specificity were calculated for all proteins and then averaged. Prediction accuracy was obtained as an average of estimated sensitivity and specificity.

B-factor predictor65.1 ± 2.368.7 ± 1.066.967.4 ± 2.285.2 ± 0.776.359.6 ± 3.464.9 ± 1.062.3
Short disorder predictor41.6 ± 2.483.8 ± 0.962.778.1 ± 2.883.0 ± 0.680.651.5 ± 4.080.3 ± 1.165.9
Long disorder predictor17.7 ± 3.087.2 ± 2.252.633.6 ± 3.882.2 ± 1.857.976.2 ± 5.083.8 ± 2.480.0
Figure Figure 1..

Amino acid compositions of various data sets. The composition of each amino acid of a reference data set of ordered proteins, Globular-3D, is subtracted from the composition of the four sets described herein; thus, negative peaks indicate depletions compared to the ordered reference set, and positive peaks represent enrichments. The order of the amino acids along the x-axis is from the most buried (left) to the most exposed (right) in typical globular proteins. Error bars indicate one standard deviation. Methionine at the N terminus and His-tags were not included in calculations.


This work was supported by the following grants: NIH 1R01 LM06916, awarded to A.K.D. and Z.O.; NSF no. CSE9711532 and no. 11S-0196237 to Z.O. and A.K.D.; University of Hong Kong CRCG grant 10202779 to D.K.S.; Research Grants Council of Hong Kong grants HKUST6208/00M and 6124/02M to G.Z. We acknowledge the help of T.R. O'Connor and C.J. Oldfield, undergraduate students at Washington State University. We thank the two anonymous reviewers for their many detailed suggestions.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.