Comparing ordered and intrinsically disordered regions
In this study, an ordered residue is considered to have a high B-factor if its normalized B-factor (Materials and Methods) is 2.0 or higher; otherwise a residue is considered to have a low B-factor. Residues of both low- and high-B-factor ordered sets were extracted from Dataset-O (Materials and Methods). Short disordered residues, that is, the disordered residues occurring in short stretches, were extracted from Dataset-SD, and long disordered residues were extracted from the previously collected Dataset-LD (Vucetic et al. 2003). The short disordered set was assembled to be similar in its length distribution to the high-B-factor ordered set, and the long disordered set was formed from unrelated proteins having disordered regions of length ≥ 30 residues.
The amino acid compositions of the low-B-factor ordered, the high-B-factor ordered, and the two intrinsically disordered sets were compared to the compositions of a reference ordered set, Globular-3D (Romero et al. 2001), in order to gain insight into the differences among these data sets (Fig. 1). Because the low- and high-B-factor sets contain about 91% and 9% of the ordered amino acids, low-B-factor order has amino acid compositions very similar to those of the reference ordered set. However, the differences from the reference ordered set, although small, are not random: Low-B-factor order is slightly enriched in almost all of the more buried residues (Fig. 1, left) and slightly depleted in three particular surface residues (Fig. 1, right), serine, glutamic acid, and lysine.
The high B-factor, short disorder, and long disorder sets exhibit similar depletions of the typically buried tryptophan, phenylalanine, tyrosine, and isoleucine, and similar enrichments in the typically exposed glutamine, glutamic acid, and lysine. The long disorder set shows much less depletion compared to the high-B-factor and short disorder sets for cysteine, valine, and leucine. The high-B-factor order set is especially enriched in asparagine and aspartic acid, the short disorder set is slightly enriched in these two residues, and the long disorder set is significantly depleted in asparagine, but not in aspartic acid. The high-B-factor and short disorder sets are both enriched in glycine, whereas the long disorder set is not. Finally, the long disorder set is more enriched in proline compared to the high-B-factor order and short disorder sets.
The four distributions can also be compared using a more rigorous statistical approach. Because there is little higher-order Markov dependence in proteins (Nevill-Manning and Witten 1999), all segments from each group can be concatenated to form four distinct samples, Sk (k = 1…4). Each sample Sk can be considered a realization of an independent and identically distributed random process that emits symbols from an alphabet of 20 amino-acid codes. To compare the four amino-acid frequency distributions, we calculated the Kullback-Leibler (KL) distance between each pair of distributions p1 and p2 as
where p1(i) and p2(i) represent relative frequencies of amino acid i in samples S1 and S2. In all cases, the reference distribution p2 was chosen to be the one with fewer observations. Table 1 presents the six non-zero KL-distances among these four distributions.
KL-distance was also used as a test statistic to evaluate the significance of the differences between the pairs of underlying sample distributions. Using bootstrapping, we tested the null hypothesis that each pair of samples was generated from the same distribution (also given in Table 1). For the pair with the smallest KL-distance, that is, high-B-factor regions and short disordered regions, we rejected the null hypothesis with a P-value of 0.0053; the P-values for rejection of the null hypothesis for all other pairs of distributions were significantly lower. Consequently, the estimated probability distributions from Figure 1 between all four data sets are different with high confidence. Furthermore, the distances suggest that the two most similar sets are high-B-factor order and short disorder, but that these two, together with long disorder, are all closer to one another than any is to the low-B-factor order set.
To further understand the distinctions among the sets, five averages were determined: segment length, flexibility index value, hydropathy, net charge, and total charge (Table 2). Flexibility indices were compared because these are the focus of the present study, whereas average hydropathy and charge were compared because these two properties have been shown to be an indicator of natively unfolded proteins (Williams 1979; Uversky 2002b). The results in Table 2 indicate, surprisingly, that high-B-factor ordered regions have a higher average flexibility index, a higher average hydrophilicity, a higher average absolute net charge, and a higher total charge than do either short or long disordered regions. The low-B-factor ordered regions are significantly enriched in hydrophobic residues and depleted in the total number of charged residues compared to the other three classes. Finally, long disordered regions differ noticeably from both short disordered and high-B-factor ordered regions as their total charge is relatively high, but their (absolute) net charge is low with high variance. This indicates an overall balance of positively and negatively charged residues in the set of long disordered segments. Further analysis, however, indicates that individual segments often have significant net positive or negative charge, which contributes to the large variance in the bootstrapping experiment, with a slightly greater occurrence of negatively charged regions.
Correlation between B-factor values
We investigated the correlation of B-factors between aligned pairs (without gaps) of highly similar protein sequences from Dataset-EO (Materials and Methods). In each iteration of our bootstrap resampling strategy, we randomly selected a set of 195 clusters of homologous sequences and drew no more than three protein pairs from each cluster. Correlation coefficients between the B-factor data for each selected pair were calculated and then averaged over all pairs classified into three ranges of sequence identity. The final estimate of correlation was obtained as the average overall bootstrap iterations within each range (Table 3). The correlation between B-factor values at aligned residues clearly decreases as sequence identity decreases, which is expected. Table 3 also illustrates the extent to which experimental conditions and crystal packing may influence B-factor values. Homologous pairs crystallized within the same space group have more highly correlated B-values than homologous pairs crystallized in different space groups.
In the next experiment, we studied the effect of normalization on discrimination between low- and high-B-factor residues and approximated the upper limit on predictability of the high B-values. Raw data and data normalized using a method by Smith et al. (2003) were dichotomized into class ‘high,’ if the B-values were at least 32 Å2 (2.0), and class ‘low.’ These thresholds provided equal class ratios in both cases. For all pairs of identical sequences selected from Dataset-EO, we then compared the proportion of superimposed residues with the same class and confirmed that the normalization process significantly improves agreement between the residues (data not shown). Because experimental reproducibility limits our ability to predict B-factors, we believe that the average of the agreement between class ‘high’ (65.2%) and class ‘low’ (96.8%) sets the upper limit on predictability of the B-factor only from amino acid sequence to approximately 81%.
Predicting B-factor values
Despite the problems that arise from differences in crystal environments, B-factors show correlation with amino acid sequence, which suggests that they should be predictable from amino acid sequence. To test this hypothesis, three logistic regression models based on different attribute sets were trained to discriminate between high and low B-factors. The models were systematically evaluated for various window sizes, win and wout, and the best results were in all cases obtained for win = 1 for structural attributes, win = 5 for nonstructural attributes, and wout = 5. The three models are called the ‘NS predictor,’ which uses no structural information, the ‘KS predictor,’ which uses known secondary structure, and the ‘PS predictor,’ which uses predicted secondary structure.
The NS predictor reached 64.5% accuracy (sn = 62.8 ± 0.9, sp = 66.1 ± 0.3), the PS predictor reached 67.0% accuracy (sn = 66.8 ± 0.9, sp = 67.2 ± 0.4), and the KS predictor reached 67.8% accuracy (sn = 65.3 ± 0.8, sp = 70.3 ± 0.3). The disparity in confidence intervals is due to the difference in sizes between the two classes. Construction of nonlinear models only marginally improved prediction accuracy (64.5% for the NS, 67.2% for the PS, and 68.3% for the KS predictor). Although the models were trained only to discriminate between high- and low-B-factor regions, we found that the approximated probability that the residue has a high B-factor is well correlated with the experimental B-values. The observed correlation coefficients for the experimental data versus the raw outputs of the NS, PS, and KS predictors reached 0.34 ± 0.02, 0.38 ± 0.02, and 0.41 ± 0.02, respectively.
The prediction accuracies and correlation coefficients of our B-factor predictors were compared with a predictor based only on flexibility indices by Vihinen et al. (1994), which was previously found to outperform other similar methods. The method of Vihinen et al. achieved 63.8% accuracy, and the correlation coefficient with the experimental data was 0.32 ± 0.02. Thus, our PS single-sequence predictor attained an improvement of 3.4 percentage points (5.3%) in prediction accuracy and 0.06 (19%) in correlation coefficient compared to the values obtained by Vihinen et al.
B-factor predictor with evolutionary modeling
It is well known that adding evolutionary information in the form of sequence alignments leads to improved secondary structure prediction (Benner et al. 1992; Levin et al. 1993; Rost 2001). In recent examples of this principle, Jones (1999) and Przybylski and Rost (2002) improved single-sequence prediction accuracy by 2–4 percentage points. Using a similar reasoning for B-factor prediction, we constructed protein families using PSI-BLAST and enhanced the performance of our models (Materials and Methods). The average improvement of the prediction results was 2.0 percentage points for the NS predictor and 2.5 percentage points for the PS predictor. Thus, the overall prediction accuracy reached 69.7%. We note that the higher the number of available homologs, the higher the prediction accuracy. For example, in the case in which 30 or more nonredundant homologs can be found, the average prediction accuracy reaches 70.8%. In terms of average correlation coefficients, PSI-BLAST-enhanced NS and PS predictors reached 0.36 ± 0.02 and 0.43 ± 0.02, respectively. Thus, the overall improvement over the predictor based only on flexibility indices by Vihinen et al. reached 5.9 percentage points (9.2%) in prediction accuracy and 0.11 (34.4%) in correlation coefficient. The quality of our predictions can be verified from the figure presented in the Supplemental Material.
Predictor-based analysis of the ordered and disordered data
To further explore the relationship between the ordered and disordered data sets that was suggested by the amino-acid frequency data, we used two predictors of intrinsic disorder: (1) a previously constructed predictor of long disordered regions, VL2 (Vucetic et al. 2003) and (2) a logistic-regression-based predictor developed here to discriminate between short disordered regions and ordered regions. The short disorder predictor, named XS1 according to our conventions (Obradovic et al. 2003), was developed from Dataset-SD and used the same set of attributes as our PS high-B-factor predictor. The maximum performance of 80.6% was obtained using win = 9 and wout = 7; the structural attributes were averaged in a window of 5.
The high-B-factor predictor, short disorder predictor, and long disorder predictor were all applied to three data sets (Dataset-O, Dataset-SD, and Dataset-LD) and the prediction results are shown in Table 4. This experiment confirmed that high-B-factors and short disorder are the most similar phenomena among the three data sets. On the other hand, VL2 performance on both B-factor and short disorder data sets was weak, in part caused by longer averaging (win = wout = 41). Correlation coefficients between predictor outputs were: 0.26 ± 0.02 between VL2 and the high-B-factor predictor, 0.31 ± 0.02 between VL2 and the short disorder predictor, and 0.88 ± 0.02 between high-B-factor and the short disorder predictor.