An iterative self-refining and self-evaluating approach for protein model quality estimation

Authors

  • Zheng Wang,

    1. Department of Computer Science, University of Missouri, Columbia, Missouri 65211
    Search for more papers by this author
  • Jianlin Cheng

    Corresponding author
    1. Department of Computer Science, University of Missouri, Columbia, Missouri 65211
    2. Informatics Institute, University of Missouri, Columbia, Missouri 65211
    3. Christopher S. Bond Life Science Center, University of Missouri, Columbia, Missouri 65211
    • Department of Computer Science, University of Missouri, Columbia, 65211

    Search for more papers by this author

  • ZW is supported in part by the Paul K. and Diane Shumaker Endowment in Bioinformatics.

Abstract

Evaluating or predicting the quality of protein models (i.e., predicted protein tertiary structures) without knowing their native structures is important for selecting and appropriately using protein models. We describe an iterative approach that improves the performances of protein Model Quality Assurance Programs (MQAPs). Given the initial quality scores of a list of models assigned by a MQAP, the method iteratively refines the scores until the ranking of the models does not change. We applied the method to the model quality assessment data generated by 30 MQAPs during the Eighth Critical Assessment of Techniques for Protein Structure Prediction. To various degrees, our method increased the average correlation between predicted and real quality scores of 25 out of 30 MQAPs and reduced the average loss (i.e., the difference between the top ranked model and the best model) for 28 MQAPs. Particularly, for MQAPs with low average correlations (<0.4), the correlation can be increased by several times. Similar experiments conducted on the CASP9 MQAPs also demonstrated the effectiveness of the method. Our method is a hybrid method that combines the original method of a MQAP and the pair-wise comparison clustering method. It can achieve a high accuracy similar to a full pair-wise clustering method, but with much less computation time when evaluating hundreds of models. Furthermore, without knowing native structures, the iterative refining method can evaluate the performance of a MQAP by analyzing its model quality predictions.

Introduction

Nowadays, computer programs can generate a large number of protein models in a relatively short time, which makes protein model quality evaluation/assessment indispensible. Protein model quality assessment programs (MQAPs) can predict the qualities of protein models before knowing the experimental structures, which is essential to the proper usage of the models.1–3 Current model quality assessment programs can predict both global and local qualities of one or multiple models. The methods used to predict global qualities can be categorized as multiple-model (clustering) methods and single-model methods.

Multiple-model methods assess the quality of a model by assessing its similarity with other models for the same protein target through full pair-wise structural comparisons.4–10 Single-model methods directly predict the quality of a model from its structural features using machine learning, statistical, or physical methods.11–16, 21, 22 According to recent CASP experiments,17 multiple-model methods are currently more accurate than single-model methods, although they do not work well if only a small number of models are available or the structures of input models are largely different. Another drawback is that clustering methods usually need relatively long computational time that makes it less efficient and less feasible to be used in daily research. To address these problems, recently a hybrid quality assessment method18 was developed to integrate the strengths of the two approaches. The hybrid method at first uses a single-model quality assessment method16 to generate initial quality scores of input models, and then compares the structure of each model with those of the top ranked models. It uses the average structural similarity score with the top ranked models as predicted quality score.

Here we generalize the hybrid approach and use it to refine the quality scores predicted by any MQAPs. The iterative self-refining approach can consistently improve single-model MQAPs in almost all situations in just three iterations. Our results showed that instead of performing full pair-wise comparisons between models, partial pair-wise comparisons against a few top models can achieve similarly high accuracy, but with much less computational time. Moreover, for the first time, the iterative method can help evaluate the performance of a MQAP before knowing the experimental structures. Although our algorithm can also generate local quality scores, in this article, we mainly focus on discussing its performances in improving global quality assessment.

Results and Discussion

We applied our iterative refinement approach to each of the MQAPs that participated in the Eighth Critical Assessment of Techniques for Protein Structure Prediction (CASP8, 2008) and the Ninth Critical Assessment of Techniques for Protein Structure Prediction (CASP9, 2010). Taking CASP8 as an example, we downloaded the predicted quality scores of more than 50,000 tertiary structure (TS) models associated with 120 CASP8 targets from the CASP8 web site. We also downloaded all the TS models and compared each of them with its true experimental structure using the tool TM_Score.19 The GDT-TS20 score resulted from comparison is considered as the real quality score of the model. The real quality scores were used to evaluate whether the iterative quality assessment method improved the initial quality scores predicted by CASP8 MQAPs.

We evaluated the iterative quality assessment method using the following criteria: average and overall correlations of predicted and real GDT-TS scores, and average loss of the GDT-TS scores on top one ranked models. The average correlation is the average of the per-target Pearson correlations between predicted quality scores and real GDT-TS scores. The overall correlation is the Pearson correlation of predicted quality scores and real GDT-TS scores of all models of all CASP8 or CASP9 targets. The loss on a target is the difference between the real GDT-TS score of the top one ranked model and the real GDT-TS score of the best model. The average loss over all targets measures the ranking ability of a MQAP, which ideally equals to zero indicating the program can always rank the best model as the top one model.

Table I reports the average correlation, overall correlation, and average loss of 30 CASP8 MQAPs before and after applying our refinement algorithm. The average (overall) correlations of 25 (24) out of 30 MQAPs were increased. The average losses of 28 MQAPs were reduced. According to t-tests, the P-value of observing the difference before and after refinements for average correlation, overall correlation, and average loss is less than 0.0001, 0.0001, and 0.01, respectively. The correlations of MQAPs with low initial correlation scores (<0.4), such as qa-ms-torda-server and ProtAnG_s, were increased by several times. After refinement, the correlations of all MQAPS except one were improved to above 0.80; and the average losses of all the MQAPs except two were reduced to below 0.10. One extreme example is qa-ms-torda-sever, whose average correlation was improved from 0.012 to 0.767. However, we noticed that the refinement method did not improve the correlations of several clustering-based methods probably because they had already used structural comparisons in their model evaluation process. In contrast, all the single-model methods that do not utilize structural comparisons were improved by the iterative refinement method. The same experiment was performed on 107 valid CASP9 targets (Table II). Our method improved the average correlation, overall correlation, and average loss of almost all CASP9 MQAPs that did not use structural comparisons, such as PconsR, PconsD, PRECORES-QA, ProQ, MetaMAQP, Batymus, DistillNNPIE, ProQ2, ConQuass, MULTICOM-NOVEL, and QMEAN. However, our method rarely improved clustering-based MQAPs that used structural comparisons, such as MULTICOM-CLUSTER, MUFOLD-QA, and QMEANclust, although it slightly reduced the average loss of ModFOLDclust2 and Pcons, two of the top pair-wise comparison methods. According to t-test, the P-value of the improvements on average correlation and overall correlation is 0.1 for all CASP9 MQAPs, which is less significant than the ones on CASP8 data. This may be because a larger portion of CASP9 MQAPs used structural comparisons. However, the P-value of the improvements on loss is still at a significant level 0.05.

Table I. Average Correlation, Overall Correlation, and Average Loss of CASP8 MQAPs Before and After Iterative Refinements
 Average correlationOverall correlationAverage loss
T-test P-value < 0.0001T-test P-value < 0.0001T-test P-value < 0.01
Bef. RefineAft. RefineBef. refineAft. RefineBef. RefineAft. Refine
  1. Bold fonts denote improvements. According to t-tests, the P-values of observing differences in average correlation, overall correlation, and average loss are less than 0.0001, 0.0001, and 0.01, respectively. The method “ModFOLDclust”10 is a full pair-wise clustering method that can serve as a baseline predictor for reference purpose. Our refinement method improved the performance of some single-model MQAPs, such as QMEAN, to a level close to that of ModFOLDclust.

qa-ms-torda-server0.0120.7670.1100.7300.4830.149
ProtAnG_s0.1450.8230.1000.8780.1300.070
MODCHECK-HD0.2840.8260.5010.8580.1410.081
Fiser-QA-COMB0.4760.8360.4840.8560.2140.092
Fiser-QA-FA0.4850.8220.2870.8340.1830.105
Fiser-QA0.5230.8570.5060.8790.1760.063
ModFOLD0.5970.8350.6810.8680.1320.076
SELECTpro0.6080.8050.4320.8440.1380.093
SIFT_SA0.6230.8400.4590.8580.1020.074
MUFOLD-QA0.6330.8320.5760.8720.1080.067
Pcons_ProQ0.6520.8600.6520.8820.1140.055
SIFT_consensus0.6580.8500.6730.8690.0970.068
MULTICOM-RANK0.6650.8380.7050.8670.0690.061
QMEANfamily0.6780.8470.7330.8690.0800.058
GS-MetaMQAP0.6810.8430.7710.8560.1240.079
Circle0.6830.8620.6580.8810.0980.055
QMEAN0.6990.8590.7400.8770.0810.060
MULTICOM-REFINE0.7100.8480.7720.8710.0850.061
MULTICOM-CMFR0.7210.8360.7340.8690.0750.066
Mariner20.7300.8130.8770.8890.1260.068
FAMSD0.8250.8560.6610.8800.0600.058
selfQMEAN0.8330.8420.8930.8920.0710.063
GS-MetaMQAPconsII0.8380.8660.8290.8820.0740.053
GS-MetaMQAPconsI0.8600.8700.8550.8830.0720.051
MULTICOM-CLUSTER0.8650.8470.8780.8710.0640.066
LEE-SERVER0.8660.8820.7780.8780.0620.056
MULTICOM0.8790.8690.8910.8860.0500.049
QMEANclust0.8860.8640.9190.9090.0620.056
ModFOLDclust0.8940.8560.8910.8780.0530.049
Pcons_Pcons0.9000.8400.8860.8700.0550.057
Table II. Average Correlation, Overall Correlation, and Average Loss of CASP9 MQAPs Before and After Iterative Refinements
 Average correlationOverall correlationAverage loss
T-test P-value < 0.1T-test P-value < 0.1T-test P-value < 0.05
Bef. RefineAft. RefineBef. refineAft. refineBef. refineAft. Refine
  1. Bold fonts denote improvements. According to t-tests, the P-values of observing differences in average correlation, overall correlation, and average loss are less than 0.1, 0.1, and 0.05, respectively. The method “MULTICOM-CLUSTER”23 is a full pair-wise clustering method that can serve as a baseline predictor for reference purpose.

PconsR0.0520.6290.0260.7430.1550.102
PconsD0.1190.649−0.1580.6050.1680.120
PRECORS-QA0.2600.6760.0650.6940.1550.124
ProQ0.4150.7770.6650.6840.1400.092
MetaMQAP0.5830.7830.7440.8830.1430.098
Baltymus0.5860.8100.5730.8880.1170.085
Distill_NNPIF0.6010.7570.6260.8330.1280.096
ProQ20.6270.7980.7810.9010.0740.072
ConQuass0.6560.8370.7220.8530.1340.093
MULTICOM-NOVEL0.6620.7950.7670.8900.1010.082
QMEAN0.6850.7770.8080.8890.1080.097
QMEANfamily0.6970.8050.8050.9040.1110.088
Modcheck-J20.7300.7990.8200.8840.1450.093
Gws0.7690.7720.8680.8930.1100.100
MQAPsingle0.8100.7660.9260.9060.1000.097
Splicer_QA0.8180.8270.8850.9140.0790.073
MULTICOM-CONSTRUCT0.8320.8060.9030.8980.0780.078
ModFOLDclustQ0.8320.8490.9290.8980.0620.066
QMEANdist0.8330.8540.7880.8630.0660.071
MULTICOM-REFINE0.8660.8210.9290.9180.0860.083
Pcomb0.8700.8620.9290.8920.0630.061
MULTICOM0.8850.8600.9330.9250.0600.059
PconsM0.8850.8380.9300.8930.0660.066
IntFOLD-QA0.8870.8700.9400.9120.0600.058
ModFOLDclust20.8880.8630.9440.9150.0610.058
Pcons0.8930.8510.9330.8810.0690.066
MQAPmulti0.8950.8550.9320.9200.0640.061
MetaMQAPclust0.8960.8350.9360.9190.0640.065
MULTICOM-CLUSTER0.9160.8720.9470.9120.0590.060
MUFOLD-QA0.9200.8740.9410.9140.0620.062
MUFOLD-WQA0.9200.8650.8960.8880.0570.058
QMEANclust0.9210.8650.9500.9170.0590.061

To investigate how fast the iterative QA method converged, we plotted the average loss against iterations for each CASP8 MQAP (Fig. 1) and CASP9 MQAP (Fig. 2). Most methods converged in the first one or two iterations (Figs. 1 and 2). On average, it takes up to about five iterations to converge. The number of iterations depends on the quality of initial ranking. Better initial rankings require fewer iterations of refinement.

Figure 1.

The plot of the average losses against iterations for CASP8 MQAPs. The method “ModFOLDclust”10 is a full pair-wise clustering method that can serve as a baseline predictor for reference purpose.

Figure 2.

The plot of the average losses against iterations for CASP9 MQAPs. The method “MULTICOM-CLUSTER”23 is a full pair-wise clustering method that can serve as a baseline predictor for reference purpose.

To investigate how “the number of reference models” influences the refinement performance and also the efficiency of our method, we created a random MQAP on CASP9 dataset (Fig. 3). The predicted model quality scores of this random MQAP were randomly generated, which had an average correlation −0.00357, an overall correlation 0.0021, and a loss 0.161 compared with the true model quality scores. Models were then initially ranked by these randomly generated quality scores. After a single iteration of refinement using top 1 ranked model as reference model, our method substantially improved the average correlation to 0.667 and the overall correlation to 0.738. Moreover, when top 3 ranked models were used as reference models, both the average correlation and overall correlation were improved to 0.814 and 0.862, respectively, after only one round of iteration. The improvement continued as the number of reference models increased and started to saturate after using 15–25 reference models. When 25 top models were used as reference models, the average correlation, overall correlation, and average loss were improved to 0.896, 0.940, and 0.067 respectively, which were much better than the initial ranking generated by the random MQAP. This performance was also close to the average correlation 0.916, overall correlation 0.947, and average loss 0.059 of a full pair-wise comparison method MULTICOM-CLUSTER,23 which was developed by our group and was ranked as one of the top MQAPs in CASP9 (see Table II).

Figure 3.

The average correlation, overall correlation, average loss, and average computational time under different numbers of reference models. This experiment was conducted on a MQAP whose predicted quality scores of CASP9 models were randomly generated. The predicted model quality scores had an average correlation of −0.0036 with the true model quality scores. Different numbers of reference models were tested under a single round of refinement.

We studied some cases in which our refinement method worked well or failed in the experiment on the random MQAP mentioned above. We found that it worked well on template-based modeling (TBM) targets whose models are largely of good quality. For example, the predictions of the random MQAP had an average correlation −0.071 on an easy TBM target T0522; and 218 out of 371 models of that target have true GDT-TS20 scores > 0.9. A GDT-TS score is a structural similarity score that ranges from 0 to 1, whereas 1 indicates the model is the same as the native structure and 0 completely different. After one round of refinement using the top one ranked model as reference model, the average correlation was improved to 0.985. In contrast, our refinement method did not work well on some of the hard targets whose models are mostly of low quality. For example, the random MQAP has an initial correlation −0.013 on the models of target T0537, which is a hard target that contains two free modeling (FM) domains. The best CASP9 model of the target has a GDT-TS score 0.32 whereas all other models have a GDT-TS score <0.3. After one round of refinement using the top-one ranked model as reference model, the correlation became −0.067. These two extreme examples may suggest that, similarly as clustering method, the iterative refinement method works better if a large portion of input models have reasonable qualities.

Moreover, to investigate how model rankings are changed during the refinement process, we calculated the average Kendall tau rank correlation and Spearman's rank correlation. Kendall tau rank correlation coefficient is defined as

equation image

where nc is the number of concordant pair of models whose ranking orders are not changed in two rankings, nd is the discordant pairs, and n is the total number of models in the ranking. Kendall tau rank correlation measures the agreement level between two rankings and ranges from −1 and 1, while 1 indicates the two rankings are the same, −1 one ranking is the reverse of the other, and 0 the two rankings are completely independent. The Spearman's rank correlation coefficient is defined as

equation image

where di = xiyi, which is the difference between the ranking orders of a model in two rankings; and n is the number of models in the rankings.

The average Kendall tau and Spearman's rank correlations were plotted against iteration numbers in Figures 4 and 5 for CASP8, and Figures 6 and 7 for CASP9. Similarly as for the average correlation, it took about three iterations to converge on average. For almost all the cases, the biggest increase happened after the first iteration of refinement. The “rank correlations between the rankings before and after the first iteration of refinement” (RCBAF) is particularly interesting since it reports the degree a ranking is changed by the refinement. The RCBAF of initially less accurate MQAPs (e.g., qa-ms-torda-server) is much lower than that of initially more accurate MQAPs (e.g., Pcons-Pcons, ModFOLDclust, QMEANclust, and MULTICOM). Tables III and IV report the average Spearman's and Kendall tau rank correlations before and after the fist and last iteration. The RCBAF for a less accurate MQAP is relatively low (e.g., < 0.5 for Spearman's and < 0.4 for Kendall tau). These suggest that the RCBAF can be used to assess the performance of a MQAP.

Figure 4.

The Kendall tau rank correlations of the rankings before and after each round of refinement for CASP8 MQAPs.

Figure 5.

The Spearman's rank correlations of the rankings before and after each round of refinement for CASP8 MQAPs.

Figure 6.

The Kendall tau rank correlations of the rankings before and after each round of refinement for CASP9 MQAPs.

Figure 7.

The Spearman's rank correlations of the rankings before and after each round of refinement for CASP9 MQAPs.

Table III. The Average Kendall Tau Ranking Correlation and Average Spearman's Ranking Correlation Before and After the First and Last Iteration Tested on CASP8 MQAPs
 Average Kendall tau rank correlationAverage Spearman's rank correlation
Bef. & aft. first iter.Bef. & aft. last iter.Bef. & aft. first iter.Bef. & aft. last iter.
qa-ms-torda-server0.1530.9910.2140.999
ProtAnG_s0.2210.9930.3200.999
MODCHECK-HD0.2541.0000.3581.000
Fiser-QA-COMB0.3480.9980.4821.000
Fiser-QA0.3480.9890.4820.999
Fiser-QA-FA0.3750.9910.5210.999
Pcons_ProQ0.4380.9920.5970.999
SIFT_SA0.4680.9870.6300.998
MUFOLD-QA0.4760.9960.6381.000
SIFT_consensus0.4790.9890.6420.998
SELECTpro0.4990.9870.6540.999
GS-MetaMQAP0.5030.9880.6700.999
ModFOLD0.5040.9910.6680.999
circle0.5060.9880.6760.998
MULTICOM-RANK0.5180.9900.6890.998
MULTICOM-CMFR0.5281.0000.6961.000
MULTICOM-REFINE0.5410.9830.7170.997
QMEAN0.5420.9920.7130.999
QMEANfamily0.5670.9840.7410.998
Mariner20.6010.9900.7530.998
selfQMEAN0.6650.9920.8300.998
GS-MetaMQAPconsII0.6710.9860.8380.999
FAMSD0.6810.9910.8480.999
GS-MetaMQAPconsI0.7050.9990.8621.000
LEE-SERVER0.8371.0000.9411.000
Pcons_Pcons0.8400.9890.9510.998
ModFOLDclust0.8470.9690.9550.995
QMEANclust0.8840.9610.9670.994
MULTICOM-CLUSTER0.9190.9780.9770.999
MULTICOM0.9580.9700.9930.998
Table IV. The Average Kendall Tau Ranking Correlation and Average Spearman's Ranking Correlation Before and After the First and Last Iteration Tested on CASP9 MQAPs
 Average Kendall tau rank correlationAverage Spearman's rank correlation
Bef. & aft. first iter.Bef. & aft. last iter.Bef. & aft. first iter.Bef. & aft. last iter.
PconsD0.1700.9920.2460.999
PconsR0.2010.9750.2820.994
ProQ0.3430.9890.4780.998
PRECORS-QA0.4110.9960.5551.000
ConQuass0.4240.9810.5780.997
Baltymus0.4380.9840.5950.997
ProQ20.4410.9870.5960.998
MetaMQAP0.4420.9840.5950.998
QMEAN0.4890.9980.6551.000
QMEANfamily0.5000.9950.6690.999
Distill_NNPIF0.5050.9890.6730.998
MULTICOM-NOVEL0.5120.9960.6821.000
Modcheck-J20.5710.9900.7250.999
QMEANdist0.6540.9990.8221.000
Splicer_QA0.6840.9960.8420.999
Pcomb0.7340.9950.8840.999
ModFOLDclustQ0.7630.9970.8971.000
Gws0.7830.9990.9051.000
IntFOLD-QA0.8061.0000.9291.000
ModFOLDclust20.8100.9990.9321.000
MQAPmulti0.8120.9960.9321.000
Pcons0.8200.9960.9410.999
PconsM0.8280.9960.9441.000
MULTICOM-CLUSTER0.8300.9990.9361.000
MUFOLD-WQA0.8390.9980.9511.000
MetaMQAPclust0.8470.9960.9511.000
QMEANclust0.8480.9980.9541.000
MUFOLD-QA0.8560.9990.9611.000
MQAPsingle0.8670.9910.9480.999
MULTICOM-REFINE0.8990.9980.9701.000
MULTICOM0.9181.0000.9831.000
MULTICOM-CONSTRUCT0.9361.0000.9841.000

To further verify this, we plotted the RCBAF values against average per-target correlations of 30 CASP8 MQAPs (Figs. 8 and 9). The average per-target correlation of a MQAP indicates its actual performance or accuracy. Figures 8 and 9 show that RCBAF have strong correlations with actual accuracies of a MQAP, which are 0.965 and 0.911, respectively. These results indicate that the iterative refinement method can be used to estimate the performance of a MQAP or the accuracy of a model ranking list, without knowing real quality scores of the models. This could be a useful pre-assessing procedure for a MQAP without any other data sources but its own ranking.

Figure 8.

The plot of the Spearman's RCBAF values against the average per-target correlation of the 30 CASP8 MQAPs. Their Pearson's correlation is 0.965. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Figure 9.

The plot of the Kendall tau RCBAF values against the average per-target correlation of the 30 CASP8 MQAPs. Their Pearson's correlation coefficient is 0.911. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Another finding that will contribute to the community is that instead of performing full pair-wise comparisons, partial pair-wise comparisons against a few top models can achieve a similarly high accuracy. This decreases the computational complexity from O(n2) as of full pair-wise comparisons to linear O(n). This computational efficiency makes our method a fast and accurate alternative to full pair-wise comparison methods, particularly when evaluating a large number of models.

Conclusions

We described an iterative refinement method to improve the initial ranking quality and prediction accuracy of a MQAP. The method can improve the performances of MQAPs in terms of average correlation, overall correlation, and loss. It is particularly effective for single-model MQAPs. Moreover, the iterative refinement method can be used to estimate the performance and accuracy of a MQAP by analyzing how much the initial ranking is changed during the refinement process. Since in reality the real structures are mostly unknown, this unique property makes it a useful tool to self-assess a MQAP.

Materials and Methods

The iterative quality assessment (IQA) method starts from the initial quality scores of a set of protein models. In the first round of refinement, the initial scores are used to rank all models. The top n models are selected as reference models and used to compare with every model by a structural comparison tool TM-Score,19 which generates a GDT-TS score20 for each comparison. The average GDT-TS score over the n reference models is used as the refined global quality score of a model. The new, presumably better, quality scores are then used to generate a new ranking of the models for the next round of refinement. The same refinement process is executed iteratively until it converges, that is, the ranking of models does not change any more. The average GDT-TS scores generated in the last round are used as the final global quality scores. When comparing a model to each of the n reference models in each round, TM-Score superimposes two models and outputs the superimposed coordinates of each pair of residues. These coordinates are used to calculate the residue-specific distances. The averaged residue-specific distances over the n reference models are used as the refined local quality scores. The average residue-specific distances generated in the last round are used as the final local quality scores. The only parameter of the iterative quality assessment is n, the number of reference models, which is set to five during most of our experiments except for Figure 3.

Acknowledgements

The authors thank Dr. Anna Tramontano for suggesting using this approach to self-evaluate the performance of a MAQP.

Advertisement