SEARCH

SEARCH BY CITATION

Keywords:

  • CASP;
  • protein structure prediction;
  • template-based protein modeling;
  • numerical evaluation measures

Abstract

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. METHODS: STANDARD EVALUATION MEASURES AND PROCEDURES
  5. RESULTS
  6. DISCUSSION
  7. REFERENCES

The strategy for evaluating template-based models submitted to CASP has continuously evolved from CASP1 to CASP5, leading to a standard procedure that has been used in all subsequent editions. The established approach includes methods for calculating the quality of each individual model, for assigning scores based on the distribution of the results for each target and for computing the statistical significance of the differences in scores between prediction methods. These data are made available to the assessor of the template-based modeling category, who uses them as a starting point for further evaluations and analyses. This article describes the detailed workflow of the procedure, provides justifications for a number of choices that are customarily made for CASP data evaluation, and reports the results of the analysis of template-based predictions at CASP8. Proteins 2009. © 2009 Wiley-Liss, Inc.


INTRODUCTION

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. METHODS: STANDARD EVALUATION MEASURES AND PROCEDURES
  5. RESULTS
  6. DISCUSSION
  7. REFERENCES

The CASP experiments have been instrumental in fostering the development of novel prediction methods and in establishing reliable measures for numerical assessment of the submitted three-dimensional models of proteins. Different evaluation criteria have been tested in CASP throughout the years; some of those have been identified as suitable for an automated standard analysis. The Protein Structure Prediction Center performs numerical evaluation of the CASP models according to these established criteria1 and makes the results available to the community via the CASP web site. These data are usually the assessors' starting point for the official analysis of the structure prediction results.

Several numerical evaluation measures can give a reasonable estimate of the similarity between a model and the corresponding experimental structure. Not in all cases can they be directly and automatically used for ranking models according to their accuracy. For example, models of targets for which no clear evolutionarily related templates can be identified might be quite far from the experimental structure and thereby achieve very low scores. On the other hand, careful visual inspection might highlight cases where these models, although far from being perfect, do correctly reproduce important features of the target protein—overall fold, proper secondary structure arrangements, correct inter-residue contacts, and so forth. For template-based predictions, though, numerical scores are sufficiently informative to confidently compare the quality of the models and therefore evaluate the effectiveness of the corresponding prediction methods.

This article discusses the standard measures that the template-based modeling (TBM) assessors used in previous CASPs to assess model quality and compare group performance. We also describe here the results of their application to the CASP8 predictions for the TBM category.

METHODS: STANDARD EVALUATION MEASURES AND PROCEDURES

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. METHODS: STANDARD EVALUATION MEASURES AND PROCEDURES
  5. RESULTS
  6. DISCUSSION
  7. REFERENCES

The most relevant issue that every CASP assessor has to deal with is the choice of a scoring scheme and of the appropriate metrics for comparing models and targets. Although no measure is better than the others in all cases, a number of them are sufficiently reliable to provide correct model quality estimates and have indeed been extensively used in CASP.

RMSD

The root mean square deviation (RMSD) was the metric used in CASP1-32–4 and its use is still very widespread among computational biologists due to its conceptual simplicity. It is a very effective measure for comparing rather similar conformations, such as different experimental determinations of the same protein in different conditions or different models in an NMR ensemble. RMSD is, however, not ideal for comparing cases when the structures are substantially different for several reasons. First, its quadratic nature can penalize errors very severely, that is, a few local structural differences can result in high RMSD values. Second, it obviously depends on the number of equivalent atom pairs, and thus tends to increase with protein size. Finally, and probably most importantly, the end user of a model is typically more interested in which regions are sufficiently close to the native structure than in how incorrect the very wrong parts of a model are—that affect the RMSD most dramatically.

GDT-TS and GDT-HA

To overcome the RMSD shortcomings, a new threshold-based measure, GDT-TS,5 was developed and first used by the comparative modeling (CM) assessor in CASP4.6, 7 GDT-TS is the average maximum number of residues in the predictions deviating from the corresponding residues in the target by no more than a specified Cα distance cut-off for four different LGA8 sequence-dependent superpositions with distance thresholds of 1, 2, 4, and 8 Å. By averaging over a relatively wide range of distance cut-offs, GDT-TS rewards models with a roughly correct fold, while scoring highest those perfectly reproducing the target main chain conformation. For the purpose of automatic evaluation of the overall quality of a model, GDT-TS proved to be one of the most appropriate measures and has been used by the assessors of all CASP experiments after CASP4. In CASP6 and CASP7, a modification of GDT-TS, GDT-HA, was also used by the assessors for the analysis of high accuracy template-based modeling targets.9, 10 GDT-HA uses thresholds of 0.5, 1, 2, and 4 Å, thus allowing a better detection of small differences in the model backbone quality.

AL0

Another historical accuracy measure in CASP is the AL0 score, representing the percentage of correctly aligned residues after the LGA sequence-independent superposition of the model and the experimental structure with a threshold of 5 Å. A residue in the model is considered correctly aligned if its Cα atom is within 3.8 Å from the position of the corresponding experimental atom and no other Cα atom is closer. Even though conceptually different from GDT-TS, these two measures are highly correlated.

Other evaluation measures

In recent years, other measures11–14 have been developed that take into account the peculiarities of the comparison between a model and a structure as opposed to the comparison of two experimental structures. Each of these measures has its value and indeed some of them have been used in CASP6-8 assessments.

Z-scores

In the numerical evaluation procedure of the CASP models, GDT-TS, GDT-HA, AL0, and other related parameters are computed for each model. Each prediction method could therefore be ranked after combining the values of the submitted models over all targets. The weakness of such a procedure is due to the fact that it treats all targets equally. Different targets can have different difficulties and therefore the same difference in scores between models for two different targets should not be assigned the same weight. The problem was addressed by the CM assessor in CASP4 by introducing Z-scores.7 This strategy implicitly takes into account the predictive difficulty of a target, as the normalized score reflects relative quality of the model with respect to the results of other predictors. Noticeably, the Z-scores can be computed also for non-normal distributions, although in this case the standard normal probability table could not be used and is indeed not used in CASP. The use of Z-scores instead of raw scores proved to be very effective for analyzing relative model quality, although the results should be taken with a grain of salt for some targets for which very few groups generated good models as this can lead to an overestimation of their performance.

Ranking procedures

Although using Z-scores for analyzing model quality and relative group performance became a common practice in CASP, the specific details of the scoring schemes are left to the assessors. In previous CASPs, the approaches used by the TBM assessors—formerly CM and fold recognition (FR) assessors—slightly differed in the choices for the following alternatives:

  • 1
    Use all submitted models for calculating means and standard deviations needed for the Z-score computations versus ignore outliers from the datasets (and if so—how are outliers defined?).
  • 2
    Set negative Z-scores to zero or not.
  • 3
    Use the sum of Z-scores versus use the average over the number of predicted targets for ranking.
  • 4
    Use Z-scores from a single evaluation measure as the basis for the ranking scheme versus combine Z-scores from independent evaluation measures.

There are both advantages and potential pitfalls in these choices as we will briefly discuss below.

  • 1
    One of the potential problems in the use of the Z-scores is that the basic statistical parameters of the distribution of the selected evaluation score might be influenced by some extremely bad models. These can arise, for example, because of bugs in some of the servers participating in the experiment or unintentional human errors. In particular, very short “models” consisting of just a few residues can be found among the CASP predictions. To eliminate the effect of these unrealistic models on the scoring system, outliers might be excluded from the datasets used for calculating final mean and standard deviation values. The CASP6 FR assessor considered models shorter than 20 residues as outliers.15 All other TBM assessors (starting from CASP5) chose to curate the data by removing models whose score is lower than the mean of the distribution of all the values for the specific target by more than two standard deviations.
  • 2
    One of the aims of CASP is to foster the development of novel methods in the field. Previous assessors evaluated that some scoring schemes might be less appropriate than others for encouraging predictors to test riskier approaches. For example, the scoring scheme based on combining all Z-scores can prevent predictors from submitting models for more challenging targets. Indeed, incorrect models—more likely to appear in these cases—would obtain negative Z-scores leading to a lower overall score for the submitting group. One way to avoid this potential problem is to set negative Z-scores to 0, in other words to assign incorrect models the average score for that target. This technique was suggested by the CM assessor in CASP4,7 and was used by all but the CASP6 FR assessor since.
  • 3
    For ranking purposes, Z-scores of the models submitted by each group need to be summed or averaged over the number of predicted domains. This choice is clearly irrelevant if all groups predict the same set of targets. When this is not the case, the ranking can be affected by this choice. Summing penalizes groups who did not submit models for all targets, while averaging might penalize those who submit a larger number of targets, even if negative Z-scores are set to 0. The CM assessors in CASP4-77, 9, 16, 17 preferred averaging the scores (not considering groups who submitted a very small number of predictions), while the FR assessors in CASP518 and CASP615 tried both averaging and summing approaches.
  • 4
    A combination of the Z-scores derived from several measures was used by the FR assessors in CASP518 and CASP615 while Z-scores from a single measure, always GDT-TS, were used by the CM assessors in CASP4-7.7, 9, 16, 17 The GTD-TS, AL0, GDT-HA measures are all strongly correlated, and the value of computing them mostly resides in highlighting potential inconsistencies among them.

Model_1

CASP rules allow up to five models to be submitted for same target. Predictors are informed that only the model designated as first will be used in standard ranking as any other choice would lead to unfair comparisons. “Selecting the best of the five models” strategy would provide an advantage to groups submitting more predictions as they would be more likely to submit a better model just because of larger sampling. On the other hand, the “averaging over all predictions” strategy might disadvantage groups using this possibility to test novel and riskier methods.

Statistical comparison of group performance

A sensitive and important issue concerns the evaluation of the statistical significance of the difference in the scores of different groups. The CASP5 CM assessor introduced the use of a paired t-test between the results of each pair of groups.16 Notice that groups are not ranked according to the t-test and each pair is compared independently therefore there is no multiple testing issue. One potential problem is that the t-test is based upon an assumption of normality of the distributions to be compared and one should verify that this is the case in the experiment. If not, a nonparametric test—such as the Wilcoxon signed rank test—should be used.

CASP8 evaluation of template-based models

The overall evaluation procedure is summarized in Figure 1. Once the parameters used in the evaluation (highlighted in italic) are selected, the calculations are straightforward and the results are provided to the template-based modeling assessor as soon as the target structures and their dissection in prediction units19 are available.

thumbnail image

Figure 1. Flowchart of the procedure used for evaluation. Steps in italics depend upon the assessor preference.

Download figure to PowerPoint

In the analysis of CASP8 template-based models described here, we adopted the parameters most often used by the assessors in the previous CASPs.

  • 1
    GDT-TS measure was used as the basic measure for comparing models and experimental structures. The GDT-TS values are computed using LGA in sequence-dependent mode.*
  • 2
    Models shorter than 20 residues were removed from the dataset. If several independent segments were submitted for the same prediction unit, the frame with the largest number of residues was selected as the representative model.
  • 3
    Z-scores were calculated based on the GDT-TS (and other) measures without further data curation (data reported on the web). The Z-scores reported in this article were calculated after removal of the models with values more than two standard deviations below the mean.
  • 4
    Negative Z-scores were set to zero.
  • 5
    Groups were ranked according to the average of GDT-TS-based Z-scores for the models designated as first by the predictors.
  • 6
    The normality of the GDT-TS distributions for each target was evaluated using the Shapiro Wilk test.20
  • 7
    The statistical significance of the differences between the GDT-TS values of the models was assessed with a suitable paired test of hypothesis for all pairs of groups on the common set of predicted targets.

It should be noted that in CASP8 targets were split in the two categories: (1) targets for prediction by all groups (human/server targets) and (2) targets for server prediction (server only targets). All in all, the TBM category encompassed 154 assessment units,19 64 of which were human/server domains and the remaining 90 were server only. All groups (server and human-expert) were ranked according to their results on the subset of 64 human/server domains, while server groups were also ranked on the complete list of 154 domains.

RESULTS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. METHODS: STANDARD EVALUATION MEASURES AND PROCEDURES
  5. RESULTS
  6. DISCUSSION
  7. REFERENCES

As an illustration of the evaluation strategy described in Methods, we show here the results of the automatic analysis performed on the template-based predictions in CASP8. Since they are reported here, these data will not be included in the TBM assessor paper21 that will instead concentrate on more detailed evaluations of the structural features of the submitted models.

Table I shows the correlation between the Z-scores obtained using GDT-TS, GDT-HA, and AL0 for the groups participating in CASP8. They are highly correlated for both sets of targets (“Human and Server” and “Server only”), therefore in the following we will only discuss the results of GDT-TS. The results obtained using the other scoring schemes are available on the CASP web site.

Table I. Agreement Between Group Rankings Based on Different Model Quality Measures
Dataset  ρ
  1. Spearman's correlation (ρ) between the Z-scores obtained by each group using different measures. The data are reported for both the “human and server” subset and for the complete set of targets.

All groupsMean AL0 Z-scoreMean GDT-TS Z-score0.97
Human and server targetsMean AL0 Z-scoreMean GDT-HA Z-score0.96
 Mean GDT-TS Z-scoreMean GDT-HA Z-score0.99
Server groupsMean AL0 Z-scoreMean GDT-TS Z-score0.97
All targets (human and server plus server only)Mean AL0 Z-scoreMean GDT-HA Z-score0.95
 Mean GDT-TS Z-scoreMean GDT-HA Z-score0.98

Table II illustrates the results obtained by all the groups submitting predictions. The server results are evaluated on the complete set of assessment units, while the results of all groups are computed for the subset of “Human and Server” targets.

Table II. Average Z-Scores Based on GDT-TS for Individual Prediction Groups
RankGroup nameGroup id“Human and Server” target subsetAll targets
No. of targetsMean Z-scoreNo. of targetsMean Z-scoreRank (servers only)
  1. Mean Z-score of the participating groups after setting negative Z-scores to 0. Data for human predictors are computed on the subset of “Human and server” targets, while the results of the servers are reported for both this subset (to allow a proper comparison with human groups) and for the whole set of assessment units. Data are ranked according to the Z-scores on the “Human and Server” subset, the rank of servers on the complete set of targets is reported in the last column.

1IBT_LT283641.11   
2DBAKER489641.03   
3Zhang71640.94   
4Zhang-Server426640.841540.891
5KudlatyPredHuman267180.83   
6TASSER57640.83   
7fams-ace2434640.83   
8ZicoFullSTP196640.81   
9SAM-T08-human46620.80   
10Zico299640.78   
11MULTICOM453640.78   
12GeneSilico371640.76   
13ZicoFullSTPFullData138640.75   
14LEE-SERVER29390.75970.802
15McGuffin379630.73   
163DShot1282640.73   
17Sternberg202640.72   
18Jones-UCL387640.72   
19mufold310610.71   
20FAMS-multi266640.70   
21Elofsson200640.68   
22Chicken_George81640.67   
233DShotMQ419640.66   
24Bates_BMM178640.65   
25SAMUDRALA34530.63   
26HHpred512640.611540.645
27LevittGroup442620.61   
28BAKER-ROBETTA425640.601540.578
29RAPTOR438640.591540.693
30LEE407640.59   
31MidwayFolding208630.57   
32Phyre_de_novo322640.561540.674
33Ozkan-Shell485240.55   
34HHpred4122640.541540.5610
35ABIpro340640.54   
36sessions13940.52   
37MUSTER408640.511540.4720
38METATASSER182640.511540.627
39Pcons_multi429620.501510.5113
40pro-sp3-TASSER409640.501540.636
41TsaiLab23040.49   
42fais@hgc198510.48   
43A-TASSER149640.47   
44ricardo403120.46   
45circle396610.451500.4025
46HHpred2154640.451540.5015
47MULTICOM-CLUSTER20640.431540.5611
48SAM-T08-server256640.431540.4817
49YASARA147150.42740.4124
50FEIG166640.411540.4718
51GS-KudlatyPred279630.411530.4916
52Phyre2235640.401540.3434
53SHORTLE253420.40   
54CBSU353360.39   
55FAMSD140640.391540.4719
56MULTICOM-REFINE13640.391540.569
57POEMQA124630.38   
58MUProt443640.381540.5412
59CpHModels193590.381460.3337
60COMA-M174630.371530.4522
61Phragment270640.371540.3240
62FFASsuboptimal142600.361500.3632
63EB_AMU_Physics337610.35   
64Jiang_Zhu369640.35   
65MULTICOM-RANK131640.351540.5114
66TJ_Jiang384640.35   
67reivilo2210.34   
68FALCON351640.341540.3926
693D-JIGSAW_AEP296630.341530.3338
70PS2-manual23610.34   
71PSI385640.341540.3533
72NirBenTal354110.33   
73Pcons_dot_net436590.321440.3728
74PS2-server48610.321510.4223
753DShot2427640.321540.3435
76nFOLD3100630.321510.3142
77AMU-Biology475590.32   
78FrankensteinLong172450.31   
79MULTICOM-CMFR69640.311540.4621
80jacobson47010.31   
81FALCON_CONSENSUS220630.311530.3241
82Softberry113640.30   
83Poing186640.301540.2945
84fais-server116590.291480.3727
85keasar-server415580.291400.3729
86Frankenstein85560.281310.2848
87FFASstandard7600.281480.3339
88taylor356120.28   
89COMA234630.281530.3436
90Bilab-UT325640.27   
91FFASflextemplate247590.271470.2946
92pipe_int135600.261430.3630
93Hao_Kihara284620.26   
94GeneSilicoMetaServer297590.261470.2751
95Pcons_local143600.261450.2847
963D-JIGSAW_V3449630.261530.3143
97mGenTHREADER349640.261540.3044
98Abagyan45860.25   
99SAINT1119350.25   
100GS-MetaServer2153600.241460.2749
101PRI-Yang-KiharA39640.24   
102BioSerf495640.231520.3631
103keasar114630.22   
104Kolinski493640.22   
105mti28960.22   
106POEM207640.21   
107ACOMPMOD2600.201430.1758
108FUGUE_KM19550.201410.1560
109SAM-T02-server421600.191480.1956
110Zhou-SPARKS481400.19   
111tripos_0883270.19   
112fleil70640.18   
113SAM-T06-server477640.181540.2153
1143Dpro157580.171470.1857
115JIVE08330400.17   
116RBO-Proteus479630.161530.1955
117Wolfson-FOBIA1070.15   
118mumssp34550.14   
119FOLDpro164640.141540.0964
120forecast316640.131510.2352
121Fiser-M4T394250.12930.2750
122Sasaki-Cetin-Sasai461400.12   
123Pushchino243470.101270.2154
124SMEG-CCP14620.10   
125panther_server318480.101290.1362
126LOOPP_Server454560.091350.1759
127Wolynes93270.08   
128Handl-Lovell29180.07   
129ProtAnG110380.07   
130huber-torda-server281420.07920.1363
131xianmingpan463540.06   
132MUFOLD-MD404620.061500.0965
133DelCLab373600.05   
134mariner1450580.041430.0767
135MUFOLD-Server462640.041540.1561
136StruPPi183630.03   
137TWPPLAB420640.03   
138RPFM5100.02   
139OLGAFS213430.021250.0866
140NIM255100.02   
141POISE170110.01   
142rehtnap95480.011310.0468
143FLOUDAS236360.01   
144Distill73620.011520.0269
145ProteinShop39960.01   
146MeilerLabRene211450.01   
147schenk-torda-server262560.011360.0070
148DistillSN272590.00   
149mahmood-torda-server53390.00730.0071
150Scheraga324350.00   
151psiphifoldings63300.00   
152igor188130.00   
153ShakAbInitio10470.00   
154dill_ucsf41470.00   
155Linnolt-UH-CMB38250.00   
156HCA40250.00   
157PHAISTOS45950.00   
158BHAGEERATH27430.0050.0072
159PZ-UAM1820.00   

For conciseness, the average Z-score presented in the table refers to the case where negative values were set to 0. However, the overall conclusions are not affected by this choice (data not shown).

The Shapiro Wilk test established that only seven of the 154 GDT-TS distributions were likely to be normal at the 1% confidence level. A non-Gaussian distribution of the GDT-TS scores might arise if groups of predictors used different templates for building their models, or if some groups were unable to detect a possible template and used less reliable template-free methods. The TBM assessor manuscript discusses this point in more detail.21

We applied both the t-test and Wilcoxon test to the data and the results were essentially identical: statistically indistinguishable groups were such by both analyses (data not shown). We report in Tables III and IV the results of the Wilcoxon signed rank test for the 20 best ranking groups in the“Human and Server” and “All targets” categories, respectively.

Table III. Statistical Comparisons Among the Top 20 Groups on the “Human and Server” Subset of Targets
inline image
Table IV. Statistical Comparisons Among the Top 20 Server Groups on all CASP8 TBM Targets
inline image

The overall conclusions of the automatic evaluation of the first model for each human and server group can be summarized as follows.

Several groups (283 IBT_LT, 489 DBAKER, 71 Zhang, 426 Zhang-Server, 57 TASSER, 434 fams-ace2, 196 ZicoFullSTP, 46 SAM-T08-human, 299 Zico, 453 MULTICOM, 371 GeneSilico, 138 ZicoFullSTPFullData, 379 McGuffin, 282 3DShot1) performed well on the subset of “Human and server” targets and are statistically indistinguishable. Among the top predictors, only group 426 (Zhang-server) has officially registered as a server, although it is entirely possible that some of the other “human” groups used a completely automatic procedure.

When servers are compared to each other, group 426 (Zhang-server) is by far the best performing one. It is statistically indistinguishable from group 293 (Lee-server) but the latter group submitted predictions only on 97 out of 154 possible TBM domains. The next three best performing servers are 438 Raptor, 322 Phyre_de_novo, and 12 HHpred5, which compare less favorably with human predictors on the “Human” target subset. This can reflect a genuine better performance of human groups, but it could also be due to a different performance of the servers for the biased subset of human targets that are not randomly selected.22

DISCUSSION

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. METHODS: STANDARD EVALUATION MEASURES AND PROCEDURES
  5. RESULTS
  6. DISCUSSION
  7. REFERENCES

CASP has been providing the assessors with the results of the automatic evaluation carried out by the Prediction Center at UC Davis for quite some time now. The procedure has been extensively tested and sufficiently standardized to be recommended for future CASPs, and is described in detail here. We also show here the results of the application of the procedure to the CASP8 data.

Deriving overall conclusions from the data provided is the duty and the privilege of the assessors and therefore the ranking provided here should be regarded as a starting point for the subsequent analysis of the outcome of the experiment.

The results of comparing server groups on all targets show that Zhang-server outperforms the rest of the completely automatic methods. It is the only fully automatic method that appears in the list of the 20 best performing CASP8 predictor groups. The results obtained on the subset of “Human and Server” target subset are not particularly informative on the quality of the different methods, since most of them are statistically indistinguishable. This can be due to one of two reasons (or a combination of them): either the number of “Human and server” targets is not sufficiently high for deriving conclusions or most methods are genuinely very similar. The choice of selecting a subset of targets for nonserver predictors originated by the understandable difficulty of human groups in handling a large number of predictions in a short period of time. On the other hand, it is a fact that, at least for homology based models, most groups tend to rely on the same methodology using state-of-the-art sequence similarity search tools (such as HMMs or profile–profile methods) and well performing programs such as Modeller23 for building the final set of atomic coordinates.

We strongly encourage the prediction community to take advantage of the FORCASP forum for discussing these issues before the next experiment starts. This is important to ensure that the CASP effort in setting up the experiment, in standardizing the effective and reliable comparative measures of success described here and in discussing their shortcoming will foster further advances in the protein structure prediction field.

  • *

    Results for other evaluation measures for each model are also reported in the CASP web site.

  • The Lee-server group submitted too few predictions on human/server targets and was not considered in the analysis.

REFERENCES

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. METHODS: STANDARD EVALUATION MEASURES AND PROCEDURES
  5. RESULTS
  6. DISCUSSION
  7. REFERENCES