Evaluation of template-based models in CASP8 with standard measures


  • Domenico Cozzetto,

    1. Department of Biochemical Sciences, Sapienza-University of Rome, P. le A. Moro, 5, 00185 Rome, Italy
    Search for more papers by this author
    • Domenico Cozzetto and Andriy Kryshtafovych contributed equally to this work.

  • Andriy Kryshtafovych,

    Corresponding author
    1. Genome Center, University of California, Davis, California 95616
    • Genome Center, University of California, Davis, 451 Health Sciences Dr., CA 95616, USA
    Search for more papers by this author
    • Domenico Cozzetto and Andriy Kryshtafovych contributed equally to this work.

  • Krzysztof Fidelis,

    1. Genome Center, University of California, Davis, California 95616
    Search for more papers by this author
  • John Moult,

    1. Center for Advanced Research in Biotechnology, University of Maryland, Rockville, Maryland 20850
    Search for more papers by this author
  • Burkhard Rost,

    1. Department of Biochemistry and Molecular Biophysics, Columbia University, Northeast Structural Genomics Consortium (NESG) and New York Consortium on Membrane Proteins (NYCOMPS), Columbia University, New York, New York 10032
    Search for more papers by this author
  • Anna Tramontano

    1. Department of Biochemical Sciences, Sapienza-University of Rome, P. le A. Moro, 5, 00185 Rome, Italy
    2. Istituto Pasteur-Fondazione Cenci Bolognetti, Sapienza-University of Rome, P. le A. Moro, 5, 00185 Rome, Italy
    Search for more papers by this author

  • This article is dedicated to the memory of our friend and colleague Angel Ortiz.

  • The authors state no conflict of interest.


The strategy for evaluating template-based models submitted to CASP has continuously evolved from CASP1 to CASP5, leading to a standard procedure that has been used in all subsequent editions. The established approach includes methods for calculating the quality of each individual model, for assigning scores based on the distribution of the results for each target and for computing the statistical significance of the differences in scores between prediction methods. These data are made available to the assessor of the template-based modeling category, who uses them as a starting point for further evaluations and analyses. This article describes the detailed workflow of the procedure, provides justifications for a number of choices that are customarily made for CASP data evaluation, and reports the results of the analysis of template-based predictions at CASP8. Proteins 2009. © 2009 Wiley-Liss, Inc.


The CASP experiments have been instrumental in fostering the development of novel prediction methods and in establishing reliable measures for numerical assessment of the submitted three-dimensional models of proteins. Different evaluation criteria have been tested in CASP throughout the years; some of those have been identified as suitable for an automated standard analysis. The Protein Structure Prediction Center performs numerical evaluation of the CASP models according to these established criteria1 and makes the results available to the community via the CASP web site. These data are usually the assessors' starting point for the official analysis of the structure prediction results.

Several numerical evaluation measures can give a reasonable estimate of the similarity between a model and the corresponding experimental structure. Not in all cases can they be directly and automatically used for ranking models according to their accuracy. For example, models of targets for which no clear evolutionarily related templates can be identified might be quite far from the experimental structure and thereby achieve very low scores. On the other hand, careful visual inspection might highlight cases where these models, although far from being perfect, do correctly reproduce important features of the target protein—overall fold, proper secondary structure arrangements, correct inter-residue contacts, and so forth. For template-based predictions, though, numerical scores are sufficiently informative to confidently compare the quality of the models and therefore evaluate the effectiveness of the corresponding prediction methods.

This article discusses the standard measures that the template-based modeling (TBM) assessors used in previous CASPs to assess model quality and compare group performance. We also describe here the results of their application to the CASP8 predictions for the TBM category.


CM, comparative modeling; FR, fold recognition; RMSD, root mean square deviation; TBM, template-based modeling.


The most relevant issue that every CASP assessor has to deal with is the choice of a scoring scheme and of the appropriate metrics for comparing models and targets. Although no measure is better than the others in all cases, a number of them are sufficiently reliable to provide correct model quality estimates and have indeed been extensively used in CASP.


The root mean square deviation (RMSD) was the metric used in CASP1-32–4 and its use is still very widespread among computational biologists due to its conceptual simplicity. It is a very effective measure for comparing rather similar conformations, such as different experimental determinations of the same protein in different conditions or different models in an NMR ensemble. RMSD is, however, not ideal for comparing cases when the structures are substantially different for several reasons. First, its quadratic nature can penalize errors very severely, that is, a few local structural differences can result in high RMSD values. Second, it obviously depends on the number of equivalent atom pairs, and thus tends to increase with protein size. Finally, and probably most importantly, the end user of a model is typically more interested in which regions are sufficiently close to the native structure than in how incorrect the very wrong parts of a model are—that affect the RMSD most dramatically.


To overcome the RMSD shortcomings, a new threshold-based measure, GDT-TS,5 was developed and first used by the comparative modeling (CM) assessor in CASP4.6, 7 GDT-TS is the average maximum number of residues in the predictions deviating from the corresponding residues in the target by no more than a specified Cα distance cut-off for four different LGA8 sequence-dependent superpositions with distance thresholds of 1, 2, 4, and 8 Å. By averaging over a relatively wide range of distance cut-offs, GDT-TS rewards models with a roughly correct fold, while scoring highest those perfectly reproducing the target main chain conformation. For the purpose of automatic evaluation of the overall quality of a model, GDT-TS proved to be one of the most appropriate measures and has been used by the assessors of all CASP experiments after CASP4. In CASP6 and CASP7, a modification of GDT-TS, GDT-HA, was also used by the assessors for the analysis of high accuracy template-based modeling targets.9, 10 GDT-HA uses thresholds of 0.5, 1, 2, and 4 Å, thus allowing a better detection of small differences in the model backbone quality.


Another historical accuracy measure in CASP is the AL0 score, representing the percentage of correctly aligned residues after the LGA sequence-independent superposition of the model and the experimental structure with a threshold of 5 Å. A residue in the model is considered correctly aligned if its Cα atom is within 3.8 Å from the position of the corresponding experimental atom and no other Cα atom is closer. Even though conceptually different from GDT-TS, these two measures are highly correlated.

Other evaluation measures

In recent years, other measures11–14 have been developed that take into account the peculiarities of the comparison between a model and a structure as opposed to the comparison of two experimental structures. Each of these measures has its value and indeed some of them have been used in CASP6-8 assessments.


In the numerical evaluation procedure of the CASP models, GDT-TS, GDT-HA, AL0, and other related parameters are computed for each model. Each prediction method could therefore be ranked after combining the values of the submitted models over all targets. The weakness of such a procedure is due to the fact that it treats all targets equally. Different targets can have different difficulties and therefore the same difference in scores between models for two different targets should not be assigned the same weight. The problem was addressed by the CM assessor in CASP4 by introducing Z-scores.7 This strategy implicitly takes into account the predictive difficulty of a target, as the normalized score reflects relative quality of the model with respect to the results of other predictors. Noticeably, the Z-scores can be computed also for non-normal distributions, although in this case the standard normal probability table could not be used and is indeed not used in CASP. The use of Z-scores instead of raw scores proved to be very effective for analyzing relative model quality, although the results should be taken with a grain of salt for some targets for which very few groups generated good models as this can lead to an overestimation of their performance.

Ranking procedures

Although using Z-scores for analyzing model quality and relative group performance became a common practice in CASP, the specific details of the scoring schemes are left to the assessors. In previous CASPs, the approaches used by the TBM assessors—formerly CM and fold recognition (FR) assessors—slightly differed in the choices for the following alternatives:

  • 1Use all submitted models for calculating means and standard deviations needed for the Z-score computations versus ignore outliers from the datasets (and if so—how are outliers defined?).
  • 2Set negative Z-scores to zero or not.
  • 3Use the sum of Z-scores versus use the average over the number of predicted targets for ranking.
  • 4Use Z-scores from a single evaluation measure as the basis for the ranking scheme versus combine Z-scores from independent evaluation measures.

There are both advantages and potential pitfalls in these choices as we will briefly discuss below.

  • 1One of the potential problems in the use of the Z-scores is that the basic statistical parameters of the distribution of the selected evaluation score might be influenced by some extremely bad models. These can arise, for example, because of bugs in some of the servers participating in the experiment or unintentional human errors. In particular, very short “models” consisting of just a few residues can be found among the CASP predictions. To eliminate the effect of these unrealistic models on the scoring system, outliers might be excluded from the datasets used for calculating final mean and standard deviation values. The CASP6 FR assessor considered models shorter than 20 residues as outliers.15 All other TBM assessors (starting from CASP5) chose to curate the data by removing models whose score is lower than the mean of the distribution of all the values for the specific target by more than two standard deviations.
  • 2One of the aims of CASP is to foster the development of novel methods in the field. Previous assessors evaluated that some scoring schemes might be less appropriate than others for encouraging predictors to test riskier approaches. For example, the scoring scheme based on combining all Z-scores can prevent predictors from submitting models for more challenging targets. Indeed, incorrect models—more likely to appear in these cases—would obtain negative Z-scores leading to a lower overall score for the submitting group. One way to avoid this potential problem is to set negative Z-scores to 0, in other words to assign incorrect models the average score for that target. This technique was suggested by the CM assessor in CASP4,7 and was used by all but the CASP6 FR assessor since.
  • 3For ranking purposes, Z-scores of the models submitted by each group need to be summed or averaged over the number of predicted domains. This choice is clearly irrelevant if all groups predict the same set of targets. When this is not the case, the ranking can be affected by this choice. Summing penalizes groups who did not submit models for all targets, while averaging might penalize those who submit a larger number of targets, even if negative Z-scores are set to 0. The CM assessors in CASP4-77, 9, 16, 17 preferred averaging the scores (not considering groups who submitted a very small number of predictions), while the FR assessors in CASP518 and CASP615 tried both averaging and summing approaches.
  • 4A combination of the Z-scores derived from several measures was used by the FR assessors in CASP518 and CASP615 while Z-scores from a single measure, always GDT-TS, were used by the CM assessors in CASP4-7.7, 9, 16, 17 The GTD-TS, AL0, GDT-HA measures are all strongly correlated, and the value of computing them mostly resides in highlighting potential inconsistencies among them.


CASP rules allow up to five models to be submitted for same target. Predictors are informed that only the model designated as first will be used in standard ranking as any other choice would lead to unfair comparisons. “Selecting the best of the five models” strategy would provide an advantage to groups submitting more predictions as they would be more likely to submit a better model just because of larger sampling. On the other hand, the “averaging over all predictions” strategy might disadvantage groups using this possibility to test novel and riskier methods.

Statistical comparison of group performance

A sensitive and important issue concerns the evaluation of the statistical significance of the difference in the scores of different groups. The CASP5 CM assessor introduced the use of a paired t-test between the results of each pair of groups.16 Notice that groups are not ranked according to the t-test and each pair is compared independently therefore there is no multiple testing issue. One potential problem is that the t-test is based upon an assumption of normality of the distributions to be compared and one should verify that this is the case in the experiment. If not, a nonparametric test—such as the Wilcoxon signed rank test—should be used.

CASP8 evaluation of template-based models

The overall evaluation procedure is summarized in Figure 1. Once the parameters used in the evaluation (highlighted in italic) are selected, the calculations are straightforward and the results are provided to the template-based modeling assessor as soon as the target structures and their dissection in prediction units19 are available.

Figure 1.

Flowchart of the procedure used for evaluation. Steps in italics depend upon the assessor preference.

In the analysis of CASP8 template-based models described here, we adopted the parameters most often used by the assessors in the previous CASPs.

  • 1GDT-TS measure was used as the basic measure for comparing models and experimental structures. The GDT-TS values are computed using LGA in sequence-dependent mode.*
  • 2Models shorter than 20 residues were removed from the dataset. If several independent segments were submitted for the same prediction unit, the frame with the largest number of residues was selected as the representative model.
  • 3Z-scores were calculated based on the GDT-TS (and other) measures without further data curation (data reported on the web). The Z-scores reported in this article were calculated after removal of the models with values more than two standard deviations below the mean.
  • 4Negative Z-scores were set to zero.
  • 5Groups were ranked according to the average of GDT-TS-based Z-scores for the models designated as first by the predictors.
  • 6The normality of the GDT-TS distributions for each target was evaluated using the Shapiro Wilk test.20
  • 7The statistical significance of the differences between the GDT-TS values of the models was assessed with a suitable paired test of hypothesis for all pairs of groups on the common set of predicted targets.

It should be noted that in CASP8 targets were split in the two categories: (1) targets for prediction by all groups (human/server targets) and (2) targets for server prediction (server only targets). All in all, the TBM category encompassed 154 assessment units,19 64 of which were human/server domains and the remaining 90 were server only. All groups (server and human-expert) were ranked according to their results on the subset of 64 human/server domains, while server groups were also ranked on the complete list of 154 domains.


As an illustration of the evaluation strategy described in Methods, we show here the results of the automatic analysis performed on the template-based predictions in CASP8. Since they are reported here, these data will not be included in the TBM assessor paper21 that will instead concentrate on more detailed evaluations of the structural features of the submitted models.

Table I shows the correlation between the Z-scores obtained using GDT-TS, GDT-HA, and AL0 for the groups participating in CASP8. They are highly correlated for both sets of targets (“Human and Server” and “Server only”), therefore in the following we will only discuss the results of GDT-TS. The results obtained using the other scoring schemes are available on the CASP web site.

Table I. Agreement Between Group Rankings Based on Different Model Quality Measures
Dataset  ρ
  1. Spearman's correlation (ρ) between the Z-scores obtained by each group using different measures. The data are reported for both the “human and server” subset and for the complete set of targets.

All groupsMean AL0 Z-scoreMean GDT-TS Z-score0.97
Human and server targetsMean AL0 Z-scoreMean GDT-HA Z-score0.96
 Mean GDT-TS Z-scoreMean GDT-HA Z-score0.99
Server groupsMean AL0 Z-scoreMean GDT-TS Z-score0.97
All targets (human and server plus server only)Mean AL0 Z-scoreMean GDT-HA Z-score0.95
 Mean GDT-TS Z-scoreMean GDT-HA Z-score0.98

Table II illustrates the results obtained by all the groups submitting predictions. The server results are evaluated on the complete set of assessment units, while the results of all groups are computed for the subset of “Human and Server” targets.

Table II. Average Z-Scores Based on GDT-TS for Individual Prediction Groups
RankGroup nameGroup id“Human and Server” target subsetAll targets
No. of targetsMean Z-scoreNo. of targetsMean Z-scoreRank (servers only)
  1. Mean Z-score of the participating groups after setting negative Z-scores to 0. Data for human predictors are computed on the subset of “Human and server” targets, while the results of the servers are reported for both this subset (to allow a proper comparison with human groups) and for the whole set of assessment units. Data are ranked according to the Z-scores on the “Human and Server” subset, the rank of servers on the complete set of targets is reported in the last column.


For conciseness, the average Z-score presented in the table refers to the case where negative values were set to 0. However, the overall conclusions are not affected by this choice (data not shown).

The Shapiro Wilk test established that only seven of the 154 GDT-TS distributions were likely to be normal at the 1% confidence level. A non-Gaussian distribution of the GDT-TS scores might arise if groups of predictors used different templates for building their models, or if some groups were unable to detect a possible template and used less reliable template-free methods. The TBM assessor manuscript discusses this point in more detail.21

We applied both the t-test and Wilcoxon test to the data and the results were essentially identical: statistically indistinguishable groups were such by both analyses (data not shown). We report in Tables III and IV the results of the Wilcoxon signed rank test for the 20 best ranking groups in the“Human and Server” and “All targets” categories, respectively.

Table III. Statistical Comparisons Among the Top 20 Groups on the “Human and Server” Subset of Targets
inline image
Table IV. Statistical Comparisons Among the Top 20 Server Groups on all CASP8 TBM Targets
inline image

The overall conclusions of the automatic evaluation of the first model for each human and server group can be summarized as follows.

Several groups (283 IBT_LT, 489 DBAKER, 71 Zhang, 426 Zhang-Server, 57 TASSER, 434 fams-ace2, 196 ZicoFullSTP, 46 SAM-T08-human, 299 Zico, 453 MULTICOM, 371 GeneSilico, 138 ZicoFullSTPFullData, 379 McGuffin, 282 3DShot1) performed well on the subset of “Human and server” targets and are statistically indistinguishable. Among the top predictors, only group 426 (Zhang-server) has officially registered as a server, although it is entirely possible that some of the other “human” groups used a completely automatic procedure.

When servers are compared to each other, group 426 (Zhang-server) is by far the best performing one. It is statistically indistinguishable from group 293 (Lee-server) but the latter group submitted predictions only on 97 out of 154 possible TBM domains. The next three best performing servers are 438 Raptor, 322 Phyre_de_novo, and 12 HHpred5, which compare less favorably with human predictors on the “Human” target subset. This can reflect a genuine better performance of human groups, but it could also be due to a different performance of the servers for the biased subset of human targets that are not randomly selected.22


CASP has been providing the assessors with the results of the automatic evaluation carried out by the Prediction Center at UC Davis for quite some time now. The procedure has been extensively tested and sufficiently standardized to be recommended for future CASPs, and is described in detail here. We also show here the results of the application of the procedure to the CASP8 data.

Deriving overall conclusions from the data provided is the duty and the privilege of the assessors and therefore the ranking provided here should be regarded as a starting point for the subsequent analysis of the outcome of the experiment.

The results of comparing server groups on all targets show that Zhang-server outperforms the rest of the completely automatic methods. It is the only fully automatic method that appears in the list of the 20 best performing CASP8 predictor groups. The results obtained on the subset of “Human and Server” target subset are not particularly informative on the quality of the different methods, since most of them are statistically indistinguishable. This can be due to one of two reasons (or a combination of them): either the number of “Human and server” targets is not sufficiently high for deriving conclusions or most methods are genuinely very similar. The choice of selecting a subset of targets for nonserver predictors originated by the understandable difficulty of human groups in handling a large number of predictions in a short period of time. On the other hand, it is a fact that, at least for homology based models, most groups tend to rely on the same methodology using state-of-the-art sequence similarity search tools (such as HMMs or profile–profile methods) and well performing programs such as Modeller23 for building the final set of atomic coordinates.

We strongly encourage the prediction community to take advantage of the FORCASP forum for discussing these issues before the next experiment starts. This is important to ensure that the CASP effort in setting up the experiment, in standardizing the effective and reliable comparative measures of success described here and in discussing their shortcoming will foster further advances in the protein structure prediction field.

  • *

    Results for other evaluation measures for each model are also reported in the CASP web site.

  • The Lee-server group submitted too few predictions on human/server targets and was not considered in the analysis.