Evaluation of template‐based modeling in CASP13

Abstract Performance in the template‐based modeling (TBM) category of CASP13 is assessed here, using a variety of metrics. Performance of the predictor groups that participated is ranked using the primary ranking score that was developed by the assessors for CASP12. This reveals that the best results are obtained by groups that include contact predictions or inter‐residue distance predictions derived from deep multiple sequence alignments. In cases where there is a good homolog in the wwPDB (TBM‐easy category), the best results are obtained by modifying a template. However, for cases with poorer homologs (TBM‐hard), very good results can be obtained without using an explicit template, by deep learning algorithms trained on the wwPDB. Alternative metrics are introduced, to allow testing of aspects of structural models that are not addressed by traditional CASP metrics. These include comparisons to the main‐chain and side‐chain torsion angles of the target, and the utility of models for solving crystal structures by the molecular replacement method. The alternative metrics are poorly correlated with the traditional metrics, and it is proposed that modeling has reached a sufficient level of maturity that the best models should be expected to satisfy this wider range of criteria.


| INTRODUCTION
Even though the worldwide Protein Data Bank 1 (wwPDB) continues to expand quickly, growth in this database is outpaced by the growth in genomic information, leading to an escalation in the need for protein structure modeling. Around the turn of the century the wwPDB archive was doubling every 3 to 4 years with apparent exponential growth, but in recent years it has taken about 7 years to double 2 ; in contrast, genome databases are currently doubling about once every 7 months. 3 A large fraction of proteins lacking an experimental structure will be at least distantly related to a protein of known structure, which can serve as a template for modeling. Template-based modeling (TBM) plays a key role in leveraging genomic data-but as the level of sequence identity drops, TBM becomes progressively more challenging. In the early days of the CASP experiments, it is probably fair to say that many attempts to improve on the best template actually turned it into a worse model. However, great advances have been made over the years, aided in part by improved understanding of the energetics of protein folding 4 but also largely by taking advantage of the growing databases. In CASP12, substantial improvements over CASP11 were attributed to several factors: better use of multiple templates; improved model refinement methods; and better methods for estimating model accuracy, which allowed the best alternative model to be chosen and focus limited computational resources on regions to refine. 5 It was of interest to see whether CASP13 would reveal continued progress and, if so, what was driving it.
We have also taken the opportunity to look at some less conventional measures of quality. The traditional scoring metrics are defined primarily based on the deviation between model and target in Cartesian space, and for historical reasons are somewhat lenient-in early CASP rounds simply getting most Cα positions reasonably close to the target was a substantial achievement. 6,7 However, as the field matures the number of groups achieving high scores on any given model is steadily increasing. It is sensible to start considering more stringent measures of model quality, preferably orthogonal to those in current use. Thus, we have developed some measures based on how well the torsion angles describing the conformation of the structure are reproduced in the model. While torsion-based analyses have been previously used in assessing CASP rounds 3, 8 4 9 and 9 10 they have not been widely adopted-perhaps because such metrics only become truly meaningful once the majority of the fold is essentially correct.
The conventional metrics, and the new torsion metrics, evaluate respectively the correctness of the predicted folds and the adherence of predicted fine-scale features to those observed in the target structures.
However, users of such models will primarily be interested in their utility for particular purposes, such as providing targets for the design of new therapeutics or explaining the impact of mutations found in inherited diseases. In addition to such applied research, predicted structures can also be very useful as initial models when determining new experimental structures. Arguably, the most common example of this is in X-ray crystallography structure phasing through molecular replacement (MR). 11 In MR, an atomic model derived from a related protein structure is rotated and translated in a search for the position occupied by the true structure in the crystal; phases calculated from the model are combined with data to produce an electron density map that reveals new features, if the model is sufficiently accurate. MR, when it succeeds, allows a structure to be determined from a data set from a single native crystal, without requiring the preparation of heavy-atom derivatives or the accurate measurement of anomalous scattering data. 12 TBM methods that improve significantly on the original template can therefore shortcut the process of structure determination and improve throughput in X-ray crystallography. In recognition of this, in CASP7 we introduced a metric scoring individual model predictions based on their usefulness in MR. 6 As a result of continuing improvement in structure prediction, the use of TBM to improve MR models has been greatly expanding in recent years. [13][14][15][16] Although TBM is the focus of this work, it should also be noted that free modeling of whole proteins or fragments can also yield useful models for MR, under favorable circumstances of relatively small proteins and high-resolution data. [17][18][19][20] 2 | MATERIALS AND METHODS

| Target classification and scope of this work
Target classification is described in detail elsewhere in this volume.
Briefly, in CASP13 as in earlier exercises, targets for structural modeling were divided when appropriate into evaluation units, which were categorized by difficulty. The difficulty category addressed here, TBM, broadly covers cases in which a good template can be found in the PDB. It is further subdivided into TBM-easy and TBM-hard. For the most part, we will not be discussing the targets lacking good templates, which are categorized as free modeling (FM) or, for borderline cases, TBM/FM. For CASP13, there were 40 evaluation units defined as TBM-easy and 21 defined as TBM-hard.

| Traditional evaluation measures
Over the years, a large number of evaluation measures have been developed to assess different aspects of model quality. A detailed description, classification and review of a number of these metrics has been published recently 21 ; they differ for instance in whether or not they depend on structure superposition and whether they depend on global or local measures. Most of these metrics are computed, collated, and analyzed by the Prediction Center (http://predictioncenter. org), 22 making them much more convenient for assessors and others.
In this work, our primary ranking has adopted the same overall ranking score used for TBM models in CASP12, 5 which is based on five metrics computed by the Prediction Center.
GDT_HA is the high-accuracy version of the Global Distance Test, which assesses the overall fold in a way that gives greater reward for parts of the target reproduced with high precision. 23 The local difference distance test, lDDT, evaluates how well models reproduce an all-atom distance map. 7 The contact area difference score, CADaa, is based on comparing residue contact surface areas. 24 Sphere-Grinder (SG) measures how well local environment is conserved between the model and target. 25 Finally, the accuracy self-estimate measure, ASE, evaluates the degree to which the coordinate error estimates predict positional differences from the target. 22 The overall ranking score combining these measures is given by the following: This scheme assigns equal overall weight to global fold quality (GDT_HA), local structure quality (split over lDDT, CADaa, and SG), and quality of model accuracy estimates (ASE). In the ranking equation, z indicates the adjusted z-score over all models under consideration for a given target, with the subscript denoting the particular underlying evaluation measure. The adjusted z-score (essentially the number of standard deviations (SD) above the mean of the full set of models) is computed using the following protocol common to recent CASPs. A set of initial z-scores is evaluated based on the mean and SD of scores from all the models under consideration. All models yielding initial z-scores below −2 are omitted as potential outliers and the z-scores are recomputed using the mean and SD from the pruned set of models. Finally, negative z-scores are reset to zero, with the goal of reducing the penalty on predictors who test novel methods.
When comparing with other metrics it was sometimes more sensible to exclude the ASE component, leading to: For each residue present in both model or template and target, the backbone score was defined as: where φ and ψ are the characteristic Ramachandran torsion angles 27 and ω is the torsion across the peptide bond. Instances where the ω torsion was more than 30 from planar or flipped relative to the target (ie, trans to cis or vice versa) were separately recorded.
Devising a useful and fair score for sidechain conformations was somewhat more challenging. In experimental models, the sidechain conformation is often inherently less certain than the backbone, and in fact in highly solvent-exposed locations there is often effectively no experimental evidence for any given configuration, and hence no true correct answer. Further, the certainty of a given sidechain torsion tends to reduce with distance from the backbone. To complicate matters further, the relevance of each torsion is dependent on those preceding: if the first torsion is completely wrong, then the values of the remaining torsion angles are effectively meaningless. The sidechain score for rotameric residues was thus defined as: where χ i is the ith sidechain torsion from the backbone, n χ is the number of sidechain torsions in the given target residue, τ defines the contribution of χ 2 to the score as a function of Δχ 1 , and β is a "burial score" defined as: where n close is the number of heavy atoms from other residues within 4 Å of any heavy atom in the given residue (based on the target structure), and n sc is the number of heavy atoms expected in the sidechain.
Sidechains with no χ torsions in the target (ie, glycine or alanine residues and truncations) did not receive a score. For any χ torsions pre- For ranking of models, we found it convenient to define two different combined scores: a "torsion-only" score, and a "geometryweighted" score combining torsion differences with more standard metrics. These are defined as: where z-scores were calculated as described in the previous section.
These analyses were implemented as Python scripts using tools from ChimeraX 28 and ISOLDE. 29

| Evaluation by utility in molecular replacement calculations
In the high-accuracy TBM assessment for CASP7, we tested each model for selected targets to determine whether it could have been used to solve the structure by molecular replacement. 6  Since CASP7, it has become much more common for predictors to submit local error estimates in the B-factor fields of submitted models.
As suggested at the time of the CASP7 evaluation, 6 and as demonstrated in tests with models from CASP10, 33  as a control, a third calculation interpreted the B-factor column as a Bfactor. The MR calculation is relatively inexpensive compared to modeling algorithms, and crystallographers tend to test multiple alternative models in practice. In accordance with this, we tested all five submitted models for each target and chose the best for each group. Preliminary trials indicated that the calculations could be unstable for models that reproduced the structure very poorly or when the coordinate error estimates were infeasibly large. For this reason, models with a GDT_TS score less than 30 or with a median coordinate error estimate greater than 3 Å were omitted from the calculations and assigned an LLG score of zero.
An additional complication was encountered for two of the targets for which diffraction data were available, T0960 and T0963. The crystal structures for these targets display translational noncrystallographic symmetry (tNCS), in which more than one unique copy of a molecule is found in a similar orientation in the crystal. The presence of tNCS leads to a systematic modulation of the diffraction intensities, as the contributions from the tNCS-related molecules can interfere constructively or destructively. If not accounted for, this seriously degrades the reliability of the MR calculations. Phaser has been adapted to characterize and account for the effects of tNCS, 37 but because the Python scripts we were using have not yet been updated to take advantage of this new feature, we omitted these two targets from our calculations.

| Model visualization
Models were visualized in ChimeraX, 28 using validation markup provided by ISOLDE. 29 3 | RESULTS

| Assessment of progress
As noted in previous CASP rounds it is difficult to assess progress, in no small part because the targets are different ones each time. This will add random noise to any comparison, but there are potential systematic effects as well. Most notably, the operational definition of an evaluation unit has a subjective element that could possibly mask some of the improvements that are being made: one of the considerations in defining an evaluation unit is whether or not any predictors succeeded in finding a good relative orientation between two segments of structure that might otherwise have been classified as separate domains. As predictors improve in this aspect, eval- With those provisos in mind, a consistent measure of target difficulty can be used to compare results with different targets. In line with previous rounds including CASP12, we use here a linear combination of coverage by the best structural template and the sequence identity between the target and the best template. 5 Using this measure of difficulty, progress in improving overall fold accuracy as judged by GDT_TS had seemed to stall around the time of CASP11. 38 However, substantial improvements were seen again in CASP12. 5 Figure 1 shows that this progress has continued for CASP13. Note that the marked outlier from CASP13 in Figure 1D, T0999-D2 (a TBM-hard target), is an example of the challenges involved in designating evaluation units. This target is from a family of proteins that undergo a large conformational change upon ligand binding; since good templates exist in both ligand-bound and -free states, the resulting F I G U R E 1 Overall trends in model difficulty and accuracy over time. A, The average difficulty of TBM targets in CASP13 was somewhat lower than in CASP12, with templates of both higher sequence identity and coverage available. B, The distribution of GDT_TS scores for TBM models has shifted toward higher values since the first four rounds of CASP, with a further substantial shift from values below 50 to very good values above 80 between CASP12 and CASP13. C, The accuracy of sequence alignments has improved significantly since CASP11, particularly for low homology templates. D, In keeping with (C), GDT_TS scores appear to still be improving for harder targets. T0999-D2 is an outlier due to ambiguity in the definition of a "domain," as discussed in the main text. In (C) and (D), individual data points are shown for CASP12 and −13, with only trend lines shown for earlier meetings. Each point represents the best model submitted by any group for a given target

| Group rankings
For the primary rankings, predictions were scored by the S CASP12 score discussed above. For group rankings, we considered only the scores for the "model 1" models rather than the best of the potential five models submitted for each target. This is the approach generally taken in CASP assessment, and the ability of the group to rank their models forms an implicit part of the ranking score. Any ranking score that assigns comparable weights to a combination of metrics measuring global fold, local fold, and estimated model accuracy is likely to lead to a similar overall ranking, as the metrics within these general categories tend to be highly correlated to one another. 21 Because the ASE accuracy self-estimate score measures an orthogonal characteristic of the models (and to assess the possibility that a good ASE score could be attained by assigning large errors to poor models), we also tested the effect on ranking of excluding the ASE measure. Figure 3A shows an overview of the group rankings across all TBM targets, while Figure 3B focuses on the 20 top-ranked groups. It is also interesting to note that neither score correlates particularly well with the suitability of models for MR (discussed below).
While the top 10 groups by this MR score (highlighted in red) are found for the most part in the upper-right quadrant of the plot, they share this space with a much larger number of similarly scoring groups whose models were not as effective for MR.
The disparity between Cartesian and torsional scoring metrics is illustrated further in Figure 6 using T0981-D5 (TBM-hard) as an example. By S CASP12 it appeared that A7D and Zhang-Server did comparably well on this model, with z-scores of 1.76 and 1.55 respectively. By S CASP12-ASE these remained the leading models, but the gap was markedly widened (z-scores of 1.77 and 0.98 respectively). In torsion space, however, the A7D model remained the leader (z = 1.49; S backbone = 0.123; S sidechain = 0.234) while the Zhang-Server model received the minimum possible z score (z = 0; S backbone = 0.211; S sidechain = 0.370). Inspecting more closely revealed that many sites in the latter were not simply incorrect relative to the target but were physically highly implausible: 36 (28%) of peptide bonds were either twisted >30 from planar or flipped into cis conformation; 33 sidechains (26%) were rotamer outliers; and 65 residues (51%) were outside of favored Ramachandran space.  where the relative domain orientation is uncertain. In this round, group A7D (quite reasonably) applied this latter interpretation, leading to error estimates that were in many cases 1-2 orders of magnitude larger than expected. For future CASP rounds, the definition of coordinate error estimates should be more carefully specified.

| Geometry-weighted scoring metric
As we have shown, under the current-standard S CASP12 scheme it is possible to generate a top-scoring model that nevertheless contains a very large number of physically implausible or impossible local conformational features. It thus seems reasonable to suggest an alternative scoring scheme incorporating these. The weighting of these must be carefully considered, however. Relying purely on torsions is inadvisable: as a simple illustration, for a target consisting of a bundle of alpha helices a model built as a single long helix will score almost as well as the correct fold by this metric. It is also wise to consider that the experimental structure itself is not perfect, and that torsions (particularly in elements without defined secondary structure) tend to be substantially more error-prone than Cα and overall sidechain positions. On the other hand, inclusion of torsion-based scoring will reduce the effect of the (not uncommon) case where some portion of the target is likely flexible in solution but has been captured in a single conformation. In such cases, distance-based metrics will unfairly reward models which happen by chance to replicate the specific location of the flexible element, whereas torsion-based metrics remain largely unaffected.
Here In this CASP round, 1651 individual models were identified as templates in the submitted model files. Of these, 15 could not be retrieved from the wwPDB-11 due to having been obsoleted and replaced with newer models, one withdrawn entirely, one apparently nonexistent and two unreleased at the time of writing. Model 1wb1, for example, was used as a template for target 1022 by five groups, despite having been obsoleted and replaced by 4ac9 more than 5 years before the beginning of this CASP round. This is a problem that is likely to be compounded in future by the (otherwise positive) recent decision by the wwPDB steering committee to begin allowing authors to deposit updated versions of coordinates under their original accession ID: given that the wwPDB is a rapidly-growing and ever-more-dynamic database, any static template library is doomed to quickly become outdated.
Nine templates were of resolutions lower than 10 Å. For the remainder that could be retrieved from the wwPDB, a plot of MolProbity score vs resolution is shown in Figure 10. For the most extreme low-resolution and/or low-quality templates in this cohort, we searched for better models of similar or identical sequence with release dates of 2017 or earlier (ie, those that were available for this CASP round). Of the approximately 50 cases inspected, we were able to find demonstrably better models for 21 ( Table 1). Eighteen of these were >90% identical in sequence to the template, and 12 were never used as templates by any group. In Figure 10A, red lines connect each template to the identified alternative.
Our observations here suggest that an easy way for some groups to significantly improve their TBM results will be to introduce some simple extra heuristics in the selection of templates. Our recommendations are: F I G U R E 1 0 Importance of considering model quality when selecting templates. A, Scatter plot of resolution vs MolProbity score for all PDB entries identified as templates used in this CASP round (excluding models with resolutions below 10 Å). Red lines connect selected templates to similar models with significantly better resolution and/or MolProbity score. Alternative models were selected from those with better than 90% (solid lines), 70% (dashed line), or 50% (dotted lines) sequence identity to the template chain. B, Representative fragment of chain F from 5mqf (5.9 Å resolution cryo-EM model used as a template for T0954 by six groups). The density is uninterpretable on the atomic scale-this chain is a homology model, truncated to poly-Ala and rigid-body docked into patchy density. C, Equivalent region from the 100% sequence-identical 5xjc (3.6 Å cryo-EM model, used by only two groups). All sidechains are present and for the most part modeled into strong, convincing density. The lower MolProbity score for 5mqf arises simply because truncated sidechains do not contribute to clashscore nor count as rotamer outliers • Only use templates with resolution poorer than 5 Å with extreme caution, and if no other option exists. At these resolutions the model is likely to be little more than a Cα trace and/or set of rigidbody fitted homology domains (see for example Figure 10B).
Sidechains are effectively invisible, and the backbone path is typically extremely vague with the possible exception of long helices.
Replace with a higher-resolution template wherever possible, even at the expense of significantly lower sequence homology.
• Resolution 3.5-5 Å: while most of the fold is typically correct at these resolutions, it is common for stretches of up to a few dozen residues to be out of register (ie, systematically shifted one or more positions toward their N-or C-terminus). MolProbity scores higher than~3 are cause for substantial caution. If an alternative exists with higher resolution but lower sequence homology, consider using that instead. Favor models with complete sidechains over those with truncated ones.
• Resolution 2.5-3.5 Å: a "transition zone" where most of the model is usually correct. Most sidechains are at least partially visible and regions with defined secondary structure will usually be well-modeled, but loop regions are often problematic.
• Resolution <2.5 Å: except in very rare cases (or very old models) these are generally trustworthy.
• Favor newer models over older ones-data collection, computational methods, and validation statistics have all improved dramatically, particularly over the past 15 years. The two oldest models used in this round were 2hhb and 4hhb, two models of human deoxyhemoglobin deposited in 1984; 209 newer, and 28 higherresolution, experimental models of this protein exist.
• All else being equal, choose the model with the highest resolution, followed by the lowest MolProbity score. Wherever possible, avoid models with MolProbity scores greater than about 3 and ideally aim for those with scores below 2. Developers might also consider using properties at the residue level such as difference from the mean B-factor for that structure (a useful proxy for local effective resolution and/or coordinate error), Ramachandran and rotamer probabilities, and local clashes to assign finer-grained confidence scores to templates.
• Reduce or remove reliance on static template libraries in favor of selection directly from the wwPDB. Software and server developers should consider making use of the extensive query APIs provided by the RCSB PDB 41 and/or PDBe 42 to select up-to-date templates directly from the master wwPDB database.

| MR score could inspire a more general metric
The MR LLG score has a substantial potential advantage in providing a numerical measure of the utility of a model for one of the purposes to which models are put, that is, solving new crystal structures. It assesses not only a measure of all-atom accuracy but also provides a tangible reward for good estimates of coordinate accuracy.
However, there are serious drawbacks to this metric as it stands.
The most obvious is that it can only be used when diffraction data have been made available. Even when diffraction data are available, the scores for models of different targets are not on the same scale, because the LLG values depend on the number of reflections in the data set, and the sensitivity to model errors depends on the resolution T A B L E 1 Details of alternative templates for the cases pictured in Figure 10A Template ( to which the data extend. When targets are subdivided into evaluation units, different fractions of the full structure will provide the background for the LLG calculation covering an evaluation unit; because of the quadratic dependence of the LLG on the completeness of the model, 32 the calculations will all be on different scales. Data pathologies such as anisotropic diffraction or tNCS add further complications. What is needed is a metric that reflects utility in MR but can be cal- weighted by the square of the inverse resolution (to account for the density of Fourier terms as a function of resolution). In future work, we hope to implement and test such a metric. Because both crystallography and cryo-EM methods involve fitting 3D atomic models to maps, either by correlating their electron density (X-ray crystallography) or their electric potential (cryo-EM), this metric would be highly useful in optimizing modeling procedures that assist in experimental structure determination.

| CONCLUSIONS
The progress in the TBM category that was seen in CASP12 has continued in CASP13. As noted for CASP12, the use of contact (or inter-residue distance) predictions derived from deep multiple sequence alignments is an important ingredient. One surprise, found in the work from the group A7D that used deep convolutional neural networks, is that explicit use of template models is not essential, though it is still valuable when there are closely related templates available in the wwPDB.
We were also surprised to find that few predictors appear to be taking account of measures of experimental structure reliability when choosing templates to use as starting models (or presumably to train structure prediction methods in general). Applying some simple rules when choosing templates should have an immediate impact on model quality.
Given the progress that has been achieved in the TBM category since the inception of the CASP experiments, we believe that this is a good time to raise the level of expectations for good quality models. The results from some predictors show that it is possible not only to predict the general outline of a protein fold but also to predict more of the details in terms of the main-chain and side-chain torsion angles, as well as to evaluate the local reliability of their models. To encourage development along these lines, we present a new suggested ranking score that future assessors might wish to consider as a basis for their work.

ACKNOWLEDGMENTS
We thank the experimentalists who provided diffraction data for a number of targets and Gábor Bunkóczi for advice on his tools for evaluating models for molecular replacement. This research was supported