Evaluation of model refinement in CASP13

Abstract Performance in the model refinement category of the 13th round of Critical Assessment of Structure Prediction (CASP13) is assessed, showing that some groups consistently improve most starting models whereas the majority of participants continue to degrade the starting model on average. Using the ranking formula developed for CASP12, it is shown that only 7 of 32 groups perform better than a “naïve predictor” who just submits the starting model. Common features in their approaches include a dependence on physics‐based force fields to judge alternative conformations and the use of molecular dynamics to relax models to local minima, usually with some restraints to prevent excessively large movements. In addition to the traditional CASP metrics that focus largely on the quality of the overall fold, alternative metrics are evaluated, including comparisons of the main‐chain and side‐chain torsion angles, and the utility of the models for solving crystal structures by the molecular replacement method. It is proposed that the introduction of these metrics, as well as consideration of the accuracy of coordinate error estimates, would improve the discrimination between good and very good models.


| INTRODUCTION
The refinement category was introduced in CASP8 to assess potential strategies for further improving the quality of some of the best models produced by existing structure prediction pipelines. Although these strategies could in principle be introduced into the pipelines that they follow on from, having a separate refinement category allows focus on the endgame when models are already reasonably accurate. It also allows the exploration of what becomes possible when significantly greater computing resources can be devoted to a smaller number of starting models.
Over the years, there have been signs of progress but there have also been recurring themes in the assessments of this category. [1][2][3][4][5] It has always been true that, considering all submissions in total, more of the refined models become worse than the starting model rather than better. This reflects considerations that there are many degrees of freedom in the space of incorrect models, so that there are more ways to degrade a model than to improve it; the search space has many local minima with a relatively narrow convergence radius around the true structure; and many groups use this category (as well as other categories in CASP) as a way to experiment with novel ideas. As early as CASP8, 1 it was recognized that it is much easier to improve the agreement of a model with physics (geometric criteria including torsion angles and clashes, as measured for instance by MolProbity 6 ) than the overall fidelity of the fold, and that for distant models the two measures do not tend to be correlated. Because of problems with the dimensionality of the search, relatively conservative strategies that restrain shifts from at least the better parts of the starting model tend to be more successful because they avoid serious degradation of the model; as a result, the refined structures are almost always closer to the starting model than to the experimental structure.
Nonetheless, there has been real progress in this category. In CASP8, 1 only one group (Lee) succeeded in improving the average global distance test total score (GDT_TS) 7 from the starting models, whereas by the time of CASP12 8 of 39 groups succeeded in improving the more stringent high-accuracy GDT_HA score. 5 2 | MATERIALS AND METHODS

| Target classification
A total of 31 refinement targets were chosen, with two exceptions, from among the best server models for evaluation units from the various structure prediction categories, comprising the easy and hard versions of template based modeling (TBM), free modeling (FM) and the intermediate TBM/FM. The exceptions were the refinement models for the two subunits of target T0986, that is, R0986s1 and R0986s2; both of these models were submitted by group A7D and were substantially better than the best server models. Two targets were subsequently canceled because of unexpectedly early publication of the experimental structures, leaving 29 for evaluation (Table 1). Feedback from CASP12 suggested that refinement targets larger than about 200 residues were too demanding of computational resources, so targets were restricted to domains ranging from 59 to 204 residues.
Visual inspection was used to confirm that the starting models were of reasonable quality in at least some regions of the structure, but also that there was room for improvement by refinement of aspects such as sequence register, choices of conformer, or relative orientations of subdomains or secondary structure elements.
Continuing a trend first seen in CASP12, 5 a substantial number of refinement targets came from modeling targets initially categorized as TBM/FM (5 targets) or even FM (6 targets), with 13 from TBM-easy and 5 from TBM-hard (Table 1). Figure 1 shows that there is a correlation between original target category and the quality of the starting model judged by GDT_HA, but with substantial overlap between categories. In particular, the best TBM/FM starting model has a higher GDT_HA than the average TBM-easy starting model. Although there was an attempt to choose starting models from a variety of servers to avoid bias in the initial structure prediction methods, ultimately more than half of starting models derived from just two labs. Seven starting models each were derived from models submitted by groups 324 (RaptorX-DeepModeller) and 368 (Baker-RosettaServer), while two more came from other Xu lab groups: one each from groups 221 (RaptorX-TBM) and 498 (RaptorX-Contact) ( Table 1). For each starting model, predictors were given the GDT_HA score as an indication of difficulty. They were also given some information about which residues were not visible in the experimental structure and occasionally other hints listed in Table 1

| Evaluation measures
Many of the evaluation measures, particularly the utility of models for use in molecular replacement (MR) calculations, are discussed in another contribution on the topic of template-based modeling (Croll et al., this volume). For consistency with the previous round, our primary ranking score was taken from the CASP12 refinement assessment, where relative weights of several metrics were determined by a machine-learning algorithm trained to reproduce manual rankings. 5 We also checked whether the ranking would have been affected by choosing the TBM ranking score used in CASP12 9 and in CASP13 (Croll et al., this volume).
Both ranking scores can readily be computed with results and tools on the Prediction Center website (http://predictioncenter.org). 10 The refinement ranking score from CASP12 is given by the following: S CASP12 = 0:46 z RMS_CA + 0:17 z GDT_HA + 0:2 z SG + 0:15 z QCS + 0:02 Z MP , where the z-scores (SD above the mean from all predictions) for each model are computed according to the usual CASP conventions, as described in more detail in the TBM assessment (Croll et al., this volume). RMS_CA is the sequence-dependent Cα root-mean-square deviation between the superposed model and target computed with localglobal alignment (LGA), 7 GDT_HA is the high-accuracy version of the GDT score, 7 SG is the SphereGrinder score that measures conservation of local environment, 11 quality control score (QCS) combines measures of the relative length, position, and orientations of secondary structure elements with Cα-Cα distances, 12 and MP is the MolProbity score reflecting the stereochemical quality of the model. 6 The TBM ranking score from CASP12 is the following: where lDDT is the local distance difference test, 13 a measure based on comparing all-atom distance maps, and contact area difference, all atoms (CADaa) is a measure comparing residue contact surface areas. 14 The accuracy self-assessment (ASE) measure differs qualitatively in measuring not the accuracy of the model but rather the accuracy of the modelers' estimates of local coordinate error. 10 Presumably because the accuracy of error estimates has not been evaluated for refinement models in previous rounds of CASP some predictors did not provide them, even though they are defined as T A B L E 1 Source of refinement targets and information given to predictors parameters that should be included in any submitted model. To assess the impact of the ASE measure within the TBM ranking score, we also ranked models by a modified score that did not include it.
In assessing the high-accuracy TBM category in CASP7, we introduced a MR score 15 measuring the utility of models for solving X-ray crystal structures by MR 16

| Group rankings
A total of 32 groups participated in the refinement category. In the group rankings, we compared their results with those that would have been achieved by a "naïve predictor," defined as a group that simply resubmits the starting model. Figure 2 shows that, on average, predictors are still degrading the quality of the starting model by the ranking  8 showing the distribution of GDT_HA values seen in starting models for refinement derived from different initial modeling categories score, as the majority of groups (25 of 32) score below the naïve predictor overall while 24 of 32 degrade more models than they improve. A number of the seven groups that ranked above the naïve predictor also did well in CASP12. The Baker group (as Baker and also as Baker-Autorefine) were in first and third positions, having been in third position for CASP12. Feiglab, ranked second, was ranked sixth in CASP12. The Seok group (as Seok-server and also as Seok) were ranked fourth and fifth, having ranked second (Seok) and fourth (Seok-server) in CASP12. Jones-UCL and MUFold_server, groups that did not appear in the top 10 ranking from CASP12, were in positions 6 and 7, respectively. Notably, two server groups were among the top seven: Seok-server at position 4 and MUFold_server at position 6.
For a more direct comparison with the TBM assessment, it is useful to see how the refinement groups would fare when judged by the S TBM score as well. Figure 3 presents the groups in this order, showing in addition the S TBM 0 score (from which the ASE metric is omitted) and the S CASP12 score. The ordering is changed significantly, although the same groups occupy the top five places (with Feiglab moving up to first place and the Baker groups down the ranking). A comparison with the S TBM 0 scores shows that this difference in ranking arises primarily from the inclusion of the ASE metric, with the top five groups appearing in the same order as for S CASP12 . The overall correlation between S CASP12 and S TBM 0 is very high (.974), whereas the correlation between S CASP12 and S TBM is somewhat lower (.944). However, it must be noted that this difference arises primarily because some groups did not actually provide coordinate error estimates in this category and therefore score below average for the ASE component of S TBM . Inspection of submitted coordinates shows that Baker, Baker-Autorefine, and Zhang-Refinement provided constant error estimates of zero or one.
MUFold_server, on the other hand, provided numbers on a scale of tens to hundreds; these numbers were carried over from a step in the pipeline that used MODELLER 18 (Junlin Wang, personal communication), which uses the B-factor column to store violations of the target function (https://salilab.org/modeller/9.21/manual/node256.html). It F I G U R E 2 Performance of refinement groups according to default ranking score, S CASP12 . A. Sum of positive z-scores for all "model 1" submissions. The red bar indicates the score that would be achieved by a "naïve predictor" resubmitting each starting model. B. Fraction of times the submitted model 1 was better than the starting model for each group. It is notable that the two leading groups by this metric are automated servers seems reasonable to believe that some groups did not provide error estimates because they have not been used traditionally to assess the refinement category. The dependence of the detailed ranking order on the choice of ranking target suggests that there is little to separate the performance of the top few groups by these criteria.

| Assessment of progress
The assessment of progress in the refinement category is particularly difficult because refinement is a rapidly moving target. The servers generating the starting models themselves are continually improving their methods, effectively leaving refinement with fewer ways to improve a given model and just as many ways to degrade it. The improvement in server prediction methods can come, at least in principle, from lessons learned in earlier rounds of the refinement category. Furthermore, each CASP round attracts a different cohort of new groups and novel methods, not all of which will be successful.
Finally, with each round the set of targets is of course completely different, inevitably introducing a large amount of noise in this measure.
One class of measure typically used to assess progress is the fraction of all submitted refinement models that improve on the GDT_HA and Cα RMSD metrics. [1][2][3]5 Histograms of the overall change in these metrics are shown in Figure 4A,B, suggesting that the progress has stalled or even reversed. However, any measure that looks at all submitted models will be particularly sensitive to which new groups  Figure 4C-E), we indeed see that performance according to GDT_HA has held steady or slightly increased over the last few CASP rounds. Progress can also be assessed by looking at the performance of the top-ranked groups as judged by the S CASP12 score. From Figure 2B, we see that 8 of 32 groups have succeeded in improving the majority of starting models. Three groups (Baker-Autorefine, Seok-server, and Feiglab) are able to yield better models for more than 70% of refinement targets.

| Improvement over starting and TBM models
The improvement that can be achieved through refinement can also be evaluated by comparing scores for the best model 1 submission with those from the starting model. This is illustrated for the GDT_TS score in Figure 5, which shows that every starting model has been improved.
This improvement in scores from the starting model could be taken as an indication that the more computer-intensive algorithms used in the refinement category truly yield better models than the algorithms used in the TBM and FM categories. Given that almost all starting models have been produced by servers, it is also possible that the involvement of human predictors is the key factor in improvement. This can be assessed by comparing the best initial model 1 from any predictor with the starting and best refined models, also shown in Figure 5. For most cases, the best refined model is better than the best initial TBM or FM  Figure 6). This is not surprising for models that reproduce the fold poorly, in which case the structural context required to choose the correct conformer is lacking. It was more surprising to see that the restrained molecular dynamics methods of the Feig lab led to substantial improvements in S CASP12 (enough to place them second overall by this measure)-yet, their aggregate score according to S torsion was essentially identical to that of the naïve predictor (possible explanations for this observation are discussed below). On the other hand, the Baker-Autorefine method that includes more aggressive conformational searching led to substantial improvements in both metrics. 23 The Seok and Seok-server groups (which combined molecular dynamics approaches similar to Feig with local rebuilding) yielded a somewhat more modest improvement. Overall rankings according to this metric are shown in Figure 7. Changes made to the starting models and their differences from the targets for the top five groups (by S CASP12 ) are explored in more detail in Figure 8. We performed separate analyses for "good" regions where the starting model essentially agreed with the target (defined as residues with average backbone torsion angle differences <30 ) and the remainder where conformation differed substantially. Importantly, all five groups made only small changes to the backbone conformation in the "good" regions, suggesting that recognition and preservation of correct folds are quite robust. Changes to backbone conformation in the remaining residues were much higher, and all five groups did in fact improve overall agreement with the target by this metric. All five groups made significant changes (and improvements) to sidechain conformations. The Feig group was much more

| Notable successes
Two of the more impressive successes (and one less successful case study) we saw in this round are pictured in Figure 9.  mally be a quite sensible move it incorrectly opted to bury the "solvent-exposed" Tyr63 and Trp65. This led to a one-residue shift rather than two-residue shift in the offending β-strand, while the introduction of two bulky sidechains into the hydrophobic core significantly disrupted the packing of the domain, resulting in an arguably worse model than those from other groups that did not change the sequence register in this region.

| MR model quality
Diffraction data were available for 11 of the 31 refinement targets.
LLG scores were computed using Phaser 19 both with and without error weighting, as discussed above. To put the results from different targets on the same scale, z-scores were computed, carrying out the calculation separately for LLG values obtained with and without F I G U R E 9 Comparison of Baker and Feig group results for three interesting cases. A-C. Refinement target R0974s1 was a globular domain of five α-helices, the first four of which were correct in the starting model. A. Starting model (gray) compared to target. The C-terminal helix is tilted and shifted from its true position, with Ile62 packed into the core in place of Phe66. Equivalent C β atoms are connected by dashed yellow lines. The remainder of the target is shown in surface representation. B. The Baker (tan) and Feiglab (white) models matched the target essentially perfectly. C. The Baker-Autorefine result improved upon the starting model, but did not quite reach the target conformation. D-F. Refinement target R0981-D4 was a particularly notable success for the Baker group. D. While the starting model (white) closely matched the main β-sheet in the target, the helix spanning residues 434 (Cα shown in blue) to 441 (Cα in red) was shifted about 7.5 Å from its true position. E. The Baker method shifted this helix to within 2 Å of its true position, and correctly predicted the conformations of the entering and exiting turns. F. The next best result (from the Feig group) brought the helix to within 5 Å of the target, but added a spurious extra turn to the N-terminus. The first 17 residues of this domain were not correctly predicted by any group, and are not shown. G-I. The N-terminus of R0997, in contrast, highlights a potential pitfall of the use of fragment-based sampling methods in refinement. G. In the starting model the first helix was essentially correctly folded, but turned almost 45 from its true configuration. Additionally, the somewhat large loops flanking the second helix were poorly modeled. H. the Baker group unfolded the N-terminal helix, added two spurious extra turns to the N-terminus of the second helix, and partially unwrapped the C-terminal turn of the second helix in order to fold the following loop into a helix-a significant degradation of the model quality. On the other hand, the more conservative Feig method kept the secondary structure elements correctly folded and slightly improved the disposition of the N-terminal helix and flanking loop geometry. Cα atoms equivalent to those constituting the N-terminus and C-terminus of the first two helices in the target are shown, colored in blue, cyan, pink, and red in order of residue number error weighting. Computing the z-scores separately for each target helps to correct for differences among the targets in quality of diffraction data, the number of copies of a target in the asymmetric unit of the crystal, and the presence of unmodeled components, as discussed in more detail in the paper on TBM assessment (Croll et al., this volume). Groups were ranked, as shown in Figure 11, by mean z-score. There is considerable overlap between the top groups by this ranking and S CASP12 , with Baker, Feiglab, Baker-Autorefine, and Seok-server all appearing in the top five of both lists. However, the group AWSEM, which is in position 17 by S CASP12 , appears in third place by the MR ranking, but only when the LLG score computed by using error weighting is considered. This is a very striking example of how much value can be added to the MR calculation when good estimates of coordinate error can be provided. Feiglab moves into first place when error weighting is considered but Baker, which failed to provide error estimates, drops from first to third in the ranking. In every case, at least one model gives an improved LLG score compared to the starting model ( Figure 12). Figure 13 shows that the top groups improve on the starting model in most, but not all, cases.
F I G U R E 1 0 Many refinement failures arise from a lack of context. Like many targets, R1002-D2 was a single domain excised from a larger multidomain protein.
Here the target and models are shown in ribbon/stick format (with foreground loop 84-90 hidden for clarity), while the remainder of the experimental model is shown in surface representation. A. In the experimental model (green) Trp65 and Tyr63 are buried in the interface with an adjoining domain, but shorn of this context appear to be entirely solvent-exposed. In the starting model for refinement (white) the N-terminal β-strand spanning residues 59-63 was shifted by two residues N-terminal to its true position. B. The result from Baker-Autorefine suggests that they correctly identified the presence of a register error here-but attempted to correct it by (sensibly, given the information available) burying these two bulky residues in the hydrophobic core, shifting the register by a single position rather than the needed two F I G U R E 1 1 MR LLG z-scores for top groups, sorted by the maximum z-score obtained either with error-weighted or unweighted models. Note that, although the LLG values will be unchanged when groups provide constant coordinate error estimates, the z-scores become lower because of improved performance from other groups the distribution of residues on the Ramachandran plot). 21 Strengthening these terms has the effect of pushing residues in marginal or disallowed conformations toward the nearest "favored" region of Ramachandran space. As has been learned in the field of experimental model building, such "Ramachandran restraints" are often counterproductive. 22 The problem in essence is that in any physically realistic force field, the nearest favored conformation to a stable outlier is rarely the correct conformation. The more common scenario is that the offending residue's backbone is sterically trapped in a conformation where one or both of its flanking peptide bonds is flipped close to 180 from its true low-energy state. In such situations, the net effect of Ramachandran restraints is to push the conformation "uphill" into a high-energy state which, while achieving a lower MolProbity score, is not necessarily any more correct than  Figure 10). Removing the context causes such residues to appear solvent exposed, leading to large conformational changes in MD simulations and confusing conformational search algorithms. Providing the true (experimental) context is not a satisfying solution-not only is this unrealistic in terms of most real-world uses, but this would also allow most targets to yield only a single domain for refinement.
One possible solution would be to provide the entire server model as starting coordinates, with instructions specifying which portion is to be focused on for refinement.

| CONCLUSIONS
Progress in the model refinement task is difficult to measure: it inevitably becomes more difficult from one round of CASP to the next, as the F I G U R E 1 3 Scatter plot comparing the increase in LLG obtained by adding the starting model to a background comprising the rest of the crystal structure with that obtained using the best refined model from each of the three top-ranked groups predictors providing the starting models become increasingly sophisticated and leave only more subtle errors that are more difficult to address. By one of the measures that had shown improvements in past rounds of CASP (the fraction of all submitted models that improve on the starting model), progress in the general refinement community might appear to have stalled or even reversed. We feel that this conclusion would be too pessimistic: the fact that some of the refinement groups are still consistently able to improve on the best of the models provided in the initial predictions shows that the best refinement methods are matching the more easily measured improvements in the initial modeling methods.
For consistency, we used the score developed for CASP12 as our primary ranking score. However, we believe that in the future this should incorporate metrics that make greater demands, including agreement with main-chain and side-chain torsion angles. Even though the TBM and FM predictors are now largely providing coordinate error estimates, it seems that many participants in the refinement category fail to do so because this has not typically been used in assessment. Because good error estimates are, in fact, an essential part of a useful model, we find it unfortunate that they have been neglected traditionally in this category and strongly believe that they should be required here as well. It might also be interesting to evaluate, along with the refined model, some annotation of which parts of the model the predictor believes have been improved.
By convention, the starting models for refinement come with hints about the target. Some of these (such as the oligomeric state of the molecule or the presence of a ligand or bound metal ion) are facts that would frequently be known in a real-life modeling scenario. On the other hand, one would be unlikely to know the GDT_HA score of an intermediate model, yet this can be (and is) used to decide between more and less conservative approaches. The starting model is almost always the best server model provided in the initial modeling round. In principle, the knowledge of which server models were not chosen could be exploited, though it is difficult to know if it is. Perhaps a more random choice from among the better server models should be used.
Finally, the nature of available targets in this round of CASP reflected the move in structural biology toward larger assemblies, assisted in part by recent dramatic improvements in the capabilities of cryo-EM. A number of the targets were, in fact, components from very large assemblies determined by cryo-EM. As a result, many of the evaluation units for TBM and refinement targets chosen from them are small components divorced from their structural context. In a number of cases, knowledge of the context would be essential to making an accurate prediction. Some consideration should be given to how refinement targets can be chosen and presented to provide a better indication of their context.