Evaluation of model refinement in CASP14

We report here an assessment of the model refinement category of the 14th round of Critical Assessment of Structure Prediction (CASP14). As before, predictors submitted up to five ranked refinements, along with associated residue-level error estimates, for targets that had a wide range of starting quality. The ability of groups to accurately rank their submissions and to predict coordinate error varied widely. Overall, only four groups out-performed a “ naïve predictor ” corresponding to the resu-bmission of the starting model. Among the top groups, there are interesting differences of approach and in the spread of improvements seen: some methods are more conservative, others more adventurous. Some targets were “ double-barreled ” for which predictors were offered a high-quality AlphaFold 2 (AF2)-derived prediction alongside another of lower quality. The AF2-derived models were largely uni-mprovable, many of their apparent errors being found to reside at domain and, especially, crystal lattice contacts. Refinement is shown to have a mixed impact over-all on structure-based function annotation methods to predict nucleic acid binding, spot catalytic sites, and dock protein structures.


| INTRODUCTION
The Critical Assessment of Structure Prediction (CASP) refinement category ran for the first time at CASP8 in 2008. 1 The aim was to systematically test methods that could push initial structure predictions, initially deriving from template-based modeling alone, closer to the native structure. At the time, it was particularly envisaged that molecular dynamics (MD)-based methods could have a significant role. At CASP9, refinement was found to have a distinct beneficial effect on model geometry, 2 although coordinate refinement remained modest and sporadic. As recognized from the beginning, 1 such geometric improvement and elimination of atomic clashes are easier than systematic improvement of coordinate accuracy: the former can be achieved by local conformational sampling, while larger-scale shifts require an algorithm that can avoid trapping in local energy minima and distinguish the correct direction of travel from the much larger number of ways in which a model structure can be degraded. Nevertheless, impressive results by the FEIG group at CASP10 demonstrated that most models could be systematically improved by restrained MD. 3 In more recent CASPs, such MD-based approaches have been profitably adopted and adapted by other groups (e.g., see Reference 4), sometimes with a specific focus such as loops 5 and alternative approaches, most notably from the BAKER group, 6 have emerged as rivals.
It is recognized that the refinement category is something of a special case in CASP by taking as targets selected products of another Adam J. Simpkin and Filomeno Sánchez Rodríguez contributed equally to the work. category, namely the primary structure prediction exercise. This means that as the original prediction algorithms improve, including by harnessing explicit refinement steps, refinement groups need to improve every time merely to stand still in terms of the headline statistics. 7 Targets have also been observed to differ in their refinability 7 so the obviously different selections made for each exercise might influence difficulty in unappreciated ways. Here in CASP14, AlphaFold 2 (AF2)derived refinement targets, selected alongside poorer quality models as "double-barreled" targets, proved to be a special case. Even the best methods failed to drive them closer to the experimental structures, but detailed analysis suggests they were, to a large extent, not meaningfully improvable since their deviations lay mainly at crystal lattice contacts where the experimental structure is potentially unrepresentative of biologically relevant conformations. As with other CASP categories, accurate model quality assessment is fundamental here since alternative strategies can be employed for higher-or lower-quality models (e.g., see Reference 6) and refinement effort can be productively focused on areas that are predicted to be inaccurately modeled. Here we show, however, that groups still differ widely in their ability to rank submissions by overall quality and to predict local coordinate error at a residue level.
It is important to remember that the value of a model, refined or otherwise, lies not only in the overall fold and what that may reveal about evolution and function, but also in its use, for example, for more detailed structure-based function prediction, 8 for structure-based in silico ligand screening and as a search model in molecular replacement (MR), for example see References 9 and 10. Here we show that refinement affects-often positively but not exclusively so-the readout of catalytic site recognition and prediction of nucleic acid binding ability. A similarly mixed picture is obtained from comparing the protein-protein docking of unrefined and refined models with that of the experimental structures.
Less ambivalently, we show elsewhere in this issue 11 that refinement often significantly improves performance in MR, frequently converting an unsuccessful starting model into a structure that succeeds.

| Target selection and characteristics
Refinement targets were selected on a continuous basis during the CASP experiment. When a target closed for regular prediction, consideration was given to whether a submission (or occasionally two-see "doublebarreled" targets below) might be suitable. This decision factored in its size (a target should be tractable for even compute-intensive methods based on MD) and quality (it should be neither irredeemably poor nor so good that significant improvement would be difficult). In addition to available quantitative measures of coordinate quality, potential targets were examined visually to be sure that their errors were plausibly refinable and, in particular, did not lie predominantly at interfaces between domains or chains. This latter selection was designed to address the previous observation 12 that missing structural context hampers refinement. Table 1 indicates the characteristics of the final set of refinement targets.
Compared with previous CASPs, two different classes of refinement target were introduced. With the first, indicated in Table 1, groups were allowed 6 weeks for refinement rather than the usual 3 weeks. The 6 week extended versions bore names such as R1034x1, the regular 3-week submissions being R1034, and so forth.
The second innovation was what we refer to as double-barreled targets. As CASP14 progressed it became obvious that one group, ultimately revealed to be AF2, performed significantly better than all others. Although the AF2 submissions typically had less, and sometimes very little, room for improvement, we considered that perfecting them further represented an interesting and potentially important challenge. Certain proteins, the "double-barreled" targets were therefore represented by both an AF2 prediction and a prediction from another group. There were seven targets of this type and they were named, for example, R1074v1 and R1074v2, the labeling as v1 or v2 being random between targets. As an unforeseen consequence of this, for three targets one group submitted (unpublished communications) derivatives of the AF2 models as "refinements" of the non-AF2 target.
In certain places indicated below we chose to exclude these points from our analysis. Table 2 compares sizes and categories of the CASP14 refinement targets to those of CASP13 while Figure 1 illustrates their range of quality, expressed as GDT_HA (or GDT HA ), with the previous two CASPs. In terms of quality, this set of refinement targets is comparable to those of previous CASPs, but clearly the mean size of target has crept up from 134 to 149 since CASP13. There has also been a change of distribution between template-based modeling (TBM) and free modeling (FM) categories with a shift toward more difficult targets: the latter outnumbered the former around 2:1 in CASP14, a reversal of the CASP13 distribution.

| Overall ranking
In order to allow ready comparisons with other CASPs, we used the CASP12 refinement ranking score. This score was derived using a machine learning approach to reproduce automatically the expertly assigned scores of four independent assessors. 13 For a single target it is It includes five weighted z-scores (SD above the mean of all submissions). Three of these assess atomic positional accuracy: RMS_CA is the local-global alignment (LGA 14 ) sequence-dependent calculation of root-mean-square deviation between the superposed model and target, GDT_HA is the high-accuracy variant of the GDT score, 14 SG is the SphereGrinder score that captures the local similarity of model and target at each residue within a sphere of 6 Å. 15 The quality control score (QCS) assesses the correctness of secondary structure elements and their relative arrangements 16 while the Molprobity score assesses stereochemical parameters of backbone and side chains, as well as measuring atomic clashes. 17  Groups are also asked to include per-residue error estimates in the B-factor column of their submissions. These are scored at the CASP website using the accuracy self-estimate (ASE) score, which captures in a single value between 0 and 100 how well the error estimates and actual errors align in a given prediction. It should be mentioned though, that ASE score can be considered only as a supplementary measure as a good ASE score can correspond to a very poor structural model, for which authors "correctly predicted" large local deviations for the vast majority of atoms.

| Function prediction
In order to assess the impact of refinement on readout of structurebased function prediction methods, targets that were enzymes and/or nucleic acid binding proteins were identified. Catalytic sites from the Catalytic Site Atlas (CSA 18 ) were then sought using the 3D-motif matching methods implemented at CatsId 19 and ProFunc. 20 Nucleic acid binding capacity was predicted with the structure-based methods DNA_bind 21 and BindUP. 22

| Docking assessment for function prediction
In order to assess the impact of model refinement on the ability to predict protein-protein interactions, ClusPro 23 was used to dock the subunits of targets involved in this kind of interactions. In those cases where preexisting mutagenesis evidence implicating specific residues on the interaction was available, contact restraints were provided as they could be inferred from these experimental data. All other parameters were left at their default values. The quality of the resulting docked subunits was then assessed using PPDbench, 24 which was used to calculate the fraction of native contacts (Fnat), ligand RMSD (L-RMSD), and interface RMSD (I-RMSD) between the docked pose obtained with ClusPro and the ground truth as observed on the crystal structures. These values were then used to determine the quality of the docking, using the CAPRI assessment protocol (Table S1 25 ).

| Assessment of proximity of modeling errors and interfaces
In order to assess whether error regions present in the AF2 models selected for refinement were located in the vicinity of intermolecular interfaces that were not considered during the refinement stage and therefore could preclude successful refinement of such local errors, they were analyzed as follows. Error regions were defined as comprising at least three consecutive residues with a five residue-window rolling average LGA distance (between target and experimental structure superimposed using the sequence-dependent algorithm) of at least 3 Å. If the residues within this error region had an average of  3 | RESULTS AND DISCUSSION

| Overall group rankings
For comparability with previous CASP rounds we employed the CASP12 scoring for overall ranking of groups (see Materials and Methods). This score was derived using a machine learning approach to reproduce automatically the expertly assigned scores of four independent assessors. 13 It includes (see Materials and Methods) five weighted terms, three of which assess Cα positional accuracy, the QCS 16 which assesses secondary structure elements and the Molprobity score 17 for stereochemical analysis. Since the CASP12 score terms are Z-scores and more groups degrade model quality overall than improve it, then it is useful to compare the overall ΣS CASP12 score of each group with a "naïve predictor" corresponding simply to resubmission of the starting structures. Figure 2A shows that, across all regular targets only four groups out-performed the "naïve predictor": the human FEIG and its server equivalent FEIG-S, the overall top-scoring group BAKER, and the DellaCorteLab. This, along with the observation that only the FEIG group managed to improve more than half the targets ( Figure 2B), is testimony to the continuing difficulty in consistently refining target structures. Quite distinct methods lie behind the most successful approaches. The FEIG and FEIG-S approaches are based on MD with flat-bottom harmonic restraints. New for CASP14 was additional sampling by the generation of multiple alternative initial models using Modeller 26 and templates identified by HHsearch. 27 The Della-CorteLab uses a modified version of the FEIG group MD-based approach from CASP13, differing in details of salt concentration, equilibration, and restraint application. In contrast, the BAKER group carries out all-atom refinement in Rosetta using information from a F I G U R E 2 Overall group ranking according to the ΣS CASP12 score (A) and proportion of models improved by each group (B). The "naïve predictor" corresponding to resubmission of the starting models is shown in pink in (A). The data used to generate these figures are from the regular refinement targets, that is, excluding the extended targets but including the double-barreled targets deep learning framework that estimates per-residue accuracy and residue-residue distances.
While bearing in mind that the sample size is relatively small, some differences in performance on different groups of targets can be tentatively proposed. Figure 3 shows, unsurprisingly, that more groups perform well with smaller proteins, where conformational sampling is more tractable, than with larger targets. Ten groups, including the four overall top performers, outperform the "naïve predictor" on the four small targets with fewer than 100 residues. With these small targets, the DellaCorteLab performs best, followed by FEIG and FEIG-S, similarly based on MD. The overall winner, the BAKER group, ranks only eighth for these targets. On the other hand, only the BAKER group beats the "naïve predictor" for the eight targets longer than 200 residues. DellaCorteLab, FEIG, and FEIG-S rank 9, 12, and F I G U R E 3 Overall group ranking according to the ΣS CASP12 score for targets subdivided according to size, from small (top) to large (bottom). The "naïve predictor" corresponding to resubmission of the starting models is shown in pink in each panel. The data used to generate these figures are from the regular targets, that is, excluding the extended targets but including the double-barreled targets 7 on these largest targets. Overall, the results suggest that MD-based approaches, at least as currently configured, perform best on the smallest targets, but for larger targets their relative performance drops, and the BAKER group approach would be preferred. Figure 4 illustrates group rankings on targets classified by quality, as measured by their starting GDT_HA. While again remembering the rather small numbers in each category, there appears to be an overall trend in the number of groups out-performing the "naïve predictor" from seven for the lowest-quality starting structures to none where the targets were already of reasonable quality with GDT_HA > 70: evidently gross errors are generally easier to correct than the final incorrect details. Viewed by target starting quality there does not seem to be any observable overall difference among the top four performers between the MD-based methods and the BAKER group Overall group ranking according to the ΣS CASP12 score for targets subdivided according to starting quality, from poor (top) to good (bottom). The "naïve predictor" corresponding to resubmission of the starting models is shown in pink in each panel. The data used to generate these figures are from the regular targets, that is, excluding the extended targets but including the double-barreled targets results. Interestingly, the JLU_Comp_Struct_Bio submission performs best in both 60 < Starting GDT_HA ≤70 and Starting GDT_HA >70 categories. It employs a neural network implementation of generalized solvation free energy 28 to allow rapid structure refinement by differentiation rather than more expensive conformational sampling. 29 Figure 5 shows the distribution of ΔGDT_HA and ΔRMS_CA values for submissions by refinement groups, positive and negative, respectively being refinements toward the experimental structure.
The overall percentages of improved models are no better, or even somewhat worse than in recent CASP experiments. However, the AF2-derived refinement targets had some special properties that materially influence these numbers as discussed later. Figure 5 shows that the overall picture clearly improves when AF2 targets are

| Refinability
Since even the best-performing groups clearly struggle with some targets, we thought it interesting to study which kinds of targets could be refined, and which consistently confounded the refinement groups.
We, therefore, devised a simple metric of refinability (see Material and Methods) which sums the improvements (or deteriorations) seen on a per-target basis. The basic refinability scoring concept can be applied to selected or all groups and selected or all submissions.
Analysis ( Figure S1) shows that the six variant scores we trialed F I G U R E 9 Correlation between target refinability-defined as the sum of the difference of GDT_HA before and after refinement-and three different factors: the starting GDT_HA of the refinement target, the target's percentage of regular secondary structure, and its total number of residues. Top row corresponds with data obtained across all submissions from all groups, bottom row with data observed across the top four groups' best submissions. A linear model was fitted into the data displayed at each figure and included in the form of a line, together with the R 2 value resulting from this model. Shaded bands around the regression line depict the 95% confidence interval for the regression estimate. Each point represents a different refinement target, those colored in orange highlight refinement targets derived from AF2 modeling results. Only refinement target accuracy is correlated significantly with refinability, and the correlation is weaker for the top groups than for all groups starting model quality (Figure 9). This suggests that the best groups achieve similar performance across the range of target difficulties,  Table 3 presents a summary of this analysis. It is evident that the error regions in the initial AF2-derived targets are quite commonly found at crystal lattice contacts-eight regions, 64 residues-and only rarely at interfaces with other domains of the same protein-one region, five residues-and not, in this set, at all at interfaces with other chains. The remainder, that we term uncomplicated errors, are not in any of these categories, for at least one chain in the asymmetric unit: these include five error regions encompassing 35 residues. Some cases (italicized in Table 3)    Total number of residues in error regions 64 5 35 Note: Error regions are classified (for each chain where appropriate) according to whether they predominantly lie near other symmetry mates in the crystal lattice, other domains in the native protein containing the refinement target sequence, or neither. We considered the possibility of contacts with other chains in the asymmetric unit but there were no cases like this. Each cell contains ranges of residues considered as error regions in the AF2-based refinement target. Numbers in parentheses correspond with the average number of contacting residues (in a symmetry mate or another domain) for residues in the error region. Where a region is categorized differently in different chains (italicized) it is excluded from the lattice contact and domain contact totals but included in the uncomplicated error column. a These two error regions have residues missing in chain B. Thus, it is not clear whether they should be classified as a domain contact or as uncomplicated: they are therefore excluded from the counts.
F I G U R E 1 1 Comparison of error regions in (A) R1067v2, an AlphaFold 2-derived target with a starting GDT_HA of 79 and (B) R1091-D2, deriving from a tFold-IDT prediction with a starting GDT_HA of 61. Error regions are colored according to whether they are at lattice contacts (red), or not (green). The remainder of the refinement target is colored in cyan and is superimposed on the complete chain of the experimental structure (dark blue) with symmetry mates shown in gray to distort the structure and thereby provide an explanation for the error. Figure 11 illustrates the error regions determined for AF2-derived R1067v2 and how they are each positioned near a crystal lattice contact. For comparison, we also show a non-AF2 target R1091-D2 which contains error regions that are uncomplicated by contact with crystal symmetry mates, other chains, or other domains.
The data appear to show a significant co-location of AF2 target errors and crystal lattice contacts: significantly more residues in error regions are found at crystal lattice interfaces than not. Remembering the overall extremely high quality of AF2 models in general, the question arises as to which of the structures-the AF2 prediction or the crystal structure-should be considered as the more authentic in these cases. Ordinarily, the structure based on experimental data would immediately be preferred but at crystal lattice contacts, where unnatural distortions can occur, the crystal structure should not necessarily be trusted to the same extent. Since crystal lattices take no part in the AF2 calculations (to the best of our knowledge), the resulting models do not suffer from this disadvantage. Naturally, they are only predic- tions, yet for the bulk of many targets they are as close a match to the native structure as would be another crystal structure of the same protein (see elsewhere in this issue). It seems we are forced to consider the prediction as not necessarily less useful or authentic than the experimental structure in these regions.
Returning to the question of refinability, overall the results suggest that the apparent unrefinability of AF2-derived targets can partly be explained by the fact that many remaining small errors lie at crystal lattice contacts. Thus, the "correct," experimental structure used as a reference for refinement assessment may not necessarily be fully representative of the conformation(s) accessible in solution. This means that parts of the reference structure might not be accessible to or targeted by a refinement protocol that seeks a global energy minimum and/or a structure that satisfies covariance information deriving from residue contact constraints on natural conformations.

| Self-assessments
In addition to submitting coordinates, refinement groups reported their own assessment of model accuracy in two ways, firstly at the global level, by ranking models from 1 to 5 in decreasing order of accuracy. Secondly at local level, groups are asked to submit a per-  The ASE values also allow an analysis by target of features that are associated with the ability to accurately estimate errors. We found no association between secondary structure class (all-α, all-β, mixed), percentage regular structure and number of residues (not shown).
However, there was a strong correlation between the mean ASE of a target (across all the groups shown in Figure S2 and for all refinements) and its starting GDT_HA. Curiously the AF2-derived targets again performed differently, having lower ASE values than other targets of similar starting GDT_HA. Evidently, it is harder to predict residue error for AF2-derived targets than for other comparable proteins.
This is presumably because the AF2-derived targets were generally high quality throughout, not following the typical pattern of lower accuracy in exposed loops.

| Extended targets
At CASP14, for the first time, for a subset of targets, refinement groups were invited to submit results after 6 weeks of work, in addition to submissions after the usual period of 3 weeks. The rationale was that some refinement methods, especially those based on MD, are quite compute-intensive, and so can benefit from a longer window, particularly when dealing with larger targets. Figure S3A shows the groups ordered by overall performance ( Figure 2) and illustrates the sum of all improvements made, expressed as sigma ΔGDT_HA, over model_1 submissions for all targets. Somewhat surprisingly, it is as common to see that the 6-week submissions are worse (12 groups) than it is that they are improved (also 12). For the remaining three groups (DellaCorteLab, BAKER-experimental and MULTICOM_CLUSTER), the 3-and 6-week scores are identical, reflecting repeat submissions. Figure S3B shows variation of scores on each of the seven extended targets. Again, equal numbers of targets benefit or suffer overall from the additional 3 weeks, while R1029 scores similarly at the two-time points. Taken together, these results suggest that there is little benefit from the extended submission window of 6 weeks.

| Structure-based function prediction
A major application of protein modeling lies in the better interpretation and prediction of function. Function prediction in CASP is a separate category reported elsewhere in this issue 35 , but we wished to assess here what impact model refinement had on the ability to read out function from protein structure. We focused on servers that are readily accessible to the community. Inspection of the information provided to CASP predictors was combined with some initial analysis and literature review to identify functions encoded within the refinement targets that would be interpretable using structure-based methods. This produced four enzymes (R1053, a PI3 kinase; R1056, a metalloprotease; R1057, a methyltransferase; R1067, an LD-transpeptidase) with catalytic sites potentially discoverable by structural motif matching in ProFunc 20 or CatsId. 19 R1057, along with the non-catalytic R1068 were DNAbinding proteins, a function potentially discoverable using DNA_BIND 21 or BindUP. 22 Finally, we identified three targets that contribute to protein-protein interactions and considered testing their performance in docking using ClusPro. 23 In order to be able to measure the impact of refinement we required, for at least one criterion, that the experimental structure give a positive prediction while the refinement target yield a negative result. Any positive impact of the refinement would then be evident in the function annotation emerging from the refined version. Unfortunately, only one of the four enzymes-R1057, an N4-cytosine methyltransferase-fulfilled these criteria. DNA polymerase was available in the PDB. 32 Unfortunately, even with mutagenesis evidence implicating specific residues on each partner in the interaction, 32 no plausible binding mode between the two structures was obtained.
The refinement targets that could be used were both chains of T1065 which are described by the submitters as two subunits of Serratia marcescens N4-cytosine methyltransferase (although our own unpublished analysis suggests they may be a toxin-antitoxin pairing).
We did pairwise docking between crystal structures, unrefined targets, and the model_1 refinements of the top 5 groups, looking at the top predicted binding model in each case. We defined the receptor as the larger T1065s1 and the ligand as T1065s2. As Table 5 and Figure S4A show, the crystal structures can be docked by ClusPro to closely capture the native interaction. Replacing the crystal structure of the ligand with the refinement target still yields good results ( Figure S4B), but the refinement target version of the receptor is not Note: Included are the five models of the top four groups along with model_1 from the other six groups that were ranked in the top 10 for both targets. CatsID identifies structural matches to catalytic sites among all Protein Data Bank proteins; scores listed are for methyltransferase hits. Scores above 0.02 are an indication of correct assignment of catalytic function. No models surpassed this threshold for a methyltransferase hit, but scores for any methyltransferase hits are displayed (bold). It should be noted also that where a methyltransferase hit was recorded, other hits with unrelated catalytic sites also were observed. For ProFunc scores, the higher the score of an active site template match the greater the confidence in a hit: methyltransferase hit scores are again highlighted in bold. DNAbind predicts DNA-binding ability even from low-resolution, Cα-only protein models: proteins with scores above the 0.5313 threshold are predicted to bind DNA (bold). BindUP predicts nucleic acid binding function given the protein's three-dimensional structure.
successfully docked to the ligand crystal structure ( Figure S4C). Nevertheless, the pair of refinement targets dock well. The impact of refinement here is again mixed. Positively, refinement of the receptor structure prediction by three of the four groups tested improved the results significantly, giving native-like poses where the unrefined target did not (e.g., Figure S4D). On the other hand, the good quality result between ligand crystal structure and receptor refinement target is lost upon any of the tested refinements of the latter. incorporates Deep Learning, which has revolutionized protein structure prediction in recent years, using it to estimate errors and thereby guide the diversification and optimization of refined derivatives of the refinement target. Also notable is the use of a neural network by the JLU_Comp_Struct_Bio 28 which is the best performing group for refinement of higher quality starting models with GDT_HA > 60.

| CONCLUSIONS
The CASP organizers introduced two new features to the refinement challenge this time. Some targets were allowed an additional 3 weeks of time, with submissions at a 6-week checkpoint in addition to the usual three. Though well-motivated by the compute-intense nature of many refinement protocols, the results were disappointing: the quality of the extended target refinements was just as likely to be worse than better, even among submissions from the best groups.
Also new this year were "double-barreled" targets where groups were challenged to refine lower and higher quality predictions for the same target. The higher quality predictions were from a single group, later revealed to be AF2. Despite containing regions differing from the Note: Different combinations of structures were tested using, for each subunit; the crystal structure, the structure provided to the groups as the refinement target, and the model_1 submitted by each of the top four refinement groups. These top predictions are indicated simply by the refinement group name in the Table. For each docking exercise, the ClusPro cluster size and lowest energy reported were recorded. Additionally, the top cluster was selected for further docking quality assessment, where the fraction of native contacts (Fnat), ligand RMSD (L-RMSD), and the interface RMSD (I-RMSD) were recorded and used to estimate the docking quality based on the CAPRI assessment protocol-see Materials and Methods and Table S1. experimental structure these proved to be essentially unimprovable by two orthogonal measures of protein quality. Digging deeper, we found that a majority of the structural differences to the reference experimental structure lay at crystal lattice interfaces. Bearing in mind the potential distortion introduced by formation of the crystal lattice, it seems possible that the failure to "improve" the quality of these error regions in the AF2 models may simply reflect that the experimental reference structures are in non-natural conformations at these points. The code we developed to categorize error regions as lying at lattice or other interfaces may prove useful to future CASP refinement assessors for the selection of targets with uncomplicated and improvable errors.
Remembering that structure predictions are frequently used by biologists for interpretation or prediction of function, we looked at the impact of refinement on structure-based function annotation methods for catalytic sites, nucleic acid binding capacity, and proteinprotein docking. Although only a small number of refinement targets were suitable, and although the picture was mixed, it is clear that refinement can sometimes yield a correct structure-based function read-out for a refinement target that did not give a positive result.
Importantly, the server FEIG-S was among the groups whose refinements behaved in this way suggesting that biologists should consider structure-based hypotheses from server-refined models in addition to analyzing the original structure predictions. We also looked at the impact of refinement on the prospects for use of structure predictions in Molecular Replacement 36 where the picture was very strongly encouraging: we frequently observed success with a refined version where the original prediction failed.
Finally, in the post-AF2 era, it is relevant to consider whether and in what form the refinement category should persist in the CASP experiment. Clearly if all structures can be computationally predicted by readily available software with the same accuracy as they can be experimentally determined then there is no refinement to be done and the category dies. However, we are not yet in that position despite the remarkable performance of AF2 37 . Firstly, AF2 did produce some lower-quality models for which refinement would potentially be of use. Secondly, AF2 is not yet available to the community and we have clearly shown the benefits of refinement of others' models. And finally, it is not yet clear that AF2 or any future packages inspired by it perform equally well on all molecular architectures of interest. Nevertheless, it is probably fair to say that the space available to refinement groups to innovate and have impact is diminishing as the latest deep learning-based methods, allied to the ongoing incorporation of refinement protocols into the original predictive pipelines, ramp up starting model quality and reduce the potential for meaningful refinement. Part of the future may be a reconfiguration of the refinement category away from single-domain proteins toward more challenging multi-domain proteins or multichain assemblies. Another trend may be toward refining an initial prediction, not against a single, potentially unrepresentative structure, but against the experimental data. As noted elsewhere, 33 MDbased methods may be particularly well-suited to refining against data representing an ensemble of states: future refinement exercises could therefore include efforts to produce ensembles that better explain the experimental data than the initial submitted structure(s).

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request.