Biological and functional relevance of CASP predictions

Abstract Our goal is to answer the question: compared with experimental structures, how useful are predicted models for functional annotation? We assessed the functional utility of predicted models by comparing the performances of a suite of methods for functional characterization on the predictions and the experimental structures. We identified 28 sites in 25 protein targets to perform functional assessment. These 28 sites included nine sites with known ligand binding (holo‐sites), nine sites that are expected or suggested by experimental authors for small molecule binding (apo‐sites), and Ten sites containing important motifs, loops, or key residues with important disease‐associated mutations. We evaluated the utility of the predictions by comparing their microenvironments to the experimental structures. Overall structural quality correlates with functional utility. However, the best‐ranked predictions (global) may not have the best functional quality (local). Our assessment provides an ability to discriminate between predictions with high structural quality. When assessing ligand‐binding sites, most prediction methods have higher performance on apo‐sites than holo‐sites. Some servers show consistently high performance for certain types of functional sites. Finally, many functional sites are associated with protein‐protein interaction. We also analyzed biologically relevant features from the protein assemblies of two targets where the active site spanned the protein‐protein interface. For the assembly targets, we find that the features in the models are mainly determined by the choice of template.

interfaces between proteins, or interfaces between a protein and small molecules are critical to understanding function.
Official CASP structural assessments include global and local metrics that evaluate atomic level similarity of the structural features of proteins. [2][3][4] The root mean square deviation (RMSD) was the first metric used in the CASP evaluations and it is still reported in the automatic evaluation system. The global distance test (GDT) score is effective for the automatic evaluation of predictions as it reflects absolute and relative accuracy of models for a wide range of target difficulty. In addition to GDT, several other similarity measures are used. Structural quality often tracks with functional quality, but the details of this correlation needs to be further explored.
The physicochemical environments within functional sites in experimentally solved structures are strongly associated with the functional properties of proteins. Therefore, a predicted structure that contains a similar physicochemical environment to an experimentally solved structure may be the most useful one for functional annotation.
Previous studies have used a structural prediction protocol on a set of proteins and then compared the results of functional predictions with those from experimental structures. [5][6][7][8] In this work, we perform a systematic assessment that compares the ensembles of predictions of a target protein from different modeling algorithms to quantify the utility of predictions for inferring or recognizing function.
We address one simple question: to what extent do the CASP predictions accurately provide protein function information (compared to experimental structures)? To help define the term "protein function", we asked the experimentalists why they were motivated to solve the structures. Based on the experimentalists' stated motivations, we defined regions or sites for assessment, including nine sites with known ligand binding (holo-sites), nine sites that were expected or suggested by experimental authors to have small molecule binding (apo-sites), and 10 sites containing motifs, loops, or key residues with important disease-associated mutations. We evaluated the physical features of the predicted structure sites and the degree to which they shared similarity with the experimental structure sites. We previously developed PocketFEATURE (PF), an algorithm that evaluates similarity between two functional sites in terms of their physicochemical features. [9][10][11][12] As part of this work, we applied the PF algorithm to assess the extent to which physicochemical features that are observed in experimental structures can be replicated by predicted structures. We also analyzed features of quaternary structure assemblies in two oligomeric proteins and disease-causing variants, which often play an important role in protein function.

| Define sites
The biological rationale for determining a protein's structure provides a key perspective from which we evaluate the utility of predicted models. That is, what functional information should be provided by predictions from the viewpoint of the experimental authors? The answers we obtained from experimental authors varied in detail. Example include: "First structure <in this family>. . .might help identify its function"; "Putative peptide-binding site: D1154, F1147, I1162, M1163. . ."; "Interface: 46-48, 76-82, 104-120, 218-224"; "His204 of T0894 (CdiA-CT) is involved in catalysis"; "Cys:His dyad as per other LD-TP enzymes"; "It binds ADP".
Based on the answers, we defined three categories of functional sites by manually curating these answers and inspecting experimentally solved structures. The three categories are: (1) nine holo sites: pockets based on observed ligand binding in experimental structures, (2) nine apo sites: sites based on (a) critical residues provided by experimental authors, or (b) known motifs relevant to ligand or substrate binding, and/or (c) site finding algorithms, and (3) ten critical patches: patches centered at the key residues provided by experimental authors, including functionally critical residues, loops and mutations (Table 1 and Supporting Information Table S1). We evaluated the similarity of the three categories of pockets to the experimental sites.

| Overall assessments
We compared our assessment (using PF) on functional environment to the CASP assessments on overall structure quality ( Figure 1). We aim to provide two references for users who are considering structural models for functional annotation: (1) Can model-1 (the best in terms of their structure feature) provide robust functional insights? (2) Can the server (the average of all models) provide models with good functional features? PF measures the similarity between two sites in terms of their physicochemical features. The chosen official CASP assessments include the CASP ranking (see Methods), the global distance test (GDT), the template modeling score (TM), and root mean square deviation (RMSD). Figure 1 shows the correlation between PF and official CASP assessments (Analysis for individual target are available at https://simtk.org/projects/casp12funassess/.). In general, the correlation between PF and TM is lower than that between PF and GDT or PF and RMSD. This corresponds with the fact that TM is often considered as a more accurate measure of the quality of full-length protein structures (compared with RMSD and GDT), 13 while PF assess local characterization and may not reflect the quality of full structures.
CASP predictor teams could submit up to five models, ranked by their predicted quality. For each site, we either averaged scores over all submitted models ("all-models"; Figure 1 top panel) or considered only the first model ("model-1"; Figure 1 bottom panel). When we focus on the correlation between PF ranking and CASP ranking, the correlation coefficients for model-1 are consistently higher than the all-models average, indicating that predictions with higher overall structure quality often have good functional features (Supporting Information Table S2).
For example, the correlation between PF-ranking and CASP ranking for T0911 all-models is about 0.4423 and that for model-1 is 0.8878 (Predictor teams know which of their structures are likely the best.). It is interesting to note that the assessments on holo sites generally have lower correlation coefficients than those on apo sites and critical patches.
The correlation between our functional assessments and the structural assessments has two modes: (1) High correlation: predictions with high overall structure quality often have good local structure quality at their functional sites. This is reflected in the higher correlation on model-1 assessments. (2) Low correlation: some predictions with excellent structure quality at local functional sites may not have good overall structure quality. For some targets, we found that PF-scores do not track with the structural assessment, resulting in low correlation coefficients (Supporting Information Table S2).
Two servers (server-220 GOAL 14 and server-005 Baker-ROSETTASERVER 15 ) made predictions on all 28 sites; this provides enough data to allow a comparison between servers (Supporting Information Table S3). Both servers showed fairly good performance in structural assessments. Table 2 shows our functional assessments and CASP assessments on model-1 only. Using the CASP ranking, 12 of 28 model-1 sites predicted by ROSETTASERVER were ranked in the top 30 models; whereas, 11 of 28 model-1 sites predicted by GOAL were ranked in the top 30 models. Using   (Table 3). When comparing only model-1, the correlation coefficients improved with averages of 0.49-0.89 (with eight sites above 0.5). For all nine sites, the correlation coefficients between functional and structural assessments for model-1 were higher than those for all-models taken together.
That is, for holo sites, the first ranked model (the best predicted model in terms of structure quality) contained better functional characterization.
We evaluated six sites that had more than 10 predictions that were within 5 Å RMSD compared to the experimental structures (T0861, T0873, T0889, T0891, T0910, and T0911). For these six sites we selected the top 30 predictions based on functional assessments (Table 4 and Supporting Information section 2). We highlight one example to show how local functional environments can have characteristics that an overall structural assessment may not recognize. One example, T0891, is a heme binding protein. More than 70% of predictions have a GDT score better than 80. The experimental structure was solved with a heme-binding molecule.
For T0891, we compared the local features in the best PF ranked model with those observed in the best structure model (best GDT model) in Figure 2. The model-2 from server GOAL (220-2) has the best GDT score (91.74) among all the predictions, while its PF-zscore is 21.466. PF estimates similarities by matching similar microenvironments between two sites. Microenvironment refers to the local, spherical region in the protein structure that may encompass residues discontinuous in sequence and structure (See method). A higher number of matched microenvironments and a more negative PF-zscore suggest better similarity. The model-1 from HHPred (349-1) was ranked best by our functional assessment with a PF-zscore of 22.035, but its GDT score was 86.61. When aligning microenvironments surrounding the heme-binding site, the best structural model (220-2) shared five similar microenvironments with the experimental structure. We noticed that the secondary structures near the binding site were slightly different from those in the experimental structure. The top PF ranked model matched an additional two microenvironments to the experimental structure due to better positioning of the heme-binding motifs.

| Apo sites
The nine apo-sites were defined based on the information provided by experimental authors combined with a ligand-binding site searching The correlation coefficients for model-1 are consistently higher than all-models, suggesting that predictions with higher overall structure quality often have good functional features. The performance on holo sites is different from those on apo sites and critical patches (key residues): the overall (all-models) correlation coefficients for holo sites are lower than that of apo sites or critical patches program (Fpocket 16 ). The assessments compared sites in predicted structures (apo) to the corresponding sites in experimental structures (apo). The correlation coefficients between CASP rank and our functional assessments ranged from 0.28 to 0.75 (with five sites above 0.5) (Table 3). When comparing only model-1, the correlation coefficients improved with averages of 0.63-0.87 (with all nine sites above 0.5). For eight of the nine sites, the correlation coefficients between functional and structural assessments for model-1 were higher than those for all-models. Notably, the average correlation between functional assessments and CASP assessments was higher than that for holo sites (Figure 1 and Table 3).  Table 1).  (Figure 4 and Table 5).

| Critical patches
The 10 critical patches were defined based on the information provided by experimental authors and resources, such as sequence analysis and a literature review. We compared the microenvironments surrounding the patches in predicted structures with those in experimental structures. Table 3 shows the correlation coefficients between CASP rank and our functional assessments ranging from 0.40 to 0.96 (with seven sites above 0.5). When comparing only model-1, the correlation coefficients ranged from 0.41 to 0.85 (with eight sites above 0.5). In this category, model-1 (the best predicted model in terms of structure quality) and other models have similar levels of functional characterizations.
We evaluated four sites that had >10 predictions that were within 5 Å RMSD compared to the experimental structures (T0860, T0882, T0920-0, T0920-1). For these four sites we selected the top 30 predictions based on functional assessments (Supporting Information section 2). In this category, functional information is often not available to predictors (in contrast to ligand binding sites); hence, we observe greater deviation between structural quality and functional quality. For example, when we ranked model-1 for the critical patch T0920-1, the best functionally characterized prediction was 220-1 (GOAL), whose official CASP rank was 108 in terms of its overall structural quality (Table 2 and Supporting Information section 2).
We applied PocketFEATURE to analyze patches surrounding mutations in two targets: T0948 (four patches) and T0945 (20 patches).
The four patches in T0948 cluster together and were treated as one functional site for overall assessment on critical patches, as discussed above (Tables 1-3). We analyzed the 20 mutation patches and found that the functional ranking tracks with the overall structure quality, but with great deviations (Supporting Information Table S11). Figure 5 shows

| Assessments from other research groups 2.4.1 | Functional prediction in dimeric targets (Capitani research group)
Two target assemblies contained a pocket at the protein interface: CckA histidine kinase (T0893), and STRA6 receptor (T0930). CckA is a histidine kinase, a dimeric bifunctional enzyme mediating both phosphorylation and dephosphorylation of downstream targets. 17 The most important features of the quaternary structure are (1) the conserved,  18 A total of eight groups submitted dimeric models with acceptable oligomeric quality for T0893 (Supporting Information Figure   S2 and Table S9). These were manually inspected for the presence of the three features. All the models exposed the phosphate acceptor histidine, four models correctly reproduced the connectivity of the four helices of the DHp domain, and two models predicted the correct position of the CA domain for cis autophosphorylation. However, no model included all three features.
STRA6 is a dimeric integral membrane receptor for retinol uptake that associates with the retinol binding protein (RBP) and translocates the retinol molecule into the lipid bilayer. 19 The two features of the STRA6 receptor dimer important for its function are the geometry of the cleft in the dimeric interface, which bends the outer membrane outwards, and the coordination of residues from both subunits to create the RBP-binding motif.
Unfortunately, STRA6 had no sequence similarity to any known membrane transporter, channel, or receptor at the time of the CASP12 experiment, and the prediction of its tertiary structure and assembly was unsuccessful. Therefore, no predictions were of sufficient quality to provide biologically relevant information about the function.

| Predictions at missense mutation sites (Mooney research group)
To evaluate whether structure predictions can be interpreted as an indicator of the pathogenicity status of missense mutations, we The experimental structure of T0891 has a heme binding site. Local features in the best PF ranked model with those observed in the best structure model (best GDT model). The model-2 from server GOAL (220-2) has the best GDT score (91.74) among all the predictions, while its PF-zscore is 21.466 (A more negative PF-zscore suggests better similarity.) The model-1 from HHPred (349-1) was ranked best by our functional assessment with a PF-zscore of 22.035, but its GDT score is 86.61. When aligning microenvironments surrounding the heme-binding site, the best structural model (220-2) shares five similar microenvironments with the experimental structure.
The best PF ranked model shares seven similar microenvironments with the experimental structure assessed secondary structure and solvent accessibility predictions.
The mutation databases ClinVar 20 and HGMD 21  intermediate (0.09 RelAcc < 0.36), or exposed (RelAcc 0.36). 24 (5) The distribution of pathogenic variants in highly conserved residues as reported through ConSurf. 25 For residues affected by pathogenic variants the average RMSD and standard deviation of relative solvent accessibility is 0.14 and 0.20, respectively (Supporting Information Figure S3). We did not find a significant difference between these values and the according metrics for residues affected by VUS. Hence, a suspected correlation between pre- We compare the side-chains near these four residues between experimental structures, the best GDT ranked model-1 (004-1, GDT 46.3, PF-zscore 20.93), the best GDT ranked all-models (060-2, GDT 54.5, PF-zscore 21.01), the best PF ranked model-1 (016-1, GDT 38.6, PF-zscore 21.50) and the best PF ranked all-models (303-4, GDT 39.8, PF-zscore 21.79). The best PF-ranked models share similar sidechain arrangements with the experimental structures, while the best GDT-ranked models do not distribute similarly (Median: 0. 16

| Physicochemical properties in microenvironments carry functional critical information
We have previously reported a system, FEATURE, 12 for representing protein "microenvironments", as statistical descriptions of physicochemical and structural features in a sphere volume of 7.5 Å radius.
A single ligand site is often comprised of between 10 to 20 microenvironments, each centering on one of the key residues. Pocket-FEATURE employs a matching system that aligns similar microenvironments, or physicochemical properties, between sites or even entire proteins (instead of sequence alignments). PocketFEA-TURE can distinguish statically and dynamically between similar sites, between homologs, 26 and even between unrelated proteins. 9 Pock-etFEATURE is able to distinguish aspects of the drug-binding pocket in FtsZ structures from different species that are not evident with other comparison methods such as RMSD. PocketFEATURE can also detect the effects of mutations in protein pockets. 26 In addition, it can detect key functional changes driving molecular dynamic trajectories. Our analysis is based on the evidence that PocketFEATURE can distinguish more finely grained physicochemical differences associated with protein function-including ligand binding or mutation effects between sites with very similar structure properties. Of course other methods (SiteCompare 27 and SMAP 28 ) that share similar characteristics could also be used.

| Differential performance on holo and apo sites
We observed that predictions using holo sites differ in quality from those using apo sites and critical patches. In all-models assessments, the correlation coefficients for holo sites are lower than the other two categories. Given a sequence with templates that have bound ligand(s), predictors generate "apo models" that do not take the ligand information into account (They may consider ligand information implicitly if they use templates that contain a bound ligand

| Method limitations
In general, functional utility correlates with the quality of structure predictions, but there are interesting deviations. Predictions with higher overall structural quality (model-1) often have good functional utility.
However, some predictions with good structural quality may not have the best local functional sites, and sometimes these are significantly worse. Using PocketFEATURE to evaluate physicochemical properties at local functional sites provides reasonably good discrimination between predictions with similar structural quality.
The major uncertainty of our assessment originates from the illdefined nature of functional sites and functional centers. Even with communication with the experimentalists, it was difficult for us to achieve an undisputed functional site definition. In future CASPs, it would be useful to have a more structured and systematic procedure to retrieve biological relevance from the experimental contributors.
Nonetheless, our evaluations still suggest substantial biological utility despite some partial site definitions. We found that scoring a SNP alone (very local) does not track with the overall structure quality, but scoring patches surrounding a SNP provide more insights into functional relevance ( Figure 5). We also compared defined site residues' PF assessments with local RMSD measurements (Supporting Information Figure S1). The local RMSD correlate well with overall RMSD, but not PF-scores, suggesting that PF evaluates physicochemical properties beyond structure features. In addition, the local estimation also depends on the site definition, which is one of the key limitations of functional assessment methods. Based on their answers (on 43 targets), we selected 25 targets for which sufficient information was provided. We assigned targets to three groups to assess their utility in functional annotation. Our previous work demonstrates that functional properties of a critical region can be extracted by describing their physicochemical environments. 12 We have developed the FEATURE system that computes a set of 80 physicochemical properties collected over six concentric spherical shells (total 480 properties 5 80 properties 3 6 shells) centered on a predefined functional center.

| Compare sites in predictions and experimental structures
PocketFEATURE contains two essential modules to evaluate and compare physicochemical properties of a single or a cluster of functional centers. 9 The two modules are: 1. Given two centers (can be an atom, or average coordinates of multiple atoms) from two structures, we use the term "microenvironment"to refer to the local, spherical region in the protein structure that may encompass residues discontinuous in sequence and structure. We then measure the similarity between the two microenvironments by a Tanimoto-based approach (see Supporting Information: method description).
2. Given two binding sites (or two clusters of functional centers), we exhaustively calculate the similarities between all permissible microenvironment-pairs. We then search for the mutual most similar microenvironment-pairs between two binding sites and assign alignments and similarity scores between the two binding sites (see Supporting Information Section 4: method description).
We applied the two modules of PocketFEATURE to assess the physicochemical environments of a single or cluster of functional residue centers.
For apo and holo sites, the challenge was to evaluate how well the binding sites are predicted, in terms of the pocket's physicochemical environments, given the quality similarities of the overall predicted structures. We applied PocketFEATURE to compare experimental sites to the corresponding microenvironment centers in the predicted structures. The similarity between the two sites provides an estimate of the probability of a ligand binding to the predicted site, which is the biological relevance of apo and holo sites. . The 15 residues surrounding this SNP are the microenvironments associated with the functional effects of the SNP. However, in the best GDT model (best GDT 220-1, GDT 59.27, PFzscore 21.32) these microenvironments are not clustered near 376 H. This is because one of the key loops was predicted away from the functional center. In the best PF ranked model (best PF model, 324-1, GDT 54.07, PF-zscore 21.55), the corresponding microenvironments form one cluster, with the functional loops predicted in the right position physicochemical environments of the critical regions. We adopted the procedure above with modifications based on the shape and the size of the critical regions.

| Compare functional assessments and CASP structure assessments
CASP predictions were downloaded from assessors' section of the CASP website. In the assessors' section, under the predictions folder, there was a gziped folder for each target containing all predictions from all servers. CASP rankings and other measurements, including GDT, TM, and RMSD (official assessments), were obtained from the CASP website (CASP12 result section).
We performed two assessments: "all-models" and "model-1". For each target, each prediction server may generate one to five models, with their best model labeled as model-1 before submitting to CASP assessment committee. For "all-models", we calculated PocketFEA-TURE zscores (PF-zscore) of all server models for each of the 28 sites.
Specifically, scores of all predictions of a given target from each server were treated as independent predictions. PocketFEATURE scores across these models for each site were then normalized to obtain the zscores using the scipy.stats.zscore package. For "model-1", we apply the same procedure to models labeled with model-1 by the predictors.