Assessment of the assessment—All about complexes

Predicting model quality is a fundamental component of any modeling procedure, and blind assessment of these methods constitutes a crucial aspect of the Critical Assessment of Protein Structure Prediction (CASP) experiment. Historically, the main focus was on assessing methods that predict global and per‐residue accuracies in tertiary structure models. This focus shifted with the community's increased efforts in modeling complexes and assemblies. We asked the community to process the models from the CASP15 assembly category and provide estimates of the accuracy of the predicted quaternary structure, both globally and at the local interface level. Besides identifying remarkable accuracy of modeling groups in assessing their own predictions, we set up a benchmarking pipeline to highlight different aspects of quaternary structure models and introduced a simple consensus EMA method as baseline. While participating methods showed commendable performance, the baseline was difficult to surpass. It is important to point out that prediction performance varies for the individual CASP targets, highlighting potential areas of improvement and challenges ahead.

AlphaFold2 13 which participated as a modeling group with the same name.Besides building models of excellent accuracy, AlphaFold2 provided very reliable quality estimates for their own models which is a trend that was also observed for other modeling groups.This goes into the desired direction of modeling pipelines with fully integrated quality estimation.
Given the successes in tertiary structure modeling and the shifting focus of the community on quaternary structure modeling, we aim to assess the current state of the art in estimating the accuracy of protein complex models.Additionally to evaluating selfassessment in tertiary structure modeling, where we see excellent performance of participants, we processed protein complex models which we furthermore split into core, interface, and surface residues and observed differences in quality estimation accuracy according to residue location.CASP15 EMA participants were asked to process all models from the CASP15 multimeric prediction category 14 and to provide estimates of quaternary structure accuracy on a global level, as well as of local interfaces.We related the predictions to common reference values in the field and developed new ones where required.Similar to "Davis-EMAconsensus", we introduced a consensus method as a baseline.It was found that while there was good overall performance, the consensus baseline was difficult to surpass.A noteworthy observation is that the effectiveness of the methods displayed considerable variability depending on the specific target being evaluated.This observation provides valuable insights into the areas that require further investigation and improvement.

| Overview of the EMA experiment performed in CASP15
Predictors were asked to assess the models from the multimeric category in CASP15. 14 There were a total of 11 129 models for 41 targets.Six hundred and fifty models with different stoichiometry than the target were filtered out and 7 did not fulfill the strict input requirements of the assessment pipeline resulting in 10 472 models from 40 targets for global analysis.Target T1192 was removed from the assessment as the underlying target structure only resolved nine chains of the expected A10 stoichiometry.For local assessment, another 143 models were dropped as they did not exhibit any interface contacts resulting in 10 329 models for 40 targets with a total number of 3 046 566 interface residues.
As described above, three distinct types of scores were requested: SCORE, QSCORE and Local for which predictions were returned by 24, 19, and 14 groups, respectively.To enable a fair comparison and avoid cherry picking of favorable targets, only groups that returned a sufficient amount of data points (80%) for at least 80% of the targets were considered for evaluation.This resulted in 22 groups considered for SCORE (Figure 1A), 17 groups for QSCORE (Figure 1B) and 13 groups for Local (Figure 2).SCORE and QSCORE were complemented by an assembly consensus baseline predictor ("AC") which is described in a separate section.One special case were the local predictions of "Manifold" that specified interface residues with 0-based indexing instead of the expected 1-based indexing.The assessors added the index corrected data as a new group: Additionally, all predictors in the tertiary structure and assembly categories were tasked to provide confidence estimates in the temperature factor field for their models.For most CASP editions, the confidence estimates were meant to be distances in Å.As of CASP15, percentage-scale confidence values ranging from 0 to 100% as made popular by AlphaFold2 11 had to be provided.The confidence estimates were meant to assess the accuracy of the relative positions of atoms in the neighborhood of each residue, including other chains in the case of complexes.

| Methods for assessing global structure accuracy estimation
TM-score 15 and a special variant of GDT, 16 designated as Oligo-GDTTS, were used as reference values for SCORE predictions.TMscore is a metric for assessing the topological similarity of protein structures.It includes a scaling, making it inherently independent of protein size.The TM-score in this work has been computed by the prediction center with US-align. 17Oligo-GDTTS is conceptually similar to GDTo which was used in the CASP13 assembly assessment 18 and has been implemented in OpenStructure. 19It uses Cα positions of all mapped residues from all chains to first derive a Kabsch superposition. 20Upon superposition, these positions are used to derive the Oligo-GDTTS score which is a GDT score with distance thresholds 1, 2, 4, and 8 Å.
The superposition independent QS-score 21 as well as a variant of DockQ, 22 were used as reference values for QSCORE predictions.QS-score quantifies the similarity between interfaces as a function of shared interface contacts and compares full complexes at once.In contrast, DockQ strictly processes single interfaces.Due to the requirement of having one full model score also for higher order complexes, we introduced DockQ-wave.It scores full complexes as weighted average of per-interface DockQ scores.The weight of an interface is the number of its native contacts (contacts in target) as computed by DockQ normalized by the total number of native contacts.QS-score is symmetric and added contacts in the model reduce the score.This is problematic as some target structures are not fully resolved, with significant N-terminal truncations in H1111 as an example.This work assumes that most models exhibit full target sequence coverage and therefore used a variant of QS-score which only considers contacts between residues that are present in both, model and target.
Evaluation has been performed on all targets with at least one model having a TM-score > 0.6 for SCORE and QS-score > 0.6 for QSCORE which gives 39 targets in both cases.There are contributions from several metrics M: Pearson (P) and Spearman (S) correlation coefficients, ROC AUC (R) as well as loss (L).For each predictor p, the average of its per-target performances for a certain metric (M) given a reference value (r) is computed (x(r,p,M)) and transformed into Zscores considering the average of all predictors (x(r,M)) and the respective standard deviation (σ(r,M)).Negative Z-scores are set to 0 and there is no penalty if a predictor did not return predictions for certain targets as long as the specified 80% of targets are covered.In case of loss (L), this gives: The overall SCORE ranking score (RS SCORE ) for predictor p is: Groups that returned 80% of expected SCORE data points for at least 80% of the targets (dashed line) are considered for evaluation.(B) QSCORE equivalent."exp" represents the total number of evaluated targets and "AC" is the assembly consensus baseline.Grayed out methods do not fulfill the 80% threshold and are not evaluated.
Local score data collection.Groups that returned 80% of expected Local data points for at least 80% of the targets (dashed line) are considered for evaluation."exp" represents the total number of evaluated targets and "Manifold_2" represents index corrected data for "Manifold".Grayed out methods do not fulfill the 80% threshold and are not evaluated.
The QSCORE equivalent:

| Methods for assessing local structure accuracy estimation
Given a chain mapping, one can establish a correspondence between each residue in the model and its counterpart in the target structure.
This correspondence enables the classification of model interface residues as either true, if they are also part of an interface in the target structure, or false, if they are not.The task outlined for CASP15 EMA local scores is to evaluate the predictors' capacity in distinguishing between these two classes.However, in addition to addressing whether an interface residue is truly located in an interface in the native structure, we have chosen to expand the scope of the challenge to assess the predictors capacity in estimating overall local interface accuracy.Put simply, we not only want to know whether a true interface residue is correctly identified, i.e. is located in some interface in the native structure, but also whether its close environment is modeled correctly including specific interactions with neighboring chains.
This work uses the lDDT score 23 and CAD score (AA-variant) 24  As was done for global assessment, evaluation metrics are computed on a per-target basis.That means all interface residues from all models of a certain target are concatenated into one dataset for evaluation.
Local interface residue identification is evaluated using the previously described classification of model interface residues.ROC AUC measures the ability of a predictor to separate the two classes.The finally reported value for a predictor p is the average from all targets for which a sufficient amount of data has been returned.
Local interface accuracy is evaluated using three metrics: Pearson The overall ranking score (RS Local ) for predictor p is:

| Methods for evaluating self-assessment
For the self-assessment evaluation, model "1" was selected for each participating group, resulting in a total of 7713 tertiary structure models and 2359 assembly models from various groups.To be included in the evaluation, groups had to evaluate more than 80% of all targets, resulting in 90 out of 132 groups for tertiary structures and 46 out of 87 groups for assemblies.
The per-residue confidence estimates for all targets provided by the participants were matched with the evaluated per-residue lDDT scores, and four scores were calculated: Pearson's r correlation coefficient, Accuracy Self Estimate score rescaled to the range [0,1] (ASE/100), the AUC of the receiver operating characteristic curve (ROC AUC), and the AUC of the precision-recall curve (PR AUC).Compared to previous CASP editions, the formula for ASE in CASP15 changed to where pLDDT i and lDDT i are the confidence estimate and the lDDT score for residue i respectively and N is the total number of residues.
For the tertiary structure targets, the models were additionally evaluated with a consensus baseline (described in a separate section below), which could benefit from knowing all models for the same target from all participating groups.For assembly targets, the residues were further split into core, interface, and surface residues according to the target structure.Core residues were defined as the ones having a relative solvent accessible surface area ≤ 25% and the non-core residues were further split into interface residues (Cβ-Cβ distance ≤ 8 Å, Cα in case of glycine) and surface residues (rest).

| Chain mapping
Chain mapping is a prerequisite to compare a model and a target the number of chains in model and target are ≤12, the full solution space is enumerated and a mapping with optimal QS-score is guaranteed.For larger complexes, a greedy heuristic is used instead.

| Tertiary structure baseline
Local per-residue accuracies for tertiary structure models have been estimated with a consensus baseline similar to "Davis-EMAconsensus" which has been introduced in CASP10. 5As opposed to "Davis-EMAconsensus" which is based on Cα distances, the baseline used in this work uses lDDT scores computed on Cα atoms.Given an ensemble of N models of the same target, the predicted accuracy for residue i in model x is the average per-residue lDDT score when using all other models y as reference:

| Assembly consensus baseline
We also introduced a consensus baseline to estimate quaternary structure accuracies of full assemblies for both, SCORE and QSCORE.
Again, the accuracy estimate S for model x is an average of a scoring function f when using all other models y as reference: In the case of SCORE, f is Oligo-GDTTS and for QSCORE f is QSscore.This all versus all comparison becomes computationally demanding for large targets due to chain mapping.AC thus uses a faster version similar to the QS-align tool 25 that iteratively adds chain pairs close in space after an initial superposition while updating that superposition with every added pair.

| Availability
Code and documentation for data collection, processing, scoring, baselines and evaluation are available as a Git repository: https://git.scicore.unibas.ch/schwede/casp15_ema.The code allows to inject results from custom methods and directly compare the results in the context of CASP15 (see <GIT_ROOT>/custom_analysis).The code relies on chain mapping and scoring procedures that have been integrated into the OpenStructure computational structural biology framework 19 (https://openstructure.org/).

| Assessment of accuracy estimation
In case of global analysis, the "ModFOLDdock" family of predictors in general and "MULTICOM_qa" in case of SCORE perform best in the overall ranking (Figure 5A)."Venclovas", in addition of showing strong F I G U R E 4 CASP15 target H1114.(A) Reference structure.(B) model H1114TS360_1.The reference structure with stoichiometry A8B8C4 is dominated by four subunits with stoichiometries A2B2 which are connected with four short central chains with stoichiometry C4.The model accurately represents the 4 subunits with all their interfaces but misses the small but critical interfaces in the center.As a consequence, the QSscore is high (0.862) but TM-score which evaluates topology is low (0.329), highlighting the requirement of considering both aspects in global model evaluation: overall interface accuracy as well as overall topology.performance in the overall ranking, performs particularly well in model selection, that is, Loss (Figure 5C).However, as in previous CASP-EMA iterations, a simple consensus method represented by "AC" is among the top performing methods in SCORE and top in QSCORE.
Prediction performance is by far not constant over all targets and target specific trends can be observed when using Pearson correlation between QS-score and the QSCORE predictions as proxy (Figure 5E).
An example are 5 nanobody targets bound to their antigens: H1140, H1141, H1142, H1143 and H1144 (marked as red dots in Figure 5E).H1143 has numerous high quality models that tightly cluster around the native structure, simplifying model selection for "AC" and other methods employing consensus information.However, the other targets proved to be challenging to model, highlighting general difficulties for nanobodies. 14Also EMA predictors have a poor ranking performance there (Figure 5E).However, in terms of loss, the top model selected by "Venclovas" was close to the target in four out of five cases (QS-score loss <0.1).The pure single model method "Gui-junLab-RocketX" managed to do that for three out of five cases whereas "AC" only succeeded for H1143.
Other examples are related to structural flexibility, like H1171 and H1172 for which several quaternary structure states are resolved, or a particularly intriguing target T1121o (marked as blue dot in Figure 5E).Many models cluster around conformation that is clearly distinct from the target structure.As a consequence, models from this cluster are strongly preferred by any method that uses consensus information.But also pure single model methods such as "VoroIF" and "GuijunLab-RocketX" score them favorably.The target represents an inactive closed conformation of the DNA interacting complex JetD in P. aeruginosa which requires significant structural flexibility to explain its observed biochemical activities. 26It may well be that the observed cluster resembles a relevant intermediate state.But more significantly these examples highlight the impact of structural flexibility also from a quaternary structure perspective which is not necessarily covered if only one static target structure is available.
Global accuracy estimates are typically employed to select the most suitable models from a set of alternatives.However, such estimates fail to capture the nuances of individual interfaces of interest, thereby resulting in the loss of crucial information.Despite not achieving the highest performance in global quality estimates, the single model method "Guijunlab-RocketX" exhibits exceptional performance when assessing local interface accuracy.That is per-residue lDDT and CAD which assess the accuracy of relative atom positions in the neighborhood, including neighboring chains, as well as PatchQS and PatchDockQ which primarily consider inter-chain information (Figure 6B).Two methods that are no top-performers in local interface accuracy assessment show a strong capacity in identifying true interface residues: "ModFOLDdock" and "Manifold_2" (Figure 6A).
We performed a specialized analysis on the antibody CASP prediction targets H1166, H1167.and H1168.They all consist of the heavy and light antibody chains as well as a bound antigen.Global making antibody targets particularly susceptible to issues with added model contacts that are not penalized by these reference values (see Section 2).It is therefore worth allocating additional attention to PatchQS and PatchDockQ for which "ModFOLDdockR" and "Mod-FOLDdock" particularly stand out.These two methods also perform best for local interface residue identification (Figure 6C), meaning they are best suited to identify the correct antigen binding site.

| Self-assessment evaluation
For the tertiary structure targets, the top-performing groups based on the sum of the four evaluation scores (Pearson's r, ASE/100, ROC AUC and PR AUC) were "ColabFold", "colabfold_human", "FoldEver", the AlphaFold2 baseline ("NBIS-AF2-standard"), and "MUFold" (Figure 7A).However, the differences among the top groups were minimal, with the top 20 groups all having high scores with a summed score ranging from 3.56 to 3.66 (with 4.0 being the perfect score).
The self-assessment performance of these groups was on par with or better than the consensus baseline evaluated on each group's models which further shows that current methods can reach an excellent level of accuracy in their confidence estimates.When comparing the self-assessment performance with the accuracy of the predictions measured as the average per-residue lDDT (Figure 7B), a trend was observed among the top methods.The methods with the most accurate models had slightly worse self-assessment scores.This trend was particularly visible in the highest Pearson's r score for "Agemo", which may, in part, be due to a wider distribution of lDDT values.
For the assembly targets, the top-performing groups based on the sum of the four evaluation scores were "ColabFold", "Kiharalab_Server", and "colabfold_human", with summed scores of 3.59, 3.51, and 3.49, respectively (Figure 7C).The group with the highest prediction accuracy ("Yang") was ranked 19th with a score of 3.27.The scores for the assembly targets were generally lower than those for the tertiary structure targets, and the differences between the topperforming groups were larger.For assemblies, we also compared the self-assessment performance on different parts of the protein complexes by splitting the target structures into core, interface and surface residues (Figure 7D).Here, we observe a consistent trend among the groups to reach the highest self-assessment accuracy in core residues and the lowest in interface residues.This may be caused by a lower accuracy of the target structures in the interface regions 14 but it likely points to the fact that there is still room for improvement for more accurate predictions of protein complexes.The current CASP EMA iteration has established the state-of-the-art in model quality estimation, with a particular emphasis on quaternary structures.Moreover, the self-assessment of both tertiary and quaternary structure predictors was thoroughly analyzed.
A promising trend has been noted in self-assessment.The topperforming modeling groups have demonstrated the ability to provide precise per-residue accuracy estimates that are comparable to the consensus baseline without the requirement of a full ensemble of models as input.This trend is presumably a result of most topperforming groups utilizing AlphaFold2 in some capacity, which had already demonstrated commendable self-assessment performance in CASP14.It is worth noting that a straightforward consensus baseline has historically produced favorable results, particularly for local perresidue estimates.However, the accuracy of self-estimates depends on the location of the residue in question.Notably, interface residues are particularly challenging to assess accurately.While it is not possible to completely eliminate the influence of the accuracy of the underlying target structure or assessment procedures, this indicates that there is still room for improvement in this area.
For the quaternary structure quality estimates, a commendable level of performance can be observed by the individual methods.
However, the simple consensus baseline method "AC" is difficult to surpass.While "AC" performs well in general, it fails in the discussed nanobody example and highlights the importance of methods that are capable of selecting individual high-quality models that are not part of well-defined clusters.Another observed challenge is structural flexibility.The used target metrics rely on a single static reference structure, which does not fully capture the range of conformational changes that a protein can undergo.This is a well-known problem for tertiary structure models but should now be gaining renewed attention in the context of quaternary structures.
Two requested model reliability scores in range [0.0, 1.0] reflect global, full model accuracy.The first, depicted as SCORE, reflects similarity of the full model/complex to the target upon global superposition.The second, depicted as QSCORE, solely evaluates the accuracy of interfaces.Additionally, local scores in range [0.0, 1.0] were requested for each model interface residue.They reflect the probability of a model interface residue being in the actual interface of the native structure.Interface residues are defined as having a contact with at least one residue from another chain (Cβ-Cβ distance ≤ 8 Å, Cα in case of glycine).

Þ
ROC AUC requires a threshold to label reference values.We avoid fixed thresholds but use the 75th quantile which conceptually turns the evaluation into a question of how well a method can separate the top 25% of the data points.Loss is the difference of the best possible reference value observed in a dataset and the reference value of the model with the best score as assigned by the predictor.If several models have the same score assigned, the first is selected.The ranking score (RS) for one reference value r and predictor p is: RS r, p ð Þ¼0:5 Ã P r, p ð Þþ0:5 Ã S r, p ð ÞþR r, p ð ÞþL r, p ð Þ

•
as reference values to evaluate model interface residues.Both reference values are conceptually contact based which allows to evaluate intrachain contacts, i.e. tertiary structure specific evaluation, or interchain contacts, i.e. interface specific evaluation.This work deliberately includes both types of contacts to assess relative positions of atoms in the neighborhood, incl.neighboring chains given a chain mapping.However, lDDT and CAD do not explicitly penalize added contacts which hampers the assessment of model residues that should be on the surface but are buried in a wrongly modeled interface.Additionally, lDDT and CAD may be dominated by intrachain contacts.Despite the described shortcomings, we consider lDDT and CAD useful measures to describe the local full atomic environment of a residue and help to give a comprehensive picture of local interface accuracy when complemented with appropriate alternatives.We thus introduced two additional reference values, which are local variants of QS-score and DockQ and strictly assess interchain interactions: PatchQS and PatchDockQ.They are both evaluated for each model interface residue r.Given a model interface residue r in chain A they first extract two local interface patches as illustrated in Figure 3D: Patch one: (cname = A and 8 Å <> r) and (12 Å <> cname != A).Patch two: (cname !=A and 8 Å <> r min ) and (12 Å <> A).With: Distances based on Cβ-Cβ (Cα for GLY) • r min : closest residue to r in any chain !=A • <>: within In words, patch one consists of all residues in chain A within 8 Å of residue r that at the same time are reasonably close to any other chain (12 Å).Patch two uses r min as a reference point.It consists of all residues of any chain within 8 Å of r min that at the same time are reasonably close (12 Å) to chain A. The two patches and their mapped counterparts in the target represent dimers on which QS-score and DockQ was computed.PatchDockQ and PatchQS are highly correlated (Figure 3A), the described shortcomings of lDDT/CAD are visible as outliers in Figure 3B,C.
(P) and Spearman (S) correlation coefficients as well as ROC AUC (R) that are defined as for global.Again, metrics are evaluated on a pertarget basis.The per-target performances are averaged and then transformed to Z-scores as described for global evaluation.The ranking score (RS) for a reference value r and predictor p is: RS r,p ð Þ¼0:5 Ã P r, p ð Þþ0:5 Ã S r, p ð ÞþR r, p ð Þ structure.It aims to automatically detect a one-to-one assignment of model and target chains.From a computational perspective, the problem has a factorial complexity.In the example of target T1176 with A8 stoichiometry, there are 8 != 40 320 possible mappings, rendering exhaustive enumerations of the solution space infeasible for large targets.Problematic targets in CASP15 are H1111, H1114 and T1115 with stoichiometries A9B9C9, A4B8C8 and A16.All comparison metrics used in this work require such a mapping as input with the exception of TM-score which inherently provides a heuristically derived chain mapping as part of its output.17However, that mapping is based on rigid superposition of full complexes which hampers the analysis on a local/per-interface basis for which superposition independent scores that operate on internal contacts should be preferred.One such example concerning the quaternary structure prediction of model H1114TS360_1 for target H1114 is illustrated in Figure 4.The majority of interfaces are modeled accurately despite a poor global superposition as a result of mismodeling small but topologically critical interfaces.All scores except TM-score thus use a custom chain mapping that aims to optimize the superposition independent QS-score.If F I G U R E 3 Local patch score characteristics.(A) Density of all model interface residues with their respective PatchDockQ and PatchQS scores.They exhibit similar characteristics and correlate with Pearson R 0.95.(B) The equivalent with PatchDockQ and lDDT scores exhibiting modest Pearson R of 0.72.There are two main categories of outliers.(1) lDDT of 0.0: residues with stereochemical irregularities get penalized by lDDT.No such stereochemistry check is applied for PatchDockQ/PatchQS.(2) High lDDT, low PatchDockQ/PatchQS: lDDT does not penalize for added contacts.This is relevant for incorrect interfaces.Other examples include lDDT scores that are dominated by intrachain interactions, whereas PatchDockQ/PatchQS strictly process interchain interactions.(C) PatchDockQ and CAD scores exhibit modest Pearson R of 0.64.CAD has similar characteristics as lDDT with the exception of no stereochemistry checks in CAD.(D) Example patches for residue r (residue number 213 in chain A) in model H1140TS037_1 on which PatchDockQ/PatchQS are computed.r is the anchor for patch one (red).r min , which is the closest residue in any chain != A, is the anchor of patch two (blue).

F I G U R E 5
Global score evaluation.(A) Overall Z-score based SCORE ranking using superposition dependent reference values TM-score and oligo-GDTTS.(B) Overall Z-score based QSCORE ranking using interface centric reference values QS-score and DockQ-wave.(C) Fraction of targets where predictors succeed to select a model close to optimal with TM-score as reference value.(D) Same with QS-score as reference value.(E) Target specific trends in ranking performance as measured by Pearson correlation between QSCORE estimates and QS-score for a selected subset of predictors.Nanobody targets that were generally hard to model are marked with red dots.Target T1121o which likely exhibits large structural flexibility is marked with a blue dot.
scores are largely dominated by the interface between the antibody chains and poorly reflect the main question of biological interest which is correct antigen placement.We therefore re-ran the local score analysis pipeline on the 3 antibody targets and only included the 10 predictors that submitted data for all of them.We excluded data from model interface residues that are part of the heavy/light antibody interface to deliberately direct our focus towards interactions occurring specifically at the antibody/antigen interface."Guijun-Lab-RocketX" still ranks best in the local interface accuracy ranking (Figure6D) which can largely be attributed to their commendable performance with respect to lDDT and CAD.Antibodies themselves are often accurately modeled but the antigen is at the wrong location, F I G U R E 6 Local score evaluation.(A) Local interface residue identification ranking assessing a predictors capacity to identify model interface residues that are true interface residues, that is, part of an interface in the native structure.(B) Local interface accuracy ranking based on Z-scores using the four described reference values.The superposition independent all-atomic lDDT and CAD scores take into account the entire residue environment, including residues from other chains.In contrast, PatchQS and PatchDockQ primarily consider inter-chain information.(C, D) Same analysis but only considering antibody/antigen interface residues from antibody CASP prediction targets H1166, H1167 and H1168.

F
I G U R E 7 Self-assessment evaluation.(A) Top 20 performing groups for tertiary structure targets ranked by the sum of four evaluation scores and compared with the equivalent performance of the consensus predictor.(B) Comparison between self-assessment performance and the accuracy of the predictions measured as average per-residue lDDT.(C) Top 20 performing groups for assembly targets ranked by the sum of four scores.(D) comparison of the self-assessment performance on different parts of the protein complexes (core, interface, and surface residues in target).