Predicting residue‐specific qualities of individual protein models using residual neural networks and graph neural networks

Abstract The estimation of protein model accuracy (EMA) or model quality assessment (QA) is important for protein structure prediction. An accurate EMA algorithm can guide the refinement of models or pick the best model or best parts of models from a pool of predicted tertiary structures. We developed two novel methods: MASS2 and LAW, for predicting residue‐specific or local qualities of individual models, which incorporate residual neural networks and graph neural networks, respectively. These two methods use similar features extracted from protein models but different architectures of neural networks to predict the local accuracies of single models. MASS2 and LAW participated in the QA category of CASP14, and according to our evaluations based on CASP14 official criteria, MASS2 and LAW are the best and second‐best methods based on the Z‐scores of ASE/100, AUC, and ULR‐1.F1. We also evaluated MASS2, LAW, and the residue‐specific predicted deviations (between model and native structure) generated by AlphaFold2 on CASP14 AlphaFold2 tertiary structure (TS) models. LAW achieved comparable or better performances compared to the predicted deviations generated by AlphaFold2 on AlphaFold2 TS models, even though LAW was not trained on any AlphaFold2 TS models. Specifically, LAW performed better on AUC and ULR scores, and AlphaFold2 performed better on ASE scores. This means that AlphaFold2 is better at predicting deviations, but LAW is better at classifying accurate and inaccurate residues and detecting unreliable local regions. MASS2 and LAW can be freely accessed from http://dna.cs.miami.edu/MASS2-CASP14/ and http://dna.cs.miami.edu/LAW-CASP14/, respectively.

angstroms, S-score, or LDDT, 2 are usually used to describe the deviation for residues and guide the refinement of the model, whereas the global quality scores that indicate the quality of the entire model, such as the global distance test_total score (GDT_TS), 3 are used to distinguish the best model. As AlphaFold2 4 achieved great success in TS predictions, the community of CASP14 has started to discuss the future role of QA. 5 Although the TS models generated by AlphaFold2 have achieved promising accuracies, the benchmark of the confidence measures or residue-specific deviations generated by AlphaFold2 and the performance comparisons of which with other QA methods were not performed in CASP14. This was because AlphaFold2 was registered as a human TS predictor in CASP14, and only the TS models from server predictors were sent to CASP14 QA methods for quality-assessment predictions. To fill up the gap, in this work, we benchmarked the performance of the confidence measures generated by AlphaFold2 and compared it with our QA methods.
Most of the recent single-model EMA methods use machine learning algorithms. DeepAccNet 6 applies 2D convolution to the global environment and 3D convolution to the atomic environment to estimate the distance error. QDeep 7 combines residue-level ensemble classifications into local accuracy scores. Specifically, QDeep trains multiple residual neural networks (ResNets) based on different distance-error thresholds. DeepQA 8 applies a deep belief network on structural and physicochemical characteristics to estimate the quality of models. SMOQ 9 is an SVM-based method using secondary structures (SS), solvent accessibilities, and residue-residue contacts to predict the per-residue distance deviation with a single-model input. In our previous study, 10 we designed four single-model methods based on stacked denoising autoencoders, SVM, or the combination of them, which have achieved relatively better performance for the residues that have the native deviations greater than 5Å. ProQ3D 11 improved the Pearson correlation coefficient of local model quality from 0.72 as in ProQ2 12 to 0.77 benchmarked on the CASP11 dataset by changing the SVM to multi-layer perceptrons. The authors of QAcon 13 found that the performance of its two-layer neural network was improved by adding the novel feature of residue-residue contact information, which illustrates the value of using the contact information in protein QA. Our previously developed MASS 14 is a single-model method that only predicts global quality scores using random forests based on six protein statistical potentials that indicate structural or energetic properties of protein models. In comparison, one of the methods presented in this article, MASS2, was trained to predict residue-specific qualities and was based on completely different methodologies.
In CASP14, a few methods used graph neural network (GNN) into EMA for the first time. GraphQA 15 represents a protein molecule as a graph by constructing edges between either spatially close or sequentially consecutive residues. GraphQA encodes spatial and sequential distance as edge features and adopts amino acid sequences, multiple sequence alignment (MSA), dihedral angles, and SS as node features.
And then, GraphQA applies a GCN 16 to predict local and global quality scores. GraphQA co-optimizes two local and five global quality scores with weighted summed mean squared losses. However, the cooptimization of GraphQA reduces the performance of local scores. 15 VoroCNN 17 constructs a graph from the model using a Voronoi diagram tool 18 and predicts local scores by employing a graph convolutional neural network (GCNs). ProteinGCN 19 generates features from structural and inter-atomic orientations and distances and applies another type of GCN 20 to the graph of k-nearest residues.
The methods discussed in this paper, named MASS2 and LAW (LAW is an abbreviation of "local assessment of protein models using graph networks"), also participated in CASP14, which used ResNet and GNN, respectively. These two methods performed better than the other CASP14 methods using similar methods according to our implementation of the CASP official evaluation criteria. to dark red (the last residue). MASS2 and LAW are implemented with Resnet and GNN, respectively. These two methods use similar features but different neural networks. The MASS2 utilizes the 3D closeness information by combining the features of the spatially close but sequentially far residues from each residue. However, the LAW captures the 3D structure of a predicted model by creating a graph for nearby residues with a distance cutoff to the target residue. Both methods predict per-residue local quality scores.
We used the models from CASP7 to CASP12 for training and validation, where models are trimmed into evaluation units (EUs). 21

| The architecture of MASS2
The deep learning network architecture of MASS2 contains two branches. The input of the first branch is the concatenated features of the five spatially closest residues (sequential distances > 6) for each target residue. The features pass through three convolutional layers.
In contrast, the input of the second branch is the per-residue features of a model, where the second branch contains a convolutional layer and six residual blocks.
The concatenated outputs of two branches are fed into another 18 residual blocks and a fully connected layer, which reduces the dimension of features to one. The output after the sigmoid function is the predicted S-score that is further converted to residue-specific deviation as follows:

| The architecture of LAW
The architecture of the GNN network used in LAW includes five graph blocks. A graph is generated for each model by adding edges to any two residues with a 3D Euclidean distance ≤ 8Å. The edges are unweighted and bi-directed, so each node is a sender and a receiver.
LAW contains five graph blocks and three convolutional layers.
The inputs of the graph network are nodes E, edges V, and global features u. In the GNN framework, 16 each graph block is a computation unit, which has "update" functions φ and "aggregation" functions ρ.
The "update" functions are implemented with fully connected layers and a ReLU function, and the "aggregation" functions are implemented with scatter mean functions. 23 The details in a computation unit are described below in three steps: The first step is to update the edge features. We concatenate sender features, receiver features, edge features, and global features.
Then we apply a Linear-ReLU-Linear layer to the concatenated features for updating edge features, in which the linear part is implemented as a fully connected layer in Pytorch. 23 The second step is to update the node features. We pass the concatenated sender features and edge features into a Linear-ReLU-Linear layer. After that, we use the "aggregation" function ρ e!v to average the edge features over receivers. The message from the edge features is passed to node features. And then, the concatenated node, edge, and global features are passed into another Linear-ReLU-Linear layer.
The third step is to update the global features. Before that, we use ρ e!u to average edge features over receivers and then over graphs, and then we use ρ v!u to average node feature over graphs.
We concatenate the aggregated edge and node features with global features and update the global features.
We let the input features go through five graph blocks. After that, we averaged the edge features to receivers and concatenated the aggregated edge features with node features and global features.
Then we applied three conventional layers to reduce the dimension of the node feature to one.
The node features for LAW contain one-hot encoding, PSSM,

| Positional encoding
We use four sinusoidal positional encodings as the per-residue features to make the deep learning network know the sequence order.
The ULR evaluation example on the same target and based on the same QA predictor (QA209) as in Figure 5A of CASP14 official QA evaluation paper. 5 (A and B) Highlights the ULRs defined by two ways, ULR-1 and ULR-2, respectively, on the model T1076TS198_3-D1 superposed with the native structure of target T1076. The red-highlighted regions in (A) represents the ULRs defined in the same way as in CASP14 official QA evaluations, which is called ULR-1 in our paper. The red-highlighted regions in (B) represents the URLs defined by the second way of merging neighboring ULRs, named ULR-2 in this paper. (C) Shows the amino acid position (AA pos), native deviation (Native dev), True ULR-1, True ULR-2, predicted deviation (Pred dev), predicted ULR-1 (Pred ULR-1), and predicted ULR-2 (Pred ULR-2). All deviations greater than 3:8Å are highlighted in red. In the lines of true ULR-1 and true ULR-2, the native deviation that belongs to true ULRs is highlighted with green background. In the lines of predicted ULR-1 and predicted ULR-2, the predicted deviations belonging to ULRs are highlighted with red background. Right (✔) and wrong (✖) symbols are used to indicate the correctness of predicted ULRs. The ULR-1.F1 and ULR-2.F1 are calculated at the bottom of the figure on all models. The second and third sets were not used in CASP official evaluations. We evaluated the model by converting local distances into Sscore, which was calculated as follows: 1= 1 þ distance=d 0 ð Þ 2 , where distance is the native deviation inÅ, and d 0 ¼ 5Å. All CASP14 QA groups submitted predicted local or residue-specific deviations in angstroms except the groups QA074, QA066, and QA280, which submitted scores between 0 and 1. We treated these submitted scores from these three QA predictors as S-scores during our evaluations.

| Averaged residue-wise S-score error
The averaged residue-wise S-score error (ASE) is defined as:

| Area under the curve
After the superposition of the predicted and native structures, each residue in the predicted structure is classified into two classes: accurately modeled or inaccurately modeled. The residues are considered accurately modeled if the deviations to the native counterparts are less than 3:8Å. By comparing the classes of residues generated from the predicted local quality scores and the classes of residues based on the superposition, we calculated the area under the curve (AUC). The final AUC scores were averaged over all models and then all EUs.

| ULR-1 and ULR-2
Unlike the AUC, the unreliable local region (ULR) is designed to detect and evaluate continuous inaccurately modeled residues. If the deviation of a residue is less than 3:8Å, the residue is accurately modeled in the TS model. A ULR is defined as a sequence of residues with at least three continuously inaccurately modeled residues. We F I G U R E 3 CASP14 groups ranked by the Z-1 scores. Z-1 score is the sum of averaged Z scores of ASE/100, AUC, and ULR-1.F1 over all EUs then used two different ways of merging smaller ULRs into longer ULRs.
We defined the first way of merging, named ULR-1, in the same way as in the official CASP14 evaluations (see "2.3 Methods for assessing local structure accuracy estimation" on page 3 of Ref 5), that was, if only one accurately modeled residue existed between two ULRs, then these two ULRs, together with the one residue in between, were merged into a longer ULR.
We defined the second way of merging, named ULR-2, which was less stringent than ULR-1, that was, if only one accurately modeled residue existed between one ULR and another inaccurately modeled residue or URL, then these ULRs or residues are merged into a longer ULR.
We compared the ULRs derived from the predicted local quality scores (referred to as predicted ULRs) and the ULRs parsed from the native deviations (referred to as true ULRs). If the start and end positions of a predicted ULR are within two residues of the start and end positions of the true ULR, then it is considered a correctly predicted URL. This is also what CASP14 organizers have done in their official evaluation (see "2.3 Methods for assessing local structure accuracy estimation" on page 3 of Ref 5).
The F1 score of ULRs was calculated as follows: Þ , and the final ULR.F1 scores were averaged over all models and then all EUs, which is in the same way as what is described in CASP14 official evaluations. 5 Figure 2 shows the example of ULRs on a CASP14 target for a QA predictor. This example is also shown in Figure 5A of the CASP14 official evaluation paper. 5 Figure 2A,B show the predicted ULR-1 and ULR-2 (red), respectively, and the details of defining ULRs are shown in Figure 2C. According to our observation and based on the data shown in Figure 2, the ULR-1.F1 score for this example should be 0.33 ( 2 Â 0:33 Â 0:33 ½ Þ = 0:33 þ 0:33 ð Þ ) instead of 1. 5 This may explain the differences in the evaluation results that we and CASP14 official evaluators generated.
Based on our observation and the data shown in Figure 2, we believed that our evaluations accurately followed the descriptions of how ULRs were defined and evaluated in the CASP14 official evaluation paper. 5 To further prove that we strictly and accurately followed the same way as described by CASP14 official evaluators, we listed the GDT_TS, ASE, AUC, URL-

| SOV_refine
As discussed in Section 2.4.4, SOV_refine scores can measure the similarity of two strings. Here, we used it to measure the predicted and true classes of correctly modeled or incorrectly modeled for all of the residues. We used 0 to represent an accurately modeled residue and 1 for inaccurately modeled residues. In this way, we generated a string of 0s Note: Z-1 scores are the sum of averaged Z-scores of ASE/100, AUC, and ULR-1 overall EUs. and 1s from the predicted residue-specific deviations of a QA group and another string from the true deviations. The similarity of these two strings was measured by the SOV_refine score. This is an evaluation criterion that was not used by the CASP14 official evaluations.  Figure 3 shows the ranking of 31 CASP14 single-model groups according to Z-1 scores. Z-1 scores are the sum of averaged Z scores of ASE/100, AUC, and ULR-1.F1 over all EUs. This is the same way of calculating Z-1 scores as the CASP14 official paper. 5 Based on our evaluation, MASS2

| RESULTS
and LAW are the first and second methods in CASP14. Table 1 shows the details of the top 10 ranked groups plus Bhattacharya-QDeep and GraphQA based on Z-1 scores. Figure S1 shows the ranking of 31 CASP14 groups according to Z-2 scores. Z-2 scores are the sum of averaged Z-scores of ASE/100, AUC, ULR-2.F1, and SOV_refine over all EUs. Notice that this Z-2 score was an evaluation criterion that CASP14 official evaluations did not use as it used ULR-2 instead of ULR-1.
MASS2 and LAW are the first and third methods based on this criterion.   Figure 5A shows the distance for each residue, and Figure 5B shows the superposition of the model and native structure. Figure 6 shows the ULR-1s for T1065s2TS075_1-D1 based on the estimated deviations generated by six CASP14 QA predictors, including our MASS2 and LAW. We also benchmarked the performance of MASS2, LAW, and the predicted deviations generated by AlphaFold2 on the TS models of AlphaFold2 in CASP14. Although LAW did not perform as well as MASS2 on the TS models of server predictors in CASP14, LAW outperformed MASS2 when evaluating the TS models of AlphaFold2 in CASP14. The QA performance of LAW on the models of AlphaFold2 is comparable to or even better than the QA performance of Alpha-Fold2. When users utilize our web servers, we may recommend F I G U R E 6 The ULR-1 estimated by six CASP14 QA predictors for T1065s2TS075_1-D1

| DISCUSSION AND CONCLUSION
using LAW for AlphaFold2 TS models and MASS for all other TS models.
We believe that the inclusion of spatial properties of protein models contributes to the promising performances of MASS2 and LAW. For MASS2, a unique type of feature that we have used is the concatenated features of the five spatially closest residues (sequential distances > 6) of each target residue. These features were processed by three convolutional layers and then combined with other residuespecific features by residual blocks. In this way, MASS2 is not only seeing residue-specific features but also, to some degree, integrating the 3D structure of the TS model.
We had the same goal in mind when we designed LAW but on an even larger scale, that was, using a graph to represent the global 3D structure of the entire TS model, not only the five spatially closest residues as in MASS2. Since each edge in the graph is added to connect any two residues having a 3D Euclidean distance ≤ 8Å, the graph topology, to some extent, is a representation of the key 3D structural properties of the model. The advantage of using a graph is that it does not include all of the 3D coordinates of the 3D structures, which decreases the complexity of modeling and reduces computational time. Investigation.