Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13

Abstract Predicting residue‐residue distance relationships (eg, contacts) has become the key direction to advance protein structure prediction since 2014 CASP11 experiment, while deep learning has revolutionized the technology for contact and distance distribution prediction since its debut in 2012 CASP10 experiment. During 2018 CASP13 experiment, we enhanced our MULTICOM protein structure prediction system with three major components: contact distance prediction based on deep convolutional neural networks, distance‐driven template‐free (ab initio) modeling, and protein model ranking empowered by deep learning and contact prediction. Our experiment demonstrates that contact distance prediction and deep learning methods are the key reasons that MULTICOM was ranked 3rd out of all 98 predictors in both template‐free and template‐based structure modeling in CASP13. Deep convolutional neural network can utilize global information in pairwise residue‐residue features such as coevolution scores to substantially improve contact distance prediction, which played a decisive role in correctly folding some free modeling and hard template‐based modeling targets. Deep learning also successfully integrated one‐dimensional structural features, two‐dimensional contact information, and three‐dimensional structural quality scores to improve protein model quality assessment, where the contact prediction was demonstrated to consistently enhance ranking of protein models for the first time. The success of MULTICOM system clearly shows that protein contact distance prediction and model selection driven by deep learning holds the key of solving protein structure prediction problem. However, there are still challenges in accurately predicting protein contact distance when there are few homologous sequences, folding proteins from noisy contact distances, and ranking models of hard targets.


| INTRODUCTION
The major breakthrough in protein structure prediction, particularly template-free (ab initio) prediction, is the drastic improvement of the accuracy of residue-residue contact distance prediction in the recent years, leading to the correct folding of some template-free modeling (FM) targets in CASP11 and CASP12 experiment. [1][2][3][4] The accurate prediction of inter-residue contacts and distances has become a key intermediate step and driving force to predict protein three-dimensional (3D) structure from sequence. The breakthrough in contact distance prediction was driven by two key advances: residue-residue coevolutionary analysis popularized in Reference 5 and demonstrated in CASP11 and CASP12 experiment 4,6 and deep learning introduced in Reference 7 and enhanced in  The coevolutionary analysis is based on the observation that two amino acids in contact (or spatially close according to a distance threshold such as 8 Å) must coevolve in order to maintain the contact relationship during evolution, that is, if one amino acid is mutated to a positively charged residue, the other one must change to a negatively charged one to be in contact. A number of coevolutionary methods of calculating direct rather than indirect/accidental correlated mutation scores has been developed and shown to improve contact prediction. [13][14][15][16] Moreover, the coevolutionary scores can be used as input for machine learning methods to further improve contact prediction.
Deep learning, the currently most powerful machine learning method, was introduced into the field in 2012 and demonstrated as the best method for protein contact prediction in 2012 CASP10 experiment. 7 Different variants of deep learning methods-convolutional neural networks and residual networks-were combined with coevolutionary features to substantially improve contact prediction. [8][9][10][11][12] The improved contact prediction led to the significant improvement of template-FM in CASP12 experiment, in which contact predictions were used with different ab initio modeling methods such as fragment assembly and distance geometry to build protein structural models from scratch. 1 To prepare for 2018 CASP13 experiment, we focused on enhancing our MULTICOM protein structure prediction system [17][18][19] with our latest development in contact distance prediction empowered by deep learning and its application to template-FM and protein model ranking, 17,20,21 while having a routine update on its other components such as template library, template identification, and template-based modeling. Our experiment demonstrates that contact distance prediction empowered by the advanced deep learning architecture can accurately predict a large number of contacts for some template-free or hard template-based targets, which are sufficient to fold them correctly by the distance geometry and simulated annealing from scratch without using any template or fragment information. Our experiment also shows that directly translating predicted contacts into tertiary structures by satisfying distance restraints can fold large proteins with complicated topologies better than using contacts indirectly to guide traditional fragment assembly approaches. Moreover, we demonstrate that deep learning can integrate one-dimensional (1D), two-dimensional (2D), and 3D structural features to improve protein model ranking. Particularly, we show that, for the first time, improved contact prediction can consistently improve protein model ranking. Therefore, contact distance prediction and deep learning are the key driving force that made our MULTICOM predictor rank third in the CASP13 experiment in both template-based and template-FM. The success of MULTICOM human and server predictors (MULTICOM_CLUSTER, MULTICOM-CONSTRUCT, and MULTICOM-NOVEL) in CASP13 clearly proves that deep learning holds the key for protein contact distance prediction and folding, even though there are still significant challenges in contact/distance prediction for targets with few homologous sequences, translation of noisy or sparse contact distances into 3D models, and selecting a few good protein structural models from a large pool of low-quality ones for a hard target.

| MATERIALS AND METHODS
In this section, we first provide an overview of the MULTICOM server and human prediction system, followed with the detailed description of several key new components that we added into the MULTICOM system in CASP13, such as the protein contact distance prediction empowered by deep learning, ab initio protein structure prediction driven by predicted contact distances, and large-scale protein quality assessment (QA) enhanced by deep learning and contacts. Figure 1 is an overview of our MULTICOM server and human prediction systems. Once the server received a target protein sequence, MULTICOM searched it against protein sequence databases such as the nonredundant sequence database to collect its homologous sequences to generate multiple sequence alignments, which were used to build sequence profiles such as position specific scoring matrices 22 and hidden Markov models (HMMs). 23 The sequence was also used to predict 1D structural features including secondary structure, solvent accessibility, and disorder regions. 24,25 The profile or sequence of the target was searched against the template profile/sequence library by a number of sequence alignment tools (eg, BLAST, 26 CSI-BLAST/CS-BLAST, 27 PSI-BLAST, 22 COMPASS, 28 FFAS, 29 HHSearch, 30 HHblits, 23 HMMER, 31 JackHMMER, 32 SAM, 33 PRC, 34 RaptorX, 35 I-TASSER/MUSTER 36,37 ) to identify protein templates whose structures were known and build pairwise target-template sequence alignments. DeepSF-a deep learning method of classifying protein sequences into folds was also used to identify templates for the target. 38 In parallel to the template identification, the multiple sequence alignments of the target were also used to generate coevolutionary features by CCMpred, 14 FreeContact 39 and PSICOV, 16 which were used together with other sequential and structural features such as predicted secondary structure and solvent accessibility as input for DNCON2 8 to predict residue-residue contacts at multiple distance thresholds (ie, 6, 7.5, 8, 8.5, and 10 Å).

| An overview of the MULTICOM system
The target-template sequence alignment was used to identify domain boundaries, that is, the region of the target not aligned with any significantly homologous template was treated as a template-FM domain, otherwise a template-based domain. The contact prediction for template-free domains was made by DNCON2 and combined with the contact prediction of the full-length target.
The pairwise target-template alignments were combined into the multi-template alignments between the target and the multiple templates if the structures of the templates were consistent. 40 The alignments and the structures of templates were fed into Modeller 41 to build the structural models for the target. Generally, more than 100 template-based models were constructed for a target.
In parallel to the template-based modeling, predicted contacts were used with several ab initio modeling tools such as CONFOLD2, 42 Rosetta, 43 UniCon3D, 44 and FUSION 45 to build structural models for a template-free target or domain. Both the template-based models and/or template-free models were added into a model pool for model ranking.
The MULTICOM human predictor also used all CASP13 server models as input. The incomplete server models or highly similar models (eg, GDT-TS > 0.95) from the same server group were filtered out. The side chains of the remaining models were repacked by SCWRL 46 in order to have the consistent side chain packing before they were evaluated, which was shown to improve the performance F I G U R E 1 The pipeline of MULTICOM server and human prediction systems of model QA. 47 If the target was identified as multiple-domain protein, the server models were divided into individual domain models.
The structural models from either MULTICOM human predictor or server predictors were compared with 1D structural features (eg, predicted secondary structure, solvent accessibility) to generate 1D matching scores and with 2D contacts to generate 2D matching scores (ie, the percentage of predicted contacts existing in a model of the target).
The models were also assessed by a number of 3D QA tools to generate 3D quality scores. The 1D, 2D, and 3D quality scores (features) were used by DeepRank-our deep learning-based model QA tool-to predict the accuracy of the models. This QA method was also applied to individual domains if a target had multiple domains. It is worth noting that our three server predictors used different QA methods for model selection.
MULTICOM_CLUSTER ranked models primarily based on pairwise similarity scores between models using APOLLO, 48  Third, the side chains of CASP13 server models were repacked by the MULTICOM human predictor before they were evaluated in order to make the side chains of the models predicted by different CASP13 server predictors consistent, while the MULTICOM server predictors did not have this step. And fourth, the final selected models were further refined by 3Drefine in the MULTICOM human predictor, whereas the MULTICOM server predictors did not use refinement.

| Deep convolutional neural network for contact distance prediction
We used DNCON2 to generate the 2D contact map for an input sequence. 8 As shown in Figure 2, a target sequence was searched against Uniprot20 database (version: 2016_02) by HHblits 23 to collect homologous sequences and generate multiple sequence alignments. If there is not a sufficient number of homologous sequences (eg, <5L sequences; L sequence length), the target was further searched against Uniref90 database (released by April 2018) by JackHMMER 32 to collect more homologous sequences whose multiple sequence alignments were combined with the results of HHblits search. The multiple sequence alignments were used by CCMPred, 14 FreeContact, 39 and PSICOV 16 to generate residue-residue coevolution features. The pairwise coevolution features together with other pairwise information (eg, secondary structure, solvent accessibility, and mutual information for each pair of residues) were stored in the L × L input matrices (L: sequence length or domain length).
The input feature matrices were used by the first-level convolutional neural networks in DNCON2 to predict the contact probability maps (ie, contact distance distribution) at multiple distance thresholds 6, 7.5, 8, 8.5, and 10 Å. The distance distribution and the original input matrices were concatenated as input for the secondlevel convolutional neural networks to predict a final contact probability map at 8 Å distance threshold.

| Contact distance-based ab initio folding
We used predicted contacts with a pure contact distance-based ab initio modeling tool-CONFOLD2 and several fragment-assembly tools to build 3D models for targets or domains without significant templates being identified. CONFOLD2 42 used only predicted contacts and secondary structures to build structural models without leveraging any other information such as structural fragments ( Figure 3). Top x × L contacts (x: a ratio ranging from 0.1 to 4; L: length of the protein) ranked by probabilities were used to generate distance restraints between C β atoms (or C α atom for glycine). The predicted secondary structures were used to generate torsion angle restraints, atom-atom distance restraints, and hydrogen-bond restraints, 52 which were important for building good local secondary structures in the model. These restraints were used by the distance geometry and simulated annealing optimization implemented in CNS 53 to build tertiary structure models by satisfying the restraints as well as possible. In this round of modeling, some local structures, particularly beta-sheets, are often not well-formed due to lack of restraints or noisy restraints. To remedy the problem, the potential beta-sheets were detected in the models generated by the first round of modeling. More angular, hydrogen bond, and atom-atom distance restraints were added in order to improve the pairing between the beta strands. Moreover, the contact distance restraints that were not realized in the models were removed from the list. The new set of restraints was used by the distance geometry again to build 3D models. Usually, a few hundred of models were constructed by using different numbers of contact distance restraints (ie, 0.1L, 0.2L, …, 3.9L, 4L), which were then clustered. Top models from the clusters were selected as final models. The key feature of this approach is that contacts play a dominant and direct role in building structural models. If there are a sufficient amount of accurate distance restraints, high-quality 3D models can be constructed.
As an alternative, we also used predicted contacts as distance or contact restraints with three fragment assembly methods-Rosetta, 43 UniCon3D, 44 and FUSION 45 to build models. Contacts were used as a part of the energy function of these methods to guide the assembly of protein structure. Rosetta used the structure fragments drawn from a fragment library to assemble the structure, while UniCon3D and FUSION used HMMs to generate conformations for fragments of variable length.
In contrast to the CONFOLD approach, 42,52 extra information such as fragments and energy terms is used in this kind of approach, in which contacts only play an indirect or auxiliary role in structural modeling. Therefore, the fragment assembly approach may fail if its conformation sampling cannot generate correct topologies, which often happens for relatively larger proteins with complicated topologies, even though there is a good amount of accurately predicted contacts. To assist the fragment-assembly with contacts, we selected top L/5 predicted contacts of short-range, medium-range and long-range, which were translated into the distance constraints between pairs of C β − C β as additional energy terms. Rosetta and FUSION used the bounded potential for a distance d, which is defined as follows: The parameters "lb" and "ub" are lower and upper bounds for atom-atom distance, which had been optimized and set to 3.5 and 8 Å F I G U R E 2 The pipeline of DNCON2 for protein residueresidue contact distance prediction. The input volume has 56 channels (matrices) containing various pairwise residue-residue features F I G U R E 3 Automated contact distance-based ab initio protein structure prediction by CONFOLD2 in our experiment. Unicon3D adopted a square well function with the exponential decay to account for the contact distance energy and is defined as: where P is the predicted contact probability for a pair of atoms. In CASP13, the contact-based ab initio structure prediction was run for up to two days to generate decoys for model selection. Pcons, 64  The second-level network was also trained on the all models of CASP8-11 targets, where the models were randomly split into the training and validation data with ratio 9 to 1. The details of the network configuration are reported in Supporting Information Table S1. This method was also blindly tested as "MULTICOM_CLUSTER" in the CASP13 QA category and ranked as one of the best predictors in selecting models and estimating the absolute error (see Supporting Information Table S2 for details). We also developed a simplified DeepRank method (called DeepRank_avg) by averaging the predictions from the 10 trained networks in the first level as the final quality score.

| Performance of MULTICOM human and server predictors in CASP13
We evaluate the performance of MULTICOM methods on 104 "all groups" domains that were used in CASP13 official evaluation. Based  Figure 5 shows the performance of MULTICOM human predictor and our three server predictors based on the TM-score metric. 66 According to the evaluation, as shown in Figure  Supporting Information Figure S1 compares MULTICOM with other top ranked CASP13 groups. MULTICOM (group number: "089") is consistently ranked among the top three predictors according to all metrics on the three domain sets. For instance, it is ranked 3rd according to zscore on all 104 domains. Figure S2 shows the performance of our three MULTICOM server predictors and other top ranked server groups on the 112 "all groups" and "server only" domains. MULTICOM-CONSTRUCT ranked 7th among all server groups on all the targets, followed by MUL-TICOM_CLUSTER and MULTICOM-NOVEL. The performance of the global and local quality metrics defined by GDT-TS, 66 and LDDT score 67 are also summarized in Figures S3 and S4. We also analyzed the performance of the different alignment tools used by our server predictors. The results are summarized in Supporting Information Table S3. Prior to CASP13, we assessed how much the deep learning and contact prediction improved the QA in CASP12 dataset. After the quality scores were generated using individual QA methods, two baseline combination strategies (eg, the average score of raw feature scores and z-scores, respectively) were compared with the deep learning. Supporting

| Performance of DeepRank and individual QA methods used by MULTICOM
Information Table S4 shows that the z-score based consensus worked better than the average score consensus, while the deep neural network of integrating all features except contacts further reduced the loss from 0.064 of the z-score based consensus to 0.054. Furthermore, the deep learning with contact features performed best (correlation = 0.853 and loss = 0.048), and the improvement was significant compared to the averaging approach (loss = 0.067) according to the P-value (0.007751).
The average loss of the deep learning with contacts is 0.051 on the 74 CASP13 targets, lower than 0.059 of the deep learning without contacts that is lower than both the average score consensus (loss = 0.073) and z-score consensus (loss = 0.057). The improvement is also consistent with the results in the blind CASP13 experiment (Supporting Information Table S5). This further validated the deep learning and contact prediction's positive contribution to model selection.   Figure 7B is the plot of the true GDT-TS scores of models against their predicted ranking by DeepRank. It successfully ranked the model with highest GDT-TS score (0.6103) as no.1 ( Figure 7D). MULTICOM generated a refined model by combining the top 1 selected model with the other top ranked models, which had a GDT-TS score of 0.6113 ( Figure 7E). The ranking of individual QA methods for this target is shown in Figure S6. The other three such successful cases for DeepRank are also reported in Figures S7-S9.
To assess how contact predictions can help model ranking, we evaluated DeepRank with/without contact features on targets with low contact prediction precision and ones with high contact prediction precision, respectively ( Figure S10). The consistent, significant improvement in model selection has been observed when the contact prediction of short-range, medium-range, and long-range has high precision (precision >0.5). However, the less accurate contact prediction led to the slightly worse performance on model selection than not using contact prediction.
We also analyzed the effect of side-chain repacking on model evaluation. The results show that repacking the side chains of models before they were evaluated reduced the loss of modeling ranking. The detailed results are reported in Supporting Information Table S6 and Figure S11.

| Comparison of different contact-based ab initio modeling methods on FM targets
To evaluate how predicted contact distances improved template-FM, we collected the top 5 models predicted by five ab initio modeling methods (CONFOLD2, RosettaCon-Rosetta with contacts, UniCon3D with contacts, FUSION with contacts, and Rosetta without contacts) for all domains that MULTICOM considered them as "hard." Figure 8 shows that the GDT-TS scores of the ab initio models generally increase as the accuracy of contact prediction becomes higher for each method. This upward trend is most significant for CONFOLD2 and the correlation between the contact accuracy and the GDT-TS score of CONFOLD2 models is 0.578. This is expected because CONFOLD2 is the only pure contact distance-driven modeling method in the group and contact distances play a direct and dominant role in its modeling, while they only play an indirect role in the other three modeling methods assisted by contact predictions.
The average GDT-TS score and TM-score were also calculated for each method on the FM targets. The models generated by RosettaCon has the highest average GDT-TS score of 0.376 and CONFOLD2 has the second highest average score of 0.356, followed by Rosetta, FUSION, and UniCon3D. It is interesting to note that CONFOLD2 started to work better than RosettaCon when top L/5 contact predictions reached a high accuracy (eg,~80%). When the accuracy of contact prediction was lower, RosettaCon worked somewhat better than CON-FOLD2 probably because the extra structural fragment information and its advanced energy function made some difference. The comparison of RosettaCon and Rosetta shows a 15.3% increase of GDT-TS score by using contact distance restraints, demonstrating that predicted contacts can significantly improve the fragment-assembly modeling.   Table S7 reports the scores of the models using or not using domain parsing.
For the server predictors, the performance of structure prediction was substantially improved in terms of GDT-TS, TM-score and RMSD after the domain-based modeling was applied. For the human predictor, the quality of final predictions was also slightly improved when domain information was considered. And almost all the improvement is significant.

| What went right?
In CASP13, a main progress was to apply contact distance prediction and deep learning to improve ab initio modeling. Predicted contacts were successfully utilized to guide ab initio structure modeling for several hard targets that could never be modeled correctly before.
Supporting Information Figure S12 shows the models and scores of nine hard targets that were folded into correct topology when the predicted contacts generated by DNCON2 were rather accurate.
Remarkably, a pure contact distance-driven modeling method-  Figure S13).
Another main progress is that MULTICOM performed better in ranking the models in CASP13 than in CASP12 due to the application of deep learning and contact prediction. MULTICOM successfully selected models that are identical or close to the best models for 28 targets (see the distribution of loss of model selection for all the targets and two good examples in Supporting Information Figure S14).  68 The contacts in the model matched well with the contacts predicted by DNCON2 domain by domain, confirming that both domain parsing and structure modeling was largely correct ( Figure 10). This contact-based validation approach was applied to all CASP13 targets during CASP13, providing a complementary validation for structure modeling.

| What went wrong?
Despite the significant progress of MULTICOM in CASP13, it has its several limitations. The first limitation is in contact distance prediction.
DNCON2 sometime failed to generate a sufficient amount of accurate contact predictions to predict correct folds. The problem is particularly severe when the number of effective homologous sequences for a target is small (see Supporting Information Figure S15 for an example-T0998). One possible reason is that it did not use a metagenomics sequence database 69 that contains sequences not present in the nonredundant protein sequence database and the latest HHblits database 23 to collect homologous sequences. Another possible reason is the convolutional architecture used by DNCON2 is not deep enough in comparison with some other approaches. 10,12,70 The F I G U R E 1 0 The successful modeling of a large multidomain target T0996 and the contactbased validation. The contacts (red) predicted by DNCON2 match with the contacts (blue) in the template-based models domain by domain second limitation is that only the coarse distance restraints derived from binary contacts at 8 Å threshold were used with CONFOLD2 for ab initio modeling, without taking advantage of the more detailed distance distribution spanning multiple distance thresholds predicted by DNCON2, which limited its capability to build quality models. 71 The third limitation is that the deep learning-based QA failed on some targets. As shown in Supporting Information Figure S14B, DeepRank method performed poorly with loss >0.1 on 14 "all groups" targets. The failed rankings are summarized in Supporting Information - Table S8 and Figures S16-S29. The results show that its performance was worse on the FM targets or hard-template targets than on other targets. A possible reason is that a large portion of low-quality models in the pool and less accurate features of measuring model quality (eg, contact predictions) for the hard targets hinders the performance of the deep learning ranking.

| CONCLUSION AND FUTURE WORK
Our CASP13 results demonstrate that residue-residue contact prediction, more generally distance prediction, is the key direction to advance protein structure prediction, particularly ab initio prediction, and deep learning is the key technology to solve it. Not only do accurate contact distance prediction and deep learning enhance ab initio structure folding, but also model ranking for both template-based and FM. In the future, we will develop more advanced deep learning methods to directly predict real-value distances between residues and/or classify them into much finer intervals than DNCON2 currently does. The more detailed distance predictions will be used to more accurately fold proteins by the distance geometry, 42,52 simulated annealing and advanced gradient descent optimization 72,73 as well as to rank protein models.
ACKNOWLEDGMENT This work is supported by an NIH grant (R01GM093123), an NSF IIS grant (IIS1763246), and an NSF DBI grant (DBI1759934) to J.C.