Through CASP experiments, various template-based modeling methods have been proposed, and the importance of successful template based modeling has been ever increasing. One of the important issues in this category is to improve the accuracy of the modeling to the level of the experiment. The emergence of high accurate template-based modeling (TBM/HA) category in this CASP7 reflects on this.
In this CASP7 experiment, we have applied systematically a powerful global optimization method, the conformational space annealing (CSA),1 to the whole procedure at three levels: multiple alignment, backbone modeling, and side-chain modeling. The success of this approach depends on the harmony of a powerful optimization method with accurate score functions. For this purpose, we have developed a consistency-based score function for multiple alignment so that the more we optimize, the more the consistency is satisfied. For chain building and side-chain remodeling, we used MODELLER energy function and an in-house score function similar to the SCWRL3.0 in which we add rotamers generated by consensus analysis.
We provide an overall analysis of submitted models for 100 domains in the TBM category especially focusing on 26 domains in TBM/HA category. On average, excellent backbone modeling as well as side-chain modeling are achieved for TBM/HA targets.
For the prediction of the 3D structures of 100 CASP7 targets, we have developed a procedure based on global optimization of score functions in three levels: multiple alignment, backbone modeling, and side-chain optimization. The effectiveness of the current approach relies on the global optimization method used, the CSA. The CSA method searches the whole conformational space in its early stages and narrows the search to smaller regions with low energy. The greatest advantage of the CSA is that it always finds distinct families of low-energy conformations. Details and successful applications of the CSA can be found elsewhere.1–3 The whole prediction procedure is described below:
The first step is about fold recognition. To collect fold candidates of a given target sequence, we have considered top scoring templates from the meta-server 3D-jury, as well as top scoring templates from an in-house method called FoldFinder.4 FoldFinder is a profile–profile alignment method utilizing predicted secondary structures. We have used the fold database of 17,930 protein chains from PISCES5 at the 99% sequence identity (SeqID) level. After collecting templates, we performed structural clustering, which typically led to 2 or 3 sets.
The second step is to perform multiple sequence/structure alignment for each template set by MSACSA,* which is the most crucial and computationally time-consuming part of the whole procedure. Unlike the other heuristic (progressive) alignment methods in the literature, MSACSA applies the CSA,1 to a consistency-based score function like COFFEE.6 It gives a higher score for an alignment more consistent with the pair-wise restraint library. The library is generated from profile–profile alignment between the query sequence and template sequences and structure–structure alignment between templates using TM-align.7 The score function for a given multiple alignment A of N sequences is defined as:
where M is the alignment length of A and δ(A) is 1 or 0. If the aligned residues (excluding indels) between the ith and the jth sequences at the kth column of A are present in the library, δ(A) = 1, otherwise 0. Lij and wij are the pairwise alignment length and the SeqID, respectively, between the ith and the jth sequences in the library. For each list, the top scoring alignment among the final 100 is used for the next 3D backbone modeling step.
The 3D structures of target proteins are constructed by optimizing the MODELLER8 energy function again using the CSA, which we call ModellerCSA.9 For each multiple alignment (containing up to 20 templates), a total of 100 models are generated and they are used for the list-selection procedure. This is the second most computationally time-consuming part of the method.
For most cases, we have used more than one list of templates. We have selected the best performing list (if exists) by assessing the average quality of the 100 3D models of each list by applying an in-house neural network (NN)-based procedure.10 From the winning list, we apply the clustering method SPICKER,11 to find center models of 2–3 largest clusters. Best scoring models in terms of the MODELLER energy and/or DFIRE12 energy are also selected. Typically, the center model of the largest cluster served as model 1. When there are competing lists, we used more than one list to select five models.
The NN was quite successful to select the winning list. The network consists of five input nodes, three hidden nodes, and one output. Inputs are the MODELLER energy, DFIRE energy, and the consistencies of a model with three predicted properties (secondary structure, solvent accessibility, and hydrophobicity). Details will be published elsewhere. The network was trained so that the output predicts the TM-score of a model. For training and testing, 1,600 models for six early-released CASP7 targets are used. Because of the time constraint of the CASP7 schedule and the required additional computational resources, we could not manage to include additional proteins into the training set. The network was trained using the back propagation algorithm and a fixed number of epochs, which was determined by applying all 5+1 cross-validations.
For the side-chain modeling of a model, say p, we have considered all 100 final models that we define as a set P obtained from ModellerCSA such that p ∈ P. A target-specific rotamer library is constructed based on the consistency of the side chains in P. More precisely, for each residue i, we calculate the average mi and the standard deviation σi of χ angles of P. If σi ≤ 15°, we add 10 sets of all χi angles closest to mi into the rotamer library. If σi > 15°, we add a backbone dependent and sequence-specific rotamer library derived from SCWRL3.013 using refined bins. Using this rotamer library, we optimize an energy function E = ESCWRL + EDFIRE (RotamerCSA) where ESCWRL and EDFIRE is SCWRL3.0 energy and DFIRE energy, respectively. Figure 1 shows the flowchart of the overall procedure.
RESULTS AND DISCUSSION
Among a total of 104 domains in TBM targets, we have analyzed the results of 100 domains. Our models for T0295_1 and T0295_2 are void due to the enforced soft-deadline. In addition, our model 1 for T0335 is obtained by template-free modeling, and the experimental structure of T0284 is not available yet. All data shown here are from model 1.
Table I shows the official GDT-HA and GDT-TS scores of our models for all TBM/HA targets. The averages of GDT-HA and GDT-TS scores for 26 TBM/HA domains are 72.47 and 87.35, respectively. TM-scores of our models (TM1) and TM-scores of the best templates by best structure alignment (TMBT) are also shown. The best template (the first one in the table) corresponds to the template, which gives the highest TM-score to the native structure among all templates used in MSACSA (TM-align is used for structure–structure alignment).
Table I. Summary of the Backbone Accuracy of 26 TBM/HA Targets
Templates used (First 5 are shown)
TM1 values better than TMBT are shown in bold face.
DL, Domain length of the native structure; GDT_HA, The official GDT_HA score; GDT_TS, The official GDT_TS score; TM1, TM-score between the native structure and the model; N, The number of templates used; TMBT, The TM-score of the best template aligned by structure alignment; AL, Aligned length of the best template to the native structure; SeqID, SeqID of the best template.
TMBT, AL, and SeqID are calculated by TM-align. Best templates appear first in the list.
2fe5A, 1qavA, 2fcfA, 1ihjA, 1n7eA
1ihgA, 1a330, 1dywA, 1qngA, 1z81A
1jpaA, 1p4oA, 1lufA, 1p14A, 1xkkA
2bfxA, 1phk0, 1tkiA, 1fotA, 1xjdA
1agrE, 2af0A, 2crpA, 2bv1B, 1zv4X
2ah5A, 1x42A, 2fdrA, 2gfhA, 1te2A
2fh7A, 1larA, 1rpmA, 1yfoB, 2g59A
2a5dA, 1mr3F, 1mozA, 1z6xA, 1fzqA
1adr0, 1b0nA, 1rioA, 1zz6A, 1y9qA
1bg20, 1gojA, 1f9vA, 1sdmA, 1x88A
1j6oA, 1yixA, 1xwyA, 1zzmA, 1pscA
1vhrA, 1wrmA, 1yz4A, 1mkp0, 1m3gA
2ah5A, 2fi1A, 1te2A, 2gfhA, 1lvhA
2gvkA, 1t0tV, 1vdhD, 2cb2A
1zjrA, 1ipaA, 1v2xA, 1x7oA, 1gz0A
2aqjA, 1el5A, 1pj5A, 1vrqB, 1y56B
1eg5A, 1p3wA, 1t3iA, 1jf9A, 1elqA
1g9oA, 1i92A, 1gq5A, 1gq4A, 1q3oA
2f8aA, 1gp1A, 2gs3A, 2ggtA, 1kngA
2gw2A, 1a580, 1ihgA, 1qngA, 1cynA
2bygA, 1wf8A, 2fneA, 2fe5A, 1qlcA
2fneA, 2bygA, 1ujuA, 1whaA, 1uhpA
1ufbA, 1wolA, 1o3uA
The SeqID between the best template and the native structure is also shown. The average SeqID is 39.8%, and the best templates are well aligned to the native structures. The alignment length (AL) covers 97.5% of the domain length (DL) on average. For TBM/HA targets, it is rather easy to align target sequence to the best template after the native structures are obtained, i.e., a posteriori. However, the challenge is to align the target sequence to the best template without the native structure, i.e., a priori.
To examine how well we performed the alignment a priori, the plot of TM1 versus TMBT of the 100 TBM targets is shown in Figure 2, where 26 TBM/HA targets are marked as circles. In Figure 2, data below the diagonal line correspond to models better than the best models one can construct with the best templates. If we fit the data by a straight line, it is y = 0.73x + 0.23 and it crosses the diagonal line at TM1 = 0.88.
The total number of value-added models against the best templates is 33 of 100 domains. The fact that the predicted model outperforms the best template (corresponding to symbols below the diagonal line) implies that the current method can extract more relevant information out of multiple templates when accurate multiple (sequence) alignment is provided. On the other hand, the relatively poor performance of the current method (corresponding to symbols above the diagonal line) is believed to be intrinsic in the sense that the discrepancy between sequence alignment and structure alignment increases for hard targets.* Obviously, even when the identity of the best template is known, finding its structure alignment to the native structure is not possible without the native structure.
Among the predicted 100 domains of TBM category, 30 models are ranked as top 5 and 11 models are ranked as top 1. If we consider 26 domains of TBM/HA category, the numbers of top 5 and top 1 are 13 and 5, respectively.
Figure 3 shows accuracies of χ1 and χ1 + χ2 for 26 targets of TBM/HA category, where χ1 + χ2 is considered as correct if both χ1 and χ2 are within 30° from their native values. Average accuracies of χ1 and χ1 + χ2 are 70.0% and 48.4% respectively. In the average, we excluded T0302, which is an NMR model.
Side-chains of T0340, T0345, and T0346 are predicted with particularly high χ1 accuracies, nearly 85%. These are easy targets with high SeqID as in Table I. The best performing side-chain predictions are shown as filled circles, where 11/26 and 9/26 are for χ1 and χ1 + χ2 accuracies, respectively. These results mean that all-atom accuracy of our prediction is significantly improved. Generally, the accuracy of side-chain modeling increases as the SeqIDs between templates and target sequence increase. This tendency can be observed in Table I. In addition, obviously, the accuracy of the final multiple alignment between target and templates affects the side-chain modeling. To compare the efficiency of the current side-chain modeling against MODELLER and SCWRL, we plot in Figure 4 the χ1 accuracies of the three methods utilizing identical multiple alignments. For χ1 accuracy > 70%, a systematic improvement is achieved except the T0290. For reasons that we do not understand, MODELLER performs best for the T0290.
T0303 is a joint center for structural genomics (JCSG) target, a novel predicted phosphatase from HAEMOPHILUS SOMNUS 129PT (PDB code: 2hsz). For modeling, first, we collected 10 templates from 3D-jury and FoldFinder with, on average, 25% SeqID whose multiple alignment is shown in Figure 5. T0303_1 is a discontinuous domain with the residue numbers 1–17, 95–224 from CASP7 domain definition. In Figure 5(a), 1–17 N-terminal part is inserted in this domain making parallel β-sheet plane together with other strands. This model is exceptionally well predicted compared to the other predictors. Since this domain is discontinuous, obtaining a good multiple alignment was relatively more difficult. However, we managed to obtain a model significantly better than the best model one can construct out of the best template 2ah5A (see Table I). The accuracies of χ1 and χ1 + χ2 are 76.4% and 56%, respectively, which are better than the second best predictor model by 11.4% and 15.0%.
T0332 is a structural genomics consortium (SGC) target, a methyltransferase domain of human TAR (HIV-1) RNA binding protein (PDB code: 2ha8). In this case, we used six templates and the best template 1zjrA shares 23% SeqID with the native structure. Interestingly, there are three templates whose SeqID is greater than that of the best template. In terms of SeqID, T0332 is in the twilight region. As shown in table I, we obtained more accurate model than the best template. The χ1 and χ1 + χ2 accuracies are 70.8% and 50% respectively. The Z-score of GDT-HA and χ1 accuracy of our model is remarkably high compared to those of all submitted models. As shown in Figure 5(b, d), there are ambiguous regions, which are marked by yellow. The model is well predicted in the long loop region (the upper right part of Figure 5(b)) which are marked by magenta in Figure 5(d). This magenta region has about 50% SeqID and well aligned with templates without gaps.
In the cases of T0303_1 and T0332, we note that probable sources of value added to the best templates are (a) good separation between well aligned regions and ambiguous regions, (b) good backbone modeling by global optimization based on good alignment, and (c) good side-chain modeling utilizing many templates.
For TBM/HA targets, finding good templates including the best was relatively easy, and consequently, we were able to generate more accurate models. Nevertheless, generating value-added models better than one can get from the best template is a challenging task.
What went wrong
Poorly predicted models compared to more successful ones by other predictors are mostly early targets expired prior to July 12, 2007. For these targets, the NN list-selection technique was not ready to be employed. We have examined the reasons for poor modeling especially for T0308 and T0313. In the case of T0308, we failed in proper structural clustering to prepare template lists. At the time of prediction, we knew via sequence analysis that 2a5dA was the obvious best template, which was also recognized by many other predictors. However, without the assistance of the NN list-selection method, we wanted to include as many templates as possible in the MSACSA stage. A total of 14 templates were used for T0308, in which the best template 2a5dA was structurally quite different from the rest. We should have used the 2a5dA as a single template, which could have been identified as a winner with the NN list selection.
In the case of T0313, there are two template candidates in PDB (1ii6 and 1x88) with 100% SeqID. We have assumed wrongly that two proteins with 100% SeqID would take more or less the same 3D structure, and consequently, our fold database culled by PISCES at the 99% SeqID level happened to take 1x88 over 1ii6. However, our posteriori analysis reveals that the two templates are significantly different in their 3D structures with TM-score 0.91. The failure simply came from the fact that the best template 1ii6 was missing in our fold database.
Obviously, our overall method was quite new and evolving rapidly throughout the CASP7 season, and we could not spare extra time and computational resources for systematic benchmarking in many aspects. Also, it should be noted that the current method does not include any kind of refinement protocols, which was not considered simply due to the urgency of developing the method itself.
For high-accuracy template-based modeling of CASP7 targets, we have applied a new procedure based on the rigorous optimization of score functions at three stages: multiple alignment, chain building, and side-chain modeling. We applied the CSA to a newly developed consistency based score function for multiple alignment. For chain building and side-chain modeling, we optimized the MODELLER energy and a SCWRL-like energy using a target specific rotamer library, respectively. Significant improvement in backbone as well as side-chain modeling is achieved for TBM and TBM/HA targets. For most TBM/HA targets (17/26), the predicted model was more accurate than the model one can construct, when the native structure is available, by structure alignment using best template in a posteriori fashion. Among the predicted 100 domains of TBM category, 30 models are ranked as top 5 and 11 models are ranked as top 1. If we consider 26 domains of TBM/HA category, the numbers of top 5 and top 1 are 13 and 5, respectively. Among others, T0303_1 and T0332 are predicted exceptionally well in terms of GDT-TS, GDT-HA, and side-chain accuracies.
Although the modeling for TBM/HA targets are quite good on average, there are many targets where further improvement is necessary for better protein structure prediction. Especially for a target where the final multiple alignment contains many regions of long loops, we need to implement a sensible loop modeling procedure into the current method.
We thank CASP7 organizers and assessors. We also thank KIAS for providing Linux cluster supercomputers dedicated for CASP7.
Joo K, Lee J, Kim I, Lee S, Lee J. Multiple sequence alignment using the conformational space annealing, submitted.