Three different modeling scenarios are considered. First, we address the traditional loop modeling problem in crystal structures where the structure surrounding the loop is known. We next address InsEnds modeling as applied in the CASP9 blind prediction competition, where the InsEnds are as large as 45 aa regions in template-based models generated by Xu's RAPTOR-X algorithm.22 The third scenario is for the CASP9 refinement category in which the InsEnds algorithm is applied to the best structure from the server predictions and where the starting model and boundaries are specified by the organizers.
Loops in crystal structures
In order to demonstrate the applicability to larger loops in crystal structures, 26 loops of lengths 8–12 have been randomly selected from standard loop benchmarking studies.15 Loop boundaries for each target are taken as previously specified, and the loops are modeled using our method. Figure 4 illustrates the process of selecting the top five predictions, and Table I presents the best and the remaining four top predictions. After the predictions are clustered according to the RMSD between the loop structures, the largest five clusters are ranked using a linear combination of the Z-scores for the cluster tightness (RMSD between structures in the cluster), size, and average DOPE-PW energy, defined as Zt, Zs, and ZE, respectively,
where the Z-score for the property of structure X, is Zi = (Xi − <Xi>)/σi, and <Xi> and σi are the mean and standard deviation, respectively. After ranking the clusters, one representative from each cluster is selected by using a combination of the DOPE-PW energy and the SASA to obtain the top five predictions. Although DOPE-PW is very successful in guiding the protein backbone into a proper conformation based on the orientation of the Cα–Cβ vectors, it is unable to resolve the details of solvation at an atomistic level because it is parameterized only at the Cβ level. Hence, explicit SASA calculations are necessary to properly account for the solvation energy.
Figure 3. The ligation terms close the loop. Constraints are placed on the distance between the two ends of the loop and the distance between the free end of the loop and the anchor residue. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
Download figure to PowerPoint
Figure 4. Selection of the top 5 loop predictions for 1xnb. After clustering, the largest 5 clusters are ranked based on Z-scores with respect to cluster tightness, size, and average DOPE-PW energy. Once ranked, a selection is made from each of the 5 clusters using DOPE-PW + SASA. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
Download figure to PowerPoint
Table I. Prediction of Loops of 8–12 Residues in Crystal Structures
|Target||Loop length||Local loop RMSD (Align Loop, RMSD loop)||Global loop RMSD (Align rest, RMSD loop)|
|1thw||12||1.84||4.61||2.4|| || || ||4.53||5.94||5.87|| || || |
|1scs||12||2.27||3.48||5.2||4.91|| || ||3.1||4.22||7.87||6.95|| || |
|4i1b||11||1.17||1.13||3.72||3.64||3.11|| ||2.03||2.03||9.71||10.3||10.83|| |
|1cbs||8||0.74||3.65||0.74|| || || ||1.74||5.41||1.74|| || || |
As discussed in the Methods section, the SASA scores are determined from a combined ranking of the hydrophilic and hydrophobic ASAs. Similarly, the structures are ranked by using the DOPE-PW energy function as well. The structure with the lowest total DOPE-PW + SASA rank in a given cluster is taken as the predicted structure from that cluster. Models are discarded when the distance between the free end and the anchor point fails to return to within 1.5 Å of the initial distance. If the largest cluster contains less than 5% of the structures, the scoring for the top five candidates uses only the sum of DOPE-PW and SASA scores. As shown in Supporting Information Figure 1, the inclusion of SASA to the DOPE-PW energy improves the selection of the top structure in most cases compared with simply using DOPE-PW energy to select the top structure.
A residue-specific deviation quantifies the local confidence score of the prediction for each residue individually in each of the top five predictions. The local confidence scores are illustrated by color and thickness in Figure 4. The thicker (redder) portions in the predicted local region correspond to residues displaying the largest deviation within the cluster.
The simulations for loops of length 12 and 8–11 residues generate conformations with global loop RMSDs of 2.76 and 1.93 Å, respectively, where the RMSD is calculated for the loop residues after aligning the structures without the loop regions (Table II). These results can be compared with Table II of Lee et al.,8 which presents the minimum backbone RMSDs found using different existing loop sampling protocols. Simulations for 12 and 8 residue loops in crystal structures with the cyclic coordinate descent (CCD) protocol9 generates minimum RMSDs of 3.05 and 1.59 Å, the CJSD8 method obtains 2.34 and 1.01 Å, the self organizing algorithm using an alternating scheme of pairwise distance adjustments (SOS)23 yields 2.29 and 1.19 Å and the FALC8 scheme finds 1.84 and 0.78 Å, respectively.
Table II. Statistics for Loops in Crystal Structures
|Length||Local loop RMSD (Å)||Global loop RMSD (Å)|
Table III. Prediction of 20 End Residues in Crystal Structures
|Target||Type||Ends length||Local ends RMSD (Align Ends, RMSD Ends)||Global ends RMSD (Align rest, RMSD Ends)|
The average RMSDs of our top ranked predicted loops are 3.98 and 3.13 Å for loops with 12 and 8–11 amino acids, respectively (Table II). These results are comparable to those from two different methods, RAPPER14 and FALC,8 ranked by DFIRE16 as listed in Table IV in Lee et al.,8 where the average RMSDs of the top ranked 12 residue loop decoys for RAPPER and FALC are 4.32 and 3.84 Å, respectively. Rossi et al.24 compare four commercial loop modeling packages—Prime (Schrödinger, LLC), MODELLER (Accelrys Software), ICM (Molsoft, LLC), and Sybyl (Tripos)—which obtain RMSD values ranging from 3 to 5 Å for loops with 10–12 amino acids. Our performance is comparable to these methods.
We also compare our results to a recent atomic level loop modeling study which has sub-angstrom level accuracy.25 Although our Cβ level modeling certainly limits us in terms of obtaining sub-angstrom models, we still are able to obtain models for some of the same benchmark proteins that are better than or comparable to the high-resolution Kinematic Closure (KIC) protocol (Supporting Information Table 1 in Ref.26). For instance, our top predicted model for 1hfc with 3.69-Å RMSD outperforms KIC's 8.2 Å prediction for the same loop. Similarly, our top predictions for the other targets 4i1b, 1msc, 1cyo, and 1pmy from our benchmark set in Table I yield RMSDs of 2.03, 5.5, 2.47, and 2.97 Å that are better or comparable to the high resolution KIC method's top predictions of 3.8, 3.2, 5.2, and 2.6 Å, respectively, for the same 10–12 amino acid loops. The results demonstrate that a Cβ level representation of the protein chain without a costly analytical closure constraint is sufficient to achieve accuracy comparable with existing methods for relatively long loops in the context of crystal structures.
Ends in crystal structures
Another challenge involves modeling the termini of protein structures, a challenge that has attracted only limited study.27, 28 Unlike loops, end regions require no loop closure. To demonstrate that our method is also applicable to end regions, we have refolded the termini for six proteins (Table III). In each of the case, 20 residues in the C terminal end of the native proteins are first randomized while the rest of the structure is kept fixed. Starting from these pseudo-random structures, the end residues are sampled and clustered using the loop modeling protocol. Because no loop closure is required, the termini are folded using only the DOPE-PW energy function.
In three out of six cases (1af7, 1o2f, 1r69), the best and the predicted structures have a global RMSD of under 3 Å (Table III), with the best local RMSD under 5 Å. Although direct comparisons are unavailable for the same proteins, the results are comparable to another method28 for refolding of terminal secondary structures where the average RMSDs of 4.6 and 2.0 Å are obtained for 10–23 residue ends after three minimizations using the DFIRE and dDFIRE energy functions. We select the last 20 residues in each of the proteins for modeling irrespective of where the secondary structure boundaries lie. This protocol better mimics the situation encountered in authentic template-based modeling where the number of unknown residues that require modeling is determined by the gaps in the sequence alignment and where reliable information about secondary structure type or boundaries is often unavailable.
CASP9 blind InsEnds predictions
Methods designed for predicting the structure of internal loops may be inappropriate for termini of proteins because the energy functions and sampling generally used for loop modeling assume both ends are fixed. Furthermore, InsEnds can encompass whole secondary structure elements. The existing loop modeling methods have been benchmarked for loops in crystal structures where the remaining structure and loop boundaries are known. The situation for homology modeling, however, is more complex, being highly dependent on the quality of the sequence alignments, template identification, and boundary determination. Consequently, the starting point for InsEnds modeling is imperfect and inexact.
The biannual CASP experiments present a unique platform for testing new and benchmarking developed methods through blind predictions. Our participation in CASP9 as MidwayFolding (groups TS435 and TS477) focused on testing our local structure prediction method and on improving poorly predicted local regions in template-based models. Our analysis begins with models generated by the program RAPTOR-X, which utilizes homology to identify template structures appropriate to a target sequence through sophisticated sequence/structure alignments. The templates are processed by MODELER to generate our starting model. We also use the sequence alignments of RAPTOR-X to identify the InsEnds regions in the models. Five entries may be submitted to CASP9 for each target, and Figure 5 displays the best of the five blind models submitted to CASP9 for each target.
Figure 5. CASP9 Ins&Ends blind predictions. Numbers indicate improvement from MODELER (Red) to our model (blue), when compared with the native structure (green) after modeling the regions enclosed by the boxes. RMSD changes are for the whole structure.
Download figure to PowerPoint
The CASP9 targets serve to illustrate several strengths of our method. Several of the insertion regions contain secondary structure elements in the targets. The target T0464 from CASP8 presents a case where the insertion region is a helix, which our method predicts correctly, improving the model's RMSD from 9.6 to 4.5 Å, as exhibited in Figure 5(A). Another target, T0623 has a 25-residue gap in a region that is, in fact, a hairpin that is correctly predicted by our method as well (8.2-Å RMSD improved to 6.3 Å).
The largest InsEnds contains 45 residues (T0585), and the RAPTOR-X+MODELER programs describe them as a large loop. Our method correctly identifies that the missing region corresponds to three helices that pack into the protein core, thereby improving the model substantially from 15.1 to 9.1 Å overall RMSD as depicted in Figure 5(H). The target TR606 presents an example where the local modeling is performed for both termini simultaneously to form a pair of beta strands, thereby improving the overall RMSD from 4.9 to 3.8 Å for the target as a result of modeling the ends [Fig. 5(G)].
Other CASP targets contain InsEnds that are loops connecting different secondary structures. For instance, the targets T0520, T0594, and T0612 yield initial models with loops containing as many as 17 residues (identified from the gap boundaries in the sequence alignments). Use of our InsEnds protocol for these three loop regions improves the overall RMSDs from 3.2 → 2.6 Å, 2.2 →1.7 Å and 7.3 → 6.6 Å for T0520, 594, and 612, respectively [Fig. 5(C,D)]. The demonstration that we successfully model various types of InsEnds with the same protocol without any prior knowledge of whether they are loops or contain secondary structure elements highlights the robustness of the method.
Blind prediction of refinement targets in CASP9
The judges for the refinement category in the CASP experiment select the best of all submitted (template-based) models from all participating groups. The local regions that deviate most from the native structure are identified to the predictors as the refinement targets. From our perspective, the refinement category is distinct because the starting model is guaranteed to be the best of the all CASP server models rather than one of RAPTOR-X's model and because the boundaries for InsEnds are specified based on where the server model differs from the native structure (as identified by the organizers) rather than from RAPTOR-X's sequence alignment.
On average for the 12 refinement targets, the 24 different refinement methods in CASP8 yield no net improvement over the starting models.29 Table IV lists the RMSD as well as the Global Distance Test (GDT) changes from the starting models along with the ranking of our method with respect to all the other refinement methods. Our method proceeds by first initializing the InsEnds regions to a completely random conformation, so that no structural information about the InsEnds is retained from the starting model.
Table IV. Blind InsEnds Prediction of Refinement Targets in CASP9
|CASP9 refinement target||GDT starting||RMSD starting||GDT MidwayFolding||RMSD MidwayFolding||Rank of MidwayFolding|
|TR576||48.91||10.926||Ignored since initial GDT<50|
Unlike the RMSD which relies on a single alignment, the GDT scores reflect the structural similarities at different distance cutoffs and, therefore, are generally better at assessing improvements in local regions.30 We have attempted 11 targets for refinement in CASP9 (Fig. 6) and improve the GDT scores for 7 of them. Among all groups participating in CASP9 refinement, four of our 11 predictions (targets TR517, TR568, TR569, and TR517) fall in the top 10% of all submissions, and eight of the 11 reside in the top 25% of all submissions, thereby outperforming several of the more costly all atom refinement methods. The improvements are achieved for targets with a wide range of starting GDTs (>50). The GDT/RMSD for TR569 improves from 73.1/3.01 Å to 76.58/2.24 Å, and our method ranks fourth out of the 121 total submissions for this target. The starting values for TR568 are lower at 53.35/6.39 Å, and we improve them to 56.7/5.1 Å with an overall ranking of sixth out of the 127 submissions for the target. Our method performs much worse than the rest of the methods for one target, TR592, presumably because the starting structure is already extremely good (91.2/1.2 Å) so that our Cβ level representation is inadequate, and, consequently, an all atom side chain representation is required to improve the model further. Moreover, we have not refined the side chains in any of the cases, something that probably would have improved the results even more.
Figure 6. InsEnds predictions for CASP9 refinement targets. Difference between CA/CA distance across the sequence of the initial (starting) /native and final (refined using InsEnds method)/native after superposition using sequence-dependent LGA protocol. Official data from CASP9 official website (http://predictioncenter.org/casp9/). For each target, the arrows indicate the regions where the InsEnds modeling has been performed. The blue to green color change designates regions where the InsEnds modeling improves upon the given target based on LGA superposition to native structures.
Download figure to PowerPoint
Figure 6 displays all our predictions for the refinement targets in CASP9. The figure illustrates how well the model aligns to the native structure before refinement (initial) and after refinement (after) when superposed using the LGA program.30 The improvements introduced into the local region also help to align the remainder of the protein in several cases. For example, in TR614, even though the actual regions modeled are an insertion from 33 to 50 and the C terminal residues from 106 to 121, the local alignment of the N terminal residues improves over the starting model as indicated with blue in the LGA alignment for TR614 in Figure 6.
Molecular replacement results for CASP9 refinement targets
One of the CASP9 refinement metrics assesses how well the predicted models reproduce the experimental data.31 Recently, models generated by the structure prediction methods have been inserted into the molecular replacement likelihood algorithms for X-ray crystallographic refinement to solve the phase problem.32, 33 The assessors for CASP9 refinement judge the quality of each submitted model in this regard by calculating the Z-score of the best orientation of the model in the unit cell of the crystal compared to placing it in a set of random orientations. Only models with Z-scores above 6 are considered good enough to solve the phase problem. Table 3 in Ref.31 summarizes how often various groups improve the Z-score of the targets from likely unrefinable (<7) to likely refinable (>7). Our method performs as well or better than all the other groups in this test, with positive results in two of three cases attempted. As our approach uses a backbone + Cβ model with the side chains either missing or added simply using SCWRL4.0 with no further refinement, some of our submitted models were discarded in the analysis by assessors. Regardless, the fact that our method ranks at the top in the molecular replacement test proves its real value in X-ray crystal structure refinement.
In contrast to most other methods that expend considerable computing resources on including all-atom interactions, our method lacks explicit side chain atoms. This difference highlights the distinction between the refinement of crystal structures and template-based models. The all atom refinement of crystal structures benefits from having high-resolution information for the rest of the structure, whereas homology models are usually far from perfect. It is unclear whether the expensive modeling of all the atoms in an imperfect environment provides a computationally efficient strategy. In contrast, the first step of our approach is designed to obtain the proper backbone structure and orientation for the local region by using a coarse level of modeling that is less sensitive to the atomic level details for the rest of the homology model. Once the coarse level model is obtained for the local region, side chains may be added, and more detailed all-atom refinement can proceed.
Global InsEnds RMSD versus local InsEnds RMSD
RMSDs are calculated in three ways to help quantify the quality of the modeling of local InsEnds regions:
Local InsEnds RMSD: Align the loop and calculate the RMSD of only the InsEnds region.
Global InsEnds RMSD: Align all the residues besides the InsEnds, and then calculate the RMSD of the InsEnds region.
Global structure RMSD: Optimally align all the residues in the protein and calculate the RMSD of the full chain.
The local InsEnds RMSD is a measure of how well the InsEnds region itself is modeled, and the global InsEnds RMSD provides a measure of how well the modeled InsEnds is oriented with respect to the rest of the protein. The global InsEnds RMSD is the ideal measure of loop quality when predicting loops in crystal structures because the only difference between the native structure and the model can appear in the loop region. In contrast, InsEnds modeling of homology models begins from inexact structures; therefore, assessing the refinements requires accounting for the RMSD of the rest of the structure (besides the InsEnds) with respect to the native structure. If the starting homology model deviates significantly from the native structure, the alignment of the non-InsEnds region necessarily must skew the anchor regions, and therefore the global InsEnds RMSD would not provide as good a metric for reporting the accuracy of InsEnds modeling as does either the local InsEnds RMSD or the overall RMSD of the structure.
This utility of the different RMSDs is illustrated for six targets from CASP8 for which the initial RAPTOR models have variable RMSDs to the native structures. The 11–12 residue InsEnds regions in those models are chosen for (post-dictum) prediction using our method (Table V). Not surprisingly, the global InsEnds RMSD is highly dependent on the quality of the initial model (i.e., the RMSD of all but the InsEnds region in the initial model). For target T0478D1, the RMSD of the non-InsEnds region in the starting model is 8.07 Å; the best local InsEnds RMSD decreases from 2.9 Å to 1.58 Å, whereas the best global InsEnds RMSD decreases from 12.2 Å to 8.4 Å.
Table V. Prediction of InsEnds in CASP8 structures (postdiction)
|CASP8 target||InsEnds length||RMSD of InsEnds region||Local InsEnds RMSD to native (align InsEnds, RMSD of InsEnds)||Global InsEnds RMSD to native (align non InsEnds, RMSD of InsEnds)||Global protein RMSD to native (align all, RMSD of all)|
|Initial model (RAPTOR + Modeler)||InsEnds best||InsEnds predicted||Initial model (RAPTOR + Modeler)||InsEnds best||InsEnds predicted||Initial model (RAPTOR + modeler)||InsEnds best||InsEnds predicted|
Target T0411D1 has a non-InsEnds RMSD of the starting model much closer to native structure at 2.74 Å, and our local InsEnds RMSD improves from 3.53 to 1.85 Å, similar to the local InsEnds RMSD improvement in T0478D1 (2.9 to 1.58 Å). However, the global InsEnds RMSD for this target improves from 10.2 to 2.78 Å, which is much more remarkable than the global InsEnds RMSD in T0478D1 (12.2 to 8.4 Å). The difference can be attributed to T0411D1′s starting model having the non-InsEnds region much closer to the native structure when compared with T0478D1. Figure 7 illustrates this behavior and indicates that the local InsEnds RMSD remains relatively unaffected, whereas the global InsEnds RMSD for the same targets is quite severely affected by the RMSD of the remaining region. The successes of the modeling also support our previous contention from protein structure predictions that the neighbor-dependent ϕ,ψ distributions capture local interactions reasonably well.20
Figure 7. Global versus local RMSD. The RMSD of non InsEnds region is plotted against the global InsEnds RMSD (red) and local InsEnds RMSD (blue) for six CASP8 targets. The global InsEnds RMSD is affected severely by the quality of the homology model. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
Download figure to PowerPoint
Applications to protein folding simulations
Although loop modeling is often called the “mini-folding problem,” traditional approaches to loop modeling do not consider the folding mechanism when predicting loops. Our method on the other hand, views local modeling in a fashion that fits naturally into the larger problem of protein folding.
Experimental studies indicate that proteins fold through sequential stabilization of tertiary structure elements or foldons.34–37 Often, long-range contacts form early in the folding pathway and produce intermediate species where some entrained local regions are not yet folded. Hence, a computational scheme designed to predict structure by mimicking the natural stepwise fashion of folding pathways should encounter the problem of folding inside of loops.
Our InsEnds algorithm is well suited to address this problem because the undetermined local regions in the structure that arise during the folding pathway can correspond to distinct secondary structures, loops, or to combinations thereof. As a proof of principle, we test our method by predicting native structures of possible intermediates in the pathways for folding two proteins: ubiquitin and barnase.
The late-folding intermediate in ubiquitin lacks the 310 helix and the β5 strand, while the rest of the structure is well formed34, 38 [Fig. 8(B)]. Starting from a native-like structure for the intermediate, the InsEnds algorithm is used to fold the 18-residue insertion. The InsEnds refinement procedure successfully recovers the native structure to a global RMSD of 1.6 Å [Fig. 8(C)]. This illustrates an example where the local region is neither a loop nor a continuous secondary structure. Nevertheless, we still obtain the right topology, essentially completing the last step of the folding pathway to predict to the native structure.
Figure 8. InsEnds algorithm applied to protein folding pathways. A: The β5 and 310 helix in ubiquitin are the last structures to form in the pathway. Their structures are depicted as disordered in the model (B) of the folding intermediate and (C) predicted using the InsEnds algorithm. D: Barnase native structure highlighting the two hairpin loops that are part of two different cores, and (E) and (F) predictions of the loops using InsEnds algorithm, respectively. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
Download figure to PowerPoint
Barnase is a 108-residue protein that is atypical for a small protein because it contains three distinct hydrophobic cores. The two hairpin loops depicted in Figure 8(D) are crucial to the structure because they are involved in formation of the protein's cores, and, therefore, the correct prediction of the loops is essential for the prediction of the global structure. Experiments indicate that loop 2 is the last structure to form in the folding pathway.36 When the InsEnds method is used to fold both the 10 and 15 residue loops in barnase [Fig. 8(E,F)], our best predictions in both cases lie in the top clusters, and the best global RMSDs are 2.03 Å and 1.27 Å for loops 1 and 2, respectively.
The problem of folding inside of loops highlights two aspects of our method. The first is that our approach treats local and global structure prediction similarly by mimicking the natural protein folding mechanism. The second aspect is the demonstration that given the correct boundaries, our method is able to reconstruct the local structures irrespective of whether the local regions are well defined secondary structures or loops.
Simultaneous folding of multiple InsEnds
One crucial feature of our approach is the ability to simultaneously model multiple local regions. When the regions are interacting, simultaneous modeling can be essential because the context provided by one local region may be important in guiding the other into place. A good example is the CASP target TR606, where the InsEnds correspond to the two termini that form a hydrogen-bonded pair of β strands. The initial template model fails to identify the ends as strands, and, therefore, the ends are wrongly placed. Accurate modeling requires that they be folded simultaneously. Guided only by the orientationally dependent DOPE-PW energy function, we have modeled the free termini and correctly predicted the pair of strands in our top submission [Fig. 5(G)].
Protein structure prediction pipeline
One of our goals is to combine the respective strengths of free modeling with template-based modeling for an integrated structure prediction pipeline. This goal is realized through an automated server, created for CASP9 that integrates the InsEnds, RAPTOR-X and ItFix methods. Given a sequence, the pipeline begins by performing homology modeling using RAPTOR-X. If no templates are identified, the pipeline directs the sequence for free modeling using our existing ItFix algorithm for secondary and tertiary structure prediction. If RAPTOR is able to build a template-based model, the InsEnds are modeled to obtain a final structure. The pipeline has been used for the CASP9 structure predictions of the MidwayFolding group (CASP9 group numbers 435 and 477).