The four groups of this article were selected because their template-based predictions at CASP8 scored especially well on two aspects: First, they showed a good match to the target H-bonds and target side-chain positions and rotamers.1 Second, they were physically realistic, with few all-atom clashes and good bond lengths, bond angles, and Ramachandran plot. Although the latter features are not among the most important ones in structure prediction, getting them right is both essential and difficult: essential because the fine-grained energy functions for high-resolution refinement depend on correct stereochemistry, and difficult because a simple energy minimization not only improves the look of the model, but also tends to move it away from the target.2
The authors' prediction methods use distinctly different ways to obtain these results, and this article is an attempt to synthesize some commonality from these different approaches. There are two basic approaches in getting the details right: correct by construction and optimizing cost functions (also called energy functions).
The correct-by-construction approach makes sure that all parameters being considered are set correctly in initial models, and that conformation-change operators do not change these parameters. For example, in Rosetta's initial stages, all backbone bond angles and bond lengths are set to ideal values, and only torsion angles are modified. In undertaker, all backbone fragments and side-chains are copied from PDB files, and bond angles and bond lengths are not changed by conformation-change operators.
The cost function approach requires attention to two issues: the accuracy of the function and the efficiency of the conformational search. Many terms are included in the cost function, and these terms often are in conflict, so setting the weights of the different components is important when optimizing the weighted sum. Rosetta and YASARA have put considerable effort in optimizing their energy functions so that low-energy protein models are much more likely to be correct predictions, and experimental crystal structures have very low energy. Rosetta has also put effort into improving their Monte-Carlo-based search strategy to sample conformation space sufficiently. The Lee method focuses on improved conformational search by conformational space annealing,3 using the standard Modeller4 cost function, which includes competing constraints from “many” templates and CHARMM.5
One problem with the cost-function approach is that some terms of the cost function (such as the Lennard-Jones potential for clash detection) are much stiffer than other terms (such as bond length and bond angle), so that poor bond lengths and bond angles are accepted to remove clashes, rather than doing more difficult combinatorial searches that remove the clashes without damaging the bond angles and bond lengths. One solution (used in YASARA and Rosetta) is to keep bond lengths and angles fixed until the worst clashes are gone. Another solution is to perform straightforward (but difficult) global optimization of the function considering all degrees of freedom as done by the LEE server. Undertaker ramps up the weight of the clash terms as the optimization progresses.
All four methods extract considerable information from the templates, using them to provide initial starting models and, for the LEE server and undertaker, constraints for the cost function. None of the four methods uses the old “frozen-core” approach, in which portions of the backbone copied from the templates are not allowed to change.
The next sections will describe in more detail the approaches of each of the four methods.
The self-parameterizing knowledge-based YASARA force field
Improving the physical correctness of protein models looks like the ideal task for a widely used physics-based method: all-atom molecular dynamics simulation with explicit solvent. Still, this approach has traditionally had a difficult stand at CASP: computers are too slow to simulate the folding of the CASP targets from a random initial conformation, and empirical force fields are usually too crude to really improve models built with other methods. The latter problem is related to the observation that high resolution X-ray structures “jump away” during the first picoseconds of a simulation, accompanied by a deterioration of knowledge-based indicators like Ramachandran plot quality.
The NMR community, being haunted by the poor quality of structures obtained from molecular dynamics refinement (Ramachandran plot Z-scores of −7 were common6), soon came up with a solution: The force field was augmented with knowledge-based torsional potentials, that were extracted from high-resolution X-ray structures and ensured that the resulting models looked the same.7
Interestingly, the models often looked even better than X-ray structures, raising the question whether these knowledge-based potentials really improved the accuracy, or just created artificially good-looking models.
The YASARA force field described here addresses these issues by combining the AMBER all-atom force field equation8 with multi-dimensional knowledge-based torsional potentials† (see Fig. 1) and with a consistent set of force field parameters to maximize the accuracy: this is achieved by making a random change to one or more parameters (e.g. a certain van der Waals radius, a charge, or the weight of a knowledge-based potential, see Table I), energy-minimizing a training set of 25 high-resolution X-ray structures, measuring the damage done, and rejecting or accepting the new force field based on a Monte Carlo criterion.9 To ensure that all forces responsible for the experimentally observed structure are considered, minimizations are done in crystal space, using complete unit cells.2 As a result, one obtains a force field that has stable energy minima as close as possible to native structures. And as shown before, this is essentially equivalent to a force field that moves models closer to native structures during a simulation.9
|38||Scaling factor for those (Y)AMBER torsional potentials that are also covered by knowledge-based potentials (KBPs). (There are no KBPs involving terminal hydrogens and inside rings).|
|39||Height of the average 1D KBP energy barrier.|
|40||Height of the average 2D KBP energy barrier.|
|41||Ratio of 2D PhiPsi and Psi−1Phi KBPs|
|42||Height of the average 3D KBP energy barrier.|
The parameter optimization procedure is computationally intensive and took about half a year using the Models@Home distributed computing system.10 After convergence, the contributions of 1D:2D:3D potentials (Table I) were 2.6:0.33:3.82 kcal/mol. So the highest weight was assigned to the 3D potentials, which makes sense as they contain the most information. A simple explanation for the surprising result that 2D potentials came out last could be that they are (except for Psi−1Phi) fully contained in the 3D potentials, whereas the 1D potentials have a higher resolution (256 instead of 64 bins, see footnote).
The cross-validated results are shown in Table II. Obviously, the knowledge-based potentials helped a lot: First, the damage done to crystal structures during an energy minimization (RMSD column) is noticeably smaller with the YASARA force field than with YAMBER2 (which used the same parameter optimization approach, but without knowledge-based potentials2) or AMBER. Although these RMSD differences look small, they translate to much larger differences during longer simulations.2 And second, the old modeler's rule of thumb to “never hurt a protein by energy minimization if it can be avoided” is no longer an issue: the deterioration of structure validation Z-scores is gone, the minimized structures even look a bit better according to WHAT IF (third column in Table II). This does not only hold for those checks that are related to the knowledge-based potentials described here (Ramachandran plot, backbone conformation quality), but also for the independent 3D packing quality check (which improves from −0.583 to −0.539).
|Force field||RMSD||Quality Z-Score|
Regarding the practical application of the new YASARA force field during CASP8, two results are noteworthy: First, extensive parallel molecular dynamics simulations‡ aided by Concoord12 won three of the 12 refinement targets (TR429, TR454 and TR469, based on GDT_TS scores for Model_1). And second, short energy minimizations with a solvent shell helped to improve the physical realism of homology models1 built for the main CASP8 targets. The initial models and thus the starting points for the energy minimizations were obtained using the following protocol: PSI-BLAST13 was run to identify the five closest templates in the PDB, then for each template up to five stochastic alignments were created14 using SSALIGN scoring matrices.15 For each of the maximally 25 template/alignment combinations, a 3D model was built using loop conformations extracted from the PDB16 and the SCWRL side-chain placement algorithm.17 After the minimization, the models were ranked by quality Z-score (Table II), and the top five were submitted.
The YASARA force field and the homology modeling protocol have been implemented as part of the molecular modeling program YASARA Structure, available from www.YASARA.org. A web-server can be found at www.YASARA.org/minimizationserver.
Protein 3D modeling by global optimization (LEE)
Resolving two requirements—taking as much information from templates and keeping good stereochemistry is a difficult task to achieve. The LEE and LEE-SERVER models of CASP8 targets are physically realistic while they are in good agreement with native H-bonds and native side-chain conformations. This is achieved firstly by incorporating “many” templates into the standard Modeller4 energy function so that the function contains many competing energy terms and secondly by simply executing rigorous/extensive global optimization to the much frustrated energy function.
An ideal method for logical protein structure modeling would require two conditions. The first one is the availability of an accurate energy function which can identify a good protein model from bad ones solely based on models' energy values. The other one is the availability of an efficient sampling method which can guarantee to generate a wide spectrum of low-energy conformations including the global minimum energy conformation of a given energy function in a reasonable time scale. Unfortunately, the two conditions are not going to be met in near future, and consequently the state of the art protein modeling is carried out with not-so-ideal energy functions and conformational search methods.
Using a given (and preferably more commonly used) energy function, the LEE approach focuses on investigating the effect of utilizing a more rigorous optimization method for the improvement of the protein model quality. Generally, more efficient optimization of an energy function does not necessarily warrant more accurate protein modeling due to the inherent inaccuracy of energy functions currently available. To circumvent the situation, we resort to score functions constructed from templates based on consistency, which is shown to be effective for multiple alignment18 and three-dimensional (3D) protein chain building.19 In the LEE method, a global optimizer, the conformational space annealing (CSA) method,3, 20 is applied to the three steps of protein 3D modeling—multiple alignment, 3D chain building, and side-chain re-modeling.21
CSA searches the conformational space of local minima. In addition, CSA combines the general scheme of genetic algorithm with the concept of annealing in the conformational space. CSA maintains a population of conformations and controls its conformational diversity using a distance measure between them. Annealing is achieved by using a large value for the distance measure at the early stages of optimization, and reducing its value in later stages.
For a given set of templates§, we perform multiple alignment between query and templates by generating a pair-wise restraint library from profile–profile alignment between query and templates and structure–structure alignment between templates. Unlike the other heuristic (progressive) alignment methods popular in the literature, we apply a more thorough global optimization to a consistency-based score function by using CSA.18 The more consistent an alignment is to the restraint library, the higher its score is evaluated. Typically, CSA provides 100 alternative solutions, from which a few good alignments are selected by screening. To evaluate an alignment, we have generated 25 3D models using the standard Modeller package and these models are used to measure the quality of the alignment. For the quality assessment of a 3D model, we used the support vector regression machine trained on decoy structures generated by LEE method in the CASP7 experiments.
Using the alignment selected and templates' 3D structures as input, we generate 3D structures of the query protein by straightforward optimization of the standard Modeller energy function using CSA, which we call Modeller-CSA.19 To apply CSA to the MODELLER energy function, one should be equipped with the following three ingredients: (a) a local minimizer for a given input structure, (b) a distance measure between two given energy-minimized structures, and (c) ways to combine two parent structures to generate a daughter structure which will be energy-minimized subsequently.
For local energy minimization, we have used what is already implemented in the MODELLER package. Root-mean-square-deviation between Cα atoms of two structures is used as the distance measure. To explore the conformational space of the neighborhood of a parent structure P1, we generate a daughter structure by replacing a part of P1 by the corresponding part of another parent structure P2. The actual replacement is performed using internal variables such as bond angles, bond lengths, and dihedral angles. As a result, daughter structures partially inherit bond angles, bond lengths, and backbone and side-chain dihedral angles of their parents. To generate the initial population of conformations, Cartesian coordinates of a trial MODELLER structure are randomly perturbed within 2 Å, and subsequently energy-minimized using MODELLER version 8v2.
The Modeller energy function has 35 restraint energy terms for 12 features such as distances between atoms, bond angles, dihedral angles, etc. Without additional restraints provided by users, the number of non-zero energy components is typically 15. Besides the spatial restraints, CHARMM energy terms are included in Modeller to enforce proper stereochemistry, which in general creates additional frustration to the energy function. Therefore, models with lower Modeller energies can be considered to satisfy more restraints than those with higher energies. In other words, the lower a model's Modeller energy is, the more information from template structures and their alignment is accordingly utilized in the model while satisfying as much stereochemistry as allowed by the Modeller energy. For this reason, as much steric clashes are removed accordingly. When we compare the accuracy of the standard Modeller models with that of Modeller-CSA models for 140 TBM domains, the latter is improved by 2.9% (backbone by GDTHA), 11.0% (χ1), 18.3% (χ2) and 7.3% (H-bond), demonstrating the positive effect of rigorous optimization of the standard Modeller energy function for protein 3D modeling.
The Modeller energy function is a collection of many competing/contradicting restraints arising partly from an alignment containing more than one template and partly from the protein stereochemistry. Generally, it is not possible to satisfy all restraints, naturally setting up a combinatorial optimization problem. The LEE approach intends to include many more (up to 20) templates than a typical template-based protein modeling method, consequently generating a much frustrated energy function for optimization. To handle this kind of problem, a powerful optimizer such as the CSA is shown to be appropriate by providing physically reliable models while efficiently incorporating maximal information from many templates.
Comparison of 3D modeling accuracy between various methods is a tricky procedure since identifying proper template(s) can have a more significant effect than the rest of the modeling procedure. This is more so for medium to more difficult template-based modeling (TBM) targets, and less for high accuracy (HA) TBM targets. When we compare the model accuracies for HA-TBM targets, LEE models are especially excellent for all aspects of model quality.
Finally, we remark that additional side-chain improvement (10.8% for χ1 and 20.5% for χ2) is achieved by side-chain remodeling, for which a rotamer library is constructed based on the consistency of the side chains from the Modeller-CSA models. Into this library, we have added a backbone dependent and sequence-specific rotamer library similar to the SCWRL3.0.17 Again using the CSA, we have optimized a scoring function containing energy terms from SCWRL and DFIRE.22
Rosetta all-atom refinement
The Rosetta CASP8 models are physically realistic because they were all refined in the physically realistic all atom Rosetta force field.23 Lennard Jones repulsive interactions are not damped and all backbone and side-chain atoms are modeled explicitly, so refined models have essentially no steric clashes. Explicit orientation dependent hydrogen bonding potentials derived from high level quantum chemistry calculations are used to refine interactions between hydrogen bond donors and acceptors; this is a considerable improvement over the purely electrostatic treatment of polar interactions in molecular mechanics force fields such as CHARMM5 and AMBER.24 The hydrophobic effect and electrostatic desolvation are treated using an implicit solvation model which captures both entropic and enthalpic contributions. Protein structure derived torsional potentials restrict backbone torsion angles to populated regions of the Ramachandran plot and side-chain χ angles to within a standard deviation of significantly populated “rotameric” values. Bond lengths and angles are restrained close to ideal values. The net result of the above is that Rosetta refined models are physically realistic both in terms of local geometry and global structural properties.
Because the Rosetta force field is reasonably accurate, the native structure almost always has lower (free) energy than non-native models. The structure prediction problem in Rosetta is thus the problem of searching for the lowest energy structure for a given amino acid sequence. This is carried out using the Rosetta high resolution sampling/refinement methodology.
The development of the Rosetta sampling methodology has been a continued battle with the challenges posed by refining models in such a high resolution force field. Because the repulsive interactions are not damped and the hydrogen bonding interactions are sensitive to small changes in distance and angles between hydrogen bonding atoms, the total system energy can change drastically with even slight perturbations of the structure, which produces an extremely bumpy free energy landscape riddled with local minima. The key aspects of the Rosetta sampling methodology that allow effective search for the global minimum despite the ruggedness of the landscape are described in the following paragraph.
First, Rosetta uses Monte Carlo Minimization25 rather than traditional Monte Carlo moves to explore backbone conformational space—after a perturbation of the backbone torsion angles, gradient-based quasi Newton minimization is used to locate the nearest local minimum, and the decision about whether or not to accept the move is made based on the energy difference between the starting energy and the energy after minimization. Without the minimization step, essentially no moves would be accepted as the chance of introducing a clash is extremely high. Second, side-chain torsional barriers are hopped over by combinatorial optimization of side-chain rotamer conformations after the backbone torsion perturbation but before the minimization (which is carried out with respect to all side-chain and backbone degrees of freedom). This discrete optimization of side-chain conformations provides a big speedup over continuous minimization in for example molecular dynamics simulations in which quite substantial torsional barriers must be overcome. Third, large scale sampling is carried out initially using a low resolution representation in which side-chain degrees of freedom are effectively integrated out; this greatly reduces the computational cost and enables rapid and broad sampling. In the subsequent all-atom refinement stage, the goal is to identify the lowest energy local minimum in the vicinity of starting low resolution model, and hence this more computationally intensive refinement is limited to a relatively small region of conformational space. Fourth, the Lennard Jones repulsive interactions in both the continuous quasi-Newton optimization and the discrete side-chain optimization are initially damped considerably and then ramped slowly to full weight—this allows the gradual working out of clashes in starting models generated by the low resolution Rosetta methodology and in general greatly smoothes the landscape which facilitates the early part of the search. Fifth, loop regions and other regions variable in a population of starting models are randomly selected for complete rebuilding by cutting the chain, remodeling the variable region, and then resealing the break to restore near ideal geometry. Following the stochastic rebuilding of randomly selected segments of the chain using the low resolution Rosetta representation, all atom refinement of both the rebuilt region and the remainder of the protein are carried out. The cutting of the chain allows the traversal of huge barriers that would be nearly insurmountable if the chain were kept intact. This “rebuild and refinement” protocol has proven quite effective in refining comparative models and NMR structures26 and was used extensively in CASP8. Sixth, rather than carry out a small number of very long trajectories, the standard approach with Rosetta is to carry out very large numbers of short Monte Carlo Minimization trajectories starting with a diverse collection of starting models generated either from the Rosetta de novo modeling method or from alternative alignments to alternative templates. The advantage of carrying out many short trajectories is that the space of possible structures can be covered much more effectively than in a single long trajectory. Though each full atom refinement trajectory typically takes tens of minutes (rather than days or weeks in the case of long time scale MD simulations), as many independent trajectories are carried out Rosetta refinement is quite CPU intensive.
The Rosetta high resolution sampling methodology because of the factors described earlier is reasonably effective at locating low energy near native structures provided the number of degrees of freedom is not too large. However, because the size of the search space grows exponentially with chain length, for larger proteins (> 200 amino acids) Rosetta rarely happens upon the native free energy minimum unless there is considerable information from homologous structures or from experimental data to guide sampling to this region. The sampling problem—finding the global minimum—remains the most formidable obstacle to consistent high resolution structure prediction with Rosetta.
It is important to recognize that Rosetta models are physically realistic because of the combination of the physically realistic energy function with the powerful sampling methodology. The physically realistic forcefield ensures that predicted structures, which are at energy minima, are physically realistic, but it is not in itself sufficient—without an adequate sampling methodology it would be very difficult to locate deep local minima, let alone the global minimum.
Undertaker keeps the good parts
The SAM-T08-human method at CASP8 builds models using the program undertaker,27 developed at University of California, Santa Cruz. Exactly the same protocol is used for template-based and template-free modeling.
First, a random, all-atom conformation is built by joining 1-to-4-residue long fragments from PDB files. This conformation is invariably terrible, but it is complete, and all bond angles, bond lengths, and torsion angles are locally reasonable.
Next, incomplete models are built from alignments to templates. These models copy the backbone from the template in aligned positions. Side-chains are also copied when the residues in the target are identical to the aligned residues in the template. The side-chains for aligned residues that differ are filled in by SCWRL 3.0.17
Each of the incomplete models is inserted into the random conformation, to make complete models based on the models from the alignments. These models are scored with a cost function, and the best-scoring one is kept. The incomplete models are then inserted into that model, and the process is repeated a total of nine times. The best model generated from each round of inserting incomplete models is kept to seed a pool for genetic optimization. For targets with good templates, most of the models are very similar and are based on one of the templates, with only small loop regions left from the initial random conformation.
For the SAM-T08-server submissions, these initial models were submitted as the second model and the final optimized models as the first model, so the effect of subsequent optimization can easily be measured. We noticed that essentially all the correct hydrogen bonds are present in the initial models, so undertaker is not creating them during optimization. Indeed for all the measures of accuracy that we have looked at, the difference between the initial and final models is very small (much less than a standard deviation).
Undertaker does optimization using an adaptive genetic algorithm that has 39 different conformation-change operators. Initially, operators that remove clashes and close chain breaks tend to get most heavily applied, whereas towards the end of the optimization operators that make small tweaks to torsion angles tend to be more successful. For example, in the last 300 generations of the optimizations for the server-only targets, the TweakPsiSegment, which adjusts one psi angle by a small amount, succeeds on average 26% of the time (Tables III and IV).
Fairly large improvements in the physics-like terms of the cost function are made during optimization, particularly removing clashes and breaks, but also improving the N-CA-C bond angle, hydrogen-bond geometry, and denseness of packing (measured with several different terms). These changes do not improve the measures of correctness of the model, but produce less ugly models of about the same accuracy.
The conformation-change operators are designed to preserve bond lengths and bond angles (or to change them to values copied from an example in PDB), except at breaks in the chain. One special case is that several operators modify peptide planes, either by rotating them around the CA-CA axis or by inserting a canonical peptide plane between the two CA atoms. Both of these types of operations can modify the N-CA-C bond angle, and so the cost function includes explicit scoring of the N-CA-C bond angles—the only bond angle included in the cost function.
The presence of explicit chain breaks means that the cost function needs to include a penalty for chain breaks, so that the optimization process will attempt to close the breaks. The cost function does not increase quadratically with the size of the gap (as is common in physics-based cost functions), but linearly. This ensures that the optimizer does not try spreading a large gap into many smaller gaps, but leaves it in one place, so that combinatorial optimization can try to close the gap with bigger movements of the backbone.
One of the most important tasks for optimizers like undertaker is to remove steric clashes from the conformation. Undertaker's clash cost function does not use Lennard-Jones potentials, but a simple “softstep” function that is 0 when the atoms are far enough apart and rapidly rises to 1 as the atoms get too close. A table of minimum acceptable differences is used, derived from a training set of PDB files. By not having enormous costs for bad clashes, the optimizer does not destroy everything in a neighborhood to remove a clash. Clash removal is only done when it is possible to do so without damaging the good aspects of the conformation.
During the optimization process the weight of the break and clash costs is increased, so that later stages of optimization try harder to remove the breaks and clashes.
Another component of undertaker's cost function that is very useful for polishing almost-right models is the hydrogen-bond cost function. Undertaker implements several different H-bond cost functions of differing complexity, from a very simple one that just counts how many donors are close enough to acceptors to potentially form H-bonds, to one that takes into account several different geometric features of the H-bond. For CASP8, the most sophisticated of these H-bond cost functions was used, so that optimization could improve the quality of the modeled H-bonds, and not just the number of them. Somewhat unusually, undertaker's H-bond scoring is done without explicitly representing the hydrogens—all the geometric terms are parameterized to use just the heavy atoms that are included in X-ray models. Although undertaker does introduce some H-bonds and improve others during optimization, it does not appear to be increasing the overall number or quality of correct H-bonds.
The insignificant gain in accuracy from our initial models to our final optimized models indicates that we need to focus more on generating good alignments to templates and creating the initial models from the alignments—subsequent optimization is making the models prettier without really making them better. The big improvement in CASP8 over older versions of undertaker is that polishing the models no longer decreases their accuracy.