- Top of page
- Materials and methods
In protein structure prediction, it is often the case that a protein segment must be adjusted to connect two fixed segments. This occurs during loop structure prediction in homology modeling as well as in ab initio structure prediction. Several algorithms for this purpose are based on the inverse Jacobian of the distance constraints with respect to dihedral angle degrees of freedom. These algorithms are sometimes unstable and fail to converge. We present an algorithm developed originally for inverse kinematics applications in robotics. In robotics, an end effector in the form of a robot hand must reach for an object in space by altering adjustable joint angles and arm lengths. In loop prediction, dihedral angles must be adjusted to move the C-terminal residue of a segment to superimpose on a fixed anchor residue in the protein structure. The algorithm, referred to as cyclic coordinate descent or CCD, involves adjusting one dihedral angle at a time to minimize the sum of the squared distances between three backbone atoms of the moving C-terminal anchor and the corresponding atoms in the fixed C-terminal anchor. The result is an equation in one variable for the proposed change in each dihedral. The algorithm proceeds iteratively through all of the adjustable dihedral angles from the N-terminal to the C-terminal end of the loop. CCD is suitable as a component of loop prediction methods that generate large numbers of trial structures. It succeeds in closing loops in a large test set 99.79% of the time, and fails occasionally only for short, highly extended loops. It is very fast, closing loops of length 8 in 0.037 sec on average.
To characterize biological processes both in physiological and pathological conditions, knowledge of the three-dimensional structures of the proteins involved is of great importance. The number of unique sequences in the Protein Data Bank (PDB; Berman et al. 2000) of experimentally determined structures is now >12,000, and the number of sequences in the nonredundant protein sequence database is >1.2 million (Wheeler et al. 2002). Homology modeling remains the most accurate structure prediction method for bridging the gap between the number of sequences and available structures. At least one-third of protein sequences in most genomes are homologous at least in part to proteins in the PDB (Sauder and Dunbrack Jr. 2000), and are therefore candidates for homology modeling. Ab initio folding simulations have also made gains in recent years as the necessary computational resources have become cheaper and more plentiful (Simons et al. 1999). For domains without representatives in the PDB, these methods may provide at least a preliminary model that can be tested experimentally.
Homology modeling usually proceeds via a number of steps: (1) identification of a homolog of known structure (the “parent”) for the sequence of interest (the “target”); (2) refinement of the target–parent sequence alignment through application of varied alignment methods or manual adjustment in light of the known structure; (3) backbone modeling by borrowing of core secondary structures and loops of conserved length from the parent structure, and loop modeling in regions of the alignment that contain insertions and deletions, or in which the sequence has diverged substantially; (4) side-chain modeling onto the backbone given the target sequence; (5) refinement and validation of the model. In practice, backbone modeling and side-chain modeling are interdependent. Some methods proceed by defining constraints from the known structure and the target–parent sequence alignment, rather than borrowing Cartesian coordinates directly (Sali and Blundell 1993).
Although improvements in identification (Karplus et al. 1998; Jones et al. 1999), alignment quality (Sauder et al. 2000) and side-chain prediction methods (Bower et al. 1997; Dunbrack Jr. 1999; Mendes et al. 2001; Xiang and Honig 2001; Liang and Grishin 2002) have been significant in recent years, loop modeling remains a difficult task for several reasons. For long loops, the number of available conformations is enormous, and rapid and thorough sampling is a challenge. Defining an energy function that can select the right loop structure in an environment that is only approximately correct for the target sequence is also quite difficult. Most loop methods have not been tested in true homology modeling situations, but rather in the more artificial situation of replacing a loop back into its own native structure.
Modeling a loop requires satisfying the constraint of connecting the two protein segments on either end of the loop with a physically reasonable peptide conformation. The fixed residues on either side of the loop to be modeled are termed the N- and C-terminal anchor residues. The anchors place a constraint on the available conformations, thus reducing the size of the conformational space, but satisfying the constraint presents an algorithmic challenge. The “loop closure” problem arises in nearly all loop-modeling methods, regardless of their nature. This is the case whether loop closure takes place in the context of homology modeling or in ab initio protein structure prediction, in which, for instance, secondary structures may be moved as a whole and new loops constructed to connect them. Database methods (van Vlijmen and Karplus 1997) that borrow loops from unrelated structures that approximately fit the anchors must refine the loop structure to fit the actual anchors of the target model. Ab initio loop-modeling methods (Bruccoleri and Karplus 1987) may generate large numbers of random conformations. If these are built starting from the N-terminal anchor in the model, then the loop must be adjusted to connect the C-terminal residue of the loop to the C-terminal anchor of the model. Some methods build randomly from both the N- and C-terminal anchors, and the resulting segments must be connected in the middle (Moult and James 1986).
Several solutions have been presented to solve the “loop closure” problem. Wedemeyer and Scheraga (1999) have solved the problem analytically for tripeptides with 6 degrees of freedom. Shenkin et al. described an algorithm based on the Jacobian matrix of first derivatives of distances between atoms of the terminal residues of the loop with respect to the dihedral degrees of freedom (Fine et al. 1986; Shenkin et al. 1987). Their method, referred to as “random tweak,” uses Lagrange multipliers to minimize changes in the dihedral angles while satisfying the constraints on the interatomic distances of the end residues. Starting from a random conformation, all the dihedral angles are modified at once in each step of the iteration until the distance constraints between the end residues are satisfied. Because of the matrix inversion required, tweak is sometimes numerically unstable, if the matrix loses rank (i.e., has determinant 0 and is therefore uninvertible). Tweak requires that the resulting loop be rotated into place, because the algorithm attempts to satisfy distance constraints between the N- and C-terminal anchor atoms, rather than between the last residue of the moving loop and the fixed anchor. It also does not allow imposing additional constraints on individual residues because modifications to all dihedral angles are computed at once, with strong dependence of each dihedral change on all of the others. It has been used in a number of loop-modeling programs, including Drawbridge (Ring et al. 1992), the Biopolymer program (Tripos, Inc.), and Loopy (Xiang et al. 2002).
Both Modeller (Fiser et al. 2000) and the “scaling relaxation” method of Zheng et al. (1992) build initial conformations that connect the anchors by scaling the size of an initial conformation to fit the anchors, and then gradually returning the loop to normal size through an energy minimization or molecular dynamics procedure. In Modeller, the backbone atoms are built in a straight line from one anchor to the other, whereas in the scaling relaxation method a database loop is used.
Our implementation of the random tweak algorithm and analysis of its limitations led us to examine a variety of so-called inverse kinematics algorithms used in robotics and computer-generated character animation. Forward kinematics methods calculate the positions of object components based on internal and external degrees of freedom, whereas inverse kinematics methods calculate the necessary changes in internal and external degrees of freedom in order for an object component to reach a desired position. Inverse kinematics algorithms are designed to move an “end effector” (e.g., a robotic gripper) to reach for a specific location to pick up an object by changing joint angles and modifying segment lengths. As such, it is essentially the same problem as loop closure in proteins or other molecules, as originally pointed out by Manocha and Zhu (1994; Manocha et al. 1995). Many inverse kinematics algorithms are based on computing the Jacobian and its inverse or pseudoinverse, and hence like tweak are computationally expensive and sometimes numerically unstable (Lander 1998). They rely on changing all joint variables at the same time along a path that will move the end effector toward the target object. In robotics, the problems of singularities in Jacobian-based methods have been studied extensively (Maciejewski 1990; Merlet 1992). Besides the computational difficulties, one major drawback of Jacobian-based methods is that placing constraints on some degrees of freedom may produce unpredictable results. Capping or zeroing out certain elements of the proposed vector of the changes in degrees of freedom may result in motion of the end effector away from rather than toward the target object.
One algorithm used in robotics that is flexible in allowing constraints to be placed at each step, easy to program, conceptually simple, and computationally fast is “cyclic coordinate descent” (CCD). This algorithm was originally developed as an improved method for solving inverse kinematics problems in robotics (Wang and Chen 1991). CCD is a member of a class of iterative relaxation algorithms known as Jacobi or Gauss-Seidel methods (Briggs et al. 2000). It involves adjusting one degree of freedom at a time to move the end effector toward the target object. This results in one equation in one unknown for each degree of freedom, and hence is analytically very simple and computationally fast. The method is free of singularities and it does not include matrix inversion. It proceeds in iterative fashion along a chain of degrees of freedom, modifying each joint so that the end effector gets as close as possible to the desired position. The equations are able to provide both an optimum setting for the variable and the first and second derivative of the change at the current position so that small increments can be made in preference to large changes, if desired. Given that the optimal change in a parameter in one joint depends only on the current values of the other joint parameters, one can place constraints on any degree of freedom, choosing to restrict their allowed values or place probability distributions on them.
In this paper, we describe the cyclic coordinate descent algorithm, which we have modified for the problem of loop closure in protein structure prediction. In robotics, the end effector is usually a single point and the target position is a single point. In protein modeling, the end effector may be the three backbone atoms of the C-terminal residue of the loop that must be superimposed onto the backbone atoms of the fixed C-terminal anchor of the target model. Therefore, we must also consider the orientation of the end effector as well as its position. In the Materials and Methods section, we describe the algorithm and derive the necessary equations for dihedral angle degrees of freedom and the orientation constraint. There are several possibilities for choice of end effector and target, and we describe one such possible choice and its implementation. In the Results section, we show that CCD can close loops from nearly any starting configuration as long as the chain is long enough to reach from the N-terminal anchor to the C-terminal anchor. We have also explored the use of Ramachandran probability maps as constraints in the CCD closure procedure. This is accomplished by using CCD as a proposal step in a Monte Carlo simulation. The CCD equations provide the move to new values of the backbone dihedrals, and we use Ramachandran map probabilities to determine whether to accept the move. We show that using the Ramachandran constraint does not affect the success rate of loop closure by CCD.
Materials and methods
- Top of page
- Materials and methods
Our implementation of the CCD method for protein loop closure is an iterative procedure that modifies sequentially each backbone dihedral angle (ϕ and ψ) in each residue that is part of the loop. We define “N-anchor” and “C-anchor” to be the N- and C-terminal anchor residues, respectively, that bracket the loop and remain fixed throughout the calculation. These are illustrated in Figure 1A. The residues are numbered from 0 to n, where the 0-th residue is the N-anchor and the n-th residue is the C-anchor. The calculation begins with some initial configuration of the loop consisting of residues 0 through n. Residue 0 of the loop coincides exactly with the N-anchor, whereas the position of residue n of the loop will not coincide with the position of the fixed C-anchor. The goal is to adjust the backbone dihedrals ϕ and ψ of residues 1 to n so that the backbone atoms N, Cα, and C of the moving residue n are superimposed on the corresponding backbone atoms of the fixed residue n (i.e., the C-anchor).
The initial structure of the loop can be from any source. For instance, it might be constructed from random values for the backbone dihedrals and standard bond lengths and bond angles, or it might be obtained from a database search for loops that approximate the anchors. In either case, we have initial values for all dihedrals, bond lengths, and bond angles for residues 1 through n. In our implementation, starting from residue 0, we build the N atom of residue 1 from the known (or chosen) value of ψ of residue 0, and the Cα atom from ω of residue 1. These atoms remain fixed through the rest of the calculation. The remaining atoms of the loop, beginning with C of residue 1 through C of residue n are built from values of the dihedral angles ϕ,ψ and the initial bond lengths and angles.
Once the initial loop is constructed, the procedure involves changing the values of the backbone dihedrals ϕ and ψ iteratively until the backbone atoms of residue n are superimposed on the fixed backbone atoms of the C-anchor residue. The progress of the loop closing process is assessed by the distances between the backbone atoms of the moving C-terminal residue of the loop and their desired positions in the C-anchor.
As shown in Figure 1B, F1, F2, and F3 are vectors that represent the fixed target positions for the atoms of the C-terminal residue of the loop. The positions of the moving C-terminal residue atoms are represented by M01,M02,M03, and M1,M2,M3, before and after a change, respectively, in a dihedral angle of any residue in the loop. The rotation axis (containing O1, O2, O3) is given by the direction of the bond corresponding to the dihedral angle that is modified (N—Cα for ϕ, Cα—C for ψ), where O1, O2, and O3 are the footpoints of vectors from the rotation axis to the three atoms of the moving C-terminal anchor.
To accomplish the overlap between the current and desired positions of the atoms, the sum of squared distances, S, should be minimized:
From equations 2 and 3, it follows that
with similar equations for the second and third atoms.
Calculating the squared distances between the moving atoms and the fixed target atoms, we obtain:
The first-order derivative for S is
where for i = 1, 2, 3 we have
Setting dS/dθ = 0, we obtain tan α, where α is the rotation angle that will yield an extreme value for the sum of square distances described above.
Inverting the tangent will produce two values for α that are π radians apart. The correct one is that which produces a positive value of the second derivative of S, which is easily derived from equations 6 and 7.
In practice we obtain α in a different way. With S of the form
multiplying the last two terms by
When θ = α, S is a minimum. This is the same solution as equation 8, except that we now have sine and cosine explicitly defined. We use the atan2(y, x) function of the C programming language to return θ in the correct quadrant, rather than making use of the second-order derivative test.
To test our algorithm, we selected a set of 2752 loops from 366 X-ray crystallographic structures in the PDB. The selected loops belong to structures that have been solved to a resolution better than 1.6 Å and have mutual sequence identity <20%. The list of structures was obtained from the PISCES server (formerly the CulledPDB server, now available at http://www.fccc.edu/research/labs/dunbrack/pisces). The chosen loops were identified as having coil structure by the Stride program (Frishman and Argos 1995). None of the loops in the test set are adjacent to disordered residues as determined by our validation program S2C (http://www.fccc.edu/research/labs/dunbrack/s2c).
Ramachandran probability maps
We derived Ramachandran probabilities from data used to build our backbone-dependent rotamer libraries (May 2002 release; Dunbrack Jr. and Cohen 1997). Pairs of ϕ,ψ dihedral angles were weighted with a Gaussian function, and counts were produced in 10° bins in ϕ and ψ over the entire Ramachandran map.
We have implemented CCD in object-oriented C++ under RedHat Linux 7.3. All calculations were performed on an AMD1800+ MP processor.
- Top of page
- Materials and methods
For each loop in our test set, we generated 100 random loops by drawing values for the ϕ and ψ dihedral angles randomly from PDB structures used to build our backbone-dependent rotamer library (Dunbrack Jr. and Cohen 1997). We used random ϕ,ψ from the PDB rather than random values from the interval 0°–360° to produce more protein-like starting conformations. Loops were built starting from the N-anchor residue Cartesian coordinates from the structure as described in Materials and Methods. The last residue of the random loop structure was built with the crystallographic bond lengths and bond angles, so that the RMS of the superposition of the moving and fixed C-anchors would not depend on differences in these parameters. We consider the loop closed when the RMS of the N, Cα, and C atoms of the moving and fixed C-anchor residues is <0.08 Å. The maximum number of CCD iterative cycles was limited to 5000.
The first implementation of CCD entailed using equation 10 to calculate the change in each dihedral along the chain and accepting the proposed move with probability 1. This algorithm is denoted “CCD No Constraint” in Table 1. Of the 275,200 loops closed (100 random conformations of each of 2752 loops in the test set), 99.79% of them closed to within 0.08 Å RMS in fewer than 5000 steps, where each step consists of a single cycle through all dihedral angles of the loop. Most loops closed in fewer than 200 steps (see below).
To examine the effect of adding a constraint to CCD, we used Ramachandran probability maps to determine whether moves proposed by equation 10 would be accepted in each step. The algorithm proposed a change in ϕ and, based on the new ϕ, proposed a new value for ψ. The new ϕ,ψ position was accepted with probability 1.0 if the probability of the new ϕ,ψ was higher than the current value. It was accepted with probability pnew/pold if the new probability was lower than the current value. The results over the same set of 275,200 random conformations are also shown in Table 1 (labeled “CCD Ramachandran Map”), and demonstrate that the Ramachandran constraint has essentially no effect on the loop closure rate.
We investigated situations in which the generated loops did not close within 5000 steps. For each unclosed loop, the RMS of N, Cα, and C atoms of the C-terminus residue in the simulated loop and C-anchor was calculated. The distributions of the number of loops as a function of the RMS are shown in Figure 2. The figure shows that the greatest majority of the loops that were not able to close within the imposed error margin (0.08 Å) are within 0.1 Å RMS from the target position, in both versions of the algorithm. There is also no significant difference between the Ramachandran probability version of the algorithm and the unconstrained one, with respect to the RMS distributions.
The lowest closure rates were for very short loops of 4 or 5 amino acids. We examined loops that failed to close, and in every case these were extended conformations. CCD converged in a local minimum for every dihedral of the loop. A simple Monte Carlo step that proposes moves that increase the RMS can be added in these situations to move the algorithm out of the local minimum. It should also be noted that for these extended loops, most trials actually closed, and only a small number failed to converge. Histograms for the number of steps required to close the loops in the input data set are shown in Figure 3. The large majority of trials close within 100–500 steps. For all loop lengths, the number of steps required with the Ramachandran map constraint is higher than without the constraint, as expected.
CCD has been designed to be used with any loop prediction algorithm that generates reasonable trial structures, coupled with an energy function to identify the best loops. As such, it is not a prediction method on its own, nor is it a sampling algorithm per se but, rather, a component of one. Nevertheless, we were interested to determine how many random loop structures would need to be built and closed with the generation procedure described in Materials and Methods to obtain a reasonable RMS to the real structure. For this purpose, we chose 10 loops at lengths 4, 8, and 12, for a total of 30 loops. For each of the loops we generated 5000 random loops and closed them, using the CCD Ramachandran Map algorithm. In Table 2, we show the minimum RMS achieved for these 30 loops. The average for loops of 4, 8, and 12 amino acids is 0.56, 1.59, and 3.04 Å, respectively. Examples of these minimum RMS conformations are shown in Figure 4. Although the RMS is low in Figure 4C for the 12-amino-acid loop, other samples might better reflect the positions of backbone carbonyls and NH groups relative to the whole protein structure.
We investigated the CCD Ramachandran map constraint method to determine whether loops closed from the same starting conformation would converge to the same structure, cluster into groups, or be distributed randomly. The sequence of random numbers used to determine whether moves are accepted or not was seeded differently in each run, resulting in different closed structures from the same initial structure. We used a loop from PDB entry 1egu (Li et al. 2000), residues 508–519 of length 12, closed it 500 times from the same conformation, and calculated the RMS between each pair in this set. It is useful to compare the distributions of RMS values among this set, with the RMS values among a set of 500 closures of the same loop, but starting from different initial conformations for each trial. The comparison is shown in Figure 5. Loops starting from the same conformation do cluster, as demonstrated by the peak at RMS near 0 Å. Loops starting from different conformations do not converge to the same structure. The RMS values approximately follow a gamma distribution.
We compared CCD to the random tweak method (Shenkin et al. 1987) used in some loop prediction algorithms (Xiang et al. 2002). In our version of random tweak, we set constraints on the distances of 0 Å between the moving C-terminal anchor N, Cα, and C atoms and the corresponding atoms of the fixed C-anchor. Note that this is different from the original description, which fixes distances between the N- and C-terminal Cα atoms. We ran the same test set shown in Table 2 for 8 residue loops, consisting of 500 initial configurations of 10 loops, for a total of 5000 trials for both tweak and CCD No Constraint. We used the same final RMS criterion for a closed loop of 0.08 Å and 5000 rounds of each algorithm maximum. CCD was able to close all 5000 of these loops in ∼7 min, whereas random tweak closed 4841 loops in 40 min, failing on 159 loops.
Finally, we further investigated the computation time needed by the CCD algorithm itself, without the time needed to read the initial conformation and the full Ramachandran probability maps for the 20 amino acids from the disk for each closure. We used the same set of 10 loops of length 4, 8, and 12 listed in Table 2, but this time generated 500 random conformations within the program, thus reading in only a single initial conformation and reading in the probability map only once. We generated 500 random conformations in this manner and closed them, with two different RMS cutoffs. With an error margin of 0.08 Å, the average computing time was 0.031, 0.037, and 0.023 sec per loop for loops of lengths 4, 8, and 12, respectively. With a looser cutoff of 0.16 Å, the times would be significantly lower. For instance, for 8 amino acid loops, the average computing time was 0.026 sec per loop. It is interesting to note that CCD takes less time for longer loops, because these loops have more degrees of freedom and more solutions to the loop closure problem.
- Top of page
- Materials and methods
Loop closure is a component of a number of protein structure prediction problems, including various approaches to homology modeling as well as Monte Carlo simulations of protein folding. Because most structure prediction methods proceed by producing large numbers of trial configurations, a fast loop closure method is of significant importance. Although several algorithms have been presented previously, each suffers from a number of drawbacks. These include lack of convergence, numerical instability, and cumbersome implementations. We have described a very simple method for loop closure borrowed from robotics that is easy to understand and implement.
As mentioned in the Results section, CCD is not meant to be a sampling strategy on its own but, rather, is to be used with any method that generates unclosed trial conformations. In this paper, we have generated trial conformations by drawing random values for ϕ,ψ from X-ray data from the PDB. These data were taken from loop residues and were specific for each amino acid type. The procedure as described here does not have any steric bump checks or internal energy evaluations of any kind, other than the Ramachandran probabilities. Nevertheless, this procedure was able to generate loops on average minimum RMS values of 0.56, 1.59, and 3.04 Å from the native structure for loops of 4, 8, and 12 residues. This compares favorably with some other sampling strategies, such as that of Tosatto et al. (2002), who used a divide-and-conquer method to generate average minimum RMS values of 1.0, 2.22, and 3.5 Å for loops of lengths 4, 8, and 12. Sudarsanam et al. (1995) used a database of ψi,ϕi+ 1 pairs to model loops and achieved lowest RMSD values of 1.2–1.3 Å in 10,000 simulations for an 8-amino-acid loop. However, these loops were not closed. By including a CHARMM-based energy function during loop construction and simulation, Fiser et al. (2000) were able to achieve lowest RMSDs (regardless of energy) of 0.70, 0.93, and 1.93 for three 8-amino-acid loops. It remains to be seen whether in designing a loop prediction strategy incorporating CCD it is better to include energy evaluations during the CCD closure procedure, or to build large numbers of samples and evaluate their energy in the context of the protein afterward. CCD can be easily modified to include other constraints such as avoidance of collisions. Techniques from robot motion planning would probably be most suitable for this purpose (Singh et al. 1999).
CCD's advantages compared with Jacobian-based methods are its simplicity, ease of implementation, speed, and lack of singularities (Welman 1993). One disadvantage in our present implementation is that the algorithm favors large changes in the first residues of the loop. If the loop can be nearly closed with manipulations of the first few residues, then the other residues will barely move at all. This probably occurs fairly rarely. In any case, to preserve similarity to the initial configuration and to even out the changes in the dihedrals across the whole loop, one can impose limits on the change in dihedral angles at each step. Because we have an expression for the distance (and its derivative) of the moving C-anchor to the fixed C-anchor, we can choose to make small moves toward an RMS of 0, rather than propose the full CCD step to the minimum value of S in equation 11. Another disadvantage is that CCD may occasionally get stuck in a local minimum, when solving equation 11 for each dihedral results in no change in configuration. The method can be modified to check for this and to add a step that will move the moving C-anchor residue away from the target.
CCD is a good example of crossover of algorithms from one field to another that is a hallmark of bioinformatics and computational biology. The inverse kinematics problem is a staple of robotics, and the cyclic coordinate descent algorithm described here is one of several methods that are likely to be borrowed from this field in structural biology. Other recent examples include the use of robot motion planning and probabilistic roadmaps in protein folding (Singh et al. 1999; Apaydin et al. 2002; Brutlag et al. 2002), analytical inverse kinematics approaches applied to protein loop closures of six dihedral degrees of freedom or fewer (Manocha and Zhu 1994; Manocha et al. 1995), and a randomized kinematics search for loop closure in the drug design problem (Lavalle et al. 2000). It is likely that there will be more interactions between these disciplines in the future.