The number of protein three-dimensional structures is rapidly increasing and has resulted in a situation in which approximate conformations for a large percentage of existing proteins are probably represented in the Protein Data Bank (PDB; Berman et al. 2000). To be of value, models of proteins produced from other known structures should have an accuracy approaching that of a protein structure determined by x-ray crystallography. Present comparative modeling techniques often offer little advantage over copying the core regions from homologs (Martin et al. 1997). The core regions within homologous protein families, mainly comprising elements of secondary structure, are nearly superimposable, but there are often significant variations in nonregular regions, more particularly in peripheral loops (Chothia and Lesk 1986). Predicting the conformations of these structurally variable regions (SVRs) in the environment of the complete protein is one of the most difficult challenges for comparative modeling (Johnson et al. 1994; Moult et al. 1997). These SVRs relate not only to loop regions between two defined pieces of secondary structure, but also to fragments involving secondary structure. The prediction of such conformers is also an important objective in protein design and engineering (Chothia and Lesk 1987; Blundell et al. 1988).
CODA is a methodology for improving the accuracy of prediction of the SVRs of comparative models so that they can approach that of a protein structure that has been experimentally determined. There have been two main types of methods for the prediction of the SVRs of proteins, those that adopt a knowledge-based approach (Jones and Thirup, 1986; Blundell et al. 1988; Topham et al. 1993; Rufino et al. 1997; Bates and Sternberg 1999) and those that use an ab initio or conformation searching method (Fine et al. 1986; Bruccoleri and Karplus 1987; Higo et al. 1992; Mas et al. 1992; Collura et al.,1993; Abagyan and Totrov 1994; Abagyan et al. 1994; Fidelis et al. 1994; Zheng and Kyle 1996; Zhang et al. 1997). CODA clusters the results from FREAD, a knowledge-based approach, and PETRA, (Deane and Blundell 2000) an ab initio method.
The knowledge-based approach (use of protein fragments) was first used for modeling SVRs as part of interactive rather than automatic procedures. In a study of serine proteinases, Greer (1980) showed that SVRs were best selected from the structures of homologous proteins. Chothia and Lesk (1987) analyzed the sequences and structures of the variable regions from a large number of antibodies and showed that the correct conformer could often be identified by “key residues” found in the loop itself or in spatially close regions that played an important structural role; this approach has been used successfully in antibody engineering. The first attempt to derive rules for conformers outside homologous families was by Sibanda and Thornton (1985), who showed that short loops connecting elements of secondary structures in this case β-hairpins often belong to well-defined classes with recognisable sequence patterns. Such patterns can be seen in target sequences and used to suggest possible conformers in modeling (Sibanda et al. 1989). However, these approaches are not general, being confined to families in which either large numbers of homologs have known structures or in which short loops between elements of secondary structure are to be modeled.
A more general method was developed by Jones and Thirup (1986), initially for electron density fitting by selecting appropriate fragments from structures in the PDB. A similar approach was applied in comparative modeling by selecting fragments from the PDB that overlap the framework at both ends (Sutcliffe et al. 1987; Blundell et al. 1988). Methods of this type have the advantage of guaranteeing rapid results that have physically reasonable main chain conformations. However, Fidelis et al. (1994) concluded that the use of fragments from the PDB was useful only for sections up to four residues in length as the completeness of the database was found to degrade rapidly with increasing length.
Recently, van Vlijmen and Karplus (1997) have described a method for the selection of loops from a fragment databank, followed by the optimization and ranking of the possible fragments using the CHARMM nonbonded energy function (Brooks et al. 1983). This has the advantage over direct searches of conformational space in that the number of fragments required to be minimized with the CHARMM function increases slowly with loop length rather than exponentially. Their method for loop selection and optimization predicted eight out of 18 loops of up to nine residues in length to a Cartesian root-mean-square deviation (rmsd) better than 1.07 Å relative to the crystal structure. This is not the only way to utilise the information from the PDB. Sudarsanam et al. (1995) describe an SVR modeling procedure that uses a database of all ϕ(i +1), ψ(i) dimers from a nonredundant version of the PDB—approximately 50,000. These dimers were put into 400 pools representing all possible dimers that can result from the 20 naturally occurring amino acids. Unknown fragments are predicted by matching each dimer in the sequence with the corresponding pool in the database and generating all possible conformations based on the real ϕ(i +1), ψ(i) pairs in those pools. The resulting fragments are then filtered using the geometric restraints of the flanking residues.
The second type of methodology proposed by loop modelers is ab initio construction. In this case, the conformers can be generated by computer methods, either on the fly or separately from modeling. Go and Scheraga (1970) were the first to develop a procedure for predicting the conformation of a fragment joining two polypeptides of known structure. Moult and James (1986) later considered the feasibility of a systematic search for polypeptide segments within a protein. Pedersen and Moult (1995) describe an ab initio method for predicting short fragments of proteins using genetic algorithms. Refinement of closed structures of fragments has been achieved using Monte Carlo simulated annealing (Collura et al. 1993; Zhang et al. 1997) or generalized Born models of solvation effects (Rapp and Friesner 1999).
A recent example of computer conformer generation, PETRA (Deane and Blundell 2000) is a fast ab initio method for the prediction of local conformations in proteins. The program, PETRA, selects polypeptide fragments from a computer-generated database (APD), encoding all possible peptide fragments up to 12 amino acids long. Each fragment is defined by a representative set of eight ϕ/ψ pairs, obtained iteratively from a trial set by calculating how fragments generated from them represent the PDB. Ninety-six percent of length five fragments in crystal structures, with a resolution better than 1.5Å and less than 25% pairwise identity, have a conformer in this database with less than 1Å rmsd. To select segments from APD, PETRA uses a set of simple rule-based filters, thus reducing the number of potential conformations to a manageable total. This reduced set is scored and sorted using rmsd fit to the anchor regions and a knowledge-based energy function dependent on the sequence to be modelled. The average rmsd ranges from 1.4 Å for three residue loops to 3.9 Å for eight residue loops.
An example of a combined algorithm is that of Martin et al. (1989), in which antibody hypervariable loops were predicted using a database search followed by reconstruction of sections of the predicted loops ab intio and addition of side chains using the CONGEN conformational searching algorithm (Bruccoleri and Karplus 1987). The generated structures were evaluated using a subset of the CHARMM potential in vacuo.
Mas et al. (1992) generated a model for an antibody specific for the carcinoembryonic antigen (CEA) using a method that combines the concept of canonical structures with conformational search. They use a conformational search technique that couples random generation of backbone loop conformations to a simulated annealing method for assigning side chain conformations. This technique was used both to verify conformations selected from the set of known canonical structures and to explore conformations available to the H3 loop in CEA ab initio. Analysis of the results of conformational search resulted in three equally probable conformations for H3 loop in CEA.
These methods differ entirely from the combined algorithm proposed here, in which ab initio and knowledge-based procedures are run separately and the independent results combined for the prediction.
In the present study we predict the conformations of polypeptide fragments using two algorithms: FREAD (a knowledge-based method) and PETRA (Deane and Blundell 2000) (an ab initio method). We describe the development of the FREAD algorithm and the environmentally constrained substitution tables, which were developed as a selection filter. FREAD also selects fragments on the basis of anchor rmsd and a knowledge-based energy function. FREAD was tested on more than 3000 fragments to ensure accurate and objective parameterization. CODA extracts the results from the two programs and then clusters them and selects a fragment from these clusters using several rule-based filters. CODA was parameterized and tested using two independent and nonhomologous sets of proteins: a total of 300 loops.
The procedure was the tested on five model SVRs to examine the utility of the program under comparative modeling conditions. A Web server for the program is available at http://www-cryst.bioc.cam.ac.uk/∼charlotte/Coda/search_coda.html
Results and discussion
Environmentally constrained substitution tables
The Ramachandran areas by which the tables are constrained are shown in Figure 1.
The tables developed are similar to those of Topham et al. (1993) in that they are also constrained by Ramachandran angle. There are two fundamental differences however. Topham et al. (1993) included α-helix and β-strand as specific categories of structural definition separate from Ramachandran areas, a distinction that is not used here. The tables here measure only if the structure is similar, not if it is involved in longer range interactions. Topham et al (1993) also included hydrogen bonding possibilities and solvent accessibility in their tables as unconstrained environmental considerations; these were not considered here as they added sparsity to the data and did not appear to improve prediction.
The four rule-based filters used by FREAD were Dc(F), the difference between anchor Cα separations of the target and the prediction; R(F), the rmsd between the backbone atoms of the anchor region of the target and the prediction; Ek(F), the energy of the predicted fragment in the target structure; and Sc(F), the environmentally constrained substitution score.
The filters and cut-offs selected for FREAD were developed using a jack-knife process, (i.e., by prohibiting each target loop from selecting its real structure) predicting the L20 (Table 1) set of SVRs using the F20 database of protein structures. The development of the L20 set and the F20 database are described in Materials and Methods. The F20 database and L20 SVR set contain virtually no homologs. They were used for parameterization as this corresponds to the true modeling situation more closely than the use of a set containing homologs, as SVRs by definition are those regions not found in the structurally homologous parents.
The algorithm CONTEST was used to identify optimal cut-offs and sorting functions for FREAD. In an iterative process, it predicted the entire L20 (Table 1) test set using FREAD with the F20 database.
Discrete values for all the filters across large ranges coupled with all choices of sorting function were input. The algorithm then identified those values that gave predictions for over or close to 90% of the L20 set. The discrete steps for the filters across these identified regions were then decreased in size and the process repeated. These steps were performed until no further improvement for the average rmsd of prediction was seen. This process was performed for each of the loop lengths separately.
The four filters tested using CONTEST were implemented as follows. Dc(F) and Sc(F) were the rapid initial selection procedure. Dc(F) is obviously a prefilter to R(F). The final sorting of the possible predictions for a target SVR was performed by R(F).
Ek(F) was used to filter the list of predictions (the output consists of the top 10 results) at this final stage. If a fragment is to be on the list of predictions, it must have an Ek(F) value below a cut-off. The values of the cut-offs used are given in Table 2.
The selection criteria, for instance, Ek, often had a length-dependent variation with the degree of similarity of conformation that they could predict. Ek cut-offs ranged from 0.5 for three residues to 2.0 for eight residues (Table 3).
Using more stringent cut-offs can significantly increase the accuracy of FREAD; however, the price of this is a drop in coverage. Figure 2 shows the effect on coverage and accuracy of the rmsd cut-off on length seven loops. During identification of parameters for FREAD, only those conditions at which over 90% of the set were predicted were selected. More trust could be placed in FREAD answers that had passed more stringent cut-offs, and this information could be used in the development of CODA (Table 3).
FREAD: Prediction quality
The average rmsds calculated across all the backbone atoms (Cα, C, O, and N) using the final rules and filters are shown in Table 4. The rmsd reported here is not the internal rmsd (calculated after superposition of the polypeptide fragment alone), which is used in many papers in the literature. Instead, we report the actual rmsd of the predicted fragment as it is built into the protein in the anchor frame of reference, a far more demanding criterion. This is because it is possible for a fragment to be incorrectly orientated with respect to the target protein and still yield a good ‘rmsd’, according to the standards of bare loop superposition.
A further jack-knife test of FREAD was performed using the L90 SVR set and the F90 database of fragments. This database and SVR set will contain many homologs. The set of filters and cut-offs developed by CONTEST on the L20 SVR set were used for prediction. This was performed to assess the effect of the addition of homologs (Martin et al. 1997) and to allow comparison to database loop prediction methods such as Oliva et al. (1997), Rufino et al. (1997), van Vlijmen and Karplus (1997), Li et al. (1999), and Wojcik et al. (1999), whose databases have either no percentage identity cut-off or one at around 90%. The level of accuracy of FREAD is higher for the L90 test set than for the L20 test set as shown in Table 4. This would be expected as the F90 database used for prediction contains over two times as many proteins, which can also be homologous to the proteins from which the L90 fragment for prediction is taken.
Table 5 shows the average rmsd of the best loop taken from the Top ten (one to 20) results, and it shows that the filters and cut-offs used in FREAD are not able to pinpoint the best answer. When the top 10 answers are considered, the program performs significantly better, but there is little increase in performance if more answers are used. The Top ten results for PETRA (Table 4; Deane and Blundell 2000) also show how the selection procedure does not place the best loop in position number one every time. Thus, the use of CODA to combine the two methods of prediction should improve the results.
The different filters used in FREAD all proved to be of some value in selection. For example, the environmentally constrained substitution tables were found to offer significant discriminating power, particularly in the selection of homologous or other fragments that were very similar to the target (Table 3). This contrasts with the findings by (van Vlijmen and Karplus 1997; Deane and Blundell 2000) that propensity tables such as BLOSUM (Henikoff and Henikoff 1992) were very weak or ineffective positive discriminators in the loop modeling problem. The power of the environmentally constrained substitution tables probably arises because they provide quantitative information about the existence of an amino acid in a structural environment and the possibility of its replacement by any other amino acid while retaining that structure. However, in the case of fragments less similar to the target, environmentally constrained substitution tables were found to act in a similar way to propensity tables in that they only discriminated against impossible conformations for certain residues.
The percentage identity of the prediction sequences to the target was calculated for the L20 set to see if, as expected (van Vlijmen and Karplus 1997; Deane and Blundell 2000), there was little or no correlation with percentage identity. At all lengths, on average, less than one residue is conserved in the prediction fragments. In over 50% of cases, the percentage identity between prediction and target is zero. Table 3 shows the average rmsd values for these. The results are close to the general results achieved for the L20 set (Table 4).
Ek, the knowledge-based potential that has been used previously as a selection filter in Deane and Blundell (2000), was found again to be a weak positive indicator. This is likely to be a consequence of both the lack of side chain information and the relatively high-energy values gained even if the template loop orientation is close to the target orientation. A few slightly displaced atoms can increase the energy significantly and compromise the prediction. However, it could be used for identification of a very good answer in a small number of cases in which a significant negative value was calculated, but this number was too small to force a negative cut-off for Ek(F) to be used; instead, this information could be passed on to CODA to gain greater confidence in these predictions (Table 3).
The most useful filter in FREAD was found to be R(F), the rmsd between the anchor residues of the prediction and the target, as has been described previously (Lessel and Schomburg 1999). Different loop modeling studies based on the PDB have used various numbers and schemes of anchor residues at each end of the loop to select the loops: two before and one after (Fidelis et al.,1994), four, four (Tramontano and Lesk 1992), three, three (Topham et al. 1993). In this study, we used two anchor residues on either side, from Cα to Cα. Different anchor lengths from one to five were investigated, but there appeared to be little advantage in extending the length superimposed beyond two, because fewer targets gain a prediction. With only one residue in the anchor region, the orientation of the loop was less well-predicted (results not shown).
As the PDB increases in size the utility of FREAD should also increase, particularly if the total number of loop conformations in all proteins is smaller than the combinatorial number. Such a limited set of conformations (families) has so far only been proposed for entire proteins (Chothia 1992). The clustering of loops and development of loop family databases (Donate et al. 1996; Oliva et al. 1997; Li et al. 1999) where some families are clusters from nonhomologous protein structures may indicate such a distribution. However, these databases also show a very large number of single member families.
CODA: Parameterization and test sets
A list of 1179 independent protein chains were extracted from the PDB (March 2000 version) of structures that were solved by x-ray crystallography to a resolution better than 2.5Å and that shared less than 20% pairwise identity (Hobohm et al. 1992). The first stage was to remove all those that are contained within the F90 database, leaving 311 new structures. The next stage was to remove homologs to the F90 list from these. This was achieved by running each of these 311 sequences against a sequence database of the F90 list using PSI-BLAST (Altschul et al. 1997). Homologs were identified as those with an E value of 0.001 or less. The remaining list contained 156 independent protein chains. The loop fragments within these chains were identified using the SLoop method (Donate et al. 1996), and 50 loops of each length from 3 to 8 were selected and split randomly into two sets—Lparam and Ltest— to be used for CODA parameterization and testing, respectively. Thus, two sets of loops, significantly different and independent from one another and largely nonhomologous to the F90 database, were generated for the testing and parameterization of CODA.
CODA: Consensus method
In the CODA algorithm, all possible pairs of predictions are generated across the outputs of PETRA and FREAD (in other words, each pair contains a PETRA and a FREAD prediction), and for each pair the values of all the filters are calculated.
The filters are Dϕ,ψ (the difference between backbone torsion angles ϕ and ψ of the fragments in a pair), sum of Ek (the energy of the predicted fragment in the target structure) for the two fragments in the pair, Ed (the rmsd difference between the pair of fragments as superposed by their anchors on the target), Hp (the sum of their positions in the prediction lists from PETRA and FREAD), and the original selection filters from both programs.
This led to a large choice of possible filter values and sorting functions for the pairs. The program FINDIT was written to investigate optimal cut-offs and sorting functions for CODA.
FINDIT operates in an analogous way to CONTEST, described above, in that it calculates using an iterative process the filters and sorting functions at each length that give the lowest average rmsd on the Lparam set using the CODA procedure and achieving over 85% coverage. These results are given in Table 2.
There was also one final step after a pair has been selected: to decide whether the FREAD or PETRA answer from the pair should be taken as the single prediction from CODA. If the cluster is tight, very little variation is shown by selection of either the FREAD or PETRA answer. However, when FINDIT attempted to find stable selection criteria based on all these filters, a marked preference for the FREAD or PETRA answer at each particular length was shown. To develop stable selection criteria, FINDIT was given various discrete values for the different filtering options set against different sorting functions. Not all changes in the parameters for the filters affect the overall ability for prediction. The coverage is also affected by different filtering choices as was found in the FREAD analysis described above.
Once again on parameterization, as with FREAD a minimal coverage level was set, this time 85%. Parameters for CODA were developed on the Lparam set, and when the final rules and filters were developed, these were used to predict the Ltest set (Tables 6 and 7, Table 7.). The results for the Ltest set are very similar to those achieved for the Lparam set, showing that the algorithm has not been over optimized.
Dϕψ and Ed were generally the most useful selection criteria between the pairs in CODA, Ed indicating the closeness of two SVRs in the frame of reference of the anchor regions and Dϕψ more a measure of the similarity between the pair in isolation. The sum of Ek was found to be a very weak indicator for selection except for very specific cases in which it removed false positives. Of the original selection filters from FREAD and PETRA, Scwas found to be by far the most useful. A high Sc score coupled with a negative Ek value was used as a strong positive indicator that the selection of that prediction would be better than invoking the general CODA procedure, because a prediction that has a very high Sc score in FREAD is generally far closer to the true structure than a consensus selection using CODA (Table 8). This follows from the conclusion drawn during FREAD parameterization about the ability of Sc to identify correctly very close predictions in some cases. The removal of incorrect predictions using Sc was no longer pertinent as this had already been achieved in FREAD.
Tables 6 and 7, Table 7. show that CODA has improved over the results for the individual programs and that the standard deviation of the results has also dropped, indicating a more consistent predictive ability than that shown by PETRA or FREAD. The CODA prediction, along with the top FREAD and PETRA predictions, is shown superposed on the real structure in Figure 3. For the figure, CODA predictions that differ from the top FREAD or PETRA predictions were selected; this is, of course, not always the case.
Examination of Tables 6 and 7, Table 7. shows that FREAD predicts better for the shorter lengths and PETRA for the longer lengths. Thus, it was not surprising to discover that CODA generally selects a PETRA prediction from the clusters for lengths six to eight and FREAD for the shorter lengths. CODA outperforms either of the individual methods at all lengths. The fact that FREAD gives better predictions than PETRA at short lengths is probably a function of the increased occurrence of more strained ϕ/ψ dihedral angles in shorter loops. PETRA is limited to only eight ϕ/ψ pairs, and none of these are very unusual conformations, as by definition these pairs were selected for maximal coverage of protein space. At longer lengths, PETRA, which is designed to be an exhaustive database, should cover more space than the PDB fragments. Clustering of loops from the PDB has shown that length four has the maximum number of classes of loops, and the number of classes drops dramatically past six (Oliva et al. 1997; Li et al. 1999). Fidelis et al. (1994) concluded that database searches of real PDB fragments are limited to loops of four residues as the completeness of the database degrades rapidly with increasing length. However van Vlijmen & Karplus (1997) found template loops up to nine residues in length in the PDB with a main chain rmsd of less than 1Å. From our work here, we find that the PDB fragment database search method is overtaken by the computer generated fragments at around six residues.
At all lengths other than three, CODA clusters are ordered by Ed. On length three SVRs, after cut-offs imposed by the other selection criteria, CODA orders the clusters on the basis of Hp, the position of the pair members in the FREAD and PETRA prediction lists, rather than on tightness of the cluster. This follows from the results in Table 5, as at length three there is very little improvement with the addition of further answers. So the pressure to select the top answers from either FREAD or PETRA prediction lists is higher.
Comparison to other methods
The comparison of the accuracy of our algorithm with that of others is made difficult by several factors. As has been argued by Martin et al. (1989), the different methods used in the literature for superposition can give rise to divergent calculations of rmsd (Fidelis et al. 1994), and these are very seldom specified in publications. The number of test examples also varies greatly in the literature. Our algorithm is designed to be more generally applicable by predicting the unknown conformations for any polypeptide in a structure or comparative model rather than relying on the correct secondary structure assignments of the flanking regions as is found in some studies (Rufino et al. 1997; Wojcik et al. 1999).
In comparative modeling, the fragments missing after building the structurally conserved core are very rarely pure loops. Moreover, only the backbone of the core is known, rather than the whole protein (including side chains), which is often required for methods using energy calculation (Bruccoleri and Karplus 1987; Zheng and Kyle 1996; Zhang et al. 1997; Rapp and Friesner 1999).
The difficulty of comparing methods is illustrated by the results of Wojcik et al. (1999). Their method is based on selection of loop fragments from a database of loop structures taken from the PDB. These loops are clustered into families. Although the loops used to test their predictions were not from their original database, they were taken from proteins with a percentage identity of up to 90% with those in the database. Their rmsd, calculated in the anchor frame of reference and on the backbone atoms N, Cα, and C, varied from 1.1 Å for three residues to 3.3 Å for eight residues. Thus, in comparing these results to those given here, two major differences must be considered. First, their measure of nonrelation to the original database was far less strict than in this study. Second, they have not included the backbone oxygen atom in their calculation of rmsd. The results in Table 7, which are lower, are, therefore, disadvantaged in two ways.
A comparison was made with the work of van Vlijmen and Karplus (1997), a method based on a fragment database combined with the use of a molecular mechanics energy function to improve predictions. Only 13 of their 14 example loops were predicted, as one was of length nine (PETRA only predicts up to length eight). Two from an obsolete PDB structure 3tln were replaced by those from PDB 8tln. Of the 13 cases, CODA performed better than the van Vlijmen and Karplus (1997) method in eight and identically well in one. The loops used in this comparison and the results of both methods are given in Table 9.
The van Vlijmen and Karplus (1997) method would probably perform better now than it did a few years ago because the database of known structures, on which the method depends, is now significantly larger. On the other hand, the authors themselves point out that their results were better than would be obtained in realistic applications because the candidate loops for energy minimization were the 50 loops closest to the real loop structure, not the 1000 loops with best fitting anchors. The running time for an SVR prediction is significantly different between the two algorithms, with the van Vlijmen and Karplus (1997) method taking several hours on a R10000 SGI compared with between 3 and 20 minutes (depending on length) to run FREAD, PETRA, and then CODA consecutively.
All the tests reported above on CODA use loop regions only. The algorithm may also be used when secondary structure is within the SVR and/or the anchor regions are loop conformations. To test this, a length six loop, situated between an alpha helix and beta strand, was selected from the final test set and seven length six peptides were predicted, six of which included residues from the elements of secondary structure as described in Table 10. The obtained rmsd values for these fragments compare very favourably with the average rmsd for length six loops of 1.95Å from Ltest. Those containing alpha helix appear to be better predicted, possibly because of the lower variability of the ϕ, ψ in alpha helical compared to the beta strand structure. These results indicate the efficacy of CODA at predicting SVRs that contain secondary structure and also show its use hen the ends of the loop have not been precisely predicted.
CODA on models
To test the application of CODA in the modeling of SVRs in comparative models, four models were built using MODELLER (Sali and Blundell 1993). The models were built on the basis of structural alignments found in HOMSTRAD ((Mizuguchi et al. 1998; http://www-cryst.bioc.cam.ac.uk/∼homstrad.) A single member of the family was chosen as a target and the other members were used as the basis structures or templates (Table 11). SVRs were defined using SCORE (C. Deane, unpubl.), and from these, five SVRs of length three to eight were selected (Table 12). The results (Table 12) show that accuracy of prediction depends on the overall rmsd between the model and the correct structure.
In this test, CODA has been challenged by the inaccuracy of the overall structure into which the fragment must fit which will affect the filter Ek and by the inaccuracy of the anchor regions that will affect the selection ability of Dc and R(F). The problem of the deviation of the whole model has been cited as a major problem for methods of loop prediction that use a molecular mechanics energy function (Sudarsanam et al. 1995), as none of the true interactions of the loop with the rest of the structure will necessarily be present in the model. The second problem of deviation of the anchor regions has been highlighted previously by Lessel and Schomburg (1999) as a difficulty in fragment database search methods. CODA appears to be more robust to overall model deviation than anchor deviation as would be expected for an algorithm of this type.
The SVRs built on model structures were built on correct alignments. The use of bad alignments, which would be encountered when building a target from a more distant template, would cause severe problems. Any loop modeling method, even if perfect, would be unable to overcome this problem. This means that loop modeling methods will be most effective for easy to medium hard comparative modeling targets (Jones and Kleywegt 1999).
The CODA web site
The algorithm is available online at http://www-cryst.bioc.cam.ac.uk/∼charlotte/Coda/search_coda.html. Using a given sequence and coordinates of a protein framework, the CODA algorithm is implemented as described above. The structure given may be an X-ray crystallographic or NMR structure or a model. If the structure does not contain the SVR, the full sequence of the protein must be submitted along with the PDB file. The SVR as defined by the user is then predicted independently with both FREAD and PETRA and a consensus prediction is made by CODA. The output combines two tables of data relating to the FREAD and PETRA results with the scores for the various filters for each predicted fragment listed. The CODA prediction is highlighted in red. All these predictions are superposed onto the anchor regions of the structure given and are available to view, so that the predictions can be visually inspected. CODA predictions can be made for protein fragment pieces up to eight residues in length. At lengths greater than eight, but less than 30, a FREAD prediction alone will be made.
CODA generates a consensus prediction from two separate algorithms, both based on the search of a database of peptide fragments: one of real fragments, FREAD, and one of computer generated fragments, PETRA. Over 3300 protein loops have been used to test various aspects of the approach.
The results for CODA compare very favorably with any of the other protein loop prediction methods. Unlike many programs (Martin et al. 1989; Martin and Thornton 1996; Oliva et al. 1997; Rufino et al. 1997; Li et al. 1999; Wojcik et al. 1999), it is not limited to loop prediction but can be used for any missing peptide fragment in a protein structure or model. The program also does not require side chains as distinct from others (Bruccoleri and Karplus 1987; Zheng and Kyle 1996; Rapp and Friesner 1999; Zhang et al. 1997). To run FREAD and PETRA and generate the consensus result using CODA consecutively is rapid, with time varying on length from three to around 20 minutes for lengths three to eight, respectively, on an SGI R10000.
These features make CODA particularly suitable for comparative modeling (for review, see Sanchez and Sali 1997), particularly when the templates are distantly related to the target structure, such that the SVRs are likely to contain pieces of secondary structure. This type of modeling for more distant parents will occur more and more often in the era of genome sequences and threading algorithms.
Materials and methods
Strategy for FREAD database and test set generation
A FREAD fragment database contains Cα intervals between all non-terminal Cαs separated by more than three residues and less than 30. Two such databases were generated for use in this analysis: They were labeled F20 and F90. The F20 database contained 1010 protein chains with up to 20% pairwise identity (Hobohm et al. 1992). F90 is an analogous database containing 2107 chains with up to 90% pairwise percentage identity. All the chains included were selected from the PDB (August 1999 version), solved by x-ray crystallography with a resolution better than 2.5Å.
F20 is, therefore, a database of accurate and largely nonhomologous structures. Even though sequence and structure are less well conserved in loops than in the core region (Chothia and Lesk 1986; Chelvanayagam et al. 1994), some loops, particularly those of short lengths, may be conserved in homologous proteins. Thus, we used the pairwise 20% identity cut-off, which is a stricter definition of nonhomology than is found in most loop prediction software testing (Wojcik et al. 1999). Homologous proteins are removed from the test set, as in good modeling practice, loops that could be built from the homologous parents would have been copied from the homologous parent structures used for modeling and, as such, a general SVR modeling program would not be used for their prediction.
To complement these two databases, two test sets of loops were built. All the loop fragments in F20 and F90 were extracted by comparison to the SLoop database (http://www-cryst.bioc.cam.ac.uk/∼sloop). These two sets of loops (one from each database) do not include all loops in each of the databases (F20 and F90) as the SLoop January 1999 release was used. The definition of a loop in the SLoop database is a region connecting two secondary structure elements, both of which must be of at least three residues in length (Donate et al. 1996). Such loop definition is highly dependent on the flanking region assignment, which is known to vary from one program to another (Colloch et al. 1993). Despite these drawbacks, the use of comparison to the SLoop database for loop extraction from the two fragment databases does provide a large number of SVRs for testing the predictive power of FREAD (Table 1).
The shortest loops (one and two residues) were discarded, as these do not usually require advanced modeling. Loops longer than eight residues were also not considered, as many of these longer loops include regular secondary structure elements, as can be seen from the alignment of complete families such as those found in HOMSTRAD (Mizuguchi et al. 1998). In the current release of the SLoop database, over 75% of loops are length eight or less.
In all, 2355 loops were selected from the F90 database and 734 from the F20 database to create the L90 and L20 loop test sets, respectively. The numbers of loops at various lengths in L20 and L90 are given in Table 1.
Environmentally constrained substitution tables
Selection of environmental constraints
The substitution tables were constrained only by Ramachandran areas (Ramachandran and Sasisekharan 1968), which were defined by a close examination of the PDB. Several PDB sets with different pairwise percentage identities (25%, 35%, and 40%, to measure the effect of homology) and resolution cut-offs (1.5Å, 1.8Å, 2.0Å, and 2.5Å, higher resolution structures show tighter behaviour in the Ramachandran plot; Morris et al. 1992) were constructed. These different sets of structures were then mapped onto Ramachandran propensity plots by dividing the Ramachandran plot into three by three degree bins. The propensity was calculated for each bin, which gives a ‘propensity surface’ (Fig. 1).
The propensity of an amino acid for a region i of the Ramachandran plot is given by Pi, the total number of all types of amino acids found in the region nOi, divided by the total number of amino acids nO.
As the general objective here is to model structurally variable polypeptide fragments that are mostly found outside regions of regular secondary structure, Ramachandran propensity plots were also generated by excluding amino acids involved in secondary structure, using the Kabsch and Sander (1983) definition. This allowed the lower density regions to be more easily seen.
The peaks and valleys found in the Ramachandran plot guided its partitioning into six conformational classes (Fig. 1). The areas of the Ramachandran plot selected for the environmentally constrained substitution tables are different from those that have been used before (Fig. 1). They neither follow the classic Efimov areas (Efimov 1980; Swindells et al. 1995) or the areas that have been used previously in the development of environmental substitution tables either constrained (Topham et al. 1993) or unconstrained (Overington et al. 1992; Topham et al. 1997) or those used for definition of the PETRA angles (Deane and Blundell 2000). Figure 1 shows the distribution of ϕ,ψ pairs guiding the boundaries of the selected areas. The cis conformation found in some proline residues is ignored in the definition of the main chain conformational class used here as these constitute less than 5% of all proline residues and less than 0.04% of all peptide bonds (Reimer et al. 1998).
Calculation of substitution probabilities
The raw environmentally constrained substitution tables were constructed by accumulating substitutions observed in all the homologous pairwise alignments from a high-resolution database. This database was extracted from HOMSTRAD (date 18/02/00; Mizuguchi et al. 1998) and contained 320 homologous families with 859 proteins all solved to a resolution better than 2.5Å. In this case, the environment of the replaced and substituted residues was taken into account, and the six ϕ,ψ areas described above defined the six environmentally constrained tables.
Environmentally constrained amino acid substitution tables are derived as follows using SUBST (K. Mizuguchi, unpubl.). Observed amino acid replacements at structurally aligned positions are counted in terms of the local environment of both the aligned structures. AEab is the unnormalized frequency of observing amino acid a in environment E replaced with amino acid b. If amino acid a in environment E(a) and amino acid b in environment E(b) are found at a structurally-aligned position, both AabE(a) and AbaE(b) are increased by one. To derive ‘constrained’ matrices, alignment positions that do not satisfy E(a) = E(b) are discarded. No restriction is imposed on ‘unconstrained’ matrices. By definition, AEab = AEba is true for constrained matrices.
To reduce multiple contributions from closely related members, sequences are clustered in an analogous way to that used in the construction of the BLOSUM matrices (Henikoff and Henikoff 1992). Unlike BLOSUM, however, clustering is performed for the entire sequence using the overall percentage identities. For example, if the percentage threshold is set at 60% and the alignment contains three sequences (A, B, and C) with the pairwise percentage identities between A and B being 70, between A and C being 30, and between B and C being 30, then A and B are clustered. Counts are made only between A and C and between B and C and the contributions from the pairs (A, C) and (B, C) are averaged. In general, if one cluster includes m sequences and another includes n sequences, each count between these two clusters is assigned the weight of 1/(mn). The unnormalized frequencies AEab henceforth mean the weighted sum of counts.
The raw substitution counts are converted into substitution probabilities P(b|a,E). P(b|a,E) is the probability that amino acid a in environment E is substituted by amino acid b, and it can be derived from AEab as
Finally, a matrix of log-odds scores can be obtained from P(b|a,E) as
where qb is the background probability of observing amino acid b. If the substitution matrices are used for comparing structure with sequence, the score matrices must represent the odds ratio of the pair (a in structure, b in sequence) occurring in an alignment, as opposed to this match occurring by chance. The background probabilities, therefore, must be those for observing each amino acid residue in a sequence and can be given by
Note that the environment-specific substitution matrices are generally asymmetric, namely,
If, on the other hand, the matrices are constrained and we assume that each amino acid residue in sequence would be in the same environment as that of the aligned residue in structure, the background probabilities must depend on the environment and are given by
This treatment produces a symmetric matrix (sa,E → b) = s(b,E → a)).
Elements of the log-odds matrices are multiplied by a scaling factor of 3/log3 and rounded to the nearest integer value (i.e., the log-odds scores are expressed in 1/3 bit units).
Parameterization and testing of FREAD
To parameterize FREAD, the F20 database was used to predict the L20 loop set with a jack-knife process (i.e., by prohibiting each target loop from selecting its real structure). The filters and sorting functions used by FREAD are Dc(F), R(F), Sc(F), and Ek (described below). Different anchor lengths were also investigated, in which the use of two residues on either side was found to be the most successful (results not shown). The optimal filters and cut-offs were investigated using the program CONTEST. The selected parameters were then further tested using the F90 database to predict the L90 loops by a jack-knife process.
Description of the FREAD rules and filters
Anchor region selection filters
The anchor residues were defined as the two residues adjacent to the polypeptide fragment on the N and C termini sides in the target protein. These anchor residues are also present in the prediction fragment. A set of m target Cα separations T(i,j) was calculated between Cα atoms i and j of the anchor residues, separated by the n residue gap in the target protein. The equivalent set of distances P(F;k,l) for every n residue fragment F in the database are read using FREAD. The individual differences between distances T(i,j) and P(F;k,l) were calculated and used as selection filters as well as Dc(F), the difference between the set of target distances and the fragment distances.
R(F) is the rmsd of the backbone atoms of the anchor regions (C, Cα, N and O) of the fragment superposed on the backbone atoms of the anchor regions of the target protein (Kearsley 1989a, b).
Ek and Sc
Ek is calculated using the all-atom, distance-dependent conditional probability function (out to the Cβ; Samudrala and Moult 1998). It is the energy of the fragment in the overall structure of the target. Calculation of Ek does not include the first and last residue of the fragment, because strongly repulsive overlaps with the template protein may result as the fragment is fitted without optimization. Sc(F) is the sum of the log odds scores from the environmentally constrained substitution tables for each amino acid of the predicted fragment.
CODA consensus method
CODA takes the top ten predictions from PETRA and FREAD for a structurally variable region. Several methods of selection were tested, clustering on Dϕ, ψ, the difference between backbone torsion angles ϕ and ψ of all residues of the two predicted fragments, or Ed, the rmsd between the predictions while they are superimposed by their anchor regions onto the anchor regions of the target or Hp, the sum of their positions in the prediction lists from PETRA and FREAD. Any of these scores, as well as the sum of the Ek values for the two fragments that make up the cluster, could then filter the predictions.
The Sc value for the FREAD predictions is also checked. Depending on the length of the SVR, either the PETRA or FREAD fragment could be selected as the top prediction. All these selection procedure are length dependent
where dϕ = |ϕik − ϕjk| if |ϕik − ϕjk| < 180 and 360 − |ϕik − ϕjk| if |ϕik − ϕjk| > 180 and dψ = |ψik − ψjk| if |ψik − ψjk| < 180 and 360 − |ψik − ψjk| if |ψik − ψjk| > 180.