If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.
Tertiary structural information is critical for our understanding of a protein's biological function. However, experimental structure determination is far too expensive and time consuming to be applied to all proteins of interest. Computational approaches are, thus, expected to play a major role in determining protein structures in the future.1 Over the last two decades, great strides have been made in exploiting distant evolutionary relationships to known structures to derive spatial restraints for comparative models.2–4 One of the remaining major challenges is in the refinement of such models to near experimental accuracy.5 This challenge, in turn, demands the development of more accurate force fields that can be deployed in molecular mechanics simulations.
Physiochemical force fields such as CHARMM,6 AMBER,7 and GROMOS,8 parameterized for use in protein simulations, are routinely applied to the refinement of comparative models. However, overall improvement in the accuracy of comparative models by such methods has not been achieved.5 Knowledge-based potential energy functions are derived from either statistical analysis of observed protein structures9–16 or optimization of parameters such that native structures are discriminated from non-native decoys.17–20 They usually outperform9, 21 physiochemical force fields that lack some physical terms such as cation-π interactions and entropic effects. However, the discrete nature of statistical energy functions makes it difficult to be used directly in energy minimization or molecular dynamics for protein-structure refinement. Moreover, most knowledge-based energy functions derived from parameter optimization are coarse grained (i.e., at the residue level or using simplified side chains) to minimize the number of adjustable parameters. Parameter optimization was considered inappropriate to derive distance-dependent energy functions of all atom types,13 not to mention orientation dependence. Thus, it is more practical to optimize a small number of weights for mixing physiochemical terms with statistics-based potentials.22, 23 As more and more experimental protein structures become available, knowledge-based potential energy functions derived from parameter optimization may prove optimal even for all-atom force fields.
Any complicated function, including the force fields between atoms in a protein, can be decomposed as a mathematical series. For example, power series expansions of a diatomic potential energy function are the most useful means for its analytical representation in quantum chemistry.24 Miyazawa and Jernigan used series expansions of spherical harmonic functions to represent the fully anisotropic distribution of the relative orientation of two residues and increased the discrimination power in fold recognition.25 Here, we expanded atomic force fields as series. The parameters were optimized by maximizing the gap between native and non-native side chain conformations and by minimizing the root mean square deviation (RMSD) of low-energy rotamers. A total of 5798 nonhomologous proteins were used for optimizing 1889 parameters. The energy functions with optimized parameters were used to predict side chain conformations for 218 independent test proteins. The prediction accuracies of χ1 and χ1 + 2 were improved by 2.2 and 4.0%, respectively, compared with the next best side chain modeling program. Because the expansions used here are continuous, the resulting energy functions can be used directly in gradient-based search algorithms to address the comparative model refinement problem.
Training and Test Sets
Thirty nonhomologous proteins are used as the first test set, as described previously.23 The training proteins were chosen according to the following criteria: the sequence identity between any two pairs was less than 30%, the resolution was less than 2.5 Å, and the R-factor was less than 1.0. A total of 6254 chains that met the above criteria were downloaded from the Dunbrack Lab website (http://dunbrack.fccc.edu/PIS CES.php) in June 2008. A protein was discarded if more than 5% of its residues had incomplete side chain atomic coordinates or the sequence identity with any of the 30 test proteins was more than 50% following local alignment. As a result, the training set contains 5798 proteins. We also compiled a second test set. A total of 5279 chains with sequence identity less than 30%, resolution less than 2.0 Å, and R-factor less than 0.25 were downloaded from the Dunbrack Lab website in Jun 2009. Those proteins were discarded if they met any of the following conditions: the same PDB ID as the 6254 protein chains downloaded in Jun 2008, more than 5% residues with incomplete side chain atomic coordinates, less than 100 residues (excluding Gly, Ala, and incomplete side chains) for the prediction accuracy assessment, or a sequence identity of more than 50% with any of the other training proteins following local alignment. This leads to the second test set of 218 proteins. Hydrogen atoms were added with the REDUCE program for all protein structures.26
The rotamer library is from Dunbrack and Cohen.27 We generated subrotamers by giving a perturbation to each dihedral angle of the rotamer. (f1 + f2 + f3 + f4 + f5) × σ is added to the original dihedral angle. Here, fi is a generated random number in the range of (−1, 1), and σ is the standard deviation of the dihedral angle included in the library. Bond lengths and angles from Engh and Huber28 are used to build the rotamer library. Polar hydrogen atoms are added since they are absent in the Dunbrack library and considered explicitly in this study. Each χ2 for Ser and Thr and χ3 for Tyr (θ) are assigned three possible values: −60°, 60°, and 180°. The dihedral angle varies from θ– 30° to θ + 30° with even distribution for the subrotamers.
Rotamer Internal Energy
where α is a dihedral angle of the side chain rotamer, and t1–6 are optimized parameters. Two hundred and fifty-eight parameters are used for the 43 dihedral angles of the 20 amino acids. The rotamer internal energy is summarized over all dihedral angles of the modeled side chain. The interactions between bonded atoms are not calculated. Atomic interaction energy beyond 1,4 interactions are calculated as for typical nonbonded atoms.
Distance-Dependent Energy Function
The distance-dependent optimized side chain atomic energy (OSCAR-d) is calculated by
where d is the distance between two atoms and a1–4 are optimized parameters. We define 16 atom types for 20 amino acids and use a total 544 parameters. Atoms with a similar charge and radius according to CHARMM are defined as the same type. The distance cutoff is set to 10 Å for any two interacting atoms. The definition of the atom types can be found in the Supporting Information.
Orientation-Dependent Energy Function
The orientation-dependent optimized side chain atomic energy (OSCAR-o) is calculated by
where E(θ,φ,ψ) is an orientation-dependent function, and C is a constant. θ, φ, and ψ are Euler angles of two interacting dipoles (Fig. 1). The dipole points to the interacting atom from the center of its base atoms. E(θ,φ,ψ) is given by
There are 1087 parameters including b1–9 in eq. (4), and the constant C in eq. (3) optimized for 16 atom types (for sp2 hybridized atoms, the dipole is perpendicular to the hybridization plane, and the parameters for the related one order terms are set to 0, so that the calculated energy is not affected by inversion of the dipole direction). Here, we assume the interaction energy is comprised of a distance-dependent term and an orientation-dependent term. In extreme case, when E(θ,φ,ψ) equals to 0 and C equals to 1, eq. (3) becomes a distance-dependent energy function only. We optimized the parameters (b1–9 and C) simultaneously, so that the interaction energy could be correctly calculated even if the distance-dependent term and the orientation-dependent terms overlap somewhat.
Optimizing Parameters for the Distance-Dependent Energy Functions and Dihedral Angle Potential Functions
The parameters are initialized with random values. The sum of eqs. (1) and (2) is used to calculate energies for the native side chain conformation and rotamers at a specific position. The side chain conformation of other residues is fixed at observed atomic coordinates. Energies for 12,000 residues from the training proteins are calculated for each of the 18 residue types (excluding Gly and Ala). Residues from high-resolution proteins are used with a priority. Similar to our previous study,29 Monte Carlo (MC) simulation annealing is used to optimize the parameters by minimizing the following objective function:
where N is the number of rotamers, E(r) is the energy of the native conformation r, E(i) is the energy of rotamer i, and Mis the total number of calculated residues (18 × 12,000 = 216,000).
Optimizing Parameters for Orientation-Dependent Energy Functions
First, for each of the N backbone-dependent rotamers at the modeled position, we generate 60 subrotamers and select the one with the lowest energy by the distance-dependent and rotamer internal energy functions. The parameters of the orientation-dependent functions are optimized so that the native conformation has a lower energy than the selected N subrotamers by minimizing eq. (5). The optimized parameters of the distance-dependent functions are fixed during this procedure. The parameters of the rotamer internal energy functions are initialized to previously optimized values and then reoptimized. For the parameters of the orientation-dependent functions, b1–9 in eq. (4) are initialized to 0 and C in eq. (3) is initialized to 1.
In the next step, we increase the number of residues used in training up to 40,000 for each residue type. For rare residue types such as Cys, Met, Trp, and His, less than 40,000 residues are used. Instead of eq. (5), which is continuous and easy to minimize, the rmsd value of the subrotamer with the lowest energy is averaged over all training residues and adopted as the objective function to minimize. For each modeled position, the lowest energy subrotamer calculated by the distance-dependent functions, and the rotamer internal energy functions are selected for four rotamers. Those with a relatively high energy are not used. Similarly, we select four lowest energy subrotamers calculated by the orientation-dependent functions with parameters optimized in the first step. A total of eight subrotamers are considered at each position. The parameters are initialized to the same value as optimized in the first step. After optimization, we used the new parameters to select an additional four lowest energy subrotamers and optimize the parameters with 12 subrotamers at each modeled position. This procedure is repeated three more times based on the observation that the results improve slightly with each iterative optimization.
Predicting Side Chain Conformation of a Single Residue
We do not use any information of the native side chain conformation. To predict the side chain conformation, we generate 60 subrotamers for each rotamer, and the subrotamer with the lowest energy among 60N ones constitutes the prediction.
Side Chain Modeling of the Whole Protein
We predict side chain conformations of entire proteins by combing a genetic algorithm with MC simulation as follows: (1) generate a pool of 20 structures with the same native backbone structure but with randomly initialized side chain conformations; (2) exchange side chain conformations among those with lower energy values; (3) optimize side chain conformations for all of the 20 protein structures by the MC method; and (4) repeat steps 2 and 3 for 30 cycles during which the MC simulation temperature decreases after every cycle. The energy values of final 20 structures are compared, and the structure with the lowest energy is the predicted structure.
The methods for accuracy evaluation are similar to those described previously.23 Residues with <20% solvent accessibility are considered as core residues. The χ1 angle of a residue is correctly predicted if it is within 40° of the experimental value. The χ1 + 2 angle is correctly predicted when both χ1 and χ2 are within 40° of their experimental values. For residues with multiple side chain conformations in the observed structure, we compare with the first conformation in the PDB file only; other conformations are not considered. Residues with incomplete side chain atomic coordinates are modeled but not evaluated.
Performance of Distance-Dependent Energy Functions and Rotamer-Dependent Internal Energy Functions
We decomposed atomic distance-dependent energy functions and dihedral angle potentials as power and Fourier series, respectively. The parameters of the series were initially assigned as random values. The MC-optimized objective function [eq. (5)] converged to similar values (0.088–0.093) with different starting parameter values. As an example, Figure 2 shows the resulting optimized, distance-dependent component of the CH3CH3 and HH interaction energy functions. CH3 (the terminal methyl carbon) and H (polar hydrogen from noncharged residues) are two of the 16 defined atom types. The HH interaction energy is unfavorable in the range of 0–10 Å, whereas the CH3CH3 interaction has an attractive well around 4 Å, similar to the Van der Waals energy. Table 1 tabulates the performance of the optimized energy functions for the training and 30-protein test sets. The accuracies of χ1 and χ1 + 2 are 90.3 and 81.3%, respectively, for the 30 test proteins. The accuracy is much higher for core residues (96.4% for χ1 and 92.5% for χ1 + 2). For different residue types, the prediction accuracy varies significantly from the lowest (80.1% for Glu χ1 and 68.1% for Glu χ1 + 2) to the highest (100% for Phe χ1 and 98.9% for Phe χ1 + 2). For core residues, the prediction accuracy of the training proteins is slightly (∼1% for χ1 or χ1 + 2) better than for the 30 test proteins. However, the accuracies are nearly identical for the two sets of proteins when both surface and core residues are included in the evaluation, implying that the training set is sufficiently large, despite the large number of adjustable parameters.
Table 1. Prediction of Side Chain Conformations for Single Residues.
A total of 12,000 residues of each residue type (excluding Gly and Ala) were calculated for the training proteins.
The prediction accuracy of each residue type was calculated and averaged over 18 amino acids for the 30 test proteins.
Fraction of correctly predicted χ1 for all residues.
Fraction of residues with both χ1 and χ2 correctly predicted.
The root mean square deviation between predicted and native side chain conformations.
Prediction for core residues.
Performance of Orientation-Dependent Energy Functions
The distance-dependent energy functions are multiplied by an orientation-dependent factor. The parameters of the orientation-dependent functions are optimized as described in “Methods” section, and the performance is shown in Table 1. For the 30 test proteins, the prediction accuracies of χ1 and χ1 + 2 are further improved by 1.7 and 3.0%, respectively, compared to the distance-dependent energy functions. The improvement is mostly due to hydrophilic residues that are responsible for specific interactions. For example, the prediction accuracies of χ1 and χ1 + 2 for Asp increased significantly from 88.7 and 76.6% to 94.5 and 88.4%, respectively. The χ1 accuracy of Cys also improved from 96.2 to 98.1%, due to the orientation-dependent nature of the disulfide bridge. The prediction accuracy in χ1 and χ1 + 2 of the training set is slightly higher than that of the test set (Table 1), indicating some degree of over optimization.
We compared our energy functions with the two most popular force fields, AMBER and CHARMM, to predict side chain conformations of a single residue. This procedure was already described by Wilson and coworkers30 and Petrella and coworkers31 to test AMBER and CHARMM, respectively. We used the same protein as Wilson and coworkers (PDB identifier 2alp) and the 10 proteins of Petrella and coworkers, for comparison. As shown in Table 2, our results on single side chain prediction are significantly more accurate than those from the physics-based force fields despite the fact that Petrella and coworkers have used native bond lengths and angles in their predictions. Rather than using rotamers, Petrella and coworkers rotated χ1 and χ2 in native side chains at intervals of 5° or 10°, which made the prediction easier.
Table 2. Single Side Chain Prediction by AMBER, CHARM, and OSCAR-o.
Average rmsd (Å)
Overall rmsd (Å)
%χ1 × χ2 Correct
χ1 × χ2, the number of residues with both dihedral angles correct, or single angle correct in the cases of valine, threonine, serine, and cysteine. Results for the AMBER and CHARMM force fields were obtained from Wilson and coworkers30 and Petrella and coworkers,31 respectively.
Side Chain Modeling for Whole Proteins
We compared side chain modeling programs with built-in distance-dependent force fields (OSCAR-d) and orientation-dependent force fields (OSCAR-o), respectively, to our previous side chain modeling program LGA23 in Table 3 based on 30 test proteins. The χ1 accuracy of OSCAR-d is only slightly higher (0.5%) than LGA, whereas the χ1 + 2 accuracy is 3.2% higher. The OSCAR-o has the highest accuracy (89.4% for χ1 and 80.8% for χ1 + 2) among the three methods. The rmsd values of core residues in the current predictions are much smaller than those of LGA, in part due to the use of the subrotamer model.
Table 3. The Accuracy of Side Chain Modeling for Whole Proteins by Various Energy Functions.
The overall rmsd of each of the 30 test proteins was calculated and averaged. The rows correspond to the three energy functions optimized in a similar way: our previous method LGA resulted from combining a knowledge-based term and multiple physics-based terms,23 the distance-dependent optimized side chain atomic energy derived by series expansions (OSCAR-d), and the orientation-dependent version (OSCAR-o).
Comparison to Other Algorithms
To provide an additional test for our methods, we collected 218 recently released nonhomologous proteins (see “Methods” section). Three side chain modeling programs, NCN,32 OPUS_ Rota,33 and LGA,23 with top prediction accuracy ranked by Lu and coworkers33 and the recently updated SCWRL434 were compared. The prediction accuracy of OSCAR-d is similar to other side chain modeling programs while significant improvement is achieved by OSCAR-o. The accuracies of χ1 and χ1 + 2 are improved by 2.2 and 4.0%, respectively (Table 4), compared to the next best side chain modeling program, OPUS_Rota. The improvement for χ1 is remarkable considering the moderate improvement (2.9%) that resulted from combining a knowledge-based term and multiple physics-based terms instead of a simple Van der Waals energy.32 For the prediction of individual residue types, OSCAR-o has a higher accuracy than OPUS_Rota for χ1 and χ1 + 2 in all cases except Pro (Fig. 3). We also modeled side chain conformations for additional 595 proteins. These proteins were selected according to the same criteria as the 218 test proteins but with the maximum sequence identity to the training proteins in the range of 50.1 and 100.0%. The prediction accuracy (88.8% for χ1 and 79.5% for χ1 + 2), which is almost the same as that for the 218 proteins, indicates that the results are not biased to proteins with a high sequence identity to the training proteins.
Table 4. Performance of Different Side Chain Modeling Programs on Recently Released PDB Entries.
All χ1 + 2
Core χ1 + 2
Data are shown for three third-party methods (SCWRL4,34 NCN,32 and OPUS_Rota33), our previous method LGA,23 and current OSCAR methods.
We have expanded energy functions as mathematical series and improved prediction accuracy for side chain modeling. The energy functions were first expanded as distance-dependent power series and further enhanced by an orientation-dependent factor. The large number of optimized parameters and training proteins is the main reason for its performance. In previous study,23 we combined contact surface, volume overlap, backbone dependency, electrostatic interactions, and desolvation energy; the weights of energy terms were optimized in a manner similar to the current study with 15 training proteins. The prediction accuracy was not improved when more proteins were used for training. Here, by series expansion, more parameters are available for optimization, which makes it possible to improve the prediction accuracy by using a huge number of training proteins. The lowest energy subrotamer model also plays a key role in optimizing the parameters for OSCAR-o. We used OSCAR-d and the rotamer internal energy functions with optimized parameters to select the subrotamer with the lowest energy out of 60 possible states for each rotamer at the modeled position. The parameters of OSCAR-o were optimized so that the native conformation had a lower energy than the selected lowest energy subrotamers. If only one subrotamer was generated and used, that is, OSCAR-d were not used to select the lowest energy subrotamer, there was little improvement by OSCAR-o over OSCAR-d.
It takes approximately 15 CPU (central processing unit) hours for OSCAR-o or NCN32 to model the side chain conformations for the 30 test proteins. By comparison, OPUS_Rota33 or SCWRL434 takes five CPU minutes only. Both OPUS_Rota and SCWRL4 use rigid rotamers. As a result, all rotamer-rotamer or rotamer-backbone interaction energies can be pre-calculated and employed directly in modeling the whole protein. Using flexible rotamer model prohibits OSCAR-o for employing the pre-calculated values. Nevertheless, our energy functions are continuous and fast to calculate. In our future studies, we will explore more efficient gradient-based search algorithms for modeling side chain conformations on a flexible backbone.
We also expanded the series with different orders for the distance dependent energy functions, Σai.d−2i (i = 1,2,3…n). The accuracies of χ1 in predicting single residues were 85.0% (n = 2), 89.6% (n = 3), 90.3% (n = 4), 90.1% (n = 5), and 90.3% (n = 6), respectively, for the 30 test proteins. The prediction accuracy was not improved by a higher order expansion (n > 4) and even lower (<85%) by expansions in a different manner (i = 3, 4 or i = 1, 4). The Fourier series to calculate rotamer internal energy is expanded up to the third order because the rotatable bonds are connected to at least one sp3 hybridized atom. The prediction accuracy of χ1 decreased to 88.8% by one order expansion (t1 × cos α + t2 × sin α). For the orientation dependent functions, we tried a simpler formula (b1 × cos θ + b2 × cos φ + b3 × cos ψ + C). The accuracy was the same as the eq. (4) in predicting side chain conformations for single residues. However, when the simple formula was used to model side chain conformations for the whole protein, the accuracy was slightly decreased for the 218 test proteins (88.5% for χ1 and 79.1% for χ1 + 2). We preferred to using eq. (4) in side chain conformation search because the running time was reduced only 10% by using the simpler formula. Nevertheless, it may be appropriate to use this simpler formula if the energy functions are used with gradient-based optimization methods in future.
Employing native conformations in generated sub-rotamers improves the prediction accuracy. For 30 test proteins, the prediction accuracies of χ1 and χ1 + 2 for single-residue conformations by OSCAR-o increase by 0.9 and 1.7%, respectively. On the other hand, the prediction accuracy does not improve with more sampling (for example, generating 200 sub-rotamers instead of 60 for each rotamer and excluding the native conformation). This indicates that it is the energy function that limits the final accuracy.
In addition to energy functions and sampling techniques, other factors can affect the prediction accuracy. The accuracy of OPUS_Rota in Table 4 is lower than the reported values (89.0 for χ1 and 79.1% for χ1 + 2).33 This is mainly due to different evaluation methods. For residues with multiple conformations, we only compared the predicted one with the first conformation in the PDB file. In Lu and coworker's report, the predicted one was considered to be correct if it satisfied any of the alternative positions. Using the same evaluation method as this study, the prediction accuracy of OPUS_Roda decreased to 88.0% for χ1 and 77.8% for χ1 + 2, respectively, for their 65 test proteins. In addition, the prediction accuracy is protein dependent. For the same 65 test proteins, which also included our 30 test proteins, OSCAR-o achieved a relatively high accuracy (90.0% for χ1 and 81.1% for χ1 + 2). For the 43 small proteins, which were selected according to the same criteria as the 218 test proteins, but with less than 100 evaluated residues, the prediction accuracy of OSCAR-o is only 85.0% for χ1 and 73.4% for χ1 + 2. Nevertheless, the prediction accuracy for core residues is still high (96.0% for χ1 and 92.0% for χ1 + 2). The decreased overall prediction accuracy is mostly due to the relatively low percentage of core residues for these small proteins.
Protein tertiary structure prediction is now more important than ever. But the prediction accuracy is limited by the availability of high quality energy functions. A complicated function, such as that describing the forces between atoms in a protein, can be decomposed as a mathematical series even if we do not know the analytical form in detail. We derived orientation dependent knowledge-based atomic force fields (OSCAR-o) by series expansions. The solvation energy and entropic effect, which are of the most difficult items to calculate, are not considered explicitly because the energy functions, in principle, include all known or unknown interactions. The parameters were optimized by discriminating the native side chain conformations from non-native conformations. When OSCAR-o were used to predict side chain conformations for single residues, the prediction accuracy in χ1 and χ1 + 2 was >5% higher than AMBER or CHARMM force fields. We also used OSCAR-o to model side chain conformations of entire proteins. For 218 independent test proteins, the prediction accuracy was significantly higher (2% for χ1 and 4% for χ1 + 2) than the next-best performing side chain modeling program. Since OSCAR-o are continuous, accurate, and fast to calculate, we expect a wide-range of applications in protein structure prediction and protein design.
S.L. is grateful for the illuminating discussions with Haruki Nakamura (Osaka University).