Protein structure prediction by homology modeling or ab initio methods remains a difficult challenge. An important component of any modeling method is the prediction of side-chain conformations. Side-chain prediction usually entails placing side chains onto fixed backbone coordinates either obtained from a parent structure or generated from ab initio modeling simulations or a combination of these. Many side-chain prediction methods have been presented in the past 15 years (Summers and Karplus 1989; Holm and Sander 1991; Lee and Subbiah 1991; Tuffery et al. 1991; Desmet et al. 1992, 2002; Dunbrack and Karplus 1993; Wilson et al. 1993; Kono and Doi 1994; Laughton 1994; Hwang and Liao 1995; Koehl and Delarue 1995; Bower et al. 1997; Samudrala and Moult 1998; Dunbrack 1999; Mendes et al. 1999a; Xiang and Honig 2001; Liang and Grishin 2002).

Nearly all of these are based on using a rotamer library of discrete side-chain conformations (McGregor et al. 1987; Ponder and Richards 1987; Tuffery et al. 1991; Dunbrack and Karplus 1993, 1994; De Maeyer et al. 1997; Dunbrack and Cohen 1997; Lovell et al. 2000; Xiang and Honig 2001; Dunbrack 2002) obtained from statistical analysis of the Protein Data Bank (PDB; Berman et al. 2000). Rotamer libraries can either be backbone-independent or backbone-dependent. Backbone-dependent rotamer libraries contain information on side-chain dihedral angles and rotamer populations as a function of the backbone dihedral angles ϕ and ψ (Dunbrack and Karplus 1993, 1994; Dunbrack and Cohen 1997; Dunbrack 2002), whereas backbone-independent libraries ignore such dependence.

In side-chain prediction methods, rotamers are chosen based on the desired protein sequence and the given backbone coordinates, by using a defined energy function and search strategy. Energy functions that have been used in side-chain prediction include simple steric energy functions combined with log probabilities of backbone-dependent rotamer populations (Dunbrack 1999), as well as molecular mechanics potential energy functions with complex solvation free energy terms (Mendes et al. 2001a; Xiang and Honig 2001; Liang and Grishin 2002).

For some emerging uses of side-chain prediction, fast calculations are a requirement. These include Web-based homolog and fold-recognition programs that align query sequences to proteins of known structure and model proteins based on these identifications and alignments. For instance, 3D-PSSM (Kelley et al. 2000) uses SCWRL (Bower et al. 1997) to generate models of proteins from structure-derived profile alignments. As the numbers of available sequences and experimentally determined protein structures both increase, improved models of proteins can be constructed on the basis of a number of parent structures and many alternative alignments, as demonstrated at the recent CASP5 meeting. In some cases, hundreds of models can be used to determine the best parent and alignment for a single query sequence, thus necessitating fast side-chain prediction. The protein design problem, in which a sequence is designed that will fold into a given defined set of backbone coordinates (Desjarlais and Handel 1995; Dahiyat and Mayo 1996) also requires efficient side-chain search methods, because several residue types and many rotamers must be tested at every site of the protein. In this case, the combinatorial problem can be quite severe.

A search method used in side-chain prediction can be classified as either exact or approximate, depending on whether it is guaranteed to find the global minimum energy given the rotamer library and energy function, or instead whether it finds a low-energy conformation without such a guarantee. The many versions of the dead-end elimination algorithm (Desmet et al. 1992, 1997; Lasters and Desmet 1993; Goldstein 1994; Keller et al. 1995; Lasters et al. 1995; Gordon and Mayo 1999; Pierce et al. 1999; De Maeyer et al. 2000; Voigt et al. 2000; Looger and Hellinga 2001) are designed to find the global minimum energy, when they are able to converge. Conversely, Monte Carlo methods (Holm and Sander 1991; Liang and Grishin 2002), cyclical search methods (Dunbrack and Karplus 1993; Xiang and Honig 2001), and some other algorithms (Desmet et al. 2002) are not guaranteed to find a global minimum, but they will almost always find a low-energy conformation in a reasonable time.

We have continued to develop the SCWRL program, originally written by Michael Bower, over a number of years (Bower et al. 1997; Dunbrack 1999). SCWRL uses a backbone-dependent rotamer library (Dunbrack and Karplus 1993, 1994; Dunbrack and Cohen 1997; Dunbrack 2002) and an energy function based on log probabilities of these rotamers and a simple repulsive steric energy term. In this article, we will present a new algorithm for SCWRL based on graph theory, which is considerably faster than the original algorithm. The new algorithm will enable new uses such as protein design and ab initio structure prediction, as well as increases in accuracy with new energy functions and side-chain flexibility.

Before describing the new algorithm, it is useful to summarize the previous method used to solve the side-chain combinatorial problem in SCWRL as well as its drawbacks. The original SCWRL algorithm initially places side chains onto the backbone according to the backbone-dependent rotamer probabilities and steric interactions with the nonlocal backbone. Any side chains that clash with other side chains according to the steric energy function are then defined as “active.” These active residues are placed in each of their possible rotamers in turn, and if any rotamer clashes with any currently “inactive” residues, these residues become active as well. This procedure continues until a set of active residues is identified. The combinatorial problem is reduced to finding the minimum energy configuration of these active residues, because the inactive residues are in their minimum energy conformation. This is de facto a dead-end elimination of all rotamers except one for each inactive residue.

We can represent the side-chain combinatorial problem in terms of graph theory by making each residue a vertex in an undirected graph. If at least one rotamer of residue *i* interacts with at least one rotamer of residue *j*, then there is an edge between vertices *i* and *j* of the graph. In the original SCWRL algorithm, the active residues can be represented as a graph, one that is not necessarily connected. The active residues can therefore be grouped into interacting clusters. Residues in different clusters do not have contacts with one another. Each cluster is a connected subgraph of the entire graph. The combinatorial problem is therefore reduced to enumerating the combinations of rotamers for the residues in each connected graph. In SCWRL, the lowest energy of each cluster is found by using a backtracking algorithm with pruning (see Materials and Methods).

Some clusters are too large to be solved in a reasonable period of time. The solution to this problem (Bower et al. 1997) is to locate one residue (the “keystone”) that when removed from the cluster graph, breaks the graph into two separate subgraphs. This procedure is shown in Figure 1A. Assuming each residue has the same number of rotamers, *n*_{rot}, the number of combinations in this graph, is *n*^{11}_{rot}. The graph in Figure 1A has one residue labeled as a dark gray vertex that breaks the graph into two pieces when removed, as shown on the lower part of the figure. The global minimum of the energy can be found by identifying the minimum energy configuration for each subgraph once for each rotamer of the keystone residue. The solution is the one that finds the minimum energy using the equation

where *E*_{L}(*r*_{i}) = ^{min}*E*({*r*_{j}}|*r*_{i}) is the lowest energy combination of rotamers in the left subgraph (*L*) with the keystone residue *i* fixed in rotamer *r*_{i}, and *E*_{R} is the term for the right subgraph. *E*_{self} (*r*_{i}) is the energy of interaction of the side chain with the backbone and other fixed non–side-chain atoms, such as ligands. Splitting the graph results in *n*^{5}_{rot} + *n*^{7}_{rot} combinations. This process may result in subgraphs that are also too large to be solved in a reasonable time. SCWRL breaks these up by using a similar procedure, so that each subgraph is broken up for each rotamer of the first keystone residue and then for each rotamer of the second keystone residue, and so on.

In addition to the challenges described above for side-chain prediction, there are a number of other reasons for developing a new algorithm for SCWRL. First, when the backbone used for modeling is not the native backbone structure for the protein, the connectedness of side chains (i.e., the average number of edges per node in the graph) may be higher than for native backbones, because in general, there may not be sufficient volume available for all side chains. This is especially true at low sequence identity and in ab initio modeling, when backbone structures very different from the native are generated. For a small proportion of target structures to be modeled, SCWRL may not converge in any reasonable amount of time. Second, to improve SCWRL accuracy, it is likely to be necessary to implement an energy function that includes favorable van der Waals interactions, as well as electrostatic and solvation energy terms (Liang and Grishin 2002). Flexibility of side chains may also be necessary to improve the accuracy of large hydrophobic residues (Mendes et al. 1999a). These changes will make the connectedness problem even worse, and the current SCWRL algorithm is not robust enough to find a solution.

Third, SCWRL also uses a number of heuristic steps to reduce the number of rotamers and the number of interactions considered in the cluster-solving algorithm. These include removing rotamers with backbone (local and nonlocal) energies >5.0 kcal/mole, and not forming edges between residues in the graph unless there is some pair of rotamers with interaction energy >10.0 kcal/mole. The effect of using these heuristics is that SCWRL does not locate the global energy minimum of its own energy function, because many interactions are ignored when setting up the combinatorial searches to make the calculations tractable.

In this article, we propose a new algorithm for SCWRL based on graph theory that overcomes the combinatorial problem in a novel way, as illustrated in Figure 1B. Instead of recursively breaking up graphs into pieces as in previous versions of SCWRL (Fig. 1A), the new algorithm breaks up clusters of interacting side chains into the biconnected components of an undirected graph. Biconnected graphs are those that cannot be broken apart by removal of a single vertex. Biconnected graphs are cycles, nested cycles, or a single pair of residues connected by an edge (sometimes referred to as a bridge). In Figure 1B, the graph in Figure 1A is shown, before and after being broken up into biconnected components. This graph can be broken up into five biconnected components: A is residues V10, R15, Y21, and K17; B is residues K17 and T30; C is residues T30, Q25, M32, H29, and W11; D is residues W11 and F35; and E is residues W11 and D39.

Vertices that appear in more than one biconnected component are called “articulation points.” Removing an articulation point from a graph breaks the graph into two separate subgraphs. In this graph, vertices K17, T30, and W11 are articulation points. The order of an articulation point is the number of biconnected components of which it is a member minus one. A first-order articulation point connects only two components. Both K17 and T30 are first-order articulation points (Fig. 1B, light gray), whereas residue W11 is a second-order articulation point (Fig. 1B, dark gray). Finding biconnected components and their articulation points is easily accomplished by using an algorithm developed by Tarjan (1972). The algorithm uses a standard depth-first search algorithm from graph theory contained in many computer science textbooks.

To solve the side-chain combinatorial problem with biconnected components, we use the following simple procedure, illustrated in Figure 2. For each biconnected component with only one articulation point, we find the minimum energy over all combinations of rotamers of the residues in the component for each rotamer of the articulation point. This energy includes all interactions among these residues and between these residues and the fixed rotamer of the articulation point. So in the first panel of Figure 2, we calculate the minimum energy over all combinations of rotamers of residues V10, R15, and Y21 for each rotamer of residue K17. Once this is accomplished, we can collapse the biconnected component onto the articulation point, obtaining the graph in the second panel of Figure 2. That is, we can now consider residue K17 a superresidue with a number of superrotamers. If residue K17 has 10 rotamers, superresidue {K17, V10, R15, Y21} also has 10 rotamers. The energies of these superrotamers include both the self-energies of residue K17 and the minimum energy of the biconnected component consisting of residues V10, R15, and Y21. After collapse, we reduce the order of articulation point K17 by one, so it is now no longer an articulation point in the second panel (i.e., is of zeroth order).

We now proceed to another component with only one articulation point, for instance, component E. We find the minimum energy of E for each rotamer of W11 and collapse the component onto W11, making it the superresidue {W11, D39}, resulting in the third panel of Figure 2. After collapse, W11 is now of first order. We can also solve component D for each rotamer of articulation point W11, resulting in the superresidue {W11, D39, F35}. W11 is now no longer an articulation point (Fig. 2, fourth panel).

At this point, after collapsing several biconnected components, we have two components that now have only one articulation point, whereas before they had two each. We solve the first one, B, for each rotamer of residue T30 and form the superresidue {T30, K17, V10, R15, Y21}. We are left with one biconnected component with no articulation points. We find the minimum energy of this component over all combinations of its rotamers.

The procedure as just described always converges into a single biconnected component or a single residue, each rotamer of which now contains information about all other residues in the cluster as well as the total energy. The order of complexity of the side-chain problem is now reduced to the size of the largest biconnected component in any cluster.

To achieve reasonably sized biconnected components, the new program, referred to here as SCWRL3.0, begins with a dead-end elimination (DEE) step, based on the simple Goldstein criterion (Goldstein 1994). This DEE step replaces the heuristics in SCWRL previously used to remove many high-energy rotamers not likely to be part of the minimum-energy configuration. Because the simple DEE criterion is not sufficient to solve the combinatorial problem on its own for reasonably sized proteins, a number of recent papers have presented complex DEE schedules that remove clusters of rotamers. However, the new SCWRL algorithm based on biconnected components stands in contrast to recent DEE implementations (Pierce et al. 1999; De Maeyer et al. 2000; Looger and Hellinga 2001) in its simplicity and speed.

We show that the new algorithm is extremely fast even for very large proteins. We evaluate the accuracy and speed of the new algorithm, as implemented in SCWRL3.0. The new algorithm will allow for new applications and for the development of more accurate energy functions and consideration of side-chain flexibility without a significant impact on computational times.