Energy functions for protein design I: Efficient and accurate continuum electrostatics and solvation


  • Navin Pokala,

    Corresponding author
    1. Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, California 94720-3206, USA
    • Department of Molecular and Cell Biology, University of California, Berkeley, 237 Hilde-brand Hall, Berkeley, CA 94720-3206, USA; fax: (510) 643-9321.
    Search for more papers by this author
  • Tracy M. Handel

    Corresponding author
    1. Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, California 94720-3206, USA
    • Department of Molecular and Cell Biology, University of California, Berkeley, 237 Hilde-brand Hall, Berkeley, CA 94720-3206, USA; fax: (510) 643-9321.
    Search for more papers by this author


Electrostatics and solvation energies are important for defining protein stability, structural specificity, and molecular recognition. Because these energies are difficult to compute quickly and accurately, they are often ignored or modeled very crudely in computational protein design. To address this problem, we have developed a simple, fast, and accurate approximation for calculating Born radii in the context of protein design calculations. When these approximate Born radii are used with the generalized Born continuum dielectric model, energies calculated by the 106-fold slower finite difference Poisson-Boltzmann model are faithfully reproduced. A similar approach can be used for estimating solvent-accessible surface areas (SASAs). As an independent test, we show that these approximations can be used to accurately predict the experimentally determined pKas of >200 ionizable groups from 15 proteins.

Protein design algorithms seek to identify amino acid sequences that have low energies for target structures. Although the sequence-conformation space that needs to be searched is large, the most challenging requirement for computational protein design is a fast yet accurate energy function that can distinguish optimal sequences from similar suboptimal ones (for review, see Pokala and Handel 2001; Mendes et al. 2002). Here and elsewhere (N. Pokala and T.M. Handel, in prep.), we describe the development and testing of the EGAD (Egad! A Genetic Algorithm for Protein Design!) program and energy function for protein design.

The energy function that describes the predicted stability of a sequence threaded onto a structure is:

equation image((1))

where ΔG is the predicted stability, Eforcefield is the molecular mechanics force-field energy (van der Waals, torsion, and Coulombic electrostatics; equivalent to enthalpy in constant temperature, pressure, and volume simulations), ΔGsolvation is the solvation energy, which Greference is the reference (unfolded) state energy, which includes the enthalpy and conformational entropy of the unfolded state. The Eforcefield term describes the interactions between protein atoms, and is parameterized with quantum calculations and experiments performed on small molecules in vacuo (Jorgensen et al. 1996). Given that proteins are macromolecules dissolved in water, the energy function must also take into account the energy required to solvate the molecule in water (ΔGsolvation). The solvation energy has two primary components, the hydrophobic effect and the solvation of charged/polar groups, and is the primary topic of the work presented here.

As Ponder and Richards described several years ago, the protein design problem can be greatly simplified by keeping the target backbone in a fixed conformation, and by restricting side-chain conformations to discrete, experimentally observed rotamers (Ponder and Richards 1987). Despite these simplifications, protein design calculations require many energy evaluation steps; stochastic optimization methods such as simulated annealing can require in excess of 1 million molecular energy calculations.

The time required for building structures and calculating interatom distances for energy calculations during sequence/rotamer optimization is significant and is often not practical for larger problems. A far more efficient approach is to decompose the total energy into rotamer–rotamer and rotamer–backbone energies, and add up these precalculated pairwise partial energies as needed during the optimization:

equation image((2))

where ΔGi_internal is the solvation, reference and intrinsic energies of the rotamer at position i, ΔGi_bkbn is the interaction energy between the rotamer at i and the backbone, and ΔGij is the interaction energy between the rotamers at positions i and j. This strategy reduces the time required for a design calculation from weeks or days to minutes or seconds. Although this makes calculations efficient for stochastic methods, it is necessary for methods such as dead-end elimination and self-consistent mean field optimization (SCMF) that require energies to be decomposable into the sum of pair energies (Voigt et al. 2000).

At no time during the calculation of rotamer pair or rotamer–backbone energies are side-chain atoms at other positions present, because their identities have not yet been determined. This poses a problem for the calculation of electrostatics energies. Individual charges placed in a protein affect the stability in a manner sensitive to the local environment. Electrostatic interactions between atoms are also dependent on the local environment; although buried charge pairs often have more favorable interaction energies than equivalent solvent-exposed pairs, they are frequently offset by the unfavorable charge desolvation energies (Hendsch and Tidor 1994; Waldburger et al. 1995; Hendsch et al. 1996; Daune 1999). Because a complete molecule never exists during the pair-energy calculations, it is not straightforward to define the environment for a given atom.

As a first approximation, conventional models of electrostatics and solvation simply forbid polar groups from occupying buried positions, or attach a large penalty for burying them without hydrogen bonds to the backbone (Raha et al. 2000; Marshall and Mayo 2001). Interactions between atom pairs are described by distance-dependent dielectric constants.

Because these models are only dependent on the distances between atom pairs, but not their environments, they are readily decomposable into pairwise terms. However, they have poor agreement with the finite-difference Poisson-Boltzmann (FDPB) continuum model, and do not reflect experimentally measured stabilities well (Fersht and Sternberg 1989; Mehler 1996; Edinger et al. 1997; David et al. 2000; Marshall et al. 2002). Nevertheless, because they reflect the well-established observation that burial of polar groups often destabilizes proteins, these simple models have proven useful for protein design (Stites et al. 1991; Lim et al. 1992; Blaber et al. 1993; Hendsch and Tidor 1994; Waldburger et al. 1995; Hendsch et al. 1996). Sequences that share features of natural sequences for a given fold have been designed using simple solvation and electrostatics models (Koehl and Levitt 1999; Raha et al. 2000; Jaramillo et al. 2002). Experimentally, some protein variants designed with these simple models are even more stable than the natural protein (Dahiyat and Mayo 1997; Malakauskas and Mayo 1998; Marshall and Mayo 2001). It has also been shown that simple models are sufficient for discriminating native structures from near-native decoys (Morozov et al. 2003), although there is some controversy about the significance of this (Felts et al. 2002). These results, as well as the problem of pairwise-decomposability, beg the question of whether modeling electrostatics and solvation accurately in an environment-dependent manner is really that important for protein design.

Despite the remarkable successes of conventional models, it may be better to use the more accurate environment-dependent models for describing solvation energies. Stabilizing and destabilizing interactions must be balanced to achieve specificity in folding and molecular recognition, and this balance is often achieved by the burial of polar groups. For example, a buried asparagine has been shown to be important for specifying unique topologies in leucine zippers. Replacing this asparagine with nonpolar groups “stabilizes” the folded state with respect to the unfolded state, but the folded state becomes an ensemble of dimers, trimers, and tetramers of various orientations, thus destabilizing the specific parallel dimer structure (Lumb and Kim 1995; Gonzalez et al. 1996a,b). Similar buried polar interactions were designed to switch a parallel coiled coil to an antiparallel structure, and appear to be important for the structural specificity of thioredoxin (Oakley and Kim 1998; Bolon and Mayo 2001). However, many present design protocols exclude polar residues from core positions (Marshall and Mayo 2001).

In other situations, buried polar groups are required for conformational switching. For example, the EF-hand family of Ca2+-binding proteins fall into two classes: those that undergo conformation changes upon Ca2+ binding, and those that do not. At two buried positions, the former have polar residues, whereas the latter usually have hydrophobic groups. Upon Ca2+ binding, the polar residues become exposed. As expected, replacement of polar residues with hydrophobic residues in a protein that normally undergoes a conformational change stabilizes the molecule, but the ability to undergo a ligand-dependent conformational change is curtailed. The energy required for conformational switching is “paid for” by exposing the previously buried polar groups, relieving the destabilization due to desolvation. However, the destabilization cannot be too great, otherwise the protein would not have folded in the first place (Ababou and Desjarlais 2001).

Changes in the accessibility of polar groups have also been suggested as a source of binding energy in protein–protein complexes. In the unbound form, an intramolecular interaction on the surface of one of the binding partners may be neutral or only slightly stabilizing, owing to the large shielding effect of solvent. However, when the interaction is desolvated in the complex, the interaction energy increases in magnitude, because the solvent shielding decreases. In some situations, this favorable interaction energy may be able to overcome the desolvation penalty associated with burying the polar groups (Chong et al. 1998).

It is also important to model electrostatics at protein surfaces accurately. For example, although there are ion pairs between the two subunits in the fos–jun coiled-coil heterodimer, these interactions are not particularly strong in the heterodimer structure. However, they destabilize the alternative homodimer structures, thus driving heterodimer formation (O'Shea et al. 1992). This strategy was mimicked in part to rationally design coiled-coil heterodimers (O'Shea et al. 1993) and heterotrimers (Nautiyal et al. 1995). A similar strategy was used to rationally design heterodimer variants of Arc repressor by charge substitutions at the dimer interface; the FDPB model was shown to accurately predict the relative homo- and heterodimerization specificities for this series of variants (Nohaile et al. 2001). Whereas conventional models often fail to accurately predict electrostatic energies at protein surfaces, the continuum model is able to do so (Marshall et al. 2002; Havranek and Harbury 2003).

To design systems that are as delicately balanced as these examples, accurate and quantitative models of electrostatics are required. Because conventional models severely penalize or exclude buried polar groups, and fail to capture interatom electrostatic energies accurately, it is unlikely that these models would be able to design such systems.

As discussed above, unlike the conventional environment-independent electrostatics models, environment-dependent models are not readily decomposed into sums of rotamer pair energies, because they are dependent on multibody interactions. To address this issue, we have developed a conceptually simple approach for capturing the environment dependence and for estimating the surface area of all atoms in all rotamers during the rotamer–backbone energy precalculation step: Pseudoatoms are used to occupy volume that would likely be occupied by real atoms in a full structure. This strategy approximates FDPB energies and solvent-accessible surface areas (SASA) quite well, and extends the utility of the continuum electrostatic model to deterministic search methods that require rotamer pair energies. As an independent test, we show that these approximations retain enough accuracy for calculating pKas of ionizable groups in proteins, an important experimental benchmark for evaluating protein electrostatic models (Antosiewicz et al. 1994).


Additive solvent-accessible surface areas

The transfer energies of compounds from vacuum to water or from a nonpolar solvent to water are described by an SASA-dependent energy function (Chothia 1975; Eisenberg and McLachlan 1986; Sitkoff et al. 1994; Street and Mayo 1998):

equation image((3))

where ΔGSASA is the SASA-dependent hydrophobic solvation energy, γi is the atomic solvation parameter for atom i, and Ai is the SASA of i.

The SASA for a given atom is dependent on other atoms in a molecule. For pair energy calculations, a complete molecule never exists. As first described by Wodak and Janin (1980) and later elaborated by Street and Mayo (1998), one can approximate the surface area by adding up the surface area buried between a rotamer and the backbone and individual rotamer pairs, using empirical scale factors to account for the overcounting of areas buried by multiple atoms. We describe here an approximation that is additive, and is therefore faster than the pairwise methods, because the number of surface-area calculations scales linearly rather than quadratically with the number of positions (see Discussion).

As shown schematically in Figure 1, the SASA of a given rotamer in a complete molecule can be estimated by calculating the SASA of that rotamer in the context of a molecule in which the “missing” side-chain atoms are mimicked with enlarged backbone pseudoatoms. In this scheme, the SASA of a rotamer is calculated once during the rotamer–backbone calculation step, and ΔGSASA is calculated with equation 3 and added to ΔGi_internal in equation 2.

The radii of the enlarged backbone pseudoatoms were determined empirically, as described below (Materials and Methods; Supplemental Table 1). Despite the simplicity of this approximation, it appears to work reasonably well for atomic SASAs and protein side-chain energies (Fig. 2).

The additive method is admittedly not as accurate as the pairwise method (R = 0.95 versus R = 1.0; Street and Mayo 1998). However, because the ΔGSASA energies are small compared with other terms in the energy function, and the errors are comparable in magnitude to the error in our electrostatics model, the increase in speed offsets the decreased accuracy.

Generalized Born model for electrostatics

The Poisson-Boltzmann (PB) continuum dielectric model continues to be the gold standard for environment-dependent electrostatics and solvation calculations (for review, see Honig et al. 1993). This model assumes that the protein and water system can be treated as a low dielectric charged object (protein) in a high dielectric medium (water). In contrast to many other protein electrostatic models, this model does not require any empirical parameters, other than the dielectric constant, yet it can predict many experimental values that are dependent on electrostatics and solvation, such as transfer free energies, pKas of protein side chains (Bashford and Karplus 1990; Antosiewicz et al. 1994; Sitkoff et al. 1994), and the effects of charge substitutions on protein stability (Spector et al. 2000; Marshall et al. 2002). Unfortunately, numerically solving the PB model by the finite-difference method (FDPB) is too slow for use in protein design calculations (see Discussion). Nevertheless, it provides a benchmark to calibrate and compare faster methods.

A fast approximation to FDPB is the generalized Born (GB) family of models. A Born radius of an atom is analogous to an ionic radius, and can be thought of as a measure of the depth of an atom within a molecule; atoms buried more deeply in a molecule have larger Born radii than atoms near the solvent-accessible surface. Briefly, this model uses a Born radius α for each atom to confer environment-dependence on self-energies (ΔGself) and charge pair energies (ΔGCoulomb + ΔGpair; Still et al. 1990):

equation image((4a))
equation image((4b))
equation image((4c))


equation image

kc is Coulomb's constant, r is the distance between atoms i and j, εp is the dielectric constant of the molecule (protein), εw is the dielectric constant of the solvent (water), qi is the charge of atom i, and fGB is some f(r, α) for describing the screening of pair interactions.

To estimate the environment-dependence of interatom energies, Still et al. (1990) proposed to use some arbitrary fGB = f(r, α) as the denominator for the pairwise shielding term. Although it may not be possible to define an optimal general purpose form of fGB (Onufriev et al. 2002), an effective and commonly used form is the one originally proposed by Still and coworkers (Still et al. 1990):

equation image((5a))

However, we and others have found that a simpler form appears to work marginally better for estimating interatom energies (Supplemental Fig. 1), and does not require the computationally expensive exponential function (Grycuk 2003):

equation image((5b))

Methods to analytically calculate Born radii directly from structure have been developed (Hawkins et al. 1995; Schaefer and Karplus 1996; Qiu et al. 1997; Dominy and Brooks 1999; Onufriev et al. 2000). In these implementations, the Born radii are dependent on the volumes and distances of neighboring atoms via empirical factors fit to FDPB-calculated energies.

The self-energy of a charged isolated atom i of radius α (ΔGisolated)is given by the classical Born model (the ΔGself term in equation 4c). However, when the atom is in a molecule, the other atoms ji displace water, reducing the favorable r−4-dependent interaction energy between the solvent dipoles and the charged atom (ΔGwater_displaced_by_j; Daune 1999). Therefore, the energy of atom i in the presence of other atoms j can be approximated by:

equation image((6a))


equation image((6b))

and Vj is the van der Waals volume of atom ji.

The self-energy of atom i can be analytically calculated from the structure, and the Born radius determined from the ΔGself term in equation 4c. Qiu et al. (1997) proposed that the proportionality constant in equation 6b could be empirically determined by fitting to atomic self-energies numerically calculated by FDPB. It was found that good agreement between the numerical Born radii and the analytical Born radii was not possible unless bond connectivity was taken into account. Special treatment was also required for atom pairs that were overlapped. The implementation used in EGAD is the variation described by Dominy and Brooks (1999):

equation image((7))


equation image

k1–6 are empirical parameters, and Ri is the van der Waals radius of atom i.

Because a set of parameters consistent with the OPLS-AA force field was not available, we determined the empirical parameters in equation 7 using free amino acid, protein, and peptide structures in the basis set (see Materials and Methods). As shown in Figure 3A, the fitting procedure worked well to estimate the self-energies in the basis set, with an error comparable to that found by others (Qiu et al. 1997; Dominy and Brooks 1999; Onufriev et al. 2000).

Calculation of approximate Born radii

As shown in equation 7, the self-energy (and therefore the Born radius) of an atom is dependent on other atoms in the structure. However, as discussed above, if pairwise-decomposability is required, the only atoms present for any given energy calculation are the rotamer and backbone atoms, or the atoms for a pair of rotamers.

As shown schematically in Figure 1, one can estimate the Born radii that the atoms in a given rotamer will have in a complete molecule by calculating the Born radii for that rotamer in the context of a molecule in which the “missing” side-chain atoms have been replaced with pseudoatoms, and by scaling the effect backbone atoms will have on an atom's Born radius by a factor kBB:

equation image((8))

As described for the approximate SASAs, the Born radius of each atom in each rotamer is calculated once during the rotamer–backbone calculation step, and the self-energy calculated with equation 8 and added to Ei_internal in equation 2. These Born radii are saved and used for calculating the shielding energy for atom pair interactions during rotamer pair calculations.

As shown in the inset of Figure 3B, not including pseudoatoms with an optimal kBB results in very poor agreement with the complete GB model. kBB and the side-chain pseudoatom radius were fit to atomic self-energies in a manner similar to that used to fit the parameters in equation 7 (Fig. 3B). Importantly, there is excellent agreement between the approximate Born method and FDPB for rotamer self- and rotamer pair energies (Fig. 4).

pKa predictions for assessing the accuracy of the approximate GB model

The prediction of pKa shifts is a rigorous test of the accuracy of a protein electrostatics model (Antosiewicz et al. 1994). The pKas of ionizable groups in proteins are often shifted from the value they would have in isolation (pKa0). The magnitude and direction of this shift (ΔpKa) for a group is thermodynamically coupled to the electrostatic environment of that group:

equation image((9))


equation image

z is the charge of the group, R is the gas constant, and the ΔGenvironments are the electrostatic energies of the group in the protein (ΔGprotein) or free in water (ΔGwater). By calculating the ΔGenvironments, it is possible to calculate the pKa of a group in the protein. If one assumes that ionization has little effect on van der Waals interactions, the pKa shift of a group in the protein can be assumed to be completely attributable to the difference in electrostatic free energy between the charged and neutral states in the protein (Tanford and Kirkwood 1957; Bashford and Karplus 1990; Antosiewicz et al. 1994).

Although this calculation is straightforward for a single ionizable site, multiple-site calculations are complicated by the fact that the electrostatic energy for each ionizable group is dependent on the ionization state of the other groups. To calculate the pKas of ionizable groups in such a system, the protonation state for each side chain must be calculated as a function of pH. The pKa can then be defined as the pH at which the charged and neutral states are populated equally. Usually, this is done using a static structure and variable charge states for each group that are optimized by self-consistent mean field optimization (Tanford and Roxby 1972). Alternatively, charge states can be Monte Carlo sampled at each pH. Conventionally, these calculations are performed using the FDPB model to calculate the required energies. Because it is such a time-consuming method, static structures are usually used.

The self-consistent mean field (SCMF) method for rotamer optimization indicates a simple way to take into account protein side-chain dynamics for predicting pKas (Koehl and Delarue 1994; Havranek and Harbury 1999). Rotamers for both the charged and neutral forms are considered, and each form is assigned the energy required to transfer it from the reference state (free terminally blocked amino acid) to the protein. Based on the pH, the energies of rotamers for the unfavored form are assigned an additional intrinsic energy that takes into account the chemical energy ΔGchemical required to occupy that state:

equation image((10))

Like the static structure charge optimization described above, the rotamer optimization is done as a function of pH. Until recently, such calculations were not tractable, because the FDPB model is too time-consuming for rotamer-based calculations. However, the advent of faster models such as our approximate GB model and the modified Tanford-Kirkwood model permit this (Havranek and Harbury 1999).

We used a subset of five proteins to identify an optimal protein dielectric (εp) for the scheme outlined here (Fig. 5A). Although the optimal εp = 8 is lower than the εp = 20 needed for an equivalent level of accuracy with FDPB using static structures (Antosiewicz et al. 1994), it is higher than the εp = 4, which is commonly thought to be the best estimate for a protein's dielectric response behavior, based on the bulk dielectric of acetamide and dry protein powders (Bashford and Karplus 1990; Honig et al. 1993; Havranek and Harbury 1999). However, as discussed at length by Schutz and Warshel (2001), the value of a protein dielectric in continuum models may be nothing more than a model and algorithm-dependent empirical factor that has little physical meaning. They point out that for detailed free energy perturbation simulations that include explicit waters, the relevant dielectric is εp = 1; if the larger values of εp used in continuum calculations actually reflects the true polarizability of proteins, then a larger εp should also be required for the all-atom calculations (Schutz and Warshel 2001). Because our calculations model side-chain movements, it makes sense that the optimal εp = 8 required here is lower than the εp = 20 required for static calculations. However, because the side-chain movements are modeled discretely, and backbone movements are not considered at all, it is not surprising that the εp required here is >1.

As shown in Figure 5B and Supplemental Table 2, the method described here predicts pKas accurately, with an RMSD of 0.92 pH units for all 226 data points from 15 proteins. If a few outliers (17 data points) are removed, the error drops by 23% to 0.71 pH units. These results are comparable to FDPB-based calculations, and in contrast to some electrostatic models, are significantly superior to the trivial Null (no-shift) model (Antosiewicz et al. 1994). Note that Lys residues were not included in Figure 5B, as these can exaggerate the apparent accuracy (see Materials and Methods).

Significantly, our model accurately predicts residues with large pKa shifts. Because electrostatically strained residues are often functionally important, being able to predict these energies accurately may be especially important for the design of functional proteins, and for the prediction of functional residues in known protein structures (Shoichet et al. 1995; Elcock 2001; Ondrechen et al. 2001; Forsyth et al. 2002). It is also worth noting that this analysis assumes that the backbone structure is unperturbed over the pH range of the experiment, an assumption that may not be completely valid. Structural perturbations due to a change in ionization of a single group can be propagated, affecting chemical shifts elsewhere. In some cases, there are highly cooperative or multiphase pH-dependent transitions for which pKas are poorly defined. Based on these caveats, the level of agreement that exists between the calculated and experimental values is quite good, validating our model.


Comparison with other models

It is often intractable to explicitly consider individual water molecules, especially for protein design calculations. However, a useful approximation is to treat the solvent implicitly as a high dielectric continuum, and the protein as a low dielectric charged object within it (Tanford and Roxby 1972). As discussed above, the environment-dependency of the continuum model quantitatively reflects the experimentally observed properties of charged groups in proteins.

There are two major classes of environment-dependent electrostatics/solvation models: empirical and continuum. As the names imply, empirical models are dependent on experimental data for parameterization, whereas continuum models, such as GB, are approximations to the FDPB continuum dielectric model. The empirical models treat desolvation energies with empirical metrics that implicitly reflect the continuum description of desolvation, whereas continuum models do so explicitly.

The so-called Effective Energy Function 1 (EEF1) model defines desolvation of an atom as dependent on the volumes of neighboring atoms. The desolvation energies are defined by parameters derived from experimental small-molecule transfer energy data (Lazaridis and Karplus 1999). The successful design of native-like sequences using the desolvation energy model as part of a protein-design energy function has been reported (Kuhlman and Baker 2000; Nauli et al. 2001; Dantas et al. 2003). Although this model does not describe electrostatic pair energies in an environment-dependent manner per se, an empirical effective dielectric function that uses the desolvation information to approximate FDPB-calculated pair energies has recently been described (Mallik et al. 2002).

Two models have recently been described that depend on parameterization with experimental, structural, and thermodynamic protein data. The FOLD-X model treats interatom interactions and solvation as being dependent on a scaled atom occlusion term, and has parameters fit with protein mutant data simultaneously with other nonelectrostatic energies, such as van der Waals (Guerois et al. 2002). The model described by Wisz and Hellinga (2003) is parameterized with experimental pKa shift data. It divides a protein into core, boundary, and surface regions, uses several empirical coefficients and dielectric constants to calculate solvation and interaction energies in an environment-dependent manner, and has been used as part of an energy function to design ligand-binding proteins (Looger et al. 2003; Wisz and Hellinga 2003).

A continuum model that has been used successfully for protein design is the Tanford-Kirkwood model. This model approximates the solvent-excluded volume of a protein as a sphere, an object for which the PB model can be solved analytically (Tanford and Kirkwood 1957). Although simple, this model has been used by Makhatadze and colleagues to identify electrostatically strained sites in ubiquitin, which were then stabilized by generating appropriate mutations (Loladze et al. 1999). A major improvement has been a better mapping of protein charges to equivalent locations in a sphere. This modified Tanford-Kirkwood (MTK) model has even better agreement with FDPB, enables accurate prediction of pKa shifts in proteins, and has been used for the automated design of specific coiled-coil heterodimers (Havranek and Harbury 1999, 2003).

Continuum models such as the MTK and GB that approximate FDPB energies may be more broadly applicable for protein design than the empirical models. The continuum model is theoretical, conceptually simple, and except for the dielectric constant, charges, and atomic radii (the last two can be provided by the molecular mechanics force field), does not require any additional empirical parameters. The empirical models are much more complex and heuristic. For example, the Wisz and Hellinga (2003) model requires 24 parameters fit to 260 experimental pKa values, and each of these is dependent on only a small subset of the data. In contrast, the GB model described here has only eight parameters, and each of these are constrained by all ∼9000 FDPB-calculated theoretical values in the basis set. Nevertheless, our theoretical model is able to predict pKas as accurately as the empirical model parameterized directly with pKa data.

Of course, the even more theoretically rigorous MTK model does not require any fit parameters (Havranek and Harbury 1999). If the goal is to stay as close as possible to the underlying physics, the MTK method is clearly superior. However, as discussed below, because the method described here is approximately three orders of magnitude faster than the MTK model (which is itself three orders of magnitude faster than FDPB), the approximate GB model can more easily address larger systems with modest computational resources.

Computational efficiency

The total time required for calculating pairwise-decomposable energies is:

equation image((11))

where t is the total time; trotamer_bkbn is the time required to calculate the solvation, internal, and backbone interaction energy for a single rotamer; tpair is the time required to calculate the interaction energy of a rotamer pair, n is the number of variable side-chain positions for design, and R is the number of rotamers allowed at each position.

On a 500-MHz Pentium II CPU, we find trotamer_bkbn ∼ 0.1 sec and tpair ∼ 3 × 10−4 sec. For the total sequence design of ribonuclease H (2RN2; n = 136 non-proline/glycine residues), allowing R = 365 residue rotamers at each position requires 103 CPU hours to generate the energy table. Doing the same calculation with the MTK model would require, based on the reported values (trotamer_bkbn ∼ 11 sec and tpair ∼ 0.13 sec), 5 CPU years, whereas the FDPB model would require 4.1 CPU millennia (trotamer_bkbn ∼ 111 sec and tpair ∼ 106 sec; Havranek and Harbury 1999). In reality, equation 11 provides an upper bound, because not all rotamer pairs interact with each other; the actual time required by EGAD for this example was found to be 86 CPU hours, corresponding to ∼4.25 CPU years for the MTK model, and ∼3.4 CPU millennia for FDPB. Parallelization can speed up these calculations; a small cluster of 10 CPUs requires only a few hours to perform the calculations described here.

A possible hybrid method that uses MTK or some other parameter-free numerical method to calculate Born radii, while using the faster GB formalism for the quadratically scaling rotamer pair-energy calculations, could potentially use the strengths of both models (theoretical rigor and speed, respectively). For the example discussed here, such a method would require 211 CPU hours instead of 4 CPU years. Nevertheless, the approximate GB method would still be ∼twofold faster, yet have similar accuracy with respect to experimental values. Therefore, at this time, the GB model with the approximation we describe appears to be a good balance between speed, accuracy, and simplicity.

Materials and methods

Force field

Charges and van der Waals radii are from the OPLS-AA force field (Jorgensen et al. 1996). Atoms >10 Å apart are defined as noninteracting (Desjarlais and Handel 1995; Dominy and Brooks 1999). Because the force field defines σ, but not van der Waals radii (R) per se, two alternative definitions of van der Waals radii were examined: the distance at which the interaction energy between identical atoms is 0, and the distance at which the energy is at a minimum. From the form of the Lennard-Jones potential used in OPLS-AA, it follows that R = 0.5 σ for the first definition, and R = 2−5/6 σ for the second definition. It was found that the first definition gave the best predictions of experimental solvation energies of amino acid side-chain analogs. Subsequently, it was found that defining radii for nitrogens (except for the Trp Nε) using the second definition improved the predictions (data not shown).

Parameter optimization

The basis set for parameterization and testing of the approximate SASA and Generalized Born models consisted of the following protein structures (PDB IDs): 1PGB, 1UBQ, 2CRO, 2CI2, 1POH, 1A57, 3RN3, 2RN2, 1RIS, 1SRL, 1STN, 2YGS, 1B2X, 1MJC, 1FNF, 1HB6, 1FKB, 1LMB, 1CSP, 1SHF, 1DC2, 1XNB, 1TUR, 4PTI, 193L, 1BU4, all with added hydrogens. Free amino acid structures, as well as 15 pentapeptides culled from 1L63 were also included.

Parameter optimization was accomplished by using the downhill simplex algorithm to minimize the RMSD error function (Press et al. 1999). To check for bias, optimization was repeated several times with different subsets of atoms. In all cases, the solutions were similar to the values reported in Supplemental Table 1.

Solvent-accessible surface area (SASA) calculations

SASAs are calculated using the algorithm of LeGrand and Merz (1993), using the van der Waals radii defined above. Water has a radius of 1.4 Å.

Optimal radii for backbone pseudoatoms were determined by using the downhill simplex algorithm to minimize the RMSD error between the actual and approximate SASAs for atoms in the basis set. A side-chain pseudoatom was also included initially, but was found to be unnecessary.

Born radii parameterization

The parameters in equation 7 were determined using a procedure similar to that described by Qiu et al. (1997) and Dominy and Brooks (1999). Briefly, the reaction field self-energy was calculated with DelPhi version 2.5 for each atom in the basis set (Gilson et al. 1988; Nicholls and Honig 1991). The atom of interest was assigned a unit charge, with all other atoms kept neutral. Internal dielectric constants were εp = 4.0 for proteins, and εp = 2.0 for small molecules. The dielectric constant for water was εw = 80.0, and the ionic strength was set to zero. The grid spacing was 0.5 Å, with a 20.0 Å border region, permitting the full Coulombic approximation for edge effects (Nicholls and Honig 1991). These self-energies were used to determine the parameters in equation 7. Atom volumes in equation 7 took into account overlaps between bonded atoms (Qiu et al. 1997).

For the calculation of approximate Born radii, backbone atoms are assumed to have the same Born radii they have in the template structure containing the wild-type side chains. For side-chain atoms, the self-energies are calculated using equation 8. The virtual side-chain radius and kBB were optimized as described above, using ΔGself calculated with the GB model. The centers of side-chain pseudoatoms are placed 2.29 Å away from Cα along the Cα−Cβ vector to a position near the location of the residue-frequency-weighted average center of mass for all amino acid side chains. However, we have observed that the approximate Born radii are not particularly sensitive to the exact locations of the pseudoatoms (data not shown). A single pseudoatom was placed for each variable side chain.

pKa shift calculations

The reference states for pKa shift calculations are ensembles of terminally blocked amino acids. The backbone structures were culled from a nonredundant subset of the PDB, and side chains were modeled by rotamers (Dunbrack Jr. and Cohen 1997). Energies were calculated for each structure and averaged. Intrinsic pKa values of model compounds are Asp, 4.0; Glu, 4.4; Lys, 10.4; His, 6.3; and Tyr, 9.6 (Antosiewicz et al. 1996).

SCMF rotamer optimizations (Koehl and Delarue 1994) were performed from pH 2.0 to pH 12.0, in increments of 0.25, using a backbone-independent rotamer library (Dunbrack Jr. and Cohen 1997). Rotamers at all positions were allowed to move. At each pH, rotamers for the less favored state (based on the model compound p Ka) were assigned a penalty (ΔGchemical in eq. 10; Havranek and Harbury 1999). The pKa of a side chain was defined as the pH at which the SCMF probability of the protonated and unprotonated states are equal (Bashford and Karplus 1990). pH steps where the ratio of protonated to deprotonated rotamer in the final ensemble changes from being less than to greater than unity were identified, and the pKa estimated by linear interpolation between these two points.

Because there are very few values available, Lys residues were not considered when assessing the accuracy of our predictions. Owing to their much higher intrinsic pKas, the inclusion of these residues with the far more numerous Asp, Glu, and His data can exaggerate the apparent accuracy. For example, their inclusion increases the correlation coefficient for the Null model from 0.61 to 0.95, while leaving the RMSD essentially unchanged (data not shown).

Electronic supplemental material

Supplemental Table 1 lists the fitted parameters. Supplemental Table 2 lists the experimental and predicted pKa values. Supplemental Figure 1 compares two forms of fGB with FDPB-calculated pair energies.

Figure Figure 1..

Schematic of approximate Born radii and SASA calculation strategy. During rotamer–backbone energy precalculation, scaling the effect of the backbone atoms (thick gray curve) on a given rotamer's (black lines) Born radii (via kBB in eq. 8) or SASA (via swollen backbone radii; light gray circles on the right panel) compensates for the “missing” side chains (light gray lines). For Born radii calculations, virtual side-chain pseudoatoms (dark gray circles on the right panel) are also necessary to fully mimic the volume of the “missing” side-chain atoms.

Figure Figure 2..

Comparison of approximate and actual SASAs and ΔGSASAs. The lines are y = x. (A) Parameterization of virtual atom radii with atomic surface areas. (B) Side-chain ΔGSASA calculated with eq. 3; γ = 0.04 kcal/mole Å2 for hydrophobic atoms (C, H, S) and 0 for all others. These energies are shown here for demonstration, and were not used for parameterization.

Figure Figure 3..

Parameterization of equations 7 and 8 for calculating analytical and approximate Born radii. The lines are y = x. As described in Materials and Methods, for the parameterization, each atom was assigned a unit charge; energies and absolute errors with the partial charges used in actual design calculations are much lower. (A) Parameterization of equation 7 with FDPB-calculated self-energies. (B) Parameterization of kBB and side-chain pseudoatom radius for equation 8 with GB-calculated self-energies. The inset shows self-energies calculated with no pseudoatoms and kbb = 1.

Figure Figure 4..

Comparison of side-chain self- and side-chain pair shielding energies calculated by FDPB and the approximate Born method. Side-chain atoms are assigned partial charges from OPLS-AA. These energies are shown here for demonstration, and were not used for parameterization. The lines are y = x. (A) Self-energies. (B) Pair shielding energies.

Figure Figure 5..

pKa predictions. (A) Quality of pKa predictions as a function of dielectric constant (εp) for a subset of proteins. The smooth curve is for visualization, and has no other meaning. (B) Comparison of predicted pKas with experimental values. (Circles) Calculated predictions, (open circles) outliers, (crosshatches) null model predictions. Statistics values in parentheses include the outliers. As discussed in Materials and Methods, Lys pKas are not included. The line is y = x.


We thank Mark Voorhies and other members of the Handel lab; Susan Marqusee and members of her lab; and other members of the Stanley/Hildebrand Hall community for their valuable discussions, advice, and critical readings of the manuscript. We thank Scott LeGrand for making his SASA source code available to us. This work was funded by an NIH training grant, NSF grant MCB9458201, and the UCOP-CLC program.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.