SEARCH

SEARCH BY CITATION

Keywords:

  • de novo structure prediction;
  • loop modeling;
  • metalloproteins;
  • zinc binding

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Results
  5. Discussion
  6. Materials and Methods
  7. Acknowledgements
  8. References
  9. Supporting Information

Metal ions play an essential role in stabilizing protein structures and contributing to protein function. Ions such as zinc have well-defined coordination geometries, but it has not been easy to take advantage of this knowledge in protein structure prediction efforts. Here, we present a computational method to predict structures of zinc-binding proteins given knowledge of the positions of zinc-coordinating residues in the amino acid sequence. The method takes advantage of the “atom-tree” representation of molecular systems and modular architecture of the Rosetta3 software suite to incorporate explicit metal ion coordination geometry into previously developed de novo prediction and loop modeling protocols. Zinc cofactors are tethered to their interacting residues based on coordination geometries observed in natural zinc-binding proteins. The incorporation of explicit zinc atoms and their coordination geometry in both de novo structure prediction and loop modeling significantly improves sampling near the native conformation. The method can be readily extended to predict protein structures bound to other metal and/or small chemical cofactors with well-defined coordination or ligation geometry.

Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Results
  5. Discussion
  6. Materials and Methods
  7. Acknowledgements
  8. References
  9. Supporting Information

Zinc is one of the most abundant and important metal ions in biology, playing an indispensable role in a broad range of cellular processes, such as DNA replication and transcription,1 cell apoptosis,2 and metabolism.3 Catalytically, zinc acts as the critical electrophile in many hydrolases4 and structurally, zinc stabilizes many protein domains, for example, “zinc-finger” proteins.5 Genome analysis studies have revealed thousands of potential zinc-binding protein sequences6; however, only a small percentage of them have been structurally characterized.7 Therefore, it is of substantial interest to develop computational structure prediction methods that are able to generate three-dimensional structural models of zinc-binding proteins from their sequences with accuracy in terms of both overall topology and atomic details around zinc-binding site.

Many previous studies have reviewed and classified the coordination geometry and amino acid preferences in zinc-binding sites in known zinc-binding proteins.8–10 Patel et al. estimated that a majority (82%) of zinc ions in proteins are tetrahedrally coordinated with the rest pentahedrally or hexahedrally coordinated.10 For a structural zinc-binding site, cysteine (Cys) and histidine (His) are the preferred coordinating residues and usually there are no water molecules in the primary coordination sphere.9 In a recent comprehensive survey of zinc-binding proteins, Grishin and coworkers structurally classified zinc finger domains into eight distinct fold groups with three dominant categories: C2H2-like finger, treble clef finger, and zinc ribbon.11 Torrance et al. studied the evolutionary divergence of metal-binding sites in proteins and identified two “archetypal” zinc-binding site structures –– Cys-Cys-Cys-Cys and His-Cys-Cys-Cys, each of which appears to have evolved independently multiple times.12 These sequence and structural patterns have led to development of methodologies to predict potential zinc-binding sites from either protein primary sequences or structural information. Hovmoller and coworkers combined support vector machine and homology-based approaches to predict zinc-coordinating Cys and His from protein sequences with a success rate of 86%.13 METSITE developed by Sodhi et al. uses neural network classifiers to distinguish metal-binding sites from nonsites in protein structure models with moderate quality with a mean accuracy of 94.5%.14 The empirical force field Fold-X developed by Serrano and coworkers is able to predict from the high-resolution crystal structures the positions of single-atom ligand including zinc with an overall deviation less than 1.0 Å.15 Recently, it was shown by the Montelione group that zinc-coordinating cysteine residues can also be identified based on NMR 13Cα and 13Cβ chemical shift data.16 Despite this progress, information on likely zinc-binding sites has not generally been incorporated into protein tertiary structure prediction methods to generate models for zinc-binding proteins. Previous studies have included zinc in docking metalloprotein–ligand complexes17 and modeling active site of metalloenzymes by molecular dynamics (MD) simulations18, 19; however, modeling in these cases starts from existing protein structures, and only a narrow range of protein conformational space is searched.

The two key components of computational protein structure prediction methods are the procedure for carrying out the conformational search (sampling) and the free energy function used for evaluating possible conformations (scoring).20 Challenges in both areas have hindered modeling metal binding explicitly in protein structure prediction. First, conformational sampling is generally limited to the protein backbone and sidechain torsional degrees of freedom, and it is difficult to simultaneously sample the rigid-body degrees of freedom of the metal ion during folding. Second, to reduce computational complexity, nonbonded physical interactions among multiple atoms are simplified by treating the total energy as the sum of pairwise additive distance-dependent interactions. However, this two-body approximation does not suffice to model metal–protein interactions because metal coordination geometries around the favored coordination sphere have angular and multibody dependencies. In the example of zinc, the tetrahedral coordination of the four liganding residues requires distances, angles, and dihedrals among multiple atoms to be satisfied simultaneously. New algorithms must be developed to address such challenges to model metal-binding sites explicitly in protein structure prediction.

The de novo structure prediction and homology modeling methods in Rosetta software suite use a Monte Carlo strategy to assemble short fragments of known protein structures into compact conformations followed by gradient-based refinement with respect to all backbone and sidechain torsional angles in a detailed all-atom force field.21, 22 The power of the methods has been demonstrated by the generation of structural models with atomic accuracy for a handful of benchmark and blind prediction protein targets in the last few years.23–25 Recently, a “fold-tree” representation26 of the molecular system has been developed in Rosetta that can seamlessly integrate the torsional degrees of freedom and rigid-body degrees of freedom, which has allowed explicit treatment of backbone flexibility in protein–protein docking27 and protein–ligand docking.28 Taking advantage of this new capability, we developed an approach for predicting the structure of proteins with ion-binding sites with known coordination geometries. In this new method, zinc ions are explicitly represented and are tethered to their liganding residues with naturally observed geometries to maintain the integrity of the zinc-binding site and drive the folding of the protein chain. We show that in both de novo structure predictions and loop modeling, the explicit incorporation of zinc ions significantly improves sampling toward native protein conformation, and we expect that this method can be readily extended to predict protein structures bound with other metal ions and other ligands/cofactors with known coordination geometries.

Results

  1. Top of page
  2. Abstract
  3. Introduction
  4. Results
  5. Discussion
  6. Materials and Methods
  7. Acknowledgements
  8. References
  9. Supporting Information

Incorporation of zinc into the molecular system for protein structure prediction

It has been well established that the majority of the structural zinc-binding sites are arranged in a tetrahedral coordination, and the most preferred zinc-liganding residues in these sites are cysteines and histidines.9, 10 To capture this coordination geometry, zinc is represented as a ligand with five atoms forming the center and vertices of a tetrahedron. The actual zinc atom is centered in the tetrahedron and each of the four virtual atoms occupies a vertex. The distance between zinc and a virtual atom, 2.20 Å [Fig. 1(A)], is given by the average bond lengths of Zn-Sγ of Cys and Zn-Nδ/Nε of His in a set of structural zinc-binding sites. The four virtual atoms defined in the zinc residue serve to (1) set a reference frame for calculating the internal rigid-body transformation from the protein to the zinc, and (2) define constraints consistent with the coordination geometry between the zinc and the coordinating residues. A related “dummy-atom” approach was implemented in a MD study of zinc-bound farnesyltranserase.29

Figure 1. Incorporation of zinc coordination geometry into structure prediction. (A) Zinc is modeled as a tetrahedral ligand with four virtual atoms on the vertexes and one zinc atom in the center. The distance between the zinc atom and each of the virtual atoms is 2.20 Å. (B) Rigid-body transformation (RT) from zinc-coordinating residue Cys (left) and His (right) to zinc. Coordinate frames are defined by N, Cα, C in protein backbone and Zn, V1, V2 in the zinc. The spatial relationship between the sidechain and zinc can be described by six internal coordinate parameters (Table I), which can have a range of possible values (Table I). Combination of these possibilities with different possible sidechain rotamer conformations results in a discrete set of rigid-body transformations (a “jump” library) that can be used to sample the rigid-body degrees of freedom of the zinc during structure prediction. (C) Low-resolution de novo structure prediction. The fold tree is setup so that backbone conformational changes are propagated from protein N-terminal to C-terminal upon changes in the backbone torsion angles by fragment insertion, while a through-space transform (curly arrow) is established from one zinc-coordinating residue to the zinc ligand that samples alternative rigid-body transforms from the “jump” library created in (B). Distance constraints (single dash) tether the Cβ atoms of the remaining zinc-coordinating residues to the zinc atom. (D) High-resolution refinement. The fold tree is unchanged but with all atoms represented, more precise constraints (double dash, see Materials and Methods) are defined to maintain the integrity of the zinc-coordination site. Virtual atoms are included in the constraints definition to enforce tetrahedral coordination arrangements, which would otherwise be very complicated to realize with the single zinc atom alone.

Download figure to PowerPoint

thumbnail image

Rosetta's “fold-tree” representation of the molecular system integrates torsional degrees of freedom with rigid-body degrees of freedom together so they can be optimized simultaneously.26, 27 In the fold tree, the zinc ligand is attached to one of its coordinating protein residues via a through-space connection (“jump”) and protein backbone conformational changes propagate through this rigid-body jump to determine the new position of the zinc ligand [Fig. 1(C,D)]. While protein backbone torsion degrees of freedom are sampled through inserting short fragments from known protein structures, the rigid-body degrees of freedom of the jump are also sampled using a precomputed library of sidechain–zinc transforms. To generate this library, the rigid-body relationships between Cys and His sidechain atoms and zinc were first characterized in structures of proteins with natural zinc-binding sites. Based on this analysis, allowed ranges for each of the six rigid-body degrees of freedom [d, θ1, θ2, ϕ1, ϕ2, and ϕ12 in Fig. 1(B)] relating the sidechain and the zinc were defined (Table I). Combining these possible sidechain–zinc interactions with all Cys and His sidechain rotamers (χ1 and χ2) yields a set of rigid-body transformations from the backbone “N-Cα-C” triplet in the zinc-coordinating residue to the “Zn-V1-V2” triplet in the zinc ligand [Fig. 1(B), Table I]. Our library contains about 1300 different transformations (jumps) from the backbone of Cys residues to zinc and 2000 jumps from the backbone of His residues to zinc.

Table I. The Six Internal Coordinate Parameters Defining Local Interactions Between Cys/His Sidechains and Zinc
Residue-Zndθ1θ2ϕ1ϕ2ϕ12
  1. The parameters consist of one bond length (d), two bond angles (θ1 and θ2), and three torsion angles (ϕ1, ϕ2, and ϕ12), and their defining atoms are listed. For each parameter, the sampling range is indicated, for example, ϕ1 in Cys-Zn is sampled every 30° in a full rotatable manner, but the same torsion in His-Zn is fixed at either 0 or 180° to keep zinc aligned in the plane of the imidazole ring as observed in natural zinc-binding sites. These parameters are used in combination with sidechain rotamers to create a set of rigid-body transformations from residue backbone to the zinc ligand [Fig. 1(B)]. They also serve as a reference for all-atom constraints, which enforce optimal zinc-coordination geometries.

Cys-ZnSγ-ZnCβ-Sγ-ZnSγ-Zn-V1Cα-Cβ-Sγ-ZnCβ-Sγ-Zn-V1Sγ-Zn-V1-V2
2.20 Å112.0°109.5°−180°:180°:30°−180°:180°:30°120.0°
His(D)-ZnNδ1-ZnCγ-Nδ1-ZnNδ1-Zn-V1Cβ-Cγ-Nδ1-ZnCγ-Nδ1-Zn-V1Nδ1-Zn-V1-V2
2.20 Å120.0°109.5°0.0°−180°:180°:30°120.0°
His(E)-ZnNε2-ZnCδ2-Nε2-ZnNε2-Zn-V1Cγ-Cδ2-Nε2-ZnCδ2-Nε2-Zn-V1Nε2-Zn-V1-V2
2.20 Å120.0°109.5°180.0°−180°:180°:30°120.0°

The rigid-body jump described in the previous paragraph anchors the zinc to the protein. Constraints between the zinc and the other three zinc-coordinating residues are included during folding to maintain the integrity of zinc-binding site. In the low-resolution search stage in which protein sidechain atoms are approximated by centroids,21 a distance constraint term defined from the zinc atom to the Cβ atom of each of the remaining zinc-coordinating residues favors the formation of a protein topology that can accommodate a zinc-binding site [Fig. 1(C)]. In the subsequent high-resolution refinement stage in which all atoms are represented, the Rosetta all-atom energy function is supplemented with distance, angular, and dihedral constraints derived from structures of zinc-binding sites to ensure that low-energy models are generated that contain a zinc coordination site with correct geometry [Fig. 1(D)]. The distance constraints are defined between a protein zinc-coordinating atom and a virtual atom with a target distance of zero (see Materials and Methods), which enforces the overall tetrahedral coordination geometry around the zinc because in the creation of the zinc ligand, the four virtual atoms occupy the four vertexes of the tetrahedron centered at the zinc atom [Fig. 1(A)]. Such treatment allows generation of correct zinc coordination geometries without complicated computation of nonpair-additive interactions between zinc-liganding residues.

With incorporation of zinc into the fold-tree framework of the molecular system by (1) defining a zinc ligand residue with virtual atoms, (2) creating a jump library sampling rigid-body transformations from protein to zinc, and (3) adding constraint energy terms maintaining the geometry of zinc coordination, previously developed Rosetta structure prediction methods can be seamlessly adapted to perform various tasks to generate structure models for zinc-binding proteins. In the next sections, we present results from implementing this new method in de novo structure prediction and loop modeling of zinc-binding proteins.

De novo structure prediction

Starting from the amino acid sequence only, Rosetta de novo structure prediction and high-resolution refinement have generated structure models with atomic-level accuracy for a handful of benchmark and blind prediction cases.23, 25 In this study, a benchmark set of nine zinc-binding proteins was constructed to test the performance of the new method with explicit modeling of zinc (Table II). The set represents six of the eight fold groups of zinc fingers as defined by Grishin and coworkers,11 including two targets from each of the three major zinc-finger fold groups — classical C2H2-like zinc finger, treble clef finger, and zinc ribbon. Models were generated for each protein using the Rosetta de novo structure prediction method without and with zinc incorporation. Energy versus RMSD plots are shown for low-energy (5%) predictions in Figure 2(A). For six of nine cases (1co4, 1d0q, 1ef4, 1fv5, 1ncs, and 2b9d), improved sampling toward near-native conformations is observed. For four cases (1ef4, 1fv5, 1wjb, and 2b9d), the overall “energy-funnel” character was improved when zinc is explicitly incorporated in the process of structure prediction. In both cases of the classical C2H2-like zinc finger proteins (1fv5 and 1ncs), near-native models (<2 Å backbone RMSD) were identified among the five best energy-ranked predictions [Fig. 2(A), Table II]. Accurately predicted models of 1fv5, 2b9d, and 1wjb are shown in Figure 2(B); for 1fv5 the prediction has atomic level accuracy with backbone RMSD less than 1.0 Å and all-atom RMSD less than 2.0 Å.

Table II. Benchmark Sets of Zinc-Binding Proteins from Various Fold Groups for Testing De Novo Structure Prediction
PDBLengthStructural classificationLigandsBL5
ControlZinc
  1. The lowest RMSD of the lowest energy five models (BL5) is reported for the test without zinc (control) and the test with zinc (zinc).

1co41–34Zn2/Cys6-like fingerC11, C14, C23, H256.256.14
1d0q2–103Zinc ribbonC40, H43, C61, C6411.995.20
1dsv54–84Gag knuckleC58, C61, H66, C716.337.31
1ef41–55Treble clefC6, C9, C43, C444.063.71
1fv510–33C2H2-like fingerC11, C14, H27, C323.820.70
1irn1–53Zinc ribbonC6, C9, C39, C425.907.35
1ncs21–60C2H2-like fingerC34, C39, H52, H568.541.59
1wjb1–46TAZ2 domain-likeH12, H16, C40, C432.892.07
2b9d42–93Treble clefC52, C55, C85, C886.463.04

Figure 2. De novo prediction of the structures of zinc-binding proteins. (A) Energy (y-axis) versus RMSD (x-axis) plots of the lowest energy 5% of models generated without (left) and with (right) explicit zinc modeling. The red line in each plot indicates the lowest RMSD value of the lowest energy five models (“BL5” in Table II). (B) Accurate predictions of 1fv5 (left, 0.7 Å RMSD), 2b9d (middle, 3.04 Å RMSD), and 1wjb (right, 2.07 Å RMSD) from low-energy models with zinc incorporated. The predicted model (pink) is superimposed onto the native structure (green). Backbone traces are drawn in cartoon, zinc ions are drawn in spheres, and zinc-coordinating sidechains are drawn in sticks.

Download figure to PowerPoint

thumbnail image

De novo structure prediction using NMR chemical shift information

It was demonstrated recently that the robustness of Rosetta de novo structure prediction method can be improved by using a fragment library generated with NMR chemical shift data (CS-fragment).30, 31 For metal-binding proteins, chemical shift information can further provide valuable insights on structure features such as metal ligation. Montelione and coworkers recently showed that overlapped13Cβ chemical shift distributions of zinc-liganding and nonmetal-liganding cysteine residues are largely resolved by the inclusion of the corresponding13Cα chemical shift information.16 Here, we take the nine proteins in that study whose chemical shift data were used to identify their zinc-liganding cysteines16 and generate models using Rosetta de novo structure prediction and refinement protocols. Four protocols including with/without zinc and with/without CS-fragments were tested, and the results are summarized in Figure 3(A) and Table III. Compared to the control protocol (black curve, without zinc and without CS-fragments), the new protocol (blue curve, with explicit zinc and with CS-fragments) has RMSD distributions shifted toward near-native conformations in seven of nine cases [Fig. 3(A)], and the energy-based ranking of modeled structures is improved for six cases (Table III). Incorporating zinc and using a chemical shift-based fragment library produce different levels of improvement for different protein targets. In 1lv3 and 1r9p, improvements mainly come from incorporating zinc, whereas in 1m3v and 1iym, CS-fragments play a more important role in creating near-native models. In 1exk, including both zinc and CS-fragments have a synergistic impact. Significantly improved results are obtained for both 1r9p (with one zinc-binding site) and 1m3v (with two zinc-binding sites) with predictions of 2.21 and 2.66 Å backbone RMSD identified in the best five energy-ranked models. As illustrated in Figure 3(B), both overall protein topology and zinc positions are predicted accurately.

Figure 3. Prediction of the structures of zinc-binding proteins using NMR chemical shift information. (A) Comparison of RMSD distribution of low-energy models generated with and without explicit zinc modeling. Backbone heavy-atom RMSD values of the lowest energy 5% of models for each of nine NMR protein structures were grouped into 0.5 Å bins. Histograms are shown for calculations without zinc using standard fragments (black), without zinc using CS-fragments (green), with zinc using standard fragments (red), and with zinc using CS-fragments (blue). (B) Accurate predictions of 1r9p (left, 2.08 Å RMSD) and 1m3v (right, 2.66 Å RMSD, two zinc-binding sites) from low-energy models predicted with zinc incorporated. The predicted model (pink) is superimposed onto the native structure (green). Backbone traces are drawn in cartoon and zinc ions are drawn in spheres.

Download figure to PowerPoint

thumbnail image
Table III. Benchmark Set of Zinc-Binding Proteins for Testing Structure Prediction Using Chemical Shift Information
PDBLengthLigandsBL5
ControlControl-CSZincZinc-CS
  1. The best RMSD value of the lowest energy five models (BL5) is reported for tests without zinc using normal fragments (control), tests without zinc using CS-fragments (control-CS), tests with zinc using normal fragments (zinc), and tests with zinc using CS-fragments (zinc-CS).

1exk13–75C14, C17, C67, C7013.2511.3811.6611.96
C31, C34, C53, C56
1f621–48C3, C6, H26, C297.067.674.074.86
C18, C21, C44, C47
1g478–66C10, C13, H32, C356.785.979.535.04
C38, C41, C59, H61
1iym133–179C134, C137, H158, C1612.854.733.953.71
C153, H155, C172, C175
1lv35–40C9, C12, C28, C325.856.805.427.11
1m3v7–67C8, C11, H29, C328.806.888.882.66
C35, C38, C58, D61
1nku1–184C4, H17, H175, C17914.6811.4514.7911.37
1r9p26–122C37, C63, H105, C1067.046.701.932.08
1t3k1–132H39, C120, C122, C12715.0415.116.3314.86

Loop modeling

One of the important goals of computational structure biology is to model protein structures accurately from homologues of known structures. A critical step in this process is the modeling of structurally divergent regions using “loop modeling” methods. Several loop modeling methods have been developed in Rosetta27, 32 and have been applied in CASP blind predictions to create accurate models.24, 25 In the current test, 16 crystal structures of zinc-binding proteins were selected, which have at least two zinc-coordinating residues residing in one or more loop regions (Table IV). These loop regions were built using a previously published protocol27 coupling cyclic coordinate descent (CCD) algorithm33 with Monte Carlo energy minimization.34 For each test case, 6000 models were generated with or without the explicit incorporation of zinc. Distributions of the global loop RMSD values from the 300 lowest energy models are plotted in Figure 4(A), and the best global loop RMSD value from the five lowest energy models (BL5) is reported in Table IV. When loops are modeled in the presence of zinc, 7 of 16 cases show improved results, while the performance for the rest of the cases does not become significantly worse. For 1d0q, 2ayd, 2ioi, and 2orw, the accuracies of modeled loop conformations and the energetic discrimination between near-native and incorrect models are dramatically improved as evidenced by both a substantial RMSD distribution shift toward the native loop conformation [Fig. 4(A)] and the significantly lower RMSD values among the five lowest energy models (Table IV). For 2ayd, two loops containing all four zinc-coordinating residues are modeled simultaneously with a RMSD of 1.34 Å, whereas for 2orw, a 15-residue long loop accommodating two zinc-coordinating residues is predicted with an RMSD of 1.29 Å. In both cases, as illustrated in Figure 4(B), accurate predictions are achieved not only for loop backbone conformations but also for the sidechain conformations of the zinc-coordinating residues as well as the zinc position, which would not be possible without explicit incorporation of the zinc into the modeling process.

Table IV. Benchmark Set of Zinc-Binding Proteins for Testing Loop Modeling
PDBLengthLigandsLoopsNresBL5
ControlZinc
  1. The number of zinc-coordinating residues residing in the defined loop regions (Nres) is indicated. The lowest RMSD of the lowest energy five models ranked by energy (BL5) is reported for tests without zinc (control) and tests with zinc (zinc).

1d0q2–103C40, H43, C61, C6438–4943.861.56
61–70
1ee81–266C238, C241, C258, C261238–24241.211.12
257–264
1kk16–198C60, C62, C72, C7560–8048.488.43
1oqj90–179C113, H170, C174, C178107–12033.313.73
162–177
1v331–346C106, H108, C114, C11797–11535.423.27
1vsr23–156C66, H71, C73, C11764–8244.333.79
115–127
1zin1–217C130, C133, C150, C153130–13448.786.96
138–165
2ayd293–368C332, C337, H361, H363332–34041.751.34
358–366
2d5b1–287C127, C130, C144, H147127–15244.356.12
2gmw24–205C112, H114, C127, C129112–13547.696.83
2ioi1097–1283C1173, H1176, C1235, C12391233–124823.120.88
2j6a1–136C11, C16, C112, C1158–3345.305.29
112–116
2olm3–135C29, C32, C49, C5222–5036.476.00
2orw2–181C140, C143, C173, C176135–14927.891.29
2pq8177–305C210, C213, H226, C230210–21432.102.49
229–238
2znr270–436H362, C402, H408, H410401–43135.793.93

Figure 4. Loop modeling near zinc-binding sites. (A) Comparison of RMSD distribution of low-energy models generated with and without explicit zinc modeling. Histograms of loop backbone heavy-atom RMSD of the lowest energy 5% of models are shown for tests without zinc (black) and with zinc (red). (B) Accurate predictions of 2orw (left, 1.29 Å RMSD) and 2ayd (1.34 Å RMSD) from low-energy models with zinc incorporated. The predicted model (pink) is superimposed onto the native structure (green). Backbone traces are drawn in cartoon, zinc ions are drawn in spheres, and zinc-coordinating sidechains are drawn in sticks.

Download figure to PowerPoint

thumbnail image

Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Results
  5. Discussion
  6. Materials and Methods
  7. Acknowledgements
  8. References
  9. Supporting Information

Metal ions are essential to maintain the function, structure, and stability of proteins and, as the second abundant metal ion found in eukaryotic organisms, zinc plays important roles in many biological processes. About 10% of the structures deposited in the Protein Data Bank have zinc listed as a ligand in their structure records.7 Despite the flourishing development of computational tools to generate protein structure models either from sequence alone or from structures of close homologues, few methods take the binding of metal cofactors into account explicitly. The method presented in this article is a step toward overcoming this limitation.

The development of the current method has greatly benefited from both the representation of the molecular system by a “fold tree”26 and the recent effort to reshape Rosetta software with a modular object-oriented design (Leaver-Fay A, Baker D, and Bradley P, unpublished). The fold tree lays out a general kinematic framework for a wide spectrum of structure modeling tasks in which torsional degrees of freedom and rigid-body degrees of freedom can be integrated seamlessly and optimized simultaneously. The power and generality of this framework is illustrated by predictions of the structures of 1m3v (protein folding with two zinc ions) and 2ayd (two loops built to coordinate the zinc ion) presented here. The new modular architecture allows easy integration into protein structure prediction and design calculations of nonprotein molecules with well-defined coordination geometries, such as metal ions and clusters, water molecules, and other small molecules with well-defined hydrogen bond acceptors and donors. Once the coordination/ligation geometry is specified, the Rosetta3 framework enables fast and efficient development of modeling methodologies with these compounds based on existing protocols (e.g., de novo structure prediction and loop modeling).

In conventional protein structure prediction methods, interactions among protein atoms are often approximated using pair-additive distance-dependent potentials. This approach is problematic for modeling metal–protein interactions because formation of correct metal coordination geometries requires simultaneous satisfaction of distance, angle, and dihedral geometric constraints from multiple protein atoms surrounding the ion. Our method addresses this challenge by representing the zinc ion with a tetrahedron-shaped residue. In this pseudo-residue, the actual zinc atom is positioned at the center and the four virtual atoms occupy the vertexes of the tetrahedron, mimicking zinc coordination spheres in native protein structures. By enforcing distance constraints between zinc-coordinating atoms from proteins and these virtual atoms, as well as angular/dihedral constraints, correct zinc-coordinating geometries are favored by the energy function without incurring the complexity of computing multibody interactions.

The approach to zinc incorporation in this article can easily be extended to model protein structures bound with other metals, such as calcium, iron, and magnesium. Metal ion coordination geometry and sidechain preferences have been extensively studied,35 and once such information is encoded as done here for zinc (creation of ligand residue, “jump” library, and coordination constraints), these metals (or more broadly, small chemical ligands) can be readily incorporated into existing methods to predict structures of other metalloproteins and/or dock metalloproteins with protein or ligand partners.

Our method currently relies on knowledge of the locations of the zinc-coordinating residues in the protein primary sequence. Such information may be obtained from analysis of consensus metal-binding sequence patterns in genome sequences,6, 13 alignments with other known homologues,11 and experimental data such as NMR chemical shifts16 or mutagenesis around metal-binding sites. With the incorporation of zinc-coordinating constraints in prediction, the conformational space to be searched is certainly reduced; however, we still have several test cases (1dsv, 1t3k, 1nku, 2d5b, etc.) in which near-native conformations are not sampled, highlighting the importance of developing algorithms to better search conformational space. Although the method described in this article mainly focuses on structurally bound zinc metals in proteins, the catalytic role of zinc binding in many metalloenzymes should not be overlooked. To model their structures, energy functions need to be improved to monitor electrostatic interactions among cationic metals, protonated waters, and more acidic residues in the active site such as Asp and Glu.36

Metal binding promotes protein stability and catalytic activity, and attention has been increasingly focused on designing interactions between protein and metal ions.37, 38 Previous studies have explored the introduction of zinc- and iron-binding sites into static protein scaffolds39–43 such as four-helix bundles, however, as suggested by recent work on de novo protein structure design44 and enzyme design,45 successful creation of metalloproteins with novel structure and function will likely require iterative rounds of design and prediction of protein scaffolds with structural and/or catalytic metal-binding sites. The method described in this article can serve to create an initial structure model for sequence optimization and to refine designed sequences and structures.

Materials and Methods

  1. Top of page
  2. Abstract
  3. Introduction
  4. Results
  5. Discussion
  6. Materials and Methods
  7. Acknowledgements
  8. References
  9. Supporting Information

Datasets

Nine protein targets were selected from Krishna et al.11 representing six of the eight defined classes of zinc-binding proteins to test the de novo structure prediction protocol. Two targets were selected for each of the three major fold groups: C2H2-like finger, treble clef finger, and zinc ribbon. The nine protein targets used to test structure prediction with NMR chemical shift information were selected from the set of Kornhaber et al.16 The first model in the NMR ensemble was used as the native conformation with the flexible terminal residues removed. To test the loop modeling protocol, 16 crystal structures with resolution better than 2.5 Å were selected from the Protein Data Bank,7 which contain loop regions with at least two zinc-coordinating residues. Information on the four residues coordinating the zinc was extracted from the structures and used to guide de novo structure prediction and loop modeling.

De novo structure prediction

The Rosetta de novo structure prediction method has been described in detail.21, 46 Models are built from fragments starting from an extended chain and then subjected to all-atom refinement. Two sets of torsional fragment libraries were tested, one created with standard procedures based on local sequence similarity46 and the other created with additional chemical shift information31 retrieved from Biological Magnetic Resonance Data Bank (BMRB, http://www.bmrb.wisc.edu/published/). When zinc is incorporated, it is treated as an additional ligand residue and is attached to one coordinating residue (closest to protein N-terminal) via a long-range connection in the fold tree. A library of rigid-body transformations from the backbone of the coordinating residue to the zinc ligand is generated by combining all parameters of freedom as listed in Table I. During the course of fragment assembly, the “jump” fragment from this library can be inserted and selected using a Monte Carlo strategy to sample the rigid-body orientation of zinc with respect to protein backbone. All backbone and sidechain torsional degrees of freedom and zinc rigid-body degrees of freedom are optimized simultaneously in the all-atom refinement stage. In both the low-resolution folding and high-resolution stages, tethering constraints are implemented to favor keeping zinc-coordination geometry (see the section of “energy function”). For each protein, 50,000 models were generated and the first 2500 models (5%) ranked by energy were selected for further analysis. With the incorporation of zinc, the computational cost of Rosetta de novo structure prediction generally increases by about 20–50% depending on the size of protein being modeled.

Loop modeling

The loop modeling method used in this article and the fold-tree setup were described in detail by Wang et al.27 It couples CCD algorithm33 with Monte Carlo energy minimization34 to build loops onto protein template structures. The native loop conformations were removed from template structures before modeling. Multiple loops are constructed in the low-resolution stage in a randomly selected order and then optimized simultaneously in the high-resolution refinement stage. The zinc ligand is allowed to be freely moved in space, and its interactions with the four coordinating residues are rewarded by constraint terms defined in energy function. Six thousand models were generated for each test case, and the first 300 models (5%) ranked by energy were selected for further analysis.

Energy function

Standard Rosetta low-resolution and all-atom energy functions were used in generating and ranking models.21, 22 The virtual atoms in the zinc ligand have no physical interactions with other protein atoms, and they are implemented only for the purpose of defining zinc-coordination constraints. The zinc atom is treated as a backbone Cα atom in the low-resolution stage and in the all-atom stage, force field parameters for zinc ion from CHARMM2747 were used to model its interaction with the rest of protein. Additional constraints energies are defined to favor formation of zinc-coordination sites with satisfying geometry. In the low-resolution stage, a distance constraint is defined between the zinc atom and the Cβ atom of each zinc-coordinating residue with a penalty function form of

  • equation image

where d is the actual distance between zinc and Cβ, Δ is a constant of 0.2 Å. ub and lb are 2.8 and 3.8 Å for Cys-zinc coordination and 3.2 and 4.0 Å for His-zinc coordination. In the all-atom refinement stage, the constraint energy for each zinc-residue coordination is composed of three terms:

  • equation image

where θ10 and ϕ10 are the actual/optimal values of bond angles and dihedral angles for zinc coordination as defined in Table I, respectively, with Δθ and Δϕ both equal to 20°. d is the distance between the zinc-coordinating atom and one of the virtual atoms, d0 is 0.0 Å and Δd 0.2 Å. The purpose of defining the distance constraints using virtual atoms instead of the actual zinc atom is to explicitly favor tetrahedral zinc coordination while keeping the coordination distance optimal. As virtual atoms are tethered to four unique zinc-coordinating residues in protein sequence, the zinc atom essentially becomes a chiral center with (A1/V1, A2/V2, A3/V3, and A4/V4) and (A1/V1, A2/V2, A3/V4, A4/V3) corresponding to two different zinc-coordination sites (V1, V2, V3, and V4 are four virtual atoms in the zinc ligand and A1, A2, A3, and A4 are the zinc-coordinating atoms from the four residues ordered from N-terminal to C-terminal). When the modeling process enters all-atom refinement, both “chiral” constraints are provided and one is randomly chosen to proceed to generate a final model. All constraint penalties for each pair of zinc-residue interaction are summed together and added to the total energy of the model with a weight of 0.01 and 0.1 for low-resolution and high-resolution energy functions, respectively.

Evaluation of model accuracy

To evaluate model accuracy in the loop modeling test, the RMSD is calculated over all backbone heavy atoms in the loop region between the model and the native structure after the backbones of nonloop regions of the two proteins are superimposed. To evaluate the accuracy of models generated in de novo structure prediction tests, the RMSD is calculated over all backbone heavy atoms in the entire protein chain after the model and native structure are optimally superimposed.

Plots and figures

R (http://www.r-project.org/) was used to make energy versus RMSD plots and RMSD distributions, and PYMOL (http://www.pymol.org) was used to produce figures for protein models.

BOINC and Rosetta@Home

Rosetta@Home (http://boinc.bakerlab.org/rosetta/), a distributed computing project running the Rosetta software on personal computers of volunteers from all over the world using the Berkley Open Infrastructure for Network Computing (BOINC) technology, was critical to the method development and model production described in this article. This substantial computing resource allowed us to rapidly test and improve the new methodology at a level not possible with only in-house computing resources.

Software availability

The software described in this article is available free for academic use at http://www.rosettacommons.org/ as part of the Rosetta software suite release 3.1 (SVN#33180) or newer. The command line options used for this study are provided in the electronic Supporting Information available over the internet as part of the Electronic Edition of Protein Science.

Acknowledgements

  1. Top of page
  2. Abstract
  3. Introduction
  4. Results
  5. Discussion
  6. Materials and Methods
  7. Acknowledgements
  8. References
  9. Supporting Information

The authors thank many scientists who have participated in the development of the suite of computational tools used in the Baker laboratory for computations on the structure of proteins. In particular, Philip Bradley and Andrew Leaver-Fay made key contribution to reshaping Rosetta's software architecture into an advanced modular design, which paves the way for efficient software development. David Kim built and maintained the Rosetta@Home project. Keith Laidig and Darwin Alonso maintained reliable, state-of-the-art computing resources. We thank all the Rosetta@ Home users worldwide for generously donating their computer time for their scientific research.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Results
  5. Discussion
  6. Materials and Methods
  7. Acknowledgements
  8. References
  9. Supporting Information

Supporting Information

  1. Top of page
  2. Abstract
  3. Introduction
  4. Results
  5. Discussion
  6. Materials and Methods
  7. Acknowledgements
  8. References
  9. Supporting Information

Additional Supporting Information may be found in the online version of this article.

FilenameFormatSizeDescription
PRO_327_sm_suppinfo.doc28KSupporting Information.

Please note: Wiley-Blackwell are not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article.