Data-driven docking for the study of biomolecular complexes


A. M. J. J. Bonvin, Department of NMR Spectroscopy, Bijvoet Center for Biomolecular Research, Utrecht University, 3584CH, Utrecht, the Netherlands
Fax: +31 (0) 30 2537623
Tel: +31 (0) 30 2532652


With the amount of genetic information available, a lot of attention has focused on systems biology, in particular biomolecular interactions. Considering the huge number of such interactions, and their often weak and transient nature, conventional experimental methods such as X-ray crystallography and NMR spectroscopy are not sufficient to gain structural insight into these. A wealth of biochemical and/or biophysical data can, however, readily be obtained for biomolecular complexes. Combining these data with docking (the process of modeling the 3D structure of a complex from its known constituents) should provide valuable structural information and complement the classical structural methods. In this review we discuss and illustrate the various sources of data that can be used to map interactions and their combination with docking methods to generate structural models of the complexes. Finally a perspective on the future of this kind of approach is given.


ambiguous interaction restraint


critical assessment of predicted interactions


chemical shift perturbation


high ambiguity driven docking


heteronuclear single quantum coherence


residual dipolar coupling


small angle X-ray scattering


With the available amount of genetic information, a lot of attention is focused on systems biology. Here a central question is: how do the various biomolecular units work together to fulfil their tasks? To answer this question, structural information on complexes is needed. Biochemical and biophysical experiments are widely used to gain insight into biomolecular interactions. The information generated in this way can in principle be used to model the structure of the complex under study. Taking the step from data to modeling (docking) is, however, not common practice. Docking approaches allow models of a biomolecular complex to be generated using as starting information the known structure of its constituents. Combining experimental data with docking makes sense considering that the number of single proteins, domains thereof, or other biomolecules whose 3D structures have been solved is much larger than the number of solved structures of complexes and is steadily increasing as a result of the worldwide structural genomics initiatives. The advantages of docking approaches over conventional structural techniques are the speed and the possibility of studying complexes that could only otherwise be studied with considerable effort (or not at all). One particular class of complexes for which this is the case are weak or transient, short-lived complexes; this is all the more interesting as these are often of the utmost biological importance. Other examples are the biologically highly relevant complexes of membrane or membrane-associated proteins, which are also notoriously difficult to study by NMR spectroscopy or X-ray crystallography.

Conventional crystallographic and NMR structural biology techniques have proven their value and will continue to do so. There are, however, problems associated with these techniques that are not likely to be completely overcome, especially when dealing with complexes. For crystallography, the main bottleneck is the crystallization, which can be a daunting task. For NMR, large complexes cause severe line broadening, which, at present, sets the upper limit for NMR to molecular sizes below 100 kDa. Moreover, to solve a structure by NMR in a conventional way, complete chemical shifts assignment and collection of structural restraints such as NOEs are challenging tasks, especially for large systems such as complexes.

In this review, we wish to highlight the use of biochemical and biophysical data in docking approaches not only because of the general interest in docking as explained above, but also because it is still common practice to experimentally map interfaces without taking the next step of generating a structural model of the complex. We review only part of the docking field, namely approaches that rely on the use of additional biochemical and/or biophysical data. Generally, docking approaches that do not use any kind of experimental data have difficulty in generating consistently reliable structures of complexes. Nevertheless, clear progress has been achieved in the field of ‘ab-initio docking’, as reviewed in [1–4], and illustrated by the critical assessment of predicted interactions (CAPRI) experiment [5], a ‘blind’ docking competition in which participants have a limited time to predict the structure of a complex given only the structures of the constituents. Our discussion will be limited to biomolecular complexes, omitting protein–small ligand complexes; however, much of what is presented here will also be valid for that class of complexes. For a review on ‘guided docking’ for studying protein–ligand complexes, see reference [6].

The review is organized as follows. We will first discuss the various kinds of biochemical and biophysical data that can be combined with docking. For each of these, examples will be given, and their strengths and weaknesses for use in docking will be discussed. We will then describe the basics of current docking methodologies and highlight our newly developed data-driven docking method HADDOCK [7]. We will end with conclusions and give a broader perspective on what could be the future of data-supported docking.

Sources of experimental data to define interfaces

Data from biochemical and/or biophysical experiments that provide information on residues located at the interface of a complex are potential sources to be used in docking. Critical issues are the level of detail that can be obtained (e.g. is the information residue-specific or not?) and the reliability of the data. Here we discuss, with those issues in mind, the techniques that have been used to obtain interface information for docking. In Fig. 1 we present an overview of the most common methods. For a selected set of examples, we will also discuss how these data relate to the experimental high-resolution structure solved by conventional methods (Table 4). Other experimental methods such as small angle X-ray scattering (SAXS) or electron microscopy and tomography can also provide valuable information about the ‘shape’ and organization of biomolecular complexes. As these are rather different kinds of approaches, we will not review them here, but only briefly mention their potential in our conclusions and perspectives. A general review of structural perspectives on protein–protein interactions can be found in reference [8].

Figure 1.

Illustration of the various data sources used in combination with docking. Left: advantages (+) and disadvantages (–); right: pictorial representation of the data source: the green and red shapes represent the two components of the complex. Mutagenesis: the blue star indicates a mutated residue; cross-linking: the black line indicates a cross-link; H/D exchange: ‘D’ and ‘H’ indicate residues where exchange can and cannot take place, respectively; CSP (chemical shift perturbation): HSQC spectrum showing one peak that does not shift and one peak that shifts on complex formation (the corresponding residues are indicated on the protein shapes); RDC, relaxation: the axis system indicates the tensor which provides orientational information.

Table 4.  Comparison of experimental information defining interfaces with the experimental X-ray or NMR structures (CSP, chemical shift perturbation; DMC, double mutant cycles; SAT, saturation transfer).
ComplexInformation usedReference
Mutagenesis data
 Barnase–barstarDMC: coupling energy decreases as distance increases[176]
 Antibody D1.3–antibody E5.2DMC: of 13 identified, 9 in interface and 4 not in interface showing significant coupling, but lower than the contacting residues[177]
 Cyt c–peroxidaseMutations: sites coincide with X-ray defined sites; DMC: couplings for residues that are more than 10 Å apart, concluded to be due to small rearrangements[178]
 Cyt c2–RCDMC: coupling approximately inversely proportional to distances[179]
MS data
 DnaA domain 4–DnaA boxCross-linking data correctly locate the interaction site to a six residue peptide fragment identified previously by X-ray/NMR[180]
 RibosomeComparison of > 2500 experimental distance restraints (cross-linking, footprinting and cleavage data) with X-ray structure showing good agreement[144]
NMR data
 Lysozyme–antibodyH/D: of 15 perturbed: 5 on epitope, 5 at edge, 5 far away[181]
 OMTKY3–CtrCSP fully consistent with X-ray[182]
 rNTF2–FxFG-containing Nsp1-P30High affinity X-ray site seen by NMR; NMR also finds low affinity site → NMR data better able to identify weak interactions[183]
 Zf1–3 (TFIIIA)−15 bp DNACSP data do not correspond exactly to the interface, but arise from a number of effects[184]
 CAD–ICADNOE and SAT defined interface is quite consistent with X-ray; CSP defined interface is a bit different[82]
 Nova1–RNACross-saturation defined residues match closely the X–ray interface; CSP data define the same residues and a few additional ones[185]
 RNAse E S1 homodimerCSP used to assess validity of crystallography dimer; data match the contacting residues seen in the crystal[186]


When using mutagenesis to derive information for docking, one considers as candidates only the residues that are on the surface of the partner proteins. The general idea then is that mutation of an interface residue will influence the interaction, whereas for non-interface residues the mutation will have no effect. A variety of methods can be used to find out whether complex formation is affected by mutations, such as surface plasmon resonance [9], MS, yeast two-hybrid systems [10] and phage display libraries [11]. Target residues for mutagenesis can be selected based on knowledge such as conservation (see below), but it is also possible to perform an in-depth systematic scan as in alanine scanning mutagenesis studies [12,13]. An online database with results from alanine scanning mutagenesis has been established called ASEdb ( [14]. These methods indicate which residues are in the interface, but do not give information about the contacts that are made across the interface. More detailed information can be obtained using so-called double mutant cycles [15]. Here one creates a series of mutants for both proteins. By measuring the Kd values for combinations of mutants, one can assess whether the influence of mutation X in protein A on the complex formation depends on mutation Y in protein B. If this is the case, the mutations are coupled, and one infers that the residues are close in space, i.e. that they are in contact or close proximity across the interface.

A general warning when using mutagenesis data is that it is unsound to assume that residues for which no effect is seen on mutation do not participate in an important interaction, unless it can be demonstrated that water, or nearby side chains, do not effectively substitute for the deleted atoms [13]. Another point is that one should, in principle, always check whether the mutants do not affect the 3D structure of the free components themselves, i.e. whether or not the native structures are preserved. Mutagenesis approaches, when carried out extensively, are able to generate a fairly detailed map of the interface of a biomolecular complex. In Table 1 we give an overview of complexes for which mutagenesis data have been used in docking.

Table 1.  Examples of complexes docked using mutagenesis data (GST, glutathione S-transferase; SPR, surface plasmon resonance; CSP, chemical shift perturbation). –, Data were taken from the literature without giving any experimental details.
ComplexInformation usedReference
 FAK FAT domain–paxillin-derived LD2 peptideGST domain fusion[89]
 TF/fVIIa/fXaCharge altering mutations[152]
 RIIα–Cα subunits of PKANeutron scattering, mutagenesis[110]
 Glycophorin A dimer[45]
 Phospholamban pentamer[44,154,155]
 Staphylokinase–microplasminPhage display[156]
 Gα–Gβγ-receptorG-protein activation assay[157]
 30S ribosomal subunit–colicin E3Immunoblotting[70,71]
 EmrE dimerCysteine mutagenesis, cross-linking[78]
 Hsc70–auxilinRescue-mutant pair, CSP[158]
 Kv1.3 K+ channel αIIb – six different scorpion toxinsComparison of electrostatic energy with binding affinity[63]
 Integrin αIib TM domain homodimerCAT-ELISA[47]
 C1q–C-reactive protein/IgG[49]
 Antibody fragment–α bungarotoxinCDR on antibody; epitope mapping[159]
 Malonyl-CoA–COT/CPTEnzyme activity assay, immunoblotting[160]
 Protein–DNA complexes of 434 cro and lac headpieceEthylation interference[34]
 LexA DBD–DNAEthylation interference[72]
 Repressor–protein–DNADNA footprinting[162]
 Fis–DNAChemical interference, nuclease DNA cleavage site[163]
 EnvZ dimerCysteine substitutions and disulfide cross-linking detection[164]
 Subunit c oligomer of H+-transporting ATP synthaseCysteine substitutions and disulfide cross-linking detection[165]
 Yeast cofactor A–β-tubulinTwo-hybrid assay[166]
 FOG-ZF3KRA–TACC3Two-hybrid assay; NMR CSP[90]
Double mutant cycles
 BgK–Kv1.1Electrophysiological experiments, dose–response curve[74]
 Agitoxin–shaker K+ channel[75]
 IFN-α2–ifnar2Reflectometric interference spectroscopy[77]
 α-Cobratoxin–α7 receptorBinding competition[76]

Mass spectrometry

There has been increasing interest in MS as a tool in structural biology in general, and also specifically to obtain information about biomolecular complexes [16,17]. One approach that can be used is H/D exchange. Here the rate of exchange gives information about the accessibility of the residue in question; rate differences between free and bound forms indicate that a given residue is protected on complex formation and thus probably involved in the interaction [18,19]. Another possibility is cross-linking, where residues close in space are detected by first covalently linking two molecules by the use of a cross-linking reagent, and then subjecting the resulting material to peptide mass fingerprinting or other protein identification methods [20]. Although these methods are promising, the cross-linking reaction is problematic, and the information is often not easy to interpret. The detection of cross-linked residues is especially nontrivial. To date MS data have not often yet been combined with docking approaches (Table 2).

Table 2.  Examples of complexes docked using MS data.
ComplexInformation usedReference
Aminoacylase-1 dimerProteolysis, cross-linking[111]
PKA–C and R subunitH/D exchange[50]
C1r (γ-B)2Cross-linking[167]
IL-6 homodimerCross-linking[112]


Conventional NMR methods have been used for more than a decade to study biomolecular complexes. In the classical approach, one first has to perform a resonance assignment that is as complete as possible, and then collect structural restraints such as NOEs, which can be detected between protons that are close in space (< 5 Å), and residual dipolar couplings that provide orientational information. Using such restraints, one can accurately define the structure of a biomolecule or a biomolecular complex. In addition to its conventional use in structure determination, NMR is very well suited to map interfaces of biomolecular complexes with so-called chemical shift perturbation (CSP) experiments [21]. Here, easily obtainable heteronuclear single quantum coherence (HSQC) spectra of one (15N-labeled) partner in the complex are recorded in the absence and presence of increasing amounts of the partner protein (‘titration experiments’). Changes in chemical shifts of one molecule on addition of a second molecule allow assessment of which residues of the labeled molecule are perturbed by the formation of the complex. One then repeats this procedure with the second molecule labeled. Under the assumption that the perturbed residues correspond to the interacting residues, a detailed map of the interface is obtained.

Two other NMR techniques that are able to give similar information are H/D exchange and cross-saturation or saturation transfer [22]. As in MS, NMR can also easily be used to perform H/D exchange experiments; again, differences in exchange rates when comparing uncomplexed and complexed forms point to protected residues that are assumed to be at the interface. In cross-saturation experiments, the observed protein is perdeuterated and 15N-labeled, with its amide deuterons exchanged back to protons, while the other ‘donating’ partner protein is unlabeled. Saturation of the unlabeled protein leads by cross-relaxation mechanisms to signal attenuation (again typically monitored by 15N-HSQC spectra) of those residues in the labeled protein that are in close proximity. The labeling scheme can be reversed to map the other interface. Deuteration is a requisite here. Cross-saturation experiments are believed to give a more reliable picture of the interface than CSP data, which can suffer from ‘false positives’ because of conformational changes.

Other relatively easily obtainable NMR parameters are residual dipolar couplings (RDCs) [23]. These provide information about the orientation of the components with respect to each other, and can be used in addition to CSP data in docking approaches. Comparable information can be extracted from relaxation experiments in the case of diffusion anisotropy [24].

A NMR parameter that can also be useful is the pseudocontact shift. It results from residual electron–nuclei dipolar interactions in molecules [21]. The use of paramagnetic tags attached to a protein can induce this phenomenon [25,26]. As pseudocontact shifts contain long-range information, they can be very useful in docking approaches. It is also possible to use paramagnetic ions as probes, as they induce broadening of the NMR signals for the residues they contact. In a complex, the interface residues will be protected from such effects, allowing a reliable detection of the interface [27]. An overview of complexes for which NMR data have been used in docking approaches is given in Table 3.

Table 3.  Examples of complexes docked using NMR data (CSP, chemical shift perturbation; PC, pseudocontact shifts; SAT, saturation transfer).
ComplexInformation usedReference
  • a

     These complexes were also solved using the classical NOE-based approach.

 Cyt c–cyt fCSP[56]
 Cyt c–cyt c peroxidaseCSP[54]
 Plastocyanin–cyt fPC, CSP[80,81]
 Myoglobin–cyt b5CSP, 15N relaxation[57]
 Ubiquitin–hHR23A UBA1, UBA2CSP[93]
 hHR23a (four linked domains)CSP, RDC[168]
 Ubiquitin–p47 UBA domainCSP[96]
 Di-ubiquitinCSP, RDC[169,170]
 UbcH5B–CNOT4CSP, mutagenesis[88]
 EIN-HPra, IIA(Glc)-HPra, IIA(Mtl)-HPraCSP, RDC[84]
 Bem1 PB1–Cdc24 PB1CSP, mutagenesis[95]
 RPA70A–Rad51NCSP, mutagenesis[94]
 EIN–HPra, E2A–HPraCSP[7]
 Atx1–Ccc2 domainCSP[92]
 FcεRIα–IgE Cε2CSP[172]
 FcεRI–peptideCSP, mutagenesis, NOE[66]
 LpxA–acyl carrier proteinCSP, RDC, mutagenesis[91]
 Tri,hexa saccharide–antibodySAT[173]
 (Glycosylated) PDTRP–antibody SM3SAT[174]
 Fibronectin (13,14)F3–heparinCSP[62]
Protein–nucleic acids
 NS1A(1–73)−16 bp dsRNACSP[40]
 UvrC CTD–junction DNACSP[39]
 XPA-MBD−9 bp ssDNACSP[175]
 Rom–RNA kissing hairpinCSP[41]
 Pf3 ssDBP–ssDNACSP[83]
 CylR2–22 bp DNACSP[73]

Reliability issues

It should be clear that there is a wealth of experimental data, not all of them having been discussed here, that can be used to define interface residues. The question of the reliability of this information is of course very important. In Table 4 we give an overview of some complexes for which the experimental data have been compared explicitly with the (at that time available) corresponding 3D structures. In Fig. 2, as an example, experimental data for the antibody D1.3–antibody E5.2 complex is mapped on to the surfaces of the two proteins. Although these are only a few examples, the general trend indicates that the experimental sources discussed above provide quite reliable information on interface residues. Sometimes they can result from small rearrangements and secondary effects, but as long as these ‘false positives’ are not too numerous, they can be dealt with in computational approaches (see below). If conformational changes are too large, however, docking approaches are probably bound to fail. It is not simple to predict a priori from the data if such effects should be expected. Sometimes, clustering of predicted interface residues on the surface can give a good indication that the mapped interface is very likely to be the correct one.

Figure 2.

Mapping of the mutagenesis data [177] on to the structure of the antibody D1.3–antibody E5.2 complex [187] (pdb entry 1dvf). Top: structure of the complex; bottom: interaction surface of E5.2 (left) and D1.3 (right) color coded according to the measured ΔΔG value [177] in mutagenesis experiments. Red: ΔΔG > 4.0 kcal·mol−1; orange: ΔΔG 2.1–4.0 kcal·mol−1; yellow: ΔΔG 1.1–2.0 kcal·mol−1; green: ΔΔG < 1.0 kcal·mol−1. Figures are prepared using molscript[188] and raster3d[189].

Computational docking approaches using experimental data

In the docking literature one often finds the distinction between ‘bound’ and ‘unbound’ docking: the former refers to docking using the structures of the single proteins as they are present in the complex, and the latter to docking using the structures of the free proteins. As only the latter is of biological relevance, here ‘docking’ will refer to ‘unbound docking’ (although in some cases a method is, as a first, easier step, tested in bound docking).

As defined in the introduction, docking methods generate a model of a complex based on the known 3D structures of its free components. To do this in a computer, two things are needed: a way to generate structures of the complex, i.e. a sampling method, and a way to decide which of the generated structures are ‘good’, i.e. a scoring method. The output typically consists of a large number of solutions, some of which get a high ranking and are accordingly considered to correspond to the ‘real’ structure, whereas others get a lower ranking and are discarded.

Docking methods vary in the way sampling and scoring are implemented, and also in the representation of the molecules in the calculations. An important choice to be made is whether the proteins are kept rigid or whether flexibility is needed. Flexibility can be introduced in various ways, e.g. by using an ensemble of rigid structures (experimental or generated for example by molecular dynamics methods) corresponding to static snapshots of possible conformational changes, by allowing some interpenetration of the docked molecules (sometimes called ‘soft’ rigid body docking, as opposed to ‘hard’ rigid body docking, where no overlap is allowed at all), or by allowing explicit side-chain and/or backbone flexibility during the docking. The type of sampling depends on the way in which the molecules are represented. When a grid representation of the molecules is used, rigid body docking can be done by calculating correlations (e.g. surface complementarity) using fast Fourier transform methods [28–33]. When the protein is explicitly represented using an atomic model, one can use various sampling methods such as Monte Carlo [34–36] and molecular dynamics methods [7] or genetic algorithms [36] in combination with simulated annealing schemes. The scoring is typically based on some kind of force field [37], which assigns an energy to atom–atom (or residue–residue) pairs, and subsequently adds all these together to get the energy of a given configuration. Often, terms such as buried surface area and desolvation energy are added. Force fields can have a physical basis or can be knowledge based (derived by counting how often a given pair occurs in a database of experimental structures). Using biochemical and/or biophysical data in docking approaches has advantages for both the sampling and scoring stages. During the sampling, more ‘relevant’ configurations are produced, whereas in the scoring, the ranking of true positives (i.e. correct solutions) can be improved compared with ab initio docking, where typically tens to hundreds of false positives are scored at the top. An important difference between various methods is whether the experimental data are only introduced in the scoring (i.e. to filter the solutions that have been generated) or whether they are also used during sampling. In the following we will discuss a number of methods that have been proposed, first the procedures that only use experimental data for scoring, and next those that incorporate experimental data into the sampling itself. In Fig. 3 a graphical representation is given of the choices to make in the various docking approaches with respect to the incorporation of experimental data and the treatment of flexibility.

Figure 3.

Some choices to be made in docking. (A) When to introduce the data? Here the complex structures resulting from a hypothetical docking method are shown, and the scoring is represented in a simplified way, discarding the complexes that do not satisfy the experimental restraints (indicated by the black crosses); (B) How to deal with flexibility: using an ensemble of starting structures; by soft rigid body docking; and explicitly during the docking by allowing side chain and/or main chain flexibility.

Although computer-based approaches should be preferred in terms of reproducibility, it is also possible to ‘manually’ build models of complexes based on experimental information. In fact there are quite a few examples where this has been done [38–42], some of which have been compared with pure ab initio docking results [43].

We should point out here that each docking approach has its own advantages and disadvantages, and the ‘docking problem’ is still unsolved: no single docking method will always give the right answer. The docking field is still in active development, and various approaches to the problem are being pursued, as will be discussed below.

Docking methods using experimental data only in the scoring stage

A large variety of docking methods exist and have been used before applying a filter based on experimental data. One approach consists of a systematic grid search for all possible orientations (three translations and six rotations). This is only feasible for small systems and simplified models, as otherwise scoring all possible configurations becomes intractable. Such a method has been used for probing transmembrane helix multimers, e.g. the dimeric transmembrane region of glycophorin A and the phospholamban pentamer. The low-energy structures resulting from the grid search were filtered using mutagenesis data [44–47].

When studying larger systems, and especially if one wants to introduce sophisticated amounts of flexibility in the docking, exhaustive grid searches become unrealistic. A fast method to perform grid calculations based on spherical Fourier correlations is implemented in the program Hex [48]. It has been combined with mutagenesis data [49]. Fast Fourier transform methods have often been used in docking. For example, the docking program dot[29] has been used in combination with MS H/D data to filter solutions [50]. Other examples of fast Fourier transform based methods are the soft docking program gramm[30], which has been used in combination with mutagenesis data [51] and ftdock[28], which was originally tested on several complexes using experimental data (e.g. active-site information in the case of enzyme–inhibitor complexes) and was recently combined with NMR data (CSP and RDCs) to filter solutions [52]. Another grid approach, which uses Boolean-type operations and was optimized heuristically for speed, is the docking program bigger[53]. This program allows soft rigid body docking (hard and soft docking are compared in [54]). bigger is often used in combination with NMR CSP data [55–59].

There are several docking approaches that do not use a grid but rather an explicit search in the configurational space, e.g. dock[60,61], autodock[36], which was used in combination with CSP data [62], and other methods based on Brownian Dynamics simulations followed by Molecular Dynamic refinement of the initial models [63]. NMR CSP data have also been used in a more quantitative way for filtering docking solutions, by back-calculating chemical shift changes from the models with programs such as shifts[64] or shiftx[65] and comparing them with the experimental values [66]. This approach has also been combined with RDCs [67]. The above methods have been successfully applied to model various biomolecular complexes (Tables 1–3).

Docking methods using experimental data to drive the docking

The advantage of using the data in the sampling stage of docking is that ‘correct’ or ‘near-correct’ configurations should be enriched, compared with approaches in which the data are only used in the scoring stage, provided of course that the experimental information is correct. This becomes especially important when the number of configurations is too large to be adequately sampled, as is often the case when flexibility is introduced.

As will be clear from the following discussion, there are different ways to incorporate the experimental data during the sampling stage. This partly depends on the kind of data used (e.g. the level of detail and the amount of inherent ambiguity) and the sampling method. ‘Geometric’ methods might limit the number of orientations selected for docking rather than adding experimental terms to an energy function. The search space is thus reduced on the basis of the available experimental data. The subsequent docking and scoring stages then proceed as in ab initio docking [68]. Other approaches use anchor points based on experimental data, e.g. treedock[69], or incorporate the experimental data by up weighting given residues in fast Fourier transform-based rigid body docking approaches (‘weighted geometric docking’) [32,70,71]. Another popular possibility is to use some kind of distance restraints. This means that an additional energy term is created, which is high if residues which, according to the data, should be at the interface, i.e. close to each other, are far away in the proposed complex, and, contrarily, low if they are near.

Ethylation interference and mutagenesis data have been used as experimental input for protein–DNA docking in the early data-driven Monte-Carlo docking program monty[34,72,73], which allows side-chain flexibility and DNA deformations. Double mutant cycle data, giving information about residue–residue contacts, have been incorporated as distance restraints in various applications [74–77]. A comparable approach was used to incorporate cross-linking data for a dimer of a four-transmembrane helix protein [78]: here a total of 10 distance restraints could be defined with quite small error bounds because of the rigid nature of the linker. There are several examples of the combination of NMR information with rigid body docking. Rigid body docking in x-plor[79] has been used to model the dynamic complex between plastocyanin and cytochrome f based on upper bound distance restraints derived from pseudo-contact shifts and CSP data, and lower bound distance restraints for residues assumed not to be in the interface [80,81]. Saturation transfer and RDC restraints have been combined with energy minimization to model the CAD–ICAD complex (complex between the CAD domain of caspase activated deoxyribonuclease and the CAD domain of its inhibitor) [82]. The nucleoprotein superhelix–DNA complex was modeled using CSP restraints in a grid search [83].

Some experimental data are highly ambiguous and only provide information about interface residues, but not about the specific contacts they make. Docking approaches should thus be capable of incorporating such ambiguity. Typical examples here would be the CSP data obtained from NMR titration experiments or mutagenesis data. With this in mind, we developed an information-driven semiflexible docking approach called HADDOCK [7] in which any kind of information about interface residues can be incorporated as a highly ambiguous interaction restraint (AIR) (see below). Related approaches have been described in [84] where NMR CSP data and RDCs were used, and in [85] for cross-linking information detected by MS.


The method

As is clear from the discussion above, there is a wealth of experimental sources that can provide information about interfaces of biomolecular complexes. These data are generally not used, however. Our docking approach HADDOCK, an acronym for high ambiguity driven docking [7], makes use of such information to drive the docking while allowing various degrees of flexibility. The information is encoded in AIRs similar to the ambiguous restraints commonly used in NMR structure determination [86]. The ambiguity here refers to the way in which the restraints are defined: between any residue which, based on experimental data, is believed to be an interface residue (called active residue), and all such residues (plus surface neighbors, called passive residues) on the partner molecule. An AIR is defined as an ambiguous intermolecular distance (diAB) with a maximum value of typically 2 Å between any atom m of an active residue i of protein A (miA) and any atom n of both active and passive residues k (Nres in total) of protein B (nkB) (and inversely for protein B). The effective distance inline image for each restraint is calculated using the equation:


where Natoms indicates all atoms of a given residue and Nres the sum of active and passive residues for a given molecule. The definition of passive residues ensures that residues that are at the interface but are not detected (e.g. no CSP when using NMR, or no change in binding on mutation) are still able to satisfy the AIR restraints, i.e. contact active residues of the partner molecule. The 1/r6 summation [87] is used to mimic the attractive part of a Lennard-Jones potential and ensures that the AIRs are satisfied as soon as any two atoms of the two proteins are in contact. The AIRs are incorporated as an additional energy term to the energy function that one tries to minimize during the sampling. The docking proceeds in three stages during which increasing amounts of flexibility are introduced. In the first stage, the molecules are considered as rigid bodies, and a large number of solutions are generated. In the second stage, a limited amount of flexibility is introduced first into the side chains and subsequently into both side chains and backbone of predefined flexible segments encompassing the active and passive residues. Finally, the solutions are refined in explicit solvent. The final structures are clustered and scored using a combination of energy terms (mainly intermolecular van der Waals and electrostatic energies and restraint energies); for details see [7,88]. Note that fully flexible models can also be defined, for example for the docking of an unstructured peptide on to a protein.


Several groups have used HADDOCK to generate models of biomolecular complexes in combination with different sources of information such as mutagenesis [89–91] or NMR CSP data [88,89,91–96]. A common problem resulting from the highly ambiguous nature of the interaction restraints is that symmetrical solutions are often obtained corresponding, for example, to a 180° rotation of one molecule with respect to the other. In cases where energy considerations cannot distinguish between the symmetrical solutions, additional information should ideally be supplemented. This was the case for the UbcH5-Not4 complex [88](Fig. 4A). To solve the symmetry problem, the HADDOCK models were used for structure-directed mutagenesis. Reverse mutants could be produced in which two residues of opposite charges across the interface were swapped, restoring thereby the binding. This provided unique, unambiguous information to select the correct solution.

Figure 4.

Two examples of structures calculated using HADDOCK. (A) The Ubch5–Not4 complex (pdb entry 1ur6) [88]. In a first docking run using only NMR CSP data, two models were obtained (top left and top right). Based on these, mutagenesis experiments were performed to discriminate between the two models: the charge-reversing double mutant E49K,K63E did restore the complex (red box), whereas the double mutants including K4E or K8E did not restore complex formation. Only the left solution is consistent with this information. (B) TBE virus envelope glycoprotein E trimer (CAPRI target 10), for which epitope, conservation and protection from enzymatic digestion data were introduced in HADDOCK, resulting in a docking model (left) within 2.9 Å ligand–RMSD from the crystal structure [190] (pdb entry 1urz, right). The three subunits are color-coded; note that two segments (residue 148–159 and 204–209) are missing from the crystal structure.

In the case of the transient complex between the yeast copper chaperone Atx1 and the first soluble domain of the copper-transporting ATPase Cccp2, a copper ion was explicitly introduced into the docking calculations based on NMR CSP data and found to move from Atx1 to Cccp2, consistent with the physiological direction of transfer [92]. The copper-transfer intermediate was a result of the flexible docking protocol, as no restraints were introduced to force the copper ion to move. This example indicates that flexible data-driven docking can be used to investigate not only ‘static’ structures but also more ‘dynamic’ aspects of biomolecular complexes. When available, classical NMR data such as NOEs can also be incorporated into HADDOCK, as was the case for generating the solution structure of a nonspecific protein–DNA complex [97].

Recently, we participated in the fourth and fifth round of the ‘blind docking competition’ CAPRI. As CAPRI is not especially meant for data-supported docking, we had to search literature and databases and use sequence conservation criteria (predicted via a neural network [98]) to define AIRs. Using HADDOCK, we were able to generate structures that are close to the experimentally defined structures even with low-resolution, ‘fuzzy’ data such as epitope mapping and protection from enzymatic digestion. As an example, we successfully predicted the trimeric form of the TBE virus envelope glycoprotein E within 2.9 Å ligand–RMSD (Fig. 4B) (the ligand–RMSD is defined as the RMSD calculated on one component after superposition of the other components). Our participation in the CAPRI experiment has, however, taught us that in some cases our docking methods, as well as others, can fail.

Conclusions and perspectives

The combination of biochemical and biophysical data with docking has many different applications. Docking models can obviously be used to select residues to be targeted for mutagenesis, for example. One interesting point is that it becomes possible, when flexibility is explicitly introduced, to investigate structural changes at the interface on complex formation, or even dynamic events as shown above for the copper-transfer complex. Here we discuss what the future of this kind of approach might be.

Perspectives on data used in docking

One interesting development is the use of conservation data to define interface residues (reviewed in [99]). Several methods have been developed for this purpose; examples are the use of a neural network [98,100], the determination of invariant polar residues [101], 3D cluster analysis [102], the use of phylogenetic trees, [103] the Evolutionary Trace method [104,105] and the Promate approach where conservation is combined with general interface characteristics [106]. Information from predicted interfaces has been used to model several complexes, for example, the Hsp90-p23 [107] and Gαβγ trimer–receptor complexes [42] based on predictions obtained with the Evolutionary Trace method, and the complex between the α1 and β2 subunits of hemoglobin and the FtsA homodimer [43] based on conservation data and correlated mutations [46]. With the increasing amount of genomic data available, this kind of analysis can be expected to become more and more important. In addition, protein interaction networks can be compared using pathblast[108]; homologies based on this may provide additional information. Similarly, homology modeling, which has been improving over the years [109], in addition to being used to generate starting structures, could be combined with docking approaches, as illustrated with mutagenesis and neutron-scattering data [110] and MS data [111,112]. An interesting example of the combination of homology modeling and docking is the Multiprospector multimeric threading approach [113], which has been applied to the Saccharomyces cerevisiae proteome [114]: Multiprospector threads the sequences of the single chains of a target complex; if a template is found that is part of a complex, both chains of the target are rethreaded, now also incorporating an interfacial energy term.

Two experimental techniques which are very promising in combination with docking are cryo-electron microscopy or tomography and SAXS. Both techniques provide ‘shape’ information into which the structures of known constituents of a complex can be fitted. Cryo-electron microscopy has been used for a large number of yeast complexes [115] and for the 80S ribosome from S. cerevisase[116]. For further discussion see reference [8]. SAXS data have been applied in docking to a variety of systems [117–124]. Specific examples are the twinfilin-capping protein complex [125] for which models of the single components were fitted to the SAXS data and compared with mutagenesis data, and the FixJ response regulator where the rotation angle between the two domains was probed [126].

Another technique that can potentially be used is fluorescence. Interface information could be obtained for example for the complex of HscA with IscU LPPVK motif-containing peptides [127]: the ability of Trp residues at the N-terminus or C-terminus of the peptides to quench the fluorescence of labeled HscA was measured, and this allowed us to define the substrate-binding orientation. In another example, docking simulations of HLA-1 dimers and complexes of those with CD8 and TCR were compared with fluorescence resonance energy transfer data [128]. The use of fluorescence resonance energy transfer to study protein–DNA interactions has been reviewed [129]. Infrared spectroscopy might also become useful. For example, it was possible to define the tilt and relative orientation of transmembrane helices in the pentameric phospholamban [130] and the tetrameric M2 protein complex [131] based on infrared data.

With respect to the techniques discussed above, at least for MS and NMR, improvements can be expected. An example of a new MS approach for mapping interfaces is the modification of solvent-accessible side chains by hydroxyl radicals from millisecond exposure of aqueous solutions to X-rays; the modification sites can be identified by MS and differences between complexed and uncomplexed forms indicate the location of the binding interface [132,133]. In NMR, new approaches are emerging that might overcome the assignment problem. Comparison of experimental and back-calculated unassigned 1D 1H spectra of a complex has been proposed as a means of filtering docking solutions; the feasibility of this approach has been demonstrated for four complexes [134]. Other methods that do not require chemical shift assignments but rely on the combination of amino acid-specific labeling with saturation transfer or titration experiments have been reported as well [135,136]. Provided that selective labeling can be efficiently performed, such methods should clearly speed up interface mapping by NMR.

Considering that information-driven docking will be much faster than conventional structural methods, it makes sense to invest some time and effort in making sure that the experimental data are reliable and really reveal interface residues. Therefore, whatever experimental technique is preferred, it is worth combining information from various sources.

Perspectives on docking methods

Not only from the data side, but also from the methodological point of view, improvements are needed and can be expected. It will be possible one day to perform reliable ab initio docking, in which case no data will be needed at all, but this is probably not within our reach for the coming years. Still, active developments in the ab initio docking field will definitely benefit data-driven docking approaches. Next to the need for proper scoring schemes, another important aspect is the handling of flexibility during docking. Although several methods exist that perform reasonably well in this respect, many still only use rigid body (soft) docking. Potential improvements might include a more widespread use of energy-driven sampling methods, such as molecular dynamics, before docking to generate ensembles of starting structures, during docking to allow induced conformational changes, and/or after docking to refine the (rigid body) solutions. Other advanced computational methods are emerging aiming at identifying parts of a molecule that are likely to be flexible and undergo conformational changes on complex formation [137,138]. Another kind of flexibility which, in our opinion without a good reason, has not had much attention is that complexes themselves might be dynamic. As the forces that hold together the noncovalently linked complexes are, in most cases, weaker than those that are involved in covalent interactions, one would expect mobility to play a bigger role here. This will be particularly true in the case of weak and transient complexes. Methods should be developed that take this into account.

Perspectives on experimental systems amenable to data-driven docking

Finally, the range of systems studied with docking approaches can also be extended. Although it might not strictly speaking be docking, it is interesting to note that the kind of methods that we have discussed here in the context of biomolecular complexes can also be applied to generate structures of single proteins by docking structural elements. This was done using cross-linking data to refine a homology model of FGF-2 [139] and with distance restraints for the lactose permease which consists of 12 transmembrane helices [140]. In another example, dipolar EPR distances, disulfide mapping distances and electron cryomicroscopy data were used in a special kind of exhaustive search using a graph-theory algorithm to generate models of rhodopsin [141]. Docking-like approaches are particularly interesting for modeling transmembrane helical proteins, as these typically contain considerable helical content already in their unfolded state; this means that docking approaches can be applied using helical segments as structural entities, as described for example in reference [142]. A general review about helix–helix interactions in the folding of membrane protein can be found in reference [143].

At the other extreme, data have become available for many giant multisubunit complexes such as the ribosome [144] or the regulatory complex of the Drosophila 26S proteasome [145], but docking approaches have not often been used for them. A combinatorial approach such as CombDock [146] may be useful here, but HADDOCK or other docking methods can also easily be extended to deal with multiple subunits (as shown for the trimer example above), although, for large assemblies, computational requirements might become a limiting factor. Another kind of biological system for which data are becoming available now are protein–lipid assemblies. Using EPR, the orientation of phospholipase A2[147,148] with respect to the surface of phospholipid vesicles was studied. For the C2 domain of protein kinase A, fluorescence and EPR data were used to elucidate the surface of the protein that contacts the membrane and to generate a model for the protein attached to a membrane [149]. NMR spin label data have also been used to provide the depth and angle of micelle insertion of the FYVE domain of early endosome antigen I [150]. Finally, one interesting type of system to which increasing attention is given consists of proteins that, in their monomeric form, are unstructured and only fold during complex formation. A docking approach was used to study the complex of the (prefolded) actin with the (only folding upon binding) thymosin β4, using a combination of NMR data, mutation data and cross-linking data as restraints in the docking [151].

In conclusion, we have shown that docking methods can provide valuable biological insight, when combined with a limited amount of experimental data. Such a combination will, without doubt, become more widely used in the near future.


Financial support from the Netherlands Organization for Scientific Research (N.W.O.) through a Jonge Chemici grant to A.M.J.J.B. (grant number 700.50.512) is acknowledged. We thank Cyril Dominguez and Sjoerd de Vries (Utrecht University) for helpful discussions.