A Branch-and-Prune algorithm for the Molecular Distance Geometry Problem



This article is corrected by:

  1. Errata: A note on “A branch-and-prune algorithm for the molecular distance geometry problem” Volume 18, Issue 6, 751–752, Article first published online: 1 March 2011


The Molecular Distance Geometry Problem consists in finding the positions in inline image of the atoms of a molecule, given some of the inter-atomic distances. We show that under an additional requirement on the given distances this can be transformed to a combinatorial problem. We propose a Branch-and-Prune algorithm for the solution of this problem and report on very promising computational results.

1. Introduction

We present a discrete formulation and a very fast and accurate solution method for a subclass of instances of the Molecular Distance Geometry Problem (MDGP) (Moré and Wu, 1997, 1999; Crippen and Havel, 1988; Dong and Wu, 2002; Hendrickson, 1995). The MDGP is related to the determination of the tridimensional structure of a molecule based on knowledge of some distances between pairs of atoms. The tridimensional structure is very important because it is associated to the physical and chemical properties of the molecule.

The MDGP can be seen as finding a distance-preserving immersion in inline image of a given undirected weighted graph G=(VEd ), so it can be very naturally cast as a continuous search problem. Under three additional assumptions that are satisfied by most proteins (a very interesting and rich class of molecules), we transform the MDGP to a discrete search problem. The assumptions are:

  • 1covalent bond lengths and angles are known;
  • 2the molecule has the shape of a protein backbone, i.e. it is a sequence of n atoms such that there is a covalent bond between every pair of consecutive atoms;
  • 3all distances between atoms separated by three covalent bonds are known;
  • 4no bond angle is equal to kπ, for inline image.

Naturally, distances between atoms separated by two covalent bonds can be easily calculated from the covalent bond lengths and bond angles. We remark that if Assumption 1 above is strengthened to require knowledge of the distances between atoms separated by four covalent bonds, then the problem becomes polynomially solvable (Dong and Wu, 2002; Eren et al., 2004). Further considerations on the complexity of the MDGP problem under assumption 1 are given in Lavor et al. (2006).

It should be noted that we refer to the MDGP as a precisely formalized decision problem, and not as a practical chemical problem. We therefore make three assumptions that in real life may be easily challenged: (a) a subset of exact (as opposed to approximate) distances is given as part of the input; (b) no measurement errors occur; (c) the optimal three-dimensional (3D) embedding of the graph is not influenced by a potential energy minimization term on the objective function. We refer to Klepeis et al. (1999) for a more realistic formulation.

In Section 2, we show a discrete formulation for the problem. In Section 3, we describe the Branch-and-Prune algorithm, which will be applied to the solution of the MDGP. The computational results are discussed in Section 4. Section 5 concludes the paper.

2. Molecular Distance Geometry Problem

Formally, the MDGP can be defined as the problem of finding Cartesian coordinates x1, …, xninline image of the atoms of a molecule such that for all (ij)∈S,


where S is the set of pairs of atoms (ij) whose Euclidean distances dij are known. If all distances are given, the problem can be solved in linear time (Dong and Wu, 2002). If there is an order on the atoms such that the given distances form cliques on each set of five contiguous atoms, the problem is polynomially solvable (Eren et al., 2004). In general, however, the problem is NP-hard (Saxe, 1979).

The MDGP is usually formulated as a continuous least-squares minimization problem, where the objective function is as follows:


Obviously, inline image solve the problem if and only if inline image.

Note that, as stated above, the MDGP bears no connection whatsoever with molecules. In fact, the MDGP appears in such diverse application fields as 3D graph drawing (Cruz and Twarog, 1996) and network design (Eren et al., 2004). Our assumption that all distances between atoms separated by one, two, and three covalent bonds are known can be expressed as an additional condition on the set S of distances, namely that S can be partitioned into two disjoint sets E, F of distances where




We also assume that for all pairs of atoms in F, the distances are shorter than a given cut-off value Δ (usually this is taken to be 5 Å using, for example, NMR analysis; Creighton, 1993; Schlick, 2002), that is, dijleqslant R: less-than-or-eq, slantΔ∀(ij)∈F.

As we shall show in Section 2.1, for each group of four consecutive atoms, if we know all the distances between them and fix the first three, with probability 1 (see Section 2.2) the fourth atom can only have two possible symmetric placements. This allows us to give a discrete formulation for the considered problem.

2.1. Discrete formulation

Consider a molecule as being a sequence of n atoms with Cartesian coordinates given by inline image and such that there is a covalent bond between every pair of atoms (ii+1), for i=1, …, n−1. The bond length ri is the Euclidean distance between atoms i−1 and i (i.e. ri=di−1, i for all i=2, …, n). The bond angle θi∈[0, π] is the angle between the segments joining atoms i−2, i−1 and i−1, i (for all i=3, …, n). The torsion angle ωi∈[0, 2π] is the angle between the normals through the planes defined by the atoms i−3, i−2, i−1 and i−2, i−1, i (for all i=4, …, n) (see Fig. 1).

Figure 1.

 Definitions of bond lengths, bond angles, and torsion angles.

In most molecular conformation calculations, all covalent bond lengths and bond angles are assumed to be known a priori (Phillips et al., 1996). Thus, the first three atoms in the sequence can be fixed and the fourth atom is determined by the torsion angle ω4. The fifth atom can be determined by the torsion angles ω4 and ω5, and so on. So, given all bond lengths r2r3, …, rn, bond angles θ3θ4, …, θn, and torsion angles ω4ω5, …, ωn of a molecule with n atoms, the Cartesian coordinates xi=(xi1xi2xi3) for each atom i in the molecule can be obtained using the following formulae (Phillips et al., 1996):






for i=4, …, n. We call Bi the torsion matrices and denote by inline image the cumulative torsion matrices. For every four consecutive atoms xi, xi+1, xi+2, xi+3 we can express the cosine of the torsion angle ωi+3 in terms of the distances ri+1, di+1,i+3, di,i+3 and the bond angle θi+2, θi+3 by using the cosine law for torsion angles (Pogorelov, 1987, p. 278), as follows:


Hence, if we know all the bond lengths (ri), bond angles (θi), and distances between atoms separated by three covalent bonds (dii+3), we can calculate the cosine of the torsion angles defined by the atoms i, i+1, i+2, i+3 for i=1, …, n−3. We note in passing that in order for (4) to hold, we obviously need the denominator to be nonzero.

Using the bond lengths r2, r3 and the bond angle θ3, we can determine the torsion matrices B2 and B3 and obtain


fixing the first three atoms of the molecule. Because we also know the distance d14, by (4) we can obtain the value cosω4. Thus, the sine of the torsion angle ω4 can have only two possible values: inline image. Consequently, we obtain only two possible positions inline image for the fourth atom:


along with the respective torsion matrices inline image such that


where C3 is a cumulative torsion matrix. This dichotomy, shown pictorially in Fig. 2, is the basic reason why this problem can be formulated combinatorially.

Figure 2.

 Discretization of the problem. The atom i+3 can only be in the two shown positions in order to be feasible with the distance dii+3.

For the fifth atom, we will obtain four possible positions, one for each combination of inline image and inline image. By an easy induction argument, we can see that for the ith atom we obtain 2i−3 possible positions. Hence, for a molecule shaped as a sequence (a linear chain) of n atoms, we get 2n−3 possible sequences of torsion angles ω4ω5, … ,ωn, each defining a different tridimensional structure. Using the matrices Bi defined above, we can convert a sequence of torsion angles into Cartesian coordinates inline image. Thus, this problem has a finite search space. To test a candidate solution we simply use the function f defined in (1); the candidate solution (x1, …, xn) will be a valid solution if and only if f(x1, … ,xn)=0.

The discussion above can be summarized in the following theorem.

Theorem 2.1.Consider a sequence M of n atoms such that:

  • (i)atom i is covalently bonded to atom i+1 for all ileqslant R: less-than-or-eq, slantn−1;
  •  (ii)all bond angles and bond lengths are known;
  • (iii)no bond angle is a multiple of π;
  • (iv)all distances between atom i and i+3 are known, for all 1leqslant R: less-than-or-eq, slantileqslant R: less-than-or-eq, slantn−3.

Then there is a finite number of distinct immersions p:M→inline imagesuch that:

  • (a)p(1)=(0, 0, 0), p(2)1=0, p(2)2=0, p(3)1=0 (where p(i)kis the kth coordinate of p(i) for kleqslant R: less-than-or-eq, slant3, ileqslant R: less-than-or-eq, slantn);
  • (b)for all atoms i, j with known atomic distance dijwe have:

We remark that the idea that three distances suffice to determine at most two possible positions of the fourth atom, which is the basis on which our algorithm rests, was also very recently used in Wu et al. (in press) in the framework of the geometric build-up algorithm (Dong and Wu, 2002).

2.2. Undiscretizable instances

As has been remarked, the instances of the considered problem have a finite number of valid solutions with probability 1. The only case where an instance is not susceptible of a discrete formulation is when there is a subsequence of three consecutive atoms i, i+1, i+2, where the bond angle θi+2 is kπ for inline image: because ωi+3 is an angle between two normal vectors to given planes, ωi+3 is undefined when at least one of the planes is undefined, i.e. if the two vectors defining the plane are collinear. In other words, if the bond angle θi+2 is a multiple of π, we have the situation depicted in Fig. 3, where di,i+3 is feasible for every position of atom i+3 on the circle shown in the figure. Because the set {π} has measure 0 in [0, 2π], the probability that any given instance is discretizable is 1. In any case the ‘undiscretizable cases’ do not often occur in practice.

Figure 3.

 An instance that cannot be discretized. The i+3-rd atom can be on any position on the circle shown without affecting the feasibility of the distance di, i+3.

3. The algorithm

In this section we shall present a Branch-and-Prune (BP) algorithm designed for solving the considered problem. The approach is very simple and mimics the structure of the problem closely: at each step we can place the ith atom in two possible positions inline image. We then branch the search and prune away the infeasible branches. More precisely, for each position we check feasibility with all distance pairs (ji)∈F by checking that inline image, where ɛ>0 is a given tolerance. There are four possible outcomes:

  • 1both inline image are feasible: in this case we store both positions and explore both branches in a depth-first fashion;
  • 2only xi is feasible: we only store the feasible position xi and prune the infeasible branch inline image;
  • 3only inline image is feasible: we only store the feasible position inline image and prune the infeasible branch xi;
  • 4neither position is feasible: we prune both branches and backtrack the search.

Notice that this algorithm, as described, will find all solutions to the problem. If we are only interested in one, we can stop the search as soon as we have placed the last atom in a feasible position.

Let T be a graph representation of the search tree. Initially, T is initialized to 1→2→3 because the first three atoms can be fixed to feasible positions x1, x2, x3 as explained earlier. By the current rank of the search tree we mean the index of the atom being placed at the current node. At each search tree node of rank i we store:

  • the position inline image of the ith atom;
  • the cumulative product inline image of the torsion matrices;
  • a pointer to the parent node P(i);
  • pointers to the subnodes L(i),R(i) (initialized to a dummy value PRUNED if infeasible).

Notice that the edge structure of the graph T is encoded in the operators P(), L(), R() defined at each node. The recursive procedure at rank i−1 is given in Algorithm 1. Let y=(0, 0, 0, 1), ɛ>0 a given tolerance and v a node with rank i−1 in the search tree T.

3.1. Detailed example

We now discuss the application of Algorithm 1 to a simple example (artificially generated as explained in Lavor (2006), see also Section 4.2).

The instance in question (called lavor11_7), with all bond lengths set to 1.526 Å and bond angles set to 1.91 radians, has 11 atoms with distances in F given by


where δ(i) indicates the atoms j such that dijleqslant R: less-than-or-eq, slant4 Å (the cut-off value). The distances in E are of course δ(i)={i+1, i+2, i+3} for all ileqslant R: less-than-or-eq, slantn−3, δ(n−2)={n−1, n}, δ(n−1)={n}. The vector of the distances in E is:


where the ith line contains the distances among atoms i and i+1, i+2, i+3. Of course, the last two lines contain the distances among the atom n−2 and atoms n−1 and n, and the distance between the atom n−1 and n, respectively.

Algorithm 1. BP algorithm
0: BranchAndPrune(Tvi)
if (ileqslant R: less-than-or-eq, slantn– 1) then
 Compute the possible placements forith atom:
 calculate the torsion matrices inline image via Eq. (3);
 retrieve the cumulative torsion matrix Ci−1 from the parent node P(v);
 compute Ci=Ci−1, Bi, inline image and inline image from inline image;
 let λ=1, ρ=1;
 Check feasibility:
 for all (j,i)∈Fdo
  let inline image and inline image;
  if (δji>ɛ) then
  end if
  ifinline imagethen
  end if
 end for
Create subnodes as required:
if (λ=1) then
 create a node z, store Ci and xi in z, let P(z)=v and L(v)=z;
 set TT∪{Z};
 BranchAndPrune(T, z, i+1);
 set L(v)=PRUNED;
End if
if (ρ=1) then
 create a node z′, store Ci and xi in z′, let P(z)=v and R(v)=z′;
 set TT∪{Z′};
 BranchAndPrune(T, z′, i+1);
 set R(v)=PRUNED;
End if
 Ranknreached, a solution was found:
 solution stored in parent nodes ranked n to 1, output by back-traversal;
end if

As can be seen from the BP tree given in Fig. 4 (this is actually the output of Algorithm 1 on the given instance), this instance has four solutions: the leaf nodes at rank 11—the rank is given by the number of the leftmost node in each row. Notice that the earliest node when some pruning occurs is at rank 7, i.e. no pruning occurs before the placement of the 8th atom. This happens because there are no distances (jk)∈F with k<8, so each position for atoms with index i<8 is feasible (by construction of inline image) with the distances in E. The only symmetry-breaking distances are in fact those in F. Again, there is pruning at ranks 8, 9, and 10, i.e. during the placement of atoms 9, 10, and 11, because there are distances (jk)∈F with k=9, 10, 11. One of the solutions is shown in Fig. 5.

Figure 4.

 The BP tree of the instance of Section 1.

Figure 5.

 One of the possible solutions of the lavor11_7 instance.

4. Computational experiments

In order to test the viability of the proposed method, we tested a class of randomly generated MDGP instances described in Lavor (2006). We present comparative results of BP and another existing MDGP software called dgsol (Moré and Wu, 1999). It turns out that BP is superior to dgsol for speed and solution accuracy, and inferior as regards memory requirements and running time reliability.

4.1. Software testbeds

The software code dgsol (Moré and Wu, 1999) (version 1.3) can be freely downloaded from http://www.mcs.anl.gov/more/dgsol/. The algorithm implemented by the dgsol code is very different from ours. First, it targets a more general problem class: the Molecular Distance Geometry Problem with Distance Bounds. In this problem, lower and upper bounds to atomic distances are known, rather than the exact atomic distances. Because these are usually estimated through NMR techniques, it is realistic to assume that there is an experimental error in the measurements (our approach does not consider this issue yet). Secondly, dgsol needs to make no assumption whatsoever about the distances of triplets and quadruplets of consecutive atoms being known. Thirdly, dgsol is based on a continuous smoothing of the original problem to a form that has fewer local minima. An ordinary NLP optimization method is then applied to the modified problem, and the optimum is traced back to the original problem. This is a fully continuous optimization algorithm, whereas BP is a discrete method.

It turns out that the main advantages of BP over dgsol are:

  • 1tractability of larger instances;
  • 2higher solution accuracy;
  • 3BP can potentially find all feasible solutions, not just one.

By contrast, the main advantages of dgsol over BP are:

  • 1it targets a larger class of problems;
  • 2its running time seems to increase very slowly (and regularly) as a function of the number of atoms in the molecule, at least when the set of given distances is comparatively small;
  • 3the amount of memory needed to complete the search is negligible.

The BP algorithm behaves very unpredictably with respect to the amount of memory needed, sometimes requiring over 1 GB RAM for relatively small molecules (40 atoms), sometimes solving 1000-atom instances in a few seconds requiring very little memory.

4.2. Lavor instances

These instances, described in Lavor (2006), are based on the model proposed by Phillips et al. (1996), whereby a molecule is represented as a linear chain of atoms. Bond lengths and angles are kept fixed, and a set of likely torsion angles is generated randomly. Depending on the initial choice of bond lengths and angles, the Lavor instances give rather more realistic models of proteins than other randomly generated instances do (see for example the instances described in Moré and Wu, 1999). Figure 5 gives an example of a Lavor instance. In the numerical tables, we labeled the Lavor instances by lavor nm, where n is the number of atoms in the molecule and m is an instance ID (because there is a random element of choice in the generation of the Lavor instances, many different instances can be generated having the same atomic size).

We generated 10 different Lavor instances for each size n=10, …, 70 (‘small set’), and four different Lavor instances for each size n in {100i|1leqslant R: less-than-or-eq, slantileqslant R: less-than-or-eq, slant10} (‘large set’).

4.3. Hardware and memory considerations

All tests have been carried out on an Intel Pentium IV 2.66 GHz with 1 GB RAM, running Linux. The code implementing the BP algorithm has been compiled by the GNU C++ compiler v.3.2 with the −02 flag. As mentioned above, BP can be very memory demanding. We deliberately took the choice of employing a low-end PC with just 1 GB RAM to show just how powerful this technique can be even with modest hardware.

The BP algorithm is in general very fast, because all it does is test feasibility with the known distances at each branched node. However, exploring the search space may require a lot of memory, especially if no pruning occurs early in the run. Consequently, when the physical RAM of the test machine is exhausted, and the operating system starts swapping to disk, the total CPU elapsed time size becomes unmanageable. Thus, it was decided to kill all jobs requiring more than 1 GB RAM. In particular, we solved almost all the Lavor instances in the ‘small set’ and found one solution for each of the Lavor instances in the ‘large set’.

4.4. Comparative results

The full results table for the complete test suite includes 655 instances and spans 14 pages: thus, only a sample will be presented in detail. The averages, however, are taken with respect to the whole suite. The ɛ parameter of Algorithm 1 was set to 1 × 10−3 for all tests.

Table 1 contains detailed results for the sample. The instances are described by their name, their atomic size n and the number of given distances |S|. Note that in order to use dgsol, the lower and upper bounds to these distances were set to dij±5 × 10−4. Other than this, dgsol was used with all default parameter values. The results refer to three methods: dgsol, BP stopped after the first solution was found (BP-One), and BP run to completion (BP-All). For dgsol and BP-One, the user CPU time (in seconds) was reported, as well as the Largest Distance Error (LDE), defined as


employed as a measure of solution accuracy (the lower, the better). We remark that we employed the LDE rather than the objective function value (1) as a solution quality measure because we wanted our results to be comparable with dgsol, which uses the LDE (see file dgerr.f in the dgsol 1.3 distribution http://www-unix.mcs.anl.gov/more/dgsol/). For the BP-All method, we reported the user CPU time and the number of solutions found (#Sol). Missing values are due to excessive memory requirements (over 1 GB RAM).

Table 1. 
Computational results for a sample of small and large Lavor instances
  1. Missing values are due to excessive memory requirements (>1 GB RAM).


It is immediately noticeable that whereas dgsol always finds a solution, BP sometimes fails to find one within 1 GB RAM. It is instructive, however, to look at the solution accuracy (taken over the whole test suite): whereas dgsol ranges from 4.5 × 10−7 to 0.875 (excepting a couple of out-of-scale values clearly due to some numerical instability), BP scores a rather more impressive 4.74 × 10−11 to 5.62−6. On average, the solution accuracy obtained by dgsol is 9.55 × 10−2 whereas BP averages 4.56 × 10−8. Furthermore, all the instances in the Lavor ‘large set’ are solved by dgsol to a solution accuracy of order 10−1: given that in BP pruning often occurs for feasibility differences of order 10−1 and even 10−2, such a slack solution accuracy may mean that dgsol is not actually finding the correct solution.

Table 2 reports the averages of the same parameters as in Table 1 taken over 10 Lavor instances in a sample of the ‘small set’ and over 4 Lavor instances in a sample of the ‘large set’. It appears clear from these data that BP's strong points are indeed speed and accuracy. A graphical representation of the averages taken over the whole Lavor test set is shown in Fig. 6 (user CPU average taken to solve the instances in function of the molecular size by dgsol and BP-One) and Fig. 7 (average accuracy of the solution attained by dgsol and BP-One). We chose not to show the curves in the same plot because the huge scale difference on the ordinate axis ‘pushed’ the BP-One performance towards zero.

Table 2. 
Average statistics for Lavor instances (over 10 instances for the set of small instances and over 4 for the set of large instances).
dgsol (average)BP-One (average)BP-All (average)
Figure 6.

 Average user CPU time (plotted against molecular size) taken by dgsol (top) and BP-One (bottom).

Figure 7.

 Average accuracy (plotted against molecular size) attained by dgsol (top) and BP-One (bottom).

5. Final remarks

In this paper we presented a new discrete formulation for a particular subclass of the MDGP. We proposed a Branch-and-Prune algorithm and tested it against dgsol, an existing software for the MDGP. It appears that our method is faster and more accurate than dgsol by several orders of magnitude, albeit less predictable as concerns the running time, and requiring much greater memory.


The authors would like to thank FAPESP, FAPERJ, and CNPq for their support.