Visualizing energy landscapes with metric disconnectivity graphs

The visualization of multidimensional energy landscapes is important, providing insight into the kinetics and thermodynamics of a system, as well the range of structures a system can adopt. It is, however, highly nontrivial, with the number of dimensions required for a faithful reproduction of the landscape far higher than can be represented in two or three dimensions. Metric disconnectivity graphs provide a possible solution, incorporating the landscape connectivity information present in disconnectivity graphs with structural information in the form of a metric. In this study, we present a new software package, PyConnect, which is capable of producing both disconnectivity graphs and metric disconnectivity graphs in two or three dimensions. We present as a test case the analysis of the 69-bead BLN coarse-grained model protein and show that, by choosing appropriate order parameters, metric disconnectivity graphs can resolve correlations between structural features on the energy landscape with the landscapes energetic and kinetic properties.


Introduction
The potential energy surface, UðrÞ, of an N atom chemical system represents the potential energy as a function of 3N atomic coordinates. The topography of UðrÞ, or energy landscape, determines its structure, kinetics, and thermodynamics [1,2] and its analysis has proved useful in studying a range of physical systems and phenomena, including glasses, [3] biomolecules, [4][5][6] and clusters. [7][8][9] For all but the simplest cases, UðrÞ has many more degrees of freedom than it is possible to visualize conventionally, making it impossible to assess the surface topography directly. One solution to the visualization problem is to partition the landscape into discrete regions, and then hierarchically cluster these regions according to some similarity measure. This clustering can then be represented as a tree-graph in either two or three dimensions (2D or 3D). There are a number of examples of hierarchical clustering methods in the literature, broadly based on either geometry, energetic barriers, or local ergodicity.
In geometrical clustering, regions are clustered according to their structural similarity, which is usually defined by the rootmean-square deviation (RMSd) between them. In this context, regions can either correspond to minima on UðrÞ [10] or points along a molecular dynamics trajectory. [11,12] The structures are clustered either by an iterative process, by which each structure is joined to its nearest neighbor until only a single cluster remains, [11] or clustering structures that are within a critical distance of one another. [10] When clustering according to energetic barriers, the landscape is partitioned into basins of attraction whereby each point on the landscape UðrÞ, is mapped onto a local minimum a, with coordinate, r a by a steepest-descent path. [3,13] Alternatively, the landscape can be partitioned using a lumping approach, [14] in which energy thresholds are used to group connected regions below the threshold. The similarity measure used for hierarchical clustering is the barrier energy that separates any two regions. Starting from the energy of the global minimum, U 0 , regions are clustered together if they are separated by a barrier with an energy lying in the interval U i11 2U i , where U i11 5U i 1DU and DU is the width of the interval. This clustering is repeated until a particular energy threshold, U t , is reach, or all the minima are clustered together. Such graphs have come to be referred to as disconnectivity graphs, [13,15] and have been used in a number of studies to visualize energy landscapes. [13,[15][16][17] Disconnectivity graphs retain both the energies of minima on UðrÞ, and the barriers that separate them, making them a useful diagnostic in visually assessing the thermodynamic and kinetic behavior of a system. [18,19] They can also be used to represent free-energy surfaces by estimating the vibrational entropy of minima and transition states on the landscape from the harmonic superposition approximation. [20][21][22] Clustering landscapes by local ergodicity involves partitioning the landscape into basins about local minima. Equilibration between basins is determined by comparing forward and backward transition rates between states [23] or the time-dependent probability distributions of connected basins. [24] A weakness of the disconnectivity graph method is that it does not retain any structural information on the minima and thus neglects a large portion of the information contained in the energy landscape. Metric disconnectivity graphs capture some of this structure by defining a metric, and then calculating an order parameter from the metric for each minimum of interest on the landscape. The minima can then be plotted along a metric axis perpendicular to the energy axis. Metric information can be included in a number of other ways, such as by changing the color, or thickness of the nodes and edges. [25,26] In this article, we will refer to metric disconnectivity graphs as those for which the nodes are organized along a metric axis. A judicious choice of metric captures overall structural trends in the system, while ignoring noisy or irrelevant information.
Here, we demonstrate the use of metric disconnectivity graphs, using several metrics, to visualize the energy landscapes of coarse-grained proteins. These disconnectivity graphs are plotted with our new energy landscape visualization package, PyConnect. [27] Methodology BLN model Metric disconnectivity graph analysis was performed on a database of stationary points for a BLN model protein This database was generated with discrete path sampling [8,28] as implemented in PATHSAMPLE. [29] The BLN model [30,31] is a coarse-grained, off-lattice protein model in which each protein residue is represented by one of three types of bead: hydro-phoBic, hydrophiLic, or Neutral. Here, we use a version of the BLN potential in which the interresidue distances and angles are restrained with stiff springs. [32] The beads interacts with each other according to where R ij is the distance between two beads i and j. The first term is a harmonic bond restraint with K r 5231:2r 22 and R e 5 r. The second term represents a harmonic angle constraint the K h 520 rad 22 and h e 51:8326 rad. The third term takes into account torsional angles along the chain and is defined by four consecutive beads. If two or more beads are N, then A 5 0 and B 5 0.2, else A 5 B 5 1.2. The fourth term represents long range, water-mediated hydrophobic interactions between nonbonded pairs. If both beads are B, then C 5 D 5 1. If one residue is L and the other is L or B, then C5 2 3 and D521. If either residue is N, then C 5 1 and D 5 0. [32] Though other sequences exist and have been studied, we consider here BLN-69, which consists of 69 beads with the sequence [33] B 9 N 3 (LB) 4 N 3 B 9 N 3 (LB) 4 N 3 B 9 N 3 (LB) 5 L. BLN-69 has been designed to exhibit a frustrated energy landscape, with a 6-strand b-barrel structure as its global minimum. The model has been shown to have a number of low-energy b-barrel-like structures, which differ from the global minimum by a chain slip along the length of the barrel, [4] but are separated by large barriers. Such frustration is absent when considering the "G o" version of the model (G o-69), where attractive interactions between pairs of residues that are not in contact in the native state (i.e., the global minimum) are neglected. [34,35] Metric disconnectivity graphs Disconnectivity graphs and metric disconnectivity graphs are plotted using PyConnect. [27] The PyConnect package comprises two components: PCA, which calculates the principal components of molecular systems from PATHSAMPLE [29] databases, and PyConnect, which constructs and displays metric disconnectivity graphs. Both of these programs were written in Python. The disconnectivity graphs are rendered with Mat-PlotLib, [36] and users can choose to create disconnectivity graphs and metric disconnectivity graphs in 2D or 3D. PyConnect also provides some cosmetic features, including the ability to label minima, color minima according to an order parameter or according to their basin of residence. PyConnect can also be used to modify graphs interactively using the iPython [37] virtual environment. In the disconnectivity graphs produced by PyConnect, the position of nodes and minima along the x axis are determined by algorithms similar to those used in DISCONNECT, [38] another program for producing disconnectivity graphs from databases of minima and transition states. Full details of the algorithms used can be found on the PyConnect website. [27] Two-dimensional metric disconnectivity graphs are plotted with the position of the minima on the x axis defined according to a metric. In 3D disconnectivity graphs, two metrics are used. The positions of nodes on the disconnectivity graphs are defined as the mean of the metrics for all minima connected to that node. Native Contact Metric. The native contact metric evaluates for each minimum, a, the ratio N a =N NC , where N NC is the number of native contact pairs, and N a is the number of contact pairs in minimum a that are also present in the native conformation.
Here, contacts are defined as those beads which are within 1.167r of each other, excluding pairs that are within three beads of each other in the peptide sequence. [4] Hydrogen bonding is important in protein folding, and native contact analysis can provide a useful analogy for coarse-grained protein models. N a =N NC is commonly used as a progress variable in computational studies of protein folding to distinguish between the different degrees of partially folded protein. [39] RMSd Metric. The RMSd, d ab , measures the distance between the conformation of minimum a and b; r a and r b respectively, according to Invariance under global translations and rotations is implicit if structures are represented in internal coordinates. When working in Cartesian coordinates, the Kabsch algorithm [40] was used to align structures to minimize d ab . In the RMSd metric, d ab is calculated between the conformation of each minimum and the conformation of the global minimum, r GM .
Principal Component Metric. The principal component metric is based on principal component analysis (PCA), a statistical procedure used to analyze large, high-dimensional data sets, which is commonly used in dimensional reduction and, or when the relevant degrees of freedom in a data set are not clear. [41] PCA attempts to reexpress a data set in terms of a new basis set, the principal components, which are a linear transform of the data sets original basis set. The principal components lie along the axes of greatest sample variance, with the first principal component, PC1, capturing the axis of greatest variance, the second principal component, PC2, capturing the axis of second greatest variance (orthogonal to the first) and so on. [42] We performed PCA on the set of N sp stable configurations fr a g Ut , where fr a g Ut are all local minima connected to the global minimum below a certain threshold energy U t . The initial basis sets employed were the 3N dimensional external Cartesian basis set, fe i g, and an internal basis set of dihedral angles, fw i g. To remove the periodicity of fw i g, we used the sines and cosines of the internal dihedrals, fcosw i ; sinw i g. [43] Rotational and translational invariance of fr a g Ut was enforced by implementing McLachlan's best fit procedure [44] ;

1.
A reference configuration defined as the ensemble average, hri of fr a g was calculated, where fr a g is the set of N sp minima of interest, and where each configuration in fr a g has its centroid centered on the origin. 2. Define a new set fr 0 a g, rotate each configuration about its origin to be as close to hri as possible using the Kabsch algorithm, and thus minimize 3. Replace fr a g with fr 0 a g. 4. Repeat steps 1-3 until the ensemble average converges to some threshold criterion.
In our study, we used the threshold criterion defined by Komatsuzaki et al., [25] s 10 28 .
Hereafter, whether discussing Cartesian or internal coordinates, we define fr a g as the translation-free, rotation-free set of configurations, and fq i g as the basis set, where i is the coordinate index.
To perform PCA, we begin with defining the 3N 3 N sp mean-centered configuration matrix, R where each column of R is a 3N dimensional vector corresponding to a stable configuration in the set fr a g Ut . The PCs are the eigenvectors of the 3N 3 3N covariance matrix, C and are thus the basis set fQ i g, where i is the coordinate index, in which C is diagonalized. The PCs are calculated using the sin-gular value decomposition method, which states that a 3N 3 N sp configuration matrix, R can be written as the product where W is the 3N 3 3N matrix; whose columns are the PCs of C and S is a 3N3N sp matrix with diagonal elements, S ii , where (dropping the double index for clarity) S 2 i is the variance associated with Q i . The PCs are ordered so that Q 1 has the greatest variance, Q 2 has the second greatest variance, and so on. The ith principal component metric is calculated by transforming each member of fr a g Ut into the basis set of fQ i g, and using the value of the ith PC for each minimum as the order parameter.
One can visualize the PCs of a given fr a g Ut , by choosing a reference structure, r ref , and adding the Q i of interest to it where k is a progress variable.
Isomap Metric. The Isomap metric is based on the Isomap algorithm, [45,46] a nonparametric, nonlinear dimensionality reduction technique. The aim of the Isomap algorithm is to define a low-dimensional embedding that as accurately as possible preserves geodesic distances between all pairs of points in the data cloud. The geodesic distance between a pair of points that lie on a manifold is the length of the shortest path between them that lies along that manifold. Isomap assumes that such a low-dimensional manifold exists, and that its shape can be estimated from the distribution of points in the data cloud. The Isomap algorithm approximates the geodesic distance between a given pair of points on the manifold by calculating the shortest possible path between them that can be found by stepping from one point to its neighbor. We applied Isomap to the set of N sp stable configurations fr a g Ut using the Isomap implementation in the scikit-learn machine learning package. [47] As with the principal component metric, rotational and translational invariance of fr a g Ut was enforced by implementing McLachlan's best fit procedure. The Isomap algorithm works in three steps;

1.
A weighted graph, G, of fr a g Ut is built, where each conformation is a node and where the k nearest-neighbors of each conformation a are joined by an edge with weight d ab . Isomap has been shown to be fairly robust to the choice of k, [46] and in this study we took k 5 15. 2. The shortest path between each conformation through the graph G is determined and a distance matrix, D, is computed, where D ab 5minfd ab ; d ac 1d cb g for c51; . . . ; N sp . The elements D ab , are the approximate geodesics between conformations a and b. 3. Classical multidimensional scaling is applied to the matrix D, producing a low-dimensional embedding of the conformational coordinates that best preserves geodesic distances on the manifold.
The ith Isomap metric corresponds to the ith embedded dimension of the low-dimensional manifold of fr a g Ut .

BLN-69
For BLN-69, a database containing 141,835 minima and 173,692 transition states was used in the study. The first three Cartesian and dihedral principal components for the sets, fr a g Ut , connected to the global minimum below energy U t , where 295:0e ! U t ! 298:0e are shown in Table 1.   Table 2. The beads are colored from red to blue (N-terminus to C-terminus).

FULL PAPER WWW.C-CHEM.ORG
For the Cartesian PCs, PC1 captures significantly more of the variance than PC2 for all data sets considered. PC1 for the sublevel set of minima connected to the global minimum below U t 5297:0e; fr a g Ut 5297:0e has the largest fractional variance and therefore this threshold was selected for all disconnectivity graphs. The dihedral PCs have a more uniform variance dis-tribution than the Cartesian PCs, with S di 1 % 1 2 S cart 1 , for all fr a g Ut considered in both BLN-69 and G o-69. The dihedral PCs are thus not appropriate metrics for studying these systems, and have not been used to create metric disconnectivity graphs. The set of minima where U t 5297:0e is represented as a disconnectivity graph in Figure 1. Figure 2 shows the low-energy minima labeled a-e in Figure 1. Minima b-e are all structurally similar to one another with each adopting compact b-barrel geometries and differing from global minimum a by either a chain-slip, chain-reptation, or twist in the turn regions, with further details given in Table 2.
The native contact metric (Fig. 3) splits the two largest funnels, with each having distinct fractions of native contacts (the mean fraction of native contacts for the funnels containing minima a and b is 0.91 and 0.81, respectively). Though the native contact metric differentiates between kinetically separated minima, it does not differentiate according to their    The RMSd metric (Fig. 4) is capable of distinguishing between the different major funnels on the surface, with each having its own mean value of the metric (mean RMSd for the funnels containing minima a, c, and b 0:24r; 0:31r, and 0:45r, respectively). There is also some relation to the minima energy in the green and red funnels, where lower energy corresponds to RMSd metric values closer to 0. The RMSd metric differentiates the basin minima into four groups, with minimum c being most similar to the global minimum, which is as expected from a visual inspection of the structures.    The PC1 metric (Fig. 5) splits the blue funnel (mean 3:29r) from the red and green funnels, which sit on top of one another (mean 1:59r and 1:60r, respectively). The purple funnel lies at the boundary of the two, with a mean of 20:86r. Minima a and c have almost identical values of Q 1 , while minima d, b, and e have increasingly dissimilar values. Given that PC1 corresponds to a chain-slip between the C and N termini, and that minima d, b, and e have chains that have shifted relative to the global minimum in the same direction, it gives confidence that PCA is capable of identifying structural features of the energy landscape.
The PC2 metric (Fig. 6) does not reveal any obvious correlation between structure and energetics or kinetics, with no distinction made between the funnels and with the points reasonably evenly distributed along the order parameter. Thus, this PC corresponds to variations within all of the funnels rather than structural differences between the funnels.
The progression of PC1 of fr a g Ut 5297:0e from k525:2r to k 54:7r (Fig. 7) corresponds to a chain-slip between the C and N termini. The progression of PC2 of fr a g Ut 5297:0e from k52 3:7r to k53:4r (Fig. 8) corresponds to a twisting of the internal chain sequences.
The first embedded dimension of the Isomap metric ( Fig. 9) clearly differentiates between all the colored funnels on the land-scape. The mean value of the blue, purple, red, and green funnels are 210:23r; 22:96r; 4:12r, and 8.80r, respectively. The structure of the graph is similar to the PC1 graph (Fig. 5), with the order of the colored funnels and labeled low-energy minima along the metric axis matching. The agreement between these two metrics suggests that the first embedded dimension of the Isomap metric is fairly linear, and that, as with the PC1 metric, it corresponds to a chain-slip between the C and N termini.
As with the PC2 metric, the disconnectivity graph for the seconded embedded dimesnion of the Isomap metric (Fig. 10) is difficult to interpret. The overlapping of the colored funnels suggests that the second embedded dimension corresponds to some structural variation common to each funnel.
The information in Figures 5 and 6 is visualized on a single 3D metric disconnectivity graph of fr a g Ut 5297:0e projected onto the plane of maximal variance in Figure 11. The plot shows fr a g Ut 5297:0e for BLN-69 plotted against its first two principal components. Clear separation of minima a-e is discernible in this 3D metric disconnectivity graph.

G o-69
For G o-69, a database containing 75,666 minima and 113,101 transition states was used. The first three Cartesian principal components for the sets, fr a g Ut , connected to the global Figure  11. 3D metric disconnectivity graph of BLN-69, U t 5297:0e; N sp 51611, plotted with the first two principal components of fr a g Ut 5297:0e , Q 1 and Q 2 , used as order parameters in units of r. The color scheme and labels are as used in Figure 1. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]   minimum below energy U t , where 252:0e ! U t ! 258:0e are shown in Table 3.
As with the Cartesian PCs of BLN-69, PC1 captures significantly more of the variance than PC2 for all data sets considered. PC1 for the sublevel set of minima connected to the global minimum below U t 5252:0e and U t 5253:0e, have the largest fractional variance, though these large variances are due to a comparatively small number of unstable, high-energy minima in which one end of the chain has peeled away from the barrel and become unbound. For these systems, PC1 is no longer representative of the distribution of minima on U r . For this reason, we consider the sublevel set of minima connected to the global minimum below U t 5254:0e, for which all the minima have densely packed geometries. This set is represented as a disconnectivity graph in Figure 12. The results for the Isomap metric were fairly ambiguous for G o, with no obvious pattern correlation between the embedded dimensions and the kinetic or energetic structure of the graph, so they have not been included in this work.
As there is only a single funnel on the G o-69 landscape, there are no large kinetic barriers for any of the metrics to differentiate between. The native contact metric (Fig. 13) is able to partially distinguish between the structures of high-and low-energy minima. As with BLN-69, minima across the whole energy range examined were able to satisfy nearly full native contacts, including unstable, high-energy minima with energetically unfavorable turns in the flexible N bead regions. The converse is not true; however, as all low-energy minima have a high number of native contacts and low numbers of native contacts are only found for high-energy minima.
For the RMSd metric (Fig. 14), similar behavior to BLN-69 is exhibited, albeit with a single funnel, with RMSd from the global minimum increasing with increasing energy.
The metric disconnectivity graphs in Figure 15 use PC1 and PC2 as order parameters. In the PC1 graph, the majority of minima are centered about the global minimum, with a smaller number of high-energy minima extending to higher values of PC1. PC1 and the fraction of native contacts are wellcorrelated, as can be seen in the 3D metric disconnectivity graph (Fig. 16). The PC2 metric orders all but a few minima tightly in a rough column about Q 2 % 0:1. Those unstable, higher energy minima that are not in that column are structures with a partly unbound C-terminus chain-portion.   The use of color allows an additional metric to be included on a metric disconnectivity graph. For example, Figure 16 shows the PC1, native contact, and RMSd metrics for G o-69. Figure 17 shows the progression of PC1 of fr a g Ut 5254:0e from k522:6r to k56:1r, with a view along the axis of the barrel and corresponds to a sweeping action of the red chainportion across the face of the white chain-portion. Figure 18 shows the progression of PC2 of fr a g Ut 5254:0e from k527:6r to k58:1r, and corresponds to a "can-can" like sweeping motion of the free C-terminus end of the chain.

Conclusions
In this study, we have demonstrated how an appropriate order parameter can elucidate the connection between structures in the energy landscape of BLN-69 and G o-69, such as funnels, with certain structural motifs of the protein, including chain slips and twists in the turn regions. However, there are still shortcomings to the metrics proposed. Fraction of native contacts and RMSd metrics relied on having prior knowledge of the system. PCA provides a means to study systems without resorting to chemical intuition, but still assumes that the point cloud is approximately linear, and cannot be directly implemented for angular coordinates. Also, it considers all structures to be of equal importance, regardless of energy, leading to situations such as with G o-69, where all the variance in structure was provided by a small number of high energy, unstable minima. Isomap allows one to discern low-dimensional, nonlinear manifolds in the data, and does not make the same assumptions of linearity as PCA. This is clearly a successful strategy, with Isomap distinguishing between the different kinetic structures on the landscape. A useful feature of PCA is the ease Figure 17. Different values of Q 1 for U t 5254:0e projected onto the structure of the global minimum of G o-69. For the global minimum, Q 1 520:6r. The beads are colored from red to blue (N-terminus to C-terminus). An animated version of this projection is available as Supporting Information. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.] Figure 18. Different values of Q 2 for U t 5254:0e projected onto the structure of the global minimum of G o-69. For the global minimum, Q 2 520:9r. The beads are colored from red to blue (N-terminus to C-terminus). An animated version of this projection is available as Supporting Information. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary. com.] with which one can project the principal components back into the original space, making it possible to visualize what these directions correspond to. In principal, it should be possible to do the same with Isomap, projecting the approximate geodesics of the manifold back into the original space, though we have not implemented this in the work presented here. Other nonlinear dimensionality reduction methods exist in the literature, such as sketch-map, [48] locally scaled diffusion map, [46,49] and spectral methods, [50] which are good candidate metrics for further study.
Equally, though the data produced by PyConnect is of a highquality, the data analysis is still fairly qualitative, and further efforts are being taken to quantify the observations, such as using graphtheoretic techniques to analyze and compare tree graphs. [51] Future work should also focus on investigating more realistic, small protein systems, such G-protein [52] or cyclic peptides. [5,22]