SEARCH

SEARCH BY CITATION

Keywords:

  • molecular dynamics;
  • trajectory analysis;
  • proteins;
  • base excision repair;
  • Fpg;
  • open-source software;
  • graphical user interface

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results and Discussion
  6. Conclusion
  7. Supporting Information

Most of existing software for analysis of molecular dynamics (MD) simulation results is based on command-line, script-guided processes that require the researchers to have an idea about programming language constructions used, often applied to the one and only product. Here, we describe an open-source cross-platform program, MD Trajectory Reader and Analyzer (MDTRA), that performs a large number of MD analysis tasks assisted with a graphical user interface. The program has been developed to facilitate the process of search and visualization of results. MDTRA can handle trajectories as sets of protein data bank files and presents tools and guidelines to convert some other trajectory formats into such sets. The parameters analyzed by MDTRA include interatomic distances, angles, dihedral angles, angles between planes, one-dimensional and two-dimensional root-mean-square deviation, solvent-accessible area, and so on. As an example of using the program, we describe the application of MDTRA to analyze the MD of formamidopyrimidine-DNA glycosylase, a DNA repair enzyme from Escherichia coli. © 2012 Wiley Periodicals, Inc.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results and Discussion
  6. Conclusion
  7. Supporting Information

Molecular dynamics (MD) simulations of biopolymers on meaningful time scales always produce large trajectories rarely amenable to manual analysis. To deal with this problem, a wide variety of postprocessing software has been developed, either as parts of molecular modeling packages (for example, ptraj[1] within Amber Tools, or VMD[2] highly integrated with NAMD molecular dynamics package[3]) or standalone programs (carma,[4] Wordom,[5] MD-TRACKS,[6] Simulaid,[7] etc.). Most of them provide an excessive set of analytic features, but sacrifice ease of use, simplicity, and clarity. Another trend is implementation of a scripting language instead of a graphical user interface (GUI). Some packages (for example, MDAnalysis[8]) are actually libraries that expose the necessary functionality to user's Python scripts. Although scripting commands are highly flexible in many cases, this approach has several disadvantages. First of all, it requires a researcher to master scripting languages. Second, scripting commands are not portable between software solutions, which depreciate the programming experience when moving to another program, thus attaching a researcher to the software initially selected. GUI programs are typically much more ergonomic and easier to master. Thus, portable analytic software that does not require the users to have programming skills would be very useful, especially for nonexperts in computational biochemistry. We have developed a GUI-based program, MDTRA (Molecular Dynamics Trajectory Reader and Analyzer), which we widely use in our own MD simulations, that addresses these problems.

The main design principles for MDTRA are: (1) an ergonomic GUI (Supporting Data 1) that does not require scripting for most routine tasks (yet supports scripting for user's custom task) and is expandable for further improvements; (2) the ability to quickly plot the analysis results in a way representative enough to be used even in published work; (3) minimization of random access memory (RAM) requirements of the program, thus relying on intensive (i.e., fast hard drives, multithreaded CPUs with advanced features, GPGPU) rather than extensive (i.e., increase in RAM and CPU frequency) progress of computer hardware.

MD analysis programs often perform actual calculations while relying on third-party plotting software in the matter of data visualization. Yet, sometimes the problem of quick visual data inspection is of vital importance. It is convenient to view plots in place, right after the calculation process, and MDTRA offers this functionality.

Another crucial problem of the analysis software is its memory footprint. Some programs such as VMD load all the trajectory files into random access memory (RAM), which results in huge memory requirements for large trajectories. In most cases, MDTRA loads files on demand, and its memory footprint is rather small. The reverse side of such an approach is a slowdown of the calculations, as hard disk access is much slower when compared with RAM access; but modern hardware such as solid-state drives should mostly eliminate this problem in the near future.

MDTRA has been developed as a part of BISON modeling and analysis suite and thus integrates with BioPASED/GUI-BioPASED molecular dynamics modeling package[9, 10] and can perform analysis of its specific output; for example, atomic forces.

In this article, the main features of MDTRA are described both in terms of their implementation and the usage. For illustration, we show how MDTRA can be used for analysis of trajectories generated during MD of E. coli Fpg, a DNA repair protein.

Materials and Methods

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results and Discussion
  6. Conclusion
  7. Supporting Information

Key features of MDTRA

MDTRA is written in the object-oriented programming language C++ using Qt cross-platform GUI libraries developed by Trolltech/Nokia. MDTRA works with trajectories represented by sets of protein data bank (PDB)[11] files that represent a single MD simulation run. A single trajectory resource in the program is called a “stream.” Some MDTRA tools require one or two valid streams to be launched. At present, MDTRA cannot directly analyze binary DCD trajectory files, the output of NAMD and some other popular MD packages, which must be converted to PDB files. For example, a DCD file can be converted into a single PDB file with VMD and then split into a set of PDB files with a shell script provided within the MDTRA package. After the streams are registered, “data sources” are added. A data source is an instruction for the program which trajectory parameters to extract. The following types of data sources are available: (1) root-mean-square deviation (RMSD) and root-mean-square fluctuation (RMSF) of the backbone (N[BOND][BOND]C[BOND]O for proteins, O1P[BOND]O2P[BOND]P[BOND]O5′[BOND]C5′[BOND]C4′[BOND]O4′[BOND]C3′[BOND]O3′[BOND]C2′[BOND]O2′[BOND]C1′ for nucleic acids); (2) RMSD and RMSF of a selection; (3) radius of gyration; (4) distance between two atoms; (5) angle between three atoms; (6) angle between two segments (each segment is defined by two atoms); (7) dihedral angle defined by four atoms in the range [0; 2π]; (8) absolute value of dihedral angle defined by four atoms in the range [0; π]; (9) angle between planes defined by three atoms each; (10) force applied to an atom; (11) resultant force for a pair of atoms; (12) solvent-accessible surface area (SAS); (13) SAS of a selection; (14) occluded surface area; (15) occluded surface area of a selection; (16) user-defined types. The parameters 10 and 11 require the trajectories to contain a specific output, a value of atomic force in each snapshot. Such trajectories are produced, for example, by BioPASED molecular dynamics modeling program.[10]

The result of data source processing is an array of floating point values. These arrays are combined in groups named “result collectors.” Each collector may include one or more data sources. Some collectors may be either time-based or residue-based: for example, RMSD collectors are residue-based, and others are time-based. Results are represented as raw data, plotted onto a graph, and statistically analyzed. Before calculation, each snapshot is superposed with the first trajectory element using the Kabsch algorithm.[12] The following statistical parameters are calculated for each data source in the result collector: (1) mean values (arithmetic, geometric, harmonic, quadratic); (2) minimum and maximum values; (3) range and midrange; (4) median; (5) sample variance; (6) sample standard deviation and standard error. If there are several data sources in a collector, a linear (Pearson) correlation between each data pair is calculated. In addition, at any time point available in the multistream collector, a combined snapshot of all structures superimposed for the best match can be viewed in an external viewer supported by MDTRA (RasMol,[13] VMD), if installed.

Solvent-accessible surface used in processing of several data sources is calculated using the Shrake–Rupley algorithm[14] from the van der Waals radii of the atoms expanded by a user-selectable solvent probe radius (the default value is 1.4 Å, the approximate radius of a water molecule). The dot density depends on the accuracy selected and corresponds to the number of subdivisions of the initial icosahedron using a geosphere algorithm which produces a high-quality uniform distribution. The default number of dots per atom is 162. Occluded area may also be calculated; it is defined as the part of molecular surface occluded by a selected atom set (the “occluder”). For example, in protein-DNA complexes, some amino acid residues closely contact the DNA, while in the absence of DNA they are exposed to the solvent. Their summary surface area is defined as an area of the protein's molecular surface occluded by DNA. Solvent-accessible surface calculations are accelerated using general purpose computations on graphical processing units (GPGPU). At present, a massively parallel algorithm is implemented in MDTRA using NVIDIA Compute Unified Device Architecture, resulting in a speed-up factor up to 14×, as compared with single-threaded CPU implementation.

Some MDTRA instruments extend beyond the general data organization (data sources, result collectors) and therefore are referred to as trajectory-related search tools. They include the Distance Search tool, the H-Bonds Search tool, and the two-dimensional (2D)-RMSD calculation.

The Distance Search tool is a cross-trajectory tool for evaluating interatomic distances, which may be helpful in a procedure of data source extraction. The main use of the tool is to analyze the distance between pairs of atoms in two different trajectories and to decide whether the difference is significant or not. Distance Search results are suggestions that can be added to result collectors or ignored. The tool requires at least two valid trajectories registered in the current project.

The H-Bonds Search tool is designed to find hydrogen bonds that stabilize the structure at a meaningful time scale. It allows the user to specify a significance criterion to either accept or discard a bond based on its estimated energy and occurrence along the trajectory. The process of hydrogen bond search consists of: (1) building a set of all possible X[BOND]H···Y triplets (where X is a donor, Y is an acceptor, H is a hydrogen, the dash means a covalent bond, and dots mean a hydrogen bond) based on the known hydrogen bond-forming properties of protein and DNA residues; (2) calculating the energy for each triplet and for each trajectory snapshot; (3) collecting all triplets with the hydrogen bond occurrence greater than the value defined in significance criterion (if any) into a table of results. Hydrogen bond energy is calculated using the following equation:

  • equation image

where E is the hydrogen bond energy, r is the distance between the Y and H atoms (in angstroms), θ is the triplet angle, Rm and Em are the parameters that depend on the AMBER force field code of the particular Y and H atoms, and s is a constant of softness. If the energy is not zero and its absolute value is greater than the value defined by the significance criterion (if any), the bond is assumed to exist in the current snapshot. H-Bond Search results can also be added to result collectors. The tool requires a valid trajectory registered in the current project.

With the 2D-RMSD tool, special time-based RMSD maps, combining the relative deviations between the snapshot pairs, can be built. Every pixel of a 2D-RMSD map with coordinates (n, m) represents the average RMSD, for the atom set selected, between the trajectory snapshots n and m:

  • equation image

where ri(n) and ri(m) are vectors representing the Cartesian coordinates of atom i in snapshots n and m, respectively, and N is a total number of atoms in the selection. This stipulates the diagonal symmetry of the plot. The 2D-RMSD tool is the only MDTRA tool that requires all the trajectory snapshots to be loaded into RAM and therefore has high memory requirements for large plots.

Some MDTRA tools operate on a selected part of the molecule. The selection syntax is designed to be as close as possible to that used in RasMol; however, some notable differences exist. There is a special utility tool to quickly test selection terms and get trained to use them, a Select Atoms tool. The selection parser of MDTRA is written using Bison and Flex tools.[15]

Although there are many different predefined data sources and statistical parameters, MDTRA introduces an extension mechanism, programmable data sources, suitable for user's custom analysis of any type. Custom scripts in Lua programming language[16] can be written, within the limits of MDTRA programming interface, and immediately executed. A detailed description of programmable data sources is provided in the User's Guide (available at http://bison.niboch.nsc.ru/mdtra.html).

Molecular dynamics of Fpg

To illustrate the analysis capabilities of MDTRA, an MD modeling of E. coli formamidopyrimidine DNA glycosylase (Fpg) protein was performed. The atomic structure of this enzyme (PDB ID 1K82, chain A[17]) was taken for modeling. Missing residues 217–224 were restored as described.[18] A 2-ns MD (1 fs integration interval) was performed using BioPASED molecular dynamics modeling software[10] using the Amber95 force field and an analytical implicit solvent model without overall artificial restraints; four noncovalent bonds between Zn2+ and cysteins of the zinc finger of Fpg were modeled with a 2.1 kcal/Å2 artificial restraint to maintain the geometry of this structural motif. Use of a high-quality implicit solvent model allows a more effective sampling over the essential conformational space of a protein by excluding the dynamic friction inherent in the explicit solvent model.[19, 20] There were two trajectories resulting from two modeling modes: without (as described[18]) and with the energy contribution of possible hydrogen bonds (Model 1 and Model 2, respectively). Each trajectory consisted of 1015 PDB files (2 ps per snapshot) with an overall size of 672 Mb. Both trajectories were analyzed in a single MDTRA project with the default parameters of analysis. The autocorrelation function was built using the Lua scripting module provided with MDTRA (the script is given in the Supporting Data 2). The autocorrelation coefficient Rh for each time lag (h) was calculated as

  • equation image

where Ch is the autocovariance function and C0 is the variance function,

  • equation image

and plotted against h.

Results and Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results and Discussion
  6. Conclusion
  7. Supporting Information

Fpg is a key enzyme involved in the repair of oxidatively damaged purines in bacterial DNA. It was previously shown[18] that some segments of the molecule have the mobility highly increased over the baseline. The primary interest of that study was identification of protein regions that likely give rise to changes in the fluorescence of Fpg during the reaction. Therefore, special attention was paid to the segments that contain tryptophan residues, which may contribute to the fluorescence changes observed in pre-steady-state kinetics experiments.[18]

Time-based RMSD of the backbone showed that the models were stable over the total 2-ns trajectory (Fig. 1). As was expected, the model without explicit hydrogen bonds (Model 1) was more mobile with a median RMSD value of 4.01 Å; the model stabilized by hydrogen bonds (Model 2) displayed a median RMSD of 1.92 Å. Residue-based molecular surface area (Fig. 2) displays high correlation between the models (Pearson correlation coefficient 0.90). This means that the overall stability of protein shape is maintained in both models; however some residues are more exposed to the solvent when not stabilized by hydrogen bonds.

thumbnail image

Figure 1. Time-based RMSD of backbone atoms of Fpg in two models, without (Model 1; black) and with (Model 2; grey) explicit hydrogen bonds. The plot image was exported by MDTRA.

Download figure to PowerPoint

thumbnail image

Figure 2. Residue-based solvent-accessible area of Fpg in two models, without (Model 1; black) and with (Model 2; grey) explicit hydrogen bonds. The plot image was exported by MDTRA.

Download figure to PowerPoint

As an additional measure of model quality, as well as to demonstrate the custom scripting module of MDTRA, we have calculated time-based autocorrelation of RMSD of the backbone (Fig. 3 and Supporting Data 2). Lack of pronounced peaks shows no significant oscillations of the structure along the trajectory.

thumbnail image

Figure 3. Dynamic autocorrelation of RMSD of backbone atoms of Fpg in two models, without (Model 1; black) and with (Model 2; grey) explicit hydrogen bonds. The plot image was exported by MDTRA.

Download figure to PowerPoint

Residue-based RMSD of heavy atoms (i.e., all atoms except hydrogen) describes how mobile each residue in each model is (Fig. 4A). One can see that some segments in Model 1 have an increased mobility (defined as those with RMSD more than mean RMSD plus one standard deviation of RMSD), which agrees with the previously published data.[18] There were even more residues and segments with increased mobility in Model 2 (residues 26, 29–34, 81, 83–87, 117, 122, 161, 215–218, 223, 228, 231, 239, 253–254, 256–258). The high mobility of Trp34 was reproduced in both models, without and with hydrogen bonds, with amplitudes of 7.47 and 3.77 Å, respectively. Interestingly, many of these residues belong to two protein regions, the zinc finger and the β1/β2 loop (the latter including Trp34), which are inserted into the major and minor DNA groove, respectively, when the enzyme scans DNA for damage,[17] and this mobility could be important for structural accommodation of DNA geometry changes during lesion search and catalysis. Another extended segment with high RMSD, the β5/β6 loop, lies at the surface of the protein far away from the DNA-binding groove, and its high mobility is unlikely to be of functional significance. Besides these, both models rendered no lengthy segments with high mobility. Interestingly, some individual residues, for example, Lys217, which plays an important role in damaged base recognition,[21, 22] had a comparable mobility in both models (4.8 Å in Model 1, 4.6 Å in Model 2). RMSD of Lys217 was 1.4-fold greater than the mean in Model 1 and 2.2-fold greater than the mean in Model 2. This observation is perhaps not surprising given that Lys217 forms hydrogen bonds not with other protein residues but with the damaged DNA base,[21] and both our models lacked DNA. The calculation of residue-based RMSF identified the same regions as highly mobile and confirmed that Trp34 has the highest mobility of all Trp residues (Supporting Data 3).

thumbnail image

Figure 4. A: Residue-based RMSD of heavy atoms of Fpg in two models, without (Model 1; dark blue) and with (Model 2; purple) explicit hydrogen bonds. Positions of the tryptophan residues are labeled. Horizontal colored lines represent mean values, color filled areas represent one standard deviation of the respective samples. B, C: Hydrogen bond parameters along the MD trajectory of Fpg modeled with explicit hydrogen bonds (Model 2): distances between heavy atoms (B) and three-atom angles (C). Five bonds with the highest energy were taken. The plot images were exported by MDTRA.

Download figure to PowerPoint

Other tryptophan residues in both models did not display high RMSD values; however, this does not mean that they do not contribute to the observed change in fluorescence. To further investigate the mobility of these residues (Trp34, Trp66, Trp113, Trp115, and Trp156) along the trajectory, a 2D-RMSD analysis was performed. A 2D-RMSD plot was built for each tryptophan residue in the model with hydrogen bonds, taking into account only the fluctuations of heavy atoms (Fig. 5). All plots scales were remapped to the maximum RMSD value of 12 Å. It is immediately clear that Trp66, Trp113, Trp115, and Trp156 do not fluctuate significantly at all. However, some minor movements with up to ∼3.0 Å amplitude can be noticed (Fig. 5, panel F). On the contrary, Trp34 is very mobile. According to the 2D-RMSD plot, it undergoes two distinct conformational changes (which correspond to a wide blue band and a wide green band in the plot), and one transient leap (narrow red band). All these residues appear to contribute to the fluorescence dynamics of enzyme–DNA complexes; however, Trp34 is the most significant one, which agrees with the previous study.[18]

thumbnail image

Figure 5. 2D-RMSD maps of tryptophan residues of Fpg modeled with explicit hydrogen bonds (Model 2). (A) Trp34; (B) Trp66; (C) Trp113; (D) Trp115; (E) Trp156. All RMSD scales were remapped to a maximum value of 12 Å. (F) Trp66 with RMSD scale remapped to a maximum value of 6 Å. The plot images were exported by 2D-RMSD Tool of MDTRA.

Download figure to PowerPoint

To further investigate the dynamics of Fpg, MDTRA H-Bonds Search Tool was used. Both trajectories were searched for hydrogen bonds discarding all bonds with the energy less than 0.4 kcal/mol. The search results were refined to display only those bonds existing for at least 95% snapshots of the trajectory. There were no significant bonds found in the Model 1 (without contribution from of hydrogen bonds during the dynamics). In Model 2, there were 72 bonds found. The list of those bonds ordered by energy can be found in Supporting Data 4. Most of the bonds found were typical bonds stabilizing α-helices and β-sheets. Among others, many were in or near the linker connecting the domains of Fpg and around the zinc finger (see below) Monitoring of distances and angles along the trajectory using MDTRA shows that these bonds are quite stable and indeed participate in the secondary structure formation: the mean distance is ∼3.0 Å, and the mean angle in the triplets X[BOND]H···Y is ∼166°, which agrees with the common hydrogen bond geometry. Plots of the distances and angles of five bonds with the highest energy are shown in Figures 4B and 4C, respectively.

Fpg protein has a structural motif, the zinc finger, which contains four cysteine residues coordinating a single Zn2+ ion. As the correct positioning of this motif relative to the protein globule is very important in the enzymes of Fpg family,[23] a contact area between the zinc finger and the rest of the protein was investigated. The zinc finger was defined as either the stretch of the protein (Cys243–Cys266) between the first and the last of the four cysteine residues coordinating Zn2+ (Fig. 6A), or as the stretch of the protein (Gln234–Lys268) comprising the β-hairpin and the surrounding regions without a definite secondary structure (Fig. 6B). Contact area for both models was calculated as surface area occluded by zinc finger. Model 2 contacts between zinc finger and the rest of the protein increased and stabilized faster than in Model 1 for both zinc finger definitions. This is consistent with the existence of an extensive network of hydrogen bonds tying the zinc finger to the helix–two turn–helix motif in the C-terminal domain of Fpg, as inferred from the static X-ray structure.[17]

thumbnail image

Figure 6. Contact area between the zinc finger and the rest of the protein calculated as the surface area occluded by the zinc finger. The zinc finger is defined as either the stretch of the protein (Cys243–Cys266) between the first and the last of the four Cys residues coordinating Zn2+ (A) or as the stretch of the protein (Gln234–Lys268) comprising the β-hairpin and the surrounding regions without a definite secondary structure (B). Model 1, black; Model 2, grey.

Download figure to PowerPoint

Thus, we have shown an example of practical use of MDTRA, a new MD trajectory analysis tool. Its architecture is based on the principle of a “conveyor”, which delivers results from “streams” (trajectories) through “data sources” to “result collectors.” Each conveyor stage is adjustable and expandable at any time. The whole analysis project, including data calculated, but excluding the trajectories themselves, can be saved to an MDTRA Project file and loaded by MDTRA the next time one wants to work with it. This means that raw data, statistics, and plots may be exported to external files on demand. After a single build process, data and plots can be analyzed interactively multiple times. Subsequent modifications may cause only the affected data sources to be rebuild. Such project files can be accessed even without trajectories, but in a read-only mode (but still amenable to interactive plot analysis). This project organization is lacking in many widely used MD trajectory analysis packages, which makes MDTRA a good choice for versatile and structured analysis.

Conclusion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results and Discussion
  6. Conclusion
  7. Supporting Information

MDTRA is a versatile MD trajectory analysis program that facilitates many tasks of advanced structure analysis. Although designed for integration with MD software BioPASED as a part of BISON package (http://bison.niboch.nsc.ru), it can be used as a standalone resource. The program proved itself useful in study of E. coli DNA repair enzyme Fpg and can certainly be very useful in analysis of an outcome of other MD experiments. MDTRA is available at http://bison.niboch.nsc.ru/mdtra.html. The distribution includes the documentation, data files, and executables for the selected platform. The authors welcome any user feedback, including suggestions, bug reports, and feature requests.

Supporting Information

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results and Discussion
  6. Conclusion
  7. Supporting Information

Additional Supporting Information may be found in the online version of this article.

FilenameFormatSizeDescription
JCC_23135_sm_SuppFig1.tif5892KSupporting Information Figure 1.
JCC_23135_sm_SuppFig2.pdf13KSupporting Information Figure 2.
JCC_23135_sm_SuppFig3.eps275KSupporting Information Figure 3.
JCC_23135_sm_SuppFig4.pdf15KSupporting Information Figure 4.

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.