Basic Use of the Library: Trajectory I/O and Selections
Once installed, MDAnalysis is imported in the standard Python fashion as:
or
For the following discussion and the examples, we will assume that the second form has been used, which makes a number of useful classes and modules available in the top-level name space (Fig. 1).
The most important class is Universe, which represents the complete simulation system. It is initialized from a topology file, which defines the atoms in the system, and a trajectory, which lists the position of each atom for a number of time frames:
universe = Universe(topology_file,
MDAnalysis can read a range of popular file formats. This includes trajectories produced by CHARMM, NAMD, LAMMPS, Amber, and Gromacs and the generic XYZ format (also compressed with gzip or bzip2). It can write DCD (CHARMM/NAMD/LAMMPS) and XTC/TRR (Gromacs) trajectories. Furthermore, it can read and write a number of single frame formats such as Brookhaven PDB, CHARMM coordinates, GROMOS96 coordinates, and it can read PQR files as used in electrostatics calculations.15 Periodic boundary conditions are not automatically taken into account and hence trajectories should be appropriately preprocessed. To define the atoms in the system, a topology file is required. Typically, this is a CHARMM/XPLOR PSF file or a Amber PARMTOP, but it is also possible to use any of the single frame formats instead and make use of the information stored there. For instance, using a PQR file sets the charge attribute of each atom to the charge stored in the file. There are no restrictions on the type or number of atoms; for instance, PDB files with large numbers of atoms are handled gracefully and coarse grained simulations can be analyzed in the same way as atomistic ones.
The main purpose of Universe is to gather all atoms in the system in its Universe.atoms attribute and provide the Universe. selectAtoms() method to create arbitrary groupings of atoms (Fig. 1). All such groups of atoms are instances of the AtomGroup class and a number of simple methods are automatically defined for such a group (Fig. 2). For instance, AtomGroup.totalMass() computes the total mass of these atoms and AtomGroup.principalAxes() the principal axes by diagonalizing the tensor of inertia. Other examples of these types of functions include computing the center of mass, center of geometry, radius of gyration, or the total charge. Individual properties of all atoms in an AtomGroup can also be conveniently accessed in a fashion typical for Python (“Pythonic”) by calling methods that return lists or arrays of the property. For example, AtomGroup.coordinates() returns a NumPy array of the coordinates, which can be processed further in Python code:
protein = universe.selectAtoms(‘protein’)
coords = protein.coordinates()
shifted = coords + array([10.,0,0])
The example above calculates the coordinates of all protein atoms shifted by the vector (10,0,0) along the x-axis. Other methods return lists of residue numbers, residue names, partial charges (if defined in the topology file), atom masses, or B-factors if loaded from a PDB file (Fig. 2).
From an AtomGroup, one can also write different types of coordinate files such as trajectory or single coordinate files (Fig. 2). A single coordinate file such as a PDB can be written with
universe.selectAtoms(“byres (resname SOL and
around 3.5 protein)”).write(“solvation
The related write_selection() method writes selection commands to an output file; by loading these commands in an external program such as VMD or PyMOL one can visualize the MDAnalysis selection. It is also possible to use the selection commands as input for simulations in CHARMM or Gromacs.
To write trajectory files, one obtains a trajectory Writer and then writes every individual timestep, typically by iterating over an input trajectory. Coordinates can be manipulated before writing them, or only a selection of the whole system can be written. In the next example, a new trajectory in Gromacs XTC format is written, in which the whole system (universe.atoms) isrotated by 360° around an axis parallel to the z-axis that contains the center of geometry of residues 100 to 105:
writer = MDAnalysis.Writer(‘rotating.xtc’,\
universe.atoms.numberOfAtoms())
for ts in universe.trajectory:
angle = 360.*(ts.frame-1)/\
universe.trajectory.numframes
universe.atoms.rotateby(angle, axis=(0,0,1),\
point=universe.selectAtoms(\
The object-oriented design of the library allows the convenient use of selections (i.e., AtomGroups) in places where one could also provide an explicit Cartesian coordinate; if a point in space is required, the centroid of the AtomGroup is substituted. Vectors can be replaced by a tuple of two AtomGroups and the vector is calculated from the centroid of the first group to the second one.
Writing of selections and trajectories is also object-aware. For example, to write a Cα–only trajectory, the appropriate Cα–selection is passed to the write() method:
ca = universe.selectAtoms(“name CA”)
writer = MDAnalysis.Writer(“ca.dcd”, len(ca))
for ts in universe.trajectory:
Typical frame-based analysis of a trajectory follows the same simple iterator pattern: while looping through the frames of the trajectory, data are collected at each time step. In the next example, the end-to-end distance of a protein is calculated and printed at each time step:
from numpy.linalg import norm
protein = universe.selectAtoms(‘protein’)
# N atom of first residue :
nterm = protein[0].residue.N
# C atom of last residue :
cterm = protein[-1].residue.C
for ts in universe.trajectory:
d = norm(cterm.pos – nterm. pos)
print “End-to-end distance %f at frame %d”\
The example also demonstrates the important fact that coordinates of selections or individual atoms change dynamically when the trajectory is moved to a different frame. The iterator approach is robust and because MDAnalysis only loads individual frames into memory when needed, it is possible to handle large systems. So far MDAnalysis has been tested successfully with systems up to 4.5 million particles (Daniel Parton, personal communication).
The MDAnalysis.analysis Module and Advanced Usage
A typical approach to testing convergence of a property is to calculate the property of interest over blocks of a trajectory and analyze the standard deviation of the block averages as function of the block size.16 Accumulating block results can be almost trivially carried out in MDAnalysis by iterating over blocks and appending calculated observables to a list. The following example defines a function blocked() that takes a universe, the number of blocks, and a function operating on a universe as arguments:
def blocked(universe, nblocks, analyze):
size = universe.trajectory.numframes/nblocks
for block in xrange(nblocks):
block*size:(block +1)*size]:
a.append(analyze(universe))
blocks.append(numpy.average(a))
blockaverage = numpy.average(blocks)
blockstd = numpy.std(blocks)
return nblocks, size, blockaverage, blockstd
def blocked(universe, nblocks, analyze):
size = universe.trajectory.numframes/nblocks
for block in xrange(nblocks):
block*size:(block+1)*size]:
a.append(analyze(universe))
blocks.append(numpy.average(a))
blockaverage = numpy.average(blocks)
blockstd = numpy.std(blocks)
return nblocks, size, blockaverage, blockstd
To calculate the block average of the radius of gyration of a protein one would define an analysis function such as
return universe.selectAtoms(‘protein’).\
that can be directly used with blocked() for a range of block numbers:
for nblocks in xrange(2,10):
results.append(blocked(universe, nblocks,
for nblocks in xrange(2,10):
results.append(blocked(universe, nblocks, rgyr)
MDAnalysis already contains a number of analysis applications that are user-friendly “best-practice” examples for tool development (Fig. 1). One example is the align module, which contains code to fit a trajectory to a reference structure by minimizing the RMSD of a selection between the current frame and the reference. The implementation uses the fast QCP method for calculating the minimum RMSD between two structures17 and determining the optimal least-squares rotation matrix.18 The rms_fitting() function can be used to re-orient the trajectory based on an atom selection and/or to concatenate trajectories together while extracting an arbitrary subset of frames or atoms. An example for this tool, rmsfit_alignment.py, is located within the examples directory.
Time Series Analysis
A number of often-used geometric calculations over time series of atoms are predefined and written in C. Instead of looping over individual frames, the so-called Timeseries analysis functions directly operate on the underlying trajectory. First a collection is populated with the desired analysis tasks and then the analysis is carried out by passing the collection to the correl method of the trajectory object:
MDAnalysis.collection.addTimeseries(\
Timeseries.Dihedral(atomselection))
data = universe.trajectory.correl(MDAnalysis.\
The current implementation (for DCD only) includes functions for Angle, Bond, Dihedral, CenterOfGeometry, CenterOfMass, Atom, Distance, and WaterDipole moment.
An example is shown in Figure 3 where a KALP19 peptide in a lipid membrane is analyzed. The script below shows how to set up the calculation of a single ψ peptide backbone dihedral; Figure 3B shows values for all ψi:
psi_sel = universe.selectAtoms("atom KALP 21 N", \
"atom KALP 21 CA", "atom KALP 21 C", \
a = collection.TimeseriesCollection()
a.addTimeseries(Timeseries.Dihedral(psi_sel))
data = universe.trajectory.correl(a, skip=10)*180./pi
The structure of a lipid bilayer can be analyzed by the average deuterium order parameter, SCD = (3〈cos2 θ〉 – 1)/2;19 θ is the angle between carbon-hydrogen bonds in the fatty acid tail and the bilayer normal. SCD can range from −0.5 to 1.0, representing completely disordered chains to fully ordered chains. To analyze SCD of DMPC lipid tails in the KALP system (Fig. 3C), the C-H vectors for a given carbon on each aliphatic chain of every lipid molecule are extracted and stored in a NumPy array. The data can be extracted in different formats depending on the analysis—for example, the timeseries is generated in “afc” format for “atom, frame, coordinate.” SCD is then calculated using NumPy's powerful arithmetic:
carbon = 3 # analyze the 3rd carbon
selection = "resname DMPC and (name C2%d or name H%dR or \
name H%dS or name C3%d or name H%dX or name H%dY)" % \
((carbon,)*6) # selects C23 H3R H3S C33 H3X H3Y
group = universe.selectAtoms(selection)
data = universe.trajectory.timeseries(group, format="afc", skip=skip)
cd = numpy.concatenate((data[1::3]-data[0::3], \
data[2::3]-data[0::3]), axis=0)
cd_r = numpy.sqrt(numpy.sum(numpy.power(cd,2), axis=-1))
S_cd = −0.5*(3.*numpy.square(cos_theta)-1)
S_cd.shape = (S_cd.shape[0], S_cd.shape[1]/4, −1)
order_param = numpy.average(S_cd)
Distances
The distance module contains functions to calculate distances between atoms in selections. The versatile distance_array() function (implemented as a fast compiled library) can be used to rapidly calculate all distances dij between two groups of atoms. The following example shows how to use it in order to calculate the radial distribution function, g(r), of water molecules around the amino group of all lysine residues. In the example below, g(r) is stored in the variable rdf:
from MDAnalysis.analysis.distances import distance_array
group = universe.selectAtoms("resname LYS and name NH*")
water = universe.selectAtoms("resname TIP3 and name OH2")
rdf, edges = numpy.histogram([0], bins=100, range=(dmin, dmax))
for ts in universe.trajectory:
amino_coor = group.coordinates()
water_coor = water.coordinates()
dist = distance_array(amino_coor, water_coor, box)
new_rdf, edges = numpy.histogram(numpy.ravel(dist), \
bins=100, range=(dmin, dmax))
universe.trajectory.numframes/universe.trajectory.skip
vol = (4./3.)*numpy.pi*density*\
(numpy.power(edges[1:],3) - numpy.power(edges[:-1], 3))
rdf = rdf/(vol*numframes)
Distances under periodic boundary conditions can be taken into account via a minimum-image convention by providing the unit cell information to the distance_array()function. At the moment, this functionality is limited to cubic or orthogonal unit cells but additional periodic boundary processing will be implemented in the future. In the general case, it is advisable to preprocess trajectories to center the system on the molecule of interest and remap solvent molecules into the primary unit cell using native tools.
Density
The density package contains functions to calculate a density of a selection of atoms. The following example shows how to use the density_from_Universe() function to calculate the three-dimensional (3D) density map for the oxygen atom for all water molecules in the system on a grid with spacing 1 Å:
from MDAnalysis.analysis.density import density_from_Universe
D = density_from_Universe(universe, delta=1.0, \
atomselection="name OH2")
D.convert_density("water") # measure relative to bulk water
The resulting OpenDX file can be viewed in a molecular viewer such as VMD or PyMOL. The density object can be processed further and analyzed within a Python script.
Implementation of the LeafletFinder Algorithm for Lipid Bilayer Analysis
The tight integration between MDAnalysis and NumPy together with the availability of powerful specialized libraries enables one to express algorithms in a concise and efficient manner. As an example, we present the LeafletFinder algorithm that returns the groups of atoms that make up the two leaflets of a bilayer. Such information is required for the automated analysis of lipid-protein interactions or lipid exchange between leaflets. For small, planar bilayer patches (e.g., with the bilayer normal assumed parallel to the z axis), it is not difficult to implement a simple algorithm that (1) collects specific head group atoms (for instance, phosphorous atoms for phospholipids), and then (2) assigns them to a leaflet, depending on the atom's z coordinate being above or below the center of geometry of the bilayer. Such an approach breaks down when the bilayer shows strong undulations or if it is not planar, as is the case for vesicles. LeafletFinder follows a different approach, first building a network of neighbors and then using a graph-theoretic approach to analyze the network:
Figure 4D shows the algorithm applied to a large coarse-grained bilayer system with 24,056 lipids and a total system size of over 1.5 million particles. Here, membrane undulations have amplitudes larger than the bilayer thickness, thus rendering the simple approach useless whereas LeafletFinder reliably distinguishes the two leaflets as shown in the closeup in Figure 4E.
The implementation in MDAnalysis starts with a selection of lipid head group atoms, e.g.,
headgroup_atoms = universe.selectAtoms(“name P*”)
coord = headgroup_atoms.coordinates()
Step 1 of the algorithm takes the coordinates of the selected head group atoms and builds the adjacency matrix, which contains True for any distance smaller than the cutoff and False otherwise; this only requires a single line of code because distance_array() returns a NumPy array that can be directly transformed using NumPy's powerful Boolean array constructors:
from MDAnalysis.analysis.distances import\
adj = (distance_array(coord,coord) < cutoff)
Step 2 and 3 make use of the NetworkX12 package. The networkx.Graph class can directly build a graph from an adjacency matrix and the networkx.connected_components() function returns the connected components of a graph, sorted by size. Thus only a single line of code is required to analyze the network of neighbors of all lipids:
leaflets = NX.connected_components(\
leaflets[0] and leaflets[1] contain the indices of the two leaflets that can be mapped back to the atoms by indexing the selection; the corresponding residues (i.e., the lipids) are obtained from the residues attribute of the AtomGroup:
A_lipids = headgroup_atoms[leaflets[0]].residues
B_lipids = headgroup_atoms[leaflets[1]].residues
The “top” or “bottom” leaflet of a planar membrane can be assigned by, for instance, comparing the centers of mass of the leaflets; membrane normals can be computed via singular value decomposition of the head group coordinates, leaflet-resolved diffusion coefficients and order parameters can now be easily computed with standard techniques.
Native Contacts Analysis
Trajectories between known end states such as folding/unfolding or a macromolecular transition between “closed” and “open” conformations are frequently simulated. A useful metric to described the progression of such a trajectory is the fraction of native contacts qM(t) at each time t.20 A contact exists if a atom pair (i,j) has a distance dij < Rc, where Rc is a cut-off. A native contact relative to state M is said to exist if dij(t) < RcanddijM < Rc. The fraction of native contacts is the ratio between the total number of native contacts and the total number of contacts in state M.
The following example implements a “Q1–Q2” analysis,20 applied to the simulated transition of the enzyme adenylate kinase (AdK) from a closed to an open conformation.21 Contacts are defined between Cα atoms with Rc=8 Å. q1 is the fraction of contacts relative to the closed state (pdb:1AKE), the starting conformation of the trajectory, and q2 is the fraction of native contacts relative to the open conformation (based on pdb:4AKE). The outline of the algorithm only requires a few lines of code; a full implementation can be found in MDAnalysis.analysis.contacts.ContactAnalysis. To calculate the fraction of native contacts q1 relative to a native structure with Cα coordinates
from MDAnalysis.analysis.distances\
import self_distance_array
native_contacts_1 = (self_distance_array(\
ca = universe.selectAtoms(“name CA”)
for ts in universe.trajectory:
contacts = (self_distance_array(\
ca.coordinates()) < Rcut)
native_contacts = numpy.logical_and(\
contacts, native_contacts_1)
q1 = native_contacts.sum()/\
A plot of q1 versus q2 is shown in Figure 5, indicating that the transition progresses in a nonlinear fashion and switches from a “closed”-like conformation to an “open”-like conformation.20, 21
Benchmark
Analysis of MD simulations can be time consuming and thus performance is a concern when using any analysis program. We benchmarked MDAnalysis together with three other software packages (CHARMM,6 VMD,7 and Gromacs1) on a number of representative tasks that are readily available in each package. For a 10-ns trajectory with 10,000 frames of fatty acid binding protein (pdb: 1IFC) in water (13,051 atoms), we computed time series for three distances, two dihedral angles, the RMSD relative to the starting conformation, and the radius of gyration of the protein. We also calculated the density of water oxygens around the protein from the same trajectory. The radial distribution of 3476 water oxygens in an urea solution was calculated from 2000 frames of a trajectory with 11,492 atoms. MDAnalysis code used was similar to the examples shown above. Runs were timed on an Intel Xeon E5420 CPU at 2.5 GHz. Results are shown in Table 1.
Table 1. Benchmark of Standard Analysis Tasks Using Various Analysis Software Packages.| | MDAnalysis | VMD | CHARMM | Gromacs |
|---|
|
| Distances (3x) | 1.04 s | 4.81 s | 2.22 s | 14.71 s |
| Dihedrals (2x) | 0.80 s | 2.40 s | 2.15 s | 11.92 s |
| RMSD | 29.10 s | 4.29 s | 5.09 s | 32.18 s |
| Radius of gyration | 10.16 s | 12.58 s | 12.10 s | 16.87 s |
| Radial distribution | 21 m 24 s | 10 m 38 s | 7 m 1 s | 6 m 6 s |
| 3-dimensional density | 1 m 32.31 s | 5 m 52.37 s | –a | 16.93 s |
In general, MDAnalysis performs fairly well and is even faster than most programs when performing simple calculations such as the analysis of distances, dihedrals, generating density maps, and the radius of gyration. Very computing intensive tasks that have to operate on large arrays such as the radial distribution function can be substantially slower. Profiling of the code with Python's cProfile module revealed that in this case the drop in performance arose mostly from the NumPy histogram() function, which requires about three times as much time as the MDAnalysis distance calculation. If necessary, performance could likely be improved by implementing the whole analysis task in Cython or C without using NumPy, although requiring many more lines of code and hence implementation and testing time.