The GROMACS molecular simulation package[1–4] is widely used in the field of (bio)molecular simulation. The most common setup of a simulation system in GROMACS, as in most other major molecular simulation software, assumes a system with a fixed composition of molecules. In addition, for each type of molecule, a fixed, predefined chemical connectivity is assumed. This setup does neither readily allow the breaking of chemical bonds nor readily allow the addition or removal of atoms or molecules to or from the system, for example, simulations in the grand-canonical ensemble. A common workaround implementation involves shell scripting to apply modifications to the simulated system setup, followed by (re)starting of the simulation engine. Needless to say, this adds significant overhead and subtracts from the overall efficiency. Moreover, such approaches show unfavorable scaling behavior with respect to computational efforts, for example, when increasing the amount of insertions/removals or when increasing the system size.
The grand-canonical ensemble can be used for studying systems where one is interested in the average number of molecules as a function of the external chemical potential and temperature. This renders it a suitable ensemble, for example, for exploring adsorption behavior of a given molecular species to the system of interest. In a grand-canonical Monte Carlo (GCMC) simulation, one imposes the chemical potential μi of species i, the system volume V, and the temperature T. During simulation, particles of type i are removed or inserted as a result of the imposed chemical potential. At equilibrium, the amount of removals is equal to the amount of insertions, and one can sample the average number of molecules i. The main computational advantage with respect to molecular dynamics (MD) and NVT Monte Carlo (MC) is that equilibration times are drastically reduced as well as the sizes needed for the molecular systems. It is also possible to combine MD with GCMC.[5, 6] The result is a “hybrid” scheme that alternates short MD trajectories for particle translations of a system containing Ni particles of type i, with trial particle removals (Ni ← Ni − 1) or insertions (Ni ← Ni + 1).
The main application of GROMACS is as an engine to perform MD simulations. Based on such simulations, dynamic system properties of interest can be determined. In addition, the MD trajectories also contain nonequilibrium thermodynamic properties of molecular systems. For analyzing the simulation outcomes, GROMACS comes with a range of applications that facilitate this process. With respect to specific simulation options (e.g., GCMC) or with respect to data analysis, it would however be useful to have the GROMACS data structures accessible to the user via interpreted, high-level programming languages such as python.
In contrast to C or Fortran, python is suitable for rapid proto-typing and is easy to read and learn. Moreover, the python user community is active and growing and several python packages such as BioPython and PyCogent have become standards. A python interface would therefore extend the scope of users that can contribute to and use the flexibility of the GROMACS simulation package.
In this work, we describe an approach that makes the GROMACS data structures available to the user via the python module GromPy acting as an application programming interface (API) to the GROMACS C-library. The module allows access to the desired GROMACS data structures in memory from the python interpreter that can then be used to implement analysis tools and new simulation schemes. Here, we illustrate the use of the GromPy API by implementing a GCMC simulation scheme for which we use GROMACS C-library functions to perform energy calculations.
The GromPy python interface
The GROMACS package is written in the C programming language. We base our development tree on GROMACS version 4.0.5 that will be ported to the latest development branch in the near future.
To implement the interface, we choose the python programming language. Python is a high-level, interpreted, object-oriented, and multiplatform programming language. It provides a large standard library and is easy to code. We use the free and open source CPython implementation of python. Apart from the standard library, python has excellent extensions for numerical data analysis and data display.[11–16] CPython is written in C and compiles python programs into intermediate code that can be executed by a virtual machine. The CPython implementation also allows the implementation of modules in C and the interfacing of (precompiled) libraries.
In our setup, we use the ctypes module as interface between python and the GROMACS C-library. The ctypes module contains python equivalents for all basic C data types and allows the mapping of compound structures in C to python classes. As soon as the GROMACS data structures are accessible via ctypes, we can pass them to external GROMACS functions and access the result from the python interpreter during the execution with the GromPy module.
The initial GromPy implementation can be used for the analysis of trajectories, for example, using GROMACS' periodic boundary condition removal and structure fitting routines. GromPy can also read in index groups and topologies and was applied in the prototyping of GROMACS tools, which were later implemented in C. Recently, GromPy was applied to design a combined MD/MC approach to simulate FRET experiments and aid in the distance reconstruction. This work involves extending GromPy by a GCMC simulation mode. The GromPy source code is publicly available at https://github.com/GromPy.
Hybrid GCMC/MD Simulations
In GCMC, the simulation box is in chemical equilibrium with an external bath. Hence, the chemical potential μ of both systems is equal. One therefore imposes the chemical potential of a particular molecular species upon which molecules are exchanged between the external reservoir and the simulation box. In practice, this means that molecules are inserted into or removed from the simulation box during simulation. The MC acceptance rule for insertion of a molecule reads
where N is the number of molecules, V is the box volume, is the thermal De Broglie wavelength (h denotes Planck's constant, m is the molecular mass, kB Boltzmann's constant, and T is the temperature), β = 1/(kBT) is the inverse temperature, and ΔU = U(N + 1) − U(N) is the energy difference of adding one molecule at a random position in the simulation box. For removal of a molecule, we use the following acceptance rule
where ΔU = U(N − 1) − U(N) is the potential energy difference associated with the removal of a randomly selected molecule.
To simulate thermal motion, we apply several MD steps at constant NVT using the velocity rescale thermostat, which generates a canonical ensemble, in between the GCMC moves. The nature of the MC move (i.e., a trial insertion/removal or an MD move) during a MC cycle is chosen at random based on a user-defined list of probabilities for each type of MC move.
Extending GromPy and modifying the GROMACS source code
This work involves an extension of GromPy, enabling GCMC using the GROMACS C-library. The general setup is shown in Figure 1. When used in GCMC mode, GromPy needs a starting configuration with a number of molecules Ni,start of type i in the form of a GROMACS tpr file stored on disk. Such a tpr file serves as input for a GROMACS simulation and contains all simulation parameters and a configuration of a system. The tpr file range Ni ∈ [Ni,min,Ni,max] is generated in the preprocessing stage, where Ni,min ≤ Ni,start and Ni,max ≥ Ni,start are the extrema of the Ni sampling range. By imposing a chemical potential μi of this molecule type, GromPy samples the Ni range via the hybrid GCMC/MD algorithm.
All MC moves in our hybrid MD/GC MC module require having the current state sc:[Ni,c,r,v] and associated total potential energy Uc in memory, compare Figure 2. This state is a member of the grand-canonical ensemble and thus comprises the current number of molecules Ni,c of type i, the coordinates r, and the velocities v (we use the rN and vN short hand notation for the coordinate and velocity arrays consisting of N elements). The GCMC module uses two MC move types: one that performs several MD steps on sc to simulate thermal motion of the molecular system and one that performs a GCMC move that tries to modify sc by inserting or removing a molecule. For computational efficiency, the MD move is always accepted as the resulting configuration is already part of the correct statistical mechanical ensemble. After the MD move, we update the coordinates, velocities, and total potential energy. Inside the GCMC move, we select either the removal or the insertion of a molecule with a probability of Pinsert = Premove = . For insertion, we generate a trial state st that has Ni,t = Ni,c + 1 molecules. The first Ni,c elements of the coordinate and velocity arrays are copied from sc. The last element is filled by a random molecular position r′ inside the box and by a molecular velocity v′ chosen at random from the Maxwell–Boltzmann velocity distribution associated with the imposed temperature T, respectively. This step requires having st in memory. If this is not the case, we first read a tpr file with Ni = Ni,t from disk. A molecular removal involves generating a trial state st that has Ni,t = Ni,c − 1 molecules. We randomly select a molecule (k) from the list and copy the Ni,c elements of the coordinate and velocity arrays from sc to st, while excluding the kth element. Again, we require having st in memory and read from disk otherwise. Trial insertions or removals with associated Ut are accepted according to eq. (1) or eq. (2) (where ΔU = Ut − Uc), respectively. If accepted, we update st to sc and the associated potential energy Ut becomes Uc. Otherwise, we keep sc and Uc. After each MC move, we update the averages and increment the MC loop iterator.
As described above, the GCMC module uses the current and trial states (sc and st) to sample the grand-canonical ensemble. For this, energy evaluations are needed to obtain Uc and Ut that serve as input for the acceptance rules for insertion [eq. (1)] and removal [eq. (2)]. At run time, the states are stored in memory by interfacing with specific GROMACS library functions. The associated energies Uc and Ut are computed by calls to the GROMACS library. Both operations are performed using the python ctypes module. To achieve the interfacing, we modified the GROMACS 4.0.5 source code as shown in Figure 3. Although the modifications were performed for the serial implementation of GROMACS, we intend to make the modifications compatible with the parallel parts of the code. We expect that this will require relatively little effort. The GROMACS function mdrunner() loads a tpr file and can perform an MD simulation on a given system. This function is called by the GROMACS mdrun executable. As ctypes can load only shared object libraries, we compile the mdrun executable as a shared object library: libmdrun.so. During a GCMC run, we generate trial states st by copying the current state sc to st and adding a trial position (and velocity) for insertion or excluding a randomly selected molecule for removal. To achieve this flexibility, we have split up the mdrunner() function into three parts: mdr_init(cs), mdr_int(cs), and mdr_fin(cs). We added a new data structure cs for the current state that enables communication between the subfunctions. For our purposes, the most important member of cs is the state s. By subsequently calling the three separate functions (and without modifying cs in between), the behavior of the original mdrunner() function is reproduced exactly. Function mdr_init(cs) reads a tpr file from disk and stores the state s in cs. Function mdr_int(cs) performs an MD calculation of NMD steps. NMD is also a member of cs and can be set from within GromPy. For an MD move, the number of MD steps is set to NMD > 0 and for energy evaluations in a GCMC move it is set to NMD = 0 (which results in a single point energy calculation). Computational performance of the simulation is calculated by function mdr_fin(cs). The gain in total computational time is realized by keeping cs in python memory once initialized by a disk read. In this way, cs can be (re)used efficiently for MD or GCMC moves.
Note that the Ni,start configuration should be an equilibrated one. However, this is not a precondition for all other Ni ≠ Ni,start tpr files that the user wishes to use for sampling, as this tpr file is merely used to fill the coordinate and velocity arrays in a trial move. During simulation, sc will always be part of the correct ensemble.
To summarize, once in memory cs can be manipulated for whatever intended purpose and can serve as input for mdr_int(cs). Our purpose is GCMC and we therefore need to manipulate the cs members s and NMD. Obviously, the same behavior can be achieved by executing a shell script that calls the necessary GROMACS executables, that is, grompp and mdrun. The downside of such an approach is that most of the time the GCMC shell will perform file I/O and/or system calls, mainly invoked by the necessary consecutive execution of the GROMACS grompp and mdrun applications. Having the relevant GROMACS data structures in memory, combined with the modified GROMACS source code drastically reduces the time spent on file I/O and renders GromPy an efficient GCMC application, with less than 6% of run time spent in overhead. This overhead involves logging to disk, reading of tpr files from disk, iterating over the MC loop, replacing the rN and vN arrays for trial insertions/removals, and associated evaluations of eqs. (1) and (2).
Validation of the GCMC module
We aim to validate the GROMACS-GCMC scheme by comparing equations of state (EOS) determined by GCMC and NVT MD. For this, it is necessary to simulate a single phase. We therefore choose to simulate supercritical fluids. The validation is performed for two model systems. The first system consists of single Lennard-Jones (LJ) particles of the same type. For this, we use water particles of the MARTINI coarse-grained force field that are modeled as single LJ particles. For this system type, we approximate the critical properties by Gibbs ensemble simulation results. For the second system, with polar SPC water, we also need to account for charges and insertions/removals of multi-atomic molecules, rendering it a more complicated and challenging test case. The critical properties for the SPC model are taken from the literature: Tc,SPC = 587 K and ρc,SPC = 15 mol/l.
In the LJ simulations, we use a shift potential for the nonbonded interactions with a switch radius of rs = 0.9 nm. The nonbonded interactions were truncated at rc = 1.2 nm. In the SPC simulations, all nonbonded interactions were calculated up to a cut-off distance of rc = 0.9 nm (corrections to the total energy and pressure due to truncation are taken into account) and the Coulombic interactions are calculated by the particle mesh Ewald method with a spacing of the Fourier grid of 0.12 nm.
The NVT EOS for both systems are determined at T = 773 K and T = 900 K. The simulation parameters are summarized in Table 1. For each density ρ, we perform a separate simulation of which the ranges are the x-values in Figures 4a and 4b for the LJ and SPC models, respectively. These density ranges are obtained by changing the box volume, while keeping the amount of molecules constant. A pilot experiment showed that NVT results are consistent when varying the box volume V at constant N or varying N at constant V. We average the total pressure p and hence obtain a pressure profile as a function of density ρ.
Table 1. MD parameters used in this work for the LJ and SPC models
The MD time step is denoted by Δ t, the total simulation time for each NVT simulation by tNVT, the simulation time per MD move in each μ VT simulation by tMD,μ VT, and the “simulation time” for a single point energy calculation needed for a GCMC trial move by tGCMC, μ VT. We apply the velocity rescale thermostat that ensures a canonical ensemble. The associated coupling frequency is represented by τ.
Δ t (ps)
2 × 10−2
2 × 10−3
2 × 103
tMD, μ VT (ps)
tGCMC, μ VT (ps)
The μVT EOSs at T = 773 K and T = 900 K are obtained by imposing a range of chemical potentials μ to fixed volume systems of either LJ particles or SPC water molecules. The simulation parameters of the μVT simulations can be found in Table 2. The MD parameters used in MD moves are listed in Table 1. For the density ranges studied, compare the x-values in Figures 4a and 4b for the LJ and SPC models, respectively. For this type of simulation, we obtain a density profile as a function of μ.
Table 2. GCMC parameters used in this work for the LJ and SPC models
The length of the cubic simulation box is denoted by b and the number of MC cycles is denoted by Ncycles. Each MC cycle consists of Nmoves trial MC moves where the MC move type is chosen randomly with probabilities PMD and PGCMC for an MD move and GCMC move, respectively.
The Gibbs–Duhem equation is used to validate the μVT results
from which the pressure profile
is derived. The excess part of the chemical potential μex is calculated as
where μid is the ideal part of the chemical potential. The pressure as a function of density p(ρ) in eq. (4) is determined from a numerical least-squares fit of a sixth degree polynomial to the μVT data of Ndat = 1000 data points.
Results and Discussion
A supercritical LJ system
For the supercritical LJ system, we used a system of single particle MARTINI water (W) molecules. A system consisting of just this molecule type, involves nonbonded LJ interactions only and therefore renders it a relatively simple test system. We calculated the critical temperature of this system as Tc,W = 647.2 K and its associated critical density as ρc,W = 4.99 mol/l by Gibbs ensemble simulations.
The NVT results are shown in Figure 4a (left and bottom axes). The GCMC insertion/removal acceptance probabilities are listed in Table 3. We examined if the NVT EOS is different when varying the number of molecules compared with varying the simulation box volume. This was found not to be the case. The results of the LJ μVT simulations are shown in Figure 4a (right and bottom axes). The NVT and μVT EOS are completely equivalent.
Table 3. Insertion/removal acceptance probabilities Pacc,GCMC at various densities for the LJ and SPC models
Note that at equilibrium Pacc(N → N + 1) = Pacc(N → N −1) = Pacc,GCMC.
A supercritical SPC water system
Apart from nonbonded LJ interactions between the water oxygen atoms, the SPC model involves Coulomb interactions between the partially charged hydrogen and oxygen atoms. The relative orientation of the hydrogen and oxygen atoms within a water molecule is assumed constant, that is, bond stretching and bond bending is constrained during simulation using the SETTLE algorithm.
In Figure 4b (left and bottom axes), we show the NVT results. The GCMC insertion/removal acceptance probabilities are listed in Table 3. We again validated the NVT EOS when varying the number of molecules compared against the NVT EOS when varying the simulation box volume. The results of the μVT simulations are shown in Figure 4b (right and bottom axes).
As we can see from Figure 4b, the μVT data at T = 773 K are in excellent agreement with the NVT results. The μVT data at T = 900 K also agrees with the NVT data. Although still within the NVT error bars, the μVTp(ρ) profile at the highest particle densities slightly overestimates the NVT one. This could be explained by GCMC sampling difficulties at extreme simulation conditions. SPC molecule insertions are performed by generating a random position in the simulation box for the oxygen atom, followed by randomly orienting the hydrogens while meeting the bond angular and bond length constraints. A more efficient sampling at higher densities could be achieved by applying the configurational bias MC (CBMC) technique. In CBMC, one selects the most favorable insertion configuration from a set of trial configurations and appropriately corrects for this bias. It should be kept in mind that for both temperatures, the conditions at higher densities can be considered extreme, for example, pressures of over 5 × 108 Pa.
Computational performance and accuracy
To get an impression of the computational performance of GromPy in GCMC mode, we again determined the EOS for the LJ system at T = 773 K. For the GCMC case, simulation parameters are the same as above. For each data point in Figure 5, the number of particles in the NVT simulation was taken the same as the average number of particles, calculated from the μVT simulation series. In this way, a fair comparison can be made between the two simulation modes. Per μVT or NVT simulation, we used a total of 749,700 integration steps of which the first 16.7% was used for equilibration.
Both EOSs were determined on a 32-bit Linux machine with the applications running on a single CPU. The total simulation time for the NVT EOS is 2400 s (8 s spent on system calls and 2392 s spent on “real” CPU time). The total simulation time for the μVT EOS is 2547 s (149 s spent on system calls and 2397 s spent on real CPU time). The ∼150 s difference between the two simulation modes comes from the limited amount of time spent on system calls and can be considered as “python overhead” as described in Extending GromPy and modifying the GROMACS source code section. Note that this also involves the evaluations of eqs. (1) and (2) in python.
The μVT and NVT EOSs are completely equivalent, compare Figure 5. The uncertainty bandwidth in the pressure profile based on the standard deviations in the μVT data is the area between the thin solid lines in Figure 5. The μVT EOS uncertainty is well within the error bars of the pressure sampled in the NVT ensemble.
To illustrate the file I/O overhead problem in a shell approach that does not use direct calls to the GROMACS library, we implemented such a shell that can also sample the grand-canonical ensemble but uses the GROMACS executables to perform the necessary MC moves. We simulated the LJ system at T = 900 K at a chemical potential yielding an average number of 〈N〉 ≈ 377 (〈ρ〉 ≈ 13 mol/l) using both GCMC approaches. For both simulations, the parameters are listed in Table 2 (and Table 1 for the parameters of the MD moves). The “shell” GCMC module requires 10,800 s for 2500 MC cycles, whereas GromPy in GCMC mode does the same 14 times faster (771 s). It thus turns out that the “shell” approach spends over 90% of the computation time on system calls and disk operations at this particular chemical potential. For this LJ system, we found that the computation time tCPU scales as tCPU ∝ N1.48. Now, assuming that the file I/O overhead remains constant, the shell approach would be 1.01 times slower if we would study systems of size 〈N〉 ≈ 11,000 and it would be two times slower for system sizes of 〈N〉 ≈ 2700. We tested this assumption and found that for a system of size 〈N〉 ≈ 11,850, again 〈ρ〉 ≈ 13 mol/l, the shell implementation is 39 times slower compared with GromPy/GCMC. Moreover, tCPU for GromPy/GCMC scales with N via the above relation, whereas the shell approach shows a more unfavorable scaling behavior of tCPU with N. Clearly, grompp and mdrun contribute unfavorably to the scaling behavior of tCPU with system size and render the shell GCMC implementation unfeasible for large system sizes, due to the large number of grompp and mdrun executions (and associated file I/O) needed. GromPy/GCMC does not suffer from this effect as the necessary data structures are kept in memory at run-time.
We have successfully implemented and extended the GromPy module (available at https://github.com/GromPy) and enabled simulations in grand-canonical ensemble using the GROMACS C-library. To this end, only minor modifications to the GROMACS source code needed to be applied, and these do not in any way affect the operation, efficiency, and/or performance of the GROMACS applications built with the GROMACS source. To the best of our knowledge, GromPy is the first reported interface to the GROMACS library and MD engine that uses direct library calls. It can be used for further extending the current GROMACS simulation and analysis modes.
We validated our grand-canonical scheme for two system types. For the simplest one that involves only LJ interactions, the μVT results are in complete agreement with those of NVT MD simulations performed with GROMACS. For a second, more complicated, system that also involves Coulombic interactions and insertions of multi-atomic molecules, the μVT results agree completely with the NVT results at T = 773 K, but seem to slightly overestimate the high density region of the NVT EOS at T = 900 K (although still within the NVT error). This deviation is explained by sampling difficulties at this high temperature and density. Sampling efficiency might be enhanced by implementing CBMC for multiatomic molecules.
The computational performance of GromPy in GCMC mode is comparable to the GROMACS mdrun executable. The accuracy of the μVT data is well within that of conventional MD in the NVT ensemble.
Our work is compatible with the 4.0.7 version of GROMACS, and only minor modifications are needed for the 4.5 version and higher versions. For the near future, we plan to merge the necessary changes on the code to the main development tree, which will make our GCMC compatible with the latest GROMACS releases, of course in consultation with the GROMACS developers community. In addition, our minor modifications to the serial implementation of the GROMACS source code should be made compatible with the parallel implementation. We expect that GCMC and hybrid MD/MC is of interest to the GROMACS users community. Our modifications to the source code are only minor and do not stand in the way of “normal” use of the MD engine. Additionally, a python interface to GROMACS will contribute significantly to the flexibility of the package.
This work is dedicated to Wilfred van Gunsteren for his pioneering work in biomolecular simulations and being a personal inspiration to many in the field.
RP, JH and KAF would like to thank Mohammed El-Kebir and Nicola Bonzanni of the IBIVU group for their contributions to modifying the GROMACS source code and Sanne Abeln for scientific ideas and discussions.
This work was part of the BioRange programme (project number: 2.3.1) of the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIK grant through the Netherlands Genomics Initiative (NGI).