ChemShell—a modular software package for QM/MM simulations

ChemShell is a modular computational chemistry package with a particular focus on hybrid quantum mechanical/molecular mechanical (QM/MM) simulations. A core set of chemical data handling modules and scripted interfaces to a large number of quantum chemistry and molecular modeling packages underpin a flexible QM/MM scheme. ChemShell has been used in the study of small molecules, molecular crystals, biological macromolecules such as enzymes, framework materials including zeolites, ionic solids, and surfaces. We outline the range of QM/MM coupling schemes and supporting functions for system setup, geometry optimization, and transition‐state location (including those from the open‐source DL‐FIND optimization library). We discuss recently implemented features allowing a more efficient treatment of long range electrostatic interactions, X‐ray based quantum refinement of crystal structures, free energy methods, and excited‐state calculations. ChemShell has been ported to a range of parallel computers and we describe a number of options including parallel execution based on the message‐passing capabilities of the interfaced packages and task‐farming for applications in which a number of individual QM, MM, or QM/MM calculations can performed simultaneously. We exemplify each of the features by brief reference to published applications.


INTRODUCTION
C ombined quantum mechanical (QM) and molecular mechanical (MM) modeling is a widely used technique to perform atomistic simulations of large systems as encountered in materials modeling and biochemistry. In a QM/MM calculation, the QM description is restricted to the central region where information about local electronic properties is required, while the environment is modeled with a much cheaper classical molecular mechanics force field. In this way, the simulation becomes tractable without compromising significantly on the accuracy of the final result, whereas a QM simulation of the whole system may have an unnecessarily high or even prohibitive computational cost.
In order to perform a QM/MM calculation, the software package must be able to run both QM and MM calculations and be able to combine the results in a meaningful way, including an appropriate treatment of both through space (nonbonded) interactions and covalent bonds between the QM and MM regions. Historically, most QM and MM programs have been developed as separate packages. More recently, a number of QM packages have had MM routines added to them, and vice versa. An alternative approach, which we believe offers greater flexibility, is to combine the output of separate fully featured QM and MM packages using a software wrapper. The advantage of this modular approach is that the user may in principle mix and match QM and MM codes that implement the features of interest to them. The ChemShell computational chemistry environment is an implementation of this approach. 1 ChemShell performs a QM/MM calculation by decomposing the system to be modeled into QM and MM subsystems, interfacing to external programs to perform the QM and MM calculations and then combining the resulting energies and gradients to obtain a QM/MM energy and gradient. ChemShell can also be used to drive higher level chemical simulation tasks such as geometry optimization or molecular dynamics (MD), using QM, MM, or QM/MM Hamiltonians.
In this article, we describe the capabilities of version 3.5 of ChemShell, released in May 2012, except where features are noted as under development. Original references for methodological developments are not in general given, but may be found in the implementation and application works we cite. The selected applications presented here are chosen to illustrate the described features and are in no way exhaustive. A more detailed introduction to the different flavors of QM/MM and applications of the method (not only with ChemShell) can be found elsewhere. 2,3

THE CHEMSHELL FRAMEWORK
ChemShell is based on the Tcl scripting language and as such it can be run interactively or presented with a script file describing the sequence of calculations required. All the expected features of a scripting language (string handling, conditional statements, loops, etc.) are augmented with additional commands to perform molecular modeling operations. Tasks such as defining a molecular structure or performing a geometry optimization are initiated by simple commands, with options specified as keyword/value pairs. Data (such as coordinates) can be provided in-line as part of the script, or in separate files. Additional data structures can be used to store and manipulate matrices and data computed on special grids (e.g., spin densities). Although it is not necessary to know more than the basics of Tcl scripting to perform a calculation in ChemShell, the full power of the language is available to write highly sophisticated input scripts.
Behind the Tcl user interface are a collection of modules written in C and Fortran, performing such tasks as managing data structures, geometry optimization, or interfacing to external programs. In particular, any work that involves significant computational expense benefits from being implemented in C or Fortran. The modular approach allows developers and users to extend ChemShell without having to understand everything it does at a coding level.
When a calculation using an interfaced package is performed, ChemShell creates the relevant input files based on the control parameters presented in the user's script. These parameters (which define such choices as basis set, level of theory, force field, etc.) are defined in a consistent way between similar types of external packages, allowing the user to switch between alternative packages in a relatively straightforward manner. The list of QM and MM codes that can be interfaced to ChemShell is shown in Table 1. Each code has a corresponding interface module in ChemShell, mostly written in Tcl to allow for extensions to support additional features in the external codes as they are developed.

ESTABLISHED QM/MM METHODOLOGY
ChemShell can be (and is often) used to script standalone QM or MM calculations. However, its primary strength lies in combined QM/MM calculations. It is helpful to divide QM/MM schemes by the way the total energy is defined. We use the term 'additive' for schemes combining a QM calculation on the inner region, an MM calculation on the outer region, and the QM/MM coupling terms. Most calculations performed with ChemShell are of this type. Additive QM/MM calculations may require modifications to the force field terms applied (e.g., in the junction region) and this restricts the choice of MM packages. Alternatively, a QM calculation on the inner region can be combined with a MM calculation on the entire system with a third MM calculation on the inner region being performed to subtract out the terms that are being replaced by the QM Hamiltonian. Such 'subtractive' QM/MM calculations can be performed in ChemShell using any QM and MM codes, or indeed can be used to combine two different QM Hamiltonians in an analogous way.
Additive QM/MM calculations can be performed with three QM/MM coupling schemes. The most basic is termed 'mechanical embedding', where the QM calculation is not coupled electrostatically to the MM environment. The second level is 'electrostatic embedding', where classical partial charges from the MM description enter the QM Hamiltonian, in effect polarizing the QM region. The third level is 'polarized MM embedding', where the MM region is in turn polarized by the QM region. Polarized MM embedding may place additional restrictions on the choice of QM and MM codes (see Table 1). The level of embedding used depends on how accurately the electrostatic interaction between the QM and MM 1 Scheduled for release in ChemShell 3.6. 2 Not maintained by Daresbury Laboratory due to licensing restrictions. 3 Internal module with integrated support for CHARMM and AMBER force fields. 4 Possible but there is no automatic deletion of duplicate FF terms and so the user is responsible for providing a suitable topology.
region should be described, and whether the preferred MM description incorporates polarization. There are two approaches to handling the QM/MM boundary: covalent and ionic. The covalent approach is the obvious choice whenever covalent bonds connect the QM and MM regions, as is usually the case in studies of enzymes and inorganic networks with covalent character, such as zeolites. ChemShell implements a 'link atom' approach in which the broken bonds are terminated by adding hydrogen atoms to the QM region while force field terms in the MM calculation are selectively deleted to ensure that no interactions are double counted. Further, the charges on the MM atoms closest to the QM region are shifted to neighboring atoms to avoid overpolarization of the QM system. Extra dipoles are then added to compensate for the dipoles created by the charge shifting scheme. Finally, the forces on the link atoms are projected onto real atoms after each gradient evaluation. For further details see Ref 1. For some dense materials, such as metal oxides, an ionic approach (as described below) is more appropriate. Link atoms are not necessary as there are no covalent bonds to break. Instead atomic, largecore pseudopotentials are placed on the MM atoms immediately surrounding the QM region in order to localize electrons in the QM region, preventing them from spilling out into the MM region. 4 Finally, if there are no covalent or ionic bonds that cross the QM and MM regions, no specific boundary handling method is required, as is the case in many solvation studies and molecular crystals 5 (assuming that the QM/MM boundary does not divide any of the molecules).

Biomolecular Calculations
The recommended MM interface for biomolecular QM/MM calculations is the internal DL_POLY module, based on the DL_POLY 2 (now 'DL_POLY Classic') code developed by W. Smith. 6 Routines are provided to convert force field parameter and topology files from formats such as CHARMM and AMBER into an internal format suitable for calculations with DL_POLY. A common workflow is to set up and equilibrate the system initially in CHARMM or AMBER before importing the coordinates and force field into ChemShell to perform a QM/MM optimization.
Most biochemical simulations to date using ChemShell have used the electrostatic level of embedding, where the QM region is polarized by partial charges as defined in the standard biomolecular force fields (illustrated in Figure 1). This approach is exemplified by an extensive series of studies of the cytochrome P450 superfamily. Although these enzymes share the same active species, they differ in the immediate protein environment. The QM/MM methodology is an ideal approach to explore the differences as it allows these environmental variations to be compared at little extra cost. After over 10 years of intensive research, QM/MM results have provided descriptions of the various species present in the catalytic cycle and an understanding of their molecular and chemical properties. Numerous mechanisms by which the native enzymes, and mutants thereof, catalyze the oxygenation reactions have been used to rationalize experimental findings and predictive models of mechanisms and reactivity have been established. One of the key intermediates of the catalytic cycle (Cpd I) was predicted in QM/MM studies a long time before its experimental discovery and the simulations and predictions about the selenocysteine mutant also preceded experimental preparation. A thorough review of the simulated results from the first QM/MM study in 2002 up to 2010 is available. 7 However, investigations on similar systems do and will continue. [8][9][10] In the area of biomolecular modeling, ChemShell has been applied to a broad variety of enzymes to investigate their reaction mechanisms, for example, studies on the C-F bond-forming step in fluorinase, catalyzing the reaction between Sadenosyl-L-methionine and fluoride ions to form 5 -fluoro-5 -deoxyadenosine and L-methionine, 11 and the metabolic oxidation of xanthine to uric acid performed by xanthine oxidase. 12 In addition to obtaining structural features of intermediates, QM/MM calculations can be used to predict spectroscopic properties, like experimentally easily accessible 1 H or 13 C NMR shifts, 13 NMR shifts of transition metal centers like vanadium, 14,15 UV-vis, Raman, and IR data 16 or EPR tensors. 17

Solid-State Systems
ChemShell implements a finite cluster embedding model for QM/MM solid state calculations of bulk defects and surface reactions, as illustrated in Figure 2. In this approach, a spherical or hemispherical cluster is cut from a periodic MM model of the material. Point charges are then placed in a shell around the cluster and their charges set to reproduce as closely as possible the missing classical periodic electrostatic environment. A QM region is then defined in the center of the cluster and additional reactants added as appropriate. There are several advantages to using a finite representation of the solid in QM/MM calculations: spurious periodic interactions between defects are avoided; molecular QM codes, which lack support for periodic boundary conditions can still be used; and charged species are straightforward to model.
As of version 3.4, ChemShell contains integrated setup routines for extended covalent, ionic, and molecular crystal systems. These have a common interface and differ mainly in how the outer boundary  of the MM cluster is handled: for example, the molecular crystal routines ensure that only whole molecules are included in the cluster. The type of system also determines how the QM/MM boundary should be handled. For molecular crystals and semi-ionic solids such as zeolites, 18,19 the link atom approach may be used. The DL_POLY interface is recommended for these calculations. Solid-state classical force field parameters may be defined explicitly using this interface. For calculations on zeolites, the MM charges may be adjusted automatically to account for the usually nonzero charge of the QM region (in addition to the normal charge shift procedure). In order to model molecular crystals accurately, it is necessary to obtain appropriate point charges for the MM region. A protocol has been developed to obtain point charge values based on DFT calculations updated self-consistently in the field of point charges, which is available as a ChemShell script. 20 The ionic approach to embedding, 21 used in materials such as metal oxides, may be employed using either electrostatic embedding (using a rigid-ion force field) or polarized MM embedding using a shell model potential. Because of the strong electrostatic interactions in an ionic material it is often necessary to use a shell model potential to obtain an accurate QM/MM result. The shell model splits the charges on classical atoms into core and shell components, connected by a spring force. The shell component represents the valence electrons that can be polarized by the QM region. In a QM/MM calculation, the shell positions are relaxed self-consistently by multiple alternating QM and MM calculations at every step. The GULP MM package 22 is used to perform shell model calculations in ChemShell, and QM/MM shell model force fields may be defined from a standard GULP format force field using an automated script.
Applications of the ionic embedded cluster model include the study of point defects in zinc oxide and catalysis on zinc oxide surfaces (reviewed elsewhere 23 ) and the formation of water on interstellar olivine grains. 24

Geometry Optimization with DL-FIND
Characterizing chemical reaction paths using geometry optimization algorithms is one of the most fundamental tasks in computational chemistry. A number of optimization modules are available in ChemShell, but the recommended optimizer is DL-FIND. 25 DL-FIND offers a range of algorithms to explore potential energy surfaces and can be applied to the location of energy minima, transition states (including the dimer method), reaction path optimization (nudged elastic band), 12, 26 conical intersection search, 27,28 and global optimization through stochastic search and a genetic algorithm. Several coordinate systems are available internally in DL-FIND, including the efficient hybrid delocalized coordinate system originally developed for the HDLCOpt optimizer. 29 Large-scale systems optimizations may be performed using the low memory BFGS algorithm. A microiterative minimization scheme for efficient QM/MM optimization has also been developed in ChemShell's HDLCOpt optimizer, 30 and the method is currently being reimplemented in DL-FIND with extensions to transition state and reaction path optimization. In microiterative optimization, the MM environment is relaxed after every QM step in the optimization, thereby reducing the number of expensive QM evaluations at the cost of a larger number of cheaper MM evaluations, giving a lowered overall computational expense.
DL-FIND has also recently been extended to perform reaction rate calculations. Reactant and transition state Hessian calculations may be used to calculate rates by harmonic transition-state theory, or instanton theory may be used to optimize the most likely quantum tunneling path at a given temperature. 31 DL-FIND is distributed with ChemShell but is also available as a standalone open source library with a well-defined interface for use with other codes. It has been interfaced to QM codes such as GAMESS-UK, CRYSTAL, and CASTEP as well as MM routines from DL_POLY 32 and Tinker. 33

Treatment of Long-Range Electrostatic Interactions
Long-range electrostatic interactions within a macromolecule/solvent system may be modeled efficiently using one of two boundary potential methods implemented in ChemShell: the generalized solvent boundary potential (GSBP) 34 and the solvated macromolecule boundary potential (SMBP). 35 In these approaches, an inner atomistic region is defined surrounded by an outer region which is represented implicitly by the boundary potential. The GSBP requires modifications to the QM Hamiltonian and is only implemented in the MNDO semiempirical code, and is targeted at MD simulations where the high initial setup overhead is outweighed by the large number of subsequent calculations. By contrast the SMBP represents the boundary potential due to the outer parts of the investigated macromolecular system and bulk solvent solely by point charges calculated on-the-fly, and is therefore suited to single point calculations and geometry optimizations, providing an alternative to simply 'chopping' down the system by means of a distance criterion. The SMBP can be used with any QM code that supports electrostatic embedding.
The effect of the SMBP method has been analyzed 36 by measuring the influence of its usage on the energetics of previously obtained reaction mechanisms, namely an intramolecular Claisen rearrangement in chorismate mutase 37 and a hydroxylation reaction in p-hydroxybenzoate hydroxylase. 38 Results show that there is a need for SMBP when the reaction under consideration is accompanied by a significant rearrangement of the charge distribution.
Recently, a polarized MM embedding scheme has been implemented in ChemShell based on the charge-on-spring/Drude oscillator approach, which modifies the MM force field to include terms that are polarized by the electrostatic potential of the QM region. This is conceptually very similar to the shell model described above for solid-state studies. It has further been combined with SMBP and GSBP leading to a fully polarizable three-layer method. 39

X-Ray Based Quantum Refinement of Crystal Structures
The QM/MM driver in ChemShell has been extended to include crystallographic X-ray data through an interface to the CNS (Crystallography & NMR System) package. 40 Standard crystal structure refinement, especially for protein structures, is often carried out based on classical force fields. As the main part of proteins consist of amino acids for which thoroughly tested parameters are available, this is of little concern. The force field parameters for cofactors, substrates, inhibitors, metals centers, and so on are however much less reliable. Rather than optimizing the crystal structure based on a MM energy expression, the refinement can be performed in ChemShell using QM/MM information. The method has been applied to investigate the chromophore structure of the red fluorescent protein DsRed.M1 40 focusing on the cis and trans conformers of the acylimine chromophore, providing additional insight into the hydrogen bonding network within the protein.

MD and Free Energy Calculations
The MD module in ChemShell is based on routines from DL_POLY 2. Standard MD functionality such as NVE, NVT, and NPT ensembles, constraints using the SHAKE procedure, and rigid groups are available. An MD calculation may be performed with any code interfaced to ChemShell so that QM, MM, and QM/MM MD simulations are all possible. The QM/MM MD method has been outlined for the case of the hydroxylation reaction in the enzyme p-hydroxybenzoate hydroxylase 38 using a semiempirical QM method, in which free energies of activation were obtained by thermodynamic integration.
Free energy calculations may also be performed using the QM/MM free energy perturbation (FEP) method. 41 When applied to QM/MM the QM region is normally kept frozen and approximated by point charges representing the electrostatic potential of the QM electron density distribution during the MD runs to reduce computational cost, and so only the entropic contribution from the MM environment is measured. The entropic effect of the QM part is estimated at energy minima and transition states using harmonic vibrational frequency calculations. The QM/MM-FEP method is now available as a module in ChemShell, which automates much of the lengthy setup process.
FEP calculations are normally based on top of mechanistic QM/MM investigations with the aim of obtaining a more realistic energy profile. 41 As such, they have been combined with additional high level ab initio calculations, for example, local coupled cluster, for purely organic systems such as the hydroxylation reaction in p-hydroxybenzoate hydroxylase 38,42 as well as transition metal containing enzymes like aldehyde oxidoreductase. 43,44

Excited-State Calculations
Given the methodology for ground states described above, obtaining QM/MM energies of excited states using methods like TD-DFT and CASSCF/CASPT2 is straightforward, providing these methods are available in an interfaced QM code. With the availability of gradients for excited states, the exploration of an exited-state energy surface becomes possible. Therefore, vertical and adiabatic excitation energies as well as excited state minima can be obtained, as has been demonstrated by several groups. Publications working on flavin-based blue light photoreceptors including the Blue Light Using Flavin-Adenin-Dinucleotide 45,46 and the Light-Oxygen-Voltage domain 47 provide an insight into what has been done in this area.
Although studies of a single excited electronic state can be carried out using conventional methods, reactions that involve changes of electronic state require more sophisticated techniques. The conical intersection optimization methods in DL-FIND, for example, optimize the position where two electronic states intersect, which acts as a funnel for ultrafast relaxation to the lower state. ChemShell's data structures, interfaces and QM/MM routines have been extended in order to support QM/MM multiple state calculations, 25 including the calculation of interstate coupling gradients, and used to investigate the behavior of retinal models in solution. QM/MM conical intersection optimizations have subsequently been used to investigate the energy surfaces of DNA bases in solution 48 and the DNA environment. 49 The list of external codes that support coupled multiple state calculations through ChemShell is given in Table 1. An uncoupled interface is under development for states of different multiplicity (where the interstate coupling gradient is formally zero). In this case, any QM code that supports excited-state calculations may be used.
The multiple state interface is also being used to implement a surface hopping MD module for ChemShell based on the surface hopping routines in the MNDO semiempirical package. 50 A development version of this module has been used to simulate the dynamical behavior of DNA bases in aqueous solution and the DNA environment. 48,49,51

HIGH-PERFORMANCE COMPUTING
There are two levels of parallelism possible in a ChemShell calculation: firstly, at the level of the energy or gradient evaluation, in which parallelism in external codes can be exploited; and secondly, at the higher level of tasks such as geometry optimization, which are handled by ChemShell.
ChemShell is parallelized using the message passing interface (MPI) formalism. It is not necessary to build ChemShell in parallel in order to exploit the parallelism of separate external programs, as they can simply be launched in parallel. However, running two separate binaries in an HPC environment can be inefficient and it is often not even possible to launch parallel executables in this way. ChemShell therefore offers the option to directly link some external codes as libraries so they behave as one parallel executable. This requires some modification to the external codes and so not all interfaces support this mode of operation: those that do are listed in Table 1.
The second level of parallelism has been implemented in ChemShell using a task-farming approach. 52 In task-farming parallelism, the overall set of MPI processors is split into smaller workgroups, each of which is assigned tasks such as energy evaluations. In a finite difference Hessian calculation, for example, each finite difference evaluation is independent and so may be calculated simultaneously using task-farming. Task-farming requires further modifications to external codes in order to share and make use of the MPI communicator handle that ChemShell uses to identify workgroups. The interfaces that support task-farming via a directly linked interface are listed in Table 1. It also possible in principle to perform task-farming calculations with other codes by launching separate binaries, but again, this is platform dependent.
Task-farming may also be used with the DL-FIND optimizer for algorithms where many independent energy evaluations are required, for example, the nudged elastic band method and population-based optimization using a genetic algorithm. The task-farm parallel nudged elastic band method has been used to investigate the mechanisms of hydrogen adsorption 52 and carbon dioxide activation 53 on zinc oxide surfaces, precursor steps in the industrial synthesis of methanol.

CONCLUSION
The basic methodology of QM/MM is now well established and has proven to be an efficient way of performing large-scale simulations in biology, chemistry, and materials science. In this article, we have detailed the flexible, modular environment for QM/MM calculations provided by ChemShell, allowing users to mix and match QM and MM packages through a consistent interface. Applications using the established methodology have demonstrated the wide va-riety of systems that QM/MM can be applied to, whereas recent developments in ChemShell highlight the continuing refinement of the QM/MM model to treat the environment more accurately, calculate free energies and study excited state reactions, and the use of QM/MM data to improve the crystallographic refinement process. Using the integrated DL-FIND optimizer, ChemShell has access to a range of powerful algorithms for exploring potential energy surfaces, and both codes have been modified to take advantage of massively parallel computing technology.
For further information about ChemShell and instructions for obtaining the code, visit the website www.chemshell.org.