## Introduction

Among the valuable tools for studying biomolecular systems are dynamical simulations using molecular-mechanics force fields. The electrostatic model used in such simulations often determines both their computational cost and scientific reliability. The long-range nature of electrostatics makes simple, accurate implementations very costly, and this has led to the development of many different approximation schemes. These are sometimes based on simple distance-based cutoffs, or the multipole approximation, or the Ewald decomposition.1–3

### PME in GROMACS

This work discusses improvements made to the implementation of the “smooth particle-mesh Ewald” method3 (SPME, or hereinafter just PME) implemented in GROMACS 4.4, 5 This algorithm and its implementation are more fully discussed in related work.6 The part of the method that concerns this work is the evaluation of energies and forces for the real-space part of the PME approximation to the infinite periodic Coulomb sum. For a given list of pairs (*i,j*) of particles with inter-atomic distance *r*_{ij} less than some given value *r*_{c}, and partial charges respectively *q*_{i} and *q*_{j}, the real-space part of the electrostatic energy *E*_{r} is evaluated with the sum

for some value of a parameter β that controls the relative importance of the real- and reciprocal-space parts of the PME approximation. Interparticle forces are evaluated from the derivatives of this function with respect to *r*_{ij}.

### Attributes of BlueGene

The 32-bit PowerPC processors used in the IBM BlueGene series of supercomputers have a number of key features which must be considered in optimizing the code for PME in GROMACS. These attributes include7–9

that the processor has multiple independent units for different kinds of operations and can issue from two different instruction families simultaneously (i.e., a “dual-issue multiple-unit” processor),

that rounding a floating-point number to an integer requires instructions from multiple different instruction families (i.e., load/store, integer operation and floating-point operation), and

that there is a dual floating-point unit (FPU), each unit having a fully independent set of 32 registers.

The discussion of the optimization will refer to these attributes.

The processor possesses instructions that execute solely on the primary FPU, and others that execute a similar or identical instruction on both FPUs. In ideal cases, this allows the dual FPU to perform twice the work of a single FPU, achieving a kind of “single-instruction multiple-data” (SIMD) parallelism. Moreover, the dual-issue multiple-unit nature of the processor allows the simultaneous dispatch of both a SIMD FPU operation, and a memory operation that fills a new register on each FPU for a future operation. If the relevant memory locations can be made available in the first level of three caches, the processor can achieve the peak floating-point performance.

A twofold unroll of the loop corresponding to the inner summation of eq. (1) would be the most straightforward way to take advantage of the SIMD capability; this was implemented in GROMACS 4 by Mathias Puetz of IBM. In this approach, each FPU handles one entire interaction independently of the other. That is, the primary FPU handles one value of *j* in eq. (1) and the secondary FPU handles *j* + 1 . Under suitable conditions, IBM's XLC compiler is able to generate dual-FPU machine code from normal C or FORTRAN. Two critical conditions are that memory arrays are suitably aligned and that data accesses have unit stride through the data. Satisfying this second constraint during MD simulations would require significant overhead. Alternatively, through IBM SIMD extensions to the C syntax,7 the compiler can be given explicit directions to generate dual-FPU code, which can be over-ruled during optimization where necessary. Puetz's implementation made heavy use of the syntax extensions, and depended little on the SIMD-optimization capabilities of the compiler because of the constraint mentioned earlier.