Several improvements to the previously optimized GROMACS BlueGene inner loops that evaluate nonbonded interactions in molecular dynamics simulations are presented. The new improvements yielded an 11% decrease in running time for both PME and other kinds of GROMACS simulations that use nonbonded table look-ups. Some other GROMACS simulations will show a small gain. © 2011 Wiley Periodicals, Inc. J Comput Chem, 2011
Among the valuable tools for studying biomolecular systems are dynamical simulations using molecular-mechanics force fields. The electrostatic model used in such simulations often determines both their computational cost and scientific reliability. The long-range nature of electrostatics makes simple, accurate implementations very costly, and this has led to the development of many different approximation schemes. These are sometimes based on simple distance-based cutoffs, or the multipole approximation, or the Ewald decomposition.1–3
PME in GROMACS
This work discusses improvements made to the implementation of the “smooth particle-mesh Ewald” method3 (SPME, or hereinafter just PME) implemented in GROMACS 4.4, 5 This algorithm and its implementation are more fully discussed in related work.6 The part of the method that concerns this work is the evaluation of energies and forces for the real-space part of the PME approximation to the infinite periodic Coulomb sum. For a given list of pairs (i,j) of particles with inter-atomic distance rij less than some given value rc, and partial charges respectively qi and qj, the real-space part of the electrostatic energy Er is evaluated with the sum
for some value of a parameter β that controls the relative importance of the real- and reciprocal-space parts of the PME approximation. Interparticle forces are evaluated from the derivatives of this function with respect to rij.
Attributes of BlueGene
The 32-bit PowerPC processors used in the IBM BlueGene series of supercomputers have a number of key features which must be considered in optimizing the code for PME in GROMACS. These attributes include7–9
that the processor has multiple independent units for different kinds of operations and can issue from two different instruction families simultaneously (i.e., a “dual-issue multiple-unit” processor),
that rounding a floating-point number to an integer requires instructions from multiple different instruction families (i.e., load/store, integer operation and floating-point operation), and
that there is a dual floating-point unit (FPU), each unit having a fully independent set of 32 registers.
The discussion of the optimization will refer to these attributes.
The processor possesses instructions that execute solely on the primary FPU, and others that execute a similar or identical instruction on both FPUs. In ideal cases, this allows the dual FPU to perform twice the work of a single FPU, achieving a kind of “single-instruction multiple-data” (SIMD) parallelism. Moreover, the dual-issue multiple-unit nature of the processor allows the simultaneous dispatch of both a SIMD FPU operation, and a memory operation that fills a new register on each FPU for a future operation. If the relevant memory locations can be made available in the first level of three caches, the processor can achieve the peak floating-point performance.
A twofold unroll of the loop corresponding to the inner summation of eq. (1) would be the most straightforward way to take advantage of the SIMD capability; this was implemented in GROMACS 4 by Mathias Puetz of IBM. In this approach, each FPU handles one entire interaction independently of the other. That is, the primary FPU handles one value of j in eq. (1) and the secondary FPU handles j + 1 . Under suitable conditions, IBM's XLC compiler is able to generate dual-FPU machine code from normal C or FORTRAN. Two critical conditions are that memory arrays are suitably aligned and that data accesses have unit stride through the data. Satisfying this second constraint during MD simulations would require significant overhead. Alternatively, through IBM SIMD extensions to the C syntax,7 the compiler can be given explicit directions to generate dual-FPU code, which can be over-ruled during optimization where necessary. Puetz's implementation made heavy use of the syntax extensions, and depended little on the SIMD-optimization capabilities of the compiler because of the constraint mentioned earlier.
Improvements to GROMACS Code
Minor Improvements to the GROMACS Code for BlueGene
Upon inspecting the code, several opportunities for minor improvements were observed:
In general, atoms have contributions to energies and forces from both van der Waals and Coulomb interactions calculated in the inner loops. One quantity is calculated first, stored in a register and the other added to it. For some versions of these loops, such as solvent-optimized loops, only one or other quantity is calculated for a given pair of atoms. In such cases, the previous implementation sometimes assigned zero to a register, and then later added the other contribution. The compiler did not optimize away the unnecessary operation. Accordingly, the code was changed to either add or assign efficiently in a case-dependent way.
The FPU is only capable of double-precision floating-point operations.7 Single-precision data is automatically extended to double when loaded into a register, and rounded back with a separate operation when required. In the previous implementation, the unrolled loops made use of the double-precision dual-FPU instructions, and rounded back to single-precision only at the end of each inner loop. However, the epilogue code for the unrolled loop* required the compiler to round to single precision at several points, which was inconsistent and slightly wasteful. This was rectified.
At the lowest compiler optimization level, the code generated for the previous implementation always de-referenced several pointer arguments, even though they were not used in some versions of the code. This led to segmentation faults, and so preprocessor directives were used to allow these de-references only when the pointer should be valid.
Because many of the important arrays are correctly aligned in memory to take advantage of dual-FPU instructions, compiler assertions about these facts were used to help the compiler generate code for the dual FPU.
The improved code passed the GROMACS regression test suite, and its speed was compared to the original version on 64 processors over 50,000 steps of a 51,000-atom peptide-in-water PME simulation. The PME parameters were chosen so that the evaluation of the inner loops was rate-determining (see Table I). The improved code (version A) was over 2% faster, which can be attributed mostly to the first improvement listed (data not shown), due to the reduced competition for registers during compiler optimization. Some of these advantages will also accrue to BlueGene GROMACS simulations that use algorithms other than PME.
Table I. Total Execution Times for Different Implementations of GROMACS Inner Loops Using Compiler Options -O3 -qhot
Table Look-Ups for PME in GROMACS
The explicit evaluation of erfc() functions for evaluating eq. (1) is quite slow, so GROMACS tabulates values to estimate through cubic spline interpolation.10 In order to evaluate the energy and force magnitude for a single iterate of the inner loop given
r the distance between two atoms (i.e. rij),
qq the product of their charges,
tablescale the number of table points per unit distance, and
tablewidth the distance in memory between adjacent points in the table array,
the C code of Listing 1 could be used. The goal is to parallelize that code so that the execution time of two iterates, one on each FPU with arbitrary r and qq, is comparable with one iterate solely on the primary FPU.
A critical feature of this code is the calculation of eps, which contains the fractional part by which rt exceeds the next lowest integer n0 . The code looks up values and derivatives suitable for n0 from the tables, and then applies corrections based on the size of eps. The simplest way to obtain the fractional part of a number is to use the C language requirement that a floating-point-to-integer conversion must round towards zero. When the result of such an operation in line 2 is subtracted from rt in line 3, the fractional part is calculated. Listing 1.
Table Listing 1.. C code for GROMACS table lookups
|1||rt||= r * tablescale;|
|2||N0||= (int) rt;|
|3||eps||= rt - (double) n0;|
|9||GHeps||= G + H * eps;|
|10||R||= F + eps * GHeps;|
|11||VV||= Y + eps * R;|
|12||GHeps2||= GHeps + GHeps|
|13||S||= GHeps2 + eps * H;|
|14||FF||= F + eps * S;|
|15||force||= qq * FF;|
|16||energy||= qq * VV;|
Table Look-Ups and BlueGene
It is clear upon inspecting the compiler listing for the above code that a major bottle-neck is the conversion process from double to int to double in lines 2 and 3 (data not shown). n0 is used in calculating the floating-point value eps, and in computing the (integer) effective memory address for the table loads. Integer, floating-point, and memory operations occur on different units of the BlueGene processor.7 Accordingly, the value of n0 is required to be computed in both integer and floating-point form on the different units. The dual FPU can round a number and produce the result in integer form in matching floating-point registers in a single cycle. Thereafter, the only way on this processor to transfer those values to the integer unit for use there is to write them to memory and re-load them on the integer unit (effectively, a “load-hit-store”). Once the values are in integer registers, obtaining the dual floating-point form of the two integer values requires some integer bit-masking, a further store to memory, loading to a floating-point register (i.e., a second load-hit-store), further manipulation on the primary FPU, and then transfer of one value to the secondary FPU. (64-bit versions of this processor have an instruction that will convert a value in integer form in a floating-point register to floating-point form,11 however the present processor lacks such.) The requirement for a double load-hit-store (and more!) before values in both forms are available limits the speed of the code significantly, although the compiler uses the processor's multiple-issue capability to alleviate some of this limitation. Once the effective addresses are computed on the integer unit, single loads of the table values can begin, and these run in parallel with the manipulations required to convert n0 back to floating-point and then calculate eps .
Faster Calculation of eps
An approximate pure-floating-point form for this conversion process is well known,12 and this can be implemented with the dual FPU. To illustrate, if some hypothetical computer could represent only four significant decimal digits of a number, then rounding of the number 21.64 to the nearest integer could be achieved by adding 1000 to give 1021.64 (which the processor rounds to 1022), and then subtracting 1000 to give 22. Now the nonintegral part of −0.36 can be computed by subtraction. The same kind of process can be used with signed binary floating-point numbers with d significand bits, where the “magic number” to use is 2d + 2d−1. Addition and subtraction of the magic number will round the result to an integer according to the prevailing rounding mode, with the result still in floating-point form. This requires that the value being rounded is less than 2d−1, which is known to be true for rt in GROMACS. The C language requires that conversion of floating-point numbers to integers must round towards zero. The code in Listing 1 exploits this requirement, so that eps in line 3 is non-negative. Unfortunately, the above conversion trick does not readily compute an integer rounded towards zero. In the decimal example above, addition and subtraction of 999.5 would create the effect of round-towards-zero. This does not generalize to binary numbers, as correct representation of 2d + 2d−1 − 2−1 requires an impossible d + 1 significand bits. At least five approaches exist for managing this situation, including
- Acontinuing to convert float-int-float in the previous manner,
- Bswitching the processor rounding mode to “round towards zero” before using the magic number, and then switching it back,
- Cswitching the processor rounding mode for the whole evaluation of eq. (1),
- Dperforming the rounding operation of line 2 in the prevailing rounding mode, and changing the generation and use of tables accordingly, and
- Esubtracting 0.5 from rt before adding the magic number.
Solution A is expected to be slow. Solution B is very slow on many processors, as it can require two complete flushes of the floating-point pipeline. Solution C is not as expensive as B, but it does introduce rounding artifacts. However, those artifacts are very unlikely to lead to significant errors. Solution D is accurate, elegant and maximally fast, and would be best in the abstract, but requires significant recoding effort in other parts of GROMACS to produce tables suited to the new conditions. Solution E is always accurate, and costs only a single floating-point operation and a matched pair of registers. As the constant 0.5 is used in the Newton-Raphson iterations that compute r immediately before this code fragment, the register space required for E might come at little cost. Similarly, if the performance of the code for the inescapable first load-hit-store from line 2 is still limiting, the extra floating-point operation for implementing line 3 might be without cost.
Solutions A and E were implemented in GROMACS 4.0.7 in combination with the other improvements described previously, and tested on BlueGene/L. It should be necessary to direct the compiler with -qstrict to avoid optimizations that would remove the operations involving the magic number. However, it was observed that -qstrict was only necessary for the single-FPU epilogue code for the unrolled loop, not for the dual-FPU unrolled loop. Thus it appears that the optimizer cannot achieve the same results for otherwise-equivalent single- and dual-FPU code. Further, in that epilogue code section the rounding trick was slightly slower than the straightforward conversion, so the latter was retained. The resulting code also passed the GROMACS regression tests. Compared with the original implementation, a total speed improvement of over 10% was observed for solution E, shown in Table I. Part of the rounding operation now takes three fully-SIMD floating-point instructions compared with a larger number of non-SIMD instructions from three families; this must explain some of the benefit. Some further benefit accrues from reduced instruction latency from eliminating the second load-hit-store. This approach is also beneficial on the BlueGene/P architecture.
BlueGene Dual FPU Operations
As for many modern processors, the BlueGene dual FPU possesses fused multiply-add (FMA) instructions that perform both a multiplication and an addition in the same time as a normal floating-point operation,7 i.e.
Such an instruction is faster than separate multiplication and addition instructions would be, and avoids loss of precision from rounding of the intermediate stage. Listing 1 is structured to make it clear that after the table look-up, five FMA can be used to do the bulk of the computation. Puetz's implementation does this. A few other operations are also required. All of the required arithmetic operations are available in instructions suitable for executing on the dual FPU. However, the process in lines 5-8 of looking up the table values from memory is interesting. In Puetz's implementation, four separate load-from-memory instructions are issued for each FPU. However, dual load-from-memory instructions exist that will load two suitably aligned adjacent memory locations into a matching pair of registers on each FPU in the same time as a single load. To benefit from the dual load, all such data needs to reside in the 32Kb first-level cache,7 which will be true in the present case.
Before the rounding optimization described above, lines 4–8 effectively ran in parallel with line 3 already, so little gain could arise from using dual loads parallelizing lines 5–8. However, with the conversion bottle-neck alleviated, it becomes worth considering the merits of using dual loads. For aqueous MD simulations in GROMACS with solvent-optimized inner loops, the speed of this code segment is limited by the number of available floating-point instruction-issue opportunities. Two forms of dual load exist, however using them does not automatically place the data in the right registers for further computation. So further floating-point “shuffle” operations can be required, and these add to the pressure on floating-point instruction-issue opportunities. There are some “cross” forms of dual FMA that can be used to address part of the necessary shuffling at no cost, but only if the wrongly placed operand is used in the FMA multiplication. Any benefit from converting four single loads to two dual loads and some floating-point instructions will only be seen if the inner loops are not being limited by floating-point instruction opportunities. This might be true for GROMACS non-solvent-optimized inner loops on BlueGene. For example, on newer Intel SIMD cores register-shuffle instructions do not compete with other floating-point instructions and so such parallel loads and shuffles would show a benefit. The key point here is that there is not (yet) a generic approach available for SIMD code-generation, because there are too many hardware differences between architectures.
Various strategies for coping with the need to shuffle registers on BlueGene exist. eps is required as a multiplicative operand in every FMA, and so if one or two of (Y,F,G) are wrongly placed with respect eps there will not be a cross FMA suitable for implementing all of lines 9–14. However, H is used only as a multiplicative operand, so (if other conditions are suitable) two cross FMA can be used for lines 9 and 13, and normal FMA for the rest. Alternatively, if sufficient register pairs are available to store both eps and a copy of eps of opposite locality, then after two dual and two cross loads, spare registers and fewer register shuffles would be required if a suitable set of dual and cross FMA were used subsequently.
I considered the possibility of reordering the table generation so that the order in memory was different from (Y,F,G,H) but could perceive no advantage from so doing.
Implementation of Further Improvements to Table Look-Ups
Several implementations based on the above ideas are described below, and their performance was compared in the same manner described earlier:
- EEight single loads as implemented by Puetz.
- FTwo dual and two cross loads together with four dual “floating-point select” instructions, several extra register pairs, some register shuffling and several cross FMA (suggested in code comments by Puetz, but acknowledged there to be slower than single loads).
- GTwo dual and two cross loads, copying eps to another register suitable for the other form of FMA, more register shuffling, and several cross FMA.
- HTwo dual and two cross loads, copying F to another register suitable for the other form of FMA, more register shuffling and several cross FMA.
- IFour single loads for (Y,F), and a cross and dual load for (G,H) followed by same-FPU register pairing and several cross FMA.
Pseudo-code for implementations E, G and I may be found in Listing 2. All implementations E- I used the improvements for E described earlier. At low compiler optimization levels, implementation H worked correctly and passed the GROMACS regression test suite. At high levels it failed all regression tests and the underlying problem led to exploding simulations. Apparently some compiler defect exists that is revealed by case H. All other implementations passed the regression test suite at all optimization levels. The total time taken for identical 64-processor GROMACS simulations using all the successful inner loop implementations can be seen in Table I. Listing 2.
Table Listing 2.. C pseudo-code for different implementations of table look-ups. The eps pair, and table offsets n1 and n2 for the respective FPU have been previously computed. Function calls with a double-underscore prefix are IBM extensions to C syntax that are compiled to the matching dual-FPU instruction. Other function calls indicate an abstract process, for which the actual implementation requires the use of a clumsy syntax. The interested reader is referred to the freely-available GROMACS source code for more information here
|1||/* Implementation E */|
|12|| || |
|13||/* Implementation G */|
|18||Y||= pair(primary(buf1), secondary(buf2));|
|19||F||= pair(primary(buf2), secondary(buf1));|
|20||G||= pair(primary(buf3), secondary(buf4));|
|21||H||= pair(primary(buf4), secondary(buf3));|
|29|| || |
|30||/* Implementation I */|
|35||G||= pair(primary(buf3), secondary(buf4));|
|36||H||= pair(primary(buf4), secondary(buf3));|
It appears that the eight loads of E are close to the best choice, and that the extra registers and operations of F and G outweigh any benefits of dual loads. Using the dual loads in I only for a variable pair including H showed a slight benefit. Inspecting the corresponding compiler listing showed that the compiler generated a cross load, a same-FPU move, and two single loads to achieve the result of two matched register pairs containing G and a crossed version of H. This could indeed be slightly superior to four single loads in a dual-issue processor context where the FPU-only move instruction fills an otherwise wasted instruction-issue opportunity. Presumably code could be constructed that resulted in a dual load, a same-FPU move, and two single loads if the subsequent FMA were all of normal form, and this would be equally fast. I have tested the improved inner loops also in GROMACS 4.5.1 and observed similar gains.
Although the two-fold loop unroll is an obvious way to take advantage of the dual-FPU, an alternative was considered. Because the dual FPU uses a 5-stage execution pipe,7 constructing intermediate arrays considerably longer than 5 might be able to saturate the floating-point pipeline, which might show enough gain to pay the cost of constructing the intermediate array. This requires that the algorithm loop over a “dual load, calculate, dual store” motif. Unfortunately, the nature of the calculation requires significant interaction with the integer and memory units (even with the above improvements) and many spare registers for intermediate results. The loop-unrolling approach was greatly superior to this one, though perhaps with a much larger L1 cache and more registers the result could have been different.
The previously optimized BlueGene implementation of the various GROMACS inner loops was improved modestly by removing some redundant floating-point operations. The GROMACS inner loops that use table look-ups require the use of a float-int-float conversion process; this was significantly improved on this hardware through the use of an established pure floating-point technique for part of the round-to-integer process. These table look-ups were also slightly improved by judicious use of a dual load. Although the 11% speed improvement illustrated in the timing results above tested only a handful of the GROMACS inner loops, the correctness tests were satisfied for all such loops, and all loops are likely to benefit slightly from the removal of the redundant operations. GROMACS simulations using “switched” Coulomb functions use the same table look-ups, and will show the full benefit. BlueGene PME simulations with well-balanced separate real- and reciprocal-space processor groups will show only part of the improvement reported here. I note the observation in related work6 that separating these groups may not be best on BlueGene. These improvements also apply to BlueGene/P, because the relevant attributes of the processor have not changed.13 I am working with the GROMACS developers to make these improvements available in a future release of GROMACS.
Access to the IBM BlueGene/L system at the University of Canterbury Supercomputing Center is gratefully acknowledged. The author is particularly indebted to one of the reviewers for testing the key performance improvement on BlueGene/P, for his view on the reason underlying this improvement, and for his observation about register-shuffle instructions on Intel cores. He also thanks the other reviewers for their helpful comments.