Performance modeling of microsecond scale biological molecular dynamics simulations on heterogeneous architectures


Correspondence to: Pratul K. Agarwal, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA.



Performance improvements in biomolecular simulations based on molecular dynamics (MD) codes are widely desired. Unfortunately, the factors, which allowed past performance improvements, particularly the microprocessor clock frequencies, are no longer increasing. Hence, novel software and hardware solutions are being explored for accelerating performance of widely used MD codes. In this paper, we describe our efforts on porting, optimizing and tuning of Large-scale Atomic/Molecular Massively Parallel Simulator, a popular MD framework, on heterogeneous architectures: multi-core processors with graphical processing unit (GPU) accelerators. Our implementation is based on accelerating the most computationally expensive non-bonded interaction terms on the GPUs and overlapping the computation on the CPU and GPUs. This functionality is built on top of message passing interface that allows multi-level parallelism to be extracted even at the workstation level with the multi-core CPUs and allows extension of the implementation on GPU-enabled clusters. We hypothesize that the optimal benefit of heterogeneous architectures for applications will come by utilizing all possible resources (for example, CPU-cores and GPU devices on GPU-enabled clusters). Benchmarks for a range of biomolecular system sizes are provided, and an analysis is performed on four generations of NVIDIA's GPU devices. On GPU-enabled Linux clusters, by overlapping and pipelining computation and communication, we observe up to 10-folds application acceleration in multi-core and multi-GPU environments illustrating significant performance improvements. Detailed analysis of the implementation is presented that allows identification of bottlenecks in algorithm, indicating that code optimization and improvements on GPUs could allow microsecond scale simulation throughput on workstations and inexpensive GPU clusters, putting widely desired biologically relevant simulation time-scales within reach of a large user community. In order to systematically optimize simulation throughput and to enable performance prediction, we have developed a parameterized performance model that will allow developers and users to explore the performance potential of future heterogeneous systems for biological simulations. Copyright © 2012 John Wiley & Sons, Ltd.