## 1 INTRODUCTION

Recent advancement of parallel computing environments with graphics processing units (GPUs) has significantly promoted progress in scientific computation performance [1, 2]. The combination of a GPU and NVIDIA's compute unified device architecture (CUDA) [3] provides even a personal computer (PC) with a high-performance computing ability because of both its reasonable performance/cost ratio and the ease at which the CUDA C language can be mastered. Significant performance improvements with GPUs have been obtained in *N*-body problems [4-8] both with and without fast algorithms such as the fast multipole method (FMM) [9, 10] and the tree method. Such GPU-accelerated *N*-body interaction calculators can be embedded in an iterative linear system solver required by a boundary element method (BEM) and can improve calculation performance of the BEM. Using a GPU without using fast algorithms, Takahashi [11] accelerated a BEM for the Helmholtz equation and Lezer [12] did so for a method of moments. Using the tree method, Stock [13] accelerated a BEM for the vortex particle method with GPUs. Using the FMM, Yokota [14] accelerated a special-purpose BEM for analyzing biomolecular electrostatics with up to 512 GPUs. Calculation times and speed-up ratios in [11, 13, 14] were based on single-precision floating-point arithmetic, and the FMM used in [14] was based on the rotation-coaxial translation–rotation (RCR) decomposition of the multipole-to-local (M2L) translation operator [10].

In a previous study [15], using the Laplace kernel FMM with RCR decomposition, the author accelerated an indirect BEM with GPUs on the basis of double-precision floating-point arithmetic with CUDA. The BEM is geared to electrostatic field analysis in voxel models, and it considers square walls on cubic voxels as boundary surface elements [16]. Using the BEM, three-dimensional fields were analyzed in human voxel models derived from anatomical images. The quality of fields calculated by original CPU codes was similar to those obtained by the scalar-potential finite-difference method, impedance method, and so on [17], indicating the practical applicability and usefulness in dosimetric studies of field exposure and in electro-magneto-encephalogram signal analyses.

In the current study, on the basis of two conventional CPU codes, three GPU codes are programmed with CUDA in search of higher computing performance. These GPU codes are based on double-precision floating-point arithmetic and their specifications are as follows: GPU code 1 simultaneously calculates direct and far fields in the FMM on the GPU and the CPU, respectively; GPU code 2 calculates both fields on the GPU with M2L translation with RCR decomposition (R-M2L) for the far-field calculation; and GPU code 3 calculates both fields on the GPU with diagonal forms of M2L translation operators [9] (D-M2L). GPU code 3 is the first GPU version of this type of FMM–BEM using D-M2L. In human models, the electric fields are generated by applying a 50-Hz magnetic field or by injecting DC current through two electrodes and these were analyzed successfully using a PC with three GPUs (NVIDIA GTX480) and six CPU cores (Intel Core i7-980X). Three types of human voxel model were analyzed, which contained up to 3.9 million boundary elements. By comparing calculation times, speed-up ratios, and required GPU memories, the advantages and disadvantages of tested codes were investigated.