• Open Access

Performance comparison of three types of GPU-accelerated indirect boundary element method for voxel model analysis

Authors


Correspondence to: S. Hamada, Department of Electrical Engineering, Kyoto University, Kyoto-Daigaku-Katsura, Nishikyo-ku, Kyoto 615–8510, Japan.

E-mail: shamada@kuee.kyoto-u.ac.jp

SUMMARY

An indirect boundary element method that is geared to electrostatic field analysis in voxel models is accelerated by graphics processing units (GPUs). The method considers square walls on cubic voxels as boundary surface elements and uses the fast multipole method (FMM) to analyze large-scale models. On the basis of two conventional CPU codes, three GPU codes are programmed in search of higher computing performance. These GPU codes are designed as follows: In GPU code 1, direct and far fields in the FMM are simultaneously calculated on the GPU and the CPU, respectively; in GPU code 2, both fields are calculated on the GPU with a rotation-coaxial translation–rotation decomposition algorithm; and in GPU code 3, both fields are calculated on the GPU with a diagonal translation scheme. The electric fields in human models are generated by applying a 50-Hz magnetic field or by injecting direct-current (DC) current through two electrodes and they were calculated successfully using a personal computer with three GPUs and six CPU cores. An analysis with 3.9 million surface elements took 89.4 s to solve its governing linear system with double-precision floating-point arithmetic. GPU codes 1, 2, and 3 demonstrated the least memory usage, the greatest speed-up ratio, and the fastest calculation time, respectively. These results show an example of the trade-off relationships of computation performances on a heterogeneous CPU–GPU system. Copyright © 2013 John Wiley & Sons, Ltd.

1 INTRODUCTION

Recent advancement of parallel computing environments with graphics processing units (GPUs) has significantly promoted progress in scientific computation performance [1, 2]. The combination of a GPU and NVIDIA's compute unified device architecture (CUDA) [3] provides even a personal computer (PC) with a high-performance computing ability because of both its reasonable performance/cost ratio and the ease at which the CUDA C language can be mastered. Significant performance improvements with GPUs have been obtained in N-body problems [4-8] both with and without fast algorithms such as the fast multipole method (FMM) [9, 10] and the tree method. Such GPU-accelerated N-body interaction calculators can be embedded in an iterative linear system solver required by a boundary element method (BEM) and can improve calculation performance of the BEM. Using a GPU without using fast algorithms, Takahashi [11] accelerated a BEM for the Helmholtz equation and Lezer [12] did so for a method of moments. Using the tree method, Stock [13] accelerated a BEM for the vortex particle method with GPUs. Using the FMM, Yokota [14] accelerated a special-purpose BEM for analyzing biomolecular electrostatics with up to 512 GPUs. Calculation times and speed-up ratios in [11, 13, 14] were based on single-precision floating-point arithmetic, and the FMM used in [14] was based on the rotation-coaxial translation–rotation (RCR) decomposition of the multipole-to-local (M2L) translation operator [10].

In a previous study [15], using the Laplace kernel FMM with RCR decomposition, the author accelerated an indirect BEM with GPUs on the basis of double-precision floating-point arithmetic with CUDA. The BEM is geared to electrostatic field analysis in voxel models, and it considers square walls on cubic voxels as boundary surface elements [16]. Using the BEM, three-dimensional fields were analyzed in human voxel models derived from anatomical images. The quality of fields calculated by original CPU codes was similar to those obtained by the scalar-potential finite-difference method, impedance method, and so on [17], indicating the practical applicability and usefulness in dosimetric studies of field exposure and in electro-magneto-encephalogram signal analyses.

In the current study, on the basis of two conventional CPU codes, three GPU codes are programmed with CUDA in search of higher computing performance. These GPU codes are based on double-precision floating-point arithmetic and their specifications are as follows: GPU code 1 simultaneously calculates direct and far fields in the FMM on the GPU and the CPU, respectively; GPU code 2 calculates both fields on the GPU with M2L translation with RCR decomposition (R-M2L) for the far-field calculation; and GPU code 3 calculates both fields on the GPU with diagonal forms of M2L translation operators [9] (D-M2L). GPU code 3 is the first GPU version of this type of FMM–BEM using D-M2L. In human models, the electric fields are generated by applying a 50-Hz magnetic field or by injecting DC current through two electrodes and these were analyzed successfully using a PC with three GPUs (NVIDIA GTX480) and six CPU cores (Intel Core i7-980X). Three types of human voxel model were analyzed, which contained up to 3.9 million boundary elements. By comparing calculation times, speed-up ratios, and required GPU memories, the advantages and disadvantages of tested codes were investigated.

2 INDIRECT FMM–BEM FOR VOXEL MODEL ANALYSIS

2.1 Indirect boundary element method for electric field analysis of biological samples

Basic equations that describe magnetically induced, low-frequency, faint currents in a biological sample were provided by, for example, Dawson [18], which assumes that the secondary magnetic fields induced by the primary induced current are negligibly small. When an external magnetic flux density B0 and a vector potential A0, which satisfy B0 = ∇ × A0, are applied, the magnetically induced electric field E and current density J satisfy the following equations:

display math(1)

where j, ω, σ, and ϕ are the imaginary unit, angular frequency, conductivity, and scalar potential, respectively. Equations (1) also describe electrically-induced DC fields, when ωA0 is equal to zero. In this case, the fields are generated by electric current injection via two electrodes attached to the model analyzed. An indirect BEM can be utilized to solve the Laplace equation in Equations (1) [19-21]. The BEM discretizes the model surfaces into N boundary surface elements, each with an individual unknown charge density q. All of the element charges generate ϕ and − ∇ ϕ according to Coulomb's law. When flat elements with homogeneous q are used, the surface integrals of ϕ and − ∇ ϕ, respectively, on the jth element are formulated as follows.

display math(2)
display math(3)

where math formula yields the surface integral for an element, S is the area of an element, n is a unit normal vector of the jth element, whereas the subscripts ± indicate the plus or minus side with respect to n. The following boundary equation based on the weighted residual method holds for an element that is not used as an electrode

display math(4)

Elements used as electrode can be classified into four subsets: (i) a designated element of electrode 1; (ii) the other elements of electrode 1; (iii) a designated element of electrode 2, and (iv) the other elements of electrode 2. The following boundary equations are imposed on these elements [15]: math formula; math formula; math formula; math formula. Subscripts (i)–(iv) indicate that they are related to the elements of the subsets (i)–(iv), whereas the side indicated with the subscript ‘in’ is the tissue side of each electrode element.

These N equations form simultaneous linear equations, which should be solved for q. The superimposed ϕ, E, and J at arbitrary positions are calculated by consolidating the Coulomb fields generated by every surface charge element.

2.2 Indirect boundary element method for voxel model analysis

The indirect BEM developed specifically for voxel model analysis [15, 16] considers a rectangular solid region composed of Nx × Ny × Nz cubic voxels, where each voxel position is indicated by a set of integer indices (i,j,k). Each voxel has homogeneous conductivity σ and the voxels define a conductive volume model. Another integer index identifies a surface among six surfaces on a voxel as follows:  = 1, 2, or 3 corresponds to the surface with the maximum x-coordinate, y-coordinate, or z-coordinate value of the surface center, respectively. The unit normal vectors of these surfaces are i, j, and k, respectively. Thus, a set (i, j, k, ) identifies a voxel surface in the region. When two voxels with different conductivities are in contact with a surface, the surface is considered to be a square boundary element. These N surface elements produce a unique surface model, which can be analyzed using the indirect BEM. After solving the N element charge densities, the electric field is determined at each voxel center.

Because the number of possible element positions is 3NxNyNz, the number of possible relative positions of source and target elements is 32(2N x − 1)(2Ny − 1)(2Nz − 1). The latter number corresponds to the number of independent values of F and V generated by a source element in the region with a unit charge density. These unit source response values of F and V can be preliminarily calculated and stored as f and v. By using these unit responses, consolidated F and V that involve N elements contributions are calculated with the following multiply-and-accumulate operations:

display math(5)

where symbols ‘s’ and ‘t’ denote the role of a surface as source or target, respectively, and math formula denotes the summation of the contributions from all related sources.

2.3 Interaction calculation using the fast multipole method

The FMM algorithm calculates the N-element interaction with O(N) complexity by separating N elements into element groups corresponding to hierarchical oct-tree boxes [9, 10]. The interaction is also separated into two components by considering the distance of boxes: the near-field component is directly calculated via element-to-element interactions; and the far-field component is indirectly calculated via group-to-group interactions. The finest level box, which is called a leaf box, is defined by c3 cubic voxels in the voxel model analysis, and c ranges from 5 to 7 in this paper.

Direct-field calculations in the FMM are performed between elements in neighboring leaf boxes, and the number of possible relative positions of source and target elements is 32(4c − 1)3. Unit responses f and v with this size are required in Equation (5) for this direct-field calculation.

Indirect far-field calculations in the FMM are based on multipole expansion coefficients M and local expansion coefficients L defined on each box. An index ic indicates their individual coefficients as Mic and Lic, and ic ranges from 0 to (p + 1)2 − 1, when the expansion is truncated to p. The FMM algorithm generates M defined on each leaf box from charge elements involved in the box (Q2M) and it evaluates far component values Ffar and V far on elements in each leaf box from L defined on the box (L2F and L2V). The number of possible element positions in a leaf box is 3c3 and thus Mic generated via Q2M by a unit source element has 3c3 values. Ffar and V far generated via L2F and L2V, respectively, by a unit coefficient Lic also have 3c3 values. These unit responses can be preliminarily calculated and stored as mic, math formula, and math formula, respectively. By using these unit responses, consolidated Mic, Ffar, and V far are calculated with the following multiply-and-accumulate operations:

display math(6)
display math(7)

where the prime mark indicates the local position in a leaf box.

The other FMM processes in the far-field calculation are identical to those in [9] or [10]. Multipole-to-multipole (M2M) and local-to-local (L2L) translation operators are decomposed by RCR decomposition and the M2L translation operator is decomposed on the basis of RCR decomposition or its diagonal forms. Here, these two kinds of M2L are referred to as R-M2L and D-M2L, respectively (also see Appendixes A and B). The complexities of R-M2L and D-M2L are approximately O(p3) and O(p2), respectively. Although the latter has better calculation performance, its algorithm is more complex than the former. Note that the R-M2L used is based on the standard stencils composed of 189 translations per box.

3 GPU-ACCELERATED INDIRECT FMM–BEM

3.1 Hardware and software

The hardware and software specifications of the PC used are now summarized. The operating system, CPU, and GPUs are 64-bit Microsoft Windows 7, Intel Core-i7 980X (six CPU cores, 3.33 GHz), and three NVIDIA GTX480 cards (total of 1440 CUDA cores), respectively. GTX480 has 1536 MB of GDDR5 global memory per GPU, 15 multiprocessors (480 CUDA cores) per GPU, and up to 48 KB shared memory per multiprocessor. CUDA 3.2 is used to program FMM codes that executed on the GPUs with double-precision floating-point arithmetic. The Open Multi-Processing (OpenMP) application program interface is used to manage parallel execution of CPU cores. Although Intel Hyper-Threading Technology makes efficient use of 12-thread operation on a six-core CPU, the number of threads was fixed at six for numerical operations, because unvarying calculation times could be obtained by up to six threads in this study. To manage parallel execution of three GPUs, three OpenMP threads are used.

3.2 CPU codes

Two CPU codes were used as references to measure speed-up ratios and the specifications of these codes are shared by GPU codes except for the GPU-accelerated FMM routine. The CPU-FMM subroutine written in FORTRAN was compiled by Intel Visual Fortran Compiler v11 with /O3, /QxHost and /Qopenmp options that enabled optimization, SSE4.2, and OpenMP multi-threading, respectively. CPU codes 1 and 2 are indirect FMM–BEM based on R-M2L and D-M2L, respectively, and use six OpenMP threads to operate numerical calculation in both the liner system solver and the embedded FMM. When using OpenMP, the addition order of floating-point arithmetic is fixed to obtain deterministic results. The solver used is the Bi_IDR (s) method [22] with the setting s = 3. Note that this solver requires one FMM operation in its one iteration step. Convergence was judged when the relative residual norm of the solution became less than 10 − 6. The expansion of multipole and local coefficients was truncated to p = 10 and the actual relative accuracy of used FMM based on R-M2L was estimated at about 10 − 7 [23]. The parameter c, which ranged from 5 to 7 in this study, changes both the average number of elements in a leaf box and the level of leaf boxes in the oct-tree structure. Thus, c controls the proportion of direct-field and far-field calculation amounts, and it controls the performance of the FMM. Although automated tuning of c before calculation will be helpful from a practical standpoint, it is beyond the scope of this paper.

The following is a brief summary of the D-M2L procedure [9] (see also Appendixes A and B). After selecting a translation direction from ± x, ± y, and ± z, the following operations are sequentially performed: rotate all related M toward the selected direction using Equation (19); translate all M to exponential expansion coefficients W using Equations (14) and (15); translate W to W ′ that corresponds to L using Equation (16); translate W ′ to L with Equations (17) and (18); reversely rotate all related L using Equation (20). D-M2L is completed after performing all of these processes in all directions. In this study, D-M2L is performed using a collection of boxes as illustrated in Figure 1, which is a set of double-layered source M boxes and double-layered target L boxes. D-M2L performed in one such collection is independent of other D-M2L performed in other collections; thus, D-M2L can be divided into many independent sectional subprocesses. The use of such collections in the CPU code contributes to the reduced usage of workspace memory. The parameter kmax used in D-M2L [9] was set to 18. Parameters M(k), k = 1 to kmax, were set to 6, 8, 12, 16, 20, 26, 30, 34, 38, 44, 48, 52, 56, 60, 60, 52, 4, and 2. Odd numbers were avoided for M(k) to improve the calculation efficiency [16]. The sum of M(k) was 568. The actual relative accuracy of used FMM based on D-M2L was estimated at around 10 − 7 [23].

Figure 1.

A collection of fast multipole method boxes, D-M2L in which can be performed independently of other collections.

3.3 Graphics processing unit codes

Three GPU codes are programmed as follows: GPU code 1 simultaneously calculates direct and far fields in the FMM on the GPU and CPU, respectively; GPU code 2 calculates both fields on the GPU with R-M2L; GPU code 3 calculates both fields on the GPU with D-M2L. GPU code 1 uses D-M2L for the far-field calculation on the CPU. Added and modified items compared with those in the previous code [15], which corresponds to GPU code 2, are summarized in Table 1.

Table 1. Added and modified items compared with those in the previous code [15].
ItemsPrevious codeAdded in current code
  1. FMM, fast multiple method; GPU, graphics processing unit; CUDA, compute unified device architecture; OpenMP, Open Multi- Processing.
Far-field calculation in the FMMOn the GPUOn the CPU in GPU code 1
M2L on the GPUR-M2LD-M2L in GPU code 3
Operation allocation among GPUsSubprocessesFMM boxes
 in line 01 in Table 2in line 03 in Table 2
Applied fieldField IIIFields I and II
Voxel modelModels A and BModel C
ItemsPrevious codeModified in current code
The number of GPUs (Ng)43
Multiprocessors per GPU (Nmp)3015
CUDA cores per GPU240 ( = 8Nmp)480 ( = 32Nmp)
B in line 03 in Table II30 ( = Nmp)90 or 120
Ng GPU executionDirect field and M2LAll FMM processes
CPU cores (Ncc)46
OpenMP threads for numerical operation8 ( = 2Ncc)6 ( = Ncc)

The outline of GPU code 2 is summarized as follows. All FMM processes of the far-field calculation (Q2M, M2M, R-M2L, L2L, L2F, and L2V) and direct-field calculation are programmed as CUDA kernels using the pseudocode template shown in Table 2. This template adopts the strategy of one CUDA block per FMM box. Each FMM process is divided into P independent subprocesses to achieve both fine-grained parallelization and efficient use of shared memory. The value of P is up to 972 for the direct-field calculation and 316 for R-M2L. B CUDA blocks process related FMM boxes, and a CUDA block always processes the same box to avoid race conditions. T CUDA threads process all targets in the box and consolidate source contributions related to each target. Such procedures are designed to fix the addition order of floating-point arithmetic, and thus, they generate deterministic numerical results.

Table 2. Pseudocode template for CUDA kernels of the fast multipole method.Thumbnail image of

Added and modified items in GPU code 2 are as follows (see Table 1). The current numbers of GPUs, Ng, and multiprocessors per GPU, Nmp, are 3 and 15, respectively. B is empirically set to up to 120. All FMM processes are executed with Ng GPUs. To allocate operations to Ng GPUs, FMM boxes are exclusively divided into Ng subsets, which contain almost the same number of boxes. Line 03 in Table 2 is performed by Ng GPUs according to these subsets.

As an example of the use of the pseudocode template provided in Table 2, details of the CUDA kernel for R-M2L based on Equations (19), (13), and (20) are described in the following. Note that these equations have a shared structure, that is, multiple left-hand side coefficients defined in an FMM box are calculated by single multiply-and-accumulate operations, which facilitates the use of the shared pseudocode template.

(Step 1)

R-M2L is divided into 316 subprocesses, because the possible relative arrangements of source boxes and a target box in R-M2L is 7 × 7 × 7 − 3 × 3 × 3 or 316. Each subprocess is handled by For Loop 1 in Table 2. Steps 2–7 shown in the following text are repeated three times, that is, one time each for forward-rotation, coaxial translation, and backward-rotation operations.

(Step 2)

The weight coefficients for the fixed subprocess and integer arrays, which are used for array address calculations, are loaded into the shared memory.

(Step 3)

B is set to 120. B CUDA blocks handle boxes in parallel. A CUDA block always handles the same boxes. The Lic of a handled box act as targets, whereas the Mic of a paired box specified by the current subprocess acts as sources. If the paired box is empty or undefined, the subsequent steps are skipped.

(Step 4)

The values of source- Mic are loaded into the shared memory.

(Step 5)

The number of targets is (p + 1)2 = 121. T is set to 128 and one thread handles a target.

(Step 6)

Each thread gathers related source contributions via For Loop 2.

(Step 7)

The calculated Lic values are stored in the global memory.

GPU code 3 performs all of the FMM processes on GPUs as well as GPU code 2. However, it utilizes D-M2L instead of R-M2L. The processes required by D-M2L are based on Equations (19), (14), (15), (16), (17), (18), and (20). With the exception of (16), these equations have a shared structure, that is, the multiple left-hand side coefficients defined in an FMM box are calculated using single multiply-and-accumulate operations. Therefore, the CUDA kernels of these processes can share the template in Table 2. However, this required a modification in the template for diagonal form translation based on Equation (16). Lines 05 and 06 were exchanged irregularly, that is, ‘For loop 2’ specified source boxes, whereas CUDA threads handled 568 pairs of target and source coefficients that correspond to the one-to-one mapping in the diagonal-form translation. In the template, lines 03–11 were repeated seven times, that is, once for each of these processes.

A subprocess of D-M2L is defined as a sectional D-M2L performed in a collection of boxes illustrated in Figure 1. Line 01 in Table 2 is performed by Ng GPUs according to Ng subsets of such subprocesses. These procedures were designed to generate deterministic numerical results. The number of real coefficients in math formula and math formula was (p + 1)2 = 121 in this study. By contrast, this was 568 for math formula and math formula, which was the sum of all M(k). Because of this size difference, the weight coefficients for multiply-and-accumulate operations, which could be loaded into the shared memory for R-M2L, could not be fully loaded into the shared memory for D-M2L. Although this reduces speed-up ratio of the latter compared with the former, numerical experiments are required to evaluate the ratios quantitatively.

3.4 Multiple graphics processing unit execution

GPU codes 1, 2, and 3 perform data transfers between the CPU host and GPU devices (see Figure 2). To allocate Q2M, L2F, L2V calculations to Ng GPUs, Ng sets of exclusively allocated leaf boxes are defined. Each set determines both a set of ancestor level boxes used in M2M and L2L and another set of leaf boxes used in the direct-field calculation. Each GPU receives charge densities q involved in related leaf boxes from the CPU. In the far-field calculation, each GPU generates M on allocated leaf boxes (Q2M) and generates M on related ancestor level boxes (M2M) and sends results to the CPU. All M are completed on the CPU and are sent to the GPUs. Each GPU has another set of exclusively allocated boxes to separate M2L calculations, performs M2L to calculate L on these boxes without L2L contributions, and sends results to the CPU. All L are consolidated on the CPU and are sent to the GPUs. Each GPU completes L by adding L2L contribution in reverse order of M2M, and each calculates Ffar and V far on elements in allocated leaf boxes (L2F and L2V) and sends results to the CPU. F and V are completed on the CPU by adding calculated direct and far fields.

Figure 2.

Diagram of the fast multipole method with multiple graphics processing units (GPUs).

Table 3. Specifications of voxel models.
 Model AModel BModel C
  1. NICT, National Institute of Information and Communications Technology.
RegionHeadHeadWhole body
OriginNICT TAROOriginalNICT TARO
Voxel side length (mm)212
Nx × Ny × Nz104 × 180 × 120184 × 232 × 241260 × 137 × 868
 ( = 1,347,840)( = 10,287,808)( = 30,918,160)
The number of tissue voxels583,1094,625,7557,977,906
The number of boundary elements499,6291,458,8133,921,953
The number of different conductivities14819

Although the GPU-accelerated FMM codes that are embedded into the BEM leave room for improvement in both memory savings and optimizing the GPU performance, continuous improvement is expected with the help of various studies of general FMMs on the GPUs found in [4, 5, 14] and references therein.

4 VOXEL MODELS AND APPLIED FIELDS

A voxel model is an integer array composed of NxNyNz indices, which denote the types of tissue conductivity. Details of the three voxel models used are listed in Table 3. Model A is the head part of a Japanese adult male model TARO that was developed by the National Institute of Information and Communications Technology (NICT) [15, 24] (see Figure 3). Model B is an original head model derived from magnetic resonance images of a Japanese adult male [15]. Model C is the complete TARO model (see Figure 4). The numbers of surface elements are approximately 0.5, 1.5, and 3.9 million, respectively. The conductivities of tissues are set identical to those used by Hirata et al. [17].

Figure 3.

NICT TARO model as model A. The color gradation shows the magnitude of E induced by field II.

Figure 4.

NICT TARO model as model C. The color gradation shows the magnitude of E induced by field I.

Three types of external fields are applied. Applied fields I and II are 50-Hz AC magnetic fields B0. Applied field I is a homogeneous magnetic field B0 parallel to the z-axis defined by vector potential A0 = 0.5B0( − yi + xj) with B0 = 1μT. Applied field II is an inhomogeneous magnetic field B0, which is generated by a 50-Hz AC current flowing through an eight-figure coil. The coil location is shown in Figure 3, and the side length of the square outline is 28 mm. Vector potential A0 is calculated by numerical integration of partial vector potentials generated by line elements that collectively approximate the coil path. A supplemental comment related to A0 is given in Appendix C. Applied field III is an external DC current, which is applied through two circular disk electrodes attached on the skin surface [15]. The radius of the electrode is 5 mm and the total number of electrode elements is 65.

5 RESULTS

5.1 Analyses using model A

The induced electric fields in model A were calculated using three types of applied fields. The FMM for fields I and II calculates F for all elements, whereas the FMM for field III calculates V for the electrode elements and F for all elements. However, the single FMM calculation times are almost independent of the applied fields because the number of electrode elements is negligibly small, that is, 65. Direct and far field calculation times per single FMM calculation and one step calculation times are measured by averaging those with three types of applied field. These times and speed-up ratios compared with CPU codes are listed in Table 4, where significant figures are highlighted in bold letters. When c is fixed, the direct-field calculation times for the three GPU codes are nearly identical because they use the same algorithm for the GPUs. The maximum direct-field speed-up ratio was 26.3. By contrast, the far-field calculation times depend on the type of code used. The maximum far-field speed-up ratios with GPU codes 2 and 3 were 9.6 and 4.2, respectively. The one-step calculation time for GPU code 1 approximately is equal to the longest time for either direct-field or far-field calculations, whereas the one-step calculation times for GPU codes 2 and 3 is approximately equal to the sums of the direct-field and far-field calculation times. In this study, the longest time in GPU code 1 was always the far-field calculation time. The minimum one-step calculation times for GPU codes 1, 2, and 3 were 0.164, 0.137, and 0.107 s, respectively. The maximum one-step speed-up ratios for GPU codes 1, 2 and 3 were 11.5, 13.9, and 13.0, respectively. Thus, GPU code 2 scored the greatest speed-up ratios, which demonstrated the better GPU-acceleration performance of R-M2L compared with D-M2L.

Table 4. Direct field calculation time, far field calculation time, one step calculation time, and speed-up ratios for model A.
CodecDirect fieldFar fieldOne step ( ≃ One FMM)
  Calculation time/sSpeed-up ratioCalculation time/sSpeed-up ratioCalculation time/sSpeed-up ratio
  1. FMM, fast multipole method; GPU, graphics processing unit.
CPU code 150.750 1.207 1.966 
with R-M2L61.160 0.747 1.914 
(c1)71.731 0.505 2.245 
CPU code 250.762 0.308 1.078 
with D-M2L61.181 0.197 1.386 
(c2)71.722 0.155 1.885 
GPU code 150.04516.90.314 0.3233.3
(ratio to c2)60.04625.50.201 0.2106.6
 70.09617.90.155 0.16411.5
GPU code 250.04516.70.1269.60.17811.0
with R-M2L60.04625.20.0839.00.13713.9
(ratio to c1)70.09618.00.0618.20.16513.6
GPU code 350.04516.90.0724.20.1258.6
with D-M2L60.04526.30.0543.60.10713.0
(ratio to c2)70.09617.90.0443.50.14912.6

The requisite number of iteration steps, time required before starting iteration T0, and the total calculation time are shown in Table 5. T0 contains both the setting time for the Bi_IDR(s) method and the memory allocation times on the GPUs. The total calculation time is approximately equal to the sum of T0 and the product of the one-step calculation time and the number of iteration steps. The required iteration steps and total calculation time depended on the type of field applied. The minimum total calculation times for GPU codes 1, 2, and 3 were 12.1, 10.5, and 8.4 s, respectively. Thus, GPU code 3 achieved the fastest times in these calculations, which demonstrated the better algorithmic performance of D-M2L than that of R-M2L even after considering GPU acceleration.

Table 5. Total iteration steps, time required before starting iteration (T0), and total calculation time for model A with fields I, II, and III.
CodecField IField IIField III
  Total iteration stepsT0 / sTotal calculation time/sTotal iteration stepsT0 / sTotal calculation time/sTotal iteration stepsT0 / sTotal calculation time/s
  1. GPU, graphics processing unit.
CPU code 15670.203129.748720.193143.9971110.195218.260
with R-M2L6770.373147.920740.183140.210790.191152.971
 7740.208163.984850.173192.077890.200201.226
CPU code 25670.20472.715720.18876.937930.194100.985
with D-M2L6770.192106.581740.189102.421820.192114.527
 7730.191137.148850.178160.917720.195136.076
GPU code 15670.41921.973720.42923.7401020.41433.356
 6770.39516.445740.39516.157790.38316.776
 7730.38412.473850.39014.259720.38712.075
GPU code 25670.41912.381720.41313.2461110.43020.177
with R-M2L6770.39310.995740.38810.526790.40411.238
 7740.38712.632850.41314.453720.41512.389
GPU code 35670.4458.830720.4419.4561100.44114.247
with D-M2L6770.4118.631740.4328.356810.3278.961
 7730.44011.282850.41613.057890.42013.739

Table 6 lists the required memory usages per GPU in applied field I. The maximum memory usages of codes 1, 2, and 3 were 31, 73, and 70 MB, respectively, for model A. GPU code 1 required the least memory usage, because it only performs direct-field calculation on the GPUs.

Table 6. Memory usage in MB per graphics processing unit required by GPU codes 1, 2, and 3 in field I.
 GPU code 1 (CPU and GPU)GPU code2 (R-M2L)GPU code 3 (D-M2L)
c = 5c = 6c = 7c = 5c = 6c = 7c = 5c = 6c = 7
  1. GPU, graphics processing unit.
Model A28.228.730.973.463.560.969.558.254.6
Model B88.281.780.0282.2212.3176.7283.7212.8176.1
Model C209.9196.6263.6584.3449.2449.2629.2484.7477.2

In general, GPU code 3 with D-M2L showed the most preferable total performance. However, all GPU codes have respective advantages and do not have critical disadvantages. A supplemental comparison related to code 2 in field III is shown in Appendix D.

5.2 Analyses of models B and C

Induced electric fields in models B and C were calculated with field I to investigate calculation performance in larger scale models. Calculations with fields II and III were omitted because FMM execution times observed in the previous subsection were not affected by the type of applied field.

For model B, calculation times, speed-up ratios compared with CPU codes, and the number of iteration steps are listed in Table 7. The maximum far-field speed-up ratios of GPU codes 2 and 3 were 10.4 and 5.2, respectively. The maximum one-step speed-up ratios of GPU codes 2 and 3 were 10.7 and 8.0, respectively. Thus, GPU code 2 scored the greatest ratio. The minimum one-step calculation times of GPU codes 1, 2, and 3 were 0.822, 0.604, and 0.399 s, respectively. The minimum total calculation times of GPU codes 1, 2, and 3 were 50.8, 37.5, and 26.7 s, respectively. Thus, GPU code 3 achieved the fastest time. Table 6 lists the required memory usages per GPU for model B. The maximum memory usages of codes 1, 2, and 3 were 88, 282, and 283 MB, respectively, and GPU code 1 required the least memory usage.

Table 7. Calculation times in seconds, speed-up ratios, and iteration steps for model B in field I.
CodecDirect fieldFar fieldOne step   
  Calculation timeSpeed-up ratioCalculation timeSpeed-up ratioCalculation timeSpeed-up ratioIteration stepsTime before starting iterationTotal time
  1. GPU, graphics processing unit.
CPU code 151.528 7.171 8.725 590.364515.165
with R-M2L62.010 4.513 6.549 650.373426.026
(c1)72.690 3.041 5.756 610.539351.646
CPU code 251.550 1.773 3.350 590.381198.011
with D-M2L62.016 1.134 3.176 650.548206.994
(c2)72.787 0.814 3.627 610.373221.598
GPU code 150.1738.91.745 1.7771.9590.951105.794
(ratio to c2)60.13914.51.099 1.1292.8650.71074.114
 70.26410.60.791 0.8224.4610.66550.778
GPU code 250.1778.60.68910.40.8909.8590.82953.343
with R-M2L60.13714.60.44710.10.60910.7650.74940.358
(ratio to c1)70.26310.20.3159.60.6049.5610.68737.537
GPU code 350.1768.80.3445.20.5446.2590.99433.119
with D-M2L60.13714.70.2364.80.3998.0650.74526.683
(ratio to c2)70.26410.60.1794.50.4707.7610.69829.369

For model C, calculation times, speed-up ratios, and the number of iteration steps are listed in Table 8. The maximum far-field speed-up ratios of GPU codes 2 and 3 were 9.7 and 5.2, respectively. Their maximum one-step speed-up ratios were 12.1 and 11.3, respectively. Thus, GPU code 2 scored the greatest ratio. The minimum one-step calculation times of GPU codes 1, 2, and 3 were 1.715, 1.215, and 0.838 s, respectively. Their minimum total calculation times were 183.9, 129.0, and 89.4 s, respectively. Thus, GPU code 3 achieved the fastest time. Table 6 lists the required memory usages per GPU for model C. The maximum memory usages of codes 1, 2, and 3 were 264, 584, and 629 MB, respectively, and GPU code 1 required the least memory usage. These usages are up to 40% of the 1536-MB hardware capacity.

Table 8. Calculation times in seconds, speed-up ratios, and iteration steps for model C in field I.
CodecDirect fieldFar fieldOne step   
  Calculation timeSpeed-up ratioCalculation timeSpeed-up ratioCalculation timeSpeed-up ratioIteration stepsTime before starting iterationTotal time
  1. GPU, graphics processing unit.
CPU code 154.988 12.51 17.56 1070.8231880.06
with R-M2L66.862 7.792 14.72 1050.8181546.55
(c1)79.534 5.448 15.05 1060.8281596.04
CPU code 255.114 3.515 8.699 1070.845931.653
with D-M2L67.119 2.297 9.485 1050.836996.767
(c2)79.909 1.685 11.66 1060.8331237.19
GPU code 150.35314.53.377 3.4592.51071.541371.705
(ratio to c2)60.30523.32.227 2.3144.11051.364244.279
 70.67614.71.632 1.7156.71062.049183.854
GPU code 250.35714.01.2909.71.70610.31071.596184.167
with R-M2L60.30122.80.8479.21.21512.11051.438129.010
(ratio to c1)70.67614.10.6079.01.35111.11062.059145.313
GPU code 350.35914.30.6735.21.0957.91071.578118.733
with D-M2L60.30323.50.4684.90.83811.31051.40289.382
(ratio to c2)70.67414.70.3614.71.10410.61072.171120.312

The ratio of the numbers of boundary elements in models C and A was 3,921,953 / 499,629 = 7.85, and the ratio of minimum one-step calculation times for models C and A was 7.68 for CPU code 1. Such ratios for CPU code 2 and GPU codes 1, 2, and 3 were 8.04, 10.3, 8.99, and 7.69, respectively. These ratios in two models based on the same TARO model approximately suggest the expected O(N) complexity of the FMM algorithm.

5.3 Comparison of performance

In all the test cases, GPU codes 1, 2, and 3 achieved the least GPU memory usage, the greatest speed-up ratios, and the fastest calculation times, respectively. These results are summarized in Table 9. Algorithmic compatibility to GPU architecture in this table roughly represents ease of both designing the CUDA kernel and approaching the ideal GPU computing performance. Note that GPU code 1 has the advantage of shorter period of development, which is important in practical situations. In addition, the wider setting range of the c value might have yielded a slightly faster calculation time for GPU code 1. Although GPU code 3 demonstrated the fastest calculation times in all test cases, all codes have a chance of selection depending on both the number of unknowns and the balance of CPU and GPU performance in the future. The choice of code also depends on the following items: parameter settings related to the calculation accuracy, for example, the selection of p; the stencils of the FMM boxes used in M2L translation; and the specific code implementation with consideration for symmetric, cyclic, and conjugate relationships. Eventually, securing such optional FMM–BEM codes and adjusting parameters of FMM will stably provide updated computation performance on heterogeneous CPU–GPU systems.

Table 9. Comparison of tested codes.
Compared items1GPU codes 23GPU codes 1 and 2
  1. GPU, graphics processing unit.
Algorithmic compatibility to GPU architectureBetterMediumWorse 
Period to develop GPU codeShorterMediumLonger 
Speed-up ratio of direct-field calculationBetter than far-field calculation 
Speed-up ratio of far-field calculation BetterWorse 
Speed-up ratio of one-step calculationWorseBetterMedium 
Calculation speed on the GPUsWorseMediumFaster 
GPU memory usageLesserGreaterGreaterZero
Total performance in tested cases  The best 
Applicability to larger scale model analysisMediumWorseWorseBetter
Chance in case of greater CPU advance than GPUBetter  Better
Chance in case of greater GPU advance than CPU BetterBetter 

6 CONCLUSION

An indirect FMM–BEM that is geared to field analysis in voxel models is accelerated by GPUs with double-precision floating-point arithmetic on a PC. Three types of GPU-accelerated code are programmed in search of higher computing performance. These codes are designed as follows: GPU code 1 simultaneously calculates direct and far fields in the FMM on the GPU and the CPU, respectively; GPU code 2 calculates both fields on the GPU with R-M2L; and GPU code 3 calculates both fields on the GPU with D-M2L. The electric fields in human models induced by three kinds of applied fields are successfully calculated on a PC with three GPUs and six CPU cores. When a homogeneous AC magnetic field was applied to models with 0.5, 1.5, and 3.9 million boundary elements, GPU code 3 took 8.6, 26.7, and 89.4 s, respectively, to solve the governing linear system. With respect to speed-up ratios of one-step calculation time, GPU code 2 scored 14.1, 10.7, and 12.1, respectively, on the basis of the CPU code times performed with six OpenMP threads. Although embedded GPU-accelerated FMM codes leave room for improvement in both memory savings and optimization of the GPU performance, the obtained times and speed-up ratios demonstrated the usefulness and efficiency of the tested BEM codes. GPU codes 1, 2, and 3 exhibited the least memory usage, the greatest speed-up ratio, and the fastest calculation time, respectively. The observed trade-off relationship implies the effectiveness of a common but practical strategy of preparing such optional codes to secure high-performance computing on heterogeneous CPU–GPU systems responding to a variety of situations.

APPENDIX A: MATHEMATICAL FORMULAE FOR M2L TRANSLATION

The mathematical formulae required for M2L translation [9] are briefly summarized for the readers’ convenience. The spherical harmonics of degree n is defined as follows.

display math(8)

where θ and ϕ are spherical polar angles, math formula is the associated Legendre function, and i is the imaginary unit. The n ranges from 0 to ∞ , whereas m ranges from − n to + n. The scalar potential Φ at (r,θ,ϕ), which is a solution of the Laplace equation, is described as follows.

display math(9)
display math(10)

where R is the convergence radius of the infinite series, whereas math formula and math formula are multipole and local expansion coefficients. The following relations hold: math formula, math formula, and math formula, where the overline indicates the complex conjugate. Thus, math formula, where math formula indicates the imaginary part of the complex number. If the infinite series is truncated and n ranges up to p, the number of independent real numbers in math formula, math formula, or math formula is (p + 1)2. If math formula and math formula are defined on different spherical coordinates, that is, ‘1’ and ‘2’ respectively, the following M2L translation formula holds:

display math(11)
display math(12)

where (ρ, α, β) is a point in coordinate ‘1’, which indicates the origin of coordinate ‘2’. The ‘ ≅ ’ is replaced by ‘ = ’, when p = ∞ . Equation (11) requires O(p2) operations to calculate an math formula, so it requires (p + 1)2 × O(p2) = O(p4) operations to calculate all local expansion coefficients.

A rotation-coaxial translation–rotation decomposition algorithm reduces the complexity of M2L. The coaxial translation is a special case of Equation (11), which assumes that (ρ,α,β) = (ρ, 0 rad, 0 rad).

display math(13)

which requires O(p) numerical operations to calculate an math formula, so all local expansion coefficients are calculated with (p + 1)2 × O(p) = O(p3) complexity. Before using Equation (13), the z coordinate in ‘1’ has to be rotated and oriented in the (ρ, α, β) direction. After using Equation (13), math formula have to be reversely rotated in the original direction. The complexity of these rotation translations is also O(p3) (see Appendix B), so the overall complexity retains O(p3) (R-M2L).

By contrast, a diagonal-form translation scheme (D-M2L) is based on the following equations.

display math(14)
display math(15)
display math(16)
display math(17)
display math(18)

where λk, ωk, and M(k) depend on k, whereas j ranges from 1 to M(k), and k ranges from 1 to kmax. The total complexity of Equations (14), (15), (17), and (18) is O(p3) and they are independent of ρ, θ, and ϕ. By contrast, Equation (16) has O(p2) complexity because it requires no summation and it depends on ρ, θ, and ϕ. Therefore, after math formula is translated into math formula, the math formula is repeatedly available for a lot of translations using Equation (16). Therefore, the most significant part of the M2L operation is reduced to O(p2). The one-to-one mapping style of math formula to math formula is known as the diagonal form. Although Equation (16) holds as far as θ ⩽0.5π rad, a narrower θ reduces kmax and M(k) with a fixed accuracy. Thus, D-M2L is performed in six steps that correspond to the six directions of ± x, ± y, and ± z. Before starting each step, the ± x, ± y, and − z coordinate are rotated and oriented in the + z direction. After using Equations (14)(18), math formula are reversely rotated to their original direction.

Eventually, R-M2L is composed of three O(p3) operations, that is, forward-rotation, Equation (13), and backward-rotation. D-M2L is composed of six O(p3) operations, that is, forward-rotation, Equations (14), (15), (17), (18), and backward-rotation, and one O(p2) operation, Equation (16).

APPENDIX B: OUTLINE OF THE ROTATION OPERATION [10, 16]

Two spherical coordinates 1 and 2 have a shared origin and axes with different directions. A set of multipole or local coefficients math formula defined for coordinate 1 is translated into math formula, which is defined for the other coordinate 2 as follows.

display math(19)

where math formula are coefficients for forward rotation. When we fix the degree n, we require only 2n + 1 coefficients of math formula to calculate 2n + 1 coefficients of math formula. To calculate all of the coefficients of math formula requires (p + 1)2 × O(p) = O(p3) operations, if n is truncated to p. The backward rotation is described in the same form using other coefficients math formula as follows.

display math(20)

If rotation operations are used in the FMM, the multiple left-hand side coefficients defined for an FMM box are calculated by single multiply-and-accumulate operations on the right-hand side.

APPENDIX C: CONSTANT VECTOR OF THE LINEAR SYSTEM IN FIELDS I AND II

To calculate the constant excitation column vector of the linear system in fields I and II, ∫ A0 · ndS is numerically evaluated on every element. The evaluation in field I is equivalently performed by calculating the product of the element area and A0 · n at the element center, and thus, it requires an ignorable calculation cost. The evaluation in field II simultaneously requires line integration along the coil path, and thus, it usually requires expensive calculation cost. However, by applying a kind of tree method with Equation (7) [25], the calculation time on the CPU was reduced to 0.46 s for the analysis in model A.

APPENDIX D: COMPARISON TO PREVIOUS GPU-ACCELERATED FMM–BEM CODE

In a previous study [15], the induced electric field in model A with applied field III is calculated with a GPU code corresponding to GPU code 2 in this study. Here, calculation times obtained in the current and previous studies are compared at the setting c = 6. Previously used GPUs, CPU, and linear system solver were NVIDIA GTX295 (two cards, four GPUs, and a total of 960 CUDA cores), Intel Core i7-975 (four cores) and the BiCGsafe method, respectively, (see also Table 1). Note that BiCGsafe requires two FMM operations in one iteration step and that the number ratio of total CUDA cores in the current and previous studies is 1440 / 960 = 1.5. In the previous study, the direct-field calculation time, far-field calculation time, one FMM calculation time, total number of FMM operations, and total time required for convergence were 0.106, 0.186, 0.309, 100, and 31.3 s, respectively. In the current study, they are 0.048, 0.082, 0.138, 79, and 11.2 s, respectively, on the basis of which Tables 4 and 5 are made. Their respective ratios are 2.21, 2.27, 2.24, 1.27, and 2.79.

ACKNOWLEDGEMENTS

This study was partially supported by a Grant-in-Aid for Scientific Research (C) (No. 22560419) from the Japan Society for the Promotion of Science. The author sincerely thank T. Yamamoto, T. Sasayama, and T. Kobayashi for their contributions to this study.

Biography

  • Image of creator

    Shoji Hamada received his BS degree from Kyoto University, Japan, in 1987, and his MS and PhD degrees in Electrical Engineering from the University of Tokyo, Japan, in 1989 and 1992, respectively. He was with Tokyo Denki University, Japan, from 1992 to 1997. He joined the Faculty of Engineering, Kyoto University as a lecturer in 1997 and has been an associate professor with the Department of Electrical Engineering, Graduate School of Engineering at the same University since 2005. His research interests include electromagnetic field analysis and inverse problems.

Ancillary