An MPI/GPU parallelization of an interior penalty discontinuous Galerkin time domain method for Maxwell's equations



[1] In this paper we discuss our approach to the MPI/GPU implementation of an Interior Penalty Discontinuous Galerkin Time domain (IPDGTD) method to solve the time dependent Maxwell's equations. In our approach, we exploit the inherent DGTD parallelism and describe a combined MPI/GPU and local time stepping implementation. This combination is aimed at increasing efficiency and reducing computational time, especially for multiscale applications. The CUDA programming model was used, together with non-blocking MPI calls to overlap communications across the network. A 10× speedup compared to CPU clusters is observed for double precision arithmetic. Finally, for p = 1 basis functions, a good scalability with parallelization efficiency of 85% for up to 40 GPUs and 80% for up to 160 CPU cores was achieved on the Ohio Supercomputer Center's Glenn cluster.

1. Introduction

[2] Practical complex applications have a high level of complexity and usually result in long simulation times. In order to have efficient numerical simulation times, especially in the time domain, it is desirable to have a numerical method with high parallelism capability that can run on fast hardware. Discontinuous Galerkin (DG) finite element methods [Hesthaven and Warburton, 2002; Fezoui et al., 2005; Montseny et al., 2008; Dosopoulos and Lee, 2010a] have the desired properties for complex time domain simulations. They support various types and shapes of elements, non-conformal meshes and non-uniform degrees of approximation. Being discontinuous they offer freedom in the choice of basis functions. In this way a great amount of flexibility is available. Additionally, the resulting mass matrix is a block diagonal matrix, with the block size equal to the degrees of freedom in the element. Hence the method can lead to a fully explicit time-marching scheme for the solution in time. Another feature of DG is that a big part of the computation is local to each element. Moreover, information exchange is required only between neighboring elements regardless of the order of the polynomial approximation and the element shape. Thus, as shown also in section 2, DG methods possess what is usually called the locality property.

[3] In this paper we present an approach that exploits this locality property to achieve an efficient parallelization. Consequently, DG methods should run on hardware that supports parallelization. At least two possible choices are available in current hardware technology. One is multicore CPUs and the other is GPUs. For DGTD, current generation multicore CPUs can offer up to 32 threads (8 quad core sockets) for parallel computation. If more threads are required, multithreading can be applied, but the operating system will have to switch between threads, and this could be expensive. Moreover, CPUs offer fast memory access time, but the memory bandwidth is a factor that could limit performance. Another possible candidate for DGTD computation is GPUs. Current generation GPUs, i.e., Tesla 2050, have 14 streaming multiprocessors (SMs). Each SM has 32 streaming processors (SPs), or CUDA cores, and can support up to 1,024 threads. Furthermore, in GPUs, memory access time is usually slower than in the CPU, but GPUs also have about an order of magnitude higher memory bandwidth and floating point operation capabilities than their CPU counterparts. Therefore, one can argue that GPUs appear to be a better candidate for DGTD methods.

[4] In the context of DG methods, GPU computing was initially introduced by Klöckner et al. [2009], where a single precision GPU implementation of DGTD was presented. The authors presented a 40×–60× speedup when comparing one GPU versus one CPU core. Furthermore, Gödel et al. [2010a] presented their approach to implementing DGTD in GPU clusters, and the authors reported a 20× speedup in the solution time for a single precision implementation. Also, in work by Gödel et al. [2010a] the GPU communication was done in a shared memory architecture, since all GPU devices physically reside on the same cluster node and scalability results were reported for up to 8 GPUs. Furthermore, in work by Gödel et al. [2010b] a multirate GPU implementation of DGTD was presented, but there was a restriction to only two classes for time marching. In this paper the number of classes is arbitrary, so no such restriction is applied for the local time stepping.

[5] The main contribution of this paper is summarized in the following: (a) we present our approach for an MPI/GPU implementation of DGTD, on conformal meshes with uniform degrees of approximation. The proposed approach is applicable in large GPU clusters with distributed-memory architecture. Moreover, we report a 10× speedup (one GPU against one CPU core) in double precision arithmetic using Quadro FX 5800 cards; (b) we combine the local time stepping (LTS) Montseny et al. [2008] algorithm with MPI/GPU to increase efficiency and reduce the computational time for multiscale applications, and (c) we show good scalability and parallelization efficiency, of 85% up to 40 GPUs for the hybrid CPU/GPU implementation, and of 80% up to 160 CPU cores for the CPU only implementation, on the Glenn Cluster at the Ohio Supercomputer Center. Moreover, note that all reported scalability results are for linear basis (p = 1). The scalability for higher order basis will be investigated in the future. Furthermore, the discussion on MPI/GPU for DGTD presented by the authors in this article is not the optimal one. Consequently, a number of improvements (i.e high orders, mixed orders, higher order time integrators etc.) can be added. In this article the authors simply present their initial, and in our opinion promising, findings for an MPI/GPU implementation of DGTD.

[6] This paper is organized as follows. In sections 2.1 and 2.2 we briefly describe the DGTD space and time discretization that we employ. Next, in section 3 we discuss the computational layout used to map DGTD to a CUDA framework. Following that, in section 4 we describe our MPI/GPU implementation for distributed memory architecture. Finally, in section 5 we present our performance analysis and give numerical examples to illustrate the potential of the proposed MPI/GPU/LTS strategy.

2. DGTD Methodology

[7] In this section we briefly outline the fundamental concepts of our DGTD method, based on an interior penalty formulation by Arnold [1982].

2.1. DGTD-Space Discretization

[8] Let Ω be the computational domain of interest and equation imageh the discretization of Ω into tetrahedra K. Following the interior penalty approach described by Arnold [1982] and Dosopoulos and Lee [2010b] we can derive the DGTD formulation in equation (1). Assuming constant material properties within each element we can define the following finite-dimensional discrete trial space: Vhk = {v ∈ [L2(Ω)]3 : v|K ∈ [Pk(K)]3, ∀Kequation imageh}. Denote by equation imageh the set of all faces of equation imageh, by equation imagehequation image the set of all interior faces and by equation imagehequation image the set of all boundary faces such that equation imageh = equation imagehequation imageequation imageequation imagehequation image. Next, define the tangential trace and projection operators, γτ(·) and πτ(·) respectively, as γτ(ui) = equation imagei × uequation image, πτ(ui) = equation imagei × (ui × equation imagei)∣equation image. We define the following traces operators: {{u}} = (πτ(ui) + πτ(uj))/2, 〚uγ = γτ(ui) + γτ(uj) and 〚uπ = πτ(ui) − πτ(uj) on equation imagehequation image. The final formulation can be formally stated as:

[9] Find (H, E) ∈ Vhk × Vhk such that

equation image

where equation image and equation image are symmetric positive definite tensors. For the choice e = f = 0, one obtains an energy conservative formulation, with a suboptimal (O(hp)) rate of convergence as discussed by Fezoui et al. [2005] and Dosopoulos and Lee [2010a]. If we choose e = 1/(2ZΓ) and f = 1/(2ZΓ), with ZΓ = (1/2)(equation image + equation image), ZΓ = (1/2)(equation image + equation image), we get a lossy formulation. However, in this case an optimal (O(hp+1)) rate of convergence is attained as discussed by Hesthaven and Warburton [2002] and Dosopoulos and Lee [2010a].

2.2. DGTD-Time Discretization

[10] Let us expand the electric and magnetic fields within element Ki in terms of basis functions w, vVhk as Eequation image(r, t) = E(r, t)|Ki ≈ Σndi ein(t)vin(r) and Hequation image(r, t) = H(r, t)|Ki ≈ Σndi hin(t)win(r), where di are the degrees of freedom in element Ki. Separating the w, v testing in equation (1), we obtain a semi-discrete system in matrix form for each element:

equation image
equation image

where ei and hi are the time dependent coefficient vectors for the electric and magnetic field respectively, in element Ki and jneigh(i). The above system of first-order differential equations is discretized in time with a leapfrog scheme which is second-order accurate. The electric field unknowns are evaluated at tn = nδt and the magnetic field unknowns are evaluated at tn+1/2 = (n + 1/2)δt. The first derivatives will be approximated using central differences. Moreover, for the two extra penalty terms arising from the upwind flux formulation, we use a backward approximation ei(j)n+equation imageei(j)n and hi(j)n+1hi(j)n+equation image. For an average approximation, i.e. (ei(j)n+equation imageequation image and hi(j)n+1equation image), the system will become globally implicit due to the coupling terms from the neighboring elements. Therefore, the backward approximations in the upwind flux formulation are necessary if we want the time marching scheme to remain explicit. In this way, the fully discretized local system of equations is written as:

equation image
equation image

The resulting update scheme is conditionally stable. A stability condition is derived by Fezoui et al. [2005] and Montseny et al. [2008]. For a DGTD method based on centered fluxes and linear interpolation for the approximation of the fields: ∀i, ∀iequation imageN(i), ciδti[equation image + equation image max(equation image, equation image)] < equation image, where N(i) is the set of indices of the neighboring elements of Ki. Moreover, Vi is the volume of element Ki, Pi is the perimeter of Ki defined as Pi = Σfi Sfi, (Sfi is the area of face i) and ci = equation image. As equation image becomes smaller, the stability condition provides a smaller δt. In practical applications, locally refined and/or distorted meshes will result in a very small time step δt. For a standard leapfrog scheme, to guarantee stability for all the elements we must choose δt = δtmin = min(δti), i.e. δtmin is the minimum of all the local δti. Consequently, CPU time will significantly increase. To mitigate this problem, a local time stepping strategy proposed by Montseny et al. [2008] is applied to increase efficiency and reduce the computational time. The set of elements is partitioned into N classes, before the time-marching begins, based on the stability condition. For the kth class δtk = (2m + 1)kδtmin. In our case we use m = 1. Finally, a detailed description of LTS is out of the scope of this paper and can be found in work by Montseny et al. [2008].

3. CUDA Implementation

[11] In this section we describe the approach we follow on the GPU side of our implementation. Our approach is based on ideas similar to the ones discussed by Klöckner et al. [2009] and Gödel et al. [2010a]. The computational layout we follow on the GPU implementation consists mainly of three steps, which are described in the following and summarized in Figure 1. In the following, let us denote by Mi the ith partition of an initial large finite element mesh.

Figure 1.

Mapping of DGTD to CUDA programming model.

3.1. FEM Mesh to CUDA Grid Mapping

[12] Firstly, a CUDA-grid is mapped to the finite element mesh of partition Mi. According to the CUDA programming model Nvidia (, 2010), a CUDA grid, dim3 grid(dimX, dimY, dimZ), can be at most two dimensional (dimZ must be equal to 1) although declared as dim3. When a CUDA grid with dimensions dimX and dimY is declared, it means that a total of dimX × dimY CUDA thread blocks will be scheduled to run on the hardware at execution time. Therefore, the dimensions dimX and dimY are chosen such that dimX × dimY = Ni, where Ni is the numbers of elements in partition Mi. The requested thread blocks will be assigned by the CUDA model to run in parallel on the available streaming-multiprocessors (SMs) of the GPU device. This concludes the first level of parallelism. By establishing this mapping it is easy to see that we have one thread block for each element, which naturally leads us to the next step. Finally, according to Nvidia (, 2010) 1 ≤ dimX × dimY ≤ 65,536. Hence, for a CUDA grid with more than 65,536 threads blocks, a two dimensional grid is necessary. However, one should notice also the massive maximum number of 65,5362 thread blocks that are allowed to be scheduled to run on the GPU.

3.2. Finite Element to CUDA Thread Block Mapping

[13] Secondly, each thread block is mapped to a finite element. Each thread block is going to be responsible for completely updating one element of the FEM mesh of partition Mi. A CUDA thread block can be declared to have three dimensions as dim3 block(dimX, dimY, dimZ). The maximum number of threads per block depends on the compute capability of the GPU device. The Quadro FX 5800 cards used in this paper have a compute capability of 1.3. According to Nvidia (, 2010), cards of 1.3 capability have a maximum number of threads per block equal to 512. Hence we have that 1 ≤ dimX × dimY × dimZ ≤ 512. In our case, a one dimensional declaration is sufficient. Therefore, a thread block is chosen as dim3block(1,Blocksize) where BlockSize is equal to the number of DOFs of the element.

3.3. DOF to CUDA Thread Mapping

[14] After establishing the previous two steps, the third and final step follows naturally. Each thread of a thread block is now responsible for updating one DOF of the finite element. Of course all threads within each thread block are executed in parallel, and this constitutes the second level of parallelism on the GPU side. As mentioned before, Eequation image(r, t) ≈ Σequation image ein(t)vin(r) and Hequation image(r, t) ≈ Σequation image hin(t)win(r) so each thread tn updates the value of ein and hin.

3.4. Description of the Leapfrog Update Kernels

[15] Now that the computational layout is established, we continue our discussion with the description of the kernels used in our implementation. Every thread block performs the matrix-vector multiplications in the update equations (4) and (5). All the matrix data needed in the update equations are precomputed on the CPU side at pre-processing. Then they are copied to the GPU's global memory once, before the LTS begins. Memory is also allocated in the GPU's global memory for the time stepping vectors. In this way all the computations in the update equations are directly performed on the GPU only. The update at every LTS step is performed by calling one of two functions (LeapFrogE or LeapFrogH) that perform the leapfrog update for the E and H field respectively [Montseny et al., 2008; Dosopoulos and Lee, 2010a]. Two kernel functions are needed for a LTS update step (E or H), as shown in Figure 2. First, a volume kernel LE_Vol_kernel is executed to compute the local contributions. Then, a surface kernel LE_Surf_kernel is launched to compute the neighboring contributions. In both volume and surface kernels, the field data are reused many times in the matrix-vector multiplications. Hence it is beneficial to copy them from global to shared memory, since shared (on chip) memory has much faster access time. Furthermore, the shared memory for the Quadro FX 5800 cards that we use is 16kB for each SM. Since each SM can run at most 8 thread blocks, we have up to 2kB per thread block to store field data which is enough even for high order elements. Thus, both volume and surface kernels will use shared memory for the field data used in the update equations, while the matrix data will be read directly from global memory. Moreover, while the volume kernel is executed, asynchronous copy with cudaMemcpyAsync is used to copy the excitation data calculated in the CPU to the GPU and overlap copy with computation. This is achieved by using the CUDA streams functionality as shown Figure 2. In this way, the E update is completed. Likewise, two kernels LH_Vol_kernel and LH_Surf_kernel perform the update for the H-field. Furthermore, in the hardware available at OSC (Quadro FX 5800) only one kernel can be executed at a time, so volume and surface kernels are not executed concurrently. However, in more recent GPUs like Tesla 2050, concurrent kernel execution is supported. Features like this can potentially improve even further the performance of CUDA codes. Finally, data are copied from the GPU to the CPU, through the PCIe2 bus, every m × δtN of simulation time for post-processing and output to disk, where m is some integer and N is the number of classes.

Figure 2.

Implementation of one update step of the LTS algorithm in the GPU. Here, Nelements is the numbers of elements in partition Mi of the class that is currently updated.

4. MPI/GPU Parallelization

[16] As described in the previous section, the matrix data are stored in the GPU global memory. Consequently, the size of the problem one can solve in a single GPU is limited by the available memory on the device. Hence, there is no doubt that only a few complex applications can be solved using a single GPU. The Quadro FX 5800 graphics cards used in this paper have 4GB (future cards like Nvidia's Tesla 2070 report up to 6GB) of global memory. Therefore, the type of problem one can solve with one GPU is limited. If one wishes to solve problems with large numbers of unknowns in a single GPU, then data must be transferred from the CPU to the GPU and vice versa many times. These transfers through the PCIe bus will eventually become a bottleneck. Therefore, a domain partition strategy for data parallelism on a multiGPU platform is required to provide solutions to complicated applications. Moreover, scalable solution strategies need to be considered. For problems for which one GPU device is enough, no communication overhead exists and all optimizations can be accounted for in the GPU part of the code. On the other hand, when using multiple GPUs, neighboring partitions need to share data between them to complete their computations. These data exchanges across neighboring partitions need to be addressed properly so that they do not introduce too much latency and restrict scalability. In this article we achieve that by using non-blocking MPI calls, which overlap communications over the network and contribute to a scalable implementation.

4.1. Combined MPI/GPU/LTS Strategy

[17] In this section we describe our strategy to combine the LTS algorithm with an MPI parallelization, as shown in Figure 3. The implementation described below is in some respect a hybrid CPU/GPU implementation. The computationally intensive and also highly parallel parts of the implementation are ported into the GPUs. On the other hand, the less intensive parts, as well as the pre-processing and post-processing, are handled on the CPU side. Since the most expensive part is actually performing the LTS algorithm, our approach focuses on how to speed up the LTS update using GPUs. Our implementation of the LTS follows the approach by Montseny et al. [2008]. The LTS algorithm uses two functions LeapFrogE(class i, dti) and LeapFrogH(class i, dti) that perform the time updates in the LTS algorithm. These functions are called in a recursive fashion to update each class. The goal and contribution of our approach is to try to combine the benefits of the LTS approach with the GPU compute capabilities and MPI parallelization. We retain the LTS algorithm unchanged and use multiple processes (CPUs/GPUs) to speed the computations within each class, as shown in Figure 3. Our approach is given in the following lines:

Figure 3.

Proposed MPI/LTS strategy. A number of MPI processes P0, P1, …, PN work in parallel for every step in the LTS algorithm. Here LE denotes a leapfrog E update and LH denotes a leapfrog H update.

[18] 1. The initial finite element mesh is partitioned using METIS Karypis and Kumar [1998] into M partitions, in order to obtain a balanced partition. Each partition Mi is then mapped to an MPI process, and each MPI process is associated with a CPU core or a GPU.

[19] 2. After partitioning, each MPI process will have elements that belong to more than one LTS class. When a LTS update (either LeapFrogE or LeapFrogH) step is performed, all MPI processes having elements that belong to the current class perform the update in a parallel fashion as illustrated in Figure 3.

[20] 3. For every LTS update step, all communications between neighboring partitions are handled with non-blocking MPI calls to improve communication time and scalability.

4.2. MPI System Design

[21] In this section we continue by describing the coarse-grained level of the parallelization, which is essentially the MPI part of the implementation. For this part, as shown also in Figures 45, one has at least two possible design choices:

Figure 4.

Use MPI and run one MPI process per GPU.

Figure 5.

Use host threads to run multiple GPUs on each cluster node, and MPI for inter-node communications.

[22] 1. Approach 1: One choice is that every MPI process operates one GPU device or a CPU core. In this design, the size of the MPI communicator is equal to the number of available GPU devices in the cluster.

[23] 2. Approach 2: The second choice is a hybrid MPI/OpenMP approach. We could use OpenMP threads within each node to operate the GPU devices or CPU cores and MPI only for communication between nodes. In this case, the size of the MPI communicator is equal to the number of nodes, and the number of OpenMP threads per node is equal to the number of GPU devices per node.

[24] One might argue that approach two has a potential advantage over approach one. In the second approach, communication between GPUs residing on the same cluster node is done implicitly in a shared memory space. Conversely, in the first case, communication is done using explicit MPI calls, which potentially could be more time consuming. However, the MVAPICH implementation of MPI at the Ohio Supercomputer Center will use shared memory to communicate between MPI processes that reside on the same cluster node. Therefore there is no overhead for Approach 1 compared to Approach 2. Moreover, the first approach requires only one API compared to two APIs for the second approach. Hence, in this paper we present results based on the first approach.

4.3. Partitioning and Communication Setup

[25] Our MPI implementation can be summarized in the steps shown in Figures 67. First, METIS is used to partition the data into M partitions. Then each partition Mi is mapped to an MPI process, with each MPI process handling one GPU device for the MPI/GPU case and one CPU core for the MPI/CPU case, as described in Approach 1 in 4.2. After the partitioning is done, we proceed with the setup of the communication data structures, since there is obviously a need to exchange data between neighboring partitions. All communications are done at the CPU level. After the field data are updated in the GPU they are transferred to the CPU through the PCIe2. These field data are then communicated, with non-blocking MPI calls, only to MPI processes that need them to perform the next step in the LTS algorithm. They are then transferred back to the GPU to be ready for processing at the next LTS time step. We use two functions that set up the communications. First, the function ExchangeGhostInfo() identifies the neighbors for each partition and sets up the buffer sizes according to the shared information between neighboring partitions. These buffers are indexed by MPI rank and LTS class. Next, SetupSubdomainLinkage() fills in those buffers. These buffers will be used later on by two communication functions mpiShareE() and mpishareH() to share the DOFs of field data between neighbors during the LTS update steps. Also, the matrix data in equations (4) and (5) are computed on the CPU during pre-processing and copied to the GPU's global memory once, before the LTS begins. When all communication data structures are set up, we complete the pre-processing and can proceed to the LTS update. Two communication functions mpiShareE() and mpishareH() are used at each LTS step to communicate data, if necessary, between neighbors as shown in Figure 7. For all communication, non-blocking MPI calls are utilized using MPI_ISend and MPI_IRecv to ensure good performance with respect to communication time. This is critical, since a good implementation of the communication part can lead to better scalability of the MPI code. Finally, each process will write to disk its own local set of data. When the solution process is finished, the post-processing step uses. pvtu files and vtkMergeCells from VTK to merge the partitions back together.

Figure 6.

Pre-processing step to set up all necessary buffers for communication between subdomains.

Figure 7.

Flowchart of how one leapfrog E or leapfrog H update of the LTS algorithm shown in Figure 2 is performed in the MPI/GPU implementation.

5. Numerical Examples

[26] In this section we present our results from the performance analysis study, as well a numerical example that shows the potential and capabilities of the proposed implementation to handle complex examples. For all simulations we used the hardware available at the Ohio Supercomputer Center (OSC). The OSC facilities provide GPU-capable nodes on the Glenn Cluster, connected to Quadro Plex S4's CUDA-enabled graphics devices. Each Quadro Plex S4 contains 4 Quadro FX 5800 GPUs, with 240 cores per GPU and 4GB of global memory per card. The GPU compute nodes in Glenn also contain dual socket quad core 2.5 GHz AMD Opterons and 24 GB RAM; they communicate through a 20Gb/s Infiniband ConnectX host channel adapter (HCA). For the configuration in the OSC Glenn cluster, each compute node has access to two Quadro FX 5800 graphics cards. All implementations were done on a Redhat Linux OS, in C++. For both the CPU and the GPU versions the MVAPICH1.1 implementation of MPI at OSC was used, and floating point operations were done in double precision arithmetic. Compilation was done with gcc version 4.1.2, with −O3 optimization. Additionally, BLAS routines from the ATLAS library were used to perform the matrix-vector multiplications in equations (4) and (5) for the MPI/CPU code. For the MPI/GPU code, we used CUDA version 2.3.

5.1. Performance Analysis

[27] In this section we analyze the performance of our implementation. As an example we use a coated sphere with inner radius a = 2m, outer radius b = 2.25m, and εr = 2.0, μr = 1 as shown in Figure 8. The sphere is illuminated with a Gaussian pulse with 3dB bandwidth at f3dB = 300MHz. The mesh under consideration consists of approximately 1.6M elements, which results in 38M unknowns (p = 1 elements were used). For this particular example, the maximum time step given by the stability condition was δtmax = 2.99 × 10−11s, the minimum was δtmin = 4.6 × 10−12s, and application of the LTS algorithm resulted in 2 classes. In this study we concern ourselves with a strong scalability analysis. In a strong scalability analysis the number of unknowns remains the same, and the number of processes (and consequently the number of partitions also) is gradually increased. In order to give a more complete picture, we performed and present our scalability analysis results for both an MPI/CPU and an MPI/GPU implementation.

Figure 8.

Coated sphere used for the performance analysis.

[28] The scalability results for MPI/CPU are shown in Figure 9. The average iteration wall-clock time for one complete LTS update is plotted against the number of MPI processes. In the MPI/CPU case, as shown also in Figure 9, each MPI process handles one CPU core. In Figure 9 the values for the average iteration time for the ideal curve are calculated as AverageIterationTime(N) = T1/N, where T1 is the average iteration time with one MPI process, and N is the number of MPI processes. One can observe that the ideal values are in good agreement with the actual MPI/CPU average iteration time data. Moreover, for larger numbers of MPI partitions, the MPI average iteration time slowly starts to diverge from the ideal curve. To complete the analysis for the MPI/CPU case, we present the parallel efficiency results shown in Figure 9. A satisfactory parallelization efficiency of 80% is achieved for up to 160 CPU cores.

Figure 9.

Scalability and parallelization efficiency of MPI/CPU implementation.

[29] Next we consider the results for the MPI/GPU case. Each MPI process controls one GPU device. As mentioned before, each compute node has access to 2 GPUs, so we have two MPI processes per node. Due to the relatively large number of unknowns and the use of double precision, at least 6 GPUs were necessary to run this example. Therefore in Figure 10 the values of the average iteration time for the ideal curve are calculated as AverageIterationTime(N) = T6/N, where T6 is the average iteration time with 6 MPI processes (6 GPUs) and N is the number of MPI processes. We can observe that the ideal values are in good agreement with the actual MPI/GPU average iteration time data. Again, for larger numbers of MPI partitions the MPI average iteration time slowly starts to diverge from the ideal curve. To complete the analysis for the MPI/GPU case, we present the parallel efficiency results shown in Figure 10. A satisfactory parallelization efficiency of 80% is achieved for up to 40 GPU devices.

Figure 10.

Scalability and parallelization efficiency of MPI/GPU implementation.

[30] Furthermore, in Figure 11 we report the resulting speedup in the solution time. A 10× speedup is achieved. Also note that the speedup values reported in Figure 11 and throughout this paper are for the case of 1 CPU core compared against 1 GPU. In summary, the proposed MPI/GPU implementation maintains a 10× speedup with 40 GPUs, which is a quite satisfactory performance result. Additionally, it indicates that the proposed MPI/GPU approach can be used to solve efficiently complex examples, as will be shown in the following section. Finally, according to Nvidia (, 2010), new generation GPUs like the Tesla 20 series have a 7× improvement in double precision performance compared to the previous Tesla 10 series. Thus with future hardware there could potentially be further improvements in speedup.

Figure 11.

Speed up in iteration time when comparing one GPU against one CPU core.

5.2. 3D SRR Metamaterial Cloaking Device

[31] In this section we present our results from a full wave time domain simulation of a metamaterial cloaking device that was originally designed and presented by Schurig et al. [2006]. The device consists of 5 cylinders as shown in Figure 12. Each cylinder is made of split ring resonators (SRRs) whose design parameters are documented by Schurig et al. [2006]. For all the cylinders the SRRs are printed on a RTDuroid 5780 substrate with relative permittivity εr = 2.33. The design frequency for this example is 8.5 GHz as discussed by Schurig et al. [2006]. This example is characterized by a multiscale geometry with complex features, which makes it challenging for time domain simulations. The volume concealed by the cloak in this case has free space material properties. The generated mesh consists of 6,685,671 elements resulting in approximately 150 million unknowns (p = 1 elements were used). A first order ABC was used for the domain truncation. For this example the minimum time step is 1.1166 × 10−14s, the maximum time step is 5.3834 × 10−13s, and the application of the LTS algorithm in this example results in 4 classes. A total of 22 compute nodes, each having 8 CPU cores, 2 Quadro FX 5800 GPUs and 24GB of RAM, were used, all part of the Ohio Supercomputer Center (OSC) Glenn cluster. We ran the same simulation using 2 CPUs per node and also using 2 GPUs per node. All simulations were done in double precision arithmetic. The average LTS time-update iteration time with MPI/GPU was approximately 7.085 seconds. In contrast, the average time step iteration time with MPI/CPU was approximately 70.30 seconds, resulting in a 10× speedup in the solution time as shown in Table 1. The simulation was performed for 6000 LTS updates resulting in 12 hours and 117 hours computational times for MPI/GPU and MPI/CPU respectively. The results of the simulation using the MPI/GPU approach are shown in Figure 13. The cloaking device was illuminated with a Neuman pulse with 3dB bandwidth 8–11 GHz (Xband); the polarization of the field is along the cylinder axis. In Figure 13 we show snapshots of the Neuman pulse as it propagates through the cloaking device. The following comments can be made about the performance of the cloaking device. First, there is a small but noticeable scattered field in the forward region induced by the cloaking, as seen in Figures 13b13f. Moreover, in Figure 13f one can clearly observe ripples when the pulse exits the cloaking device. These ripples are caused by time delay due to the long traveling paths through the SRRs at each cylinder, which distorts the pulse shape. This time delay is also justified by the fact that SRRs are quite narrow-band and metamaterials built with SRRs are dispersive. Therefore, this distortion in the pulse shape can be detected and indicates the presence of the cloaking as a scatterer, hence reducing the effectiveness of the cloaking. Thus the cloak plus the concealed volume appear to both scatter waves and distort the transmitted field. Consequently one could come to the conclusion that a true cloaking seems not to be achieved in time domain when the structure is illuminated by a pulse with spectral content in the Xband 8–11 GHz.

Figure 12.

Geometry and partial view of the mesh of the 3D SRR cloaking device.

Figure 13.

(a–f) A pulse with 8GHz–11GHz 3dB bandwidth propagating through the cloaking device. Small but noticeable scattered field in the forward region and ripples, due to time delay, as the pulse exits the cloak, reduce the clocking effectiveness.

Table 1. Average Iteration Time for One Complete LTS Update (4 Classes)
 CPU Cores per NodeGPUs/Node
  • a

    Boldfacing indicates cases when two CPU cores per node are compared with two GPUs per node, which is the most important comparison for this study.

# Nodes 221 core2 coresa4 cores8 cores2 GPUs
Iteration Time135.5 s70.30 s38 s21.8 s7.085 s
Percent Efficiency100%96%89%80%-
GPU Gain199.95.33-

6. Conclusion

[32] In this article, we have presented our approach on an MPI/CUDA implementation of the IP-DGTD method for Maxwell's equations. Moreover, we combined the local time stepping(LTS) algorithm with MPI, to efficiently address the multiscale nature of most complex applications. The CUDA programming model together with non-blocking MPI calls were used. A 10× speed up compared to CPU clusters is observed for double precision arithmetic. Furthermore, the proposed approach provides satisfactory scalability for both MPI/CPU and MPI/GPU implementations. An 80% parallelization efficiency was achieved up to 160 CPU cores and an 85% parallelization efficiency was achieved up 40 GPUs, at the Ohio Supercomputer Center, Glenn cluster. Finally, a full wave simulation of a 3D SRR metamaterial cloaking device was performed in time domain. This shows the ability of the proposed approach to handle complicated challenging examples. The presented numerical experiments show some interesting results which can provide a better insight about electromagnetic cloaking properties.