## 1. Introduction

[2] Practical complex applications have a high level of complexity and usually result in long simulation times. In order to have efficient numerical simulation times, especially in the time domain, it is desirable to have a numerical method with high parallelism capability that can run on fast hardware. Discontinuous Galerkin (DG) finite element methods [*Hesthaven and Warburton*, 2002; *Fezoui et al.*, 2005; *Montseny et al.*, 2008; *Dosopoulos and Lee*, 2010a] have the desired properties for complex time domain simulations. They support various types and shapes of elements, non-conformal meshes and non-uniform degrees of approximation. Being discontinuous they offer freedom in the choice of basis functions. In this way a great amount of flexibility is available. Additionally, the resulting mass matrix is a block diagonal matrix, with the block size equal to the degrees of freedom in the element. Hence the method can lead to a fully explicit time-marching scheme for the solution in time. Another feature of DG is that a big part of the computation is local to each element. Moreover, information exchange is required only between neighboring elements regardless of the order of the polynomial approximation and the element shape. Thus, as shown also in section 2, DG methods possess what is usually called the *locality* property.

[3] In this paper we present an approach that exploits this *locality* property to achieve an efficient parallelization. Consequently, DG methods should run on hardware that supports parallelization. At least two possible choices are available in current hardware technology. One is multicore CPUs and the other is GPUs. For DGTD, current generation multicore CPUs can offer up to 32 threads (8 quad core sockets) for parallel computation. If more threads are required, multithreading can be applied, but the operating system will have to switch between threads, and this could be expensive. Moreover, CPUs offer fast memory access time, but the memory bandwidth is a factor that could limit performance. Another possible candidate for DGTD computation is GPUs. Current generation GPUs, i.e., Tesla 2050, have 14 streaming multiprocessors (SMs). Each SM has 32 streaming processors (SPs), or CUDA cores, and can support up to 1,024 threads. Furthermore, in GPUs, memory access time is usually slower than in the CPU, but GPUs also have about an order of magnitude higher memory bandwidth and floating point operation capabilities than their CPU counterparts. Therefore, one can argue that GPUs appear to be a better candidate for DGTD methods.

[4] In the context of DG methods, GPU computing was initially introduced by *Klöckner et al.* [2009], where a single precision GPU implementation of DGTD was presented. The authors presented a 40×–60× speedup when comparing one GPU versus one CPU core. Furthermore, *Gödel et al.* [2010a] presented their approach to implementing DGTD in GPU clusters, and the authors reported a 20× speedup in the solution time for a single precision implementation. Also, in work by *Gödel et al.* [2010a] the GPU communication was done in a shared memory architecture, since all GPU devices physically reside on the same cluster node and scalability results were reported for up to 8 GPUs. Furthermore, in work by *Gödel et al.* [2010b] a multirate GPU implementation of DGTD was presented, but there was a restriction to only two classes for time marching. In this paper the number of classes is arbitrary, so no such restriction is applied for the local time stepping.

[5] The main contribution of this paper is summarized in the following: (a) we present our approach for an MPI/GPU implementation of DGTD, on conformal meshes with uniform degrees of approximation. The proposed approach is applicable in large GPU clusters with distributed-memory architecture. Moreover, we report a 10× speedup (one GPU against one CPU core) in double precision arithmetic using Quadro FX 5800 cards; (b) we combine the local time stepping (LTS) *Montseny et al.* [2008] algorithm with MPI/GPU to increase efficiency and reduce the computational time for multiscale applications, and (c) we show good scalability and parallelization efficiency, of 85% up to 40 GPUs for the hybrid CPU/GPU implementation, and of 80% up to 160 CPU cores for the CPU only implementation, on the Glenn Cluster at the Ohio Supercomputer Center. Moreover, note that all reported scalability results are for linear basis (*p* = 1). The scalability for higher order basis will be investigated in the future. Furthermore, the discussion on MPI/GPU for DGTD presented by the authors in this article is not the optimal one. Consequently, a number of improvements (i.e high orders, mixed orders, higher order time integrators etc.) can be added. In this article the authors simply present their initial, and in our opinion promising, findings for an MPI/GPU implementation of DGTD.

[6] This paper is organized as follows. In sections 2.1 and 2.2 we briefly describe the DGTD space and time discretization that we employ. Next, in section 3 we discuss the computational layout used to map DGTD to a CUDA framework. Following that, in section 4 we describe our MPI/GPU implementation for distributed memory architecture. Finally, in section 5 we present our performance analysis and give numerical examples to illustrate the potential of the proposed MPI/GPU/LTS strategy.