A Proof of Concept for Very Fast Finite Element Poisson Solvers on Accelerator Hardware

It is demonstrated that modern accelerator hardware specialized in AI, e.g., “next gen GPUs” equipped with Tensor Cores, can be profitably used in finite element simulations by means of a new hardware‐oriented method to solve linear systems arising from Poisson's equation in 2D. We consider the NVIDIA Tesla V100 Tensor Core GPU with a peak performance of 125 TFLOP/s, that is only achievable in half precision and if operations with high arithmetic intensity, such as dense matrix multiplications, are executed, though. Its computing power can be exploited to a great extent by the new method based on “prehandling” without loss of accuracy. We obtain a significant reduction of computing time compared to a standard geometric multigrid solver on standard x64 hardware.


Introduction
Graphic cards with AI components, particularly Tensor Cores (TC) specialized in accelerating dense matrix multiplications, have recently gained importance in computer systems. As an example, the NVIDIA Tesla V100 SXM2 promises 125 TFLOP/s in half precision (HP) due to its TCs (31 TFLOP/s in HP without TCs, 16 TFLOP/s in single (SP) and 8 TFLOP/s in double precision (DP)) according to the manufacturer specifications 1 . It is of course desirable to exploit this high performance in HP in basic components of matrix-based finite element (FE) simulations. FE discretizations generally result in very sparse stiffness matrices. If matrix-vector or matrix-matrix products with the Poisson matrix stored in standard CSR format are computed on the V100, the observed performance (10-100 GFLOP/s) is far below the peak rates, as the results in [1,2] show. If, however, dense matrices are used, 1 up to 100 TFLOP/s in HP are attained [1,2]. In short, there is a large gap between actual and theoretical performance but also a huge potential if the hardware is appropriately used, e.g., by the hardware-oriented method explained below. We stick with Poisson's equation (in 2D), which is an important component in many flow simulations, and consider the case of one and many (i.e., a matrix valued linear system AX = B) right-hand sides. Due to currently examined time-simultaneous [3] Navier-Stokes solvers, that allow for solving the pressure Poisson problems for each time step at once, the latter is relevant in practice.

Prehandling of Finite Element Matrices and a Direct Poisson Solver
To construct a solver that is capable of exploiting the hardware mentioned above, there are two conditions to be met. Firstly, the condition numbers of involved matrices need to be reduced because the condition number O h −2 (if h denotes the grid width) of the standard FE Poisson matrix causes high computational errors that prohibit the use of low precision, which is necessary to achieve the V100's peak flop rate in HP, though. To this end, we introduced the concept of "prehandling" [4], an explicit preconditioning without excessively increasing the density of the matrix. The hierarchical finite element method [5], that is based on successive refinement of an initial coarse grid, proved to be successful in the 2D case because it reduces the condition number to O log 2 1 h while preserving sparsity and also accuracy in lower precision as shown in [1,2,4]. Secondly, the solver needs to consist of multiplications with small, dense matrices (or large, sparse ones containing such as blocks) to exploit the TCs of the V100. For this purpose, the nodes of the hierarchical mesh are grouped into three types, the prehandled matrix is renumbered accordingly and a two-step Schur complement is applied (for details see [2]). Consequently, we obtain a direct solver, which we also refer to as PSC method (Prehandling Schur Complement), for Poisson's equation that is based on dense matrix-vector or matrix-matrix products (for many right-hand sides) and thus tailored for next gen GPUs.

Numerical Results Compared to Standard Solvers on Standard Hardware
As a proof of concept, we consider Poisson's equation on the unit square discetized by bilinear finite elements (Q1) on an equidistant mesh of grid width h ∈ 1 256 , 1 512 , 1 1024 (the latter leads to N ≈ 10 6 unknowns). The performance of the direct PSC method on the V100 GPU measured in GFLOP/s is shown in Fig. 1a. The case of many right-hand sides allows for more dense matrix-matrix multiplications and thus yields the highest performance, due to the TCs especially in HP, namely 60 TFLOP/s if h = 1 1024 . This clearly shows the outstanding performance achieved by exploiting the TCs, which would not be available with common ALUs in HP.
However, one can only limitedly draw conclusions on the efficiency of the method from this high performance. Hence, we use the measure "million degrees of freedom solved per second" (MDof/s) that enables a comparison with a standard approach. As such, we consider a FE geometric multigrid (MG) method (implemented in the FEAT3 2 software) including sparse matrixvector operations and in DP -due to the high condition number -on the x64 (multicore) AMD EPYC 7542 CPU 3 . The results of both approaches for many right-hand sides are depicted in Fig. 1b. The MG method requires O(N ) (or as a rule of thumb for the meshes at hand approx. 1, 000N ) operations, whereas, according to an estimate in [2], approx. 12N 3 2 operations are needed for the PSC method if the variable coarse grid width is chosen optimally; an amount of N = 1 million unknowns thus yields 12, 000N operations. The results confirm that the 12 times higher complexity is compensated by the remarkable hardware exploitation of the new method on the V100 GPU leading to lower computing times. In the case of one right-hand side the solution works 8 times and for a matrix of right-hand sides 600 times faster than the standard MG approach while accuracy is preserved.
In preliminary numerical tests on the successor model of the V100, the NVIDIA Ampere A100 TC GPU promising 300 TFLOP/s in HP 4 , we were able to further accelerate the solution by approx. 50% using the direct PSC method [2].

Conclusion and Outlook
We provided a proof of concept of an efficient algorithm resulting from knowledge of modern hardware that benefits from the high performance of GPUs powered by TCs. The presented studies only treat Poisson problems on a simple mesh so far, but this equation is crucial within flow simulations and, according to recent tests, the PSC method is also applicable in the case of partially unstructured, refined triangular coarse grids.
Further research will be focused on extensions to the 3D case, other differential operators and finite element spaces and less storage intensive semi-direct variants of the PSC method.