A framework for dense triangular matrix kernels on various manycore architectures

We present a new high‐performance framework for dense triangular Basic Linear Algebra Subroutines (BLAS) kernels, ie, triangular matrix‐matrix multiplication (TRMM) and triangular solve (TRSM), on various manycore architectures. This is an extension of a previous work on a single GPU by the same authors, presented at the EuroPar'16 conference, in which we demonstrated the effectiveness of recursive formulations in enhancing the performance of these kernels. In this paper, the performance of triangular BLAS kernels on a single GPU is further enhanced by implementing customized in‐place CUDA kernels for TRMM and TRSM, which are called at the bottom of the recursion. In addition, a multi‐GPU implementation of TRMM and TRSM is proposed and we show an almost linear performance scaling, as the number of GPUs increases. Finally, the algorithmic recursive formulation of these triangular BLAS kernels is in fact oblivious to the targeted hardware architecture. We, therefore, port these recursive kernels to homogeneous x86 hardware architectures by relying on the vendor optimized BLAS implementations. Results reported on various hardware architectures highlight a significant performance improvement against state‐of‐the‐art implementations. These new kernels are freely available in the KAUST BLAS (KBLAS) open‐source library at https://github.com/ecrc/kblas.


INTRODUCTION
Basic Linear Algebra Subroutines (BLAS) are fundamental dense matrix operation kernels for implementing high-performance computing applications, eg, ranging from dense direct solvers to eigenvalue solvers and singular value decomposition. Although they are extensively optimized by hardware vendors [1][2][3]

and often available in
open-source libraries, 4-6 some BLAS kernels operating on general triangular matrix structures, coming from the symmetry or the upper/lower matrix formats, do not run at the sustained bandwidth (Levels 1 and 2 BLAS) 7 or peak performance (Level 3 BLAS). The kernels from the latter BLAS family have often been taken for granted in extracting hardware performance due to their high arithmetic intensity, which usually results in performing matrix-matrix multiplication operations.
In fact, it turns out that the dense triangular Level 3 BLAS kernels, ie, the triangular matrix-matrix multiplication (TRMM ∶ B = A B), and the triangular solve with multiple right-hand sides (TRSM ∶ A X = B), suffer from performance losses, as demonstrated in a previous work from the authors. 8 To remedy these losses, current high-performance implementations 3,9 rely on an out-of-place (OOP) design of these aforementioned kernels, which help exposing more parallelism by weakening the thread synchronizations encountered during the Write After Read (WAR) and Read After Write (RAW) data dependency hazards for TRMM and TRSM kernels, respectively. In particular, we addressed in Charara et al 8  In this paper, we extend the work in Charara et al 8 19,20 It has been also applied in the context of speeding up large bandwidth-limited workloads seen during the panel factorizations of dense matrices.

BACKGROUND ON GENERAL RECURSIVE TRMM AND TRSM KERNELS
In this section, we briefly recall the uniform design strategy for recursive high-performance TRMM and TRSM BLAS kernels and motivate the need for further enhancements of these kernels.

Recursive formulations
In Charara et al, 8 we addressed the performance optimizations of triangular Level 3 BLAS operations (TRMM and TRSM) on single GPU and proposed using a recursive formulation that converts most of the computations to GEMM calls. The recursive TRMM operation RecTRMM: can be expressed recursively as follows. Consider the left/lower/transpose case, and let M and N be the number of rows and columns of B, respectively. We use the following notation for submatrices: X r1:r2,c1:c2 is the submatrix of X whose rows indices range from r1 until r2 inclusive and whose column indices range from c1 until c2 inclusive (array indices starting with 1). Consider m = M∕2, with m preferred to be the largest power of 2 less than M, for smoother further recursive splits.

Motivation for further enhancements
As recalled above, the recursive formulation of TRSM and TRMM kernels converts most of the computations into GEMM calls. Figure 2 illustrates how the innermost calls to TRSM or TRMM routines at the leaves of the recursion operate on small triangular blocks located in the diagonal along with a thin and wide portion of the output matrix (ie, a general matrix for TRMM or a matrix containing the right-hand sides for TRSM), where number of columns N may be larger than number of rows M. when operating on small number of right-hand sides (see Figure 3C). This is mostly due to a low hardware occupancy engendered by a limited concurrency. A similar profiling trend for RecTRMM, although less significant, has also been identified. These observations motivate the need for optimizing the native TRMM and TRSM CUDA kernels for small number of rows involving diagonal blocks with thin and wide output matrices, which may potentially improve the overall RecTRSM and RecTRMM performances.

IMPLEMENTATION DETAILS
In this section, we describe our design criteria and implementation of our CUDA-based native TRSM and TRMM kernels to better handle short and wide output matrices, which we use in return to enhance the overall RecTRSM and RecTRMM performances on GPUs. We also describe our multi-GPU implementation of these kernels and port the same recursive formulation on Intel ×86 CPU architectures using various BLAS libraries.

Data dependencies
We briefly recall the TRSM and TRMM operations to properly design a CUDA kernel for each. Assume M and N are the number of rows and columns of matrix B, respectively. In a left/lower/non-transpose TRSM ∶

Hardware occupancy
In the context of CUDA programming best practices, 23

Pseudocodes
Given the design considerations discussed above, Algorithms 1 and

KBLAS multi-GPU RecTRSM and RecTRMM
The purpose of a multi-GPU API for TRSM and TRMM is to provide a function call that can replace an existing CPU function call with as little code intrusion as possible, as may be required by some GPU-based libraries. As such, the data is assumed to reside on CPU memory.
The multi-GPU routine should transparently manage the data transfer from/to GPU main memory, and thus, relieving the developer from this burden, especially since multi-GPU represents a challenging platform for high-performance computing.
In fact, the recursive breakdown of our proposed RecTRMM and Rec-TRSM operations naturally simplifies this process. Since the invocation of GPU kernels is recursively handled on the CPU code, we can hide the data transfer (from/to CPU to/from GPU memory), while performing computations. For that purpose, dedicated CUDA streams are created to handle data transfer to/from GPU, which are separate from the computation streams. Synchronization between data-transfer streams and computation streams is handled by CUDA events. Data transfer from CPU is invoked at the beginning, based on the recursive splitting depicted in Figure 1, while the sequence of operations is statically predetermined. As soon as portions of matrices A and B needed for the next computation kernel reach GPU's main memory, its corresponding device. However, this may limit the problem size that can be computed.
To resolve this limitation, we design 2 stages for the algorithm. The first stage computes the operations using the recursive scheme described above as long as the GPU memory can fit the data. A second stage is invoked when the next GEMM data cannot be fitted on the multiple GPUs memory, at which an out-of-core multi-GPU GEMM is invoked for the subsequent GEMM calls. After that, the execution returns to the recursive scheme to process the remaining subsequent portions of the recursion. Highly efficient multi-GPU GEMM implementations can be utilized for this purpose and are available from cuBLAS-XT, 3 BLASX, 6 and KBLAS. 13 In order to compensate for the time of memory allocation (for subsequent GEMM and TRSM or TRMM calls), we pre-allocate a big memory pool that is enough for all data needed on the GPU. The size of data needed on GPU is determined by number of participating GPU devices on which matrix B is distributed, and the lower or upper portions of matrix A. With this scheme in place, the scaling of multi-GPU TRSM/TRMM calls is not affected by any cross-GPU communications and may, therefore, be almost linear.

On Intel ×86 CPU architectures
We

EXPERIMENTAL RESULTS
In this section, we present the performance results of our KBLAS imple-

Environment settings
Experiments have been conducted on single and multiple NVIDIA K40 GPUs across all subsequent GPU performance plots (unless noted otherwise).

5.2
Performance on single GPU Figure 5 shows the performance gain obtained with the RecTRSM from KBLAS using the new native customized TRSM kernel introduced in this paper, against the RecTRSM from KBLAS using the corresponding FIGURE 5 Performance comparison of TRSM kernels FIGURE 6 Performance comparison of TRMM kernels on single NVIDIA K40 GPU  Figure 5C,D, respectively, is minimal due to the high concurrency already engendered by large matrix sizes.
By the same token, the new customized KBLAS TRMM kernel brings another factor of performance gain on top of RecTRMM in DP, but mostly in SP (up to 30%), with small number of right-hand sides, as reported in Figure 6B,A, respectively. The gain is, however, limited for square matrices, as shown in Figure 6C,D, for the same aforementioned reasons.
For all subsequent GPU performance graphs, we refer to RecTRSM and RecTRMM as the KBLAS recursive kernel containing the new customized TRSM and TRMM CUDA kernels, respectively. RecTRSM kernels, in SP and DP, with a small number of right-hand sides, against previous corresponding kernels from Charara et al. 8 The gain in performance ranges from 40% to 70% for STRSM, from 10% to 30% for DTRSM, from 20% to 60% for STRMM, and from 60% to 140% for DTRMM. The main goal of this section is to show the backward and forward compatibilities of our recursive implementations, besides the additional performance improvements.

Performance on multiple GPUs
We report next the performance gain of both RecTRSM and Rec-TRMM multi-GPU implementations from KBLAS against those from cuBLAS-XT, 3 MAGMA, 9,10 and BLASX. 6 The performance numbers from the latter have been extracted from the paper by Wang et al, 6 which only reports performance on three GPUs. The multi-GPU implementation assumes the matrix data is residing on the host-CPU memory, and thus, the reported performance numbers accounts for both data transfer and computations execution times. Figure 8A,B demonstrates significant performance gain with KBLAS DTRSM against that of multi-GPU cuBLAS-XT (up to 2×) and the recent BLASX libraries (up to 30%), respectively. Figure 8C reports a similar performance gain against MAGMA library (up to 76%), which is an OOP implementation and does not account for data transfer from CPU memory. On the other hand, Figure 9A,B shows that KBLAS multi-GPU DTRMM can match the OOP implementation from cuBLAS-XT (which also requires twice the memory allocations and data transfers), as well as the performance of the more sophisticated dynamic runtime system from BLASX. The tuning parameter for stopping the recursion has been experimentally determined and set from 256 up to 512 on the CPU depending on the matrix size. On the Intel Xeon Phi, the tuning parameter is much higher and has been set from 1000 up to 1500 to be the most effective. In particular, Figure 10A-D highlights the performance speedup of KBLAS RecTRMM against the corresponding native TRMM kernels (square/tall-and-skinny) from MKL-CPU (up to 40%/250%), MKL-KNC (up to 100%/300%), OpenBLAS (up to 20%/100%), and BLIS (up to 70%/55%), respectively. Figure 10E,F, shows the performance speedup of KBLAS RecTRSM against the corresponding native TRSM kernels (square/tall-and-skinny) from MKL-CPU (up to 50%/70%) and MKL-KNC (up to 23%/120%), respectively.

CONCLUSION AND FUTURE WORK
We have presented new high-performance in-place TRSM and TRMM BLAS CUDA kernels, for matrices with a small number of RHS on single GPU, that enhance the performance of RecTRSM and RecTRMM by up to 70% and 140%, respectively, while outperforming corresponding cuBLAS kernels by more than 2× and 10×, respectively.
We have also highlighted a multi-GPU implementation of these kernels, which outperforms cuBLAS, MAGMA, and BLASX corresponding implementations by up to 100%, 76%, and 30% respectively. We have For future work, we plan to extend our approach to other triangular BLAS kernels and also investigate the impact of recursive DLA schemes to the recent batched BLAS trend in the scientific community, 24 in the context of machine/deep learning applications. 25 The triangular matrix shapes and the in-place nature of such operations, combined with the need to a batched execution mode, which consists of processing thousands of small and independent operations in parallel, poses a problem that tends to be memory-bound and may greatly benefit from our recursive formulations and low-level optimizations on various manycore architectures.