Diagonally-Addressed Matrix Nicknack: How to improve SpMV performance

We suggest a technique to reduce the storage size of sparse matrices at no loss of information. We call this technique Diagonally-Adressed (DA) storage. It exploits the typically low matrix bandwidth of matrices arising in applications. For memory-bound algorithms, this traffic reduction has direct benefits for both uni-precision and multi-precision algorithms. In particular, we demonstrate how to apply DA storage to the Compressed Sparse Rows (CSR) format and compare the performance in computing the Sparse Matrix Vector (SpMV) product, which is a basic building block of many iterative algorithms. We investigate 1367 matrices from the SuiteSparse Matrix Collection fitting into the CSR format using signed 32 bit indices. More than 95% of these matrices fit into the DA-CSR format using 16 bit column indices, potentially after Reverse Cuthill-McKee (RCM) reordering. Using IEEE 754 double precision scalars, we observe a performance uplift of 11% (single-threaded) or 17.5% (multithreaded) on average when the traffic exceeds the size of the last-level CPU cache. The predicted uplift in this scenario is 20%. For traffic within the CPU's combined level 2 and level 3 caches, the multithreaded performance uplift is over 40% for a few test matrices.


Introduction
An operation is called memory-bound, if its performance is limited by the memory bandwidth [Byte/s] of the executing hardware.In that context, it holds that where P i denotes performance [FLOP/s], and traffic i [Byte] accounts for all the memory involved.
Hence, reducing the traffic should directly lead to a performance improvement.It is often assumed that computing the Sparse Matrix Vector (SpMV) product is memory-bound; see, e.g., [5,7,9].The biggest contributor to the overall SpMV traffic is the matrix.Therefore, in the following, we present a technique that allows for the reduction of the storage size of a sparse matrix at no loss of information.
The key ingredient of the new storage technique is the observation that many matrices arising in, e.g., finite element simulations have a very low matrix bandwidth (under certain permutations).That means, potentially after permutation, all non-zero matrix entries are located close to the matrix diagonal.This motivates storing the indices of these entries relative to the matrix diagonal rather than as an absolute position, which we call Diagonally-Adressed (DA) storage.Due to the small matrix bandwidth, the relative indices may be stored in a smaller (integer) data type.This technique is easily applicable to many sparse formats, e.g., Compressed Sparse Rows (CSR), Block CSR (BSR), or (one of the vectors of) Coordinate (COO) storage.In this paper we apply DA storage to the CSR format; obtaining the Diagonally-Adressed CSR (DA-CSR) format; and compare the SpMV performance against our implementation of CSR as well as Intel Math Kernel Library (MKL) [6].
DA storage differs from Diagonal (DIA) storage in that the new technique still requires one index per non-zero, depending on the underlying technique, instead of one index per diagonal.Also, it does not impose a diagonal-major order of the entries, or require a potentially padded and full/dense storage for each of the diagonals.The Recursive Sparse Blocks (RSB) format [10] is another data structure for sparse matrices motivated by a cache-efficient and parallel implementation of the SpMV product.It divides a sparse matrix into a tree structure of sparse blocks, whose leaves are iterated in a Z-or Morton-order.The leaf blocks are stored in the COO, CSR, or Compressed Sparse Columns (CSC) format.
RSB was designed for arbitrary sparse matrices, in particular, matrices without an inherent low bandwidth (under certain permutations).RSB allows 16 bit indices as well, but only for its leaf matrices.Meanwhile, DA-CSR has a conceptually simpler non-recursive design, allowing 16 bit indices throughout, which leads to a much lower overhead in terms of Byte per non-zero.Therefore, DA storage does not directly compete with the RSB format, but could be used in the leaf blocks within the RSB format, to allow for an even smaller index type.
The remainder of this paper is structured as follows.Section 2 applies DA storage to the CSR format.Section 3 describes the selection of test matrices.Section 4 describes how to compute the SpMV product using that new DA-CSR format, and measures the performance of SpMV.We conclude the paper in Section 5.

Diagonally-Addressed Storage
The CSR storage of a matrix A ∈ F nrows×ncols comprises three vectors, cf.Listing 1, where nnz denotes the number of non-zero entries, oindex_t and iindex_t are integer data types, and (right) from the SuiteSparse Matrix Collection [4].
scalar_t is an approximation of F, e.g., IEEE 754 double or float for F = R.The "row pointers" stored in rowptr and "column indices" stored in colids do the bookkeeping imposed by only storing non-zero matrix entries.
The rth entry 0 ≤ rowptr[r] < nnz is the index into colids and values corresponding to the first non-zero of row r, 0 ≤ r ≤ nrows. 1 The ith entry colids[i] is the column index and values[i] is the value of the ith non-zero, 0 ≤ i < nnz.Let w denote the matrix bandwidth of A = (a rc ), i.e. the farthest distance of a non-zero matrix entry from the matrix diagonal, For CSR storage, colids[i] = c i , which lies in the range [0, ncols).For DA-CSR storage, while the DA storage replaces colids to become Observe that colids covers its full range in either storage scheme: [0, ncols − 1] for CSR and [−w, w] for DA-CSR.
Sometimes it is necessary to reduce the bandwidth of a matrix before DA storage can be applied effectively, which we observe in the next example.
Example 2. The matrix GHS_posdef/ldoor from the SuiteSparse Matrix Collection [4] has dimension 952 203 and 46 522 475 pattern entries,2 distributed over a bandwidth of 686 979.On average, this matrix has 49 pattern entries per row.See Figure 2 (left) for its sparsity pattern.Due to its block structure, the original matrix bandwidth is fairly large.Still, a Reverse Cuthill-McKee (RCM) reordering [2] reduces the bandwidth to about 9100, 3 which is only about 1 % of the matrix dimension.Ignoring the colors, the corresponding sparsity pattern would look almost identical to Figure 2 (right).The reduced bandwidth is less than 2 15 = 32 768 and therefore allows for the usage of 16 bit column indices in DA storage, while both the matrix dimension (for plain CSR storage) and the original bandwidth (for naive DA-CSR storage) would require 32 bit indices.(dense) Following Listing 1, the matrix-related traffic amounts to If w may be stored in a smaller (integer) data type than ncols, this allows for a smaller iindex_t to be used.Using an index type half the size nearly halves the bookkeeping traffic.
Example 3. The matrix Janna/Bump_2911 from the SuiteSparse Matrix Collection [4] has dimension 2 911 419 and 127 729 899 non-zeros, distributed over a bandwidth of only 31 343 < 2 15 = 32 768, which is only about 1 % of the matrix dimension.On average, this matrix has 44 non-zeros per row.See Figure 2 for its sparsity pattern.Therefore, standard CSR storage requires 32 bit indices for both oindex_t and iindex_t, which require 11 MiB and 487 MiB in total, respectively.DA-CSR allows for 16 bit iindex_t to be used, which requires only 244 MiB, thus reducing the bookkeeping traffic by 1 − 11+244 11+487 ≈ 48.8 %, or from 4.09 to 2.09 Byte per nnz, irrespective of scalar_t.For 64 bit and 32 bit scalar_t, e.g., IEEE 754 double and float, which in total require 975 MiB and 487 MiB, using DA-CSR instead of CSR results in an overall matrix-related traffic reduction of 1 − 11+244+975 11+487+975 ≈ 16.5 % and 1 − 11+244+487 11+487+487 ≈ 24.7 %, respectively.For matrices with more than a few non-zeros per row, it is therefore reasonable to ignore the effect of oindex_t, i.e. to assume sizeof(oindex_t) = 0.The final percentages of the previous example would then be estimated by 1  6 and 1 4 .Figure 3 shows this approximate reduction in matrix-related traffic by means of formula (5).Note how smaller iindex_t; i.e. lower bookkeeping traffic; yield better approximations of the factor 1  2 observed for dense storage. 4Recall that by equation ( 1) a traffic reduction is tightly coupled with expected performance gains for memory-bound operations.
While the goal of multi-precision algorithms is to exchange scalar_t for a smaller data type (as in, e.g., [1]), i.e. traversing the rows of Figure 3, DA storage focuses on iindex_t, i.e. traversing the columns of Figure 3.However, as the main motivation of multi-precision algorithms is the memory bottleneck, DA storage is expected to enable even larger speedups in that context.CSR using 64 bit scalars and 32 bit indices merely allows for a 3  2 × performance speedup when switching to 32 bit scalars.Meanwhile, if the matrix has a representation in DA-CSR using 16 bit indices, the expected speedup is 5  3 ×.This speedup is much closer to the 2× possible for dense storage, when using a scalar type half the size.

Selection of Matrices
The SuiteSparse Matrix Collection [4] contains 1367 square matrices having a CSR representation using 32 bit indices and a full structural rank. 5Only 993 of these matrices (72.6 %) have a dimension less than 2 15 , i.e. fit into CSR using 16 bit column indices.However, for 1302 matrices (95.2 %) there exists a permutation that reduces the matrix bandwidth to below 2 15 , such that these matrices fit into DA-CSR using 16 bit column indices.These are the matrices we select for further investigation.Some of the investigated matrices are already stored in a bandwidth-reduced way.We applied an RCM reordering [2] to the ones that are not.Unfortunately, our implementation of the RCM permutation has not been able to sufficiently reduce the bandwidth of two matrices (Janna/Long_Coup_dt0 and Janna/Long_Coup_dt6), which reduces the number of matrices to 1300 (95.1 %).

Sparse Matrix Vector Product
Let A denote a matrix, x and y be vectors, and α and β be scalars.The SpMV product denotes the operation y ← αAx + βy, which requires floating-point operations of work W [FLOP].The performance P [FLOP/s] is then defined as the ratio of work W and runtime t, where t denotes the runtime measured in elapsed time.The relative performance w.r.t.some baseline is computed via assuming W candidate = W baseline .The traffic [Byte] of computing the SpMV accounts for x and y on top of the three components of A in (DA-) CSR storage, cf.Listing 1 and formula (5).The throughput [Byte/s] is given by the ratio of traffic and t, and the relative throughput is then computed via traffic candidate traffic baseline which is a scaled form of the relative performance.Refer to Figure 3 for typical and approximate expected traffic ratios.In the following, we aim to verify the predicted6 5 × speedup when replacing 32 bit column indices by 16 bit ones.

Implementation Details and Methodology
A prototypical implementation of the SpMV product for a matrix in DA-CSR format is given in Listing 2. Instead of reversing the index translation in the innermost loop, i.e. computing oindex_t col = row + colids[i], we instead compute a shifted view xshift into the factor x one level up.This replaces nnz oindex_t-additions by nrows pointer-additions.Recall that in C/C++ the memory access x[row + col] is equivalent to *(x + (row + col)).Applying associativity to the computation of the pointer address, we see that this access is also equivalent to *((x + row) + col) and xshift[col].
Remark 4 (Non-Square Matrices).For tall matrices, i.e. nrows > ncols, xshift points to memory outside x, and must therefore never be dereferenced directly.Within Listing 2 however, it will only be dereferenced at an offset col that yields a memory address within x.
The code has been compiled with GCC 10.3.0 using -O3 -NDEBUG -mavx2 -mfma.The benchmarks have been run on an Intel Xeon Skylake Silver 4110 running CentOS 7. threads pinned using taskset7 .Runtime measurements have been taken using nanobench [8] with minEpochTime set to 100 ms, minEpochIterations and warmup both set to 10, using the minimum over 11 epochs.
We measured the performance of a naive implementation (with nnz additions instead of nrows) as well as several implementations akin to Listing 2, optionally using OpenMP with 2, 4, 6, or 8 threads, both for CSR and DA-CSR matrices.Among all implementations executed on the given hardware, the best performing ones were the naive implementation, the one using 3 accumulators, and the one using a single AVX2 accumulator (4 scalars wide) accessing values using unaligned load instructions.In the following, for each matrix, and each storage format tested, we select the implementations and number of threads yielding the highest performance.
For the MKL [6] implementation of the CSR format, we measured single-threaded as well as multithreaded performance.Again, for each matrix we select the number of threads yielding the highest performance.

Numerical Results
For traffic within the size of the L1 cache, a single thread yields the best performance.Up to about 100 KiB, which is well within the size of a single L2 cache, the optimum number of threads increases gradually.For traffic larger than that, the maximum number of threads yields the best performance.This behavior is irrespective of the matrix format and the implementation vendor (ourselves or MKL [6]).
Our implementation of the SpMV product for the CSR format using 32 bit indices performs about the same as the MKL [6], see Table 1. Figure 4 shows the comparison of DA-CSR using 16 bit column indices to CSR.Using DA-CSR shows almost no change for traffic within the combined size of the L2 caches, i.e. up to 8 • 1 MiB.For traffic larger than that, up to the combined size of all caches, i.e. up to about 8 + 11 MiB, we observe a larger than 1.4× speedup.For traffic beyond that, we observe an average speedup of about +17.5 %, which is reasonably close to the expected +20 %.However, the throughput drops slightly, indicating some unused potential on the given hardware.See Table 2 for the comparison of DA-CSR to MKL [6].
Remark 5 (CSR using 16 bit column indices).Recall that 933 matrices have a direct representation in CSR using smaller column indices.The DA-CSR format performs on par with CSR using the same index types for these matrices.

Conclusion and Outlook
Diagonally-Adressed (DA) storage allows to nearly halve the bookkeeping traffic in sparse matrix storage formats, when the matrix bandwidth allows for an index type half the size.On the hardware used, DA-CSR storage with 16 bit column indices improves the single-threaded SpMV performance over CSR storage with 32 bit column indices by more than 17 %, for both our implementation and MKL [6] if the traffic exceeds the size of the L3 cache of the CPU.Meanwhile, DA-CSR performs no worse than CSR when using the same data types.

Example 1 .
which instead lies in the range [−w, w].We illustrate this transformation in the following example.The sample matrix shown in Figure 1 has nrows = ncols = 4, nnz = 5, and w = 1.Its CSR representation is given by

Figure 3 :
Figure 3: Approximate matrix-related traffic reduction when exchanging the data types used to store the matrix scalars (horizontally) or column indices (vertically) of (DA-) CSR storage, as well as dense storage (no indices required).

Figure 4 :
Figure 4: Relative performance and throughput of SpMV using the DA-CSR format with 16 bit column indices w.r.t.CSR using 32 bit column indices as the baseline (iso-scalar).The sizes of the L1d, L2, and L3 CPU caches are marked with vertical lines (left to right).

Table 2 :
[6]rage relative performance of our best SpMV implementation for DA-CSR using 16 bit column indices w.r.t.MKL[6]as the baseline.Values > 1 mean we are faster.