Impact of mixed precision and storage layout on additive Schwarz smoothers

The growing discrepancy between CPU computing power and memory bandwidth drives more and more numerical algorithms into a bandwidth‐bound regime. One example is the overlapping Schwarz smoother, a highly effective building block for iterative multigrid solution of elliptic equations with higher order finite elements. Two options of reducing the required memory bandwidth are sparsity exploiting storage layouts and representing matrix entries with reduced precision in floating point or fixed point format. We investigate the impact of several options on storage demand and contraction rate, both analytically in the context of subspace correction methods and numerically at an example of solid mechanics. Both perspectives agree on the favourite scheme: fixed point representation of Cholesky factors in nested dissection storage.


INTRODUCTION
The numerical solution of elliptic partial differential equations by finite element methods is an essential building block and one of the corner stones of scientific computing, with applications in diffusion processes, solid mechanics, and incompressible fluids. 1,2 If more than moderate accuracy is requested, higher-order finite elements are indispensable for an efficient solution process, with polynomial ansatz orders typically in the range from three to twelve depending on the problem's regularity properties. Despite higher-order discretization, the arising linear systems tend to be rather large and sparse, such that, in particular in three-dimensional (3D) problems, iterative solvers are preferable due to the superlinearly growing memory demand of direct sparse solvers.
The natural iterative solver for elliptic problems is the conjugate gradient (CG) method, preconditioned with multigrid or domain decomposition methods, leading to preconditioned CG (PCG) solvers of optimal complexity. Several variants of constructing multigrid hierarchies and smoothers exist. 3,4 Here we focus on one particular scheme, where the hierarchy consists of a step from high-order finite elements to linear finite elements on the same grid, followed by a classical h-multigrid hierarchy. For this to work, the top level smoother must be highly effective. We consider a vertex-centered overlapping additive Schwarz smoother, that solves a small to moderately large and rather dense system on the patch around each grid vertex and is therefore a generalized block-Jacobi method. The optimal complexity of this approach has been shown in Reference 5 on quadrilaterals, and extended to simplicial meshes 6 and spectral elements. 7 While this additive Schwarz smoother is highly effective in terms of contraction rate, its application is significantly more expensive than a sparse matrix vector multiplication with the problem's stiffness matrix, and is therefore the most expensive part of the solver. During the last decades, the computing power of CPU cores has been growing faster than the memory bandwidth, 8 and thus the smoother application is a bandwidth-bound operation. An acceleration of the smoother application therefore depends mainly on the reduction of the smoother storage size, and to some extent also on the access patterns.
Two of the main aspects affecting smoother storage size are sparsity of the blocks and the representation of matrix entries. These are the aspects we investigate from both a theoretical and an experimental point of view here. Exploitation of sparsity is possible using standard sparse solver techniques, 9 but, as the blocks are only moderately sparse, the trade-offs can be different from usual sparse solver situations.
Inexact representation of the blocks offer additional or alternative means of reducing storage demands. One line of approach comprises structured and low-rank matrix approximations such as -matrices. 10 Another one is mixed precision arithmetics, leading to different representations of matrix entries as real numbers. This is attractive since the accuracy provided by the smoother need not be particularly high, and consequently, use of mixed precision arithmetics has been studied for heterogeneous (see, e.g., Reference 11) and HPC computing in general, and the iterative solution of linear equation systems in particular in various contexts, [12][13][14][15][16] in particular recently for nonoverlapping block-Jacobi solvers. 17 Beyond mixed precision representations, more sophisticated lossy compression schemes based on transform coding have been proposed for floating point values, see, e.g., References 18,19. These do, however, incur significant computational work for decompression, which can outweigh the bandwidth savings. Therefore, we restrict the attention to simpler mixed precision representations here.
After defining the problem setting and the overlapping Schwarz smoother in Section 2, we discuss different smoother representations in Section 3, exploiting both sparsity and mixed precision and providing some theoretical insight for different ansatz order. Finally, numerical experiments and performance evaluations are reported in Section 4.

VERTEX-CENTERED ADDITIVE SCHWARZ PRECONDITIONERS
As a model problem, we consider the stationary elliptic PDE −div( ∇u) = f in Ω on a polyhedral domain Ω ⊂ R d and with uniformly positive definite diffusion tensor 0 ≺ ≼ ≼ and uniformly positive Dirichlet part 0 < ≤ ≤ of the Robin boundary condition, such that the associated bilinear form a(u, v) = ∫ Ω ∇u T ∇v dx + ∫ Ω uv ds is H 1 (Ω)-elliptic. Let  be a simplicial triangulation of Ω, with vertices  and mesh size h = min T∈ diam(T). On  we define the ansatz space V p = {u ∈ H 1 (Ω) | ∀T ∈  ∶ u| T ∈ P p } of globally continuous and piecewise polynomial functions of order at most p. Choosing ansatz functions with smallest support as a basis to be used in a Galerkin discretization transforms (1) into a large, sparse, symmetric positive definite, and ill-conditioned linear equation system Sparse direct solvers produce superlinear fill-in 20 in particular for 3D problems, and therefore induce a huge computational effort and often a prohibitive memory demand on large problems. If approximate solutions are sufficient, iterative solvers as the conjugate gradient method are a viable alternative, but suffer from the large condition number of A. For that reason, symmetric positive definite preconditioners B ≈ A are required, transforming (2) into the system with smaller condition number (B −1 A) ≪ (A). In the optimal case, the condition number is independent of both h and p, such that the number of CG-iterations required to reach a given accuracy is bounded independently of the problem size.
One such preconditioner is the overlapping additive Schwarz method with coarse space, analyzed first for quarilaterals in. 5 Let = ⋃ T∋ T denote the patch of simplices incident to the vertex , V = {u ∈ V p |supp u ⊂ } the subspace of finite element functions with support contained in the patch . Moreover, let the prolongation P be a matrix representation of the embedding of V into the ansatz space V p . Local stiffness matrices A = P T AP are defined by Galerkin projection. Similarly, let P 1 represent the embedding V 1 ⊂ V p and A 1 be the corresponding stiffness matrix. Then the overlapping additive Schwarz preconditioner is solving the problem simultaneously in all subspaces V , similar to a block-Jacobi method. Its optimality, that is, the independence of the condition number (B −1 A) of h and p, has been shown in Reference 6 in the framework of subspace correction methods, 21 see also Reference 1: Assume that the following conditions are satisfied.
Stability. There is a constant K 1 and for holds.
Then, the preconditioned system has bounded condition number (B −1 A) ≤ K 1 K 2 . The number of PCG iterations required to reach a relative accuracy of TOL < 1 in the energy norm is bounded by n it = −( √ K 1 K 2 + 1) log(TOL∕2).
The P 1 A −1 1 P T 1 part of the preconditioner is responsible for independence of h, while ∑ ∈ P A −1 P T leads to independence of p, see Figure 1. In contrast, the effectivity of a nonoverlapping point or block Jacobi and Gauss-Seidel methods used as smoothers in a multigrid setting or as standalone preconditioners quickly degrades with growing h or p. Despite their smaller computational complexity, the solution time is larger than for the additive Schwarz smoother. Remark 1. Instead of an exact solve on the coarse space V 1 by A −1 1 , a suitable preconditioner B −1 1 could be used. Examples are multigrid preconditioners like BPX introduced by Bramble, Pasciak, and Xu 22 or multilevel domain decomposition methods. 23 Similarly, instead of the exact local solves with A , a finer decomposition of the ansatz space in more but smaller subspaces associated not only to patches around vertices, but also to patches around edges and faces can be used. This reduces the size of the local systems significantly, but tends to lead to larger though still bounded 6 or slowly growing 24 condition numbers. Corresponding overlapping Schwarz smoothers have also been investigated for H(div)and H(curl)-elliptic problems with conforming discretizations, 25 for grad-div stabilized Navier-Stokes equations, 26 and for Vanka-type smoothers for Stokes systems. 27 The mixed precision storage schemes considered in Section 3 could also be applied to those problems, but require a different analysis than given in Section 3.2.1.
The local matrices A are moderately sparse, with a dimension growing as (p d ). For two-dimensional (2D) problems, on average six incident triangle to a vertex lead to an expected size of 3p 2 (in leading order) and 3 2 p 4 nonzero entries in A , which leads to an occupation of the matrices of around 25% for common values of p ≤ 10, see Figure 2 and Appendix. In 3D problems with about 20 incident tetrahedra, the size is 10 3 p 3 with 0.56p 6 nonzero entries, corresponding to a density of around 10%.
Already for moderate p, factorizing the A is therefore the main computational effort during preconditioner construction, and the matrix-vector multiplication involving A −1 is the main effort during preconditioner application. In comparison, the effort for construction and application of P 1 A −1 1 P T 1 is negligible. In the subsequent investigations on computational effort, we will therefore concentrate on the Schwarz preconditioner without coarse space.

SMOOTHER STORAGE ORGANIZATION AND DATA REPRESENTATION
The application of the overlapping Schwarz preconditioner (7) consists of matrix-vector products A −1 r . On today's multicore CPUs, this is typically a memory bound operation if several threads are active, such that the computation power of floating point units and SIMD (single instruction, multiple data) vector extensions such as SSE or AVX are largely unused. The actual execution time, therefore, depends to a large extent on the amount of data and the storage organization used for representing A −1 .
Several different storage schemes are possible: (i) direct storage of all A −1 using, for example, LAPACK formats GE, SY, and P (ii) storage of ∑ ∈ P A −1 P T as sparse matrix in, for example, CRS format (iii) storing Cholesky factors L of L L T = A in one of the LAPACK formats (iv) storing sparse Cholesky factors along with the pivoting table, using either CRS or a special purpose nested dissection format. Since the local stiffness matrices A are of moderate sparsity and size, depending on p and d, we will focus on factorizations and do not consider the iterative computation of A −1 r .
A second dimension of matrix storage that affects the amount of data is the real number representation. Widely used are the IEEE standard floating point representations of double (FP64) or single (FP32) precision, or the half precision (FP16) format that has become popular in machine learning. Besides those, special purpose floating point or fixed point formats with less accuracy and further reduced size may be considered. In contrast to the storage layout variants above, the arithmetic precision affects the accuracy of the preconditioner result, and may lead to an increased number of iterations or even divergence.

Storage layout
The different options of representing A −1 sketched above lead to different size of data to be stored, but also to different memory access patterns. While the data size and its impact on run time and energy consumption can be estimated theoretically, the effect of different access patterns is hard to model. We will therefore only estimate the amount of data here and leave the impact of access patterns to Section 4 below. As theoretical models of mesh connectivity within a patch we will use a hexagon in 2D and an icosahedron in 3D. Thus, we have matrix dimensions n = 3p 2 in 2D and n = 10 3 p 3 in 3D, see Appendix. All data sizes given in the text are in leading order of p only, but are exact in Table 3 and Figures 3 and 4.

Dense storage of A −1
The simplest approach is to compute the inverse, which is dense, and store it in one of the common LAPACK formats for symmetric matrices. The number of entries to be stored are 9p 4 for GE layout, and 4.5p 4 for the packed P format in 2D. In 3D, the respective numbers are 11.1p 6 and 5.6p 6 . Symmetric storage SY is in between, as the memory demand is as for GE but only half of the entries are actually accessed, as in P. As may be expected, this naive approach is rather inefficient: the memory requirement of GE storage is 6-10 times larger than that for storing the stiffness matrix A itself in 2D, and 15-35 times larger in 3D, see Figure 3. Packed storage improves this by a factor of about two, still incurring a considerable overhead. Since in the CG method the number of applications of B −1 p and A is the same, and the application is usually memory bound, reducing the storage size of B −1 p can be expected to speed up the solution process significantly. The local matrices A overlap each other as submatrices of A, such that storing the inverses A −1 individually wastes memory. Computing the sum B −1 p = ∑ ∈ P A −1 P T explicitly and storing it in a sparse matrix format fuses several entries and thus reduces the memory demand. Moreover, the application of B −1 p can be parallelized without locking on NUMA systems, as the overlap has already been resolved, and fewer accesses to the right hand side and solution vectors are performed.
The memory savings, however, are quantitatively small, see Figure 3. The storage size in terms of matrix entries is reduced by about 10% in 2D and even less in 3D. Moreover, indices have to be stored for the entries, which adds some overhead.

Dense storage of Cholesky factor L
The dense storage of Cholesky factors in LAPACK formats GE, SY, and P involves the same amount of memory as the dense storage of A −1 . However, the entries need to be accessed twice in order to solve the system. If the factor is too large to fit into a cache, this incurs a larger amount of data to be read from memory. This makes dense storage of L less attractive than dense packed storage of A −1 , except for that it is more robust with respect to inexact data representation, see Section 3.2.1.

Sparse storage of Cholesky factor L
In contrast to A −1 , the Cholesky factor L is relatively sparse. The popular CRS format, however, incurs a certain overhead of indices to be stored alongside the entries of L , and leads to irregular access patterns that may make the use of vector instructions less efficient. Since the local matrices A and hence L are only moderately sparse, special purpose block-structured approaches might be an attractive alternative, exploiting most of the potential sparsity while preserving SIMD-friendly access patterns. One such method is nested dissection, 28 with just one or two levels, decomposing the degrees of freedom into two disjoint sets of roughly the same size, separated by a small third set, which induces a large zero block in The number of entries in L for a single level of dissection is then in leading order 2.3p 4 in 2D and 2.8p 6 in 3D, reducing the memory demand by about a factor of two compared to packed storage of A −1 and reducing the bandwidth demand to the same level.
Note that a suitable permutation into nested dissection format can be stored with two bits per degree of freedom and dissection level, by assigning each degree of freedom to one of the three groups and keeping the relative ordering within each group. If the blocks S 1 and S 2 are treated as dense, the inverse L −1 has the same sparsity structure, such that L −1 can be stored with the same amount of data.
A second option is static condensation, 29 essentially a nonnested dissection approach with vertex, edge, and facet degrees of freedom forming the separator. The elimination of cell-associated degrees of freedom does not induce any fill-in. This leads to a block-sparse pattern of L where fill-in is created only when degrees of freedom associated with edges or faces are eliminated. The resulting storage size is optimal in two leading orders of p: 0.75p 4 in 2D and 0.28p 6 in 3D. While this is asymptotically optimal for large p, the asymptotics sets in rather late, and the scheme creates many rather small dense blocks, for each of which the indices need to be stored. For small to moderate p, the access pattern is therefore less suited for SIMD processing than the nested dissection approach. Combining static condensation and one level of nested dissection can reduce the required storage further, in particular for small orders p, at the expense of creating even more and smaller blocks in L , as well as increasing implementation complexity.
Compared to the naive dense storage of A −1 , the sparse representations of L promise a significant reduction of storage size as shown in Figure 4, and hence shorter run times and energy consumption. Still, the amount of data to be read from memory is about 2-5 times larger than for the stiffness matrix.

Mixed precision storage on CPUs
The sparser storage schemes for B p reduce the amount of data simultaneously with the number of floating point operations, but still require 3-5 times the memory size compared to the global stiffness matrix A for usual ansatz orders. A further reduction in memory size is possible with smaller representation of matrix entries, leading to mixed precision approaches. As in Reference 17, we focus on the storage of the preconditioner in reduced precision, assuming that the factorization of A is performed in exact arithmetics, or at least sufficiently accurate such that the errors are negligible. Unfortunately, reduced precision storage affects the quality and hence the effectiveness of the preconditioner.

Impact of reduced precision storage
In the following we will derive bounds on the representation error such that the efficiency of the preconditioner, in terms of a bound on the iteration count, deteriorates only by a certain factor. Unless stated otherwise, matrix norms will be the spectral norm.

Lemma 1.
In addition to the assumptions of Theorem 1 (existence of K 1 and ), let a, b ≥ 1 such that holds for all and v ∈ V . Then there are , ∈ [0, 1], such that the preconditionerB to a condition number bounded by where K 2 = 1 + || || as in Theorem 1. The bound on the iteration count is Proof. In order to apply Theorem 1 toB instead of B, we bound Depending on the relative magnitude of the two terms, this can be bounded by convex combination of K 1 ⟨v, Av⟩ and bK 1 ⟨v, Av⟩. Thus, there exists some ∈ [0, 1], such that (11) is bounded by For the strengthened Cauchy-Schwarz condition, we obtain Applying Theorem 1 yields and thus (9), where we have again the existence of some convex combination parameter ∈ [0, 1] for weighting 1 and a|| ||. From that, (10) follows directly from Theorem 1. ▪ In Lemma 1 above, the condition number bound (8) always holds for = = 1. Depending on the problem, its discretization, and the space splitting, smaller admissible values may exist. These will, however, be hard to obtain a priori, but may be reconstructed a posteriori, see Section 4.2.
Remark 2. The factor s of Lemma 1, which will also show up in Theorems 2 and 3 below, is to be interpreted with care. It does not describe a bound or estimate for the increase in iteration count due to inexact storage, but a bound for the increase of the theoretical upper bound of the iteration count. While this indeed provides a bound for the total iteration count, it does not bound the ratio of the actual iteration counts.

Error bounds for storing A −1
Instead of storing A −1 , we store Ã −1 = A −1 + with symmetric rounding error matrix .

Theorem 2. Let the representation errors
for some < 1. Then the preconditionerB is positive definite and there are , ∈ [0, 1] such that its bound of the PCG iteration count satisfies ñ it ≤ sn it with Proof. For brevity, we drop the subscript . Due to A being positive definite and being symmetric, A has a real spectrum. Assumption (12) implies and similarly min ( for all w. Thus, (8) is satisfied for all . Applying Lemma 1 yields and hence the claim. ▪ For a small predicted increase in iteration count, that is, s ≈ 1, Assumption (12) is roughly the relative accuracy requirement in terms of the condition number of A , similar to the findings in Reference 17.
We observe, however, that the accuracy is relative to ||A −1 ||, and not component-wise. This suggests that a fixed point number representation is sufficient for storage of A −1 , and even more memory-efficient since no individual exponents need to be stored.

3.2.3
Error bounds for storing L Now we storeL = L + instead of L , again with rounding matrix .

Theorem 3. Let the representation errors =L − L be bounded by
for some < 1. Then the preconditionerB is positive definite and there are , ∈ [0, 1] such that its bound of the PCG iteration count satisfies ñ it ≤ sn it with (1 + (2 + )).
Proof. Again we omit the subscript .
holds for all w and implies (8) with v = L −T w. Due to (15), (1 + (2 + )), As before, for s ≈ 1 the required normwise relative precision is governed by the condition number as .
Due to (A ) = (L ) 2 ≥ 1, the accuracy requirement (17) is less restrictive than (14). Since (A) ≥ (p 2 ), using Cholesky factors should gain efficiency over using inverses in terms of reduced precision storage for growing ansatz order p.

3.2.4
Floating point representation The IEEE 754 floating point standard specifies the number of bits in the exponent (R) and the mantissa (P) to be R s = 8 and P s = 23 for the single type (float), and R d = 11 and P d = 52 for the double type (double), respectively. Beside the cases where the accuracy provided by either the single or the double type is insufficient, there is those where not all bits are needed to hold a given value, for example, if a loss of accuracy is acceptable within some limits. Based on the single/double type, any representation FP R, P with R ≤ R s/d and/or P ≤ P s/d can be easily implemented by a sequence of arithmetic and bit operations together with an optional rescaling so as to cope with the range of FP R, P . On most general purpose computer platforms, however, these floating point representations cannot be processed in hardware. Implementing the conversion between the single/double type and FP R, P adds to the actual calculation and has to be integrated on a fine-grained level because otherwise data is moved twice. The benefit thus lies only in the reduced memory foot print as a result of the implicit data compression. Table 1 gives a brief overview of the supported floating point formats on x86 (Intel), ARM and (GP)GPU platforms/ architectures. While the latter two architecturally allow for half type processing (depending on the actual model), support for the half type on x86 platforms is restricted to conversion combined with load/store, and can be used through intrinsic functions only.
For general FP R ≤ 11, P ≤ 52 formats, bits can be packed into suitably large fundamental C/C++ data types or bitstreams to implement the load/store operation. Listing 1 shows a possible implementation of the compress (and store) operation that processes a sequence of n floating point numbers of type T, packs their FP R, P representations into 64-bit words in groups of size 64/(1 + R + P), and writes these packs to the output until all words are done. The down_convert<T, R, P> function implements the elementwise conversion (or recoding) from T to FP R, P . The (load and) decompress operation works the other way around using the up_convert<T, R, P> function to convert elementwise from FP R, P to T.
A reduced exponent range due to R ≪ 8 can lead to overflow and hence huge representation errors. Despite large exponents, the entries may nevertheless be of comparable magnitude. Local mesh widths or PDE coefficients, for instance, tend to affect all the entries in a similar way. Thus, a rescaling of the input F = {f 1 , … , f n | f i ∈ R} can be applied before and after the elementwise conversion in the compress and decompress function, respectively. For that to work, the input has to be scanned first in the compress function to find the maximum absolute value f max = max f ∈ F (|f |). The rescaling in the compress function then happens with s = l R, P /f max (not implemented in Listing 1), where l R, P is the largest number that can be represented with FP R, P . 1 In the decompress function the rescaling factor 1/s is used. Listing 1: Pseudo code for the compression of an input data stream of n floating point numbers of type T into an output data stream using 64-bit word packs, each holding multiple compressed words.
From a performance perspective, Listing 1 can serve as the default case for general R and P values. However, it will most likely suffer from the recoding of the exponent and mantissa (in [up|down]_convert) using bit operations, and specializations for specific R and P values that enable for simplifications and/or advanced optimizations are needed to gain performance over using T. Listing 2 implements the conversion from a C/C++ single type (float) to FP 8,7 (or bfloat16 in the field of AI and Deep Learning) by removing the lower 16 bits of the mantissa. The elementwise down-conversion and the packing of the FP 8, 7 representations into 64-bit words, is replaced by accessing the input sequence with stride-2 and writing to the output with stride-1. In the corresponding decompress function, the strided data access is the opposite, and the lower 16 bits of each output element are zero-filled.

Fixed point representation
According to the analysis in Section 3.2.3, fixed point representations might be a preferable alternative to floating point types. For fixed point representations with k > 0 bits, we define the quantization function and its dequantization The quantized values can be stored as integers with k bits, and the quantization error is bounded by |q + (q(c)) − c| ≤ 2 −(k+1) (c max − c min ). Using the fixed point approximationC ∶= q + (q(C)) with c min = min ij (c ij ), c max = max ij (c ij ) for any dense matrix C ∈ R m×m , we obtain Note that the factor one between spectral norm and Frobenius norm in the first inequality is sharp, but on average, a much smaller ratio of (m − 1 2 ) is expected, leading to the better average case estimate

NUMERICAL EXPERIMENTS
In the following, we will investigate how the expected speedup translates into practice. First we analyze the run times of BLAS level 2 operations using reduced precision representations, and then turn to run times and iteration counts of additive Schwarz preconditioners.

BLAS level 2 operations with reduced precision
Among the performance critical matrix-vector operations used for the implementation of the preconditioner is general matrix-vector multiplication (gemv), triangular (packed) matrix-vector multiplication (tpmv), symmetric (packed) matrix-vector multiplication (spmv), and triangular (packed) solve (tpsv). Library implementations of these BLAS level 2 operations, for example, Intel MKL, provide interfaces that allow for single and double data type processing. Using these implementations together with the reduced precision floating or fixed point representations therefore requires an upconversion (decompression) step before the actual application of the matrix-vector operation.
In order not to spill the cache during decompression, a block representation of the compressed matrixC is required. That is,C withC □ ij ∈ R m □ ×m □ for i, j = 1, … , ⌈m∕m □ ⌉. BLAS operations onC then become a sequence of BLAS operations on these blocks. The block size m □ should be of moderate size so that the blocks fit into the CPU cache. Otherwise, data would be moved from main memory into main memory again while decompressing, thereby rendering the benefit of the compression zero. The block representation has the additional benefit that the range c max − c min of the entry values can be stored individually for each block, resulting in higher accuracy in (19) compared to a global quantization, at the expense of storing two additional floating point numbers per block. However, blocks should not be too small, as the overhead for calling into the BLAS library increases with the number of blocks, and processing too small blocks might drop the performance. The block size m □ hence is target for machine and context specific optimization.
For BLAS level 2 operations it is known that they are bandwidth-bound, so that implementing them ourselves should not hurt the performance that much compared to optimized library implementations. For fixed point representations with k bits, we can directly operate on the integers if k matches their word size. The matrix-vector multiplication a ←Cb (with the blocking induced by (21)) then reads with according to (18b), and a □ i and b □ j being the segments of a and b needed for the blockwise BLAS level 2 operations. Figure 5 shows the performance of the aforementioned BLAS level 2 operations on an Intel Xeon Gold 6138 (Skylake) compute node (2 × 40 logical CPU cores, 192 GiB main memory, and 80 threads running) for matrices of size m × m with m = 447 (a small matrix that is not a power of 2) and m = 2048 (a large matrix), varying block size m □ , and different floating point and fixed point representations. Floating point representations use Intel's MKL 2019 for double type BLAS level 2 operations on the blocks (after an explicit upconversion from either FP 8,7 or FP 8,23 to FP 11, 52 ), whereas an implementation of (22) is used for the fixed point representations.
The GFLOPS values refer to the overall performance on our Skylake compute node with all threads executing separate matrices of the considered sizes in parallel. Additional floating point operations that originate from the decompression are not included, thereby allowing for a direct and fair (black box-like) performance comparison between the different representations. The gains over reference executions on the entire matrices using Intel's MKL 2019 are given as well. FP R, P representations that use the packing scheme presented in Listing 1 give at most a factor 1.6 performance gain over the MKL reference, independently of the values of R and P. They are not included in Figure 5 for that reason. For the other representations, there are some interesting observations: First, the block representation of the matrix using the double type results in a small performance drop over the reference execution if block sizes are either too small or too large. Choosing 16 ≤ m □ ≤ 128 on our Skylake compute node, the additional calling overhead and the decompression, which for the double type is a simple mem-copy, can be overcompensated by an increased data locality. However, in most cases, we observe no significant performance increase over the reference.
Second, the reduced precision floating point representations suffer to some extent from the explicit decompression of the blocks to FP 11, 52 before applying the actual BLAS operation. While both the compressed block and the decompressed block have to fit into the CPU cache in this case, for the fixed point representations only the compressed block is cached as the decompression happens on the fly with our implementation. The latter results in a better cache utilization that allows for larger blocks without causing performance drops due to spilling. In some cases, we see performance gains larger than the expected ones with the 16-bit fixed point representation for sufficiently large matrices.
Third, the 8-bit fixed point representation does not gain the performance significantly beyond the 16-bit case, which for matrices of size 2048 × 2048 is 200-380 GFLOPS (or 8-16 % peak performance) on our Skylake compute node. 2 The actual number of FLOPS that are executed is higher, as this value does not include the rescaling, which adds 2 We compiled our benchmark codes with the GNU 8.  one fused-multiply-add (FMA) operation for each row within a block, and integer-to-floating point conversion per matrix-element. Taking that into account, a more realistic estimate of a GFLOPS should be above 16% of the peak performance. The decreased memory footprint together with the blocking, thus render the BLAS level 2 operations significantly into the compute-bound regime. As the actual BLAS operation happens with FP 11, 52 floating point numbers, a performance gain over the 16-bit fixed point representation can result only from the further reduced memory footprint and/or an improved upconversion. The latter, however, involves almost the same number of data-parallel (SIMD) operations in both the 8-bit and 16-bit case. Significant gains over the reference thus are not to be expected with the 8-bit instead of the 16-bit fixed point representation. Summarizing, the block size m □ has a major impact on the achievable performance of the BLAS level 2 operations. For the reduced floating point representations, too small blocks mean more calling overhead into and reduced performance of the BLAS calls, whereas too many large blocks might not fit into the CPU cache. Our implementations operating directly on the fixed point representations avoid the explicit decompression step before the BLAS call, but combine it with the actual computation. A minimum block size of m □ = 32 seems to be necessary on our Skylake compute node to get roughly 4x gain with 16-bit words instead of the 64-bit reference in that case.

4.2
Additive Schwarz preconditioners with reduced precision storage

Test problem setup
As a simple yet sufficiently interesting example for the application of an additive Schwarz smoother we consider the Lamé-Navier equation of linear solid mechanics, As material parameters we use those of steel, that is, = 79.3 ⋅ 10 9 and = 107.1 ⋅ 10 9 . Boundary conditions are natural conditions on the right face of the unit cube, constant normal force applied to the left face, and Dirichlet conditions on the remaining sides, the latter ones implemented as penalty. The coarse grid is a symmetric splitting of the unit cube into 24 tetrahedra, and is uniformly refined three times, resulting in 2465 vertices For a first-order discretization this grid would be rather coarse, but for higher order finite elements up to p = 8, the resolution is actually quite fine. The resulting deformation is shown in Figure 6.
As storage schemes, five versions have been implemented for numerical experiments. As a reference we use dense storage of A −1 in double precision GE format, using standard BLAS gemv. In detail we consider block formats with various block sizes m □ and different representations as discussed in Sec. 4.1 for (a) dense non-symmetric storage of A −1 , (b) symmetric packed storage of A −1 , (c) packed storage of L , and (d) single level (and hence strictly speaking nonnested) dissection with packed storage of the resulting blocks in L .
As mentioned in Remark 1, a BPX preconditioner would have been a natural choice for the coarse problem of P1 finite elements in (4). The overall contraction rate, however, is then dominated by the BPX part, such that inaccuracies in the overlapping Schwarz smoother are hidden. We therefore use a direct solver for the coarse problem, which yields an overall contraction rate around 0.31. The PCG method used for solving (23) is started with a zero initial iterate, and terminated based on the estimated energy error, compare Reference 1. Test setup and preconditioner have been implemented using the Kaskade 7 toolbox. 30

Iteration count
First we check the theoretical impact of inaccurate storage on PCG iterations. The increase of upper bounds on iteration counts given in Theorems 2 and 3 depends on the relative accuracy of storage versus the condition number of the local matrices. The distribution of the latter is shown in Figure 7, left. The apparent bimodality of the distribution is due to the penalty treatment of Dirichlet boundary conditions, and becomes more distinct for larger penalty factors.
Based on the maximum condition number observed for different ansatz orders p, we derive relative accuracies 2(s − 1)∕ according to (14) and (17), respectively, for the conservative choice = = 1. Relative accuracies sufficient to guarantee an increase of the iteration count bound by 1 and 10% corresponding to s = 1.01 and s = 1.1, respectively, are shown in Figure 7, right, as decreasing curves. The relative accuracy expected from storage of entries in k-bit fixed point representation according to (20) is plotted for k = 16 and k = 20. These curves increase since the local matrix size grows as p 3 . The figure shows that up to p = 6, a negligible iteration count increase of less than 1% can be expected when storing Cholesky factors L with 20 bits fixed point accuracy, whereas 16-bit accuracy can be expected to lead to at most 10% increase in iteration count. Note that the block representation of stored matrices according to (21) with local fixed point range usually achieves a higher accuracy, such that in practice, similar results can be achieved with fewer bits per entry. Obviously, due to slower growth of condition number, storing L is preferable compared to storing A −1 .
In order to investigate the impact of inaccuracy on actual iteration numbers, we store A −1 + in double precision, where the entries of are uniformly and independently distributed in [−1, 1], and the scaling = ||A || || || ensures that the assumptions of Theorem 2 are satisfied. The distribution of entries of mimics the quantization error of fixed point representation. For different values of ∈ [0, 1[ and orders p ∈ {3, 4}, the iteration count n it (s) for reaching a relative accuracy of TOL = 10 −12 in solving (23) has been recorded and is shown in Figure 8, left. The parameters and arising in the theoretical upper bound (13) have been chosen to give a best fit. Apparently, the convergence behavior is well captured by the theory-as long as the parameters and are are chosen appropriately. When storing L , the results are similar. In contrast to storing A −1 , uniformly distributed errors in [−1, 1] have almost no impact on the iteration count, but uniformly distributed errors ( ) ij ∈ [−(i − j + 1) −1 , 0] have. The latter setup has been used to compare actual iteration numbers with the theoretical prediction (16) in Figure 8, right, again when identifying the parameters and from the data. As before, the theoretical bound provides a reasonable description of the observed iteration numbers. We note that the identified values for and differ not only between different ansatz orders, which is to be expected, but also between A −1 and L storage. The reason can be that either the particular choice of inaccuracy does not represent the worst case covered by the theory or that the theoretical bounds are not sharp, or both. Lines are the theoretically predicted increase factors with fitted parameters and . Left: storing A −1 , s from (13) Right: storing L , s from (16) From these results we may draw two conclusions. First, the increase factor s for iteration count bounds given in (13) and (16) describes the observed increase in iteration counts reasonably well, but only if appropriate values for the parameters and are known. They are, however, hard to obtain a priori. Using the always feasible choice = = 1 for an adaptive selection of storage precision, similar to the adaptive scheme proposed in Reference 17, will be overly pessimistic. Second, storage of A −1 appears to be much more susceptible to the type of error arising from entry-wise quantization due to reduced precision representation. Though not explained by theory, the observation suggests that this error structure is far from worst case when storing Cholesky factors L . This makes the latter more attractive compared to storing A −1 even beyond the smaller condition number entering the relative accuracy (17).
For real program executions, A −1 and L are stored using any of the reduced precision floating point or fixed point representations, and captures the quantization error as a consequence of this. Observed iteration counts ñ it that exceed the reference iteration count n it = 39 are reported in Table 2 for different combinations of storage scheme, matrix entry representation, and block size m □ -blank fields correspond to executions that did not converge. Executions using the FP 8,7 floating point representation all have the same per-block quantization, This does not hold for the fixed point representations, where quantization errors and PCG iterations tend to increase with the block size. Apparently, the convergence is unaffected for all but the very least accurate representations, in agreement with the theoretical results above. Deviations occur for blocks associated to boundary vertices. Blocking of matrices with m not being a multiple of m □ results in some outer blocks having a rectangular shape with sizes smaller than m □ . As the smoother application time is dominated by the execution time(s) of the BLAS level 2 operation(s) used, our observations from Section 4.1 can essentially be transferred. The application of the preconditioner is parallelized using standard C++ threading, storing the results in local vectors. Scattering the local vectors into the global result vector is serialized. As the execution time is dominated by the local matrix-vector products, this serialization has a minor impact on performance. Up to ansatz order p = 6, all configurations can be compared among each other, and performance gains over the original program version storing A −1 as dense matrix can be deduced (on our Skylake system):
Going beyond p = 6, the increased memory footprint prevents the full storage of A −1 in double precision, such that no speedups over this scheme are reported in Figure 9 for p > 6. Storing A −1 in packed format or with reduced precision representations still fits into memory, such that run times can be reported.
Packed symmetric storage of A −1 gains the performance over storing the dense matrix by halving the amount of data being processed-calls to gemv are replaced by calls to spmv. Consequently, the second column in Figure 9 shows a factor two performance gain already for the FP 11, 52 floating point representation if p ≥ 4 and m □ ≥ 32. Storing L in packed format, however, cannot benefit from symmetry, and falls behind packed A −1 . The greater robustness of Cholesky factors with respect to inexact storage is visible only for p = 8.
For higher orders p ≥ 4, the (single level and hence nonnested) dissection format cuts down the amount of nonzeros to be stored by almost a factor of two compared to packed storage of L , and is therefore on par with packed storage of A −1 . We expect that, in particular for higher orders, adding one or two levels more of nested dissection would improve the speedup further, even though the possible gains are limited by the moderate sparsity of A , more complex memory access patterns, and smaller block sizes to be exploited by vector instructions.

CONCLUSIONS
In preconditioned conjugate gradient solvers for higher-order finite element discretization of elliptic equations, the preconditioner often dominates the run time. One effective preconditioner bounding the condition number independently of the order p involves an overlapping Schwarz smoother, which is memory bandwidth bound. Reducing the storage size can be done by exploiting symmetry and block sparsity, or reduced precision storage, or both. Theoretical accuracy requirements for reduced precision have been derived using subspace correction theory, and show a greater robustness of storing Cholesky factors than storing inverses. Being worst case bounds, these results are far from sharp, though, and a better theoretical understanding of quantization error impact on convergence is required for an effective adaptive choice of precision. Numerical experiments show that quite low precision, down to 16 bits per number, is often sufficient for undisturbed PCG convergence, and reduces preconditioner execution time roughly by the same factor as the storage size, as one might expect in a bandwidth-bound situation. For even lower precision of eight bits per number, the computation is apparently no longer bandwidth-bound, such that the additional gain diminishes.
In particular, for low precision storage, fixed point representations turn out to be more accurate and preferable compared to floating point storage. The results also demonstrate that the overhead for decompression is not negligible, as can be seen in the unsatisfactory results for small block size m □ .
The combination of sparse factorization and controlled low precision fixed point representation is a promising way to adapt overlapping Schwarz smoothers to modern computer architectures with growing imbalance of memory bandwidth and CPU performance.

APPENDIX A. COMBINATORICS OF DEGREES OF FREEDOM
For estimating the number of entries in the global stiffness matrix A and the local stiffness matrices A when using simplicial Lagrange finite elements of order p, we need first to count the number n i of degrees of freedom on each mesh entity of dimension i, that is, vertices, edges, triangles, and, in 3D, tetrahedra. These are given in the first row of Table A1 TA B L E A1 Number of degrees of freedom for simplicial Lagrange finite elements on different mesh entities and number of mesh entities in vertex-centered patches and (approximately) in a mesh with N vertices and can easily be verified by induction. Multiplying theses numbers with the number of corresponding mesh entities occurring in the hexagonal and icosahedral model patches yields the total number of degrees of freedom within the patch and hence the size of A , as reported in the last column of Table A1. When computing these numbers not for patches but for the whole mesh, we need to take into account that the patches overlap, that is, the relative frequency of mesh entity types is different. Assuming the same incidence numbers as in the model patches, for example, each vertex is incident to 30 triangular faces in 3D, the total number of degrees of freedom in the mesh and hence the size of A as given in the last column of Table A1 results. We point out that these numbers are estimates and neglect the domain boundary as well as the fact that various and different incidence numbers can exist in a concrete mesh. For counting the number of matrix entries in A , we also need to consider which degrees of freedom interact with each other. Grouping them by mesh entity, the number of interactions between different mesh entities have to be counted. For the model patches, these are given as r ij for entities of dimension i and j in Table A2. For example, the ansatz functions associated to an edge have a support on the incident cells, and thus interact with all edges incident to one of the incident cells. In the 2D hexagon, each edge therefore interacts with three edges -namely with itself and with its two neighboring edges, making 6 × 3 = 18 interactions in total. Remember that the boundary edges do not belong to the patch. Similarly, in the icosahedron, each edge interacts with the five incident faces and the five opposing faces, in total 12 ⋅ (5 + 5) = 120 edge-triangle interactions. Finally, the number of entries in the local stiffness matrix A is nnz(A ) = ∑ i,j r ij n i n j .
For the hexagon, we obtain nnz(A ) = 1