Providing performance portable numerics for Intel GPUs

With discrete Intel GPUs entering the high‐performance computing landscape, there is an urgent need for production‐ready software stacks for these platforms. In this article, we report how we enable the Ginkgo math library to execute on Intel GPUs by developing a kernel backed based on the DPC++ programming environment. We discuss conceptual differences between the CUDA and DPC++ programming models and describe workflows for simplified code conversion. We evaluate the performance of basic and advanced sparse linear algebra routines available in Ginkgo's DPC++ backend in the hardware‐specific performance bounds and compare against routines providing the same functionality that ship with Intel's oneMKL vendor library.


INTRODUCTION
To complement the arrival of the first discrete Intel GPUs, Intel has teamed up with partners from academia and industry to create the oneAPI open standard for a unified application programming interface intended to be used across different compute accelerator architectures, including GPUs, AI accelerators, and field-programmable gate arrays (FPGAs). oneAPI relies on the SYCL-based DPC++ programming language, and despite Intel currently being the driving force, the intention of the oneAPI standard is to enable platform portability across hardware from Intel and other vendors. A challenge in this context is that parallelization concepts like workgroup size can have very different characteristics for a hardware zoo that spans from multicore CPUs to manycore GPUs and special function units.
In Tsai et al., 1 the authors reported how they prepared the math library Ginkgo 2 to execute on Intel GPUs by adding a backend containing kernels written in the DPC++ programming language. This article is an extended version of Tsai et al., 1 building upon the previous effort, and carrying it further with the development of a kernel selection scheme that allows for compiling kernels for different hardware characteristics and automatically selecting an appropriate kernel via a unique kernel signature. Furthermore, we now also provide DPC++ kernels for sophisticated algorithms and preconditioners, including parallel incomplete factorizations and mixed-precision block-Jacobi and sparse approximate inverse preconditioners.
This article is structured in the following way. We first introduce the oneAPI ecosystem in Section 2 before reviewing the GINKGO library design in Section 3. Section 4 is the main contribution of the article, which contains the porting strategy as well as the kernel selection strategy enabling performance portability. We evaluate the performance of sparse linear algebra workflows in Section 5 before concluding in Section 6.
F I G U R E 1 The GINKGO library design overview. DPC++ is a community-driven (open-source) language based on the ISO C++ and Khronos' SYCL standards. The concept of DPC++ is to enhance the SYCL 3 ecosystem with several additions that aim at improving the performance on modern hardware, improving usability, and simplifying the porting of classical CUDA code to the DPC++ language. Two relevant features originally introduced by the DPC++ ecosystem now also integrated into the SYCL standard are † : (1) A new subgroup concept that can be used inside kernels. This concept is equivalent to CUDA warps (or SIMD on CPUs) and allows optimized routines such as subgroup-based shuffles. In the GINKGO library, we make extensive use of this capability to boost performance. (2) A new unified shared memory model which provides new malloc_host and malloc_device operations to allocate memory which can either be accessed both by host or device or respectively accessed by a device only. Additionally, the new SYCL queue extensions facilitate the porting of CUDA code as well as memory control. Indeed, in pure SYCL, memory copies are entirely asynchronous and hidden from the user, since the SYCL programming model is based on tasking with automatic discovery of task dependencies.
Another important aspect of oneAPI and DPC++ is that they adopt platform portability as the central design concept. Already the fact that DPC++ is based on SYCL (which leverages the OpenCL's runtime and SPIRV's intermediate kernel representation) provides portability to a variety of hardware. On top of this, DPC++ introduces a plugin API that allows the development of new backends and switches dynamically between them ‡ .
Currently, DPC++ supports the standard OpenCL backend, a new Level Zero backend which is the backend of choice for Intel hardware § , and an experimental CUDA backend for targeting CUDA-enabled GPUs. As our goal is to provide high-performance sparse linear algebra functionality on Intel GPUs, we focus on the Intel Level Zero backend of DPC++.

GINKGO DESIGN
GINKGO 2 is a GPU-focused cross-platform math library focusing on sparse linear algebra. The library design is guided by combining ecosystem extensibility with heavy, architecture-specific kernel optimization using the platform-native languages CUDA (NVIDIA GPUs), HIP (AMD GPUs), or OpenMP (Intel/AMD/ARM multicore). 4 The software development cycle ensures production-quality code by featuring unit testing, automated configuration, and installation, Doxygen code documentation, as well as continuous integration and continuous benchmarking framework. GINKGO provides a comprehensive set of sparse BLAS operations, iterative solvers including many Krylov methods, standard and advanced preconditioning techniques, and cutting-edge mixed-precision methods.
A high-level overview of GINKGO 's software architecture is visualized in Figure 1. The library design collects all classes and generic algorithm skeletons in the "core" library which, however, cannot be used without the driver kernels available in the "omp," "cuda," "hip," and "reference" backends. We note that "reference" contains sequential CPU kernels designed to validate the correctness of the parallel algorithms and for the unit tests realized using the Googletest framework. We note that the "cuda" and "hip" backends are very similar in kernel design, so we have "shared" kernels that are identical for the NVIDIA and AMD GPUs up to kernel configuration parameters. 5 Extending GINKGO 's scope to support Intel GPUs via the DPC++ language, we add the "dpcpp" backend containing corresponding kernels in DPC++.

PORTING TO THE DPC++ ECOSYSTEM
Though porting GINKGO to a new hardware ecosystem requires acknowledging the hardware-specific characteristics, the GINKGO design exposed in Section 3 promotes a natural porting workflow: (1) As a first step, core library infrastructure needs to acknowledge the addition of a new backend.
This includes the GINKGO Executor which allows transparent and automatic memory management as well as the execution of kernels on different devices. Another example of manual porting in this preparatory step is the cooperative group and other shared kernel helper interfaces used for writing portable kernels and simplifying advanced operations. (2) A set of scripts can be used to generate non-working definitions of all kernels for   the new backend. The completion of this step creates a compilable backend for the new hardware ecosystem. (3) For an initial kernel implementation,   we rely whenever possible on existing tools to facilitate the automatic porting of kernel implementations from one language to the target language, doing only manual fixes when appropriate. The successful completion of this step provides a working backend. (4) Finally, we analyze and validate the observed performance for the ported kernels. Often, simple kernels already provide competitive performance, but advanced kernels require either manual tuning or algorithmic adaptation to reach the hardware limits.
We will expose Steps 1-3 of this workflow in Section 4.1. The later sections introduce three of the aspects of Step 4 of the workflow, where we discuss challenges and optimization opportunities specific to the oneAPI/SYCL ecosystem. In Section 4.2, we focus on the SYCL concept of subgroup, discuss its relation and mapping to CUDA and propose viewing it as CUDA subwarp instead of CUDA warp. In Section 4.3, we introduce some performance and portability aspects on Intel oneAPI of GINKGO'S advanced preconditioners which are some of the most complex algorithms we support. Finally, in Section 4.4, we introduce a new C++ concept and relevant compiler interactions which allow our kernels to be portable to all Intel hardware while being able to be tuned for performance.

Porting CUDA code to DPC++
To facilitate an initial porting of the functionality without optimization, we can rely on the Intel "DPC++ Compatibility Tool" (DPCT), which converts CUDA code into compilable DPC++ code. DPCT is not expected to automatically generate a DPC++ "production-ready" executable code, but code "ready-to-compilation," and it requires the developer's attention and effort in fixing conversion issues and tuning it to reach performance goals. However, with oneAPI still being in its early stages, DPCT still has some flaws and failures, and we develop a customized porting workflow using the DPC++ Compatibility Tool at its core, but embedding it into a framework that weakens some DPCT prerequisites and prevents incorrect code conversion. In general, DPCT requires not only knowledge of the functionality of a to-be-converted kernel, but also knowledge of the complete library and its design. This requirement is hard to fulfill in practice, as for complex libraries, the dependency analysis may exceed the DPCT capabilities. Additionally, many libraries do not aim at converting all code to DPC++, but only a subset to enable the dedicated execution of specific kernels on DPC++-enabled accelerators. Thus, we employ a strategy where we first isolate kernels we want to convert and then re-integrate them into the library.

Isolated kernel modification
DPCT converts all files related to the target file containing any CUDA code that is in the target (sub)folders. To prevent DPCT from converting files that we do not want to be converted, we have to artificially restrict the conversion to the target files. We achieve this by copying the target files into a temporary folder and considering the rest of the GINKGO software as a system library. After the successful conversion of the target file, we copy the file back to the correct destination in the new DPC++ submodule. By isolating the target files, we indeed avoid additional changes and unexpected errors, but we also lose the DPCT ability to transform CUDA kernel indexing into the DPC++ nd_item<3> equivalent. As a workaround, we copy simple headers to the working directory containing the thread_id computation helper functions of the CUDA code such that DPCT can recognize them and transform them into the DPC++ equivalent. For those complicated kernels, DPCT fails in the kernel conversion, and we need a fake interface that enables DPCT to apply the code conversion for nd_item<3>.

Fake interface: Workaround for cooperative groups
While DPC++ provides a subgroup interface featuring shuffle operations, this interface is different from CUDA's cooperative group design as it requires the subgroup size as a function attribute and does not allow for different subgroup sizes in the same global group. As GINKGO implementations aim at performing close to the hardware-induced limits, we make heavy use of cooperative group operations. Based on the DPC++ subgroup interface, we implement our own DPC++ cooperative group interface. Specifically, to remove the need for an additional function attribute, we add the item_ct1 function argument into the group constructor. As the remaining function arguments are identical to the CUDA cooperative group function arguments, we therewith achieve a high level of interface similarity. This workflow resolves the porting not only for the cooperative group functionality but also for other custom kernels replacing the automated DPCPP conversion.
A notable difference to CUDA is that DPC++ does not support subgroup vote functions like "ballot," or other group mask operations yet. To emulate this functionality, we need to use a subgroup reduction provided by oneAPI to emulate these vote functions for subgroups. This lack of native F I G U R E 2 Summary of the workflow used to port the cooperative groups' functionality and isolating effort such that we get the correct converted DPC++ codes. support may affect the performance of kernels relying on these subgroup operations. In Figure 2, we visualize the porting workflow for cooperative group functionality via four steps: 1. Origin: We prepare an alias to the cooperative group function such that DPCT does not catch the keyword. We create this alias in a fake cooperative group header we only use during the porting process.
2. Adding interface: As explained previously, we isolate the files to prevent DPCT from changing other files. We also add the simple interface including threadIdx.x and make use of the alias function. For the conversion to succeed, it is required to return the same type as the original CUDA type, which we need to extract from the CUDA cooperative group function this_thread_block.
3. DPCT: Apply DPCT on the previously prepared files. Adding threadIdx.x indexing to the function allows DPCT to generate the nd_item<3> indexing.
4. Recovering: During this step, we change the related cooperative group functions and headers to the actual DPC++ equivalent. We implement a complete header file that ports all the cooperative group functionality to DPC++.
In Figure 3, the final result of the porting workflow on a toy example with cooperative groups. For the small example code in Figure 3A, if we do not isolate the code, DPCT will throw an error like Figure 3B once encountering the cooperative group keyword. A manual implementation of the cooperative group equivalent kernel is shown in Figure 3C. Our porting workflow generates the code shown in Figure 3D, which is almost identical to the original CUDA code Figure 3A.

Pushing for backend similarity
To simplify the maintenance of the platform-portable GINKGO library, our customized porting workflow uses some abstraction to make the DPC++ code look similar to CUDA/HIP code ¶ . In particular, we not only add the customized cooperative group interface, but also a dim3 implementation layer for DPC++ kernel launches that uses the same parameter order as CUDA and HIP kernel launches. One fundamental difference remaining between the CUDA or HIP ecosystems and DPC++ is that the latter handles the static and dynamic memory allocation in the main component. CUDA and HIP handle the allocation of static shared memory inside the kernel and the allocation of dynamic shared memory in the kernel launch parameters. Another difference is the kernel invocation syntax since DPC++ relies on a hierarchy of calls first to a queue, then a parallel instantiation. For consistency, we add another layer that abstracts the combination of DPC++ memory allocation and DPC++ kernel invocation away from the user. This enables a similar interface for CUDA, HIP, and DPC++ kernels for the main component, and shared memory allocations can be perceived as a kernel feature. We extend this design for supporting different device configurations as well in Section 4.4 and use both of these approaches in GINKGO . If the kernels require a subgroup or workgroup size selection, we add the configuration selection into the call, otherwise, we use the simple kernel interface without configuration input.

SYCL subgroups compared to CUDA warps and subwarps
In SYCL and DPC++, subgroups are equivalent to the CUDA warp concept. However, there are two differences between SYCL subgroups and CUDA warps: (1) SYCL subgroups are adjustable for kernels (2) SYCL does not have a sub-subgroup (subwarp) concept for the moment. In this situation, we decide to rather match subgroups to CUDA subwarps. This allows the usage of small subgroups similar to subwarps as long as there is no major communication out of the given subgroup. In addition, choosing a suitable subgroup size is of importance because it gives better performance than the automatic SYCL choice, see the small reduction example in Figure 4. In the original SYCL-1.2.1 standard, Intel OneAPI has proposed a DPC++ extension that introduces the subgroup concept along with propagation rules for the kernel attribute from deep inside a kernel call nesting to the The performance of subset reduction on 10,000,000 elements on GEN12 GPU outer-most level, such that the compiler recognizes this setting to generate a tailored version of the kernel. This allows a very similar usage of cooperative groups in DPC++ and CUDA and a smooth conversion of CUDA kernels to DPC++ via our porting workflow with DPCT. The conversion process as well, as the cooperative group porting process, are described in Section 4.1.
However, the new SYCL-2020 standard changes the attribute propagation for the subgroup setting such that all kernel attribute propagation is disabled. This virtually prevents the strategy of setting the subgroup size inside the cooperative group. From the standard's perspective, it might be reasonable to ignore subgroup settings from deep inside the kernel because it is essentially considered like a CUDA warp and not a subwarp. An overhead-free workaround is given by re-using an intermediate layer that manages the subgroup settings, see "Pushing for backend similarity" of Jacobi<ValueType>::build() 7 .with_max_block_size(32u) 8 .with_storage_optimization(gko::precision_reduction::autodetect()) 9 .on(exec); 10 / / Create IR s o l v e r with the adaptive block j a c o b i as t r i a n g u l a r s o l v e r 11 auto trisolve_factory = 12 ir::build() 13 .with_solver(share(bj_factory)) 14 .with_criteria( 15 gko::stop::Iteration::build().with_max_iters(sweeps).on(exec)) 16 .on(exec); 17 / / Create the ILU p r e c o n d i t i o n e r and use the IR to s o l v e lower / upper t r i a n g u l a r s o l v e r 18 auto ilu_pre_factory = 19 gko::preconditioner::Ilu<ir, ir>::build() 20 .with_l_solver_factory(gko::clone(trisolve_factory)) 21 .with_u_solver_factory(gko::clone(trisolve_factory)) 22 .on(exec); .with_criteria(gko::share(iter_stop), gko::share(tol_stop)) 31 / / without the f o l l o w i n g l i n e , i t w i l l be p l a i n cg . 32 .with_generated_preconditioner(gko::share(ilu_preconditioner)) 33 .on(exec); 34 / / Generate preconditioned s o l v e r f o r a system matrix A 35 auto ilu_cg = ilu_cg_factory->generate(A); 36 / / Solve the system 37 ilu_cg->apply(lend(b), lend(x)); Listing 1: ParILU IR PCG example

Porting GINKGO 's advanced preconditioners on Intel oneAPI
GINKGO contains a set of advanced preconditioners designed to leverage the compute power of modern GPUs by allowing for fine-grain parallelism and using lower precision formats for parts of the computations. We have ported the following preconditioner functionality to the DPC++ backend: (1) adaptive precision block-Jacobi preconditioning, 6 (2) routines for generating incomplete factorization preconditioners, in particular the parallel incomplete LU factorization (ParILU 7 ), the parallel incomplete Cholesky factorization (ParIC), and their respective threshold versions (ParILUT/Par-ICT 8,9 ), and (3) incomplete sparse approximate inverse preconditioners (ISAI 10 ). In comparison to the vendor libraries for AMD and NVIDIA GPUs, oneMKL is in the early stage and still missing some key functionality such as sparse triangular solves that we use for incomplete factorization preconditioning on NVIDIA and AMD GPUs. However, incomplete factorization preconditioning is still possible also in the DPC++ backend by using approximate triangular solves that employ a relaxation method for solving the triangular systems. 11 Listing 1 demonstrates the use of approximate triangular solves: we first build a ParILU factorization factory with its settings (line 2-3). Next, we declare an adaptive block-Jacobi factory, which is used as a solver in an iterative refinement (IR) solver, in (lines 7-18) and replace the missing triangular solver. Then the ILU preconditioner is built using all the previous pieces (lines 20-26). Finally, the incomplete factorization preconditioner using approximate triangular solves is passed as a preconditioner to a CG solver (lines 28-37).

Challenges with ParIC/ILU(T)
A challenge when porting the ParILUT 8,9 algorithm to DPC++ is the limitation in terms of using only one fixed-size subgroup in a kernel (i.e., the lack of a CUDA subwarp equivalent). ParILUT employs divide and conquer to handle the bitonic sorting, such that the working space is: block -> block/2 -> · · · -> warp -> subwarp(16) -> subwarp(8) -> · · · -> single thread. Its implementation relies on cooperative groups to efficiently handle each subwarp level, which is not possible in DPC++ due to the fixed subgroup size. Acknowledging that all subwarps execute the same algorithm, we can still use the algorithm with all subgroups (subwarps) having the same size. This workaround does however not follow the cooperative group concept.
That is, the kernels on DPC++ leverage the divide and conquer strategy on the workload view and stop when reaching the cooperative group level.
When reaching the subgroup size, for example, 32, the divide and conquer logic is implemented within the kernel without changing the subgroup size. To make this work, the kernel needs to compute or count indices carefully. For example, ffs gets the first set bit (1-based) in the given mask.
len(mask) -clz(mask) gives the wrong answer when the len(mask) does not match the size of subgroup. We need to use ctz(mask) + 1 to ensure correctness.

Challenges with adaptive precision block-Jacobi
Also, the adaptive precision block-Jacobi 6 algorithm makes heavy use of subwarps as matching diagonal blocks to subwarps requires adjusting the subwarp size to the size of the diagonal block. The original CUDA implementation checks the diagonal blocks for the precision requirements in a subwarp and then uses the strongest requirement for all subwarps of the same warp. This scheme means that we also face the issue that SYCL/DPC++ does not support variable sub-subgroup sizes. It is possible to always use the same subgroup size to handle the blocks, which means that blocks need to be considered individually but this will produce different block precisions compared to the other backends. As long as oneAPI does not allow for flexibility in the subwarp size, we limit the block-Jacobi preconditioner to a uniform block size of 32.

Improving performance and portability of GINKGO 's DPC++ backend
An important aspect of oneAPI is hardware portability. In effect, oneAPI is expected to execute not only on Intel architectures (CPUs, GPUs, and

Encoding kernel parameters with the ConfigSet
We propose a compile-time structure ConfigSet which supports compile-time encoding of multiple parameters into a single bitset and both compile-time and runtime decoding of the configuration. The architecture of the ConfigSet is detailed in Figure 5. First, the type needs to be declared and specifies the sizes and order of the encoded numbers, for example, Config<3, 11, 7> indicating that the first element uses up to 3 bits, the second element uses up to 11 bits, and the last element uses up to 7 bits. The user can encode a set configuration (e.g., 4, 256, 16) into a number at compilation-time. This number can be used as a template parameter value for the kernel which greatly simplifies the implementation process of the strategy, as all relevant numerical parameters are specified and checked against this single encoded result. Inside the kernel, the ConfigSet is then decoded and the kernel parameters such as subgroup size and group size are extracted.  The ConfigSet concept allows selecting a valid and well-performing kernel during runtime without significantly increasing the development effort. In Listing 3, we present the full implementation selection via the ConfigSet for a reduction kernel. In this example, we consider (256, 16) as a valid (workgroup, subgroup) configuration for the GEN9 GPU and (512, 32) for the GEN12 (lines 1-3) GPU due to the devices' max workgroup size limitation and other hardware characteristics. We note that because get_first_cfg returns the first valid config, the ordering of the list entries affects which configuration is selected. We also note that it is possible to extend the ConfigSet and the implementation selection process to account also for other hardware characteristics.

4.4.4
Overhead of the runtime selection To evaluate the overhead of the runtime selection, we take the kernel performing reduction per 256 elements with single precision from Figure 4 as an example, because the reduction is a common kernel requiring subgroup functionality to get good performance. We generate a certain number of kernels at compile time and select the last generated kernel at runtime. Our experiment runs one warmup and measures the average of 100 runs, The overhead of runtime selection on Gen12LP GPU. "x" is the number of kernels in the selection list generated at compile-time and we select the last kernel in the list at runtime. "time" includes the selection and the kernel run. "relative difference" is the relative difference (in percentage) between the current case and the 1-selection case.
where each run selects kernels on the GEN12 GPU and we check using from 1 to 768 kernels in the compile time list. In Figure 7, there is a time increase when we choose the last kernel in a list with more than 512 kernels, but it is still very small with only up to 0.3% of the time (i.e., 4.84 μs) || .
Thus, the overhead from the selection is not a major issue for the kernel performance.

4.4.5
Compiling GINKGO in the oneAPI ecosystem GINKGO allows for integration with the oneAPI and SYCL toolchains by changing the default C++ compiler, for example, via There are two compilation modes for DPC++ functionality. The first mode is JIT compilation where the specific kernel code is automatically generated at runtime, whereas ahead of time compilation (AOT) pre-generates the kernel code for the specific architecture ** . The selection of AOT and JIT compilation is controlled via the -fsycl-targets compiler option † † . The default mode spir64 enables JIT whereas spir64_gen enables AOT compilation for specific hardware types (in this case, a GPU). In addition, a specific device generation can be targeted by using the flag -Xsycl-target-backend such as the Intel GEN9 . Whether AOT or JIT is used, and in which manner, has important implications for the previous ConfigSet and implementation selection concepts: when generating exhaustively the kernel configuration for all potential hardware settings, the kernel compilation may fail because it is not supported for the target hardware type and device pair. JIT compilation can work transparently by using the compiler option -fsycl-device-code-split which controls the granularity of device kernel code modules: the default auto uses an algorithm to try to recognize it, whereas per_source compiles each source file as a device code module, and per_kernel generates a different device code module for every kernel (including template parameters). Setting a value of per_kernel in combination with JIT implies that all code paths for which the hardware is not compatible are not evaluated at runtime, and therefore never compiled. On the other hand, a per_source value raises a runtime JIT error if an invalid kernel configuration is encountered. However, the convenience of the per_kernel compilation comes with an important cost: all kernel dependencies (like cooperative group and other pieces of code) are bundled in every kernel module separately, which can significantly increase the compilation overhead as well as library size, particularly in debug mode. To address this downside and enable more flexibility, we also support the JIT per_source mode and AOT if the user explicitly specifies the relevant hardware parameters of the target device, and this information is then used to automatically generate the configuration list like in line 2 of Listing 3 from within CMake.

PERFORMANCE ASSESSMENT OF GINKGO'S DPC++ BACKEND
In this section, we evaluate all aspects of the newly ported GINKGO backend on available Intel hardware. In Section 5.1, we present the platforms we consider and the experimental setup. We then start with evaluating the performance of the SPMV kernel in Section 5.2 and its performance portability on Intel, NVIDIA, and AMD GPUs. The performance portability of the Krylov solvers and the performance of the advanced preconditioners are evaluated in Sections 5.3 and 5.4, respectively. All performance results including the above testings are available in the dataset. 13

Experimental setup and platform overview
In this article, we consider two Intel GPUs: the generation 9 (GEN9 ) integrated GPU UHD Graphics P630 with a theoretical bandwidth of 41.6 GB/s and the generation 12 Intel® Iris® Xe Max discrete GPU (GEN12 ) ‡ ‡ which features 96 execution units and a theoretical bandwidth of 68 GB/s.
We start with initially evaluating the performance and bandwidth of the Intel GPUs using bandwidth tests, performance tests, and sparse linear algebra kernels. We note that the GEN12 architecture lacks native support for IEEE 754 double-precision arithmetic, and can only emulate double precision arithmetic with significantly lower performance. Given that native support for double-precision arithmetic is expected for future Intel GPUs and using the double-precision emulation would artificially degrade the performance results while not providing insight into whether GINKGO'S algorithms are suitable for Intel GPUs, we use single-precision arithmetic in all performance evaluations on the GEN12 architecture § § .
The DPC++ version we use in all experiments is Intel oneAPI DPC++ compiler 2021. All experiments were conducted on hardware that is part of the Intel DevCloud. We have performed bandwidth tests and experimental performance roofline in the article. 1 We use the BabelStream 14 benchmark to evaluate the peak bandwidth, and the mixbench 15

SPMV performance analysis and portability
An important routine in sparse linear algebra is the sparse matrix-vector product (SPMV ). This kernel reflects how a discretized linear operator acts on a vector, and therewith plays the central role in the iterative solution of linear problems and eigenvalue problems. We consider two sparse matrix formats: (1) the "COOrdinate format" (COO ) that stores all nonzero entries of the matrix along with their column-and row-indices, and the "Compressed Sparse Row" (CSR ) format that further reduces the memory footprint of the COO format by replacing the row-indices with pointers to the first element in each row of a row-sorted COO matrix. We focus on these popular matrix formats not only because of their widespread use but also because Intel's oneMKL library provides an optimized CSR -SPMV routine for Intel GPUs.
In Figure 8, in terms of percentage of peak bandwidth is exposed in Figure 9.
We also evaluate the hardware efficiency of the GINKGO DPC++ backend compared to the other backends. For that, we focus on the relative performance the functionality achieves on GPUs from AMD, NVIDIA, and Intel, taking the theoretical performance limits reported in the GPU specifications as the baseline. This approach reflects the aspect that the GPUs differ significantly in their performance characteristics, and that Intel's oneAPI ecosystem and GPU architectures are still under active development and have not yet reached the maturity F I G U R E 8 SPMV kernel performance for GINKGO and Intel's oneMKL library on GEN9 (left) and GEN12 (right) using double and single precision, respectively.
F I G U R E 9 SPMV performance relative to the hardware bounds on various GPUs.
level of other GPU computing ecosystems. At the same time, reporting the performance relative to the theoretical limits allows us to both quantify the suitability of GINKGO'S algorithms and to estimate the performance we can expect for GINKGO'S functionality when scaling up the GPU performance. In Figure 9 we report the relative performance of different SPMV kernels on AMD MI100 ("hip" backend), NVIDIA A100 ("cuda" backend), and Intel GEN9 and GEN12 GPUs (both "dpcpp" backend). As expected, the achieved bandwidth heavily depends on the SPMV kernel and the characteristics of the test matrix. Overall, the performance figures indicate that the SPMV kernels achieve about 80% of peak bandwidth on A100, 90% of peak bandwidth on GEN12 , and about 60%-70% of peak bandwidth on MI100 and GEN9 . On all hardware, GINKGO'S SPMV kernels are competitive with the vendor libraries, indicating the validity of the library design and demonstrating good performance portability.

Krylov solver performance analysis and performance portability
We now turn to advanced numerical algorithms typical of scientific simulation codes. The Krylov solvers we consider-CG, BiCGSTAB, CGS, FCG, and GMRES 16 -are all iterative methods popular for solving large sparse linear systems. They all have the SPMV kernel as the central building block, and we use GINKGO'S COO SPMV kernel and test matrices from the SuiteSparse Matrix Collection 17 that are orthogonal in their characteristics and origin, and their characteristics are listed in Table 1. We run the solver experiment for 1000 solver iterations after a warm-up phase.
We report the relative performance of different Krylov solvers on different vendors' GPUs in Figure 10. However, we do not have the vendors' library as a reference. A100 achieves the highest ratio of bandwidth (50%-85%) and the ratio is similar with the SPMV case Figure 9. GEN12 achieves the second highest ratio of bandwidth, and GEN9 and MI100 achieve similar ratios of the devices' bandwidth. GEN12 and A100 achieve a similar ratio of bandwidth in the short recurrences Krylov solvers. GMRES on GEN12 achieves a lower performance than other short recurrences Krylov solvers. This highlights that the kernels of GMRES require specific tuning as the article 1 mentioned.

Evaluation of advanced preconditioners
For evaluating the performance of GINKGO 's preconditioners, we select a list of test matrices from the Suite Sparse Matrix Collection, see Table 1 for the matrices' key characteristics. For the preconditioner performance evaluation, we embed the GINKGO 's DPC++ preconditioner kernels into a CG iterative solver. We apply the preconditioned CG to a linear system that complements the test matrices with an all-ones right-hand side. The CG uses an all-zeros initial guess and a relative residual, which is reported by CG, stopping criterion of 10 −7 while allowing for at most 1000 CG iterations. The preconditioners are executed in the following configurations: • CG-plain conjugate gradient without any preconditioner.
• Adaptive block-Jacobi PCG-PCG using GINKGO 's adaptive precision block-Jacobi preconditioner with block size 32.
F I G U R E 11 Effectiveness and efficiency of GINKGO 's preconditioners on the Intel GEN12 GPU.
F I G U R E 12 Adaptive precision block-Jacobi speedup over the fixed precision block-Jacobi on the Intel GEN9 GPU.
• ParILU IR PCG-PCG using the ParILU preconditioner. The triangular systems are solved using 5 sweeps of IR with adaptive precision block-Jacobi (block size 32) as shown in Listing 1. Except for the block size, this configuration is identical to the configuration used in Goebel et al. 11 • ParIC IR PCG-Identical to ParILU IR PCG, except for using the symmetric (ParIC) incomplete factorization.
• SPD ISAI PCG-PCG using (symmetric positive definite) incomplete sparse approximate inverse (ISAI) preconditioner. Figure 11 reveals that a plain CG is not converging within the iteration limit for the thermal1 and ted_B problems, but adding a preconditioner enables convergence. Adaptive precision block-Jacobi PCG provides a significant iteration improvement for the ted_B matrix but it is less effective for the Dubcova3 and thermal1 matrices. ParILU or ParIC generally provide larger convergence improvement, see for example, finan512, thermal1, ted_B, and Muu matrices. The larger convergence improvement of ParILU/ParIC preconditioning comes at higher preconditioner generation and application costs. The time-to-solution comparison in Figure 11 reveals that the ParILU/ParIC preconditioned CG is despite the lowest iteration count often not the fastest choice. In fact, the ISAI-preconditioned CG providing moderate convergence improvement at low cost is the fastest choice for all test problems we consider in this evaluation.
We next evaluate the performance of GINKGO 's adaptive precision block-Jacobi in comparison to the fixed precision block-Jacobi. For this experiment, we use IEEE double precision and run on the GEN9 GPU. We use a relative residual threshold of 10 −12 , start the iterations with an all-zero initial guess, and solve for an all-one right-hand side. In Figure 12, we visualize the speedup we observe when running on the GEN9 architecture. For the finan512 and thermal1 problems, adaptive precision block-Jacobi executes about 15% faster than fixed-precision block-Jacobi. For the rest matrices, adaptive precision block-Jacobi gives a similar performance with fixed precision. This demonstrates that the DPC++ kernels for the adaptive precision block-Jacobi are working as intended and can offer performance advantages over the fixed precision preconditioning.

SUMMARY AND OUTLOOK
We have prepared the GINKGO open-source math library for Intel GPUs by developing a DPC++ backend. We presented strategies that are practical to accommodate the design differences between CUDA/HIP and the oneAPI ecosystem. Acknowledging the oneAPI ecosystem spans from general-purpose CPUs to GPUs, and special function units, we also discuss a strategy that enables the automatic selection of a valid and well-performing kernel configuration out of a set of pre-compiled kernels when running on specific DPC++ enabled architecture. Using these porting strategies, we added DPC++ kernels for basic building blocks, Krylov solvers, and advanced preconditioners to the GINKGO library. This makes GINKGO the first math library able to run complete simulation workflows on Intel GPUs. We evaluated the efficiency of the new functionality in terms of translating hardware performance into algorithm performance. Comparing the execution time of basic building blocks with kernels available in Intel's oneMKL library, we demonstrate that GINKGO 's kernels are competitive to the vendor implementation while providing more flexibility.
Future research will focus on evaluating GINKGO 's DPC++ on GPUs from other vendors.

ACKNOWLEDGMENTS
This work was supported by the "Impuls und Vernetzungsfond" of the Helmholtz Association under Grant VH-NG-1241, and the US Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. The authors want to acknowledge the access to the Intel® Devcloud early access development platform. Open Access funding enabled and organized by Projekt DEAL.