Toward a new linpack‐like benchmark for heterogeneous computing resources

This work describes some first efforts to design a new Linpack‐like benchmark useful to evaluate the performance of Heterogeneous Computing Resources. The benchmark is based on the Schur Complement reformulation of the solution of a linear equation system. Details about its implementation and evaluation, mainly in terms of performance scalability, are presented for a computing environment based on multi NVIDIA GP‐GPUs nodes connected by an Infiniband network.

F I G U R E 1 The evolution of the Linpack benchmark from 1993 to now. Figure Credits: Tan et al. 22 which could become much more expensive than computation in terms of both throughput and energy consumption.As in Yamazaki et al., 7 the term communication is used to include both horizontal data movement between parallel processing units and vertical data movement between memory hierarchy levels.In fact, in their original formulation, these methods are based on level 1 BLAS operations (i.e., vector products, products of a scalar by a vector, etc.) 1 .These operations have a low granularity and they fail to guarantee good performance, especially in high-performance computing contexts.In parallel computing, the term granularity of a task is a metric of the amount of work (or computation) performed, 8 and it refers to the ratio of computation time to communication time.
Considering the above, the Linpack Benchmark (based on level 3 BLAS operations) seems to have lost its relevance in guiding the community toward the development of benchmarks for HPC systems.Indeed, new tools such as the High Performance Conjugate Gradient (HPCG) benchmark, [9][10][11] which is based on implementations of KM, are now more representative of the computing patterns of real applications.
Nevertheless, paraphrasing the authors of the new benchmarks cited above, 9 we state "Presently Linpack-like benchmarks remains tremendously valuable as a measure of historical trends, and as a stress test … Furthermore, it provides the HPC community with a valuable outreach tool, understandable to the outside world".Furthermore, we cannot overlook the fact that for the Krylov methods (KM) solvers to efficiently utilize extreme-scale hardware, a lot of work has been dedicated to redesign both the Krylov methods algorithms and their implementations over the last three decades (e.g., see Yamazaki et al., 7 Bai et al., 12 Hoemmen, 13 Carracciuolo et al., 14 Laccetti at al. 15 ) to address challenges like extreme concurrency, complex memory hierarchies, costly data movement, and heterogeneous node architectures.All the redesign approaches base the algorithms on BLAS 2 and 3 operations and computational patterns more similar to the direct solvers of the problem (1).
That being said, in alignment with the authors of HPCG, 16 we believe that benchmarks for new computing patterns should be seen as a complement to the Linpack benchmark rather than a replacement.
The structure of this work is as follows: Section 2 provides a "State of the Art" regarding Linpack-like benchmarks for HPC and heterogeneous computing systems; Section 3 offers details about the usage of the Schur Complement in solving linear systems (1); Section 4 describes issues related to the initial implementation and evaluation, mainly in terms of performance scalability, of the benchmark based on the Schur reformulation; Section 5 summarizes the content of the present work and outlines some ideas for our future endeavors.

RELATED WORKS AND MOTIVATION
In the field of Scientific Computing (SC), high-level benchmarks are designed to test the overall system performance, including the utilization of the CPU, memory, and hard drive, in conjunction with all available computing devices (i.e., GPUs).[19][20] The tests performed by these tools are often used for both assessing the overall system performance and comparing the performance of different systems.For instance, the Linpack benchmark is a high-level benchmark used to evaluate the performance of computing systems in terms of their ability to process large-scale problems.The Linpack benchmark was originally developed in the 1970s 2,3 and is based on an algorithm for solving linear systems that use direct methods and whose computational complexity is the order of n 3 (where n is the dimension of the matrices involved).
Linpack has played a crucial role in the analysis of computing systems' performance for SC because it provides a way to compare systems of different architectures and sizes on the operation (that is the solution of linear systems with dense matrices) which is at the base of many algorithms of interest of SC.That allows scientists to evaluate, quite accurately, the efficiency of the use of computing systems, as evidenced by the use of the benchmark in the Top 500 ranking. 21Indeed, the use of Linpack, and its evolutions, is linked to the creation of the Top 500 ranking, which lists the most powerful supercomputers in the world and uses the Linpack score as one of the main classification criteria.Figure 1 shows the evolution of Linpack direction since the first stable version was released in 1993: hardware architecture evolution is the main driving source of the benchmark optimization. 22er the years, Linpack has evolved into new versions such as the HPL (High Performance Linpack) benchmark. 4HPL uses an algorithm to solve dense linear systems enabling more efficient and effective use of distributed memory and network connectivity on modern computing systems.The HPL algorithm was designed to be scalable and usable on a wide range of HPC systems.The algorithm implemented by the Linpack Benchmark is based on a block organization of the linear system matrix, where each block is processed separately, using level 3 BLAS operations.The HPL algorithm (like any Linpack-like benchmark) employs a technique called "LU decomposition" to solve linear systems. 23This technique consists of factoring the matrix A as follows (by algorithms whose computational complexity is the order of n 3 ) where L and U, respectively, are lower triangular and upper triangular matrices.This decomposition allows us to efficiently solve (1), especially if several systems with the same matrix A should be solved.Indeed, if the LU decomposition of a matrix A is already available, the linear system 1 can be solved by 1) the solution of Lz = y, followed by 2) the solution of Ux = z (with a total computational complexity O ( 2n 2 ) ).
HPL-MxP (High Performance Linpack Mixed-Precision benchmark, formerly known as HPL-AI) implements a new version of the HPL algorithm based on a mixed-precision approach.HPL-MxP is designed to test the performance of heterogeneous computing systems that use a combination of CPU and GPU. 5,24The basic algorithm of HPL-MxP is similar to that of HPL and makes use of mixed-precision techniques to better exploit the characteristic of GPU resources.Mixed precision consists of using a lower floating-point precision for calculations on the GPU, being faster but less precise devices, than those based on CPUs.HPL-MxP is designed to be more efficient and scalable than the HPL algorithm.Indeed, in the intentions of its authors, it should more effectively use the system's memory and bandwidth and could be easily adapted to a wider range of system architectures.To maximize GPU resource usage, HPL-MxP uses a series of GPU-specific optimizations, such as the use of CUDA-version of BLAS library 25 or other numerical libraries such as cuSOLVER. 26It also uses several efficient communication methods between the CPU and GPU to minimize data transfer time between the two processing units.The algorithm of HPL-MxP was formerly designed to test the performance of systems specifically planned for Artificial Intelligence and Machine Learning workloads, which often require a combination of high-performance CPU and GPU resources.HPL-MxP seeks to underline the link between the computational paradigms related to both the HPC and AI workloads based on machine learning (ML) and deep learning (DL): while traditional HPC focuses on simulation runs for modeling phenomena in a variety of scientific disciplines mostly requiring 64-bit precision, the ML/DL methods, that are at the basis of advances in AI, achieve the desired results at 32-bit or even lower ones.Performance of the HPL-MxP benchmark on the supercomputer Fugaku was the world's first achievement to exceed the wall of Exascale in a floating-point arithmetic benchmark. 27e CUDA-Aware HPL benchmark 28 implemented the first version of CUDA-based HPL for NVIDIA GPU clusters, it uses CUDA libraries to accelerate the HPL benchmark on heterogeneous clusters, where both CPUs and GPUs are used with minor or no modifications to the source code of HPL.A host library intercepts the calls to the most computationally intensive BLAS operations (i.e., the DGEMM and DTRSM procedures) that form the basis of the LU decomposition and executes each of them while distributing computation on both GPUs and CPU cores.In the CUDA-Aware HPL benchmark, computational load distribution of BLAS operations between CPU and GPU is automatically determined by the benchmark thanks to some metrics related to 1) the bandwidth for data transfer on PCI-e bus from/to host to/from the device and 2) the sustained performance of BLAS operations on the GPU/CPU.Since the sustainable bandwidth from the host to the device (and vice versa) plays a key role in the acceleration of a single DGEMM or DTRSM calls, in the CUDA-Aware HPL benchmark, the CUDA tool related to a fast transfer mode is exploited.Such a tool is enabled when page-locked memory (sometimes called pinned memory 29 ) is used.
Since the publication date of Fatica, 28 a lot of work has been done for the Linpack optimization on the CPU-GPU heterogeneous systems for older systems technologies (see for example the "Related works" sections of Kim et al. 30 and Tan et al. 22 ).New technologies present new challenges for programming heterogeneous systems.Indeed, the most important challenge is the widening gap between GPU computing speed, CPU computing speed, and data transfer speed (PCIe and inter-node network).From 2010 to 2019, one CPU's double-precision floating-point calculation speed increased from 30GFlops to 1TFlops while one GPU's speed from 250GFlops to 7TFlops, 22 forcing the software developer to deal with the new need to implement strategies that favor an appropriate load balancing to prevent the resources from being idle.Regarding the work done with the same goal in more recent technological contexts, the following two papers are worth mentioning.
In Kim et al., 30 SnuHPL is described.It is an optimized HPL-based benchmark for modern clusters with inter-node heterogeneity (different GPUs on different nodes).A performance model is used to optimize SnuHPL execution generating information about the best data distribution for a given cluster configuration by considering computing power, memory capacity, and network performance.In the first intentions of the authors, SnuHPL should have been only an open-source HPL implementation optimized enough even for a cluster of modern homogeneous GPUs, just later did they make it a tool capable of adapting its executions, in terms of data and load distribution of work, to distributed memory systems with non-homogeneous GPUs.SnuHPL takes the data distribution generated by its data distribution generation framework that consists of a performance profiler, SnuHPL simulator, and a greedy heuristic generation algorithm.The SnuHPL Performance profiler samples various performance parameters of the cluster from which the SnuHPL simulator determines the information about data distribution.However, if SnuHPL can adapt its execution on non-heterogeneous systems, it does not seem to be able to use all the system computational resources (CPUs and GPUs).
In Guangming Tan et al., 22 a reformulation of the LU decomposition algorithm is described which better implements strategies for the overlapping of computations and communications phases where CPU-GPU data transfers use a PCIe-based bus.In particular, considering that the major part of such an algorithm proceeds multiple iterations of four consecutive steps-panel factorization (PF), panel broadcast (PB), row swapping (RS), trailing-matrix updating (TU)-on the matrix A in a blocking-by-blocking way, the authors implement and evaluate four different heterogeneous algorithms that organize such steps in diverse pipelines that try to optimize overlapping actions.
In addition to the two above-listed papers, it is also worth mentioning the recent effort spent on deploying rocHPL, 31 AMD's open-source implementation of the HPL benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. 21at implementation of the original HPL benchmark leverages the AMD GPU accelerators on the node via AMD's ROCm platform, runtime, and toolchains.The rocHPL code is written using the HIP programming language and are based on linear algebra routines highly optimized for AMD's latest discrete GPUs available from the rocBLAS math library. 32gether with Tan et al., 22 we advocate that it is time for the community to release a new version of the Linpack benchmark for the CPU-GPU heterogeneous architecture.Therefore, in a context confirming the community's interest in the Linpack benchmark and its evolution, this work aims to lay the groundwork for a Linpack-like benchmark that can make the most of the heterogeneity of the computing systems with solutions that: 1. use both the CPUs and GPUs present on the individual nodes, 2. exploit all the most performing communications channels available, 3. employ CUDA-Aware (or more generally GPU-aware) messages passing library and innovative BLAS implementations (for example, the Software for Linear Algebra Targeting Exascale (SLATE) library 33 ) or innovative approaches that can use a reformulation of the problem (1) no longer relying just on the LU decomposition (in the example, the HPL-MxP Mixed-Precision Benchmark 5 ).

THE "SCHUR COMPLEMENT"-BASED REFORMULATION OF A LINEAR EQUATION SYSTEM
The Schur complement is a fundamental and versatile tool in mathematical research and applications. 34It can be considered a fundamental tool for the analysis and solution of the so-called "Saddle Point Problem" which arises in a wide variety of technical and scientific applications.For example, the ever-increasing popularity of mixed finite element methods in engineering fields such as fluid and solid mechanics has been a major source of saddle point systems.Another reason for this surge in interest is the extraordinary success of interior point algorithms in both linear and nonlinear optimization, which require at their core the solution of a sequence of systems in saddle point form. 35ppose If A 11 is invertible, the Schur complement of the block A 11 of the matrix A is the n 2 × n 2 matrix S defined by The Schur complement arises naturally in solving a system of linear equations such as Assuming that the sub-matrix A 11 is invertible, we can eliminate x 1 from the equations, as follows.
Substituting this expression into the second equation yields Algorithm 1.The "Schur complement"-based algorithm for the linear system Ax = y solution 1: procedure SCHURCOMPLEMENTSOLUTION(A, y, x)

2:
Input: A, y 3: Output: x 4: .See Equation (8)   11: ). See Equation (9)   12: ) . See Equation ( 9) 13: end procedure We refer to this as the reduced equation obtained by eliminating x 1 from the original equation.The matrix appearing in the reduced equation is the Schur complement S of the block A 11 : Solving the reduced equation, we obtain Substituting this into the first equation yields From Equations ( 8) and ( 9) and from definition (4) follows that the solution of linear systems in Equation ( 1) can be solved by the procedure described in Algorithm 1 where A, x, and y are defined as in (5).
Concerning the Algorithm 1 we can observe that: 3. the approach used seems to be particularly attractive because of its block-based formulation because each operation can be computed on different computational devices (CPUs and GPUs) depending on the computational cost and also considering that some operations (differently from what happens in the case of algorithms based on the LU decomposition which is strongly recursive) are independent of the others; where n 2 = n − n 1 and the O(⋅) represents the computational cost (i.e., the order of magnitude of the number of floating point operations).

The Benchmark implementation details
This section describes the strategies used in the implementation of Algorithm 1.These strategies, which are at the basis of the choice of which computing device (CPUs or GPUs) to use in the allocation of each task, must take into account: 1. the dependency between tasks, 2. the computational cost of each task, TA B L E 1 The mapping of tasks in the Algorithm 1.
3. the balancing of the total computational load on computing devices, 4. the numbers and dimensions of communications needed to guarantee the availability of the task input data on the allocated computational devices, 5. to preserve, as far as possible, the locality of the data, that is, data retention on allocation devices.
Regarding point 1 of the above list, Figure 2 depicts the Dependency Diagram of the tasks listed in Algorithm 1 from which we can observe that some tasks can be considered independent from the others and suggesting what of them can be executed, eventually in a concurrent way, on the different computational resources (CPUs and GPUs) of the systems.
Under the assumption that n 1 < n 2 , and to account for their dependencies, computational costs, and data locality (see points 1, 2, and 5), the mapping of the tasks described in Table 1 is considered: tasks with higher computational costs are mapped on GPU also in consideration of the required data exchange between CPUs and GPUs (see point 4).
If the mapping of the tasks described in Table 1 is adopted, the two computing parts of the systems (CPUs and GPUs) share the total computational cost of the Algorithm 1 CompCost Schur = CompCost Schur CPU + CompCost Schur GPU where: )) . ( To evaluate how the values of n 1 and n 2 could condition the balance of the computational load between CPU and GPU (see point 3), we have to consider that balance can be expressed as The trend of the function Γ( CPU ) in Equation (15).

CompCost Schur
CPU where NCores CPU and NCores GPU represent respectively the number of cores of CPUs and GPUs, and where NClock CPU and NClock GPU are the number of theoretical flops per second executed respectively by one of the CPUs and GPUs core.
Using (10) and (11), the relation ( 12) is valid if and only if If n 1 is defined in terms of a fraction of n, that is, then Equation ( 13), can be rewritten as: so, computational load balance depends just from  CPU , where 0 <  CPU < 1, and from computing system features.In Figure 3 the trend of the function Γ( CPU ) in Equation ( 15) is shown.
It is noteworthy that the relation 15 can be used to determine, starting from the characteristics of the computing resources and the problem dimension n, the value of n 1 which should be able to guarantee the balancing of the computational load (see Section 4.3 for an example of usage).
F I G U R E 4 Tasks allocation on a distributed memory computing infrastructure.
We note that the execution of the BLAS operations related to each task can be performed by versions of the BLAS library that are optimized for each computing infrastructure.For example, if a cluster of computing nodes equipped with some GPUs is used, each task can be performed by a Distributed Memory based BLAS library (i.e., ScaLAPACK, 36 SLATE, 33 PETSc, 37 etc.): in such case, the allocation of the tasks to the CPUs or GPUs devices is intended that "the local part Task x.p" of Task x is assigned, that is, on each node, is set the computation on the local data of the same task (see Figure 4).
To define the amount of communications needed by tasks allocations (see point 4 above), we note that: • one communication is needed, at the end of task 1, to send to the GPU (task 3) the factorization of A 11 .The order of the data amount to be sent ) ), • one communication is needed, at the end of task 4, to send to the GPU (task 7) the vector w.The order of the data amount to be sent is O(n 2 )), • one communication is needed, at the end of task 8, to send to the CPU (task 9) the vector u.The order of the data amount to be sent is O(n 1 )), So, if input data is already available on computing devices, just three communications are performed (between GPU and CPU) for a total of

CommCost Schur
CPU↔GPU data transferred where However, it's important to observe that other communications can be hidden in the implementations of the BLAS operations of each task: for example, in the BLAS implementations for distributed memory systems where the exchange of messages takes place by not CUDA-Aware libraries (see next subsection for a definition of CUDA-Aware Message Passing Library).

The time CommTime Schur
CPU↔GPU spent during data transfer between GPU and CPU could be modeled as: where BW CPU↔GPU represents the bandwidth (i.e., the amount of data transferred per second) of the data transfer channel between CPU and GPU.
To define a model of performance for Algorithm where CompTime Schur = CompTime Schur CPU + CompTime Schur GPU .Equation ( 18) can be used (see Section 4.3 for an example of use) to "predict" performance starting from the characteristics of the computing resources NCores * , NClock * , from the problem dimension n and the value of n 1 (defined by  CPU ).

F I G U R E 5
The layered architecture of computing resource.

The Benchmark evaluation details
In this sub-section, we describe the computing environment used during evaluation tests of Algorithm 1 implementation.
We utilized a heterogeneous computational resource 38,39 equipped with 128 GPUs and approximately 1600 physical cores distributed across 32 nodes.These nodes are interconnected using InfiniBand and NVLink technologies.
The architecture of the computing resources can be depicted as a set of multiple layers (Figure 5).The highest layer of the architecture consists of the application layer which is exposed to users.The lowest one consists of hardware resources and comprises 32 computing nodes.In particular, it provides 1) 128 NVIDIA Volta GPUs and about 1600 physical cores (from Intel Gen 2 Xeon Gold CPUs) distributed on 32 nodes whose connections are based on InfiniBand 40 and NVLink2 41 technologies.The efficient use of cluster technologies is made possible by a software layer interposed between the lowest and the highest levels, namely the middle-ware, which is based on a combination of the following technologies: 1. OpenFabrics Enterprise Distribution (OFED) 42 that makes available drivers and libraries needed by the Mellanox InfiniBand network cards.
2. CUDA Toolkit 43 that makes available drivers, libraries, and development environments enabling NVIDIA GP-GPU usage.
3. "MPI-CUDA aware" 44 implementation of OpenMPI 45 through the UCX open-source framework. 46ndwidth and latency in message exchange among processes are critical factors that hinder the full utilization of GPU potential.
In addressing this challenge, NVIDIA has introduced two important technologies: CUDA Inter-Process Copy (IPC) 47 and GPUDirect Remote Direct Memory Access (RDMA). 48These technologies are designed for intra-and inter-node GPU process communications and are particularly valuable for InfiniBand-based clusters.Additionally, for optimizing inter-node GPU-to-GPU communications for small messages, NVIDIA offers NVIDIA gdrcopy. 49To integrate these technologies with communication libraries (i.e., OpenMPI), we used the UCX open-source framework.UCX is a communication framework optimized for modern, high-bandwidth, low-latency networks.It exposes a set of abstract communication primitives that automatically choose the best available hardware resources.Supported technologies include RDMA (both InfiniBand and RoCE), TCP, GPU, shared memory, and atomic network operations.
Table 2 shows the hardware and software features of the cluster nodes.
All the BLAS operations listed in Algorithm 1 are using the SLATE library procedures.The SLATE (Software for Linear Algebra Targeting Exascale) library is actively under development to provide essential capabilities for dense linear algebra on current and future distributed high-performance systems.This includes systems based on both CPU+GPU or just on CPU.SLATE will provide coverage of existing ScaLAPACK functionality, including the parallel BLAS and the solution of the linear systems using LU and Cholesky.In this respect, it will serve as a replacement for ScaLAPACK, which after two decades of operation, cannot adequately be retrofitted for modern accelerated architectures.SLATE uses modern techniques such as communication-avoiding algorithms, look-ahead panels to overlap communication and computation, and task-based scheduling, along with a modern C++ framework. 33ile the BLAS operations provided by SLATE can be utilized on Distributed Memory computing platforms, they lack strategies to fully exploit the heterogeneous capabilities of nodes by simultaneously leveraging both CPUs and GPUs.The SLATE procedures offer a CPU execution mode that can be specified using the macro called execution target: if such target is defined as slate::Target::HostTask, the execution will happen on the CPUs (cores) using OpenMP tasks 50 allowing the exploitation of the multicore architecture of modern computing nodes.

The benchmark evaluation results
In this sub-section, we illustrate some results about the evaluation of the implementation of Algorithm 1 in the software module schur_solve.
The developed module schur_solve uses the modules offered by the SLATE library and uses double precision floating point number (i.e., sizeof(float) = 8).The bandwidth of the communication channel between the CPU and GPU As a term of comparison, we also show the results related to the execution of the SLATE module slate::gesv which performs the solution of linear system 1 exclusively using CPU or GPU and whose computational cost is CC LUBased ) .
In Figures 6-8, we show the results of module schur_solve executed on some nodes of the described cluster: the number of total MPI tasks is 4P where P is the number of involved nodes, the number N OpenMPTasks of the OpenMP tasks used for CPU execution of SLATE procedures is fixed to be N OpenMPTasks = 12.Then the value for F Sys (NCores * , NClock * ) in ( 15) is: ( Tests results, NB = 500: The Execution Time T(P, n) (A), The Scaled Execution Time T(P, Pn) (B), Speed-Up S(P, n) (C), Scaled Speed-Up SS(P, n) (D), the Sustained Performance SP(P, n) (E), the Scaled Sustained Performance SP(P, Pn) (F), the fraction of Peak Performance SPF(P, n) (G), and the Scaled fraction of Peak Performance SPF(P, Pn) (H).Red, green, and blue lines respectively represent the Implementation of the Schur-Based algorithm, slate::gesv on GPU, and slate::gesv on CPU test results.It follows, from (15) and (19), that to get computational load balance, a good choice for  CPU (and then of n 1 ) should be such that O(Γ( CPU )) = 10 −3 .From Figure 3, we can observe that it happens when 0.06 <  CPU < 0.18 (see the zoomed part of the plot).
The tests, which have the main aim to verify the scalability of Algorithm 1 implementation, are performed using different values for SLATE block dimension NB which is used by SLATE to distribute matrices over a distributed memory computational resource. 51ots in Figures 6-8 show: where PP(P) is the Peak Performance of P nodes when for each node all four GPU devices and all the CPU cores are considered where See Table 2 for the considered values of PP CPU and PP GPU .
The following values for n are considered: n = 86,000 (strong scalability tests 52 ), n = 30,000 * P (weak scalability 52  • the implementation of the Schur-Based algorithm (the software module schur_solve) seems to be less sensitive to the variations of values for parameter NB; • times to solution and speed-up values of the software module schur_solve are similar (and in the case of MB = 500 better) than those obtained with the module slate::gesv executed on GPUs; • in case the metrics relating to the number of operations per unit of time are considered, the module schur_solve does not get the same performance as the module slate::gesv executed on the GPU.This could be because the number of operations CC LUBased performed by the former could be different than that of operations CC Schur performed by the latter.
About the fraction of Peak Performance, please note that the performance percentages reported by the Top500 are related to the performance measurement R max which is obtained using the largest problem size n max fitting memory of the computing system. 2 Intending to use the proposed new benchmark implementation similarly to the one already used to draw up the Top500, we have identified, for different F I G U R E 9 "Heatmap" of the Theoretical Sustainable Performance TSP(P, n(P),  CPU ) as a function of  CPU and P for different values of the problem dimension on one node n(1).The value of problem dimension n(P) on P nodes is scaled according to the following rule n(P) = n(1) √ P.
In Table 3 the Execution Time T(P, n max (P)) and the Sustained Performance SP(P, n max (P)) are shown for the values of n max (P) already identified.
The results listed in Table 3 confirm the very low fraction of Peak Performance obtained using the described implementation of Algorithm 1.Yet Algorithm 1 should be able to achieve performances very close to the peak ones (see Figures 9 and 10).Such figures show respectively 1. the Theoretical Sustainable Performance TSP(P, n(P),  CPU ) (see Equation ( 18) that was evaluated using the computing system features listed in The low level of performance could potentially be attributed to implementation issues.Identifying such problems, carrying out profiling of the developed software module schur_solve, could be useful.For this reason, thanks to the use of the tool nvprof available in the CUDA Toolkit, a representation of this profiling has been generated (see Figure 11).The figure represents the profiling data of one of the four MPI tasks executed on one node to solve a problem whose dimensions are n = 130,000 and n 1 = 13,000, and where each MPI task uses N OpenMPTasks = 2 OpenMP threads.The profiling view (see Figure 11A) reveals that a significant portion of the execution time (see the time range highlighted by the green box) was spent not in computing or CPU-GPU communication actions but in some other action type that seems related to memory access and/or management (see the very big number of calls to the CUDA driver function cuPointerGetAttributes).The Summary of GPU activities (see Figure 11B) demonstrates a relatively well-balanced distribution between computational and communication activities on the GPU, with a slight prevalence of the latter.
It's important to investigate these implementation issues, considering that many intricate details are concealed within the procedures of the SLATE library upon which the described implementation is built.

CONCLUSIONS AND FUTURE WORK
This work outlines our initial efforts in designing a new Linpack-like Benchmark, based on the Schur Complement reformulation of the solution of a linear equation system, useful to evaluate the performance of Heterogeneous Computing Resources.
Our objective is not to develop a plan to replace the legacy HPL benchmark, which serves as the de facto standard for evaluating HPC platforms.
Nonetheless, there seems to be a very heated discussion on the opportunity to supplement historical benchmarks with new tools capable not only of responding better to the availability of new technology but also of being more representative of the real system workloads (i.e., benchmarks based on sparse solvers).Therefore, this work has the objective of trying to contribute to this discussion in the hope that this can be considered useful for a (re-)formulation, which is considered necessary by many, of the historically consolidated tools.
We provide in-depth insights into the implementation and evaluation, with a primary focus on performance scalability, of our revamped Linpack Benchmark.These details pertain to a computing environment based on multi-NVIDIA GP-GPUs nodes interconnected by an Infiniband network.
However, it's worth noting that our proposed approach is adaptable to various accelerator technologies, such as ROCm for AMD GPUs 32 or oneAPI for Intel accelerators. 54Test results reveal that the benchmark's performance is on par with tools that predominantly emphasize the computational aspect linked to the GPU.
We anticipate that by enhancing the distribution of tasks across computational components and addressing the aforementioned implementation issues, we can elevate the benchmark's quality in measuring the performance of heterogeneous systems, especially in the context of scientific computing.Additionally, we envision that further performance improvements can be realized through the comprehensive utilization of the potential offered by CUDA-aware (or more generally, GPU-aware) MPI implementations.Some of our future work will be dedicated to these endeavors.

12. 5
GB/s sizeof(float) (number of floats per second) BW CPU↔GPU Data bandwidth was obtained by the CUDA bandwidthTest utility

FF
I G U R E 10 "Heatmap" of the Theoretical Peak Performance Fraction TSPF(P, n(P),  CPU ) as a function of  CPU and P for different values of the problem dimension on one node n(1).The value of problem dimension n(P) on P nodes is scaled according to the following rule n(P) = n(1) I G U R E 11 of the implementation of Algorithm 1 visualized the NVIDIA Visual Profiler (A) and Summary of GPU activities (B).The figure represents the profile data, obtained by the nvprof tool, of one of the four MPI tasks executed on one node to solve a problem where n = 130,000, n 1 = 13,000.Each MPI task uses N OpenMPTasks = 2 OpenMP threads.
y 1 by mean of L A 11 and U A 11 Solve A 11 E = A 12 by mean of L A 11 and U A 11 Solve Sx 2 = w by mean of L S and U S ⊳ Task n. 2.
1 able to give indications about its performance in terms of the number of floating point operations per second, we propose the Theoretical Sustainable Performance TSP(n,  CPU , NCores * , NClock * ) metric defined as:TSP(n,  CPU , NCores * , NClock * ) = TA B L E 2 Hardware and Software specs of cluster nodes.Please note that the value of data bandwidth was obtained by the CUDA bandwidthTest utility.
Scaled Speed-Up SS(P, n) (D), the Sustained Performance SP(P, n) (E), the Scaled Sustained Performance SP(P, Pn) (F), the fraction of Peak Performance SPF(P, n) (G), and the Scaled fraction of Peak Performance SPF(P, Pn) (H).Red, green, and blue lines respectively represent the Implementation of the Schur-Based algorithm, slate::gesv on GPU, and slate::gesv on CPU test results.Tests results, NB = 2000: The Execution Time T(P, n) (A), The Scaled Execution Time T(P, Pn) (B), Speed-Up S(P, n) (C), Scaled Speed-Up SS(P, n) (D), the Sustained Performance SP(P, n) (E), the Scaled Sustained Performance SP(P, Pn) (F), the fraction of Peak Performance SPF(P, n) (G), and the Scaled fraction of Peak Performance SPF(P, Pn) (H).Red, green, and blue lines respectively represent the Implementation of the Schur-Based algorithm, slate::gesv on GPU, and slate::gesv on CPU test results.
Tests results, NB = 1000: The Execution Time T(P, n) (A), The Scaled Execution Time T(P, Pn) (B), Speed-Up S(P, n) (C), T(P, n):The execution time (in seconds) of the module schur_solve as a function of the number P of nodes for some values of n; S(P, n): The Speed-Up of the execution as a function of the number P of nodes for some values of n.So, The Scaled Speed-Up of the execution as a function of the number P of nodes for some values of n.So, The Sustained Performance (expressed in TeraFLOPS) obtained during the execution as a function of the number P of nodes for some values of n.It represents the number of Floating Point operations CC(n) executed by an algorithm in a time range, that is The fraction of Peak Performance obtained during the execution as a function of the number P of nodes for some values of n.So, 53sts).For all the tests, the values n 1 = 2000 (strong scalability tests) and n 1 = 2000 * P (weak scalability tests) are considered in line with the considerations previously made about better choices for  CPU .•weakscalabilitytests seem to confirm behavior already described in the ancestor of SLATE named DPLASMA53(for example, lines in Figures * -(F) show a similar trend to those in fig.7 in Bosilca et al. 53 ); • the role of the blocking factor NB is decisive in obtaining performance.The most appropriate value of this parameter (in this specific case NB = 1000) makes it possible to reduce the time for solving the problem by up to 50% (i.e., see Figures * -(A) and * -(B));

CPU = 0.10 𝜶 CPU = 0.20 𝜶 CPU = 0.40
Execution time T(P, n) and Sustained Performance SP(P, n) for some values of  CPU .The value of n max (P) is obtained by scaling the value of the largest problem size n max (1) on P = 1 node by the following rule n max (P) = n max(1) CPU the value of the largest problem size n max (1) on one nodes, then the value of n max (P) is obtained scaling n max (1) by the following rulen max (P) = n max (1) √ P.TA B L E 3

Table 2
CPU and P for different values of problem dimension on one node n(1) (the value of problem dimension n(P) on P nodes is scaled according to the same rule used above).From both the Figures 9 and 10, it turns out that for small values of CPU (i.e., where  CPU < 0.15) the proposed algorithm based on Schur reformulation of the problem (1) could exploit a very big fraction of the Peak Performance.