Task-based, GPU-accelerated and Robust Library for Solving Dense Nonsymmetric Eigenvalue Problems

In this paper, we present the StarNEig library for solving dense nonsymmetric standard and generalized eigenvalue problems. The library is built on top of the StarPU runtime system and targets both shared and distributed memory machines. Some components of the library have support for GPU acceleration. The library is currently in an early beta state and supports only real matrices. Support for complex matrices is planned for a future release. This paper is aimed at potential users of the library. We describe the design choices and capabilities of the library, and contrast them to existing software such as ScaLAPACK. StarNEig implements a ScaLAPACK compatibility layer which should assist new users in the transition to StarNEig. We demonstrate the performance of the library with a sample of computational experiments.


INTRODUCTION
In this paper, we present the StarNEig library 1 for solving dense nonsymmetric standard and generalized eigenvalue problems. StarNEig differs from existing libraries such as LAPACK 2 and ScaLAPACK 3 by relying on a modern task-based approach (see, e.g., 4 and references therein) in a manner similar to the already well-established PLASMA library 5 . Specifically, StarNEig is built on top of the StarPU runtime system 6 . This allows StarNEig to target both shared memory and distributed memory machines. Furthermore, some components of StarNEig have support for GPU acceleration. The library is currently in an early beta state and under continuous development. Currently, StarNEig applies to real matrices with real and complex eigenvalues and all calculations are done using real arithmetic. This paper is an extended version of a conference paper by the authors 7 .
This paper is addressed to potential users of the StarNEig library. We hope that readers, who are already familiar with ScaLA-PACK, will be able to decide if StarNEig is suitable for them. In particular, we wish to communicate the changes needed to integrate existing ScaLAPACK style (or LAPACK style) software with StarNEig. Central to this integration is the ScaLAPACK compatibility layer implemented in StarNEig. This compatibility layer allows users to keep their existing two-dimensional block cyclic distribution of the data and call StarNEig routines directly to perform the computations. The authors hope to start a discussion which will help guide and prioritize the future development of the library.
Dense nonsymmetric eigenvalue problems are usually solved using three phases: reduction to condensed Hessenberg form, reduction to (real) Schur form and computation of eigenvectors. Additionally, a fourth phase, called eigenvalue reordering, can be performed to acquire an invariant subspace that is associated with a given subset of eigenvalues. All four phases are implemented in StarNEig in a task-based manner. Performance-wise, the Hessenberg reduction phase in StarNEig is comparable to the LAPACK, ScaLAPACK and MAGMA 8 libraries, while the Schur reduction and the reordering phases in StarNEig are significantly faster than the ScaLAPACK implementations. Moreover, StarNEig can compute the eigenvectors directly from any Schur form without suffering from floating-point overflow, i.e., the implementation in robust. This functionality simply does not exist in ScaLAPACK and the implementation in StarNEig is significantly faster than the LAPACK implementation in a parallel setting. We refer the reader to 9 for more comprehensive performance and accuracy evaluations.
The rest of this paper is organized as follows: Section 2 provides a brief summary of the solution of dense nonsymmetric eigenvalue problems using the four phases mentioned earlier. Section 3 introduces the task-based approach and explains why the task-based approach can potentially lead to superior performance when compared to older, well-established techniques. Section 4 introduces the reader to some of the inner workings of StarNEig. In particular, the current state of the library and various limitations are explained in this section. Section 5 presents a sample of computational results which demonstrate the expected performance of StarNEig in both shared and distributed memory. Finally, Section 6 concludes the paper.

SOLUTION OF DENSE NONSYMMETRIC EIGENVALUE PROBLEMS
Given a matrix ∈ ℝ × , the standard eigenvalue problem consists of computing eigenvalues ∈ ℂ and matching eigenvectors ∈ ℂ , ≠ 0, such that Similarly, given matrices ∈ ℝ × and ∈ ℝ × the generalized eigenvalue problem for the matrix pair ( , ) consists of computing generalized eigenvalues ∈ ℂ ∪ {∞} and matching generalized eigenvectors ∈ ℂ , ≠ 0, such that = . ( If the matrices and are sparse, then the well-known SLEPc library 10 is one the better tools for solving the eigenvalue problems (1) and (2). Similarly, if the matrices and are is symmetric, then algorithms and software that take advantage of the symmetry are preferred (see, e.g., 5,11,12,13,14 ). Otherwise, if the matrices are both dense and nonsymmetric, then the route of acquiring the (generalized) eigenvalues and the (generalized) eigenvectors usually includes the following three phases: Hessenberg(-triangular) reduction: The matrix or the matrix pair ( , ) is reduced to Hessenberg form or Hessenbergtriangular form ( , ) by a similarity transformation where is upper Hessenberg, is a upper triangular, and 1 and 1 are orthogonal matrices.

Schur reduction:
The Hessenberg matrix or the Hessenberg-triangular matrix pair ( , ) is reduced to real Schur form or generalized real Schur form ( , ) by a similarity transformation where is upper quasi-triangular with 1 × 1 and 2 × 2 blocks on the diagonal, is a upper triangular, and 2 and 2 are orthogonal matrices. The eigenvalues or generalized eigenvalues can be determined from the diagonal blocks of or ( , ).
Eigenvectors: Finally, we solve for vectors ∈ ℂ from and backtransform to the original basis by Additionally, a fourth phase can be performed to acquire an invariant subspace of or ( , ) that is associated with a given subset of eigenvalues or a given subset of generalized eigenvalues: Eigenvalue reordering: The real Schur form or the generalized real Schur form ( , ) is reordered, such that a selected set of eigenvalues or generalized eigenvalues appears in the leading diagonal blocks of an updated real Schur form̂ or an updated generalized real Schur form (̂ ,̂ ), by a similarity transformation where 3 and 3 are orthogonal matrices.
See 15 for a detailed explanation of the underlying mathematical theory.

A CASE FOR THE TASK-BASED APPROACH
A task-based algorithm functions by cutting the computational work into self-contained tasks that all have a well defined set of inputs and outputs. In particular, StarNEig divides the matrices into disjoint (square) tiles and each task takes a set of tiles as its input and produces/modifies a set of tiles as its output. The main difference between tasks and regular function/subroutine calls is that a task-based algorithm does not call the associated computation kernels directly. Instead, the tasks are inserted into a runtime system which then derives the data dependences between the tasks from the supplied input and output information. The runtime system then schedules the tasks to computational resources, such as CPUs and GPUs, in a sequentially consistent order as dictated by the data dependences. The main benefit from this is that as long as the cutting is carefully done, the underlying parallelism is exposed automatically as the runtime system gradually traverses the resulting task graph. In particular, the runtime system can detect and tap into previously unexplored avenues of parallelism that are hidden within the task graphs. This leads to significantly more powerful algorithms that are able to adapt to different inputs and changing hardware configurations. Other benefits of the task-based approach include, for example, better load balancing and resource utilization due to dynamic scheduling, task priorities and implicit MPI communications that are automatically derived from the task graph.
The following subsections briefly discuss the four phases in the algorithm stack from the point of view of task parallelism. We do not attempt to explain the details of each algorithm, rather we focus on the key steps and explain how they benefit from task parallelism. We will use the standard eigenvalue problem as an illustration.  We begin by discussing the Hessenberg reduction phase. We emphasize that this phase does not benefit as much from the task-based approach at the other phases. Now, the so-called standard algorithm 16 for reducing a nonsymmetric matrix to upper Hessenberg form first divides the matrix into disjoint panels as illustrated in Figure 1a. Each panel is then reduced to upper Hessenberg form as summarized below:

GPU-accelerated Hessenberg reduction
1. Reduce the panel. The th column in the panel is reduced by constructing and applying a suitable Householder reflector − , where ∈ ℝ and ∈ ℝ . All involved reflectors are initially applied only inside the current panel and accumulated into a compact WY representation − , where ∈ ℝ × is upper triangular and = [ 1 2 … ]. In tandem, one of the necessary intermediate results is also accumulated into a matrix = . The construction of the matrix allows us to reduce the entire panel without updating the other sections of the matrix. However, the update formula includes a large matrix-vector multiplication involving the section of the matrix that trails the current column (see the shaded area in Figure 1b) and the non-zero part of the vector . Although these matrix-vector multiplications constitute approximately only 20% of the total number of flops, they are significantly more expensive than the remaining 80% due to the fact that matrix-vector multiplication is a memory bound operation. This is an important factor to consider when analysing the performance.
2. Update the trailing matrix. Update the section of the matrix that trails the current panel (see the shaded area in Figure 1c) by the update formula where − and are the final compact WY representation and the final intermediate result matrix from the panel reduction step, respectively.
3. Update the top part from the right. Update the section of the matrix above the current panel (see the shaded area in Figure  1d) by the update formula Critical path (Reduce the panel and Update the trailing matrix) CPUs GPU CPUs FIGURE 2 An illustration of the two scheduling contexts.
In StarNEig, each one of the aforementioned three steps is formulated as a task. In particular, for task scheduling overhead related reasons, the entire reduce the panel step is implemented as one monolithic task. Implementing each column reduction as a set of separate tasks would lead to an excessive number of lightweight tasks and thus to an unmanageable amount of scheduling overhead. The computational resources are divided into two scheduling contexts: Parallel scheduling context contains a subset of the available CPU cores and a GPU, if one exists in the system. Each inserted task is executed in parallel by all included CPU cores or by the GPU, depending on which resource is predicted to be the best option by the runtime system. In order to accomplish this, the runtime system uses calibrated performance models to predict the execution and data transfer times for each task. The reduce the panel and update the trailing matrix tasks form the critical path of the algorithm and are scheduled to this scheduling context as illustrated in Figure 2. The CPU implementation of the reduce the panel task copies the trailing matrix to memory buffers that are allocated from the local NUMA islands. That is, each involved CPU core has its own local memory buffer and the obtainable memory bandwidth is thus significantly higher compared to a situation where the CPU cores would access the memory across the NUMA islands. Note that the trailing matrix is copied only once in the beginning of each reduce the panel task. This is an another reason for implementing reduce the panel step as one monolithic task.
Sequential scheduling context contains the remaining CPU cores and will inherit computational resources from the parallel scheduling context once the critical path has been completed. Each inserted task is executed sequentially by one of the available computational resources. The update the top part from the right tasks are scheduled to this context as illustrated in Figure 2. Note that the update the top part from the right tasks never feeds back into the critical path and can therefore be scheduled independently. The tasks that compute the matrix 1 in (3) are also scheduled to this context and given lower priority.
The main benefit that comes from the task-based approach here is that the runtime system is allowed to schedule the work to the GPU when it predicts this will improve the performance. In particular, the runtime system usually schedules reduce the panel tasks to the GPU because most modern GPUs have a much higher memory bandwidth than CPUs. This means that the time that is spent computing the large matrix-vector multiplications is significantly reduced. The runtime system will handle the necessary data transfer between main memory and GPU memory, including prefetching of the data when necessary. The end result is that the task-based implementation will naturally behave very similarly to the implementation available in MAGMA library 8,17 but provides some additional flexibility as the scheduling decisions are done dynamically. We will now use the Schur reduction and eigenvalue reordering phases to illustrate some of the more notable benefits of the task-based approach. The modern approach for obtaining a Schur form of a nonsymmetric matrix is to apply the multi-shift QR algorithm with Aggressive Early Deflation (AED) to the upper Hessenberg form (see 18,19,20,21 and references therein). The algorithm is a sequence of steps of two types, AED and bulge chasing, as illustrated in Figure 3 and summarized below:

Aggressive Early Deflation, Bulge Chasing and Eigenvalue Reordering
AED step first reduces a small diagonal window (a.k.a AED window) to real Schur form via a recursive application of the QR algorithm. When the sections of the matrix outside the AED window are updated from the left, a spike is induced to the left of the AED window and we will thus temporarily deviate from the upper Hessenberg form. All diagonal blocks in the reduced AED window are then systematically evaluated in order to identify those eigenvalues that can be safely deflated without introducing significant perturbations. If the corresponding element in the spike is found to be small enough (i.e., the so-called deflation condition is satisfied), then the element is set to zero and the eigenvalue that corresponds to the diagonal block is deflated. On the other hand, if the corresponding entry in the spike is found to be too large, then the remaining AED window is reordered such that the diagonal block that failed the deflation check is moved to the upper left corner of the window thus pushing the remaining unevaluated diagonal blocks downwards along the diagonal. After all diagonal blocks have been evaluated, the remaining spike is eliminated by performing a small-sized Hessenberg reduction.
Bulge chasing step chases a set of 3×3 bulges down the diagonal. The eigenvalues of the diagonal blocks that failed the deflation condition, 1 , 2 , … , , are used as shifts to generate the bulges. That is, the first column of the matrix is transformed to the first column of the matrix ( − 1 )( − 2 ) via a small-sized Householder reflector. When applied from the both sides, the reflector creates fill-in in the form a 3 × 3 bulge that appears in the upper left corner of the matrix. At this point, the bulge could be eliminated by chasing it down the diagonal with a sequence of overlapping small-sized Householder reflectors and this would complete one implicit QR iteration. However, a multi-shift QR algorithm will instead chase the bulge just enough so that a second bulge can be introduced using the shifts 3 and 4 . The same procedure is then repeated until a total of ∕2 bulges have been introduced. The bulges are then chased in groups down the diagonal to complete one pipelined QR iteration. The bulge chasing step is then followed by a second AED step and the same procedure is repeated until all eigenvalues have been deflated or an iteration limit is reached.
Similarly, the eigenvalue reordering phase is based on applying sequences of overlapping Givens rotations and small-sized Householder reflectors to . The Schur form is essentially reordered in a bubble sort manner by using kernels that swap two adjacent diagonal blocks 22 .   If the Givens rotations and small-sized Householder reflectors are applied one by one, then memory is accessed as shown in Figure 4a. This is grossly inefficient for two reasons: i) the transformations are so localized that parallelizing them would not produce any significant speedup and ii) the matrix elements are touched only once thus leading to very low arithmetic intensity.
The modern approach (see 18,19,20,21,23,24 and references therein) groups together a set of local transformation and initially applies them to a relatively small diagonal window as shown in Figure 4b. The localized transformations are accumulated into an accumulator matrix and later propagated as BLAS level-3 operations acting on the off-diagonal sections of the matrix as shown in Figure 4c. This leads to much higher arithmetic intensity and enables proper parallel implementations as multiple diagonal windows can be processed concurrently. These are the main reason why the multi-shift QR algorithm introduces several 3 × 3 bulges to the diagonal. The 3 × 3 bulges are divided into groups and each group is chased separately down the diagonal. A set, or a chain, of overlapping diagonal windows is associated with each group. Similarly, several selected diagonal blocks can be grouped and moved together in the eigenvalue reordering phase 23,24,25 .
The Schur reduction and eigenvalue reordering phases are implemented in ScaLAPACK as PDHSEQR 20 and PDTRSEN 24 subroutines, respectively. Following the ScaLAPACK convention, the matrices are distributed in a two-dimensional block cyclic fashion 26 . The resulting memory access pattern is illustrated in Figure 4d for a 3 × 3 MPI process mesh. In this example, three diagonal windows can be processed simultaneously. The related BLAS level-3 off-diagonal updates require careful coordination since the left and right hand side updates must be performed in a sequentially consistent order. In practice, this means (global or row/column communicator broadcast) synchronization after each set of BLAS level-3 off-diagonal updates have been applied. In addition, each AED step introduces a global synchronization point.
In StarNEig, the Schur reduction and eigenvalue reordering phases are implemented with the following tasks types: Window task generates and applies a set of local transformations inside a diagonal window. Takes the intersecting tiles as input, and produces updated tiles and an accumulator matrix as output.
Right update task applies accumulated right-hand side updates using BLAS level-3 operations. Takes the intersecting tiles and an accumulator matrix as input, and produces updated tiles as output.
Left update task applies accumulated left-hand side updates using BLAS level-3 operations. Takes the intersecting tiles and an accumulator matrix as input, and produces updated tiles as output.

FIGURE 5
A hypothetical task graph arising from a situation where a set of three bulges is chased down the diagonal. We have simplified the graph by omitting dependences between the left (L) and right (R) update tasks as these dependences are enforced through the diagonal bulge chasing tasks (W). Note that the actual bulge chasing step involves several sets of bulges and the resulting task graph is therefore significantly more complex than the simplified graph presented here.
The tasks are inserted into the runtime system in a sequentially consistent order and each chain of overlapping diagonal windows leads to a task graph like the one shown in Figure 5. Note that real live task graphs are significantly more complex than shown here, but also enclose more opportunities for parallelism. It is also critical to realize that the runtime system guarantees that the tasks are executed in a sequentially consistent order. In particular, there is no need for synchronization and different computational steps are allowed to overlap (see Figure 4e) as the runtime system merges the corresponding sub-graphs together. This can lead to a much higher concurrency since idle time can be reduced by delaying low priority tasks until computational (a) The first bulge chasing step with delayed right-hand side updates.
(b) Overlapping AED and bulge chasing steps.

FIGURE 6
Snapshots taken during the first two iterations of the multi-shift QR algorithm with AED. The numbering matches the numbering in Figure 7. That is, (1) is the bulge chasing step from the first iteration, (2) is the small-sized Hessenberg reduction from the second AED step and (3) is the bulge chasing step from the second iteration. Note that the three steps are merged in (b).

Aggressive Early Deflation
First iteration Second iteration 1 2 3

FIGURE 7
An illustration of the first two iterations of the multi-shift QR algorithm with AED. The steps that are concurrently active in Figure 6b are highlighted in bold and numbered correspondingly.
resources start becoming idle. This is exactly what can be observed from Figure 6a where a subset of the right-hand side updates are given lower priority and are therefore scheduled only after the diagonal bulge chasing tasks and the left-hand side update tasks cannot saturate all available computational resources. This is possible because, as seen in Figure 5, most right-hand side update tasks do not feed back to the other sections of the task graph and can therefore be scheduled independently. The AED step can also be overlapped with the bulge chasing steps as shown in Figures 6b and 7. This improves the concurrency significantly compared to ScaLAPACK because the ScaLAPACK implementation will effectively synchronize before and after each AED step. This means that in ScaLAPACK most MPI ranks will become idle while a subset of the MPI ranks are involved in the AED step. Actually, StarNEig can even overlap two bulge chasing steps with each other as seen in Figures 6b. See 9,27 for further information.

Robust computation of eigenvectors
Mathematically, the problem of computing a single eigenvector of, say, a quasi-triangular matrix is trivial. The standard algorithm is a variant of substitution. However, substitution is very vulnerable to floating-point overflow. In particular, there exist triangular linear systems which are well-conditioned in the sense of Skeel for which the solution grows exponentially and rapidly exceeds the representational range of any floating-point number system ? . We say that an algorithm or a subroutine is robust if all intermediate and final results are in the representational range, i.e., floating point overflow is avoided. In LAPACK there exists a robust family xLATRS of subroutines for solving triangular linear system = 28 . They dynamically scale the entire right-hand side and return a scaling factor and a vector such that = . The purpose of the scaling factor is to extend the floating-point representational range. In LAPACK, the solvers for computing eigenvectors from Schur forms are all descended from xLATRS. They are scalar codes which compute the eigenvectors one by one. The back-transformation to the original basis can of course be done using BLAS level-3 operations. In ScaLAPACK, there is no parallel implementation of xLATRS and the solvers for computing eigenvectors are not robust.
In contrast, StarNEig implements novel algorithms for computing eigenvectors which are tiled, parallel and robust. In StarNEig, each matrix of eigenvectors is partitioned into tiles . Every tile = 1 2 ⋯ is augmented with a vector of scaling factors ∈ ℝ with one scaling factor per column. The augmented tile ⟨ , ⟩ represents the matrix = 1 2 ⋯ given by = ∕ . The StarNEig solvers for computing eigenvectors accept and produce augmented tiles. This allows StarNEig to obtain a representation of the eigenvectors without exceeding the representational range. A final post-processing step ensures that all segments of each eigenvector are consistently scaled. In reality, StarNEig is exploiting a principle which is familiar to every scientist: Two results can be combined if we know how to convert between the different system of measurements, say, the metric system and the imperial system.
The use of augmented tiles is compatible with the linear update ← − and the vast majority of the arithmetic operations can be completed using BLAS level-3 operations. Given a matrix and two augmented tiles ⟨ , ⟩ and ⟨ , ⟩, such that − is defined, StarNEig produces an augmented tile ⟨ , ⟩ which represents the intended result, i.e., This is achieved as follows. A preprocessing step ensures that each pair ( , ) of columns of and are not only consistently scaled, but the linear update = − can be computed without exceeding the overflow threshold. If , and are × matrices, then this preprocessing step requires ( 2 ) operations. Then the linear update ← − is executed using a single BLAS level-3 operation. We emphasize that the cost of the preprocessing step is insignificant compared with ( 3 ) arithmetic operations required for the linear update. Rescaling vectors to obtain a consistent scaling requires that the scaling factors are nonzero. Otherwise, a division-by-zero is attempted. In StarNEig, the possibility of the scaling factors underflowing is significantly reduced by using scaling factors which are integer powers of 2. Currently StarNEig uses at least 32 bit signed integers, for which the smallest scaling factor is = 2 −2 31 +1 ≈ 10 −6.4646×10 6 , but this can easily be extended to the point where underflow is a practical impossibility using 64 bit unsigned integers.
The main benefits that stem from the task-based approach are the merging of different computational steps and the improved load balancing. Additional information can be found in the existing literature. In particular, the fundamental principles for solving triangular linear systems in parallel without suffering from overflow are discussed in 29,30 . The StarNeig solver for computing generalized eigenvectors is the subject of a separate paper 31 .

STARNEIG LIBRARY
StarNEig is a C-library that runs on top of the StarPU task-based runtime system. StarPU handles low-level operations such as heterogeneous scheduling, data transfers and replication between memory spaces and MPI communication between compute nodes. In particular, StarPU is responsible for managing the various computational resources such as CPU cores and GPUs. The support for GPUs and distributed memory were the main reasons why StarPU was chosen as the runtime system.
StarPU manages a set of worker threads; usually one thread per computational resource. In addition, one thread is responsible for inserting the tasks into StarPU and tracking the state of the machine. If necessary, one additional thread is allocated for MPI communication. For these reasons, StarNEig must be used in a one process per node (1ppn) configuration, i.e., several CPU cores should be allocated for each MPI process (a node can be a full node, a NUMA island or some other reasonably large collection of CPU cores). The current status of StarNEig is summarized in Table 1. The library is currently in an early beta state. The Experimental status indicates that the software component has not been tested as extensively as those software components that are considered Complete. In particular, the GPU functionality requires some additional involvement from the user (performance model calibration). At the time of writing this paper, only real arithmetic is supported and certain interface functions are implemented as LAPACK and ScaLAPACK wrapper functions. However, we emphasize that StarNEig supports real valued matrices that have complex eigenvalues and eigenvectors. Additional distributed memory functionality and support for complex data types are planned for a future release.    StarNEig distributes the matrices in rectangular blocks of uniform size (excluding the last block row and column) as illustrated in Figure 8a. The data distribution, i.e., the mapping from the distributed blocks to the MPI process rank space, can be arbitrary as illustrated in Figure 8b. A user has three options:

Distributed memory
1. Use the default data distribution. This is recommended for most users and leads to reasonable performance in most situations.
2. Use a two-dimensional block cyclic distribution (see Figure 8c). In this case, the user may select the MPI process mesh dimensions and the rank ordering.
The library implements distribution agnostic copy, scatter and gather operations. Users who are familiar with ScaLAPACK are likely accustomed to using relatively small distributed block sizes (between 64-256). In contrast, StarNEig functions optimally only if the distributed blocks are relatively large (at least 1000 but preferably must larger). This is due to the fact that StarNEig further divides the distributed blocks into tiles and a tiny tile size leads to excessive task scheduling overhead because the tile size is closely connected to the task granularity. Furthermore, as mentioned in the preceding section, StarNEig should be used in 1ppn configuration as opposed to a one process per core (1ppc) configuration which is common with ScaLAPACK.

ScaLAPACK compatibility
StarNEig is fully compatible with ScaLAPACK and provides a ScaLAPACK compatibility layer that encapsulates BLACS contexts and descriptors 26 inside transparent objects, and implements a set of bidirectional conversion functions. The conversions are performed in-place and do not modify any of the underlying data structures. Users can mix StarNEig interface functions with ScaLAPACK routines without intermediate conversions. The use of the ScaLAPACK compatibility layer requires the use of either the default data distribution or the two-dimensional block cyclic data distribution. Listing 1 illustrates how a distributed StarNEig matrix is converted to an equivalent BLACS context and a matching local buffer. The matrix is then reduces to upper Hessenberg form by calling the PDGEHRD routine from ScaLAPACK. This is actually how the ScaLAPACK wrapper functions are implemented in StarNEig, see Table 1.

PERFORMANCE EVALUATION
Computational experiments were performed on the Kebnekaise system, located at the High Performance Computing Center North (HPC2N), Umeå University. Kebnekaise is a heterogeneous systems consisting of many different types of compute nodes. The compute node types relevant for this paper are: Broadwell compute node contains 28 Intel Xeon E5-2690v4 cores organized into 2 NUMA islands with 14 cores each and 128 GB memory. The nodes are connected with FDR Infiniband. All distributed memory experiments were performed on these nodes.
Skylake compute node contains 28 Intel Xeon Gold 6132 cores organized into 2 NUMA islands with 14 cores each and 192 GB memory. All Hessenberg reduction experiments were performed on these nodes.
Large memory node contains 72 Intel Xeon E7-8860v4 cores organized into 4 NUMA islands with 18 cores each and 3072 GB memory. All eigenvector experiments were performed on these nodes.
V100 GPU node contains 28 Intel Xeon Gold 6132 cores organized into 2 NUMA islands with 14 cores each and 192 GB memory. Each NUMA island is connected to a single NVIDIA Tesla V100 GPU. All GPU experiments were performed on these nodes.

/ / c r e a t e a 2D b l o c k c y c l i c d a t a d i s t r i b u t i o n ( pm x pn p r o c e s s mesh ) s t a r n e i g _ d i s t r _ t d i s t r = s t a r n e i g _ d i s t r _ i n i t _ m e s h ( pm , pn , STARNEIG_ORDER_DEFAULT ) ;
/ / c r e a t e a n x n d i s t r i b u t e d m a t r i x ( bn x bn b l o c k s ) s t a r n e i g _ d i s t r _ m a t r i x _ t dA = s t a r n e i g _ d i s t r _ m a t r i x _ c r e a t e ( n , n , bn , bn , STARNEIG_REAL_DOUBLE , d i s t r ) ; . . . In most experiments, StarNEig version 0.1-beta.2 was used. However, the Schur reduction and eigenvalue reordering experiments in distributed memory were performed using an older and unpublished version of StarNEig. The main difference between this unpublished version and the version 0.1-beta.2 is the deflation condition used in the Schur reduction phase. The deflation condition is used to decide when an eigenvalue can be deflated and thus impacts the convergence rate of the algorithm. The unpublished version uses a deflation condition that is identical to the one used in LAPACK and PDHSEQR where as the version 0.1-beta.2 uses the so-called norm stable deflation condition 18,19 . The latter deflation condition is less strict and could thus potentially lead to faster convergence. The latest version of StarNEig (0.1-beta.4) allows the user to choose between these two deflation conditions. Table 2 shows how the Hessenberg reduction routine in StarNEig compares against LAPACK (with parallel BLAS), ScaLA-PACK (in single node) and MAGMA 8,17 . Since the implementations in all four libraries are based on the standard algorithm, the performance is limited by the throughput of the matrix-vector multiplication routine. It is therefore important that the memory access pattern is optimized. In particular, a memory access to a NUMA island that is not local to a particular core is very likely to reduce the performance since the obtainable memory bandwidth is significantly lower compared to a local access. A memory access pattern that is close to optimal is easy to achieve with ScaLAPACK since each MPI process accesses its own local buffer and this buffer can be allocated from the NUMA island that is closest to the core to which the MPI process is bound. For the other CPU experiments, the memory was allocated in interleaved mode across the two NUMA islands in order to evenly distribute   Table 3 shows a comparison between ScaLAPACK and StarNEig. All distributed memory experiments were performed using a square MPI process grid. This was done because the maximum number of diagonal bulge chasing and reordering windows in the PDHSEQR and PDTRSEN routines is given by min( , ), where and are the height and width of the process mesh, respectively. A square process mess thus maximizes the degree of parallelism in the ScaLAPACK routines. We always map each StarNEig process to a full node (28 cores) and each ScaLAPACK process to a single CPU core, the latter being done for the same reason as the choice to use a square process mesh. Since it is not straightforward to find a CPU core allocation that would lead to a square MPI process grid in both configurations, the number of CPU cores in each ScaLAPACK experiment is always equal or larger than the number of CPU cores in the corresponding StarNEig experiment. The upper Hessenberg matrices for the Schur reduction experiments were computed from random matrices (entries uniformly distributed over the interval [−1, 1]). In the ScaLAPACK experiments, the matrices were distributed in 160 × 160 blocks, and in the StarNEig experiments, the library default block size was used. In the eigenvalue reordering experiments, 35% of the diagonal blocks were randomly selected. From Table 3, we note that StarNEig is between 1.6 and 2.9 times faster than ScaLAPACK (PDHSEQR) when computing the Schur form and between 2.8 and 5.0 times faster than ScaLAPACK (PDTRSEN) when reordering the Schur form. The distributed memory experiments were initially reported in the technical report 9 10 give some idea of how well the library is expected to scale in distributed memory. We note that StarNEig scales reasonably when computing the Schur form and almost linearly when reordering the Schur form. The iterative nature of the QR algorithm makes the Schur reduction results less predictable because different matrices require different number of bulge chasing steps. That is, with some matrices, the algorithms ends up performing more consecutive AED steps between the bulge chasing steps, thus leading to faster convergence rate but weaker scalability. Figure 11 demonstrates that StarNEig can indeed take advantage of the available GPUs as long as the matrices are reasonably large. The introduction of a single V100 GPU gives a speedup of up to 1.65 and the introduction of two V100 GPUs gives a speedup of up to 2.84. In the best case, the speedup when moving from single V100 GPU to two V100 GPUs is 1.83. Table 4 shows how the eigenvector computation routine in StarNEig compares against LAPACK (with parallel BLAS). In all experiments, 35% of the eigenvalues were randomly selected. In single core experiments, StarNEig is between 3.4 and 4.9 times faster than LAPACK. This demonstrates demonstrate how efficient the tiled approach is compared to the scalar implementation that exists in LAPACK. Furthermore, the multi-core experiments demonstrate the poor scalability of the LAPACK implementation. In particular, the initial solve phase (5) is executed sequentially and the back transformation phase (6), which is executed   98  53  38  30  60 000  -----4231 315  167 116 90  80 000  -----9221 693  359 252 194  100 000  ------1325 698 483 369  120 000  ------2226 1194 833 647 in parallel, constitutes only a small fraction of the total execution time. The implementation in StarNEig, on the other hand, demonstrates a very good scalability as for the larger matrices ( ≥ 40 000) the parallel efficiency stays above 70%. In the most extreme case StarNEig is over 190 times faster than LAPACK.

SUMMARY
This paper presented a new library called StarNEig. StarNEig aims to provide a complete task-based software stack for solving dense nonsymmetric standard and generalized eigenvalue problems. StarNEig support both shared and distributed memory machines and some routines in the library can take advantage of the available GPUs. The paper is mainly aimed at potential users of the library. Various design choices were explained and contrasted to existing software. In particular, users who are already familiar with ScaLAPACK should know the following: • StarNEig expect that the matrices are distributed in relatively large blocks compared to ScaLAPACK.
• StarNEig should be used in a one process per node (1ppn) configuration as opposed to a one process per core (1ppc) configuration which is common with ScaLAPACK.
The performance of the library was demonstrated with a set of computational experiments. The presented results show the following: In the Hessenberg reduction phase, StarNEig is competitive with LAPACK, ScaLAPACK (in single node) and MAGMA (single GPU). In the Schur reduction phase, StarNEig is between 1.6 and 2.9 times faster than ScaLAPACK. In the eigenvector computation phase, StarNEig's parallel and robust implementation significantly outperforms LAPACK in both single-core and multi-core settings with recorded speedups as large as 190. In the eigenvalue reordering phase, StarNEig is between 2.8 and 5.0 times faster than ScaLAPACK and scales nearly linearly.
The library is still incomplete. Future work with StarNEig includes the implementation and integration of the missing software components. Support for complex valued matrices is also planned. The GPU support, and the multi-GPU support in particular, are still under active development. The authors hope to start a discussion which would help guide and prioritize the future development of the library.