A staggered-grid multilevel incomplete LU for steady incompressible flows

Algorithms for studying transitions and instabilities in incompressible flows typically require the solution of linear systems with the full Jacobian matrix. Other popular approaches, like gradient-based design optimization and fully implicit time integration, also require very robust solvers for this type of linear system. We present a parallel fully coupled multilevel incomplete factorization preconditioner for the 3D stationary incompressible Navier-Stokes equations on a structured grid. The algorithm and software are based on the robust two-level method developed by Wubs and Thies. In this paper, we identify some of the weak spots of the two-level scheme and propose remedies such as a different domain partitioning and recursive application of the method. We apply the method to the well-known 3D lid-driven cavity benchmark problem, and demonstrate its superior robustness by comparing with a segregated SIMPLE-type preconditioner.


Introduction
The incompressible Navier-Stokes equations accurately describe flow of Newtonian fluids like water and air at low Mach numbers.In this paper we consider the numerical solution of these equations after spatial discretization on a structured grid.The discretization we use does not introduce artificial diffusion and is therefore popular for direct numerical simulations of turbulent flow.In this flow regime, one is typically interested in accurately resolving the temporal evolution of the flow.The relatively small time steps in such a simulation allow simplifications in the solution process, in particular, Picard linearization (instead of Newton's), and a segregated solution scheme that treats the velocities and pressure separately Verstappen and Veldman [36].
On the other end of the spectrum are Stokes flows and flows at very low Reynolds numbers.Here it is common to solve the coupled linear systems with a segregated preconditioner, see e.g.Elman et al. [16].This class of linear solvers tries to reduce the problem to scalar linear systems that can be solved by multigrid methods.
In this paper, we are interested in the intermediate flow regimes, where the focus lies on accurately desribing flow transitions occuring when model parameters are varied (or uncertain).In such situations it may be beneficial to directly compute steady state solutions, or at least take very large implicit time steps in order to quickly reach a statistical equilibrium solution.Another application where the solution of linear systems with the full Jacobian (and its adjoint) play a key role is design optimization with gradient-based optimization methods Brandenburg et al. [8].In that application, however, unstructured grids may be preferable, which we do not address here.
To study the stability of steady states in fluid flow problems as a function of a parameter, e.g. the Reynolds number (see Charru and de Forcrand-Millard [10] for an introduction to such problems), a continuation method [22] can be used.To apply such a method to high-dimensional systems, we require an efficient and robust method for solving linear systems associated with the discretized incompressible Navier-Stokes equations.An elegant way of solving these systems is by replacing the complete LU factorization by an accurate incomplete one, which can be used as a preconditioner in an iterative procedure.By an appropriate ordering technique and dropping procedure, one can construct an incomplete LU (ILU) factorization that yields grid independent convergence behavior and approaches an exact factorization as the amount of allowed fill is increased.During the continuation process, this preconditioner can also be used to find approximate smallest eigenvalues and eigenvectors of the Jacobian matrix of the incompressible Navier-Stokes equations Sleijpen and Wubs [29].These eigenvalues are of interest since a switch in sign may indicate a bifurcation.
The incompressible Navier-Stokes equations can be written as where Re = ρU D µ is the Reynolds number, ρ is the density and µ is the dynamic viscosity, and D and U are characteristic length and velocity scales of the flow.These equations are discretized using a second-order symmetry-preserving finite volume method on an Arakawa C-grid Arakawa and Lamb [2]; see Figure 1.The discretization leads to a system of ordinary differential equations (ODEs) where u and p now represent the velocity and pressure in each grid point, N (•, •) is the bilinear form arising from the convective terms, L is the discretized Laplace operator, G is the discretized gradient operator, M is the mass matrix, which contains the volumes of the grid cells on its diagonal, and f contains the known part of the boundary conditions.Our method will exploit the property that the divergence operator is given by −G T .The term N (u, v) is the discretized form of u • ∇v, and is bilinear.Hence for given u the expression is linear in v: there exists a matrix N 1 (u) such that N (u, v) = N 1 (u)v.Similarly, for given v, there exists a matrix N 2 (v) such that N (u, v) = N 2 (v)u, which is the discretized form of v(∇ • u).For the contribution of N (u, v) to the Jacobian, we consider the derivative of N (u, u) in the direction ∆u, which is found from the following expression by taking the limit → 0: [N (u + ∆u, u + ∆u) − N (u, u)]/ = N (u, ∆u) + N (∆u, u) + N (∆u, ∆u) = N 1 (u)∆u + N 2 (u)∆u + N (∆u, ∆u), where we used only the bilinearity of the expression N (•, •).The last term becomes zero when taking the limit.
Using this notation, the linear system of saddle point type [4] that has to be solved within each Newton step is given by It is known [36] that, on closed domains, dissipation of kinetic energy only occurs by diffusion.Our discretization preserves this property, which has the consequence that for divergence-free u, the operator N 1 (u) is skew-symmetric.A popular simplification is to omit N 2 (u), which replaces the Newton process by a Picard iteration [16].The approximate Jacobian then becomes negative definite (on the space of discrete divergence-free velocities), which greatly simplifies the solution process since definiteness is a condition under which many standard iterative methods converge.We remark, however, that Picard iteration works well only for fairly low Reynolds numbers and far away from bifurcation points where steady solutions become unstable, and seriously impairs the convergence rate of the nonlinear iteration [9,17].Furthermore, some authors use time dependent approaches to study the stability of steady states [15].This approach, however, also requires some tricks to obtain the desired information.Since we want to study precisely the phenomena where the above methods experience difficulties, we would rather use the full Jacobian matrix of the nonlinear equations, applying Newton's method.
There are many methods, based on segregation of physical variables, that can solve the linear equations that arise in every Newton iteration.In this approach the system is split into subproblems of an easier form for which standard methods exist.The segregation can already be done at the discretization level, e.g. by doing a time integration and solving a pressure-correction equation independently of the momentum equations [36,35].Another class of methods performs the segregation during the linear system solve, often in a preconditioning step.Physics based preconditioners [13,23,11,19] try to split the problem into subsystems which capture the bulk of the physics.The subsystems are again solved by iterative procedures, e.g.algebraic multigrid (AMG) for Poisson-like equations.These methods consist of nested loops for: the nonlinear iteration, iterations over the coupled system, and iterations over the subsystems.The stopping criteria of all these different iterations have to be tuned to make the solver efficient.Furthermore, the total number of iterations in the innermost loop is given by the product of the number of iterations performed on all three levels of iteration and thus easily becomes excessive.This is a major problem when trying to achieve extreme parallelism, because each innermost iteration typically requires global communication in the inner products.The number of levels of nested iteration may increase even more if additional physical phenomena are added [13,31].Our ultimate aim is to remove the inner iterations altogether and to solve the nonlinear equations using a subspace accelerated inexact Newton method.In Sleijpen and Wubs [29] we did this for simple eigenvalue problems using the Jacobi-Davidson method, which is in fact an accelerated Newton method.The method we present here will make this feasible for the 3D Navier-Stokes equations, even at high Reynolds numbers and close to bifurcation points, for which we have so far failed to find other scalable parallel solution methods.
To achieve this, fully aggregated approaches are important.In this category, multigrid and multilevel ILU methods for systems of PDEs exist (see Trottenberg et al. [32], Wathen [37] and references therein).The former is attractive, but for those methods smoothers may fail due to a loss of diagonal dominance for higher Reynolds numbers, for which a common solution is to use time integration [18].The latter comprise AMG algorithms and multilevel methods like MRILU [7] and the methods available in ILUPACK [6].ILUPACK has been successful in many fields since it uses a bound to preclude very unstable factorizations.However, this method does not show grid independence for Navier-Stokes problems and is difficult to parallelize [1].It works well for large problems, but not yet beyond a single shared memory system.
In this paper, we present a novel multilevel preconditioning method which is specially designed for the 3D Navier-Stokes equations.In Section 2, we first describe the two-level ILU preconditioner as introduced in Wubs and Thies [38] and Thies and Wubs [30].After this, we generalize the two-level method to a multilevel method in Section 3. To make this method work for the 3D Navier-Stokes equations, we introduce a skew partitioning method in Section 4. In Section 5 we discuss the scalability and general performance of the method, and compare it to a popular physics based method, after which we summarize the paper in Section 6.

The two-level ILU preconditioner
In Wubs and Thies [38] a robust parallel two-level method was developed for solving Ax = b, with A ∈ R n×n for a class of matrices known as F-matrices.An F-matrix is a matrix of the form with K symmetric positive definite and B such that it has at most two entries per row and the entries in each row sum up to 0, as is the case for our gradient matrix G [33,14].It has been shown that the two-level preconditioner leads to grid-independent convergence for the Stokes equations on an Arakawa C-grid, and that the method is robust even for the Navier-Stokes equations, which strictly speaking do not yield F-matrices because K becomes nonsymmetric and possibly indefinite.
Rather than reviewing the method and theory in detail, we only briefly present it here.For details, see Wubs and Thies [38] and Thies and Wubs [30].
To simplify the discussion, we focus on the special case of the 3D incompressible Navier-Stokes equations in a cube-shaped domain, discretized on an Arakawa C-grid (see Figure 1).We refer to the velocity variables, which are located on the cell faces as V -nodes, and to the pressure, which is located in the cell center, as P -node.The variables are numbered cell-by-cell, i.e. the first three variables are the u/v/w-velocity at the north/east/top face of the cell in the bottom south west corner of the domain, and variable four is the P -node in its center.Appropriate boundary conditions (e.g.Dirichlet conditions) are easily implemented in this situation.We remark that the algorithm (and software) can handle more general situations like rectangular domains, interior boundary cells, etc., and could be generalized to arbitrary domain shapes and partitionings.
First we describe the initialization phase of the preconditioner, which is the necessary preprocessing that has to be done only once for a series of linear systems with matrices with the same sparsity pattern.Then we describe the factorization phase, which has to be done separately for every matrix.Finally we describe the solution phase, which has to be performed for every application of the preconditioner.

Initialization phase
First the system is partitioned into N non-overlapping subdomains Ω α , with α = 1, . . ., N .These subdomains may be distributed over different processes, which allows for parallelization of the computation.The partitioning may be done based on the graph of the matrix, as implemented for instance in METIS [21], or by a geometric approach, e.g. by dividing the domain into rectangular subdomains.An example of a Cartesian partitioning of a square domain is shown in Figure 2. The non-overlapping subdomains are denoted by the black lines.
Based on the non-overlapping decomposition, we can define a global minimal set of separator nodes Γ such that it decouples the linear systems associated with the remaining variables in any two subdomains Ω α,β .This Γ can straightforwardly be defined based on either geometric information or the graph of the matrix.Let I α = Ω α \Γ denote the set of interior variables of subdomain Ω α , then we have by construction Furthermore, we define the set of separator variables associated with subdomain Ω α as S α = {j ∈ Γ : ∃i ∈ I α : A i,j = 0}.Note that the interior variables of two subdomains can be eliminated independently on a parallel computer, and that the union of the sets I α and S α defines an overlapping partitioning.Although we use the non-overlapping domain decomposition to define our incomplete factorization, we formally introduce the overlapping subdomains Ω α = I α ∪ S α .In the remainder of the paper, we will refer to the variables in the sets I α and S α , α = 1, . . ., N as interior nodes, and separator nodes, respectively.One separator is defined as a set of separator nodes that are coupled with the same set of subdomains (geometrically, separators comprise faces, edges and corners).One separator group is defined as the variables on the same separator that have the same variable type (u, v, w or p).Note that because separators are defined by the set of subdomains they are coupled to, separator nodes can never be in multiple separator groups at the same time.In Figure 2 the interior nodes are shown in gray and the different separator groups are denoted by different colors.
We can now reorder the matrix A such that the interiors (I) and separators (S) are grouped.This gives us the system where A II consists of independent diagonal blocks.This submatrix is invertible because on each subdomain we deal with a discretized Navier-Stokes problem on an Arakawa C-grid which is known to be well-posed if the normal and tangential velocities are specified on the boundaries and the level of the pressure is fixed by setting it in one grid point inside the domain.The velocities are indeed specified on the surrounding boundaries and separators of each subdomain and one pressure is fixed in each subdomain.Since A II consists of independent diagonal blocks, applying A −1 II is easy and trivial to parallelize.For this reason, we aim to solve where S is the Schur complement given by S = A SS − A SI A −1 II A IS .Whether a variable is coupled to a different subdomain could be determined directly from the graph of the matrix, as discussed earlier.In our implementation, however, we determine this geometrically by defining the overlapping subdomains during the partitioning step, and checking what overlapping subdomains a node of the non-overlapping subdomain belongs to.
There are three types of separators: faces (in 3D), edges and corners.For the Navier-Stokes problem on a C-grid, these separators only consist of V -nodes.The P -nodes are only connected to V -nodes that belong to the same overlapping subdomain, so these should never lie on a separator.We arbitrarily choose one P -node in every interior which we also define to be its own separator group to make sure the Schur complement stays nonsingular.
We want to eliminate velocities on a separator in such a way that the velocities that remain on a separator face provide an approximation of the collective flux through that face.It is therefore important that the variables are correctly scaled before the factorization in a way that they represent physical fluxes.In the matrix in (2) this gives a G-part with entries of constant magnitude.In this case we can define a Householder transformation which exactly decouples all but one V -node from the P -nodes [38].This transformation is of the form for some separator group g j with and 1 if node k is in separator group g j 0 if node k is not in separator group g j for all k = 1, . . ., n.We call the one V -node that is still coupled to the P -nodes a V Σ node.Since the Householder transformation can be applied independently for every separator group, we may collect all these transformations in one single transformation matrix H.

Factorization phase
For every overlapping subdomain Ω i , i = 1, . . ., N , where N is the total number of overlapping subdomains, we can define a local matrix .

After computing the factorization
II , the local contribution to the Schur complement is given by and the global Schur complement is given by Here we take the sum of the A (i) SS contributions over the non-overlapping subdomains to make sure that contributions from separators are not counted multiple times.
We now apply the Householder transformation which can be done separately for every separator group or subdomain, or on the entire Schur complement.We choose to apply the transformation separately for every subdomain, since H is very sparse, and sparse matrix-matrix products are very expensive and memory consuming.
We now drop all connections between V Σ and non-V Σ nodes, and between non-V Σ nodes and non-V Σ nodes in different separator groups.The dropping that is applied here is what makes our factorization inexact.Not applying the dropping gives us a factorization that can still be partially parallelized, but is also exact.
Our transformed Schur complement is now reduced to a block-diagonal matrix with blocks of non-V Σ nodes for every separator, and one block for all V Σ nodes, which we call S ΣΣ .Separate factorizations can again be made for all these diagonal blocks, which can again be done in parallel.For the non-V Σ blocks, a sequential LAPACK solver can be used, and for S ΣΣ we can employ a (distributed) sparse direct solver.We denote the factorization that is computed by these solvers as L S U S .
The full factorization obtained by the method is given by

Solution phase
After the preconditioner has been computed, it can be applied in each step of a preconditioned Krylov subspace method, for which we use the well-known GM-RES method [26].Communication has to occur in the application of A IS and A SI , and when solving the distributed V Σ system.The latter was adressed by using a parallel sparse direct solver in Thies and Wubs [30], but in the next section we propose a different road that does not rely on the availability of such a solver.

The multilevel ILU preconditioner
The main issue with the above two-level ILU factorization that prevents us from scaling up to very large problem sizes in three space dimensions is that, as the problem size increases and the subdomain size remains constant, the size of S ΣΣ increases steadily, and any (serial or parallel) sparse direct solver eventually limits the feasible problem sizes.Increasing the subdomain size, on the other hand, leads to more iterations and therefore more synchronization points.One of the strong points, on the other hand, is the fact that it preserves the structure of the original problem in the sense that, when applied to an F-matrix, it produces a strongly reduced matrix S ΣΣ which is again an F-matrix.It is therefore intuitive to try to apply the scheme recursively and avoid the problem of the growing sparse matrix that has to be factorized.From the structure preserving properties of the algorithm, it is expected that such a recursive scheme again leads to grid-independent convergence if the number of recursions is kept constant as the grid size is increased.
From now on we refer to the number of recursions, or the number of times a reduced matrix S ΣΣ is computed, as the number of levels.Note that this means that the method, which was previously referred to the two-level method is in fact the first level of the multilevel method.Applying a direct solver to S T from ( 6) is level zero.In this case the preconditioner is a direct solver and GMRES will converge in 1 iteration.
In order to apply the method to the reduced matrix S ΣΣ , we require a coarser partitioning, in which a subdomain consists of multiple subdomains from the original partitioning.In case we have a regular partitioning like a rectangular partitioning, this may be done by multiplying the separator length by a certain coarsening factor.Having a coarsening factor of 2, for instance, means that in 3D the separator length is increased by a factor 2, and the number of subdomains is reduced by a factor 8.
As stated in the previous section, we require the velocity variables to be correctly scaled to be able to apply the Householder transformation.However, the V Σ -variables from the previous level that lie on one separator have a different number of variables in their separator groups.In case of a regular partitioning, an edge separator, for instance, consists of V Σ -nodes from two edges and one corner from the previous level This leads to a different scaling of the V Σ -nodes and thus to non-constant entries in the G-part of S ΣΣ .Instead of re-scaling the S ΣΣ matrix on every level, we instead use a test vector t.The test vector is defined initially as a constant vector of ones, and is multiplied by the Householder transformation H at each consecutive level.The Householder transformation is as defined in ( 4) and ( 5), but now with for all k = 1, . . ., n.After applying the Householder transformation, we can again apply dropping to remove connections between V Σ and non-V Σ nodes and between non-V Σ nodes and non-V Σ nodes in different separator groups.When the matrix S ΣΣ is sufficiently small, a direct solver is applied to factorize it.
Instead of just having one separator group per variable per separator, we may also choose to have multiple separator groups, meaning that instead of retaining only one V Σ node per variable per separator we retain multiple V Σ nodes.This in turn means that less dropping is applied, and therefore the factorization is more accurate.Retaining all nodes in this way, possibly only after reaching a certain level, gives us an exact factorization, which, in terms of iterations for the outer iterative solver, should give the same results as using any other direct solver at that level.A visual representation of this process is given in Figure 3.

Skew partitioning in 2D and 3D
Looking at Figure 2, we see that there are pressures that are located at the corners of the subdomains that are surrounded by velocity separators.This means that if we add these pressures to the interior, as was suggested above, we get a singular interior block.We call these pressure nodes that are surrounded by velocity separators isolated pressure nodes.For the two-level preconditioner in 2D, it was possible to solve this problem by adding these pressures to the Schur complement as single-element separator groups.This unfortunately does not work for the multilevel method, since in this case velocity nodes around the isolated pressure nodes get eliminated.It also does not work in 3D because there the isolated pressure nodes also exist inside of the edge separators, where they form tubes of isolated nodes.
A possible way to solve this for the multilevel case and in 3D is to also turn all velocity nodes around the isolated pressure nodes into separate separator groups.This means that they are all treated as V Σ nodes and are never eliminated until they are in the interior of the domain at a later level.This, however, has the downside that a lot more nodes have to be retained at every level, resulting in much larger S ΣΣ matrices at every level.Furthermore, a lot of bookkeeping is required to keep track of all the extra separator groups, and their elimination leads to long-range connections and thus a more irregular matrix structure.
In 2D, we can resolve these problems by rotating the Cartesian partitioning by 45 degrees.The result is shown in Figure 4a.It is easy to see that in this case, no pressure nodes are surrounded by only velocity separators.We call this partitioning method skew partitioning.In Figure 4, we also show the workings of the multilevel method, with all the steps being displayed in the different subfigures.
For the skew partitioning to work with our multilevel method, we have two requirements on the shape of the subdomains: (1) it should be space-filling, meaning that it can be used to fill the entire domain and (2) it should be selfsimilar, meaning that we can construct a larger subdomain of the same shape from multiple smaller subdomains.It is easy to see that these two properties hold for the 2D skew partitioning.
The most basic idea for generalizing the rotated square shape to a 3D setting is to use octahedral subdomains.Partitioning with this design turned out to be unsuccessful, but it is still briefly discussed here since it led to some new insights.Since regular octahedra (the dual to cubes, having their vertices at the centers of the cube faces) by themselves are not space-filling, the octahedra can be slightly distorted to make them fit within a single cubic repeat unit.The resulting subdomains are space-filling, but only by using three mutually orthogonal subdomain types.The fact that it requires the use of three types of subdomains increases the programming efforts significantly since it introduces a lot of edge cases that should be considered, e.g. for subdomains located at the boundary of the domain.Velocities of the same (non-gray) color together form a separator group (but only if they point in the same direction and are in neighboring grid cells).The red circles are pressure nodes that are retained.This example uses a coarsening factor of 2, i.e. the separators on the next level have twice the length of those on the previous level.
The major problem with the octahedral subdomains, however, is that they are not self-similar, meaning that we cannot construct a larger octahedral subdomain from multiple smaller octahedral subdomains.However, self-similarity can be achieved by splitting the octahedra into four smaller tetrahedra, of which six different types are required to fill 3D space.This introduces additional separator planes that are similar to the 2D skew case and hence it increases the risk of isolating a pressure node when such planes intersect.Especially planar intersections which are parallel to any of the Cartesian axes have a high risk of producing isolated pressure nodes.
We did indeed not manage to find any self-similar tiling using tetrahedra that would not give rise to any isolated pressure nodes.Moreover, we would like to have a single subdomain shape that we can use instead of six, since this would greatly simplify the implementation.A lesson we learned is that isolated pressure nodes always seem to arise when having faces that are aligned with the grid.Therefore, we looked into a rotated parallelepiped shape that does not have any faces that are aligned with the grid [34].This shape is shown in Figure 5a, where the cubes represent a set of s x × s x × s x grid cells.A welcome property of this domain is that its central cross section resembles the proposed 2D skew partitioning method.
A schematic view of the position of the separator nodes is shown in Figure 5b.One important detail to note is that on the side that is facing towards us, only half of the w-nodes are displayed.This is because the w-nodes have to be positioned in an alternating fashion on the inside and outside of the nonoverlapping subdomain to prevent isolated pressure nodes from appearing.A consequence of this alternating property is that the w-planes have to be divided into two separate separator groups; one for the w-nodes that are located inside the non-overlapping subdomain, and one for w-nodes that are only present in the overlapping subdomain.
Another advantage of using a skew domain partitioning is that the amount of communication that is required is reduced when compared to a square partitioning.In Bisseling [5], it is estimated that for the Laplace problem, the communication is asymptotically reduced by a factor of √ 2 for the 2D diamond shape.If we instead compare the diamond shape to a rectangular domain with the same number of nodes (having the same number of nodes with a square domain is impossible), we find that communication is reduced by a factor of 3  2 .In the same manner, we can compare a 3D domain of size s x × s x × s x /2 to the rotated parallelepiped, and find a factor of 4  3 .We remark that the truncated octahedron that is used in Bisseling [5] for the 3D domain and has a factor of 1.68 cannot be used for our multilevel method, since truncated octahedra are not self-similar.

Numerical results
A common benchmark in fluid dynamics is the lid-driven cavity problem.We consider an incompressible Newtonian fluid in a square three-dimensional domain of unit length, with a lid at the top which moves at a constant speed U .The equations are given by (1).No-slip boundary conditions are applied at the walls, meaning that they are Dirichlet boundary conditions, and the equations are discretized on an Arakawa C-grid as described before.We take n x = n y = n z grid cells in every direction.At first, however, we will only look at the Stokes equations of the form where we take f to be random.This is because our preconditioner is constructed in such a way that memory usage and time cost for both computation and application of the preconditioner should not be influenced by inclusion of the convective term.After this, we further investigate the behavior of the method on the liddriven cavity problem for increasing Reynolds numbers, which constitute harder problems.Therefore we expect an increase in iterations of the iterative solver, but otherwise the same behavior.
For obtaining the exact memory usage, we developed a custom library which overrides all memory allocation routines when linking against it.The library contains a hash table in which the amount of memory that is allocated is stored by its memory address.We keep track of the total amount of memory that is allocated, which is increased on memory allocation, and reduced by the amount that is stored in the hash table when memory is freed.The reason we developed this library is that existing methods rely on rough estimates of the memory usage of the used data structures, use the data that is available from /proc/meminfo, which is inaccurate, or actually count memory usage in a similar way as we do (e.g.valgrind), but have a large performance impact.
We perform every experiment twice: once to determine the memory usage, and once to determine the run time, without linking to the memory usage library.This means that when reporting timing results, we are not affected by the performance impact of tools to determine memory usage.The reason that we are still concerned about their performance impact, even though we developed our own library, is that it adds roughly a constant amount of time per process, which impacts scalability results.The memory usage that we report is the exact difference in memory usage before and after a certain action is performed, e.g.before and after the construction of the preconditioner.
For the timing results, we do not include the time it takes to compute the partitioning, because we did not optimize this step.The partitioning is computed by first constructing one full overlapping subdomain of the prescribed subdomain size sequentially, which is then mapped to the correct position of every overlapping subdomain to determine the interiors and separator groups.However, since the initial full overlapping subdomain contains all (possibly already eliminated) nodes, every time we increase the number of levels by 1, the amount of time required to compute the partitioning increases by a factor 8, which is the worst possible scenario.The reason we do not include this in the timing results is that this may be resolved by only computing the partitioning for nodes that are still present in the Schur complement at that level.This requires a full rewrite of the existing partitioning code, and will be addressed in a future version of the software.It does, however, not have any impact on the timing results of the remainder of the implementation, since this is completely decoupled.This means that even though the partitioning method does not scale at all, the preconditioning method itself can be studied reliably without including the timing of the partitioning.
For the implementation of the preconditioner and solver, we use libraries from the Trilinos project [20].The libraries we use are Epetra (vector and matrix data structures), IFPACK (direct solver and preconditioning interfaces) [27], Amesos (direct solvers) [28] and Belos (iterative solvers) [3].As iterative solver we use GMRES(250) [26], as parallel sparse direct solver on the coarsest level we use Su-perLU DIST 6.1 [25], and as direct solver for the interior blocks we use KLU [12] with the fill-reducing ordering from de Niet and Wubs [14].The implementation of our preconditioner can be found on GitHub 1 .
The benchmark is performed on the SuperMUC-NG cluster at LRZ, Munich2 .The nodes contain two Intel Xeon "Skylake" Platinum 8174 processors with 24 cores and have 96 GB of memory.For all experiments, we disable multithreading inside the MPI processes, since the use of OpenMP in Epetra is not compatible with our need for very small objects (e.g.sparse matrices living on a very small subdomain).

Weak scalability
First, we look at results obtained when increasing the grid size n x at the same rate as the number of used cores n p , i.e. the problem size on each core is kept constant.The exact configurations that we use are 1 core for n x = 16, 1 core for n x = 32, 8 cores for n x = 64, 64 cores for n x = 128, 512 cores for n x = 256 and 4096 cores for n x = 512.The size of the subdomains (the size of the cubes in Figure 5a) at the first level is s x = 8, while we choose the coarsening factor to be 2, meaning that we increase s x by a factor of 2 at each level.We perform experiments where we keep the number of levels constant at L = 2, and experiments where we increase the number of levels by 1 when doubling the grid size.For the latter, we look at three cases where we retain a different number of separator nodes starting at level 2: 1, 4, and all.Since we start retaining more nodes after two levels, results with only two levels (n x = 16) will be the same for all configurations.When no result is shown for n x = 512, this means the method ran out of memory, unless otherwise stated.
For the fixed number of levels, we expect the number of iterations of the iterative solver to converge to a constant number as the grid is refined.For the case where we increase the number of levels as the domain size increases, we expect the number of iterations to only increase mildly, and we expect that retaining more separator nodes starting at level 2 decreases the number of iterations until it again becomes constant as we retain more and more nodes per separator.
The results are shown in Figure 6.We see that indeed the number of iterations becomes constant when fixing the number of levels or when retaining all separator nodes.When increasing the number of levels with the grid size, we see that the number of iterations keeps increasing gradually.What is interesting is that when we retain 4 instead of 1 separator node per separator group after level 2, the number of iterations that is required decreases drastically, even though this does not have a significant impact on the memory usage as we will see later.
The computational time of both computing the preconditioner, as well as the

retain all
Figure 6: Number of iterations of GMRES for the Stokes problem on a grid of size n x × n x × n x , while keeping the number of levels at L = 2, and when increasing the number of levels by 1 when n x is doubled.A relative residual of 10 −8 was used as convergence tolerance.application of the preconditioner would ideally become constant when keeping the problem size at each core constant while increasing the problem size.However, in practice this is not possible since increasing the number of cores also increases the required amount of communication.The results are shown in Figure 7 and Figure 8.   7: Time to compute the preconditioner for the Stokes problem on a grid of size n x × n x × n x , while keeping the number of levels at L = 2, and when increasing the number of levels by 1 when n x is doubled.
When computing the preconditioner, we see that if we increase the number of levels with the grid size, the computational time rises only slightly.Ideally, there would be no rise at all, but since an increased the number of computational nodes also means more communication, there will always be an increase in practice.In our case, the communication mainly happens at the point where contributions of neighboring subdomains have to be added up in the Schur complement.Since retaining more nodes per level means an increase in the amount of communication, we also see that retaining 4 or all nodes takes more time than retaining only 1 node per separator at every level.It must be noted that due to system variability, we also found results that were about 10% better or worse than the results displayed here.This could be improved by disabling frequency adaptation in the CPUs, but we are aware that at very large core counts the synchronous nature of our algorithm may become a scalability problem.
If we keep the number of levels constant, the computational time required to compute the factorization at the last level increases rapidly, and for larger grid sizes the amount of memory that is required for the factorization is too large for the computation to complete.We also notice that retaining all nodes after level 2 is much less efficient than just using SuperLU DIST at level 2, meaning that our preconditioner performs very poorly as a direct solver.This is mainly due to the fact that the Schur complement at the last level is quite large and full.The last level Schur complement for a grid of size 256 3 has size 20 961 × 20 961, and its factorization is filled with 72% nonzeros.Computing the factorization of this matrix and using it in a forward/backward substitution is therefore very expensive.Using a parallel dense solver instead of SuperLU DIST might help to make it more efficient.Moreover, since we choose n ∼ n p , the cost of computing the factorization by a sparse direct solver grows at best with x ) [25], and we expect that the cost of computing the preconditioner when keeping the number of levels constant, or when retaining all separator nodes, increases with n 3 x .In Figure 8, we show both the time required to solve the linear system after computation of the preconditioner, and the time of 200 applications of the preconditioner, which is not influenced by the number of iterations of the iterative solver and does not include the time spent on, for instance, matrix-vector products and orthogonalization.First of all, we again observe that retaining all separator nodes is a bad idea, since the computational time goes off the chart.For the case where we only use 2 levels, we also see the unwanted behavior of a time that keeps increasing linearly for the total solution time, and superlinearly for the application of the preconditioner, where we would actually expect O(n 4/3 /n p ) = O(n x ) behavior from the triangular solve [25].This may be caused by disabling multithreading support, which results in an increased amount of communication inside of SuperLU DIST.
For both cases where we retain 1 and 4 separator nodes after 2 levels, we see that the computational time only slightly increases for larger grid sizes.We also note that for the case where we retain 4 nodes, the application of the preconditioner is slightly slower, but the total solution time is much smaller due to the lower number of iterations that is required, as can also be seen in Figure 6, and   8: Time to solve the linear system with GMRES (lines), and time of 200 applications of the preconditioner (dashed lines).This is for the Stokes problem on a grid of size n x × n x × n x , while keeping the number of levels at L = 2, and when increasing the number of levels by 1 when n x is doubled.A relative residual of 10 −8 was used as convergence tolerance.
the fact that applying the preconditioner is relatively cheap.

retain all
Figure 9: Memory usage of the preconditioner per core for the Stokes problem on a grid of size n x × n x × n x , while keeping the number of levels at L = 2, and when increasing the number of levels by 1 when n x is doubled.
In Figure 9, we see the average memory usage of the preconditioner per core.
Here we again observe that the 2 level case, and the case where we retain all separator nodes after level 3, perform poorly.The cases where we retain 1 or 4 separator nodes per separator group after level 2 show similar behavior in terms of memory usage, and the memory usage becomes constant as expected.

Strong scalability
In this section, we look at a problem of size n x = 128, with 6 levels and retaining only one node per separator group.We use 1 to 128 cores with steps of a factor of 2 for all cases except the application time, where we use 2 to 128 cores.The reason we do not look at 1 core for this timing is that this configuration caused a memory allocation error in the iterative solver (Belos).We first look at the total memory usage in Figure 10.We observe that there is a large difference between the memory usage on one core, and the memory usage on two cores.This is because especially Epetra uses different implementations of its data structures for serial and parallel computations.
We also observe that the total memory usage increases slightly when using more cores.This can be explained by the overlap that exists between different processes.Since the difference in communication between a cubical domain and a parallelepipedal domain is a constant factor of 3  4 , we may instead look at a cubical domain to explain this behavior.If we have a subdomain of size s x × s y × s z , then communication is required for 2s x s y + 2s x s z + 2s y s z grid cells.If the subdomain is split in the x-direction, communication for 2s x s y + 2s x s z + 4s y s z grid cells is required.For the case s x = s y = s z , communication increases with a factor 4   3   when doubling the number of cores.This explains the slight increase in memory usage we measure.
In Figure 11 we look at the speedup when using more cores.Ideally, we would see that using twice the number of cores would mean half of the computational time.We plotted this ideal line for reference.We see that the speedup of both the computation and application of the preconditioner is very close to this ideal line.There is one outlier at 2 cores which is above the ideal line, which may be explained by system variability.Other than that the behavior is as expected.

Lid-driven cavity
In the previous sections we determined the weak and strong scalability properties of the preconditioner, which means that we can now continue with the robustness of the solver on the lid-driven cavity problem with increasing Reynolds numbers.We perform a continuation with steps of Re = 100 starting from the solution of the Stokes problem.We show the results of the first iteration of Newton at Re = 500 and Re = 2000.The reason we go up to Re = 2000 is that a Hopf bifurcation is located between Re = 1900 and Re = 2000 [18,24], which is of interest because it changes the qualitative behavior of the solution and makes the Jacobian indefinite.
In Figure 12 and Figure 13, we show the results at Reynolds number 500, which show good correspondence with the results on the Stokes problem.The main difference is that more iterations are needed, since higher Reynolds numbers constitute harder problems.We see that the number of iterations that is required when keeping the number of levels constant actually decreases, which is due to the grid refinement having a positive effect on the mesh Péclet number, i.e. the coefficients of the diffusion increase with respect to those of the convection.
The results at Reynolds number 2000 are shown in Figure 14 and Figure 15.We again observe that the number of iterations is much larger.What is odd, however, is that retaining more nodes now actually gives worse convergence.This may be because we use GMRES(250) instead of GMRES due to memory limitations, and therefore do not preserve the convergence properties of GMRES.This effect is more prevalent for this problem because of the large number of iterations that is required.We did confirm that for the first 250 iterations, retaining 4 nodes after level 2 gives rise to better convergence, as we expected.For Reynolds Figure 12: Number of iterations of GMRES for the lid-driven cavity problem at Re = 500 on a grid of size n x × n x × n x , while keeping the number of levels at L = 2, and when increasing the number of levels by 1 when n x is doubled.A relative residual of 10 −8 was used as convergence tolerance.As initial guess we used the solution at Re = 400.
Figure 13: Time for GMRES to converge for the lid-driven cavity problem at Re = 500 on a grid of size n x × n x × n x , while keeping the number of levels at L = 2, and when increasing the number of levels by 1 when n x is doubled.A relative residual of 10 −8 was used as convergence tolerance.As initial guess we used the solution at Re = 400.number 2000, we do not show results with a grid of size 512 3 , because the method does not converge within 10 000 GMRES iterations, which we set as the maximum.It does converge, however, and extrapolating the results, we expect that around 15000 iterations are required to meet the tolerance.Alternatively, instead of increasing the number of levels with the grid size, we could also have used 6 levels as we did for a grid of size 256 3 , in which case we would have expected it to actually use fewer iterations due to the effect of grid refinement on the mesh Péclet number as explained earlier.
Figure 14: Number of iterations of GMRES for the lid-driven cavity problem at Re = 2000 on a grid of size n x × n x × n x , while keeping the number of levels at L = 2, and when increasing the number of levels by 1 when n x is doubled.A relative residual of 10 −8 was used as convergence tolerance.As initial guess we used the solution at Re = 1900.
Figure 15: Time for GMRES to converge for the lid-driven cavity problem at Re = 2000 on a grid of size n x × n x × n x , while keeping the number of levels at L = 2, and when increasing the number of levels by 1 when n x is doubled.A relative residual of 10 −8 was used as convergence tolerance.As initial guess we used the solution at Re = 1900.
In Table 1 we show results for Reynolds number 500 using the LSC-SIMPLEC block preconditioner implemented in Teko [11].We also tried preconditioners from other packages, as well as other preconditioning strategies implemented in the Teko package, but those unfortunately converged even slower or did not yield convergence at all.The stopping criterion of the linear solver is 10 −8 , as before.1: Performance of Teko with LSC-SIMPLEC preconditioner for the liddriven cavity problem at Re = 500 on a grid of size n x × n x × n x .Here n p is the number of cores, L is the number of levels, its is the number of iterations, t c is the time to compute the preconditioner and t s is the time for solving the linear system.
Compared to our method, we see that Teko has much more difficulty with the grid refinement, leading to a huge computational cost at a grid size of only 64 3 .A crude computation shows that per grid point the method becomes about 65 times more expensive per iteration.This must be attributed to slow convergence of algebraic multigrid on the subblocks.One of these blocks is the L + N block from (2), with N indefinite, and this is something that is difficult for a standard AMG method.We choose the number of levels to be 2 for grid sizes 16 3 and 32 3 , and 3 levels for 64 3 since these seem to give the optimal results.For Reynolds number 2000, we do not observe convergence past the 16 3 grid.

Summary and Discussion
We presented a robust method for solving the steady, incompressible Navier-Stokes equations, which makes use of parallelepiped shaped overlapping subdomains.The interiors of these overlapping subdomains can be eliminated in parallel.On the interfaces of the subdomains, Householder transformations are applied to decouple all but one velocity node from the pressure nodes, after which all decoupled nodes can also be eliminated in parallel.The key to the multilevel approach is the resulting reduced Schur complement matrix, which has the same structure as the original matrix.We can therefore recursively apply our method to this matrix.The shape of the subdomains makes sure that pressure nodes are not isolated in the factorization process, which would result in a singular Schur complement matrix.
Our weak scalability experiments show that if the number of levels of the multilevel approach is kept constant while increasing the problem size, grid in-dependent convergence is obtained.We also show that by increasing the number of levels and processors as the problem size increases, we only require a small amount of additional time and memory for both the factorization and solution process.Moreover, by increasing the number of nodes that is retained per separator after level 2, we can further decrease the time that is required to solve the system.
Our strong scalability experiments show that the time that is required to both compute and apply the preconditioner scales very well.The same holds for the memory usage, which behaves as expected.
We also showed that the preconditioner leads to convergence of GMRES for the lid-driven cavity problem at high Reynolds numbers, where other methods, such as the LSC-SIMPLEC preconditioner that is implemented in Teko, fail to do so.This leads us to conclude that we implemented a robust solution method for the Navier-Stokes equations on staggered grids which shows good parallel scalability.In this paper, we only showed results for Arakawa C-grids.We have already succeeded to apply the idea to discretization on other staggered grids, e.g.Arakawa B-grids and mixtures of B-and C-grids as used in oceanography.This will allow us to use it as preconditioner in ocean-climate models, which is our application goal.The approach is, however, not limited to structured grids.The essence of the method is that we can iterate in the divergence-free space in a cheap way.As far as we can see this is the case when we have discrete conservation of mass.The C-grid discretizaton leads to one mass conservation law per subdomain (the B-grid has two).By uniting two neighbouring subdomains, we can find, by a simple combination of the two laws, a single new one for the united subdomains.This is also possible for finite volume discretizations on unstructured grids.For this reason, we think that the method can be applied to such discretizations in general and to some (discontinuous) Galerkin methods that also have the discrete conservation property.In the future we may generalize the approach to unstructured grid discretizations for which we will need graph based partitioning methods.

Figure 1 :
Figure 1: Positioning of unknowns in an Arakawa C-grid.

Figure 2 :
Figure 2: Cartesian partitioning of a 12 × 12 C-grid discretization of the Navier-Stokes equations into 9 subdomains of size s x × s y with s x = s y = 4.The interiors are shown in gray.Velocities of the same (non-gray) color together form a separator group (but only if they point in the same direction and are in neighboring grid cells).The red circles are pressure nodes that are retained.

Figure 3 :
Figure 3: One-dimensional representation of the process of retaining multiple nodes per separator.Each separator group has its own color and shape.
After elimination of interiors only the separator groups remain (c) The Householder transformations leave one V Σ -node per separator group (d) Partitioning on next level

Figure 4 :
Figure 4: Skew partitioning of a 12 × 12 C-grid discretization of the Navier-Stokes equations into 12 subdomains.The interiors are shown in gray.Velocities of the same (non-gray) color together form a separator group (but only if they point in the same direction and are in neighboring grid cells).The red circles are pressure nodes that are retained.This example uses a coarsening factor of 2, i.e. the separators on the next level have twice the length of those on the previous level.
Shape of the parallelepiped inside of two cubes that consist of m×m×m grid cells y x z (b) Schematic view of the position of the separator nodes from the same point of view as the figure on the left The u-nodes are depicted as red faces, the v-nodes are depicted in green and the wnodes in yellow.

Figure 5 :
Figure 5: Parallelepiped shape for partitioning the domain.

Figure
Figure 7: Time to compute the preconditioner for the Stokes problem on a grid of size n x × n x × n x , while keeping the number of levels at L = 2, and when increasing the number of levels by 1 when n x is doubled.

Figure
Figure8: Time to solve the linear system with GMRES (lines), and time of 200 applications of the preconditioner (dashed lines).This is for the Stokes problem on a grid of size n x × n x × n x , while keeping the number of levels at L = 2, and when increasing the number of levels by 1 when n x is doubled.A relative residual of 10 −8 was used as convergence tolerance.

Figure 10 :
Figure 10: Total memory usage of the preconditioner for the Stokes problem on a grid of size 128 3 , with 1 to 128 cores.

Figure 11 :
Figure 11: Speedup of computation and application of the preconditioner for the Stokes problem on a grid of size 128 3 , with 2 to 128 cores.
p n x L its