An algebraic multigrid method for elasticity based on an auxiliary topology with edge matrices

This article introduces a novel approach to algebraic multigrid methods for large systems of linear equations coming from finite element discretizations of certain elliptic second‐order partial differential equations. Based on a discrete energy made up of edge and vertex contributions, we are able to develop coarsening criteria that guarantee two‐level convergence even for systems of equations such as linear elasticity . This energy also allows us to construct prolongations with prescribed sparsity pattern that still preserve kernel vectors exactly. These allow for a straightforward optimization that simplifies parallelization and reduces communication on coarse levels. Numerical experiments demonstrate efficiency and robustness of the method and scalability of the implementation.


INTRODUCTION
Algebraic multigrid (AMG) methods are one of the most popular classes of methods for the iterative solution of large sparse systems of linear equations. Besides the classical approach, 1-3 various other AMG methods have been developed. For an overview of the mainstream methods, see, for example, Reference 4. The ones that are most relevant to the discussion here are smoothed aggregation algebraic multigrid (SAAMG), [5][6][7][8] and also algebraic multigrid based on computational molecules (AMGm). [9][10][11] The latter is an attempt to combine the classical approach with algebraic multigrid based on element interpolation (AMGe). 12 It expands on the idea of element preconditioning introduced in Reference 13, where an AMG method was applied not to the system matrix itself, but to an auxiliary matrix obtained from preconditioning of the element matrices. AMGm is based on the concept of small edge matrices that can be assembled into an auxiliary matrix. In contrast to AMGe, which is based on element matrices that need additional topological information on coarse levels, AMGm only needs topological information stored in the matrix itself. In Reference 11, a coarsening strategy for linear elasticity problems based on the solution of small eigenvalue problems obtained from these edge matrices was introduced.
In this article we propose a method of smoothed aggregation type, that is, just as AMGm, based on the concept of edge matrices. On the finest level, an auxiliary topology of vertices and edges, with associated vertex and edge matrices, is set up. These can be assembled into an auxiliary matrix that is proven to be equivalent to the finite element matrix but has a simpler structure. Coarsening is then performed on the auxiliary topology, using weights computed from small eigenvalue problems assembled from vertex and edge matrices. We prove that these weights indeed lead to a good two-grid constant and present a coarsening algorithm based on successive matching. By subassembling edge contributions one can form a new variation of the classical smoothed prolongation used in SAAMG. It is possible to prescribe the sparsity pattern of this auxiliary smoothed prolongation while still preserving rigid body modes. The validity of the approach and efficiency of the implementation are shown through computations.
The method is suited for both scalar H 1 and linear elasticity problems, and, in fact, on the theoretical level, hardly makes a distinction between them at all. For linear elasticity, this means that coarse spaces contain both displacement and rotational degrees of freedom. The introduction of edge matrices that explicitly include these rotational degrees of freedom is the crucial ingredient for the method proposed here. This is done by imposing an energy on explicitly calculated local deviation from zero-energy functions, essentially separating the deviation from the energy it induces.
Besides the works on AMGm, there is also a strong relation to References 14 and 15, where, based on a theoretical result from Reference 16, an aggregation algorithm for scalar problems that is based on similar eigenvalue problems was introduced. In order to formulate the eigenvalue problems in that method, a particular nonnegative splitting of the global matrix is required. While finding such a splitting of the actual system matrix is relatively straightforward only for scalar problems, there is a canonical splitting of the auxiliary matrix introduced here. In fact, the eigenvalue problems in the coarsening algorithm proposed here are similar to ones that appear in the algorithm from Reference 15 when applied to the auxiliary matrix with this splitting; however, they feature an additional stabilization that turns out to be essential for elasticity problems. In that respect, they are more reminiscent of a coarsening criterion used in AMGm that is based on computational molecules, introduced in Reference 11.

Outline
Section 2 is dedicated to preliminaries, that is to stating the equations and discretizations we consider. In Section 3, we discuss aggregation-based AMG methods and find a representation of the tentative prolongation based on transformation of coordinates of zero-energy functions given at one point to ones given at another. These coordinate transformation matrices give us a way to explicitly calculate deviation from zero-energy functions and to introduce the auxiliary matrix in Section 4. Based on these, coarsening criteria and a coarsening algorithm will be developed in Section 5. The auxiliary smoothed prolongation will be discussed in Section 6. Finally, numerical results are presented in Section 7.

PRELIMINARIES
We briefly summarize the considered equations and the chosen discretizations, see Reference 17 for a more detailed description of finite element methods in general. From now on, the dimension d will be 2 or 3 and the domain Ω ⊂ R d will have the Dirichlet boundary Γ D ⊆ Ω. On the continuous level, we consider linear equations of the form: On the discrete level we only consider discretization with the standard, lowest order, conforming finite element method. For  , a shape-regular triangulation of Ω that resolves Γ D , and its set of vertices,  = {v 1 , … , v n }, in the scalar case, that is, where P 1 (T) is the set of polynomials of order 1 on T. This space is spanned by the standard hat functions associated to vertices For the vector case, is spanned by basis functions that are just the scalar ones times unit vectors ⃗ e l , By Galerkin projection of (1) onto V h,0 ∶= V h ∩ V 0 , we obtain the system of linear equations that is to be solved, where A ij = a( i , j ) and b i = g( i ) for the basis functions i of V h,0 . As a(⋅, ⋅) is symmetric and elliptic, A is a symmetric positive definite matrix. The problem we consider for the scalar case is: Problem 2. Solve the system of linear equations (6) with V h as in (2) and Here, > 0 is bounded away from zero and either |Γ D | > 0 or > 0 (in some part of Ω).
In the vector-valued case we are interested in linear elasticity: Problem 3. Solve the system of linear equations (6) with V h as in (4) and with (u) = 1 2 (∇u + ∇u T ). Now > 0 is bounded away from zero and ≥ 0 and ≥ 0. Additionally, we exclude the almost incompressible case, where ≪ , where, as is well known, the problem becomes extremely ill-conditioned (see Reference 17).
For the coordinate vector of a function u or u h in V h with respect to the basis functions given in (3) or (5) above, we write u. Owing to the clear association of basis functions and vertices, we will, for the elasticity problem, consider A not as an element of R nd×nd but rather of (R d×d ) n×n , so a "block entry" A ij ∈ R d×d stands for the d × d submatrix of A corresponding to basis functions associated to vertex v i and v j . Therefore, by saying that a matrix is diagonal, we mean it to be block-diagonal with respect to vertices. Similarly, vectors u ∈ R nd are interpreted as elements of (R d ) n , so a "block entry" u i ∈ R d is the subvector of u consisting of entries associated to vertex v i . There will also be matrices and vectors (on coarse levels) that are to be understood as having a bigger "block size" than d, which will be clear from context. For the most part, vectors and matrices that are to be understood this way are written bold, A, x to differentiate them from generic ones, A, x. For symmetric matrices A and B, we write A ≤ B if ∀x ≠ 0 ∶ x T Ax ≤ x T Bx and A < B defined in the same manner. Finally, A ∼ B, or || ⋅ || A ∼ || ⋅ || B , means that there exist constants c 1 , c 2 > 0 such that c 1 A < B < c 2 A.

ALGEBRAIC MULTIGRID METHODS
Multigrid methods for the solution of (6) have two components: smoothers, some sort of computationally cheap iterative methods, and prolongations. A prolongation matrix is the representation of an embedding operator of some coarse space of smaller dimension into the original, fine space. A simple two-grid method (Algorithm 1) consists of one application of a smoother, restriction of the residual to the coarse space, the exact solution of the projected system on the coarse space, embedding of the coarse solution into the fine space and finally application of the transposed smoother which makes the whole iteration symmetric. For a thorough discussion of multilevel methods in general, see Reference 18. The iteration matrix of the two-grid method is For algebraic multigrid methods, a simple smoother M is used-(block-)Gauss-Seidel-type and polynomial smoothers are popular. Then, clearly, the key ingredient is the prolongation matrix P, for which it is of great importance that its range spans the near-nullspace  of A (without boundary conditions and disregarding L 2 terms, its kernel). In aggregation-based methods, a first, tentative prolongation is constructed by finding some partition of the fine degrees of freedoms into so-called agglomerates, and then defining coarse degrees of freedom by the restrictions of kernel vectors to these agglomerates. In order to motivate the definition of edge matrices in the next section, we give a somewhat roundabout formulation of this tentative, or unsmoothed, prolongation. For that, suppose that for any two vertices v i and v j there exists a matrix Q v i →v j ∈ R k×k , where k is the number of degrees of freedom per vertex, such that When a global kernel vector can be identified by its block entry at a single vertex, Q v i →v j represents the transform of coordinates associated to one vertex into those an energy free extension would have at another vertex. For example, for the scalar case (Problem 2), take Q v i →v j ∈ R 1×1 = 1, and the above expression is just in two dimensions, and in three dimensions, these are the functions For this reason, let us take a detour and suppose we were dealing with some discretization that has exactly k degrees of freedom per vertex, which are nodal in the sense that where x i is the position of vertex v i . In this scenario, every vertex would have associated rotational degrees of freedom in addition to displacement degrees of freedom. Then, from where ⃗ t ij = ⃗ x j − ⃗ x i is the vector connecting vertices v i and v j . As we make extensive use of the matrices Q v i →v j throughout the rest of the article, we summarize their properties here. Lemma 1. For the matrices Q v i →v j in (9) the following holds ∀v i , v j , v k ∈ : With v ij denoting an additional vertex placed at the midpoint of v i and v j , for the matrix Proof. (i) can be shown by a simple computation: (ii) follows from (i) and (iii) is shown by analogous elementary calculations. ▪ Remark 1. In light of Lemma 1 it is inconsequential whether one considers Q v i →v j to exist for any arbitrary pair of vertices or just neighboring ones. In the latter case, an appropriate matrix Q v i →v j can be defined for nonneighbors by a product of Q v (⋅) →v (⋅) matrices along a path connecting v i and v j .
Notation 1. The number of degrees of freedom per vertex will be referred to as k. It should be clear from context whether k = d or k = d + d(d−1) 2 . Expressing the change in coordinates from one vertex to another by such matrices, we can formulate a tentative prolongation, as used in aggregation based AMG methods.

Definition 1 (tentative prolongation). For the set of fine vertices
, define a coarse vertex v I at some position x I for every C I ∈ . Then the tentative prolongation P ∈ (R k×k ) | f |×| c | is defined by Notation 2. In order to more easily differentiate between fine and coarse level, we use lower case letters for indices associated with the fine level and upper case letters for those associated with the coarse level.
Remark 2. Usually, aggregation-based algebraic multigrid methods orthogonalize the restrictions of kernel vectors to agglomerates. This has a similar effect to moving x I , which amounts to choosing a central point with respect to which coarse grid rotations are defined. Therefore, it is reasonable to choose coarse vertices somehow towards the "center" of agglomerates, for example, x I = 1 When using discretization (4) for linear elasticity, there are no rotational degrees of freedom on the finest level, so Definition 1 cannot be used immediately. However, we can express a prolongation from a coarse level that does include rotational degrees of freedom by simply restricting the above one to rows corresponding to displacement degrees of freedom: Definition 2 (finest-level elasticity prolongation). Let P ∈ ( R k×k ) ||×|| be a tentative prolongation for Problem 3 as in Definition 1 with Q v i →v j defined as in (9) Then the finest-level tentative prolongation P f ∈ (R d×k ) ||×|| for Problem 3 is given by P f ∶= RP. (11) A potential issue with (11) is that without further restrictions to , RP does not necessarily have full column rank. This happens in three dimensions when all vertices in an agglomerate lie on a straight line, and in both two and three dimensions when an agglomerate consists of only one vertex. This means that coarse-level matrices can be singular if no extra care is taken during the coarsening process to make sure that none of the cases above occur. As we aim to provide a general purpose coarsening algorithm in this article, we consider an alternative solution to this problem. In principle, one could modify the prolongations themselves-for a series of prolongations P 0 , … , P L−1 , let, for l > 0, the columns of the matrix E l consist of an orthonormal base of ker (RP 0 ⋅ P 1 ⋅ ⋅ ⋅ P l−1 ) ⟂ and set E 0 ∶= R T . Then the "restricted" prolongations However, the number of degrees of freedom per vertex is not necessarily constant anymore. Lemma 2 characterizes the kernel of the coarse grid matrices and shows that, unsurprisingly, the matrices E l are feasible to compute.

Lemma 2. Let
where m = | l | is the number of vertices on level l and E l,ii ∈ R k×n i with n i ≤ k.
prolongation of the type of Definition 1. It is therefore enough to consider only the case where there is only one coarse grid matrix A 1 = (RP) T ARP. To prove the statement, we need to find a basis of ker A 1 = ker RP where every basis vector only has a single nonzero block entry. As RP only has (at most) one column block entry per row, for any vector u ∈ ker (RP) ⟂ also its restriction to a single vertex is in the kernel. For any such base of ker RP, therefore, also a maximal linearly independent subset of the restrictions of these vectors to single vertices form a base. Orthogonalization and normalization thereof gives a basis as is claimed to exist in the lemma. The block structure of E follows immediately, where n i is the number of vectors in the basis associated to vertex v i . ▪ Therefore, E l can be computed by finding the kernel of diagonal block entries of A l . The disadvantage of this approach is that due to the varying number of degrees of freedom per vertex on coarse levels, we cannot make use of especially efficient sparse matrix formats that store entries in blocks of fixed, static size. We propose to use it for theory only, and to deal with singular coarse grid matrices in a different way. Given some convergent smootherM l forÃ l , we can define a "pseudo-smoother" M l for A l as M l ∶= ( where we make use of the pseudo-inverse (⋅) † to define the iteration This can be seen as merely an implementation of the smoothing iteration on the restricted matrixÃ l usingM l . The advantage is that whenM l is the Jacobi or Gauss-Seidel smoother, M l is the Jacobi or Gauss-Seidel smoother for A l , where diagonals are pseudo-inverted instead of inverted. On the coarsest level, where we have to invertÃ l , observe that ( where the first and third equalities follow from the definition of E l and the second and last from (ker B) ⟂ = ranB T , and thus It is therefore justified to regularize the coarsest-level matrix with the term I − E l E T l , which, according to Lemma 2, is feasible to compute and, as just seen, does not disturb the coarse grid correction. This allows us to work with the aforementioned sparse matrices with static block size, which should be seen as an implementation of the AMG method defined by the restricted prolongationsP l .

DISCRETE ENERGY
The idea of constructing an approximate splitting of the matrix A into edge contributions has already been explored in References 9-11. There, edge matrices of size (2d) × (2d) were used. These edge matrices have to preserve the kernel of the global differential operator. Using the explicit representation of these kernel vectors from the previous section, we can define edge matrices in an arguably simpler and more intuitive way:

Definition 3 (auxiliary matrix and discrete energy). For
Q ij be defined as there, and let E ij = E ji ∈ R k×k and M i ∈ R k×k be symmetric and positive semidefinite. The auxiliary matrixÂ ∈ (R k×k ) ||×|| defined by these edge and vertex matrices is: A induces the discrete energy: We call E ij edge weight and M i vertex weight matrices. While the second term on the right in (13) captures the energy induced by the differential operator (e.g., ‖ (u)‖ 2 L 2 ), the first term is supposed to represent the L 2 -term in (7), (8) when > 0, or, as discussed shortly, naturally arises on coarse levels under the presence of Dirichlet boundary conditions. Remark 3. The use of Q ij instead of Q v i →v j imparts symmetry to expression (13) and makes it independent of the orientation.
For Problem 2, where k = 1 and Q ij = 1, under the condition that A is an M-matrix, the choice of E ij = |A ij |, and gives an exact splitting of A into edge and vertex contributions. Now, A is not always an M-matrix; however, in practice, there seem to be no issues using this choice for E ij anyways and set M i = max{0, A ii − ∑ j |A ij |}. Either way, the edge contributions simplify to "bending" "twisting" "stretching" "wiggling" F I G U R E 1 Visualization of the components of the last term in (15) For Problem 3 in three dimensions, where k = 6, when splitting displacement and rotational degrees of freedom, an edge contribution is where [⋅] denotes the jump and {⋅} is the average along an edge. This expression is independent of the orientation, as exchanging i and j not only negates [⋅] but also flips the tangent vector, and therefore negates the ⃗ t × {r} term. The second block entry of the above vector, [r], stands for a "bending," or "twisting" along an edge, while the first, [u] − ⃗ t ij × {r}, describes compression/expansion (its tangential part) and a kind of "wiggling" in its normal part (see Figure 1). (14) is proposed in Reference 19 as a component of the construction of the interpolation in the element-free AMGe method. An energy functional based on contributions of the form (15) might be a good candidate with which to extend that approach also to elasticity problems.

Remark 4. An energy functional made up of contributions similar to
On the finest level, (where k = d), when = 0, theorem 3.8 and the following corollary 3.11 in Reference 10 show that an approximate splitting of A into a sum of edge matrices of the form exists, with positive constants c ij ∈ R and again the vector ⃗ t ij = ⃗ x j − ⃗ x i connecting vertices v i and v j . Theorem 1 shows that there is a choice of E ij in Definition 3 that reproduces these edge matrices, and thus, that in this case, the resulting auxiliary matrixÂ is equivalent to A. Theorem 1 (discrete energy for elasticity). Let A be the finite element matrix for Problem 3 with = 0, let Q ij be defined as in (9) and let E ∈ (R d×k ) ||×|| with .
Then, there exist c ij ≥ 0 such that the choice of in Definition 3 defines an auxiliary matrixÂ such that (17) with constants depending only on the mesh regularity. Alternatively, setting in Definition 3 yields an auxiliary matrixÃ such that Proof. As shown in theorem 3.8 in Reference 10 for element matrices, and corollary 3.11 for A, there exists an approximate splitting of A into edge matricesẼ ij of the form (16) such that It remains to show that any edge contribution in (19) can be represented by one from (17) for some E ij . Indeed, the choice of E ij given in the theorem, using the same c ij , yields The second statement of the theorem follows from E TÂ E =Ã. ▪ Remark 5. We have not studied in detail the constants involved in equivalence (17) or how to choose c ij so they become optimal. Establishing sharp bounds for a particular choice of c ij would be a topic of interest; however for the problems considered in this article, the rather crude choice of c ij = 1 d 2 ∑ l,m∈{0, … ,d} |(A ij ) lm | seems to suffice. A more well-reasoned approach is used in Reference 11, where c ij are computed from off-diagonal block entries of the finite element matrix A.
So far, Theorem 1 is no real improvement over the results from Reference 10. The edge matrices E ij are rank 1 matrices, and only the upper left block of the transformation matrices Q ij is used. However, combining this with Definition 1 leads to a natural way of finding a splitting into edge matrices for the coarse grid matrix.

Theorem 2.
Let the fine-level matrix A be equivalent to the auxiliary matrixÂ, given as in Definition 3, and the set of Dirichlet vertices , the partition of the remaining vertices  with its index set  c , and the unsmoothed prolongation P be as in Definition 1. Finally, for I, J ∈  c , define Q IJ as in property (iii) of Lemma 1 and write Q IJ→ij ∶= Q ij Q v I →v i Q JI . Then, with the auxiliary matrixÂ c , induced by these matrices as in Definition 3, is equivalent to the coarse grid matrix, with constants that are no worse than those in the equivalence between A andÂ.
Proof. The statement immediately follows from P TÂ P =Â c . To see that, first consider the vertex contributions, For the edge contributions, first see that which follows from elementary computations. With this there holds Summing both sides of the above equalities, and defining E IJ and M I as proposed in the theorem gives and thus In the following, we assume appropriate E ij and M i to be given on the finest level, and coarse edge and vertex matrices to be computed according to Theorem 2. As mentioned before, it seems that in practice, for many cases, very crude choices, based on the absolute values of off-diagonal block entries, suffice for the discretizations considered here.

COARSENING
The next goal is to introduce a coarsening algorithm that is directly based on the auxiliary matrixÂ. As the finite element matrix A itself is not needed for this step, in the multilevel case, the entire hierarchy of auxiliary matrices, represented by their topology of vertices and edges as well as edge and vertex matrices, can be built according to Theorem 2 before computing any coarse-level matrices. We therefore suggest the name auxiliary topology AMG for the method. First, we prove that a coarsening that produces a good two-grid method forÂ also produces one for A. This comes at the cost of an additional factor in the two-grid estimate that consists of the constants in the equivalence between these two matrices for which, as addressed in Remark 5, we do not currently have sharp estimates. Lemma 3 is a slight extension of a result from References 16,18 that bounds the condition number of the two-grid method by if a convergent and SPD smoother M is used. If M is not symmetric, but equivalent to a symmetric matrix, it can be replaced in the above expression at the cost of an additional constant. For a thorough discussion of this, see Reference 18. We take extra care to also cover the case whereÂ is singular, as can happen for the elasticity problem.

Lemma 3. LetÂ be an auxiliary matrix as in Definition 3 and let the tentative prolongation
for some c 1 , c 2 > 0 and let M be a convergent smoother for A such that LetD be the diagonal ofÃ and D = EDE T be the corresponding diagonal matrix ofÂ. Then the two-grid method defined by the prolongationP = E T PE c and the smoother M produces an SPD coarse grid matrixP T AP and where Proof. Firstly,P T AP is positive definite due to the definition of E c and under these prerequisites, theorem 3.19 and the following corollary 3.20 from Reference 18 give an estimate It remains to show thatK Because A andÃ are equivalent according to (20), as a result of where i zeroes out all block entries of a vector except the ith one, with the analogue holding forÃ andD, this equivalence also holds between their diagonal parts, Also observe that, due to the block-diagonal structure of E there holds EDE T =D, and of course According to Lemma 2, the matrix I − EE T is diagonal, and per definition its range is the kernel ofÂ. With the range of I − EE T is a subset of the kernel ofD, and thus (I − EE T )D = 0. Therefore, finally, which is the wanted estimate for the numerator in (22). As for the denominator, shows, together with the above estimate that ▪ We now proceed to show an estimate for K TG which should be seen as a generalization one originally formulated for scalar problems in Reference 14, theorem 3.2. From this we will derive a criterion for agglomerates that implies boundedness of K TG . As checking this criterion can be computationally expensive in practice, in Section 5.1 we give simplified coarsening criteria for "matchings," that is, coarsenings where every agglomerate consists of only one or two vertices. Based on these, in Section 5.2, a coarsening algorithm based on successive rounds of matching, first introduced in Reference 15, will be modified to work for both Problems 2 and 3. We stress that two-grid convergence is guaranteed even for problems that have deliberately been constructed to defeat coarsening algorithms based on traditional, scalar criteria, as demonstrated in Section 7.
Theorem 3. For every C ∈ , letD C be the diagonal block ofD associated to vertices in C, and let P C ∈ (R k×k ) |C|×1 be the block of P associated to C. Now, for ∈ {0, 1}, let g (C) be the smallest constant such that for all and let Then there holds Proof. First, we find an estimate from below for the numerator in (22) by omitting edge-terms coming from connections between vertices in , scaling those coming from connections between vertices in  and vertices in some agglomerate  by one half and reordering the rest of the terms.
On the other hand and thus separating the terms in the above sums proves (25). ▪ Computing g (C) involves finding the largest eigenvalue of the generalized eigenvalue problem whereM C ,Â C ∈ (R k×k ) |C|×|C| are the matrices that induce the seminorms on the left-and right-hand side, respectively. The role of in (23) is important for elasticity, where = 0 leads to a criterion that is too restrictive in practice. The reason is thatÂ C then often has a lower rank thanM C due to the fact that both edge and vertex matrices can be singular, and indeed vertex matrices are often zero, and the inequality cannot hold for any finite 0 g (C). This results in a coarsening algorithm producing very large coarse grids and eventually, when applied on multiple levels, completely stalling. Setting = 1 fixes the rank deficiency in enough cases to obtain good coarsening rates.
Remark 6. When = 1 in (23), the local energy on the right-hand side is induced by a (generalized, because edge matrices can be singular) Schur complement matrix. This is similar to the way that strength of connection between two vertices is computed in AMGm, 11 based on a generalized Schur complement of a so-called computational molecule, a small subassembled diagonal block of the auxiliary matrix, assembled from edge contributions between the two vertices and their common neighbors.
As already proposed in Reference 15 for their coarsening algorithm for scalar problems, alternatively, in order to simply check that g (C) ≤ for some fixed parameter , the matrixÂ C −M C can be checked for positive semidefiniteness via computing a Cholesky factorization. It should be noted that when = 1, because no connections between outside neighbors are included, the right-hand side matrix can be computed efficiently by eliminating neighbors separately instead of all at once.

5.1
Coarsening criteria for matchings Lemma 4 introduces a measure that bounds g from above. While cheaper to compute, it is only applicable to pairs of vertices. It reduces the eigenvalue problem that has to be solved from size 2k to size k which makes it completely feasible for an eventual coarsening algorithm to exactly compute it, instead of just checking whether it is bounded by some constant.
where H(A, B) = A(A + B) † B is the harmonic mean. Then there holds

Proof.
Let v H be the coarse vertex representing the agglomerate C (for which there holds (P C w) l = Q v H →v l w). Set Q Hh ∶= Q ij Q v H →v i which is independent of the orientation as Q ij Q v H →v i = Q ji Q v H →v j due property (i) of Lemma 1 and represents a transformation of coordinates given at v H to ones given at the midpoint between v i and v j . With this we show the following identity for the left-hand side of (23): For readability, setD ij = Q ji,TD ii Q ji andD ji = Q ij,TD jj Q ij . These are just diagonal block entries, which take coordinates at vertices, transformed so they take coordinates at the midpoint. Elementary calculations show that the infimum is obtained for For a short detour, consider two symmetric, nonnegative matrices A and B. Clearly, ker(A + B) = ker A ∩ ker B, so ker (A) ⟂ ⊆ ker (A + B) ⟂ , and thus there holds Now, inserting the solution for w in the terms in the infimum above gives and, analogously, As further elementary calculations show, for any two symmetric, nonnegative matrices A, B and thus As for the right-hand side of (23), we need to identify All terms for vertices v l ∉  (that are neighbor of only one vertex in V) are zero, which is attained for the choice w = Q v k →v l v k . It is therefore enough to sum over all v l ∈  : Similarly to before, withẼ il = Q li,T E il Q li andẼ jl = Q lj,T E jl Q lj , elementary calculations show that the infimum Inserting this into one of the terms in the infimum above gives Applying (27) allows for further simplification: Analogous computations for the second term give Here, we used which can easily be seen by Summing these two and using (28) we have found the infimum, Summing over all v l ∈  , Finally, as g ({v i , v j }) is the smallest constant for which this last inequality holds, In fact, as can be seen in the last step of the proof, In Reference 15, a similar simplification for pairs of vertices is made for the scalar case which results in a scalar expression that is slightly different from p in that is also includes diagonal contributions analogous to M i , M j . For elasticity problems, where k > 1, it would be convenient to find an additional simplification of p which results in some kind of scalar criterion. During the coarsening process, it could be used to cheaply pick a candidate for agglomeration among viable neighbors. As long as that choice is then confirmed by p , it does not even necessarily need to be an upper bound for p . We use a heuristic measure inspired by a criterion that is used in classical scalar AMG methods for M-matrices to gauge the strength of connection between two degrees of freedom i and j, based on entries of the finite element matrix A (see also, e.g., the overview paper 4 ), Instead of using entries of A, or evenÂ, it is even simpler to use vertex and edge matrices M i , E ij , which are guaranteed to have nonnegative trace:

Definition 4. With the trace of a matrix tr A ∶=
Note that s is in no way guaranteed to produce high quality agglomerates. As an example, consider Figures 2 and 3. In the former case, Algorithm 2 described in Section 5.2 uses s only as an initial guess for suitable coarsening which is then double-checked with g , p yielding a well-conditioned two-grid method. By contrast, in the latter case, a much worse condition number is obtained when removing the double-check and basing the coarsening purely on s . It should be said, however, that this example was specifically designed to make correct coarsening more difficult. In many, or even most, cases s alone is sufficient.

A successive matching algorithm
The coarsening algorithm presented here (Algorithm 2) is based on the successive pairwise aggregation algorithm from Reference 15 which matches up vertices in multiple passes. In the first pass, vertices are matched up and coarse grid vertex and edge matrices are computed according to Theorem 2 in line 7. As no agglomerates that contain more than two vertices can be formed, this can be done based on p , s alone. On subsequent passes, the (coarse grid) vertices that are paired up represent agglomerates formed by the previous pass. As now also bigger agglomerates exist, also g (which still depends on fine grid matrices M i , E ij ) has to be considered. Note that on later passes, in line 9, edge and vertex matrices for the next one are computed based on previous coarse grid matrices and not initial fine grid matrices.
In the matching process itself, candidates have to be selected among unmatched neighbors. As Theorem 3 shows that the measure g goes directly into the estimate for K TG , it would be preferable to pick the neighbor that minimizes g . On the first pass, also the neighbor that minimizes p (which is an upper bound for g ) would be a reasonable choice. However, on later passes, computing g for every single neighbor would be expensive, as increasingly large eigenvalue problems have to be solved. Therefore, as suggested in Reference 15, instead of the "optimal" neighbor, a candidate is F I G U R E 3 Agglomerates formed by Algorithm 2 for the same problem as in Figure 2 when only using the scalar measure s . That chosen that maximizes either p or s (computed from coarse grid matrices E IJ , M I ). If the initial choice is made based on s , it is first confirmed by also computing p . Where necessary (i.e., if the resulting agglomerate would contain more than three vertices), this is then finally confirmed by checking that g < , which can be done, as mentioned before, by computing a Cholesky factorization. The advantage of basing the initial choice on s is that the more expensive p ideally has to be computed for only one candidate. On the other hand, we might get a better initial candidate from p , as it is an upper bound to g on the first pass and an upper bound to what would be g computed from the coarse grid matrix on later passes. While we have not observed a considerable difference in the quality of generated agglomerates, it seems reasonable to provide the choice between these two approaches as an option (as the parameter R in Algorithm 2). One detail that has a surprisingly large effect on the quality of generated agglomerates is the choice of C i in line 15. As again suggested in Reference 15, on the first pass, a Cuthill-McKee permutation of fine grid vertices is computed and vertices are iterated through in that manner. On subsequent passes, vertices are iterated through in the reverse order they have been added to the coarse grid on the previous pass. The same ordering is also used to break potential ties in line 20.

AUXILIARY SMOOTHED PROLONGATION
In the smoothed aggregation AMG method, a smoothing step is applied to the tentative prolongation matrix P to obtain the smoothed prolongation matrix P s (see, e.g., Reference 5) Here D is the diagonal of A and is a dampening parameter. In order to control the sparsity of P s , instead of A, a filtered, sparser version of A could be used here. As P s still has to preserve kernel vectors, care must be taken in how this is done. For scalar problems, where the only kernel vector is the constant one, off-diagonal entries can be removed and subtracted from the diagonal entries such that the row-sum remains zero. For elasticity problems, eliminating off-diagonal entries of A without disturbing the rigid body modes is less easy to do. One such approach can be found in Reference 20 (algorithm 5). Eliminating off-diagonal block entries from the auxiliary matrixÂ, however, is quite straightforward-in order to set of agglomerates ,  if k < N then 6: if k == 0 then  while  ≠ ∅ do 15: Select C i ∈  16: while  s ≠ ∅ do 19: if R then 20: Select C j ∈  s with minimal s (C i , C j ) 21: return  eliminateÂ ij , we only need to not assemble the contribution of the edge connecting v i and v j . This motivates the following definition: Definition 5 (auxiliary smoothed prolongation). Let P be a tentative prolongation as in Definition 1. Let the set of non-Dirichlet vertices,  ⧵ , be split into two sets,  and . For v i ∈ , define some set  i of "filtered neighbors." Then, define the filtered auxiliary matrix bŷ The auxiliary smoothed prolongation is then given as Remark 9. For the elasticity problem, the set  i can be chosen such thatÂ 0 ii is singular. Instead of forcing all  i to contain sufficiently many neighbors, it is feasible in practice to use (D 0 ) † instead of (D 0 ) −1 . A final option would be to require every nonempty  i to contain at least neighbor v j that is in the same agglomerate as v i , and to then regularize E ij such that it has full rank which makesD 0 ii regular.
In a distributed parallel setting, the domain, and thus the mesh, is partitioned, with every process having its own assigned subdomain. Now, usually, as coarse grid matrices become less sparse due to smoothed prolongations, they will also have many "off-processor" entries connecting degrees of freedom inside different subdomains. This way, even processors that do not share a boundary on the finest level can come into contact with one another on coarse levels which increases, in addition to the communication volume, also the number of processor pairs that need to communicate with one another. One approach to combat this effect is to somehow remove entries from coarse grid matrices themselves, resulting in non-Galerkin methods such as in References 21-23. By contrast, when using the auxiliary smoothed prolongation, the sets  i can be freely chosen. Normally, one would like to choose  i to contain the neighbors that are most strongly connected to v i . If we additionally restrict  i to neighbors that are "more parallel," we can avoid generating additional off-processor entries and remain in the Galerkin setting without increasing coarse-level communication volume. More precisely, v j ∈  i only if the number of processes that share the coarse grid vertex representing the agglomerate v j is in is a superset of those that share v i . For example, this allows coarse grid basis functions associated to subdomain boundaries (shared by two or more processes) to grow into the interior of attached subdomains, but restricts ones in the interior of a subdomain (local to one process) to remain within it. We call this kind of auxiliary smoothed prolongation "hierarchic" as it only transfers information from more to less parallel vertices.
At first glance, there seems to be a major drawback to this approach. As the number of levels increases, and thus the number of degrees of freedom per subdomain decreases, this additional restriction becomes very strict, to the point where most  i can be empty and the smoothed prolongation be nearly the same as the tentative one. This can be overcome by merging and redistributing subdomains to fewer processes which frees up additional connections to be taken into account by the tentative prolongation. This is convenient in practice anyways, since with such a small number of degrees of freedom per core as can otherwise occur on coarse levels, the cost of communication can dominate the cost of computation.
Finally, there is one aspect to using auxiliary smoothed prolongations that cannot be addressed with the theory in this article. Theorem 2 shows how to define coarse grid edge matrices such that A c = P T AP ∼Â c = P TÂ P. However, using P s instead of the tentative P, the coarse matrix is A c = P T s AP s ≠ P T AP. Unfortunately, we have at present no other way of computing a coarse grid auxiliary matrix, let alone one that provides sharp estimates for the constants in the equivalence A c ∼Â c . As they factor into the condition number of the two-grid method, degeneration of the constants for many levels could, in principle, lead to very poor multigrid condition numbers. Possibly, techniques such as those used in Reference 11 to find the correct scaling of finest-level edge matrices based on off-diagonal entries of the finite element matrix could be used to calculate better coarse grid edge matrices. On the positive side, the coarsening algorithm might even profit from the distinction. Since A c is much denser thanÂ c , and the effect of repeated smoothing of prolongations is compounded for many levels, a coarsening algorithm based onÂ c alone has to consider much fewer connections and only ever matches coarse grid vertices that are actually "geometrically" next to one another.
Ultimately, this is a gap in the theory established in this article and topic of further investigation. We can only refer to Section 7 where we demonstrate that in practice the method does in fact not completely fail for many levels.
Remark 10. An approach better supported by theory might be to use tentative prolongations in an algebraic multilevel iteration (AMLI) cycle such as in Reference 15. If sharp bounds for the equivalence between finite element and auxiliary matrix could be established on the finest level, this could lead to an algebraic multilevel method that can be proven to have condition number independent of the number of levels even for the linear elasticity case.

NUMERICAL RESULTS
We now present computational results in an attempt to demonstrate both the robustness of the coarsening algorithm presented in Section 5 as well as the scalability of our implementation. In all cases, the conjugate gradient method with one V-cycle of the auxiliary topology AMG method as a preconditioner was used to reduce the error by a factor of 10 6 . The smoother used is a block version of the 1 smoother introduced in Reference 24, which was chosen for its parallelizability.
Blocks are made up of all degrees of freedom associated with a single vertex. We consider three problems, with the first two being of a simple nature, intended to show scalability of the implementation for both the scalar and vector case. The last problem is the three-dimensional analog of the problem used in Figure 2 and is as such explicitly constructed to make traditional coarsening algorithms fail. It is intended to demonstrate robustness of the coarsening algorithm even for difficult problems on a large scale.
The AMG method has been implemented as an extension library to the C++ open source mesh generator and finite element software Netgen/NGSolve, 25,26 see also the webpage of that project. 27 It is available as research code through the "NgsAMG"-project on GitHub. 28 The computations were performed on the Vienna Scientific Cluster in its newest 2020 iteration (VSC4). The partition used for these computations consists of 700 nodes, each featuring 48 cores. For all cases, a series of unstructured tetrahedral meshes of varying size have been generated for the same geometry, and the number of vertices per processor has been kept as constant as feasible. This means that for computations where the number of processes is not a multiple of 48, as is usually the case, one node in the computation is not fully utilized. Due to technical limitations, the larger meshes could not be generated beforehand, instead a coarser mesh had to be generated, distributed and then refined in parallel at execution time. For these reasons we were unable to keep the number of vertices per processor perfectly constant; however, it generally lies within 5% variation of the goal value.

Scalar model problem
First, in order to demonstrate scalability of the implementation for the simplest scalar case, we consider Poisson's equation, Problem 2, on the unit cube Ω = [0, 1] 3 with = 1 and Dirichlet conditions on the entire boundary. The coarsening algorithm uses four passes on the first two levels and three passes after that and uses the scalar criterion s only.
Coarsening is stopped when the number of vertices reaches the minimum of a static threshold (1600) and the initial number of vertices divided by a factor (1250). The number of block entries per row of the first three auxiliary smoothed prolongation is capped by four. For later ones, the cap is four for vertices where edge matrices are used, and six (for the fourth and fifth) or eight (any later ones) for those where real matrix entries are used. The results are listed in Table 1.
For each computation we give the number of nodes (#N), processors (#P) and degrees of freedom (as a multiple of the number of vertices, #V), as well as the number of vertices per processor (#V/#P). The number of AMG levels (#L) and the number of vertices on the coarsest one (n c ) and both the grid, or vertex complexity (VC), the sum of the number of vertices on all levels divided by the initial number of vertices and the operator complexity (OC), the number of nonzero entries of matrices on all levels divided by those of the finest-level matrix describe the AMG iteration itself. Finally, the number of CG iterations necessary to reduce the error by a factor of 10 6 (#its), as well as the time for the setup of the AMG levels (tsup) and the CG iterations (tsol) are complemented by eff sup and eff sol , which is the sequential time divided by either tsup or tsol.  Table 2. Note that, with a comparable number of degrees of freedom per process to that of the scalar model problem, although the matrix is denser, and although the more expensive criteria p and s are used for coarsening, the setup time is in the same order of magnitude.

A difficult vector problem
The final problem is the three-dimensional analog of the problem used in Figures 2 and 3 to illustrate the necessity for the robust criteria g and p . It is as such specifically constructed to defeat traditional, scalar, criteria and features a large jump in coefficients , . The domain Ω = [0, 1] 3 is split into a subdomain Ω 1 and its complement Ω 2 . As depicted in Figure 4, Ω 2 is the union of eleven "boxes" that touch in their corners, With = we are still far away from the almost incompressible case, however the large jump between = 1 in Ω 1 and = 10 4 in Ω 2 is problematic. The difficulty comes from the fact that the coarsening algorithm has to correctly identify that no agglomerates spanning multiple of these boxes must be formed. The exception is that, once all vertices within two boxes on the right end of the chain are contained within one agglomerate each, they can be combined. On the next level, another single box can be added, and so on, until all but the one first one, that touches the Dirichlet boundary, are agglomerated.

F I G U R E 4
The geometry for the difficult vector problem, Ω 1 in green and Ω 2 in red That one is different in that its connection to the Dirichlet boundary Γ D = {0} × [0, 1] 2 , once all (non-Dirichlet) vertices in it are agglomerated together, is represented by an edge weight matrix that has six large eigenvalues, while edge weight matrices for any connection between two such box agglomerates have three large and three small eigenvalues. Only the robust criteria g and p can correctly identify this. Parameters for the AMG method are the same as for the vector model problem, except that, as this is a hard problem, only four passes were done on the first level and three afterward. Results are listed in Table 3.

CONCLUSIONS
We have introduced a new notion of discrete energy made up of vertex and edge contributions based on local penalization of deviation from zero-energy functions. The existence of suitable edge matrices on the finest level is a result from Reference 11, and coarse grid edge and vertex matrices have been identified by us. From this energy, we derived criteria