On the use of suboptimal matchings for scaling and ordering sparse symmetric matrices

The use of matchings is a powerful technique for scaling and ordering sparse matrices prior to the solution of a linear system Ax = b. Traditional methods such as implemented by the HSL software package MC64 use the Hungarian algorithm to solve the maximum weight maximum cardinality matching problem. However, with advances in the algorithms and hardware used by direct methods for the parallelization of the factorization and solve phases, the serial Hungarian algorithm can represent an unacceptably large proportion of the total solution time for such solvers. Recently, auction algorithms and approximation algorithms have been suggested as alternatives for achieving near‐optimal solutions for the maximum weight maximum cardinality matching problem. In this paper, the efficacy of auction and approximation algorithms as replacements for the Hungarian algorithm is assessed in the context of sparse symmetric direct solvers when used in problems arising from a range of practical applications. High‐cardinality suboptimal matchings are shown to be as effective as optimal matchings for the purposes of scaling. However, matching‐based ordering techniques require that matchings are much closer to optimality before they become effective. The auction algorithm is demonstrated to be capable of finding such matchings significantly faster than the Hungarian algorithm, but our 12 ‐approximation matching approach fails to consistently achieve a sufficient cardinality. Copyright © 2015 The Authors Numerical Linear Algebra with Applications Published by John Wiley & Sons Ltd.


INTRODUCTION
Our aim is to efficiently solve the large sparse linear system Our main interest is the use of direct solvers when A is symmetric indefinite. In this case, it is necessary to incorporate numerical pivoting to maintain stability. This can mean that the pivot sequence chosen during the analyse phase on the basis of sparsity has to be modified as the numerical factorization proceeds. In particular, some pivots have to be delayed until they satisfy the stability criteria. Our recent studies comparing scaling algorithms [1,2] demonstrate that the number of delayed pivots provides a good predictor for the effectiveness of a scaling at reducing the wall-clock time for the factorization. This is because the number of delayed pivots corresponds to the amount of additional work performed because of the requirements for numerical pivoting. Our work also demonstrates that, for tough indefinite problems, the widely used MC64 scaling algorithm [3] is particularly effective compared with other scalings and techniques tested.
The MC64 algorithm applies a transformation to a ij to give an associated (positive) edge weight w ij D log c j log ja ij j; where c j D max i ja ij j. The maximization in (1) is then equivalent to min X .i;j /2E w ij ij : By way of a sign change, this is classified as a maximum weight maximum cardinality matching problem (also known as an assignment problem). In MC64, this is solved using the Hungarian algorithm [5]. From standard theory, at optimality, the following conditions are satisfied for some row and column dual variables u and v: w ij u i v j D 0; 8.i; j / 2 M; (4) w ij u i v j > 0; otherwise: The matching M provides a permutation that, in the unsymmetric case, can be used to achieve a zero-free diagonal with large entries. In the symmetric case, it can be used to permute large entries onto the subdiagonal [6][7][8][9]. The dual variables u and v can be used to calculate a scaling as follows. Define the diagonal scaling matrices D r ; D c and S with diagonal entries Then, D r AD c is such that the largest entry in each row and column is exactly one, and all other entries are less than or equal to one. If A is symmetric, SAS has the same property. The permutation and scaling can be used independently if required, and it is especially common in the symmetric case to use only the scaling because of the unsymmetric nature of the permutation. However, more recently, techniques have been developed to permute the large entries onto the subdiagonal [6,7]. We consider both approaches in this paper. There are two potential problems with MC64: (i) the runtime is hard to predict and can vary significantly when the data are permuted; and (ii) an application of MC64 can represent a significant fraction of the total factorization time when using a direct solver, particularly when the solver is run in parallel (see Table VI in Section 4). The latter point is compounded by Amdahl's law, as MC64 is a serial code while the factorization obtains good parallel speedups on a modest number of cores. The main issue lies with the Hungarian algorithm that MC64 uses to solve the assignment problem. This seeks optimal augmenting paths through the matrix from an unmatched row to an unmatched column. In those cases where performance is poor, it is because of the need to scan a significant portion of the entire matrix while proving optimality for each augmenting path.
We note that it may be possible to parallelize the Hungarian algorithm using similar techniques to those used for the unweighted case [10,11]. However, we expect the speedups to be significantly more limited because at each stage, optimal independent augmenting paths must be found, whereas in the unweighted case, any augmenting path will do.
In this paper, we relax the requirement for a maximum cardinality matching to allow us to use algorithms that deliver near-optimal results in weight and cardinality. The solution hence does not provide the zero-free diagonal often desired by unsymmetric solvers, but does allow the majority of large entries to be permuted to the subdiagonal for symmetric matrices and, as we shall demonstrate, still provides a high-quality scaling for most of our test matrices, which are taken from practical applications.
The main contribution of this paper is a comparison of the performance and effectiveness of two alternative algorithms for the relaxed maximum weight maximum cardinality matching problem with that of the MC64 implementation of the Hungarian algorithm when the resulting scaling is used prior to the factorization of sparse symmetric matrices. We do not consider unsymmetric factorizations in this paper as the application of the matching and subsequent factorization is substantially different to the symmetric case. These alternatives are an auction algorithm and a 1 2 -approximation algorithm. Both solve the problem approximately while claiming to offer significantly better parallel speedups than the Hungarian algorithm for large problems [12,13]. We assess performance in terms of time to find the matching and their effectiveness when used as scaling and/or ordering heuristics for a sparse direct symmetric linear solver.
The remainder of this paper is laid out as follows. In Section 2, we describe the auction algorithm and associated work; both serial and parallel versions are discussed. Then, in Section 3, we briefly describe the approximation algorithm. Section 4 provides a comparison of the effectiveness of these algorithms, both in terms of performance and numerical improvement, to the scaled matrix; comparisons are made with the Hungarian algorithm. Finally, in Section 5, our conclusions are presented.

THE AUCTION ALGORITHM
The auction algorithm for the maximum weight matching problem was first proposed by Bertsekas [14] and since then has been studied in a number of papers, including [15][16][17]. Most recently, Sathe et al. [13] showed that the algorithm can quickly find high-quality matchings and is readily parallelizable.
The auction algorithm solves the following maximum weight matching problem: To transform (1) to (5), in place of (2), we use the related transformation where c j D max i log ja ij j and˛D max i;j ¹c j log ja ij jº. The transformation log ja ij j C .˛ c j / is sufficient to transform the maximum product into a scaled maximum sum over positive weights. The extra˛term transforms the objective from a maximum weight to maximum weight maximum cardinality because˛is greater than each individual log ja ij j C .˛ c j / term; only a maximum cardinality solution can be optimal (this trick has been used by other authors previously, e.g. in LEDA [18]). A corresponding unsymmetric scaling of A is then given by which can again be symmetrized using For each nonzero entry a ij of A, w ij u i is the increase in the objective (3) obtained by augmenting M with .i; j /, displacing any edge currently in M that contains row i or column j . The auction algorithm starts with the row dual variables, u D ¹u i º, initialised to zero and proceeds by scanning each unmatched column j to find the row index i such that If w ij u i > 0, column j 'bids' for row i. The highest bid for row i wins; say in column j 1 , the matching M is augmented by adding the edge .i; j 1 /, and any column previously matched with row i is returned to the pool of unmatched columns. The dual variable u i is updated to be the cost of using the second best row, that is, u i is the (first-order) reduction in the objective if j 1 was not matched to i. For this reason, the dual vector u is also referred to in some contexts as the vector of 'reduced costs'. By adding > 0 to u i , a minimum increase in the objective can optionally be required. This accelerates convergence of the algorithm by ignoring opportunities for trivial improvement. is chosen to be small but much larger than machine precision, and it is increased as the algorithm proceeds.
Once the matching M and row dual variables u have been computed, the column dual variables v may be obtained directly from (4).
We have implemented both the serial and OpenMP versions of the auction algorithm, which we outline as Algorithms 1 and 2, respectively. In Algorithm 1, U is the set of columns that have been found to be unmatchable. In Algorithm 2, we use pseudo-code modelled after OpenMP: DEFAULT(private) and SHARED specify data locality, and BARRIER indicates that no thread continues past that point until all threads have reached it.
For the serial algorithm, the (average) number of iterations and cost per iteration is reduced by treating every bid as immediately winning. This reduces the cost per iteration, as there is no longer a need to determine the highest bid (each bid wins) and the data needed for the resulting update are already in cache. Further, if a bid by column j for row i wins immediately, any existing k such that .i; k/ 2 M becomes available for rematching in the current iteration. However, in the parallel version, the work must be split into separate bid generation and reconciliation phases, as bids must now be communicated between threads. As the process is memory bound, this additional phase essentially doubles the time, so significant speedup is required for the parallel code to outperform the serial code. This is illustrated in Section 4.1.
The termination conditions for both algorithms are the same, and the basis for these is illustrated in Figures 1 and 2. These show the convergence of the serial auction algorithm for two symmetric problems taken from our test set (see Section 4 for details) that are chosen to demonstrate typical behaviour (one converges almost immediately in fewer than 10 iterations while the other takes over 200). We define the effectiveness of a matching M as the reduction in the number of delayed pivots 652 J. HOGG AND J. SCOTT

Algorithm 1 Serial auction algorithm
Input: Size n, positive weights w ij , iteration limit maxitr Output:  compared with the reduction for an optimal matching M calculated using MC64. That is, we use the following formula: where ndelay is the number of delayed pivots and is the empty set and denotes no scaling. The figures demonstrate that jMj and the effectiveness of the matching do not correlate well. However, in all our tests, we observed that a matching of high cardinality was sufficient to achieve high Algorithm 2 Parallel auction algorithm Input: Size n, positive weights w ij , iteration limit maxitr Output: Matching M, dual variables u Running on P threads, DEFAULT(private), SHARED(n, w ij , maxitr, M) Initialise: M D ; u D 0; D 0:01 Partition the columns equally between the threads for itr D 1; maxitr do if ( terminate() ) exit D min.1:0; C 1=.n C 1// generate_bids() BARRIER (wait for all bids to be made) determine_winners() BARRIER (wait for winners to be found) Reassign columns so that each thread has approximately .n jMj/=P columns end for generate_bids(): for each unmatched column j owned by this thread do  effectiveness (but we could stop much earlier in some cases). With the further observation that later iterations are much cheaper than earlier iterations as they involve many fewer unmatched columns, convergence to a high-cardinality matching is a good stopping criterion.
Our preliminary experiments showed the quality of the matching to be sensitive to the choice of at each iteration. Sathe et al. [13] describe a strategy for choosing , and, as we also found this to be effective, we employ their approach (except that, in place of initializing to 16=.n C 1/, we use the constant initial value of 0.01).
The question arises as to how unmatched rows and columns should be scaled. The transform (6) can lead to large entries in D c and small entries in D r that (by construction) cancel to give a moderate scaling of A for entries in matched rows and columns. However, for unmatched rows and columns, we cannot simply choose u i D 0 or v j D 0 without taking into account the w ij transformation: if the unmatched row contains an entry in a matched column (or vice versa), it ends up very poorly scaled. Instead, we consider values in w ij space and transform back to a ij space to obtain the relevant scaling. We found that most reasonable values for such dual variables work, but for simplicity, our code initialises u i D 0 and v j D max i w ij (u i D 0; v j D 0 performed considerably less well in our tests).
Finally, we observe that, for an unmatched columns j , the pval D w ij u i value calculated most recently makes a good guess for v j , as it corresponds to the value of the dual variable for a recent trial matching. As each (non-empty) column has such a pval calculated at least once during the execution of the algorithm, these values are always available. However, storing this value creates additional memory traffic that may not be desirable, particularly in the parallel case where false sharing may be an issue. Further, no similar value is available for unmatched rows. For these reasons, we avoid this approach.

THE APPROXIMATION ALGORITHM
We start with some definitions that we require in our description of the approximation algorithm. We assume that all edge weights w ij are positive and define W .M/ D P .i;j /2M w ij as the total weight of a matching M for a graph G. Let M be an optimal matching; then, an˛-approximation matching algorithm is defined to be a matching algorithm that guarantees to find M such that W .M/ >˛W .M /. An edge .i; j / of G with weight w ij is defined to be locally dominant if arg max k ® w kj¯D arg max k ¹w ki º D w ij . The greedy approach of Avis [19] provides a simple 1 2 -approximation algorithm. This matches the heaviest edges in decreasing order if they are locally dominant (this is roughly equivalent to a single round of the auction algorithm). In this paper, we use the more advanced parallel implementation of Halappanavar et al. [12]. Rather than the sorting-based approach of Avis, a queue-based mechanism is used, as originally proposed by Preis [20]. The central concept is that, at each iteration, locally dominant edges are added to the matching; the matched edges and their vertices are removed, and the resulting reduced graph is considered at the next iteration.
The algorithm has two phases. In the first, a list of the locally dominant edges in G is made. This is performed in parallel by passing through the graph data twice. On the first pass, for each vertex j , the neighbour p j that maximises w kj is found (ties are broken by taking the lowest such p j ). The edge .j; p j / is held as the candidate locally dominant edge for vertex j . A second pass confirms whether a candidate edge is a locally dominant edge by checking if p p j D j . Each locally dominant edge .j; p j / is added to the matching, and the vertex j is added to the list Q of vertices to be removed from G for the next iteration. Observe that k D p j will be added to this list when vertex k is processed.
The second phase of the algorithm consists of a number of iterations, each of which removes vertices from G and, for each remaining vertex i that is a neighbour of a vertex in Q, updates its candidate locally dominant edge. The list of vertices for removal can be iterated over in parallel as long as updates to the list of candidate edges and additions to the vertex removal list Q 0 for the next iteration are performed atomically. The unmatched neighbours of each vertex j that is to be removed must be checked. Specifically, any unmatched neighbour i for which .i; j / was the candidate edge must have its candidate edge updated to the edge .i; p i /, where p i is the unmatched neighbour of i that maximises w ki . If edge .i; p i / is locally optimal in the reduced graph, it is added to the matching, and vertices i and p i are included in the removal list that is to be used in the next iteration.
An algorithm outlined together with a simple example to illustrate the algorithm are given in [21]. Recall that our interest is in the problem (1), but the approximation algorithm addresses the maximum weight maximum cardinality problem. Applying the transformation w ij D log ja ij j C .˛ c j /; where˛D max i;j ¹c j log ja ij jº and c j D max i log ja ij j, we obtain w ij > 0, and the final matching provides an approximate solution to (1). Note that the extra˛term present in (6) is omitted as  the algorithm works with relative rather than absolute edge weights: its inclusion would be pointless. As the approximation algorithm does not use dual variables, we must define u and v. In our implementation, we define This guarantees the equality condition w ij u i v j D 0 for edges in M. The choice v j D c j for unmatched columns ensures that the largest entry in the column is scaled towards 1, and that the scaling is appropriate after the w ij transformation is reversed. Note that the choice (7)- (8) is not unique.

COMPUTATIONAL EXPERIMENTS
For the purposes of our experiments, we use four sets of symmetric indefinite test problems drawn from the University of Florida Sparse Matrix Collection [22] and detailed in Table I. Test Sets 1 and 2 are matrices that do not significantly benefit from an MC64 scaling compared with no scaling or the application of a cheap norm equilibration algorithm. The purpose of these sets is to assess the cost of applying a scaling algorithm when scaling is not actually needed. For the problems in test set 1, the time to run MC64 is high, while for those in test set 2, MC64 represents a much smaller overhead in the solver time. Test sets 3 and 4 are drawn from our recent paper on pivoting techniques for difficult problems [2]. Test set 3 is a set of problems for which using the MC64 scaling is sufficient to reduce the number of delayed pivots to reasonable levels, while test set 4 comprises those problems that require a matching-based ordering and scaling to achieve this (further details on matching-based orderings are given in Section 4.3). All our tests are performed on the 16-core machine detailed in Table II. All times and results for the Hungarian algorithm are obtained using HSL_MC64 version 2.4.0; the sparse direct solver used is HSL_MA97 version 2.2.0 [23,24]. Both HSL codes are run with default settings, except where otherwise stated. In particular, this means that a heuristic choice is made between approximate minimum degree and nested dissection orderings, and the pivot threshold parameter is 0.01. Results for the 1 2 -approximation algorithm were obtained with MATCHBOX software [25]. We use the letters OOM to indicate a problem ran out of memory during the factorization phase of HSL_MA97 because of the generation of too many delayed pivots. Figure 3 shows the speedup of our implementation of the parallel auction algorithm against our implementation of the serial auction algorithm. The problems in the four test sets have been amal- gamated into a single set and then rearranged in the order of increasing number of entries in A. It is clear that the matrix must have a large number of entries before parallelization is worthwhile (a significant speedup is required to overcome the overhead of the separate bid generation and reconciliation phases). Further analysis of the results shows that the quality of the scalings computed using the serial and parallel versions are comparable (see [21] for detailed results). Thus, with our current implementation, we only recommend the parallel algorithm if there are more than 2 10 6 entries in A.

Scalability
We found that no appreciable parallel speedup was achieved by running the approximation algorithm in parallel (however, slowdown was observed in the smallest problems).
On the basis of these findings, in the rest of this section, we present results for the serial implementations only. Table III provides results on the quality of the matching achieved by each of the matching algorithms in terms of cardinality, while Table IV measures the effectiveness of the associated scaling by counting the number of delayed pivots when the orderings are run with the HSL_MA97 solver. Table V compares the runtime of each algorithm to achieve this. These tables show that while the approximation algorithm is the fastest, it fails to provide an alternative to the Hungarian algorithm, both in terms of finding a high-cardinality matching and reducing the number of delayed pivots. On the other hand, both our serial and parallel auction codes lead to a similar number of delayed pivots as for the Hungarian algorithm on all but one problem (GHS_indef/ncvxqp1), where they perform slightly worse.

Effectiveness of algorithms: scaling only
The ncvxqp1 discrepancy is an example where our stopping conditions for the auction algorithm cause it to terminate with a 96.3% match after 101 iterations, taking approximately 0.004 s. If we instead run for 383 iterations (which takes 0.007 s), we achieve a 96.6% match resulting in only 10 986 delayed pivots, which is comparable to the Hungarian algorithm. However, this run includes 268 iterations where the matching is stuck at 96.3%. Note that, for this problem, a complete matching requires 12 368 iterations and takes 0.013 s. Table VI summarises the numbers in Table V by showing the fraction of the total factorization time spent in the scaling for each algorithm. The total factorization is taken to be the time to compute the scaling and then to factorize the scaled matrix (the time for pre-processing and post-processing the matrix data is not included, but is relatively small and easily parallelized). It shows that the use of the auction algorithm generally reduces the proportion of the time spent in scaling the matrix, especially for problems in test sets 1 and 3. The approximation algorithm spends a very small proportion of its time in scaling because the factorization time is so much larger.

Effectiveness of algorithms: ordering and scaling
We now consider the effectiveness of using a matching-based ordering combined with the matchingbased scaling. As these techniques are known to be expensive [2], we only consider their application to problems in test sets 3 and 4. Rather than the default choice between AMD and nested dissection, a matching-based ordering instead involves using a matching to identify 2 2 pivots, compressing the adjacency graph of the matrix such that the sparsity patterns of both members of the 2 2 pivot are merged into a single column before running a fill-reducing ordering on the compressed graph [6,7]. There are thus three times to consider: (i) the time to run the matching algorithm (given in Table V of the previous section); (ii) the time to run the whole matching-based ordering routine, including the pre-processing, matching algorithm, graph compression and ordering; and (iii) the factorization time using the calculated scaling and ordering. Table VII reports the latter two times, while  Table VIII demonstrates their ability to reduce the number of delayed pivots required during factorization.
We again see the approximation algorithm does not provide a sufficiently good matching for this approach to be effective. For most of our test problems, the Hungarian algorithm and the serial and parallel auction algorithms give comparable results and are extremely effective in substantially reducing the delayed pivots. However, for the ncvxqp/cvxqp problems, the Hungarian algorithm gives the best results, even for those problems for which the auction algorithms gave quality scalings of comparable quality (Table IV). These ncvxqp/cvxqp problems correspond exactly to those for which the cardinality of the auction algorithm matching was less than 99% (Table III). Additional experiments show that by running the serial auction algorithm until a 100% cardinality matching is reached, results comparable to the Hungarian algorithm can be obtained, while still offering a substantial time saving.

CONCLUSIONS
We have demonstrated that the auction algorithm fulfils its promise and provides comparable quality to the Hungarian algorithm in the context of scaling and ordering sparse symmetric matrices for use with direct solvers while being significantly faster. By contrast, the very fast 1 2 -approximation algorithm does not at present represent a reasonable alternative. We believe there is a substantial room for improvement in how the scaling is derived from the matching, and this is an obvious direction for future work.
Our results further show that high-quality scalings can be obtained using a suboptimal matching. However, the matching-based orderings generally require the matching to be of high cardinality to be fully effective in limiting the number of delayed pivots.
As the parallel auction algorithm requires additional work compared with the serial version, we recommend that the user is asked to choose which to use. In our tests, we were able to achieve consistent speedups with the parallel version on matrices that have in excess of two million entries; for smaller problems, it is more efficient to use the serial code.
In this paper, the emphasis has been on sparse symmetric systems. However, matchings are commonly used in the unsymmetric case to permute the matrix in order to obtain a zero-free diagonal of large elements to reduce the need for pivoting (see, e.g. the parallel solver SuperLU_DIST [26]). We suspect that this will have similar behaviour to that found in the symmetric case when permuting large entries to the sub-diagonal: specifically that the sub-optimal termination required to obtain a small runtime of the matching algorithm may not provide a matching of sufficient quality to avoid pivoting. Further, the failure to obtain a matching of maximum cardinality could necessitate some additional manipulation to ensure pivot candidates exist at the end of the factorization. A future objective is to investigate how effective the auction algorithm is for unsymmetric solvers and, in particular, whether a parallel implementation can reduce the time for the scaling and ordering without having a detrimental effect on the subsequent factorization (see also [17]).
Finally, we remark that an efficient implementation of the Hungarian algorithm is complicated, whereas that of both the serial and parallel versions of the auction algorithm is much more straightforward. We plan to include such implementations within our mathematical software libraries.