Hierarchical adaptive low-rank format with applications to discretized PDEs

A novel compressed matrix format is proposed that combines an adaptive hierarchical partitioning of the matrix with low-rank approximation. One typical application is the approximation of discretized functions on rectangular domains; the flexibility of the format makes it possible to deal with functions that feature singularities in small, localized regions. To deal with time evolution and relocation of singularities, the partitioning can be dynamically adjusted based on features of the underlying data. Our format can be leveraged to efficiently solve linear systems with Kronecker product structure, as they arise from discretized partial differential equations (PDEs). For this purpose, these linear systems are rephrased as linear matrix equations and a recursive solver is derived from low-rank updates of such equations. We demonstrate the effectiveness of our framework for stationary and time-dependent, linear and nonlinear PDEs, including the Burgers' and Allen-Cahn equations.


Introduction
Low-rank based data compression can sometimes lead to a dramatic acceleration of numerical simulations. A striking example is the solution of two-dimensional elliptic PDEs on rectangular domains with smooth source terms. In this case, the (structured) discretization of the source term and the solution lead to matrices that allow for excellent low-rank approximations. Under suitable assumptions on the differential operator, one can recast the corresponding discretized PDE as a matrix equation [20,24]. In turn, this yields the possibility to facilitate efficient algorithms for matrix equations with low-rank right-hand side [7,4]. However, in many situations of interest the smoothness property is not present in the whole domain. A typical instance are solutions that feature singularities along curves, while being highly regular elsewhere. This renders a global low-rank approximation ineffective. Adaptive discretization schemes, such as the adaptive finite element method, are one way to handle such situations. In this work, we will focus on a purely algebraic approach.
During the last decades, there has been significant effort in developing hierarchical low-rank formats that apply low-rank approximation only locally. These formats recursively partition the matrix into blocks that are either represented as a low-rank matrix or are sufficiently small to be stored as a dense matrix. These techniques are usually applied in the context of operators with a discretization known to feature low-rank off-diagonal blocks, such as integral operators with singular kernel [5]. The use of these formats for representing the solution itself has also been proposed [9,18] but its applicability is limited by the fact that the location of the singularities needs to be known beforehand in order to define a suitable admissibility criterion [10,11,5]. This makes the format too inflexible to treat time-dependent problems for which the region of non-smoothness evolves over time. In the context of tensors, it has been recently proposed a bottom-up approach to identify a partitioning of the domain and perform a piecewise compression of a target tensor by means of local high-order singular value decompositions [8]. A very different and promising approach proceeds by forming high-dimensional tensors from a quantization of the function and applying the so called QTT compression format; see [14] and the references therein.
In this paper, we propose a new format that automatically adapts the choice of the hierarchical partitioning and the location of the low-rank blocks without requiring the use of an admissibility criterion. The admissibility is decided on the fly by the success or failure of lowrank approximation techniques. We call this format Hierarchical Adaptive Low-Rank (HALR) matrices.
This work focuses on the application of HALR matrices to the following class of timedependent PDEs: where Ω ⊂ R 2 is a rectangular domain, L is a linear differential operator, f is nonlinear and (1) is coupled with appropriate boundary conditions in space. Discretizing (1) in time with the IMEX Euler method [1] and in space with, e.g., finite differences leads to (I − ∆tL n )u n, +1 = u n, + ∆t(f n, + b n, ), where L n represents the discretization of the operator L, u n, and f n, are the discrete counterparts of u and f at time t := ∆t, ∈ N, and b n, accounts for the boundary conditions. When using finite differences on a tensor grid, it is natural to reshape the vectors u n, , f n, , b n, into matrices U n, , F n, , B n, . In our examples, the matrix L n will often take the form L n = I ⊗ A 1,n + A 2,n ⊗ I, a structure that is sometimes referred as having splitting-rank 2 [24] and which allows to rephrase the linear system (2) as a linear matrix equation.
As a more specific guiding example, let us consider the two-dimensional Burgers' equation over the unit square: with K > 0. Under suitably chosen boundary conditions, the solution of (3) is given by u(x, y, t) = 1 + exp x+y−t 2K −1 ; see [17,Example 3]. For a fixed time t, the snapshot u t := u(·, ·, t) describes a transition between two levels across the line x + y = t. For a small coefficient K the transition becomes quite sharp, see Figure 1. Let U sol n, be the matrix collecting the samples of u t on an equispaced 2D lattice; U sol n, has a time dependent rank structure. More specifically, the submatrices of U sol n, corresponding to subdomains which are far away from x + y = t are numerically low-rank because they contain samples of a smooth function over a rectangular domain; see the lower part of Figure 1. Therefore, an efficient representation strategy for the solution of (3) needs to adapt the block low-rank structure of U sol n, according to . In this work, we develop techniques for: Bottom: Corresponding block low-rank structure of U sol n, for n = 4096; the numbers indicate the rank of the corresponding block while full rank blocks are colored in blue.
(i) Computing a HALR representation for the discretization of a function explicitly given in terms of a black-box evaluation function.
(ii) Solving the linear system (2) by exploiting the HALR structure in the right-hand-side and the decomposition L n = I ⊗ A 1,n + A 2,n ⊗ I.
Task (i) yields structured representations for the initial condition u n,0 and the source term f n, . Taken together, Tasks (i) and (ii) allow to efficiently compute the matricized solution U n, +1 of (2). The assumption on the discretized operator in (ii) is satisfied for L = ∂ 2 ∂x 2 + ∂ 2 ∂y 2 and enables us to rephrase (2) as the matrix equation The paper is organized as follows; in Section 2 we introduce HALR matrices and discuss their arithmetic. Section 2.4 focuses on solving matrix equations of the form (4) where the right-hand-side is represented in the HALR format. There, we propose a divide-and-conquer method whose cost scales comparably to the memory resources used for storing the right-handside. In Section 3 we address the problems of constructing and adapting HALR representations. In particular, Section 3.3 considers the following scenario: given a parameter maxrank, determine the partitioning that provides the biggest reduction of the storage cost and uses low-rank blocks of rank bounded by maxrank. In Section 4 we incorporate HALR matrices into integration schemes for PDEs and we perform numerical tests that demonstrate the computational benefits of our approach. Conclusions are drawn in Section 5.

Notation
To simplify the statements of some definitions we introduce the following compact notation for intervals of consecutive integers: In addition, we write i l , i r < i l , i r if i r < i l and we use the symbol to indicate the union of disjointed sets.

HALR
We are concerned with matrix partitioning described by quad-trees, i.e. trees with four branches at each node. More explicitly, given a matrix A we consider the block partitioning where the blocks A ij can be either dense blocks, low-rank matrices, or recursively partitioned. The cases of interest are those where large portions of the matrix are in low-rank form. This is in the spirit of well established hierarchical low-rank formats such as H-matrices [11] and H 2 -matrices [5]. To formalize our deliberations, we first provide the definition of a quad-tree cluster.
• each node I is a subset of 1, m × 1, n of the form I = I r × I c := m 1 , m 2 × n 1 , n 2 .
• each non leaf node I = I r × I c has 4 children I 11 , I 12 , I 21 , I 22 , that are of the form I ij = I ri × I cj such that I r = I r1 I r2 , I c = I c1 I c2 , and I r1 < I r2 , I c1 < I c2 .
• Each leaf node is labeled either as dense or low-rank.
The depth of T is the maximum distance of a node from the root.
An example of a quad-tree cluster of depth 4 is given in Figure 2; this induces the block structure of a 16 × 16 matrix shown in the bottom part of the figure. This block structure is formalized in the following definition.
Definition 2.2. Let A ∈ C m×n and T be a quad-tree cluster for 1, m × 1, n .
1. Given k ∈ N, A is said to be a (T , k) Hierarchical Adaptive Low-Rank (HALR) matrix, in short (T , k)-HALR, if for every leaf node I r × I c of T labeled as low-rank, the submatrix A(I r , I c ) has rank at most k.
2. The smallest integer k for which A is (T , k)-HALR is called the T -HALR rank of A.   Figure 2: Example of a quad-tree cluster of depth 4 and the induced partitioning on the matrix. The leaf nodes labeled as dense correspond to dense blocks colored in blue. The leaf nodes labeled as low-rank are taken of rank 1 and correspond to the blocks colored in gray.
There are close connections between (T , k)-HALR matrices and H-matrices [11]. More precisely, any H-matrix with low-rank blocks of rank at most k and with binary row and column cluster trees is a (T , k)-HALR matrix. In this case the quad-tree cluster is obtained from the Cartesian product of the row and column cluster trees. The HODLR format [10] is a special case discussed in more detail in Section 2.3. On the other hand, Definition 2.1 allows to build quad-tree clusters that can not be written as subsets of any Cartesian product of a row and a column cluster trees. For instance, we might have two nodes (not having the same father) with column indices I c , I c such that I c ∩ I c = ∅ and I c ⊆ I c , I c ⊆ I c . This makes the HALR class slightly more general than H-matrices.
In the next sections we will describe operations involving HALR matrices and we will tacitly assume to have access to their structured representations, i.e. the quad-tree clusters and the low-rank factors of the low-rank leaves. How to retrieve the HALR representation of a given matrix will be discussed in Section 3.

Matrix-vector product
In complete analogy with the H-matrix arithmetic, the HALR structure allows to perform the matrix-vector product efficiently by relying on the block-recursive procedure described in Algorithm 1.
In the particular case when the cluster only contains the root, A itself is either low-rank (of rank k) or dense and Algorithm 1 requires O((m + n)k) and O(mn) flops, respectively. We remark that this cost corresponds to the memory required for storing A. This statement holds if A is a leaf node then 3: return Av Exploiting low-rank structure if present 4: else 5: end if 8: end procedure in more generality. Proof. The result is shown by induction on the depth of T . By the discussion above, the claim is true when T consists of a single node. If the result holds for trees of depth up to d, and T has depth d + 1, the cost for Av is dominated by the cost of the 4 recursive calls to HALR MatVec. Using the induction assumption, it follows that the cost for these calls sums up to O(S).

Arithmetic operations
We proceed by analyzing the interplay between the quad-tree cluster partitioning and the usual matrix operations. If A is a given (T , k)-HALR we can define the transpose of T as the natural cluster tree for A T . Definition 2.4. Let m, n ∈ N and T a quad-tree cluster for 1, m × 1, n . The transposed quad-tree cluster T T is defined as the quad-tree cluster on 1, n × 1, m obtained from T by: (i) replacing each node I r × I c with I c × I r (ii) swapping the subtrees at I 12 and I 21 for every non-leaf node.
Clearly, A is (T , k)-HALR if and only if A T is (T T , k)-HALR.
Remark 2.5. In the following, we want to regard a subtree of T again as a quad-tree cluster. For such a subtree to satisfy Definition 2.1, we tacitly shift its root m 1 , m 2 × n 1 , n 2 to 1, m 2 − m 1 × 1, n 2 − n 1 and, analogously, all other nodes in the subtree. In the opposite direction, when connecting a tree to a leaf of T , we shift the root (and the other nodes) of the tree such that it matches the index set of the leaf.
Now, let us focus on arithmetic operations between HALR matrices. When dealing with binary operations, we need to ensure some compatibility between the sizes of the hierarchical partitioning in order to unambiguously define the partitioning of the result. To this aim, we introduce the notions of row and column compatibility, which will be used in the next section for characterizing matrix products and additions. Definition 2.6. Given m A , n A , m B , n B ∈ N, let T A , T B be quad-tree clusters for 1, m A × 1, n A and 1, m B × 1, n B , respectively.
• T A and T B , with roots I A and I B , are said to be row-compatible if one of the following two conditions are satisfied: (i) T A or T B only contains the root, and m A = m B .
(ii) For every i, j = 1, 2 the subtrees at (I A ) ij and (I B ) ij are row-compatible.
• T A and T B are said column-compatible if T T A and T T B are row-compatible. • T A and T B are said compatible if they are both row-and column-compatible.
Intuitively, two quad-trees T A and T B are row (resp. column) compatible if taking the same path in T A and T B yields index sets with the same number of row (resp. column) indices.
According to Definition 2.6, compatibility does not depend on the labeling of the leaf nodes. Moreover, two clusters can be compatible even if they have different depths (or contain subtrees of different depths). The following definition introduces a partial ordering among compatible trees. This will be used to define the intersection between quad-tree clusters, which in turn allows us to characterize the natural partitioning of binary matrix operations involving A and B.
Definition 2.7. Let T A , T B be compatible quad-tree clusters for 1, m × 1, n . We write T A ≤ T B if one of the following conditions is satisfied (i) T A only contains the root labeled as low-rank.
(ii) T B only contains the root labeled as dense.
(iii) For every i, j = 1, 2 the subtrees (T A ) ij and (T B ) ij at (I A ) ij and (I B ) ij , respectively, verify The idea behind Definition 2.7 is that T A ≤ T B implies that a (T A , k)-HALR matrix has a stronger structure than an (T B , k)-HALR one. In fact, any (T A , k)-HALR is also a (T B , k)-HALR for all T B ≥ T A . A low-rank matrix itself corresponds to the format with the strongest structure. Based on this, we define the intersection between T A and T B as the strongest structure among the ones which are weaker than both T A and T B . Definition 2.8. Let T A , T B be compatible quad-tree clusters for 1, m × 1, n . Their intersection T := T A ∩ T B is defined recursively as follows: (ii) If T A or T B only contain the root labeled as dense then T A ∩ T B is a tree that only contains the root labeled as dense.
(iii) If T A and T B contain more than one node then their intersection is constructed by connecting the subtrees Remark 2.9. The neutral element for the intersection is given by the quad-tree T only containing the root labeled as low-rank, that is, a low-rank matrix.
We now make use of the notions defined above to infer the structure of A + B from the ones of A and B.
Proof. We recall that the sum of two matrices of rank at most k A and k B , respectively, has rank at most k A + k B . The statement follows from traversing the tree T A ∩ T B ; for every leaf in the tree for which both submatrices of A and B are low rank, the resulting submatrix in A + B will have rank at most k A + k B .
It is instructive to consider two special cases. First, if T A = T B , then A + B shares the same quad-tree cluster (with higher rank). Second, in view of Remark 2.9, if A is low rank then A + B has the same structure as B, with a rank increase by (at most) the rank of A.
The proof of Lemma 2.10 suggests a recursive procedure that is summarized in Algorithm 2. An inductive argument analogous to the one used for Lemma 2.3 shows that the complexity of Algorithm 2 is bounded by two times the cost of storing a (T A ∩ T B , k A + k B )-HALR matrix. Note that this estimate can be reduced by exploiting the fact that Line 5 is executed at no cost by simply appending the low-rank factors of A, B. For example, when A is a rank-k A matrix the cost reduces to k A times the number of entries in the dense blocks of B, which equals the storage needed for a (T B , 0)-HALR matrix If A (resp. B) is a low-rank leaf, partition it according to B (resp. A) 8: return end if 10: end procedure For a matrix product A · B of HALR matrices, it is natural to assume that A T and B are row compatible. Assuming that T A and T B denote, as usual, the quad-tree clusters of A and B, the matrix product A · B stored in the HALR format is computed with the following procedure: (i) If T A (resp. T B ) only contains the root labeled as low-rank, then the resulting tree only contains the root labeled as low-rank and its factorization is obtained efficiently from the one of A (resp. B); otherwise (ii) if T A (resp. T B ) only contains the root labeled as dense, then A · B is computed in dense arithmetic and the resulting tree only contains the root labeled as dense; otherwise (iii) we partition We note that it is difficult to predict a priori the quad-tree cluster of AB because even if A and B contain many low-rank blocks, the structure may be completely lost in AB; see the example reported in Figure 3. Also computing A · B may cost significantly more than the storage cost of the outcome, e.g., when A and B are dense matrices. On the other hand, in Section 2.3 we will show that if one of the two factors happens to be a HODLR matrix then the cost of computing A · B and its quad-tree structure are predictable. Figure 3: Example of loss of structure when computing the matrix-matrix multiplication. The blue region correspond to nodes labeled as dense, and the empty regions to nodes labeled as low-rank.

HODLR matrices
HODLR matrices are special cases of HALR matrices; all the off-diagonal blocks have low rank.
To formalize this notion, we adopt the definition given in [15], rephrased in the formalism of quad-tree clusters. • I 12 and I 21 are leaf nodes labeled as low-rank.
• the subtrees at I 11 and I 22 are HODLR clusters of depth p − 1.
We say that a matrix is (T An example of a HODLR cluster is reported in Figure 4. A crucial property of HODLR matrices is that they are block diagonal up to a low-rank correction. This allows to predict the structure of a product of HALR matrices whenever one of the factors is, in fact, a HODLR matrix.
Proof. We prove only the first statement, the second can be obtained by transposition. We proceed by induction on p A ; if p A = 1, then A is composed of a single dense block. Since p B ≤ p A , B is also composed of a single block, either labeled as low-rank or dense. Both structures are preserved when multiplying with A.
Suppose that the claim is valid for p A − 1 ≥ 1. If T is composed of a single node, the claim is valid. Otherwise, by decomposing A in its diagonal and off-diagonal parts, we may write In view of the induction step, each block of The complexity of multiplying by a HODLR matrix can be bounded in terms of the storage of the other factor.

Lemma 2.13. Under the assumptions of Lemma 2.12 the cost of computing
where S is the storage cost of B and n min is an upper bound on the size of the dense diagonal blocks of A.
Proof. For p A = 1, A is a square matrix of size at most n min while B has low rank or is dense. In both cases, it directly follows that the cost of multiplication is O(Sn min ).
For the induction step, we recall the splitting (6) of A · B into two terms M D and M O . The term M O is a product between B and a matrix of rank (at most) 2k A , which, according to Lemma 2.3, requires c v Sk A operations for some constant c v . The term M D consists of four products A ii B ij , where A ij is a HODLR matrix of depth p A − 1 and B ij is a HALR matrix. By induction, there is a constant c ≥ c v + 2 such that the cost for each of these four multiplications is bounded by cS ij (n min + k A (p A − 2)) operations, where S ij denotes the storage cost of B ij . Adding the corresponding rank-k A submatrix of M O requires at most 2S ij k A operations, as discussed after Lemma 2.10. Therefore, the cost for computing the block (i, j) of the product A · B is bounded by cS ij (n min + k A (p A − 1)). Summing over i, j concludes the proof.
Remark 2.14. When performing arithmetic operations between HALR matrices, or HODLR and HALR matrices, it is often observed that the numerical rank of the blocks in the outcome is significantly less than the worst case scenario depicted in Lemma 2.10 and 2.13. Hence, it is advisable to perform a recompression stage, see [11, Algorithm 2.17, p. 33], when expanding low-rank factorizations, such as in line 5 of Algorithm 2.

Solving Sylvester equations with HODLR coefficients and HALR right-hand-side
As pointed out in the introduction, when dealing with PDEs defined on a rectangular twodimensional domain, one frequently encounters linear matrix equations of the form with square matrices A, B and a right-hand side C of matching size. To simplify the discussion we will assume that A and B are of equal size n. As A, B stem from the discretization of a 1D differential operator, they are typically (T (H) p , k)-HODLR for some small k. In contrast to our previous work [15], where we assumed C to be HODLR as well, we now consider the more general setting when C is (T , k C )-HALR. In the following, we require that T The particular case when C is a dense matrix will be discussed in further detail in Section 2.4.1. For the moment, we let DenseRHS Sylv denote the algorithm chosen for this case. If, instead, C is low-rank, well-studied low-rank solvers are available, such as Krylov subspace methods and ADI (see [23] for a survey). Under suitable conditions on the spectra of A and B and given a low-rank factorization of C, these solvers return an approximation to the solution X in factorized low-rank format. Since the specific choice of the low-rank solver is not crucial for the following discussion, we refer to this routine as LowRankRHS Sylv.
The equation (7) can be solved recursively using an extension of our divide-and-conquer approach [15] for HODLR matrices C. If T only contains the root and, hence, C is composed of a single block, we use either DenseRHS Sylv (if C is dense) or LowRankRHS Sylv (if C is low-rank). Otherwise, we partition (7) according to the four children of the root of T : In the spirit of [15], we first solve the equation associated with the diagonal blocks of A and B: which is equivalent to solving the four decoupled equations where, by recursion, X can be represented in the T -HALR format. Letting δX := X − X and subtracting (8) from (7), we obtain which is a Sylvester equation with right-hand-side of rank at most 4k. In turn, δX is computed using LowRankRHS Sylv, and X = X + δX is retrieved performing a low-rank update. Note that the Sylvester equations in (9) have again HODLR coefficients and HALR right-hand-side, with the depth decreased by one. Applying this step recursively yields the divide-and-conquer scheme reported in Algorithm 3. Note that, the approximate solution returned by Algorithm 3 retains the HALR format, with the quad-tree cluster T inherited from C. for i, j = 1, 2 do 10: end for 12: In practice, LowRankRHS Sylv in lines 4 and 13 uses low-rank factorizations of the matrices C and C, and returns the solutions in factorized form. The low-rank factors of C at line 4 are given as C is a leaf node of an HALR matrix. At line 4 they are easily retrieved using the low-rank factorizations of the off-diagonal blocks of A and B that are stored in their HODLR representations; see [15, Section 3.1] for more details. When X is assembled by its blocks in line 14, an HALR structure with the appropriate tree is created.

Sylvester equation with dense right-hand-side
We now consider the solution of a Sylvester equation (7) with dense C and HODLR coefficients A, B. This is needed in Line 6 of Algorithm 3, but it may also be of independent interest.
For small n (say, n ≤ 200), it is most efficient to convert A and B to dense matrices, and use a standard dense solver, such as the Bartels-Stewart method or RECSY [13], requiring O(n 3 ) operations.
For large n, we will see that it is more efficient to use a recursive approach instead of a dense solver. For this purpose, we partition C into a block matrix in accordance with the row partition of A and the column partition of B. More specifically, if the size of the minimal blocks in the partitioning of A and B is n min and n = 2 p n min , we represent C as a n nmin × n nmin block matrix, that is, a (T , 0)-HALR of depth p with all leaf nodes labeled as dense. Then (7) is solved recursively in analogy to Algorithm 3. The resulting procedure is summarized in Algorithm 4, where DenseSolver Sylv indicates the standard dense solver.

Complexity analysis of the D&C Sylvester solvers
In order to perform a complexity analysis we need to make a simplifying assumption on the convergence of the low-rank Sylvester solver, which usually depends on several features of the problem, such as the spectrum of A and B. Partition C according to the partitioning of A, B: for i, j = 1, 2 do 8: end for 10: end if 14: end procedure Assumption 1 The computational cost of LowRankRHS Sylv for AX + XB = C is O(k C kn log n + k 2 n log 2 n), where n is the size of A, B, and k their HODLR rank, and k C is the rank of C. The rank of X is O(k C ).
Assumption 1 is satisfied, for example, if the extended Krylov subspace method [22] converges to fixed (high) accuracy in O(1) iterations and the LU factors of A and B are HODLR matrices of HODLR rank O(k) 1 . This requires the solution of a linear system with A and B in each iteration, via precomputing accurate approximations of the LU decompositions of A, B at the beginning with cost O(k 2 n log 2 n). In other situations, e.g., when the number of steps and/or the number of linear systems per step depend logarithmically on n in order to reach a fixed accuracy, Assumption 1 and the following discussion can be easily adjusted by adding log n factors.
Before analyzing the more general Algorithm 3, it is instructive to first focus on Algorithm 4. We note that Algorithm 4 solves ( n nmin ) 2 dense Sylvester equations of size n min and, at each level j = 0, . . . , p − 1, as well as 4 j Sylvester equations of size n 2 j and with right-hand-sides of rank at most 4k. In addition, computing the low-rank factorization at line 10 requires O( n 2 4 j k) operations, amounting to a total cost of O(n 2 ). Under Assumption 1, LowRankRHS Sylv solves the equations at level j with a cost bounded by O( k 2 n log 2 n 2 j ). Hence, the total computational cost is O(n 2 (n min + k 2 log 2 n)). For large n and moderate k, we can therefore expect that Algorithm 4 is faster than a dense solver of complexity O(n 3 ).
The following lemma estimates the cost of Algorithm 3 for a general HALR matrix C, which reduces to our previous estimates in the two extreme cases: O(k C kn log n + k 2 n log 2 n) if C is low-rank and O(n 2 n min + k 2 n 2 log 2 n) algorithm if C is dense. , k)-HODLR matrices A, B ∈ C n×n and a (T , k C )-HALR matrix C ∈ C n×n , with a quad-tree cluster T that is compatible with T (H) p and has depth p C ≤ p. Suppose that p ∼ log(n), and let n min denote the size of minimal blocks in A, B. If Assumption 1 holds, then the cost of Algorithm 3 for computing the solution X is O(S(n min + k 2 log 2 n)), where S is the storage required for C.
Proof. We prove the result by induction on p C . For p C = 1, the result holds following the discussion above, because S = n 2 if C is dense and S = 2k C n if C is low-rank.
As the induction step is similar to the proof of Lemma 2.13, we will keep it briefer. When p C > 1, Algorithm 3 consists of four stages: By the induction hypothesis, each solve is O(S ij (n min + k 2 log 2 n)), where S ij denotes the storage of C ij and, hence, the total cost is O(S(n min + k 2 log 2 n)).

Computation of the right-hand-side in line 12.
This computation involves 4k matrix-vector products with X. After p − 1 recursive steps, the storage for X is at most the one for C plus the one for storing the p − 1 low-rank updates, which amounts to O(S + kpn) according to Assumption 1. Hence, by Lemma 2.3, the cost of this step is O(Sk + k 2 pn).

Solution of Sylvester equation in line 13.
Because the rank of the right-hand side is bounded by 4k, the cost of this step step is O(k 2 n log 2 n).

Update of X in line 14.
The cost of this step is O(Sk + k 2 pn), the storage of X times the rank of δX.
The total cost is dominated by the cost of Step 1, because one can easily prove by induction that pn is bounded by O(S). This completes the proof.
Note that it is not the HALR rank k C but the storage cost S of the right-hand-side C that appears explicitly in the complexity bound of Lemma 2.15. The advantage of using S instead of an upper bound induced by k C is that it allows us to better explain why isolated relatively high ranks can still be treated efficiently.
We remark that when A and B are banded, e.g. when they arise from the discretization of 1D differential operators, Algorithm 3 can be executed without computing the HODLR representations of A and B. Indeed, the low-rank factorizations of the off-diagonal blocks at line 12 are easily retrieved on the fly and one can implement a solver of Sylvester equations that exploits the sparse structure of A, B, in LowRankRhs Sylv.

HALR matrices in hm-toolbox
The hm-toolbox [19] available at https://github.com/numpi/hm-toolbox is a MATLAB toolbox for working conveniently with HODLR and HSS matrices via the classes hodlr and hss, respectively. We have added functionality for HALR matrices to the toolbox. A new class halr has been introduced, which stores a (T , k)-HALR matrix A as an object with the following properties: • sz is a 1 × 2 vector with the number of rows and columns of the matrix A.
• F contains a dense representation of A if it corresponds to a leaf node labeled as dense.
• U, V contain the low-rank factors of A if it corresponds to a leaf node labeled as low-rank.
• admissible is a Boolean flag that is set to true for leaf nodes labeled as low-rank.
The toolbox implements operations between halr objects, such as Algorithms 1-2 and matrix multiplication, as well as the Sylvester equation solver described in Section 2.4. Similar to hodlr and hss, when arithmetic operations are performed recompression (e.g., low-rank approximation) is applied in order to limit storage while ensuring a relative accuracy. More specifically, an estimate of the norm of the result C = A op B is computed beforehand, and recompressions are performed using a tolerance · C , where is a global tolerance.

Construction via adaptive detection of low-rank blocks
In this section we deal with task (i) described in the introduction, that is: given a function handle f : 1, m × 1, n → C construct an HALR representation of A = (a ij ) i,j ∈ C m×n such that a ij = f (i, j).

Low-rank approximation
We start by considering a simpler problem, the (global) approximation of A with a low-rank matrix. Several methods have been proposed for this problem [12,2,25,21], which target different scenarios. In the following sections we will often need to determine if a matrix is sufficiently low-rank in the sense that it can be approximated, within a certain accuracy, with a matrix of rank bounded by maxrank. For this purpose, we assume the availability of a procedure (U, V, flag) = LRA(A, maxrank, ) that returns a low-rank factorization A ≈ U V T , of rank at most maxrank. The returned flag indicates whether the approximation verifies A − U V T . In our implementation, we will rely on the adaptive cross approximation (ACA) algorithm with partial pivoting [2], which only requires the evaluation of a few matrix rows and columns selected by the algorithm. The parameter is used in the heuristic stopping criterion of the method, which in practice usually ensures the requirement on the absolute error stated above. When aiming at a relative accuracy rel , we need to set = rel A ; if A is not available, it is estimated during the first ACA steps. The cost of ACA for returning an approximation of rank k is O((k 2 + kc A )(m + n)) where c A is the cost of evaluating one entry of A. The approximation is returned in factorized form as a product of m × k and k × n matrices and therefore the storage cost is O(k(m + n)).
Depending on the features of A, other choices for the procedure LRA might be attractive. For instance, if the matrix-vector product by A and A T can be performed efficiently (for instance when A is sparse), then a basis for the column range of A can be well-approximated by taking matrix-vector products with a small number of random vectors, and this can be used to construct an approximate low-rank factorization as described in [12]. The methodology described in the following sections can be adapted to this context, by replacing the procedure LRA.

HALR approximation with prescribed cluster
Letting T denote a prescribed quad-tree cluster on 1, m × 1, n , we consider the problem of approximating A within a certain tolerance , with a (T , k)-HALR A for some, hopefully small k. A straightforward strategy for building A is to perform the following operations on its blocks: (i) for a leaf node labeled low-rank, run LRA (without limitation on the rank) to approximate the block in factored form; (ii) for a leaf node labeled dense, assemble and explicitly store the whole block; (iii) for a non-leaf node, proceed recursively with its children.
To avoid an overestimation of the ranks for blocks of relatively small norm, we first approximate the norm of the entire matrix with the norm of a rough approximation of A obtained by running LRA for a small value of maxrank.

HALR approximation with prescribed maximum rank
We now discuss the problem at the heart of HALR: Given an integer maxrank determine a quadtree cluster T and an (T , k)-HALR matrix A such that k ≤ maxrank and A − A ≤ . Moreover, we ideally want T to be minimal in the sense that ifÂ is another (T ,k)-HALR approximating A within the tolerance , andk ≤ maxrank, thenT < T . In this context, we consider all the trees (or subtrees) that contain only dense leaves to be equivalent to a single dense node. We propose to compute T and A with the following greedy algorithm: (i) We apply LRA limited by maxrank to the matrix A. If this is successful, as indicated by the returned flag, then T is set to a tree with a single node that is labeled low-rank and contains the approximation returned by LRA.
(ii) If LRA fails and the size of A is smaller than a fixed parameter n min then T is set to a tree with a single node labeled as dense and the matrix is formed explicitly.
(iii) Otherwise we split A in 4 blocks of nearly equal sizes and we proceed recursively on each block. Then: • If the 4 blocks are all leaves labeled as dense, then we merge them into a single dense block. • Otherwise, we attach to the root of T the four subtrees resulting from the recursive calls.
The whole procedure is summarized in Algorithm 5.

Refining an existing partitioning
As operations are performed on a (T , k A )-HALR matrix A, its low-rank properties may evolve and it can be beneficial to readjust the tree T accordingly by making use of Algorithm 5. More specifically, we refine T by performing the following steps from bottom to top: (i) Algorithm 5 with maximum rank maxrank is applied to each leaf node and the leaf node is replaced with the outcome.
(ii) A node with 4 children that are dense leaf nodes is merged into a single dense leaf node.
(iii) For a node with 4 children that are low-rank leaf nodes, we form the low-rank matrix obtained by merging them. If its numerical rank is bounded by maxrank, we replace the node with a low-rank block. Otherwise, the node remains unchanged.
The procedure is summarized in Algorithm 6; to decide whether to merge four low-rank blocks in (iii), we make use of the method CompressFactors that computes a reduced truncated singular value decomposition of U V T ; this requires O(k 2 (m + n) + k 3 ) flops, where k is the number of columns of U, V , see [11,Algorithm 2.17,p. 33].
In the next sections, Algorithm 6 is regularly used to deal with situations where a matrix B is obtained from operating with HALR matrices A 1 , . . . , A and its tree is initially set to the intersection of the cluster trees of A 1 , . . . , A . A relevant special case is the one where only A 1 is a general HALR matrix and all the other matrices are low-rank; in this case the initial tree for B is the one of A 1 .
Algorithm 5 Approximation of a matrix A using the greedy construction of the quad-tree cluster T . The (absolute) approximation accuracy is determined by . if A ij are labeled as dense for i, j = 1, 2 then

Numerical examples
In Sections 2 and 3 we have developed all the tools needed to implement an efficient implicit time integration scheme for the reaction diffusion equation (1), provided that the discretization of the operator L has the Kronecker sum structure I ⊗ A n + B n ⊗ I. In the following, we describe in detail how to put all pieces together for the representative case of the Burgers' equation. Then we provide numerical tests for other problems that can be treated similarly. The experiments have been run on a server with two Intel(R) Xeon(R) E5-2650v4 CPU with 12 cores and 24 threads each, running at 2.20 GHz, using MATLAB R2017a with the Intel(R) Math Kernel Library Version 11.3.1. In all case studies, the relative truncation threshold has been set to rel = 10 −8 .

Burgers' equation
We consider the following Burgers' equation [17,Example 3] with Dirichlet boundary conditions: for K = 0.001. We make use of a uniform finite differences discretization in space, with step h = 2 n−1 , combined with a Euler IMEX method for the discretization in time with step ∆t = 5 · 10 −4 .

2:
if A is leaf node then 3: A ← HALR Adaptive(A, maxrank, ) 4: if A.A ij are dense leaf nodes for i, j = 1, 2 then 7: if A.A ij are low-rank leaf nodes for i, j = 1, 2 then 10: if rank(U V T ) ≤ maxrank then I − ∆tA n U n, +1 + U n, +1 1 2 I − ∆tA n = U n, + ∆t(F n, + B n, ), where, denoting with • the Hadamard (component-wise) product, we have set F n, := U n, • D n, U n, + U n, D T n, + (e n v T n, + v n, e T n )/h , (v n, ) j = u(jh, 2, ∆t), B n, := (e 1 w T n, +1 + w n, +1 e T 1 + e n v T n, +1 + v n, +1 e T n )/h 2 , (w n, ) j = u(jh, 0, ∆t) and Note that rank(B n, ) ≤ 4. The time stepping procedure is reported in Algorithm 7. If Algorithm 7 is executed with standard dense numerical linear algebra each time step requires O(n 3 ) flops and O(n 2 ) storage. In order to exploit the additional structure observed in Figure 1 we propose to maintain the HALR representations of the matrices F n, and U n, . In particular: (i) At line 3 we employ Algorithm 5 to retrieve a quad-tree cluster tree T and a (T , k)-HALR representation of U n,0 . The rank k satisfies k ≤ maxrank.
F n, ← U n, • D n, U n, + U n, D T n, + (e n v T n, + v n, e T n )/h 7: B n, ← (e 1 w T n, +1 + w n, +1 e T 1 + e n v T n, +1 + v n, +1 e T n )/h 2

8:
R ← U n, + ∆t(F n, + B n, ) 9: Solve the Lyapunov equation (10) 10: t ← t + ∆t, ← + 1 11: end while 12: end procedure (ii) In place of lines 6-8 we compute a (T , k R )-HALR representation for R using the algorithm described in Section 3.2. More specifically, we force the quad-tree cluster to be the one of U n, . We remark that an efficient handle function for evaluating the entries of R is obtained by leveraging the HALR structure of F n, , U n, and the low-rank structure of B n, .
(iii) We refine the cluster tree of R using Algorithm 6. During this process, the truncation is performed according to a relative threshold rel = 10 −5 , comparable with the accuracy of the time integration method. This avoids taking into account the increase of the rank caused by the accumulation of the errors.
(iv) Since the Lyapunov equation (10) has HODLR coefficients and HALR right-hand-side we employ Algorithm 3 for its solution at line 9. Consequently, the matrix U n, +1 inherits the same quad-tree cluster of R.
Note that the refinement of the cluster at step (iii) is the only operation that can modify the quad-tree cluster used to represent the solution. The test has been repeated for maxrank = 25, 50, 75, 100. In Table 1 we report the total computational time (labeled as T tot ), and the maximum memory consumption for storing the solution in each run, measured in MB. We also report the total time spent solving Lyapunov equations (phase (iv), labeled as T lyap ) and approximating the right-hand-side and adapting the HALR structure (phases (ii)-(iii), labeled as T adapt ). These two phases accounts for most of the computational cost (between 85% and 90%); the solution of the Lyapunov equation is the most expensive operation. The ratio T lyap /T adapt seems to grow with n, and is around 2 at n = 16384. Figure 5 describes in detail the case maxrank = 50. The solution at time t = 0 has a lowrank structure; the region where the shock happens is confined to the origin in [0, 2] 2 . After some iterations, the shock moves causing an increase in the rank required to approximate the solution, and the method switches to the HALR structure. When the time approaches t = 3.75, the solution becomes numerically low-rank again, because the shock moves close to top-right corner of [0, 2] 2 . This progression is reported in the top part of Figure 5, which shows the time required for each iteration, and the structure adopted by the method. We remark that since the 1D Laplacian can be diagonalized via the sine transform, Algorithm 7 can be efficiently implemented also without exploiting the local low-rank structure. In particular, the iteration cost becomes O(n 2 log(n)). In the left part of Table 2   for this case, we have also reported the average time for solving the Lyapunov equation via fast diagonalization; we see that leveraging the HALR structure makes the algorithm faster since dimension 8192. The bottom plot of Figure 5 shows the absolute approximation error in the discrete l 2 -norm, computed comparing the numerical solution with the true solution u(x, y, t) = 1 + exp x+y−t 2K −1 .
The error curve associated with the implementation of Algorithm 7 in dense arithmetic matches the one reported in Figure 5 confirming that the low-rank truncations have negligible effects on the computed solution. We remark that the displayed errors come from the discretization, and are not introduced by the low-rank approximations in the blocks: we have verified the computations using dense unstructured matrices, obtaining the same results.

Allen-Cahn equation
The Allen-Cahn equation is a reaction-diffusion equation which describes a phase separation process. It takes the following form: where ν = 5 · 10 −5 is the mobility, Ω = [0, 1] 2 and the source term is the cubic function g(u) := u(u − 0.5)(1 − u). This test problem is described in [6]. For a fixed choice of (x, y), the solution u(x, y, t) converges either to 1 or 0 for most points inside the domain as t → ∞.
We discretize the problem with the IMEX implicit Euler method in time and centered finite differences in space, exactly as in the Burger's equation example. The only difference is that in this problem we are considering Neumann boundary conditions instead of Dirichlet.
In this example we choose the initial (discrete) solution randomly, distributed as u(x i , y j , 0) ∼ N ( 1 2 , 1), with every grid point independent of the others. Integrating the system yields a model for spinodal decompositions [6]. We remark that with this choice the matrix U n,0 has no low-rank structure, and will be treated as a dense matrix. On the other hand, during the time evolution, the smoothing effect of the Laplacian makes the solution U n, well-approximable by low-rank matrices, at least locally (see Figure 6). For even larger , the solution converges to either 0 or 1, giving rise to several "flat regions", which can be approximated by low-rank blocks, and the structure U n, can be efficiently memorized using a (T , k)-HALR representation.
We have integrated the solution for t ∈ [0, 40], using ∆t = 0.1, and grid sizes from 1024 up to 16384. The simulation has been run for maxrank = 25, 50, 75, 100. The time and storage used for the integration has been reported in Table 3, analogously to the Burgers' equation case. Note that here the maximum memory consumption is always attained at t = 0, where the solution is stored as a dense matrix. Figure 6 and 7 focus on the case n = 4096 and maxrank = 50. The evolution in time of the solution and of the corresponding HALR structure are reported in Figure 6. The initial structure is a tree with a single node labeled as dense, and the HALR representation can be used already at time 0.3. Then, the solution becomes (numerically) low-rank at time 6.7 (with rank approximately 50); the third image in Figure 6 shows the low-rank structure at time t = 18. Later, as the phase separation happens, the representation becomes again HALR, and stabilizes at the format shown in the fourth figure. The time required for each iteration, and the structure adopted is reported in Figure 7. Analogously to the Burgers' example, the 1D Laplacian with Neumann boundary conditions can diagonalized via the cosine transform providing a dense method with iteration cost O(n 2 log(n)). In the right part of Table 2, we have reported the times required by the dense method for integrating (11) and the average time for solving the Lyapunov equation via fast diagonalization; we see that exploiting the structure makes the HALR approach faster from dimension 16384.

Inhomogeneous Helmholtz equation
The chosen coefficient k(x, y) is negligible outside of a semi annular region centered in (0, −1); the source term f is concentrated around the origin. The numerical solution of (12), reported in Figure 8, is well approximated in the HALR format which refines the block low rank structure in the region where k takes the larger values. The usual finite difference discretization of (12) provides the n 2 × n 2 linear system (A ⊗ I + I ⊗ A + D k ) vec(X) + vec(F ) = 0, (13) where A is the 1D Laplacian matrix with Neumann boundary conditions, D k is the diagonal matrix containing the evaluations of k(x, y) at the grid points and F contains the analogous evaluations of the source term. Note that, omitting the matrix D k , (13) can be solved as a Lyapunov equation. In the spirit of numerical methods for generalized matrix equations [3], we propose to solve (13) with a structured GMRES iteration using the Lyapunov solver as preconditioner; more specifically we store all the (matricized) vectors generated by the GMRES in the HALR format. If necessary (when the rank grows) we adjust the partitioning of the latter via Algorithm 6. The inner product between vectors are computed using the block recursive procedure described in Algorithm 9, which returns the trace of A T B for two HALR matrices A and B. Finally, the solution is constructed as a linear combination of HALR matrices. The whole procedure is reported in Algorithm 8.  return j y j U j 19: end procedure Equation (12) has been solved for different grid sizes with maxrank = 50. The time and memory consumption are reported in Table 4. The storage needed for the solution scales linearly with n. In addition, the table contains the number of iterations needed by the preconditioned GMRES to reach the relative tolerance tol = 10 −4 . We note that the number of iterations grows very slowly as the grid size increases. The time required depends on many factors, such as the distribution of the ranks and the complexity of the structure in the basis generated by the GMRES; we just remark that it grows subquadratically for this example. In future work we plan to explore the use of restarting mechanisms and other truncation strategies in order to optimize the approach. if A and B are leaf nodes or at least one is low-rank then 3: return Trace(A T B) Exploiting the low-rank structure of A or B, if any 4: else 5: if A is dense or B is dense then 6: Set the partitioning of A and B equal to the finest of the two.   (12), for different values of n and maxrank = 50. The reported memory is measured in Megabytes (MB), and refers to the storage of the solution.

Conclusions
In this work, we have proposed a new format for storing matrices arising from 2D discretization of functions which are smooth almost everywhere, with localized singularities. Low-rank decompositions, which are effective for globally smooth functions, become ineffective in this case. The proposed structure automatically adapts to the matrix, and requires no prior information on the location of the singular region. The storage and complexity interpolates between dense and low-rank matrices, based on the structure, with these two cases as extrema. We demonstrated techniques for the efficient adaptation of the structure in case of moving singularities, with the aim of tracking time-evolution of 2D PDEs; the examples show that the proposed techniques can effectively detect changes in the structure, and ensure the desired level of accuracy. We developed efficient Lyapunov and Sylvester solvers for matrix equations with HALR right-hand-side and HODLR coefficients. This case is of particular interest, as it often arises in discretized PDEs. Several numerical experiments demonstrate both the effectiveness and the flexibility of the approach.
The proposed format may be generalized to discretization of 3D functions by swapping quadtrees with octrees, making the necessary adjustments, and choosing a suitable low-rank format for the blocks. Similar ideas to the ones presented in this work may be used to detect the hierarchical structure in an adaptive way, and to adjust the structure in time. However, devising an efficient Sylvester solver remains challenging. Despite the existence of low-rank solvers for linear systems with Kronecker structure in the Tucker format [16] exploiting the hierarchical structure introduces major difficulties, which we plan to investigate in future work.