Parameter Estimation in the Hermitian and Skew-Hermitian Splitting Method Using Gradient Iterations

. This paper presents enhancement strategies for the Hermitian and skew-Hermitian splitting method based on gradient iterations. The spectral properties are exploited for the parameter estimation, often resulting in a better convergence. In particular, steepest descent with early stopping can generate a rough estimate of the optimal parameter. This is better than an arbitrary choice since the latter often causes stability problems or slow convergence. Additionally, lagged gradient methods are considered as inner solvers for the splitting method. Experiments show that they are competitive with conjugate gradient in low precision.


Introduction
We are interested in solving the linear system where A is a non-Hermitian positive definite matrix of size N .It has been observed that splitting methods can be used with success.The traditional alternating direction implicit method [19] has inspired the construction of alternate two-step splittings A = M 1 − N 1 and A = M 2 − N 2 , and this leads to an iteration called Hermitian and skew-Hermitian splitting (HSS) [3] in which alternately a shifted Hermitian system and a shifted skew-Hermitian system are solved.HSS has received so much attention [4,2,6,23,26,27,20], possibly due to its guaranteed convergence and mathematical beauty.Let H and S denote the Hermitian and skew-Hermitian parts of A, respectively.Let A H be the conjugate transpose of matrix A. It follows that Let I be the identity matrix.In short, the HSS method is defined as follows (γI + H)x n+ 1 2 = (γI − S)x n + b, (γI + S)x n+1 = (γI − H)x n+ 1  2 + b, with γ > 0. It could be regarded as a stationary iterative process where x 0 is a given vector.Following the notations of Bai et al. [3] let us set The operators T and p can be expressed as Let σ(•) be the spectrum of a matrix and let ρ(•) be the spectral radius.Convergence result for (2) in the non-Hermitian positive definite case was established by Bai et al. [3] ρ(T ) ≤ N 2 M −1 where • denotes 2-norm.This shows that the spectral radius of iteration matrix T is less than 1.As a result, HSS has guaranteed convergence for which the speed depends only on the Hermitian part H. Let λ i (•) be the ith eigenvalue of a matrix in ascending order.The key observation here is that choosing leads to the well-known upper bound where κ(•) denotes the condition number.It is noteworthy that inequality ( 5) is similar to the convergence result for conjugate gradient (CG) [17,25] in terms of A-norm error.As mentioned by Bai et al. [3], γ * minimizes the upper bound of ρ(T ) but not ρ(T ) itself.In some cases the right-hand side of (3) may not be an accurate approximation to the spectral radius.Since very little theory is available on direct minimization, we still try to approximate indirectly the optimal parameter γ * .
In this paper we exploit spectral properties of gradient iterations in order to make the estimation feasible.In Section 2, we focus on the asymptotic analysis of the steepest descent method.In Section 3, we discuss some strategies for estimating the parameter in HSS based on gradient iterations and give a comparison of lagged gradient methods and CG for solving Hermitian positive definite systems in low precision.Numerical results are shown in Section 4 and some concluding remarks are drawn in Section 5.

Asymptotic analysis of steepest descent
In this section we consider the Hermitian positive definite (HPD) linear system of size N .The solution x * is the unique global minimizer of convex quadratic function For n = 0, 1, . . ., the gradient method is of the form where This gives the updating formula The steepest descent (SD) method proposed by Cauchy [8] defines a sequence of steplengths as follows which is the reciprocal of Rayleigh quotient.It minimizes the quadratic function f or the A-norm error of the system (6) and gives theoretically an optimal result at each step , where e n = x * − x n .This classical method is known to behave badly in practice.The directions tend to asymptotically alternate between two orthogonal directions resulting in a slow convergence [1].
The motivation for this paper arose during the development of efficient gradient methods.We notice that generally SD converges much slower than CG for HPD systems.However, the spectral properties of the former could be beneficial to parameter estimation.Akaike [1] provided a probability distribution model for the asymptotic analysis of SD.It appears that standard techniques used in linear algebra are not very helpful in this case.The so-called twostep invariance property led to the work of Nocedal et al. [18] in which further asymptotic results are presented.Let v i (•) be the eigenvector corresponding to the eigenvalue λ i (•).Relevant properties by Nocedal et al. [18] which will be exploited in the following text can be briefly described in Lemma 1.Note that a symmetric positive definite real matrix was used by Nocedal et al. [18].Therefore, we extend this result and we present a new lemma and its proof in the case of Hermitian positive definite systems.
1 (H)g 0 = 0 and v H N (H)g 0 = 0. Consider the gradient method (8) with steplength (10) being used to solve (6) where H est Hermitian positive definite.Then and for some constant c.
Proof.Let Re(•) and Im(•) be the real and imaginary parts, respectively.The coefficients of system (6) have the following form where ι denotes the imaginary unit.It is possible to rewrite system (6) into the real equivalent form By Lemma 3.3 and Theorem 5.1 in Nocedal et al., 2002 [18], it is known that results (11) to (14) hold in the real case.To prove the desired result in the Hermitian case, it suffices to show that SD applied to (15) is equivalent to that for (6), namely, they should yield the same sequences of gradient vectors and steplengths.One finds that where Assume that the 2 blocks in xn is the same as the real and imaginary parts of x n , respectively.Then, from (15) one obtains that Hence, the following result holds: Along with (15), this implies that g H n Hg n = g⊺ n H gn , according to which one finds that αn = α n when the 2 blocks in gn are equal to the real and imaginary parts of g n , respectively.Hence, the SD iteration for Hermitian system (6) and that for 2-by-2 real form yield exactly the same sequence of solutions.Since properties (11) to (14) in the real case has been proved by Nocedal et al. [18], we arrive at the desired conclusion.
Concerning the assumption used in Lemma 1, if there exist repeated eigenvalues, then we can choose the eigenvectors so that the corresponding gradient components vanish [13].If v H 1 (H)g 0 = 0 or v H N (H)g 0 = 0, then the second condition can be replaced by inner eigenvectors with no effect on the theoretical results.
It took some time before the spectral properties described by Nocedal et al. [18] were applied for solving linear systems.De Asmundis et al. [12] proposed an auxiliary steplength which could be used for efficient implementations of gradient methods.The major result is a direct consequence of ( 11) and (12).We state the lemma without proof, see De Asmundis et al., 2013 [12] for further discussion.
Lemma 2. Under the assumptions of Lemma 1, the following result holds Another direction of approach was based on a delicate derivation by Yuan [29].Let us write α RA n = α A n −1 and Yuan [29] developed a new auxiliary steplength of the form which leads to some 2-dimensional finite termination methods for solving system (6) [29].
Let us now introduce an alternative steplength Let us write The spectral properties of ( 20), ( 21) and ( 22) are shown in Lemma 3. Note that the equations ( 23) and ( 24) have appeared in De Asmundis et al., 2014 [11] for the real case.Below, we extend equations ( 23) and ( 24) for the Hermitian case and also one new equation.
Lemma 3.Under the assumptions of Lemma 1, the following limits hold Proof.Combining (11) and (12) implies Combining ( 11) to ( 14), one could deduce that from which one finds The first equation follows by combining (26) and (27).Along with (19), this implies that which yields the desired limits ( 24) and ( 25).
-0.It is noteworthy that steplengths ( 21) and ( 22) could be expressed as the roots of a quadratic function with from which one could observe that Γ n > 0 and As mentioned by Yuan [29], a slightly shortened steplength would improve the efficiency of steepest descent.This is one reason why the Yuan steplength could be fruitfully used in alternate gradient methods [10,11].
As an example, assume that x 0 = 0 and Assume that b is constructed by b = Hx * where x * is a vector of all ones.We plot in Figure 1 the curves of (28) for a few representative iteration numbers.This figure shows that the curves of Q n (α) corresponding to steepest descent converge to the limit, as proved in Lemma 2 and Lemma 3.
3 Application to HSS iterations

Preliminary considerations
In this section we first try to compute estimates for parameter γ in the HSS method.One possible solution is to simply choose γ = 1 without resorting to special techniques, but experience shows that it often leads to very slow convergence or even divergence, depending on the system being solved.Another approach is based on the observation that γ was introduced to enable the bounded convergence, as seen in ( 3), and it is possible to express it differently.As an example consider a positive definite diagonal matrix D such that As a result, the iteration matrix is of the form Notice that (30) is a special case of preconditioned HSS [7] when choosing γ = 1 and P = D.
On the basis of similar reasoning as in HSS [3], the spectral radius is bounded by A natural idea is to seek D so that the upper bound is small.At first glance we may choose D as the diagonal elements of H. Inspired by the diagonal weighted matrix in Freund, 1992 [14], the Euclidean norms of column vectors could also be exploited.However, the common experience is that these strategies may lead to a stagnation of convergence, and sometimes perform much worse than choosing γ = 1.We will not pursue them further in this paper.

Parameter estimation based on gradient iterations
It is observed that (23) leads to a straightforward estimation of parameter γ * in (4).From Figure 1 we can deduce that the optimal parameter in HSS could be actually approximated by steepest descent iterations, which is shown in the following theorem.
Theorem 4. Assume that the matrix H in system (6) is the Hermitian part of A in system (1).If steepest descent is used for solving (6), then the following limit holds Proof.Combining (4), (23) and the fact that Γ n > 0 observed from (28), the desired conclusion follows.
Another approach is to compute the approximation by combining Lemmas 2 and 3, in which case γ * could be estimated without explicit access to operator H.This approach is shown in Theorem 5.
Theorem 5. Assume that the matrix H in system (6) is the Hermitian part of A in system (1).If steepest descent is used for solving M 1 x = b, then the following limit holds Proof.Recall that M 1 = αI + H. Since Combining ( 19) and ( 23) implies This completes out proof.
Remark.Practically, obtaining γ * by (32) requires a predetermined parameter γ.One could choose γ = 1 and give an integer k as the maximum number of iterations such that in which case the HSS algorithm might be executed at reduced costs.
Another direction of approach is based on the minimal gradient (MG) steplength the spectral properties of which have been discussed by the present authors along with several new gradient methods in a separate paper.Let Theorem 6. Assume that the matrix H in system (6) is the Hermitian part of A in system (1).If minimal gradient is used for solving (6), then the following limit holds Proof.The proof can be obtained similarly as the one in Theorem 4.
Theorem 7. Assume that the matrix H in system (6) is the Hermitian part of A in system (1).If minimal gradient is used for solving M 1 x = b, then the following limit holds Proof.The proof can be obtained similarly as the one in Theorem 5.

Solution based on lagged gradient iterations
Although steepest descent has remarkable spectral properties, as an iterative method, its popularity has been overshadowed by CG.Akaike [1] exploited the fact that the zigzag behavior nearly always leads to slow convergence, except when initial gradient approaches an eigenvector.This drawback can be cured with a lagged strategy, first proposed by Barzilai and Borwein [5], which was later called Barzilai-Borwein (BB) method.The idea is to provide a two-point approximation to the quasi-Newton methods, namely where ∆x = x n − x n−1 and ∆g = g n − g n−1 , yielding The convergence analysis was given in Raydan, 1993 [21] and Dai and Liao, 2002 [9].For the Q-linear result, however, has never been proved due to its nonmonotone convergence.It seems overall that the effect of this irregular behavior is beneficial.
For the HSS method, two iterative procedures are needed at each iteration.Since the solution of subproblems in ( 2) is sometimes as difficult as that of the original system (1), the inexact solvers with rather low precision are often considered, especially for ill-conditioned problems.In practice, the first equation of ( 2) is usually solve by CG, and the second equation of ( 2) can be solved by CGNE [22].Friedlander et al. [15] made the observation that BB could often be competitive with CG when low precision is required.It is known that CG is sensitive to rounding errors, while lagged gradient methods can remedy this issue [13,24] with less computational costs per iteration.Additionally, although BB sometimes suffers from the disadvantage of requiring increasing number of iterations for increasing condition numbers, its low-precision behavior tends to be less sensitive to the ill-conditioning.
A similar method developed by symmetry [5] is of the form , which imposes as well a quasi-Newton property Notice that α BB2 n = α MG n−1 .In the last three decades, much effort was devoted to develop new lagged gradient methods, see De Asmundis et al., 2014 [11] and the references therein.
An example is illustrated in Figure 2. We solve (6) with different residual thresholds ε, where H is chosen as a diagonal matrix of size 10 3 and b is a vector of all ones.The diagonal entries have values logarithmically distributed between 10 −3 and 1 in ascending order, with the first and the last entries equal to the limits, respectively, such that κ(H) = 10 3 .The plot shows a fairly efficient behavior of BB.

Numerical experiments
In this section we perform some numerical tests.Assume that iterative algorithms are started from zero vectors.The global stopping criterion in HSS is determined by the threshold ε = b − Ax n / b with a fixed convergence tolerance 10 −6 .The inner stopping thresholds ε 1 and ε 2 for the two half-steps of (2) are defined in the same way.For gradient iterations applied to system (6), similarly, the stopping criterion is defined by the threshold ε = b − Hx n / b with the same tolerance.All tests are run in double precision.

Asymptotic results of gradient iterations
The goal of the first experiment is to illustrate how the spectral properties described earlier can be used for providing a rough estimate of parameter γ * .We have implemented steepest descent and minimal gradient iterations for several real matrices of size 1000 generated by MATLAB routine sprandsym.The right-hand side is chosen to be a vector of ones.In Figure 3, parameter γ is plotted versus iteration number, under which a red dotted line marks out the position of γ * .It is clear that γ tends to γ * asymptotically as expected.As can be seen, steepest descent with limits (31) and (32) turns out to be a better strategy than minimal gradient in all cases.The indirect approximations based on (32) and (34) yield faster convergence for both steepest descent and minimal gradient.This test confirms Theorems 4 to 7. Recall that choosing γ * as parameter leads to an upper bound of ρ(T ), for which it is not necessary to obtain an exact estimate.Experience shows that this choice may sometimes cause overfitting, resulting in slow convergence or even divergence, especially when γ * is small.One simple measure is to use early stopping in gradient iterations.In the following, it is assumed that steepest descent is used for parameter estimation in HSS, called preadaptive iterations, and we consider only the direct approach (31).

HSS with different parameters
In this test we generate some matrices obtained from a classical problem in order to understand the convergence behavior of HSS enhanced by steepest descent iterations.
Example 1.Consider system (1) where A arises from the discretization of partial differential equation on the unit cube Ω = [0, 1] 3 with θ a positive constant.Assume that u satisfies homogeneous Dirichlet boundary conditions.The finite difference discretization on a uniform m × m × m grid with mesh size h = 1/(m + 1) is applied to the above model yielding a linear system with N = m 3 .
In the following we use the centered difference scheme for discretization.The right-hand side b is generated with random complex values ranging in [−10, 10] + ι[−10, 10].As thresholds for inner iterations, ε 1 = 10 −4 and ε 2 = 10 −4 are chosen.CG is exploited for solving the Hermitian inner system, while CGNE is used for the skew-Hermitian part.Figure 4 shows the convergence behavior of HSS upon different values of the parameter.Here, we set γ ∈ [0.5, 3.5] and m ∈ [9,21].The optimal parameters γ * with m = 9, 12, 15, 18, 21 are located by red lines.Notice that a path that zigzags through the bottom of the valley corresponds to the best parameters.As already noted that the parameter estimates need not be accurate, and thus the red lines are good enough in practice.Then approximating γ * by inexact steepest descent iterations yields the preadaptive HSS method (PAHSS).Let η denote the number of preadaptive iterations.The convergence behaviors and total computing times are illustrated in Figure 5.The left four plots show the residual curves with several typical choices of η when m = 16, 32, 64, 128, namely, N = 4096, 32768, 262144, 2097152.Two observations can be made for all dimensions: the first is that larger η yields faster convergence of HSS; the second is that η = 100 does not lead to significant gains in efficiency compared with η = 50.The right four plots show total wall-clock times of PAHSS iterations, measured in seconds, upon η ranging from 10 to 160.It can be seen that substantial gains are made in the beginning, following a long period of stagnation.Experience shows that a small number of steepest descent iterations is sufficient and it is therefore appropriate to use early stopping.

CG and BB as low-precision inner solvers
In order to verify that BB can be an efficient alternative to CG as low-precision inner solver for HSS, some tests proceed along the same lines as above but consider both CG and BB as inner solvers for the Hermitian part.Numbers of total iterations and wall-clock   1.Since the optimal parameters γ * for (35) with m = 40, 60, 80 are less that 1, which may lead to stability problem, we choose γ = 1 for all tests.We conduct 10 repeated experiments and print only the average computation times.We also add here the results of ORTHODIR [28] for solving (1) for the purpose of comparison.The comparison of costs is shown in Table 2.As expected, BB is less efficient than CG in terms of computation times but BB shows a clear advantage for storage requirements and resistance to perturbation [13,24].In addition, BB and CG used within HSS make the HSS method better than ORTHODIR.The major drawback to ORTHODIR is that the computational work and storage requirement per iteration rise linearly with the iteration number.This drawback can be reduced with a restarted version of the ORTHODIR, but this increases the total number of iterations and in the end does not reduce the computation time.The choice between HSS and Krylov subspace methods depends on how expensive the sparse matrix-vector multiplications are in comparison to the vector updates and how much storage is available for the routine.

Conclusion
Gradient iterations provide a versatile tool in linear algebra.Apart from parameter estimates related to the spectral properties, steepest descent variants have also been tried recently with success as iterative methods [12,11,16].This paper extends the spectral properties of gradient iterations and gives an application in the Hermitian and skew-Hermitian splitting method.Note that this approach can be extended to other splitting methods (see Section 1 and the references therein) without difficulty where a parameter γ is needed to be computed.Our experiments confirm that the gradient-enhanced HSS method can be an attractive

Figure 2 :
Figure 2: Comparison of CG and BB for solving system (6) where H is a diagonal matrix of size 10 3 with κ(H) = 10 3 and b is a vector of all ones.

Table 2 :
Summary of operations for iteration i and storage requirements.In the HSS row, the term "solver" represents an inner solver like CG, BB or CGNE, while the storage requires Hermitian and skew-Hermitian parts of A and the residual vector.