Objective acceleration for unconstrained optimization

Acceleration schemes can dramatically improve existing optimization procedures. In most of the work on these schemes, such as nonlinear generalized minimal residual (N‐GMRES), acceleration is based on minimizing the ℓ2 norm of some target on subspaces of Rn . There are many numerical examples that show how accelerating general‐purpose and domain‐specific optimizers with N‐GMRES results in large improvements. We propose a natural modification to N‐GMRES, which significantly improves the performance in a testing environment originally used to advocate N‐GMRES. Our proposed approach, which we refer to as O‐ACCEL (objective acceleration), is novel in that it minimizes an approximation to the objective function on subspaces of Rn . We prove that O‐ACCEL reduces to the full orthogonalization method for linear systems when the objective is quadratic, which differentiates our proposed approach from existing acceleration methods. Comparisons with the limited‐memory Broyden–Fletcher–Goldfarb–Shanno and nonlinear conjugate gradient methods indicate the competitiveness of O‐ACCEL. As it can be combined with domain‐specific optimizers, it may also be beneficial in areas where limited‐memory Broyden–Fletcher–Goldfarb–Shanno and nonlinear conjugate gradient methods are not suitable.


Introduction
Gradient based optimization algorithms normally iterate based on tractable approximations to the objective function at a particular point.Acceleration algorithms aim to combine the strengths of existing solvers with information from previous iterates.We propose an acceleration scheme that can be used on top of existing optimization algorithms, which generates a subspace from previous iterates, over which it aims to optimize the objective function.We call the algorithm O-ACCEL, short for Objective Acceleration.
Our idea closely resembles the work of De Sterck [7], which introduced the preconditioned nonlinear GMRES (N-GMRES) algorithm for optimization.By using a more appropriate target to accelerate the optimization than N-GMRES does, we show, with numerical examples, how O-ACCEL more efficiently accelerates the steepest descent algorithm.When optimizing an objective f , N-GMRES is used as an accelerator from the point of view of solving the nonlinear system ∇f (x) = 0 which arises from the firstorder condition of optimality.It uses the idea of Krylov subspace acceleration from Washio and Oosterlee [24] and Oosterlee and Washio [19] for solving nonlinear equations that arise from discretizations of partial differential equations.The name N-GMRES arises from the fact that steepest descent preconditioned N-GMRES is equivalent to the standard GMRES procedure for linear systems of equations [7,24].A similar idea, also arising from nonlinear equations, was described in Anderson [2] in 1965.See Walker and Ni [23] for a note on the similarities of the methods, and Fang and Saad [11] which puts Anderson acceleration in the context of a Broyden-type approximation of the inverse Jacobian.Brune et al. [4] show, with many numerical examples, that N-GMRES and Anderson acceleration can greatly improve convergence on nonlinear systems, when combined with an appropriate preconditioner (nonlinear solver).In the setting of optimization, De Sterck [6] and De Sterck and Howse [8] show large improvements in convergence by applying N-GMRES acceleration to the computation of tensor decompositions.
More recently, Scieur et al. [22] have developed another acceleration method for convex optimization denoted regularized nonlinear acceleration (RNA), which Cartis and Geleta [5] have extended to the nonconvex case.Acceleration techniques differ from one another in several ways, but, for convex quadratic objectives, the Anderson, N-GMRES and Scieur et al. algorithms all coincide [5].These methods all minimize the 2 norm of some objective in R n , the space of the decision variable.The proposed algorithm in this manuscript instead aims to minimize the objective function over a subspace of R n .We believe this is a natural target to accelerate against, especially when the optimization procedure is seeking descent directions.For convex, quadratic functions we prove that O-ACCEL with a steepest descent preconditioner reduces to the full orthogonalization method (FOM [21]), a Krylov subspace procedure for solving linear systems.This differentiates our method from the other acceleration techniques, which are related to the GMRES algorithm for linear systems.
Due to the close similarity with the proposed algorithm and N-GMRES, this manuscript focuses on numerical comparisons to N-GMRES under the same testing conditions as used by De Sterck [7].On the test set from De Sterck [7], our acceleration scheme compares favourably to N-GMRES, as well as implementations of the nonlinear conjugate gradient (N-CG) and limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) methods [18].Further tests on the CUTEst test problem set [13] show that L-BFGS is more applicable to these problems, however, O-ACCEL again performs better than N-GMRES.
The manuscript is organized as follows.Motivation for the algorithm, and discussion around it, is covered in Section 2. Numerical tests that show the efficiency of our proposed acceleration procedure applied to steepest descent are presented in Section 3. We conclude and discuss further potential work in Section 4.

Optimization acceleration with O-ACCEL
To fix notation, consider a twice continuously differentiable function f ∈ C 2 (R n ) that is bounded below and has at least one minimizer.We aim to find a local minima of the optimization problem Let M(f, x) denote an optimization procedure for f with initial guess x ∈ R n .This optimization procedure can, for example, be the application of one steepest descent, or Newton, step.We will refer to M as the preconditioner, because it is applied in the same fashion as a right preconditioner for iterative procedures of linear systems [4].Given a sequence of previously explored iterates x (1) , . . ., x (k) , and a proposed new guess x P = M(f, x (k) ), we will try to accelerate the next iterate x (k+1) towards a minimizer.Define K O k (x P ) = span{x (1) − x P , . . ., x (k) − x P }. ( The acceleration step aims to minimize f over the subset x P + K O k (x P ), which can be interpreted as a generalisation from a line search to a hyperplane search.Let α ∈ R k , and set Note that, when k = 1, minimizing f over K O k (x P ) is equivalent to the standard line search problem of minimizing λ → f (x (1) + λ(x P − x (1) )).The first-order condition for α to be a minimizer of the function Define the gradient g where superscript denotes the transpose.The O-ACCEL algorithm aims to linearize the first-order condition ∇ α f (x A (α)) = 0 in the following way.Let H(x) denote the Hessian of f at x.By linearizing α → g(x A (α)), we get where we use the matrices X = x (1) , . . ., x (k) ∈ R n×k and X P = x P , . . ., x P ∈ R n×k .Given this linearization we aim to find an α ∈ R k that approximately satisfies the first-order condition.We can do this by combining ( 5) and ( 6), and then look for an α ∈ R k that solves α (X − X P ) H(x P )(x (l) − x P ) = −g(x P ) (x (l) − x P ), l = 1, . . ., k.
In matrix form, the system of equations becomes There are cases where we may not wish to compute the Hessian of f explicitly, for example, if M does not use it.We can instead use an approximation H(x P ) of the Hessian H(x P ), or its action on vectors in K O k (x P ).The iterative Hessian approximation algorithms that are used in quasi-Newton methods can provide one avenue of research.In the numerical experiments provided in this manuscript, we instead focus on approximating the action of the Hessian on K O k (x P ) to first order by Let g(X) = g(x (1) ), . . ., g(x (k) ) , and define g(X P ) similarly.This gives a second approximation to the first-order conditions, (X − X P ) g(X) − g(X P ) α = −(X − X P ) g(x P ).(10) In this manuscript, we investigate the performance of the objective-based acceleration using (10).
To contrast our work with the N-GMRES optimization algorithm in De Sterck [7], minimizing the min Its solution can be found from the normal equation g(X) − g(X P ) g(X) − g(X P ) α = − g(X) − g(X P ) g(x P ).(12) We argue that the O-ACCEL algorithm is more appropriate for an optimization problem than N-GMRES.When we are restricted to subsets of the decision space, reduction in the value of the objective is a better indicator of moving towards a minimizer than reduction in the gradient norm.In effect, N-GMRES ignores the extra information provided by f .This is better illustrated in the case when k = 1, where it is standard to perform a line search on the objective rather than the gradient norm.

Algorithm
The proposed acceleration procedure, which we call O-ACCEL, is described in Algorithm 1.The number of stored previous iterates w denotes the history size.Setting an upper bound on the history size can be necessary due to storage constraints, or to prevent the local approximations of ( 6) and (9) from using iterates far away from x P .If the direction from x P to the accelerated step x A is not a descent direction, it indicates that the linearized approximation around x P is bad for the currently stored iterates.For simplicity, we therefore choose to reset the history size to w = 1 when we encounter such cases.
To prevent re-computation of g(x (j) ) for j = 1, . . ., w in each application of the procedure, we store these vectors for later use.The computational cost of the algorithm is approximately the same as w-history L-BFGS with two-loop recursion [7].In terms of storage, O-ACCEL and L-BFGS both store 2w vectors of size n.In addition, our implementation of O-ACCEL, as described in Algorithm 2 below, reduces the number of flops required by storing a w × w matrix of previously calculated values.For the numerical experiments we have used w = 20, in accordance with De Sterck [7].It was, however, shown by De Sterck [7] that N-GMRES can already provide good results with w = 3. Tests using O-ACCEL with w = 5, although not included here, provide almost as good results as reported in Section 3. Note that, if the Hessian is sparse, it may be more storage efficient to find α from the linear system in (8) than using a large w.

O-ACCEL as a full orthogonalization method (FOM)
The optimality condition (5) for the function Hence, we look for ).This condition reduces to FOM [21] when g(x) is linear and M(f, x) is a steepest descent algorithm.When the Hessian is symmetric positive-definite, FOM is mathematically equivalent to the conjugate gradient method.We can therefore think of O-ACCEL as a N-CG method that approximates the orthogonality condition with a larger history size.
For convex, quadratic objectives f (x) = 1 2 x Ax − x b, the gradient g(x) = Ax − b is linear and the optimum must satisfy the equation Ax = b.The residuals r (k) = b − Ax (k) are equal to the negative gradient −g(x (k) ).Therefore, O-ACCEL with a steepest descent preconditioner yields Let the O-ACCEL algorithm take the step x (w+1) = x A in Line 9 of Algorithm 1. Then the iterates of the O-ACCEL algorithm form the FOM sequence of the linear system Ax = b.
We shall shortly prove the theorem after deriving new expressions for K k (A, r (1) ).First, note that for any x, a reordering of terms can show that x This motivates the next lemma, which connects the space on the right hand side of ( 14) to K k+1 (A, r (1) ).
Proof of Theorem 1.We prove the result by induction on the sequence x (1) , . . ., x (k) arising from the O-ACCEL algorithm.Let k = 2, then and so span{x (2) − x (1) } = K 1 (A, r (1) ).From ( 5) the residual b − Ax (2) ⊥ x P − x (1) = λ (k) r (1) , and thus x (2) is the second FOM iterate.This establishes the base case for the induction proof.The inductive step follows from Lemma 1 together with ( 14) and ( 15), and hence proves that the O-ACCEL iterates are the FOM iterates for Ax = b.
Remark.The connection to the FOM differentiates O-ACCEL from N-GMRES, Anderson acceleration, and RNA, which reduce to GMRES for quadratic objectives.

Numerical experiments
In order to investigate the performance of the proposed algorithm, we implement it with two preconditioners M. The first is steepest descent with line search, and the second is steepest descent with a fixed step length.They are compared to the N-GMRES algorithm with the same preconditioners, and implementations of the nonlinear conjugate gradient (N-CG) variant with the Polak-Ribière update formula, and the two-loop recursion version of the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method [18].The test problems considered in Sections 3.1 to 3.4 are the same eight problems that were used in De Sterck [7] to advocate N-GMRES.We also include experiments from 33 CUTEst problems to further test the applicability of the algorithms.The results are presented in the form of performance profiles, as introduced by Dolan and Moré [9], based on the number of function/gradient evaluations.
The main focus of this manuscript is to compare the performance of the proposed algorithm to the N-GMRES algorithm.To this end, we have used the MATLAB implementation of this algorithm, available online. 1 The O-ACCEL implementation, and the rest of the code required to generate the test result data, is also made available by the author. 2Our implementation of O-ACCEL follows the exact same steps, only replacing the calculations needed to solve the N-GMRES system in (12) with those of the linear system in (10).The implementation is detailed in Algorithm 2. It closely follows the instructions from Washio and Oosterlee [24], including a regularization for the linear system.
The regularization is used prevent the direct linear solver we use to find α from crashing when A is ill-conditioned or singular, which can happen if the vectors g(x (k) ) − g(x P ) are linearly dependent.Let A ∈ R w×w denote the system matrix (X − X P ) g(X) − g(X P ) .Then, for some tolerance 0 > 0, . The max term is used to scale the regularization in accordance with the optimisation problem.With I ∈ R w×w the identity matrix, we solve the linear problem rather than the linear problem Aα = b as defined in Algorithm 1.This is a Tikhonov type regularization [17], often employed to regularize ill-conditioned problems.Washio and Oosterlee [24] shows that the error in the resulting α is negligible for the N-GMRES problem (12) provided is much smaller than the smallest non-zero eigenvalue of the system matrix.The error for the O-ACCEL system can be analysed within a general Tikhonov regularization framework, see, for example, Neumaier [17].We do not investigate the impact of the regularization parameter further in this manuscript, and use the value 0 = 10 −14 that was used in the N-GMRES code by De Sterck [7].
Algorithm 2 Implementation of O-ACCEL algorithm.Indentation and curly brackets denote scope.Input: f , g, M, x, w max , 0 , tolerance description Output: x satisfying tolerance description 1: while Not reached tolerance do 2: while reset is false do x ← M(f, x) ; r ← g(x) 7: if reached tolerance then η ← x r 10: : if d r ≥ 0 then x ← linesearch(x + λd) 20: w ← min(w + 1, w max ) 21: j ← (k mod w max ) + 1 22: x j ← x 23: r j ← g(x) 24: for i = 1, . . ., w { q ij ← x i r j ; q ji ← x j r i } For the remainder of the section, we present the test problems, provide details for the parameter choices, and discuss the test results.

Test problems from De Sterck
We describe the seven test problems from De Sterck [7].All the functions are defined as f : R n → R, and the matrices mentioned are all in R n×n .
Problem A. Quadratic objective function with symmetric, positive definite diagonal matrix D, , where D = diag(1, 2, . . ., n), and The minimizer x * of Problem A is unique, with f (x * ) = 0.The gradient is given by g(x) = D(x − x * ).
Problem B. Problem A with paraboloid coordinate transformation, x * = [1, . . ., 1] , and The minimizer is again x * , with f (x * ) = 0.The gradient is g where Q is a random orthogonal matrix.As in Problems A and B, the minimizer is (21) from Moré et al. [16], 2 , where n is even, odd), and The unique minimum f (x * ) = 0 is attained at 2 , where n is a multiple of 4, The unique minimum f (x * ) = 0 is attained at x * = 0. Problem F. The Trigonometric function, Problem (26) from Moré et al. [16], 2 , where The unique minimum f (x * ) = 0 is attained at x * = 0. Note that in De Sterck [7], a minus sign is used in front of j(1 − cos x j ).We follow the original formulation of Moré et al. [16].Problem G. Penalty function I, Problem (23) from Moré et al. [16], 2 , where x 2 j , and The minimum is not known explicitly for Problem G, and depends on the value of n.

Experiment design
We test the N-GMRES and O-ACCEL algorithms with two steepest descent preconditioners with Z = A, B, and Thus, the two preconditioners only differ in the choice of step length.Option A employs a globalizing strategy with a chosen line search, whilst option B takes a predetermined step length.By choosing a short, predetermined step length δ > 0, we expand the subspace to search for α and stay close to the previous iterate x (k) , hopefully improving the linearizations in ( 6) and (9).For the experiments, we use the line search algorithm by Moré and Thuente [15], which satisfies the Wolfe conditions [18].It is both employed for M A , and in the line search x P + λ(x A − x P ) between the preconditioned step x P and the accelerated step x A of the N-GMRES and O-ACCEL routines.
To closely follow the testing conditions of De Sterck [7], we use the N-CG, L-BFGS and Moré-Thuente line search implementations from the Poblano toolbox by Dunlavy et al. [10].These may not be state of the art implementations, however, the main focus of this manuscript is to investigate the performance of the N-GMRES and O-ACCEL algorithms.Future work will include testing the O-ACCEL algorithm with appropriate preconditioners on more comprehensive test sets, against state of the art implementations of gradient based optimization algorithms.
All optimization procedures employ the Moré-Thuente line search with the following options: decrease tolerance c 1 = 10 −4 and curvature tolerance c 2 = 0.1 for the Wolfe conditions, starting step length λ = 1, and a maximum of 20 f /g evaluations.The N-GMRES and O-ACCEL history lengths are set to w max = 20, and the regularization parameter is set to 0 = 10 −12 .For M B , the fixed step length is set to δ = 10 −4 .The L-BFGS history size is set to 5. Larger history sizes were found by De Sterck [7] to be harmful for the L-BFGS performance on this test set.
Note that our choice of curvature tolerance c 2 = 0.1 is different from De Sterck [7], where c 2 = 0.01 was used.There are two reasons for this.First, our choice is often used in practice, see Nocedal and Wright [18, Ch. 3.1], and it reduces the number of function evaluations for all the solvers considered.Second, we are interested in comparing the outer solvers, however, smaller values of c 2 moves work from the outer solvers to the line search algorithms.
We test Problem A-C for both problem sizes n = 100 and n = 200.Problem D is tested with n = 500, 1000, 50 000, 100 000.Problem E with n = 100, 200, 50 000, 100 000.Problem F is called with n = 200, 500, and finally, Problem G with n = 100, 200.Each combination of problem and problem size is run 1000 times, with the components of the initial guess drawn uniformly random from the interval [0, 1].For Problem C, each instance of the problem generates a new, random, orthogonal matrix Q.This results in 18 000 individual tests for the comparison.To evaluate performance, we count the number of objective evaluations required for the algorithms to reach an iterate x such that f (x) − f * < 10 −10 (f (x (0) ) − f * ).A solver run is labelled as failed if it does not reach tolerance within 1500 iterations.The minimum value f * is known for Problems A-F, however for Problem G we estimate f * using the lowest value attained across all the optimisation procedures.The results on the collection of 18 000 test instances are discussed in Section 3.3, whilst the Appendix provides tables of results on the individual problems and problem sizes.
Note that our reporting of the numerical experiments differ from those of De Sterck [7] in two ways: First, we run each problem combination 1000 times, instead of 10 times.Second, we evaluate the results based on performance profiles and tables of quantiles, instead of solely reporting the average number of evaluations to reach tolerance.We believe the high number of test runs is important for more consistent values of the statistics reported in the Appendix across computers, further stabilised by using quantiles rather than averages.

Performance profiles
In order to evaluate the performance of optimizers on test sets with problems of varying size and difficulty, Dolan and Moré [9] proposed the use of performance profiles.For completeness, we first define the performance profile for our chosen metric of objective evaluations.Let P denote the test set of the n p = 18 000 problems, and n s the number of solvers.For each problem p ∈ P, and solver s, define t p,s = number of f evaluations required to reach tolerance. (31) In the numerical tests we say that the solver has reached tolerance for the problem when the relative decrease in the objective value is at least 10 −10 , that is Remark.Note that the numbers of objective and gradient calls are the same for each of the optimizers considered in this manuscript.This is due to the use of the Moré-Thuente line search algorithm.
Let t p denote the lowest number of f evaluations needed to reach tolerance for problem p across all the solvers, The performance ratio measures the performance on problem p by solver s, as defined by The value is bounded below by 1, and ρ p,s = 1 for at least one solver s.If solver s does not solve problem p, then we set ρ p,s = ∞.We define the performance profile p s : [1, ∞) → [0, 1], for solver s, by The performance profile for a solver s can be viewed as an empirical, cumulative "distribution" function representing the probability of the solver s reaching tolerance within a ratio τ of the fastest solver for each problem.In particular, p s (1) gives the proportion of problems for which solver s performed best.
For large values of τ , the performance profile p s (τ ) indicates robustness, that is, what proportion of all the test problems were solved by the solver.
Figure 1: Performance profiles, defined in (35), for Problems A-G.O-ACCEL preconditioned with a fixed-step steepest descent (B) and L-BFGS mostly outperform the rest, except for higher factors of τ .They are also more robust, solving the largest proportion of the problems when the computational budget is large.
Figure 1 plots the performance profile of the n s = 6 solvers considered: N-CG, L-BFGS, and N-GMRES and O-ACCEL with steepest descent preconditioning using both a line search (A) and a fixed step size (B).It is clear that O-ACCEL-B and L-BFGS are the best performers across the test set.For 44 % of the test problems they reach tolerance in the fewest f evaluations, and they also solve the largest proportion of problems within higher factors τ of the best performance ratio.There is also a region where N-CG does particularly well, solving the largest proportion of problems within two to three times the highest performing solver.The worst performers are N-GMRES-A and O-ACCEL-A, mainly due to the high amount of work that the line search must do to satisfy the Wolfe conditions along the steepest descent directions.
It is notable that O-ACCEL-B is competitive with L-BFGS on the test set.Tests, not presented in this work, indicate that the L-BFGS performance improves by using a line search with Wolfe curvature condition parameter c 2 = 0.9, rather than c 2 = 0.1 as used in this manuscript.The main focus of this manuscript is, however, to investigate the potential improvement of minimizing the objective rather than an 2 norm of the gradient.Thus, we are more interested in the comparison between N-GMRES and O-ACCEL.The two plots in Figure 2 show the performance profiles comparing N-GMRES and O-ACCEL, and in both cases show a significant improvement by minimizing the objective.In fact, O-ACCEL reaches tolerance first on 63 % to 71 % of the test problems.The instances where N-GMRES does better is primarily in Problems E, F, and G, as can be seen from Table 2 in the Appendix.One of the findings of De Sterck [7] was that N-GMRES with line search-steepest descent often stagnated or converged very slowly.From the left plot of Figure 2, we see that this issue is reduced with the O-ACCEL acceleration.It also turns out that O-ACCEL-A has a larger success rate over the test set than N-GMRES-A.Note that the lines of O-ACCEL-A and N-GMRES-A cross in Figure 1, but not in the left figure here, because the performance profiles change depending on the set of solvers considered.

The tensor optimization problem from De Sterck
The original motivation for N-GMRES was to improve convergence for a tensor optimization problem [6].De Sterck [6] and De Sterck [7] show that using N-GMRES with a domain-specific ALS preconditioner is better than generic optimizers such as L-BFGS and N-CG.De Sterck [7] states that "In this problem, a rank-three canonical tensor approximation (with 450 variables) is sought for a three-way data tensor of size 50 × 50 × 50.The data tensor is generated starting from a canonical tensor with specified rank and random factor matrices that are modified to have prespecified column colinearity, and noise is added.This is a standard canonical tensor decomposition test problem [1]."For this manuscript, we run the 1000 realisations of the test problem using the code provided by De Sterck [7] with the parameter values described in Section 3.2.The algorithms tested for this problem are vanilla ALS, N-GMRES-ALS, O-ACCEL-ALS, N-CG, and L-BFGS.Figure 3a and Figure 3b show the performance profiles and quantiles for the number of f evaluations required to reach tolerance.We see that O-ACCEL-ALS and N-GMRES-ALS perform better than the other algorithms, which underscores the advantage of applying these acceleration methods to domain-specific algorithms.

CUTEst test problems
The test problems we have considered so far were taken from De Sterck [7] and originally used to promote N-GMRES.We finish by presenting results from a numerical experiment using problems from the CUTEst problem set [13].For this experiment, we compare the solvers O-ACCEL-B, L-BFGS, and N-GMRES-B, with the parameter values described in Section 3.2.The minima are not known for many of the CUTEst problems, and so we change the tolerance criterion to be defined in terms of the relative decrease of the gradient norm.The performance measure used for this experiment is A solver run is labelled as failed if it does not reach tolerance within 2000 iterations.We run the experiment using implementations of the solvers from the package Optim [14] of the Julia programming language [3].To be sure, we have also verified that the Optim code yields the same results as the MATLAB code for Problems A-G.  3 and 4. L-BFGS is the highest performing most of the time, however, O-ACCEL-B reaches tolerance for more problems.
The 33 problems we consider are listed in Tables 3 and 4 of the appendix together with the results of the numerical experiment.We selected the problems with dimension n = 50 to 10 000 that satisfy the two criteria (i) the objective type is in the category "other" (ii) at least one of the solvers succeed in reaching tolerance.Figure 4 shows performance profiles from the experiment.L-BFGS reaches tolerance first for most of the problems, however, O-ACCEL reaches tolerance within 2000 iterations for more of the test problems.In the problems where L-BFGS does not reach tolerance it stops because it fails prematurely, whilst N-GMRES-B only fails due to reaching 2000 iterations.We believe the poorer performance of the acceleration algorithms for the CUTEst problems, compared to the previous experiments, is due to the poor performance of the steepest descent preconditioner on these problems.Again, O-ACCEL-B performs better than N-GMRES-B, which underscores our claim that accelerating based on the objective function is better than accelerating based on the gradient norm.

Conclusion
We have proposed a simple acceleration algorithm for optimization, based on the nonlinear GMRES (N-GMRES) algorithm by De Sterck [7], Washio and Oosterlee [24].N-GMRES for optimization aims to accelerate a solver step when solving the nonlinear system ∇f (x) = 0 by minimizing the residual in the 2 norm over a subspace from previous iterates.The acceleration step consists of solving a small linear system that arises from a linearization of the gradient.
We propose to take advantage of the structure of the optimization problem and instead accelerate based on the objective value f (x).This new approach, labelled O-ACCEL, shows a significant improvement to the original N-GMRES algorithm in numerical tests when accelerating a steepest descent solver.The first test problems are taken from De Sterck [7] and run under the same conditions that proved to be beneficial for N-GMRES.Further tests on a selection of CUTEst problems strengthen the conclusion that O-ACCEL outperforms N-GMRES.Another strength of these acceleration algorithms is that they can be combined with many types of optimizers.We have seen O-ACCEL's efficiency with steepest descent, and accelerating quasi-Newton, Newton methods, and domain-specific methods have potential to reduce costs for more expensive algorithms.For example, in De Sterck [7] it is shown that N-GMRES significantly accelerates the alternating least squares algorithm (ALS), which already without acceleration performs much better than L-BFGS and N-CG on a standard canonical tensor decomposition problem.Our numerical tests show that O-ACCEL further improves the ALS convergence for this problem.
There are two particular paths of interest to improve the proposed acceleration scheme.The first is to reduce the cost by not using a line search between the proposed steps by the solver and O-ACCEL.One can instead rely on heuristics along the lines of those proposed by Washio and Oosterlee [24].The second is to find better heuristics for choosing previous iterates to use in the acceleration step.Currently, no choices are made, other than discarding all iterates when problems appear.Better guidelines for the number of previous iterates to store is another topic of interest, especially when memory storage is limited.
We would like to investigate connections between the proposed O-ACCEL acceleration step and other optimization procedures, in the same fashion that Fang and Saad [11] put Anderson acceleration in the context of a family of Broyden-type approximations of the inverse Jacobian (Hessian).The preliminary analysis presented in this manuscript shows that, for convex quadratic objectives, O-ACCEL with a gradient descent preconditioner is equivalent to FOM for linear systems.As FOM is equivalent to CG for symmetric positive definite systems, we can view O-ACCEL in the context of N-CG methods using a larger history size than usual.There are many new ideas for improving step directions based on previous iterates, such as the acceleration scheme by Scieur et al. [22], and Block BFGS by Gao and Goldfarb [12].A better understanding of the overlaps between these and more classical optimization procedures can provide useful guidance for further research.
Further work is needed to test O-ACCEL on a wider range of problems, with comparisons to other state-of-the-art implementations of solvers and accelerators, in order to provide guidance as to when a method is appropriate.For example, on Problems A-G, O-ACCEL accelerating steepest descent is superior to N-CG and slightly better than L-BFGS.These results may, however, be due to implementations from De Sterck [7] and test problems favouring the acceleration algorithms.They are still indicative of the power of objective value based optimization, a research track that is worth pursuing further.

Figure 2 :
Figure 2: Performance profiles comparing and O-ACCEL with steepest descent with line search (A, left) and without (B, right).O-ACCEL outperforms N-GMRES in both cases on our test set.Note that the lines of O-ACCEL-A and N-GMRES-A cross in Figure 1, but not in the left figure here, because the performance profiles change depending on the set of solvers considered.

Figure 3 :
Figure 3: Numerical results from the tensor optimization test problem.O-ACCEL and N-GMRES perform significantly better than the other solvers.

Figure 4 :
Figure 4: Performance profiles for the CUTEst test problems from Tables3 and 4. L-BFGS is the highest performing most of the time, however, O-ACCEL-B reaches tolerance for more problems.

Table 4 :
Results from the CUTEst tests.