This paper is followed by discussions and a rejoinder.

Original Article

# A Brief Survey of Modern Optimization for Statisticians

Article first published online: 17 FEB 2014

DOI: 10.1111/insr.12022

© 2014 The Authors. International Statistical Review © 2014 International Statistical Institute

Additional Information

#### How to Cite

Lange, K., Chi, E. C. and Zhou, H. (2014), A Brief Survey of Modern Optimization for Statisticians. International Statistical Review, 82: 46–70. doi: 10.1111/insr.12022

#### Publication History

- Issue published online: 22 APR 2014
- Article first published online: 17 FEB 2014
- Manuscript Accepted: 20 APR 2013
- Manuscript Revised: 14 JAN 2013
- Manuscript Received: 10 SEP 2012

- Abstract
- Article
- References
- Cited By

### Keywords:

- Block relaxation;
- Newton's method;
- MM algorithm;
- penalization;
- augmented Lagrangian;
- acceleration

### Summary

- Top of page
- Summary
- Introduction
- Block Descent
- Steepest Descent
- Variations on Newton's Method
- The MM and EM Algorithms
- Penalization
- Augmented Lagrangians
- Algorithm Acceleration
- Discussion
- Acknowledgement
- References

Modern computational statistics is turning more and more to high-dimensional optimization to handle the deluge of big data. Once a model is formulated, its parameters can be estimated by optimization. Because model parsimony is important, models routinely include non-differentiable penalty terms such as the lasso. This sober reality complicates minimization and maximization. Our broad survey stresses a few important principles in algorithm design. Rather than view these principles in isolation, it is more productive to mix and match them. A few well-chosen examples illustrate this point. Algorithm derivation is also emphasized, and theory is downplayed, particularly the abstractions of the convex calculus. Thus, our survey should be useful and accessible to a broad audience.

### Introduction

- Top of page
- Summary
- Introduction
- Block Descent
- Steepest Descent
- Variations on Newton's Method
- The MM and EM Algorithms
- Penalization
- Augmented Lagrangians
- Algorithm Acceleration
- Discussion
- Acknowledgement
- References

Modern statistics represents a confluence of data, algorithms, practical inference, and subject area knowledge. As data mining expands, computational statistics is assuming greater prominence. Surprisingly, the confident prediction of the previous generation that Bayesian methods would ultimately supplant frequentist methods has given way to a realization that Markov chain Monte Carlo may be too slow to handle modern data sets. Size matters because large data sets stress computer storage and processing power to the breaking point. The most successful compromises between Bayesian and frequentist methods now rely on penalization and optimization. Penalties serve as priors and steer parameter estimates in realistic directions. In classical statistics, estimation usually meant least squares and maximum likelihood with smooth objective functions. In a search for sparse representations, mathematical scientists have introduced non-differentiable penalties such as the lasso and the nuclear norm. To survive in this alien terrain, statisticians are being forced to master exotic branches of mathematics such as convex calculus (Hiriart-Urruty & Lemarechal, 1996, 2001). Thus, the uneasy but productive relationship between statistics and mathematics continues, but in a different guise and mediated by new concerns.

The purpose of this survey article is to provide a few glimpses of the new optimization algorithms being crafted by computational statisticians and applied mathematicians. Although a survey of convex calculus for statisticians would certainly be helpful, our emphasis is more concrete. The truth of the matter is that a few broad categories of algorithms dominate. Furthermore, difficult problems require that several algorithmic pieces be assembled into a well-coordinated whole. Put another way, from a handful of basic ideas, computational statisticians often weave a complex tapestry of algorithms that meets the needs of a specific problem. No algorithm category should be dismissed a priori in tackling a new problem. There is plenty of room for creativity and experimentation. Algorithms are made for tinkering. When one part fails or falters, it can be replaced by a faster or more robust part.

This survey will treat the following methods: (a) block descent, (b) steepest descent, (c) Newton's method, quasi-Newton methods, and scoring, (d) the majorize–minimize (MM) and expectation–maximization (EM) algorithms, (e) penalized estimation, (f) the augmented Lagrangian method for constrained optimization, and (g) acceleration of fixed point algorithms. As we have mentioned, often the best algorithms combine several themes. We will illustrate the various themes by a sequence of examples. Although we avoid difficult theory and convergence proofs, we will try to point out along the way a few motivating ideas that stand behind most algorithms. For example, as its name indicates, steepest descent algorithms search along the direction of fastest decrease of the objective function. Newton's method and its variants all rely on the notion of local quadratic approximation, thus correcting the often poor linear approximation of steepest descent. In high dimensions, Newton's method stalls because it involves calculating and inverting large matrices of second derivatives.

The MM and EM algorithms replace the objective function by a simpler surrogate function. By design, optimizing the surrogate function sends the objective function downhill in minimization and uphill in maximization. In constructing the surrogate function for an EM algorithm, statisticians rely on notions of missing data. The more general MM algorithm calls on skills in inequalities and convex analysis. More often than not, concrete problems also involve parameter constraints. Modern penalty methods incorporate the constraints by imposing penalties on the objective function. A tuning parameter scales the strength of the penalties. In the classical penalty method, the constrained solution is recovered as the tuning parameter tends to infinity. In the augmented Lagrangian method, the constrained solution emerges for a finite value of the tuning parameter.

In the remaining sections, we adopt several notational conventions. Vectors and matrices appear in boldface type; for the most part, parameters appear as Greek letters. The differential *df*(** θ**) of a scalar-valued function

*f*(

**) equals its row vector of partial derivatives; the transpose ∇**

*θ**f*(

**) of the differential is the gradient. The second differential**

*θ**d*

^{2}

*f*(

**) is the Hessian matrix of second partial derivatives. The Euclidean norm of a vector**

*θ***and the spectral norm of a matrix**

*b***are denoted by ∥**

*A***∥ and ∥**

*b***∥ , respectively. All other norms will be appropriately subscripted. The**

*A**n*-th entry

*b*

_{n}of a vector

**must be distinguished from the**

*b**n*-th vector

*b*_{n}in a sequence of vectors. To maintain consistency,

*b*

_{ni}denotes the

*i*-th entry of

*b*_{n}. A similar convention holds for sequences of matrices.

### Block Descent

- Top of page
- Summary
- Introduction
- Block Descent
- Steepest Descent
- Variations on Newton's Method
- The MM and EM Algorithms
- Penalization
- Augmented Lagrangians
- Algorithm Acceleration
- Discussion
- Acknowledgement
- References

Block relaxation (either block descent or block ascent) divides the parameters into disjoint blocks and cycles through the blocks, updating only those parameters within the pertinent block at each stage of a cycle (de Leeuw, 1994). For the sake of brevity, we consider only block descent. In updating a block, we minimize the objective function over the block. Hence, block descent possesses the desirable descent property of always forcing the objective function downhill. When each block consists of a single parameter, block descent is called cyclic coordinate descent. The coordinate updates need not be explicit. In high-dimensional problems, implementation of one-dimensional Newton searches is often compatible with fast overall convergence. Block descent is best suited to unconstrained problems where the domain of the objective function reduces to a Cartesian product of the subdomains associated with the different blocks. Obviously, exact block updates are a huge advantage. Non-separable constraints can present insuperable barriers to coordinate descent because parameters get locked into place. In some problems, it is advantageous to consider overlapping blocks.

Example 1. Non-negative least squares

For a positive definite matrix ** A** = (

*a*

_{ij}) and vector

**= (**

*b**b*

_{i}), consider minimizing the quadratic function

subject to the constraints *θ*_{i} ≥ 0 for all *i*. In the case of least squares, ** A** =

*X*^{t}

**and**

*X***= −**

*b*

*X*^{t}

**for some design matrix**

*y***and response vector**

*X***. Equating the partial derivative of**

*Y**f*(

**) with respect to**

*θ**θ*

_{i}to 0 gives

Rearrangement now yields the unrestricted minimum

Taking into account the non-negativity constraint, this must be amended to

at stage *n* + 1 to construct the coordinate descent update of *θ*_{i}.

Example 2. Matrix factorization by alternating least squares

In the 1960s, Kruskal (1965) applied the method of alternating least squares to factorial analysis of variance. Later, the subject was taken up by de Leeuw & colleagues (1990). Suppose ** U** is a

*m*×

*q*matrix whose columns

*u*_{1}, … ,

*u*_{q}represent data vectors. In many applications, it is reasonable to postulate a reduced number of prototypes

*v*_{1}, … ,

*v*_{p}and write

for certain non-negative weights *w*_{kj}. The matrix ** W** = (

*w*

_{kj}) is

*p*×

*q*. If

*p*is small compared with

*q*, then the representation

****

*U*

*V***compresses the data for easier storage and retrieval. Depending on the circumstances, one may want to add further constraints (Ding**

*W**et al*., 2010. For instance, if the entries of

**are non-negative, then it is often reasonable to demand that the entries of**

*U***be non-negative as well (Lee & Seung, 1999; Paatero & Tapper, 1994). If we want each**

*V*

*u*_{j}to equal a convex combination of the prototypes, then constraining the column sums of

**to equal 1 is indicated.**

*W*One way of estimating ** V** and

**is to minimize the squared Frobenius norm**

*W* No explicit solution is known, but alternating least squares offers an iterative attack. If ** W** is fixed, then we can update the

*i*-th row of

**by minimizing the sum of squares**

*V* Similarly, if ** V** is fixed, then we can update the

*j*-th column of

**by minimizing the sum of squares**

*W*Thus, block descent solves a sequence of least squares problems, some of which are constrained.

### Steepest Descent

- Top of page
- Summary
- Introduction
- Block Descent
- Steepest Descent
- Variations on Newton's Method
- The MM and EM Algorithms
- Penalization
- Augmented Lagrangians
- Algorithm Acceleration
- Discussion
- Acknowledgement
- References

The first-order Taylor expansion

of a differentiable function *f*(** θ**) around

**motivates the method of steepest descent. In view of the Cauchy–Schwarz inequality, the choice**

*θ* minimizes the linear term *df*(** θ**)

**of the expansion over the sphere of unit vectors. Of course, if ∇**

*γ**f*(

**) =**

*θ***0**, then

**is a stationary point. The steepest descent algorithm iterates according to**

*θ*- (1)

for some scalar *s* > 0. If *s* is sufficiently small, then the descent property *f*(*θ*_{n + 1}) < *f*(*θ*_{n}) holds. The most sophisticated version of the algorithm determines *s* by searching for the minimum of the objective function along the direction of steepest descent. Among the many methods of line search, the methods of false position, cubic interpolation, and golden section stand out (Lange, 2012). These are all local search methods, and unless some guarantee of convexity exists, confusion of local and global minima can occur.

The method of steepest descent often exhibits zigzagging and a painfully slow rate of convergence. For these reasons, it was largely replaced in practice by Newton's method and its variants. However, the sheer scale of modern optimization problems has led to a re-evaluation. The avoidance of second derivatives and Hessian approximations is now viewed as a virtue. Furthermore, the method has been generalized to non-differentiable problems by substituting the forward directional derivative

for the gradient (Tao *et al*., 2010). Here, the idea is to choose a unit search vector ** ν** to minimize

*d*

_{ν}

*f*(

**). In some instances, this secondary problem can be attacked by linear programming. For a convex problem, the condition**

*θ**d*

_{ν}

*f*(

**) ≥ 0 for all**

*θ***is both necessary and sufficient for**

*ν***to be a minimum point. If the domain of**

*θ**f*(

**) equals a convex set**

*θ**C*, then only tangent directions

**=**

*ν***−**

*μ***with**

*θ***∈**

*μ**C*come into play.

Steepest descent also has a role to play in constrained optimization. Suppose we want to minimize *f*(** θ**) subject to the constraint

**∈**

*θ**C*for some closed convex set. The projected gradient method capitalizes on the steepest descent update (1) by projecting it onto the set

*C*(Goldstein, 1964; Levitin & Polyak, 1966; Ruszczyński, 2006). It is well known that for a point

**external to**

*X**C*, there is a closest point

*P*

_{C}(

**) to**

*x***in**

*X**C*. Explicit formulas for the projection operator

*P*

_{C}(

**) exist when**

*x**C*is a box, Euclidean ball, hyperplane, or half-space. Fast algorithms for computing

*P*

_{C}(

**) exist for the unit simplex, the**

*x**ℓ*

_{1}ball, and the cone of positive semidefinite matrices (Duchi

*et al*., 2008; Michelot, 1986).

Choice of the scalar *s* in the update (1) is crucial. Current theory suggests taking *s* to equal *r* ∕ *L*, where *L* is a Lipschitz constant for the gradient ∇ *f*(*θ*) and *r* belongs to the interval (0,2). In particular, the Lipschitz inequality

is valid for *L* = sup_{θ} ∥ *d*^{2}*f*(** θ**) ∥ , whenever this quantity is finite. In practice, the Lipschitz constant

*L*must be estimated. Any induced matrix norm ∥ ⋅ ∥

_{ † }can be substituted for the spectral norm ∥ ⋅ ∥ in the defining supremum and will give an upper bound on

*L*.

Example 3. Coordinate descent versus the projected gradient method

As a test problem, we generated a random 100 × 50 design matrix ** X** with independent and identically distributed (i.i.d.) standard normal entries, a random 50 × 1 parameter vector

**with i.i.d. uniform [0,1] entries, and a random 100 × 1 error vector**

*θ***with i.i.d. standard normal entries. In this setting, the response**

*e***=**

*y*

*X***+**

*θ***. We then compared coordinate descent, the projected gradient method (for**

*e**L*equal to the spectral radius of

*X*^{t}

**and**

*X**r*equal to 1.0, 1.75, and 2.0), and the MM algorithm explained later in Example 6. All computer runs start from the common point

*θ*_{0}whose entries are filled with i.i.d. uniform [0,1] random deviates. Figure 1 plots the progress of each algorithm as measured by the relative difference

- (2)

between the loss at the current iteration and the ultimate loss at convergence. It is interesting how well coordinate descent performs compared with projected gradient descent. The slower convergence of the MM algorithm is probably a consequence of the fact that its multiplicative updates slow down as they approach the 0 boundary. Note also the importance of choosing a good step size in the projected gradient algorithm. Inflated steps accelerate convergence, but excessively inflated steps hamper it.

### Variations on Newton's Method

- Top of page
- Summary
- Introduction
- Block Descent
- Steepest Descent
- Variations on Newton's Method
- The MM and EM Algorithms
- Penalization
- Augmented Lagrangians
- Algorithm Acceleration
- Discussion
- Acknowledgement
- References

The primary advantage of Newton's method is its speed of convergence in low-dimensional problems. Its many variants seek to retain its fast convergence while taming its defects. The variants all revolve around the core idea of locally approximating the objective function by a strictly convex quadratic. At each iteration, the quadratic approximation is optimized subject to safeguards that keep the iterates from overshooting and veering towards irrelevant stationary points.

Consider minimizing the real-valued function *f*(** θ**) defined on an open set

*S*⊂ R

^{p}. Assuming that

*f*(

**) is twice differentiable, we have the second-order Taylor expansion**

*θ* for some ** α** on the line segment [

**,**

*θ***]. This expansion suggests that we substitute**

*γ**d*

^{2}

*f*(

**) for**

*θ**d*

^{2}

*f*(

**) and approximate**

*α**f*(

**) by the resulting quadratic. If we take this approximation seriously, then we can solve for its minimum point**

*γ***as**

*γ*In Newton's method, we iterate according to

- (3)

for step length constant *s* with default value 1. Any stationary point of *f*(** θ**) is a fixed point of Newton's method.

There is nothing to prevent Newton's method from heading uphill rather than downhill. The first-order expansion

makes it clear that the descent property holds provided *s* > 0 is small enough and the Hessian matrix *d*^{2}*f*(*θ*_{n}) is positive definite. When *d*^{2}*f*(*θ*_{n}) is not positive definite, it is usually replaced by a positive definite approximation *H*_{n} in the update (3).

Backtracking is crucial to avoid overshooting. In the step-halving version of backtracking, one starts with *s* = 1. If the descent property holds, then one takes the Newton step. Otherwise, *s* ∕ 2 is substituted for *s*, *θ*_{n + 1} is recalculated, and the descent property is rechecked. Eventually, a small enough *s* is generated to guarantee *f*(*θ*_{n + 1}) < *f*(*θ*_{n}).

In the next two examples, we adopt standard statistical language. The outcome of a statistical experiment is summarized by a log likelihood *L*(** θ**). Its gradient ∇

*L*(

**) is called the score, and its second differential**

*θ**d*

^{2}

*L*(

**), after a change in sign, is called the observed information. In maximum likelihood estimation, one maximizes**

*θ**L*(

**) with respect to the parameter vector**

*θ***.**

*θ*Example 4. Newton's method for binomial regression

Consider binomial regression with *m* independent responses *y*_{1}, … ,*y*_{m}. Each *y*_{i} represents a count between 0 and *k*_{i} with success probability *π*_{i}(** θ**) per trial. The log likelihood, score, and observed information amount to

Because E(*y*_{i}) = *k*_{i}*π*_{i}(** θ**), the observed information can be approximated by

Because we seek to maximize rather than minimize *L*(** θ**), we want −

*d*

^{2}

*L*(

**) to be positive definite. Fortunately, both approximations fulfil this requirement. The second approximation leads to the scoring algorithm discussed later.**

*θ*Example 5. Poisson multigraph model

In a graph, the number of edges between any two nodes is 0 or 1. A multigraph allows an arbitrary number of edges between any two nodes. Multigraphs are natural structures for modelling the internet and gene and protein networks. Here, we consider a multigraph with a random number of edges *x*_{ij} connecting every pair of nodes {*i*,*j*}. In particular, we assume that the *x*_{ij} are independent Poisson random variables with means *μ*_{ij}. As a plausible model for ranking nodes, we take *μ*_{ij} = *θ*_{i}*θ*_{j}, where *θ*_{i} and *θ*_{j} are non-negative propensities (Ranola *et al*., 2010). The log likelihood of the observed edge counts *x*_{ij} = *x*_{ji} amounts to

The score vector has entries

and the observed information matrix has entries

For *p* nodes, the matrix − *d*^{2}*L*(** p**) is

*p*×

*p*, and inverting it seems out of the question when

*p*is large. Fortunately, the Sherman–Morrison formula comes to the rescue. If we write −

*d*

^{2}

*L*(

**) as**

*θ***+**

*D***1**

**1**

^{t}with

**diagonal, then the explicit inverse**

*D* is available. This makes Newton's method trivial to implement as long as one respects the bounds *θ*_{i} ≥ 0. More generally, it is always cheap to invert a low-rank perturbation of an explicitly invertible matrix.

In maximum likelihood estimation, the method of steepest ascent replaces the observed information matrix − *d*^{2}*L*(** θ**) by the identity matrix

**. Fisher's scoring algorithm makes the far more effective choice of replacing the observed information matrix by the expected information matrix**

*I**J*(

**) =**

*θ**E*[ −

*d*

^{2}

*L*(

**)] (Osborne, 1992). The alternative representation**

*θ**J*(

**) = Var[ ∇**

*θ**L*(

**)] of**

*θ**J*(

**) as a variance matrix demonstrates that it is positive semidefinite. Usually it is positive definite as well and serves as an excellent substitute for −**

*θ**d*

^{2}

*L*(

**) in Newton's method. The inverse matrices and immediately supply the asymptotic variances and covariances of the maximum likelihood estimate (Rao, 1973).**

*θ*The score and expected information simplify considerably for exponential families of densities (Bradley, 1973; Charnes *et al*., 1976; Green, 1984; Jennrich & Moore, 1975; Nelder & Wedderburn, 1972). Recall that the density of a vector random variable ** Y** from an exponential family can be written as

- (4)

relative to some measure *ν* (Dobson, 1990; Rao, 1973). The function *h*(** y**) in (4) is the sufficient statistic. The maximum likelihood estimate of the parameter vector

**depends on an observation**

*θ***only through**

*Y**h*(

**). Predictors of**

*y***are incorporated into the functions**

*Y**β*(

**) and**

*θ**γ*(

**). If**

*θ**γ*(

**) is linear in**

*θ***, then**

*θ**J*(

**) = −**

*θ**d*

^{2}

*L*(

**) = −**

*θ**d*

^{2}

*β*(

**), and scoring coincides with Newton's method. If in addition**

*θ**J*(

**) is positive definite, then**

*θ**L*(

**) is strictly concave and possesses at most a single local maximum, which is necessarily the global maximum.**

*θ*Both the score vector and expected information matrix can be expressed succinctly in terms of the mean vector *μ*(** θ**) = E[

*h*(

**)] and the variance matrix Σ(**

*y***) = Var[**

*θ**h*(

**)] of the sufficient statistic. Standard arguments show that**

*y*These formulas have had an enormous impact on non-linear regression and fitting generalized linear models. Applied statistics as we know it would be nearly impossible without them. Implementation of scoring is almost always safeguarded by step halving and upgraded to handle linear constraints and parameter bounds. The notion of quadratic approximation is still the key, but each step of constrained scoring must solve a quadratic programme.

In parallel with developments in statistics, numerical analysts sought substitutes for Newton's method. Their efforts a generation ago focused on quasi-Newton methods for generic smooth functions (Dennis & Schnabel, 1996; Nocedal & Wright, 2006). Once again, the core idea was successive quadratic approximation. A good quasi-Newton method (a) minimizes a quadratic function *f*(** θ**) from R

^{p}to R in

*p*steps, (b) avoids evaluation of

*d*

^{2}

*f*(

**), (c) adapts readily to simple parameter constraints, and (d) exploits inexact line searches.**

*θ*Quasi-Newton methods update the current approximation *H*_{n} to the second differential *d*^{2}*f*(** θ**) of an objective function

*f*(

**) by a rank-one or rank-two perturbation satisfying a secant condition. The secant condition captures the first-order Taylor approximation**

*θ*If we define the gradient and argument differences

then the secant condition reads *H*_{n + 1}*d*_{n} = *g*_{n}. Davidon (1959) discovered that the unique symmetric rank-one update to *H*_{n} satisfying the secant condition is

where the constant *c*_{n} and the vector *v*_{n} are determined by

When the inner product (*H*_{n}*d*_{n} − *g*_{n})^{t}*d*_{n} is too close to 0, there are two possibilities. Either the secant adjustment is ignored, and the value *H*_{n} is retained for *H*_{n + 1}, or one resorts to a trust region strategy (Nocedal & Wright, 2006).

In the trust region method, one minimizes the quadratic approximation to *f*(** θ**) subject to the spherical constraint ∥

**−**

*θ*

*θ*_{n}∥

^{2}≤

*r*

^{2}for a fixed radius

*r*. This constrained optimization problem has a solution regardless of whether

*H*_{n}is positive definite. Working within a trust region prevents absurdly large steps in the early stages of minimization. With appropriate safeguards, some numerical analysts (Conn

*et al*., 1991; Khalfan

*et al*., 1993) consider Davidon's rank-one update superior to the widely used BFGS update, named after Broyden, Fletcher, Goldfarb, and Shanno. This rank-two perturbation is guaranteed to maintain positive definiteness and is better understood theoretically than the symmetric rank-one update. Also of interest is the Davidon, Fletcher, and Powell (DFP) rank-two update, which applies to the inverse of

*H*_{n}. Although the DFP update ostensibly avoids matrix inversion, the consensus is that the BFGS update is superior to it in numerical practice (Dennis & Schnabel, 1996).

### The MM and EM Algorithms

- Top of page
- Summary
- Introduction
- Block Descent
- Steepest Descent
- Variations on Newton's Method
- The MM and EM Algorithms
- Penalization
- Augmented Lagrangians
- Algorithm Acceleration
- Discussion
- Acknowledgement
- References

The numerical analysts Ortega & Rheinboldt 1970) first articulated the MM principle; de Leeuw (1977) saw its potential and created the first MM algorithm. The MM algorithm currently enjoys its greatest vogue in computational statistics (Hunter & Lange, 2004; Lange *et al*., 2000; Wu & Lange, 2010). The basic idea is to convert a hard optimization problem into a sequence of simpler ones. In minimization, the MM principle majorizes the objective function *f*(** θ**) by a surrogate function

*g*(

**∣**

*θ*

*θ*_{n}) anchored at the current point

*θ*_{n}. Majorization combines the tangency condition

*g*(

*θ*_{n}∣

*θ*_{n}) =

*f*(

*θ*_{n}) and the domination condition

*g*(

**∣**

*θ*

*θ*_{n}) ≥

*f*(

**) for all**

*θ***. The next iterate of the MM algorithm is defined to minimize**

*θ**g*(

**∣**

*θ*

*θ*_{n}). Because

the MM iterates generate a descent algorithm driving the objective function downhill. Strictly speaking, the descent property depends only on decreasing *g*(** θ** ∣

*θ*_{n}), not on minimizing it. Constraint satisfaction is automatically enforced in finding

*θ*_{n + 1}. Under appropriate regularity conditions, an MM algorithm is guaranteed to converge to a local minimum of the objective function (Lange, 2010). In maximization, we first minorize and then maximize. Thus, the acronym MM does double duty in the forms majorize–minimize and minorize–maximize.

When it is successful, the MM algorithm simplifies optimization by (a) separating the variables of a problem, (b) avoiding large matrix inversions, (c) linearizing a problem, (d) restoring symmetry, (e) dealing with equality and inequality constraints gracefully, and (f) turning a non-differentiable problem into a smooth problem. The art in devising an MM algorithm lies in choosing a tractable surrogate function *g*(** θ** ∣

*θ*_{n}) that hugs the objective function

*f*(

**) as tightly as possible.**

*θ*The majorization relation between functions is closed under the formation of sums, non-negative products, limits, and composition with an increasing function. These rules allow one to work piecemeal in simplifying complicated objective functions. Skill in dealing with inequalities is crucial in constructing majorizations. Classical inequalities such as Jensen's inequality, the information inequality, the arithmetic-geometric mean inequality, and the Cauchy–Schwartz prove useful in many problems. The supporting hyperplane property of a convex function and the quadratic upper bound principle of Böhning & Lindsay (1988) also find wide application.

Example 6. An MM algorithm for non-negative least squares

Sha *et al.* (2003) devised an MM algorithm for Example 1. The diagonal terms they retain as presented. The off-diagonal terms *a*_{ij}*θ*_{i}*θ*_{j} they majorize according to the sign of the coefficient *a*_{ij}. When the sign of *a*_{ij} is positive, they apply the majorization

which is just a rearrangement of the inequality

with equality when *x* = *x*_{n} and *y* = *y*_{n}. When the sign of *a*_{ij} is negative, they apply the majorization

which is just a rearrangement of the simple inequality *z* ≥ 1 + ln*z* with *z* = *xy* ∕ (*x*_{n}*y*_{n}). The value *z* = 1 gives equality in the inequality. Both majorizations separate parameters and allow one to minimize the surrogate function parameter by parameter. Indeed, if we define matrices *A*^{ + } and *A*^{ − } with entries max{*a*_{ij},0}and − min{*a*_{ij},0}, respectively, then the resulting MM algorithm iterates according to

All entries of the initial point *θ*_{0} should be positive; otherwise, the MM algorithm stalls. The updates occur in parallel. In contrast, the cyclic coordinate descent updates are sequential. Figure 1 depicts the progress of the MM algorithm on our non-negative least squares problem.

Example 7. Locating a gunshot

Locating the time and place of a gunshot is a typical global positioning problem (Strang & Borre, 2012). In a certain city, *m* sensors located at the points *x*_{1}, … ,*x*_{m} are installed. A signal, say a gunshot sound, is sent from an unknown location ** θ** at unknown time

*α*and known speed

*s*and arrives at location

*j*at time

*y*

_{j}observed with random measurement error. The problem is to estimate the vector

**and the scalar**

*θ**α*from the observed data

*y*

_{1}, … ,

*y*

_{m}. Other problems of this nature include pinpointing the epicentre of an earthquake and the detonation point of a nuclear explosion. This estimation problem can be attacked by a combination of block descent and the MM principle.

If we assume Gaussian random errors, then maximum likelihood estimation reduces to minimizing the criterion

The equivalence of the two representations of *f*(** θ**,

*α*) shows that it suffices to solve the problem with speed

*s*= 1. In the remaining discussion, we make this assumption. For a fixed

**, estimation of**

*θ**α*reduces to a least squares problem with the obvious solution

To update ** θ** with a fixed

*α*, we rewrite

*f*(

**,**

*θ**α*) as

The middle terms − 2(*y*_{j} − *α*) ∥ ** θ** −

*x*_{j}∥ are awkward to deal with in minimization. Depending on the sign of the coefficient − 2(

*y*

_{j}−

*α*), we majorized them in two different ways. If the sign is negative, then we employ the Cauchy–Schwarz majorization

If the sign is positive, then we employ the more subtle majorization

To derive this second majorization, note that is a concave function on (0, ∞ ). It therefore satisfies the dominating hyperplane inequality

Now substitute ∥ ** θ** −

*x*_{j}∥

^{2}for

*u*. These manoeuvres separate parameters and reduce the surrogate to a sum of linear terms and squared Euclidean norms. The minimization of the surrogate yields the MM update

of ** θ** for a fixed

*α*. The condition

*α*>

*y*

_{j}in this update is usually vacuous. By design,

*f*(

**,**

*θ**α*) decreases after each cycle of updating

*α*and

**.**

*θ*The celebrated EM algorithm is one the most potent optimization tools in the statistician's toolkit (Dempster *et al*., 1977; McLachlan & Krishnan, 2008). The E step in the EM algorithm creates a surrogate function, the *q* function in the literature, that minorizes the log likelihood. Thus, every EM algorithm is an MM algorithm. If ** Y** is the observed data and

**is the complete data, then the**

*X**q*function is defined as the conditional expectation

where *f*(** X** ∣

**) denotes the complete data log likelihood, upper case letters indicate random vectors, and lower case letters indicate corresponding realizations of these random vectors. In the M step of the EM algorithm, one calculates the next iterate**

*θ*

*θ*_{n + 1}by maximizing

*Q*(

**∣**

*θ*

*θ*_{n}) with respect to

**.**

*θ*Example 8. MM versus EM for the Dirichlet-multinomial distribution

When multivariate count data exhibit overdispersion, the Dirichlet-multinomial distribution is preferred to the multinomial distribution. In the Dirichlet-multinomial model, the multinomial probabilities ** p** = (

*p*

_{1}, … ,

*p*

_{d}) follow a Dirichlet distribution with parameter vector

**= (**

*α**α*

_{1}, … ,

*α*

_{d}) having positive components. For a multivariate count vector

**= (**

*x**x*

_{1}, … ,

*x*

_{d}) with batch size , the probability mass function is accordingly

- (5)

where Δ_{d} is the unit simplex in *d* dimensions, | ** α** | equals , and denotes a rising factorial. The last equality in (6) follows from the factorial property Γ(

*a*+ 1) ∕ Γ(

*a*) =

*a*of the gamma function. Given independent data points

*x*_{1}, … ,

*x*_{m}, the log likelihood is

The lack of concavity of *L*(** α**) may cause instability in Newton's method when it is started far from the optimal point. Fisher's scoring algorithm is computationally prohibitive because calculation of the expected information matrix involves numerous evaluations of beta-binomial tail probabilities. The ascent property makes EM and MM algorithms attractive.

In deriving an EM algorithm, we treat the unobserved multinomial probabilities *p*_{j} in each case as missing data. The complete data likelihood is then the integrand in the integral (5). A straightforward calculation shows that ** p** possesses a posterior Dirichlet distribution with parameters

*α*

_{1}+

*x*

_{i1}through

*α*

_{d}+

*x*

_{id}for case

*i*. If we now differentiate the identity

with respect to *α*_{j}, then the identity

emerges, where Ψ(*z*) = Γ ′ (*z*) ∕ Γ(*z*) is the digamma function. It follows that up to an irrelevant additive constant, the surrogate function is]

Maximizing *Q*(** α** ∣

*α*_{n}) is non-trivial because it involves special functions and intertwining of the

*α*

_{j}parameters.

Directly invoking the MM principle produces a more malleable surrogate function (Zhou & Lange, 2010). Consider the logarithm of the third form of the likelihood function (5). Applying Jensen's inequality to ln(*α*_{j} + *k*) gives

Likewise, applying the supporting hyperplane inequality to − ln( | ** α** | +

*k*) gives

Overall, these minorizations yield the surrogate function

which completely separates the parameter *α*_{j}. This suggests the simple MM updates

The positivity constraints are always satisfied when all initial values *α*_{0j} > 0. Parameter separation can be achieved in the EM algorithm by a further minorization of the lnΓ( | ** α** | ) term in

*Q*(

**∣**

*α*

*α*_{n}). This action yields a viable EM–MM hybrid algorithm. The study of Zhou & Yang (2012) contains more details and a comparison of the convergence rates of the three algorithms.

Finally, let us mention various strategies for handling exceptional cases. In the MM algorithm, it may be impossible to optimize the surrogate function *g*(** θ** ∣

*θ*_{n}) explicitly. There are two obvious remedies. One is to institute some form of block relaxation in updating

*g*(

**∣**

*θ*

*θ*_{n}) (Meng & Rubin, 1993). There is no need to iterate to convergence because the purpose is merely to improve

*g*(

**∣**

*θ*

*θ*_{n}) and hence the objective function

*f*(

**). Another obvious remedy is to optimize the surrogate function by Newton's method. It turns out that a single step of Newton's method suffices to preserve the local rate of convergence of the MM algorithm (Lange, 1995). The ascent property is sacrificed initially, but it kicks in as one approaches the optimal point. In an unconstrained problem, this variant MM algorithm can be phrased as**

*θ* where the substitution of ∇ *f*(*θ*_{n}) for ∇ *g*(*θ*_{n} ∣ *θ*_{n}) is justified by the tangency and domination conditions satisfied by *g*(** θ** ∣

*θ*_{n}) and

*f*(

**).**

*θ*A more pressing concern in the EM algorithm is intractability of the E step. If *f*(** X** ∣

**) denotes the complete data likelihood, then in the stochastic EM algorithm (Jank, 2006; Robert & Casella, 2004; Wei & Tanner, 1990), one estimates the surrogate function by a Monte Carlo average**

*θ*- (6)

over realizations *x*_{i} of the complete data ** X** conditional on the observed data

**=**

*Y***and the current parameter iterate**

*y*

*θ*_{n}. Sampling can be performed by rejection sampling, importance sampling, Markov chain Monte Carlo, or quasi-Monte Carlo. The next iterate

*θ*_{n + 1}should maximize the average (6). The sample size

*m*should increase as the iteration count

*n*increases. Determining the rate of increase of

*m*and setting a reasonable convergence criterion are both subtle issues. The ascent property of the EM algorithm fails because of the inherent sampling noise. The combination of slow convergence and Monte Carlo sampling makes the stochastic EM algorithm unattractive in large-scale problems. In smaller problems, it fills a useful niche.

The stochastic EM algorithm generalizes the Robbins–Monro algorithm (Robbins and Monro, 1951) for root finding and the Kiefer–Wolfowitz algorithm (Kiefer & Wolfowitz, 1952) for function maximization. In unconstrained maximum likelihood estimation, one seeks a root of the likelihood equation, so both methods are relevant. Under suitable assumptions, the Kiefer–Wolfowitz algorithm converges to a local maximum almost surely. Because this cluster of topics is tangential to our overall emphasis on deterministic methods of optimization, we refer readers to the books of Chen (2002), Kushner & Yin (2003), and Robert & Casella (2004) for a fuller discussion.

### Penalization

- Top of page
- Summary
- Introduction
- Block Descent
- Steepest Descent
- Variations on Newton's Method
- The MM and EM Algorithms
- Penalization
- Augmented Lagrangians
- Algorithm Acceleration
- Discussion
- Acknowledgement
- References

Penalization is a device for imposing parsimony. For purposes of illustration, we discuss two penalized estimation problems of considerable utility in applied statistics. Both of these examples generate convex programmes with non-differentiable objective functions. In the interests of accessibility, we will derive estimation algorithms for both problems without invoking the machinery of convex analysis.

Example 9. Lasso penalized regression

Lasso penalized regression has been pursued for a long time in many application areas (Chen *et al*., 1998; Claerbout & Muir, 1973; Donoho & Johnstone, 1994; Santosa & Symes, 1986; Taylor *et al*., 1979; Tibshirani, 1996). Modern versions consider a generalized linear model where *y*_{i} is the response for case *i*, *x*_{ij} is the value of predictor *j* for case *i*, and *θ*_{j} is the regression coefficient corresponding to predictor *j*. When the number of predictors *p* exceeds the number of cases *m*, ** θ** cannot be uniquely estimated. In an era of big data, this quandary is fairly common. One remedy is to perform model selection by imposing a lasso penalty on the loss function

*ℓ*(

**). In least squares estimation,**

*θ* For a generalized linear model (Park & Hastie, 2007), *ℓ*(** θ**) is the negative log likelihood of the data. Lasso penalized estimation minimizes the criterion

where the non-negative weights *w*_{j} and the tuning constant *ρ* > 0 are given. If *θ*_{j} is the intercept for the model, then its weight *w*_{j} is usually set to 0. For the remaining predictors, the choice *w*_{j} = 1 is reasonable provided the predictors are standardized to have mean 0 and variance 1. To improve the asymptotic properties of the lasso estimates, the adaptive lasso (Zou, 2006) defines the weights for any consistent estimate of *θ*_{j}. In a Bayesian context, imposing a lasso penalty is equivalent to placing a Laplace prior with mean 0 on each *θ*_{j}. The elastic net adds a ridge penalty to the lasso penalty (Zou & Hastie, 2005).

The primary difference between lasso and ridge regression is that the lasso penalty forces most parameters to 0, whereas the ridge penalty merely reduces them. Thus, the ridge penalty relaxes its grip too quickly for model selection. Unfortunately, the lasso penalty tends to select one predictor from a group of correlated predictors and ignore the others. The elastic net ameliorates this defect. To overcome severe shrinkage, many statisticians discard penalties after the conclusion of model selection and re-estimate the selected parameters. Cross-validation and stability selection are effective in choosing the penalty tuning constant and the selected predictors, respectively (Hastie *et al*., 2009; Meinshausen & Bühlmann, 2010).

Coordinate descent works particularly well when only a few predictors enter a model (Friedman *et al*., 2007; Wu & Lange, 2008). Consider what happens when we visit parameter *θ*_{j} and the loss function is the least squares criterion. If we define the amended response , then the problem reduces to minimizing

Now divide the domain of *θ*_{j} into the two intervals ( − ∞ ,0] and [0, ∞ ). On the right interval, elementary calculus suggests the update

This is invalid when it is negative and must be replaced by 0. Likewise, on the left interval, we have the update

unless it is positive. On both intervals, shrinkage pulls the usual least squares estimate towards 0. In underdetermined problems with just a few relevant predictors, most parameters never budge from their starting values of 0. This circumstance plus the complete absence of matrix operations explains the speed of coordinate descent. It inherits its numerical stability from the descent property enjoyed by any coordinate descent algorithm.

With a generalized linear model, say logistic regression, the same story plays out. Now, however, we must institute a line search for the minimum on each of the two half-intervals. Newton's method, scoring, and even golden section search work well. When *f*(** θ**) is convex, and

*θ*

_{j}= 0, it is prudent to check the forward directional derivatives and along the current coordinate direction

*e*_{j}and its negative. If both forward directional derivatives are non-negative, then no progress can be made by moving off 0. Thus, a parameter parked at 0 is left there. Other computational savings are possible that make coordinate descent even faster. For example, computations can be organized around the linear predictor for each case

*i*. When

*θ*

_{j}changes, it is trivial to update this inner product. Wu

*et al.*(2009) and Wu & Lange (2008) illustrate the potential of coordinate descent on some concrete genetic examples.

Example 10. Matrix completion

The matrix completion problem became famous when the movie distribution company Netflix offered a million dollar prize for improvements to its movie rating system (ACM SIGKDD and Netflix, 2007). The idea was that customers would submit ratings on a small subset of movie titles, and from these ratings, Netflix would infer their preferences and recommend additional movies for their consideration. Imagine therefore a very sparse matrix ** Y** = (

*y*

_{ij}) whose rows are individuals and whose columns are movies. Completed cells contain a rating from 1 to 5. Most cells are empty and need to be filled in. If the matrix is sufficiently structured and possesses low rank, then it is possible to complete the matrix in a parsimonious way. Although this problem sounds specialized, it has applications far beyond this narrow setting. For example, filling in missing genotypes in genome scans for disease genes benefits from matrix completion (Chi

*et al*., 2013).

Following Cai *et al.* (2008), Candés & Tao (2009), Mazumder *et al.* (2010), and Chen*et al.* (2012), let Δ denote the set of index pairs (*i*,*j*) such that *y*_{ij} is observed. The Lagrangian formulation of matrix completion minimizes the criterion

- (7)

with respect to a compatible matrix ** X** = (

*x*

_{ij}) with singular values

*σ*

_{k}. Recall that the singular value decomposition

represents ** X** as a sum of outer products involving a collection of orthogonal left singular vectors

*u*_{i}, a corresponding collection of orthogonal right singular vectors

*v*_{i}, and a descending sequence of non-negative singular values

*σ*

_{i}. Alternatively, we can factor

**in the form**

*X*

*U*

*Σ*

*V*^{t}for orthogonal matrices

**and**

*U***and a rectangular diagonal matrix**

*V***.**

*Σ*The nuclear norm plays the same role in low-rank matrix approximation that the *ℓ*_{1} norm plays in sparse regression. For a more succinct representation of the criterion (7), we introduce the Frobenius norm

induced by the trace inner product tr(*U**V*^{t}) and the projection operator *P*_{Δ}(** Y**) with entries

In this notation, the criterion (7) becomes

To derive an algorithm for estimating ** X**, we again exploit the MM principle. The general idea is to restore the symmetry of the problem by imputing the missing data (Mazumder

*et al*., 2010). Suppose

*X*_{n}is our current approximation to

**. We simply replace a missing entry**

*X**y*

_{ij}of

**by the corresponding entry**

*Y**x*

_{nij}of

*X*_{n}and add the term 1 ∕ 2

*(*

*x*

_{nij}−

*x*

_{ij})

^{2}to the criterion (7). Because the added terms majorize 0, they create a legitimate surrogate function and lead to an MM algorithm. One can rephrase the problem in matrix terms by defining the orthogonal complement of

*P*

_{Δ}(

**) according to the rule . The matrix temporarily completes**

*Y***and yields the surrogate function**

*Y*At this juncture, it is helpful to recall some mathematical facts. First, the Frobenius norm is invariant under left and right multiplication of its argument by an orthogonal matrix. Thus, depends only on the singular values of ** X**. The inner product − tr(

*Z*_{n}

*X*^{t}) presents a greater barrier to progress, but it ultimately succumbs to a matrix analogue of the Cauchy–Schwarz inequality. Fan's inequality says that

for the ordered singular values *ω*_{k} of *Z*_{n} (Borwein & Lewis, 2000). Equality is attained in Fan's inequality if and only if the right and left singular vectors for the two matrices coincide. Thus, in minimizing *g*(** X** ∣

*X*_{n}), we can assume that the singular vectors of

**coincide with those of**

*X*

*Z*_{n}and rewrite the surrogate function as

Application of the forward directional derivative test

for all tangent directions ** ν** identifies the shrunken singular values

as optimal. In practice, one does not have to extract the full singular value decomposition of *Z*_{n}. Only the singular values *ω*_{k} > *ρ* are actually relevant in constructing *X*_{n + 1}.

In many applications, the underlying structure of the observation matrix ** Y** is corrupted by a few noisy entries. This tempts one to approximate

**by the sum of a low-rank matrix**

*Y***plus a sparse matrix**

*X***. To estimate**

*W***and**

*X***, we introduce a positive tuning constant**

*W**λ*and minimize the criterion

by block descent. We have already indicated how to update ** X** for a fixed

**. To minimize**

*W**f*(

**,**

*X***) for a fixed**

*W***, we set**

*X**w*

_{ij}= 0 for any pair (

*i*,

*j*) ∉ Δ. Because the remaining

**parameters separate in**

*W**f*(

**,**

*X***), the shrinkage updates**

*W*are trivial to derive.

### Augmented Lagrangians

- Top of page
- Summary
- Introduction
- Block Descent
- Steepest Descent
- Variations on Newton's Method
- The MM and EM Algorithms
- Penalization
- Augmented Lagrangians
- Algorithm Acceleration
- Discussion
- Acknowledgement
- References

The augmented Lagrangian method is one of the best ways of handling parameter constraints (Hestenes, 1969; Nocedal & Wright, 2006; Powell, 1969; Rockafellar, 1973). For the sake of simplicity, we focus on the problem of minimizing *f*(** θ**) subject to the equality constraints

*g*

_{i}(

**) = 0 for**

*θ**i*= 1, … ,

*q*. We will ignore inequality constraints and assume that

*f*(

**) and the**

*θ**g*

_{i}(

**) are smooth. At a constrained minimum, the classical Lagrange multiplier rule**

*θ*- (8)

holds provided the gradients ∇ *g*_{i}(** θ**) are linearly independent. The augmented Lagrangian method optimizes the perturbed function

with respect to ** θ**. It then adjusts the current multiplier vector

**in the hope of matching the true Lagrange multiplier vector. The penalty term**

*λ**ρ*∕ 2

*g*

_{i}(

**)**

*θ*^{2}punishes violations of the equality constraint

*g*

_{i}(

**) = 0. At convergence, the gradient**

*θ**ρg*

_{i}(

**) ∇**

*θ**g*

_{i}(

**) of**

*θ**ρ*∕ 2

*g*

_{i}(

**)**

*θ*^{2}vanishes, and we recover the standard multiplier rule (8). This process can only succeed if the degree of penalization

*ρ*is sufficiently large.

Thus, we must either take *ρ* initially large or gradually increase it until it hits the finite transition point where the constrained and unconstrained solutions merge. Updating ** λ** is more subtle. If

*θ*_{n}furnishes the unconstrained minimum of , then the stationarity condition reads

The last equation motivates the standard update

The alternating direction method of multipliers (ADMM) (Gabay & Mercier, 1976; Glowinski & Marrocco, 1975) minimizes the sum *f*(** θ**) +

*h*(

**) subject to the affine constraints**

*γ*

*A***+**

*θ*

*B***=**

*γ***. Although the objective function is separable in the block variables**

*c***and**

*θ***, the affine constraints frustrate a direct attack. However, the problem is ripe for a combination of the augmented Lagrangian method and a single round of block descent per iteration. The augmented Lagrangian is**

*γ* Minimization is performed over ** θ** and

**by block descent before updating the multiplier vector**

*γ***via**

*λ* Introduction of block descent simplifies the usual augmented Lagrangian method, which minimizes jointly over ** θ** and

**. This modest change keeps the convergence theory intact (Boyd**

*γ**et al*., 2011; Fortin & Glowinski, 1983) and has led to a resurgence in the popularity of ADMM in machine learning (Bien & Tibshirani, 2011; Boyd

*et al*., 2011; Chen

*et al*., 2012; Qin & Goldfarb, 2012; Richard

*et al*., 2012; Xue

*et al*., 2012).

Example 11. Fused lasso

The ADMM is helpful in reducing difficult optimization problems to simpler ones. The easiest fused lasso problem minimizes the criterion (Tibshirani *et al*., 2005)

The *ℓ*_{1} penalty on the increments *θ*_{i + 1} − *θ*_{i} favours piecewise constant solutions. Unfortunately, this twist on the standard lasso penalty renders coordinate descent inefficient. We can reformulate the problem as minimizing the criterion subject to the constraint ** γ** =

*D***, where**

*θ* In the augmented Lagrangian framework, updating ** θ** amounts to minimizing 1 ∕ 2

*∥*

**−**

*y***∥**

*θ*^{2}+

*ρ*∕ 2

*∥*

**− 1 ∕**

*γ**ρ*

**−**

*λ*

*D***∥**

*θ*^{2}. It is straightforward to solve this least squares problem. Updating

**involves minimizing**

*γ**ρ*∕ 2

*∥*

*D***−**

*θ***∥**

*γ*^{2}+

*μ*∥

**∥**

*γ*_{1}, which is a standard lasso problem. Thus, ADMM decouples the problematic linear transformation

*D***from the lasso penalty.**

*θ*### Algorithm Acceleration

- Top of page
- Summary
- Introduction
- Block Descent
- Steepest Descent
- Variations on Newton's Method
- The MM and EM Algorithms
- Penalization
- Augmented Lagrangians
- Algorithm Acceleration
- Discussion
- Acknowledgement
- References

Many MM and block descent algorithms converge very slowly. In partial compensation, the computational work per iteration may be light. Even so, diminishing the number of iterations until convergence by one or two orders of magnitude is an attractive proposition (Berlinet & Roland, 2007; Jamshidian & Jennrich, 1997; Kuroda & Sakakihara, 2006; Lange, 1995; Roland & Varadhan, 2005; Zhou *et al*., 2011). In this section, we discuss a generic method for accelerating a wide variety of algorithms (Zhou *et al*., 2011). Consider a differentiable algorithm map *θ*_{n + 1} = *A*(*θ*_{n}) for optimizing an objective function *f*(** θ**), and suppose stationary points of

*f*(

**) correspond to fixed points of**

*θ**A*(

**). Equivalently, stationary points correspond to roots of the equation**

*θ**B*(

**) =**

*θ***−**

*θ**A*(

**) =**

*θ***0**. Within this framework, it is natural to apply Newton's method

- (9)

to find the root and accelerate the overall process. This is a realistic expectation because Newton's method converges at a quadratic rate in contrast to the linear rates of MM and block descent algorithms.

There are two principal impediments to implementing algorithm (9) in high dimensions. First, it appears to require evaluation and storage of the Jacobi matrix *dA*(** θ**), whose rows are the differentials of the components of

*A*(

**). Second, it also appears to require inversion of the matrix**

*θ***−**

*I**dA*(

**). Both problems can be attacked by secant approximations. Close to the optimal point**

*θ*

*θ*_{ ∞ }, the linear approximation

is valid. This suggests that we take two ordinary steps and gather information in the process on the matrix ** M** =

*A*(

*θ*_{ ∞ }). If we let

**be the vector**

*V**A*∘

*A*(

*θ*_{n}) −

*A*(

*θ*_{n}) and

**be the vector**

*U**A*(

*θ*_{n}) −

*θ*_{n}, then the secant condition reads

*M***=**

*U***. In practice, it is advisable to exploit multiple secant conditions**

*V*

*M*

*u*_{i}=

*v*_{i}as long as their number does not exceed the number of parameters

*p*. The secant conditions can be generated one per iteration over the current and previous

*q*− 1 iterations. Let us represent the conditions collectively in the matrix form

*M***=**

*U***for**

*V***= (**

*U*

*u*_{1}, … ,

*u*_{q}), and

**= (**

*V*

*v*_{1}, … ,

*v*_{q}).

The principle of parsimony suggests that we replace ** M** by the smallest matrix satisfying the secant conditions. If we pose this problem concretely as minimizing the criterion subject to the constraints

*M***=**

*U***, then a straightforward exercise in Lagrange multipliers gives the solution**

*V***=**

*M***(**

*V*

*U*^{t}

**)**

*U*^{ − 1}

*U*^{t}(Lange, 2010). The matrix

**has rank at most**

*M**q*, and the Sherman–Morrison formula yields that explicit inverse

Fortunately, it involves inverting just the *q* × *q* matrix *U*^{t}** U** −

*U*^{t}

**. Furthermore, the Newton update (9) boils down to**

*V*The advantages of this procedure include the following: (a) it avoids large matrix inverses, (b) it relies on matrix times vector multiplication rather than matrix times matrix multiplication, (c) it requires only storage of the small matrices ** U** and

**, and (d) it respects linear parameter constraints. Non-negativity constraints may be violated. The number of secants**

*V**q*should be fixed in advance, say between 1 and 15, and the matrices

**and**

*U***should be updated by substituting the latest secant pair generated for the earliest secant pair retained. If an accelerated step fails the descent test, then one can revert to the ordinary MM or block descent step.**

*V*Acceleration of non-smooth algorithms is more problematic (Hiriart-Urruty & Lemarechal, 2001). For gradient descent and its generalizations (Combettes & Wajs, 2005) to non-smooth problems, Nesterov (2007) has suggested a potent acceleration. As noted by Beck & Teboulle (2009), the accelerated iterates in ordinary gradient descent depend on an intermediate scalar *t*_{n} and an intermediate vector ** ϕ** according to the formulas

with initial values *t*_{1} = 1 and ** ϕ** =

*θ*_{0}. In other words, instead of taking a steepest descent step from the current iterate, one takes a steepest descent step from the extrapolated point

**, which depends on both the current iterate**

*ϕ*

*θ*_{n}and the previous iterate

*θ*_{n − 1}. This mysterious extrapolation algorithm can yield impressive speedups for essentially the same computational cost per iteration as gradient descent.

### Discussion

- Top of page
- Summary
- Introduction
- Block Descent
- Steepest Descent
- Variations on Newton's Method
- The MM and EM Algorithms
- Penalization
- Augmented Lagrangians
- Algorithm Acceleration
- Discussion
- Acknowledgement
- References

The fault lines in optimization separate smooth from non-smooth problems, unconstrained from constrained problems, and small-scale from large-scale problems. Smooth, unconstrained, and small-scale problems are easy to solve. Mathematical scientists are beginning to tackle non-smooth, constrained, large-scale problems at the opposite end of the difficulty spectrum. The most spectacular successes usually rely on convexity. We can expect further progress because some of the best minds in applied mathematics, computer science, and statistics have taken up the challenge. What is unlikely to occur is the discovery of a universally valid algorithm. Optimization is apt to remain as much art as science for a long time to come.

We have emphasized a few key ideas in this survey. Our examples demonstrate some of the possibilities for mixing and matching the different algorithm themes. Although we cannot predict the future of computational statistics with any certainty, the key ideas mentioned here will not disappear. For instance, penalization is here to stay, the descent property of an algorithm is always desirable, and quadratic approximation will always be superior to linear approximation for smooth functions. As computing devices hit physical constraints, the importance of parallel algorithms will also likely increase. This argues that block descent and parameter-separated MM algorithms will play a larger role in the future (Zhou *et al*., 2010). Although we have de-emphasized convex calculus, readers who want to devise their own algorithms are well advised to learn this inherently subtle subject. There is a difference, after all, between principled algorithms and ad hoc procedures.

### Acknowledgement

- Top of page
- Summary
- Introduction
- Block Descent
- Steepest Descent
- Variations on Newton's Method
- The MM and EM Algorithms
- Penalization
- Augmented Lagrangians
- Algorithm Acceleration
- Discussion
- Acknowledgement
- References

Acknowledgement This research was supported in part by USPHS grants HG006139 and GM53275.

### References

- Top of page
- Summary
- Introduction
- Block Descent
- Steepest Descent
- Variations on Newton's Method
- The MM and EM Algorithms
- Penalization
- Augmented Lagrangians
- Algorithm Acceleration
- Discussion
- Acknowledgement
- References

- ACM SIGKDD and Netflix. (2007). Proceedings of KDD Cup and Workshop. Available online http://www.cs.uic.edu/liub/Net-flix-KDD-Cup-2007.html.
- 2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2, 183–202. & (
- 2007). Acceleration schemes with application to the EM algorithm. Comp. Statist. Data Anal., 51, 3689–3702. & (
- 2011). Sparse estimation of a covariance matrix. Biometrika, 98(4), 807–820. & (
- 1988). Monotonicity of quadratic approximation algorithms. Ann. Instit. Stat. Math., 40, 641–663. & (
- 2000). Convex Analysis and Nonlinear Optimization: Theory and Examples. New York: Springer. & (
- 2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1), 1–122. , , , & (
- 1973). The equivalence of maximum likelihood and weighted least squares estimates in the exponential family. J. Amer. Statist. Assoc., 68, 199–200. (
- 2008). A singular value thresholding algorithm for matrix completion. SIAM J. Optim., 20, 1956–1982. , & (
- 2009). The power of convex relaxation: near-optimal matrix completion. IEEE Trans. Inform. Theory, 56, 2053–2080. & (
- 1976). The equivalence of generalized least squares and maximum likelihood in the exponential family. J. Amer. Stat. Assoc., 71, 169–171. , & (
- 2012). Matrix completion via an alternating direction method. IMA J. Numer. Anal., 32, 227–245. , & (
- 2002). Stochastic Approximation and its Applications. Dordrecht: Kluwer. (
- 1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput., 20, 33–61. , & (
- 2013). Genotype imputation via matrix completion. Genome Res., 23, 509–518. , , & (
- 1973). Robust modeling with erratic data. Geophys., 38, 826–844. & (
- 2005). Signal recovery by proximal forward-backward splitting. Multiscale Model. Simul., 4, 1168–1200. & (
- 1991). Convergence of quasi-Newton matrices generated by the symmetric rank one update. Math. Prog., 50, 177–195. , & (
- 1959).Variable metric methods for minimization. AEC Research and Development Report ANL–5990, Argonne National Laboratory, USA. (
- 1977). Applications of convex analysis to multidimensional scaling. In Recent Developments in Statistics, Eds. Barra J. R., Brodeau F., Romier G. & Van Cutsem B., pp. 133–146. Amsterdam: North Holland Publishing Company. (
- 1994). Block relaxation algorithms in statistics. In Information Systems and Data Analysis, Eds. Bock H. H., Lenski W., Richter M. M., New York: Springer. (
- Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Stat. Soc. B, (1977). 39, 1–38. , &
- 1996). Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Philadelphia: SIAM. & (
- 2010). Convex and semi-nonnegative matrix factorizations. IEEE Trans. Pattern Anal. Mach. Intell., 32, 45–55. , & (
- 1990). An Introduction to Generalized Linear Models. London: Chapman & Hall. (
- 1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81, 425–455. & (
- 2008). Efficient projections onto the
*l*_{1}-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008), New York: ACM. , , & ( - 1983). Augmented Lagrangian methods: applications to the numerical solution of boundary-value problems. J. Appl. Math. Mech., 65, 622–622. & (
- 2007). Pathwise coordinate optimization. Ann. Appl. Stat., 1, 302–332. , & (
- 1976). A dual algorithm for the solution of nonlinear variational problems via finite-element approximations. Comp. Math. Appl., 2, 17–40. & (
- 1990). Nonlinear Multivariate Analysis. Hoboken, NJ: Wiley. (
- 1975). Sur lapproximation par elements finis dordre un, et la resolution par penalisation-dualite dune classe de problemes de dirichlet nonlineaires. Rev. Francaise dAut. Inf. Rech. Oper., 2, 41–76. & (
- 1964). Convex programming in Hilbert space. Bull. Amer. Math. Soc., 70, 709–710. (
- 1984). Iteratively reweighted least squares for maximum likelihood estimation and some robust and resistant alternatives (with discussion). J. Roy. Stat. Soc. B, 46, 149–192. (
- 2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York, Springer. , & (
- 1969). Multiplier and gradient methods. J. Optim. Theory Appl., 4, 303–320. (
- 1996). Convex Analysis and Minimization Algorithms: Part 1: Fundamentals. New York: Springer. & (
- 2001). Convex Analysis and Minimization Algorithms: Part 2: Advanced Theory and Bundle Methods. New York: Springer. & (
- 2004). A tutorial on MM algorithms. Amer. Statist., 58, 30–37. & (
- 1997). Quasi-Newton acceleration of the EM algorithm. J. Roy. Stat. Soc. B, 59, 569–587. & (
- 2006). Implementing and diagnosing the stochastic approximation EM algorithm. J. Comput. Graph. Statist., 15, 803–829. (
- 1975). Maximum likelihood estimation by means of nonlinear least squares. In
*Proceedings of the Statistical Computing Section: Amer. Stat. Assoc.*, Atlanta, Georgia, pp. 57–65. & ( - 1993). A theoretical and experimental study of the symmetric rank-one update. SIAM J. Optim., 3, 1–24. , & (
- 1952). Stochastic estimation of the maximum of a regression function. Ann. Math. Stat., 23, 462–466. & (
- 1965). Analysis of factorial experiments by estimating monotone transformations of the data. J. Roy. Stat. Soc. B, 27, 251–263. (
- 2006). Accelerating the convergence of the EM algorithm using the vector epsilon algorithm. Comput. Statist. Data Anal., 51, 1549–1561. & (
- 2003). Stochastic Approximation and Recursive Algorithms and Applications. New York: Springer. & (
- 1995). A gradient algorithm locally equivalent to the EM algorithm. J. Roy. Stat. Soc. B, 57, 425–437. (
- 1995). A quasi-Newton acceleration of the EM algorithm. Statist. Sinica, 5, 1–18. (
- 2010). Numerical Analysis for Statisticians, 2nd ed. New York: Springer. (
- 2012). Optimization, 2nd ed. New York: Springer. (
- 2000). Optimization transfer using surrogate objective functions (with discussion). J. Comput. Graph. Statist., 9, 1–59. , & (
- 1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791. & (
- 1966). Constrained minimization problems. USSR Comput. Math and Math Physics, 6, 1–50. & (
- 2010). Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res., 11, 2287–2322. , & (
- 2008). The EM Algorithm and Extensions, 2nd ed. Hoboken, NJ: Wiley. & (
- 2010). Stability selection. J. Roy. Stat. Soc. B, 72, 417–473. & (
- 1993). Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika, 80, 267–278. & (
- 1986). A finite algorithm for finding the projection of a point onto the canonical simplex in R
^{n}. J. Optim. Theory Appl., 50, 195–200. ( - 1972). Generalized linear models. J. Roy. Stat. Soc. A, 135, 370–384. & (
- 2007). Gradient methods for minimizing composite objective function. CORE Discussion Papers. (
- 2006). Numerical Optimization, 2nd ed. New York: Springer. & (
- 1970). Iterative Solutions of Nonlinear Equations in Several Variables. New York: Academic. & (
- 1992). Fisher's method of scoring. Int. Stat. Rev., 60, 99–117. (
- 1994). Positive matrix factorization: a non-negative factor model with optimal utilization of error. Environmetrics, 5, 111–126. & (
- 2007).
*ℓ*_{1}-regularization path algorithm for generalized linear models. J. Roy. Stat. Soc. B 69, 659–677. & ( - 1969). A method for nonlinear constraints in minimization problems. In Optimization, Ed. Fletcher R., pp. 283–298. New York: Academic Press. (
- 2012). Structured sparsity via alternating direction methods. J. Mach. Learn. Res.,
**98888**, 1435–1468. & ( - 2010). A Poisson model for random multigraphs. Bioinformatics, 26, 2004–2011. , , , & (
- 1973).Linear Statistical Inference and its Applications, 2nd ed. Hoboken, NJ: Wiley. (
- 2012. Estimation of simultaneously sparse and low rank matrices. In
*Proceedings of the 29th International Conference on Machine Learning (ICML 2012)*, pp. 1351–1358. Edinburgh, Scotland, UK. , & - 2004). Monte Carlo Statistical Methods. New York: Springer. & (
- 1951). A stochastic approximation method. Ann. Math. Stat. 22, 400–407. & (
- 1973). The multiplier method of Hestenes and Powell applied to convex programming. J. Optim. Theory Appl., 12, 555–562. (
- 2005). New iterative schemes for nonlinear fixed point problems, with applications to problems with bifurcations and incomplete-data problems. Appl. Numer. Math., 55, 215–226. & (
- 2006).Nonlinear Optimization. Princeton, NJ: Princeton University Press. (
- 1986). Linear inversion of band-limited reflection seimograms. SIAM J. Sci. Stat. Comput., 7, 1307–1330. & (
- 2003). Multiplicative updates for nonnegative quadratic programming in support vector machines. In Advances in Neural Information Processing Systems, Vol. 15, Eds. Becker S., Thrun S. & Obermayer K., pp. 1065-1073. Cambridge, MA: MIT Press. , & (
- 2012). Algorithms for Global Positioning. Wellesley, MA: Wellesley-Cambridge Press. & (
- 1979). Deconvolution with the
*ℓ*_{1}norm. Geophys., 44, 39–52. , & ( - 2010). Bundle methods for regularized risk minimization. J. Mach. Learn. Res., 11, 311–365. , , & (
- 1996). Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc., Ser. B, 58, 267–28. (
- 2005). Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. B, 67, 91–108. , , , & (
- 1990). A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithms. JASA, 85, 699–704. & (
- 2009). Genomewide association analysis by lasso penalized logistic regression. Bioinformatics, 25, 714–721. , , , & (
- 2008). Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat., 2, 224–244. & (
- 2010). The MM alternative to EM. Stat. Sci., 25, 492–505. & (
- 2012). Positive definite
*ℓ*_{1}penalized estimation of large covariance matrices. JASA, 107, 1480–1491. , & ( - 2012). EM vs MM: a case study. Comp. Stat. Data Anal., 56, 3909–3920. & (
- 2011). A quasi-Newton acceleration for high-dimensional optimization algorithms. Statist. Comput., 21, 261–273. , & (
- 2010). MM algorithms for some discrete multivariate distributions. J. Comput. Graph. Statist., 19, 645–665. & (
- 2010). Graphics processing units and high-dimensional optimization. Stat. Sci., 25, 311–324. , & (
- 2006). The adaptive lasso and its oracle properties. JASA, 101, 1418–1429. (
- 2005). Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. B, 67, 301–320. & (