SEARCH

SEARCH BY CITATION

Keywords:

  • Block relaxation;
  • Newton's method;
  • MM algorithm;
  • penalization;
  • augmented Lagrangian;
  • acceleration

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Block Descent
  5. Steepest Descent
  6. Variations on Newton's Method
  7. The MM and EM Algorithms
  8. Penalization
  9. Augmented Lagrangians
  10. Algorithm Acceleration
  11. Discussion
  12. Acknowledgement
  13. References

Modern computational statistics is turning more and more to high-dimensional optimization to handle the deluge of big data. Once a model is formulated, its parameters can be estimated by optimization. Because model parsimony is important, models routinely include non-differentiable penalty terms such as the lasso. This sober reality complicates minimization and maximization. Our broad survey stresses a few important principles in algorithm design. Rather than view these principles in isolation, it is more productive to mix and match them. A few well-chosen examples illustrate this point. Algorithm derivation is also emphasized, and theory is downplayed, particularly the abstractions of the convex calculus. Thus, our survey should be useful and accessible to a broad audience.

Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Block Descent
  5. Steepest Descent
  6. Variations on Newton's Method
  7. The MM and EM Algorithms
  8. Penalization
  9. Augmented Lagrangians
  10. Algorithm Acceleration
  11. Discussion
  12. Acknowledgement
  13. References

Modern statistics represents a confluence of data, algorithms, practical inference, and subject area knowledge. As data mining expands, computational statistics is assuming greater prominence. Surprisingly, the confident prediction of the previous generation that Bayesian methods would ultimately supplant frequentist methods has given way to a realization that Markov chain Monte Carlo may be too slow to handle modern data sets. Size matters because large data sets stress computer storage and processing power to the breaking point. The most successful compromises between Bayesian and frequentist methods now rely on penalization and optimization. Penalties serve as priors and steer parameter estimates in realistic directions. In classical statistics, estimation usually meant least squares and maximum likelihood with smooth objective functions. In a search for sparse representations, mathematical scientists have introduced non-differentiable penalties such as the lasso and the nuclear norm. To survive in this alien terrain, statisticians are being forced to master exotic branches of mathematics such as convex calculus (Hiriart-Urruty & Lemarechal, 1996, 2001). Thus, the uneasy but productive relationship between statistics and mathematics continues, but in a different guise and mediated by new concerns.

The purpose of this survey article is to provide a few glimpses of the new optimization algorithms being crafted by computational statisticians and applied mathematicians. Although a survey of convex calculus for statisticians would certainly be helpful, our emphasis is more concrete. The truth of the matter is that a few broad categories of algorithms dominate. Furthermore, difficult problems require that several algorithmic pieces be assembled into a well-coordinated whole. Put another way, from a handful of basic ideas, computational statisticians often weave a complex tapestry of algorithms that meets the needs of a specific problem. No algorithm category should be dismissed a priori in tackling a new problem. There is plenty of room for creativity and experimentation. Algorithms are made for tinkering. When one part fails or falters, it can be replaced by a faster or more robust part.

This survey will treat the following methods: (a) block descent, (b) steepest descent, (c) Newton's method, quasi-Newton methods, and scoring, (d) the majorize–minimize (MM) and expectation–maximization (EM) algorithms, (e) penalized estimation, (f) the augmented Lagrangian method for constrained optimization, and (g) acceleration of fixed point algorithms. As we have mentioned, often the best algorithms combine several themes. We will illustrate the various themes by a sequence of examples. Although we avoid difficult theory and convergence proofs, we will try to point out along the way a few motivating ideas that stand behind most algorithms. For example, as its name indicates, steepest descent algorithms search along the direction of fastest decrease of the objective function. Newton's method and its variants all rely on the notion of local quadratic approximation, thus correcting the often poor linear approximation of steepest descent. In high dimensions, Newton's method stalls because it involves calculating and inverting large matrices of second derivatives.

The MM and EM algorithms replace the objective function by a simpler surrogate function. By design, optimizing the surrogate function sends the objective function downhill in minimization and uphill in maximization. In constructing the surrogate function for an EM algorithm, statisticians rely on notions of missing data. The more general MM algorithm calls on skills in inequalities and convex analysis. More often than not, concrete problems also involve parameter constraints. Modern penalty methods incorporate the constraints by imposing penalties on the objective function. A tuning parameter scales the strength of the penalties. In the classical penalty method, the constrained solution is recovered as the tuning parameter tends to infinity. In the augmented Lagrangian method, the constrained solution emerges for a finite value of the tuning parameter.

In the remaining sections, we adopt several notational conventions. Vectors and matrices appear in boldface type; for the most part, parameters appear as Greek letters. The differential df(θ) of a scalar-valued function f(θ) equals its row vector of partial derivatives; the transpose ∇ f(θ) of the differential is the gradient. The second differential d2f(θ) is the Hessian matrix of second partial derivatives. The Euclidean norm of a vector b and the spectral norm of a matrix A are denoted by ∥ b ∥ and ∥ A ∥ , respectively. All other norms will be appropriately subscripted. The n-th entry bn of a vector b must be distinguished from the n-th vector bn in a sequence of vectors. To maintain consistency, bni denotes the i-th entry of bn. A similar convention holds for sequences of matrices.

Block Descent

  1. Top of page
  2. Summary
  3. Introduction
  4. Block Descent
  5. Steepest Descent
  6. Variations on Newton's Method
  7. The MM and EM Algorithms
  8. Penalization
  9. Augmented Lagrangians
  10. Algorithm Acceleration
  11. Discussion
  12. Acknowledgement
  13. References

Block relaxation (either block descent or block ascent) divides the parameters into disjoint blocks and cycles through the blocks, updating only those parameters within the pertinent block at each stage of a cycle (de Leeuw, 1994). For the sake of brevity, we consider only block descent. In updating a block, we minimize the objective function over the block. Hence, block descent possesses the desirable descent property of always forcing the objective function downhill. When each block consists of a single parameter, block descent is called cyclic coordinate descent. The coordinate updates need not be explicit. In high-dimensional problems, implementation of one-dimensional Newton searches is often compatible with fast overall convergence. Block descent is best suited to unconstrained problems where the domain of the objective function reduces to a Cartesian product of the subdomains associated with the different blocks. Obviously, exact block updates are a huge advantage. Non-separable constraints can present insuperable barriers to coordinate descent because parameters get locked into place. In some problems, it is advantageous to consider overlapping blocks.

Example 1. Non-negative least squares

For a positive definite matrix A = (aij) and vector b = (bi), consider minimizing the quadratic function

  • display math

subject to the constraints θi ≥ 0 for all i. In the case of least squares, A = XtX and b = − Xty for some design matrix X and response vector Y. Equating the partial derivative of f(θ) with respect to θi to 0 gives

  • display math

Rearrangement now yields the unrestricted minimum

  • display math

Taking into account the non-negativity constraint, this must be amended to

  • display math

at stage n + 1 to construct the coordinate descent update of θi.

Example 2. Matrix factorization by alternating least squares

In the 1960s, Kruskal (1965) applied the method of alternating least squares to factorial analysis of variance. Later, the subject was taken up by de Leeuw & colleagues (1990). Suppose U is a m × q matrix whose columns u1, … ,uq represent data vectors. In many applications, it is reasonable to postulate a reduced number of prototypes v1, … ,vp and write

  • display math

for certain non-negative weights wkj. The matrix W = (wkj) is p × q. If p is small compared with q, then the representation U ࣈ VW compresses the data for easier storage and retrieval. Depending on the circumstances, one may want to add further constraints (Ding et al., 2010. For instance, if the entries of U are non-negative, then it is often reasonable to demand that the entries of V be non-negative as well (Lee & Seung, 1999; Paatero & Tapper, 1994). If we want each uj to equal a convex combination of the prototypes, then constraining the column sums of W to equal 1 is indicated.

One way of estimating V and W is to minimize the squared Frobenius norm

  • display math

No explicit solution is known, but alternating least squares offers an iterative attack. If W is fixed, then we can update the i-th row of V by minimizing the sum of squares

  • display math

Similarly, if V is fixed, then we can update the j-th column of W by minimizing the sum of squares

  • display math

Thus, block descent solves a sequence of least squares problems, some of which are constrained.

Steepest Descent

  1. Top of page
  2. Summary
  3. Introduction
  4. Block Descent
  5. Steepest Descent
  6. Variations on Newton's Method
  7. The MM and EM Algorithms
  8. Penalization
  9. Augmented Lagrangians
  10. Algorithm Acceleration
  11. Discussion
  12. Acknowledgement
  13. References

The first-order Taylor expansion

  • display math

of a differentiable function f(θ) around θ motivates the method of steepest descent. In view of the Cauchy–Schwarz inequality, the choice

  • display math

minimizes the linear term df(θ)γ of the expansion over the sphere of unit vectors. Of course, if ∇ f(θ) = 0, then θ is a stationary point. The steepest descent algorithm iterates according to

  • display math(1)

for some scalar s > 0. If s is sufficiently small, then the descent property f(θn + 1) < f(θn) holds. The most sophisticated version of the algorithm determines s by searching for the minimum of the objective function along the direction of steepest descent. Among the many methods of line search, the methods of false position, cubic interpolation, and golden section stand out (Lange, 2012). These are all local search methods, and unless some guarantee of convexity exists, confusion of local and global minima can occur.

The method of steepest descent often exhibits zigzagging and a painfully slow rate of convergence. For these reasons, it was largely replaced in practice by Newton's method and its variants. However, the sheer scale of modern optimization problems has led to a re-evaluation. The avoidance of second derivatives and Hessian approximations is now viewed as a virtue. Furthermore, the method has been generalized to non-differentiable problems by substituting the forward directional derivative

  • display math

for the gradient (Tao et al., 2010). Here, the idea is to choose a unit search vector ν to minimize dνf(θ). In some instances, this secondary problem can be attacked by linear programming. For a convex problem, the condition dνf(θ) ≥ 0 for all ν is both necessary and sufficient for θ to be a minimum point. If the domain of f(θ) equals a convex set C, then only tangent directions ν = μ − θ with μ ∈ C come into play.

Steepest descent also has a role to play in constrained optimization. Suppose we want to minimize f(θ) subject to the constraint θ ∈ C for some closed convex set. The projected gradient method capitalizes on the steepest descent update (1) by projecting it onto the set C (Goldstein, 1964; Levitin & Polyak, 1966; Ruszczyński, 2006). It is well known that for a point X external to C, there is a closest point PC(x) to X in C. Explicit formulas for the projection operator PC(x) exist when C is a box, Euclidean ball, hyperplane, or half-space. Fast algorithms for computing PC(x) exist for the unit simplex, the 1 ball, and the cone of positive semidefinite matrices (Duchi et al., 2008; Michelot, 1986).

Choice of the scalar s in the update (1) is crucial. Current theory suggests taking s to equal r ∕ L, where L is a Lipschitz constant for the gradient ∇ f(θ) and r belongs to the interval (0,2). In particular, the Lipschitz inequality

  • display math

is valid for L =  supθ ∥ d2f(θ) ∥ , whenever this quantity is finite. In practice, the Lipschitz constant L must be estimated. Any induced matrix norm ∥ ⋅ ∥  †  can be substituted for the spectral norm ∥ ⋅ ∥ in the defining supremum and will give an upper bound on L.

Example 3. Coordinate descent versus the projected gradient method

As a test problem, we generated a random 100 × 50 design matrix X with independent and identically distributed (i.i.d.) standard normal entries, a random 50 × 1 parameter vector θ with i.i.d. uniform [0,1] entries, and a random 100 × 1 error vector e with i.i.d. standard normal entries. In this setting, the response y = Xθ + e. We then compared coordinate descent, the projected gradient method (for L equal to the spectral radius of XtX and r equal to 1.0, 1.75, and 2.0), and the MM algorithm explained later in Example 6. All computer runs start from the common point θ0 whose entries are filled with i.i.d. uniform [0,1] random deviates. Figure 1 plots the progress of each algorithm as measured by the relative difference

  • display math(2)

between the loss at the current iteration and the ultimate loss at convergence. It is interesting how well coordinate descent performs compared with projected gradient descent. The slower convergence of the MM algorithm is probably a consequence of the fact that its multiplicative updates slow down as they approach the 0 boundary. Note also the importance of choosing a good step size in the projected gradient algorithm. Inflated steps accelerate convergence, but excessively inflated steps hamper it.

image

Figure 1. Comparing the rate of convergence of three algorithms on a non-negative least squares problem. CD, coordinate descent; PG, projected gradient; MM, majorize–minimize.

Download figure to PowerPoint

Variations on Newton's Method

  1. Top of page
  2. Summary
  3. Introduction
  4. Block Descent
  5. Steepest Descent
  6. Variations on Newton's Method
  7. The MM and EM Algorithms
  8. Penalization
  9. Augmented Lagrangians
  10. Algorithm Acceleration
  11. Discussion
  12. Acknowledgement
  13. References

The primary advantage of Newton's method is its speed of convergence in low-dimensional problems. Its many variants seek to retain its fast convergence while taming its defects. The variants all revolve around the core idea of locally approximating the objective function by a strictly convex quadratic. At each iteration, the quadratic approximation is optimized subject to safeguards that keep the iterates from overshooting and veering towards irrelevant stationary points.

Consider minimizing the real-valued function f(θ) defined on an open set S ⊂ Rp. Assuming that f(θ) is twice differentiable, we have the second-order Taylor expansion

  • display math

for some α on the line segment [θ,γ]. This expansion suggests that we substitute d2f(θ) for d2f(α) and approximate f(γ) by the resulting quadratic. If we take this approximation seriously, then we can solve for its minimum point γ as

  • display math

In Newton's method, we iterate according to

  • display math(3)

for step length constant s with default value 1. Any stationary point of f(θ) is a fixed point of Newton's method.

There is nothing to prevent Newton's method from heading uphill rather than downhill. The first-order expansion

  • display math

makes it clear that the descent property holds provided s > 0 is small enough and the Hessian matrix d2f(θn) is positive definite. When d2f(θn) is not positive definite, it is usually replaced by a positive definite approximation Hn in the update (3).

Backtracking is crucial to avoid overshooting. In the step-halving version of backtracking, one starts with s = 1. If the descent property holds, then one takes the Newton step. Otherwise, s ∕ 2 is substituted for s, θn + 1 is recalculated, and the descent property is rechecked. Eventually, a small enough s is generated to guarantee f(θn + 1) < f(θn).

In the next two examples, we adopt standard statistical language. The outcome of a statistical experiment is summarized by a log likelihood L(θ). Its gradient ∇ L(θ) is called the score, and its second differential d2L(θ), after a change in sign, is called the observed information. In maximum likelihood estimation, one maximizes L(θ) with respect to the parameter vector θ.

Example 4. Newton's method for binomial regression

Consider binomial regression with m independent responses y1, … ,ym. Each yi represents a count between 0 and ki with success probability πi(θ) per trial. The log likelihood, score, and observed information amount to

  • display math

Because E(yi) = kiπi(θ), the observed information can be approximated by

  • display math

Because we seek to maximize rather than minimize L(θ), we want − d2L(θ) to be positive definite. Fortunately, both approximations fulfil this requirement. The second approximation leads to the scoring algorithm discussed later.

Example 5. Poisson multigraph model

In a graph, the number of edges between any two nodes is 0 or 1. A multigraph allows an arbitrary number of edges between any two nodes. Multigraphs are natural structures for modelling the internet and gene and protein networks. Here, we consider a multigraph with a random number of edges xij connecting every pair of nodes {i,j}. In particular, we assume that the xij are independent Poisson random variables with means μij. As a plausible model for ranking nodes, we take μij = θiθj, where θi and θj are non-negative propensities (Ranola et al., 2010). The log likelihood of the observed edge counts xij = xji amounts to

  • display math

The score vector has entries

  • display math

and the observed information matrix has entries

  • display math

For p nodes, the matrix − d2L(p) is p × p, and inverting it seems out of the question when p is large. Fortunately, the Sherman–Morrison formula comes to the rescue. If we write − d2L(θ) as D + 11t with D diagonal, then the explicit inverse

  • display math

is available. This makes Newton's method trivial to implement as long as one respects the bounds θi ≥ 0. More generally, it is always cheap to invert a low-rank perturbation of an explicitly invertible matrix.

In maximum likelihood estimation, the method of steepest ascent replaces the observed information matrix − d2L(θ) by the identity matrix I. Fisher's scoring algorithm makes the far more effective choice of replacing the observed information matrix by the expected information matrix J(θ) = E[ − d2L(θ)] (Osborne, 1992). The alternative representation J(θ) = Var[ ∇ L(θ)] of J(θ) as a variance matrix demonstrates that it is positive semidefinite. Usually it is positive definite as well and serves as an excellent substitute for − d2L(θ) in Newton's method. The inverse matrices inline image and inline image immediately supply the asymptotic variances and covariances of the maximum likelihood estimate inline image (Rao, 1973).

The score and expected information simplify considerably for exponential families of densities (Bradley, 1973; Charnes et al., 1976; Green, 1984; Jennrich & Moore, 1975; Nelder & Wedderburn, 1972). Recall that the density of a vector random variable Y from an exponential family can be written as

  • display math(4)

relative to some measure ν (Dobson, 1990; Rao, 1973). The function h(y) in (4) is the sufficient statistic. The maximum likelihood estimate of the parameter vector θ depends on an observation Y only through h(y). Predictors of Y are incorporated into the functions β(θ) and γ(θ). If γ(θ) is linear in θ, then J(θ) = − d2L(θ) = − d2β(θ), and scoring coincides with Newton's method. If in addition J(θ) is positive definite, then L(θ) is strictly concave and possesses at most a single local maximum, which is necessarily the global maximum.

Both the score vector and expected information matrix can be expressed succinctly in terms of the mean vector μ(θ) = E[h(y)] and the variance matrix Σ(θ) = Var[h(y)] of the sufficient statistic. Standard arguments show that

  • display math

These formulas have had an enormous impact on non-linear regression and fitting generalized linear models. Applied statistics as we know it would be nearly impossible without them. Implementation of scoring is almost always safeguarded by step halving and upgraded to handle linear constraints and parameter bounds. The notion of quadratic approximation is still the key, but each step of constrained scoring must solve a quadratic programme.

In parallel with developments in statistics, numerical analysts sought substitutes for Newton's method. Their efforts a generation ago focused on quasi-Newton methods for generic smooth functions (Dennis & Schnabel, 1996; Nocedal & Wright, 2006). Once again, the core idea was successive quadratic approximation. A good quasi-Newton method (a) minimizes a quadratic function f(θ) from Rp to R in p steps, (b) avoids evaluation of d2f(θ), (c) adapts readily to simple parameter constraints, and (d) exploits inexact line searches.

Quasi-Newton methods update the current approximation Hn to the second differential d2f(θ) of an objective function f(θ) by a rank-one or rank-two perturbation satisfying a secant condition. The secant condition captures the first-order Taylor approximation

  • display math

If we define the gradient and argument differences

  • display math

then the secant condition reads Hn + 1dn = gn. Davidon (1959) discovered that the unique symmetric rank-one update to Hn satisfying the secant condition is

  • display math

where the constant cn and the vector vn are determined by

  • display math

When the inner product (Hndn − gn)tdn is too close to 0, there are two possibilities. Either the secant adjustment is ignored, and the value Hn is retained for Hn + 1, or one resorts to a trust region strategy (Nocedal & Wright, 2006).

In the trust region method, one minimizes the quadratic approximation to f(θ) subject to the spherical constraint ∥ θ − θn ∥ 2 ≤ r2 for a fixed radius r. This constrained optimization problem has a solution regardless of whether Hn is positive definite. Working within a trust region prevents absurdly large steps in the early stages of minimization. With appropriate safeguards, some numerical analysts (Conn et al., 1991; Khalfan et al., 1993) consider Davidon's rank-one update superior to the widely used BFGS update, named after Broyden, Fletcher, Goldfarb, and Shanno. This rank-two perturbation is guaranteed to maintain positive definiteness and is better understood theoretically than the symmetric rank-one update. Also of interest is the Davidon, Fletcher, and Powell (DFP) rank-two update, which applies to the inverse inline image of Hn. Although the DFP update ostensibly avoids matrix inversion, the consensus is that the BFGS update is superior to it in numerical practice (Dennis & Schnabel, 1996).

The MM and EM Algorithms

  1. Top of page
  2. Summary
  3. Introduction
  4. Block Descent
  5. Steepest Descent
  6. Variations on Newton's Method
  7. The MM and EM Algorithms
  8. Penalization
  9. Augmented Lagrangians
  10. Algorithm Acceleration
  11. Discussion
  12. Acknowledgement
  13. References

The numerical analysts Ortega & Rheinboldt 1970) first articulated the MM principle; de Leeuw (1977) saw its potential and created the first MM algorithm. The MM algorithm currently enjoys its greatest vogue in computational statistics (Hunter & Lange, 2004; Lange et al., 2000; Wu & Lange, 2010). The basic idea is to convert a hard optimization problem into a sequence of simpler ones. In minimization, the MM principle majorizes the objective function f(θ) by a surrogate function g(θ ∣ θn) anchored at the current point θn. Majorization combines the tangency condition g(θn ∣ θn) = f(θn) and the domination condition g(θ ∣ θn) ≥ f(θ) for all θ. The next iterate of the MM algorithm is defined to minimize g(θ ∣ θn). Because

  • display math

the MM iterates generate a descent algorithm driving the objective function downhill. Strictly speaking, the descent property depends only on decreasing g(θ ∣ θn), not on minimizing it. Constraint satisfaction is automatically enforced in finding θn + 1. Under appropriate regularity conditions, an MM algorithm is guaranteed to converge to a local minimum of the objective function (Lange, 2010). In maximization, we first minorize and then maximize. Thus, the acronym MM does double duty in the forms majorize–minimize and minorize–maximize.

When it is successful, the MM algorithm simplifies optimization by (a) separating the variables of a problem, (b) avoiding large matrix inversions, (c) linearizing a problem, (d) restoring symmetry, (e) dealing with equality and inequality constraints gracefully, and (f) turning a non-differentiable problem into a smooth problem. The art in devising an MM algorithm lies in choosing a tractable surrogate function g(θ ∣ θn) that hugs the objective function f(θ) as tightly as possible.

The majorization relation between functions is closed under the formation of sums, non-negative products, limits, and composition with an increasing function. These rules allow one to work piecemeal in simplifying complicated objective functions. Skill in dealing with inequalities is crucial in constructing majorizations. Classical inequalities such as Jensen's inequality, the information inequality, the arithmetic-geometric mean inequality, and the Cauchy–Schwartz prove useful in many problems. The supporting hyperplane property of a convex function and the quadratic upper bound principle of Böhning & Lindsay (1988) also find wide application.

Example 6. An MM algorithm for non-negative least squares

Sha et al. (2003) devised an MM algorithm for Example 1. The diagonal terms inline image they retain as presented. The off-diagonal terms aijθiθj they majorize according to the sign of the coefficient aij. When the sign of aij is positive, they apply the majorization

  • display math

which is just a rearrangement of the inequality

  • display math

with equality when x = xn and y = yn. When the sign of aij is negative, they apply the majorization

  • display math

which is just a rearrangement of the simple inequality z ≥ 1 + lnz with z = xy ∕ (xnyn). The value z = 1 gives equality in the inequality. Both majorizations separate parameters and allow one to minimize the surrogate function parameter by parameter. Indeed, if we define matrices A +  and A −  with entries max{aij,0}and − min{aij,0}, respectively, then the resulting MM algorithm iterates according to

  • display math

All entries of the initial point θ0 should be positive; otherwise, the MM algorithm stalls. The updates occur in parallel. In contrast, the cyclic coordinate descent updates are sequential. Figure 1 depicts the progress of the MM algorithm on our non-negative least squares problem.

Example 7. Locating a gunshot

Locating the time and place of a gunshot is a typical global positioning problem (Strang & Borre, 2012). In a certain city, m sensors located at the points x1, … ,xm are installed. A signal, say a gunshot sound, is sent from an unknown location θ at unknown time α and known speed s and arrives at location j at time yj observed with random measurement error. The problem is to estimate the vector θ and the scalar α from the observed data y1, … ,ym. Other problems of this nature include pinpointing the epicentre of an earthquake and the detonation point of a nuclear explosion. This estimation problem can be attacked by a combination of block descent and the MM principle.

If we assume Gaussian random errors, then maximum likelihood estimation reduces to minimizing the criterion

  • display math

The equivalence of the two representations of f(θ,α) shows that it suffices to solve the problem with speed s = 1. In the remaining discussion, we make this assumption. For a fixed θ, estimation of α reduces to a least squares problem with the obvious solution

  • display math

To update θ with a fixed α, we rewrite f(θ,α) as

  • display math

The middle terms − 2(yj − α) ∥ θ − xj ∥ are awkward to deal with in minimization. Depending on the sign of the coefficient − 2(yj − α), we majorized them in two different ways. If the sign is negative, then we employ the Cauchy–Schwarz majorization

  • display math

If the sign is positive, then we employ the more subtle majorization

  • display math

To derive this second majorization, note that inline image is a concave function on (0, ∞ ). It therefore satisfies the dominating hyperplane inequality

  • display math

Now substitute ∥ θ − xj ∥ 2 for u. These manoeuvres separate parameters and reduce the surrogate to a sum of linear terms and squared Euclidean norms. The minimization of the surrogate yields the MM update

  • display math

of θ for a fixed α. The condition α > yj in this update is usually vacuous. By design, f(θ,α) decreases after each cycle of updating α and θ.

The celebrated EM algorithm is one the most potent optimization tools in the statistician's toolkit (Dempster et al., 1977; McLachlan & Krishnan, 2008). The E step in the EM algorithm creates a surrogate function, the q function in the literature, that minorizes the log likelihood. Thus, every EM algorithm is an MM algorithm. If Y is the observed data and X is the complete data, then the q function is defined as the conditional expectation

  • display math

where f(X ∣ θ) denotes the complete data log likelihood, upper case letters indicate random vectors, and lower case letters indicate corresponding realizations of these random vectors. In the M step of the EM algorithm, one calculates the next iterate θn + 1 by maximizing Q(θ ∣ θn) with respect to θ.

Example 8. MM versus EM for the Dirichlet-multinomial distribution

When multivariate count data exhibit overdispersion, the Dirichlet-multinomial distribution is preferred to the multinomial distribution. In the Dirichlet-multinomial model, the multinomial probabilities p = (p1, … ,pd) follow a Dirichlet distribution with parameter vector α = (α1, … ,αd) having positive components. For a multivariate count vector x = (x1, … ,xd) with batch size inline image, the probability mass function is accordingly

  • display math(5)

where Δd is the unit simplex in d dimensions, | α | equals inline image, and inline image denotes a rising factorial. The last equality in (6) follows from the factorial property Γ(a + 1) ∕ Γ(a) = a of the gamma function. Given independent data points x1, … ,xm, the log likelihood is

  • display math

The lack of concavity of L(α) may cause instability in Newton's method when it is started far from the optimal point. Fisher's scoring algorithm is computationally prohibitive because calculation of the expected information matrix involves numerous evaluations of beta-binomial tail probabilities. The ascent property makes EM and MM algorithms attractive.

In deriving an EM algorithm, we treat the unobserved multinomial probabilities pj in each case as missing data. The complete data likelihood is then the integrand in the integral (5). A straightforward calculation shows that p possesses a posterior Dirichlet distribution with parameters α1 + xi1 through αd + xid for case i. If we now differentiate the identity

  • display math

with respect to αj, then the identity

  • display math

emerges, where Ψ(z) = Γ ′ (z) ∕ Γ(z) is the digamma function. It follows that up to an irrelevant additive constant, the surrogate function is]

  • display math

Maximizing Q(α ∣ αn) is non-trivial because it involves special functions and intertwining of the αj parameters.

Directly invoking the MM principle produces a more malleable surrogate function (Zhou & Lange, 2010). Consider the logarithm of the third form of the likelihood function (5). Applying Jensen's inequality to ln(αj + k) gives

  • display math

Likewise, applying the supporting hyperplane inequality to − ln( | α | + k) gives

  • display math

Overall, these minorizations yield the surrogate function

  • display math

which completely separates the parameter αj. This suggests the simple MM updates

  • display math

The positivity constraints are always satisfied when all initial values α0j > 0. Parameter separation can be achieved in the EM algorithm by a further minorization of the lnΓ( | α | ) term in Q(α ∣ αn). This action yields a viable EM–MM hybrid algorithm. The study of Zhou & Yang (2012) contains more details and a comparison of the convergence rates of the three algorithms.

Finally, let us mention various strategies for handling exceptional cases. In the MM algorithm, it may be impossible to optimize the surrogate function g(θ ∣ θn) explicitly. There are two obvious remedies. One is to institute some form of block relaxation in updating g(θ ∣ θn) (Meng & Rubin, 1993). There is no need to iterate to convergence because the purpose is merely to improve g(θ ∣ θn) and hence the objective function f(θ). Another obvious remedy is to optimize the surrogate function by Newton's method. It turns out that a single step of Newton's method suffices to preserve the local rate of convergence of the MM algorithm (Lange, 1995). The ascent property is sacrificed initially, but it kicks in as one approaches the optimal point. In an unconstrained problem, this variant MM algorithm can be phrased as

  • display math

where the substitution of ∇ f(θn) for ∇ g(θn ∣ θn) is justified by the tangency and domination conditions satisfied by g(θ ∣ θn) and f(θ).

A more pressing concern in the EM algorithm is intractability of the E step. If f(X ∣ θ) denotes the complete data likelihood, then in the stochastic EM algorithm (Jank, 2006; Robert & Casella, 2004; Wei & Tanner, 1990), one estimates the surrogate function by a Monte Carlo average

  • display math(6)

over realizations xi of the complete data X conditional on the observed data Y = y and the current parameter iterate θn. Sampling can be performed by rejection sampling, importance sampling, Markov chain Monte Carlo, or quasi-Monte Carlo. The next iterate θn + 1 should maximize the average (6). The sample size m should increase as the iteration count n increases. Determining the rate of increase of m and setting a reasonable convergence criterion are both subtle issues. The ascent property of the EM algorithm fails because of the inherent sampling noise. The combination of slow convergence and Monte Carlo sampling makes the stochastic EM algorithm unattractive in large-scale problems. In smaller problems, it fills a useful niche.

The stochastic EM algorithm generalizes the Robbins–Monro algorithm (Robbins and Monro, 1951) for root finding and the Kiefer–Wolfowitz algorithm (Kiefer & Wolfowitz, 1952) for function maximization. In unconstrained maximum likelihood estimation, one seeks a root of the likelihood equation, so both methods are relevant. Under suitable assumptions, the Kiefer–Wolfowitz algorithm converges to a local maximum almost surely. Because this cluster of topics is tangential to our overall emphasis on deterministic methods of optimization, we refer readers to the books of Chen (2002), Kushner & Yin (2003), and Robert & Casella (2004) for a fuller discussion.

Penalization

  1. Top of page
  2. Summary
  3. Introduction
  4. Block Descent
  5. Steepest Descent
  6. Variations on Newton's Method
  7. The MM and EM Algorithms
  8. Penalization
  9. Augmented Lagrangians
  10. Algorithm Acceleration
  11. Discussion
  12. Acknowledgement
  13. References

Penalization is a device for imposing parsimony. For purposes of illustration, we discuss two penalized estimation problems of considerable utility in applied statistics. Both of these examples generate convex programmes with non-differentiable objective functions. In the interests of accessibility, we will derive estimation algorithms for both problems without invoking the machinery of convex analysis.

Example 9. Lasso penalized regression

Lasso penalized regression has been pursued for a long time in many application areas (Chen et al., 1998; Claerbout & Muir, 1973; Donoho & Johnstone, 1994; Santosa & Symes, 1986; Taylor et al., 1979; Tibshirani, 1996). Modern versions consider a generalized linear model where yi is the response for case i, xij is the value of predictor j for case i, and θj is the regression coefficient corresponding to predictor j. When the number of predictors p exceeds the number of cases m, θ cannot be uniquely estimated. In an era of big data, this quandary is fairly common. One remedy is to perform model selection by imposing a lasso penalty on the loss function (θ). In least squares estimation,

  • display math

For a generalized linear model (Park & Hastie, 2007), (θ) is the negative log likelihood of the data. Lasso penalized estimation minimizes the criterion

  • display math

where the non-negative weights wj and the tuning constant ρ > 0 are given. If θj is the intercept for the model, then its weight wj is usually set to 0. For the remaining predictors, the choice wj = 1 is reasonable provided the predictors are standardized to have mean 0 and variance 1. To improve the asymptotic properties of the lasso estimates, the adaptive lasso (Zou, 2006) defines the weights inline image for any consistent estimate inline image of θj. In a Bayesian context, imposing a lasso penalty is equivalent to placing a Laplace prior with mean 0 on each θj. The elastic net adds a ridge penalty inline image to the lasso penalty (Zou & Hastie, 2005).

The primary difference between lasso and ridge regression is that the lasso penalty forces most parameters to 0, whereas the ridge penalty merely reduces them. Thus, the ridge penalty relaxes its grip too quickly for model selection. Unfortunately, the lasso penalty tends to select one predictor from a group of correlated predictors and ignore the others. The elastic net ameliorates this defect. To overcome severe shrinkage, many statisticians discard penalties after the conclusion of model selection and re-estimate the selected parameters. Cross-validation and stability selection are effective in choosing the penalty tuning constant and the selected predictors, respectively (Hastie et al., 2009; Meinshausen & Bühlmann, 2010).

Coordinate descent works particularly well when only a few predictors enter a model (Friedman et al., 2007; Wu & Lange, 2008). Consider what happens when we visit parameter θj and the loss function is the least squares criterion. If we define the amended response inline image, then the problem reduces to minimizing

  • display math

Now divide the domain of θj into the two intervals ( − ∞ ,0] and [0, ∞ ). On the right interval, elementary calculus suggests the update

  • display math

This is invalid when it is negative and must be replaced by 0. Likewise, on the left interval, we have the update

  • display math

unless it is positive. On both intervals, shrinkage pulls the usual least squares estimate towards 0. In underdetermined problems with just a few relevant predictors, most parameters never budge from their starting values of 0. This circumstance plus the complete absence of matrix operations explains the speed of coordinate descent. It inherits its numerical stability from the descent property enjoyed by any coordinate descent algorithm.

With a generalized linear model, say logistic regression, the same story plays out. Now, however, we must institute a line search for the minimum on each of the two half-intervals. Newton's method, scoring, and even golden section search work well. When f(θ) is convex, and θj = 0, it is prudent to check the forward directional derivatives inline image and inline image along the current coordinate direction ej and its negative. If both forward directional derivatives are non-negative, then no progress can be made by moving off 0. Thus, a parameter parked at 0 is left there. Other computational savings are possible that make coordinate descent even faster. For example, computations can be organized around the linear predictor inline image for each case i. When θj changes, it is trivial to update this inner product. Wu et al. (2009) and Wu & Lange (2008) illustrate the potential of coordinate descent on some concrete genetic examples.

Example 10. Matrix completion

The matrix completion problem became famous when the movie distribution company Netflix offered a million dollar prize for improvements to its movie rating system (ACM SIGKDD and Netflix, 2007). The idea was that customers would submit ratings on a small subset of movie titles, and from these ratings, Netflix would infer their preferences and recommend additional movies for their consideration. Imagine therefore a very sparse matrix Y = (yij) whose rows are individuals and whose columns are movies. Completed cells contain a rating from 1 to 5. Most cells are empty and need to be filled in. If the matrix is sufficiently structured and possesses low rank, then it is possible to complete the matrix in a parsimonious way. Although this problem sounds specialized, it has applications far beyond this narrow setting. For example, filling in missing genotypes in genome scans for disease genes benefits from matrix completion (Chi et al., 2013).

Following Cai et al. (2008), Candés & Tao (2009), Mazumder et al. (2010), and Chenet al. (2012), let Δ denote the set of index pairs (i,j) such that yij is observed. The Lagrangian formulation of matrix completion minimizes the criterion

  • display math(7)

with respect to a compatible matrix X = (xij) with singular values σk. Recall that the singular value decomposition

  • display math

represents X as a sum of outer products involving a collection of orthogonal left singular vectors ui, a corresponding collection of orthogonal right singular vectors vi, and a descending sequence of non-negative singular values σi. Alternatively, we can factor X in the form UΣVt for orthogonal matrices U and V and a rectangular diagonal matrix Σ.

The nuclear norm inline image plays the same role in low-rank matrix approximation that the 1 norm inline image plays in sparse regression. For a more succinct representation of the criterion (7), we introduce the Frobenius norm

  • display math

induced by the trace inner product tr(UVt) and the projection operator PΔ(Y) with entries

  • display math

In this notation, the criterion (7) becomes

  • display math

To derive an algorithm for estimating X, we again exploit the MM principle. The general idea is to restore the symmetry of the problem by imputing the missing data (Mazumder et al., 2010). Suppose Xn is our current approximation to X. We simply replace a missing entry yij of Y by the corresponding entry xnij of Xn and add the term 1 ∕ 2(xnij − xij)2 to the criterion (7). Because the added terms majorize 0, they create a legitimate surrogate function and lead to an MM algorithm. One can rephrase the problem in matrix terms by defining the orthogonal complement inline image of PΔ(Y) according to the rule inline image. The matrix inline image temporarily completes Y and yields the surrogate function

  • display math

At this juncture, it is helpful to recall some mathematical facts. First, the Frobenius norm is invariant under left and right multiplication of its argument by an orthogonal matrix. Thus, inline image depends only on the singular values of X. The inner product − tr(ZnXt) presents a greater barrier to progress, but it ultimately succumbs to a matrix analogue of the Cauchy–Schwarz inequality. Fan's inequality says that

  • display math

for the ordered singular values ωk of Zn (Borwein & Lewis, 2000). Equality is attained in Fan's inequality if and only if the right and left singular vectors for the two matrices coincide. Thus, in minimizing g(X ∣ Xn), we can assume that the singular vectors of X coincide with those of Zn and rewrite the surrogate function as

  • display math

Application of the forward directional derivative test

  • display math

for all tangent directions ν identifies the shrunken singular values

  • display math

as optimal. In practice, one does not have to extract the full singular value decomposition of Zn. Only the singular values ωk > ρ are actually relevant in constructing Xn + 1.

In many applications, the underlying structure of the observation matrix Y is corrupted by a few noisy entries. This tempts one to approximate Y by the sum of a low-rank matrix X plus a sparse matrix W. To estimate X and W, we introduce a positive tuning constant λ and minimize the criterion

  • display math

by block descent. We have already indicated how to update X for a fixed W. To minimize f(X,W) for a fixed X, we set wij = 0 for any pair (i,j) ∉ Δ. Because the remaining W parameters separate in f(X,W), the shrinkage updates

  • display math

are trivial to derive.

Augmented Lagrangians

  1. Top of page
  2. Summary
  3. Introduction
  4. Block Descent
  5. Steepest Descent
  6. Variations on Newton's Method
  7. The MM and EM Algorithms
  8. Penalization
  9. Augmented Lagrangians
  10. Algorithm Acceleration
  11. Discussion
  12. Acknowledgement
  13. References

The augmented Lagrangian method is one of the best ways of handling parameter constraints (Hestenes, 1969; Nocedal & Wright, 2006; Powell, 1969; Rockafellar, 1973). For the sake of simplicity, we focus on the problem of minimizing f(θ) subject to the equality constraints gi(θ) = 0 for i = 1, … ,q. We will ignore inequality constraints and assume that f(θ) and the gi(θ) are smooth. At a constrained minimum, the classical Lagrange multiplier rule

  • display math(8)

holds provided the gradients ∇ gi(θ) are linearly independent. The augmented Lagrangian method optimizes the perturbed function

  • display math

with respect to θ. It then adjusts the current multiplier vector λ in the hope of matching the true Lagrange multiplier vector. The penalty term ρ ∕ 2 gi(θ)2 punishes violations of the equality constraint gi(θ) = 0. At convergence, the gradient ρgi(θ) ∇ gi(θ) of ρ ∕ 2 gi(θ)2 vanishes, and we recover the standard multiplier rule (8). This process can only succeed if the degree of penalization ρ is sufficiently large.

Thus, we must either take ρ initially large or gradually increase it until it hits the finite transition point where the constrained and unconstrained solutions merge. Updating λ is more subtle. If θn furnishes the unconstrained minimum of inline image, then the stationarity condition reads

  • display math

The last equation motivates the standard update

  • display math

The alternating direction method of multipliers (ADMM) (Gabay & Mercier, 1976; Glowinski & Marrocco, 1975) minimizes the sum f(θ) + h(γ) subject to the affine constraints Aθ + Bγ = c. Although the objective function is separable in the block variables θ and γ, the affine constraints frustrate a direct attack. However, the problem is ripe for a combination of the augmented Lagrangian method and a single round of block descent per iteration. The augmented Lagrangian is

  • display math

Minimization is performed over θ and γ by block descent before updating the multiplier vector λ via

  • display math

Introduction of block descent simplifies the usual augmented Lagrangian method, which minimizes inline image jointly over θ and γ. This modest change keeps the convergence theory intact (Boyd et al., 2011; Fortin & Glowinski, 1983) and has led to a resurgence in the popularity of ADMM in machine learning (Bien & Tibshirani, 2011; Boyd et al., 2011; Chen et al., 2012; Qin & Goldfarb, 2012; Richard et al., 2012; Xue et al., 2012).

Example 11. Fused lasso

The ADMM is helpful in reducing difficult optimization problems to simpler ones. The easiest fused lasso problem minimizes the criterion (Tibshirani et al., 2005)

  • display math

The 1 penalty on the increments θi + 1 − θi favours piecewise constant solutions. Unfortunately, this twist on the standard lasso penalty renders coordinate descent inefficient. We can reformulate the problem as minimizing the criterion inline image subject to the constraint γ = Dθ, where

  • display math

In the augmented Lagrangian framework, updating θ amounts to minimizing 1 ∕ 2 ∥ y − θ ∥ 2 + ρ ∕ 2 ∥ γ − 1 ∕ ρλ − Dθ ∥ 2. It is straightforward to solve this least squares problem. Updating γ involves minimizing ρ ∕ 2 ∥ Dθ − γ ∥ 2 + μ ∥ γ ∥ 1, which is a standard lasso problem. Thus, ADMM decouples the problematic linear transformation Dθ from the lasso penalty.

Algorithm Acceleration

  1. Top of page
  2. Summary
  3. Introduction
  4. Block Descent
  5. Steepest Descent
  6. Variations on Newton's Method
  7. The MM and EM Algorithms
  8. Penalization
  9. Augmented Lagrangians
  10. Algorithm Acceleration
  11. Discussion
  12. Acknowledgement
  13. References

Many MM and block descent algorithms converge very slowly. In partial compensation, the computational work per iteration may be light. Even so, diminishing the number of iterations until convergence by one or two orders of magnitude is an attractive proposition (Berlinet & Roland, 2007; Jamshidian & Jennrich, 1997; Kuroda & Sakakihara, 2006; Lange, 1995; Roland & Varadhan, 2005; Zhou et al., 2011). In this section, we discuss a generic method for accelerating a wide variety of algorithms (Zhou et al., 2011). Consider a differentiable algorithm map θn + 1 = A(θn) for optimizing an objective function f(θ), and suppose stationary points of f(θ) correspond to fixed points of A(θ). Equivalently, stationary points correspond to roots of the equation B(θ) = θ − A(θ) = 0. Within this framework, it is natural to apply Newton's method

  • display math(9)

to find the root and accelerate the overall process. This is a realistic expectation because Newton's method converges at a quadratic rate in contrast to the linear rates of MM and block descent algorithms.

There are two principal impediments to implementing algorithm (9) in high dimensions. First, it appears to require evaluation and storage of the Jacobi matrix dA(θ), whose rows are the differentials of the components of A(θ). Second, it also appears to require inversion of the matrix I − dA(θ). Both problems can be attacked by secant approximations. Close to the optimal point θ ∞ , the linear approximation

  • display math

is valid. This suggests that we take two ordinary steps and gather information in the process on the matrix M = A(θ ∞ ). If we let V be the vector A ∘ A(θn) − A(θn) and U be the vector A(θn) − θn, then the secant condition reads MU = V. In practice, it is advisable to exploit multiple secant conditions Mui = vi as long as their number does not exceed the number of parameters p. The secant conditions can be generated one per iteration over the current and previous q − 1 iterations. Let us represent the conditions collectively in the matrix form MU = V for U = (u1, … ,uq), and V = (v1, … ,vq).

The principle of parsimony suggests that we replace M by the smallest matrix satisfying the secant conditions. If we pose this problem concretely as minimizing the criterion inline image subject to the constraints MU = V, then a straightforward exercise in Lagrange multipliers gives the solution M = V(UtU) − 1Ut (Lange, 2010). The matrix M has rank at most q, and the Sherman–Morrison formula yields that explicit inverse

  • display math

Fortunately, it involves inverting just the q × q matrix UtU − UtV. Furthermore, the Newton update (9) boils down to

  • display math

The advantages of this procedure include the following: (a) it avoids large matrix inverses, (b) it relies on matrix times vector multiplication rather than matrix times matrix multiplication, (c) it requires only storage of the small matrices U and V, and (d) it respects linear parameter constraints. Non-negativity constraints may be violated. The number of secants q should be fixed in advance, say between 1 and 15, and the matrices U and V should be updated by substituting the latest secant pair generated for the earliest secant pair retained. If an accelerated step fails the descent test, then one can revert to the ordinary MM or block descent step.

Acceleration of non-smooth algorithms is more problematic (Hiriart-Urruty & Lemarechal, 2001). For gradient descent and its generalizations (Combettes & Wajs, 2005) to non-smooth problems, Nesterov (2007) has suggested a potent acceleration. As noted by Beck & Teboulle (2009), the accelerated iterates in ordinary gradient descent depend on an intermediate scalar tn and an intermediate vector ϕ according to the formulas

  • display math

with initial values t1 = 1 and ϕ = θ0. In other words, instead of taking a steepest descent step from the current iterate, one takes a steepest descent step from the extrapolated point ϕ, which depends on both the current iterate θn and the previous iterate θn − 1. This mysterious extrapolation algorithm can yield impressive speedups for essentially the same computational cost per iteration as gradient descent.

Discussion

  1. Top of page
  2. Summary
  3. Introduction
  4. Block Descent
  5. Steepest Descent
  6. Variations on Newton's Method
  7. The MM and EM Algorithms
  8. Penalization
  9. Augmented Lagrangians
  10. Algorithm Acceleration
  11. Discussion
  12. Acknowledgement
  13. References

The fault lines in optimization separate smooth from non-smooth problems, unconstrained from constrained problems, and small-scale from large-scale problems. Smooth, unconstrained, and small-scale problems are easy to solve. Mathematical scientists are beginning to tackle non-smooth, constrained, large-scale problems at the opposite end of the difficulty spectrum. The most spectacular successes usually rely on convexity. We can expect further progress because some of the best minds in applied mathematics, computer science, and statistics have taken up the challenge. What is unlikely to occur is the discovery of a universally valid algorithm. Optimization is apt to remain as much art as science for a long time to come.

We have emphasized a few key ideas in this survey. Our examples demonstrate some of the possibilities for mixing and matching the different algorithm themes. Although we cannot predict the future of computational statistics with any certainty, the key ideas mentioned here will not disappear. For instance, penalization is here to stay, the descent property of an algorithm is always desirable, and quadratic approximation will always be superior to linear approximation for smooth functions. As computing devices hit physical constraints, the importance of parallel algorithms will also likely increase. This argues that block descent and parameter-separated MM algorithms will play a larger role in the future (Zhou et al., 2010). Although we have de-emphasized convex calculus, readers who want to devise their own algorithms are well advised to learn this inherently subtle subject. There is a difference, after all, between principled algorithms and ad hoc procedures.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Block Descent
  5. Steepest Descent
  6. Variations on Newton's Method
  7. The MM and EM Algorithms
  8. Penalization
  9. Augmented Lagrangians
  10. Algorithm Acceleration
  11. Discussion
  12. Acknowledgement
  13. References
  • ACM SIGKDD and Netflix. (2007). Proceedings of KDD Cup and Workshop. Available online http://www.cs.uic.edu/liub/Net-flix-KDD-Cup-2007.html.
  • Beck A. & Teboulle M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2, 183202.
  • Berlinet A. & Roland C. (2007). Acceleration schemes with application to the EM algorithm. Comp. Statist. Data Anal., 51, 36893702.
  • Bien J. & Tibshirani R. J. (2011). Sparse estimation of a covariance matrix. Biometrika, 98(4), 807820.
  • Böhning D. & Lindsay B. G. (1988). Monotonicity of quadratic approximation algorithms. Ann. Instit. Stat. Math., 40, 641663.
  • Borwein J. M. & Lewis A. S. (2000). Convex Analysis and Nonlinear Optimization: Theory and Examples. New York: Springer.
  • Boyd S., Parikh N., Chu E., Peleato B. & Eckstein J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1), 1122.
  • Bradley E. L. (1973). The equivalence of maximum likelihood and weighted least squares estimates in the exponential family. J. Amer. Statist. Assoc., 68, 199200.
  • Cai J. -F., Candés E. J. & Shen Z. (2008). A singular value thresholding algorithm for matrix completion. SIAM J. Optim., 20, 19561982.
  • Candés E. J. & Tao T. (2009). The power of convex relaxation: near-optimal matrix completion. IEEE Trans. Inform. Theory, 56, 20532080.
  • Charnes A., Frome E. L. & Yu P. L. (1976). The equivalence of generalized least squares and maximum likelihood in the exponential family. J. Amer. Stat. Assoc., 71, 169171.
  • Chen C., He B. & Yuan X. (2012). Matrix completion via an alternating direction method. IMA J. Numer. Anal., 32, 227245.
  • Chen H. F. (2002). Stochastic Approximation and its Applications. Dordrecht: Kluwer.
  • Chen S. S., Donoho D. L. & Saunders M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput., 20, 3361.
  • Chi E. C., Zhou H., Ortega Del Vecchyo D. & Lange K. (2013). Genotype imputation via matrix completion. Genome Res., 23, 509518.
  • Claerbout J. & Muir F. (1973). Robust modeling with erratic data. Geophys., 38, 826844.
  • Combettes P. & Wajs V. (2005). Signal recovery by proximal forward-backward splitting. Multiscale Model. Simul., 4, 11681200.
  • Conn A. R., Gould N. I. M. & Toint P. L. (1991). Convergence of quasi-Newton matrices generated by the symmetric rank one update. Math. Prog., 50, 177195.
  • Davidon W. C. (1959).Variable metric methods for minimization. AEC Research and Development Report ANL–5990, Argonne National Laboratory, USA.
  • de Leeuw J. (1977). Applications of convex analysis to multidimensional scaling. In Recent Developments in Statistics, Eds. Barra J. R., Brodeau F., Romier G. & Van Cutsem B., pp. 133146. Amsterdam: North Holland Publishing Company.
  • de Leeuw J. (1994). Block relaxation algorithms in statistics. In Information Systems and Data Analysis, Eds. Bock H. H., Lenski W., Richter M. M., New York: Springer.
  • Dempster A. P., Laird N. M. & Rubin D. B. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Stat. Soc. B, (1977). 39, 138.
  • Dennis J.E. Jr & Schnabel R. B. (1996). Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Philadelphia: SIAM.
  • Ding C., Li T. & Jordan M. I. (2010). Convex and semi-nonnegative matrix factorizations. IEEE Trans. Pattern Anal. Mach. Intell., 32, 4555.
  • Dobson A. J. (1990). An Introduction to Generalized Linear Models. London: Chapman & Hall.
  • Donoho D. & Johnstone I. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81, 425455.
  • Duchi J., Shalev-Shwartz S., Singer Y. & Chandra T. (2008). Efficient projections onto the l1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008), New York: ACM.
  • Fortin M. & Glowinski R. (1983). Augmented Lagrangian methods: applications to the numerical solution of boundary-value problems. J. Appl. Math. Mech., 65, 622622.
  • Friedman J., Hastie T. & Tibshirani R. (2007). Pathwise coordinate optimization. Ann. Appl. Stat., 1, 302332.
  • Gabay D. & Mercier B. (1976). A dual algorithm for the solution of nonlinear variational problems via finite-element approximations. Comp. Math. Appl., 2, 1740.
  • Gifi A. (1990). Nonlinear Multivariate Analysis. Hoboken, NJ: Wiley.
  • Glowinski R. & Marrocco A. (1975). Sur lapproximation par elements finis dordre un, et la resolution par penalisation-dualite dune classe de problemes de dirichlet nonlineaires. Rev. Francaise dAut. Inf. Rech. Oper., 2, 4176.
  • Goldstein A. A. (1964). Convex programming in Hilbert space. Bull. Amer. Math. Soc., 70, 709710.
  • Green P. J. (1984). Iteratively reweighted least squares for maximum likelihood estimation and some robust and resistant alternatives (with discussion). J. Roy. Stat. Soc. B, 46, 149192.
  • Hastie T., Tibshirani R. & Friedman J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York, Springer.
  • Hestenes M. R. (1969). Multiplier and gradient methods. J. Optim. Theory Appl., 4, 303320.
  • Hiriart-Urruty J. B. & Lemarechal C. (1996). Convex Analysis and Minimization Algorithms: Part 1: Fundamentals. New York: Springer.
  • Hiriart-Urruty J. B. & Lemarechal C. (2001). Convex Analysis and Minimization Algorithms: Part 2: Advanced Theory and Bundle Methods. New York: Springer.
  • Hunter D. R. & Lange K. (2004). A tutorial on MM algorithms. Amer. Statist., 58, 3037.
  • Jamshidian M. & Jennrich R. I. (1997). Quasi-Newton acceleration of the EM algorithm. J. Roy. Stat. Soc. B, 59, 569587.
  • Jank W. (2006). Implementing and diagnosing the stochastic approximation EM algorithm. J. Comput. Graph. Statist., 15, 803829.
  • Jennrich R. I. & Moore R. H. (1975). Maximum likelihood estimation by means of nonlinear least squares. In Proceedings of the Statistical Computing Section: Amer. Stat. Assoc., Atlanta, Georgia, pp. 5765.
  • Khalfan H. F., Byrd R. H. & Schnabel R. B. (1993). A theoretical and experimental study of the symmetric rank-one update. SIAM J. Optim., 3, 124.
  • Kiefer J. & Wolfowitz J. (1952). Stochastic estimation of the maximum of a regression function. Ann. Math. Stat., 23, 462466.
  • Kruskal J. B. (1965). Analysis of factorial experiments by estimating monotone transformations of the data. J. Roy. Stat. Soc. B, 27, 251263.
  • Kuroda M. & Sakakihara M. (2006). Accelerating the convergence of the EM algorithm using the vector epsilon algorithm. Comput. Statist. Data Anal., 51, 15491561.
  • Kushner H. J. & Yin G. G. (2003). Stochastic Approximation and Recursive Algorithms and Applications. New York: Springer.
  • Lange K. (1995). A gradient algorithm locally equivalent to the EM algorithm. J. Roy. Stat. Soc. B, 57, 425437.
  • Lange K. (1995). A quasi-Newton acceleration of the EM algorithm. Statist. Sinica, 5, 118.
  • Lange K. (2010). Numerical Analysis for Statisticians, 2nd ed. New York: Springer.
  • Lange K. (2012). Optimization, 2nd ed. New York: Springer.
  • Lange K., Hunter D. R. & Yang I. (2000). Optimization transfer using surrogate objective functions (with discussion). J. Comput. Graph. Statist., 9, 159.
  • Lee D. D. & Seung H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788791.
  • Levitin E. S. & Polyak B. T. (1966). Constrained minimization problems. USSR Comput. Math and Math Physics, 6, 150.
  • Mazumder R., Hastie T. & Tibshirani R. (2010). Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res., 11, 22872322.
  • McLachlan G. J. & Krishnan T. (2008). The EM Algorithm and Extensions, 2nd ed. Hoboken, NJ: Wiley.
  • Meinshausen N. & Bühlmann P. (2010). Stability selection. J. Roy. Stat. Soc. B, 72, 417473.
  • Meng X. -L. & Rubin D. B. (1993). Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika, 80, 267278.
  • Michelot C. (1986). A finite algorithm for finding the projection of a point onto the canonical simplex in Rn. J. Optim. Theory Appl., 50, 195200.
  • Nelder J. A. & Wedderburn R. W. M. (1972). Generalized linear models. J. Roy. Stat. Soc. A, 135, 370384.
  • Nesterov Y. (2007). Gradient methods for minimizing composite objective function. CORE Discussion Papers.
  • Nocedal J. & Wright S. (2006). Numerical Optimization, 2nd ed. New York: Springer.
  • Ortega J. M. & Rheinboldt W. C. (1970). Iterative Solutions of Nonlinear Equations in Several Variables. New York: Academic.
  • Osborne M. R. (1992). Fisher's method of scoring. Int. Stat. Rev., 60, 99117.
  • Paatero P. & Tapper U. (1994). Positive matrix factorization: a non-negative factor model with optimal utilization of error. Environmetrics, 5, 111126.
  • Park M. Y. & Hastie T. (2007). 1-regularization path algorithm for generalized linear models. J. Roy. Stat. Soc. B 69, 659677.
  • Powell M. J. D. (1969). A method for nonlinear constraints in minimization problems. In Optimization, Ed. Fletcher R., pp. 283298. New York: Academic Press.
  • Qin Z. & Goldfarb D. (2012). Structured sparsity via alternating direction methods. J. Mach. Learn. Res., 98888, 14351468.
  • Ranola J. M., Ahn S., Sehl M. E., Smith D. J. & Lange K. (2010). A Poisson model for random multigraphs. Bioinformatics, 26, 20042011.
  • Rao C. R. (1973).Linear Statistical Inference and its Applications, 2nd ed. Hoboken, NJ: Wiley.
  • Richard E., Savalle P. -A. & Vayatis N. 2012. Estimation of simultaneously sparse and low rank matrices. In Proceedings of the 29th International Conference on Machine Learning (ICML 2012), pp. 13511358. Edinburgh, Scotland, UK.
  • Robert C. & Casella G. (2004). Monte Carlo Statistical Methods. New York: Springer.
  • Robbins H. & Monro S. (1951). A stochastic approximation method. Ann. Math. Stat. 22, 400407.
  • Rockafellar R. T. (1973). The multiplier method of Hestenes and Powell applied to convex programming. J. Optim. Theory Appl., 12, 555562.
  • Roland C. & Varadhan R. (2005). New iterative schemes for nonlinear fixed point problems, with applications to problems with bifurcations and incomplete-data problems. Appl. Numer. Math., 55, 215226.
  • Ruszczyński A. (2006).Nonlinear Optimization. Princeton, NJ: Princeton University Press.
  • Santosa F. & Symes W. W. (1986). Linear inversion of band-limited reflection seimograms. SIAM J. Sci. Stat. Comput., 7, 13071330.
  • Sha F., Saul L. K. & Lee D. D. (2003). Multiplicative updates for nonnegative quadratic programming in support vector machines. In Advances in Neural Information Processing Systems, Vol. 15, Eds. Becker S., Thrun S. & Obermayer K., pp. 1065-1073. Cambridge, MA: MIT Press.
  • Strang G. & Borre K. (2012). Algorithms for Global Positioning. Wellesley, MA: Wellesley-Cambridge Press.
  • Taylor H., Banks S. C. & McCoy J. F. (1979). Deconvolution with the 1 norm. Geophys., 44, 3952.
  • Teo C. H., Vishwanthan S., Smola A. J. & Le Q. V. (2010). Bundle methods for regularized risk minimization. J. Mach. Learn. Res., 11, 311365.
  • Tibshirani R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc., Ser. B, 58, 26728.
  • Tibshirani R., Saunders M., Rosset S., Zhu J. & Knight K. (2005). Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. B, 67, 91108.
  • Wei G. C. G. & Tanner M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithms. JASA, 85, 699704.
  • Wu T. T., Chen Y. F., Hastie T., Sobel E. M. & Lange K. (2009). Genomewide association analysis by lasso penalized logistic regression. Bioinformatics, 25, 714721.
  • Wu T. T. & Lange K. (2008). Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat., 2, 224244.
  • Wu T. T. & Lange K. (2010). The MM alternative to EM. Stat. Sci., 25, 492505.
  • Xue L., Ma S. & Zou H. (2012). Positive definite 1 penalized estimation of large covariance matrices. JASA, 107, 14801491.
  • Zhou H. & Zhang Y. (2012). EM vs MM: a case study. Comp. Stat. Data Anal., 56, 39093920.
  • Zhou H., Alexander D. H. & Lange K. (2011). A quasi-Newton acceleration for high-dimensional optimization algorithms. Statist. Comput., 21, 261273.
  • Zhou H. & Lange K. (2010). MM algorithms for some discrete multivariate distributions. J. Comput. Graph. Statist., 19, 645665.
  • Zhou H., Lange K. & Suchard M. A. (2010). Graphics processing units and high-dimensional optimization. Stat. Sci., 25, 311324.
  • Zou H. (2006). The adaptive lasso and its oracle properties. JASA, 101, 14181429.
  • Zou H. & Hastie T. (2005). Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. B, 67, 301320.