## SEARCH BY CITATION

Optimization methods have always played a key role in the development of statistical methodology, but have become of critical importance for modern methods to analyse large size high-dimensional data. Lange, Choi and Zhua are to be commended for providing a comprehensive overview of optimization methods widely used in statistics. The paper discusses classical unconstrained optimization algorithms (steepest descent and variants), the majorization–maximisation framework that has proved very useful in devising novel algorithms for a variety of statistical problems, and also provides a flavour of constrained optimization problems arising in high-dimensional statistics (e.g. regularisation, matrix completion, etc.). The topics discussed and their accompanying examples focus on important classes of algorithms that have helped statisticians develop and solve complex models.

In this note, we focus on a class of algorithms suitable for high-dimensional constrained optimization problems. The class of interest is that of proximal algorithms, that are very generally applicable in constrained non-smooth optimization, but are in particular very well suited to recent statistical techniques developed for the analysis of high-dimensional data Bach et al. (2011), Lee et al. (2010), and Ravikumar et al. (2010). In our discussion, we make connections to a number of topics discussed in Lange, Choi and Zhua, including connections to majorization–maximisation and classical gradient descent algorithms and acceleration schemes.

### Proximal Algorithms

We start by providing some definitions and then discuss special cases arising in high-dimensional statistics.

Let be a closed proper convex function; that is, its epigraph is a non-empty closed convex set. The proximal operator of f is defined as

• (1)

where denotes the usual Euclidean norm, and λ > 0 is a parameter that controls the degree to which the proximal operator maps point closer to the minimum of f. As shown in Boyd and Vandenberghe (2004), the function minimised is strongly convex and not everywhere infinite and hence admits a unique minimiser. Note that the definition implies that proxf(u) is a point that approximately minimises f, but remains close to u.

In general, proximal mapping requires solving the minimisation problem posited in (1), but there are instances where the proximal operator admits an explicit expression.

1. f(x) = 0, then proxf(u) = u.

2. f(x) = IC(x) where IC denotes the indicator function of a convex set C; then, proxf(u) = PC(u) corresponds to the projection PC on C.

3. f(x) = ψ | | x | | 1, where | | ⋅ | | 1 denotes the 1 norm, then the i-th coordinate of proxf(u) is given by: (i) proxf(u)i = 0 if | u1 | ≤ ψ; (ii) proxf(u)i = ui − ψ if ui > ψ; and (iii) proxf(u)i = ui + ψ if ui < − ψ. It can be seen that the proximal mapping corresponds to a soft-thresholding operation.

A proximal minimisation algorithm is defined at the k-th iteration step by

As presented in Bauschke and Combettes (2011), if f has a minimum, then xk will converge to the set of minimisers of f and f(xk) to its optimal value. Similarly to steepest descent algorithms, convergence is guaranteed for values of the parameter λ that satisfy λk > 0 and .

This basic algorithm has not found many applications but is instructive from a theoretical point of view. One important application though is on solving the quadratic function

with A ≽ 0 (positive semidefinite). The proximal mapping can be written down explicitly as

which gives rise to the update of Golub and Wilkinson (1996)

The power of proximal mappings is revealed in the following setting.

Consider the following optimization problem:

• (2)

where and are closed proper convex functions, and f is differentiable. The function g can be used to encode constraints on x as discussed in the ‘Augmented Lagrangians’ section of Lange, Choi and Zhua.

Then, the proximal gradient method is given by

• (3)

where λk > 0 denotes the step size. If ∇ f(x) is Lipschitz continuous with constant M, then this algorithm converges in steps, for fixed step size λk ≡ λ ∈ (0,1 ∕ M). If M is unknown, then various line search strategies (akin to those used in gradient descent methods) can be employed.

Note that using the special form of the proximal operator, it can be seen that when g = 0, this algorithm reduces to the standard gradient descent one, when g = ψ | | x | | 1 (a lasso penalty), it leads to soft-thresholding and for g = IC, it reduces to the projected gradient method in Bertsekas (1999).

In Beck and Tebouille (2012), the proximal gradient algorithm is interpreted as a majorization–minimisation algorithm. Specifically, consider the upper bound on f(x) given by

For fixed λ,Qλ(x,y) is convex and satisfies Qλ(x,x) = f(x) and is an upper bound on f when ∇ f is Lipschitz continuous with constant M and λ ∈ (0,1 ∕ M]. Then, the algorithm that uses updates of the form

is a majorization–minimisation one. Analogously, for the function f(x) + g(x), we can use , and some algebra shows that the majorization–minimisation-based update

is equivalent to (3).

Nesterov (Nesterov, 1983) introduced a sequence of updates that accelerate the convergence rate from linear () to quadratic () in convex programming problems. The sequence extrapolates between previous updates of the algorithm. Specifically, for the proximal gradient method, it takes the form

• (4)

where ωk ∈ [0,1) is an extrapolation parameter and λk the usual step size. A simple choice for the extrapolation parameter is ωk = k ∕ (k + 3).

Then, it can be established that for ∇ f Lipschitz continuous with constant M, the acceleration scheme guarantees convergence in steps with a fixed step size λk ≡ λ ∈ (0,1 ∕ M]. As before, if M is unknown, the step sizes can be determined through a line search.

An area where we believe more research is needed to specifically suit the needs of statistics is stochastic optimization. This corresponds for example to (2), where the function f is given as an intractable integral of the form . This is a well-known problem in stochastic programming and online learning and has generated various stochastic extensions of the algorithms presented earlier (Nemirovski et al., 2009; Xiao, 2010; Duchi et al., 2012; Lan, 2012). But in statistics, this problem takes a more challenging form. Latent variables abound in statistics and lead to log-likelihood functions and their derivatives that are intractable: , , where . The important point here is that the distribution is typically very difficult to simulate, and unlike most examples in online learning stochastic optimization, depends on x. Nevertheless, one can easily adapt the proximal gradient algorithm and its accelerated version presented earlier, by replacing ∇ f(x) by a Monte Carlo approximation obtained by simulating from (possibly using Markov chain Monte Carlo methods). The approximation can be obtained with a fixed number of Monte Carlo steps, or with an increasing number of Monte Carlo steps.

For illustration purposes, consider the following random effects logistic regression example. We have n statistical units with repeated binary responses {yit, 1 ≤ t ≤ Ti}, yit ∈ {0,1}. For the covariate matrix , with i-th row denoted by zi, we assume that

where . The log-likelihood of β is intractable and requires integrating out the random effect u. We estimate β by an 1-penalised likelihood approach; thus, g(β) = ψ ∥ β ∥ 1.

We implemented the proximal gradient algorithm, and its accelerated version by replacing in (3) and (4), ∇ f(xk) by obtained using a Markov chain Monte Carlo scheme. Figure 1 depicts the relative error ∥ βk − β ⋆  ∥ ∕ ∥ β ⋆  ∥ , as a function of the iterations k for both algorithms, where β ⋆  denotes the true value of the parameter vector.

This simulation example suggests that stochastic versions of the proximal gradient algorithm and its accelerated versions can be designed to deal with high-dimensional statistical models with intractable log-likelihood functions. However, more work is required to gain a clear and deep understanding of the conditions needed so that these extensions work and exhibit convergence properties analogous to their deterministic counterparts.

### Alternating Direction Method of Multipliers

In the proximal gradient algorithm, the function f was assumed to be smooth. However, the method can be adapted to handle the following variant of the minimisation problem:

where are closed proper convex functions, and both f,g can be non-differentiable. Then, the alternating direction method of multipliers employs the following updates:

It can be seen that this method handles the two functions in the objective function completely separately through their proximal operators. It is most useful when the proximal operators of f and g can be easily computed, but that of the objective function f + g is not. The convergence theory for this method is discussed in detail in Boyd et al. (2011).

### Acknowledgement

The authors' research was supported in part by the NSF grant DMS-1228164.

### References

• , , & (2011). Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn., 4, 1106.
• & (2011). Convex Analysis and Monotone Operator Theory in Hilbert Spaces. New York: Springer.
• & (2012). Smoothing and first order methods: a unified framework. SIAM J. Optim., 22, 557580.
• (1999). Nonlinear Programming. Belmont, MA: Athena Scientific.
• & (2004). Convex Optimization. New York: Cambridge University Press.
• , , & (2012). Ergodic mirror descent. SIAM J. Optim., 22, 15491578.
• , , , & (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3, 1122.
• & (1996). Note on the iterative refinement of least squares solution. Numer. Math., 9, 139148.
• (2012). An optimal method for stochastic composite optimization. Math. Program. Ser. A, 133, 365397.
• , , , & (2010). Practical large scale optimization for max-norm regularization. Adv. Neural Inf. Process. Syst., 23, 12971305.
• , , & (2009). Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19, 15741609.
• (1983). A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 27, 372376.
• , & (2010). Message-passing for graphstructured linear programs: Proximal methods and rounding schemes. J. Mach. Learn. Res., 11, 10431080.
• (2010). Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res., 11, 25432596.