Original Article

# Discussion

Article first published online: 17 FEB 2014

DOI: 10.1111/insr.12033

© 2014 The Authors. International Statistical Review © 2014 International Statistical Institute

Additional Information

#### How to Cite

Atchade, Y. and Michailidis, G. (2014), Discussion. International Statistical Review, 82: 71–75. doi: 10.1111/insr.12033

#### Publication History

- Issue published online: 22 APR 2014
- Article first published online: 17 FEB 2014
- Manuscript Accepted: 28 JUN 2013
- Manuscript Received: 17 JUN 2013

- Abstract
- Article
- References
- Cited By

Optimization methods have always played a key role in the development of statistical methodology, but have become of critical importance for modern methods to analyse large size high-dimensional data. Lange, Choi and Zhua are to be commended for providing a comprehensive overview of optimization methods widely used in statistics. The paper discusses classical unconstrained optimization algorithms (steepest descent and variants), the majorization–maximisation framework that has proved very useful in devising novel algorithms for a variety of statistical problems, and also provides a flavour of constrained optimization problems arising in high-dimensional statistics (e.g. regularisation, matrix completion, etc.). The topics discussed and their accompanying examples focus on important classes of algorithms that have helped statisticians develop and solve complex models.

In this note, we focus on a class of algorithms suitable for high-dimensional constrained optimization problems. The class of interest is that of *proximal* algorithms, that are very generally applicable in constrained non-smooth optimization, but are in particular very well suited to recent statistical techniques developed for the analysis of high-dimensional data Bach *et al.* (2011), Lee *et al.* (2010), and Ravikumar *et al.* (2010). In our discussion, we make connections to a number of topics discussed in Lange, Choi and Zhua, including connections to majorization–maximisation and classical gradient descent algorithms and acceleration schemes.

### Proximal Algorithms

- Top of page
- Proximal Algorithms
- Proximal Gradient Method
- Accelerated Proximal Gradient Method
- Stochastic Proximal Gradient Algorithms
- Alternating Direction Method of Multipliers
- Acknowledgement
- References

We start by providing some definitions and then discuss special cases arising in high-dimensional statistics.

Let be a closed proper convex function; that is, its epigraph is a non-empty closed convex set. The *proximal operator* of *f* is defined as

- (1)

where denotes the usual Euclidean norm, and *λ* > 0 is a parameter that controls the degree to which the proximal operator maps point closer to the minimum of *f*. As shown in Boyd and Vandenberghe (2004), the function minimised is strongly convex and not everywhere infinite and hence admits a unique minimiser. Note that the definition implies that prox_{f}(*u*) is a point that approximately minimises *f*, but remains close to *u*.

In general, proximal mapping requires solving the minimisation problem posited in (1), but there are instances where the proximal operator admits an explicit expression.

*f*(*x*) = 0, then prox_{f}(*u*) =*u*.*f*(*x*) =*I*_{C}(*x*) where*I*_{C}denotes the indicator function of a convex set*C*; then, prox_{f}(*u*) =*P*_{C}(*u*) corresponds to the*projection P*_{C}on*C*.*f*(*x*) =*ψ*| |*x*| |_{1}, where | | ⋅ | |_{1}denotes the*ℓ*_{1}norm, then the*i*-th coordinate of prox_{f}(*u*) is given by: (i) prox_{f}(*u*)_{i}= 0 if |*u*_{1}| ≤*ψ*; (ii) prox_{f}(*u*)_{i}=*u*_{i}−*ψ*if*u*_{i}>*ψ*; and (iii) prox_{f}(*u*)_{i}=*u*_{i}+*ψ*if*u*_{i}< −*ψ*. It can be seen that the proximal mapping corresponds to a soft-thresholding operation.

A proximal minimisation algorithm is defined at the *k*-th iteration step by

As presented in Bauschke and Combettes (2011), if *f* has a minimum, then *x*^{k} will converge to the set of minimisers of *f* and *f*(*x*^{k}) to its optimal value. Similarly to steepest descent algorithms, convergence is guaranteed for values of the parameter *λ* that satisfy *λ*^{k} > 0 and .

This basic algorithm has not found many applications but is instructive from a theoretical point of view. One important application though is on solving the quadratic function

with *A* ≽ 0 (positive semidefinite). The proximal mapping can be written down explicitly as

which gives rise to the update of Golub and Wilkinson (1996)

The power of proximal mappings is revealed in the following setting.

### Proximal Gradient Method

- Top of page
- Proximal Algorithms
- Proximal Gradient Method
- Accelerated Proximal Gradient Method
- Stochastic Proximal Gradient Algorithms
- Alternating Direction Method of Multipliers
- Acknowledgement
- References

Consider the following optimization problem:

- (2)

where and are closed proper convex functions, and *f* is *differentiable*. The function *g* can be used to encode constraints on *x* as discussed in the ‘Augmented Lagrangians’ section of Lange, Choi and Zhua.

Then, the proximal gradient method is given by

- (3)

where *λ*^{k} > 0 denotes the step size. If ∇ *f*(*x*) is Lipschitz continuous with constant *M*, then this algorithm converges in steps, for fixed step size *λ*^{k} ≡ *λ* ∈ (0,1 ∕ *M*). If *M* is unknown, then various line search strategies (akin to those used in gradient descent methods) can be employed.

Note that using the special form of the proximal operator, it can be seen that when *g* = 0, this algorithm reduces to the standard gradient descent one, when *g* = *ψ* | | *x* | | _{1} (a lasso penalty), it leads to soft-thresholding and for *g* = *I*_{C}, it reduces to the projected gradient method in Bertsekas (1999).

In Beck and Tebouille (2012), the proximal gradient algorithm is interpreted as a majorization–minimisation algorithm. Specifically, consider the upper bound on *f*(*x*) given by

For fixed *λ*,*Q*_{λ}(*x*,*y*) is convex and satisfies *Q*_{λ}(*x*,*x*) = *f*(*x*) and is an upper bound on *f* when ∇ *f* is Lipschitz continuous with constant *M* and *λ* ∈ (0,1 ∕ *M*]. Then, the algorithm that uses updates of the form

is a majorization–minimisation one. Analogously, for the function *f*(*x*) + *g*(*x*), we can use , and some algebra shows that the majorization–minimisation-based update

is equivalent to (3).

### Accelerated Proximal Gradient Method

- Top of page
- Proximal Algorithms
- Proximal Gradient Method
- Accelerated Proximal Gradient Method
- Stochastic Proximal Gradient Algorithms
- Alternating Direction Method of Multipliers
- Acknowledgement
- References

Nesterov (Nesterov, 1983) introduced a sequence of updates that *accelerate* the convergence rate from linear () to quadratic () in convex programming problems. The sequence extrapolates between previous updates of the algorithm. Specifically, for the proximal gradient method, it takes the form

- (4)

where *ω*^{k} ∈ [0,1) is an extrapolation parameter and *λ*^{k} the usual step size. A simple choice for the extrapolation parameter is *ω*^{k} = *k* ∕ (*k* + 3).

Then, it can be established that for ∇ *f* Lipschitz continuous with constant *M*, the acceleration scheme guarantees convergence in steps with a fixed step size *λ*^{k} ≡ *λ* ∈ (0,1 ∕ *M*]. As before, if *M* is unknown, the step sizes can be determined through a line search.

### Stochastic Proximal Gradient Algorithms

- Top of page
- Proximal Algorithms
- Proximal Gradient Method
- Accelerated Proximal Gradient Method
- Stochastic Proximal Gradient Algorithms
- Alternating Direction Method of Multipliers
- Acknowledgement
- References

An area where we believe more research is needed to specifically suit the needs of statistics is stochastic optimization. This corresponds for example to (2), where the function *f* is given as an intractable integral of the form . This is a well-known problem in stochastic programming and online learning and has generated various stochastic extensions of the algorithms presented earlier (Nemirovski *et al*., 2009; Xiao, 2010; Duchi *et al.*, 2012; Lan, 2012). But in statistics, this problem takes a more challenging form. Latent variables abound in statistics and lead to log-likelihood functions and their derivatives that are intractable: , , where . The important point here is that the distribution is typically very difficult to simulate, and unlike most examples in online learning stochastic optimization, depends on *x*. Nevertheless, one can easily adapt the proximal gradient algorithm and its accelerated version presented earlier, by replacing ∇ *f*(*x*) by a Monte Carlo approximation obtained by simulating from (possibly using Markov chain Monte Carlo methods). The approximation can be obtained with a fixed number of Monte Carlo steps, or with an increasing number of Monte Carlo steps.

For illustration purposes, consider the following random effects logistic regression example. We have *n* statistical units with repeated binary responses {*y*_{it}, 1 ≤ *t* ≤ *T*_{i}}, *y*_{it} ∈ {0,1}. For the covariate matrix , with *i*-th row denoted by *z*_{i}, we assume that

where . The log-likelihood of *β* is intractable and requires integrating out the random effect *u*. We estimate *β* by an *ℓ*_{1}-penalised likelihood approach; thus, *g*(*β*) = *ψ* ∥ *β* ∥ _{1}.

We implemented the proximal gradient algorithm, and its accelerated version by replacing in (3) and (4), ∇ *f*(*x*^{k}) by obtained using a Markov chain Monte Carlo scheme. Figure 1 depicts the relative error ∥ *β*_{k} − *β*_{ ⋆ } ∥ ∕ ∥ *β*_{ ⋆ } ∥ , as a function of the iterations *k* for both algorithms, where *β*_{ ⋆ } denotes the true value of the parameter vector.

This simulation example suggests that stochastic versions of the proximal gradient algorithm and its accelerated versions can be designed to deal with high-dimensional statistical models with intractable log-likelihood functions. However, more work is required to gain a clear and deep understanding of the conditions needed so that these extensions work and exhibit convergence properties analogous to their deterministic counterparts.

### Alternating Direction Method of Multipliers

- Top of page
- Proximal Algorithms
- Proximal Gradient Method
- Accelerated Proximal Gradient Method
- Stochastic Proximal Gradient Algorithms
- Alternating Direction Method of Multipliers
- Acknowledgement
- References

In the proximal gradient algorithm, the function *f* was assumed to be smooth. However, the method can be adapted to handle the following variant of the minimisation problem:

where are closed proper convex functions, and both *f*,*g* can be *non-differentiable*. Then, the alternating direction method of multipliers employs the following updates:

It can be seen that this method handles the two functions in the objective function completely separately through their proximal operators. It is most useful when the proximal operators of *f* and *g* can be easily computed, but that of the objective function *f* + *g* is not. The convergence theory for this method is discussed in detail in Boyd *et al.* (2011).

### Acknowledgement

- Top of page
- Proximal Algorithms
- Proximal Gradient Method
- Accelerated Proximal Gradient Method
- Stochastic Proximal Gradient Algorithms
- Alternating Direction Method of Multipliers
- Acknowledgement
- References

The authors' research was supported in part by the NSF grant DMS-1228164.

### References

- Top of page
- Proximal Algorithms
- Proximal Gradient Method
- Accelerated Proximal Gradient Method
- Stochastic Proximal Gradient Algorithms
- Alternating Direction Method of Multipliers
- Acknowledgement
- References

- 2011). Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn., 4, 1–106. , , & (
- 2011). Convex Analysis and Monotone Operator Theory in Hilbert Spaces. New York: Springer. & (
- 2012). Smoothing and first order methods: a unified framework. SIAM J. Optim., 22, 557–580. & (
- 1999). Nonlinear Programming. Belmont, MA: Athena Scientific. (
- 2004). Convex Optimization. New York: Cambridge University Press. & (
- 2012). Ergodic mirror descent. SIAM J. Optim., 22, 1549–1578. , , & (
- 2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3, 1–122. , , , & (
- 1996). Note on the iterative refinement of least squares solution. Numer. Math., 9, 139–148. & (
- 2012). An optimal method for stochastic composite optimization. Math. Program. Ser. A, 133, 365–397. (
- 2010). Practical large scale optimization for max-norm regularization. Adv. Neural Inf. Process. Syst., 23, 1297–1305. , , , & (
- 2009). Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19, 1574–1609. , , & (
- 1983). A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 27, 372–376. (
- 2010). Message-passing for graphstructured linear programs: Proximal methods and rounding schemes. J. Mach. Learn. Res., 11, 1043–1080. , & (
- 2010). Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res., 11, 2543–2596. (