SEARCH

SEARCH BY CITATION

Optimization methods have always played a key role in the development of statistical methodology, but have become of critical importance for modern methods to analyse large size high-dimensional data. Lange, Choi and Zhua are to be commended for providing a comprehensive overview of optimization methods widely used in statistics. The paper discusses classical unconstrained optimization algorithms (steepest descent and variants), the majorization–maximisation framework that has proved very useful in devising novel algorithms for a variety of statistical problems, and also provides a flavour of constrained optimization problems arising in high-dimensional statistics (e.g. regularisation, matrix completion, etc.). The topics discussed and their accompanying examples focus on important classes of algorithms that have helped statisticians develop and solve complex models.

In this note, we focus on a class of algorithms suitable for high-dimensional constrained optimization problems. The class of interest is that of proximal algorithms, that are very generally applicable in constrained non-smooth optimization, but are in particular very well suited to recent statistical techniques developed for the analysis of high-dimensional data Bach et al. (2011), Lee et al. (2010), and Ravikumar et al. (2010). In our discussion, we make connections to a number of topics discussed in Lange, Choi and Zhua, including connections to majorization–maximisation and classical gradient descent algorithms and acceleration schemes.

Proximal Algorithms

  1. Top of page
  2. Proximal Algorithms
  3. Proximal Gradient Method
  4. Accelerated Proximal Gradient Method
  5. Stochastic Proximal Gradient Algorithms
  6. Alternating Direction Method of Multipliers
  7. Acknowledgement
  8. References

We start by providing some definitions and then discuss special cases arising in high-dimensional statistics.

Let inline image be a closed proper convex function; that is, its epigraph is a non-empty closed convex set. The proximal operator of f is defined as

  • display math(1)

where inline image denotes the usual Euclidean norm, and λ > 0 is a parameter that controls the degree to which the proximal operator maps point closer to the minimum of f. As shown in Boyd and Vandenberghe (2004), the function minimised is strongly convex and not everywhere infinite and hence admits a unique minimiser. Note that the definition implies that proxf(u) is a point that approximately minimises f, but remains close to u.

In general, proximal mapping requires solving the minimisation problem posited in (1), but there are instances where the proximal operator admits an explicit expression.

  1. f(x) = 0, then proxf(u) = u.

  2. f(x) = IC(x) where IC denotes the indicator function of a convex set C; then, proxf(u) = PC(u) corresponds to the projection PC on C.

  3. f(x) = ψ | | x | | 1, where | | ⋅ | | 1 denotes the 1 norm, then the i-th coordinate of proxf(u) is given by: (i) proxf(u)i = 0 if | u1 | ≤ ψ; (ii) proxf(u)i = ui − ψ if ui > ψ; and (iii) proxf(u)i = ui + ψ if ui < − ψ. It can be seen that the proximal mapping corresponds to a soft-thresholding operation.

A proximal minimisation algorithm is defined at the k-th iteration step by

  • display math

As presented in Bauschke and Combettes (2011), if f has a minimum, then xk will converge to the set of minimisers of f and f(xk) to its optimal value. Similarly to steepest descent algorithms, convergence is guaranteed for values of the parameter λ that satisfy λk > 0 and inline image.

This basic algorithm has not found many applications but is instructive from a theoretical point of view. One important application though is on solving the quadratic function

  • display math

with A ≽ 0 (positive semidefinite). The proximal mapping can be written down explicitly as

  • display math

which gives rise to the update of Golub and Wilkinson (1996)

  • display math

The power of proximal mappings is revealed in the following setting.

Proximal Gradient Method

  1. Top of page
  2. Proximal Algorithms
  3. Proximal Gradient Method
  4. Accelerated Proximal Gradient Method
  5. Stochastic Proximal Gradient Algorithms
  6. Alternating Direction Method of Multipliers
  7. Acknowledgement
  8. References

Consider the following optimization problem:

  • display math(2)

where inline image and inline image are closed proper convex functions, and f is differentiable. The function g can be used to encode constraints on x as discussed in the ‘Augmented Lagrangians’ section of Lange, Choi and Zhua.

Then, the proximal gradient method is given by

  • display math(3)

where λk > 0 denotes the step size. If ∇ f(x) is Lipschitz continuous with constant M, then this algorithm converges in inline image steps, for fixed step size λk ≡ λ ∈ (0,1 ∕ M). If M is unknown, then various line search strategies (akin to those used in gradient descent methods) can be employed.

Note that using the special form of the proximal operator, it can be seen that when g = 0, this algorithm reduces to the standard gradient descent one, when g = ψ | | x | | 1 (a lasso penalty), it leads to soft-thresholding and for g = IC, it reduces to the projected gradient method in Bertsekas (1999).

In Beck and Tebouille (2012), the proximal gradient algorithm is interpreted as a majorization–minimisation algorithm. Specifically, consider the upper bound on f(x) given by

  • display math

For fixed λ,Qλ(x,y) is convex and satisfies Qλ(x,x) = f(x) and is an upper bound on f when ∇ f is Lipschitz continuous with constant M and λ ∈ (0,1 ∕ M]. Then, the algorithm that uses updates of the form

  • display math

is a majorization–minimisation one. Analogously, for the function f(x) + g(x), we can use inline image, and some algebra shows that the majorization–minimisation-based update

  • display math

is equivalent to (3).

Accelerated Proximal Gradient Method

  1. Top of page
  2. Proximal Algorithms
  3. Proximal Gradient Method
  4. Accelerated Proximal Gradient Method
  5. Stochastic Proximal Gradient Algorithms
  6. Alternating Direction Method of Multipliers
  7. Acknowledgement
  8. References

Nesterov (Nesterov, 1983) introduced a sequence of updates that accelerate the convergence rate from linear (inline image) to quadratic (inline image) in convex programming problems. The sequence extrapolates between previous updates of the algorithm. Specifically, for the proximal gradient method, it takes the form

  • display math(4)

where ωk ∈ [0,1) is an extrapolation parameter and λk the usual step size. A simple choice for the extrapolation parameter is ωk = k ∕ (k + 3).

Then, it can be established that for ∇ f Lipschitz continuous with constant M, the acceleration scheme guarantees convergence in inline image steps with a fixed step size λk ≡ λ ∈ (0,1 ∕ M]. As before, if M is unknown, the step sizes can be determined through a line search.

Stochastic Proximal Gradient Algorithms

  1. Top of page
  2. Proximal Algorithms
  3. Proximal Gradient Method
  4. Accelerated Proximal Gradient Method
  5. Stochastic Proximal Gradient Algorithms
  6. Alternating Direction Method of Multipliers
  7. Acknowledgement
  8. References

An area where we believe more research is needed to specifically suit the needs of statistics is stochastic optimization. This corresponds for example to (2), where the function f is given as an intractable integral of the form inline image. This is a well-known problem in stochastic programming and online learning and has generated various stochastic extensions of the algorithms presented earlier (Nemirovski et al., 2009; Xiao, 2010; Duchi et al., 2012; Lan, 2012). But in statistics, this problem takes a more challenging form. Latent variables abound in statistics and lead to log-likelihood functions and their derivatives that are intractable: inline image, inline image, where inline image. The important point here is that the distribution inline image is typically very difficult to simulate, and unlike most examples in online learning stochastic optimization, inline image depends on x. Nevertheless, one can easily adapt the proximal gradient algorithm and its accelerated version presented earlier, by replacing ∇ f(x) by a Monte Carlo approximation inline image obtained by simulating from inline image (possibly using Markov chain Monte Carlo methods). The approximation can be obtained with a fixed number of Monte Carlo steps, or with an increasing number of Monte Carlo steps.

For illustration purposes, consider the following random effects logistic regression example. We have n statistical units with repeated binary responses {yit, 1 ≤ t ≤ Ti}, yit ∈ {0,1}. For the covariate matrix inline image, with i-th row denoted by zi, we assume that

  • display math

where inline image. The log-likelihood of β is intractable and requires integrating out the random effect u. We estimate β by an 1-penalised likelihood approach; thus, g(β) = ψ ∥ β ∥ 1.

We implemented the proximal gradient algorithm, and its accelerated version by replacing in (3) and (4), ∇ f(xk) by inline image obtained using a Markov chain Monte Carlo scheme. Figure 1 depicts the relative error ∥ βk − β ⋆  ∥ ∕ ∥ β ⋆  ∥ , as a function of the iterations k for both algorithms, where β ⋆  denotes the true value of the parameter vector.

image

Figure 1. Relative errors for stochastic proximal gradient algorithm for a random effects logistic model with p = 100, T = 10 and k = 200.

Download figure to PowerPoint

This simulation example suggests that stochastic versions of the proximal gradient algorithm and its accelerated versions can be designed to deal with high-dimensional statistical models with intractable log-likelihood functions. However, more work is required to gain a clear and deep understanding of the conditions needed so that these extensions work and exhibit convergence properties analogous to their deterministic counterparts.

Alternating Direction Method of Multipliers

  1. Top of page
  2. Proximal Algorithms
  3. Proximal Gradient Method
  4. Accelerated Proximal Gradient Method
  5. Stochastic Proximal Gradient Algorithms
  6. Alternating Direction Method of Multipliers
  7. Acknowledgement
  8. References

In the proximal gradient algorithm, the function f was assumed to be smooth. However, the method can be adapted to handle the following variant of the minimisation problem:

  • display math

where inline image are closed proper convex functions, and both f,g can be non-differentiable. Then, the alternating direction method of multipliers employs the following updates:

  • display math

It can be seen that this method handles the two functions in the objective function completely separately through their proximal operators. It is most useful when the proximal operators of f and g can be easily computed, but that of the objective function f + g is not. The convergence theory for this method is discussed in detail in Boyd et al. (2011).

References

  1. Top of page
  2. Proximal Algorithms
  3. Proximal Gradient Method
  4. Accelerated Proximal Gradient Method
  5. Stochastic Proximal Gradient Algorithms
  6. Alternating Direction Method of Multipliers
  7. Acknowledgement
  8. References
  • Bach, F., Jenatton, R., Mairal, J. & Obozinski, G. (2011). Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn., 4, 1106.
  • Bauschke, H. & Combettes, P. (2011). Convex Analysis and Monotone Operator Theory in Hilbert Spaces. New York: Springer.
  • Beck, A. & Tebouille, M. (2012). Smoothing and first order methods: a unified framework. SIAM J. Optim., 22, 557580.
  • Bertsekas, D. (1999). Nonlinear Programming. Belmont, MA: Athena Scientific.
  • Boyd, S. & Vandenberghe, L. (2004). Convex Optimization. New York: Cambridge University Press.
  • Duchi, J. C., Agarwal, A., Johansson, M. & Jordan, M. (2012). Ergodic mirror descent. SIAM J. Optim., 22, 15491578.
  • Boyd, S., Parikh, N., Chu, E., Peleato, B. & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3, 1122.
  • Golub, G. & Wilkinson, J. (1996). Note on the iterative refinement of least squares solution. Numer. Math., 9, 139148.
  • Lan, G. (2012). An optimal method for stochastic composite optimization. Math. Program. Ser. A, 133, 365397.
  • Lee, J., Recht, B., Salakhutdinov, R., Srebro, N. & Tropp, J. (2010). Practical large scale optimization for max-norm regularization. Adv. Neural Inf. Process. Syst., 23, 12971305.
  • Nemirovski, A., Juditsky, A., Lan, G. & Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19, 15741609.
  • Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 27, 372376.
  • Ravikumar P., Agarwal, A. & Wainwright, M. (2010). Message-passing for graphstructured linear programs: Proximal methods and rounding schemes. J. Mach. Learn. Res., 11, 10431080.
  • Xiao, L. (2010). Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res., 11, 25432596.