Original Article

# Rejoinder

Version of Record online: 22 APR 2014

DOI: 10.1111/insr.12030

© 2014 The Authors. International Statistical Review © 2014 International Statistical Institute

Additional Information

#### How to Cite

Lange, K., Chi, E. C. and Zhou, H. (2014), Rejoinder. International Statistical Review, 82: 81–89. doi: 10.1111/insr.12030

#### Publication History

- Issue online: 22 APR 2014
- Version of Record online: 22 APR 2014
- Manuscript Accepted: 3 JUL 2013
- Manuscript Received: 1 JUL 2013

- Abstract
- Article
- References
- Cited By

### Response to Y. Atchade and G. Michailidis

- Top of page
- Response to Y. Atchade and G. Michailidis
- Response to D. Hunter
- Response to C. Robert
- Acknowledgements
- References

We are grateful to Profs. Atchade and Michailidis for discussing proximal splitting methods and highlighting their connection to the methods under review. Although proximal splitting methods have been around for decades, they have recently enjoyed a renaissance in handling non-smooth regularisation, not only in statistics but also in signal processing and machine learning. Combettes and Wajs (2005) provided a comprehensive overview of proximal splitting methods, including the proximal gradient method and the alternating direction method of multipliers (ADMM) discussed by Atchade and Michailidis.

Hunter made an interesting point that requires even stronger emphasis in the context of proximal methods. One of the reasons methods such as ADMM have become so popular is that, like MM and block coordinate descent, they decompose challenging optimisation problems into simpler subproblems. These decompositions often lighten the load of coding. Moreover, just as proximal gradient algorithms can be accelerated by Nesterov's method, ADMM and its variants can also be accelerated with modest changes to the underlying algorithms (Deng and Yin, 2012; Goldstein *et al.* 2012; Goldfarb *et al.* 2012). Thus, proximal splitting can lead to a simpler code with no sacrifice in computational speed.

The benefits of this approach can be seen in a recent convex version of cluster analysis (Chi and Lange, 2013; Hocking *et al.* 2011; Lindsten *et al.* 2011). Given *p* points *x*_{1}, … ,*x*_{p} in , the new clustering method operates by minimising the convex criterion

- (1)

where *γ* is a positive regularisation parameter, *w*_{ij} is a non-negative weight, and the *i*-th column *u*_{i} of the matrix ** U** is the cluster centre attached to the point

*x*_{i}. The norm in the first summation is the Euclidean norm; the penalty norms can be either Euclidean or non-Euclidean. Figure 1 shows the solution path to this convex problem as a function of

*γ*.

This problem generalises the fused lasso, and as with other fused lasso problems, the penalties make minimisation challenging. The original problem can be reformulated as

- (2)

This alternative formulation is ripe for attack by proximal splitting. Our recent paper (Chi and Lange, 2013) presents variants of ADMM and the related alternating minimisation algorithm (AMA) (Tseng, 1991) that solve the equality-constrained version (2). As remarked earlier, both approaches are simple enough to encourage parameter acceleration. As a rule, the proximal splitting framework generates simple modular solutions. Consider our ADMM solution. Let *λ*_{ij} denote the Lagrange multiplier for the *ij*-th equality constraint. We describe a single round of ADMM block updates of the variables *u*_{i},*v*_{ij}, and *λ*_{ij}. The centroids *u*_{i} are updated as follows.

where *ρ* is the positive quadratic penalty parameter in the augmented Lagrangian, is the average of the *x*_{i}, and

The updates for *v*_{ij} are independent and amount to

- (3)

where *σ*_{ij} = *γw*_{ij}. Finally, the Lagrange multipliers are updated by

Each update is simple, and the effects of changing the norm in the fusion penalty are isolated to the updates for *v*_{ij}. In other words, only the proximal mapping needs to be changed. The updates for the AMA method exhibit similar simplicity and modularity.

Proximal splitting methods have also proven to be effective when mixed and matched with other optimisation methods. For example, Ramani and Fessler (2013) combined ADMM, MM, and acceleration to concoct an image reconstruction algorithm that outperforms all currently competing algorithms. Proximal methods themselves are undergoing improvement. Application of Newton and quasi-Newton methods to proximal methods is especially promising (Becker and Fadili, 2012; Lee *et al.* 2012).

We agree with Atchade and Michailidis that stochastic proximal gradient algorithms represent an important frontier requiring further exploration. Recent results on inexact variants of proximal splitting provide important clues for understanding the conditions under which stochastic variants converge as reliably as their deterministic counterparts (Deng and Yin, 2012; Schmidt *et al.* 2011). Finally, it is noteworthy that some of the most recent refinements on stochastic gradient methods involve generalisation to second-order methods (Byrd *et al.* 2011; Byrd *et al.* 2012). It is refreshing to see classical ideas recycled and refurbished for modern purposes. Second-order methods have been developed for lasso regularised optimisation in the deterministic (Byrd *et al.* 2012) and stochastic settings (Byrd *et al.* 2012), but it remains to be seen whether deterministic second-order proximal methods can be generalised to the stochastic setting where other non-smooth regularisers come into play. Atchade and Michailidis are surely right in calling for a deeper understanding of the convergence behaviour of such potential hybrids. In practice, it appears that the Hessian approximation need not be as accurate as the gradient approximation. This observation can lead to substantial computational savings (Byrd *et al*. 2011, 2012). Obtaining a clearer understanding of how to tune the relative accuracies of the gradient and Hessian to obtain the best performance is one of many theoretical challenges begging for resolution.

### Response to D. Hunter

- Top of page
- Response to Y. Atchade and G. Michailidis
- Response to D. Hunter
- Response to C. Robert
- Acknowledgements
- References

We are grateful to Prof. Hunter for emphasising that ‘there is simply no such thing as a universal “gold standard” when it comes to algorithms’. His concrete treatment of the Bradley–Terry model is particularly apt. Readers may find a fuller discussion of this simple example helpful in understanding that there are multiple ways to skin a statistical cat. The log-likelihood of the Bradley–Terry model considered by Hunter is

- (4)

where *w*_{ij} is the number of times individual *i* beats individual *j* and the *γ*_{i} > 0 are the parameters to be estimated. Here, *w*_{ii} = 0 by convention.

#### Bradley–Terry as a Geometric Program

Geometric programming (Boyd *et al.* 2007; Ecker, 1980; Peterson, 1976) deals with posynomials, namely functions of the form

Here, the index set is finite, and all coefficients *c*_{α} and all components *x*_{1}, … ,*x*_{n} of the argument ** x** of

*f*(

**) are positive. The possibly fractional powers**

*x**α*

_{i}corresponding to a particular

**may be positive, negative, or zero. In geometric programming, we minimize a posynomial**

*α**f*(

**) subject to posynomial inequality constraints of the form**

*x**u*

_{j}(

**) ≤ 1 for 1 ≤**

*x**j*≤

*q*. In some versions of geometric programming, equality constraints of posynomial type are permitted (Boyd

*et al.*2007).

Maximising the positive likelihood function (4) is equivalent to minimising its reciprocal

- (5)

which is a posynomial after expanding the powers . Therefore, the Bradley–Terry model is an unconstrained geometric programming problem. Recognising standard convex programming problems such as geometric programming can free statisticians from the often onerous task of designing and implementing their own optimisation algorithms. For instance, using the open-source convex optimisation software CVX (Grant and Boyd, 2008; , 2012) to minimize the criterion (5) requires only six lines of MATLAB code:

Because convex program solvers such as CVX implement variants of Newton's method, they tend to falter on high-dimensional problems. For the Bradley–Terry model, CVX handles *p* < 100 problems very efficiently but struggles for *p* > 1000. Parameter separation by the MM principle and exploitation of special Hessian structures are two possible remedies.

#### Another MM for Bradley–Terry

We now derive another MM algorithm for the Bradley–Terry model. By the arithmetic–geometric mean inequality, the objective (5) is majorised by

where . The parameters *γ*_{i} and *γ*_{j} are still entangled in the term but can be separated by the further memorisation

thanks to the convexity of the function . The resulting surrogate function is easy to optimise because all of the *γ*_{i} parameters are separated. The next iteration *γ*_{n + 1,i} of *γ*_{i} is obtained by minimising the univariate function

which is strictly convex according to the second derivative test. Both bisection and Newton's method locate its minimum quickly.

Although our new MM algorithm achieves parameter separation, the two successive majorisations and the lack of analytic updates probably make it uncompetitive with the simple MM algorithm of Hunter. Nonetheless, this example illustrates the flexibility of the MM principle. Interested readers can refer to our recent paper (Lange and Zhou, 2014) for a general class of MM algorithms for geometric and signomial programming. In signomial programming, some of the coefficients *c*_{α} are allowed to be negative.

#### Exploiting Structure in High Dimensions

The importance of exploiting Hessian structure in high-dimensional optimisation can also be illustrated by the Bradley–Terry model. By switching to the parametrisation *λ*_{i} = ln*γ*_{i}, it suffices to minimize the equivalent negative log-likelihood

This is a convex function because the terms are log-convex and the collection of log-convex functions is closed under addition (Boyd and Vandenberghe, 2004). The gradient has entries

and the Hessian has entries

Computing Newton's direction − [*d*^{2}*f*(** λ**)]

^{ − 1}∇

*f*(

**) requires solving a system of linear equations, an expensive**

*λ**O*(

*p*

^{3}) operation that becomes prohibitive when the number of parameters

*p*is large. Fortunately, in large-scale competitions, most individuals or teams play only a small fraction of their possible opponents. This implies that the data matrix

**is sparse and consequently that the Hessian**

*W**d*

^{2}

*f*(

**) is also sparse. This fact allows fast calculation of [**

*λ**d*

^{2}

*f*(

**)]**

*λ***for any vector**

*v***and suggests substitution of the conjugate gradient (CG) method for the traditional (Cholesky) method of solving for the Newton direction. The computational cost per iteration drops to**

*v**O*(

*p*

^{2}), where the constant depends on the sparsity level and the number of CG iterations.

Figure 2 compares the progress of the Newton iterations using the Cholesky decomposition (NM) and the CG method on two data sets (*p* = 1000 and 2000) simulated under the same conditions as Hunter's Figure 1. Our MATLAB code is available at http://www4.stat.ncsu.edu/~hzhou3/softwares/bradleyterry. The norm of the gradient vector serves to measure progress towards convergence; convergence is declared when its change per iteration falls below 10^{ − 6}. For this convex problem, all stationary points represent global minimum. Under the simulation conditions, each individual chooses *p* ∕ 20 opponents. Therefore, nearly 90% of the entries of the Hessian matrix are zero. The NM-CG approach achieves remarkable efficiency at *p* = 2000, demonstrating the importance of exploiting sparsity in high-dimensional data. Of course, the original MM updates mesh with sparsity equally well.

#### Acceleration for Non-smooth Problems

Hunter mentioned that, for smooth problems, MM-type algorithms combined with quasi-Newton acceleration can achieve the ‘best of both worlds’ by offering both stability and efficiency. Let us add that we have also had good results applying a different quasi-Newton acceleration scheme (Zhou *et al.* 2011) to non-smooth problems such as matrix completion (Chi *et al.* 2013) and regularised matrix regression (Zhou and Li, 2013). Nesterov acceleration works well for certain types of non-smooth problems (Beck and Teboulle, 2009).

### Response to C. Robert

- Top of page
- Response to Y. Atchade and G. Michailidis
- Response to D. Hunter
- Response to C. Robert
- Acknowledgements
- References

Prof. Robert has raised a number of objections from a Bayesian perspective, most of which we embrace. For instance, he said that integration comes more naturally to statisticians than optimisation. This is true; the current review is a modest attempt to change this state of affairs. We do not agree that an emphasis on optimisation neglects everything except estimation. For instance, the most convincing modern strides in model selection owe their existence to penalised estimation. Parameter tuning, cross-validation, and stability selection (Meinshausen and Bühlmann, 2010) all operate within the framework of optimisation. Robert may well be correct in asserting that variational Bayes (Wainwright and Jordan, 2008) and approximate Bayesian computation (Marin *et al.* 2012) will rescue Bayesian applications to big data. In our view, the jury is still out on the scope of these methods. In any event, variational Bayes operates by optimisation, so even fully committed Bayesians stand to gain from fluency in optimisation.

We have little experience with applying Bayesian inference to data summaries. Although this is a worthy suggestion, data summaries run the risk of losing vital information and presuppose knowledge of a good model for how the data are generated.

We are sympathetic to simulated annealing and have employed it in many of our scientific applications. It functions best on problems of intermediate size for which the computational complexity of all known algorithms is high. It is not truly an option on large data. Imagine, for instance, using simulated annealing to solve the travelling salesman problem with a million cities. As a rule, stochastic algorithms cannot compete with the speed of deterministic methods in optimisation. That is why, we did not feature stochastic simulation in our review. However, ideas such as annealing do generalise successfully to optimisation. Our recent paper on parameter estimation in the presence of multiple modes advocates deterministic annealing (Zhou and Lange, 2010).

We agree with Prof. Robert that EM and MM algorithms are not panaceas. It takes careful thought to construct fast, stable algorithms. MM, EM, and block descent and ascent are always stable and typically easy to code and debug. The failure of the MM principle in applications to intractable integrals is the current biggest bottleneck. The random-effects logistic regression model discussed by Atchade and Michailidis is a prime example. Statisticians venturing into the terrain of convex optimisation must also exercise special caution. Our review gives a few hints and a brief history of successful techniques. We cannot foresee the future, but it would be surprising if the covered techniques did not prove helpful for many years to come. Finally, let us reiterate our agreement with Robert's contention that statistical inference is more than parameter estimation.

### Acknowledgements

- Top of page
- Response to Y. Atchade and G. Michailidis
- Response to D. Hunter
- Response to C. Robert
- Acknowledgements
- References

We thank our colleagues for their thoughtful commentaries. They raise intriguing points and arguments that deserve equally thoughtful responses.

### References

- Top of page
- Response to Y. Atchade and G. Michailidis
- Response to D. Hunter
- Response to C. Robert
- Acknowledgements
- References

- 2012). A quasi-Newton proximal splitting method. In Advances in Neural Information Processing Systems, 25, Eds. P. Bartlett, F. C. N Pereira, C. J. C Burges, L. Bottou & K. Q. Weinberger, pp. 2627–2635. & . (
- 2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. J. Imaging Sci., 2(1), 183–202. Available at http://epubs.siam.org/doi/pdf/10.1137/080716542. & . (
- 2004). Convex Optimization. Cambridge: Cambridge University Press. & . (
- 2007). A tutorial on geometric programming. Optim. Eng., 8(1), 67–127. , , & . (
- 2011). On the use of stochastic Hessian information in optimization methods for machine learning. SIAM J. Optim., 21(3), 977–995. , , & . (
- 2012). A family of second-order methods for convex
*ℓ*_{1}-regularized optimization. Technical report, Optimization Center: Northwestern University. , , & . ( - 2012). Sample size selection in optimization methods for machine learning. Math. Program., 134(1), 127–155. , , & . (
- 2013). Splitting Methods for Convex Clustering. arXiv:1304.0499 [stat.ML]. & . (
- 2013). Genotype imputation via matrix completion. Genome Res., 23(3), 509–518. , , , & . (
- 2005). Signal recovery by proximal forward–backward splitting. Multiscale Model. Simul., 4(4), 1168–1200. & . (
- 2012). On the global and linear convergence of the generalized alternating direction method of multipliers. CAAM Technical Report TR12-14, Rice University. & . (
- 1980). Geometric programming: methods, computations and applications. SIAM Rev., 22(3), 338–362. (
- 2012). Fast alternating linearization methods for minimizing the sum of two convex functions. Math. Program., 1–34. , & . (
- 2012). Fast alternating direction optimization methods. Technical report cam12-35, University of California, Los Angeles. , & . (
- 2008). Graph implementations for nonsmooth convex programs. In Recent Advances in Learning and Control, Eds. V. D. Blondel, S. P. Boyd & H. Kimura, pp. 95–110. London: Springer-Verlag. http://stanford.edu/~boyd/graph_dcp.html. & . (
- 2012). CVX: Matlab Software for Disciplined Convex Programming, version 2.0 beta. http://cvxr.com/cvx. & . (
- 2011). Clusterpath: an algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning (ICMl-11), Eds. L. Getoor & T. Scheffer, pp. 745–752. New York, NY, USA, ICML '11: ACM. , , & (
- 2014). MM algorithms for geometric and signomial programming. Math. Program. Ser. A, 143(1–2), 339–356. & . (
- 2012). Proximal Newton-type methods for convex optimization. In Advances in Neural Information Processing Systems, 25, Eds. P. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou & K. Q. Weinberger, pp. 836–844. , & . (
- 2011). Just relax and come clustering! A convexification of k-means clustering, Linköpings universitet Technical report. , & . (
- 2012). Approximate Bayesian computational methods. Statist. Comput., 22(6), 1167–1180. , , & . (
- 2010). Stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol., 72(4), 417–473. & . (
- 1976). Geometric programming. SIAM Rev., 18(1), 1–51. . (
- 2013). Accelerated non-Cartesian SENSE reconstruction using a majorize–minimize algorithm combining variable-splitting. In
*Proceedings IEEE International Symposium on Biomedical Imaging*, pp. 700–703. & . ( - 2011). Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in Neural Information Processing Systems, 24, Eds. J. Shawe-Taylor, R. S. Zemel, P. Bartlett, F. C. N. Pereira & K. Q. Weinberger, pp. 1458–1466. , & . (
- 1991). Applications of a splitting algorithm to decomposition in convex programming and variational inequalities. SIAM J. Control Optim. 29(1), 119–138. . (
- 2008). Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1(1-2), 1–305. & . (
- 2011). A quasi-Newton acceleration for high-dimensional optimization algorithms. Statist. Comput., 21, 261–273. , & . (
- 2010). On the bumpy road to the dominant mode. Scand. J. Stat., 37(4), 612–631. & . (
- 2013). Regularized matrix regression. J. R. Stat. Soc. Ser. B Stat. Methodol., DOI: 10.1111/rssb.12031. & . (