## 1. Introduction

Estimation of discrete structure, such as graphs or clusters, or variable selection is an age-old problem in statistics. It has enjoyed increased attention in recent years due to the massive growth of data across many scientific disciplines. These large data sets often make estimation of discrete structures or variable selection imperative for improved understanding and interpretation. Most classical results do not cover the loosely defined case of high dimensional data, and it is mainly in this area where we motivate the promising properties of our new stability selection.

In the context of regression, for example, an active area of research is to study the *p*≫*n* case, where the number of variables or covariates *p* exceeds the number of observations *n*; for an early overview see for example van de Geer and van Houwelingen (2004). In a similar spirit, graphical modelling with many more nodes than sample size has been the focus of recent research, and cluster analysis is another widely used technique to infer a discrete structure from observed data.

Challenges with estimation of discrete structures include computational aspects, since corresponding optimization problems are discrete, as well as determining the right amount of regularization, e.g. in an asymptotic sense for consistent structure estimation. Substantial progress has been made over recent years in developing computationally tractable methods which have provable statistical (asymptotic) properties, even for the high dimensional setting with many more variables than samples. One interesting stream of research has focused on relaxations of some discrete optimization problems, e.g. by *l*_{1}-penalty approaches (Donoho and Elad, 2003; Meinshausen and Bühlmann, 2006; Zhao and Yu, 2006; Wainwright, 2009; Yuan and Lin, 2007) or greedy algorithms (Freund and Schapire, 1996; Tropp, 2004; Zhang, 2009). The practical usefulness of such procedures has been demonstrated in various applications. However, the general issue of selecting a proper amount of regularization (for the procedures that were mentioned above and for many others) for obtaining a right-sized structure or model has largely remained a problem with unsatisfactory solutions.

We address the problem of proper regularization with a very generic subsampling approach (bootstrapping would behave similarly). We show that subsampling can be used to determine the amount of regularization such that a certain familywise type I error rate in multiple testing can be conservatively controlled for finite sample size. Particularly for complex, high dimensional problems, a finite sample control is much more valuable than an asymptotic statement with the number of observations tending to ∞. Beyond the issue of choosing the amount of regularization, the subsampling approach yields a new structure estimation or variable selection scheme. For the more specialized case of high dimensional linear models, we prove what we expect in greater generality: namely that subsampling in conjunction with *l*_{1}-penalized estimation requires much weaker assumptions on the design matrix for asymptotically consistent variable selection than what is needed for the (non-subsampled) *l*_{1}-penalty scheme. Furthermore, we show that additional improvements can be achieved by randomizing not only via subsampling but also in the selection process for the variables, bearing some resemblance to the successful tree-based random-forest algorithm (Breiman, 2001). Subsampling (and bootstrapping) has been primarily used so far for asymptotic statistical inference in terms of standard errors, confidence intervals and statistical testing. Our work here is of a very different nature: the marriage of subsampling and high dimensional selection algorithms yields finite sample familywise error control and markedly improved structure estimation or selection methods.

### 1.1. Preliminaries and examples

In general, let *β* be a *p*-dimensional vector, where *β* is sparse in the sense that *s*<*p* components are non-zero. In other words, ‖*β*‖_{0}=*s*<*p*. Denote the set of non-zero values by *S*={*k*:*β*_{k}≠0} and the set of variables with vanishing coefficient by *N*={*k*:*β*_{k}=0}. The goal of structure estimation is to infer the set *S* from noisy observations.

As a first supervised example, consider data (*X*^{(1)},*Y*^{(1)}),…,(*X*^{(n)},*Y*^{(n)}) with univariate response variable *Y* and *p*-dimensional covariates *X*. We typically assume that (*X*^{(i)},*Y*^{(i)})s are independent and identically distributed (IID). The vector *β* could be the coefficient vector in a linear model

where *Y*=(*Y*_{1},…,*Y*_{n}), *X* is the *n*×*p* design matrix and *ɛ*=(*ɛ*_{1},…,*ɛ*_{n}) is the random noise whose components are IID. Thus, inferring the set *S* from data is the well-studied variable selection problem in linear regression. A main stream of classical methods proceeds to solve this problem by penalizing the negative log-likelihood with the *l*_{0}-norm ‖*β*‖_{0} which equals the number of non-zero components of *β*. The computational task to solve such an *l*_{0}-norm penalized optimization problem becomes quickly unfeasible if *p* is growing large, even when using efficient branch-and-bound techniques. Alternatively, one can relax the *l*_{0}-norm by the *l*_{1}-norm penalty. This leads to the lasso estimator (Tibshirani, 1996; Chen *et al.*, 2001),

where is a regularization parameter and we typically assume that the covariates are on the same scale, i.e. . An attractive feature of the lasso is its computational feasibility for large *p* since the optimization problem (2) is convex. Furthermore, the lasso can select variables by shrinking certain estimated coefficients exactly to 0. We can then estimate the set *S* of non-zero *β*-coefficients by , which involves convex optimization only. Substantial understanding has been gained over the last few years about consistency of such lasso variable selection (Meinshausen and Bühlmann, 2006; Zhao and Yu, 2006; Wainwright, 2009; Yuan and Lin, 2007), and we present the details in Section 3.1. Among the challenges are the issue of choosing a proper amount of regularization *λ* for consistent variable selection and the fact that restrictive design conditions are needed for asymptotically recovering the true set *S* of relevant covariates.

A second example is on unsupervised Gaussian graphical modelling. The data are assumed to be

The goal is to infer conditional dependences among the *d* variables or components in *X*=(*X*_{1},…,*X*_{d}). It is well known that *X*_{j} and *X*_{k} are conditionally dependent given all other components {*X*_{(l)};*l*≠*j*,*k*} if and only if , and we then draw an edge between nodes *j* and *k* in a corresponding graph (Lauritzen, 1996). The structure estimation is thus on the index set which has cardinality (and, of course, we can represent as a *p*×1 vector) and the set of relevant conditional dependences is . Similarly to the problem of variable selection in regression, *l*_{0}-norm methods are computationally very difficult and become very quickly unfeasible for moderate or large values of *d*. A relaxation with *l*_{1}-type penalties has also proven to be useful in this context (Meinshausen and Bühlmann, 2006). A recent proposal is the graphical lasso (Friedman *et al.*, 2008):

This amounts to an *l*_{1}-penalized estimator of the Gaussian log-likelihood, partially maximized over the mean vector *μ*, when minimizing over all non-negative definite symmetric matrices. The estimated graph structure is then which involves convex optimization only and is computationally feasible for large values of *d*.

Another potential area of application is clustering. Choosing the correct number of clusters is a notoriously difficult problem. Looking for clusters that are stable under perturbations or subsampling of the data can help to obtain a better sense of a meaningful number of clusters and to validate results. Indeed, there has been some activity in this area, most notably in the context of *consensus clustering* (Monti *et al.*, 2003). For an early application see Bhattacharjee *et al.* (2005). Our proposed false discovery control can be applied to consensus clustering, yielding good estimates of the parameters of a suitable base clustering method for consensus clustering.

### 1.2. Outline

The use of resampling for validation is certainly not new; we merely try to put it into a more formal framework and to show certain empirical and theoretical advantages of doing so. It seems difficult to give a complete coverage of all previous work in the area, as notions of stability, resampling and perturbations are very natural in the context of structure estimation and variable selection. We reference and compare with previous work throughout the paper.

The structure of the paper is as follows. The generic stability selection approach, its familywise type I multiple-testing error control and some representative examples from high dimensional linear models and Gaussian graphical models are presented in Section 2. A detailed asymptotic analysis of the lasso and randomized lasso for high dimensional linear models is given in Section 3 and more numerical results are described in Section 4. After a discussion in Section 5, we collect all the technical proofs in Appendix A.