#### Discussion on the paper by Meinshausen and Bühlmann

**Sylvia Richardson***(Imperial College London)*

This stimulating paper on combining resampling with *l*_{1}-selection algorithms makes important contributions for the analysis of high dimensional data. What I found particularly appealing in this paper is that it puts on a firm footing the idea of using the stability under resampling to select a set of variables, by

Using inclusion probabilities in resampled subsets had been discussed in an informal way for a considerable time in applied work, in particular in genomics. Early work on extracting prognostic signatures from gene expression data was soon questioned as it was noticed that such signatures had little reproducibility. The idea of intersecting or combining resampled signatures followed. For example Zucknick *et al.* (2008) investigated the stability of gene expression signature in ovarian cancer derived by different supervised learning procedures including the lasso, the elastic net and random forests by computing their inclusion frequencies under resampling for profiles of different sizes (see Fig. 2 of Zucknick *et al.* (2008)).

I shall focus my discussion on the variable selection aspect rather than the graphical modelling, and specifically I shall comment on two aspects:

- (a)
trying to understand better the applicability of the bound in theorem 1 and the performance of this method beyond the reported simulations;

- (b)
putting ‘stability’ into a broader context and relating or comparing it with other approaches.

The focus of theorem 1 is on the control of the familywise error rate, control which depends on two quantities: the threshold *π*_{thr} and the average number of selected variables over the regularization domain *q*_{Λ}. It is informative to work out the bounds that are obtained as successive variables are selected for particular thresholds. On the vitamin data set, using the recommended *q*_{Λ}=*p*√0.8=57, reaches 0.9 for only one variable, with a bound *E*(*V*)1. Lowering *π*_{thr} to 0.6 selects three variables with a small *q*_{Λ}=3.9,

and hence a useful bound *E*(*V*)0.02. But the bound on *E*(*V*) reaches 3.8 if the domain is extended till a fourth variable is included. Hence, in this example, the bounds of theorem 1 would restrict the selection to three variables (Fig. 10(a)). If the practical use of the stability plots is extended to a ranking of the features according to the values of as suggested in the simulations, the bounds of theorem 1 thus appear to be quite conservative for deriving a cut-off.

With respect to the randomized lasso, the relevant quantities in the consistency theorem 2 are the threshold *δ* and the bound on the *β*_{k}*s*. Unfortunately, these quantities do not seem amenable to explicit computations. The authors seem to rely instead on a semiqualitative interpretation of the plots described in terms such as ‘variables standing out’, ‘better separated’, … without giving quantitative guidelines on how to judge such a separation. However, the values of are clearly influenced by the choice of weakness *α* (see Figs 10(b) and 10(c)), indicating that the thresholds for stability selection should be adapted with respect to *α*. Besides the elegant theoretical results of theorem 2, it is thus not entirely clear how to use the randomized stability paths in practice.

Broadly speaking, stability selection and machine learning methods can both be viewed as ‘ensemble learning’ procedures following Hastie *et al.* (2009). Counting the number of times that a variable is selected in each of the resampled *n*/2 subsets for particular values of *λ* is just one way of combining the information of a collection of lasso learners. In this respect, it is a little surprising that the authors have not opened up the discussion on connections between their approach and ensemble methods such as ‘bagging’ or ‘stacking’. Exploiting this connection could potentially lead to revisiting some of the choices made in their procedure, such as the set of learners that are combined (e.g. involving learners with more complex penalties such as in the elastic net) and the size of the subsamples, and to investigate the performance of combination rules that would exploit more than the marginal information, e.g. the order, or the stability of subsets.

Linking stability selection to Bayesian approaches provides further intriguing questions. It is well known that the penalty *λ* can be viewed as a parameter in a Laplace prior on the regressions coefficients *β*. To ‘stabilize’ inference, the authors take the maximum of over a domain Λ. From a Bayesian perspective, the choice of using the maximum rather than some form of *integration over*Λ is questionable. Have the authors considered alternative choices to the maximum and would some of their results carry over?

This naturally leads me to discuss the connection with the Bayesian variable selection (BVS) context, where stability and predictive performance are achieved, not by resampling the data but by allowing parameter and model uncertainty. In this light, model averaging for BVS could be viewed as an ensemble method. There are several strategies for BVS, differing in their prior model of the regression coefficients and the model search strategy. One way (but by no means the only way) to exploit the output of the BVS search is to compute marginal probabilities of inclusion for each variable, averaging over the space of models that are visited. In the large *p*, small *n* paradigm, ranking the posterior probabilities of inclusion to select relevant variables is commonly done. Of course, when the covariates are dependent, joint rather than marginal thresholding should be also considered.

To understand better the power and sensitivity of stability selection, and to investigate further the claim that is made by the authors of empirical evidence of good performance even when the ‘irrepresentable condition’ is violated, we have implemented their procedure on a set of simulated examples under two scenarios of large *p*, small *n*, the first inspired by classical test cases for variable selection that were devised by George and McCullogh (1997) and the second based on phased genotype data from the HapMap project. In both cases, a few of the regressors have strong correlation with the noise variables. In parallel, we have run two Bayesian stochastic search algorithms, shotgun stochastic search (Hans *et al.*, 2007) and evolutionary stochastic search (Bottolo and Richardson, 2010), on the same data sets. Receiver operating characteristic curves and false discovery rate curves averaged over 25 replicates are presented in Fig. 11. It is clear from the plots that, in these two cases of large *p*, small *n*, good power for stability selection is only achieved at the expense of selecting a large number of false positive discoveries, a fact that can also be clearly seen in Fig. 7 of the paper. The Bayesian stochastic algorithms outperform stability selection procedures in the two scenarios. By their capacity to explore efficiently important parts of the parameter and model space and to perform averaging according to the support of each model, here Bayesian learners have an enhanced performance.

As can be surmised from my comments, I have found this paper enjoyable, thought provoking and rich for future research directions, and I heartily congratulate the authors.

**John Shawe-Taylor and Shiliang Sun** (*University College London*)

We congratulate the authors on a paper with an exciting mix of novel theoretical insights and practical experimental testing and verification of the ideas. We provide a personal view of the developments that were introduced by the paper, mentioning some areas where further work might be usefully undertaken, before presenting some results assessing the generalization performance of stability selection on a medical data set.

The paper introduces a general method for assessing the reliability of including component features in a model. They independently follow a similar line to that proposed by Bach (2008), in which the author proposed to run the lasso algorithm using boostrap samples and only included features that occur in all the models thus created. Meinshausen and Bühlmann refine this idea by assessing the probability that a feature is included in models created with random subsets of ⌊*n*/2⌋ training examples. Features are included if this probability exceeds a threshold *π*_{thr}.

Theorem 1 provides a theoretical bound on the expected number of falsely selected variables in terms of *π*_{thr} and *q*_{Λ}, the expected number of features to be included in the models for a fixed subset of the training data, but a range of values of the regularization parameter *λ* ∈ Λ. The theorem is quite general, but makes one non-trivial assumption: that the distribution over the inclusion of false variables is exchangeable. In their evaluation of this bound on a range of real world training sets, albeit with artificial regression functions, they demonstrate a remarkable agreement between the bound value (chosen to equal 2.5) and the true number of falsely included variables.

We would have liked to have seen further assessment of the reliability of the bound in different regimes, i.e. bound values as fixed by different *q*_{Λ} and *π*_{thr}. The experimental results indicate that in the data sets that were considered the exchangeability assumption either holds or, if it fails to hold, does not adversely affect the quality of the bound. We believe that it would have been useful to explore in greater detail which of these explanations is more probable.

One relatively minor misfit between the theory and practical experiments was the fact that the theoretical results are in terms of the expected value of the quantities over random subsets, whereas in practice a small sample is used to estimate the features to include as well as quantities such as *q*_{Λ}. Perhaps finite sample methods for estimating fit with the assumption of exchangeability could also be considered. This might lead to an on-line approach where samples are generated until the required accuracy is achieved.

Theorem 2 provides a more refined analysis in that it also provides guarantees that relevant examples are included provided that they play a significant part in the true model, which is something that theorem 1 does not address. Though stability selection as defined refers to the use of random subsampling and all the experiments make use of this strategy, theorem 2 analyses the effect of a ‘randomized lasso’ algorithm that randomly rescales the features before training on the full set. Furthermore, the proof of theorem 2 does not make it easy for the reader to gain an intuitive understanding of the key ideas behind the result.

Our final suggestion for further elucidation of the usefulness of the ideas that are presented in the paper is to look at the effects of stability selection on the generalization performance of the resulting models.

As an example we have applied the approach to a data set that is concerned with predicting the level of cholesterol of subjects on the basis of risk factors and single-nucleotide polymorphism genotype features.

The data set includes 1842 subjects or examples. The feature set (input) includes six risk factors (age, smo, bmi, apob, apoa, hdl) and 787 genotypes. Each genotype takes a value in {1,2,3}. As preprocessing, each risk factor is normalized to have mean 0 and variance 1. For each example, its output is the averaged level of cholesterol over five successive years. The whole data were divided into a training set of 1200 examples and a test set of the remaining 642 examples. We shall report the test performance averaged across 10 different random divisions of training and test sets. The performance is evaluated through the root-mean-square error. In addition to standard ‘stability selection’ we report performance for a variant in which complementary pairs of subsets are used.

We report results for four methods:

- (a)
ridge regression with the original features (method M1);

- (b)
the lasso with the original features (method M2);

- (c)
ridge regression with the features identified by stability selection (method M3);

- (d)
the lasso with the features identified by stability selection (method M4).

The variants of M3 and M4 based on complementary pairs of subsets are denoted M3c and M4c. The performances of the first two methods are independent of *π*_{thr} and provide a baseline given in Table 1.

Table 1. Mean (and standard deviations in parentheses) of the test performance and number of retained features for methods M1 and M2 | *Results for method M1* | *Results for method M2* |
---|

Root-mean-square error | 0.752 (0.017) | 0.707 (0.017) |

Number of retained features | 792 (0.66) | 109 (5.22) |

For the two methods involving stability selection we experiment with values of *π*_{thr} from the set {0.2,0,25,0.3,0.35,0.4,0.45,0.5}. The results for various values of *π*_{thr} for methods M3 and M4 using standard subsampling and the ranadomized lasso are given in Table 2, whereas using the complementary sampling gives the results of Table 3.

Table 2. Mean (and standard deviations in parentheses) of the test performance and number of retained features for methods M3 and M4 *π*_{thr} | *Number of features* | *Results for method M3* | *Results for method M4* |
---|

0.20 | 117.4 (6.2) | 0.722 (0.017) | 0.716 (0.017) |

0.25 | 86.8 (5.2) | 0.720 (0.016) | 0.715 (0.016) |

0.30 | 64.7 (4.1) | 0.719 (0.017) | 0.715 (0.017) |

0.35 | 45.3 (4.1) | 0.716 (0.016) | 0.715 (0.017) |

0.40 | 27.3 (3.8) | 0.714 (0.016) | 0.713 (0.016) |

0.45 | 17.7 (1.9) | 0.712 (0.016) | 0.710 (0.016) |

0.50 | 11.4 (1.6) | 0.714 (0.019) | 0.713 (0.019) |

Table 3. Mean (and standard deviations in parentheses) of the test performance and number of retained features for methods M3c and M4c *π*_{thr} | *Number of features* | *Results for method M3c* | *Results for method M4c* |
---|

0.20 | 116.5 (4.4) | 0.721 (0.017) | 0.715 (0.017) |

0.25 | 83.8 (3.0) | 0.720 (0.017) | 0.715 (0.017) |

0.30 | 62.4 (3.6) | 0.718 (0.017) | 0.714 (0.016) |

0.35 | 44.2 (3.2) | 0.717 (0.015) | 0.716 (0.016) |

0.40 | 27.4 (3.4) | 0.714 (0.015) | 0.713 (0.015) |

0.45 | 18.2 (1.7) | 0.714 (0.012) | 0.710 (0.013) |

0.50 | 11.8 (1.8) | 0.715 (0.014) | 0.713 (0.014) |

The results suggest that stability selection has not improved the generalization ability of the resulting regressors, though clearly the lasso methods outperform ridge regression. The performance is remarkably stable across different values of *π*_{thr} despite the number of stable variables undergoing an order of magnitude reduction.

The vote of thanks was passed by acclamation.

**Tso-Jung Yen** (*Academia Sinica, Taipei*) **and Yu-Min Yen** (*London School of Economics and Political Science*)

We congratulate the authors for tackling a challenging statistical problem with an effective and easily implementable method. Our comments and interest in the paper are as follows. First, the authors claim that, under the method, tuning parameter *λ* is insensitive to the final result. However, we have found that it may still be affected by its range, particularly in the *p**m* situation, where *m* is the subsampling size. In this situation, when *λ*0, the subsampling estimation results of the lasso will approach those of ordinary least squares. Consequently, *λ*_{min}0 will lead to for all {1,…,*p*} ∈ *K* with high probability.

We propose a solution to this problem by directly estimating the regularization region Λ=[*λ*_{min},*λ*_{max}] by and

Secondly, in addition to *E*(*V*)/*p*, we may be also interested in controlling the false discovery rate . Conventionally, the quantity may be approximated by , but it is unknown whether such an approximation works well in regression-based variable selection.

**Rajen Shah and Richard Samworth** (*University of Cambridge*)

We congratulate the authors for their innovative and thought-provoking paper. Here we propose a minor variant of the subsampling algorithm that is the basis of stability selection. Instead of drawing individual subsamples at random, we advocate drawing disjoint pairs of subsamples at random. This variant appears to have favourable properties.

Below, we use the same notation as the paper. Our method of subsampling involves splitting {1,…,*n*} into two halves at random and picking a subset of size ⌊*n*/2⌋ in each half. Repeating this *M* times, we obtain a sequence of subsets *I*_{1},…,*I*_{2M} with *I*_{2i}∩*I*_{2i−1}=∅,*i*=1,…,*M*. For *k* ∈ {1,…,*p*}, define

Similarly to the stability selection algorithm, we select variable *k* when .

- (a)
Letting

*V*_{M} be the number of falsely selected variables

satisfies the same upper bound as in theorem 1 of the paper. Briefly, defining

the result corresponding to lemma 1 of the paper is

The arguments of lemma 2 and theorem 1 follow through since

. Thus we have the same error control as in the paper even for finite

*M*, as well as the infinite subsampling case.

- (b)
Simulations suggest that we obtain a slight decrease in the Monte Carlo variance. A heuristic explanation is that, when *n* is even, each observation is contained in the same number of subsamples. This minimizes the sum of the pairwise intersection sizes of our subsamples.

- (c)
With essentially no extra computational cost, we obtain estimates of simultaneous selection probabilities, which can also be useful for variable selection; see

Fan *et al.* (2009).

- (d)
If, in addition to the assumptions of theorem 1, we also assume that the distribution of

is unimodal, we obtain improved bounds:

For a visual comparison between this bound and that of theorem 1, see

Fig. 14. The improvement suggests that using sample splitting with this bound can lead to more accurate error control than using standard stability selection.

- (e)
This new bound gives guidance about the choice of

*M*. For instance, when

*π*_{thr}=0.6 choosing

*M*>52 ensures that the bound on

is within 5% of its limit as

*M*∞. When

*π*_{thr}=0.9, choosing

*M*>78 has the same effect.

**Christian Hennig** (*University College London*)

Stability selection seems to be a fruitful idea.

As usually done with variable selection, the authors present it as a mathematical problem in which the task is to pick a few variables with truly non-zero coefficients out of many variables with true *β*_{k}=0. However, in reality we do not believe model assumptions to be precisely fulfilled, and in most cases we believe that the (in some sense) closest linear regression approximation to reality does not have any regression coefficients precisely equal to zero.

It is of course fine to have theory about the idealized situation with many zero coefficients, but in more realistic situations the quality of a variable selection method cannot be determined by considering models and data alone. It would be necessary to specify ‘costs’ for including or excluding variables with ‘small’ true *β*_{k}, which may depend on whether we would rather optimize the predictive quality or rather favour models with small numbers of variables enabling simple interpretations. We may even be interested in stability of the selection in its own right. Accepting the dependence of the choice of a method on the aim of data analysis, it would be very useful for promising methods such as stability selection to have a ‘profile’ of potential aims for which this is particularly suited, or rather not preferable.

Considering the author's remark at the end of Section 1.1, in Hennig (2010) it is illustrated in which sense the problem of finding the correct number of clusters *s* cannot be decided on the basis of models and data alone, and also in some simulation set-ups given there it turns out that *s* is not necessarily estimated most stable if it is chosen by a subsampling method looking for stable clusterings *given s* (based on ‘prediction strength’; Tibshirani and Walther (2005)).

**Paul D. W. Kirk, Alexandra M. Lewin and Michael P. H. Stumpf** (*Imperial College London*)

We consider stability selection when several of the *relevant* variables are correlated with one another. Like the authors, we are interested in variable relevance, rather than prediction; hence we wish to select all relevant variables.

To illustrate, we use a simulated example, which is similar to that of the authors, in which *p*=500,*n*=50, the predictors are sampled from an distribution and the response is given by *Y*=Σ_{i=1,…,8} *X*_{i}+*ɛ*, where *ɛ* is a zero-centred Gaussian noise term with variance 0.1. Here Σ is the identity matrix except for the elements Σ_{1,2}=Σ_{3,4}=Σ_{4,5}=Σ_{3,5}=0.8 and their symmetrical counterparts. Thus two sets of predictors are correlated: {*X*_{1},*X*_{2}} and {*X*_{3},*X*_{4},*X*_{5}}.

For variables that are correlated with each other, different realizations of the simulation example above result in different stability paths; for example some realizations will stably select *X*_{1} with high probability but not *X*_{2}, some will stably select *X*_{2} but not *X*_{1} (as in Fig. 15(a)) and others will select both variables with lower probability, and hence may not select either with sufficiently high probability to be chosen in the final analysis. In fact there is a clear relationship between the marginal selection probabilities for *X*_{1} and *X*_{2}, as shown in Fig. 15(b), which shows these probabilities for 1000 realizations.

One approach is to use the lasso as before, but to calculate selection probabilities for sets of correlated predictors. Fig. 15(c) shows the stability paths for grouped predictors for the same realization as in Fig. 15(a), in which only one member of each correlated set would have been selected with high probability. Grouping them enables us to select the groups as required.

The obvious drawback to selection probabilities for groups is that the groups must be defined from the outset. We propose to use the elastic net (Zou and Hastie, 2005), which uses a linear combination of the lasso *l*_{1}-penalty and the ridge *l*_{2}-penalty. The *l*_{2}-penalty lets the algorithm include groups of correlated variables, whereas the *l*_{1}-penalty ensures that most regression coefficients are set to 0. We find that using marginal selection probabilities with the elastic net can give us all members of the correlated groups without defining them in advance, as shown in Fig. 15(d).

**J. T. Kent** (*University of Leeds*)

This has been a fascinating paper dealing, in particular, with the important problem of variable selection in regression. I have two simple questions about the methodology in this setting.

First, if we are willing to assume joint normality of the *Y*,*X* data, then all the information in the data will be captured by the sufficient statistics, namely the first two sample moments, together with the sample size *n*. Presumably there is no need to resample from the data in this situation; in principle, inferences could be made analytically from the set of sample correlations between the variables, though in practice a version of the parametric bootstrap might be used. More generally, the use of resampling methods seems to carry with it an implicit assumption or accommodation of non-normality and leads to the question how the methodology of the paper will be affected by different types of non-normality.

Second, I am not entirely clear what happens under approximate collinearity between the explanatory variables. In the conventional forward search algorithm in regression analysis, we are often faced with the situation where two variables *x*_{1} and *x*_{2} have similar explanatory power. If *x*_{1} is in the model, then there is no need to include *x*_{2}; conversely, if *x*_{2} is in the model there is no need to include *x*_{1}. If I understand your procedure correctly, you will tend to include *x*_{1} half the time and *x*_{2} half the time, leading to stability probabilities of about 50% each. If so, you might falsely conclude that neither variable is needed.

**Axel Gandy** (*Imperial College London*)

I congratulate the authors on their stimulating paper. The following comments concern the practical implementation of selecting stable variables.

The paper defines the set of stable variables in expression (7) as those *k* for which for a fixed 0<*π*_{thr}<1. In practice, , and therefore also the set of stable variables, cannot be evaluated explicitly.

Besides this guaranteed performance, the algorithm in Gandy (2009) is a sequential algorithm and will come to a decision based on only a small number of samples if is far from the threshold *π*_{thr}.

If Λ is a finite set with #Λ elements then also the following simultaneous bound on the sampling error for all variables can be obtained:

This can be accomplished by running the algorithm of Gandy (2009) for each *λ* and *k* with the Bonferroni corrected threshold *ɛ*/(*p*#Λ). These can be run in parallel using the same subsamples *I*_{j}. The Bonferroni correction would be conservative. Devising a less conservative correction could be a topic for further research.

**Howell Tong** (*London School of Economics and Political Science*)

I join the others in congratulating the authors on a thought-provoking paper. It may be constructive to look beyond the independent and identically distributed data case and the exchangeable case. I would welcome the authors’ reaction to what follows, some of which I have alluded to elsewhere (Tong, 2010). There are many examples of ill-posed problems with dependent data. I only need to mention the well-known seasonal adjustment, which is as old as time series itself. Akaike (1980) considered the classic decomposition of a time series *y*_{i}, *i*=1,…,*n*, into *y*_{i}=*T*_{i}+*S*_{i}+*I*_{i} where *T*_{i},*S*_{i} and *I*_{i} are respectively the trend, the seasonal and the irregular component. He treated the problem as one of smoothing. By incorporating the prior belief of some underlying structural stability, e.g. a smooth trend or gradual change of the seasonal component, he minimized

where *d*, *r* and *z* are regularization parameters, *A*_{i}=*T*_{i}−2*T*_{i−1}+*T*_{i−2},*B*_{i}=*S*_{i}−*S*_{i−12} and *C*_{i}=*S*_{i}+*S*_{i−i}+…+*S*_{i−11}. He estimated the regularization parameters by adopting a Bayesian approach based on what he called the ‘ABIC’, which is minus 2 times the log-likelihood of a Bayesian model. Akaike's approach has three important aspects:

- (a)
treating the problem as one of smoothing;

- (b)
focusing on structural stability rather than variable selection;

- (c)
treating the regularization parameters as some hyperparameters in a Bayesian framework.

Finally, I have a minor question. Have the authors tried to use their stability selection on model selection in time series modelling?

**Chris Holmes** (*Oxford University*)

The authors are to be congratulated on a ground breaking paper. The following comments are made from the perspective of a casual Bayesian observer.

Crisp variable subset selection is rare but much more common is for the statistician to work with the owner of the data to determine the relevance of the measured variables and to understand better the dependence structures within the data, i.e. as part of a dialogue whereby statistical evidence is combined with expert judgement. On the one hand the Bayesian works with the posterior distribution , which is a function of the data and a prior distribution which captures any rich domain knowledge that may exist; on the other hand stability selection reports , which is a function of the data and the algorithm. It seems to the casual Bayesian observer that the former is more objective while providing a formal mechanism to incorporate any domain knowledge which might exist about the problem.

The following contributions were received in writing after the meeting.

**Ismaïl Ahmed and Sylvia Richardson** (*Imperial College London*)

The object of this contribution is to discuss further the vitamin example that was provided by the authors. This example is given to ‘see how the lasso and the related stability path cope with noise variables’. It shows that, on the basis of a graphical analysis of the stability path, we can select five of the six permuted genes whereas, with the lasso path, the situation seems to be much less clear.

Thanks to the authors, we had the opportunity to reanalyse the vitamin data set that was used in the paper. The first thing that we would like to remark is that by performing a simple univariate analysis, i.e. by using each of the 4088 genes one at a time and then adjusting the corresponding *p*-values for multiplicities at a 5% level for the false discovery rate, we also pick up five of the six unpermuted covariates. The results are illustrated by Fig. 16, which also shows that there is an important discrepancy between the first five *q*-values and the remaining values.

Furthermore, we also performed a standard multivariate regression analysis restricted to the six unpermuted covariates, removing thus all the noise variables. The results, which are illustrated in Table 4, show that only one unpermuted gene is associated with a *p*-value that is less than 0.05 and that three unpermuted genes have a *p*-value that is less than 0.10. Consequently, it seems unclear whether any multivariate selection method could or should pick up more than these three variables. And indeed, when applying the shotgun stochastic search algorithm of Hans *et al.* (2007) on the whole data set with 20000 iterations, no more than these three variables could possibly be selected with regard to their posterior importance measure (as defined in equation (2) of Hans *et al.* (2007)) over the 100 000 top visited models.

Table 4. Results of a multivariate regression restricted to the six unpermuted covariates *Parameter* | *Estimate* | *Standard error* | *t-value* | *Pr(*>|*t*|*)* |
---|

(Intercept) | −7.7107 | 0.1782 | −43.27 | 0.0000 |

X1407 | −0.1221 | 0.1912 | −0.64 | 0.5246 |

X1885 | 0.6665 | 0.3888 | 1.71 | 0.0894 |

X3228 | −0.1094 | 0.2716 | −0.40 | 0.6880 |

X3301 | 0.4697 | 0.2750 | 1.71 | 0.0905 |

X3496 | 0.6183 | 0.3077 | 2.01 | 0.0470 |

X3803 | −0.1729 | 0.3271 | −0.53 | 0.5982 |

It thus seems to us puzzling that, on this example, stability selection behaves more like a univariate approach rather than a multivariate approach.

**Phil Brown and Jim Griffin** (*University of Kent, Canterbury*)

We comment on the use of the randomized lasso (equation (13)). This, with subsampling, attempts to remedy the seductive appeal of convex penalization, which is a property of the lasso. Demanding a single solution when there is inherent uncertainty and interchangeability of predictors leads to the present paper's suggestion of subsampling for inference. In the Bayesian modal analysis of Griffin and Brown (2007) it is the multiplicity from a non-convex penalization which allows posterior exploration of alternative models without the need for external randomization. Our generalization of the lasso, a hierarchical scale mixture-of-normals prior, is the flexible normal–exponential–gamma distribution. The first two stages generate a double exponential, which is the equivalent of *L*_{1}-penalization. The third stage puts a gamma(*α*,1/*λ*^{2}) distribution on the natural parameter of the exponential second-stage mixing. Thus the penalization can be written as

where *a*_{1},*a*_{2},…,*a*_{p} are independently realized *χ*^{2} random variables with 2*α* degrees of freedom weighting each *β*_{k} in each simulation. This third-stage gamma distribution is somewhat different from the authors’ advocacy of an inverse truncated uniform distribution, whose implied prior distribution for *β*_{k} is less natural, and we feel needs more justification. Combining all three stages, the *β*_{1},*β*_{2},…,*β*_{p} are independent and identically distributed *a priori* where *β*_{k} follows a normal–exponential–gamma distribution which can be explicitly written in terms of a parabolic cylinder function, and is a unimodal spiked distribution with tails whose heaviness depends on the shape parameter *α*. When *α*=0.5, it is the quasi-Cauchy distribution of Johnstone and Silverman (2005), and the robustness prior of Berger (1985), section 4.7.10. The third-stage stochastic generation of the gamma (i.e. *χ*^{2}-) distribution gives a *stochastic lasso* allowing fast algorithms such as LARS, and we thank the authors for that suggestion. We would ask though whether the other form of randomization, subsampling of observations, is necessary with such rich stochastic weighting possibilities. It is better to generate prior data than to throw away real data.

We have also given a full Bayesian analysis in Griffin and Brown (2010) illustrating the limitations of straight lasso penalization using another robustness prior, the variance gamma prior.

**David Draper** (*University of California, Santa Cruz*)

I have two comments on this interesting and useful paper.

- (a)
I am interested in pursuing connections with Bayesian ideas beyond the authors’ mention of the

Barbieri and Berger (2004) results. I reinterpret three of the concepts in the present paper in Bayesian language.

- (i)
Frequentist penalization of the log-likelihood to regularize the problem is often equivalent to Bayesian choice of a prior distribution (for example, think of the *l*_{1}-norm penalty term in the lasso in the paper's equation (2) as a log-prior for *β*; might there be an even better prior to achieve the goals of this paper?).

- (ii)
Under the assumption that the rows *Z*^{(i)} of the data matrix are sampled independently from the underlying data-generating distribution, resampling the existing data is like sampling from the posterior predictive distribution (given the data seen so far) for future rows in the data matrix (and of course, if we already had enough such rows, it would no longer be true that *p*≫*n* and the good predictors would be more apparent).

- (iii)
When estimated via resampling, stability paths are just Monte Carlo approximations to expectations of the indicator function, for inclusion of a given variable, with respect to the posterior distribution for the unknowns in the model.

I bring this up because it has often proved useful in the past to figure out what is the Bayesian model (both prior and likelihood) for which an algorithmic procedure, like that described in this paper, is approximately a good posterior summary of the quantities of interest, because even better procedures can then be found; a good example is the reverse engineering of neural networks from a Bayesian viewpoint by Lee (2004). Can the authors suggest the next steps in this algorithm-to-Bayes research agenda applied to their procedure?

- (b)
The authors have cast their problem in inferential language, but the real goal of structure discovery is often decision making (for instance, a drug company trying to maximize production of riboflavin in the example in Section 2.2 will want to decide in which genes to encourage mutation; money is lost both by failing to pursue good genes and by pursuing bad ones); moreover, when the problem is posed inferentially there is no straightforward way to see how to trade off false positive against false negative selections, whereas this is an inescapable part of a decision-making approach. What does the authors’ procedure look like in a real world problem when it is optimized for making decisions, via maximization of expected utility?

**Zhou Fang** (*Oxford University*)

The authors suggest an interesting and novel way of enhancing variable selection methods and should be congratulated for their contribution. However, recalculation with subsamples can be computationally costly. Here, a heuristic is suggested that may allow some benefits of stability selection with the lasso without resampling.

Differentiating by *W* gives, after substitution,

*A**δ* is a linear approximation to changes in the lasso non-zero estimates under a small perturbation of weights *δ*. This is directly analogous to influence calculations in normal linear regression. (See for example Belsley *et al.* (1980).) Although this approximation may be inaccurate, the components of *A* will all already be calculated as part of LARS or similar algorithms, reducing the computational burden.

Consider a simulation. Using *n*=50 and *p*=100, generate (*Z*^{(1)},…,*Z*^{(p)},*V*) independently standard normal and set *X*^{(k)}=*Z*^{(k)}+*V*. We set *Y*=*X**β*+*ɛ*, with *β*=(1,1,1,1,1,0,…,0)^{T} and *ɛ* independent noise. Fig. 17 shows results from various variable selection techniques.

In Fig. 17(a), *V* creates correlation in the predictors, hence generating false positive selections. For high regularization, these spurious fits rival the true fits in magnitude. As suggested in the paper, stability selection is more effective at controlling false positive selections.

For perturbation methods based on the original single lasso path, we may examine the relative norms of the rows of *A*, as defined by

In our experiments, we see that calculating this also distinguishes the true covariates from the irrelevant covariates—with most *λ*, the true positive selections vary relative to their estimates much less under perturbation. For Fig. 17(d), we naively convert the relative norms into an approximate upper bound on selection probability under weight perturbations of variance 1, using the Cantelli inequality:

This approximates the stability selection path, except for low regularization, where failure to consider cases of unselected variables being selected under a perturbation means that we overestimate stability. However, this calculation, even inefficiently implemented, took 0.5 s to compute, compared with 14 s for 200 resamples of stability selection.

**Torsten Hothorn** (*Ludwig-Maximilians-Universität München*)

Stability selection brings together statistical error control and model-based variable selection. The method, which controls the probability of selecting—and thus potentially interpreting—a model containing at least one non-influential variable, will increase the confidence in scientific findings obtained from high dimensional or otherwise complex models.

The idea of studying the stability of the variable selection procedure applied to a specific problem by means of resampling is simple and easy to implement. And the authors point out that this straightforward approach has actually been used much earlier. The first reference that I could find is a paper on model selection in Cox regression by Sauerbrei and Schumacher (l992). Today, multiple-testing procedures utilizing the joint distribution of the estimated parameters can be applied in such low dimensional models for variable and structure selection under control of the familywise error rate (Haufe *et al.* (2010) present a nice application to multivariate time series). With their theorem 1, Nicolai Meinshausen and Peter Bühlmann now provide the means for proper error control also in much more complex models.

Two issues seem worth further attention to me: the exchangeability assumption that is made in theorem 1 and the prediction error of models fitted by using only the selected variables. One popular approach for variable selection in higher dimensions is based on the permutation variable importance measure that is used in random forests. Interestingly, it was found by Strobl *et al.* (2008) that correlated predictor variables receive a higher variable importance than is justified by the data-generating process. The reason is that exchangeability is (implicitly) assumed by the permutation scheme that is applied to derive these variable importances. The problem can be addressed by applying a conditional permutation scheme and I wonder whether a more elaborate resampling technique taking covariate information into account might allow for a less strong assumption for stability selection as well.

Concerning my second point, the simulation results show that stability selection controls the number of falsely selected variables. I wonder how the performance (measured by the out-of-sample prediction error) of a model that is fitted to only the selected variables compares with the performance of the underlying standard procedure (including a cross-validated choice of hyperparameters). If the probability that an important variable is missed by stability selection is low, there should not be much difference. However, if stability selection is too restrictive, I would expect the prediction error of the underlying standard model to be better. This would be another hint that interpretable models and high prediction accuracy might not be achievable at the same time.

**Chenlei Leng and David J. Nott** (*National University of Singapore*)

We congratulate Meinshausen and Bühlmann on an elegant piece of work which shows the usefulness of introducing additional elements of randomness into the lasso and other feature selection procedures through subsampling and other mechanisms. It is now well understood that certain restrictive assumptions (Zhao and Yu, 2006; Wainwright, 2009) must be imposed on the design matrix for the lasso to be a consistent model selector although adaptive.versions of the lasso can circumvent the problem (Zou, 2006). However, as convincingly pointed out by Meinshausen and Bühlmann, by considering multiple sparse models obtained from perturbations of the original feature selection problem the performance of the original lasso, which uses just a single fit, can be improved.

We believe that a Bayesian perspective has much to offer when thinking about randomized versions of the lasso. We offer two alternative approaches, where the randomness comes from an appropriate posterior distribution.

- (a)
Our first approach puts a prior on the parameters in the full model. Given a draw of the parameters, say

*β*^{*} from the posterior distribution, we consider projecting the model that is formed by this realization onto subspaces defined via some form of

*l*_{1}-constraint on the parameters. Defining the loss function as the expected Kullback–Leibler divergence between this model and its projection, we use any of the following constraints on the subspace

inspired by the lasso and adaptive lasso penalty respectively. Owing to the

*l*_{1}-penalty, in the posterior distribution of the projection there is positive probability that some parameters are exactly 0 and the posterior distribution on the model space that is induced by the projection allows exploration of model uncertainty. This idea is discussed in

Nott and Leng (2010) and extends a Bayesian variable selection approach of

Dupuis and Robert (2003) which considers projections onto subspaces that are defined by sets of active covariates.

- (b)
In on-going work, we consider the following adaptive lasso (

Zou, 2006):

- (35)

In comparison with the usual methods which determine a single estimate of

(

Zou, 2006;

Wang *et al.*, 2007),we generalize the Bayesian lasso method in

Park and Casella (2008) to produce a posterior sample of

, which is denoted as

. For each

*b*, we plug

into expression (35), which gives a sparse estimate

*β*^{*b} of

*β*. The estimated parameters

can then be used for prediction and assessing model uncertainty. This is very much like the randomized lasso of Meinshausen and Bühlmann, but the randomness enters very naturally through a posterior distribution on hyperparameters. Our preliminary results show that this approach works competitively in prediction and model selection compared with the lasso and adaptive lasso.

**Rebecca Nugent, Alessandro Rinaldo, Aarti Singh and Larry Wasserman** (*Carnegie Mellon University, Pittsburgh*)

Meinshausen and Bühlmann argue for using stability-based methods. We suspect that the methods that are introduced in the current paper will generate much interest.

*General view of stability*

Let be some class of procedures indexed by a tuning parameter *h*. We think of larger *h* as corresponding to larger bias. Our view of the stability approach is to use the least biased procedure subject to having an acceptable variability. This has a Neyman–Pearson flavour to it since we optimize what we cannot control subject to bounds on what we can control. The advantage is that variance is estimable whereas bias, generally, is not. There is no notion of approximating the ‘truth’ so it is not required that the model be correct. In contrast, Meinshausen and Bühlmann seem to be more focused on finding the ‘true structure’.

Rinaldo and Wasserman (2010) applied this idea to finding stable density clusters as follows. Randomly split the data into three groups *X*=(*X*_{1},…,*X*_{n}),*Y*=(*Y*_{1},…,*Y*_{n}) and *Z*=(*Z*_{1},…,*Z*_{n}). Construct a kernel density estimator from *X* (with bandwidth *h*) and construct a kernel density estimator from *Y*. Define the instability by

where is the empirical distribution based on *Z*. Under certain conditions, Rinaldo and Wasserman (2010) showed the following theorem.

**Theorem 3. ** Let *h*_{*} be the diameter of {*p*>*λ*} and let *d* be the dimension of the support of *X*_{i}. Then:

- (a)
Ξ(0)=0 and Ξ(

*h*)=0, for all

*h**h*_{*};

- (b)
;

- (c)
As

;

- (d)
for each

*h* ∈ (0,

*h*_{*}),

for constants

*c*_{1} and

*c*_{2}.

We suggest using

- (36)

where Ξ(*h*) measures the variability and *α* is a user-defined acceptable amount of variability. Currently, we are generalizing the results to hold under weaker conditions and to hold uniformly over cluster trees rather than a single level set. The same ideas can be applied to graphs.

The authors spend time discussing the search for true structure. In general, we feel that there is too much emphasis on finding true structure. Consider the linear model. It is a virtual certainty that the model is wrong. Nevertheless, we all use the linear model because it often leads to good predictions. The search for good predictors is much different from the search for true structure. The latter is not even well defined when the model is wrong, which it always is.

**Adam J. Rothman, Elizaveta Levina and Ji Zhu** (*University of Michigan, Ann Arbor*)

We congratulate the authors on developing a clever and practical method for improving high dimensional variable selection, and establishing an impressive array of theoretical performance guarantees. We are particularly interested in stability selection in graphical models, which is illustrated with one brief example in the paper. To investigate the performance of stability selection combined with the graphical lasso a little further, we performed the following simple simulation. The data are generated from the *N*_{p}(0,Ω^{−1}) distribution, where Ω_{ii}=1,Ω_{i,i−1}=Ω_{i−1,i}=0.3 and the rest are 0. We selected *p*=30 and *n*=100, and performed 50 replications. Stability selection with pointwise control was implemented with bootstrap samples of size *n*/2 drawn 100 times.

We selected four different values of the tuning parameter *λ* for the graphical lasso, which correspond to the marked points along the receiver operating characteristic (ROC) curves for the graphical lasso in Fig. 18. The ROC curve showing false positive and true positive rates of detecting 0s in Ω for the graphical lasso was obtained by varying the tuning parameter *λ* and averaging over replications. For each fixed *λ*, we applied stability selection varying *π*_{thr} within the recommended range of 0.6–0.9, which resulted in an ROC curve for stability selection. The ROC curves show that stability selection reduces the false positive rate, as it should, and shifts the graphical lasso result down along the ROC curve; essentially, it is equivalent to the graphical lasso with a larger *λ*. Figs 18(a) and 18(b) have *λ*s which are too small, and stability selection mostly improves on the graphical lasso result, but it does appear somewhat sensitive to the exact value of *λ*: if *λ* is very small (Fig. 18(a)), stability selection only improves on the graphical lasso for large values of *π*_{thr}. In Figs 18(c) and 18(d), *λ* is just right or too large, and then applying stability selection makes the overall result worse. This example confirms that stability selection is a useful computational tool to improve on the false positive rate of the graphical lasso when tuning over the full range of *λ* is more expensive than doing bootstrap replications. However, since it does seem somewhat sensitive to the choice of a suitable small *λ*, it seems that combining it with some kind of initial crude cross-validation could result in even better performance. It would be interesting to consider whether there are particular types of the inverse covariance matrix that benefit from stability selection more than others, and whether any theoretical results can be obtained specifically for such structures; in particular, it would be interesting to know whether stability selection can perform better than the graphical lasso with oracle *λ*.

**A. B. Tsybakov** (*Centre de Recherche en Economie et Statistique, Université Paris 6 and Ecole Polytechnique, Paris*)

I congratulate the authors on a thought-provoking paper, which pioneers many interesting ideas. My question is about the comparison with other selection methods, such as the adaptive lasso or thresholded lasso (TL). In the theory these methods have better selection properties than those stated in theorem 2. For example, consider the TL where is the lasso estimator with *λ* as in Bickel *et al.* (2009), *τ*=√{ log (*p*)/*n*} and *c*>0 is such that with high probability under the restricted eigenvalue condition of Bickel *et al.* (2009). Then a two-line proof using expression (7.9) in Bickel *et al.* (2009) shows that, with the same probability, under the RE condition selects *S* correctly whenever min_{k ∈ S}|*β*_{k}|>*Cs*^{1/2}*τ* for some *C*>0 depending only on *σ*^{2} and the eigenvalues of *X*^{′}*X*/*n*. Since also *c* depends only on *X* and *σ*^{2} (see Bickel *et al.* (2009)), *c* can be evaluated from the data. The restricted eigenvalue condition is substantially weaker than assumption 1 of theorem 2 and min_{k ∈ S}|*β*_{k}| need not be as large as greater than *C*^{′}*s*^{3/2}*τ*, as required in theorem 2. We may interpret it as the fact that stability selection is successful if the relevant *β*_{k} are very large and the Gram matrix is very nice, whereas for smaller *β*_{k} and less diagonal Gram matrices it is safer to use the TL. Of course, here we compare only the ‘upper bound’, but it is not clear why stability selection does not achieve at least similar behaviour to that of the TL. Is it only technical or is there an intrinsic reason?

**Cun-Hui Zhang** (*Rutgers University, Piscataway*)

I congratulate the authors for their correct call for attention to the utility of randomized variable selection and great effort in studying its effectiveness.

In variable selection, a false variable may have a significant observed association with the response variable by representing a part of the realized noise through luck or by correlating with the true variables. A fundamental challenge in such structure estimation problems with high dimensional data is to deal with the competition of many such false variables for the attention of a statistical learning algorithm.

The solution proposed here is to simulate the selection probabilities of each variable with a randomized learning algorithm and to estimate the structure by choosing the variables with high simulated selection probabilities. The success of the proposed method in the numerical experiments is very impressive, especially in some cases at a level of difficulty that has rarely been touched on earlier. I applaud the authors for raising the bar for future numerical experiments in the field.

On the theoretical side, the paper considers two assumptions to guarantee the success of the method proposed:

- (a)
many false variables compete among themselves at random so each false variable has only a small chance of catching the attention of the randomized learning algorithm;

- (b)
the original randomized learning algorithm is not worse than random guessing.

The first assumption controls false discoveries whereas the second ensures a certain statistical power of detecting the true structure. Under these two assumptions, theorem 1 asserts in a broad context the validity of an upper bound for the total number of false discoveries. This result has the potential for an enormous influence, especially in biology, text mining and other areas that are overwhelmed with poorly understood large data.

Because of the potential for great influence of such a mathematical inequality in the practice of statistics, possibly by many non-statisticians, we must proceed with equally great caution. In this spirit, I comment on the two assumptions as follows.

Assumption (a) is the exchangeability condition in theorem 1. As mentioned in the paper, it is a consequence of the exchangeability of *X*_{N} given *X*_{S} in linear regression. The stronger condition implies a correlation structure for the design as

where is the identity and **1** denotes matrices of proper dimensions with 1 for all entries. I wonder whether such an assumption could be tested.

Assumption (b) may not always hold for the lasso. For *q*_{Λ}<|*S*|, a counterexample seems to exist with , where *X*_{S} and *Z*_{N} are independent standard normal vectors and the components of *ρ*_{S} are of the same sign as those of *β*.

**Hui Zou** (*University of Minnesota,Minneapolis*)

I congratulate Dr Meinshausen and Professor Bühlmann on developing stability selection for addressing the difficult problem of variable selection with high dimensional data. Stability selection is intuitively appealing, general and supported by finite sample theory.

Regularization parameter selection in sparse learning is often guided by some model comparison criteria such as the Akaike information criterion and Bayes information criterion in which prediction accuracy measurement is a crucial component. It is quite intriguing to see that stability selection directly targets variable selection without using any prediction measurement. The advantage of stability selection is well demonstrated by theorem 1 in which inequality (9) controls the number of false selections. In the context of variable selection, inequality (9) is very useful when the number of missed true variables is small. In an ideal situation we wish to have while controlling the number of false selections. An interesting theoretical problem is whether a non-trivial lower bound could be established for .

Table 5 summarizes the simulation results. First of all, in all four cases the number of false selections by stability selection is much smaller than 1, the nominal upper bound. For the case of *p*=20 000 and *n*=800, both SIS and stability selection select all true variables. In particular, stability selection using *π*_{thr}=0.9 achieves the perfect variable selection in all 100 replications. When *p*=4000 and *n*=200, SIS still has a reasonably low missing rate (less than 5%), but stability selection using *π*_{thr}=0.6 and *π*_{thr}=0.9 selects about six and three relevant variables respectively. The performance is not very satisfactory. From this example we also see that with finite samples the choice of *π*_{thr} can have a significant effect on the missing rate of stability selection, although its effect on the false discovery rate is almost ignorable.

Table 5. Simulation for SIS plus stability selection based on 100 replications *π*_{thr} | *d* | | | |
---|

0.6 | 63 | 10 | 10 | 0.22 |

0.9 | 126 | 10 | 10 | 0 |

| | | *p*=4000 | *n*=200 |

0.6 | 28 | 9.52 | 5.91 | 0.09 |

0.9 | 56 | 9.68 | 3.23 | 0.01 |

The **authors** replied later, in writing, as follows.

We are very grateful to all the discussants for their many insightful and inspiring comments. Although we cannot respond in a brief rejoinder to every issue that has been raised, we present some additional thoughts relating to the stimulating contributions.

*Connections to Bayesian approaches*

Richardson, Brown and Griffin, Draper, and Leng and Nott discuss interesting possible connections between stability selection (or other randomized selection procedures) and Bayesian approaches with appropriately chosen priors. The randomized lasso has the most immediate relation, as pointed out by Brown and Griffin and connecting with their interesting paper (Griffin and Brown, 2007). They also raise the question whether subsampling is then still necessary. Although we do not have a theoretical answer here, it seems that subsampling improves in practice a randomized procedure (or the equivalent Bayesian counterpart). We are also not ‘throwing away real data’ with subsampling since the final selection probabilities over subsampled data are U-statistics of order ⌊*n*/2⌋ and are using all *n* samples, not just a random subset. Stability selection is closely related to bagging (Breiman, 1996), as pointed out by Richardson. Stability selection is aggregating selection outcomes rather than predictions and assigning an error rate via our theorem 1. Nott and Leng (2010) seems to be very interesting in the context of Bayesian variable selection.

*Bayesian decision theoretic framework*

Draper and Holmes point out that the Bayesian framework is natural for a decision theoretic framework. And, indeed, this is one of the advantages of Bayesian statistics. In our examples of biomarker discovery and, more generally for variable selection (and also for example graphical modelling), the workflow consists of two steps. The first aim is to obtain

Although a decision theoretic analysis is mostly helpful in step (b), stability selection is potentially improving both (a) and (b). The issue where to cut in the list in step (b) involves in the frequentist set-up a choice of an acceptable type I error rate. The choice of a type I error rate is maybe not as satisfying as a full decision theoretic treatment but it is often useful in practice. Each ‘discovery’ needs to be validated by further experiments which are often very costly and the chosen framework aims to optimize the number of true discoveries under a given budget that can be spent on falsely chosen variables or hypotheses.

*Exchangeability assumption*

Shawe-Taylor and Sun, and Zhang raise, very legitimately, the question to what extent the exchangeability assumption in theorem 1 is too stringent. We wrote in the paper that results do seem to hold up very well for real data sets where the assumption is likely to be violated (and theorem 2 is not making use of the strong exchangeability assumption). It is maybe also worthwhile mentioning that the assumptions can be weakened considerably for specific applications. For the special case of high dimensional linear models, we worked out a related solution in follow-up work (Meinshausen *et al.*, 2009).

*Tightness of bounds and two-step procedures*

Tsybakov correctly points out that sharper bounds on the *l*_{2}-distance are available for the standard lasso and these could be exploited for variable selection by using hard thresholding of coefficients or the adaptive lasso. The reasons for having looser results for the randomized lasso are technical in our view, not intrinsic. It is much more difficult to analyse the stability selection algorithm which involves subsampling and randomization of covariates, which opens up maybe interesting areas for further mathematical investigations. We thought that it was interesting, nevertheless, that the irrepresentable condition can be considerably weakened by using randomization of the covariates instead of using two-step procedures such as hard thresholding or the adaptive lasso.

*Power and false positive selections*

Zou, Richardson, Shah and Samworth, and Rothman, Levina and Zhu examined the power of the method to detect important variables and compared it with alternative approaches for some examples. Although it is obviously true that no method will be universally ‘optimal’, stability selection places a strong emphasis on avoiding false positive selections. This is in contrast with say, sure independence screening used by Zou, which is a screening method (by name!) and is sitting at the opposite end of the spectrum by placing a large emphasis on a large power while accepting many false positive selections. For the simulation results of Zou, we suspect that sure independence screening would have a much larger false positive rate for *p*=4000 but we could not see it being reported. Rothman, Levina and Zhu compare the receiver operating characteristic curve for the example of graphical modelling. It does not come entirely unexpected from our point of view that the gain of stability selection is very small or, indeed, non-existent since the simulation takes place in a Toeplitz design case which is very close to complete independence between all variables. For regression, it was shown already in the paper that stability selection cannot be expected to improve performance for independent or very weakly correlated variables. And our theorem 2 showed that we can expect major improvements only if the irrepresentable condition is violated, which has analogies in Gaussian graphical modelling (Meinshausen, 2008; Ravikumar *et al.*, 2008).

*Generalization performance and sparsity*

Richardson, Shawe-Taylor and Sun, Tsybakov and Hothorn discuss the connection between generalization performance and sparsity of the selected set of variables. Hothorn mentions that achieving both optimal predictive accuracy and consistent variable selection might be very difficult, as manifested also in the Akaike information criterion–Bayes information criterion dilemma for lower dimensional problems. Shawe-Taylor and Sun illustrate that stability selection will in general produce rather sparse models, which is in agreement with the discussion on false positive selections above. Their example demonstrates though also impressively that the predictive performance is sometimes compromised only very marginally when using much sparser models than those produced by the lasso under cross-validation. In general, stability selection will yield much sparser models than the lasso with cross-validation. How much predictive performance one is willing to sacrifice for higher sparsity of the model, if any, should be application driven. If the answer is ‘none’, stability selection might not be appropriate.

*Approximating the true model*

Hennig and Nugent, Rinaldo, Singh and Wasserman rightly question the assumed existence of a true linear model and whether coefficients are ever exactly vanishing. Firstly, sometimes it *is* true that *β*_{k}s are exactly zero for some *k* ∈ {1,…,*p*}, namely if observed variables contain truly just noise (in astronomy and physics this is often so—in biology not so much). Secondly, any study of variable importance or variable relevance will necessarily be model based in some form or another, be it in a low or high dimensional linear model or a random-forest framework, to name two examples. In general, no low or high dimensional parametric or non-parametric model is ever correct in practice. And yet it is legitimate in our view to be interested in assessing variable importance or relevance. In this context, it is maybe worthwhile to look instead at some sparse approximation of the data-generating distribution (which always exists) and to treat the question of variable importance and variable selection in this light. For the lasso, this has been worked out in Bunea *et al.* (2007) and Bickel *et al.* (2009) for estimation of the approximating regression parameters whereas for example van de Geer *et al.* (2010) deal explicitly with the problem of variable selection when the linear model is a sparse linear approximation for a true possibly non-linear regression function. Our theorem 1 can be extended to such settings where the ‘true structure’*S* is defined via a sparse approximation: we need to replace the true set *S* by an approximation set *S*_{approx}. For example, in a linear approximating model for a general regression function *f*(·)=*E*(*Y*|*X*=·), we can define

- (37)

and has non-zero components only in the set *M*⊆{1,…,*p*}. Here, *C*^{2} is a suitable positive number, typically depending on *n*, and we denote by *f*,*f*^{*} and *f*_{M} the *n*×1 vectors evaluated at the observed covariates. Clearly, if the true model is linear and sparse with many regression coefficients equal to 0 and where the few non-zero regression coefficients are all sufficiently large, then the set *S*_{approx} in expression (37) equals the set *S* of the true active variables. Theorem 1 will remain valid under an appropriate exchangeability assumption for selection of variables in the complement of *S*_{approx} which might or might not be realistic. The mathematical arguments for extending theorem 2 to such a setting seem to be more involved.

*Correlated predictor variables*

Kirk, Lewin and Stumpf, and Kent raise the issue of correlated predictor variables and examine the behaviour of stability selection for highly correlated designs. This is a very important discussion point. As mentioned already above, stability selection puts a large emphasis on avoiding false positive selections and, as a consequence, might miss important variables if they are highly correlated with irrelevant variables. This is similar to the behaviour of a classical test for the regression coefficient *p*≪*n* situations. For situations where we are more interested in whether there are interesting variables in a certain group of variables, the proposal of Kirk, Lewin and Stumpf on testing stability of sets of variables (and finding those sets possibly by the elastic net) seems very interesting and useful.

*Numerical example of vitamin gene expression data*

Ahmed and Richardson analyse our gene expression data set with several competing methods and come to the conclusion that at most three genes should be selected. They raise the question whether stability selection is selecting too many variables. However, as shown in the initial contribution to the discussion by Richardson, stability selection is in fact also selecting only three genes under reasonable type I error control. The methods seem to be in agreement here.

Yen and Yen mention that the number *q* of selected variables can grow very large for small regularization parameters *λ* and propose an interesting way to choose a suitable region for the regularization parameter. Yet, instead of restricting *λ* to larger values, a useful alternative in practice is to select only the first *q* variables that appear when lowering the regularization. And *q* can be chosen *a priori* to yield non-trivial bounds in theorem 1.

*Computational improvements*

Gandy and Fang both propose interesting extensions that help to alleviate the computational challenge of having to fit a model on many subsamples of the data. An interesting alternative to the procedure that was proposed by Gandy is the improved bounds suggested by Shah and Samworth.

Tong notes that stability selection makes an inherent assumption of independence between observations. We have not yet tried to apply the method to dependent data such as time series. The standard subsampling scheme will not be suitable in cases of dependence. A block-based approach with independent subsampling of blocks (and where dependence is captured within blocks, at least approximately) along the lines of Künsch (1989) might be an interesting alternative to explore in this context.

*Connections to clustering and density cluster estimation*

Nugent, Rinaldo, Singh and Wasserman, and Hennig provide fascinating connections to related ideas in clustering and density cluster estimation. As described in the paper, Monti *et al.* (2003) is another interesting connection to consensus clustering.

Richardson and Hothorn mention numerous related previous references. We tried to point out many connections to previous work but have missed important ones. It is maybe worthwhile emphasizing again the similarity, at the crude level, of the work of Bach (2008) on bolasso which has been developed independently and simultaneously.

We reiterate and thank all the contributors again for their many interesting and thoughtful comments which have already opened up and will open up new research in this area. We would like to convey special thanks to Rajen Shah and Richard Samworth, who spotted a mistake in the definition of the assumption ‘not worse than random guessing’ in an earlier version of the manuscript. Their improved bounds will also make stability selection less conservative and address John Shawe-Taylor's comment regarding the finite amount of random subsampling in practice *versus* our theoretical arguments corresponding to all possible subsamples. Finally, we thank the Royal Statistical Society and the journal for hosting this discussion.