## 1. Introduction

### 1.1. Background

Consider the problem of estimating a *p*-vector of parameters ** β** from the linear model

where **y**=(*Y*_{1},…,*Y*_{n})^{T} is an *n*-vector of responses, **X**=(**x**_{1},…,**x**_{n})^{T} is an *n*×*p* random design matrix with independent and identically distributed (IID) **x**_{1},…,**x**_{n}, ** β**=(

*β*

_{1},…,

*β*

_{p})

^{T}is a

*p*-vector of parameters and

**=(**

*ɛ**ɛ*

_{1},…,

*ɛ*

_{n})

^{T}is an

*n*-vector of IID random errors. When dimension

*p*is high, it is often assumed that only a small number of predictors among

*X*

_{1},…,

*X*

_{p}contribute to the response, which amounts to assuming ideally that the parameter vector

**is sparse. With sparsity, variable selection can improve the accuracy of estimation by effectively identifying the subset of important predictors, and also enhance model interpretability with parsimonious representation.**

*β*Sparsity comes frequently with high dimensional data, which is a growing feature in many areas of contemporary statistics. The problems arise frequently in genomics such as gene expression and proteomics studies, biomedical imaging, functional magnetic resonance imaging, tomography, tumour classifications, signal processing, image analysis and finance, where the number of variables or parameters *p* can be much larger than sample size *n*. For instance, one may wish to classify tumours by using microarray gene expression or proteomics data or one may wish to associate protein concentrations with expression of genes or to predict certain clinical prognosis (e.g. injury scores or survival time) by using gene expression data. For this kind of problems, the dimensionality can be much larger than the sample size, which calls for new or extended statistical methodologies and theories. See, for example Donoho (2000) and Fan and Li (2006) for overviews of statistical challenges with high dimensionality.

Back to the problem in model (1), it is challenging to find tens of important variables out of thousands of predictors, with a number of observations usually in tens or hundreds. This is similar to finding a couple of needles in a huge haystack. A new idea in Candes and Tao (2007) is the notion of the uniform uncertainty principle on deterministic design matrices. They proposed the Dantzig selector which is the solution to an *l*_{1}-regularization problem and showed that, under the uniform uncertainty principle, this minimum *l*_{1}-estimator achieves the ideal risk, i.e. the risk of the oracle estimator with the true model known ahead of time, up to a logarithmic factor log (*p*). Appealing features of the Dantzig selector include that

- (a) it is easy to implement because the convex optimization that the Dantzig selector solves can easily be recast as a linear program and
- (b) it has the oracle property in the sense of Donoho and Johnstone (1994).

Despite their remarkable achievement, we still have four concerns when the Dantzig selector is applied to high or ultrahigh dimensional problems. First, a potential hurdle is the computational cost for large or huge scale problems such as implementing linear programs in dimension of tens or hundreds of thousands. Second, the factor log(*p*) can become large and may not be negligible when dimension *p* grows rapidly with sample size *n*. Third, as dimensionality grows, their uniform uncertainty principle condition may be difficult to satisfy, which will be illustrated later by using a simulated example. Finally, there is no guarantee that the Dantzig selector picks up the right model though it has the oracle property. These four concerns inspire our work.

### 1.2. Dimensionality reduction

Dimension reduction or feature selection is an effective strategy to deal with high dimensionality. With dimensionality reduced from high to low, the computational burden can be reduced drastically. Meanwhile, accurate estimation can be obtained by using some well-developed lower dimensional method. Motivated by this along with those concerns on the Dantzig selector, we have the following main goal in our paper:

We achieve this by introducing the concept of sure screening and proposing a sure screening method which is based on correlation learning which filters out the features that have weak correlation with the response. Such correlation screening is called sure independence screening (SIS). Here and below, by sure screening we mean a property that all the important variables survive after variable screening with probability tending to 1. This dramatically narrows down the search for important predictors. In particular, applying the Dantzig selector to the much smaller submodel relaxes our first concern on the computational cost. In fact, this not only speeds up the Dantzig selector but also reduces the logarithmic factor in mimicking the ideal risk from log (*p*) to log (*d*), which is smaller than log (*n*) and hence relaxes our second concern above.

Oracle properties in a stronger sense, say, mimicking the oracle in not only selecting the right model, but also estimating the parameters efficiently, give a positive answer to our third and fourth concerns above. Theories on oracle properties in this sense have been developed in the literature. Fan and Li (2001) laid down groundwork on variable selection problems in the finite parameter setting. They discussed a family of variable selection methods that adopt a penalized likelihood approach, which includes well-established methods such as the Akaike information criterion and Bayes information criterion, as well as more recent methods like the bridge regression in Frank and Friedman (1993), the lasso in Tibshirani (1996), and the smoothly clipped absolute deviation (SCAD) method in Fan (1997) and Antoniadis and Fan (2001), and established oracle properties for non-concave penalized likelihood estimators. Later on, Fan and Peng (2004) extended the results to the setting of *p*=*o*(*n*^{1/3}) and showed that the oracle properties continue to hold. An effective algorithm for optimizing penalized likelihood, the local quadratic approximation, was proposed in Fan and Li (2001) and well studied in Hunter and Li (2005). Zou (2006) introduced an adaptive lasso in a finite parameter setting and showed that the lasso does not have oracle properties as conjectured in Fan and Li (2001), whereas the adaptive lasso does. Zou and Li (2008) propose a local linear approximation algorithm that recasts the computation of non-concave penalized likelihood problems into a sequence of penalized *L*_{1}-likelihood problems. They also proposed and studied the one-step sparse estimators for non-concave penalized likelihood models.

There is a huge literature on the problem of variable selection. To name a few in addition to those mentioned above, Fan and Li (2002) studied variable selection for Cox's proportional hazards model and frailty model, Efron *et al.* (2004) proposed the least angle regression algorithm LARS, Hunter and Li (2005) proposed a new class of algorithms, minorization–maximization algorithms, for variable selection, Meinshausen and Bühlmann (2006) looked at the problem of variable selection with the lasso for high dimensional graphs and Zhao and Yu (2006) gave an almost necessary and sufficient condition on model selection consistency of the lasso. Meier *et al.* (2008) proposed a fast implementation for the group lasso. More recent studies include Huang *et al.* (2008), Paul *et al.* (2008), Zhang (2007) and Zhang and Huang (2008), which signficantly advance the theory and methods of the penalized least squares (PLS) approaches. It is worth mentioning that in variable selection there is a weaker concept than consistency, called persistency, which was introduced by Greenshtein and Ritov (2004). Motivation of this concept lies in the fact that, in machine learning such as tumour classification, primary interest centres on the misclassification errors or more generally expected losses, not the accuracy of estimated parameters. Greenshtein and Ritov (2004) studied the persistence of lasso-type procedures in high dimensional linear predictor selection, and Greenshtein (2006) extended the results to more general loss functions. Meinshausen (2007) considered a case with finite non-sparsity and showed that, under quadratic loss, the lasso is persistent, but the rate of persistency is slower than that of a relaxed lasso.

### 1.3. Some insight into high dimensionality

To gain some insight into challenges of high dimensionality in variable selection, let us look at a situation where all the predictors *X*_{1},…,*X*_{p} are standardized and the distribution of **z**=**Σ**^{−1/2}**x** is spherically symmetric, where **x**=(*X*_{1},…,*X*_{p})^{T} and **Σ**=cov(**x**). Clearly, the transformed predictor vector **z** has covariance matrix *I*_{p}. Our way of study in this paper is to separate the effects of the covariance matrix **Σ** and the distribution of **z**, which gives us a better understanding of difficulties of high dimensionality in variable selection.

The real difficulty when dimension *p* is larger than sample size *n* comes from four facts. First, the design matrix **X** is rectangular, having more columns than rows. In this case, the matrix **X**^{T}**X** is huge and singular. The maximum spurious correlation between a covariate and the response can be large (see, for example, Fig. 1) because of the dimensionality and the fact that an unimportant predictor can be highly correlated with the response variable owing to the presence of important predictors associated with the predictor. These make variable selection difficult. Second, the population covariance matrix **Σ** may become ill conditioned as *n* grows, which adds difficulty to variable selection. Third, the minimum non-zero absolute coefficient |*β*_{i}| may decay with *n* and fall close to the noise level, say, the order { log (*p*)/*n*}^{−1/2}. Fourth, the distribution of **z** may have heavy tails. Therefore, in general, it is challenging to estimate the sparse parameter vector ** β** accurately when

*p*≫

*n*.

When dimension *p* is large, some of the intuition might not be accurate. This is exemplified by the data piling problems in high dimensional space that were observed in Hall *et al.* (2005). A challenge with high dimensionality is that important predictors can be highly correlated with some unimportant ones, which usually increases with dimensionality. The maximum spurious correlation also grows with dimensionality. We illustrate this by using a simple example. Suppose that the predictors *X*_{1},…,*X*_{p} are independent and follow the standard normal distribution. Then, the design matrix is an *n*×*p* random matrix, each entry an independent realization from (0,1). The maximum absolute sample correlation coefficient between predictors can be very large. This is indeed against our intuition, as the predictors are independent. To show this, we simulated 500 data sets with *n*=60 and *p*=1000 and *p*=5000. Fig. 1 shows the distributions of the maximum absolute sample correlation of predictors. The multiple canonical correlation between two groups of predictors (e.g. 2 in one group and 3 in another) can even be much larger, as there are already

choices of the two groups in our example. Hence, sure screening when *p* is large is very challenging.

The paper is organized as follows. In the next section we propose a sure screening method, SIS, and discuss its rationale as well as its connection with other methods of dimensionality reduction. In Section 3 we review several known techniques for model selection in the reduced feature space and present two simulations and one real data example to study the performance of SIS-based model selection methods. In Section 4 we discuss some extensions of SIS and, in particular, an iterative SIS is proposed and illustrated by three simulated examples. Section 5 is devoted to the asymptotic analysis of SIS and an iteratively thresholded ridge regression screener as well as two SIS-based model selection methods. Some concluding remarks are given in Section 6. Technical details are provided in Appendix A.