Penalized Euclidean Distance Regression

A new method is proposed for variable screening, variable selection and prediction in linear regression problems where the number of predictors can be much larger than the number of observations. The method involves minimizing a penalized Euclidean distance, where the penalty is the geometric mean of the $\ell_1$ and $\ell_2$ norms of the regression coefficients. This particular formulation exhibits a grouping effect, which is useful for screening out predictors in higher or ultra-high dimensional problems. Also, an important result is a signal recovery theorem, which does not require an estimate of the noise standard deviation. Practical performances of variable selection and prediction are evaluated through simulation studies and the analysis of a dataset of mass spectrometry scans from melanoma patients, where excellent predictive performance is obtained.

1 penalty. In our method we use the Euclidean distance objective function plus a new norm penalty based on the geometric mean of the`1 and`2 norms of the regression parameters. The combination of Euclidean loss and geometric mean penalty is the main contribution of the paper. The advantage of our approach is that we are able to provide a pivotal recovery property, and in addition gain the grouping property of the Elastic Net (where regression coefficients of a group of highly correlated variables are very similar). The resulting penalized Euclidean distance (PED) method is shown to work well in a variety of settings. A particularly strong feature is that it works well when there are correlated designs with weak signal and strong noise.

Penalized Euclidean distance
We assume that the data are organized as an n p design matrix X, and an n-dimensional response vector Y, where n is the number of observations and p is the number of variables. The columns of the matrix X are denoted by X j , that is, X j D .x 1j , x 2j : : : , x nj / T , j D 1, : : : , p, and the regression parameters areˇD .ˇ1, : : : ,ˇp/ T . We assume that a vector of outcomes Y is modelled linearly as Y D Xˇ C , whereˇ is the true parameter vector of dimension p, the expectation of is zero and its covariance matrix is the identity matrix. Thus, we assume that the expectation of the response Y D .y 1 , : : : , y n / T depends only on a few variables, and so Xˇ D X Q , where the columns of the matrix X are a subset of the set of columns of the entire design matrix X, so X is associated with a subset of indices J ¹1, 2, : : : , pº and Q is a subvector ofˇ with the zero elements removed whose dimension is equal to the cardinality of J . In general, if we try to minimize kY X Q k over choices of J and vectors Q , the optimal choice of J may not be unique because an under-determined system could have solutions with different sparsity patterns, even if the degree of the optimal sparsity (model size) is the same. However, in the signal reconstruction problem that we consider, where a penalty on the parameters is introduced, we will show that under some assumptions we can approximateˇ in probability. The cardinality of J (denoted by jJ j) is assumed to be less than the number of observations, and when p is greater than jJ j, a real challenge is to detect the set of irrelevant columns, namely, the variables that correspond to the position of the null components ofˇ and thus not needed for efficiently controlling the outcomes Y.
Our method involves minimizing the Euclidean distance (as a loss function, essentially equivalent to the empirical norm used in the square root Lasso) between Y and Xˇ, with a penalty based on the geometric mean of the`1 and`2 norms. In particular, we minimize where is the scalar regularization parameter,ˇD .ˇ1,ˇ2 : : : ,ˇp/ T is a vector in R p (to be optimized over), kˇk 2 D P p iD1ˇ2 j is the squared`2 norm and kˇk 1 D P p iD1 jˇjj is the`1 norm. The PED estimator Ǒ is defined as the minimizer of the objective function (1), that is, Ǒ D . Ǒ 1 , Ǒ 2 , : : : , Ǒ p / T and The penalty is proportional to the geometric mean of the`1 and`2 norms and has only one control parameter, .
An alternative, well-established method that involves a convex combination of`1 and`2 2 penalties is the elastic net (Zou & Hastie, 2005). The Lasso (Tibshirani, 1996) is a special case for this penalty, and so the elastic net combines the two methods. Our method also inherits important properties of Lasso and ridge regression but in a radically different way. The square root Lasso (Belloni et al., 2011) involves minimizing L SQL . ,ˇ/ D 1 n kY Xˇk C n kˇk 1 , With the application of a location transformation, both the design matrix X and the response vector Y can be centred; here, we say that a vector is centred if the mean of its values is zero. Considering a scaling transformation, each covariate X j can be regarded as a point on the unit hypersphere S n 1 with a centring constraint. Throughout the paper, we assume that Y has been centred and the columns of X have been standardized as described above.
3 Theoretical results

General properties
The PED objective function enables variable selection under some mild compatibility conditions. The concept is based on the simple fact that the sum of the squares of the relative sizes of vector components (as defined byˇj=kˇk) is always equal to 1. For any vector in R p , if there are components that have relative sizes larger than 1 p p , then the other components must have relative sizes falling under this value. In addition, if many components have similar relative sizes due to a grouping effect, then the relative size of those components must be small. The new penalty function that we consider is actually a norm.

Lemma 1
The geometric mean of`1 and`2 norms is also a norm on finite-dimensional Euclidean vector spaces.
The following theorem demonstrates the grouping effect achieved by a minimizer of the PED objective function. The idea of grouping effect was first introduced by Zou & Hastie (2005). Our version of the grouping effect involves the relative contributions of the components of the minimizer of the PED objective function. This property enables the process of eliminating irrelevant variables from the model, and considering the situation of p > n, the process of selecting and grouping variables is an important priority. Theorem 1 supports the idea of obtaining groups of highly correlated variables, based on the relative size of the corresponding component minimizers of the PED objective function.

Theorem 1
Assume we have a standardized data matrix X and Y is a centred response vector. Let Ǒ be the PED estimate given by Ǒ . / D arg miň ¹L PED . ,ˇ/º for some > 0. Define where ij D .X i / T .X j / is the sample correlation and Â ij is the angle between X i and X j , 0 Ä Â ij Ä =2. Note that this result is analogous to Theorem 1 of Zou & Hastie (2005) for the elastic net, and the same method of proof is used in the Appendix. From Theorem 1, if Â ij is small, then the corresponding parameters estimated from PED regression will be similar, which is the grouping effect. When Â ij D 0, we have the following corollary.
The grouping effect means that if we have strong overcrowding on the unit hypersphere around an irrelevant column, then this would be detected by a large drop in the relative size of the corresponding components of the solution to our objective function.
We consider the case when the number of variables by far exceeds the number of true covariates. Therefore, the cardinality of the set S is infinite, and the challenge is to find a sparse solution in it. The starting point of our analysis will be a solution of the PED problem defined by (2). As before, we let O Â j represent the angle between vectors X j and Y X Ǒ . We note that the angle O with the highest value 1 when there is a single non-zero element inˇ(very sparse) and with the smallest value when all elements ofˇare equal and non-zero (very non-sparse). We assume that 0 R p is not a minimizer of kY Xˇk.

Result 2
If Ǒ is the solution of (2) and its jth component is non- The following result helps demonstrate the existence of a minimizing sequence whose terms have the grouping effect property for the relative size of their components.

Lemma 3
If Ǒ is the solution of (2), thenˇǑ j

Model consistency
In this section, we demonstrate that our method is also able to recover sparse signals without (pre-)estimates of the noise standard deviation or any knowledge about the signal. In Belloni et al. (2011), this property is referred to as pivotal recovery. An important aspect is that an oracle theorem also brings a solid theoretical justification for the choice of the parameter .
We assume that Y D Xˇ C , whereˇ is the unknown true parameter value forˇ, is the standard deviation of the noise and i , i D 1, : : : , n, is independent and identically distributed with a normal lawˆ0 with Eˆ0 . i / D 0 and Eˆ0. 2 i / D 1. Let J D supp.ˇ /. For any candidate solution Ǒ , we can use the notation L for the plain Euclidean distance L. Ǒ / D kY X Ǒ k, and the newly introduced norm is denoted by kˇk .1,2/ , that is, kˇk .1,2/ D .kˇk 1 kˇk/ 1=2 .

Dissemination of Statistics Research
The idea behind the following considerations is the possibility of estimating the quotient kX T k 1 =k k in probability following Belloni et al. (2011). We can use the same general result to show that the method we propose is also capable of producing pivotal recoveries of sparse signals.
Before stating the main theorem, we introduce some more notations and definitions. The solution of the PED objective function is denoted by Ǒ . /. Let kuk X denote kXuk, p the cardinality of J , M D kˇ k, S D kX T k 1 =k k, c > 1 and, for brevity, N c D .c C 1/=.c 1/. Also, we write u for the vector of components of u that correspond to the non-zeroˇ elements, that is, with indices in J . Also, we write u c for the vector of components of u that correspond to the zero elements ofˇ , that is, with indices in the complement of J . We shall initially focus on the case n 2 > p. Consider are bounded away from 0. We make the remark that if the first compatibility condition holds, there is a relatively simple scenario when the second condition would hold as well. If N k N c is bounded away from 0 on N c , we have that kuk X must be at least O.
and we assume M = p p is bounded. Thus, the second compatibility condition could be easily achieved in the case when p D n 1C˛1 and p D n˛2 with˛1,˛2 > 0 and˛1 C˛2 Ä 1. We also present a result with certain compatibility conditions for the case when p > n 2 in Vasiliu et al. (2017).
We refer to k N c and N k N c as restricted eigenvalues. The concept of restricted eigenvalues was introduced by Bickel et al. (2009) with respect to the`1 penalty function. Our definition and usage are adapted to our own objective function. As stated before, our oracle theorem is based on the estimation of kX T k 1 k k . Directly following from Lemma 1 of Belloni et al. (2011), we have the following lemma.

Lemma 4
Given 0 <˛< 1 and c > 1, the choice D Now we are ready to state the main result.
We can use the value of in Lemma 4 for practical implementation in order to ensure c 4 p pS holds with probability 1 ˛. Note that the rate of convergence is asymptotically the same as rates seen in other sparse regression problems (e.g. Negahban et al., 2012), although as for the square root Lasso of Belloni et al. (2011), knowledge of is not needed. Also, there are some circumstances when we can consider other values of .
at the same time, we assume Ä 4 p pk p n p p for some 0 < < 1, then we also have an oracle property, that is, We use the corollary to suggest a method for choosing the model parameters by maximizing O , and this would encourage the selection of sparse models.
For a practical implementation of our method, we make use of the proven theoretical results. From the signal recovery theorem and corollary, we obtain that k Ǒ . / ˇ k D O. p p log.2p=˛/= p n/ with probability 1 ˛. Thus, if j is an index where there is no signal, that is,ˇ j D 0, then, from the previous equation, we have that j Ǒ j . //j < k Ǒ . / ˇ k Ä const.
p p log.2p=˛/= p n. If k Ǒ . /k ¤ 0, we can divide by k Ǒ . /k and obtain We will use (5) to inform a threshold choice as part of the PED numerical implementation in the next section. As well as dependence on n, we also investigate the effect of p on the relative size of the components.

Theoretical comparisons
In Section 3.1, we presented general properties for the PED estimator without requiring any special assumptions. The elastic net also shares the grouping effect without any special assumptions, but the Lasso and square root Lasso do not. The other results in Section 3.1 are particular to our estimator. In Section 3.2, under special assumptions, the Lasso, square root Lasso, and PED estimators have near-oracle rates of the same order, which is O. p p log.p/=n/. The main difference is that in the case of the Lasso pre-estimates of the noise standard deviation, is needed as well as different restricted eigenvalue assumptions given by Bickel et al. (2009). For the square root Lasso and PED, the near-oracle rate can be achieved without the pre-estimates of the noise, but with PED, we also do guarantee the grouping effect that is mitigating multiple correlations. The simulations below show numerical evidence that our rate could actually be better than the one used as a theoretical benchmark for comparison with Lasso and square root Lasso.

Numerical implementation
The objective function L PED . ,ˇ/ D kY Xˇk C .kˇkkˇk 1 / 1=2 is convex for any choice of and also differentiable on all open orthants in R p bounded away from the hyperplane Y XˇD 0. In order to find good approximations for minimizers of our objective function, as in many cases of non-linear large-scale convex optimization problems, a quasi-Newton method may be preferred because it is known to be considerably faster than methods like coordinate descent by achieving super-linear convergence rates. Another important advantage is that second derivatives are not necessarily required. For testing purposes, we present a numerical implementation based on the well-performing quasi-Newton methods for convex optimization known as Broyden-Fletcher-Goldfarb-Shanno (BFGS) methods: limited-memory BFGS (L-BFGS) (Nocedal, 1980) and BFGS (Bonnans et al., 2006). We also tested a version of non-smooth BFGS called the hybrid algorithm for non-smooth optimization (Lewis & Overton, 2008) and obtained very similar results.
The idea for the estimation is to use theoretically informed parameters based on Theorem 2 and Corollaries 2 and (5), in order to choose a suitable value of and give a sparse estimate ofˇ after thresholding. We are choosing a lambda value in the interval betweenˆ 1 . We retain the components of the solution that have higher relative contributions, that is, a tuning thresholding constant that could be selected by some information criterion such as the Akaike information criterion (AIC) or by n-fold cross-validations, or we could fix ı, for example, ı D 0.75. The steps for the numerical approximation ofˇ by using the PED method are as follows: 1. Use a quasi-Newton algorithm (e.g. L-BFGS) to minimize the convex objective function (1) (5)).

For the solution Ǒ that maximizes
Eliminate the columns of the design matrix corresponding to the zero coefficients Ǒ j , with p columns remaining.

Simulation study
We consider a simulation study to illustrate the performance of our method and the grouping effect in the case when p n. In this example, we compare the results with the square root of the Lasso method (Belloni et al., 2011) that uses a scaled Euclidean distance as a loss function plus an`1 penalty term, using the asymptotic choice of . We also compare the results with both Lasso and elastic net methods as they are implemented in the publicly available packages for R, again using the default options. In particular, we used 10-fold cross-validation to choose the roughness penalty for the Lasso and elastic net using the command cv.glmnet in the R package glmnet, and we use the command slim in the R package flare with penalty term D 1. The results are summarized in Tables I and II, and the reported values are based on averaging over 100 datasets. The distance between the true signal and the solution produced is also recorded using root mean square error. Under highly correlated designs, the method we propose shows a very efficient performance against the "curse of dimensionality" and overcrowding. Table I) has performed very well, obtaining the highest rate of true positives in many examples. Also, we compare it with PED when the parameters are selected by AIC, given by PED(AIC) in Table I, which also performs well. The elastic net is the next best performance, and Lasso (for the strongly correlated case) and square  Table II. Simulation results based on example 1 and the numerical implementation described in the previous section.  root Lasso have low rates of true positives. Note that the square root Lasso has performed rather differently here from the others. It is the only method using an asymptotic value of , where n may not be large enough here. PED has a lower model size compared to the elastic net. Finally, the root mean square error is generally best for PED, particularly for the more highly correlated situations. Table II summarizes the true positive rates for fixed model sizes, where the largest k MS coefficients in absolute value are retained for model size k MS . Again, PED has performed very well in retaining the highest number of true positives in nearly all cases (42 out of 45), with elastic net being the best in three cases. Overall, PED has performed extremely well in these simulations.

Real data applications
For prediction performance comparison, we considered the datasets Air (Chambers et al., 1983), Servo (Quinlan, 1993;Lichman, 2013), Tecator (Borggaard & Thodberg, 1992), Housing (Harrison & Rubinfeld, 1978), Ozone (Breiman & Friedman, 1985) and Iowa (Draper & Smith, 1998). It is important to note that throughout the simulations and the real data analyses, both Lasso and elastic net were run with double cross-validation for selecting the model size and the tuning parameter. We used PED with default values, and we also ran PED with a single 10-fold crossvalidation for tuning ı that affects only the model size; the results are reported in Table III. In some cases, such as the Tecator dataset, the prediction error further improved when the final choice of wasˆ 1 0 .1 Melanoma: In this application, we implement PED as a variable selection tool when the response variable serves as binary classification. We consider an application of the method to a proteomics dataset from the study of melanoma (skin cancer). The mass spectrometry dataset was described by Mian et al. (2005) and further analysed by Browne et al. (2010). The data consist of mass spectrometry scans from serum samples of 205 patients, with 101 patients with stage I melanoma (least severe) and 104 patients with stage IV melanoma (most severe). Each mass spectrometry scan consists of an intensity for 13,951 mass over charge (m=z) values between 2,000 and 30,000 Da. It is of interest to find which m=z values could be associated with the stage of the disease, which could point to potential proteins for use as biomarkers. We first fit a set of 500 important peaks to the overall mean of the scans using the deterministic peak-finding algorithm of Browne et al. (2010) to obtain 500 m=z values at peak locations. We consider the disease stage to be the response, with Y D 1 for stage I and Y D 1 for stage IV. Note that we have an ordered response here as stage IV is much more severe than stage I, and it is reasonable to treat the problem as a regression problem.
We fit the PED regression model versus the intensities at the 500 peak locations. We have n D 205 by p D 500. The data are available at http://www.maths.nottingham.ac.uk/~ild/mass-spec Here, we use˛D 0.05. The parameter values chosen to maximize O k are D 0.5 and ı.p/ D 0.75, selecting 96 non-zero m=z values. Browne et al. (2010) also considered a mixed-effects Gaussian mixture model and a two-stage t-test for detecting significant peaks. If we restrict ourselves to the coefficients corresponding to the 50 largest peaks in height, Browne et al. (2010) identified 17 as non-zero as did PED, with eight out of the 17 in common. If we apply PED(AIC), then seven peaks are chosen out of the largest 50, of which only two are in common with Browne et al. (2010). The elastic net chose six peaks with five of those in common with Browne et al. (2010), and for the Lasso, five peaks were chosen from the top 50 largest, with four in common with Browne et al. (2010). Note that here PED has selected the most peaks in common with Browne et al. (2010), and it is reassuring that the different methods have selected some common peaks. We notice that C Á ¹ˇ2 R p :`1.ˇ/`2.ˇ/ Ä 1º, and therefore, C is a bounded, closed and convex subset of R p , which contains the origin. Let g.ˇ/ D`1.ˇ/`2.ˇ/, and let Epi.g/ denote its epigraph, that is, Epi.g/ D ¹.ˇ, t/ 2 R pC1 : g.ˇ/ Ä tº. The set C is convex and orthant symmetric. Indeed, the Hessian of g is positive semi-definite on each orthant of R n because, after differentiating g twice with the product rule, it can be written as a sum of three matrices, which can be argued, by applying Sylvester's theorem, to be positive semi-definite. We see that in our case Epi.g/ D ¹t.C, 1/ : t 2[ 0, C1/º, and therefore, Epi.g/ is a convex cone in R pC1 because C is a convex set in R p . This shows that p`1`2 is a convex function. Because p`1`2 is convex and homogeneous of degree 1, it follows that it must also satisfy the triangle inequality. Therefore, p`1`2 is a norm on R p .