Ridge Regression in Prediction Problems: Automatic Choice of the Ridge Parameter

To date, numerous genetic variants have been identified as associated with diverse phenotypic traits. However, identified associations generally explain only a small proportion of trait heritability and the predictive power of models incorporating only known-associated variants has been small. Multiple regression is a popular framework in which to consider the joint effect of many genetic variants simultaneously. Ordinary multiple regression is seldom appropriate in the context of genetic data, due to the high dimensionality of the data and the correlation structure among the predictors. There has been a resurgence of interest in the use of penalised regression techniques to circumvent these difficulties. In this paper, we focus on ridge regression, a penalised regression approach that has been shown to offer good performance in multivariate prediction problems. One challenge in the application of ridge regression is the choice of the ridge parameter that controls the amount of shrinkage of the regression coefficients. We present a method to determine the ridge parameter based on the data, with the aim of good performance in high-dimensional prediction problems. We establish a theoretical justification for our approach, and demonstrate its performance on simulated genetic data and on a real data example. Fitting a ridge regression model to hundreds of thousands to millions of genetic variants simultaneously presents computational challenges. We have developed an R package, ridge, which addresses these issues. Ridge implements the automatic choice of ridge parameter presented in this paper, and is freely available from CRAN.

: Four simulation scenarios used in the evaluation of the bias-variance decomposition. The simulation scenarios are taken from Zou & Hastie (2005). x j = Z 1 + x j , Z 1 ∼ N (0, 1) j = 1, . . . , 5 x j = Z 2 + x j , Z 2 ∼ N (0, 1) j = 6, . . . , 10 x j = Z 3 + x j , Z 3 ∼ N (0, 1) j = 11, . . . , 15 x j ∼ N (0, 1) j = 16, . . . , 40 n, number of observations; p, number of predictors; β, vector of coefficients; X, matrix of predictors. HL, HyperLasso regression; RR-k r * , RR with the shrinkage parameter k r * .  Figure 1: Comparing the mean-squared error of ridge regression estimates obtained using the shrinkage parameter k r to those obtained using the shrinkage parameter k HKB . Plotted are the proportion of replicates that using k r results in smaller mean squared error than the estimates fitted using k HKB (equivalent to k r with r = p). RR−k r (0.56) p ≤ 10 −5 (0.51) p ≤ 10 −7 (0.49) p ≤ 10 −10 (0.51) Figure 2: Receiver operating characteristic curve (ROC curve) plotting true positive rate (TPR) against false positive rate (FPR) as the probability threshold for classification as a case is varied. Area under the curve (AUC) statistics are given in the legend, in parentheses. HLasso, HyperLasso regression; RR-k r * , RR with shrinkage parameter k r * ; univariate significance thresholds are for inclusion of individual SNPs in a multiple regression model. Regression models were fitted on WTCCC-BD data, and evaluated on GAIN-BD data, with the models evaluated on GAIN-BD data plotted here. For details of the study data and methods used to fit the regression models, see main text.

ROC curve: risk prediction in bipolar disorder data
Supplementary Appendix A The number of principal components to include in k r We address whether, when computing k r , it is more useful to include all non-zero PCs (of which there are at most min(n, p)), or to include fewer than all the non-zero PCs. To this end we reanalyse the data analysed by Hoerl et al. (1975), extending their results to compare the shrinkage parameter k r to k HKB . The data being reanalysed are a ten-factor dataset consisting of 36 observations. These data were first discussed by Gorman & Toman (1966) and are described in Daniel et al. (1999). They relate to the operation of a petroleum refining unit. Following the approach taken by Hoerl et al. (1975), we use the ten-factor dataset as a design matrix in a simulation study. In each replicate, a vector of regression coefficients with a specified squared length is simulated. As in Hoerl et al. (1975) we find that, subject to normalisation, our results are not sensitive to this value. Response variables are simulated at a range of signal-to-noise ratios, where the signal is the squared length of the coefficients used to generate the data and the noise is the error in the responses, σ 2 . For each signal-to-noise ratio, 1000 replicates are simulated, and results are reported as an average of these. Hoerl et al. (1975) tabulate the mean squared error under both the linear and ridge models and report the percentage of replicates linear regression gives rise to estimatesβ with smaller smaller mean squared error than ridge estimateŝ β kHKB with k HKB defined as in Equation (??). Following this approach, in Supplementary Figure 1 we plot the percentage of replicates that k r results in ridge estimatesβ kr with smaller mean squared error than the estimates obtained using the shrinkage parameter k HKB . From this figure we see that, when the signal to noise ratio is not too low, estimates ofβ with smaller mean squared error are obtained using k r with r < p than when using k HKB .

Supplementary Appendix B Definitions of Degrees of Freedom in penalised regression models
Ordinary least squares regression (OLSR), ridge regression (RR) and principal components regression (PCR) all result in models of the form given in Equation (??). For models that can be expressed in this form, several definitions of effective degrees of freedom have been proposed (Hastie & Tibshirani, 1990).
The effective number of parameters, tr(H), gives an indication of the amount of fitting that H does. As discussed in the main text, tr(HH ) can be defined as the effective degrees of freedom for variance. The degrees of freedom for error, is given by n − tr(2H − HH ), thus the effective number of parameters in the degrees of freedom for error is tr(2H − HH ).
In OLSR, all three definitions of degrees of freedom reduce to to p, the number of parameters in the model. In PCR, all three definitions reduce to r, the number of components retained in the PCR.
tr(H) is the sum of the t diagonal elements of H. Each element is less than or equal to 1. tr(HH ) is also the sum of t diagonal elements, this time of HH , and each of which is the square of the corresponding diagonal element of H. These diagonal elements each take a value between 0 and 1, thus the sum of their squares is less than or equal to the sum of the original elements. A similar argument holds for the diagonal elements of 2H − HH , where the sum is greater than or equal to the sum of the diagonal elements of H: Corollary: For a fixed value of the degrees of freedom, k HH < k H < k 2H−HH where k HH is k such that tr(HH ) = r, k H is k such that tr(H) = r and k 2H−HH is k such that tr(2H − HH ) = r (for the same value of r in all three cases).
Proof. We seek k H and k HH such that: For each diagonal element j = 1 . . . t: Which simplifies to An analogous argument shows that k 2H−HH > k H .
The larger the value of k, the further the ridge estimates are from the OLS estimates. This relationship holds when the ridge estimates are returned to the orientation of the data. In RR with k > 0, the three definitions of degrees of freedom follow the inequalities given in Equation (S1). For each of the definitions, it is possible to choose the ridge parameter such that the degrees of freedom equal some specified value. Thus, choosing k such that tr(HH ) = r (among the three definitions of degrees of freedom) results in regression coefficient estimates that are closest to the OLS estimates.

Supplementary Appendix C Logistic ridge regression by cyclic coordinate descent
In this section, we describe logistic RR and cyclic coordinate descent, the algorithm which we use to compute logistic RR coefficients.
In the logistic regression model, let X be an n × p matrix of predictors with rows x i = (x i1 , . . . , x ip ), as in the linear regression model (Equation (??)). Let Y = (Y 1 , . . . , Y n ) be a vector of observed binary outcomes, In biomedical data, this setup is common. The outcome variable represents disease status with cases as 1 and controls as 0.
The i th response Y i is a Bernoulli variable with probability of success π (x i ). The logistic regression model relates the probability π (x i ) that the i th observation is a case to the predictor variables as where β is a vector of parameters to be estimated. Logistic RR estimates are obtained by maximising the log-likelihood of the parameter vector, subject to a penalty term. The penalised log-likelihood to be maximised is: The CLG algorithm [Zhang & Oles (2001)] is a cyclic coordinate descent algorithm for penalised logistic regression. The algorithm is described in detail by Genkin et al. (2007). The CLG algorithm is initiated by setting all of the coefficient estimates to an initial value. Then, each coefficient is updated in turn whilst holding the rest fixed. This has the advantage of avoiding the need for the inversion of large matrices. Convergence is checked after each round of updating all of the coefficients.
In the CLG algorithm, cases are code as Y i = 1 and controls as Y i = −1. Finding the updated coefficient, β new j that maximises the log-likelihood whilst keeping the other parameters fixed is equivalent to finding the z that minimizes where τ = 1 2k and the r i = β x i y i are computed using the current value of β and so are treated as constants.
f (r) = ln(1 + exp(−r)), and penalty terms not involving z are constant and therefore omitted. but other functions can be used (Zhang & Oles, 2001). We then apply the trust region restriction: to give the actual update: Convergence is declared when ( n i=1 |∆r i |)/(1 + n i=1 r i ) < , where n i=1 |∆r i | is the sum of the changes in the linear scores once all the coefficients have been updated, and is a user-specified tolerance. The CLG algorithm is summarized in Algorithm 1.