Ridge Regression in Prediction Problems: Automatic Choice of the Ridge Parameter

Authors

  • Erika Cule,

    Corresponding author
    1. Department of Epidemiology and Biostatistics, Imperial College London, London, United Kingdom
    2. Statistical Consulting Group, GlaxoSmithKline, Stevenage, United Kingdom
    • Correspondence to: Erika Cule, Statistical Consulting Group, GlaxoSmithKline R & D Ltd., Medicines Research Centre, Gunnels Wood Road, Stevenage, Hertfordshire SG1 2NY, UK. E-mail: erika.j.cule@gsk.com

    Search for more papers by this author
  • Maria De Iorio

    1. Department of Statistical Science, University College London, London, United Kingdom
    Search for more papers by this author

ABSTRACT

To date, numerous genetic variants have been identified as associated with diverse phenotypic traits. However, identified associations generally explain only a small proportion of trait heritability and the predictive power of models incorporating only known-associated variants has been small. Multiple regression is a popular framework in which to consider the joint effect of many genetic variants simultaneously. Ordinary multiple regression is seldom appropriate in the context of genetic data, due to the high dimensionality of the data and the correlation structure among the predictors. There has been a resurgence of interest in the use of penalised regression techniques to circumvent these difficulties. In this paper, we focus on ridge regression, a penalised regression approach that has been shown to offer good performance in multivariate prediction problems. One challenge in the application of ridge regression is the choice of the ridge parameter that controls the amount of shrinkage of the regression coefficients. We present a method to determine the ridge parameter based on the data, with the aim of good performance in high-dimensional prediction problems. We establish a theoretical justification for our approach, and demonstrate its performance on simulated genetic data and on a real data example. Fitting a ridge regression model to hundreds of thousands to millions of genetic variants simultaneously presents computational challenges. We have developed an R package, ridge, which addresses these issues. Ridge implements the automatic choice of ridge parameter presented in this paper, and is freely available from CRAN.

Ancillary