• prediction;
  • polygenic score;
  • penalised regression;
  • lipids;
  • SNP selection


Accurate genetic prediction of quantitative traits related to complex disease risk would have potential clinical impact, so investigation of statistical methodology to improve predictive performance is important. We compare a simple approach of polygenic scores using top ranking single nucleotide polymorphisms (SNPs) to a set of shrinkage models, namely Ridge Regression, Lasso and Hyper-Lasso. These penalised regression methods analyse all genotyped SNPs simultaneously, potentially including much larger sets of SNPs in the models, not only those with the smallest P values. We compare the accuracy of these models for predicting low-density lipoprotein (LDL) and high-density lipoprotein (HDL) cholesterol, two lipid traits of clinical relevance, in the Whitehall II and British Women's Health and Heart Study cohorts, using SNPs from the HumanCVD BeadChip. For gene scores, the most accurate predictions arise from multivariate weighted scores and include only a small number of SNPs, identified as top hits by the HumanCVD BeadChip. Furthermore, there was little benefit from including external results from published sets of SNPs. We found that shrinkage approaches rarely improved significantly on gene score results. Genetic predictive performance is trait specific, depending on the heritability and genetic architecture of the trait, and is limited by the training data sample size. Our results for lipid traits suggest no current benefit of more complex methods over existing gene score methods. Instead, the most important choice for the prediction model is the number of SNPs and selection of the most predictive SNPs to include. However further comparisons, in larger samples and for other phenotypes, would still be of interest.