Advertisement

Tailoring sparse multivariable regression techniques for prognostic single-nucleotide polymorphism signatures

Authors

  • H. Binder,

    Corresponding author
    1. Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg, 79104 Freiburg, Germany
    • Institute of Medical Biostatistics, Epidemiology and Informatics, University Medical Center Johannes Gutenberg University Mainz, 55101 Mainz, Germany
    Search for more papers by this author
  • A. Benner,

    1. Division of Biostatistics, German Cancer Research Center, 69120 Heidelberg, Germany
    Search for more papers by this author
  • L. Bullinger,

    1. Department of Internal Medicine III, University Hospital of Ulm, D-89081 Ulm, Germany
    Search for more papers by this author
  • M. Schumacher

    1. Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg, 79104 Freiburg, Germany
    Search for more papers by this author

Correspondence to: Harald Binder, Institute of Medical Biostatistics, Epidemiology and Informatics, University Medical Center Johannes Gutenberg University Mainz, 55101 Mainz, Germany.

E-mail: binderh@uni-mainz.de

Abstract

When seeking prognostic information for patients, modern technologies provide a huge amount of genomic measurements as a starting point. For single-nucleotide polymorphisms (SNPs), there may be more than one million covariates that need to be simultaneously considered with respect to a clinical endpoint. Although the underlying biological problem cannot be solved on the basis of clinical cohorts of only modest size, some important SNPs might still be identified. Sparse multivariable regression techniques have recently become available for automatically identifying prognostic molecular signatures that comprise relatively few covariates and provide reasonable prediction performance. For illustrating how such approaches can be adapted to the specific features of SNP data, we propose different variants of a componentwise likelihood-based boosting approach. The latter links SNP measurements to a time-to-event endpoint by a regression model that is built up in a large number of steps. The variants allow for strategic choices in dealing with SNPs that differ in variance because of their variation in minor allele frequencies. In addition, we propose a heuristic that allows computationally efficient handling of millions of covariates, thus opening the door for incorporating SNP × treatment interactions. We illustrate this using data from patients with acute myeloid leukemia. We judge the resulting models according to prediction error curves and using resampling data sets. We obtain increased stability by moving interpretation from the SNP to the gene level. By considering these different aspects, we outline a more general strategy for linking SNP measurements to a time-to-event endpoint by means of sparse multivariable regression models. Copyright © 2012 John Wiley & Sons, Ltd.

Ancillary