EMLasso: logistic lasso with missing data

Authors

  • N. Sabbe,

    Corresponding author
    • Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Coupure Links 653a Ghent, Belgium
    Search for more papers by this author
  • O. Thas,

    1. Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Coupure Links 653a Ghent, Belgium
    2. Centre for Statistical and Survey Methodology, School of Mathematics and Applied Statistics, University of Wollongong, NSW 2522, Australia
    Search for more papers by this author
  • J-P. Ottoy

    1. Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Coupure Links 653a Ghent, Belgium
    Search for more papers by this author

Correspondence to: N. Sabbe, Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Coupure Links 653a Ghent, Belgium.

E-mail: nick.sabbe@ugent.be

Abstract

In clinical settings, missing data in the covariates occur frequently. For example, some markers are expensive or hard to measure. When this sort of data is used for model selection, the missingness is often resolved through a complete case analysis or a form of single imputation. An alternative sometimes comes in the form of leaving the most damaged covariates out. All these strategies jeopardise the goal of model selection.

In earlier work, we have applied the logistic Lasso in combination with multiple imputation to obtain results in such settings, but we only provided heuristic arguments to advocate the method. In this paper, we propose an improved method that builds on firm statistical arguments and that is developed along the lines of the stochastic expectation–maximisation algorithm. We show that our method can be used to handle missing data in both categorical and continuous predictors, as well as in a nonpenalised regression. We demonstrate the method by applying it to data of 273 lung cancer patients. The objective is to select a model for the prediction of acute dysphagia, starting from a large set of potential predictors, including clinical and treatment covariates as well as a set of single-nucleotide polymorphisms. Copyright © 2013 John Wiley & Sons, Ltd.

Ancillary