Comparison of methods for imputing ordinal data using multivariate normal imputation: a case study of non-linear effects in a large cohort study

Authors

  • Katherine J. Lee,

    Corresponding author
    1. Clinical Epidemiology and Biostatistics Unit, Murdoch Childrens Research Institute, Melbourne, Australia
    2. Department of Paediatrics, University of Melbourne, Melbourne, Australia
    • Katherine J. Lee, Clinical Epidemiology and Biostatistics Unit (CEBU), Murdoch Childrens Research Institute, Royal Children's Hospital, Flemington Road, Parkville, Vic. 3052, Australia.

      E-mail: katherine.lee@mcri.edu.au

    Search for more papers by this author
  • John C. Galati,

    1. Clinical Epidemiology and Biostatistics Unit, Murdoch Childrens Research Institute, Melbourne, Australia
    Search for more papers by this author
  • Julie A. Simpson,

    1. Centre for Molecular, Environmental, Genetic and Analytic Epidemiology, The University of Melbourne, Melbourne, Australia
    2. Cancer Epidemiology Centre, Cancer Council Victoria, Melbourne, Australia
    Search for more papers by this author
  • John B. Carlin

    1. Clinical Epidemiology and Biostatistics Unit, Murdoch Childrens Research Institute, Melbourne, Australia
    2. Department of Paediatrics, University of Melbourne, Melbourne, Australia
    Search for more papers by this author

Abstract

Background

Multiple imputation is becoming increasingly popular for handling missing data, with Markov chain Monte Carlo assuming multivariate normality (MVN) a commonly used approach. Imputing categorical variables (which are clearly non-normal) using MVN imputation is challenging, and several approaches have been suggested. However, it remains unclear which approach should be preferred.

Methods

We explore methods for imputing ordinal variables using MVN imputation, including imputing as a continuous variable and as a set of indicators, and various methods for assigning imputed values to the possible categories (rounding), for estimating a non-linear association between an ordinal exposure and binary outcome. We introduce a new approach where we impute as continuous and assign imputed values into categories based on the mean indicators imputed in a separate round of imputation. We compare these approaches in a simple setting where we make 50% of data in an ordinal exposure missing completely at random, within an otherwise complete real dataset.

Results

Methods that impute the ordinal exposure as continuous distorted the non-linear exposure–outcome association by biasing the relationship towards linearity irrespective of the rounding method. In contrast, imputing using indicators preserved the non-linear association but not the marginal distribution of the ordinal variable.

Conclusions

Imputing ordinal variables as continuous can bias the estimation of the exposure–outcome association in the presence of non-linear relationships. Further work is needed to develop optimal methods for handling ordinal (and nominal) variables when using MVN imputation. Copyright © 2012 John Wiley & Sons, Ltd.

Ancillary