Multiple Outputation: Inference for Complex Clustered Data by Averaging Analyses from Independent Data

Authors

  • Dean Follmann,

    Corresponding author
    1. National Institute of Allergy and Infectious Diseases, 6700B Rockledge Drive MSC 7609, Bethesda, Maryland 20892, U.S.A.
      *email: dfollmann@niaid.nih.gov
    Search for more papers by this author
  • Michael Proschan,

    1. Office of Biostatistics Research, National Heart, Lung, and Blood Institute, 6701 Rockledge Drive, Bethesda, Maryland 7938, U.S.A.
    Search for more papers by this author
  • Eric Leifer

    1. Office of Biostatistics Research, National Heart, Lung, and Blood Institute, 6701 Rockledge Drive, Bethesda, Maryland 7938, U.S.A.
    Search for more papers by this author

*email: dfollmann@niaid.nih.gov

Abstract

Summary This article applies a simple method for settings where one has clustered data, but statistical methods are only available for independent data. We assume the statistical method provides us with a normally distributed estimate, inline image, and an estimate of its variance inline image. We randomly select a data point from each cluster and apply our statistical method to this independent data. We repeat this multiple times, and use the average of the associated inline image as our estimate. An estimate of the variance is given by the average of the inline image minus the sample variance of the inline image. We call this procedure multiple outputation, as all “excess” data within each cluster is thrown out multiple times. Hoffman, Sen, and Weinberg (2001, Biometrika88, 1121–1134) introduced this approach for generalized linear models when the cluster size is related to outcome. In this article, we demonstrate the broad applicability of the approach. Applications to angular data, p-values, vector parameters, Bayesian inference, genetics data, and random cluster sizes are discussed. In addition, asymptotic normality of estimates based on all possible outputations, as well as a finite number of outputations, is proven given weak conditions. Multiple outputation provides a simple and broadly applicable method for analyzing clustered data. It is especially suited to settings where methods for clustered data are impractical, but can also be applied generally as a quick and simple tool.

Ancillary