Using full-cohort data in nested case–control and case–cohort studies by multiple imputation


Correspondence to: Ruth H. Keogh, Department of Medical Statistics, London School of Hygiene and Tropical Medicine, Keppel Street, London WC1E 7HT, U.K.



In many large prospective cohorts, expensive exposure measurements cannot be obtained for all individuals. Exposure–disease association studies are therefore often based on nested case–control or case–cohort studies in which complete information is obtained only for sampled individuals. However, in the full cohort, there may be a large amount of information on cheaply available covariates and possibly a surrogate of the main exposure(s), which typically goes unused. We view the nested case–control or case–cohort study plus the remainder of the cohort as a full-cohort study with missing data. Hence, we propose using multiple imputation (MI) to utilise information in the full cohort when data from the sub-studies are analysed. We use the fully observed data to fit the imputation models. We consider using approximate imputation models and also using rejection sampling to draw imputed values from the true distribution of the missing values given the observed data. Simulation studies show that using MI to utilise full-cohort information in the analysis of nested case–control and case–cohort studies can result in important gains in efficiency, particularly when a surrogate of the main exposure is available in the full cohort. In simulations, this method outperforms counter-matching in nested case–control studies and a weighted analysis for case–cohort studies, both of which use some full-cohort information. Approximate imputation models perform well except when there are interactions or non-linear terms in the outcome model, where imputation using rejection sampling works well. Copyright © 2013 John Wiley & Sons, Ltd.