Multiple Imputation of Missing Data in Nested Case-Control and Case-Cohort Studies

Summary The nested case-control and case-cohort designs are two main approaches for carrying out a substudy within a prospective cohort. This article adapts multiple imputation (MI) methods for handling missing covariates in full-cohort studies for nested case-control and case-cohort studies. We consider data missing by design and data missing by chance. MI analyses that make use of full-cohort data and MI analyses based on substudy data only are described, alongside an intermediate approach in which the imputation uses full-cohort data but the analysis uses only the substudy. We describe adaptations to two imputation methods: the approximate method (MI-approx) of White and Royston (2009) and the “substantive model compatible” (MI-SMC) method of Bartlett et al. (2015). We also apply the “MI matched set” approach of Seaman and Keogh (2015) to nested case-control studies, which does not require any full-cohort information. The methods are investigated using simulation studies and all perform well when their assumptions hold. Substantial gains in efficiency can be made by imputing data missing by design using the full-cohort approach or by imputing data missing by chance in analyses using the substudy only. The intermediate approach brings greater gains in efficiency relative to the substudy approach and is more robust to imputation model misspecification than the full-cohort approach. The methods are illustrated using the ARIC Study cohort. Supplementary Materials provide R and Stata code.

Repeat steps 2-4 until the sampled values of X converge in distribution. At this point, use these sampled values as the imputed values for the single imputed dataset. Repeat the whole process M times to generate M imputed datasets.

S2. Substantive model compatible multiple imputation ('MI-SMC')
Here we describe the MI-SMC method referred to in Section 2.3 of the main text. For MI-SMC with p partially observed variables, the algorithm to generate one imputed data set is as follows.
(1) Replace the missing missing values in X with arbitrary starting values, to create a complete dataset. Set k = 1.
(2) Fit the Cox model to the current complete data set to obtain estimates (β X ,β Z ) and their estimated variance Σ. Draw values β * X , β * Z from a joint normal distribution with mean (β X ,β Z ) and variance Σ.
(3) Calculate Breslow's estimate, denoted H * 0 (t), of the baseline cumulative hazard H 0 (t) using the parameter values β * X , β * Z and the current imputations of X.
(4) Fit a regression model (e.g. linear or logistic, as appropriate) of X k on X −k and Z to the current complete data set. Draw a value γ * Xk from the approximate joint posterior distribution of the parameters γ Xk in this model.
(5) For each individual for whom X k is missing, (a) draw a value X * k from the distribution p(X k |X −k , Z; γ * Xk ) and let X * denote X with X k replaced by its proposed value X * k , (b) draw a value U from a uniform distribution on [0, 1], and (c) accept the proposal X * k if either D = 0 and U exp{−H * 0 (t)e β * X X * +β * Z Z } or D = 1 and U H * If X * k is not accepted, then discard it and repeat (a), (b) and (c).
Repeat steps 2-6 until the sampled values of X converge in distribution. At this point, use these sampled values as the imputed values for the single imputed dataset. Repeat the whole process M times to generate M imputed datasets.
In the full-cohort and intermediate MI approaches, the MI-SMC algorithm just described is applied to all the data on the full cohort. In the setting described in the main text, there were only p = 2 partially observed variables, X 1 and X 2 (as well as the fully observed variables Z), but the algorithm allows for any number of partially observed variables.
In the substudy approach, the MI-SMC algorithm is applied only to the data on the substudy, but with the following modifications to steps 3 and 4 (steps 1, 2, 5 and 6 remain unchanged). Steps 3 and 4 involve estimating, respectively, the (population) baseline cumulative hazard H 0 (t) and the parameters of the distribution p(X 2 |X 1 , Z). In Section 5.2 of the main text we described modified estimators of these two quantities that should be used in steps 3 and 4 when using the substudy approach. In the setting described in the main text, there was only p = 1 partially observed variable, X 2 (since X 1 is fully observed in the substudy), but the algorithm allows for any number of partially observed variables.

S3. Further details of the simulation study
We base our simulation partly on information from a recent review of case-cohort studies (Sharp et al., 2014) and partly on studies in cardiovascular epidemiology, from which we take our example in Section 7 of the main text. The full cohort size of 15,000 individuals used in the simulation study was approximately the 25th percentile of full cohort sizes in the studies reviewed by Sharp et al. (2014). Variable Z is binary and was generated from a Bernoulli distribution with probability 0.5, variable X 2 is binary and was generated from a Bernoulli distribution such that logitP r(X 2 = 1|Z) = 0.5Z, and variable X 1 is continuous and was generated from a normal distribution with mean 0.25Z + 0.25X 2 and variance 1.
Loss-to-follow-up times were generated from an exponential distribution with hazard λ C .
The parameter λ in the hazard model used to generate event times, and λ C , were chosen so that 10% of individuals have the event, 20% are lost to follow-up and the remainder are administratively censored after 15 years of follow-up. The follow-up time and event rate are realistic values for studies in cardiovascular epidemiology. For the case-cohort sample we used a subcohort sampling percentage of 5%, which was similar to the median subcohort sampling fraction of 4.1% found in the review by Sharp et al. (2014) Missingness in X 2 was generated using probability of missingness exp(a + 0.2Z + 0.2D + 0.2ZD)/{1 + exp(a + 0.2Z + 0.2D + 0.2ZD)} , with a chosen to give 10% or 50% missingness.
All MI analyses used 10 imputed data sets. MI-SMC used 100 iterations; using fewer iterations results in non-convergence for a situation with large effect size and large amounts of missing data when using the full-cohort approach in the case-cohort setting, though in other simulation scenarios fewer iterations (e.g. 10) tended to be sufficient. In the nested case-control setting, the MI-SMC full-cohort approach did not require a large number of iterations because a larger proportion of the full cohort is in the nested case-control sample.
MI-approx analyses used 5 iterations, which is the default in the mice package in R.

S4. Additional simulation scenario: Model misspecification
We investigated the performance of the methods when the imputation model is misspecified.
For this, X 1 was generated from a log normal distribution with log X 1 having mean 0.25Z and standard deviation 0.65 (so that the variance of X 1 was around 1, as in the main simulation). X 2 was generated from a Bernoulli distribution using logit P r(X 2 = 1|X 1 , Z) = 0.5Z + 0.25(X 1 + X 2 1 ). In MI-approx, the misspecified imputation model for X 1 (full cohort and intermediate approaches) was X 1 = α 0 + α 1 X 2 + α 2 Z + α 3 D + α 4 H(T ) + , and the misspecified imputation model for X 2 was logit P r(X 2 = 1|X 1 , Z, T, D) = α 0 +α 1 X 1 +α 2 Z + α 3 D + α 4 H(T ). MI-SMC allows imputed variables to be used on a transformed scale in the substantive model. We considered two forms for the model for X 1 |X 2 , Z: (1) a misspecified normal distribution for X 1 |X 2 , Z with main effects of X 2 and Z; (2) a correctly specified normal distribution for log X 1 |X 2 , Z with main effects of X 2 and Z. In both cases we used a model for X 2 |X 1 , Z with main effects of X 1 and Z, which is misspecified.

S5. Software
Traditional analyses of nested case-control and case-cohort studies (Section 3 of the main text) can be performed in standard software using Cox regression. For case-cohort studies the data should be modified so that cases not in the subcohort have a start of follow-up time which is just before their event time. For nested case-control studies the Cox regression should be stratified by the matched set indicator (this gives an analysis which is equivalent to using conditional logistic regression).
Example R and Stata code for applying the methods described in this paper is available at https://github.com/ruthkeogh/MI-CC.
MI-approx (full-cohort, intermediate, and substudy approaches) can be implemented using the mice package in R (Van Buuren and Groothuis-Oudshoorn, 2011) and the mi command in Stata. The MI-approx analyses could also be performed using other statistical packages (e.g. PROC MI in SAS). The MI-approx methods described in this paper require estmates of the cumulative hazard (see https://github.com/ruthkeogh/MI-CC).
For the full-cohort and intermediate approaches, MI-SMC is applied in the full cohort and this can be implemented using the smcfcs package (Bartlett and Morris, 2015) in Stata (see https://github.com/jwb133/Stata-smcfcs, and also available on SSC) or R (see https://github.com/jwb133/smcfcs, and also available on CRAN). We have implemented the MI-SMC substudy approach (main text Section 5.2) in the new smcfcs.nestedcc and smcfcs.casecohort functions in the smcfcs package in R.
MI matched set (Section 5.3 of the main text) can be performed using mice in R or ice in Stata (Royston, 2005) after some rearrangement of the data, as described by Seaman and Keogh (2015). MI matched set could also be performed using other statistical packages (e.g. PROC MI in SAS).

S6.1 Censoring, left-truncation and auxiliary variables
MI-approx and MI-SMC (Section 2 of the main text) assume that any right-censoring occurs independently of the variables with missing data given the fully observed variables.
Such dependence can be accommodated in MI-approx by adding the term H cens (t) into the imputation model, where H cens (t) denotes the Nelson-Aalen estimate of the cumulative hazard for the censoring (Borgan and Keogh, 2015). Event times are also commonly subject to left-truncation, for example when the time scale for the analysis is age but individuals are cohort. In the minimum information setting in a case-cohort study H cens (t) can be estimated using a modified version of H * CC (t) with 'event' replaced by 'censoring'. MI-SMC has been extended to allow competing risks (Bartlett and Taylor, 2016) and this can be used to handle dependence of right-censoring on variables with missing data. Left truncation has also been incorporated in the the Stata version of the smcfcs package.
Auxiliary variables are variables that may be predictive of the value of a partially missing variable but that we do not wish to include as covariates in the substantive model. Auxiliary {n(τ k )/(c(τ k )+1)} exp(β X1 X 1l +β X2 X 2l +β Z Z l ) should be modified to incorporate stratum-specific weights. In the full-cohort approach, the Cox analysis performed on the imputed data should incorporate the matching variables, either by stratifying or by including them as covariates.
The traditional case-cohort analysis (Section 3 of the main text) was described by Prentice (1986). Modifications have been proposed in which the denominator in the pseudo-partial likelihood L substudy is modified to include individual weights (Onland-Moret et al., 2007). This is relevant for the substudy and intermediate approaches, in which the imputation methods can be used without modification. Case-cohort studies can also use stratified random sampling of the subcohort. One version of the traditional analysis uses a modified version of the psuedo-partial likelihood L substudy in whichR j is replaced byR gj , the subset ofR j that is in the same stratum g as the case whose event time is τ j . See Borgan et al. (2000) for alternative methods for use when the subcohort is selected by stratified random sampling. In MI-approx the imputation model should include the stratifying variables, and the cumulative hazard estimate H CC (t) = n S (0) n τ k t d(τ k ) n S (τ k ) should be modified to incorporate stratumspecific weights in the denominator. In MI-SMC the covariate model used for the proposal distribution should include the matching variables as predictors, and the denominator of the cumulative baseline hazard estimate H CC 0 (t) = n S (0) modified using stratum-specific weights. Figure S1. Simulation study results: case-cohort study within a cohort; 10% missing X 2 . The points are the means of the point estimates from 1000 simulated data sets. The horizontal lines around each point are the 95% confidence intervals obtained based on Monte Carlo errors. The relative efficiency is relative to the complete-data sub-study analysis. Figure S2. Simulation study results: nested case-control study with 1 control per case within a cohort; 10% missing X 2 . The points are the means of the point estimates from 1000 simulated data sets. The horizontal lines around each point are the 95% confidence intervals obtained based on Monte Carlo errors. The relative efficiency is relative to the complete-data sub-study analysis.
(a) β X1 = β X2 = β Z = 0.2 Figure S3. Simulation study results: nested case-control study with 4 controls per case within a cohort; 50% missing X 2 . The points are the means of the point estimates from 1000 simulated data sets. The horizontal lines around each point are the 95% confidence intervals obtained based on Monte Carlo errors. The relative efficiency is relative to the complete-data sub-study analysis.
(a) β X1 = β X2 = β Z = 0.2 (b) β X1 = β X2 = β Z = 0.7 Figure S4. Simulation study results: nested case-control study with 4 controls per case within a cohort; 10% missing X 2 . The points are the means of the point estimates from 1000 simulated data sets. The horizontal lines around each point are the 95% confidence intervals obtained based on Monte Carlo errors. The relative efficiency is relative to the complete-data sub-study analysis.