Construction and assessment of prediction rules for binary outcome in the presence of missing predictor data using multiple imputation and cross‐validation: Methodological approach and data‐based evaluation

Abstract We investigate calibration and assessment of predictive rules when missing values are present in the predictors. Our paper has two key objectives. The first is to investigate how the calibration of the prediction rule can be combined with use of multiple imputation to account for missing predictor observations. The second objective is to propose such methods that can be implemented with current multiple imputation software, while allowing for unbiased predictive assessment through validation on new observations for which outcome is not yet available. We commence with a review of the methodological foundations of multiple imputation as a model estimation approach as opposed to a purely algorithmic description. We specifically contrast application of multiple imputation for parameter (effect) estimation with predictive calibration. Based on this review, two approaches are formulated, of which the second utilizes application of the classical Rubin's rules for parameter estimation, while the first approach averages probabilities from models fitted on single imputations to directly approximate the predictive density for future observations. We present implementations using current software that allow for validation and estimation of performance measures by cross‐validation, as well as imputation of missing data in predictors on the future data where outcome is missing by definition. To simplify, we restrict discussion to binary outcome and logistic regression throughout. Method performance is verified through application on two real data sets. Accuracy (Brier score) and variance of predicted probabilities are investigated. Results show substantial reductions in variation of calibrated probabilities when using the first approach.

• 100 datasets with each 1000 individuals were simulated.
• For each individual, four covariates (X j ) were drawn from N 4 (µ, Σ), with µ j = 0 and Σ equal to the variance-covariance matrix of four continuous covariates in the CRT dataset.
• For each individual, a binary outcome was generated by means of a draw from a binomial distribution with probability P , where P was determined in a logistic regression model ln(P/(1−P )) = β 0 + X 1 β 1 + ... + X 4 β 4 in which X 1 , . . . , X 4 were the predictors.
• Missing values were then introduced into X 1 , either according to a MCAR mechanism or a MAR mechanism. For MAR, the probability of being missing depended on the value of X 2 (see paper for the precise denition). The number of missing values generated is governed by a parameter M which is set to either a low or high value to generate less or more missing values in X 1 .
• Three parameters were varied as also explained in the paper, between either a low (L) or set to a high level (H). The tables below (and in the paper) use the same notation L/H to denote the scenarios as all 8 combinations of these 3 binary settings. The approaches used are indicated in the tables as A1 for approach 1 and A2 for approach 2.  This led to eight scenarios (S1, . . . , S8): 1. β 0 = −1.39 , β 1 = log(1.1), 10% missing (LLL).
• Cross-validation was then applied as explained in the main paper, with either K = 1, K = 10 or K = 100 imputations. For all analyses, ten folds were used (L = 10). Logistic regression models with predictors X 1 , . . . , X 4 were tted.
• All analyses were replicated ten times.
• As measures of performance, we considered variation -as dened in the main paper -(`var'), bias (as the dierence between the`true' predicted probabilities as given by the generating model and the predicted probabilities in the validation folds after applying the whole procedure of MI and CV) and Brier score (`BS'). After analyzing the full datasets, all measures were assessed on individuals with (M) and without (F O) (Fully Observed) missing values separately. We also add the Monte Carlo Standard Errors (MC SE), dened by 1 (n sim )(n sim −1) Σ n sim i=1 (θ i −θ) 2 , where n sim = 100,θ i is the estimand of interest (variance, bias or Brier score) in the i th simulated dataset (averaged over the replications) andθ the estimand averaged over the datasets. All statistics shown in the tables have been multiplied by a factor 100 to improve readability.
• The rst two approaches introduced in the paper were implemented (A1 and A2). 1

.2 R code
The les needed to repeat the simulation study are available as supplementary material. The code is structured in such a way that 100 datasets can be generated and analyzed in parallel by making use of a computer cluster. The les can be used as follows: •`one_simulation_logreg.R' sets the parameters for generating and analyzing one dataset for each of the eight scenarios. MCAR or MAR and the number of imputations have to be set manually. Data are generated and analyzed in the function`sim_scenario'.
•`sim_scenario' is found in`sim_scenario_MCAR_logreg.R' or`sim_scenario_MAR_logreg.R'. After the generation of the dataset, data are set to missing according to the chosen mechanism. It is analyzed 10 times according to the two approaches (A1, A2). The output contains both the generated dataset and the cross-validated predictions under the dierent approaches for dierent replications.
• The tables below have been created by summarizing over replications and simulations. The code is given in`SummaryMeasures_function_logreg_ext.R' and SummaryMeasures_FinalTable_logreg_MCAR_M10.R' (again manually setting MCAR or MAR and the number of imputations).

Tables
Listing of algorithms and tables presented 1. Algorithmic description of approach 1 (page 5, compatible to graphical representation in gure 1 of the paper) 2. Algorithmic description of approaches 2 and 3 (page 6, compatible to graphical representation in gure 2 of the paper) 3. Variance, bias and Brier statistics for MCAR simulation with K = 1 imputations (table S2) 4. Variance, bias and Brier statistics for MCAR simulation with K = 10 imputations (table S3) 5. Variance, bias and Brier statistics for MCAR simulation with K = 100 imputations (table S4) 6. Variance, bias and Brier statistics for MAR simulation with K = 1 imputations (table S5) 7. Variance, bias and Brier statistics for MAR simulation with K = 10 imputations (table S6) 8. Variance, bias and Brier statistics for MAR simulation with K = 100 imputations (table S7) We repeat the scenarios table (table S1) from the paper for ease of reference.
Compute the nal prediction for each i th individual as the average P i across all K predictions P i,k .
Algorithm 1. Algorithmic description of approach 1 for combination of multiple imputation with cross-validation using the logistic regression modelling for binary outcome. K represents the number of imputations and L denotes the number of folds in the cross-validation. Note that the folds get re-dened for each new imputation in this approach. β k denotes the vector of regression coecients tted in the calibration fold for the k th imputation. The notation x i,k denotes the imputed data record for the set-aside i th patient in the validation fold for the k th imputation (observed data are not aected).

Approaches 2 and 3
Dene L folds for the CV procedure and select each l th fold in turn as the validation set, keeping all other data as calibration set. For each such fold carry out the following computation. 1. Remove the outcome from the validation data.
2. Combine this outcome-deleted validation data with the corresponding calibration data.
3. Run K imputations on this combined set.
4. Fit separate logistic regression models on the training portion of each of these K imputed datasets.
5. Compute the averageβ of the K regression coecients vectors from these models.
6. Derive predictions for subjects i in the imputed validation sets from the combined model, using the equation Compute the nal prediction for each i th individual as the average P i across all K predictions P i,k .
Algorithm 2. Algorithmic description of approach 2 for combination of multiple imputation with cross-validation using the logistic regression modelling for binary outcome. K represents the number of imputations and L denotes the number of folds in the cross-validation. Note that the folds get dened rst in this approach and are then held xed.β denotes the Rubin's-rule pooled regression coecients across imputations tted in the calibration fold.
x i,k denotes the imputed data record for the set-aside i th patient in the validation fold for the k th imputation. Approach 3 is identical to approach 2, except that the individual imputations x i,k are replaced with the Rubin's rule pooled imputationx i , such that only a single prediction P i is generated.             Figure S11: Average bias measures (Bias). The four rows of plots from top to bottom correspond to simulation scenarios 1 to 4, MAR (tables S5 to S7). The two left columns of plots show results from approaches 1 and 2 versus the number of imputations used in the calibration of the predictors. The right-side column of plots displays the corresponding relative reductions for approach 1 relative to approach 2. See table S1 for description of the scenarios.  Figure S12: Average bias measures (Bias). The four rows of plots from top to bottom correspond to simulation scenarios 5 to 8, MAR (tables S5 to S7). The two left columns of plots show results from approaches 1 and 2 versus the number of imputations used in the calibration of the predictors. The right-side column of plots displays the corresponding relative reductions for approach 1 relative to approach 2. See table S1 for description of the scenarios.