A comparison of strategies for selecting auxiliary variables for multiple imputation

Abstract Multiple imputation (MI) is a popular method for handling missing data. Auxiliary variables can be added to the imputation model(s) to improve MI estimates. However, the choice of which auxiliary variables to include is not always straightforward. Several data‐driven auxiliary variable selection strategies have been proposed, but there has been limited evaluation of their performance. Using a simulation study we evaluated the performance of eight auxiliary variable selection strategies: (1, 2) two versions of selection based on correlations in the observed data; (3) selection using hypothesis tests of the “missing completely at random” assumption; (4) replacing auxiliary variables with their principal components; (5, 6) forward and forward stepwise selection; (7) forward selection based on the estimated fraction of missing information; and (8) selection via the least absolute shrinkage and selection operator (LASSO). A complete case analysis and an MI analysis using all auxiliary variables (the “full model”) were included for comparison. We also applied all strategies to a motivating case study. The full model outperformed all auxiliary variable selection strategies in the simulation study, with the LASSO strategy the best performing auxiliary variable selection strategy overall. All MI analysis strategies that we were able to apply to the case study led to similar estimates, although computational time was substantially reduced when variable selection was employed. This study provides further support for adopting an inclusive auxiliary variable strategy where possible. Auxiliary variable selection using the LASSO may be a promising alternative when the full model fails or is too burdensome.


Introduction
Missing data are often encountered in medical research.Many studies aim to estimate an expected value (e.g. a mean or proportion) or an exposure-outcome association.Restricting the analysis to individuals with available data, i.e., analysing "complete cases," can lead to bias or loss of precision in these estimates compared to if all data were observed.[1] Multiple imputation (MI) is a popular two-stage method for handling missing data that can produce valid estimates and standard errors of target quantities under relaxed assumptions regarding the mechanism leading to missing data.[2,3] In the first stage of MI, missing data are imputed multiple times with random draws from the predictive distribution of the missing values given the observed data and a specified imputation model.In the second stage, the statistical analysis of interest is applied to each imputed dataset and the results are combined using Rubin's rules to obtain a single estimate with associated standard error.[2] When and how MI should be applied depends on the target of analysis, the reasons for missing data and whether any of the missing information can be recovered.[4] MI is usually carried out using either multiple imputation by chained equations (MICE) or multivariate normal imputation (MVNI).MICE, also known as "fully conditional specification", "regression switching" or "sequential regression multiple imputation", is a flexible MI approach in which univariate imputation models are specified for each variable with missing data.[5] Imputed values for each variable with missing data are then generated using these models, one variable at a time, until all missing values are replaced (one cycle).The algorithm carries out a number of cycles before obtaining one imputed dataset, and then this procedure is repeated m times to obtain m sets of complete data that differ in their imputed values.The second approach, MVNI, is based on the assumption that all incomplete variables jointly follow a multivariate normal distribution.Imputations are obtained from this model using the Data Augmentation algorithm.[6] Although the availability of MI procedures in statistical software such as R, Stata and SAS has made MI widely accessible and easy to implement in practice, the user still needs to make a number of decisions in order to carry out an analysis using MI, including what variables to include in the imputation model and in what form.[7] Best practice is to include all variables that appear in the main analysis, in the same form, as predictors in the imputation model.[8] Failure to do so could introduce bias in MI estimates.A benefit of MI is that additional variables that do not appear in the main analysis, known as "auxiliary variables", can be added to the imputation model to improve performance by reducing bias and/or increasing precision.[9,10] Candidate auxiliary variables include variables that are related to the variables with missing data and possibly also related to the missingness of the variables with missing data.[6] The extent to which auxiliary variables reduce bias or improve precision in MI estimates depends on the missingness mechanism, the target estimand and the relationship between the auxiliary variables and the variables with missing values.[9,11] The choice of which auxiliary variables to include in the imputation model is not always straightforward.On one hand, imputations should not be created using a model that is more restrictive than needed, [3] but on the other, including too many auxiliary variables in an imputation model can lead to a number of problems in practice.[12] Imputation algorithms may automatically drop variables or fail completely due to numerical problems such as perfect prediction, where categories of the response variable are perfectly separated by a linear combination of the covariates, or collinearity, where covariates are highly correlated.[13,14] These problems often arise in large-scale longitudinal studies where there are many potential auxiliary variables to choose from.Even when imputations are successfully produced, after a sufficient number of suitable auxiliary variables have been included in the imputation model the benefits of adding additional auxiliary variables may be small.[15] One study found that including too many auxiliary variables can have a detrimental effect on the bias and precision of MI estimates, which led to the recommendation that the number of variables included in the imputation model should be no more than a third of the number of cases with complete data.[11] Others suggest that no more than 15-25 variables are needed in the imputation model in practice.[7] These arguments motivate the desire to employ an auxiliary variable selection strategy when there are a large number of potential auxiliary variables available.
Several strategies have been proposed to specify an imputation model for an incomplete variable using a reduced set of auxiliary variables.Van Buuren et al. propose a four-step strategy for quick, data-based selection of variables to be included in the imputation model for the incomplete variable.[3,7] In short, these steps are as follows.First, include all variables that appear in the analysis model.Then add variables that are related to the nonresponse in the incomplete variable according to a suitable criterion.Next, add variables that explain a considerable amount of variance in the incomplete variable.Lastly, remove any variables in steps two and three that have too many missing values.Variables in steps two and three can be chosen based on their correlation with the missingness indicator of the incomplete variable and their correlation with the incomplete variable, respectively, while variables in step four can be identified by a criterion such as the proportion of usable cases (i.e., the proportion of observed cases in the variable being selected within the subset of cases where the incomplete variable is unobserved [7]).This strategy has been implemented in the quickpred function in the R package mice to specify imputation models for each incomplete variable in the MICE procedure, [16] and has been adopted in practice to choose an imputation model when the full model is not feasible.[17,18] A similar strategy is given by Graham, [15] with guidance on correlation cut-offs for inclusion of variables (0.4 if the variable is a cause of missingness, 0.5 otherwise, p.197) but it is noted that these can be reduced to include more variables if there are a relatively small number of variables in the analysis.Howard et al. [19] suggest using principal components of auxiliary variables in the imputation model, instead of the auxiliary variables themselves.Strictly speaking, this approach is a data reduction method rather than a variable selection method, but we included it in this comparison since it has also been adopted in practice to specify the imputation model, [20,21,22,23,24,25,26] with the R package PcAux available to aid its implementation.[27] Other strategies for selecting auxiliary variables utilise stepwise selection, t-tests or penalised regression.The ice package in Stata has an option to construct imputation models for each incomplete variable using stepwise selection.[28] An additional step is added between the initialization of the MICE algorithm and performing the imputation, where the stepwise command is used to select variables for each of the univariate imputation models.A related approach proposed by Andridge and Thompson uses forward selection with the fraction of missing information due to nonresponse (FMI, estimated as the ratio of between-imputation variance to total variance for a given estimator [1]) for the mean of the incomplete variable as the selection criterion for inclusion of auxiliary variables in the imputation model for that variable.[29] Dixon described the implementation of two sample t-tests to identify variables that are associated with missingness in another incomplete variable.[30] The sample is split into two groups based on the presence or absence of values of the incomplete variable.T-tests are then applied to the other variables using this grouping, with large test statistics providing evidence that the probability of being missing is not the same for all cases, i.e., data are not missing completely at random (MCAR).[31] Enders suggests results from such t-tests can be used to identify, and limit the number of, potential auxiliary variables.[32] Another approach is to combine penalised regression with multiple imputation.For example, the Least Absolute Shrinkage and Selection Operator (LASSO) can be used to simultaneously perform variable regularisation and selection.[33] This approach has been used for dimension reduction in the context of high-dimensional data.[34] The aim of this study was to provide a comprehensive comparison of strategies used to select the set of auxiliary variables for the imputation model in MI.Motivated by a case study based on the Longitudinal Study of Australian Children (LSAC), we restrict attention to the scenario where there is an incomplete continuous outcome, a completely observed continuous exposure, a set of completely observed continuous auxiliary variables and no confounding.We use a simulation study to evaluate the performance of eight available strategies for selecting auxiliary variables for an imputation model: (1, 2) two versions of the four-step strategy provided by Van Buuren et al. [3,7]; (3) using principal components of auxiliary variables; (4,5) forward and forward stepwise selection; (6) forward selection based on the FMI; (7) hypothesis tests of the MCAR assumption and (8) the LASSO.We compare these eight auxiliary variable selection strategies against two "benchmark" analysis strategies that do not use any auxiliary variable selection: complete case analysis and MI using all auxiliary variables (the "full" imputation model).We investigate three scenarios of particular interest, including a scenario in which the full imputation model is likely to encounter problems.We also illustrate the performance of these strategies by applying them to the motivating LSAC case study.In Section 2 we describe the motivating case study, the design of the simulation study and the auxiliary variable selection strategies that were evaluated.Results are provided in Section 3. A discussion of these results and a summary of our key messages are provided in Section 4.

Motivating example: The Longitudinal Study of Australian Children
The Longitudinal Study of Australian Children (LSAC) is a large-scale longitudinal study, with the aim of investigating the effect of a child's environment on their well-being throughout life.[35] Two cohorts were recruited into the study in 2003: the baby "B" cohort, consisting of participants aged 0-1 years at wave 1, and the kindergarten "K" cohort, consisting of participants aged 4-5 years at wave 1.This case study uses data from the 4983 children in the K cohort of LSAC to examine the association between body mass index (BMI) z-score at 4-5 years of age and health-related quality of life (HRQoL) problems at 12-13 years of age.This is a simplified version of a published analysis.[36] The analysis of interest aimed to estimate the effect of BMI z-score on HRQoL using a linear regression model with adjustment for potential confounders.The outcome of interest, HRQoL, was measured using the Total Scale Score of the Paediatric Quality of Life Inventory (PedsQL) collected at wave 4. [37] The PedsQL scale is made up of either 21 or 23 items (depending on the child's age), each of which is measured on a 5-point Likert scale with 1 corresponding to "never a problem" and 5 corresponding to "always a problem".The Total Scale Score is calculated as the average of the observed scale items, after a reverse linear transformation of items in the following manner: 1=100, 2=75, 3=50, 4=25 and 5=0.
The exposure was BMI z-score at wave 1. BMI at each wave was calculated using height and weight measurements taken by an interviewer.These measurements were standardised by age and sex to derive the BMI z-score.Several potentially confounding covariates were measured at wave 1: child sex, child age, whether the child was Aboriginal or Torres Strait Islander, whether the child had a non-English speaking background and the socioeconomic position of the child's family.
The parameter of interest was the coefficient of BMIz in the following linear regression model, adjusted for potential confounders: where Female is an indicator for whether the child is female, Age is the age in months for the child at wave 1, IndStat is an indicator for whether the child is Aboriginal or Torres Strait Islander, NonEng is an indicator for whether the child has a non-English speaking background and SEP is a standardised score representing the socioeconomic position of the child's family at wave 1.The HRQoL outcome was missing for 17.4% of participants.A small percentage of participants were also missing values for BMIz, IndStat, NonEng and SEP (Supporting Information Table 1).We have shown previously that individuals with missing data on one or more variables in the analysis model had lower mean HRQoL and SEP, and were more likely to be Aboriginal or Torres Strait Islander or come from a non-English speaking background than individuals with complete data on all variables.[38] Furthermore, a large number of potential auxiliary variables were available in the dataset.The assumption that non-response in HRQoL does not depend on HRQoL itself after conditioning on variables in the imputation model (under which MI is valid) is more plausible when the conditioning set includes auxiliary variables, and even more so if there are a large number of these.Thus MI was identified as a more appropriate method of estimating β 1 than a complete case analysis, which conditions only on variables in the analysis model.We note that it may be the case that missingness in HRQoL depended on HRQoL even after conditioning on analysis and auxiliary variables, i.e.HRQoL is "missing not at random", but we do not consider this further.
A subset of auxiliary variables for inclusion in the imputation model was initially identified across the four waves of data collection based on substantive knowledge.) with m = 20 imputations including all seven variables in the analysis model, as well as all auxiliary variables listed above.Continuous variables (including SDQ and GHM) were imputed using linear regression, except for HRQoL, which was imputed using predictive mean matching (PMM, a semiparametric approach where values are imputed using random draws from the observed data of a set of donors with similar predicted means [39]) with 5 donors because this variable exhibited negative skew.Binary variables were imputed using logistic regression and PedsQL items were imputed using ordinal logistic regression.The augment option was used to perform augmented regression in the presence of perfect prediction for all categorical imputation variables.This imputation procedure failed when fitting an ordinal logistic regression model with the message "convergence not achieved" and no imputations were produced.

Simulation study
A simulation study was conducted to evaluate the performance of a range of strategies for selecting auxiliary variables for inclusion in the univariate imputation models of the MICE procedure.For simplicity, we focused on the setting where there is one incomplete continuous outcome, one completely observed continuous exposure, a set of completely observed continuous auxiliary variables and no confounding.We considered three scenarios of interest: 1. Basic: This scenario was designed to compare the auxiliary variable selection strategies in a scenario where one would not expect the MI algorithm to fail.It is characterised by a large sample size relative to the number of auxiliary variables, moderate missingness mechanism and a moderate (but realistic) proportion of missing data in the outcome.
If auxiliary variable selection strategies do not perform well in this scenario, we would not expect them to perform well in general.
2. Extreme: This scenario was included to compare the auxiliary variable selection strategies in a scenario where including all auxiliary variables in each univariate imputation model would be expected to be problematic.It is characterised by a small sample size relative to the number of auxiliary variables, a strong missingness mechanism and a large proportion of missing data in the outcome.
3. Realistic: This scenario was included to compare the auxiliary variable selection strategies in a realistic scenario.The sample size, number of auxiliary variables, missingness mechanism and proportion of missing data were based on what was observed in the motivating LSAC case study.

Simulation of complete data
Let X denote a fully observed continuous exposure variable, Y denote a continuous outcome variable with missing values and A = (A 1 , A 2 , . . ., A p ) denote a p-vector of potential auxiliary variables.Also let 0 c denote a c-vector of 0's.We generated data on X, Y and A for n individuals in two steps.In the first step we generated data for (X, A ) by randomly drawing n samples from a multivariate normal distribution with mean 0 p+1 and covariance matrix where Σ A is a p × p covariance matrix defined for each scenario below.In the second step we generated Y from the linear regression model where β A is a p-vector of regression parameters and ε ∼ N (0, σ 2 ε ).For the Basic scenario, we set n = 1000, p = 16 and Σ A = I p , where I p is the p × p identity matrix.For the Extreme scenario, we set n = 250, p = 50, and Σ A = I p .Since the strength of relationship between an auxiliary variable and a variable with missing data is a key consideration for inclusion in the imputation model, [3,15,11] we designed these two scenarios such that there was a range of realistic relationships between Y and each auxiliary variable, conditional on X and all other auxiliary variables.For the Basic scenario, this was achieved by setting β A = (0 4 , 0.11 4 , 0.21 4 , 0.41 4 ) , where 1 d denotes a d-vector of 1's.
For the Extreme scenario, we added 34 additional "junk" auxiliary variables in a similar manner to Collins, Schafer and Kam, [9] by setting β A = (0 38 , 0.11 4 , 0.21 4 , 0.41 4 ) .For ease of interpretation, we specified σ 2 ε such that β A was the vector of correlations between Y and A. This imposes restrictions on the magnitude of the values that β A can take (equivalent to requiring the covariance matrix of (Y, X, A ) to be positive semidefinite).
In particular, we could not consider many highly correlated auxiliary variables.However, very large correlations were not observed in the case study (Supporting Information Figure 1).For the Realistic scenario, we set n = 4983, p = 81 and Σ A based on the features of the LSAC motivating example.In this scenario, β A and σ 2 ε were specified such that the vector of correlations between Y and A was the same as the estimated vector of correlations between Y and A in the LSAC case study.
We generated 2,000 datasets so that the Monte Carlo standard error for the estimated coverage probability of the 95% confidence intervals for our estimands (see Section 2.2.3) was less than 0.5%.Datasets were generated in R. Data for (X, A ) were generated using mvrnorm, which utilises an eigenvalue decomposition of the covariance matrix.

Imposing missing data
After generating the complete data, values of Y were set missing with a probability determined by the logistic regression model where M Y denotes the missingness indicator for Y and (γ 0 , γ A ) ∈ R p+1 determines the proportion of missing data, which auxiliary variables are related to M Y and the strength of missingness mechanism.
Since we investigated four different strengths of relationship between auxiliary variables and Y in the Basic and Extreme scenarios (via the elements of β A ), for these scenarios we set γ A such that half of the auxiliary variables with each level of relationship with Y were associated with M Y .For simplicity, the strength of association with missingness was the same for each of these auxiliary variables.For the Basic scenario, values of Y had a 20% chance of being missing and the odds of missingness increased by 20% with every one standard deviation increase in each of the auxiliary variables chosen to be associated with missingness (holding the other auxiliary variables fixed).This was achieved by setting the required elements of γ A to either 0 or log(1.2).The missingness probability and strength of missingness mechanism (i.e., the non-zero values of γ A ) in this scenario were chosen based on what was observed in the LSAC motivating example (Supporting Information Table 1).
For the Extreme scenario, values of Y had a 50% chance of being missing, and the odds of missingness increased by 100% with every one standard deviation increase in each of the auxiliary variables chosen to be associated with the missingness probability (holding the other auxiliary variables fixed).Since this scenario was included to explore the robustness of the auxiliary variable selection strategies when the full imputation model is likely to encounter difficulties, the missingness mechanism was intentionally chosen to be stronger than what was observed in the case study.For the Realistic scenario, we specified γ A based on the estimated odds ratio of missingness in HRQoL obtained from univariate logistic regressions of the missingness indicator for HRQoL on the standardised auxiliary variables within the LSAC case study (Supporting Information Table 1).Elements of γ A were set to log(1.2) if the corresponding odds ratio was greater than 1.15, log(0.8)if the odds ratio was less than 0.85 and 0 otherwise.In Supporting Information Section 2.1 we describe how γ 0 was calculated to achieve the desired proportions of missingness in Y for each of the simulation scenarios.

Estimands of interest
We considered two estimands: the regression coefficient of X on Y , i.e., the parameter β X in the following linear regression model and the marginal mean of Y , µ Y = E(Y ).The true values of these parameters were taken to be the values used in the data generation.That is, β X = 0.3 and µ Y = 0.
The motivation for using MI rather than a complete case analysis may be to reduce bias or increase precision.For the missingness mechanism considered here, the complete case estimate of µ Y will be biased, while the complete case estimate of β X will be unbiased.[40] These points are illustrated graphically Section 2.2 in the Supporting Information, and can also be explained using causal diagrams.[4] By considering both estimands, we are able to gain insight on the extent to which bias is improved or introduced by the various MI strategies.We are also interested in assessing gains in precision of these estimates from each MI strategy.

Auxiliary variable selection strategies
We compared the performance of eight data-based strategies for selecting auxiliary variables for an imputation model and two benchmark strategies.For each auxiliary variable selection strategy, the variable selection was done once, prior to the imputation step.The auxiliary variable selection strategies were: 1. Quickpred-pt2: This approach uses the four-step selection strategy proposed by van Buuren et al. [7] For each i = 1, . . ., p, A i is included in the imputation model for Y if the maximum of the absolute correlation between A i and Y (using all available cases) and the absolute correlation between A i and M Y is greater than 0.2.This strategy was implemented using the quickpred function from the R package mice.[16] The relatively low correlation of 0.2 was chosen as a reasonable cut-off for an inclusive strategy.
2. Quickpred-pt4: Similar to Quickpred-pt2, except with the correlation cut-off set to 0.4 instead of 0.2.This cut-off was chosen based on the rule of thumb provided by Graham.[15] 3. PcAux: This approach uses principal components of auxiliary variables as predictors in the imputation model instead of the auxiliary variables themselves.[19] The number of principal components was chosen such that the principal component scores explained ≥ 40% of the variance in the A i 's, following Howard et al. [19] This strategy was implemented using PcAux in R. [27] 4. Forward: Under this approach, auxiliary variables were chosen using forward selection.
[28] An auxiliary variable was added to the imputation model for Y at each step of the forward selection algorithm if the p-value from a Wald test on the corresponding regression coefficient was less than 0.05.This strategy was implemented in Stata using the stepwise option in ice.

5.
Forward-sw: This approach is similar to Forward, except that at each step auxiliary variables could either be added to the imputation model as per above, or removed if the p-value from the Wald test dropped below 0.05.It was also implemented using ice in Stata.7. Tests: Under this approach, auxiliary variables were selected for inclusion in the imputation model for Y if the p-value obtained from a t-test for a given auxiliary variable, carried out using M Y to define the two groups, was less than 0.05.

LASSO:
In this approach, a model for Y that included X and all auxiliary variables was fitted using the LASSO with 10-fold cross-validation.[42] The regularisation penalty (usually denoted by λ) was chosen to be the value that gave the most regularised model such that the cross-validation error was within one standard error of the minimum.[33] The auxiliary variables that were included in the fitted model were included in the imputation model for Y , i.e., the LASSO was used to select variables for the imputation model but not to estimate parameters of the imputation model.This is similar to the "indirect use of regularised regression" approach described elsewhere.[34] This strategy was implemented using glmnet.
The benchmark strategies were: 1. CCA: A complete case analysis.The subset of individuals included in the analysis was restricted to those with data for Y .
2. Full: All auxiliary variables were included in the imputation model for Y .
To be useful in practice, an auxiliary variable selection strategy should perform better than the CCA strategy by either reducing bias and/or increasing precision (see Discussion section for further comments).Ideally the auxiliary variable selection strategy would also perform no worse than the Full strategy if the imputation algorithm for this strategy does not fail, and successfully produce imputations in the case that it does fail.
All imputation models included X and m = 50 imputations were performed, which was chosen to be equal to the percentage of missing data in Y in the Extreme scenario.[28] MI estimates were obtained using Rubin's rules.[2] Unless stated otherwise, MI was implemented using mice with linear regression in R. [16]

Performance measures
The performance of the 10 analysis strategies (eight auxiliary variable selection strategies and two benchmark strategies) for estimating β X and µ Y was compared for each of the three simulation scenarios using bias, empirical standard error (SE), average model SE, and coverage probability of 95% confidence intervals.We used the definitions of these performance measures, as well as the formula for Monte Carlo SE estimates, that are given in Table 6 of Morris et al. [43] We also report standardised bias, defined as (bias / empirical SE) × 100, relative bias for β X , defined as (bias / true parameter value) × 100 (this can not be calculated for µ Y = 0), the relative error in the model SE, defined as (average model SE / empirical SE -1) × 100, and the convergence rate, defined as the proportion of datasets for which MI estimates were successfully obtained.

Application of strategies to LSAC example
To illustrate their performance in practice, we applied each of the auxiliary variable selection strategies considered in the simulation study to the LSAC case study described in Section 2.1.We also carried out the benchmark strategies CCA and Full (in R).We used MICE with the same univariate regression models that were specified in Section 2.1.All imputation models included all analysis model variables (to ensure congeniality), with auxiliary variables chosen according to the analysis strategy.A total of 20 imputations were performed, with 10 iterations for each imputation.In the next paragraph, we outline how each auxiliary variable selection strategy was implemented to accommodate the additional complexities of multivariate missingness and different types of variables.
For the Quickpred-pt2 and Quickpred-pt4 strategies, the quickpred function was used to select the auxiliary variables in the imputation model for each of the incomplete variables using the correlation cut-offs of 0.2 and 0.4, respectively.The PcAux strategy required imputation of the missing values in the auxiliary variables prior to obtaining principal components.This was done using the PcAux package, [27] which implements a single imputation using PMM within the MICE approach.The first eight principal components of the auxiliary variables, which explained approximately 44% of the variance in these variables, were included in the imputation models for the remaining five incomplete variables in the analysis model.For the Forward and Forward-sw strategies, the stepwise algorithm was used to select the auxiliary variables in the imputation model for each of the incomplete variables.
Dummy variables for the ordinal PedsQL items were grouped during the variable selection step to ensure they were either all included in, or excluded from, the imputation model.
The Forward-FMI method does not currently handle missing values in auxiliary variables or highly negatively skewed ordinal variables.[29] Rather than extend this methodology (which is beyond the scope of this study) we deemed this method not suitable to use in this case study.For the Tests strategy, t-tests (for continuous variables) or chi-squared tests (for categorical variables) were used to select the auxiliary variables in the imputation model for each of the incomplete variables, with the two groups to be compared determined by the missingness indicator of the incomplete variable.An auxiliary variable was selected for inclusion in the imputation model for the incomplete variable if the p-value obtained from the hypothesis test was less than 0.05 and a sufficient number of observations were available in each group (≥ 20 for the t-test and all expected counts ≥ 5 for the chi-squared test).For the LASSO strategy, we wanted to implement the grouped LASSO to ensure dummy variables for Ped-sQL items were either all included or excluded from the imputation model (c.f.Forward and Forward-sw).However, it is not clear how to apply the grouped LASSO with ordinal outcomes.Furthermore, the LASSO requires predictors (here the analysis model variables and the candidate auxiliary variables) to be complete.Rather than imputing the auxiliary variables, we proceeded as follows.A model for HRQoL that included all variables in the analysis model and all candidate auxiliary variables was fitted to the complete cases using the grouped LASSO with 10-fold cross-validation (implemented using the gglasso package in R).The regularisation penalty was chosen to be the value that gave the most regularised model such that the cross-validation error was within one standard error of the minimum.
Each auxiliary variable selected using the LASSO was included in the imputation model for each of the incomplete variables in the analysis model, as well as the imputation model for all other incomplete auxiliary variables selected by the LASSO.Auxiliary variables not selected by the LASSO were excluded from the imputation process altogether.

Simulation study
Mean of Y Full results are presented in Table 2 in the Supporting Information.There were no problems with convergence of any of the analysis approaches, with estimates of µ Y (and β X ) produced on all simulation runs.As expected, estimating µ Y using complete cases led to the largest bias, while using MI with all auxiliary variables in the imputation model was unbiased across all scenarios.Within each scenario, all auxiliary variable selection strategies led to smaller absolute bias than the complete case analysis, with absolute bias from the Quickpred-pt2, Forward-FMI, Tests and LASSO strategies less than one third of the absolute bias from the complete case analysis in each scenario.The smallest standardised bias was observed for the Full and LASSO strategies (less than 30% in each scenario, an amount which is considered not to be problematic by Collins, Schafer and Kam).[9] MI with all auxiliary variables in the imputation model led to lower empirical SEs than the complete case analysis across each scenario, as did the Quickpred-pt2 and LASSO strategies.The remaining MI strategies had larger empirical SEs compared to the complete case analysis in at least one of the scenarios (with PcAux producing larger empirical SEs in all three scenarios).All strategies in the Basic and Realistic scenarios produced average model SEs that underestimated the empirical SEs, although the absolute value of the relative error in the model SE did not exceed 10%.In the Extreme scenario, relative error in the model SE was underestimated by more than 10% for the Quickpred-pt4, Forward and Forward-sw strategies (-16%, -20% and -19%, respectively).Estimating µ Y using complete cases led to the lowest coverage probability in each scenario (ranging from 0% in the Realistic scenario to 70% in the Basic scenario), followed by the Quickpred-pt4 strategy (ranging from 4% in the Realistic scenario to 80% in the Basic scenario).In comparison, the Full, Tests and LASSO strategies had reasonable coverage probability in each scenario (between 93-96%).Note that comparing coverage probabilities across scenarios is difficult due to different sample sizes.Full results are presented in Table 2 in the Supporting Information.As expected, both the complete case analysis and MI with all auxiliary variables resulted in negligible bias in all scenarios.All other MI strategies also resulted in negligible bias, with the largest absolute biases (which were obtained in the Extreme scenario) amounting to less than 4% of the true parameter value and all standardised biases less than 16%.All MI strategies produced smaller empirical SEs than the complete case analysis, with gains (absolute differences) in precision of up to 0.0035 in the Basic scenario, 0.019 in the Extreme scenario and 0.0006 in the Realistic scenario.The model SE was a reasonable estimate of the empirical SE for all strategies and scenarios.The relative error in the average model SE did not exceed 2% across analysis strategies in the Basic scenario, 7% in the Extreme scenario or 4% in the Realistic scenario.All coverage probabilities were between 94% and 96%.

Selected auxiliary variables
Table 1 provides further information on the auxiliary variables selected by the analysis strategies on each simulation run.This table does not include PcAux since this strategy included auxiliary variable information via principal component scores.In the Basic and Extreme scenarios, all strategies except Tests and Quickpred-pt4 included auxiliary variables that had a correlation ≥ 0.4 with Y and were not related to missingness on most (> 90%) of the simulation runs.The Tests strategy selected these variables relatively rarely.The LASSO strategy (1) was the most inclusive strategy overall (determined by the total average number of auxiliary variables included) and (2) included the largest number of auxiliary variables that were neither correlated with Y or related to M Y , although this number was still relatively low compared to the number of auxiliary variables in this category.In the Realistic scenario, the Tests strategy was the most inclusive overall, followed by Quickpred-pt2 and then LASSO.We compared the performance of eight data-based strategies for selecting auxiliary variables for inclusion in the imputation procedure when MI was used to estimate the mean outcome and an exposure-outcome association when the outcome had missing values, under three scenarios of interest.We also carried out the analysis using complete cases and MI with all auxiliary variables in the imputation model.MI with the full imputation model performed well across all scenarios considered, regardless of whether the target of analysis was the exposure-outcome association or the mean outcome, and did not fail to converge despite the Extreme scenario.All auxiliary variable selection strategies led to negligible bias and smaller empirical standard errors than the complete case analysis when estimating the exposure-outcome association in the simulation study.When estimating the mean, the best performing strategy (based on bias and coverage probability across all scenarios) was the LASSO.This strategy had the advantage of incorporating all auxiliary variable information when selecting variables for the imputation model (rather than information from multiple pairwise comparisons).However, it was not clear how to best extend this approach to the case study where there were the additional complexities of missing data in auxiliary variables and highly skewed ordinal variables.

LSAC example
This study adds to the limited body of research on auxiliary variable selection for MI.
Early work in this area illustrated that auxiliary variables can improve the performance of MI and led to the recommendation of an inclusive strategy for variable selection.[9] However, the study by Collins et al. considered just two auxiliary variables that were correlated with either the incomplete variable or the missingness of the incomplete variable.In large-scale longitudinal studies there are often many potential auxiliary variables to choose from.The results of the current study provide support for the inclusive auxiliary variable strategy when there are a larger number of auxiliary variables and a mix of realistic relationships between these variables and an incomplete outcome.This is illustrated, for example, by the Full strategy performing better than the Quickpred-pt2 strategy, which in turn performs better than the Quickpred-pt4 strategy in terms of bias across scenarios.The quickpred strategies are attractive in practice due to their simplicity, the ease at which they extend to more complicated problems (such as the LSAC case study) and their implementation in statistical software.In contrast, although the Forward-FMI strategy performed reasonably in this study, there is currently no software freely available to apply it in practice and the strategy does not easily extend to complicated problems.Our results also show that if one employs the Tests strategy, auxiliary variables that are strongly associated with the incomplete variables may be omitted from the imputation model, which results in larger standard errors (relative to other MI strategies).
Although appealing, employing an inclusive strategy may be problematic when there are many auxiliary variables available.We encountered convergence problems when analysing the case study with Stata using the Full, Forward and Forward-sw strategies.Results were obtained when the Full strategy was implemented in R, but with a number of logged events reported on each iteration of the imputation algorithm.Others have also noted convergence problems when using MICE in Stata, [13,44] and one simulation study using MICE in both Stata and R found that, although R did not encounter the same convergence problems as Stata, the estimates obtained in R were less accurate than estimates from a MVNIbased approach.[45] The lack of convergence problems in our simulation study is likely due to the simulated datasets containing only continuous variables (and only one incomplete variable), with no very high correlations between these variables.Instead of reducing or modifying the set of auxiliary variables, one could try several alternative approaches to produce imputations when the MI algorithm fails.These include changing the imputation model form, e.g., instead of ordinal logistic regression to impute ordinal variables, one could try linear regression or PMM; changing the imputation method, e.g. for missing data in more than one variable, MVNI could be used instead of multiple imputation by chained equations [46]; or changing the level at which variables are imputed (applicable for derived variables or scale scores).[13,38] However, these approaches also have limitations.Perhaps the most effective approach for overcoming challenges with the MI algorithm would be to identify and then remove from the imputation model the problematic variables.Although, if these appear in the substantive analysis then uncongeniality of the imputation model and substantive model becomes a concern.We did not consider such alternative approaches in this study.
Many auxiliary variable selection strategies have been proposed and implemented in statistical software, but there has been little empirical evidence to evaluate these strategies even though data-based variable selection strategies are known to have limitations in other contexts.[47,48] [9,11] We did not observe bias from our Full analysis strategy in any scenario considered.However, our study was not set up to address limits on the number of auxiliary variables that should be included.The results of the current study also reiterate that the decision of whether to apply MI should include careful consideration of whether the estimand can be estimated without bias from the available data.Causal diagrams that include missingness indicators are a useful tool to aid this decision.[4] If the objective of a study is to estimate an exposure-outcome association when there is missing data in the outcome and no useful auxiliary variables are available, a complete case analysis may be sufficient.In this paper we considered an auxiliary variable strategy to perform better than the complete case analysis if it reduced bias and/or increased precision compared to the complete case analysis.However, we note that there are cases in which MI will (appropriately) result in larger standard errors than a complete case analysis as the uncertainty in the missing values is taken into account.
There are many variations on the ways in which the auxiliary variable selection strategies may be implemented that were not considered in this paper.For the quickpred strategies, one could consider adjusting the correlation cut-off such that a specified number of auxiliary variables with the highest correlations are chosen or exclude auxiliary variables with too many missing values.[3,7] For the PcAux strategy, the number of principal components to include in the imputation model was chosen such that the principal component scores explained ≥ 40% of the variance in the auxiliary variables.One may consider a different proportion of variance explained or employ other methods for specifying the number of principal components to be used.The pcaux function provides an option (nComps = Inf) to use the smallest number of component scores such that adding one more component score does not make a discernable difference in the amount of variance explained.Also, the pvalue threshold may be changed in the Forward, Forward-sw and Tests strategies to include different subsets of auxiliary variables.Finally, we used the LASSO for variable selection only, since the aim of the study was to compare methods used to select auxiliary variables for MI.However, there are alternative ways to incorporate the LASSO with MI that may perform better.[34] This study provides a natural starting point for evaluating a range of auxiliary variable selection strategies for MI.However, several limitations of this study should be noted.
Firstly, all variables in our simulation study were continuous and normally distributed.When there are a large number of categorical auxiliary variables (such as in the case study), perfect prediction can cause problems with MI.Others have considered how to do MI in this scenario.[13,38] For simplicity, we considered the exposure and auxiliary variables to be independent which may not be realistic in practice.An alternative approach may be to generate data using causal diagrams and a series of regression models.We also acknowledge that, (although not the case in the current study) including auxiliary variables may introduce bias into the MI estimates because of underlying relationships between variables.[49] The strength of missing data mechanism was the same for each auxiliary variable, there was only one incomplete variable and there was no confounding.These simplifying assumptions are unlikely to hold in practice.Finally, we chose to focus on only three scenarios.By doing so, we may have overlooked a range of scenarios under which the analysis strategies perform poorly.However, this study can be used to inform further research in this area.
In conclusion, this study evaluated a range of strategies for selecting auxiliary variables for MI models, with the aim of providing advice for researchers faced with this problem.MI using all auxiliary variables in the imputation model performed well in all scenarios considered, providing support for adopting an inclusive auxiliary variable strategy where possible.
Auxiliary variable selection using the LASSO was the best performing auxiliary variable selection strategy overall and is a promising alternative when the full model fails.Quick data-based selection of auxiliary variable, as implemented in quickpred, also performed reasonably when used with a low cut-off and was straightforward to apply to the LSAC case study.However we note that further work is needed to evaluate these strategies in a range of different scenarios.

6 .
Forward-FMI: This approach uses forward selection based on the FMI in the mean of Y (hereafter "FMI") as proposed by Andridge and Thompson.[29]Here we provide a brief description of the theory and implementation of this strategy in the current study.For further details and extensions under different assumptions we refer the reader to the work of Andridge and Thompson,[29] and Little.[41]Let A * denote a "proxy variable" which is created as the predicted values from a linear regression of Y on a subset of A. Assume that M Y follows a Bernoulli distribution and that the distribution of (Y, A * ), given M Y , is bivariate normal with distinct means and variances for respondents (M Y = 0) and nonrespondents (M Y = 1).Also assume that the probability that Y is missing depends only on A * .Under these assumptions, the maximum likelihood estimates of µ Y and the FMI can be expressed as functions of the observed data.The forward selection procedure was implemented as follows.The procedure was initialised by estimating the FMI for each auxiliary variable in turn and selecting for inclusion in the imputation model the auxiliary variable associated with the smallest FMI.In the next step, p − 1 pairs of auxiliary variables (with each pair consisting of the auxiliary variable chosen in the previous step and one of the remaining p − 1 auxiliary variables) were used to create p − 1 proxy variables.Estimates of the FMI were obtained for each proxy variable, with the auxiliary variable from the pair resulting in the smallest estimated FMI added to the imputation model.The forward selection procedure continued in this manner until the reduction in the FMI was less than a pre-specified percentage of the missing data in Y .The justification for the use of this stopping rule and further details are provided in Section 2.4 of the Supporting Information.

Figure 1
Figure 1 presents simulation estimates of bias and empirical SE for µ Y (true value equal to 0), with 95% Monte Carlo confidence intervals, for each analysis strategy and scenario.

Figure 1 :Figure 2
Figure 1: Simulation estimates of bias and empirical SE for µ Y (true value = 0) from each analysis strategy and for each scenario.Estimates are presented with 95% Monte Carlo confidence intervals to quantify simulation uncertainty.

Figure 3 Figure 2 :
Figure 3 presents estimates, with 95% confidence intervals, for the effect of BMIz on HRQoL (β 1 in model (3)) and the mean HRQoL for the LSAC case study, using seven of the analysis strategies.Estimates from Forward and Forward-sw are not presented in this figure as the MI algorithm for these strategies (which were implemented in Stata) failed with the message "convergence not achieved" during the auxiliary variable selection step for one of the PedsQL items and no estimates were produced.Although the R strategies produced

Figure 3 :
Figure 3: Estimates of the effect of BMIz on HRQoL (left panel) and mean HRQoL (right panel), with 95% confidence intervals, from seven different analysis strategies.
the child's vocabulary.The case study had two major complexities that were not incorporated into the design of the simulation study (next section).Firstly, the exposure, two covariates and all auxiliary variables had some missing values.Secondly, the case study involved different types of variables (including highly skewed ordinal PedsQL items).Further details of the variables in this case study, including percentage of missing data, are provided in Supporting Information Table 1.MI was initially attempted using MICE in Stata (16.1, StataCorp LLC, College Station, TX

Table 1 :
StrategyCorrelation between A i and Y ; A i associated with M Y Average number (%) of times auxiliary variables were selected across simulation runs for relevant analysis strategies.Auxiliary variables are grouped by correlation with Y and association with the indicator for missingness in Y (M Y ).The scenario and corresponding number of auxiliary variables in each group is provided in bold.estimates,problems were reported (via the list of logged events reported by mice) in the Full and Tests strategies.No problems were reported for the other MI strategies.All analysis strategies led to similar estimates of β X (ranging from -0.73 to -0.69) and therefore similar conclusions.That is, a one-unit increase in BMI z-score in Australian children aged 4 to 5 years is associated with a decrease in HRQoL at age 10 to 11 years of around 0.7 units.Estimates of the mean HRQoL ranged from 76.7 to 77.5, with estimates produced by all MI strategies noticeably lower than estimates from the complete case analysis.In Table3in the Supporting Information we report the variables selected for inclusion in the imputation model for HRQoL for the MI strategies that produced estimates.Of the 85 candidate auxiliary variables considered, 61 were selected for inclusion in the imputation model for HRQoL by the Quickpred-pt2 strategy, 3 by the Quickpred-pt4 strategy, 50 by the Tests strategy and 17 by the LASSO strategy.
[19]challenge in evaluating such strategies is designing a study to include a mix of realistic correlations between auxiliary variables and variables in the analysis model.A study by Hardt et al. considered two values of the correlations between all auxiliary variablesand other variables in the study: low (0.1) and moderate (0.5).[11]Another study by Howard et al. took a similar approach, varying correlation values from 0.1 to 0.6.[19]Inpractice, it is highly unlikely for all auxiliary variables to have the same relationship with the variable with missing values.Our simulation study was designed to assess the performance of strategies for selecting auxiliary variables when there were a mix of relationships between auxiliary variables and (i) the incomplete variable, and (ii) the missingness indicator of the incomplete variable.In contrast to our study, both Collins et al. and Hardt et al. showed that including many auxiliary variables in the imputation model can lead to bias in MI estimates.