Multiple imputation methods for bivariate outcomes in cluster randomised trials

Missing observations are common in cluster randomised trials. The problem is exacerbated when modelling bivariate outcomes jointly, as the proportion of complete cases is often considerably smaller than the proportion having either of the outcomes fully observed. Approaches taken to handling such missing data include the following: complete case analysis, single‐level multiple imputation that ignores the clustering, multiple imputation with a fixed effect for each cluster and multilevel multiple imputation. We contrasted the alternative approaches to handling missing data in a cost‐effectiveness analysis that uses data from a cluster randomised trial to evaluate an exercise intervention for care home residents. We then conducted a simulation study to assess the performance of these approaches on bivariate continuous outcomes, in terms of confidence interval coverage and empirical bias in the estimated treatment effects. Missing‐at‐random clustered data scenarios were simulated following a full‐factorial design. Across all the missing data mechanisms considered, the multiple imputation methods provided estimators with negligible bias, while complete case analysis resulted in biased treatment effect estimates in scenarios where the randomised treatment arm was associated with missingness. Confidence interval coverage was generally in excess of nominal levels (up to 99.8%) following fixed‐effects multiple imputation and too low following single‐level multiple imputation. Multilevel multiple imputation led to coverage levels of approximately 95% throughout. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.


Introduction
In cluster randomised trials (CRTs), the unit of random allocation is a group of individuals (e.g. a school or a hospital) rather than the individual subjects. It is a common study design in the health and social sciences, especially for evaluations of interventions that operate at a group level, manipulate the sociophysical environment or cannot be delivered at an individual level. It is well known that observations within each cluster are correlated [1], and that analyses that ignore this homogeneity within clusters can result in overestimation of the precision of the treatment effects, possibly leading to inappropriate inferences being drawn. Appropriate statistical techniques for CRTs are well developed and include mixed models and generalised estimating equations [2].
A common problem that compromises the validity of the results is that of missing data. The validity of inferences from incomplete data depends on the process that leads to data being missing, the so-called missing data mechanism, also known as missingness mechanism or missing data process [3, section 3.2]. The missing data mechanism is characterised by the conditional distribution of the probability of missingness, given the data. A classification of the missing data mechanisms according to the assumed model for the probability of non-response was introduced by Rubin [4]. A process is said to be missing completely at random (MCAR) if the probability of non-response is completely independent of any other variable, whether measured or not. A process is classified as missing at random (MAR) if the probability of non-response is conditionally independent of the unobserved data given the observed data. Processes that are neither MCAR nor MAR are called missing not at random (MNAR).
For missing data mechanisms that satisfy MAR, valid inferences can be obtained using likelihoodbased or Bayesian analyses of the complete cases [3,Part III]. However, moment-based estimators, such as those that use generalised estimating equations are, without special modification, only valid with more stringent conditions about the missing data mechanism, namely, that the missingness is independent of the outcome given the covariates in the model [5].
With partially observed clustered data, by far, the most common approach is to only analyse the complete cases (CCA) [6]. However, when two or more outcomes are analysed jointly, the proportion of complete cases is often smaller than the complete cases corresponding to each outcome in turn. This is an important issue as multivariate outcomes are common in clinical trials. Examples include clinical trials of psychological interventions and those in cardiology, which often focus on non-fatal cardiovascular events, in addition to time-to-event. Our paper uses cost-effectiveness analyses (CEA) that use data from CRTs as an illustrative example. Most CEA that use individual-level data from clinical trials have observations with incomplete information [7].
A common approach for obtaining valid inferences with incomplete data under the MAR assumption is to undertake multiple imputation (MI) [4]. In some circumstances, essentially when the analysis and imputation models coincide, MI principally replicates a likelihood analysis. Nevertheless, an advantage of MI is that unlike conventional likelihood analyses, it can incorporate so-called auxiliary variables that are not included in the analysis model but which are related to both the missing values and to the probability of observations being missing. Incorporating such auxiliary variables makes the underlying MAR assumption more plausible.
From a theoretical perspective, it is known that for CRTs, the imputation method should accommodate the multilevel structure of the data. A failure to do so may lead to invalid inferences [8]. Unfortunately, multilevel MI (MMI) is not yet available as a standard implementation in commonly used statistical packages. Hence, analyses using MI in the CRT settings commonly avoid such imputation strategies and use instead single-level imputation (SMI) methods [6]. A systematic review of CEA that use CRT data [9] found that only 5% of studies included used MI, of which none accounted for the clustering.
An alternative approach that has been previously recommended in the literature is including the cluster as a fixed effect in the imputation model (FMI) [10,11]. This has the advantage of being easily implemented in widely available MI software.
The aim of this paper is to investigate and compare the performance of these different MI strategies for handling missing bivariate outcome data in CRTs over a wide range of missingness mechanisms that are dependent on individual and cluster-level variables. We do this by first applying the methods to a costeffectiveness study that used data from a published CRT (Section 3). Then, a simulation study with a full-factorial design is presented in Section 4. We close with a few points of interpretation and discussion in Section 5.

Multiple imputation
Multiple imputation breaks down the analysis of incomplete data into a number of steps. We first need to distinguish between two statistical models. The first is the analysis model that would have been used had the data been complete. This is called the substantive model or model of interest. The second model, called the imputation model, is used to describe the conditional distribution of the missing data given the observed. For hierarchical data, this conditional distribution must reflect the multilevel nature of the data.
The MI algorithm proceeds by fitting the imputation model to the observed data and taking Bayesian draws from the posterior distribution of its model parameters. Missing data are then imputed from the imputation model, using the parameters previously drawn. These steps are repeated a fixed M number of times, to obtain M completed data sets. The substantive model is then fitted to the multiple data sets separately, producing M sets of parameter and covariance estimates, which are combined using Rubin's formulae [4] to produce a single MI estimate of the substantive model parameters and associated covariance matrix. Under the MAR assumption, this will produce consistent estimators and, in the absence of auxiliary variables, is asymptotically (as M increases) equivalent to maximum likelihood [12,13].
Sampling from the approximate predictive distribution of the missing data as described earlier can be performed in several ways. Two broad approaches can be identified; the first approach jointly models incomplete variables, by sampling from an underlying joint predictive distribution [13,14]. In the second approach, referred to as full-conditional specification (FCS) or chained equations, draws from the joint distribution are approximated using a sampler consisting of a set of univariate models for each incomplete variable conditional on all the other variables [15]. In the motivating example and simulations presented here, both approaches are used for ease of implementation. For SMI and fixed cluster effects models, which are also essentially single-level, the FCS method is used. The FCS approach is not well-suited to proper multilevel MI and so, for these imputations, a joint modelling algorithm assuming multivariate normality is used [13]. In our settings, because both outcomes are continuous, and modelled with normal linear regressions, the FCS algorithm is equivalent to a Gibbs sampler that draws from a multivariate normal distribution, and hence equivalent to a joint MI algorithm [16][17][18].
Having outlined the generic MI procedure, we now set out the details of the relevant imputation models to be compared here. Let Y 1,ijk and Y 2,ijk be the two continuous outcomes with missing data, corresponding to the i-th individual in cluster j of a two-arm cluster trial. Assume that J clusters are allocated to each treatment k ∈ {0, 1}, and that there are n j individuals in each cluster j, for j = 1, … , J. Let k indicate treatment allocation, k = 1, if the cluster is allocated to intervention, and 0 otherwise.
Let X ijk denote the vector of fully observed variables, individual and cluster level, to be included in the imputation model. This includes the variables in our model of interest and any other auxiliary variables, and may be different in each treatment arm. The imputation models compared here are regression models of the outcomes on the covariates in the substantive model and the auxiliary variables, fitted separately within each treatment arm, to allow for different covariance structure.
The single-level imputation model (used in SMI) can be written as With SMI, the imputed values are drawn from the conditional distribution of the missing observations given the observed data, ignoring any dependency between observations within a cluster not explained by the cluster-level auxiliary variables included in the model. Therefore, the single-level imputation model does not properly represent the conditional distribution of the missing data given the observed data.
The effect of clustering can be incorporated either as a fixed or random effect. Firstly, we include a cluster fixed effect in the imputation model (corresponding to FMI): Y 1,ijk = 1,0k + X ijk 1,X + 1,jk + e 1,ijk Y 2,ijk = 2,0k + X ijk 2,X + 2,jk + e 2,ijk where ,jk are the fixed cluster-effect coefficients, different from 0 only if the observation i belongs to cluster j in treatment group k, for j = 1, … , J. To avoid over-parameterisation, ,1k = 0, for k ∈ {0, 1}, ∈ {1, 2}, making the first cluster in each treatment arm the reference category. The error terms ( e 1,ijk , e 2,ijk ) are assumed to be bivariate normal as before. Missing outcomes will be imputed from the conditional normal distribution given the other outcome, if observed, and the covariates and auxiliary variables, which must all be at the individual level, with a mean determined by the fixed effect for that cluster.
We note that this parameterisation of the fixed-effects imputation model may result in biased estimates when there is a high proportion of clusters with completely missing outcomes. This is because FMI imputes empty clusters from the distribution of the reference cluster, as the fixed effect for the empty cluster cannot be estimated. When this is the case, the imputer must choose the reference cluster carefully. In particular, we should choose the cluster that has a cluster-mean closest to the randomised-group mean.
An alternative to the FMI is to include a cluster random effect in the imputation model (corresponding to MMI): Y 1,ijk = 1,0k + X ijk 1,X + u 1,jk + e 1,ijk Y 2,ijk = 2,0k + X ijk 2,X + u 2,jk + e 2,ijk again separately in each treatment group k. The individual-level residuals ( e 1,ijk , e 2,ijk ) are assumed normally distributed and correlated as before, independently of ( u 1,jk , u 2,jk ) , the cluster random effects. Finally, complete case analysis (CCA) is also included in our simulations and example for comparative purposes.

Substantive model
In this paper, we assume that the substantive model is a bivariate linear random-effects model where the only explanatory variable is treatment. This means that in what follows, the vector X ijk of explanatory variables in the imputation models specified in the previous section contains only auxiliary variables. If, however, the substantive model includes baseline covariates, these must be included in the imputation model, as covariates if they are fully observed, or as dependent variables, if they themselves have missing values.
The substantive model is fitted to the data from both arms simultaneously, assuming common variance across the treatment arms. Let the cluster random effects be represented by the latent variables u 1,jk and u 2,jk . The model can be written as follows: where 1 and 2 represent the treatment effect on the corresponding outcome. The error term (e 1,ijk , e 2,ijk ) and the cluster effects are assumed to be normally distributed: where 1 , 2 are the individual-level standard errors, is the individual-level correlation between Y 1 and Y 2 and 1 , 2 and are the standard errors and correlation of the two cluster random effects, respectively.

Motivating example: the OPERA study
We illustrate our methods using the OPERA study (exercise for treating depression in care home residents). It was a CRT to evaluate the impact of a 'whole home' exercise intervention on depressive symptoms in care home residents in England, aged 65 years or over who are free of severe cognitive impairment [19]. Clusters were randomly allocated to provide either a depression awareness training session for care home staff (control) or an exercise intervention delivered by a visiting physiotherapist (treatment). The intervention comprised twice weekly physiotherapist-led exercise groups. For the purpose of illustration, we look at the cost-effectiveness data, which consisted of 798 individuals in 72 nursing homes. There were 31 clusters in the intervention and 41 in the control arm. As is common, the OPERA CRT had an imbalanced design; the number of participants per cluster varied from 5 to 20.
This paper considers costs (in Great British pounds, £) and health-related quality of life completed via proxy (based on European Quality of Life questionnaire -EQ5D) recorded at three-monthly intervals, for a period of 12 months. These EQ5D data were used to obtain quality-adjusted life years (QALYs) over 12 months. Intra-cluster correlation coefficients (ICCs) were high for QALYs (0.23 in the intervention and 0.08 in the control) but moderate for costs (0.03 for intervention and 0.10 in the control arm). While QALYs were approximately normally distributed, costs were positively skewed. The correlation between the outcomes was −0.11 in the control arm and −0.07 in the intervention.
The data set also includes the clinical primary outcome, depression, measured using the Geriatric Depression Score-15 (GDS-15), and baseline covariates at both cluster level -location and size of the home-and at the individual level -age, sex, ethnicity, being on antidepressants, years spent in formal education, cognitive impairment (Mini-Mental State Examination -MMSE), physical function (Short Physical Performance Battery-SPPB), fear of falling, pain, social engagement, baseline GDS-15 and baseline EQ5D (self-completed and proxy).
We had 449 individuals with complete cost-effectiveness data, 190 individuals with only missing QALYs at 12 months, a further 110 with only costs missing and an additional 49 individuals with both outcomes missing. Table I reports the percentage of observations with missing cost-effectiveness outcomes and baseline covariates, by treatment group. The number of clusters with one outcome completely missing was moderate: one intervention and six control clusters had QALYs completely missing (less than 10% of the total). There were no clusters with completely missing costs. The CEA assumes a model with linear additive treatment effects for both costs and QALYs, with no additional covariates. The corresponding effect of treatment, incremental QALYs Q and incremental costs C , is estimated from a bivariate normal mixed model (2) [20]. Cost-effectiveness is then reported as the estimated incremental net monetary benefit (INB) where represents the decision-makers' willingness to pay for a one unit gain in health outcome. Thus, the new treatment is cost-effective if INB > 0. In the original study, the reported INB was calculated using = £ 20 000, which is within the range of the cost-effectiveness threshold recommended by the UK National Institute for Health and Care Excellence [21]. As the INB is a linear combination of̂C and Q , its variance can be calculated from the corresponding estimated variances and covariances, in the usual way.

Multiple imputation methods for the OPERA study
We now apply the alternative MI methods to the OPERA data set. For the purpose of illustration, we delete from the data set the single observation with missing age at baseline and consider age, sex and cluster size as completely observed baseline variables and use them as auxiliary variables in the imputations, as they are associated with the missingness and the outcomes. All MI strategies use the same baseline covariates as auxiliary variables in the imputation model, with one exception. Cluster size is dropped from the FMI approach, as cluster-level variables cannot be used as explanatory variables in models using fixed cluster effects. Although costs are somewhat skewed, we do not log-transform or perform postimputation rounding, as this has been shown to bias the associations [22,23]. Both outcomes, costs and QALYs, are included in all imputation models. The number of imputations in this example is 50.
We calculate INB on each multiply imputed data set using bivariate linear mixed models (2) and combine these results using Rubin's rules to obtain MI estimates. We construct normal-based confidence intervals (CIs) around the MI estimate.
Single-level imputation and FMI are implemented in R package mice, which uses the FCS algorithm. The number of iterations or cycles of the chained equations algorithm used is 50, as this appears to lead to satisfactory convergence for this data set. For the MMI, we use R package pan, with 1000 burn-in iterations and imputed every 3000 to reduce auto-correlation and improve convergence, as it is known that with large number of clusters and small ICCs, the Gibbs sampler is slowly mixing [8].
All three MI methods result in approximately 35 negative imputed costs per imputed set. Table I shows that the estimates of incremental QALYs, which has relatively high ICCs, are relatively insensitive to the choice of MI approach. By contrast, the incremental cost point estimate obtained by FMI is very different from the others. The standard errors across the two outcomes are different for each missing data approach but are relatively large compared with the size of the estimate. SMI produces smaller standard errors than those obtained with MMI and CCA. This is because costs and QALYs have a relatively large ICC in the OPERA data set, and we are looking at a between-cluster estimator. As a consequence, there is an increased risk of type I error [24].
The choice of MI method, which mostly affects the way the variance of the missing data is modelled, affects the estimated SE. Nevertheless, for the OPERA study, all MI approaches lead to the same conclusion, that the OPERA intervention is not cost-effective compared with the control treatment.

Simulation study
We now use a full-factorial simulation study, to compare the performance of the MI methods across a wide range of circumstances typically found in CRTs. The simulation steps proceeded as follows: data generation, application of a missing data mechanism and estimation and inference for the treatment effect from the analysis after handling (or ignoring) the missing data. Finally, the behaviour of the treatment effect estimator is examined according to our chosen performance measures.

Data generation
We begin by selecting those factors anticipated to have an impact on the performance of the approaches for handling missing data, based on previous literature [24,25]: number of clusters per treatment arm and number of individuals per cluster ( three settings); ICCs of the outcomes ( four levels); and proportion of missing data ( two levels) and missing data mechanism (four settings). The total number of clusters is 2J, with n j individuals in cluster j, for j ∈ {1, … , J}, in each trial arm k ∈ {0, 1}. The number and size of clusters are allowed to vary while maintaining the same expected sample size (S = 500). This sample size is typical of the sample sizes seen in CRTs, as a recent systematic review of CRTs published in medical journals reported the inter-quartile range (IQR) of number of participants per arm as being [143-866] [26]. Three different types of two-arm CRT design are considered: (i) large number of clusters (J = 25) and few individuals per cluster ( n j = 10 ) ; (ii) small number of clusters (J = 5) and large cluster size ( n j = 50 ) ; and (iii) moderate number of clusters (J = 15) and variable number of individuals per cluster. The small and large number of clusters were also chosen to be close to the lower and upper quartiles of number of clusters reported by Ivers [26], which found an IQR of . Following previous simulation studies [20,27], the variable cluster size n j is obtained by rounding a Gamma-distributed random variable. This Gamma random variable has mean 20 and coefficient of variation cv = SD(n) E(n) = 0.5. The full description of the simulation factors and their levels are summarised in Table II. There are 3 × 4 × 2 × 2 × 4 = 192 simulated scenarios in total.
In each simulated scenario, a cluster-level indicator is then created allocating half of the clusters to treatment and half to control. Then, for each subject i in cluster j, i.i.d standard normal individual-level covariate X i and cluster-level variable W j are generated. These are independent of treatment allocation k and are therefore thought of as pre-randomisation variables. Then, bivariate normal outcome data ( are generated separately by treatment arm as follows: with is the level-1 variance-covariance matrix, with 1 = 40, 2 = 20 and = 0.1 assumed constant across all scenarios.
The level-2 variance-covariance matrix Φ =  Note: The top part of the table reports values for scenarios with missingness mechanisms, which do not differ by treatment arm; those corresponding to missingness mechanism which are differential by treatment arm, are reported at the bottom. The numbers in italics are not simulation parameters but the approximate empirical rates of non-response obtained after setting 0 . ICC, intra-cluster correlation coefficient.
We note that while the variables X and W are associated with the outcomes, they are not included in the substantive model, and as such, they are auxiliary variables in the imputation models to follow.
The R function used to generate these data, by changing the levels of each factor accordingly, can be found in the Supporting Information File 2.

Missing data mechanisms
To generate the missing data for each outcome under the MAR assumption, we used four different missing data mechanisms, where the probability of missingness, denoted by ,ijk , with ∈ {1, 2}, is such that the non-response indicator R ,ijk ∼ Bern( ,ijk ), depends on X i and/or W j , as displayed in Table II. The coefficient represents the strength of association between the covariates and missingness indicator R ,ijk . We adjust 0 empirically to achieve the required expected probability of missing.
For both outcomes, individual-level and cluster-level covariates have the same level of association, , with the missingness indicator, and thus we drop the subscripts and X, W. However, for the last of our missingness mechanisms, we allow to differ between treatment arms. This represents a situation where there is an interaction between treatment and the covariate driving the missingness, that is, the treatment modifies strength of association between the covariate and non-response, which may arise in clinical trials, because, for example, of side effects or lack of perceived efficacy in the intervention arm or disillusionment amongst those assigned to the control arm. We allow two settings; these are presented at the bottom of Table II, together with the probabilities of non-response, which also differ across treatment arms.
Non-response rates are chosen to be moderate, to avoid situations where a high proportion of clusters have one or both outcomes completely missing.
For each simulated data set, non-response indicators R ,ijk for each outcome ∈ {1, 2} are independently drawn from a Bernoulli distribution with probabilities ,ijk as specified in Table II. Missing values are then generated to create the observed data set.

Implementation
For a simulation study of this size, it is important to balance computational time with efficiency of the methods. The number of imputations M is set to 10 [29], although in practice, a higher number of imputed sets is recommended [10].
The MI methods using FCS, that is, SMI and FMI, are implemented using the mice package in R. The chained equations procedure was repeated for 10 cycles to produce a single imputed data set, following recommendations in [30]. For the multilevel MI, Schafer's pan package in R is used [31]. Details on the MCMC procedure used in pan can be found in [13]. Briefly, a Gibbs sampler is used to simulate draws from the posterior distribution of the parameters, starting with the level 2 variances. For our simulations, we use non-informative priors for regression parameters, diffuse inverse-Wishart priors for variance components and impute on every 1000th iteration, after a 1000-iteration burn-in. In practice, one needs to monitor the convergence behaviour of the MCMC algorithm and modify the number of iterations between imputations and the burn-in period accordingly.
After imputation, for which the two covariates are used as auxiliary variables, the substantive model, Equation (2), is applied to each multiply imputed data set to estimate treatment effect on Y 1 and Y 2 simultaneously. The estimates obtained using the analysis model in each of the M multiply imputed sets are then combined using Rubin's rules. CIs around the MI estimates are constructed using a normal distribution, instead of a t-distribution with the small-sample MI degrees of freedom [32]. This is for comparability with the CIs obtained after a CCA, which are also constructed routinely using a normal approximation.
For each scenario, the whole simulation procedure (data generation, imposing missing values, imputation, analysing each of the imputed data sets using the substantive model and combining the resulting treatment effect estimates using Rubin's rules) is performed on each of the N = 1000 data sets to capture the behaviour in repeated samples.

Performance criteria
Let denote the true treatment effect on Y , with ∈ {1, 2}, and̂, the estimate obtained in the = 1, … , N replicated data set. The following criteria were used to measure the performance of the different MI strategies.
(1). Confidence interval coverage rate: The percentage of times that the true parameter value is covered in the 95% CI.
(4). Average width of confidence interval (AW): The distance between the average lower and upper CI limits across N CIs.
The performance of a procedure is regarded as poor if its coverage drops below 90% [33]. When the procedure results in over-coverage, there is an increased type II error probability. When coverage is close to 100%, extra caution should be taken when using that procedure [34], especially when coupled with wide CIs. Coverage close to the nominal value, along with narrow CIs, translates into greater accuracy and higher power. Figures 1 and 2 present, respectively, the bias and coverage distribution for each method. Each box-andwhiskers plot shown represents 48 scenarios, stratified by missing data mechanism. The performance of the methods across the scenarios was similar for both Y 1 and Y 2 . For the first three missing data mechanisms shown (Table II), where the missing data mechanism was not dependent on treatment arm, all approaches resulted in unbiased estimates across most of the scenarios. This is in line with theoretical results, as the variables associated with missingness are not associated with the treatment effect. However, for the scenario where the missingness mechanism is differential by treatment arm, the CCA produced substantially biased estimates across the scenarios considered. This corresponds to the situation where there is a different association between the covariate and the response indicator in each treatment arm, and this treatment-covariate interaction is not accounted for when we condition on the complete cases. By contrast, the corresponding results for the MI estimates show negligible bias: in general, less than 3.5% and mostly within Monte Carlo error limits. To see this in more detail, see Table III. However, the alternative MI strategies result in very different variance estimates and consequently varying coverage rates. This is evident in the plots of CI coverage Figures 1 (b) and 2 (b). In general, SMI resulted in coverage lower than the nominal. This is particularly critical for scenarios with high ICCs (0.20 and above). The number and size of clusters also appear to be factors associated with low Table III. Percentage bias for the estimated treatment effect on Y 1 for scenarios corresponding to missingness mechanism is differential by treatment.  Tables A4, A5 and A6 in the Supporting Information File 1. In contrast, fixed-effects MI results in over-conservative coverage for a range of scenarios, especially those where the ICCs are small and the number of clusters is large and the cluster-level variable is associated with the missingness mechanism. In addition, wider CIs are obtained using FMI compared with those obtained using MMI, even when coverage was similar. See Tables A5, A6, A16 and A17 in the Supporting Information File 1. MMI results in acceptable coverage rates in most scenarios, but for scenarios where the number of clusters is relatively low (J = 5 per arm) and the clustering is high (≥ 0.2), coverage rates are only just above 90%. This is because the convergence of the Gibbs sampler depends on the degree to which the cluster random effects in the imputation model, Equation (1), can be estimated from the observed data [8]. Convergence can be improved by increasing the burn-in period for the sampler in the MMI software. For example, we re-ran the scenario with J = 5 and the ICC = 0.2 for both outcomes with differential missingness by treatment arms with low and missing proportions 0.2 in each outcome. By increasing the burn-in to 5000, CI coverage rate increased to 92.0%, compared with 90.9% reported in Table IV for the same scenario. We therefore recommend that, when faced with small numbers of clusters, the burn-in period is increased. It is clear that the validity of inferences drawn depends crucially on the method chosen to handle the missing data. As the box and whisker plots for CI coverage show, the method that most consistently achieves coverage rates close to the nominal is MMI, as the interquartile range of the distribution of coverage across the 192 simulated scenarios is almost all contained within the limits 90-97% (for example, only 8 scenarios out of the total 192 resulting in coverage for Y 1 outside this range).

Simulation results
The results corresponding to RMSE are reported in the Supporting Information File 1. In general, MMI is more efficient than the other two MI methods. The ratio of MMI RMSE to either SMI or FMI RMSE is almost always ≤ 1, with only four scenarios resulting in a ratio > 1.02. In general, the FMI RMSE is larger than those corresponding to the other two MI methods, in situations where the outcome ICCs were smaller than 0.2. Conversely, when ICCs are greater or equal than 0.2, the RMSE corresponding to SMI is larger than the corresponding RMSE for the other two methods.

Discussion
In this study, we compared the performance of single, multilevel and fixed-effects MI for handling missing data in CRTs. The full-factorial nature of our simulation study enabled us to establish which characteristics have the greatest influence on the performance of the alternative methods for handling missing data considered here.
In our simulations, which assumed the data were MAR throughout, bias was a serious problem for the CCA when the missingness mechanism was differential by treatment arm, while all MI methods resulted in unbiased treatment estimates. The main difference amongst the three MI procedures is in how variability is incorporated into the imputations. This is reflected in the variance estimates and has an impact on CI coverage rate. SMI resulted in low (< 90%) coverage rate across most scenarios, in particular when the ICCs exceeded 0.05 and there were few clusters. Fixed-effects MI produced overly conservative coverage (> 98%), especially when there were small ICCs and more than 30 clusters. This finding reflects the way these two approaches accommodate the between-cluster variance. Under SMI, the between-cluster variance is set to zero, whereas with FMI, this variance is unbounded in the sense that the behaviour of one estimated cluster effect is unrelated, or unconstrained, by the behaviour of any of the others. Hence, FMI cannot be used to impute cluster-level variables, or indeed, when the substantive model includes cluster-level variables, because these cannot be explicitly included in the imputation model.
By contrast, MMI models the correlation in the data appropriately, producing coverage rates close to the nominal level. This consistent performance across the varying number of clusters and cluster sizes is indicative of acceptable finite sample properties. Moreover, MMI is compatible with the substantive model, in the sense that the imputation model contains the analysis model. The imputation model can include auxiliary variables at both the individual and the cluster-level, thus increasing the plausibility of the MAR assumptions.
The re-analysis of the OPERA study illustrates how each of the methods could be implemented in practice. In this re-analysis, the standard errors for the estimated treatment effect for both outcomes are substantially larger when using FMI, while SMI resulted in smaller standard errors. From this and other simulation studies [24,25], we know that FMI overestimates the variance, while SMI underestimates it. Moreover, there is a large difference in the estimates following the FMI, compared with other methods. This was larger for the endpoint (cost) where the ICC was smallest. This could be because the FMI cannot incorporate explicitly cluster size into the imputation model. In this example, the overall conclusion that the exercise intervention was not cost-effective did not differ according to the approach taken to handling the missing data, but this may not always be the case.
The validity of the results when using MI depends on obtaining an appropriate estimate for the standard errors. This requires that the imputation model recognises the dependencies within the data, in this case amongst clusters. It is also important to use an appropriate number of imputations. A small M will translate into a loss of efficiency compared with the estimate obtained with infinitely many imputations. If we can accept a 5% loss of efficiency, then five imputations may be sufficient even for 25% missing information [4]. In practice, the actual number of imputations necessary for MI to perform satisfactorily depends not only on the amount of information missing but also on the type of analysis. Some analyses may require M = 50 or more to obtain stable results [35]. So, for a particular application, this number must be carefully chosen, based on sampling error of the MI estimates [10]. In the present work, we used 10 imputations for the simulations and 50 for the illustrative example, following the recommendations in [10] for determining the required number of imputations.
We have shown how the flexibility of MMI allows the analyst to handle continuous multivariate outcomes without any modification to the multilevel imputation algorithm, because it is already based on multivariate normality. We illustrate this here using bivariate outcomes, but the generalisation to multivariate outcomes is straightforward. The MMI approach is readily available for continuous data in R packages pan [31] and jomo. The stand-alone software RealcomImpute also performs MMI [36] and can be used in conjunction with Stata.
Previous simulation-based comparisons of the alternative methods have been published before [24,25]. Our study builds on and extends the previous literature by establishing which characteristics of the setting most influence the performance of the different strategies for handling the missing data. In addition, we complement Andridge's work [25] by undertaking a more comprehensive assessment of the fixed effects MI, including complex scenarios with cluster-level variables as predictors of missingness, varying cluster sizes and bivariate outcomes and showing further limitations of the FMI approach when compared with MMI.
The approach presented in this paper has some limitations. For simplicity, we assumed the missing data mechanism is MAR throughout. However, MI provides a flexible and convenient route for investigating sensitivity to alternative MNAR mechanisms (e.g. [37,Chapter 10]). Our simulations excluded situations with missing covariate data and where the imputation model is misspecified. MI assumes that the functional form of the imputation model has been correctly specified and includes all interactions and terms of higher order that are of substantive interest. A further concern could be that either the imputation or the analytical models make incorrect distributional assumptions. This was the case in the OPERA example, where we imputed the costs assuming a normal distribution. However, simulation studies by Schafer [13] and others [22,34] have shown parametric MI to be fairly robust to misspecified distributions. Inferences are also insensitive to non-Gaussian random effects in an multilevel imputation model, except when the rates of missingness are very high or the sample size is small [38].
Future research directions thus include considering MNAR mechanisms, especially those where the cluster random effect is driving the missingness. Other potential extensions relate to situations where there is cluster non-response. In both situations, MMI could provide a flexible route for investigating sensitivity to alternative MNAR mechanisms and cluster dropout.