To assess the performance of methods for imputing missing data in a particular meta-analysis context, especially with regard to the issues of congeniality and the order of applying Rubin's rules and meta-analysis in a inverse-variance weighted meta-analysis, we perform a simulation study.

#### 3.1 Simulation of data

We simulate data on 30 studies from the following data-generating model:

- (5)

In the first set of simulations, each study is of equal size with 200 participants in each study. In the second set of simulations, studies are of unequal size with the number of participants in each study ranging from 125 to 275 in steps of five individuals, such that the total number of individuals in both simulations is the same. Supporting Information ‡ presents results from the second simulation with studies of unequal size.

Scenario 1 is the most homogeneous model considered, and we add heterogeneity sequentially by drawing the model parameters from a normal distribution to allow for between-study variability. We note that the homoscedastic stratified model given in equation (1) is a correctly specified analysis model in Scenario 1. A heteroscedastic stratified model is a correctly specified analysis model in Scenario 2 (and should give consistent estimates in Scenario 1). The fixed-effect method is correctly specified for *β*_{1} in Scenarios 3 and 4 (and should give consistent estimates for both *β*_{1} and *β*_{2} in Scenarios 1 and 2). We correctly specify the random-effects method for both *β*_{1} and *β*_{2} in Scenario 5 (and should give consistent estimates for both parameters in each of the other scenarios).

Similarly, the stratified imputation model is a correctly specified imputation model in Scenario 1, but not in any of the other scenarios. The within-study imputation model should give consistent imputed values in all scenarios.

In each scenario, we create 1000 simulated datasets for analysis. We assume that *Y* and *X*_{1} have no missing observations and only consider missingness in *X*_{2}. If *π*_{is} is the probability that the observation *x*_{is} is missing, we generate approximately 50% sporadically missing data in *X*_{2} using a MAR model, where missingness depends on the observed value of *X*_{1} (which is correlated with *X*_{2}):

- (6)

where expit(*x*) = (1 + exp( − *x*))^{ − 1} is the inverse of the logit function. We use this large missingness rate to illustrate the issues in parameter estimation with missing data more clearly.

In this paper, we used five imputations for each dataset for computational reasons. In a practical application, more imputations would ideally be used. We generated multiply imputed datasets in stata (StataCorp, College Station, Texas, USA) [19]; we performed subsequent analyses of the datasets in R (R Foundation for Statistical Computing, Vienna, Austria) [20].

Tables 1 and 2 for the equal sized studies and Tables SA1 and SA2 for the unequal sized studies show the results of stratified, fixed-effects, and random-effects meta-analysis methods. Supporting Information shows alternative tables displaying the same results, but grouped by imputation method rather than scenario. In Tables 1 and 2, we show results from scenarios where the analysis model is misspecified with a shaded background, whereas we show results from scenarios where the imputation model is misspecified in italics.

Table 1. Simulation study comparing complete-data, complete-case, and multiple imputation analyses with two imputation models to estimate *β*_{1} = 0.3 with thirty (30) equal sized studies using three analysis models in five scenarios with increasing heterogeneity: mean and standard deviation (SD) of estimates, mean standard error (SE), and coverage (Cov %) of the 95% confidence interval. In inverse-variance weighted analyses, it is indicated whether Rubin's rules were applied within each study prior to meta-analysis (RR then MA) or meta-analysis of imputed datasets was performed prior to combining estimates using Rubin's rules (MA then RR).^{}Results from scenarios where the analysis model is misspecified are shown with a shaded background, whereas results from scenarios where the imputation model is misspecified are shown in italics.

Table 2. Simulation study comparing complete-data, complete-case and multiple imputation analyses with two imputation models to estimate *β*_{2} = − 0.6 with thirty (30) equal sized studies using three analysis models in five scenarios with increasing heterogeneity: mean and standard deviation (SD) of estimates, mean standard error (SE), and coverage (Cov %) of the 95% confidence interval. In inverse-variance weighted analyses, it is indicated whether Rubin's rules were applied within each study prior to meta-analysis (RR then MA) or meta-analysis of imputed datasets was performed prior to combining estimates using Rubin's rules (MA then RR).^{}Results from scenarios where the analysis model is misspecified are shown with a shaded background, whereas results from scenarios where the imputation model is misspecified are shown in italics.

#### 3.2 Results of complete-data and complete-case analyses

Before considering the imputation of missing data, we discuss complete-data and complete-case analyses estimates of both *β*_{1} (Table 1) and *β*_{2} (Table 2). In the complete-data analysis, we analyzed data from all individuals prior to the introduction of missing values. In the complete-case analysis, we excluded individuals with missing data values from analysis. We present results for each of the three meta-analysis methods: the mean estimate across simulations, the standard deviation of estimates, the mean of the estimated standard errors, and the coverage of the 95% confidence interval for the true parameter of interest (or the mean of the parameter's distribution when there is heterogeneity in the parameter of interest). In an ideal analysis, the mean estimate across studies should be close to the true parameter value, the empirical standard deviation of the estimates should be close to the mean standard error estimate, and the coverage should be close to 95%. We constructed confidence intervals assuming normal distributions of the parameters of interest.

The pooled estimate from each of the methods shows little bias throughout even when the model is misspecified. The stratified and the fixed-effect analyses give good estimates when the parameter of interest (i.e., *β*_{1} or *β*_{2}) is fixed between studies, with some reduction in coverage and less efficient estimates for the stratified method when there is some between-study variability in other parameters. However, both the stratified and the fixed-effect methods underestimate variance when the parameter of interest is heterogeneous. The random-effects meta-analysis method gives marginally larger standard errors than the fixed-effect method when there is no true heterogeneity in the parameter of interest, but gives much better coverage when heterogeneity is present. Coverage in the random-effects meta-analysis is known to be theoretically underestimated because of uncertainty in the heterogeneity not being acknowledged [21]; this does not seem to be a serious issue here as with 30 studies the heterogeneity is well estimated. We note that the loss of information in the complete-case analyses over the complete-data analyses is much less in Scenarios 4 and 5 where there is heterogeneity in the parameter of interest, than it is in the other scenarios.

As seen in these simulations, the complete-case analyses are less efficient than the complete-data analyses. This motivates us to consider methods for the imputation of missing data. We note that analyses, which perform badly in terms of bias or coverage with complete data, are not going to perform well using multiple imputation methods; we should not interpret this as a failure of the multiple imputation method, and we should see correct specification of the analysis model as a first step before choosing between imputation models.

#### 3.3 Results of combining Rubin's rules and inverse-variance weighted meta-analysis

To assess the impact of the order of combining Rubin's rules and an inverse-variance weighted meta-analysis, we initially consider estimates with a fixed-effects and a random-effects meta-analysis model using the within-study imputation method, as in this case, the imputation model is correctly specified in all scenarios.

*Within-study imputation, fixed-effect analysis*: With the fixed-effect analyses (ignoring scenarios where a fixed-effect analysis is not appropriate), the coverage is further below the nominal 95% that would be expected by chance when estimates are combined using Rubin's rules and meta-analysis whichever order the processes are undertaken. (The Monte Carlo standard error for the coverage, representing the uncertainty in the simulated results due to the limited number of simulations, is 0.7%.) Additionally, there is a slight but consistent bias toward the null.

The reason that the fixed-effect analyses are undercovered and demonstrate bias is that the imputation of missing data introduces heterogeneity into the estimates of the parameter of interest (say *β*_{1}) even when there was no heterogeneity in this parameter in the data-generating model for the studies. Even though the parameter in the data-generating model was the same in all studies, the estimates of the related parameter used in the imputation model for generating imputed data will be different, and so, a fixed-effect analysis model will be misspecified. A study with by chance a larger than average estimate of *β*_{1} in the available data will use this inflated estimate of *β*_{1} (via the related parameter *α*_{2} in the imputation model) to impute the missing data. Hence, the parameter estimate from a study with by chance a larger than average estimate of *β*_{1} will be less precise than from a study with a smaller than average estimate. Pooling the estimates of association results in increased weights for studies with smaller than average estimates of *β*_{1}, and a slight downward bias in the combined estimate. This introduction of heterogeneity also leads to slight under-coverage when Rubin's rules are applied before meta-analysis, as the fixed-effect assumption will no longer be valid. The bias and reduction in coverage levels seem to be of similar magnitude when the studies are of equal and unequal size.

*Within-study imputation, random-effects analysis*: With the random-effects analyses, the coverage of the 95% confidence interval when the results are meta-analyzed then Rubin's rules are applied is conservative at 97.7% or greater when there is no heterogeneity in the parameter of interest, with mean standard error consistently larger than the standard deviance of the estimates. When the results are combined using Rubin's rules then meta-analyzed, the coverage is close to the nominal 95% when there is no effect heterogeneity. When there is heterogeneity in the parameter of interest, random-effects meta-analysis is known to give slightly over-narrow confidence intervals as stated in Section 3.2. However, this is an issue with the meta-analysis method, not with the imputation method, and the coverage is close to that achieved in the complete-data analysis. Additionally, there is a slight but consistent bias towards the null.

The reason for the overly conservative coverage is that when the multiple imputations are made, additional heterogeneity is introduced into the imputed datasets. If the imputed datasets are meta-analyzed before combining by Rubin's rules, then the heterogeneity of the meta-analysis results represents the sum of the true heterogeneity between the studies and the heterogeneity introduced due to the imputation process. If the imputed datasets are combined for each study using Rubin's rules before the meta-analysis, then each study estimate after combining reflects the true uncertainty of the estimate using all the data in the study.

*Stratified imputation*: If we consider the stratified imputation model in Scenario 1, the only scenario in which this imputation model is correctly specified, then a congenial analysis requires the meta-analysis to be performed before the application of Rubin's rules. This is because missing data in each study is imputed conditional on data in other studies, inducing a dependence between the imputed data values in different studies, which is not accounted for when Rubin's rules are applied at the study level. The inverse-variance weighted analysis models assume that estimates of the parameter of interest from each study are independent. In this case, the fixed-effect analysis still has slightly low coverage, whereas the random-effects analysis has correct coverage levels.

#### 3.4 Results of comparison of imputation models

To assess the impact of different imputation methods on estimates, we compare the performance of estimates from each of the analysis methods.

In general, the efficiency of the multiple imputation analyses for *β*_{1} is greater than that of the complete-case and slightly below than that of the complete-data analyses. The efficiency for *β*_{2} is similar to that of the complete-case analyses, with some slight improvement especially when there is heterogeneity in the parameter. We see that the results obtained are most sensitive to the choice of imputation method.

*Stratified imputation*: The results using the stratified imputation method for the inverse-variance weighted analyses show bias in all scenarios except Scenario 1, where there is little heterogeneity between the studies. This is especially marked for estimates of *β*_{2}. The coverage is underestimated, with the mean standard error being generally less than the standard deviation of the estimates, even in Scenario 1. This is because the imputation induces a correlation between data values in different studies, which is not acknowledged in an inverse-variance weighted analysis model. Although the stratified analysis method is misspecified in all scenarios except Scenario 1, the results for this method are not so bad, with minimal bias. This may reflect the congeniality of the imputation and analysis models. The inverse-variance weighted meta-analysis models do not correspond to the imputation model, and so are uncongenial. It seems that the heteroscedasticity introduced from Scenario 2 onwards, which makes the imputation model misspecified, is the key feature of the generating model, which introduces bias into the inverse-variance weighted results.

*Within-study imputation*: Using the within-study imputation method, the stratified analysis method gives biased estimates. The method does not perform as badly in Table 1 as the inverse-variance weighted methods with the stratified imputation method, although bias is more considerable in Table 2. We described the behavior of estimates from inverse-variance weighted methods with the within-study imputation method in Section 3.3. In Scenario 1, using the within-study imputation and the stratified analysis models, the finding that there is bias and that the coverage is low runs contrary to the conventional advice in multiple imputation that the imputation model can be more detailed than the analysis model [22]. In this scenario, the models should both lead to consistent estimates, and the imputation model is larger than the analysis model. However, there is bias, and the coverage is lower than the nominal 95% level.

We conclude that the stratified imputation method should be avoided when there is heterogeneity between studies. Koopman *et al*. and Andridge made a similar finding in an applied study with a logistic analysis model [12] and in the context of a cluster randomized trial [23], respectively. This is unfortunate, as the stratified imputation model provides a method for imputing data on a covariate, which is completely missing in a particular study [24]. In the complete absence of data on a covariate, we must make strong assumptions not only about the relation between the covariate and other variables in the model but also about the error distribution of the covariate. We return to this issue in the discussion.