## 1 Introduction

Missing data are data whose values are not available. This may be for a number of different reasons both within and outside the control of investigators. Factors within their control tend to lead to systematic patterns of missing data, such as data on a variable being missing for an entire subset of the study population. This may be because the measurement of a variable was only undertaken in a few studies, for example, due to the cost of measurement. Factors outside the control of the investigators tend to lead to sporadically missing data, where data on a variable are missing for a few individuals with no clear pattern to the missingness.

Missing data are classified as missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR) depending on whether the probability of data being missing is independent of the true values of the missing data (MCAR), depends only on observed data (MAR), or depends additionally on unobserved data (MNAR) [1, 2]. If the data are MCAR, then the missing values are distributed identically to the measured values. If the data are MAR, then the distributions of the missing and measured values are the same conditional on measured covariates. If the data are MNAR, the conditional distribution of the missing values differs from that of the measured values.

There are several possible approaches with missing data. The most common approach is to ignore individuals with missing data entirely: a complete-case analysis. More sophisticated approaches are available, such as multiple imputation [3]. Multiple imputation under a MAR assumption is increasingly being used in applied research because of recent software development. In multiple imputation, missing values are imputed several times by drawing random values from the conditional distribution of the missing values according to a specified imputation model using observed data values to form a completed dataset. The parameter estimates and standard errors from each of these imputed datasets are combined using formulae known as Rubin's rules [4]. There are two main advantages to a multiple imputation analysis over a complete-case analysis:

*Power*: Observations with partial missingness may still be informative for the estimate of interest, especially if missingness is in a single variable. An efficient analysis should include all relevant information.*Bias*: If the missing data are MAR, a complete-case analysis can introduce bias, whereas correctly specified multiple imputation estimates are unbiased [5].

In this paper, we consider the specific context of multiple imputation for missing data in an individual participant data (IPD) meta-analysis [6]. A meta-analysis is an analysis of data from multiple sources to give a single pooled value representing the overall estimate of the parameter of interest using the totality of the available data. Often, by necessity, a meta-analysis is performed on summarized data published by each study. In an IPD meta-analysis, the original data on the study participants is available for analysis. This facilitates hierarchical analyses, where the analysis of multilevel data can be performed in a single step [7], as opposed to the common two-stage inverse-weighted meta-analysis method.

The main difficulty with performing and interpreting meta-analyses is between-study variability [8]. This consists of both statistical heterogeneity due to differences in populations, such that coefficients cannot realistically be assumed to be constant across studies, and variability due to the investigators, such as the choice of variables measured in each study or the definition of variables. IPD enable the assessment of statistical heterogeneity and the standardization of analyses across studies [9]. Additionally, detailed analyses can be performed with individual-level data, which would not generally be possible with summarized data, such as multiple imputation [10, 11].

The imputation of missing data presents specific challenges in a meta-analysis context. For example, if a covariate represents an important confounder for an association, missing data methods can be used to impute sporadically missing data in the covariate, although it is not clear whether it would be optimal to impute data in each study separately or in all studies using a hierarchical model. If the covariate has not been measured in a study, it is unclear how to impute data on this variable using information from other studies. Previous work has shown that imputation of covariates across multiple studies can lead to inconsistencies in estimation [12].

The structure of this paper is as follows. We first introduce methods for the analysis of data from multiple studies and the imputation of missing data in this context (Section 2). Two particular issues considered are the following: (i) the congeniality of the imputation and analysis models [13] and (ii) the correct order to apply the combination of imputation estimates using Rubin's rules and the pooling of study estimates using an inverse-variance weighted meta-analysis. We present a simulation study to investigate the behavior of estimates using the analysis and imputation methods previously introduced (Section 3). We illustrate the methods with an analysis of the association of low-density lipoprotein cholesterol (LDL-C) with blood pressure using data from the Emerging Risk Factors Collaboration (ERFC) [14] (Section 4). We conclude by discussing the findings of the paper and potential avenues for future work (Section 5). The methods and simulations considered in this paper mainly relate to sporadically missing data. Issues relating to systematically missing data are left for discussion.