## 1. Introduction

### 1.1. Missing data

Missing data occur in almost all medical and epidemiological research. Inadequate handling of the missing data in a statistical analysis can lead to biased and/or inefficient estimates of parameters such as means or regression coefficients, and biased standard errors resulting in incorrect confidence intervals and significance tests. In all statistical analyses, some assumptions are made about the missing data. Little and Rubin's framework 1 is often used to classify the missing data as being (i) missing completely at random (MCAR—the probability of data being missing does not depend on the observed or unobserved data), (ii) missing at random (MAR—the probability of data being missing does not depend on the unobserved data, conditional on the observed data) or (iii) missing not at random (MNAR—the probability of data being missing does depend on the unobserved data, conditional on the observed data). For example, blood pressure data are MAR if older individuals are more likely to have their blood pressure recorded (and age is included in the analysis), but they are MNAR if individuals with high blood pressures are more likely to have their blood pressure recorded than other individuals of the same age. It is not possible to distinguish between MAR and MNAR from the observed data alone, although the MAR assumption can be made more plausible by collecting more explanatory variables and including them in the analysis.

### 1.2. Multiple imputation and its rationale

Multiple imputation (MI) 2 is a statistical technique for handling missing data, which has become increasingly popular because of its generality and recent software developments 3, 4. The key concept of MI is to use the distribution of the observed data to estimate a set of plausible values for the missing data. Random components are incorporated into these estimated values to reflect their uncertainty. Multiple data sets are created and then analyzed individually but identically to obtain a set of parameter estimates. Finally, the estimates are combined to obtain the overall estimates, variances and confidence intervals. Although MI can be implemented under MNAR mechanisms 2, 5, standard implementations assume MAR, and we make this assumption throughout this paper (except in Section 10.4). When correctly implemented, MI produces asymptotically unbiased estimates and standard errors and is asymptotically efficient. The three stages of MI are described formally below:

*Stage 1*: *Generating multiply imputed data sets*: The unknown missing data are replaced by *m* independent simulated sets of values drawn from the posterior predictive distribution of the missing data conditional on the observed data. For a single incomplete variable *z*, this involves constructing an imputation model which regresses *z* on a set of variables with complete data, say *x*_{1}, *x*_{2}, …, *x*_{k}, among individuals with the observed *z*. Choices of imputation model are discussed in Section 2. Methods for multiple incomplete variables are discussed in Section 1.3. Let and **V** be the set of estimated regression parameters and their corresponding covariance matrix from fitting the imputation model. The following two steps are repeated *m* times. Let **β*** be a random draw from the posterior distribution, commonly approximated by 2. Imputations for *z* are drawn from the posterior predictive distribution of *z* using **β*** and the appropriate probability distribution. This process is known as proper imputation because it incorporates all sources of variability and uncertainty in the imputed values, including prediction errors of the individual values and errors of estimation in the fitted coefficients of the imputation model. Alternatives to proper imputation 6 are not considered here. An alternative way to draw proper imputations, predictive mean matching, is described in Section 4.2.

*Stage 2*: *Analyzing multiply imputed data sets*: Once the multiple imputations have been generated, each imputed data set is analyzed separately. This is usually a simple task because complete-data methods can be used. The quantities of scientific interest (usually regression coefficients) are estimated from each imputed data set, together with their variance–covariance matrices. The results of these *m* analyses differ because the missing values have been replaced by different imputations.

*Stage 3*: *Combining estimates from multiply imputed data sets*: The *m* estimates are combined into an overall estimate and variance–covariance matrix using Rubin's rules 2, which are based on asymptotic theory in a Bayesian framework. The combined variance–covariance matrix incorporates both within-imputation variability (uncertainty about the results from one imputed data set) and between-imputation variability (reflecting the uncertainty due to the missing information). Suppose is an estimate of a univariate or multivariate quantity of interest (e.g. a regression coefficient) obtained from the *j*th imputed data set *j* and **W**_{j} is the estimated variance of . The combined estimate is the average of the individual estimates:

The total variance of is formed from the within-imputation variance and the between-imputation variance

Single imputation is sometimes considered as an alternative to multiple imputation, but it is unable to capture the between-imputation variance **B**, hence standard errors are too small.

Wald-type significance tests and confidence intervals for a univariate θ can be obtained in the usual way from a *t*-distribution; degrees of freedom are given in references 7, 8. Wald tests can also be constructed for a multivariate θ 7.

### 1.3. Multiple imputation by chained equations

In large data sets it is common for missing values to occur in several variables. Multiple imputation by chained equations (MICE) 9 is a practical approach to generating imputations (MI Stage 1) based on a set of imputation models, one for each variable with missing values. MICE is also known as fully conditional specification 10 and sequential regression multivariate imputation 11. Initially, all missing values are filled in by simple random sampling with replacement from the observed values. The first variable with missing values, *x*_{1} say, is regressed on all other variables *x*_{2}, …, *x*_{k}, restricted to individuals with the observed *x*_{1}. Missing values in *x*_{1} are replaced by simulated draws from the corresponding posterior predictive distribution of *x*_{1}. Then, the next variable with missing values, *x*_{2} say, is regressed on all other variables *x*_{1}, *x*_{3}, …, *x*_{k}, restricted to individuals with the observed *x*_{2}, and using the imputed values of *x*_{1}. Again, missing values in *x*_{2} are replaced by draws from the posterior predictive distribution of *x*_{2}. The process is repeated for all other variables with missing values in turn: this is called a cycle. In order to stabilize the results, the procedure is usually repeated for several cycles (e.g. 10 or 20) to produce a single imputed data set, and the whole procedure is repeated *m* times to give *m* imputed data sets.

An important feature of MICE is its ability to handle different variable types (continuous, binary, unordered categorical and ordered categorical) because each variable is imputed using its own imputation model. Suitable choices of imputation models are discussed in Sections 2 and 4.

### 1.4. Plan of the paper

The remainder of the paper is structured as follows. Section 2 describes and illustrates how to impute missing values in Normally distributed and categorical variables. Section 3 introduces the UK700 data that we use for illustration in the later sections and describes the MICE algorithm. Section 4 shows how to impute missing values in skewed quantitative variables. Section 5 focuses on how to choose the variables in the imputation model, and Section 6 focuses on how to specify the form of the imputation model when non-linear analyses are of interest. Section 7 discusses how to choose the number of imputations. Section 8 suggests how to use multiply imputed data for extended statistical analyses, such as model building and prediction. Section 9 gives an illustrative analysis of the UK700 data. Section 10 discusses theoretical limitations and pitfalls of MICE. We conclude with a general discussion in Section 11, which includes consideration of some alternatives to MICE.

We illustrate the methods using Stata code fragments where appropriate, although knowledge of Stata is not required to understand the paper. Version 11 of Stata, released in July 2009, contains a new suite of mi commands 12. These do not implement MICE, which requires the user-contributed ice command 13–17. Rubin's rules are implemented by the user-contributed mim command 18 and by the new mi estimate command.