## 1. MULTIPLE IMPUTATION, THEN DELETION

Multiple imputation (MI) is an increasingly popular tool for analyzing data with missing values (Rubin 1987). As it is commonly used, MI is part of a four-step estimation strategy:

- 1
*Replication*. Make multiple copies of the incomplete data set. - 2
*Imputation*. In each copy, replace each missing value with a plausible random imputation. (Imputations are drawn conditionally on the observed values of all of the variables.) - 3
*Analysis*. Analyze each imputed data set separately, using the standard methods that are used for complete data. - 4
*Recombination*. Combine the results of the separate analyses, using formulas that account for variation within and between the imputed data sets.

Researchers often use MI when they are estimating the conditional distribution of an outcome *Y* given some inputs ** X**= (

*X*

_{1}, … ,

*X*). For example, analysts may use MI in estimating the parameters of a generalized linear model such as a normal or logistic regression, or a hierarchical linear model.

_{p}Researchers often ask how they should handle the dependent variable *Y*. The easy question is whether *Y* should be used to impute *X*. The answer is yes (e.g., Allison 2002). If *Y* is not used to impute *X*, then *X* will be imputed as though it has no relationship to *Y*. When the imputed data are analyzed, the estimated slope of *Y* on *X* will be biased toward zero, since a value of zero was tacitly assumed in imputation.^{1}

This paper focuses on a harder question, which is what to do with cases that are missing *Y*. The answer begins with two remarks in Little (1992:1227):

If the

Xs are complete and the missing values ofYare missing at random, then the incomplete cases contribute no information to the regression ofYonX_{1}, … ,X_{p}.

In other words, when the *X*s are complete, there is no need for imputation, because maximum-likelihood estimates can be obtained simply by deleting the cases with missing *Y*. Using imputed *Y* values in analysis would simply add noise to these estimates. On the other hand, Little continues:

If values of

Xare missing as well asY, then cases withYmissing can provide a minor amount of information for the regression of interest, by improving prediction of missingX's for cases withYpresent. (P. 1227)

This second remark implies that cases with missing *Y* should be used in the imputation step, since those cases may contain information that is useful for imputing *X* in other cases. But after imputation, cases with imputed *Y* have nothing more to contribute; when the data are analyzed, random variation in the imputed *Y* values adds nothing but noise to the estimates. In short, cases with missing *Y* are useful for imputation, but not for analysis.

In light of this observation, we propose a new estimation strategy that we call multiple imputation, then deletion (MID), MID is just like a conventional MI strategy except that cases with imputed *Y* are deleted before analysis:

- 1
*Replication*. - 2
*Imputation*.2½. Deletion. Delete all cases that have imputed values for

*Y*.

- 3
*Analysis*. - 4
*Recombination*.

One advantage of MID is efficiency. Compared to an ordinary MI strategy (one that retains imputed *Y*s), MID tends to give less variable point estimates, more accurate standard-error estimates, and shorter confidence intervals with equal or higher coverage rates. To state these advantages in terms of hypothesis tests, MID tests tend to have greater power while maintaining equal or lower significance levels. In terms of efficiency, the advantage of MID is often minor, but it can be substantial when there are a lot of missing values and relatively few imputed data sets.

A second and perhaps more important advantage is that MID is robust to problems in the imputation model. Problems in imputing *Y* cannot affect MID estimates because cases with imputed *Y* are deleted before analysis. Problems in imputing *X* may also have little effect if, as is often the case, missing *X*s tend to occur in the same cases as missing *Y*s.

The importance of deleting problematic imputations bears some emphasis because, in practice, there are several things that can go wrong when data are imputed. For example, nonlinear relationships, such as interactions, may be carelessly omitted from the imputation model (Allison 2002). Even if the imputation model is specified carefully, the imputation software may have undocumented biases (Allison 2000; von Hippel 2004). Even if the software works well, it may have limited flexibility, so that the analyst has to impute skewed or discrete variables as though they were normal. An inappropriate assumption of normality can result in unrealistic imputations such as a negative body weight, or a dummy variable with a value of 0.6. To “fix” such unrealistic values, some analysts round or transform imputed values—but these fixes can introduce biases of their own (Horton, Lipsitz, and Parzen 2003; Allison 2005).

In short, since imputation can go wrong in several ways, it seems desirable to reduce reliance on imputed values. MID does this.

Like any missing-data method, MID relies on certain assumptions. In particular, MID assumes that missing *Y* values are *ignorable* (Little and Rubin 2002) in the sense that the unobserved *Y* values are similar to observed *Y* values from cases with similar values for *X*. In addition, MID assumes that missing *X* values are ignorable in cases with missing *Y*. The assumption of ignorability is not unique to MID; in fact, the vast majority of conventional MI analyses assume that missing values are ignorable. But there are extensions of MI that relax the assumption of ignorability (Rubin 1987). Though these extensions are rarely used and not always effective (Rubin 2003), under MID they can hardly be used at all.^{2}

MID also relies on the assumption that the imputed *Y* values contain no useful information. This assumption is usually valid, but in some data sets the imputed *Y* values have been enriched by auxiliary information from an outside source. As we will show in Section 7, however, the “signal” in such auxiliary information must be quite strong before it overcomes the “noise” from the random component in the imputed *Y* values. In short, although there are special circumstances when a conventional MI strategy is superior to MID, under most practical circumstances MID has at least a small advantage.

In this paper, we describe applications of MID in social research (Section 2); we explain why MID works (Section 3); and we demonstrate the efficiency of MID estimates both analytically (Section 4) and through simulations (Section 5). In the final sections of the paper, we outline some extensions and limitations of MID (Sections 6 and 7), and we argue that the limitations are usually minor compared to the advantages.