Suppose that the target analysis is a linear regression where the inputs X_{1} and X_{2} are bivariate standard normal variables with a correlation of ρ, and the outcome is
 (12)
where I_{12}= X_{1}X_{2} is the interaction term and e_{Y·12I}∼ N (0, σ^{2}_{Y·12I}) is normal error. To keep things simple, we assume that all the parameters have a value of 1, i.e. (α_{Y·12I}, β_{Y·1}, β_{Y·2}, β_{Y·I}, σ^{2}_{Y·12I}) = (1, 1, 1, 1, 1). The assumption of normal regressors and parameters of 1 are not really necessary, but they make the calculations simpler and more transparent.
The mean and covariance matrix of (X_{1}, X_{2}, I_{12}, Y) are, according to Mathstatica,
 (13)
Notice that the interaction I_{12} has no correlation with X_{1} or X_{2}. This is a bivariate analog to our earlier and more familiar finding that a standard normal variable is uncorrelated with its square.
Now suppose that all values of X_{2} are deleted; then the interaction I_{12} must be deleted as well, since in real data I_{12} could not be calculated without X_{2}. Again, we imagine that, despite missing all values of X_{2} and I_{12}, we nevertheless know μ_{12IY} and Σ_{12IY} and we can use μ_{12IY} and Σ_{12IY} to replace the missing variables with imputed variables X^{(m)}_{2} and I^{(m)}_{12}. This is a rigorous thought experiment: If an imputation method fails even when we have perfect knowledge of the completedata parameters, then that method must be fundamentally unsound.
4.1. Transform, Then Impute
As noted earlier, normal linear regression, whether implicit or explicit in the imputation model, is the most widely implemented method for imputing continuous variables. Using a normal linear regression model with two dependent variables, we can impute X^{(m)}_{2} and I^{(m)}_{12} by regression on the complete variables X_{1} and Y:
 (14)
Alternatively, we can impute the variables one at a time, first imputing X^{(m)}_{2} by regression on X_{1} and Y, and then imputing I^{(m)}_{12} by regression on X_{1}, Y, and X^{(m)}_{2}. Or vice versa: first impute I^{(m)}_{12} and then impute X^{(m)}_{2}. Or we can regress impheitly under multivariate normal model for (X_{1}, X_{2}, I_{12}, Y). As long as the regression parameters used for imputation are derived from μ_{12IY} and Σ_{12IY}, the result will be the same: the mix of complete and imputed variables (X_{1}, X^{(m)}_{2}, I^{(m)}_{12}, Y) will have the same mean μ_{12IY} and covariance matrix Σ_{12IY} as the complete data. So when Y is regressed on X_{1}, X^{(m)}_{2}, and I^{(m)}_{12}, the regression parameters will be the same as if all the variables were complete.
In short, in an idealized situation where the mean and covariance matrix of the complete data are known, the transformthenimpute method yields unbiased regression estimates.
4.2. Impute, Then Transform, and Its Variants: Passive Imputation and Transform, Impute, Then Transform Again
It bears repeating that the imputed variables X^{(m)}_{2} and I^{(m)}_{12} are not the same as the complete variables X_{2} and I_{12}. Unlike the complete variable X_{2}, the imputed variable X^{(m)}_{2} is not normal. More disturbingly, while the complete data interaction I_{12} is the product of the complete variables X_{1} and X_{2}, the imputed interaction I^{(m)}_{12} is not the product of the observed X_{1} and the imputed X^{(m)}_{2}. It is quite possible, for example, for I^{(m)}_{12} to have a negative value in a case where both X_{1} and X^{(m)}_{2} are positive.
Some researchers find this last anomaly so troubling that they replace the imputed interaction I^{(m)}_{12} with the product X_{1}X^{(m)}_{2}. We call this procedure transform, impute, then transform again. Alternatively, a researcher may calculate the product X_{1}X^{(m)}_{2} without having previously imputed I^{(m)}_{12}. This approach is called transform then impute. An iterative version of the transformthenimpute method is called passive imputation, implemented in Stata's ice command and, by a different name, in IVEware for SAS and MICE for R.
All these techniques are variants of the basic imputethentransform method, and all of them lead to the same biased estimates. The problem is that the product X_{1}X^{(m)}_{2}, although its individual values look plausible, does not vary in the right way with the other variables, especially with Y. According to Mathstatica, the imputedthentransformed data (X_{1}, X^{(m)}_{2}, X_{1}X^{(m)}_{2}, Y) has the same mean μ_{12IY} as the complete data, but the covariance matrix of the imputedthentransformed data is
 (15)
which is clearly not the same as the covariance matrix Σ_{12IY} of the complete data. All of the variances and covariances involving the recalculated interaction X_{1}X^{(m)}_{2} differ from those involving the original completedata interaction I_{12}. Whereas the completedata interaction I_{12} had zero covariance with X_{1} and X_{2}, the recalculated interaction X_{1}X^{(m)}_{2} has a small covariance with both X_{1} and X^{(m)}_{2}. (This is possible because X^{(m)}_{2}, unlike X_{2}, has skew.) Most importantly, the recalculated interaction X_{1}X^{(m)}_{2} has a much smaller covariance with Y than did the original interaction I_{12}. This is because the product X_{1}X^{(m)}_{2} was calculated from the observed and imputed X values alone, without additional input from Y.
To see what the imputethentransform method does to regression estimates, we can use Mathstatica software to derive, from μ_{12IY} and Σ^{(m)}_{12IY}, the parameters for a regression of Y on the observed variable X_{1}, the imputed variable X^{(m)}_{2}, and the composite interaction X_{1}X^{(m)}_{2}. Figure 1 plots the results as a function of the correlation ρ between X_{1} and X_{2}. A horizontal reference line shows the value 1, which in the complete data is the correct value for all of the parameters. If the estimates obtained by the imputethentransform method were unbiased, they would all be 1. But they are not.
Instead, as an estimate of the completedata regression, a regression using the imputedthentransformed data has several biases. The residual variance is too large, and the slopes of X_{2} and I_{12} are biased toward zero, with the size of the bias depending on ρ. The intercept and the slope of X_{1} are also biased, with both the size and direction of the bias depending on ρ. The exact shape of these biases would have been hard to guess, but the broadest patterns are not surprising. Under most conditions the slope of the interaction is much more biased than the other slopes; this makes sense since the interaction is the term that was neglected in imputation. The residual variance is too large under all circumstances; this too makes sense. When the regressors are not imputed well, they leave a lot of the variation in Y unexplained.
Would the imputethentransform method yield better results if the imputation model for X_{2} were better specified? In equation (14), X^{(m)}_{2} was imputed by linear regression on X_{1} and Y. But clearly X^{(m)}_{2} does not depend on X_{1} and Y in a purely linear way. Instead, since the regression equation (12) for Y contains an interaction between X_{1} and X_{2}, we know that the X_{2}–Y relationship varies with X_{1}. It follows that the imputation equation for X_{2} should include an interaction between X_{1} and Y.
Suppose that we add the interaction I_{1Y}= X_{1}Y to the regression equation that is used to impute missing values of X_{1}
 (16)
where the regression parameters are derived from the mean and covariance matrix of (X_{1}, Y, I_{1Y}, X_{2}):
 (17)
And suppose that, after imputing X^{(m)}*_{2}, we calculate interactions X_{1}X^{(m)}*_{2}.
Do these imputed interactions come close to replicating the covariance structure of the complete data? Unfortunately, they do not. Although (X_{1}, X^{(m)}*_{2}, X_{1}X^{(m)}*_{2},Y) has the same mean μ_{12IY} as the complete data, the covariance matrix of the imputed data is not the covariance matrix Σ_{12IY} of the complete data. Instead, the covariance matrix of (X_{1}, X^{(m)}*_{2}, X_{1}X^{(m)}*_{2},Y) is
 (18)
where the Ps represent polynomials in ρ:
If we derive regression estimates from this mean and covariance matrix, the result will still be biased when compared with the completedata regression. In fact, when plotted, the biases in these estimates are almost indistinguishable from the biases of the transformthenimpute method plotted in Figure 1. It appears that, although adding an extra interaction when imputing X^{(m)}*_{2} may improve the marginal distribution of the imputed variable, this refinement does not substantially improve the accuracy of the regression estimates that are derived from the imputed data.
4.3. Stratify, Then Impute
An alternative imputation strategy—which we call stratify, then impute—presents itself when one of the interacting variables is discrete and has no missing values. For example, X_{1} may be a complete dummy variable that takes values of 1 and 0. In this situation, we can divide the data into two strata, a stratum with X_{1}= 0 and a stratum with X_{1}= 1. Within each stratum, we impute missing values of X_{2} by linear regression on Y, and if Y has missing values, we can impute them by linear regression on X_{2}. This is an elegant solution since it allows for the interaction without having to incorporate it into the imputation model. Within each stratum no interaction is required, and simple linear regression can produce imputations with the same mean and covariance matrix as the complete data. The regression of Y on the imputed Xs will be the same, on average, as if Y were regressed on the complete Xs.
The stratifythenimpute method is an ideal solution, but unfortunately it is not always available. Defining strata is straightforward when the stratifying variable X_{1} is complete and takes just a few discrete values. But strata are harder to define when X_{1} is continuous or has missing values itself.
4.4. An Applied Example
To illustrate the imputation of interactions in practice, we again apply the competing imputation methods to Allison's (2002) U.S. News data on colleges and universities. Again we predict each college's graduation rate Y, which is missing for 7.5% of cases, but now our predictors are the college's average combined SAT scores X_{2}, which is missing for 40% of cases, and a complete variable X_{1} indicating whether the college is public (X_{1}= 0) or private (X_{1}= 1). Allison's (2002) own analysis of these data includes other regressors, but to illustrate the properties of competing imputation methods it suffices to regress Y on X_{1}, X_{2}, and I_{12}= X_{1} X_{2}. Despite the omission of supplementary variables, our results are very similar to Allison's, but more comprehensive because more methods are tested.
To ease interpretation, we rescaled the combined SAT score by subtracting the publiccollege mean of 942 and dividing by 100. Under this scaling, the intercept represents the average graduation rate at a public college, and a single unit on the rescaled score represents 100 points on the SAT. Note that the U.S. News data differ in several ways from the idealized data that we used in our earlier calculations. Unlike the idealized data, the U.S. News data have missing values on Y as well as X_{2}; X_{2} is partly rather than completely missing; and the complete variable X_{1} is a dummy rather than a normal variable.
We applied all of the imputation methods that we described earlier, using M = 40 imputations (Bodner 2008). The passive imputation method was carried out using IVEware for SAS (Raghunathan et al. 2002), while the other imputation methods used the MI procedure in SAS software, version 9. Table 3(a) gives the regression estimates that are obtained after the six different imputation strategies are applied.
Table 3(a). Linear Regression Predicting the Graduation Rate  Stratify, Then Impute  Transform, Then Impute  Impute, Then Transform  Impute, Then Transform, with an Extra Interaction to Impute X_{2}  Transform, Impute, Then Transform Again  Passive Imputation 

Intercept  50.5***  50.5***  50.4***  50.6***  50.5***  50.5*** 
(0.7)  (0.8)  (0.8)  (0.8)  (0.8)  (0.8) 
X_{1}. Private  12.9***  12.9***  12.8***  12.9***  12.8***  12.9*** 
(0.9)  (0.9)  (0.9)  (0.9)  (0.9)  (0.9) 
X_{2}. Combined SAT (in hundreds)  9.7***  10.0***  8.5***  8.6***  8.5***  8.5*** 
(0.7)  (0.9)  (0.7)  (0.7)  (0.7)  (0.7) 
I_{12}. Private * SAT  −2.0*  −2.3*  −0.4  −0.5  −0.4  −0.4 
(0.9)  (1.0)  (0.8)  (0.8)  (0.9)  (0.8) 
Residual variance  194.3  194.7  197.4  196.0  196.2  198.2 
Although we do not know the true parameter values, we expect to get our best estimates from the stratifythenimpute method and the transformthenimpute method. Reassuringly, these two methods give very similar results. By contrast, the other methods—imputethentransform (with and without an extra interaction to impute X_{2}), transformimputethentransformagain, and passive imputation—return estimates that are very similar to each other, and quite biased. Compared with the sound methods, the biased methods give too large a residual variance, too small a coefficient for X_{2}, and much too small a coefficient for the interaction. All these biases are consistent with those displayed in Figure 1. The intercept and the slope of X_{1} appear to be approximately unbiased; again, this is consistent with Figure 1, where the bias for these estimates is small if the correlation between X_{1} and X_{2} is small. (In these data, the correlation between X_{1} and X_{2} is just 0.14.)
It is worth taking a minute to interpret the results. According to either of the sound imputation methods—transformthenimpute or stratifythenimpute—the intercept is 50.5, implying that public colleges have an average graduation rate of 50.5%. The slope for X_{1}, the private college indicator, is 12.9, implying that a private college with the same combined SAT score as an average public college would be expected to have a graduation rate that is 12.9% higher. The slope for X_{2}, the combined SAT score, is about 10 and the interaction between X_{1} and X_{2} is about −2. So in public colleges a 100point boost in the combined SAT score predicts a 10% increase in the graduation rate, but in private colleges the increase in the graduation rate is just 8% (10 − 2). In other words, the graduation gap between public and private colleges is smaller at institutions where the students have high SAT scores. This diminishing return to private education is modeled by the interaction, which under the best imputation methods is decentsized and significant. Under the inferior methods, the same interaction is nonsignificant and trivial in size. This bias might lead researchers to overlook the diminishing returns of private education for highscoring students. Parents reading a research summary might mistakenly believe that their highscoring children would benefit from private schooling much more than is actually the case.
We also carried out a probit regression in which the interacting variables were used to predict a dummy variable Y* that turns from 0 to 1 when the graduation rate Y exceeds the median value of 60%. Given the relationship between a probit regression of Y* and a linear regression of Y (see Section 3.4), it is not surprising that the probit results in Table 3(b) display the same pattern as the linear results in Table 3(a). Under the biased imputation methods, the probit slopes of both Xs and especially of their interaction are smaller than they are under the sound methods. A logistic regression would display the same biases as well, since logistic slopes are approximately equal to probit slopes multiplied by 1.65 (Long 1997).
Table 3(b). Probit Regression Predicting Whether the Graduation Rate Exceeds 60%  Stratify, Then Impute  Transform, Then Impute  Impute, Then Transform  Impute, Then Transform, with an Extra Interaction to Impute X_{2}  Transform, Impute, Then Transform Again  Passive Imputation 


Intercept  −0.98***  −0.95***  −0.92***  −0.88***  −0.91***  −0.90*** 
(0.10)  (0.09)  (0.09)  (0.09)  (0.10)  (0.09) 
X_{1}. Private  1.18***  1.16***  1.12***  1.08***  1.11***  1.09*** 
(0.11)  (0.11)  (0.11)  (0.11)  (0.11)  (0.11) 
X_{2}. Combined SAT (in hundreds)  0.96***  0.93***  0.76***  0.65***  0.75***  0.74*** 
(0.12)  (0.12)  (0.10)  (0.13)  (0.11)  (0.11) 
I_{12}. Private * SAT  −0.38**  −0.34*  −0.13  0.02  −0.11  −0.12 
(0.14)  (0.14)  (0.12)  (0.14)  (0.12)  (0.12) 