Toward a standardized evaluation of imputation methodology

Developing new imputation methodology has become a very active field. Unfortunately, there is no consensus on how to perform simulation studies to evaluate the properties of imputation methods. In part, this may be due to different aims between fields and studies. For example, when evaluating imputation techniques aimed at prediction, different aims may be formulated than when statistical inference is of interest. The lack of consensus may also stem from different personal preferences or scientific backgrounds. All in all, the lack of common ground in evaluating imputation methodology may lead to suboptimal use in practice. In this paper, we propose a move toward a standardized evaluation of imputation methodology. To demonstrate the need for standardization, we highlight a set of possible pitfalls that bring forth a chain of potential problems in the objective assessment of the performance of imputation routines. Additionally, we suggest a course of action for simulating and evaluating missing data problems. Our suggested course of action is by no means meant to serve as a complete cookbook, but rather meant to incite critical thinking and a move to objective and fair evaluations of imputation methodology. We invite the readers of this paper to contribute to the suggested course of action.

For some data analyses, a single imputation may suffice, while for other data analyses, multiple imputations are needed.For example, in the case of inferential analyses, such multiple imputations (Rubin, 1976) have proven to be a valuable technique for obtaining valid inferences on incomplete data sets.The resulting multiple inferences on multiple completed data sets can be combined into a single inference using Rubin's rules (Rubin, 1987, p. 76).When generating predicted values is the goal of the data analysis, a single imputation may already generate sufficient predicted values (Nijman et al., 2020).
In all scenarios, it can be said that the quality of the solution obtained by imputation depends on the statistical properties of the incomplete data and the degree to which the imputation procedure is able to capture these properties when modeling missing values.In general, it holds that modeling missing data become more challenging when the amount of missingness increases.However, when (strong) relations in the data are present, the observed parts can hold great predictive power for the models that estimate the missingness.In that case, imputation may be substantially more efficient than the ubiquitous complete case analysis.
When evaluating the statistical properties-and thereby the practical applicability-of imputation methodology, researchers most often make use of simulation studies.Such studies are typically composed of an analysis pipeline in which data are subsequently generated, made incomplete, and imputed repeatedly under different simulation conditions.A set of evaluation criteria is then postulated to evaluate the performance of one or more missing data methods.Recent attention on best practices for method evaluation by means of simulation (Morris et al., 2019) and Pawel et al. (2022)'s advice to preregister simulation protocols will guide methodologists in conceptualizing their simulation workflow.However, there is no consensus on how to design, execute, and report simulation studies aimed to evaluate imputation methodology.Without a "gold standard" for evaluation, the validity and comparability of simulation setups may differ tremendously from one developer to another.This brings forth a chain of potential problems in the objective assessment of imputation method performance within and across studies, which could lead to suboptimal use of imputation in practice.
The purpose of this paper is threefold: First, we raise some concerns with respect to evaluating imputation methodology.These concerns stem from careful consideration with fellow "imputers" and from encounters as reviewers for statistical journals.Second, we provide imputation methodologists with a suggested course of action when using simulation studies to evaluate imputation techniques for missing data problems.This suggested approach should identify common ground, but is in no way intended as an absolute solution.This identifies the third purpose of our paper: discussion.We hope to elicit critical thinking regarding the problems at hand.We are all convinced that our methodology has some merit.But for sake of progress, it would be much more advantageous if the aim of our evaluations would go beyond proving the point and would legitimately consider the statistical properties.

WHY SOME EVALUATIONS SHOULD NOT BE TRUSTED
The ideal evaluation of imputation methodology may differ between studies.A simulation study developed for the comparison of several imputation methods will typically have a different design than one aimed at establishing the inferential validity of a single (novel) method.How the evaluated methods are used in practice thereby governs the simulation procedure.Since the simulation aims and setup are intertwined with the choice of the evaluated imputation methods, one should always assess the suitability of the chosen evaluation metrics with the imputation methods' purpose in mind.
We limit the scope of this paper to comparative simulation studies in the context of statistical inference.With that, we exclude the evaluation of imputation methodology for detailed aspects of causal inference and prediction.The type of simulation study under consideration in this paper requires a form of "comparative truth" to assess the inferential validity of imputation methods.This is in contrast to simulation studies aimed at comparing the predictive performance of imputation and prediction method pairs.Such a simulation design typically does not start out from a "ground truth" (e.g., a complete data set in which missingness is subsequently induced by the simulator).Rather, the imputation and prediction method pairs are evaluated on their ability to yield high predictive accuracy in one or more incomplete benchmark data sets (Liu et al., 2021).In these designs, only the comparative performance of methods may be established.We do not recommend this approach if the inferential validity of the imputations is of interest.For an overview of missingness in the prediction context, see, for example, Wood et al. (2015) and Sperrin et al. (2020).Missing data within a causal inference framework have been described by, for example, Moreno-Betancur et al. (2018) and Mohan and Pearl (2021).
Within the scope of comparative simulation studies aimed at statistical inference, we observe several problems.To demonstrate the broad impact of these problems, we recognize the following four distinct categories: problems with simulation design, problems with data generation, problems with missingness generation, and problems with performance evaluation.We further detail the impact each of these problems may have on the validity of evaluations.

Simulation design
Before setting up a simulation study, the simulator should clearly define and describe the scope of their evaluations.If the simulation parameters are unclear, any methodological problems with simulation study design may be obfuscated by the cognitive problem of generalization (Greenland, 2017).For example, misinterpretation and (unintended) questionable research practices can prompt extrapolation beyond the scope of the simulations (Greenland, 2017;Pawel et al., 2022).Imputation methods that seem to work well in simulations may then not be suitable in practice, when applied to incomplete empirical data.
One potential reason for discrepancies between imputation method performance in simulations versus applications is the use of sampling variance in the simulation design.For example, when dealing with infinite populations, the conventional pooling rules proposed by Rubin (1987, pp. 76-77) apply.When dealing with a finite incomplete population, some adjustment is needed for pooling rules to yield correct inferences.In such cases, valid inferences can be obtained by excluding the sampling variance components from the calculation of the total variance about ( − Q) (Raghunathan et al., 2003;Vink & Van Buuren, 2014), where  would be the estimand and Q = ∑  =1 Q∕ is the average over the  imputed estimates (cf.Rubin, 1987, p. 76-77).Vink and Van Buuren (2014) propose that the total variance  =  + ∕ would then solely rely on the between imputation variance  = ∑  =1 ( Q − Q) ′ ( Q − Q)∕ − 1 and demonstrate that the degrees of freedom would then equal  − 1.
We can make use of this unique real-world data property when simulating the performance of incomplete data methodology.When we take a single drawn complete data set as our comparative truth, just inducing simulated missingness would suffice, and the necessary Monte Carlo variance for our evaluations would stem from the sampled missingness.The above-detailed pooling rules can be used to obtain valid inferences in those simulations.A detailed overview of simulation strategies that adopt this approach, as well as conventional simulation strategies can be found in Vink (2022).
We would like to note that it is not necessary to exclude sampling variance from missing data simulations, as the additional variance induced by the sampling step would be ignorable.In practice, however, we have found it to be convenient to allow for a single data set to serve as comparative truth in all simulation repetitions.Especially for data-driven simulation strategies with complex data transformations, it can be challenging to derive the true parametric references to evaluate against.Evaluating against a single generated set would then avoid many procedural problems and computational challenges.In practice, there are also other benefits to this simulation approach as it is computationally convenient, requires less data archiving, and may potentially sharpen comparisons between imputation methods, thereby clarifying inconclusive results on method imputation performance.After all, we are interested in solving for the missingness and can do without the ignorable noise induced by the sampling mechanism for evaluation in such studies.
A well-designed simulation does not guarantee fair evaluations.Estimands and other simulation targets should be clearly and unambiguously defined in the context of the study aim (Petersen & Van der Laan, 2014).The study aim and the required level of precision may also inform the number of simulation repetitions (e.g., as determined from a maximum tolerable level of uncertainty in terms of a performance measure's Monte Carlo error; see Morris et al., 2019).The simulation design should ideally be fully factorial, that is, varying each simulation condition against all other conditions.There may, for example, be interactions between the source and amount of missingness in the incomplete data and the efficacy of imputation methods-that is, the validity of the assumed missingness mechanism becomes increasingly important with higher missingness proportions (Schouten & Vink, 2021).If there is more than one imputation method under evaluation, the simulation design should apply each method to every incomplete data set.Applying all methods to the same incomplete data set is computationally convenient, and minimizes unnecessary variation, which makes for fairer comparisons.

Data generation
To evaluate the ability of an imputation routine to handle missingness, a form of ground truth has to be established.Those who perform simulation studies are in the luxury position to establish the truth beforehand by choosing a data-generating mechanism.Data-generating mechanisms define how a complete dataset is obtained at the start of each simulation repetition.We highlight two general data generation approaches: (i) model-based simulation, in which data are drawn from a known statistical model or probability distribution, such as the multivariate normal distribution; and (ii) design-based simulation, where data are sampled from a sufficiently large observed set, such as official registers.Model-based data-generating mechanisms have advantages in flexibility and precision, because data are generated from a known statistical model and the true theoretical parameters can be derived.A design-based approach is often used in situations where a probability distribution is not available, or where real-life data structures are of interest.Of course, the simulator should choose their data-generating mechanism in line with the study scope.Other methods to simulate data may be more suitable, for example, synthetization of the actual data-generating process (Li et al., 2022;Volker & Vink, 2021).
The challenge with model-based data-generating mechanisms is that a method's performance on simulated data may not translate to empirical data.Real-life data may not follow a known theoretical distribution, so there is no guarantee that simulation results are generalizable.Moreover, data are often generated such that the problem being studied is most pronounced, for example, with consistently high correlations between groups of variables.This results in simulated data that contain such valuable information structures that no matter what type of missingness would subsequently be induced, the observed parts of the data will still hold much (if not all) of the information about the missing part.Unsurprisingly, the performance of any imputation method will then be evaluated as good.
Another threat to the generalizability of model-based simulations is the use of a single model for both data generation and imputation.If data are generated following a model that is also used for imputing the data, the imputation approach will be deemed good (or better than other methods) purely due to the evaluated conditions being in favor of the problem that is studied.Other potentially unfair comparative advantages in favor of a certain imputation method may occur due to characteristics of the generated data, such as the number of observations, the number of variables, the measurement level, and the coherence between variables.In contrast to design-based studies, such characteristics are not always explicit simulation conditions, which may give a false sense of objectivity.
An obvious problem with design-based simulation is that obtaining a large data set without missing entries can be challenging.Most real-world data contains at least some missing entries, for which the true underlying missing data model is-by definition-unknown.Therefore, the simulator needs to deal with missingness in some way before incomplete empirical data can serve as comparative truth in the simulations.It might seem like an intuitive solution to only draw complete cases from the large data set, which would indeed yield complete samples.However, there may be inherent differences between cases with and cases without any missing values, due to the unknown missing data model.Only sampling complete cases from the data set may thus result in samples that fail to capture all relevant real-world conditions, which, in turn, refutes the main reason for using design-based simulation.Another way to deal with missingness in a design-based simulation is to impute the incomplete data set once, to obtain a single completed data set to draw samples from.Unfortunately, this practice may favor the imputation method that was used in this initial imputation step throughout any further evaluations.Just to be clear: leaving the missingness as-is and inducing additional missing values to impute is no option here, because we would not have a real and unbiased comparative truth.

Missingness generation
Often, reports of simulation studies remain vague about the missingness conditions under investigation and, even worse, some authors only report something like: We generated missing data following a missing at random missingness mechanism.
This should be considered unacceptable, as claims about the validity of the imputation inference heavily depend on the simulated missingness conditions, such as missingness mechanisms and missingness patterns.Missingness mechanisms describe the relationship between missing entries and observed data values, whereas missingness patterns concern the location of missing entries across incomplete data (Little & Rubin, 2020, p. 8).Under this definition, there is a rowwise element to the missingness pattern describing which variables are jointly observed, and a column-wise element encompassing the amount of missingness in the data.
Even the terminology on missingness generation can be confusing.Does a missingness proportion of 50% mean that half of the entries in an incomplete data set are missing, or that half of the rows have at least one missing entry?In this paper, we will henceforth refer to the latter as the proportion of incomplete cases, and keep the term "missingness proportion" restricted to the variable-by-variable interpretation.The distinction between the two concepts, however, is not always clear in the literature, and convoluting the terms may lead to incorrect generalizations, because they rarely mean the same (e.g., a proportion of 50% incomplete cases in bivariate data could translate to a missingness proportion of 25% in both variables, or one completely observed variable and one with 50% missingness).Simulators should therefore be explicit in their description of the missing data-generating model and should make sure that the model fits the aim of their simulations.
Extrapolating simulation results from a specific missing data-generating model to more intricate (empirical) missingness could lead to suboptimal advice in practice.For example, if a simulator induces missingness in the outcome variable of their analysis model exclusively, they may inadvertently induce missingness according to the special case described by Van Buuren (2018, §2.7).Under this specific missingness model, list-wise deletion may outperform any imputation method.Such a conclusion would, unfortunately, only translate to data that adhere to the same special case, while biasing inferences in other cases.Another example of potential side effects of missingness generation may occur when multivariate missingness is generated using step-wise univariate missingness induction.The resulting incomplete data may then not have the desired statistical properties or unintentional nonignorable missingness may be generated (Robins & Gill, 1997;Schouten et al., 2018).This complicates any definitive conclusions about imputation method performance in relation to the data generation parameters.
Missingness should be induced according to several sets of missing data conditions.We encourage simulators to consider different missingness patterns and mechanisms.Missingness mechanisms were first defined in Rubin's seminal work (Rubin, 1976).For more recent considerations, see, for example, Doretti et al. (2018), Little and Rubin (2020), Moreno-Betancur et al. (2018), Mohan and Pearl (2021), Mealli and Rubin (2015), Schomaker (2021), Seaman et al. (2013), and Schouten and Vink (2021).Although not every missingness mechanism is realistically assumed in practice, they can all offer valuable insights as simulation condition.A disconnect between induced missingness, real missingness, and assumed missingness may result in simulation studies that are not as informative as they could be.
Truly random missingness across all data entries ("missing completely at random" or MCAR under Rubin's (1976) definition) may be considered as a necessary simulation condition for the evaluation of imputation procedures, because the statistical properties of the observed data given the missing data are known.Any imputation routine that cannot at least mimic the performance of the observed data inference should be deemed inefficient in the scope of the simulation.If an imputation method is not able to solve the problem (i.e., yield valid inference) under MCAR, the statistical properties of the procedure are not universally sound.Sadly, the straightforward case of MCAR is often neglected in simulation studies, whereas it offers an informative reference condition in many simulation setups.
Besides MCAR missingness, observed-data-dependent missingness mechanisms-such as missing at random (MAR) missingness-are paramount in statistical inference investigations.A straightforward technique for inducing univariate MAR missingness is described in Van Buuren (2018, §3.2.4), whereas generalizations to multivariate MAR missingness can be found in Schouten et al. (2018).If the missingness is to be induced in longitudinal data, generating the missingness through autoregressive MAR models can be useful (see, e.g., Shara et al., 2015, models 2 and 3).It is advisable to investigate varying shapes of MAR missingness to achieve a more realistic indication of the robustness of the imputation performance across the range of random missingness.The effects of different types of MAR mechanisms are described in Schouten and Vink (2021).
Inducing an MAR mechanism presents the simulator with a choice, namely, the type or functional form of the missingness model.Given the simulated data distributions, one random missingness model may be far more disastrous to the observed information than another model (Schouten & Vink, 2021).This may influence the performance of some (but not necessarily all) imputation routines.For example, inference from hot-deck techniques such as predictive mean matching (Little, 1988;Rubin, 1986) may be more severely impacted by large amounts of one-tailed missingness than inference from parametric techniques.It would be a shame to overlook such results due to the focus on a single functional form of the missingness-generating model.
Although MAR missingness is often considered as a simulation condition, the problem of spurious MAR is generally overlooked.With MAR missingness mechanisms, observed relations in the data are used to induce missingness during simulation (e.g., weight is made incomplete based on observed gender to mimic a situation wherein one gender is less likely to disclose their weight).These relations in the observed data may, however, be weak or nonexistent.If MAR is induced based on weaker relations in the true data, claims for a method's applicability to situations where the missingness is random become less valid.The most extreme example would be when MAR is induced from data without multivariate relations.The inferential implications of the missingness would then mimic those of MCAR, even though the missingness is generated randomly.Figure 1 demonstrates this: Both univariate MAR mechanisms in Figure 1 are induced based on the right tail, but only the mechanisms for which the variable to ampute and the missingness covariate have a positive correlation will yield valid MAR mechanisms.In the case of zero correlation between the data columns, iterative univariate non-MCAR missingness schemes may induce invalid missingness mechanisms that mimic MCAR.Table 1 demonstrates this and highlights the importance of including complete case analysis as an evaluation analysis to allow for identifying F I G U E 1 Three different realizations of missingness generation for MCAR and right-tailed MAR missingness mechanisms for different levels of correlation  between the variable to ampute and the missingness covariate that guide the probability to be missing.Note that the conditional probability of right-tailed MAR ( = 0) mimics that of MCAR because the variable to ampute and the missingness covariate have  = 0.

TA B L E 1
Inferences obtained by complete case analysis (CCA) and multiple imputation over 1000 simulations.Displayed are the bias of the mean, coverage rate (cov) of the corresponding 95% confidence interval, and the confidence interval width (ciw).Imputations are generated by Bayesian linear regression imputation.such invalidly generated missingness mechanisms in simulation.In fact, this would amount to one of the special cases described by Van Buuren (2018) under which complete case analysis would be more efficient than imputation: the missingness does not depend on the incomplete variable.This property might be useful in practice, but considering it as a condition to evaluate the performance under MAR missingness is pointless.

CCA
A nonignorable or MNAR mechanism-where the unobserved values themselves may also play a role in the probability to be missing-might fall outside many studies' scope, yet could yield informative insights for daily practice.Even if an imputation method is specifically developed with ignorable missingness in mind, chances are that the method will be applied to nonignorable empirical missingness at some point in time.After all, for every MNAR mechanism, there is an MAR mechanism with equal fit and one cannot definitively verify that empirical missingness is random (Molenberghs et al., 2008).Therefore, it can be argued that MNAR is the more likely mechanism for real-life missingness scenarios.It may be wise to include MNAR missingness in the simulation study just in case: A method that performs well under MAR and some cases of MNAR may be preferable over a method that only works under MAR, and not under MNAR.Such "latephase" methodological investigations (see Heinze et al., 2022) can provide an understanding of an imputation method's robustness.Alternatively, a sensitivity analysis may provide an indication of the validity of the obtained inference, given that the assumed missingness mechanism is suspected to be invalid (see, e.g., Molenberghs et al., 2014, part 5).
In addition to varying the missingness mechanisms, each simulated mechanism should be combined with different missingness patterns.Remember that missingness is only ignorable under MAR when the parameter of the data is distinct and a-priori independent from the parameter of the missing data process (Little & Rubin, 2020, Corollary 6.1A).Under MAR missingness, we assume that we may use the observed data to make inferences about the joint (observed and unobserved) data.The dependency of the procedure on the assumption under which we obtain inference is only influenced by the amount of missingness.If there is no missingness-or if there is no data, for that matter-the inference does not depend on the assumption.Alternatively, the validity of assumptions becomes increasingly important when the missingness increases.Since we control the MAR mechanism, the assumption under which we may solve the missing data problem should hold and it is only fair to assess performance under stringent missingness conditions.We therefore propose to evaluate imputation methodology under several missingness proportions to emulate a realistic range in sever-ity of the missingness problem.Depending on how the missingness mechanism interacts with the simulated data, higher missingness proportions may yield biased or invalid completed data inferences.The missingness proportions should thus be considered carefully.
The above emphasizes the need for both thorough evaluation and thorough documentation of the intended and used missingness mechanisms.Simulators must make clear which mechanisms are of interest.To be able to verify these mechanisms, simulators must also be transparent about the process that generated the missingness.For example, by sharing the code or detailing the process that generated the missing values.Alternatively, missingness generation devices, such as the mice::ampute() function, can be used to generate valid, pattern-based missingness in any structured data set (Schouten et al., 2018;Van Buuren & Groothuis-Oudshoorn, 2011).Finally, simulators must include simulation conditions and evaluation parameters that make it possible to identify the proper generation of missing values.This would prevent that errors in the missingness generation would go unnoticed.

Performance evaluation
The evaluation criteria used to assess imputation performance vary from one simulator to another.This is not surprising as people from different fields could have a different focus on the problem at hand.There are, however, some overarching issues when assessing imputation method performance.In the first place, there are pitfalls in the evaluation of the estimates that are obtained by fitting the analysis model after imputation.For example, focusing on one performance measure over another may yield different conclusions about imputation efficacy.In the second place, diagnostic evaluation of the generated imputations is often left out of simulation study results.Identifying problems with the imputationgenerating process (e.g., an iterative imputation algorithm) may offer explanations for underperformance of imputation methods.Such evaluations, however, are typically omitted and valuable insights into the imputation method(s) may then be overlooked.At minimum, imputation method performance should be quantified using appropriate measures.It depends on the specifics of each study which performance measures would be most suitable.If the goal is inference-not prediction-the uncertainty about estimates needs to be properly quantified.To capture this uncertainty, the standard errors of the estimates should be correctly calculated, which requires multiple imputation.The aim of multiple imputation is not to reproduce the data, but to allow for obtaining valid inference given that the data are incomplete.This means that, given the framework provided by Rubin (1987), and for any ( − Q), statistical properties such as bias, confidence intervals, and the coverage rate of the confidence intervals should always be studied.We therefore recommend evaluating the following points: (i) The methods should preferably be unbiased.Note that the way bias is calculated should be carefully chosen and described, because this can greatly influence the interpretation of the results.A negligible absolute bias, for example, may already yield an almost infinite relative bias if the true value of the estimand is zero.In most cases, unbiased estimation may be expected under an MCAR missingness mechanism, which can be easily verified by including MCAR missingness in the scope of missingness investigations.(ii) The intervals around estimates should have a valid coverage of the population (i.e., true) value.Coverage of a 95% interval should in theory be ≥ 95%, where a coverage rate of 95% would be most efficient (Neyman, 1934, p. 589).Undercoverage indicates that the procedure is not confidence valid and may lead to invalid inference.Undercoverage may occur when the estimation procedure is either biased, too liberal leading to intervals that are too narrow, or both.Overcoverage, though technically not confidence invalid, would indicate that efficiency could still be gained.In such cases, a narrower interval exists for which the method would still yield confidence valid result.(iii) The width of the confidence or credible interval may convey statistical efficiency, which should be considered to compare imputation methods.Wider intervals are associated with more uncertainty, whereas a more narrow interval that is still properly covered indicates a sharper inference.However, inference from a wider interval that is properly covered is to be considered more valid than a more narrow interval that is not properly covered anymore.That said, with valid nominal coverage, the method that yields the narrowest interval would be most efficient.(iv) Resemblance to the true data values may be quantified using the root mean squared error (RMSE).We do not generally recommend the use of the RMSE as evaluation criterion, because this metric does not account for the inherent uncertainty of the missing values and it may inflate the type I error rate of statistical inferences (Chapter 2.6 Van Buuren, 2018).However, if a study is aimed at obtaining predictions, and inferential validity is not of interest, the RMSE of the prediction errors (residuals) can be compared between methods.Then, the RMSE may yield valuable information about the methods' comparative efficiency in terms of accuracy and precision.
After establishing the statistical validity of an imputation routine on the level of the estimand, it remains paramount to also study the imputation-generating process.Omitting such investigations may yield suboptimal results.One simple example arises often in practice: generating implausible imputed values.Even though the estimation on the analysis level may be justified, some methods can yield imputations that may seem completely invalid to applied researchers.For example, one could very accurately estimate average human height by filling in negative values and values that are unrealistically large.While the obtained inference could still be valid under such imputations, the plausibility of the imputed values given the observed data should be under scrutiny.Under many circumstances, imputation methods may be realistically expected to preserve both marginal and conditional distributions with respect to the comparative truth.Imputation methods that fail to do so should not be considered general-purpose methods.Many techniques can yield valid inferences, but techniques that sample realistic or plausible values may be preferable in practice.
The evaluation of simulated results goes far beyond plausibility and distributional characteristics.The quality and validity of an imputation task will also rely on the relation between the missing data models, imputation models, and analysis models.When a joint model exists whose conditionals would include the imputation and analysis model, then the imputation model and analysis are said to be congenial (Bartlett et al., 2015;Meng, 1994).The congeniality of all imputation models should be assessed, but methods to assess the suitability of imputation models and diagnose misfit often rely on visual inspection of the imputations (see, e.g., Abayomi et al., 2008;Bondarenko & Raghunathan, 2016).This makes its assessment a challenging and time-consuming endeavor in many simulation studies.
When algorithms are used, the simulator should study the validity of the algorithmic process.Many contemporary imputation techniques rely on iterative algorithms, such as the Gibbs sampler, to generate imputations.As with any iterative algorithm-but especially with imputation algorithms that are critically considered to be possibly incompatible Gibbs samplers (PIGS, Li et al., 2012)-algorithmic convergence should be carefully evaluated.Oberman et al. (2021) demonstrate that statistical validity can happen before algorithmic convergence is reached, but this is not a guarantee.Unfortunately, there is no universal quantitative method to diagnose nonconvergence in iterative imputation algorithms (Oberman et al., 2021;Zhu & Raghunathan, 2015) and the alternative (visual inspection of the imputation algorithm; Van Buuren, 2018) is neither efficient nor fail-proof.As a result, imputation algorithms may be terminated before reaching a stable state, which could yield suboptimal imputations and underestimated performance of the method.Problems with producing stable imputations are not exclusive to iterative imputation algorithms.Any method may run into failures of the imputation-generating process and subsequently lack results, for example, due to overparameterization errors.
Every imputation workflow should therefore contain an evaluation of the obtained imputations.Even though inspecting each imputation may be labor-intensive due to the number of imputations generated in a simulation study, we highly recommend simulators consider the following aspects: (i) The absence of nonconvergence in the imputation-generating process is a minimum requirement for any imputation method.If nonconvergence is suspected, the inference resulting from the imputations might be invalid.However, preliminary work suggests that iterative imputation algorithms could achieve inferential validity before reaching a stable state (Oberman et al., 2021).(ii) The fit of the imputation model may be verified with the help of a posterior predictive check (Nguyen et al., 2017;Zhao, 2022).A straightforward posterior predictive check for imputation methodology is the multiple overimputation of observed data values (Cai et al., 2022).If the statistical properties of the overimputed values are equivalent to those of the observed data values, one could infer that the imputation model fits the observed part of the incomplete data reasonably well.Then, by extension, one could assume that the imputation model might be able to produce good imputations for the missing part of the data too.This overimputation procedure is straightforward with the overimpute() function in Amelia II (Honaker et al., 2011) and the where argument in mice (Cai et al., 2022;Van Buuren & Groothuis-Oudshoorn, 2011).(iii) The distributional characteristics of the imputations should be inspected for anomalies.The distribution of the incomplete data may differ greatly from the observed data.Under anything but the MCAR assumption, this can be expected.When evaluating imputations, the distributional shapes should be checked and diagnostic evaluations should be performed (see Abayomi et al., 2008, for a detailed overview of diagnostic evaluation for multivariate imputations).When anomalies are found, and if the imputation method is valid, there should be an explanation, especially in the controlled environment of a properly executed simulation study.
(iv) Finally, the plausibility of the imputed values may be evaluated.Plausible imputations-imputations that could be real values if they had been observed-are not a necessary condition for obtaining valid inference.However, in practice, especially when the imputer and the analyst are different persons, plausible imputations may be a desired property.One would prefer an imputation technique to yield both valid inference and plausible imputations.It should be studied if an imputation method is prone to deliver such impractical results, and if so, under what conditions.When evaluating imputation routines, the evaluator should mention whether the routine is prone to deliver implausible values.We must note, however, that the aim of imputation should be valid inference in the first place, and plausibility in the second place.
There are parallels between evaluating the performance and evaluating the missingness.Just like using MCAR as a condition to benchmark the validity of the missing data solution, we can use the simulation's complete data (i.e., without missing values) as an upper limit for our performance evaluation.In addition, the simulator should also perform complete case analysis, that is, analysis on the incomplete data without solving for the missingness.We know the theoretical properties of complete case analysis, which makes the ad hoc technique useful as a lower limit for evaluating imputation performance.Complete case analysis may therefore serve as a benchmark method against which imputation performance should be evaluated.Realistically, the simulator would expect the imputation solution to perform not worse than the complete case analysis and, preferably, mimic the complete data results.Moreover, pairing complete case analysis with MCAR and MAR mechanisms allows the simulator to evaluate the impact of the missing data generation process under the chosen analysis model and to identify the unintentional generation of accidental MNAR.
Last, when the simulator is on the verge of drawing conclusions about the performance of the imputation methodology, the performance should be carefully qualified.Comparing the performance of an imputation routine, given a population (or true) parameter allows for quantitative evaluation.Yet, in order to pose qualitative statements about the performance on simulated conditions, comparative methodology is required.For example, when claiming that imputation performance is unacceptable when deviations from normality become rather stringent, such performance is highly dependent on the simulation conditions that are used.For a well-balanced judgment about the severity of the performance drop, comparative simulations with, for example, nonparametric models should be executed.A method may perform badly, but if it still outperforms every other approach, it may yet be of great practical relevance.

SUGGESTED COURSE OF ACTION
We encourage simulators to carefully consider and document their choices in the evaluation of imputation methodology.Simulation studies should not only be well executed, but also properly reported.For example, the inconsistent display of simulation conditions may impact the objectivity of metaevaluations over imputation methods, as one method's performance may appear to be favorable because of less stringent simulation conditions.This may ultimately lead to statisticians recommending a less efficient method to applied researchers, thereby limiting the efficiency of the imputation approach and unnecessarily lowering the statistical power.Simulators should therefore be explicit in the descriptions of their evaluations.However, deciding what and how to report may be challenging.The simulation design could be presented textually, in a flowchart or as a block of pseudocode, whereas missingness mechanisms could be written as a function of the data or displayed graphically.Ideally, the evaluations should be supplemented by an online repository with all of the data and code required to reproduce the simulation results.To aid simulators in reporting and move toward standardization in evaluation, we provide a draft version for reporting guidelines in Appendix A.1 (also available from www.gerkovink.com/evaluation).We invite the readers of this paper to contribute to its development.
We aim to elicit critical thinking about incomplete data simulation and to establish a common ground for the evaluation of imputation routines.Such a common ground would be the basis of a standardized evaluation.This would allow for fairer and more efficient comparisons between imputation techniques.Ultimately, it would be desirable to evaluate every imputation routine against the same standardized set in order to quantify the statistical properties across imputation routines.If properly executed, such evaluations would allow for careful matching of imputation methodologies to new missing data problems.

A C K N O W L E D G M E N T S
We thank the Amices team for the fruitful discussions and highly value the comments and suggestions from the reviewers and the editors.