Multiple imputation of incomplete multilevel data using Heckman selection models

Missing data is a common problem in medical research, and is commonly addressed using multiple imputation. Although traditional imputation methods allow for valid statistical inference when data are missing at random (MAR), their implementation is problematic when the presence of missingness depends on unobserved variables, that is, the data are missing not at random (MNAR). Unfortunately, this MNAR situation is rather common, in observational studies, registries and other sources of real‐world data. While several imputation methods have been proposed for addressing individual studies when data are MNAR, their application and validity in large datasets with multilevel structure remains unclear. We therefore explored the consequence of MNAR data in hierarchical data in‐depth, and proposed a novel multilevel imputation method for common missing patterns in clustered datasets. This method is based on the principles of Heckman selection models and adopts a two‐stage meta‐analysis approach to impute binary and continuous variables that may be outcomes or predictors and that are systematically or sporadically missing. After evaluating the proposed imputation model in simulated scenarios, we illustrate it use in a cross‐sectional community survey to estimate the prevalence of malaria parasitemia in children aged 2‐10 years in five regions in Uganda.


Introduction
Over the past few years, data sharing efforts have substantially increased, and researchers increasingly often have access to IPD from large combined datasets derived from electronic health records (EHR) or from multiple randomized or observable trials (i.e. in IPD meta-analysis, IPD-MA).For example, the clinical practice research datalink (CRPD) (Herrett et al. 2015) is an electronic health record dataset in the UK, which has been used in a variety of medical research, such as the evaluation of health policy and drug efficacy.A recent example of an IPD-MA is the emerging risk factor collaboration (The Emerging Risk Factors Collaboration 2007), where data were combined from approximately 1.1 million individuals across 104 observational studies to investigate associations of cardiovascular diseases with several predictors.Individuals in these large datasets tend to be clustered in centres, countries or studies, where they have been subject to similar healthcare processes.Moreover, clusters may also differ in participant eligibility criteria, follow-up length, predictor and outcome definitions, or in the quality of applied measurement methods.Hence, heterogeneity between clusters with respect to baseline covariates and outcomes, while the structure of the correlations between these variables is likely to be different across different clusters.
A usual problem is that such clustered datasets may contain many incomplete variables.For example, in registry data it is common that test results are not available for all patients, as the decision to test may be at the discretion of the primary care physician or because the patient refuses to undergo testing.It is also possible that variables are systematically missing across clusters.(Resche-Rigon et al. 2013) For instance, in an IPD-MA, studies may have collected information on different variables.Missing values may thus appear for all participants of a study in the combined dataset.The presence of missing data can lead to loss of statistical power, imbalance in cluster size, bias in parameter estimates and therefore to erroneous conclusions as the analysis could be based on an unrepresentative sample.
To address the presence of missing data, it is important to consider the missing mechanism for each incomplete variable.Rubin (1976) (Rubin 1976) identified three missing mechanisms where the probability of missingness: 1) is independent from observed or missing values (missing completely at random; MCAR), 2) depends on observed data only (missing at random; MAR), or 3) depends on unobserved information even after conditioning on all observable variables (missing not at random; MNAR).Traditional imputation methods are designed to address incomplete data sets where variables are MCAR or MAR.Their implementation is justified when there is not systematic difference between units with missing and with complete data or when the missingness of a variable is strongly related to variables measured in the study.
Registries are notoriously prone to incomplete variables that are MNAR, due complex recording processes.(Liu et al. 2021) For example, laboratory tests are taken only in certain patients based on symptoms that are often incompletely recorded.Data from randomized trials may also suffer from MNAR, for example when study participants that experience unfavorable results drop out of the study.Also, heterogeneity of the primary objective or resources of the studies may result in variables relevant to explain the missing process not being recorded at all in some of the studies.
Modelling data under MNAR mechanism implies to specify information about the missingness process in addition to assumptions about the observed data.Two major approach have been used to address MNAR mechanism: pattern mixture models (Little and Wang 1996) and selection models (Vella 1998).One of the most popular techniques within selection models is the one proposed by Heckman (1976).(Heckman 1976) Briefly, the Heckman selection model corrects for selection bias by estimating two linked equations: a outcome equation, where the missing variable is associated with predictors, and a selection equation, that accounts for the inclusion of observations in the sample.An important feature of the Heckman selection model is that it does not assume data to be MNAR, so that it can also be used when data are MCAR or MAR.It therefore offers an appealing solution to incomplete data sets when the missingness mechanism is not precisely known.
Over the past few years, several extensions and adaptations to the Heckman selection model have been proposed for multiple imputation.Among them, Galimard et al.(2016) (Galimard et al. 2016) implemented a chained equations imputation method for continuous variables, which was extended to binary and categorical variables by employing copula estimates.(Galimard et al. 2018) Also, Ogundimu and Collins (2019) (Ogundimu and Collins 2019) proposed a chained equations imputation method that is less dependent on normality assumptions.
In clustered data sets, multilevel imputation methods are required to properly propagate uncertainty within and across clusters.(Audigier et al. 2018) However, to our knowledge, existing multilevel imputation methods mainly focus on situations where data are MAR, and do not adopt Heckman selection models.Although Hammon and Zinn (2020) (Hammon and Zinn 2020) recently proposed an extension that allows for the inclusion of random intercept effects, it can only be used for binary missing variables and assumes that the effect of explanatory variables on the missingness mechanisms is common across clusters.
Therefore, the aim of this work is to develop a multilevel imputation method for continuous and binary variables that are both sporadically and systematically MNAR, and that this can be applied for an incomplete outcome or multiple incomplete predictors in the data sets.
In section 2 we provide an introduction to the Heckman model and its estimation, and we extend it to a hierarchical setting.In section 3 we define the main steps of the proposed imputation method.In section 4 we provide the settings and results of a simulation study to evaluate the performance of our imputation method.In section 5 we illustrate the method using the survey information collected in different sub-districts in Uganda to estimate the prevalence of malaria in children.Finally, in section 6 we summarize the results, outline limitations and propose future extension of our method.

The Heckman model
The Heckman selection model was initially proposed as a method to correct for selection bias, in which individuals are not randomly selected from the population, leading to inconsistent estimates and erroneous conclusions.(Heckman 1976) Selection bias occurs when the inclusion of an observation into the sample is influenced by unobserved variables (e.g., the respondent's level of trust toward healthcare entities may cause them to self-select out of the study or refuse to sign consent for a test), which in turn either influence the outcome of interest (e.g., the result of a blood test), or are related to other unobserved variables that influence the outcome of interest.(Vella 1998) Here β O i and β S i are p × 1 and q × 1 coefficient parameter vectors and O i = ( O i1 , O i2 , ..., O ini ) T and S i = ( S i1 , S i2 , ..., S ini ) T are the residual terms vectors for the outcome and selection equations, respectively.Generally the same variables can be used on the matrix of predictor variables X O i and X S i .However, to avoid multicollinearity problems (Puhani 1997), it is recommended to include in X S i at least one variable that is not included in the outcome model.(Angrist and Krueger 2001) This variable is commonly known as an exclusion restriction variable (ERV), and should only be associated with the selection in the sample r * i (relevance condition) but not with the actual observation y * i (exclusion condition).(Gomes et al. 2020) The ERV meets by definition the relevance and exclusion conditions in order to provide independent information about the selection process and to facilitate the estimation of the Heckman model.
In the presence of selection bias, the aforementioned outcome equation will yield biased estimates of β O i if no efforts are made to adjust for the non-representativeness of the observed X O i and y i values.For this reason, the Heckman model aims to jointly estimate the outcome and selection equation by defining a relation between their respective error distributions.For instance, Heckman's original model (Heckman 1976) assumes that residual terms have a bivariate normal distribution (BVN), where σ i corresponds to the variance of the error in the outcome equation and ρ i to the correlation between the error terms of the outcome and selection equations in the i-th cluster.As in a probit model, this model assumes a unit variance for the error term of the selection equation.The unit variance has no consequence on the observable values of r ij = {0, 1}, since they only depend on the sign of r * ij and not on its scale.The interpretation of ρ i is fairly straightforward.When ρ i = 0, the participation does not affect the outcome model and missing data can be considered MCAR (if data are missing completely at random) or MAR (if missingness is already explained by x O ij ).Conversely, when ρ i = 0, this suggests that data are MNAR.

Heckman model estimation
Under the BVN assumption, the parameters of the Heckman model coefficients can be estimated using the two-step Heckman method (H2S) (Heckman 1976) or the full information maximum likelihood method (FIML).(Amemiya 1984) However, both methods may lead to inconsistent estimators when the true underlying distribution is not the BVN.(Gomes et al. 2020) To overcome this problem, other approaches have been proposed that relax the distribution assumptions, among them copula models.(Smith 2003) The copula approach uses a function, known as a copula, that joins the marginal distribution of the error terms of the selection and outcome equation which are specified separately.The Heckman model can be expressed with the following bivariate normal copula: where Φ is the bivariate standard normal cumulative distribution function.Thus, to estimate the parameters of the Heckman method, it is sufficient to specify the marginal distributions of the error terms, and link them with a suitable copula function.In our imputation method we estimate the Heckman model using the copula method for non-random selection (Wojtyś, Marra, and Radice 2018) which is a flexible approach that allows the specification of different parametric distributions of the selection and outcome variables and also different types of dependence structure between the two equations.

Hierarchical model
The Heckman model can be extended to hierarchical settings, i.e., in cases when individuals or sampling units are nested within clusters, as is the case in EHR or IPD.This hierarchical complexity must not only be taken into account in the analysis model, but also when dealing with missing data, thus requiring imputation models that are congenial to the analysis model, i.e. that make the same assumptions about the data.
Different procedures can be adopted to combine information between clusters; however, in our imputation method we opted for the two-stage approach that is often used in meta-analyses.(Simmonds et al. 2005) This is because such an approach is computationally less intensive and could potentially generate fewer convergence problems in the estimation of the Heckman hierarchical model compared to other approaches.Our method is based on the following conditional imputation model, whose parameters are Briefly, in a first stage we estimate the cluster specific parameters of the Heckman model , for all i between 1 and N clusters.That is, we estimate separately for each of the clusters, the parameters of the two equations in each study, the outcome model ( 1) and the selection model ( 2) via a copula model.
In a second stage, we estimate random effects multivariate meta-analysis model for the coefficient parameters.
Here β O i and β S i parameters are assumed to be drawn independently and identically from a latent multivariate normal distribution of parameters with a population means β O and β S and population variance covariance ψ O and ψ S respectively.(Higgins, Thompson, and Spiegelhalter 2009) Here we assume that σ i and ρ i are random variables that come from a populational distribution and we estimate random effects univariate meta-analysis models for these parameters.

Using the Heckman model to impute missing data
We follow a similar approach proposed by Resche-Rigon and White (2018) (Resche-Rigon and White 2018) for multilevel data imputation, which allows us to impute values in very common scenarios in IPD, e.g., sporadic and systematic missingness patterns.Our imputation method was created to impute datasets with a single missing outcome variable or covariate, but it can also be used to impute multiple incomplete variables in a dataset, as it can be implemented under a multivariate imputation by chain equations (MICE) approach.
In the following subsections, we explain the method when the incomplete variable is the outcome variable but this method can also be used to impute any incomplete MNAR predictor variable in the dataset.In the latter case, in the outcome equation ( 1), the incomplete predictor variable must be specified as the dependent variable and X O i as the set of covariates associated with it.On the other hand, in the selection model(2), r * i corresponds to the latent variable of selection of the incomplete predictor variable together with its X S i associated set of predictors.

Univariate imputation
Given an outcome variable y = (y 1 , y 2 , ..., y N ) T , that consists of y miss ij missing and y obs ij observable values, we generate independent draws from the posterior predictive distribution for the missing data, y miss ij , given the observable data information y obs ij .
p(y miss ij Here we implicitly assume vague prior distributions for each of the parameters included in the parameter vector θ.Because the integration can be performed computationally by sampling from the posterior predictive distribution p(θ|y obs ij ), our imputation method can be carried out in the following two steps: 1. Draw a θ parameter vector, θ * , from p(θ|y obs ij ), their posterior distribution.2. Draw y miss ij from p(y miss ij |θ * ), their predictive distribution for a given θ * vector.
Below we describe each step in depth:

Draw the θ * parameter vector
Fit p(y obs ij |θ i ), the heckman selection model at cluster level Initially, we use the copula method to estimate the set of cluster-specific parameters, , using all j units with observable measurements y obs ij within each cluster i.Here we obtain the estimates not only the parameters' point estimates θ i , but also their corresponding S(θ i ) within-cluster variance-covariance matrix.

Fit a meta-analysis model
In this step, we pool the parameters θ i with a random effects meta-analysis model using only the clusters with observable information, i.e., those with no systematically missing outcome.In particular, we pool the p coefficients of the β O in the outcome equation and estimate a multivariate random effects meta-analysis model with them.Similarly we combine all q coefficient parameters of the β S in the selection equation.Thus, we can denote the study specific coefficients with sampling errors O i and S i .We assume that σ i and ρ i are random variables coming from a population distribution, so we can express them as: For each of them, we perform a univariate random effects meta-analysis.The model for σ i is given by the model: In the case of an incomplete binary variable, the σ i parameter is not specified, so it is not necessary to perform a random effects meta-analysis model of σ i or to include it in any of the following steps in the imputation model.
The model for ρ i is given by: Draw the marginal parameters Θ From the meta-analysis model, we obtain the marginal estimates

Draw the cluster parameters θ * i
We draw the shrunked-cluster-parameters θ * i for each i cluster from the following posterior distribution conditional on Θ * and ψ * .
As can be seen, the mean and variance of the posterior distribution is a combination between the estimated marginal and cluster-specific parameters.Here the weights on the cluster-specific parameters θ i and the marginal parameters Θ * are inversely proportional to the within cluster variance S θi and between clusters variance ψ * .For example, when S θi < ψ * the mean of the conditional distribution gives more weight to the estimated cluster-specific parameter.Conversely, when S θi > ψ * , more weight is given to the estimated marginal parameters.Therefore, in case of a cluster with systematic missingness, it is as if the within-cluster variance is infinite ( S θi → ∞), so all the weight is assigned to the parameters estimated at the marginal level.That is, when there is no information in a cluster, we rely entirely on the marginal information.

Draw y miss ij observation
Having θ * i , the shrunk-cluster parameters vector for each cluster, we back-transform σ * and ρ * to the original scale.Then y miss ij , the missing values, can be drawn from p(y miss ij |θ * i ), their predictive distribution given θ * i , as follows: Continuous missing variable The imputed value of y miss ij can be drawn from the conditional expectation of y ij on unobserved measurements: (Greene 2018) where φ(.) is the probability density function and Φ(.) is the cumulative distribution function of the standard normal distribution.
Binary missing variable When y miss ij is a binary variable, the imputed value is drawn from a Bernoulli distribution with a proportion parameter p * ij given by P [y ij = 1|r ij = 0], that is the conditional probability that y ij = 1 given that the measure is unobservable (r ij = 0).The p * ij is obtained from a bivariate probit model, as follows: (Greene 2018) where Φ 2 (.) corresponds to the bivariate normal cumulative distribution function.

Multivariate imputation
When there are simultaneous missing variables in a dataset, our imputation method can be extended in a Gibbs sampler procedure.Particularly, our imputation method has been implemented according to the structure of the MICE R package, (Buuren et al. 2021) that allows imputing multiple incomplete predictors and covariates in a given dataset.
Briefly, MICE (multiple imputation of chained equations) was built under the fully conditional specification framework, where for each incomplete variable a conditional imputation model is specified based on another variables in the dataset.This process is carried out iteratively, so that in each iteration the missing values of an incomplete variable are drawn from the conditional distribution based on the updated variables in the previous iteration.
Our imputation model can then be used in the imputation of any incomplete variable in a dataset following an MNAR mechanism, even if it is an outcome, predictor or auxiliary variable of the main analysis model.Furthermore, as implemented according to the MICE framework, the imputation method could be used simultaneously with other imputation methods available in MICE or in add-in packages such as micemd (Audigier and Resche-Rigon 2022), to impute datasets with multiple incomplete variables that differ in type and missing mechanism.

Technical details of the implementation
The Heckman model is estimated with the gjrm function of the GJRM R package (Radice 2021), under the bivariate model with the nonrandom sample selection (BSS) specification.The meta-analysis model is performed with the mixmeta function of the R package mixmeta (Gasparrini and Sera 2021), which allows the use of maximum likelihood (ML), restricted maximum likelihood (REML), and moments estimation methods.For the simulation and illustrative study, we use the restricted REML estimation method, which is recommended as it has a good balance between unbiasedness and efficiency.(Viechtbauer 2005).4 Simulation study

Aim
We designed a simulation study aimed to compare the performance of alternative methods for imputing a single missing outcome variable in a hierarchical dataset, where the missingness follows a MNAR mechanism.
In our scenarios we considered systematically missingness.

Data-generation mechanism
We generated the data from a Heckman selection model with bivariate normal distribution error terms.For simplicity we started from "basic scenario" where the database collected information from N = 10 clusters of n i = 1000 individuals.However, we altered both N and n i in sensitivity analyses (Table 1).
For each dataset, we generated X 1i a treatment indicator variable from a Bernoulli distribution with a probability of treatment on each cluster equal to 0.6.Next, we simulated the mean of two continuous covariates from a multivariate normal distribution ( µ2 µ3 ) ∼ N (( 0 0 ), ( 0.2 0.015 0.015 0.2 )).We then simulated for each cluster a baseline covariate X 2i ∼ N (µ 2 , 1) and a exclusion restriction variable X 3i ∼ N (µ 3 , 0.5).
Here, we considered X 1i and X 2i as predictors in the outcome equation, i.e., For the selection equation we included both variables and the X 3i exclusion restriction variable, Then in case of a missing continuous variable, we calculate the latent variables y * i and r * i as follows: Here we assumed that all coefficient parameters varied across studies, by including cluster-specific random effects as: We fixed coefficients β O = {0.3, 1, 1} and β S = {−0.8,1.3, −0.7, 1.2} in order to get around 40% of sporadically missing values on the response y ij in the entire data set.Additionally, we ensured that the y * ij observations were systematically missing in 20% of the clusters included in the data set, by removing the outcome values in the 20% of the clusters.
We assumed that random effects were independent within equations (b 3 ), but were linked between both selection and outcome equations through a bivariate normal distributed as: for the coefficients of intercept and the coefficients of the variables X 1 ,X 2 with ψ 00 = ψ 11 = ψ 22 = 0.4.We considered that the correlation parameter of the random effects between equations is 40% of the value of the assumed correlation parameter between error terms ρ.In addition, we included a random effect on the exclusion restriction variable given by b 3 ∼ N (0, 0.2) assuming that the intracluster variation in the exclusion restriction effect is lower than the variation on other coefficient parameters effects.The ρ parameter was given different values depending on the simulated missing mechanism (Table 1) .
As regards the error terms, they were bivariate normal distributed as: whose σ 2 i is a random parameter across clusters distributed as log(σ i ) ∼ N (0, 0.05).

Adittional scenarios
In addition to the basic scenario described above, we explored additional data generating mechanisms.Specifically we investigated the performance of the imputation methods under the following scenarios: • M(N)AR scenarios: We explored a missing MAR mechanism (ρ = 0), a scenario when data followed a MNAR mechanism with a low (ρ = 0.3), intermediate (ρ = 0.6) and strong correlation (ρ = 0.9) between y * and r * .We also explored a scenario when the missing variable is binary.Therefore we simulated a y i binary incomplete variable, by keeping similar parameters to the ones used in the simulation of a missing continuous variable, but here we defined the observable binary variable as: sample size and cluster number: we explored different configurations regarding the number of patients per cluster n i = {50, 100, 1000} and the number of clusters N = {10, 50, 100}.• Violation of distributional assumptions: Here, we aimed to investigate how the imputation models behave in settings with departures from the bivariate normal distribution, performed two sensitivity analyses.
1. Skewed-t: We drew error terms from a bivariate skewed student-t distribution using the same location parameter and covariance matrix of the normal distributed settings, with 4 degrees of freedom and an α = {−2, 6} parameter which regulates the asymmetry of the density.
2. Explicit: In addition we simulated an explicit missingness process, where error terms of the selection and outcome equations were independently normal distributed and the selection of observations depended on the value of the outcome variable, as ry * i = 0.3y * i + S i .These settings assure that the percentage of missingness on the outcome variable is around 60% in all the evaluated scenarios.
3. Normal: As a reference, we provide the results from the basic scenario of the main simulation study, where the error terms were Bivariate normal distributed.

Estimands
The estimands were the parameter coefficients of the outcome equation , with special emphasis on the treatment effect parameter β 0 1 .We also report the estimated standard deviation from the covariance matrix of the random effects, i.e., √ ψ 00 , √ ψ 11 , √ ψ 22 .
Estimating procedures: After the imputation procedure, we estimated the following (generalized) mixed linear effect model using the lmer() function from the lme4 R package.(Bates et al. 2022) In case of missing binary variable, we used the same matrix of predictors but on a binary model estimated with the glmer() function from the lme4 R package.Then, we pooled the estimates of the β O i and the variance of the random effect and residual errors of the multiple imputed datasets according to Rubin's rule (Rubin 1987), over which we calculated the performance measures on the estimands.
To calculate the coverage of the parameter coefficients' 95% confidence intervals (CI), we estimate CI with the Wald method.

Imputation methods
For each scenario we simulated 500 datasets over which we evaluated the following imputation methods: • Complete case analysis (CCA): We removed all patients with missing observations.• 1l.Heckman: Multiple imputation based on the Heckman model without no study specification, following the imputation method proposed by Galimard et al.(2016).(Galimard et al. 2016

Performance measures
We calculated the following measures, usually employed to evaluate imputation methods (Buuren 2018), according to the formulas provided in Morris et al.(2019) (Morris, White, and Crowther 2019): • Bias: Bias on the coefficient and variance-covariance matrix terms.
• Coverage: Coverage of the 95% confidence intervals for the coefficient parameters.
• Width: Average width of the confidence interval on the coefficient parameters.
• RMSE: Root mean squared error of the coefficient and random effect parameters.
In addition, in the appendix table we reported the empirical standard errors (EmpSE), Monte Carlo standard errors (ModSE) on the coefficient parameters, average processing time (time in seconds) and the percentage of datasets where the imputation method converged (run), i.e., the imputation method generated an output.

Software
For the simulation study and illustrative examples we used R version 4.0.4 in a linux environment.(R Core Team 2021) The Heckman 2L imputation method is available in the micemd R package (Audigier and Resche-Rigon 2022) (as mice.2l.2stage.heckman())and also on the github repository https://github.com/johamunoz/Statsmed_Heckman where you can also find all the codes accompanying this paper and a toy example that explains how to implement the method in mice.

Results M(N)AR scenarios
Continuous incomplete variable Figure 2 shows the results of simulations where the missing variable was continuous.In the MAR scenario, i.e. when ρ = 0, all imputation methods provided similar unbiased estimates of the coefficient parameters, but as rho increased, i.e. the mechanism became MNAR, the estimates for the complete case analysis and the MAR-IPD imputation method became biased.
As it is expected both Heckman-based imputation (1l.and 2l.) models provided less biased estimates of the coefficient parameters in MNAR scenarios, but the random effects estimates on the 1l.heckman were far away from the true values as no cluster information was considered at all.
Regarding coverage, the 2l.Heckman imputation method provided the best coverage values across all the ρ scenarios with a coverage level close to the nominal 95% interval.Even though 2l.Heckman was not properly a randomization-valid method (Buuren 2018), as it was not unbiased and had a coverage above 95% across all the ρ values, it led to better results in terms of bias and coverage compared to the evaluated methods.
The 2l.Heckman method, also resulted in better estimates in terms of RMSE than the other methods evaluated, via a better trade-off between bias and variance.
In particular, our method provided an advantage over the 1l.Heckman method in systematic missingness scenarios, by adding a cluster variable in the imputation model, a feature lacking from 1l.Heckman.This can be seen in the estimates of the random effects parameters of the 1l.Heckman method, which were more biased than those of the other methods in which cluster information was included, i.e. 2l.MAR and 2l.Heckman.Regarding the width of the 95% CI of the estimated parameters, we observed that using the 2l.Heckman model we obtained larger CIs than when using other evaluated methods, which generally increased as ρ moved away from zero.In particular, we observed in β 2 that the CI length was larger in the MAR scenario.By checking the simulation results further, we noted that the high variability came from one simulation in which the linear model estimation had singularity problems (not shown here).

Bivariate incomplete variable
In the case of a missing bivariate variable, we also found (Figure 3) that the 2l.Heckman method provided the least unbiased results on the coefficient parameters.However, we observed more unbiased estimates and a larger 95th percentile amplitude in the estimates of the standard deviations of the random effects, especially in the √ ψ 22 compared to the estimates obtained in the continuous incomplete outcome scenario.

Sensitivity analysis: number of clusters and sample size of clusters
We evaluated how robust our method was to variations in the number of clusters and also in the sample size of cluster (Figure 4).By increasing N, the number of clusters, from 10 to 100, we observed that the bias was not affected but the width of the 95% CI of the estimates decreased (larger precision) and hence the RMSE decreased.On the other hand, by reducing the number of units per cluster (from n i =1000 to n i =100) the precision decreased for all coefficients, the bias on coefficient estimates were not drastically affected but the bias of random effects did.
When we reduced the sample size to 50 patients per study, the bias and RMSE of the √ ψ 22 were drastically affected (not shown here but in the Appendix).This could be explained in part due to the scarce information on certain clusters, which affects directly the estimation of the Heckman model on those clusters.

Bias
Coverage RMSE Width

Sensitivity analysis: distributional assumptions
When using the Heckman model in an explicit MNAR process, i.e. when the probability of loss is associated with the value of the missing variable, we observe that under all imputation methods we obtain biased estimators, with poor coverage.Particularly when estimating the intercept we observed more biased estimates compared to those obtained for the other estimated parameters β 2 with coverage below 60%.
Our model may not be fully suitable for this type of scenario, especially if the main analysis is focused on estimating absolute prevalence estimates, as it is greatly affected by the bias of the intercept parameter (β 0 ) and the width of the 95th percentile CI of all coefficient parameters.Also the imputation method was affected in terms of the bias of the standard deviation of the random-effects parameters √ ψ 22 .
With respect to the Skewed-t scenario, only the bias of β 0 was affected for all imputation methods applied.But β 0 estimates for both the 1l.Heckman and 2l.Heckman were drastically affected in terms of bias, with very poor coverage (below 25%).Therefore, the applicability of our method could be questionable in scenarios that do not conform to the BNV assumption.

Bias
Coverage RMSE Width

An illustrative study
Malaria is a mosquito-borne disease and is the leading cause of illness and death in Africa, especially in children and pregnant women.To prevent the spread of the disease, long-lasting nets (LLINs) and indoor residual spraying (IRS) in at-risk households are used as control measures.
Specifically, in Uganda, under the Uganda LLIN evaluation project, a LLINS distribution campaign was conducted between 2013 and 2014.In 2017, the effect of LLIN control together with insecticides was assessed through a cross-sectional community survey in 104 health sub-districts in 48 districts located within 5 sub-regions of Uganda.
In each sub-district, a sample of households with at least one child aged 2-10 years was surveyed, where information was collected on household conditions and use of preventive measures.In addition, finger prick blood samples were taken from each child to determine the prevalence of parasitemia and an entomological study was conducted to estimate mosquito prevalence.Details of the project and survey are provided elsewhere.(S.G. Staedke et al. 2019) For this example, we used data accessed directly from ClinEpiDB(S.Staedke et al. 2021), where data were collected from 5195 households with verified consent, inhabited by 11137 residents aged 2-10 years.Blood samples were only taken from 8846 children, as 69 were excluded from the study due to lack of consent and 2222 were not present at the time of the survey.Although the original data set consists of 164 variables, here we only consider the variables described in Table 2, which were used as predictors in the imputation model and are fully observed in the dataset.In this dataset, the parasitaemia test is an incomplete binary result (1=positive test, 0=negative test), which is missing 21% across the whole dataset.
To illustrate our proposed method, following the article by Rugnao et al. (2019), (Rugnao et al. 2019) we estimated the prevalence of parasitemia by subregion and by age after approximately 3 years of LLIN campaigns started.We estimated parasitemia prevalence using 3 approaches that made different assumptions on the missingness mechanism: MCAR, MAR and MNAR.
Under the MCAR assumption, prevalence was calculated on the basis of the recorded tests, i.e., we only included patients with a test result.Under the MAR assumption, the test values of children who were not present during the survey were imputed with the 2l.2stage.binmethod of the micemd package, where the community was taken as the cluster and the following factors previously associated with parasitemia were used as predictors in the imputation model: sex, bednet ( indicator of whether only two or fewer persons share a mosquito bed net).In addition, we included age as a power 3 spline function, the cluster-level Log10 mean of the number of female anopheline mosquitoes per household estimated from the entomological survey, and the household wealth index from principal components analysis calculated specifically for the surveyed households.
Under the MNAR assumption, we used the proposed 2l.Heckman method to impute missing test values.The selection and outcome equation included the same predictor variables as used under the MAR approach.In addition, we included a holiday indicator variable as ERV.This was calculated according to school vacation calendars and public holidays in Uganda in 2017.We examined the association of this ERV with the outcome variable (y) and with the selection indicator (ry), conditioned on the remaining imputation predictors.The model results in Table 3 indicate that the holiday indicator could be a plausible ERV variable, as there was strong evidence of an association with ry, but no evidence of an association with y.
According to our imputation approach, non-tested children were estimated to have a higher prevalence of malaria than participants in more than half of the districts analyzed.As can be seen in Figure 6, for each  subregion the prevalence estimates of the approaches did not differ significantly between methods.However, prevalence estimates under the MNAR assumption (i.e. 2 level Heckman) were higher than those estimated under the MAR or MCAR aproaches, except for the East-Central region.
In terms of prevalence by age, there were no significant differences between methods (Figure 7).The prevalence estimates for children aged 2 to 6 years were very similar in all regions under the different assumptions.
Assuming that children start going to school after the age of 6, the results could be partly explained by the mobility of 2-6 year olds compared to school-age children, i.e. school-age children spend more time outdoors and travel more than younger children.
However for school children, prevalences estimated with the Heckman method were found to be higher in the Mid-East and Southwest regions than those obtained with the other methods, whereas in the East-Central region the estimates with the Heckman method are lower.A possible reason for selection bias in surveys of this type is, for example, that daytime visits might favor measurement in sick school children who stay home, leading to overestimated prevalence results as found in the East-Central region.(Program 2020) Nevertheless, we were unable to find information confirming the direction in which malaria prevalence is driven by selection bias in this Uganda study or in other studies similar to this one.

Discussion
We have extended and evaluated methods for multiple imputation of clustered in situations where some incomplete variables follow a MNAR mechanism.For clustered datasets, only imputation methods under the MAR mechanism had previously been proposed.Although there are imputation methods that can handle MNAR they can only handle the case of individual studies.This puts limit to their use in situations common in IPD-MA such as when there is systematic missingness or when the proportion of missingness of a variable is very high in one of the included studies.To address this gap, we proposed a new multiple imputation method for continuous and binary MNAR covariates, that is specifically designed for clustered datasets.Our method, allows borrowing of information between the clusters to obtain more reliable imputation results at the individual cluster level.
In our simulations we observed that the imputation method we proposed is optimal for the imputation of continuous and binary missing variables that follow a MNAR mechanism according to the Heckman model and that come from multilevel data, such data commonly used in the IPDMA.
Overall, our new method produced unbiased estimates with convergence close to 95% for the coefficient parameters.It also resulted in less biased estimates for the random effects parameters compared to the other methods evaluated.
Empirically, we showed that the proposed method could be applicable when there is systematic missigness at the cluster level.This method, in particular, could provide less biased imputation values compared with individual-level imputation methods, as it not only allows imputation of missing values in clusters with systematic missigness, but can also shrunk the values of individual clusters toward the overall study mean.This is especially advantageous in studies with small sample sizes, where an analysis approach that ignores data from other studies may lead to extreme estimates of prevalence.By using our proposed method, these would be reduced to the overall mean.
The advantage of the proposed method over methods that assume MAR is that it allows the imputation of variables from cluster level data following a MAR or MNAR mechanism according to the Heckman model.That is to say that under the specification of a valid exclusion variable the method determines which is the most adjustable correlation parameter between equations (ρ), or in general terms the missingness mechanism (MAR or MNAR), in each of the clusters evaluated.
Our implementation of the imputation method was built according to the specifications of the mice R package and is available in the micemd package, which allows, first of all, to be used both on the outcome and on the covariates.In addition, it offers the option of being used simultaneously with other imputation methods implemented in the MICE package, which is advantageous in databases containing missing variables with different prediction methods and models.Finally, the method can be used on systematically and sporadically missing clusters, both for continuous and binary missing variables with heterogeneous effects and error variances.

Limitations and future directions
A major limitation of our method is that it needs a valid restriction variable, which in some contexts is difficult to establish at the individual study level and can be even more challenging if one tries to find a valid exclusion variable across clusters.
In addition, to estimate the marginal estimates, the method only uses clusters with observable information, i.e. that are not systematically missing or have sufficient information to estimate the Heckman model.The latter might restrict the evaluation of the Heckman model at the cluster level to a certain number of predictors depending on the sample size of the cluster.
Also, the method can be sensitive to both the sample size and the number of studies included in the database.On the one hand, a small sample size at the individual study level can affect not only the precision of estimates but also the convergence of the method since sample size needed to estimate the parameters of the Heckman model which can be at least twice the number of parameters required to estimate in an imputation model that assumes MAR.On the other hand, a high number of studies that may improve the precision of the estimators may also make the estimation of the marginal parameters more difficult and also considerably increase the processing time of our method.
In our simulation study, data were generated by assuming a constant correlation across all clusters in order to evaluate the performance against M(N)AR assumptions.In practice, however this parameter can be variable across clusters and this can considerably affect the performance of our method.Therefore, in future research the effect of this parameter could be further evaluated.One might also consider relaxing this assumption of constant correlation to allow for a random effects distribution for the correlation parameter.
Further, the method can also be extended to other copula models for non-random selection, with different distributions of the selection and outcome equations and dependency structure.(Wojtyś, Marra, and Radice 2018) Similarly, less restrictive Heckman based models can be considered in terms of normality distribution of errors and no specification of exclusion variables such as those proposed by Ogundimu & Collins (2019).(Ogundimu and Collins 2019).

Conclusion
We have proposed an extension to the Heckman model that can account for MNAR, MAR or MCAR of a continuous or binary variable in clustered data sets.Our simulations showed that it can have favorable statistical properties, when its assumptions were met, and provided that the sample size is sufficiently large.
Regarding deviations from distributional assumptions of the error terms, the coefficient parameters were fairly robust in terms of bias, but the intercept was not.

Disclaimer
The views expressed in this paper are the personal views of the authors and may not be understood or quoted as being made on behalf of or reflecting the position of the regulatory agency/agencies or organizations with which the authors are employed/affiliated.

Figure 1 :
Figure 1: DAG Heckman selection model: here the nodes (dotted lines = latent variables, continuous lines = observed variables) describe the relationship between y * ij the latent response and r * ij the latent selection variables of a jth unit in a ith cluster, that are dependable on x O ij and x S ij sets of predictors and are correlated through u ij unobservable variables.This is visualized in Figure 1, where for the jth individual or unit within the ith cluster, there is y * ij , a latent outcome variable, and r * ij , a latent selection variable, which are correlated through u ij an unobserved or unrecorded variable, with i ∈ [1, 2, ..., N ] and j ∈ [1, 2, ..., n i ].Here, both latent variables are related to the sets of predictor covariates x O ij and x S ij .From r * ij one can derive r ij = I(r * ij >= 0) a selection indicator of y * ij into the sample, and with this in turn, one can define y ij = y * ij , ∀r ij = 1 the observable outcome variable.Denoting y * i = (y * i1 , y * i2 , ..., y * ini ) T and r * i = (r * i1 , r * i2 , ..., r * ini ) T the vectors of latent outcomes and latent selections in the cluster i, then Heckman's model is defined by two main equations: the outcome equation (1), which describes the relation between the latent outcome (y * ij ) and a set of covariates (X O i = (x O i1 , x O i2 , ..., x O ini ) T ), and the selection equation (2) which models the likelihood that the outcome is observed in the sample as a function of another set of covariates (X S i = (x S i1 , x S i2 , ..., x S ini ) T ).y * i = X O i β O i + O σ, ρ} and the between-cluster variance matrix ψ = { Ψ O , Ψ S , ψ σ , ψ ρ }, i.e. variance-covariance matrix of the random effects, with their corresponding variance-covariance matrices S Θ and S ψ , which are used to draw the Θ * and ψ * parameters, from their posterior distribution as follows:(Resche-Rigon et al. 2013) ) • 2l.MAR: Multiple imputation assuming MAR for hierarchical datasets, we used the multilevel imputation model ( 2l.2stage.normand 2l.2stage.bin)from the micemd R package(Audigier and Resche-Rigon 2022), which are described in Audigier et al. (2018) paper.(Audigier et al. 2018) • 2l.Heckman: The proposed imputation method based on the Heckman model for hierarchical datasets.

Figure 2 :Figure 3 :
Figure 2: Continuous incomplete variable by varing ρ, where dashed lines depict the target performance criteria value

Figure 4 :
Figure 4: Continuous incomplete variable under systematical missingness by varing: number of clusters (N); sample size per cluster (n i ), where dashed lines depict the target performance criteria value

Figure 5 :
Figure 5: Continuous incomplete variable with deviations in distribution assumptions, where dashed lines depict the target performance criteria value

Figure 7 :
Figure 7: of malaria prevalence by sub-region and age

Table 1 :
Data generation scenarios