Multiple imputation of binary multilevel missing not at random data

We introduce a selection model‐based multilevel imputation approach to be used within the fully conditional specification framework for multiple imputation. Concretely, we apply a censored bivariate probit model to describe binary variables assumed to be missing not at random. The first equation of the model defines the regression model for the missing data mechanism. The second equation specifies the regression model of the variable to be imputed. The non‐random selection of the binary data is mapped by correlations between the error terms of the two regression models. Hierarchical data structures are modelled by random intercepts in both equations. To fit the novel imputation model we use maximum likelihood and adaptive Gauss–Hermite quadrature. A comprehensive simulation study shows the overall performance of the approach. We test its usefulness for empirical research by applying it to a common problem in social scientific research: the emergence of educational aspirations. Our software is designed to be used in the R package mice.


Introduction
In the social sciences, large-scale surveys with complex data structures have become the norm rather than the exception. Applied statisticians are well aware that, for valid statistical inference, they need to account for the peculiarities arising from complex survey data; see, for example, Gelman et al. (1998), Kish and Frankel (1974) and Rubin (1987). Nonetheless, despite urgent appeals for methodologies that are capable of meeting this need, an effective set of survey statistical methods is still lacking. For some time, multiple imputation (MI) (Rubin, 1987) has been the state of the art for handling missing data in surveys. Yet, to the authors' knowledge, no appropriate method exists to handle binary multilevel data under the missingness not at random (MNAR) assumption in the context of MI. Standard applications of MI techniques are usually based on the assumption that the data are missing at random (Rubin, 1976). In many situations, however, it seems entirely realistic to assume that the missing values depend on the incomplete variable Y itself even after conditioning on all the other available data and thus follow an MNAR mechanism. It is well known that this occurs, for example, in variables that are related to income. In this case, usual MI models and implementations may not be sufficient and may result in biased estimates. This paper aims to fill this gap by introducing a selection model-based MI model to be used within the fully conditional specification (FCS) framework for MI. The FCS-MI approach makes it possible to deal with missing values in variables of several distinct types (Raghunathan et al., 2001;Van Buuren et al., 2006). Our method is an imputation model for binary multilevel data that are assumed to be missing not at random. It is designed to be used in the R package mice (Van Buuren and Groothuis-Oudshoorn, 2011). Thus, it is flexible and versatile. A variety of methods exists for addressing multilevel data that are missing at random, e.g. weighting (Asparouhov, 2006;Asparouhov and Muthen, 2006;Pfeffermann et al., 1998), MI (Audigier et al., 2017;Enders et al., 2017;Lüdtke et al., 2017) and likelihood-based methods such as the full information maximum likelihood approach (Larsen, 2011;Wothke, 2000).
Under MNAR the probability of missing data is a function of the unobserved values themselves. As a consequence, each MNAR problem requires specification of the joint distribution f.Y , R/ of the complete data Y and the missing data indicator R. Depending on the factorization of this distribution, two kinds of MNAR models result (Little, 2008). Factorizing f.Y , R/ as f.Y/f.R|Y/ gives selection models, and decomposing it as f.R/f.Y |R/ yields pattern-mixture models. Thus, selection models specify the joint distribution by weighting the marginal distribution of the outcome variable f.Y/ by a selection probability f.R|Y/, that accounts for non-random non-response and must be modelled explicitly (Rubin, 1974).
In contrast, pattern-mixture models define the full data likelihood as a mixture of distinct response patterns. Thus, different distributions for Y are assumed for units with observed and missing values (Little, 1993;Rubin, 1977). Both model types rely on unverifiable assumptions that cannot be validated by the observed data. In selection models the full data response distribution f.Y/ and the conditional distribution of the missing data mechanism f.R|Y/ must be specified by underlying parametric assumptions and, in pattern-mixture models, the outcome distribution of the missing values f.Y mis / must be determined by additional external information (Glynn et al., 1986).
It is not possible to test empirically whether the missing data are missing not at random or missing at random since the information that is required to be able to distinguish between these two mechanisms is not available in the data. Furthermore, we must be aware that all types of alternative MNAR models also rely on untestable assumptions. Therefore, it is reasonable to estimate a variety of missing data models with different assumptions, rather than to rely exclusively on one type of model. This non-testability of the assumptions about the missing data mechanism makes sensitivity analysis indispensable for possibly missing not at random data (Molenberghs and Fitzmaurice, 2008). Only in this way can the effect of the assumptions and thus the robustness of statistical inference be assessed. In the context of MI, sensitivity analyses aim to contrast analysis results from missing at random imputed data with those from alternative MNAR models to evaluate whether they lead to different inferences and conclusions. In the existing literature, most methods that apply FCS under MNAR use sensitivity parameters to address distributional differences between the missing and observed units (Hedeker et al., 2007;Resseguier et al., 2011;Tompsett et al., 2018;Van Buuren et al., 1999). Such parameters cannot be estimated from observed data. Instead, plausible value ranges must be provided by experts in the respective research field. In the social sciences, it is usually difficult to define such value ranges because of the widespread lack of objective and generally valid comparative figures. We therefore present an MNAR imputation model that does not include sensitivity parameters but is completely identified by its distributional assumptions.
The idea of using a selection model (Rubin, 1974) in the context of FCS-MI is not new. Galimard et al. (2016) used a two-stage selection model for imputing continuous missing not at random data. Very recently, Galimard et al. (2015Galimard et al. ( , 2018 presented a variant of their model that applies to binary missing not at random data. Their basic idea is to use a bivariate probit model to specify the response probability and the binary outcome of a single-level model jointly. We extend their approach by adding random intercepts to the selection and outcome model to account for potential multilevel structures in the data. We use a maximum likelihood approach and adaptive Gauss-Hermite quadrature (AGHQ) procedure to fit our imputation model. The derivation of the formulae that is used will be described in detail below.
Most multilevel MNAR approaches originate from biostatistics and deal with continuous multilevel data; for comprehensive overviews, see, for example, Little (2008), Molenberghs and Fitzmaurice (2008) and . The lack of related applications in the social sciences, where variables are mostly ordinal and binary, is striking. We make a first step towards rectifying this situation by applying our novel approach to a common research problem in the social (educational) sciences. Before doing so, we examine the overall performance of our new approach by conducting a comprehensive simulation study. By means of standard errors, relative bias and coverage rates, we compare the performance of our novel MI-FCS MNAR imputation model with other common imputation strategies that cope either with binary multilevel data or with binary missing not at random data, but never with both simultaneously. Furthermore, we test the robustness of our approach under misspecified missing data models. We then apply our novel method to analyse the effect of personal attributions and social background factors on the educational aspirations of ninth-grade students in Germany.
It is also advisable to compare the performance of our method with that of alternative MNAR models such as pattern-mixture models. However, in the context of FCS-MI, no such method currently exists. The development of such a method is beyond the scope of this paper and thus is left for future work.
The remainder of this paper is structured as follows. First, we describe the new imputation method and the underlying model. This is followed by the simulation study and the presentation of its results. Thereafter, we apply the approach to empirical data from the German National Educational Panel Study (NEPS). We conclude with a short summary of the results, a discussion of some critical issues and tasks for future work.

Method
The basic idea of FCS-MI is to specify separate imputation models for each incomplete variable and to impute the missing data variable by variable, i.e. for a binary variable with missing values a model describing this variable appropriately is required. If data are additionally missing not at random, then the mechanism that caused the missing values must also be modelled. For this, like Galimard et al. (2015Galimard et al. ( , 2016Galimard et al. ( , 2018 we use a selection-model-based approach. Combined with the binary variable Y to be imputed, this yields a two-equation system: one equation for the selection process and one equation describing Y . We use a bivariate probit model with sample selection (Greene, 2012;Wooldridge, 2002), i.e. a censored bivariate probit model, to specify this two-equation system. Multilevel structures in the data are accounted for by expanding the bivariate probit model by a random-intercept term. This expanded bivariate model serves as an imputation model to impute missing values of Y within the iterative FCS-MI scheme. In what follows, we describe this model in detail and present an efficient way to estimate it. We then present the imputation algorithm that will be used to obtain plausible replacements for the missing values of Y . Here, R describes the missing data indicator of Y that takes the value 1 if Y is observed and 0 otherwise. Observations of Y and R are denoted by y and r.

Imputation model: censored bivariate probit model with random intercept
Assume that the data at hand contain j = 1, : : : , J clusters consisting of i = 1, : : : , n j individuals respectively. By using the standard probit specification based on a latent variable formulation, the model can be specified as follows: The first equation accounts for the non-random selection process, i.e. for the missing data mechanism in our case. The second equation models the focal variable Y and defines the outcome equation. The asterisk denotes the latent variables r Å ji and y Å ji , whose observed equivalents are r ji and y ji . The covariates of the two regression equations are x R and x Y , and β R and β Y are the related coefficients. α R,j and α Y ,j are the random intercepts for describing cluster effects, and R,ji and Y ,ji are the error terms. The function 1 denotes the indicator function and 'not applicable' denotes a missing value. To assure model identifiability, x Y must be a subset of x R and x e R = x R \ x Y to be highly correlated with r and hardly connected to y (Rendtel, 1992). The set x e R is called the exclusion restriction. The selection and the outcome equation are linked through correlated error terms and random intercepts: .2:2/ Here ρ describes the correlation of the bivariate distribution of R Å and Y Å , and therefore models the relationship between the selection and outcome equation. Clearly, this two-equation system specifies only the dependence of the missing-data mechanism on the outcome variable appropriately if the normality assumptions hold. To capture potential dependences of the missing data mechanism on the cluster structure of the data, the random intercepts α R and α Y are also allowed to depend on each other. τ denotes the correlation of the bivariate normal distribution of α R and α Y , and Σ their variance-covariance matrix. In this paper, only a two-level hierarchy is considered, but the extension to further levels is straightforward. The log-likelihood function of the two-equation model (2.1) can be expressed as (see, for example, Greene (2012)) .2:3/ Here, Φ 2 .·/ denotes the cumulative distribution function of the bivariate standard normal distribution and Φ.·/ is the cumulative distribution function of the univariate standard normal. The function φ 2 .·|0, Σ/ is the probability density function of a bivariate normal distribution with mean 0 and variance-covariance matrix Σ. The two random intercepts α R and α Y are treated as nuisance parameters and are integrated out. This works in the same way as for the univariate panel model with one random intercept (see, for example, Butler and Moffitt (1982)). The double integral of the log-likelihood function (2.3) has no closed form solution. One way to solve the integral nevertheless is to approximate the area under the integrand. This can be done by using either simulation (e.g. Cappellari and Jenkins (2003), Greene (2004) and Train (2009)) or quadrature techniques (e.g. Delattre and Moussa (2015) and Mulkay (2015)). In this paper, we use Gauss-Hermite quadrature (GHQ) to solve the double integral. The reason is that simulation is expected to take much more computational time for the two dimensions that we have than quadrature, which is of tremendous disadvantage in the context of MI when data are imputed several times and usually also for several variables. Simulation only becomes beneficial with higher dimensions since the required number of iterations for approximating the integrals does not depend on the number of dimensions. GHQ approximates an integral of a specific form by a weighted sum of the integrand evaluated at predetermined abscissas of the variable that is integrated out (see Abramowitz and Stegun (1964) and Davis and Rabinowitz (1967)). These predetermined abscissas are called quadrature points. For example, in the one-dimensional case the integral of the form where the quadrature points a p are set to be the nodes of the Hermite polynomial and ω p are the corresponding weights with p = 1, : : : , P. By design, the quadrature points are set symmetrically around zero. The accuracy of the Gauss-Hermite approximationĨ depends on the chosen number P of quadrature points. Ideally, P is determined by investigating the convergence behaviour ofĨ when P is increased. However, Lesaffre and Spiessens (2001) showed that a number of P = 10 is often sufficient and differences by further increasing P are only minimal. The Gauss-Hermite weights and nodes can be found in the tables of Abramowitz and Stegun (1964) or can be computed by using an algorithm that was proposed by Golub and Welsch (1969). An improved version of GHQ is AGHQ (Liu and Donald, 1994;Naylor and Smith, 1982). Here, in contrast with traditional GHQ, the quadrature points are set symmetrically around the maximum value of the integrand. In other words, AGHQ shifts and scales the quadrature locations to place them under the peak of the integrand, so that the function is evaluated where the area is expected to be largest. Applying the AGHQ approach to log-likelihood equation (2.3) gives the following approximation (the corresponding calculation steps yielding this formula are given in the on-line supplementary material provided along with this paper): where p 1 = 1, : : : , P and p 2 = 1, : : : , P are the quadrature points for the selection equation and for the outcome equation respectively. In contrast with GHQ, AGHQ enables us to specify quadrature points for each cluster separately. It can be expected that this improves the approximation of the cluster-specific integrals The related bivariate quadrature pointsã jp = .ã jp 1ã jp 2 / with p = .p 1 , p 2 / are defined as where a p = .a p 1 , a p 2 / and ω p = .ω p 1 ω p 2 / are the standard Gauss-Hermite nodes and weights.
Here, the matrix Ω j scales a p and the vector μ j centres them. The function |Ω j | denotes the determinant of Ω j . The square root of Ω j , Ω 1=2 j , can be properly described by the lower triangular matrix T of the Cholesky decomposition of Ω j = TT .
There are different approaches to specify μ j and Ω j . Liu and Donald (1994) recommended centring the nodes with respect to the mode of the integrand and scaling them according to the negative inverse Hessian matrix (curvature) at the mode. In the case considered, the mode is the most likely value for the random effects given the observed data and the current estimates of all the other model parameters. The integrand of the likelihood l j for cluster j, that is necessary to calculate the cluster-specific mode and curvature, is the two random intercepts for cluster j can be computed, for example, bŷ The related curvature matrix at the modes,Ω j , is a proper estimator for Ω j . It is defined aŝ The estimatorμ j must be found by some numerical optimization algorithm such as the Nelder-Mead method, whereasΩ j can be calculated analytically or also solved numerically. In general, AGHQ is clearly superior to ordinary GHQ since it dramatically reduces the number of necessary quadrature points to approximate a given integral by sampling the nodes in the relevant region of the function. Although additional time is needed to compute the mode and curvature at each maximization iteration, many fewer quadrature points are required for the same approximation accuracy (Lesaffre and Spiessens, 2001). This applies especially when the mode of the integrand is far from 0. We rely on the standard maximum likelihood approach to estimate the parameters of the approximated log-likelihood function (2.4). For numerical optimization we suggest using the Broyden-Fletcher-Goldfarb-Shanno method (e.g. Goldfarb (1970))-a very powerful and efficient optimization algorithm for solving unconstrained non-linear optimization problems that belongs to the group of quasi-Newton methods. These methods do not require the computation of the Hessian matrix but approximate it in each iteration by using the gradients. This fact makes them computationally very attractive (e.g. Nocedal and Wright (2006)). To speed up the maximization process of parameter estimation, we calculated the analytic gradients of equation (2.4), which can be found in the on-line supplementary material, and use them during optimization. Note that, during the optimization procedure,μ j andΩ j must be computed for each individual cluster in each iteration step.

Imputation algorithm: fully conditional specification
With the imputation model at hand and an adequate method to estimate it, missing values can now be imputed. As already stated, in FCS-MI, the incomplete variables are imputed sequentially in a univariate fashion. For this, plausible replacements are drawn variable by variable from the related conditional densities. In our case, the bivariate probit model (2.1) determines the conditional density of Y . Let θ = .β Y , β R , r, ξ 2 Y , ξ 2 R , z/ be the unknown parameters of the bivariate probit model, where r = tanh −1 .ρ/, ξ 2 Y = ln.σ 2 Y /, ξ 2 R = ln.σ 2 R / and z = tanh −1 .τ / are common transformations to preserve the range constraints of the parameters during maximization. At each iteration the following five steps are conducted to impute the missing values of Y . At each FCS-MI iteration step the approximated log-likelihood (2.4) must be maximized to obtain updated estimatesθ for θ. Furthermore, to ensure a proper imputation procedure, parameter uncertainty must be considered (Rubin, 1987). This is achieved by drawing parameter candidatesθ by using a normal approximation to the posterior distribution ofθ (e.g. Gelman et al. (2013), chapter 4).
Step 1: estimate model parameters by maximum likelihood and AGHQ by using equation Step Step 3: draw random-intercept candidates .α R,j ,α Y ,j / for each cluster j from N.μ j ,Ω j /.
Step 4: calculate for each unit with missing Y the probabilityṗ that Y equals 1: Step 5: draw for each missing value Y mis a replacement from a Bernoulli distribution with success probabilityṗ. To generate M imputed data sets, these steps are repeated M times.
This imputation algorithm extends the work of Galimard et al. (2018). We have implemented it in a way that the algorithm can be used within the mice()function of the R package mice. (The corresponding source code is available from http://github.com/Angelina Hammon/PaperBinaryMNARmultilevelData.) In FCS-MI, it is common practice that each incomplete variable is also a potential predictor in the imputation models for all the other variables. This applies to Y as well. Since by assumption the missing data indicators R of Y and X Y are correlated (determined by the selection equation), R must be included as predictor in the imputation models of all of the other incomplete variables that are part of X Y ; see also Galimard et al. (2016). Otherwise biased imputations may arise.

Simulation study
To evaluate the performance of our novel imputation procedure, we conduct a set of distinct Monte Carlo simulation studies, using different data-generating processes to represent possible real world scenarios. For clarity, we concentrate on the univariate imputation model of Y , and we assume that all the covariates considered are observed completely. However, as already stated, an application of the algorithm to multivariate missing data is straightforward. The number of replications is set to 1000.

Data generation
In sum, we consider five simulation scenarios. The total sample size is set to n = 2500 and the number of clusters equals m = 50, leading to a cluster size of n j = 50, j = 1, : : : , m. These numbers are in line with cluster and sample sizes that are commonly used in educational research, for example, when describing students in schools. For simplicity, we assume that all clusters comprise the same number of units. However, the method can also be applied without any problems in case of different cluster sizes. In any simulation scenario, we initially generate complete data sets with one binary outcome variable y ji , i = 1, : : : , n j , and three different normally distributed covariates x 1,ji , x 2,ji and x 3,ji according to Missing values are imposed on y ji by specifying a model for the response indicator r ji , where r ji equals 1 if y ji is observed and is 0 otherwise. To assess the performance of our imputation method under distinct (realistic) missing data situations, we implement models for five different missing data mechanisms. We specify four models for MNAR and one model for missingness at random (MAR). Depending on the mechanism that is considered the parameters ρ and τ of equation (2.2) take varying values expressing different relationships between the response indicator r and the outcome variable y. We include different types of MNAR missing data, where we assume that the probability of observing y ji increases with the value of y Å ji . Under the first three MNAR scenarios (MNAR selection), missing data are produced by using the following parameterization of the selection equation: r Å ji = 0:5 + 1:5x 1,ji − 0:25x 2,ji + 0:1x 3,ji + α R,j + R,ji r ji = 1.r Å ji > 0/: .3:1/ To take into account different magnitudes of correlation between y ji and r ji , we use three values for ρ, namely ρ ∈ {0:3, 0:6, 0:9}, reflecting weak, medium and strong correlation. We set τ = 0:5 to allow for a medium correlation between the random intercepts of both equations. The variable x 3 represents the exclusion criterion. To evaluate our method also in an MNAR situation, where the missing data mechanism does not strictly follow the selection model specification of the imputation procedure (MNAR non-selection), we consider another MNAR scenario, where the missing data are imposed by P.r ji = 1/ = Φ.0:8 + 1:75y Å ji + 1:5x 1,ji − 2:5x 2,ji + α R,j / r ji ∼ Ber{P.r ji = 1/}: Here, Ber.· · ·/ denotes the Bernoulli distribution, and ρ and τ of equation (2.2) are set to 0. Since it is not possible to test empirically whether the missing data mechanism at hand is MAR or MNAR, sensitivity analyses incorporating alternative imputation models with additional external assumptions are the only means of detecting a potential MNAR mechanism. The underlying idea is that, in the case of MAR the inferences should not differ between the distinct MAR and MNAR imputation methods. Thus, to conduct effective sensitivity analyses it is crucial that the alternative imputation models can not only handle data missing not at random but also yield valid inferences under MAR. Therefore, we additionally consider an MAR scenario where the missingness does not depend on y ji to evaluate how our new method performs under MAR. For this, we specify the latent response indicator r Å ji by using equation (3.1) with ρ = 0 and τ = 0. All missing data scenarios that were examined yield approximately 35% missing values in y. The complete code for data generation and analysis of our simulation study is available from http://github.com/AngelinaHammon/PaperBinaryMNARmultilevelData.

Data analysis
To assess the adequacy of our new imputation method with AGHQ (referred to in what follows as MNAR AGHQ), its performance will be compared with that of other relevant imputation procedures including methods that assume MAR. We also tested the performance of our method by using the GHQ approximation. However, the corresponding results are slightly worse than under MNAR AGHQ. Thus, they are not reported. As MAR imputation methods, we use a mixed effects logistic regression as imputation approach (MAR mixed) (Zinn, 2013) as well as a technique that uses a two-stage estimation approach (MAR 2-stage) (Resche-Rigon and . Besides these methods, which take into account the multilevel structure of the data, we also evaluate the procedure of Galimard et al. (2018Galimard et al. ( , 2015 that handles MNAR by using a selection model approach but is only designed for single-level data (MNAR Galimard). As a benchmark scenario, we present the results of a complete-case analysis (CCA), which in the case at hand (only missing values in y) is also valid under MAR (Von Hippel, 2007). We used M = 10 imputations for each scenario and imputation procedure. Since here we focus on only univariate missing data, which are a special case of monotone missingness, there is no need to iterate the mice algorithm.
Each completed data set is analysed by estimating a mixed effects probit regression on y with covariates x 1 and x 2 . For this, we use the glmer()function of the R package lme4 (Bates et al., 2015). After estimation, all the estimates are pooled by using Rubin's combining rules (Rubin, 1987). We assess the performance of each imputation method by using the empirical means of the parameter estimates, their relative bias and the empirical standard errors of the estimates, as well as the root mean square of the estimated standard errors. Furthermore, we derive and evaluate the coverage rates of the nominal 95% confidence intervals.

Results
For the various imputation strategies and simulation scenarios including the MNAR scenario based on a selection model with medium correlation, i.e. for ρ = 0:6, Table 1 shows the results for the regression parameter β 1 of the first covariate x 1 . Table 2 gives the estimates for the slope parameter β 2 of variable x 2 . The results for the selection-model-based scenarios with low and high correlation, i.e. ρ ∈ {0:3, 0:9}, are not reported here since they are similar in terms of relative bias and coverage rates. However, they can be found in the on-line supplementary material to this paper. If the true missing data mechanism is MAR, the CCA and the imputation model based on a mixed logistic regression (MAR mixed), which are both designed for this type of missing data, perform-as expected-very well in terms of bias. Surprisingly, the MAR 2-stage approach performs rather badly under MAR with regard to both bias and coverage. Apparently, the twomodel strategy is not suitable for the kind of hierarchical data situation that is considered in our simulation study. The results of Audigier et al. (2017) suggest that the method introduces bias with cluster sizes below 100, which might be the reason for its poor performance in our simulation study. Our novel approach MNAR AGHQ performs well under the MAR scenario, with an average relative downward bias of 1.98% and a reasonable coverage rate of 96% for β 1 . The bias is slightly higher than for the CCA or MAR mixed, but nevertheless these results confirm  that our novel approach also works for missing data that are in fact missing at random, which is a crucial property for conducting adequate sensitivity analyses. In the considered MNAR scenarios, the MNAR AGHQ method clearly outperforms all competing approaches. For β 1 , it yields, under both MNAR conditions, a relative bias of lower than 1% and coverage rates near the nominal coverage probability. The three MAR methods underestimate β 1 up to 51.46% in both MNAR scenarios. The approach of Galimard (MNAR Galimard) yields biased estimates for β 1 in all scenarios. This is caused by the fact that Galimard's model does not induce any multilevel structure into the imputed values.
In principle, the results for parameter β 2 are similar to those of parameter β 1 . The MAR mixed approach shows a high upward bias in all the considered MNAR scenarios along with very low coverage rates. Interestingly, the MAR 2-stage technique leads to an unbiased estimate for β 2 and to a nominal coverage of 97.2% in the MNAR scenario based on the selection model and even outperforms MNAR AGHQ, which shows a relative bias of 2.47% and a coverage rate of 92.8%. In contrast, MAR 2-stage again yields a poor coverage rate and biased estimate in the presence of MAR. All three MAR methods overestimate β 2 up to 93.84% under MNAR. For β 2 , our new approach MNAR AGHQ again shows reasonable performance in terms of bias and coverage in all the scenarios considered. However, in the data situation where missing data are created under a non-selection model, the estimate is slightly more biased than for the other scenarios. In addition, the bias is also higher than for the estimate of β 1 in the same missing data situation. Nevertheless, the average relative bias of 3.52% still lies within an acceptable range, in particular compared with the performance of the other methods investigated, which clearly underperform relative to MNAR AGHQ. Under MAR, MNAR AGHQ shows a slightly lower coverage rate for β 2 than expected, but it still lies in a reasonable range. As for β 1 the MNAR Galimard model biases the estimates of β 2 downwards and results in too low coverage rates. Hence, even if the MNAR mechanism is modelled, the failure to consider the hierarchical data structure during imputation leads to incorrect estimates of fixed effects parameters.

Analysis of educational aspirations
We apply our novel imputation method to a frequently studied problem in educational research: young people's educational aspirations and the effect of their social background. Our analysis focuses on ninth-grade students attending lower secondary school, Hauptschule, the lowest track of secondary school in Germany. This group is particularly affected by social disadvantage (Schneider, 2018;Wößmann, 2007) and thus is of special interest to educational researchers. This is especially true for the relationship between parental education and the degree that students aspire to obtain. We use wave 1 of the NEPS starting cohort 'School and vocational training: educational pathways of students in grade 9 and higher' to study the aspirations of ninth graders to graduate with a degree that is higher than the degree that is offered by the school that they are currently visiting. For students attending Hauptschule, this is either an intermediate secondary degree or a degree allowing for university admission. We study the effect of parental education on a student's aspirations by using the information on whether a student's mother has a university admission certificate (UAC) or not. As personal attributes that may influence students' aspirations we consider their grades in mathematics and German and their competencies in mathematics and reading, and their gender, as well as their migration background (measured by generation status smaller than 3.5). Competency scores are estimated as weighted maximum likelihood estimates (Warm, 1989). Detailed information on the measurement of competencies in the NEPS is given in Duchhardt and Gerdes (2013), Gehrer et al. (2012) and Neumann et al. (2013). Grades range from '1, very good' to '6, insufficient'. We are aware of the problem of multicollinearity between grades and competency scores. Nonetheless, in Germany correlations between both quantities are comparably low in lower secondary school (e.g. around −0:32 between grades in German and reading competencies in 2012 (Authors Education Report (2016), page 94)). Thus, competencies are included in the outcome model to capture ability effects that grades do not map. As possible composition effect, we consider the proportion of mothers with a UAC in a school's entire ninth-grade level. The participation in the NEPS survey is voluntary. As a consequence, some schools in the sample have only a few ninth graders. We exclude schools with fewer than 10 ninth graders from our sample to reduce distortion and estimation problems. This results in a loss of 4% of the students. In sum, our data set comprises observations on 3291 ninth graders in 142 schools who were surveyed in 2011. The average number of ninth graders in a school is 23.2 (with a minimum of 10 students and a maximum of 48 students). In total, 77.2% of the ninth graders surveyed aspire to obtain a higher degree than they can obtain at Hauptschule. The grade level intraclass correlation ICC concerning higher aspirations of students is 22.7% (with a standard deviation of less than 0.1). Hence, the multilevel structure of our data is obvious. The on-line supplementary material shows all the model variables along with their mean values, standard deviations and the proportion of missing values. The variables on competencies and gender exhibit very few missing values (at maximum 4%), whereas the variables on migration background, aspirations and grades show a few more missing values (from 13% to 17%). However, we find a high percentage of missing values (more than 50%) for maternal education. Using Little's test (Little, 1988), we see that the missingness mechanism that generated our data set is not missingness completely at random, i.e. it is either MAR or MNAR. To cope with this issue, we use the FCS-MI approach, applying distinct imputation methods depending on the nature of the variable to be imputed. It is clear that a regressor in a multilevel model does not necessarily have to have a multilevel structure as well. Imputing missing values of a regressor with a single level structure by using a multilevel imputation model can lead to an unnecessarily high variance of the imputed values. Therefore, we first compute ICC for all the variables, using their observed values to assess whether a multilevel imputation routine is required. As a rule of thumb, we consider ICC-values that are higher than 20% as at risk of having a multilevel structure, whereas regressors with lower ICC-values are assigned a single-level imputation model. We find ICC-values larger than 20% for migration background and higher aspirations. Thus, higher aspirations and migration background are imputed by using multilevel imputation approaches, and all other variables are imputed by a single-level approach. From non-response analyses with similar NEPS data, we know that people with lower educational attainment are less likely to take part in the survey; see Zinn et al. (2018). Furthermore, we suspect that students with lower educational aspirations more often refuse to take part than their counterparts. Thus, we hypothesize that MNAR mechanisms generated the data on maternal education and educational aspirations. We use Galimard's imputation method (Galimard et al., 2016) to impute the variable 'mother has UAC' and our novel method to impute 'higher aspirations'. As an exclusion criterion (in the selection models), we use the information on whether students were ever surveyed individually at home, on line or by phonei.e. not at school-within nine waves (i.e. within 5 years). This appears to be an optimal choice since we find high correlations between this survey mode variable and the indicators of whether an aspiration value has been observed (around 95.8%) or of whether maternal education has been observed (around 46.1%). In contrast, we find low correlations between the survey mode variable and the observed aspiration values (0.03%) and the observed maternal education values (16.7%). Thus, we do not expect that the complete aspiration variable or the complete maternal education variable are substantially correlated with the survey mode variable. All other variables are imputed by using an MAR approach. In detail, missing values for grades are imputed by using predictive mean matching, and missing values for competence categories are imputed by using a polytomous regression approach. Missing values for gender are imputed by using a single-level logistic regression model, whereas missing values for migration background are imputed by using a two-level logistic regression model. Table 3 shows the results of our analysis, contrasted with the results of a CCA which is valid if missingness does not depend on observed or missing outcomes values given all other observed data (e.g. White and Carlin (2010)), and an MAR imputation approach for 'higher aspirations'. As in the simulation study, the two-level MAR imputation method is the method that was used in Zinn (2013). The number of imputed data sets is 20 with 50 iterations per imputed data set.
Under all three missing data schemes, we find significant effects (i.e. with p < 0:05) for higher grades in German, higher mathematics competencies, gender, migration background and the proportion of mothers with a UAC in the ninth grade. As expected, students with better grades in German show higher aspirations than students with lower grades. We do not find any significant effect of mathematics grades under a CCA. However, the effect of the mathematics grades becomes significant under MAR or MNAR. Thus, there is evidence that grades in mathematics are important for a student's aspirations. Having a look at the effect of competencies on students' aspirations, we see that the significance of German competencies shrinks to insignificance under MAR and MNAR-whereas, under a CCA, significant effects are estimated. For mathematics competencies, the picture is slightly different: even under MAR and MNAR, the significance of the good competency effect remains. In other words, there seems to be a significant part of remaining explanatory power in a student's mathematics competency that is not already captured by the mathematics grade (or vice versa). Apart from that, under all three missing data schemes, we find that female students in lower secondary school have much higher educational aspirations than males. Likewise, students with a migration background have much higher educational aspirations than those without. The proportion of ninth graders' mothers with a UAC has a very strong effect on students' aspirations as well: the larger the proportion, the higher the effect. In sum, the size of this composition effect is very large. The reason for this lies in the small variance in the proportion of mothers with a UAC proportion at the grade level. It is less than 0.1. This small value is due to the target population under consideration, and not to particularities of the data. In Germany, social segregation in the lowest level of secondary education is enormously high. This is reflected in the data: in only one school is the proportion of ninth graders' mothers with a UAC higher than 50%, whereas, in 34.5% of the schools, none of the students has a mother with a UAC. Considering the effect of maternal education, we see that no significant influence can be detected under CCA and MAR. This changes under MNAR, although the effect of maternal education is statistically significant only at the 0.1-level. However, 58% of the cases with missing values concerning aspiration also have missing values for maternal education. Therefore, the chances are high that we are missing students with low aspirations and mothers without a UAC. Ignoring this issue may induce confounding bias, resulting in an underestimation of the effect of maternal education. In this respect, it is very likely that students who have a mother with a UAC have considerably higher educational aspirations than students whose mothers do not hold such a degree. To summarize, our analysis underscores the plausibility of the MNAR assumption concerning the 'maternal education' data. Having a look at the random-effect variances estimated under MAR and MNAR, the strong multilevel structure of the data becomes apparent. Neglecting this data feature when imputing missing values yields (at least) biased and misleading variance estimates. Exploring the missing data models that were estimated for maternal education and students' aspirations, we find a correlation (averaged across all imputations) ρ (between the error terms of the selection equation and the outcome model) of 0.25 for mother's education and of 0.06 for student's aspirations. This substantiates our prior finding of an MNAR mechanism having generated the 'maternal education' data (under the distributional assumptions of the selection model applied). In contrast, ρ = 0:06 indicates that an MAR mechanism generated the aspiration data (and not an MNAR mechanism). Looking additionally at the generally small differences between the estimates of the MAR and the MNAR models, this suspicion seems to be confirmed. The correlation τ between the random effects of the selection equation and the outcome model is on average 0.57. This high value suggests that participating in the study and having higher aspirations depend in a similar fashion on the (latent) class context of a student.
In situations where a single estimate is required (and not a set of estimates found within a sensitivity analysis), the results from the distinct imputation models of the sensitivity analysis can be combined by using Rubin's combining rules. This yields a multiple-model imputation approach that takes into account all sources of uncertainty with regard to the missing data mechanism that is assumed (Rubin, 1987;Siddique et al., 2012Siddique et al., , 2014. In our application, imputations are generated from different posterior predictive distributions (since we are using distinct (univariate) imputation models). This constitutes an additional source of uncertainty that must be considered in the final inference. Siddique et al. (2012) described a modified version of Rubin's combining rules based on nested MI (Rubin, 2003;Shen, 2000) that can be used for this.

Conclusion
In this paper, we introduced a novel and unique method for handling incomplete binary multilevel data that are assumed to be MNAR. Our univariate imputation method can easily be incorporated into the FCS framework to deal with multivariate missingness. For example, it can be used in the R software package mice. (A related implementation is available from http://github.com/AngelinaHammon/PaperBinaryMNARmultilevelData.) Our simulation studies show that the novel approach outperforms competing techniques in terms of bias and coverage when data are affected by distinct MNAR mechanisms, i.e. in contrast with methods that are designed for MAR data, our method is capable of producing unbiased and accurate estimates of the quantities of interest. Moreover, our novel imputation method also yields valid estimates if the missing data were produced by an MAR mechanism. All in all, the method seems well suited to conducting meaningful sensitivity analyses-the only means of assessing the plausibility of an MNAR missing data mechanism or, more precisely, whether it matters for inference. In addition, the simulation studies demonstrate that valid imputation requires multilevel modelling if the data are clustered. In other words, when working with multilevel data that are supposed to be missing not at random, a proper imputation model like ours is required. In our simulation study, we kept the number of clusters and cluster sizes constant. A variation of both quantities can have an influence on the performance of our method. Thus, one of our future tasks will be to find out whether and to what extent such an influence exists. Furthermore, we included only random intercepts in our imputation model. The generalization to models containing random slopes is less straightforward and there is not a completely satisfactory imputation approach at the moment (Enders et al., 2016;Grund et al., 2016). However, in most social science applications, it is sufficient to consider only random intercepts to reflect a hierarchical structure in the data. We proved that our method is applicable to real data problems as well. For this, we studied the effect of maternal education on the educational aspirations of students in lower secondary education. However, we also noted that analysing large data sets with many clusters and incomplete predictors means long computing times when using only one processor. Thus, to run our approach, we highly recommend executing the MIs of mice in parallel on multiple cores.
Our approach must be extended in several respects. Up to now, we have applied a normal approximation for the parameter draws in our imputation algorithm. Using Bayesian estimation would make it possible to overcome this restriction since the model parameters can be drawn from the actual posterior distributions (which do not necessarily have to be normal). Furthermore, prior information on the correlation between the selection and the outcome model could be integrated. A decisive extension of the approach is the possibility of handling not only binary data but also, for example, ordinal variables and count data. The extension to ordinal data is straightforward since an ordinal outcome can also be explained by an underlying latent continuous variable. Therefore, an ordered probit model with sample selection (e.g. Greene (2012)) is a natural choice for imputing ordinal MNAR data. To perform meaningful sensitivity analyses, it is indispensable to compare alternative MNAR models with different assumptions regarding their missing data mechanisms. Thus, as a complement to a selection-model-driven approach, an imputation approach based on pattern-mixture modelling should be developed as well. For this, we plan to use the proxy pattern-mixture approach of Little (2009, 2011). This method includes one sensitivity parameter to assess the robustness of the missing data inference. The parameter ranges between 0 and ∞, where 0 indicates MAR and ∞ means that missingness depends only on the incomplete variable. Since the sensitivity parameter is independent of any expert assessment, the method is well suited to social science applications. Finally, we point out the crucial role that is played by the exclusion criterion in the successful application of our method. The identification of a suitable variable for this task may seem difficult at first glance. Normally, however, when working with survey data, metainformation such as the survey mode or access corridors exists, which is suspected to be strongly correlated with the respondents' willingness to provide information, but not with the outcome variable to be imputed. Nevertheless, it is advisable to carry out sensitivity analyses with regard to the exclusion criterion as well.