Acknowledgement. The author thanks the two anonymous referees for very helpful comments.
PARAMETRIC FRACTIONAL IMPUTATION FOR NON-IGNORABLE CATEGORICAL MISSING DATA WITH FOLLOW-UP
Version of Record online: 30 AUG 2012
DOI: 10.1111/j.1467-842X.2012.00673.x
© 2012 Australian Statistical Publishing Association Inc.
Additional Information
How to Cite
Kim, J. Y. (2012), PARAMETRIC FRACTIONAL IMPUTATION FOR NON-IGNORABLE CATEGORICAL MISSING DATA WITH FOLLOW-UP. Australian & New Zealand Journal of Statistics, 54: 239–250. doi: 10.1111/j.1467-842X.2012.00673.x
Publication History
- Issue online: 15 OCT 2012
- Version of Record online: 30 AUG 2012
- Abstract
- Article
- References
- Cited By
Keywords:
- EM algorithm;
- multiple imputation;
- non-ignorable missing mechanism
Summary
- Top of page
- Summary
- 1. Introduction
- 2. Basic setup
- 3. Fractional imputation with a follow-up
- 4. Variance estimation
- 5. Simulation study
- References
Incomplete data subject to non-ignorable non-response are often encountered in practice and have a non-identifiability problem. A follow-up sample is randomly selected from the set of non-respondents to avoid the non-identifiability problem and get complete responses. Glynn, Laird, & Rubin analyzed non-ignorable missing data with a follow-up sample under a pattern mixture model. In this article, maximum likelihood estimation of parameters of the categorical missing data is considered with a follow-up sample under a selection model. To estimate the parameters with non-ignorable missing data, the EM algorithm with weighting, proposed by Ibrahim, is used. That is, in the E-step, the weighted mean is calculated using the fractional weights for imputed data. Variances are estimated using the approximated jacknife method. Simulation results are presented to compare the proposed method with previously presented methods.
1. Introduction
- Top of page
- Summary
- 1. Introduction
- 2. Basic setup
- 3. Fractional imputation with a follow-up
- 4. Variance estimation
- 5. Simulation study
- References
Missing data often arise in sample surveys. When non-response is not related directly to the missing values, the mechanism is called missing at random (MAR), as defined by Rubin (1976). In practice, however, non-response is often directly related to the values of the missing variable, even after adjusting for the auxiliary variables. For example, in exit polls, people are less likely to respond to questions about the party that they voted for if the party is not very popular. The missing data mechanism is considered non-ignorable when the non-response is directly related to the values of the missing variable.
Under MAR, Chen & Fienberg (1974) suggested a maximum likelihood method of parameter estimation for two-dimensional categorical data. Fuchs (1982) focused on the maximum likelihood estimation of log-linear models using the expectation-maximisation (EM) algorithm which was proposed by Dempster, Laird & Rubin (1977). Little & Schluchter (1985) suggested the maximum likelihood estimation method for mixed continuous and categorical data using the EM algorithm to estimate the regression coefficients in the generalised linear model. Schafer (1987) analyzed data with covariates which were measured with errors in the generalised linear model, and used the EM algorithm to estimate the regression coefficients in the generalised linear model. Ibrahim (1990) proposed the EM-method by weighting for generalised linear models with missing categorical data. For non-ignorable missing data, however, some parameters may not be identifiable, and are termed non-identifiable. Nordheim (1984) allowed the probability of uncertain classification to depend on category identity and analyzed the data obtained from a genetic study on Turner’s syndrome. Baker & Laird (1988) examined the incomplete contingency tables under non-ignorable non-response and fitted them into a log-linear model. Little (1993) considered pattern mixture models and analyzed non-ignorable missing data under some restrictions, or by using prior information. Glynn, Laird & Rubin (1993) used a follow-up sample to avoid the non-identifiability problem and analyzed non-ignorable missing data using the pattern mixture model. Park & Brown (1997) examined non-ignorable missing categorical data under a log-linear model. To avoid the non-identifiability problem, they restricted the boundary of parameters and used the maximum likelihood estimation method with constraints. Ibrahim & Lipsitz (1999) considered the generalised linear regression problem with non-ignorable missing covariates using the weighting scheme proposed by Ibrahim (1990).
In this paper, to overcome the non-identifiability problem for non-ignorable missing data, a follow-up sample is used that is randomly selected from the non-respondents. To estimate parameters, the mean score approach using fractional weights for imputed data, is proposed, and the maximum likelihood estimators are obtained via the EM algorithm using the method of weights proposed by Ibrahim (1990). To estimate variance, the one-step jackknife varianace estimation method is used. This paper is organised as follows. In Section 2, the basic setup is introduced. In Section 3, we propose the parameter estimation method for non-ignorable categorical missing data with a follow-up sample under the selection model. In Section 4, the variance estimation method is discussed. In Section 5, some simulation results are presented.
2. Basic setup
- Top of page
- Summary
- 1. Introduction
- 2. Basic setup
- 3. Fractional imputation with a follow-up
- 4. Variance estimation
- 5. Simulation study
- References
Suppose that are independent and identically distributed (i.i.d.) realisations of p-dimensional auxiliary variables that are always observed, and that are also i.i.d. realisations of the univariate categorical random variable y from a parametric distribution with density , parameterised by a q-dimensional unknown parameter Without loss of generality, it may be assumed that y is scalar, even though it can be extended directly to the case of a vector y of categorical variables. The parameter of interest is , and under complete response, the likelihood function for is
When missing data occur, the sample is decomposed as , where A_{R} is the set of respondents and A_{M} is the set of non-respondents. The response indicator variable r is defined as
for . Let the density of r_{i} be
Specifically, it is assumed that is the density function from a Bernoulli distribution parameterised by . That is,
where is a known function up to . Thus, the response probability is allowed to depend on y as well as , so that the non-response mechanism is non-ignorable. The observed likelihood function is then
- (1)
Our purpose is to estimate . Note that the observed likelihood function depends on the missing mechanism. When the missing mechanism is MAR, the observed likelihood function in (1) is factorised as
- (2)
where
and
Thus, under MAR, need not be estimated in order to estimate . However, under the non-ignorable missing mechanism, the observed likelihood function cannot be expressed as (2), and and must be estimated simultaneously.
Nonetheless, under non-ignorable missing data, the parameters in (2) may not be identified. To illustrate the non-identifiability problem, suppose that y is the dichotomous response variable with values 0 or 1, x is the auxiliary variable which is fully observed, and r is the response indicator random variable. Define
where , for . Then,
Define also if y_{i}=0 and if y_{i}=1. The observed likelihood function of is then
Under the logistic regression model,
- (3)
where and c is a constant. In (3), there are more parameters than the number of sufficient statistics, so, the parameters are not identifiable.
To overcome the non-identifiability problem, further assumptions are made on the parameters. Glynn et al. (1993) suggested estimating the parameters using a follow-up sample under a pattern mixture model based on the multiple imputation method. We discuss the parametric fractional imputation method with follow-up data under a selection model in the next section.
3. Fractional imputation with a follow-up
- Top of page
- Summary
- 1. Introduction
- 2. Basic setup
- 3. Fractional imputation with a follow-up
- 4. Variance estimation
- 5. Simulation study
- References
Fractional imputation, proposed by Kalton & Kish (1984) and further developed by Kim & Fuller (2004), is extended here to make inference under non-ignorable missing data. In fractional imputation, more than one imputed value is created for each missing item, and a fractional weight is assigned to each imputed value. Let be M possible values of y_{i}. In fractional imputation, is the jth imputed value for the missing y_{i}, and is a fractional weight assigned to satisfying .
To compute the maximum likelihood estimator, we use a follow-up sample to observe y_{i} and assume that there is no non-response in the follow-up sample. Let be the index set of the follow-up data. Assume that A_{V} is randomly selected from the non-respondents by Bernoulli sampling. Define . The sample inclusion indicator random variable for the follow-up sample is defined by
For , we observe . For , we observe . We observe for . Assume that
for some function known up to . For the follow-up sample mechanism, we have
where is a known constant predetermined at the design stage of the follow-up survey. The probability mass function in (4) is
because .
In this setup, the observed likelihood function may be written as
where
- (4)
and
In (4), the conditional distribution of y_{i} given is denoted by where is a vector of parameters that specifies the conditional distribution, and for . Writing , the observed likelihood function for is
To find the maximum likelihood estimator that maximises the observed likelihood , we have to solve
- (5)
We use the mean score equation instead of using (5), defined by
- (6)
where is the score function of under complete response, , y_{obs} is the observed part of , and . Louis (1982) proved the equivalence of (5) and (6).
The full sample score function is defined as
where
- (7)
and
- (8)
Note that is a function of the unknown parameter set . In the EM algorithm, the mean score equation in (6) can be solved iteratively by
- (9)
With categorical data, as noted by Ibrahim (1990), the conditional expectation in (9) is viewed as the weighted mean. Thus, the mean score function in (9) is called the weighted mean score function using fractional imputation. The weighted mean score function in (9) can be partitioned as
where
with defined in (7), and
with defined in (8). Here, is used as the fractional weight assigned to in fractional imputation. In the M-step, we can update the parameters by solving
The parametric fractional imputation algorithm for missing categorical data using follow-up data is described as follows.
Step 1 Let be the current parameter estimate values. We denote . Then is the fractional weight. By Bayes' rule, we compute the fractional weight as
- (10)
Step 2 Solve the weighted mean score equation using the fractional weight obtained in Step 1. That is, solve
where
and
Step 3 Go to Step 1 until converges.
The approach using the weighted mean score equation based on fractional imputation is computationally attractive. In fractional imputation in categorical data, we only need to impute a finite number of values, and the number of imputed values is equal to the number of categories. To each imputed value, we assign the fractional weight which is computed from the estimated conditional probability of obtaining the imputed value given other information. That is, if the jth imputed value of y_{i} is equal to , the fractional weight assigned to is
where is the maximum likelihood estimator of that maximises the observed likelihood function in (1). The final fractional weight is assigned to the fractionally imputed data values, and the augmented data can be used just as complete data.
If the augmented data can be obtained with the fractional weight, we can estimate various parameters such as domain estimates and proportions. Because of the availability of full data values and their weights, the fractional weights only need to be used to estimate the parameters.
4. Variance estimation
- Top of page
- Summary
- 1. Introduction
- 2. Basic setup
- 3. Fractional imputation with a follow-up
- 4. Variance estimation
- 5. Simulation study
- References
Variance estimation can be performed by a replication method. Under complete response, let be the kth replication weight for unit i. Assume that the replicate variance estimator
where , and , is consistent for the variance of . For example, for , the jackknife replication weight is defined as
and c_{k}= (n−1)/n. Now, consider variance estimation of the parameter estimates from the fractional imputation method described in Section 3. To use the replication method for variance estimation, we need the replicated fractional weights. Thus, , the kth replicates of , are required to compute the replicated fractional weights. We let denote the point estimators for . Let
- (11)
where is the replicated weight for unit i and is the fractional weight for the jth imputed data value shown in (10) when . To compute , we need to solve the replicated score equation . But this is computationally heavy because the score functions in (11) are usually nonlinear and have to be solved iteratively. So, we consider a one-step approximation method using a Taylor linearisation for obtaining . For the score functions in (11), we have
- (12)
Taylor expansion of (12) gives
- (13)
By (13), we can get
- (14)
where
The approximation formula (14) can be implemented as
where is the estimator of the kth observed information matrix for . Louis (1982) showed that the observed information matrix for is expressed as
- (15)
where is the information matrix of , , and means that . Using (15), is computed as
where
and . The consistency of the variance estimator follows by the argument of Rao & Tausi (2004).
5. Simulation study
- Top of page
- Summary
- 1. Introduction
- 2. Basic setup
- 3. Fractional imputation with a follow-up
- 4. Variance estimation
- 5. Simulation study
- References
To compare parametric fractional imputation with other existing methods, we have performed a small simulation study.
5.1. Simulation setup
In the simulation study, we used B=3000 Monte Carlo samples of size n=300 from
where logit (p_{i}) = 0.5 +x_{i}, and the response indicator variable r_{i} is distributed as
where for . The average response rate is about 57%. Let be the sample inclusion indicator for the follow-up survey with
where .
The parameters of interest are:
- 1p : the probability of y=1, and
- 2: regression coefficients for the logistic regression of r on (1, x, y).
For each parameter, we have computed four estimators:
- 1The complete sample estimator,
- 2The fractional imputation estimator (FIE), and
- 3The multiple imputation estimator (MIE) with M=10 and M=100 imputations for each missing value.
In fractional imputation, as y is a dichotomous random variable, only two values are imputed, y=0 and y=1. Fractional weights are assigned by formula (10). The parameters p and are specified in the selection model. For fractional imputation, the one-step jackknife method is used for variance estimation. For multiple imputation, the variance estimator is
where
is the variance estimator under the kth imputed sample, and
5.2. Simulation results
In the simulation study, we computed the Monte Carlo (MC) means and variances of the point estimators. Table 1 presents the MC means, the MC variances, and the MC standardised variances of the point estimators with 26% follow-up. The MC standardised variance is computed by dividing the MC variance of the corresponding point estimator by that of the complete sample estimator.
Parameter | Method | Mean | Variance | Std Var (Standardised variance) |
---|---|---|---|---|
| ||||
p | Complete sample | 0.45 | 0.00083 | 100 |
FIE | 0.45 | 0.00166 | 200 | |
MIE(M=10) | 0.45 | 0.00174 | 211 | |
MIE(M=100) | 0.45 | 0.00169 | 204 | |
Complete sample | −1.53 | 0.18854 | 100 | |
FIE | −1.53 | 0.20674 | 104 | |
MIE(M=10) | −1.50 | 0.19564 | 101 | |
MIE(M=100) | −1.50 | 0.19469 | 100 | |
Complete sample | 0.51 | 0.01780 | 100 | |
FIE | 0.51 | 0.01816 | 101 | |
MIE(M=10) | 0.50 | 0.01758 | 99 | |
MIE(M=100) | 0.50 | 0.01753 | 99 | |
Complete sample | 0.71 | 0.06158 | 100 | |
FIE | 0.73 | 0.15928 | 258 | |
MIE(M=10) | 0.72 | 0.16993 | 276 | |
MIE(M=100) | 0.72 | 0.16250 | 264 |
Table 1 shows that the point estimators are all approximately unbiased. The theoretical variance of the complete sample estimator of p is , which is consistent with the simulation result in Table 1. The theoretical variance of the fractional imputation sample estimator of p is . The simulation result in Table 1 shows a greater variance for , which is partly explained by the large variance in the estimation of ; only samples are used to estimate . In Table 1, the FIEs show smaller variances than the MIEs for all parameters.
The FIE is more efficient than the MIE. The FIE is a deterministic imputation method whereas the MIE is a stochastic imputation method. Because the MIE is a stochastic imputation estimator, the MIE is subject to the imputation variance that is of order M^{−1}. Thus, letting will remove the imputation variance of the MIE. For variance estimation, relative bias is computed by dividing the difference of the expected value of the variance estimator and the variance by the variance of the point estimator. Table 2 shows that the t-statistics are not significant at the 95% level and the variance estimators are asymptotically unbiased.
Model | Parameter | Method | Relative Bias(%) | t-statistics |
---|---|---|---|---|
| ||||
SM | App. JK(for FIE) | −2.35 | −0.83 | |
MIE(M = 100) | −3.43 | −1.32 |
References
- Top of page
- Summary
- 1. Introduction
- 2. Basic setup
- 3. Fractional imputation with a follow-up
- 4. Variance estimation
- 5. Simulation study
- References
- 1988). Regression analysis for categorical variables with outcome subject to nonignorable nonresponse. J. Amer. Statist. Assoc. 83, 62–69. & (
- 1974). Two-dimensional contingency tables with both completely and partially cross-classified data. Biometrics 30, 629–642. & (
- 1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser.B Stat. Methodol. 39, 1–38. , & (
- 1982). Maximum likelihood estimation and model selection in contingency tables with missing data. J. Amer. Statist. Assoc. 77, 270–278. (
- 1993). Multiple imputation in mixture models for nonignorable nonresponse with follow-ups. J. Amer. Statist. Assoc. 88, 984–993. , & (
- 1990). Incomplete data in generalized linear models. J. Amer. Statist. Assoc. 85, 765–769. (
- 1999). Missing covariates in generalized linear models when the missing data mechanism is non-ignorable. J. R. Stat. Soc. Ser.B Stat. Methodol. 61, 173–190. & (
- 1993). Pattern-mixture models for multivariate incomplete data. J. Amer. Statist. Assoc. 88, 125–134. (
- 1985). Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika 72, 497–512. & (
- 1982) Finding the observed information matrix when using the EM algorithm. J. R. Stat. Soc. Ser.B Stat. Methodol. 44, 226–233. (
- 1984). Some efficient random imputation methods. Comm. Statist. Theory Methods A 13, 1919–1939. & (
- 2004). Fractional hot deck imputation. Biometrika 91, 559–578. & (
- 1984). Inference from nonrandomly missing categorical data: An example from a genetic study on Turner’s Syndrome. J. Amer. Statist. Assoc. 79, 772–780. (
- 1997). Log-linear models for a binary response with nonignorable nonresponse. Comput. Statist. Data Anal. 24, 417–432. & (
- 2004). Estimating function jackknife variance estimators under stratified multistage sampling. Comm. Statist. Theory Methods 33, 1–9. & (
- 1976) Inference and missing data. Biometrika 63, 581–590. (
- 1987). Covariate measurement error in generalized linear models. Biometrika 74, 385–391. (