Efficient multiply robust imputation in the presence of influential units in surveys

Item nonresponse is a common issue in surveys. Because unadjusted estimators may be biased in the presence of nonresponse, it is common practice to impute the missing values with the objective of reducing the nonresponse bias as much as possible. However, commonly used imputation procedures may lead to unstable estimators of population totals/means when influential units are present in the set of respondents. In this article, we consider the class of multiply robust imputation procedures that provide some protection against the failure of underlying model assumptions. We develop an efficient version of multiply robust estimators based on the concept of conditional bias, a measure of influence. We present the results of a simulation study to show the benefits of the proposed method in terms of bias and efficiency.


Introduction
Item nonresponse is ubiquitous in surveys conducted by National Statistical Offices.Most often, it is treated by some form of single imputation, whereby a missing value is replaced by some plausible value constructed under certain assumptions.The customary imputation process starts with specifying an imputation model describing the relationship between the variable y requiring imputation, and a set of fully observed variables, v, available for both respondents and nonrespondents.Defining an imputation model involves the selection of an appropriate set of predictors and, in the case of parametric imputation, the specification of a functional linking y to v. Once the missing data have been imputed, a population total is readily estimated by computing a weighted sum of observed and imputed values.The validity of imputed estimators requires the first moment of the imputation model, E(y | v), to be correctly specified.Misspecification of the first moment may result in significant bias.To protect against the misspecification of the imputation model, one can have recourse to multiply robust imputation procedures, whereby the imputer specifies multiple imputation models and/or multiple nonresponse models, where a nonresponse model is a set of assumptions describing the relationship between the response indicators (equal to 1 if y is observed and equal to 0, otherwise) to a set of fully observed variables.The rationale behind multiply robust imputation procedures is to construct a set of imputed values by combining all the information contained in these multiple models.A procedure is said to be multiply robust if the resulting estimator remains consistent if all but one of the specified models are incorrectly specified, which is a desirable feature.The reader is referred to Han and Wang (2013), Han (2014a), Han (2014b), Chan and Yam (2014), Chen and Haziza (2017), Chen and Haziza (2019) for a discussion of multiply robust procedures.Double robustness (e.g., Robins et al., 1994;Scharfstein et al, 2005;Haziza and Rao, 2006;Kim and Park, 2006;Kang, Schafer, 2008;Cao and al., 2009;Kim and Haziza, 2014) can be viewed as a special case of multiple robustness.
While multiply robust imputation procedures provide some protection against the failure of underlying model assumptions, the resulting estimators are generally vulnerable to the presence of influential units in the sample.
A unit is said to be influential if its inclusion or exclusion from the computation has a large impact on the resulting estimate.In the presence of influential units, imputed estimators are (asymptotically) unbiased if the first moment of the imputation model is correctly specified but they may exhibit a large variance.At this stage, it is useful to distinguish influential units from gross measurement errors.The latter are identified and corrected at the data-editing stage.In contrast, an influential unit corresponds to a respondent who exhibits a value that is correctly recorded.An influential unit may thus represent other similar units in the set of nonrespondents or in the non-sampled part of the population.This type of units has been called representative outliers by Chambers (1986) and are the focus of the current article.The issue of influential units is common in business surveys.
On the one hand, the distribution of economic variables is typically highly skewed, which generates a conducive ground for the presence of influential units.On the other hand, an influential unit can arise when the measure of size recorded on the sampling frame and used to stratify the population is considerably smaller than the size recorded on the field.This unit is then placed in a stratum with smaller units.As a result, it will generally exhibit a large y-value combined with a large weight, which makes it potentially harmful.These units are often referred to as stratum jumpers.
To quantify the influence of a unit, we use the concept of conditional bias that was first suggested by Muñoz-Pichardo et al. (1995) in the customary independent and identically distributed (iid) setup and adapted by Moreno-Rebollo et al. (1999) and Moreno-Rebollo et al. (2002) in the survey sampling setup.At this stage, it is worth pointing out that a unit is influential/not influential with respect to a given configuration.In the context of imputation for missing survey data, a configuration consists of (i) the variable y and its distribution in the population; (ii) the finite population parameter of interest; (iii) the sampling design and the associated estimator; (iv) whether or not the unit is present in the sample; (v) whether or not the unit responded to item y; (vi) the imputation procedure used to fill in the missing values.A unit may have a large influence with respect to a given configuration but may have no influence with respect to another configuration.
In the ideal set-up of 100% response, Beaumont et al. (2013) constructed an efficient version of the Horvitz-Thompson estimator based on the concept of conditional bias.Results of several empirical investigations suggest that the estimator of Beaumont et al. (2013) outperforms the Horvitz-Thompson in terms of mean square error when influential units are present in the sample.This is achieved at the expense of introducing a bias.In the absence of influential units in the sample, the estimator of Beaumont et al. (2013) suffers from a very slight loss of efficiency with respect to the Horvitz-Thompson estimator, which is a desirable feature.Favre-Martinoz et al. (2016) extended the approach of Beaumont et al. (2013) to the case of two-phase sampling designs and weighting for unit nonresponse.Doubly robust imputation procedures in the presence of influential units were considered in Dongmo Jiongo (2015) who extended the results of Beaumont et al. (2013) with the conditional bias evaluated with respect to two inferential frameworks: the nonresponse model framework and the imputation model framework.This led to two efficient estimators, one for each framework.In this article, we consider the case of multiply robust imputation and evaluate the conditional bias using a framework different from the ones considered in Dongmo Jiongo (2015).
Our approach leads a single estimator, which is attractive from an imputer's perspective.
The paper is organized as follows.In Section 2, we briefly describe the approach of Beaumont et al. (2013) in the ideal scenario of 100% response.
In Section 3, we define the conditional bias of a unit and extend the results of Section 2 to the case of multiply robust imputation procedures.In Section 4, we present a bootstrap procedure for estimating the conditional bias of a unit.A calibrated imputation procedure is described in Section 5.In Section 6, we present the results from three empirical investigations, assessing the proposed method in terms of bias and efficiency.Some final remarks are given in Section 7. Finally, some technical details are relegated to the Appendix.

Efficient complete data estimation
Consider a finite population U = {1, . . ., i, . . ., N } of size N .We are interested in estimating the population total, t y = i∈U y i , of a survey variable y.We select a sample S, of size n, according to a given sampling design p(S).Let I i the sample selection indicator attached to unit i, such that I i = 1 if i ∈ S and I i = 0, otherwise.The first and second-order inclusion probabilities are respectively given by π i = P(I i = 1), i ∈ U, and A complete data or prototype estimator of t y is the Horvitz-Thompson estimator (Horvitz and Thompson, 1952) where w i = π −1 i denotes the sampling weight attached to unit i.The Horvitz-Thompson estimator is design-unbiased for t y ; that is, E p t y,HT = t y , where the subscript p denotes the sampling design.Under mild regularity conditions, it is also design-consistent for t y in the sense that t y,HT − t y = O p (N/ √ n); see, e.g., Breidt and Opsomer (2017).
In the presence of influential units in the sample, the Horvitz-Thompson estimator may be highly unstable.In the ideal situation of 100% response, Beaumont et al. (2013) proposed an efficient version of t y,HT based on the concept of conditional bias (Moreno-Rebollo et al., 1999;Beaumont et al., 2013).Let θ N be a finite population parameter and θ be an estimator of θ N .
The conditional bias attached to the ith sample unit is defined as If θ N = t y and θ = t y,HT , it can be shown that where ∆ ik = π ik − π i π k .Because the conditional bias (2) depends on the complete set of population values, y 1 , . . ., y N , it is generally unknown.A conditionally unbiased estimator of B 1i is given by That is, We consider an efficient version of t y,HT of the form t * y,HT = t y,HT + ∆(c), where ∆(c) is a random variable that depends on the cut-off value c.Beaumont et al. (2013) suggested to determine the value of ∆(c) that minimizes the maximum absolute estimated conditional bias of t * y,HT .This leads to t * y,HT = t y,HT − where B min = min i∈S ( B 1i ) and B max = max i∈S ( B 1i ).Beaumont et al. (2013) showed empirically that t * y,HT can be significantly more efficient than t y,HT when influential units are present in the sample.This is achieved at the expense of introducing a bias given by However, under mild regularity conditions, the estimator ( 4) is design- 3 Efficient estimation in the presence of missing data In practice, the survey variable y may be prone to missing values.Let r i be a response indicator attached to unit i such that r i = 1 if y i is observed and denote the set of respondents and the set of nonrespondents to the survey variable y, respectively.Throughout the paper, we assume that the data are Missing At Random (Rubin, 1976): where v denotes a vector of fully observed variables.The true model linking the survey variable y to the set of fully observed variables v is given by Although we assume equal variances, our results can be easily extended to the case of unequal variances.
Following Han and Wang (2013) and Chen and Haziza (2017), we consider two classes of models: (i) The class C 1 of J nonresponse models, each of the nonresponse model being a set of assumptions about the unknown nonresponse mechanism: ) is a predetermined functional associated with the jth nonresponse model, α (j) is a vector of unknown parameters and v (j)   denotes the set of predictors included in the jth nonresponse model.
(ii) The class C 2 of L imputation models, each model being a set of assumptions about the conditional distribution of y given v: where m ( ) (•, β ( ) ) is a predetermined functional associated with the th imputation model, β ( ) is a vector of unknown parameters and v ( ) denotes the set of predictors included in the th imputation model.
Overall, the imputer specifies J + L models that will be used in the construction of the imputed values.
To construct the imputed values, y * i , i ∈ S nr , we proceed as follows: (1) We start by estimating the parameters α (j) , j = 1, . . ., J and β ( ) , = 1, . . ., L by solving the following estimating equations: = 0 and S ( ) respectively, where φ pi and φ mi are two coefficients associated with unit i.In practice, these coefficients are either set to 1 or to w i .
(2) For each unit i ∈ S, we form the following vectors of size J and L, respectively: i , α (1) ), . . ., p J (v and L) ) .
To compress the information contained in the vector U pi , we fit a linear regression model with the response indicator r as the dependent variable and the vector U p as the set of predictors.This leads to the J-vector of estimated coefficients To compress the information contained in the vector U mi , we fit a linear regression model based on the responding units, with the survey variable y as the dependent variable and the vector U m as the set of predictors.This leads to the L-vector of estimated coefficients Finally, for each unit i ∈ S, we obtain the following two standardized scores: Here, if a = (a 1 , . . ., a h ) is a h-vector, a 2 denotes the vector of square coefficients (a 2 1 , . . ., a 2 h ) .
(3) The imputed values y * i , i ∈ S nr , are obtained by fitting a weighted linear regression model with y as the dependent variable and h = (1, m) as the vector of predictors.The regression weights are given by . This leads to where When J = 0 and L = 1, the imputation procedure (5) reduces to an imputation based on a single imputation model.When J = 1 and L = 1, the imputation procedure (5) corresponds to a doubly robust imputation procedure.
Based on the observed values, y i , i ∈ S r , and the imputed values, y * i , i ∈ S nr , we construct an imputed estimator of t y : The estimator ( 6) is multiply robust in the sense that it remains consistent for t y if all but one of the J + L models are incorrectly specified.That is, if at least one one the J + L models is correctly specified, we have Chen and Haziza (2017).
While the imputation procedure (5) provides some protection against model misspecification, the resulting estimator (6) may be highly unstable in the presence of influential units.In this section, we develop an efficient version of t M R by extending the results of Beaumont et al. (2013).
We start by defining the concept of conditional bias in the context of imputation for missing data.We identify three sources of randomness: the imputation model that generates the N -vector of population y-values, y U = (y 1 , . . ., y N ) ; the sampling design that generates the N -vector of sample selection indicators, I U = (I 1 , . . ., I N ) ; and the nonresponse mechanism that generates the N -vector of response indicators, r U = (r 1 , . . ., r N ) .Different combinations of these distributions may be used to assess the conditional bias of a unit.In the sequel, the conditional bias is evaluated with respect to the sampling design.That is, in addition to the sets of predictors included in the imputation and nonresponse models, the vectors y U and r U will be treated as fixed.The conditional bias associated with unit i of t M R is thus defined as The expectation on the right hand-side of ( 7) is intractable as the estimator t M R is a complex function of the sample selection indicators I 1 , . . ., I N .
Therefore, we rely on the first-order Taylor expansion, which leads to where m and τ • denoting the probability limits of α, β, η p , η m and τ , respectively.The derivations leading to (9) are shown in the Appendix.
Using (8) and ignoring the higher-order terms, we obtain the following approximation of the conditional bias attached to unit i: The conditional bias in ( 10) is unknown as it involves population quanti- ties.An estimator of ( 10) is given by where ψ k in ( 11) is obtained from ( 9) by replacing each unknown quantity with a corresponding estimator.Note that the estimator B in (11) does not involve the term k∈U ψ k − t y on the right hand-side of (10).Indeed, an estimator of t y is given by t M R , whereas an estimator of k∈U ψ k is given by k∈S w k ψ k .From (8), we have, where The estimated linearized variable ψ k is obtained by estimating each unknown quantity in (12) with a suitable estimator.This leads to where It follows that the estimated conditional bias associated with unit i of the multiply robust estimator t M R is given by The first term on the right hand-side of (13) corresponds to the influence of unit i on the sampling error, whereas the second term represents the effect of nonresponse and imputation on the influence of unit i.From (13), a unit has a large influence if its complete data conditional bias is large and/or its residual y k − v k β r ) is large and/or if the a k is large (which may indicate that the unit has a high leverage).Therefore, our measure accounts for all the components of the configuration described in Section 1.
As in Section 2, we consider an efficient version of t M R of the form We determine the value of ∆(c) that minimizes the maximum absolute conditional bias of t * M R .This leads to where ) and ).If at least one of the J + L models is correctly specified, the bias of t M R is negligible.As a result, the bias of t * M R can be approximated by where the expectation E(.) is evaluated with respect to the joint distribution induced by the imputation model, the nonresponse mechanism and the sampling design.
4 Pseudo-population bootstrap procedure for estimating the conditional bias In Section 3, we derived an approximation of the conditional bias based on a first-order Taylor expansion.However, the derivation involved relatively tedious algebra.In this section, we describe a pseudo-population bootstrap procedure for estimating the conditional bias; see Mashreghi et al. (2016) for a discussion of bootstrap procedures in finite population sampling.The idea behind pseudo-population bootstrap procedures is to create a pseudopopulation from the original sample.Bootstrap samples are then selected from the pseudo-population using the same sampling design utilized to select the original samples.
A general pseudo-population bootstrap algorithm can be described as follows: (i) Repeat the pair (y i , π i ), π −1 i times for all i in S to create, U f , the fixed part of the pseudo-population.
(ii) To complete the pseudo-population, U * , draw U c * from {(y i , π i )} i∈S using the original sampling design with inclusion probability π −1 i − π −1 i for the ith pair, leading to U * = U f ∪ U c * .Compute the bootstrap parameter t * y on the resulting pseudo-population (iii) Take a bootstrap sample S * from U * using the same sampling design that led to S. (vi) Let S b,i be the set of bootstrap samples that contain unit i.A bootstrap in ( 7) is given by where M i denotes the cardinality of S b,i .
5 Calibrated imputation procedure The proposed method described in Section 3 consists of (i) imputing the missing values according to ( 5), (ii) computing the multiply robust estimator t M R given by ( 6) and (iii) obtaining the efficient version t * M R given by ( 14).In this section, we suggest implementing the proposed method through a calibrated imputation procedure, which may be attractive from a secondary analyst's point of view.The concept of calibrated imputation has been considered in Ren and Chambers (2003), Beaumont and Alavi (2004) and Beaumont (2005), among others.
The rationale behind calibrated imputation is to find final imputed values y * iF , i ∈ S nr , as close as possible to the preliminary imputed values y * i given by (5) subject to More specifically, we seek final imputed values y * iF , i ∈ S nr , that minimize subject to ( 15), where G(•) is a pseudo-distance function and q i > 0 is a known coefficient attached to unit i.The pseudo-distance function G(•) must satisfy the following properties: Deville and Särndal (1992) for a description of commonly used functions G(•).Seeking final imputed values y * iF , i ∈ S nr , close to the preliminary values y * i is desirable as the latter ensure that the resulting imputed estimator t M R is a consistent estimator of t y if at least one of the J + L models is correctly specified.
For instance, if we use the generalized chi-square distance, we seek final imputed values y * iF , i ∈ S nr , that minimize The final imputed values y * iF are readily obtained by using any standard calibration software; e.g., the SAS macro CALMAR2 (Sautory, 2003) and the R package Icarus (Rebecq, 2016).

Simulation study
We conducted a simulation study to assess the performance of the proposed method in terms of bias and efficiency.For each scenario, we repeated R = 10, 000 iterations of the following process: (i) A finite population of size N = 5, 000 was generated.The population consisted of a survey variable y and a set of predictors.We first generated the auxiliary variable v 1 according to a uniform distribution, v 1 ∼ U (0, 5).Given the v 1 -values, we generated the survey variable y according to four distributions D: normal, Gamma, lognormal and Pareto.More specifically, we used: Several configurations of the vector β = (β 0 , β 1 , β 2 ) were used; see the tables of results below.The value of σ 2 was set to 500, 50, 30 and 20, for the normal, the gamma, the lognormal and the Pareto distributions, respectively.The parameters were set so that the first two moments of the distribution D(µ i ; σ 2 i ) were the same for the four distributions.Figure 1 shows the relationship between y and v i = (v 1i , v 2 1i ) in each scenario.
(ii) From the finite population generated in Step (i), a sample, of size n = 50; 100, was selected according to simple random sampling without replacement.
(iii) In each sample, the response indicators r i , i = 1, . . ., n, were independently generated according to a Bernoulli distribution with probability .
This led to a response rate approximately equal to 70%.(v) In each completed data set, the estimators t M R and t * M R , given respectively by ( 6) and ( 14), were computed.
As a measure of bias of an estimator, we computed the Monte Carlo percent relative bias given by where t is a generic notation used to denote an estimator of t y and t (k) is the estimator t at the kth iteration.As a measure of efficiency, we computed the percent relative efficiency, using t M R as the reference: , where (a) Normal β = (10, 10, 10)

Imputation based on a single imputation model
In this section, the imputed values were constructed using a single imputation model.In each scenario, we fitted the model m(v i , β) = v i β with That is, the first moment of the imputation was correctly specified.Table 1 shows the Monte Carlo percent relative bias and relative efficiency of t M R and t * M R for four distributions.
For the normal distribution, the estimator t * M R showed negligible bias and was slightly less efficient than the estimator t M R with values of RE equal to 103 for n = 50 and equal to 101 for n = 100.For the Gamma distribution, the estimator t * M R was biased with values of absolute RB ranging from 2.6% to 22%.In terms of RE, the estimator t * M R was more efficient than t M R for all the configurations of the vector β.For β = (1, 0.05, 0.05), the estimator t * M R was much more efficient than t M R with value of RE equal to 67 for n = 50 and equal to 76 for n = 100.For the lognormal distribution, the values of absolute RB varied from 1.3% to 9.7%.Again, the estimator t * M R was more efficient than t M R for all the configurations of the vector β, with values of RE ranging from 67 to 94.Finally, for the Pareto distribution, the estimator t * M R was moderately biased with values of absolute RB ranging from 1.2% to 4.7%.For some configurations of the vector β, the proposed estimator t * M R was considerably more efficient than its counterpart t M R ; for β = (1, 0.1, 0.1), the value of RE was equal to 57.Finally, the value of RE with n = 100 was never less than the value of RE for n = 50.Distribution In this section, the finite populations and the nonresponse indicators were generated using the same models as in Section 6.1.We considered the case of doubly robust imputation procedures for which the imputer specifies an imputation model and a nonresponse model.In the three scenarios described below, we fitted a nonresponse model and an imputation model of the form and m(v i , β) = v i β.
We considered three scenarios: (i) Both models were correctly specified, denoted by m and p ; (ii) The imputation model was correctly specified but the nonresponse model was denoted by m and p ; (iii) The nonresponse model was correctly specified but the imputation model was misspecified, denoted by m and p .
Correctly specified models were based on the set of predictors 1, v 1 and v 2 1 .Misspecified models were based on the set of predictors 2, v 1 and v 2 , where v 2 was generated from a U(0, 4) and was unrelated to both the survey variable y and the response indicators r.This is summarized in Table 2.The results are shown in Tables 3-6.As expected, the estimator t M R showed a small bias in all the scenarios.This can be explained by the fact that it is doubly robust in the sense that it remains consistent for the true total t y if either model is correctly specified.The results in Tables 3-6 were similar to those obtained in Section 6.1.The estimator t * M R was biased but more efficient than t M R in all the scenarios.Again, the gains in efficiency were especially noteworthy for the Pareto distribution with values of RE ranging from 53 to 92; see Table 6.

Imputation based on two imputation models
Again, the finite populations and the nonresponse indicators were generated using the same models as in Section 6.1.In this section, the imputed values were based on two imputation models: ) was misspecified.Table 7 gives the set of predictors for each model.
Table 7: Working models The results are shown in Table 8.Again, the results were very similar to those obtained in Sections 6.1 and 6.2. Distribution

Final remarks
In this paper, we have proposed an efficient version of the customary multiply robust estimator based on the concept of conditional bias of a unit.The proposed method is general as it can be applied to a wide class of imputation procedures including the customary imputation based on a single imputation model and doubly robust imputation procedures.The results from a simu-lation study suggest that the proposed method outperforms the customary multiply robust estimator in terms of mean square error when the distribution of y given v is highly skewed.The gains were especially substantial in the case of the lognormal and the Pareto distributions.
It would be of interest to develop an estimator of the mean square error of the proposed estimator t * M R to assess its efficiency in practice.A satisfactory solution to this issue is currently lacking, even in the ideal case of 100% response.Although a bootstrap procedure would seem natural, the extreme order statistics B (M R) min and B (M R) max in t * M R make the application of bootstrap relatively complex.This issue will be considered elsewhere.
Using a first-order Taylor expansion, we first write: Also, we have After some algebra, we obtain where with and and, as a result, is ignored.Example 3.1.Consider the case of an imputation procedure based on a single model (J = 0 and L = 1) with m(v i , β) = v i β.The linearized variable ψ k in (9) reduces to

(
iv) Let S * R = {i ∈ S * : r * i = 1}.Impute the bootstrap missing values in S * \ S * R by applying the same imputation method used for the original missing data.Compute the estimator t * M R based on observed and imputed values in the bootstrap sample S * .(v) Repeat Steps 1 to 4 a large number of times, M , to get t * y,1 , . . ., t * y,M and t * M R,1 , . . ., t * M R,M .

(
iv) The missing values in each sample were imputed by three types of imputation procedures: (a) an imputation based on a single imputation model; (b) a doubly robust imputation based on a single imputation model and a single nonresponse model; and (c) an imputation based on two imputation models.

Table 1 :
Monte Carlo percent relative bias and relative efficiency of t M R and

Table 2 :
Summary of the three scenarios

Table 3 :
Monte Carlo percent relative bias and relative efficiency of t M R and

Table 5 :
Monte Carlo percent relative bias and relative efficiency of t M R and

Table 6 :
Monte Carlo percent relative bias and relative efficiency of t M R and t * M R for the Pareto distribution

Table 8 :
Monte Carlo percent relative bias and relative efficiency of t M R and t * M R for four distributions