Accounting for Non‐ignorable Sampling and Non‐response in Statistical Matching

Data for statistical analysis is often available from different samples, with each sample containing measurements on only some of the variables of interest. Statistical matching attempts to generate a fused database containing matched measurements on all the target variables. In this article, we consider the use of statistical matching when the samples are drawn by informative sampling designs and are subject to not missing at random non‐response. The problem with ignoring the sampling process and non‐response is that the distribution of the data observed for the responding units can be very different from the distribution holding for the population data, which may distort the inference process and result in a matched database that misrepresents the joint distribution in the population. Our proposed methodology employs the empirical likelihood approach and is shown to perform well in a simulation experiment and when applied to real sample data.


Introduction
Statistical matching has become popular in recent years. Information on a set of variables of interest is often available in different micro databases, with each database containing only some of the variables, but with no joint observations on all the variables. For example, in Italy, reliable information on households income is provided by the Survey on Household Income and Wealth (SHIW) conducted by Banca d'Italia. On the other hand, information on consumption expenses is provided by the Household Budget Survey (HBS), run by the Italian National Institute of Statistics (ISTAT) (cf. Conti et al. 2017). This constitutes a serious problem because household data on income and expenditure are used by policymakers for analysing the impact of policy strategies. Statistical matching attempts to combine the data obtained from different, non-overlapping samples, drawn from the same target population. At a micro level, the main objective is to construct a synthetic (fused) data set, with joint observations on all the variables of interest. At a macro level, the main objective is the estimation of the joint population distribution of all the variables of interest.
Let A and B be two independent samples of size n A and n B , respectively, selected from a population of N independent and identically distributed (i.i.d.) records, generated from some joint probability (density) function (pdf ), f p x; y; z; θ ð Þof variables X ; Y ; Z ð Þindexed by a vector parameter θ, where p signifies the population model (the model holding for the population values). We suppose that the population is large, such that the samples A and B can be assumed to have no units in common. The statistical matching problem is that X ; Y ; Z ð Þare not jointly observed in the two samples: Only X ; Y ð Þ are observed for the units in sample A, and only X ; Z ð Þ are observed for the units in sample B ; see Rässler (2002) and D'Orazio et al. (2006b). Thus, the units in A have missing Z values while the units in B have missing Y values. Because of the lack of joint information on all the three variables, the joint pdf f p x; y; z; θ ð Þis not directly identifiable, unless under strong assumptions, which are generally hard to confirm. Several alternative approaches have been proposed in the literature to overcome the identification problem. The first (common) approach assumes conditional independence (CIA) between Y and Z given X ; see, for example, Okner (1972). A second approach assumes the existence of external information. Relevant external information may be available in one of the following forms: (i) a sample C with joint observations on X ; Y ; Z ð Þ (Singh et al., 1993) and (ii) proxy variables for Y, Z as in Zhang (2015), where a range of statistical matching techniques are reviewed and developed for estimating the joint population pdf of categorical variables. Proxy variables, if sufficiently associated with Y or Z, can help in studying the relationship between Y and Z and in particular, help in verifying or refuting the CIA. Empirical results in Zhang (2015) demonstrate that the use of proxy variables not only reduces the uncertainty associated with data fusion but also provides more accurate estimates of the target joint distribution. Notice, however, that the CIA cannot be tested from the samples A and B alone and external information is often not available. (As discussed and illustrated in subsequent sections, the CIA can be tested indirectly by use of the estimated respondents' distribution resulting from this assumption.) A third approach proposed in the literature consists therefore of analysing the uncertainty regarding the joint distribution of X ; Y ; Z ð Þ . Under this approach, several alternative models for the joint distribution of X ; Y ; Z ð Þ , compatible with the distributions of X ; Y ð Þand X ; Z ð Þin the samples A and B, are considered, resulting in 'uncertainty intervals' for the joint pdf of all the three variables, and the target estimators derived from them. See, for example, Moriarity & Scheuren (2001), Rässler (2002) and D'Orazio et al. (2006a). Uncertainty in statistical matching in a non-parametric setting is considered in Conti et al. (2013Conti et al. ( , 2015. Zhang & Chambers (2019) describe a general approach for inference based on incomplete 2 × 2 tables (including the case of statistical matching and non-response), when assumptions required for validating a likelihood-based approach cannot be supported by the available data. The authors develop the concept of corroboration, as a measure of the statistical evidence in the observed data for the unknown parameter values, which is not based on likelihoods. For this, the authors compute intervals for each of the parameter values (rather than point estimates), without relying on any additional assumptions that can lead to pointwise identification of the joint distribution. The interval corresponding to a maximum corroboration value identifies the parameter value that is the hardest to refute based on the observed data.
In practice, the independence assumption between sample measurements pertaining to different units in the sample is itself questionable when dealing with sample survey data. Often, the sample selection employs complex sampling designs that involve different inclusion probabilities, which could be related to the survey variables of interest, known in the statistical literature as informative sampling. This can distort the independence assumption and result in a different distribution of the observed data from the distribution holding in the population from which the sample is drawn. See Pfeffermann & Sverchkov (2009) for discussion of the notion of informative sampling and review of methods to handle this problem.
Statistical matching of complex sample surveys is studied by Rubin (1986), Renssen (1998) and Conti et al. (2016). Marella & Pfeffermann (2019) considered statistical matching under informative sampling designs, assuming complete response. However, in practice, not all the sampled units respond, and as well known, the response rates are steadily decreasing all over the world. Most of the approaches dealing with non-response assume that the missing data are missing at random (MAR; Little & Rubin, 1987). By this assumption, the response probabilities do not depend on the unobserved data, after conditioning on the observed data. In reality, the MAR assumption is often violated, and when the response probabilities are correlated with the target outcomes even after conditioning on the observed data (often, the model covariates), the missing data are not missing at random (NMAR non-response).
In this article, we consider the case where the sampling designs used to select the samples A and B are informative, and the non-response in the two samples is NMAR. As studied theoretically and illustrated in many articles, even informative sampling with complete response, or non-informative sampling but with NMAR non-response, already results in a different joint distribution of the observed data in the sample from the distribution of the same variables in the population from which the sample is taken. See, for example, Pfeffermann (2017). Not surprisingly and as illustrated later, ignoring the sampling and response processes in statistical matching (the focus of the present article) can result in severely biased estimators and a misrepresentative fused data set. To the best of our knowledge, no other article has been published so far, considering the dual effects of informative sampling and NMAR non-response in statistical matching. Our proposed methodology utilises the empirical likelihood (EL) approach.
In Section 2, we define more formally the statistical framework under consideration. Section 3 develops the EL in the statistical matching context under informative sampling designs and NMAR non-response, assuming the CIA. The proposed approach combines the EL with a parametric model for the response probabilities. In Section 4, the CIA is dropped, the uncertainty in statistical matching is introduced, and a procedure for choosing a population distribution from the class of plausible pdfs is described. Section 5 presents the results of a simulation study, aimed for assessing the performance of our proposed methodology. In Section 6, we apply the methodology to the SHIW and HBS samples mentioned in the introduction. Section 7 contains a brief summary.

Statistical Matching under Non-ignorable Sampling and Non-response: Notation and Assumptions
Consider a finite population of gare independent realisations from a distribution with pdf f p x; y; z; θ ð Þ . Let V p; A , V p; B be sets of population values of design variables used for selecting two non-overlapping samples, A and B, respectively. Some or all of the variables X ; Y ; Z ð Þmight be included among the design variables. We assume that D p ; V p; A ; V p; B are realisations of a random process, implying that the first-order inclusion probabilities π i; A ; π i; B È É may be viewed as random as well. Denote by w i; A ¼ 1=π i; A , w i; B ¼ 1=π i; B the (base) sampling weights. We assume that under complete response, the data available to the analyst consist of the samples A ¼ x i ; y i ; w i; A À Á of size n A and B ¼ x i ; z i ; w i; B À Á of size n B , but not the population values of the design variables, which are known to the persons drawing the samples but generally not to the persons analysing the sample data. Following Marella & Pfeffermann (2019), we assume that the sampling designs for selecting the two samples are informative for the corresponding joint population distribution, in the sense that the sample selection probabilities are correlated with at least some of the variables X ; Y ; Z ð Þ, implying that even if all the three variables had been observed in the two samples, the joint sample pdf f S x; y; z ð Þof the sample data is different from the corresponding population pdf, f p x; y; z ð Þ, for S ¼ A; B. In this article, we assume that in addition to the use of informative sampling designs, the samples A and B are subject to not missing at random (NMAR) unit non-response, in the sense that the probability to respond depends on the study variables. The data available to the analyst consist therefore of the sets of responding units in A(R A ) and B(R B ), respectively. Consequently, the joint pdf of the observed data, f R S x; y; z ð Þ, differs from the sample pdf f S x; y; z ð Þunder complete response and from the population pdf f p x; y; z ð Þ, S ¼ A; B. Here, for convenience, we omit the parameters indexing the three distributions. Notice that whereas the sampling probabilities are generally known, and thus can be used to account for informative sampling, the response probabilities are practically unknown and need to be modelled. Pfeffermann & Sikov (2011) review approaches proposed in the literature to deal with NMAR non-response.
Accounting for informative sampling but with complete response for statistical matching is considered in Marella & Pfeffermann (2019). The authors applied a parametric approach, basing the inference on the sample distribution, that is, the distribution holding for the observed sample data. However, as discussed in Pfeffermann & Landsman (2011), the maximisation of sample likelihoods can be complicated numerically and result in unstable estimates, depending on the population model and the model assumed for the sample selection probabilities, given the observed data. For this reason, and in order to account also for NMAR non-response, we propose in Section 3 the use of the EL, which enables estimating the parameters governing the sampling and response models, without specifying the corresponding population model.

Statistical Matching under Non-ignorable Sampling and Non-response by EL
In Section 3.1, the statistical framework under the EL approach is briefly described. The EL under informative sampling is introduced in Section 3.2 and extended to NMAR non-response in Section 3.4. The generation of a fused data set under the EL approach is described in Section 3.3.

Statistical Framework under the EL Approach
The use of the EL for analysing complex survey data has its origins in the pioneering article by Hartley & Rao (1968), where an estimator based on the multinomial function under simple random sampling is proposed. The use of EL gained increasing interest in general statistical contexts, following the work of Owen (1990Owen ( , 1991Owen ( , 2001Owen ( , 2013. See also Qin & Lawless (1994) and the review article by Chen & Van Keilegom (2009). The EL combines the robustness of non-parametric methods with the efficiency of the likelihood approach. It is essentially the likelihood of the multinomial distribution employed by Hartley & Rao (1968), where the unknown parameters are the point masses assigned to the distinct sample values. Chen & Qin (1993) proposed an EL approach for using auxiliary information in simple random sampling without replacement. Chen & Sitter (1999) extended the method to unequal probability sampling, applying a 'pseudo-empirical likelihood approach'. Wu (2004) used pseudo-EL methods to combine information from two independent surveys and obtained an estimator for a mean, which is asymptotically equivalent to a GREG-type estimator. The EL approach facilitates the use of calibration constraints. See Remark 4 for the form of the constraints and Chaudhuri et al. (2010) for details of the constrained estimation procedure and the asymptotic properties of the resulting estimators. Most importantly, the use of this approach does not require specifying the population model and is thus more robust and often easier to implement.
In the present section, we assume the CIA, but this assumption is dropped in the following sections. We also assume that X can take K distinct values with probabilities p X k ¼ P X ¼ x k ð Þ; ∑ K k¼1 p X k ¼ 1, whereas Y and Z are continuous. The 'matching variable', X, measured in both samples, might be a stratification variable and/or a socio-demographic variable. Socio-demographic characteristics are often related to other variables of interest.
The basic idea of the EL approach is to approximate the population distribution by a multinomial model, which support is given by the empirical observations. Let x i ; ; y i :z i ð Þdefine the values associated with unit i and denote by each with the support observed in the samples. Then, under the CIA, the joint population multinomial probability of unit i is given by p XYZ

The EL Approach under Informative Sampling
In what follows, we define the EL in the statistical matching context under informative sampling. Let I A i be the sample indicator taking the value 1 if unit i is drawn to the sample A and 0 otherwise. For i ∈ A k , denote Similarly, for i ∈ A k , Under informative sampling, the observed outcomes are no longer representative of the population outcomes and the sample models (2) and (3) are different from the corresponding population models p Y |X i ; p X k . Nonetheless, as shown and illustrated in Pfeffermann et al. (1998), if the population values are independent under the population model (see beginning of Section 2), then under mild conditions they are also asymptotically independent under the sample model, when the sample size remains fixed but the population size increases. This permits approximating the sample likelihood by the product of the sample pdfs over the corresponding sample observations. Hence, for sufficiently large populations, the sample EL (ESL), based on the observed data in A, is where n X k; A is the size of A k . An analogous expression to (4) holds for the ESL based on the observed data in B. Hence, the ESL based on the sample A∪B is (2), (3) (with analogue expressions for the sample B) and (5), the log-likelihood based on A∪B is (6) Notice that the sampling probabilities in A and B may depend on many unobserved variables, and yet, by definition of the sample pdf, one only needs to model the probabilities can be estimated outside the likelihood by regressing the sample weights w i; A w i; B À Á against , using the observed data in A and B, respectively, or non-parametrically, as considered in Feder & Pfeffermann (2019). The resulting estimates can then be inserted into the expressions for τ XY i; A and τ XZ i; B , with τ X k; A and τ X k; B defined by (1). Moreover, as discussed and illustrated in Pfeffermann (2011), the resulting sample models can be tested. The unknown parameters in (6) are thus the probabilities p In the statistical matching context, different approaches can be used for maximisation of the likelihood. For convenience, we describe the approaches for the case where no variables with known sample values and corresponding population means exist, which as noted before can be used for calibration. When such variables exist, the calibration equations are imposed to constrain the maximisation process. See Remark 4.

Remark 1. In practice, the covariates contained in the population model need not be the same as the covariates contained in the model of the conditional sample inclusion probabilities P I
However, to simplify the presentation, we assume for convenience that the same covariates appear in the two models or alternatively that x i defines the union of the two sets of covariates.

Estimating the unknown probabilities separately from the samples A and B
Noting that the likelihood (5) can be factorised into a likelihood based only on the sample A and a likelihood based only on the sample B, the unknown probabilities can be estimated separately from the two samples. This implies two sets of estimates for the probabilities p X k È É , which need to be harmonised. See Renssen (1998) and below. The EL estimators of the unknown probabilities are obtained by maximising the loglikelihood (6), subject to the constraints, Following Kim (2009), Chaudhuri et al. (2010 and Marella & Pfeffermann (2019), the estimators are are the estimates of p X k obtained from the samples A and B, respectively. Harmonisation of the estimates b p X k; A , b p X k:B into a unique estimate p X k can be achieved by use of a linear combination of the two estimates, that is, Alternatively, one may choose the value λ minimising the variance of (9). To this end, variance estimates of b p X k; A ; b p X k; B can be computed by resampling methods for finite populations, as proposed by Conti et al. (2020). The methods use a two-stage procedure. In the first stage, a pseudo-population, which can be viewed as a prediction of the target finite population, is constructed using the sampling weights. In the second stage, samples are drawn from the pseudo-population using the same sampling designs used for drawing the original samples. The procedure is also applicable for the case of informative sampling designs. (5)  Remark 3. Chen & Sitter (1999) consider a pseudo-empirical likelihood (PEL) approach, which in the context of statistical matching implies the following likelihood.

Remark 2. Another approach consists of replacing p X k; A and p X k; B in
Notice that in (10), the two samples are not considered separately. It follows from Chen & Sitter (1999) that in the absence of calibration constraints, the estimates maximising the likelihood (11) have the same form as in (8), but with the base sam- The basic difference between the two approaches is that in Chen & Sitter (1999), the likelihood is with respect to the population distribution, whereas the likelihood in (5) is with respect to the sample distribution.

File concatenation for estimation of the probabilities p X
k Rubin (1986) proposed to estimate the population probability distribution of X by computing concatenated weights for the sample A∪B as follows: The basic assumption underlying (12) is that the probability of a unit to be drawn to both samples is negligible, such that P I This is generally true when the two samples are independent, with small sampling fractions. Define n X k; A∪B ¼ n X k; A þ n X k; B . With this notation, The ESL (13) is maximised under the constraints (7), yielding the estimators Remark 4. When population means of variables measured in the sample A and/or in the sample B are known, they can be added to the constraints of the ESL. The following calibration constraints may be added, depending on data availability: In the simulation study of Section 5 and the application to real sample data in Section 6, we added 1 Generate e n observations taking values

Generation of a Fused Data Set
3 Apply a similar procedure for drawing valuese z i from the estimated probability function b p Z|X i .
The consistency of the estimators of the model parameters guarantees that for sufficiently large sample sizes n A and n B , the fused data set can be considered as being generated from the joint population pdf.

Use of the EL under Non-ignorable Sampling and Non-response
In what follows, we assume that additionally to informative sampling, the samples A and B are subject to NMAR non-response, by which the response probabilities depend in some stochastic way on the study variables of interest. Let R A i define the response indicator, taking the value 1 if sample unit i ∈ A responds and 0 otherwise. Let R A denote the set of responding units in A and r A , the size of R A . The response process is assumed to be independent between units. This way, the set of respondents can be viewed as the result of a two-phase sampling process: (i) A sample A is selected from the finite population with known inclusion probabilities π i; A ; (ii) the response set R A is selected from A with unknown response probabilities P R where τ X k; A and τ XY i; A are defined in (1) In (16) the sample model p Y |X i; A and the model assumed for the response probabilities define the model holding for the outcomes of the responding units. Notice that unless (16) is different from the sample model p Y |X i; A defined by (2), which is different from the corresponding population model under informative sampling. Specifically, the respondents model is a function of the corresponding population model, the conditional expectations of the sampling weights, , and the response probabilities Assuming that the response is independent of the sample selection, can be estimated by regressing w i; A against x i ; y i ð Þ, using the observed data in A, and similarly for the sample B. See Section 3.2. Clearly, if the response probabilities depend in some way on the sample selection, say, higher non-response rates for units with higher sampling probabilities, the expectations E A w i; A jx i ; y i À Á need to be estimated in some more elaborated manner. See also the concluding remarks in Section 7.

Remark 6.
Under MAR non-response, the response probability does not depend on the target outcome variable after accounting for the model covariates, such that in (16) With straightforward modification of the notation, similar expressions to (15)-(18) are obtained for the model holding for the responding units in B. Thus, the empirical respondents' likelihood (ERL) for the sample A∪B is given by Remark 7. The likelihood (19) only depends on the observed data for the responding units.
The response probabilities in (15) and (16), defining the probabilities in (19), are unknown and need to be estimated from the available data. Because no 'response weights' are known, parametric models for the response probabilities in the two samples need to be postulated. For example, for some functionsg A ; g B , with unknown parametersγ . Here again, we assume for convenience that the response probabilities depend on the same covariates as in the sample model. See Remark 1. Modelling the response probabilities by the logit or probit functions is common, but notice that in our case the probabilities depend also on the study variables, which is different from the familiar 'propensity scores' approach, under which the response probabilities only depend on the observed covariates, which are in common use under MAR non-response. The unknown vector parameters, γ A , γ B , indexing the response models in the two samples are then estimated as part of the maximisation of the likelihood. Thus, one needs to maximise the likelihood (19) with respect to a larger set of parameters for all k and i. Notice the difference from the constraints in (7) (19) as only a function of the three sets of probabilities, we adopt the profile likelihood approach. Suppose that the three sets of probabilities are 'known'.
(In practice, we use some initial estimates; see Remark 8.) The profile likelihood function is de- Next, we substitute the estimates (23) into the likelihood (19) and maximise the resulting likelihood with respect to the unknown sets of probabilities, yielding This completes the first iteration in the estimation process. In the second iteration, we consider the estimates in (24)  Remark 9. One of the reviewers of the present article proposed an EM algorithm for maximisation of the ERL. We hope to investigate the properties of this algorithm in the future. See also the concluding remarks in Section 7.

Uncertainty in Statistical Matching under Informative Sampling and NMAR Non-response
So far, we assumed that the joint population pdf satisfies the CIA. Clearly, the CIA may not hold in practice, and having no joint measurements for the variables of interest disallows distinguishing between different plausible distributions. In Section 4.1, we drop the CIA and define instead a class of plausible joint pdfs for the outcome variables of interest. Some measures quantifying the size of the class are introduced. In Section 4.2, a procedure for choosing a pdf from the class of plausible pdfs is described.

Measuring Uncertainty in Statistical Matching
In statistical matching, estimation of the joint pdf of X ; Y ; Z ð Þrequires the estimation of (i) the marginal pdf of X and (ii) the joint conditional pdf of Y ; Z ð Þgiven X. Denote by F p y; z|x k ð Þ the joint cumulative population distribution function (cdf) of Y ; Z ð Þ given X ¼ x k , and by F p yjx k ð Þ, G p zjx k ð Þ the corresponding marginal cdfs. Notice that unless under additional assumptions, the only valid statement regarding F p y; z|x k ð Þis that it lies in the set Ω k p of all joint distributions having marginal cdfs F p yjx k ð Þ, The bounds (25) and (26) are the Fréchet bounds; see Nelsen (1999). A natural pointwise uncertainty measure is the length of the interval L … Weight functions different from dF p yjx k ð ÞdG p zjx k ð Þcan be used instead. Our choice has a clear interpretation, with larger weights assigned to intervals with larger marginal densities. The measure in (27) is easily estimated from the sample data, see (29).
An overall uncertainty measure is defined by the average of the conditional measures (27), As shown in Conti et al. (2012), the value Δ k p ¼ 1=6 of the conditional uncertainty measure (27) represents the maximum uncertainty when no external information beyond knowledge of the marginal cdfs F p yjx k ð Þ and G p zjx k ð Þ is available. Consequently, the maximum unconditional uncertainty measure (28) also equals 1/6. Denote ϒ k; (27) can be estimated by averaging the r X k; A r X k; B pointwise un- The bounds (25) and (26) can be narrowed when additional information is available. The reduction in uncertainty due to the use of external information is investigated in Conti et al. (2015Conti et al. ( , 2016, where conditionally on X ¼ x k , constraints of the form a k ≤ c k y; z ð Þ ≤ b k with c k y; z ð Þ defining a monotone function of y z ð Þ for each z y ð Þ, are added. The class of plausible pdfs is now (31) Hereafter, each bivariate pdf in the class (31) is referred to as a plausible matching pdf for Y ; Z ð Þ, conditionally on X ¼ x k . For example, Okner (1972) imposed the constraint Y ≤ Z. With this constraint, the Fréchet bounds (25) and (26) become (see Conti et al., 2015) Notice the difference from (25) and (26), when no additional information is available. The corresponding uncertainty measures, Δ k p; c , Δ p; c , are defined similarly to (27) and (28) but with respect to the bounds (32) and (33). By choosing a matching distribution from the class (31), the uncertainty measure Δ p; c provides an upper bound for the matching error. The statistical matching problem consists therefore of choosing a matching distribution from the class (31). Conti et al. (2016) proposed a procedure for choosing a pdf in the class (31), based on iterative proportional fitting (IPF; Bishop et al., 1975). The procedure consists of the following steps:

Choosing a Matching Distribution
Step 1: Discretise Y and Z by grouping their ascending values in pre-defined classes. Conditionally on X ¼ x k , the range of Y is divided into h k adjacent intervals I Y |x k

MLE of the ERL (19).
Step 3: Once the contingency table { C k g has been defined, the midpoints y d; h ; z d; g À Á are checked to identify cells in fC k g, which do not satisfy the constraint a k ≤ c k y d; h ; z d; g À Á ≤ b k . These cells define structural zeroes in {C k }. The IPF initial cell probabilities are defined as p , where δ hg ¼ 1 for cells not containing structural zeroes and δ hg ¼ 0 otherwise.
A fused data set for X ; Y ; Z ð Þis constructed from the estimated matching distribution obtained at the end of the iterations as follows: (i) Generate e n observations e x i from the estimated distribution of X , taking values x 1 ; . Let e n X k be the number of observations with e x i ¼ x k . (ii) For each observation x i ; i ¼ 1; ::; e n X k , draw independently e n X k pairs y d;

Description of Simulation Experiment
In order to evaluate the performance of our proposed methodology, we performed a simulation experiment, consisting of the following steps: Step 1. Generate a population of N ¼ 10; 000 values x i , taking the values k ¼ 1; 2; 3; 4 with probabilities p X ¼ p X 1 ; p X 2 ; p X 3 ; p X 4 À Á ¼ 0:4; 0:1; 0:3; 0:2 ð Þ . For each x i , generate independently values y i and z i from the following distributions: (i) y i |x i is normal with Thus, the CIA holds in the population and cor CIA YZ ¼ cor XY cor XZ ¼ 0:27.
Remark 10. In Section 5.3 and in the application in Section 6 with real sample data, we no longer assume the CIA and illustrate the theory of Section 4.
Step 2. Draw independently samples A and B from the population generated in Step 1 by use of Poisson sampling with expected sample sizes E n A ð Þ ¼ E n B ð Þ ¼ 3000 and selection probabilities.
Step 3. Generate the samples of responding units in the two samples with response probabilities.
where γ A ¼ γ x; A ; γ y; A , γ B ¼ γ x; B ; γ z; B À Á govern the response models acting in the samples A and B, respectively (specified later). Clearly, the non-response is NMAR.
In what follows, we assume knowledge of the mean μ X ¼ ∑ 4 k¼1 p X k k of X, hereafter the calibration constraint, abbreviated C-C. See Remark 4.
, and it is maximised under the constraints (7) and the C-C. The estimates of p X k obtained from the two samples are harmonised according to (9), with n o the estimated population pdf.
Scenario 2: All the sampled units respond, but the informative sampling designs are taken into account in the estimation process. The ESL (3)(4)(5)(6)(7)(8)15,16,19,(24)(25)(26)(27)(28)(29)31,32,6) is maximised subject to the constraints (7) and the C-C.  (22) and the C-C. The response is independent of the sample selection, such that , and the probabilities P I A i ¼ 1jx i ; y i À Á are estimated by regressing w i; A against x i ; y i ð Þ, using the observed data. A similar procedure is applied for the sample B. As in Scenario 2, we used exponential regression n o the estimated population pdf under this scenario. The two estimates of p X k are harmonised as under Scenario 1.
Different sampling parameters κ A ; κ B and response parameters γ A ; γ B are considered, thus distinguishing between informative and non-informative samples and different NMAR non-response models. We repeated Steps 2 and 3 for each scenario and each combination of the parameters κ A ; κ B , γ A ; γ B , M ¼ 400 times.

Simulation Results When the Population Distribution Satisfies the CIA
We begin by studying the effect of ignoring the informative sampling mechanisms used for drawing the samples A and B. To this end, we estimated for each of the 400 samples the prob- under Scenarios 1 and 2 (h ¼ 1; 2). Next, we computed the mean b p X k; h and their variance-covariance matrix, but only for k ¼ 1; 2; 3, because the sum of the probabilities and their estimates equals 1. In order to evaluate the overall performance of the estimators, we use the Hotelling where b p is the mean vector of the estimated probabilities over the 400 samples and b V is the empirical V-C matrix of b p.
2 for x k ¼ 1; 2; 3; 4, and all the distances are very small. When κ A ¼ κ B ¼ 0:25; 0:25 ð Þ , the distance measures are much larger, and they increase further when κ A ¼ κ B ¼ 0:5; 0:5 ð Þ . Notice that for each x k ¼ 1; 2; 3; 4, KSd Y |x k p; 1 is much larger than KSd Y |x k p; 2 , because under Scenario 2, we account for the informative sampling designs. We also observe that the KSd Y |x k p; h distances for x k ¼ 1; 2 are much larger than the corresponding distances for x k ¼ 3; 4. This result is explained by the fact that the mean of the inclusion probabilities increases as x k increases, changing from 0.10 for x k ¼ 1, 0.20 for x k ¼ 2, 0.39 for x k ¼ 3 and 0.62 for x k ¼ 4 when κ A ¼ κ B ¼ 0:25; 0:25 ð Þ , with similar means for κ A ¼ κ B ¼ 0:5; 0:5 ð Þ . Thus, the informativeness of the sampling design reduces, as X increases. Similar results (not reported) are obtained when estimating the population cdf of Z|X .
Next consider Scenario 3, by which in addition to informative sampling, the samples A and B are subject to NMAR non-response. Figure 1 exhibits the population pdf and the kernel density estimates of the sample pdf with full response, the respondents pdf and the estimated population pdf of Y |x k ¼ 2 , for one of the 400 samples A , for the case κ A ¼ 0:5; 0:5 ð Þ, γ A ¼ γ B ¼ 0:05; 0:1 ð Þ. For selecting the bandwidth for the kernel estimates, we followed Sheather & Jones (1991). Evidently, the sample pdf is different from the population pdf due to informative sampling, and the respondents' pdf is different from the sample pdf because of the nonresponse. Notice that the estimated population pdf is the closest to the population pdf. Similar results (not reported) are obtained for the pdfs of Y |x k , x k ¼ 1; 3; 4 and Z|x k , x k ¼ 1; 2; 3; 4. Table 3 shows how by accounting for the sampling and response effects under Scenario 3, we are able to fit the population model, using the same sample used for Figure 1. For this, we use the KS test statistic with critical values computed by parametric bootstrap, as established theoretically by Babu & Rao (2004) and applied by Pfeffermann & Landsman (2011). Specifically, we generated B ¼ 500 bootstrap samples from the estimated model, re-estimated for each sample the unknown model parameters and computed the KS statistic with the estimated parameters and then obtained the critical value at the α ¼ 0:05 level from the resulting empirical distribution of the KS statistics. Table 3 reports the KS statistic of the estimated pdf of Y |x k (x k ¼ 1; 2; 3; 4) for the sample in Figure 1 and the corresponding critical value computed by the parametric bootstrap.
We also applied the Hotteling test based on all the 400 samples as in Table 1, with γ A ¼ γ B ¼ 0:05; 0:1 ð Þand γ A ¼ γ B ¼ 0:1; 0:1 ð Þ, and obtained extremely high p-values for all the three choices of the vectors κ A ; κ B defining the sample selection probabilities, thus verifying that the model which accounts for the sampling and response processes fits well the population distribution of X. Table 4 shows the KSd Y |x k p; 3 distances for the estimated cdf b F p yjx k ð Þ, computed as in Table 2 by constructing a fused data set. See Section 3.3.

Results Obtained When Matching the Two Surveys
SHIW and HBS suffer from low response rates, about 62% in both samples. It is quite evident that the non-response is explained, at least in part, by the size of the HH and the income (or expenditure). The larger the HH, the more possibilities exist to find a contact person for an interview. In addition, HH consisting of only one or two elder people often tend not to participate in surveys. Furthermore, as often reported in the literature, the response probability tends to decrease as the HH income or expenditure increase (Korinek et al., 2006). In order to obtain a response rate of about 62%, we computed the response probabilities in the two samples by use of the models defined by (35), with coefficients γ x; A ; γ y; A ¼ 0:2; À0:002 ð Þ , γ x; B ; γ z; B À Á ¼ 0:2; À0:003 ð Þ . Table 5 displays four different estimates of the probabilities p X k È É , when considering the four possible size values (hsize = 1,2,3,4+). The first column headed p X k shows the ISTAT's estimates of the household size distribution in Italy in 2010. These values are considered as the true probabilities and serve as benchmarks for the performance of the other estimates. The estimates are defined as follows: b p X k; 1 are the estimates obtained when ignoring the sampling design effects and assuming that all the units responded, and not imposing the C-C. The estimates are obtained by maximising the likelihood as under Scenario 1 in Section 5.1, but only imposing the constraints (7); b p X k; 1C are the estimates obtained under the same set-up, but imposing also the C-C; b p X k; 2C are the estimates obtained when accounting for the sampling effects (but still assuming full response) and imposing the C-C, obtained by maximising the ESL (6), subject to the constraints (7) and the C-C; b p X k; 2CM are our proposed estimates, which account for the sampling designs and the non-response (Scenario 3 of Section 5.1), obtained by maximising the ERL (19) under the constraints (22) and the C-C. We accounted for the sampling design effects by following the approach described in Section 3.2. The last four columns of Table 5 display the sample sizes and the numbers of respondents, with the index A defining the HBS and the index B the SHIW.
In order to compare the goodness of fit of the four sets of estimators in Table 5, we computed again the Hellinger distances, with the estimates compared with the true probabilities, p X k . For the estimates b p X k; 1 , the HD distance is 0.023. It reduces to 0.018 for b p X k; 1C , to 0.012 for b p X k; 2C and to 0.009 for b p X k; 2CM . In addition to estimating the probabilities p X k , we estimated the pdfs p Y |X i ; p Z|X i n o , both when ignoring the sampling designs and non-response and when accounting for them, imposing the calibration constraint C-C in both cases. Next, we generated a fused data set of size e n ¼ 10; 000 by assuming the CIA, as described in Section 3.3. The (weighted) correlations cor XY , cor XZ in the original samples are 0.38 and 0.31, respectively. In the fused data sets, the correlations are 0.34 and 0.28 when ignoring the sampling designs and non-response and {0.38, 0.32} when accounting for them. The correlation between the imputed values of Y and Z when As mentioned in Section 6.1, SHIW contains also some recall questions, aimed for constructing an approximate measure of total expenditure. The correlation in the SHIW sample between income and expenditure is 0.65. Thus, the fused data set constructed under the CIA seems to misrepresent the joint population distribution of Y ; Z ð Þ. Consequently, we no longer assume the CIA and estimate instead a matching distribution for income and expenditure by assuming the class (31) of plausible distributions, with the added constraints Y ≤ Z and the C-C, and applying the IPF. (Section 4.2.) The IPF accuracy was found to be 7 Â 10 À4 , much smaller than in the simulation study. Next, we used the estimated joint distribution for generating e n ¼ 10; 000 values x i ; y i ; z i ð Þ , as described at the end of Section 4.2. Figure 2 shows the bivariate density estimates obtained by application of the IPF and under the CIA, for households of size 3. Similar figures (not shown) have been produced for HH of size 1, 2 and 4+. Evidently, the two estimated densities are different. As noted above, the correlation between the imputed values of Y and Z under the CIA is 0.12. The correlation increases to 0.55 by use of the IPF. The correlation in the SHIW sample is 0.65, but recall that expenditure is not directly observed in SHIW. See Section 6.1. Rässler (2002) proposes four validation measures of decreasing importance in a statistical matching problem, which in our case are as follows: (i) preserving the true household values; (ii) preserving the true joint distribution; (iii) preserving correlation structures; and (iv) preserving marginal distributions. We cannot assess the first measure because the true incomes and expenditures at the HH level are unknown. The second measure requires knowledge of the true joint population distribution of X ; Y ; Z ð Þ , which is likewise unknown, but an uncertainty measure of the kind introduced in Section 4.1 can be used to assess how far the matching distribution is from the true joint distribution. When accounting for the sampling and non-response effects and imposing the constraint Y ≤ Z, the estimated uncertainty measure b Δ p; c decreases from 0.16 (its maximum value with no constraint) to 0.11. The uncertainty measure increases to 0.13 when the sampling and non-response processes are ignored. Regarding the third measure, we note that the correlation between the imputed values of expenditure and income is 0.55 when applying the IPF. Thus, our proposed methodology seems to recover pretty well the 'approximate' correlation of 0.65 between income and expenditure in the SHIW sample. Regarding the fourth measure, the constructed fused data set preserves by construction, the marginal distributions of the income and expenditure. This follows from the use of the IPF, which adjusts the initial cell probabilities to fit the marginal distributions of the two variables, as estimated from the two samples separately.

Concluding Remarks
In this paper, we propose a comprehensive approach to deal with statistical matching, when the samples containing the unmatched data are drawn by informative sampling designs and are subject to NMAR non-response. Our approach employs the EL to account for the sampling and response processes, thus enabling generating a fused data set, which represents sufficiently accurately the true joint population pdf of the target variables. We first consider the case where the target variables of interest are conditionally independent given the available matching variables (the CIA) and then the much more challenging problem when the CIA cannot be assumed. In order to deal with the latter case, we apply a procedure based on the IPF for choosing a pdf from a class of plausible pdfs, which satisfy available information regarding the relationship between the target variables and calibration constraints. An extensive simulation study and application to real data sets illustrate the good performance of our proposed methodology.
We obviously hope that other researchers will apply our proposed approach with appropriate modifications required for their data. New theoretical developments of the present work include the use of proxy variables for estimation of the conditional sample inclusion probabilities P I A i ¼ 1jx i ; y i À Á when the response process is not independent of the sampling process (Section 3.4), possibly by adding them to the covariates of the sampling and/or the response models. Good proxy variables may also be used for initialisation of the IPF algorithm.
Finally, we mention the EM algorithm for maximisation of the empirical respondents' likelihood (19), as proposed by one of the reviewers of the article. (See Remark 9.)