SEARCH

SEARCH BY CITATION

Keywords:

  • Auxiliary information;
  • Calibration;
  • Concentration and inequality measures;
  • Influence function;
  • Linearization;
  • Model-assisted approach;
  • Penalized B-splines;
  • Total variation distance

Summary

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Non-parametric model-assisted estimation of finite population totals
  5. 3. Non-parametric model-assisted estimation of non-linear finite population parameters
  6. 4. Penalized B-spline estimators
  7. 5. Variance estimation
  8. 6. Empirical results
  9. 7. Discussion
  10. Acknowledgements
  11. Appendix A: Assumptions and proofs
  12. References

Currently, high precision estimation of non-linear parameters such as Gini indices, low income proportions or other measures of inequality is particularly crucial. We propose a general class of estimators for such parameters that take into account univariate auxiliary information assumed to be known for every unit in the population. Through a non-parametric model-assisted approach, we construct a unique system of survey weights that can be used to estimate any non-linear parameter that is associated with any study variable of the survey, using a plug-in principle. Based on a rigorous functional approach and a linearization principle, the asymptotic variance of the estimators proposed is derived, and variance estimators are shown to be consistent under mild assumptions. The theory is fully detailed for penalized B-spline estimators together with suggestions for practical implementation and guidelines for choosing the smoothing parameters. The validity of the method is demonstrated on data extracted from the French Labour Force Survey. Point and confidence interval estimation for the Gini index and the low income proportion are derived. Theoretical and empirical results highlight our interest in using a non-parametric versus a parametric approach when estimating non-linear parameters in the presence of auxiliary information.

1. Introduction

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Non-parametric model-assisted estimation of finite population totals
  5. 3. Non-parametric model-assisted estimation of non-linear finite population parameters
  6. 4. Penalized B-spline estimators
  7. 5. Variance estimation
  8. 6. Empirical results
  9. 7. Discussion
  10. Acknowledgements
  11. Appendix A: Assumptions and proofs
  12. References

The estimation of non-linear parameters in finite populations has become a crucial problem in many recent surveys. For example, in the European Statistics on Income and Living Conditions Survey, several indicators for studying social inequalities and poverty are considered; these include the Gini index, the at risk of poverty rate, the quintile share ratio and the low income proportion. Thus, deriving estimators and confidence intervals for such indicators is particularly useful. In the present paper, assuming that we have a single continuous auxiliary variable available for every unit in the population, we propose a general class of estimators that take into account the auxiliary variable, and we derive their asymptotic properties for general survey designs. The class of estimators that we propose is based on a non-parametric model-assisted approach. Interestingly, the estimators can be written as a weighted sum of the sampled observations, allowing a unique weight variable that can be used to estimate any complex parameter that is associated with any study variable of the survey. Having a unique system of weights is very important in multipurpose surveys such as the European Statistics on Income and Living Survey.

The estimation of non-linear parameters is a problem that has already been addressed in several references such as Shao (1994) for L-estimators, Binder and Kovacevic (1995) for the Gini index and Berger and Skinner (2003) for the low income proportion. We mention also the recent work of Opsomer and Wang (2011). Taking auxiliary information into account for estimating means or totals is a topic that has been extensively studied in the literature; it now encompasses the model-assisted and the calibration approaches, which coincide in particular cases (Särndal, 2007). In a model-assisted setting, linear models are usually used, thus leading to the well-known generalized regression (GREG) estimators. Some non-parametric models have also been considered (Breidt and Opsomer, 2009). However, to the best of our knowledge, ratios, distribution functions and quantiles are the only examples of non-linear parameters estimated by using auxiliary information.

To derive our class of estimators and their asymptotic properties, we use an approach based on the influence function that was developed by Deville (1999). This approach utilizes a functional interpretation of the parameter of interest and a linearization principle to derive asymptotic approximations of the estimators. In general, the precision of an estimator inline image of a non-linear finite population parameter Φ is obtained by resampling techniques or linearization approaches and in the present paper we focus on linearization techniques. When a sample s is selected from the finite population U according to a sampling design p(·), the linearization of inline image leads, under some assumptions, to the approximation

  • display math(1)

where inline image denotes the first-order inclusion probability for element k under the design p(·). The right-hand term of approximation (1) is the difference between the well-known Horvitz–Thompson estimator and the parameter that it estimates, namely the total of the variable inline image over the population U. Here, inline image is referred to as the linearized variable of Φ and the way that it is derived depends on the type of linearization method used which could include the Taylor series (Särndal et al., 1992), estimating equations (Binder, 1983) or influence function (Deville, 1999) approaches. The linearized variable inline image is used to compute the approximate variance of inline image as

  • display math(2)

with inline image the joint inclusion probability for the elements k,l ∈ U.

Roughly speaking, when examining expressions (1) and (2), we can see that, if we estimate in an efficient way inline image, we shall achieve a small approximate variance and good precision for inline image. As stated above, it is well known that auxiliary information is useful for improving on the estimation of a total in terms of efficiency and, based on a linear model, the use of a GREG estimator is the most common alternative. When estimating a total, note that the asymptotic variance of the GREG estimator depends on the residuals of the study variable on the auxiliary variable. Because linearized variables may have complicated mathematical expressions, fitting a linear model onto a linearized variable may not be the most appropriate choice. This may occur even if the study and the auxiliary variables have a clear linear relationship, as illustrated in the following example. Consider a data set of size 1000 extracted from the French Labour Force Survey and consider inline image (the wages of person k in 2000) as the study variable and inline image (the wages of person k in 1999) as the auxiliary variable. We now consider the problem of estimating the Gini index. The expression of the linearized variable inline image, k ∈ U, for the Gini index is (Binder and Kovacevic, 1995)

  • display math

where G is the Gini index, F the empirical distribution function, inline image the mean of inline image lower than inline image and inline image the total of the inline image on U. It is a complex function of the study variable inline image, k ∈ U. In Figs 1(a) and 1(b) the study variable inline image and the linearized variable inline image respectively are plotted on the y-axis and the auxiliary variable inline image is plotted on the x-axis. The relationship between the study variable and the auxiliary variable is almost linear; however, the relationship between the linearized variable of the Gini index and the auxiliary information is no longer linear. The consequence of this is that we cannot increase the efficiency of estimating a Gini index if we take the auxiliary information into account through a GREG estimator. Therefore, non-parametric models should be preferred to estimate non-linear parameters Φ. Recent work already employs non-parametric models to estimate totals (Breidt and Opsomer, 2000; Breidt et al., 2005; Goga, 2005). The use of non-parametrics prevents model failure; however, the improvement over parametric estimation for totals and means may not be sufficiently significant to justify the supplemental difficulties of implementing non-parametric methodology. As illustrated above, the motivation for using non-parametrics becomes much stronger when estimating non-linear parameters. Note that the use of non-parametric regression to estimate distribution functions and quantiles has also been studied, for example in Johnson et al. (2008); however, to our knowledge, this has not been performed for other non-linear parameters.

image

Figure 1. (a) inline image, the wages of person k in 2000, against inline image, the wages of person k in 1999, and (b) inline image, linearized variable of the Gini index for the wages in 2000 for person k, against inline image, the wages of person k in 1999

Download figure to PowerPoint

We propose a novel methodology that allows for the efficient estimation of any parameter Φ by combining the functional approach (Deville, 1999) with any of the previously suggested non-parametric methods. In the present paper, we derive rigorous proofs of our asymptotic results. Most importantly, we prove that the total variation distance between finite measures is an adequate choice for the derivation of asymptotic approximations in this context. Asymptotic results are detailed at length for penalized B-spline non-parametric estimators.

The estimators under study combine two types of non-linearity: non-linearity due to the expression of a complex parameter and non-linearity due to non-parametric estimation. We propose a two-step linearization procedure that provides an approximation of the non-parametric estimator via a Horvitz–Thompson estimator of a total by using an artificial variable. Roughly speaking, this artificial variable corresponds to the residuals of the linearized variable inline image on the fitted values under the model. Because the linearized variable depends on the parameter of interest, the residuals will also depend on this parameter. The consequence of this important and general property is that the non-parametric approach helps to obtain a unique system of weights that may lead to a gain in efficiency for different complex parameters.

The paper is structured as follows: the second section provides some background information on the non-parametric estimation of a finite population total in a general framework. In the third section, a class of non-parametric substitution estimators based on non-parametric regression is introduced. Variance approximations are derived by using the influence function linearization approach (Deville, 1999) in a general non-parametric setting. We propose in the fourth section a penalized B-spline model-assisted estimator for the finite population totals which is in fact an extension to a survey sampling framework of the penalized B-spline estimator that was studied in Claeskens et al. ('Simple random sampling without replacement'). We prove that the estimator is asymptotically design unbiased and consistent. Next, we build non-parametric penalized spline estimation for non-linear parameters and we assess the validity of the two-step linearization technique. The fifth section defines a class of consistent variance estimators and Section 'Empirical results' contains a case-study. The data set is extracted from the French Labour Force Surveys of 1999 and 2000 as presented previously. Asymptotic and finite sample properties of the regression B-spline estimators are illustrated for simple random sampling without replacement and stratified simple random sampling. This section also includes suggestions for practical implementation and guidelines for choosing the smoothing parameters. Finally, Section 'Discussion' concludes this study and the assumptions and the technical proofs together with some discussion are provided in Appendix A.

2. Non-parametric model-assisted estimation of finite population totals

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Non-parametric model-assisted estimation of finite population totals
  5. 3. Non-parametric model-assisted estimation of non-linear finite population parameters
  6. 4. Penalized B-spline estimators
  7. 5. Variance estimation
  8. 6. Empirical results
  9. 7. Discussion
  10. Acknowledgements
  11. Appendix A: Assumptions and proofs
  12. References

We focus on the estimation of the total inline image of the study variable inline image over U, taking into account the univariate auxiliary variable inline image The values inline image of inline image are assumed to be known for the entire population.

Many approaches can be used to take into account auxiliary information inline image and thus improve on the Horvitz–Thompson estimator inline image. The goal is to derive a weighted linear estimator inline image of inline image such that the sample weights inline image do not depend on the study variable values inline image but include the values inline image for all k ∈ U. The construction of the model-assisted class of estimators inline image is based on a superpopulation model ξ:

  • display math(3)

where the inline image are independent random variables with mean 0 and variance inline image If inline image were known for all k ∈ U, the total inline image may be estimated by the generalized difference estimator (Cassel et al., 1976),

  • display math(4)

Note that inline image consists in the difference between the Horvitz–Thompson estimator inline image and its bias under the model ξ, namely inline image. As a consequence, inline image is unbiased under the model, inline image and, moreover, it is unbiased under the sampling design, inline image The variance of inline image under the sampling design is given by

  • display math(5)

which shows clearly that the difference estimator inline image is more efficient than the Horvitz–Thompson estimator inline image if inline image approximates well inline image for all k ∈ U.

In practice, we do not know the true regression function f; thus we use an estimator of it. Generally, this estimator is obtained by using a two-step procedure: we estimate first f by inline image under the model ξ and, next, we estimate inline image by inline image by using the sampling design. Plugging inline image in equation (4) yields the final estimator of inline image

The linear regression function inline image yields the GREG estimator that was extensively studied by Särndal et al. (1992). The GREG estimator is efficient if the model fits the data well but, if the model is misspecified, the GREG estimator exhibits no improvement over the Horvitz–Thompson estimator and may even lead to a loss of efficiency. One way of guarding against model failure is to use non-parametric regression which does not require a predefined parametric mathematical expression for f.

Breidt and Opsomer (2000) proposed local linear estimators and Breidt et al. (2005) and Goga (2005) used non-parametric spline regression. The unknown f function is approximated by the projection of the population vector inline image onto different basis functions, such as the basis of truncated qth-degree polynomials in Breidt et al. (2005) and the B-spline basis in Goga (2005). In what follows, we briefly recall the definition and the main asymptotic properties of non-parametric model-assisted estimators for finite population totals (see also Breidt and Opsomer (2009)).

Let inline image be the estimator of inline image obtained at the population level by using one of the three non-parametric methods that were mentioned above. Plugging inline image into equation (4) results in the following non-parametric generalized difference pseudoestimator of the finite population total:

  • display math(6)

inline image is called a pseudoestimator because it is not feasible in practice since inline image is unknown. This pseudoestimator is still design unbiased but it is model biased because non-parametric estimators inline image are biased for inline image (Sarda and Vieu, 2000). Nevertheless, under supplementary assumptions (Breidt and Opsomer, 2000; Goga, 2005), the bias under the model vanishes asymptotically to 0 when the population and the sample sizes go to ∞. The unknown quantities inline image are usually obtained by least squares methods (ordinary, weighted or penalized) and we may write

  • display math(7)

where the N-dimensional vector inline image depends on the population values inline imagek ∈ U, as well as on the projection matrix for the basis functions considered but does not depend on inline image The expression of inline image depends on the non-parametric method chosen, as discussed in Breidt and Opsomer (2000), Breidt et al. (2005) and Goga (2005).

As in the parametric case, we estimate inline image by inline image by using the sampling design,

  • display math(8)

where inline image is the n-dimensional design-based estimator of inline image and inline image is the sample restriction of inline image Plugging inline image into equation (6) yields the following non-parametric model-assisted (NMA) estimator

  • display math(9)

This estimator can be written as a weighted sum of the sampled observations

  • display math(10)

where the weights inline image depend only on the sample and on the auxiliary information,

  • display math(11)

with inline image the n-dimensional vector of 1s, inline image the n×n diagonal matrix with inline imagek ∈ s, along the diagonal and inline image the N×n matrix having inline image as rows with sample restriction inline image Estimator (10) is a non-linear function of Horvitz–Thompson estimators, and its asymptotic variance has been obtained on a case-by-case study. Under mild hypothesis (Breidt and Opsomer, 2000; Breidt et al., 2005; Goga, 2005), inline image is asymptotically design unbiased, namely inline image and design root n consistent in the sense that

  • display math(12)

Moreover, it can be approximated by the non-parametric generalized difference estimator inline image

  • display math(13)

Furthermore, if the asymptotic distribution of inline image is normal inline image, we have that the asymptotic distribution of inline image is also normal inline image where inline image is obtained according to formula (5) applied to residuals inline image This means that the NMA estimators bring an improvement over parametric methods and the Horvitz–Thompson estimator when the relationship between inline image and inline image is not linear. In this case, the residuals inline image will be smaller than under a parametric smoother, which explains the diminution of the design variance of NMA estimators. Nevertheless, non-parametric estimators require that the auxiliary information should be known on the whole population, unlike the GREG estimator that requires only the finite population total for inline image

The efficiency of NMA estimators depends on the choice of the smoothing parameters. Opsomer and Miller (2005) and Harms and Duchesne (2010) derived the optimal bandwidth for the local polynomial regression, whereas Breidt et al. (2005) circumvented the issue of the number of knots by introducing a penalty coefficient. They also gave a practical method for estimating this penalty.

3. Non-parametric model-assisted estimation of non-linear finite population parameters

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Non-parametric model-assisted estimation of finite population totals
  5. 3. Non-parametric model-assisted estimation of non-linear finite population parameters
  6. 4. Penalized B-spline estimators
  7. 5. Variance estimation
  8. 6. Empirical results
  9. 7. Discussion
  10. Acknowledgements
  11. Appendix A: Assumptions and proofs
  12. References

3.1. Definition of the non-parametric substitution estimator

Let us consider the estimation of some non-linear parameters Φ by taking into account univariate auxiliary information known for all the population units. Examples of a non-linear parameter of interest Φ include the ratio, the Gini coefficient and the low income proportion. A parameter Φ may depend on one or several variables of interest; however, the same auxiliary variable inline image will be used to explain these variables of interest.

We aim to provide a general method for the estimation of Φ by using inline image and considering the functional approach that was introduced by Deville (1999). The methodology consists in considering a discrete and finite measure inline image where inline image is the Dirac measure at the point inline image and M is such that there is unit mass on each point inline image with k ∈ U and zero mass elsewhere. Furthermore, we write Φ as a functional T of M,

  • display math(14)

The non-parametric weights inline image are provided by equation (11) and M is estimated by

  • display math

Even if these weights are derived to estimate the total inline image they do not depend on the study variable inline image; thus they can be used to estimate any non-linear parameter of interest Φ when it can be expressed as a function of M. Note that inline image is a random measure of total mass equal to inline image

Plugging inline image into equation (14) provides the following non-parametric substitution estimator for Φ:

  • display math

We shall now illustrate the computation of inline image by using the simple case of a ratio R and subsequently the more intricate case of the Gini index and parameters defined by implicit equations.

  1. The ratio R between two finite population totals: we write inline image in a functional form as
    • display math
    The non-parametric estimator of R is easily obtained by replacing the measure M with inline image, namely
    • display math
    A similar estimation of R using GREG weights was previously considered by Särndal et al. (1992).
  2. The Gini index (Nygard and Sandström, 1985) is given by
    • display math
    where inline image is the empirical distribution function. Again, the non-parametric estimator for G is obtained by simply replacing M with inline image Hence,
    • display math(15)
    where
    • display math
  3. Parameters defined by an implicit equation: let Φ be defined as the unique solution of an implicit estimating equation inline image (Binder, 1983) that may be written in a functional form as ∫ϕ(Φ) dM=0. We replace M with inline image and the non-parametric sample-based estimator of Φ is the unique solution of the sample-based estimating equation
    • display math
    An example of such a parameter is the odds ratio which is extensively used in epidemiological studies. Goga and Ruiz-Gazen (2013) have studied the estimation of the odds ratio by taking into account auxiliary information and non-parametric regression.

3.2. Asymptotic properties of the non-parametric substitution estimator under the sampling design

In this section, we investigate the asymptotic properties of the non-parametric estimator inline image We use the asymptotic framework that was suggested by Isaki and Fuller (1982). Additionally, we make several assumptions (which are detailed in Appendix A) regarding the regularity of the functional T and the first-order inclusion probabilities of the sampling design.

The non-parametric estimator inline image is doubly non-linear, with non-linearity due to the parameter Φ and non-linearity due to the non-parametric estimation. Our main goal is to approximate inline image by using a linear estimator (Horvitz–Thompson type) which will allow us to compute the asymptotic variance of inline image This approximation will be accomplished in two steps: first, we shall linearize Φ and, then, we shall linearize the non-parametric estimator that is obtained in the first step.

The first linearization step is a first-order expansion of inline image with the remainder going to 0. The parameter of interest Φ is a statistical functional T defined with respect to the measure M or, equivalently, with respect to the probability measure M/N (by assumption 1). Using the first-order expansion of statistical functionals T as introduced by von Mises (1947) and, under the assumption of Fréchet differentiability of T, the remainder depends on some distance function between M/N and an estimator of this measure (Huber, 1981). Deville (1999) used these facts to prove the linearization of the Horvitz–Thompson substitution estimator of Φ; however, no details were given about the distance considered, and Goga et al. (2009) provided only minimal details. In what follows, we provide a distance between inline image and the true M/N which goes to 0 when the sample and the population sizes go to ∞.

We consider the total variation distance for two finite and positive measures inline image and inline image to be defined by

  • display math

with inline image. We first prove (lemma 1 from below), that the distance inline image between the Horvitz–Thompson estimator of M/N and the true M/N goes to 0. Next, we extend the result (lemma 2 from below) to the non-parametric estimator inline image

Let inline image represent the Horvitz–Thompson weights, namely inline image for all k ∈ s, and let inline image be the estimator of M by using these weights. Let inline image and, for ease of notation, inline image. Thus, for all k ∈ U, inline image uniformly in inline image and

  • display math

where inline image is the sample membership indicator.

Lemma 1.. Make assumptions 3 and 5 from Appendix A. Then,

  • display math

The proof is provided in Appendix A.3.1. We now extend lemma 1 to non-parametric weights inline image given by equation (11). Consider again inline image and let inline image where inline image is obtained from expression (8) for inline image replaced with inline image Let also inline image be obtained from expression (7) for inline image replaced with inline image

Lemma 2.. Make assumptions 3 and 5 from Appendix A. Assume in addition that

  1. for all k ∈ U, inline image uniformly in h and
  2. inline image uniformly in h.

Then,

  • display math(16)

The proof is provided in Appendix A.3.2. In Section 'Penalized B-spline estimators', we prove that the non-parametric estimator of M constructed by using penalized B-spline estimators satisfies assumptions (a) and (b) from lemma 2. The results from Breidt and Opsomer (2000) may be used to prove the assump tions for local polynomial regression; however, this issue will not be pursued further here.

To provide the first-order expansion of Φ=T(M), we must also define its first derivative. This derivative is referred to as the influence function and is defined as follows (Deville, 1999):

  • display math

where inline image is the Dirac measure at point y. This definition is slightly different from the definition of the influence function that was given by Hampel (1974) in robust statistics, which is based on a probability distribution instead of a finite measure.

Let inline image, for all k ∈ U, be the influence function IT computed at inline image, namely

  • display math

These quantities are referred to as the linearized variable ofΦ and are a tool for computing the approximate variance of inline image They depend on the parameter of interest and they are usually unknown even for the individuals sampled. Deville (1999) provided many practical rules for computing inline image for rather complicated parameters Φ.

For example, the linearized variable of a ratio R is

  • display math(17)

and, for the Gini index, it is given by

  • display math(18)

where inline image is the mean of inline image lower than inline image

We now provide the main result of this paper. The following theorem is the first linearization step of inline image. This proves that under broad assumptions the non-parametric estimator inline image is approximated by the non-parametric estimator for the population total inline image of the linearized variable. The proof is provided in Appendix A.3.3.

Theorem 1. (first linearization step.) Make assumptions 1., 2., 3. and 5 from Appendix A. Additionally assume (a) and (b) from lemma 2. Then, the non-parametric substitution estimator inline image fulfils

  • display math

We can put inline image in the form of an NMA estimator. Denote inline image Using equation (11), we can write

  • display math(19)

where inline image with inline image given by exprssion (8) and inline image is the sample restriction of inline image

Remark 1.. A model-based interpretation of inline image may be given. The linearized variable inline image can be fitted by using the auxiliary variable inline image by the following non-parametric model inline image

  • display math(20)

where the inline image are independent random variables with mean 0 and variance inline image The estimator of g under model inline image which is denoted by inline image, is obtained by using the same non-parametric method employed for estimating f under model ξ. This implies that inline image is the best fit of the population vector inline image with inline image given by expression (7). Furthermore, inline image is estimated by inline image which leads to the pseudoestimator inline image of inline image However, unlike the linear case, inline image is not an estimate of inline image because the sample linearized variable vector inline image is not known and we refer to it as a pseudoestimator. We remark also that the estimator inline image is efficient if the non-parametric model inline image holds.

The non-parametric pseudoestimator inline image that is given by equation (20) is a non-linear function of Horvitz–Thompson estimators; however, it estimates a linear parameter of interest, namely the total of inline image, inline image This indicates that inline image is similar to estimators that were used by Breidt and Opsomer (2000), Breidt et al. (2005) and Goga (2005) although it is computed for the linearized variable inline image The second linearization step approximates inline image by the generalized difference estimator of inline image given by

  • display math(21)

where inline image

Proposition 1. (second linearization step). Assume that inline image Then,

  • display math

On the basis of theorem 1 and proposition 1, we see that the asymptotic variance of inline image is the variance of inline image namely

  • display math

Moreover, if the asymptotic distribution of inline image is inline image then the asymptotic distribution of inline image is also inline image In Section 'Penalized B-spline estimators', we provide the necessary assumptions for the linearized variable and the auxiliary variable inline image to obtain an approximation of inline image by inline image in a B-spline estimation context.

Remark 2.. When the linearized variable inline image is a linear combination of the study variables, the assumption from proposition 1 is reduced to assumptions on the study variables. For example, this occurs in the case of a ratio inline image where the linearized variable is given by inline image The error inline image can be written as a linear combination of errors inline image and inline image. Using mild regularity assumptions on inline image and inline image and on the sampling design, inline image and inline image are shown to be of order inline image (see Fuller (2009), for linear regression and Section 'Penalized B-spline estimators' for B-spline estimators). Thus inline image is also of order inline image provided that R and inline image are bounded.

Remark 3.. The asymptotic variance inline image that is given by theorem 1 and proposition 1 depends on the population residuals inline image of the linearized variable inline image under the model inline image. For the simple case of a ratio, the relationship between inline image and the study variables is explicit and given by inline image. If linear models fit the data inline image and inline image well, then a linear model will also fit inline image well. Nevertheless, for non-linear parameters such as the Gini index, the relationship between inline image and the study variable is not as simple as that for the ratio. In such situations, the use of non-parametric regression methods may provide a major improvement with respect to variance compared with parametric regression.

4. Penalized B-spline estimators

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Non-parametric model-assisted estimation of finite population totals
  5. 3. Non-parametric model-assisted estimation of non-linear finite population parameters
  6. 4. Penalized B-spline estimators
  7. 5. Variance estimation
  8. 6. Empirical results
  9. 7. Discussion
  10. Acknowledgements
  11. Appendix A: Assumptions and proofs
  12. References

Spline functions have many attractive properties, and they are often used in practice because of their good numerical features and ease of implementation. We suppose without loss of generality that all inline image have been normalized and lie in [0,1]. For a fixed m>1, the set inline image of spline functions of order m, with K equidistant interiors knots inline image, is the set of piecewise polynomials of degree m−1 that are smoothly connected at the knots (Zhou et al., 1998):

  • display math

For m=1, inline image is the set of step functions with jumps at knots. For each fixed set of knots, inline image is a linear space of functions of dimension q=K+m. A basis for this linear space is provided by the B-spline functions (Schumaker, 1981; Dierckx, 1993) inline image defined by

  • display math

where inline image equals inline image if inline image and 0 otherwise. For all j=1, …,q, each function inline image has knots inline image with inline image for r=jm, …,j (Zhou et al., 1998) which means that its support consists of a small, fixed, finite number of intervals between knots. Moreover, B-splines are positive functions with a total sum equal to 1:

  • display math(22)

For the same order m and the same knot location, one can use the truncated power basis (Ruppert and Carroll, 2000) given by inline image. The B-spline and the truncated power bases are equivalent in the sense that they span the same set of spline functions inline image (Dierckx, 1993). Nevertheless, as indicated by Ruppert et al. (2003), ‘the truncated power bases have the practical disadvantage that they are far from orthogonal’, which leads to numerical instability especially if a large number of knots are used.

4.1. Non-parametric penalized spline estimation for finite population totals

We now consider the superpopulation model ξ given by equation (3). To estimate the regression function f, we use a spline approximation and a penalized least squares criterion. We define the spline basis vector of dimension q×1 as inline image, k ∈ U. The penalized spline estimator inline image of inline image is given by inline image with inline image as the least squares minimizer of

  • display math(23)

where superscript ( p) represents the pth derivate with pm−1. The solution of expression (22) is a ridge-type estimator,

  • display math(24)

where inline image is the N×q matrix with rows inline image and the q×q matrix inline image is the squared inline image-norm applied to the pth derivative of inline image. Because the derivative of a B-spline function of order m may be written as a linear combination of B-spline functions of order m−1, for equidistant knots we obtain that inline image where the matrix R has elements

  • display math(25)

with inline image as the B-spline function of order mp and inline image as the matrix corresponding to the pth-order difference operator (Claeskens et al., 'Simple random sampling without replacement').

The amount of smoothing is controlled by λ>0. The case λ=0 results in an unpenalized B-spline estimator, the asymptotic properties of which have been extensively studied in the literature (Agarwal and Studden (1980), Burman (1991) and Zhou et al. (1998), among others). The case λ[RIGHTWARDS ARROW]∞ is equivalent to fitting a (p−1)th-degree polynomial. The theoretical properties of penalized splines with λ>0 have been studied only recently by Cardot (2000, 2002), Hall and Opsomer (2005), Kauermann et al. (2009) and Claeskens et al. ('Simple random sampling without replacement'). The design-based estimators of inline image are

  • display math(26)

where inline image is the design-based estimator of inline image and inline image is the n×q matrix given by inline image We note that inline image may be written as in formula (8) for inline image

Finally, the penalized B-spline NMA estimator of inline image is inline image

This indicates that inline image may be written as a GREG estimator that uses the vectors inline image as regressors of dimension q×1 with q going to ∞ and a ridge-type regression coefficient inline image Furthermore, inline image is a weighted sum of sampled values inline image with weights inline image expressed as in equation (11):

  • display math(27)
4.1.1. Regression splines

For λ=0, we obtain the unpenalized B-spline estimator that was studied by Goga (2005) and called the regression splines. The B-spline property that is given in expression (21) may be written as inline image with inline image the q-dimensional vector of 1s, implying that inline image and inline image Using these two relationships in equation (28), Goga (2005) observed that inline image is equal to the finite population total of the prediction inline image

  • display math

where the weights are given by

  • display math(28)

Note the similarity with the GREG weights that were obtained in the case of a linear model when the variance of errors is linearly related to the auxiliary information (Särndal, 1980). We note that, for a B-spline of order m=1, the estimator inline image becomes the well-known post-stratified estimator (Särndal et al., 1992).

On the basis of assumptions regarding the sampling design and the variable inline image (assumptions 3–5 from Appendix A) and assumptions regarding the distribution of inline image and the knots number (assumptions 6 and 7 in Appendix A), Goga (2005) proved that the B-spline estimator for the total inline image is asymptotically design unbiased and root n consistent (equation (12)) and may be approximated by a non-parametric generalized difference estimator (equation (13)). These results are valid without supplementary assumptions regarding the smoothness of the regression function f.

4.1.2. Penalized splines by using truncated polynomial basis functions

Let inline image be the vector basis and let inline image with inline image be the least squares minimizer of inline image for inline image The solution is given by

  • display math

with inline image and the penalty matrix A having m−1 0s on the diagonal followed by K 1-values, A=diag(0, …,0,1, …,1). For ρ=0, we obtain the same prediction inline image as with an unpenalized B-spline estimation. This result follows from the fact that the two bases are equivalent; thus there is a square and invertible transition matrix inline image such that inline image (Ruppert et al., 2003). For ρ>0, we have inline image which indicates equivalence to the estimator inline image that is obtained with penalized B-spline fitting given by equation (25) for inline image (see Claeskens et al. ('Simple random sampling without replacement') for the expression of inline image satisfying this equation).

In a design-based approach, Breidt et al. (2005) proved that the NMA estimator inline image is the population total of the design-based predictions

  • display math(29)

They also proved that inline image fulfils properties (12) and (13).

4.2. Asymptotic properties of the B-spline estimator of totals under the sampling design

In what follows, we study the asymptotic properties of inline image under the sampling design. We first provide a lemma concerning the convergence of inline image The proofs are based on the results that were provided by Goga (2005) for the unpenalized B-spline estimator and on the fact that the inverse of the matrix inline image is of order O(K) for the penalized B-spline estimator (lemma 1 from Claeskens et al. (2009)).

Lemma 3..

  1. Make assumptions 4, part (b), and 6, 7, part (a), and 8 from Appendix A. Then, inline image
  2. Make assumptions 3, 4, part (b), 5 and 6–8 from Appendix A. Then,
    • display math
    where ‖·‖ is the usual Euclidean norm.

The proof is provided in Appendix A.4.1. We note that, for B-spline functions of order m=1 and λ=0, we obtain a post-stratified estimator with a number of post-strata going to ∞. In this context, lemma 3, part (b), provides a detailed theoretical justification for the post-stratification example in Deville (1999), page 196. We note also that, to obtain the convergence of inline image Breidt et al. (2005) assumed that the result from lemma 3, part (b), holds. Finally, we note that GREG estimators may be viewed as a special case when the number of knots is fixed. References dealing with this issue usually assume that the regression coefficient satisfies the results from lemma 3 (see for example Robinson and Särndal (1983), or Isaki and Fuller (1983)). A similar result was proved by Cardot et al. (2013).

Using lemma 3, we derive the following results.

Proposition 2. Make assumptions 3, 4, part (b), 5 and 6–8 from Appendix A. Then,

  1. inline image
  2. inline image where
    • display math

The proof is provided in Appendix A.4.2. Using the Markov inequality, we see from the first point of proposition 2 that inline image is asymptotically design unbiased for inline image and root n consistent as inline image. The second point provides an approximation of inline image by the non-parametric generalized difference estimator inline image

4.3. Calibration with penalized splines

The spline approach has some interesting calibration properties. Under the unpenalized B-spline framework, the weights inline image that are given by equation (29) satisfy the calibration equation for the known population total of B-spline functions, namely

  • display math

This relationship is easily obtained by using equation (23) (Goga, 2005). Because the spline space inline image is spanned by the B-spline functions inline image, these weights will be calibrated to the total of any polynomial inline image of degree rq=K+m. In particular, inline image and inline image Breidt et al. (2005) proved that using the penalized splines and the truncated polynomial basis functions provides weights that are also calibrated for the finite population totals of the polynomial basis functions inline image

4.4. Non-parametric penalized spline estimation for non-linear parameters

We now consider the non-linear parameter Φ estimated by inline image with inline image and the weights inline image that are given by equation (28). As in Section 'Non-parametric model-assisted estimation of non-linear finite population parameters', to linearize inline image we use a two-step procedure. The first-step linearization is given in theorem 1 provided that assumptions (a) and (b) from lemma 2 are fulfilled. These assumptions are crucial because they ensure the convergence of some non-parametric estimator of M to the true measure M according to the distance inline image Using classical assumptions from a B-spline framework (assumptions 6–8 from Appendix A) and mild assumptions regarding the sampling design (assumptions 3 and 5 from Appendix A), we prove in theorem 2 below that assumptions (a) and (b) in lemma 2. are verified. The proof is basically based on lemma 3 and the fact that the distance inline image is defined for uniformly bounded functions inline image ensuring that assumption 4, part (b), is automatically fulfilled.

By conducting this first linearization step, we see that the non-parametric B-spline estimator inline image will be approximated by the non-parametric pseudo-B-spline estimator of the total of inline image given by

  • display math

where inline image with inline image

The second-step linearization consists of providing an approximation of inline image by a non-parametric generalized difference pseudoestimator,

  • display math

where inline image To obtain this result, we state in theorem 2, part (b), a supplementary assumption regarding the linearized variable inline image Goga and Ruiz-Gazen (2013) prove that the linearized variable inline image of the odds ratio satisfies this assumption.

Theorem 2. Suppose that the sampling design satisfies assumptions 3 and 5. In addition, assume that conditions 6–8 hold.

  1. Assumptions (a) and (b) from lemma 2 are fulfilled for inline image. As a consequence, inline image Moreover, if the functional T satisfies assumptions 1 and 2, then inline image
  2. Suppose that the linearized variable is such that, for all k in U, inline image satisfy assumption 4, part (b). Then, inline image

The proof is provided in Appendix A.4.3.

5. Variance estimation

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Non-parametric model-assisted estimation of finite population totals
  5. 3. Non-parametric model-assisted estimation of non-linear finite population parameters
  6. 4. Penalized B-spline estimators
  7. 5. Variance estimation
  8. 6. Empirical results
  9. 7. Discussion
  10. Acknowledgements
  11. Appendix A: Assumptions and proofs
  12. References

In this section we undertake a study of the variance estimation of inline image We suggest a variance estimator for inline image and we give its functional form. Assumptions under which this estimator is consistent are also given.

Let inline image and the measure inline image defined on inline image. A sample in this population inline image is inline image and has size inline image. Moreover, the first-order inclusion probabilities over the synthetic population inline image are inline image, which are exactly the second-order inclusion probabilities with respect to the initial sampling design p(s). The measure inline image is estimated on inline image by inline image where inline image

On the basis of theorem 1 and proposition 1, the asymptotic variance inline image of inline image is given by

  • display math(30)

where inline image is the linearized variable of Φ and inline image for inline image We recognize the Horvitz–Thompson variance of the total of the population residuals inline image We suggest estimating the variance of inline image by using the Horvitz–Thompson variance estimator with inline image replaced by the sample estimators inline image

  • display math(31)

where inline image is the sample estimate of inline image The Horvitz–Thompson variance estim ator with true linearized variable is given by

  • display math(32)

The variance that is given in equation (32) depends on the population residuals inline image, for all k ∈ U. We may write inline image as a functional of inline image depending on the parameter inline image

  • display math(33)

Furthermore, the Horvitz–Thompson variance estimator inline image and the variance estimator inline image can be treated in a functional form as follows:

  • display math(34)

where inline image is the vector of sample-based fit residuals with inline image for all k ∈ U. Theorem 3 from Goga et al. (2009) allows us to establish under additional assumptions that the variance estimator (29) is n consistent for the asymptotic variance (28).

Theorem 3. Assume that assumptions 3 and 5 from Appendix A hold. Also assume that inline image holds uniformly in k and inline image If the Horvitz–Thompson variance estimator inline image is n consistent for inline image then the variance estimator inline image is also n consistent for inline image in the sense that inline image

The proof is given in Appendix A.4.4. Note that, because the functional inline image is Fréchet differentiable, the n-consistency of the Horvitz–Thompson estimator inline image for inline image may also be derived with assumptions on the fourth moment of inline image and on fourth-order inclusion probabilities. The reader is referred to Breidt and Opsomer (2000) for additional details.

6. Empirical results

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Non-parametric model-assisted estimation of finite population totals
  5. 3. Non-parametric model-assisted estimation of non-linear finite population parameters
  6. 4. Penalized B-spline estimators
  7. 5. Variance estimation
  8. 6. Empirical results
  9. 7. Discussion
  10. Acknowledgements
  11. Appendix A: Assumptions and proofs
  12. References

We consider a data set from the French Labour Force Surveys of 1999 and 2000 as the finite populations of interest. The data consist of the monthly wages (in euros) of 19378 wage-earners who were sampled in both years. The study variable inline image and the auxiliary variable inline image are respectively the wage of person k in 2000 and 1999. The objective of the simulation studies is to investigate the finite sample performance of the regression spline estimators for two non-linear parameters of interest and two different survey designs. We concentrate in practice on the simple approach of regression B-splines and do not consider the penalized B-splines with λ>0. The empirical study of penalized splines raises the problem of estimating the parameter λ which is beyond the scope of the present paper. We illustrate the efficiency of the regression B-splines estimators compared with other estimators, and we also confirm the possibility of conducting valid inference using variance estimators as detailed in the previous section.

The parameters to estimate include the mean, the Gini index and the poverty rate for the wages in 2000 by using the wages in 1999 as auxiliary information. The poverty rate is the proportion of individuals whose wages are below the threshold of 60% of the median wage and correspond to the low income proportion that was studied in Berger and Skinner (2003). The Gini index and the low income proportion are the complex parameters to be estimated and we provide results for the mean as a benchmark. Details on the low income proportion estimator and its associated linearized variable can be found in Berger and Skinner (2003) and are not provided in the present paper. In Section 'Simple random sampling without replacement', we focus on simple random sampling without replacement and, in Section 'Stratified simple random sampling without replacement', we focus on a stratified simple random sampling without replacement. We consider the following estimators for each parameter:

  1. the Horvitz–Thompson estimator HT, which does not incorporate any auxiliary information;
  2. post-stratified estimators POST with a different number of strata bounded at the empirical quantiles for 1999 wages;
  3. the GREG estimator GREG, which takes into account the 1999 wages as auxiliary information by using a simple linear model;
  4. B-spline estimators BS(m) (where m denotes the spline order), which take into account the wages from 1999 as auxiliary information by using a non-parametric model with different numbers of knots, K, located at the quantiles of the empirical distribution for wages from 1999. The m=2 and m=3 orders are considered.

The post-stratified estimator is an example of a B-spline estimator with order m=1. The number of strata correspond to the number of interior knots K plus 1.

To use the regression B-spline estimators that we propose in a complex survey and to derive confidence intervals, the user must be able to calculate the weights that are given in equation (29) and the residuals inline image of equation (33). The weights depend on a spline basis that is easy to obtain by using for instance the transreg procedure in the SAS software (SAS Institute, 2010) or the functions spline.des or bs from the splines package in the R software (R Core Team, 2012). Then, it is possible to use standard calibration algorithms by simply providing the m+K B-spline basis functions as auxiliary variables for calculating the calibrated weights that correspond to equation (29). These weights are needed to calculate the substitution estimator of the parameter of interest (e.g. expression (15) for the Gini index). To estimate the variance, the linearized variable that is associated with the parameter must be estimated. For several inequality indicators, including the Gini index and the low income proportion, some SAS macro programs exist (Dell et al., 2002). Similar functions are available in the R language on request from the authors of the present paper. Once the linearized variable has been estimated, the residuals of this variable against the auxiliary variable by using regression splines are calculated; this can be accomplished with the transreg procedure in the SAS software. Then, by using the residuals as if they were the study variable in standard variance estimation tools for complex surveys, the user can obtain the estimated approximate variance and derive confidence intervals.

For each simulation scheme, we draw NS samples according to the sampling design and compare the finite sample properties of the estimators HT, GREG, POST and BS(2) and BS(3). We set knots at the quantiles of the empirical distribution of the auxiliary variable in the sample. We also compared the results with knots set at the quantiles of the empirical distribution of the auxiliary variable over the entire population. Both results are very similar; thus, we report only on the first method. For the estimators POST, BS(2) and BS(3) we tried different numbers of knots K but only report the results for K=2 and K=4. For the post-stratified estimator, K=2 and K=4 respectively correspond to three and five strata. To summarize, in what follows, we compare eight estimators (HT and GREG, and POST, BS(2) and BS(3) with K=2 and K=4).

There are several ways to estimate the linearized variable (see Section 'Variance estimation'). In this section, the results are almost the same, regardless of whether we use the simple Horvitz–Thompson weights, the GREG weights or the B-spline weights for estimating the linearized variable. We recommend using the simplest weights (i.e. the Horvitz–Thompson weights), which is what we do in the present study.

The estimators' performance of inline image for a parameter θ is evaluated by using the following Monte Carlo measures:

  1. the relative bias in percentage,
    • display math
  2. the ratio of root-mean-squared errors in percentage,
    • display math
  3. Monte Carlo coverage probabilities for a nominal coverage probability of 95%.

6.1. Simple random sampling without replacement

The first survey design that we consider is simple random sampling without replacement with three sample sizes (n=200, n=500 and n=1000). The number of simulations is NS=3000. The eight estimators are compared and relative biases and ratios of the roots of the mean-squared errors are provided in Table 1 for the various parameters and sample sizes.

Table 1. RRMSE or RB (in parentheses) of estimators HT, GREG and POST, BS(2) and BS(3) for the mean, the Gini index and the low income proportion
Parameter n Results (%) for the following estimators:
HT GREG POST BS(2) BS(3)
K=2–K=4K=2–K=4K=2–K=4
Mean200100 (0)38 (0)71 (0)–63 (0)38 (0)–37 (0)39 (0)–41 (0)
500100 (0)40 (0)73 (0)–65 (0)40 (0)–39 (0)38 (0)–39 (0)
1000100 (0)40 (0)73 (0)–66 (0)40 (0)–40 (0)38 (0)–39 (0)
Gini index200100 (1)96 (1)92 (1)–80 (1)53 (2)–53 (2)70 (3)–70 (3)
500100 (1)93 (0)93 (1)–85 (1)50 (1)–50 (1)59 (1)–56 (1)
1000100 (0)92 (0)93 (0)–86 (0)49 (0)–48 (0)55 (1)–51 (1)
Poverty rate200100 (2)95 (0)92 (0)–80 (0)65 (1)–65 (1)72 (1)–63 (1)
500100 (0)95 (0)88 (0)–78 (0)64 (0)–64 (0)68 (0)–62 (0)
1000100 (1)94 (0)89 (0)–78 (0)64 (0)–64 (0)67 (0)–61 (0)

Not surprisingly, for complex parameters, the largest gain in efficiency is observed when the B-spline estimators are compared with the estimator HT without auxiliary information. Because the wages from 2000 are almost linearly related to the wages from 1999, considering the B-spline estimator instead of the GREG estimator does not improve the performance of the mean estimation. However, regarding the Gini index and the low income proportion, the incorporation of auxiliary information by using GREG estimators does not improve efficiency compared with the estimator HT whereas using a B-spline approach improves the results especially for spline functions of order m=2. When comparing the estimator POST with the estimators BS(2) and BS(3) we note that there is quite a large gain in efficiency when order m=2 is used instead of m=1, whereas there is a loss of efficiency when m=3 is used instead of m=2, especially for sample sizes that are smaller than 1000. Moreover, for m=2 and m=3, the results do not depend heavily on the number of knots and are similar for K between 2 and 4 whereas, for the post-stratified estimator, there are large variations in the results, regardless of whether we consider three or five strata. The coverage probabilities in Table 2 illustrate that valid inference can be carried out by using B-spline estimators as long as the spline order is not too high, especially when the sample size is not very large. No problems are detected for B-splines of order m=1 and order m=2 even when the sample size is n=200; however, for m=3 and n=200, the coverage probabilities for the Gini index estimation are approximately 75%, which is quite far from the 95% nominal probability. This result indicates that, for a moderate sample size, the variance may be underestimated when the order of the splines is larger than 2. The results are not given for m=4 but we have observed that the problem worsens when we increase the order of the splines. This is not really surprising because of double linearization and non-parametric estimation.

Table 2. Coverage probabilities for estimators HT, GREG and POST, BS(2) and BS(3)
Parameter n Coverage probabilities (%) for the following estimators:
HT GREG POST BS(2) BS(3)
K=2–K=4K=2–K=4K=2–K=4
Mean200949593–9293–9390–88
500959493–9493–9391–91
1000959594–9394–9493–93
Gini index200949394–9489–8774–75
500939393–9491–9083–85
1000959495–9494–9388–90
Poverty rate200949595–9595–9494–94
500939595–9495–9596–95
1000949596–9695–9696–95

6.2. Stratified simple random sampling without replacement

For each simulation, we draw NS=5000 samples from the French labour force population according to a stratified simple random sampling design without replacement. We compare the finite sample properties of the eight estimators that were considered in the previous subsection. The strata are spatial divisions of the French territory into six ‘regions’ that correspond to the major socio-economic regions of metropolitan France as defined by Eurostat. These regions are the first level of the nomenclature of territorial units for statistics classification. For our example, we grouped the northern and eastern regions together and we grouped the Mediterranean and the south-western regions together. The sample size inside each stratum is 200, making the total sample size 1200. Thus, we used an unequal probability design with a sample rate inside the strata that varied from 5% to 9.3%.

As previously described, we set the knots at the quantiles of the empirical distribution of the auxiliary variable in the sample and we estimate the linearized variable by using the Horvitz–Thompson weights. The simulation results are reported in Tables 3 and 4 and the conclusions are similar to those obtained from the simple random sampling design without replacement when the size of the sample is n=200, which corresponds to the sample sizes inside each stratum. It is beneficial to use the available auxiliary information when estimating the mean but there is no need to use non-parametric estimators because they are not more efficient than the GREG estimator. However, for complex parameters, using a GREG estimator to take auxiliary information into account is not worthwhile in terms of variance whereas important gains can be made by using B-spline estimators. The empirical coverage probabilities are all very good except for the Gini index B-spline estimator of order 3 with values equal to 89–90%, which confirms the problem of variance underestimation for moderate sample sizes and splines of order 3.

Table 3. RRMSE or RB (in parentheses) of estimators HT, GREG and POST, BS(2) and BS(3)
Parameter Results (%) for the following estimators:
POST BS(2) BS(3)
K=2–K=4K=2–K=4K=2–K=4
Mean100 (0)40 (0)73 (0)–66 (0)40 (0)–40 (0)40 (0)–40 (0)
Gini index100 (0)93 (0)94 (0)–88 (0)50 (0)–50 (0)55 (1)–52 (1)
Poverty rate100 (0)93 (0)88 (0)–77 (0)65 (0)–64 (0)68 (0)–62 (0)

On the basis of this example we do not recommend the use of high order values for B-spline regression, especially when the sample sizes are smaller than 500. However, choosing m=2 instead of m=1 (which corresponds to post-stratification) leads to a clear improvement in terms of efficiency for complex parameters such as the Gini index or the low income proportion, and we recommend this choice.

Table 4. Coverage probabilities for estimators HT, GREG and POST, BS(2) and BS(3)
Parameter Coverage probabilities (%) for the following estimators:
POST BS(2) BS(3)
K=2–K=4K=2–K=4K=2–K=4
Mean959595–9494–9493–92
Gini index959595–9593–9389–90
Poverty rate949595–9595–9596–95

7. Discussion

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Non-parametric model-assisted estimation of finite population totals
  5. 3. Non-parametric model-assisted estimation of non-linear finite population parameters
  6. 4. Penalized B-spline estimators
  7. 5. Variance estimation
  8. 6. Empirical results
  9. 7. Discussion
  10. Acknowledgements
  11. Appendix A: Assumptions and proofs
  12. References

In this paper we considered the important problem of non-linear parameter estimation in a finite population framework by taking into account the survey design and a unique auxiliary variable known for all the population units. Examples of non-linear parameters are concentration and inequality measures, such as the Gini index or the low income proportion. We proposed a general class of substitution estimators that allows us to take into account the auxiliary information via an NMA approach. The asymptotic variance of this class of estimators was derived, based on broad assumptions, and variance estimators were proposed. Our main result was that the asymptotic variance depends on the extent to which the auxiliary variable inline image explains the variation in the linearized variable inline image. Because linearized variables of non-linear parameters are likely to be non-linearly related to auxiliary information, a non-parametric approach is recommended. The estimators proposed are based on weights that are sufficiently flexible to increase the efficiency of finite population totals estimators for any study variable and to allow the consideration of parameters that are more complex than totals. Moreover, the penalized B-spline estimators were studied in detail, and the theoretical results were confirmed for regression B-spline estimators by using one case-study.

Our proposal can be extended in several ways. In particular, further research can extend this proposal to include multivariate auxiliary information by means of additive models, as in Breidt et al. (2005), or single-index models as in Wang (2009).

Acknowledgements

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Non-parametric model-assisted estimation of finite population totals
  5. 3. Non-parametric model-assisted estimation of non-linear finite population parameters
  6. 4. Penalized B-spline estimators
  7. 5. Variance estimation
  8. 6. Empirical results
  9. 7. Discussion
  10. Acknowledgements
  11. Appendix A: Assumptions and proofs
  12. References

We are grateful to Patrick Gabriel for his precious help with lemma 1, to Hervé Cardot for helpful discussions and to Didier Gazen for his assistance with the simulations. We also thank two referees and the Associate Editor for their comments and remarks that have improved the content and the presentation of the paper.

Appendix A: Assumptions and proofs

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Non-parametric model-assisted estimation of finite population totals
  5. 3. Non-parametric model-assisted estimation of non-linear finite population parameters
  6. 4. Penalized B-spline estimators
  7. 5. Variance estimation
  8. 6. Empirical results
  9. 7. Discussion
  10. Acknowledgements
  11. Appendix A: Assumptions and proofs
  12. References

A.1. Assumptions on functional T and on sampling design

Assumption 1.. The functional T is homogeneous, in that there is a real number α>0, dependent on T such that inline image for any real r>0. We assume also that inline image

Assumption 2.. The functional T is Fréchet differentiable at M/N, i.e. there is a functional T(M/N;Δ) that is linear in Δ such that

  • display math

with inline image

We note that the strong assumption of Fréchet differentiability can be weakened to compact or Hadamard differentiability. However, for Hadamard differentiability, functionals are considered with respect to the empirical distribution function and the distance inline image should be replaced by the sup-norm. Supplementary assumptions need to be supposed to have the consistency of the estimator of the empirical distribution function. Motoyama and Takahashi (2008) studied the asymptotic behaviour of Hadamard statistical functionals but only for simple random sampling without replacement.

Assumption 3.. inline image

Assumption 4..

  1. inline image with ξ-probability 1.
  2. inline image with C a positive constant not depending on N.

Assumption 5.. inline image, inline image with inline image and inline image some positive constants and

  • display math

Assumptions 3. and 5. deal with first- and second-order inclusion probabilities and are quite classical in survey sampling theory (see also Robinson and Särndal (1983) and Breidt and Opsomer (2000)). They are satisfied for many sampling designs including rejective sampling (Boistard et al., 2012). Assumption 4., part (a), is a regularity condition which is necessary to obtain the consistency results. Some results need the stronger assumption 4, part (b).

A.2. Assumptions on B-splines

Assumption 6.. There is a distribution function Q(z) with strictly positive density on [0,1] such that inline image with inline image the empirical distribution of inline image

Assumption 7..

  1. K=o(N);
  2. inline image with inline image.

Assumption 8.. inline image where inline image with c a constant that depends only on p and the design density.

These assumptions are classical in non-parametric regression (Agarwal and Studden, 1980; Burman, 1991; Zhou et al., 1998); assumption 6 means that, asymptotically, there is no subinterval in [0,1] without points inline image and assumption 7 ensures that the dimension of the B-spline basis goes to ∞ but not too fast when the population and the sample sizes go to ∞. Assumption 8. concerns the penalty λ as used by Claeskens et al. (2009).

A.3. Proofs of results from Section 3

A.3.1. Proof of lemma 1

Let inline image and let inline image be the sample membership. Following the same lines as in Breidt and Opsomer (2000), we have, by using equation (16),

  • display math

uniformly in h by assumptions 3 and 5 and using the fact that inline image.

A.3.2. Proof of lemma 2

We have

  • display math

From the proof of lemma 1, we see that the first term from the right-hand side is of order inline image uniformly in h because

  • display math

by construction of inline image and assumption (a) of lemma 2. The result follows by using assumption (b) of lemma 2 because inline image uniformly in inline image

A.3.3. Proof of theorem 1

Under assumption 2, we provide a first-order von Mises (1947) expansion of T in inline image around M/N,

  • display math

Using the fact that, for a functional of degree α (assumption 1), we have inline image, and we write

  • display math(35)

since inline image Now, inline image and, hence, relation (31) becomes

  • display math

A.4. Proofs of results from Section 4

We state below several lemmas that are useful for the proofs of our main results. For a matrix inline image we consider the norm defined by inline image the spectral norm inline image and the Hilbert–Schmidt or the trace norm inline image We have inline image and inline image

We denote by

  • display math

and by

  • display math

its estimator.

Lemma 4.. Make assumptions 6, 7, part (a), and 8. Then,

  1. inline image (lemma 6.2 from Zhou et al. (1998)). We have inline image (lemma 6.3 from Zhou et al. (1998)).
  2. inline image (lemma 6.3 from Claeskens et al. (2009)).

Lemma. (Goga, 2005.) Make assumptions 3, 4, part (a), 5, 6 and 7, part (a). Then,

  1. inline image and
  2. inline image.
A.4.1. Proof of lemma 3
  1. When inline image is uniformly bounded (assumption 4, part (b)), we have, using lemma 3, part (a), from Goga (2005), that
    • display math(36)
    since for k,l ∈ U with |kl|>m we have inline image
  2. For part (a), inline image is bounded following Goga (2005), inline image by lemma 4, part (b), and relation (32).
  3. Furthermore, we have
    • display math(37)
    Under assumptions 4 and 5, inline image is bounded by
    • display math
    We have that
    • display math(38)
    and
    • display math
    for inline image with inline image under assumption 7, part (b), implying that inline image is invertible and
    • display math
    As a result, inline image is also invertible. Using lemma 5, part (a), we obtain that
    • display math
    From lemmas 4 and 5, we obtain that
    • display math
    Finally, we have that
    • display math
A.4.2. Proof of proposition 2

Consider first part (b). Using the same lines as in the proof of lemma 1 and the fact that inline image for all k ∈ U (Burman, 1991), we obtain that

  • display math(39)

Furthermore,

  • display math

by equation (35) and lemma 3, part (b). Then, the result follows by using the Markov inequality.

For part (a), now, we consider the error inline image We write

  • display math(40)

By assumptions 3, 4, part (b), and 5, we have that inline image Moreover, using relation (35) and lemma 3, part (a), we have

  • display math

which implies that

  • display math

by the fact that inline image, using assumption 7.

A.4.3. Proof of theorem 2
  1. We check that assumptions (a) and (b) in lemma 2 are fulfilled. We have that
    • display math
    with inline image for all k ∈ U. Following expressions (32) and (33), we obtain that inline image uniformly in h and
    • display math(41)
    uniformly in h. Now, we check assumption (b) in lemma 2, namely
    • display math
    uniformly in h.
  2. We have
    • display math
  3. The first term on the right-hand side does not depend on h and is of order inline image (equation (40)). For the second term on the right-hand side, we can use the proof of lemma 3, more exactly equation (38), and the fact that inline image to obtain
    • display math
    Finally, we obtain that
    • display math
    for inline image with inline image
  4. We write
    • display math
    because inline image (equation (40)) and inline image by lemma 5.
A.4.4. Proof of theorem 3

The proof of theorem 3 follows the same basic steps as in theorem 3 from Goga et al. (2009) and result 4 from Chaouch and Goga (2010). Let

  • display math

with inline image given by equation (34) and let also

  • display math

The quantity inline image can be written as

  • display math

Now,

  • display math

by assumptions 3 and 5 and the supplementary assumption of theorem 3. Using the same arguments as above, we obtain inline image and inline image. Hence, inline image and the result then follows because inline image and

  • display math

References

  1. Top of page
  2. Summary
  3. 1. Introduction
  4. 2. Non-parametric model-assisted estimation of finite population totals
  5. 3. Non-parametric model-assisted estimation of non-linear finite population parameters
  6. 4. Penalized B-spline estimators
  7. 5. Variance estimation
  8. 6. Empirical results
  9. 7. Discussion
  10. Acknowledgements
  11. Appendix A: Assumptions and proofs
  12. References
  • Agarwal, G. G. and Studden, W. J. (1980) Asymptotic integrated mean square error using least squares and bias minimizing splines. Ann. Statist., 8, 13071325.
  • Berger, Y. G. and Skinner, C. J. (2003) Variance estimation for a low income proportion. Appl. Statist., 52, 457468.
  • Binder, D. A. (1983) On the variances of asymptotically normal estimators from complex surveys. Int. Statist. Rev., 51, 279292.
  • Binder, D. A. and Kovacevic, M. S. (1995) Estimating some measures of income inequality from survey data: an application of the estimating equations approach. Surv. Methodol., 21, 137145.
  • Boistard, H., Lopuhaä, H. P. and Ruiz-Gazen, A. (2012) Approximation of rejective sampling inclusion probabilities and applications. Electron. J. Statist., 6, 19671983.
  • Breidt, F. J., Claeskens, G. and Opsomer, J. (2005) Model-assisted estimation for complex surveys using penalised splines. Biometrika, 92, 831846.
  • Breidt, F. J. and Opsomer, J. (2000) Local polynomial regression estimators in survey sampling. Ann. Statist., 28, 10261053.
  • Breidt, F. J. and Opsomer, J. (2009) Nonparametric and semiparametric estimation in complex surveys. In Handbooks of Statistics, vol. (edsD. Pfeffermann and C. R. Rao), pp. 103121. Amsterdam: North-Holland.
  • Burman, P. (1991) Regression function estimation from dependent observations. J. Multiv. Anal., 36, 263279.
  • Cardot, H. (2000) Spatially adaptive splines for statistical linear inverse problems. J. Multiv. Anal., 81, 100119.
  • Cardot, H. (2002) Local roughness penalties for regression splines. Computnl Statist., 17, 89102.
  • Cardot, H., Goga, C. and Lardin, P. (2013) Uniform convergence and asymptotic confidence bands for model-assisted estimators of the mean of sampled functional data. Electron. J. Statist., 7, 562596.
  • Cassel, C. M., Särndal, C. E. and Wretman, J. H. (1976) Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63, 615620.
  • Chaouch, M. and Goga, C. (2010) Design-based estimation for geometric quantiles with application to outlier detection. Computnl Statist. Data Anal., 54, 22142229.
  • Claeskens, G., Krivobokova, T. and Opsomer, J. (2009) Asymptotic properties of penalized spline estimators. Biometrika, 96, 529544.
  • Dell, F., d'Haultfœ uille, X., Février, P. and Massé, P. (2002) Mise en œ uvre du calcul de variance par linéarisation. InInsee-Méthodes.
  • Deville, J. C. (1999) Variance estimation for complex statistics and estimators: linearization and residual techniques. Surv. Methodol., 25, 193203.
  • Dierckx, P. (1993) Curves and Surface Fitting with Splines. Oxford: Clarendon.
  • Fuller, W. A. (2009) Sampling Statistics. New York: Wiley.
  • Goga, C. (2005) Réduction de la variance dans les sondages en présence d'information auxiliaire: une approche non paramétrique par splines de régression. Can. J. Statist., 33, 118.
  • Goga, C., Deville, J. C. and Ruiz-Gazen, A. (2009) Use of functionals in linearization and composite estimation with application to two-sample survey data. Biometrika, 96, 691709.
  • Goga, C. and Ruiz-Gazen, A. (2013) Estimating the odds-ratio using auxiliary information. Math. Popln Stud., to be published.
  • Hall, P. and Opsomer, J. (2005) Theory for penalized spline regression. Biometrika, 92, 105118.
  • Hampel, F. R. (1974) The influence curve and its role in robust statistics. J. Am. Statist. Ass., 69, 383393.
  • Harms, T. and Duchesne, P. (2010) On kernel nonparametric regression designed for complex survey data. Metrika, 72, 111138.
  • Huber, P. J. (1981) Robust Statistics. New York: Wiley.
  • Isaki, C. T. and Fuller, W. A. (1983) Survey design under the regression superpopulation model. J. Am. Statist. Ass., 77, 8996.
  • Johnson, A. A., Breidt, F. J. and Opsomer, J. (2008) Estimating distribution function from survey data using nonparametric regression. J. Statist. Theor. Pract., 2, 419431.
  • Kauermann, G., Krivobokova, T. and Fahrmeir, L. (2009) Some asymptotic results on generalized penalized spline smoothing. J. R. Statist. Soc. B, 71, 487503.
  • von Mises, R. (1947) On the asymptotic distribution of differentiable statistical functions. Ann. Math. Statist., 18, 309348.
  • Motoyama, H. and Takahashi, H. (2008) Smoothed versions of statistical functionals from a finite population. J. Jpn Statist. Soc., 38, 475504.
  • Nygard, F. and Sandström, A. (1985) The estimation of the Gini and the entropy inequality parameters in finite populations. J. Off. Statist., 4, 399412.
  • Opsomer, J. and Miller, C. P. (2005) Selecting the amount of smoothing in nonparametric regression estimation for complex surveys. J. Nonparam. Statist., 17, 593611.
  • Opsomer, J. and Wang, J. C. (2011) On the asymptotic normality and variance estimation of nondifferentiable survey estimators. Biometrika, 98, 91106.
  • R Core Team (2012) R: a Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing.
  • Robinson, P. and Särndal, C. E. (1983) Asymptotic properties of the generalized regression estimator in probability sampling. Sankhya B, 45, 240248.
  • Ruppert, D. and Carroll, R. J. (2000) Spatially-adaptative penalties for spline fitting. Aust. New Zeal. J. Statist., 42, 205223.
  • Ruppert, D., Wand, M. P. and Carroll, R. (2003) Semiparametric Regression. Cambridge: Cambridge University Press.
  • Sarda, P. and Vieu, P. (2000) Kernel regression. In Smoothing and Regression: Approaches, Computation, and Application (ed. M. G. Schimek). New York: Wiley.
  • Särndal, C. E. (1980) On the π-inverse weighting best linear unbiased weighting in probability sampling. Biometrika, 67, 639650.
  • Särndal, C. E. (2007) The calibration approach in survey theory and practice. Surv. Methodol., 33, 99119.
  • Särndal, C. E., Swensson, B. and Wretman, J. (1992) Model Assisted Survey Sampling. Berlin: Springer.
  • SAS Institute (2010) SAS/STAT® 9.22 User's Guide. Cary: SAS Institute.
  • Schumaker, L. L. (1981) Spline Functions: Basic Theory. New York: Wiley.
  • Shao, J. (1994) L-statistics in complex survey problems. Ann. Statist., 22, 946967.
  • Wang, L. (2009) Single-index model-assisted estimation in survey sampling. J. Nonparam. Statist., 21, 487504.
  • Zhou, S., Shen, X. and Wolfe, D. A. (1998) Local asymptotics for regression splines and confidence regions. Ann. Statist., 26, 17601782.