We give a formal definition of a representative sample, but roughly speaking, it is a scaled-down version of the population, capturing its characteristics. New methods for selecting representative probability samples in the presence of auxiliary variables are introduced. Representative samples are needed for multipurpose surveys, when several target variables are of interest. Such samples also enable estimation of parameters in subspaces and improved estimation of target variable distributions. We describe how two recently proposed sampling designs can be used to produce representative samples. Both designs use distance between population units when producing a sample. We propose a distance function that can calculate distances between units in general auxiliary spaces. We also propose a variance estimator for the commonly used Horvitz–Thompson estimator. Real data as well as illustrative examples show that representative samples are obtained and that the variance of the Horvitz–Thompson estimator is reduced compared with simple random sampling.
Consider a population U = {1,2, … ,N}of N units. The main goal is to select a sample so that various parameters can be estimated with good precision. Often, the total or mean of one or more study variables is the parameters of interest, but other parameters such as quantiles may also be of interest. Let y be a target variable, which takes a fixed value y_{i} for unit i. The total of y is denoted Y=∑i=1Nyi. Available information is a set of auxiliary variables x_{i} = (x_{i1},x_{i2}, … ,x_{iq}), for which the values preferably are known for each unit i ∈ U. Missing values can exist, but it needs to be possible to calculate a distance between all pairs of units. The x variables may be a mix of qualitative and quantitative variables. It is also assumed that each unit i ∈ U has a prescribed inclusion probability π_{i} with ∑i=1Nπi=n, where n is the sample size. Let I_{i} be the inclusion indicator for unit i, that is, I_{i} = 1 if unit i is in the sample and I_{i} = 0 otherwise. The total Y can then be estimated from a sample by the unbiased Horvitz–Thompson (HT) estimator
Ŷ=∑i=1NyiπiIi,(1)
which, if the π_{i}s are equal, also may be expressed as Ŷ=Ny, where y is the sample mean.
Often, it is desirable that the sample accurately reflect the entire population in as many ways as possible. The term representative sample is often used, and it seems that the common view of a representative sample is that it is a miniature version of the population. Before we can give our definition of a representative sample, we need to define what we mean by a coherent subset.
Definition 1. (Coherent subset) Let i ∈ U and let r ≥ 0 be a given radius We say that U^{ * } is a coherent subset of U if the following holds. Unit j ∈ U is included in U^{ * } if and only if d(i,j) ≤ r, where d(i,j) is the distance between the units i and j. Thus, U^{ * } can be constructed by including all units within a ball of radius r from some unit i.
Definition 2. (Representative sample) A sample of size n from U is said to be representative if, for every coherent subset U^{ * } ⊂ U, we have
n*≈nNN*,
where n^{ * } is the number of sampled units from U^{ * } and N^{ * } the size of U^{ * }.
Remark 1. Note that it is not required that a sample has been randomly selected for it to be representative. However, in this paper, we only treat random sampling to have statistically valid and unbiased estimators.
Example 1. (Representative sample) Let there be two auxiliary variables, age (young or old) and gender (male or female). Then, old male subjects constitute a coherent subset of the population. According to the definition, a representative sample then must have the same proportion of old male subjects as in the population. This is in agreement with our intuitive feeling of a representative sample.
Definition 3. (Well-spread sample) A sample s is said to be well spread or spatially balanced with respect to the inclusion probabilities if, for every coherent subset U^{ * } ⊂ U, we have
n*≈∑i∈U*πi,
where n^{ * } is the number of sampled units from U^{ * }.
Remark 2. With our definitions, the concepts of representative and well-spread samples are similar, but the latter is connected to a set of prescribed (possibly unequal) inclusion probabilities. A sample selected with equal inclusion probabilities that is well spread is also representative. However, a sample selected with unequal inclusion probabilities may be well spread, but it is unlikely to be representative.
Proposition 1. With our definition of a representative sample, the following two conditions are sufficient for a sampling design to yield representative samples.
The design uses equal inclusion probabilities, that is, all population units should be included in the sample with equal probability.
The design produces well-spread samples.
Representative samples can be obtained by using designs that were first developed for sampling of natural resources. Natural resources often exhibit spatial trends. For such populations, sampling designs that spread the sample well over the space have been used for a long time and are considered to provide the smallest variance for commonly used estimators (Stevens & Olsen, 2004). Recently, Grafström (2012) and Grafström et al. (2012) presented two new designs for spatially balanced sampling. These designs are called spatially correlated Poisson sampling (SCPS) and the local pivotal method (LPM). They are considered in this paper. Both methods were developed for unequal inclusion probabilities, and hence, the methodology is presented in the general case. It is sometimes of more interest to have a small variance of an estimator than to have representative samples. For instance, it is possible to increase the inclusion probabilities for groups that are known to have a large standard deviation on the target variable and still benefit from having well-spread samples.
Samples resulting from simple random sampling (SRS) may be far from representative because of the high level of randomness. With auxiliary information, one possibility is to select balanced samples by using the cube method (Deville & Tillé, 2004).
Definition 4. (Balanced sample) A sample s is said to be balanced on the auxiliary x variables if
∑i∈sxiπi=∑i∈Uxi.
A balanced sample selected with equal inclusion probabilities is mean balanced, that is, the sample means for the auxiliary variables will be equal to the population means.
However, mean balanced samples may also be far from representative. Such a sample may consist of only average units, thus not at all reflecting the units in the entire population. Despite that, mean balanced samples may be really efficient for estimating the total of some variables but cannot be regarded as representative without additional requirements. Well-spread probability samples are approximately balanced (Grafström & Lundström, 2013).
With only a few qualitative variables, it is possible to use stratification to make sure that the sample's proportions match the distribution in the population, such as selecting a sample with 50 per cent female and 50 per cent male subjects if that is the distribution in the population. With more variables, stratification soon becomes a too rough method, resulting in too many and too small strata. Moreover, a multivariate stratification becomes somewhat arbitrary. Hence, our proposal is to use instead a sampling design that guarantees that the sample is well spread in the auxiliary space, that is, a design for selecting spatially balanced samples.
The paper is structured as follows. In Section 2, we discuss how to calculate distance in the auxiliary space and how to measure spatial balance. In Section 3, we give short descriptions of the LPM and SCPS designs and discuss some of their properties. A variance estimator is presented and discussed in Section 4. Two real data examples are presented in Section 5. Concluding comments are given in Section 6.
2 Representativity and spatial balance
A representative sample from a population will be a scaled-down version of the entire population, where all different characteristics of the population are present. With equal inclusion probabilities, a sample well spread in the space spanned by the auxiliary variables corresponds to a representative sample. The two sampling designs, further discussed in Section 3, both give samples that are well spread in space. To select such samples and to measure spatial balance, we need to define distance between units.
The purpose of the distance measure is mainly to identify units that are close in the auxiliary space. The auxiliary variables can be qualitative or quantitative. It is natural to think of the usual Euclidean distance, which is very reasonable when the auxiliary variables are quantitative. However, for qualitative variables, the Euclidean distance is not suitable. For such variables, we measure distance somewhat differently.
Definition 5. (Distance) Let x ∈ R^{q} be all available auxiliary variables, where {1, … ,p} correspond to the quantitative variables and {p + 1, … q}to the qualitative variables The distance between units i and j in this q-dimensional space is
d(i,j)=∑k=1px′ik−x′jk2+∑k=p+1q1xik≠xjk,(2)
where x ′ _{k} is the standardized version of x_{k}.
To give all auxiliary variables equal importance, standardization is necessary. Our motivation for this definition is its simplicity. Other distance functions that result in small values for units that have similar values on the auxiliary variables can also be used. If p = q, (2) corresponds to the usual Euclidean distance, and if p = 0, we simply sum the number of qualitative variables that differ between the units.
Voronoi polytopes have been used by, for example, Stevens & Olsen (2004) and Grafström et al. (2012) to measure spatial balance. In the case of equal inclusion probabilities, it can also be used to measure representativity. The Voronoi polytope p_{i} for the sample unit i ∈ s includes all population units closer to i than to any other sample unit j. Let v_{i} be the total inclusion probability within polytope i, that is,
vi=∑j∈piπj.
For a sample that is spatially balanced with respect to the inclusion probabilities, v_{i} should approximately be equal to 1 for each polytope. For a sample s, we define
B=1n∑i∈s(vi−1)2(3)
as a measure of spatial balance. As a measure of how well spread the samples a sampling design produces, we use the mean of B over repeated samples. A low value corresponds to a high degree of spatial balance, that is, a high degree of representativity if the inclusion probabilities are equal. If one unit has the same distance to two or more units, it will be included in more than one polytope, and its inclusion probability is then divided equally to each polytope it is included in. The lower bound for the expected value of B will differ between populations. The minimal possible bound, 0, can be attained for some populations, but not for all.
3 Sampling designs
Two different designs that give spatially balanced samples are considered: the LPM and SCPS. Both LPM and SCPS are fixed-size πps designs, that is, if ∑i∈Uπi=n, where n is the integer valued, these designs always select samples of size n. Both algorithms start with a vector of prescribed inclusion probabilities and successively transform the probability vector into a vector with N − n zeros and n ones. As usual, the ones indicate inclusion in the sample.
In our point of view, the probability mass (n) is at first distributed on all population units according to the prescribed inclusion probabilities. Then, in a sample selection algorithm (which produces fixed-size samples), the probability mass must be moved (according to a set of rules) until the mass is focused, with ones on the coordinates of the n selected units and zeros on the N − n remaining coordinates. Also, in a fixed-size sampling algorithm, the total probability mass remains constant when the vector is updated.
3.1 Local pivotal method
The LPM, proposed by Grafström et al. (2012), is a special case of the pivotal method (Deville & Tillé, 1998). Let π_{i}, i = 1,2, … ,N, be the prescribed inclusion probabilities. In a step of the pivotal method, the inclusion probabilities are updated for two units in such a way that the sampling outcome is decided for at least one of the two units. Hence, a sample is obtained in at most N steps. In the pivotal method, the two units may be chosen arbitrarily in each step. What Grafström et al. (2012) suggested was to always choose two units close in space. Then, the pivotal method is used locally in the auxiliary variable space, hence the name LPM. When the inclusion probabilities are simultaneously updated for nearby units, they seldom appear together in a sample. The effect is that the sample becomes well spread.
One step of the LPM can be described as follows: Randomly choose one unit i with equal probabilities. Then, choose the nearest neighbour to this unit, denoted j. If two or more units have the same distance to i, then randomly choose one of them with equal probability. Update the inclusion probabilities for the chosen pair of units (i,j) according to the following.
Now, replace (π_{i},π_{j}) with (π ′ _{i},π ′ _{j}). Once a unit has obtained an updated probability that is 0 or 1, it cannot be chosen for updating again. Repeat the aforementioned step until all units have received an updated probability that is 0 or 1. Thus, the algorithm starts with the vector of prescribed inclusion probabilities, and this vector is successively updated to become a vector of zeros and ones. The ones indicate inclusion in the sample.
3.2 Spatially correlated Poisson sampling
SCPS is presented by Grafström (2012) as a modification of correlated Poisson sampling (Bondesson & Thorburn, 2008). The method is list sequential, that is, it starts at unit 1, and once the sampling outcome is decided for that unit, it continues with unit 2 and so on. It does not revisit any unit, and once a sampling outcome has been decided, the inclusion probabilities for the remaining units are updated.
Let π_{i}, i = 1,2, … ,N, be the prescribed inclusion probabilities. The algorithm starts at step k = 1 with the vector πi(0):=πi, i = 1,2, … ,N. At step k = 1,2, … ,N, the sampling outcome for the units i,i < k, is already decided. Unit k is then included with probability πk(k−1), and we set γ_{k} = 1 if it is included and γ_{k} = 0 otherwise. The inclusion probabilities for the units i = k + 1, … ,N are then updated according to
πi(k)=πi(k−1)−γk−πk(k−1)wk(i),
where wk(i) are weights. These weights control how the inclusion probabilities for the remaining units i = k + 1, … ,N should be affected by the sampling outcome for unit k.
To avoid clustering of similar units and to obtain well-spread samples, the weights are chosen such that unit k gives maximal weight to the unit closest to k in distance, among the units k + 1, … ,N. Then, maximal weight is given to the next closest unit. A restriction is that the weights should sum to 1. The maximal weights are decided by the fact that the updated inclusion probabilities must lie within [0,1]. This is called the maximal weight strategy. For a more detailed description on how to calculate the weights, see Grafström (2012).
When the weights are chosen in this way, the result is that if unit k is included, the probabilities for the surrounding units decrease. If unit k is not included, the probabilities for the surrounding units increase. As a result, the samples become well spread.
3.3 Design properties
Here, we will show that the preceding designs are appropriate when the goal is to obtain representative samples. Both designs have the property that they automatically stratify on well-separated clusters in the auxiliary space and therefore are alternatives to stratification. In high dimensions, such natural strata could be hard to detect, and hence, stratification could be difficult. The two designs are based on very different algorithms. LPM only moves probability mass between pairs, whereas SCPS can move probability mass to (and from) several neighbours. This makes the LPM algorithmically easier. On the other hand, SCPS may produce even more balanced samples for some populations.
Example 2. (Two of a kind) Let the population consist of five pairs of identical twins. We want to select a sample of five individuals, which is representative. It then seems logical to select one twin from each pair. Because the distance between identical twins is 0 and hence always less than the distance between non-twins, both LPM and SCPS automatically select one from each pair. In fact, the two designs coincide and correspond to letting each pair decide the outcome by a toss of a fair coin.
This example is just a special case of the following general theorem, which is valid for both designs with general π_{i}s.
Theorem 1. Let U be a population and let U1*,U2*,…,Um* be a partition of U such that for k=1,2,…,m, we have
maxi,j∈Uk*d(i,j)<mini∈Uk*,j∈Uℓ*,k≠ℓd(i,j).
Thus, each Uk* is a coherent subset of U, and we obtain the partition by grouping together the most similar units. For LPM and SCPS, the sum of inclusion indicators for each Uk* satisfies
n−∑ℓ≠knℓ≤∑i∈Uk*Ii≤n−∑ℓ≠knℓ,
for k=1,2,…,m, where nℓ=∑i∈Ul*πi. The notations ⌊ · ⌋ and ⌈ · ⌉ correspond to the floor and ceiling functions, respectively.
It is not guaranteed that we can find a partition as in theorem 1 (for any m ≥ 2) for every population. It depends on the auxiliary variables and the choice of distance measure.
The purpose of theorem 1 is to show that if such a partition exists, that is, if the population is clustered into groups of similar units, then we obtain good bounds for the inclusion indicators for each such group. This means that we are then guaranteed to select the right amount of units (or close to) from each such group Uk*, that is,
nk*≈∑i∈Uk*πi=nk,
which indicates a well-spread sample. For any design with given inclusion probabilities π_{i}, i = 1,2, … ,N, we have E(∑i∈Uk*Ii)=∑i∈Uk*πi=nk, but it is much stronger to have strict bounds for ∑i∈Uk*Ii. If all the n_{k}s are integer valued, then the cluster sample sizes are fixed.
Example 3. (Use of theorem 1) Assume that we have a population U, which can be partitioned into U1*, U2*, and U3*, such that theorem 1 holds. Let n_{1} = 20.7, n_{2} = 15.3, and n_{3} = 14 be the expected sample sizes from U1*, U2*, and U3*, respectively. Then, theorem 1 gives us the following bounds for the achieved sample sizes:
20≤n1*≤21,15≤n2*≤16,and13≤n3*≤15.
We also have n1*+n2*+n3*=n=50. Thus, one possible outcome is that n1*=20, n2*=15, and n3*=15. This shows that nk*≈nk.
The proof of theorem 1 is given for LPM by Grafström et al. (2012). For SCPS, the proof is presented by Grafström (2012) in the special case of all n_{k} being integers, but the proof is similar to the one for LPM in the more general case. Theorem 1 gives an indication that the resulting samples are well spread in the auxiliary space, and if the inclusion probabilities are equal, we thus obtain representative samples.
Both designs only need a distance between units to select a sample. Changing the distance function will affect the sampling designs, but it does not change the algorithms as such.
4 Variance estimation
In this section, we present a variance estimator that may be used to estimate the variance of the unbiased HT estimator under the LPM and SCPS designs with general π_{i}s. First, we note that if the sample size is fixed, then the variance of (1) can be written as
V(Ŷ)=−12∑(i,j)∈U(πij−πiπj)yiπi−yjπj2,(4)
where π_{ij} is the probability that units i and j both are included in a sample, also called the second-order inclusion probability. If the sample size is fixed and all second-order inclusion probabilities are strictly positive, then V(Ŷ) can be estimated without bias by the well-known Sen–Yates–Grundy estimator:
V^SYG(Ŷ)=−12∑(i,j)∈sπij−πiπjπijyiπi−yjπj2.(5)
The estimator (5) may be difficult to use in practice because it requires calculation of the second-order inclusion probabilities. For the LPM and also for the SCPS design, the second-order inclusion probabilities depend heavily on the spatial structure of the population, and they in practice cannot be calculated. Moreover, for most populations, many second-order inclusion probabilities will be 0 when using these designs. Hence, it is not possible to have a design-based unbiased variance estimator. Because of these facts, we seek to find an alternative variance estimator.
Stevens & Olsen (2003, 2004) introduced a design called generalized random-tessellation stratified for spatially balanced sampling in two dimensions. They also developed a local mean variance estimator for the generalized random-tessellation stratified design. Grafström et al. (2012) tested the local mean variance estimator and found that it worked well also for the LPM when using spatial coordinates as auxiliary variables. However, the local mean variance estimator was not developed to handle equal distances, which follow when using only qualitative auxiliary variables. Hence, a more general variance estimator is needed. We propose to estimate the variance (4) of Ŷ under the LPM and SCPS designs by using
V^SB(Ŷ)=∑i∈sni*ni*−1yiπi−1ni*∑j∈si*yjπj2,(6)
where si* is a coherent subset of s with ni* units. The coherent subset si* includes unit i, and j∈si* if j ∈ s and d(i,j) = min_{k ∈ s,k ≠ i}d(i,k). Thus, the subsets may be of different sizes. In case of only qualitative auxiliary variables and the distance measure (2), this variance estimator reflects the designs well. It identifies the same stratification as the designs. Moreover, for strata with only one selected unit, it produces something similar to an automatic stratum collapse. A single unit i then uses a squared deviation from a mean of all units in the nearest strata and the unit i itself. The stratum sample sizes cannot always be fixed, but the stratum sample sizes do not vary much. To be precise, the variance estimator uses stratification with the obtained stratum sample sizes. Within a stratum, the variance estimator corresponds to the variance estimator under independent random sampling (example 4).
Example 4. (A single qualitative auxiliary variable) Let there be one qualitative auxiliary variable with H categories. It divides the population U into strata Uh*, h = 1, … ,H, where each Uh* is a coherent subset of U. Also, let the expected sample sizes nh=∑i∈Uh*πi be integer valued and each n_{h} ≥ 2. Then, the designs produce stratification with fixed stratum sample sizes nh*=nh, and the variance estimator (6) corresponds to
V^(Ŷ)=∑h=1Hnhnh−1∑i∈sh*yiπi−1nh∑j∈sh*yjπj2,(7)
where sh* is a coherent subset of s, including all selected units from Uh*. This corresponds exactly to the variance estimator for Ŷ under stratification with independent random sampling within each strata, that is, in each strata Uh*, n_{h} units are drawn with replacement according to the probabilities π_{i} / n_{h}, with ∑i∈Uh*πi/nh=1.
A special case of the variance estimator (6) is when continuous auxiliary variables are used and no equal distances exist. Then, there are only two units in each local neighbourhood si*, and (6) simplifies to
V^(Ŷ)=12∑i∈syiπi−yjiπji2,(8)
where j_{i} is the nearest neighbour to i in the sample. A similar estimator to (8) is recommended by Wolter (2007, p. 336) named v_{12}, as one of the best general-purpose variance estimators for systematic sampling. In case of systematic sampling, it is assumed that the population has been ordered, prior to the sampling, by a variable related to the variable of interest. Then, the variable of interest has a trend in the ordered population. For systematic sampling, the pairs of units used in the variance estimator are constructed by taking the n − 1 successive pairs of units in the sample. This corresponds well to a local neighbourhood of two units. It makes sense to use a similar variance estimator for spatially balanced sampling as for a systematic design; both are selected to be well spread over a population with trends.
5 Examples
In this section, we study two real data sets. The simulations are carried out using the R software environment.
5.1 Baltimore data
The Baltimore data originate from a paper by Dubin 1992). These data are publicly available from GeoDa Center (2011) and contain the selling price of 211 houses together with 15 descriptive variables. We consider the 211 units to be our population. The target variable is the selling price, and we use the other 15 variables described in Table 1 as auxiliary variables. The two sampling designs presented in Section 3 are used with the distance measure ((2). For a comparative purpose, we also consider samples generated by SRS.
Table 1. Variable description for the Baltimore data
Variable
Description
Price
Selling price in thousands of dollars
Nroom
Number of rooms
Dwell
Indicator variable, 1 if detached unit
Nbath
Number of bathrooms
Patio
Indicator variable, 1 if patio
Firepl
Indicator variable, 1 if fireplace
AC
Indicator variable, 1 if air condition
Bment
Indicator variable, 1 if basement
Nstor
Number of storeys
Gar
Number of car spaces in garage
Age
Age of dwelling, in years
Citcou
Indicator variable, 1 if located in county and 0 if located in city
Lotsz
Lot size, in hundreds of square feet
Sqft
Interior living area, in hundreds of square feet
X
X-coordinate of the house
Y
Y-coordinate of the house
We performed a simulation and generated 10,000 samples of size 50 with each design. The mean of the spatial balance B was 0.32 for both LPM and SCPS compared with 0.49 for SRS. This means that the LPM and SCPS designs produce much more well-spread samples.
In the simulation, the variances of our estimators Ŷ are estimated by the following empirical variance estimator
V^Sim(Ŷ)=1m∑k=1m(Ŷk−Y)2,(9)
where m is the number of simulated samples and Ŷk is the estimate of Y for sample k. We can see in Table 2 that the variance of the sample means decreases substantially for all variables when using the LPM or SCPS design instead of SRS.
Table 2. The empirical variance of the sample means for LPM and SCPS in relation to SRS for the Baltimore example
Design
LPM
SCPS
V^Sim/VSRS
V^Sim/VSRS
LPM, local pivotal method; SCPS, spatially correlated
Poisson sampling; SRS, simple random sampling.
Price
0.34
0.32
Nroom
0.39
0.35
Dwell
0.15
0.13
Nbath
0.43
0.40
Patio
0.48
0.45
Firepl
0.41
0.36
AC
0.31
0.26
Bment
0.38
0.34
Nstor
0.31
0.29
Gar
0.35
0.31
Age
0.36
0.35
Citcou
0.19
0.16
Lotsz
0.35
0.33
Sqft
0.35
0.28
X
0.32
0.28
Y
0.35
0.32
We also estimated the total of the variable Price, and in Table 3, the empirical variance estimator is presented together with the mean of the variance estimator (6). The variance estimator (6) overestimates the variance somewhat, which might be the result of using a variance estimator based on a replacement design. However, our experience is that if the target variable has a smooth trend, then the mean of the variance estimator is rather close to the true variance. The variance estimator still performs well compared with a variance estimator based on SRS and produces good coverage rates for the approximative 95 per cent confidence interval Ŷ±1.96V^. The coverage rate with V^Sim can be interpreted as the true coverage rate, as if the variance was known.
Table 3. Estimated variance (achieved coverage rate) of the Horvitz–Thompson estimator of the total price for the Baltimore example
Design
LPM
SCPS
SRS
For SRS, the true variance is presented.
LPM, local pivotal method; SCPS, spatially correlated Poisson sampling; SRS, simple random sampling.
V^Sim
128,985 (0.952)
121,170 (0.951)
378,605 (0.951)
mean(V^SB)
151,151 (0.949)
152,770 (0.957)
In Table 4, we also compared the mean squared error of the sample quartiles of the price distribution for the three designs. We obtain a significant improvement for LPM and SCPS compared with SRS.
Table 4. Estimated mean squared error of the quartiles of the Price distribution for the Baltimore example
Design
LPM
SCPS
SRS
LPM, local pivotal method; SCPS, spatially correlated
Poisson sampling; SRS, simple random sampling.
Q1
2.53
2.50
5.78
Median
3.88
3.82
7.63
Q3
5.34
5.16
10.64
Overall, the LPM and SCPS designs produce similar results, both in terms of spatial balance and the resulting variances of estimates. Both designs achieved an important improvement compared with SRS. This is because the auxiliary variables used explain many of the variations in the target variable Price.
5.2 Jura data
In this example, we study the Jura data published by Goovaerts (1997) and publicly available in the R-package gstat. We use the prediction set consisting of 259 units as our population. A description of the variables is found in Table 5. We use Xloc, Yloc, Landuse, and Rock as our auxiliary variables. The remaining variables are our target variables.
Table 5. Variable description for the Jura data
Variable
Description
Xloc
Spatial coordinate (km)
Yloc
Spatial coordinate (km)
Landuse
1 forest, 2 pasture, 3 meadow, 4 tillage
Rock
1 Argovian, 2 Kimmeridgian,
3 Sequanian, 4 Portlandian, 5 Quaternary
Cd
Cadmium (ppm)
Co
Cobalt (ppm)
Cr
Chrome (ppm)
Cu
Copper (ppm)
Ni
Nickel (ppm)
Pb
Lead (ppm)
Zn
Zinc (ppm)
To compare representativity, we supplement the measure B in (3) with an additional analysis of the qualitative variables. For the two qualitative variables, we consider the variance of the counts in the contingency table. The results we present are based on 10,000 generated samples of size n = 80. We compare the same three designs as in the previous example with respect to spatial balance. The mean of the spatial balance B is 0.28, 0.29, and 0.55 for LPM, SCPS, and SRS, respectively. This means that the samples are much more well spread for LPM and SCPS. For the two quantitative auxiliary variables, the variance of the mean is reduced when the sample is well spread (Table 6).
Table 6. The empirical variance of the sample means for LPM and SCPS in relation to SRS for the two quantitative variables in the Jura example
Design
LPM
SCPS
V^Sim/VSRS
V^Sim/VSRS
LPM, local pivotal method; SCPS, spatially correlated
Poisson sampling; SRS, simple random sampling.
Xloc
0.08
0.07
Yloc
0.07
0.07
The results for the qualitative variables in Table 7 show that the variance is substantially reduced when using LPM and SCPS. For combinations of categories where the expected count is less than 2, the variance could obviously not be substantially reduced. The variance of the counts would be even lower if only the two qualitative auxiliary variables were used. On the other hand, we would then leave out the information about spatial location. This would then have an effect on the results for the target variables.
Table 7. Ec and variance of sample counts for LPM, SCPS, and SRS for the Jura example
Forrest
Pasture
Meadow
Tillage
Total
Ec, expected count; LPM, local pivotal method; SCPS, spatially correlated
Poisson sampling; SRS, simple random sampling.
Argovian
Ec
2.2
1.9
12
0.3
16.4
V^LPM
0.73
0.50
0.95
0.21
0.77
V^SCPS
0.73
0.51
0.93
0.21
0.72
V^SRS
1.42
1.27
7.07
0.21
9.02
Kimmeridgian
Ec
6.8
5.6
13.6
0.3
26.3
V^LPM
0.66
0.75
1.02
0.21
1.41
V^SCPS
0.61
0.73
0.98
0.22
1.33
V^SRS
4.33
3.61
7.62
0.21
12.4
Sequanian
Ec
0.9
7.7
10.2
0.6
19.4
V^LPM
0.65
0.76
0.91
0.42
1.01
V^SCPS
0.64
0.77
0.95
0.41
0.86
V^SRS
0.64
4.76
6.14
0.42
10.1
Portlandian
Ec
0.3
0.3
0.3
0
0.9
V^LPM
0.21
0.21
0.21
0.57
V^SCPS
0.22
0.21
0.22
0.58
V^SRS
0.22
0.22
0.21
0.64
Quaternary
Ec
0
1.9
14.8
0.3
17
V^LPM
0.17
0.69
0.21
0.64
V^SCPS
0.19
0.72
0.21
0.64
V^SRS
1.25
8.56
0.21
9.46
Total
Ec
10.2
17.4
50.9
1.5
80
V^LPM
1.48
1.46
2.13
0.91
V^SCPS
1.40
1.34
2.00
0.94
V^SRS
6.26
9.46
12.9
1.03
We present the results for the seven target variables in Table 8. For all target variables, we have estimated the population total. We present the empirical variance estimator (9) for the HT estimator as well as the variance estimator (6). We also construct approximative 95 per cent confidence intervals. For SRS, we calculate the exact variance. From the results, we see that the empirical variance is reduced whenever the sample is well spread. This is true for all seven target variables. The variance estimator is conservative and overestimates the variance. As a result, the coverage rate is above the nominal rate.
Table 8. Estimated variance (achieved coverage rate) of the Horvitz–Thompson estimator for the Jura example
Design
LPM
SCPS
SRS
For SRS, the true variance is presented.
LPM, local pivotal method; SCPS, spatially correlated Poisson sampling; SRS, simple random sampling.
Cd
V^Sim
313 (0.950)
319 (0.950)
485 (0.950)
mean(V^SB)
449 (0.971)
444 (0.970)
Co
V^Sim
2,615 (0.950)
2,699 (0.950)
7,411 (0.954)
mean(V^SB)
3,612 (0.974)
3,577 (0.970)
Cr
V^Sim
34,850 (0.950)
34,827 (0.950)
69,580 (0.950)
mean(V^SB)
50,659 (0.978)
50,241 (0.976)
Cu
V^Sim
157,391 (0.951)
160,773 (0.954)
248,619 (0.954)
mean(V^SB)
236,336 (0.960)
236,684 (0.961)
Ni
V^Sim
16,256 (0.951)
16,178 (0.950)
39,279 (0.952)
mean(V^SB)
22,499 (0.973)
22,323 (0.973)
Pb
V^Sim
293,344 (0.954)
321,042 (0.952)
514,360 (0.950)
mean(V^SB)
566,519 (0.972)
553,385 (0.970)
Zn
V^Sim
243,065 (0.952)
248,885 (0.950)
488,019 (0.950)
mean(V^SB)
382,705 (0.980)
371,740 (0.976)
6 Final comments
For the LPM and SCPS designs, we would like to stress their generality and the fact that the samples produced are very well spread and, in the case of equal inclusion probabilities, also representative. We do not know of any other designs that can produce more well-spread probability samples in an arbitrary auxiliary space. The designs can incorporate all auxiliary information that is available on a unit level, which is a big advantage compared with other methods. We specifically recommend these designs for multipurpose surveys when diverse auxiliary information is available. Compared with SRS, estimation for all target variables that are well explained by the auxiliary variables will be improved.
If only one target variable is of interest, it may be better not to use a representative sample. It may be preferable to use unequal inclusion probabilities, where the inclusion probabilities are chosen proportional to one of the auxiliary variables.
We only consider the unbiased HT estimator because we focus on sampling designs, but other estimators may be slightly more precise in specific situations. In some situations, it is possible to use a calibration estimator (e.g. Deville & Särndal, 1992). If the sample means are close to the population means, then not much adjustment is generally needed for the design weights in the calibration estimator. Deville & Tillé (2004) found that for approximate mean balanced samples, the weights in the calibration estimator become more stable and negative weights are less likely to appear. The performance of regression and calibration estimators is also expected to be improved, compared with that of SRS, by the fact that the samples are well spread and more representative. A new method that allows to select samples that are both well spread and balanced was recently introduced by Grafström & Tillé (2013).
Acknowledgements
We thank Sara Sjöstedt-de Luna and Lennart Bondesson for valuable comments that improved this paper. We also thank the Associate Editor and two anonymous reviewers for many constructive comments.