Locally correlated Poisson sampling

Designs that produces spatially balanced, or well‐spread, samples are desirable as they increase the probability of obtaining a sample highly representative of the population. Spatially correlated Poisson sampling (SCPS) is a method for selecting well‐spread samples. In the SCPS method, the sampling outcomes (inclusion or exclusion of units) are decided sequentially. After each decision, the inclusion probabilities of surrounding units are updated. A specific order for deciding the sampling outcomes is not enforced for SCPS, that is, the order can be chosen randomly or be fixed. A new modified method called locally correlated Poisson sampling (LCPS) is suggested. In this new method, the order of the decisions makes sure the inclusion probabilities are updated (more) locally. As a result, a stronger negative correlation between inclusion indicators of nearby units is achieved. Simulations on various data sets show that the resulting samples from LCPS, in general, are more spatially balanced and produce lower variance than samples from SCPS and the local pivotal method.


INTRODUCTION
For environmental surveys, there has been a long-standing interest in using spatial information to achieve some kind of spatial regularity in the sample.Commonly, variants of systematic or stratified designs have been used in order to get a sample which is more representative of the landscape that is surveyed.The usefulness of sampling designs which can achieve representative samples is not limited to environmental surveys.In many other applications, where x and y coordinates are not applicable, stratification is used as a way to achieve a representative sample with respect to some, often just a few, categorical variables.In other cases, a systematic sample is used in order to capture the distribution of some single numerical, auxiliary variable.
In recent years, many sampling methods which produce well-spread samples over multiple auxiliary variables have been introduced.If these auxiliary variables have some explanatory power on the target variable, such designs will often lead to a decrease in the design-based variance for the target variable (Stevens & Olsen, 2004).Stevens and Olsen (2004) introduced the general random-tessellation stratified design, which uses a function to map a two-dimensional space to an ordered list, selecting units using systematic ps-sampling, that is, a systematic without replacement design with inclusion probabilities proportional to size.While the mapping preserves some degree of the spatial structure, it cannot fully capture the structure of the population.The cube method, developed by Deville PRENTIUS and Tillé (2004), uses auxiliary information in multi-dimensional space to select balanced samples, that is, samples where the Horvitz-Thompson estimate of a total of an auxiliary variable is approximately equal to the population total.Grafström and Tillé (2013) adapted the cube method in order to produce well-spread samples in addition to balanced samples.Balanced acceptance sampling (BAS) (Robertson et al., 2013) selects a sample from a continuous or finite population using quasi-random numbers, that is, pseudo-random numbers which are evenly distributed over an interval.BAS allows for importance sampling through acceptance/rejection sampling.Benedetti and Piersimoni (2017) introduced a sampling design which selects a sample with a probability proportional to the distance between sample units, however not allowing for prescribed inclusion probabilities.Jauslin et al. (2022) proposes a method which selects balanced samples from streamed or sequential populations.
Correlated Poisson sampling (CPS) was introduced by Bondesson and Thorburn ( 2008), as a ps method usable in real-time sampling situations.The method is list-sequential, that is, a decision is taken one unit at a time, and the conditional probabilities for the remaining undecided units are updated according to the outcome of the decision, using the splitting method (Deville & Tillé, 1998).From CPS, Grafström (2012) developed spatially correlated Poisson sampling (SCPS), where the outcome of a decision prioritized updating the probabilities for the units close in auxiliary variable space, introducing negative correlation for these units' inclusion indicators.The SCPS method produces well-spread samples respecting the prescribed inclusion probabilities (Grafström & Schelin, 2014).
The local pivotal method (LPM) (Grafström et al., 2012) operates similarly to SCPS through the splitting method, however only affecting two units at each iteration of the algorithm, whereas CPS/SCPS may affect multiple units.The LPM comes in two variants, LPM 1 and LPM 2. In LPM 1, for each iteration of the algorithm, two pairwise nearest neighbors are selected at random and compete against each other.Depending on the outcome, their probabilities are updated, moving probability mass in the direction of the winner.For the second variant, LPM 2, a unit is chosen at random, and its competitor is randomly selected among its closest neighbors.Generally, LPM 1 performs the better of the two, as both competitors are each other's nearest neighbors, creating a stronger negative correlation between inclusion indicators close in auxiliary variable space.
In this article, a modification of the SCPS is proposed, called locally correlated Poisson sampling (LCPS).Inspired by the difference between LPM 1 and LPM 2, the units which are affected at each iteration of the LCPS algorithm are selected in a way that guarantees that the updating is done for the smallest possible neighborhood of units.As such, each decision only affects units in a more local area, introducing a stronger negative correlation between the inclusion indicators of these units.Compared to LPM and SCPS, two of the top-performing methods for producing well-spread samples (Benedetti et al., 2015), the proposed modification makes LCPS more efficient than both, when evaluated against a variety of data sets.
In Section 2, the sampling algorithms for LPM and SCPS are presented.Then, in Section 3, the LCPS is introduced, and some properties of the design are presented.The methods are compared through simulation in Section 4, followed by a brief discussion in Section 5.

SCPS AND LPM
Let U be a population of units labeled 1, 2, … , N with a prescribed inclusion probability vector .Furthermore, lets assume that there exists some fully known set of auxiliary variables, on which there exists a distance measure d.Let  (t) be a conditional inclusion probability vector at step t ≥ 0 such that  (0) = .Using the splitting method (Deville & Tillé, 1998), it is possible to split  (t−1) into two parts, and selecting a new conditional inclusion probability vector with probability  (t) , with probability 1 −  (t) , where u (t) is the updating vector.

Spatially correlated Poisson sampling
Let i(t) be the step unit at step t ≥ 1.For SCPS, i(t) has some predetermined order, say i(t) = t, or can be considered randomly drawn from the set of undetermined units For simplicity, i will denote i(t), and Using the maximal weight strategy (Grafström, 2012), at each step t with step unit i, The negative elements of u (t) is decided by the distance where ) .

The local pivotal method
In LPM 2, a unit i is randomly drawn from the set of unresolved units U(t), defined as (1).Let U i (t) = U(t) ⧵ i.A single competitor j is randomly drawn from the set of nearest neighbors where d(i, j) is the Euclidean distance between units i, j in auxiliary space.If For LPM 1, instead of drawing i from all unresolved units, a pair i, j is drawn from the set of pairwise nearest neighbors Thus, the average distance that probability is moved will be lower in LPM 1 compared to LPM 2.

LOCALLY CORRELATED POISSON SAMPLING
The ideas behind LPM and SCPS are that it is possible to improve the spatial balance by introducing negative correlation in the inclusion indicators of units close in auxiliary space (Grafström, 2012;Grafström et al., 2012).The SCPS method In LPM, probability mass is moved in the conditional probability vector between pairs of units close in auxiliary space, exemplified through Euclidean distance in R 2 .The set of possible pairs for LPM 1 is highlighted through solid lines, where possible pairs of possible competing units are highlighted through solid and dotted lines for LPM 2.

F I G U R E 2
For SCPS, probability mass is moved in the conditional probability vector between sets of units close in auxiliary space, exemplified through Euclidean distance in R 2 .A step unit is decided, and probabilities will be moved within a radius of this unit.Four such radii are shown as dotted or solid lines, in a setting where each unit has (conditional) probability mass 1/3.For LCPS, the step unit is decided among the set of units with the smallest possible radius, highlighted through a solid line.
does this by sequentially updating the probability vector, moving probability mass to or from units close to a step unit, prioritizing those who are closest.LPM 2 operates similarly, by moving probability between a step unit and it's closest neighbor, whereas for LPM 1, decisions are only taken between units which are pairwise nearest neighbors.In Figure 1, it can be seen that the movement of probability mass is on average lower in LPM 1 compared to LPM 2, where the former generally also produces the most spatially balanced samples (Grafström et al., 2012).By choosing the step unit for SCPS in a way that reduces the movement of probability mass it is possible to increase the spatial balance of the samples that SCPS produces.This modification of SCPS is called LCPS.
In Figure 2, the step unit with smallest distance is highlighted by the solid circle.Grafström (2012) proved that if a population can be partitioned into distinct regions with integer probability mass, in which the maximum distance between units within a region is smaller than any distance to a unit outside the region, SCPS would produce fixed sized samples for each region.For LCPS, this property can be extended to hold for any single such distinct region.
Theorem 1.For a population U, in which there exists a subset U m ⊂ U as an isolated region in auxiliary space such that for all units i ∈ U m where n m = ∑ i∈U m  i is integer, LCPS will produce a sample from U m with a fixed size n m .
Proof.Assume that there exists a partition U m for which n m is integer and (4) holds.For a unit i ∈ U m , the weights that can be provided by other units can be described by the triangular function w i (x), defined as (3).As long as weights w i ( j ), j ∈ U m ⧵ i can be found summing to (at least) one for any arbitrary unit i ∈ U m , LCPS will decide this unit before deciding any unit outside of U m which would move probability mass to or from U m .
If there exists one other unit j ∈ U m ⧵ i, and U m has integer probability mass, j must have probability  j = 1 −  i , and as such w i ( j ) = 1.As w i (x) is linearly increasing/decreasing around 1 −  i , it is not possible to introduce more units without either moving  j along the same side of w i (x), or having the sum of the weights be larger than 1, while keeping the probability mass integer.▪ Furthermore, LCPS provides the same bounds on partitions as LPM 2, for cases where the probability mass in a partition is not integer (Grafström et al., 2012).

is the inclusion indicator of a unit i, then the sum of inclusion indicators satisfies
for all partitions U m , where ⌊⋅⌋, ⌈⋅⌉ are the floor and ceiling functions respectively.
Proof.Let U m be a partition of a population U satisfying (5).We consider first the case of the upper bound of (6).If the upper bound doesn't hold, it must be possible for a partition U m to push more than n m − ⌊n m ⌋ into another partition.From Theorem 1, we know that if the probability mass n m is integer, then no probability mass will leave U m .Assume that U m has been resolved internally to the extent that no further decisions can be taken within U m without affecting units outside of U m .Let U * m and n * m be the remaining, undecided units, and their probability mass.In order to break the upper bound, it must be possible to remove strictly more than n * m − ⌊n * m ⌋ from U * m .For an arbitrary unit i ∈ U * m , the amount which will be removed from U * m upon exclusion of i is  i W i , where and Units in U + m will have their probabilities updated to 1, whereas units in j ∈ U − m will have their probabilities updated to  j ∕(1 −  i ).Using these components, we can rewrite n * m as where the cardinality |U + m | is probability mass which will definitely be kept in in U m , and the summation term is the remaining "free" probability mass in U * m after the exclusion of i.
As the sum of the two latter terms in ( 8) is strictly less than one, which is the same quantity as the maximum amount of probability mass which can be removed from U * m , and thus the upper bound of (6) holds.
Similarly for the lower bound; in order to break the lower bound, it must be possible to add strictly more than ⌈n * m ⌉ − n * m from U * m .For an arbitrary unit i ∈ U * m , the amount which will be added to U * m upon inclusion of i is (1 −  i )W i .Units in U − m will have their probabilities updated to 0, whereas units in j ∈ U + m will have their probabilities update to 1 − (1 −  j )∕ i .Rewriting n * m as where the summation term is the probability mass which is left in undecided units, it is obvious that as these sum of these terms is strictly less than one.Since the maximum amount of probability which can be added to U * m after the inclusion of i is it is not possible to add strictly more than ⌈n * m ⌉ − n * m and the lower bound of (6) holds.▪

SIMULATION
The proposed method was evaluated through simulation, measuring the spatial balance of the produced samples through two methods, one using Voronoi polytopes (Grafström et al., 2012;Stevens & Olsen, 2004), and the other using a modified Moran's I index (Tillé et al., 2018).The spatial balance measure based on Voronoi polytopes does not have a fixed range, and the spatial balance of samples can only be interpreted relative to each other, where lower is better.The modified Moran's I index gives values on [−1, 1], where a value of −1 indicates a perfectly spatially balanced sample, and 1 a perfectly clustered sample.
The LCPS method was applied, together with LPM 1, LPM 2, SCPS, and simple random sampling without replacement (SRS), on seven artificial populations, as well as on three openly available real data sets.The LPM and SCPS implementations were provided by the R package BalancedSampling, which also contains a C++ implementation of LCPS (Grafström et al., 2022).
The seven artificial populations, shown in Figure 3, were constructed as follows: a. Poisson cluster process: Three parent locations were randomly located on the unit square.Around each parent location, a random number of children are spawned according to a Poisson distribution with mean 40, and placed relative to the parent according to a normal distributions with variance 0.1.The number of observations were then reduced to 135.Any unit falling outside of the unit square were mirrored back onto the unit square.b.Poisson cluster process: As (a), but with 20 parents, a Poisson distribution with mean of 20 children spread around the parents with variance 0.01.c.Regular grid: A rectangular grid of 10 × 10 units.d.Uniform: 120 units placed uniformly over the unit square.e. Normal: 120 units placed according to a standard normal distribution along both axes.f.Triangular/uniform: 200 units placed uniformly on one axis, and according to a (right-angled) triangular distribution on the other.g.Line: 105 units placed along a straight line, with coordinates decreasing distance between units towards the center.
The three real data sets, with auxiliary variables provided in Table 1, were the following: 1. Baltimore: House sale price and characteristics from Baltimore, MD 1978, consisting of 211 observations, provided by the R package spData (Bivand et al., 2021).2. Wheat: Wheat yield data from an agricultural field experiment by Mercer and Hall, consisting of 500 observations in a regular grid, provided by the R package spData (Bivand et al., 2021).3. Meuse: Heavy metal concentrations along the flood plain of the river Meuse, consisting of 155 observations, provided by the R package sp (Bivand et al., 2013).Two observations were excluded, as they were missing data on organic matter.
From each population, 10,000 samples were taken with sample sizes 10, 20, and 40, using the previously mentioned methods.For each sample, mean spatial balance measures were calculated, and are presented in Tables 2 and 3.In Table 4, the relative mean squared errors (MSE) are presented for the Horvitz-Thompson estimators of the variable of interest in  where * is a placeholder for the sampling method.The variables of interest, marked by italics in Table 1, were the sales price for the Baltimore data set, the cadmium concentration for the Meuse data set, and the wheat yield for the wheat data set.The results show that LCPS produces the most spatially balanced samples for all populations.The spatial balance of the sample is also shown to have an effect on the MSE's, as more spatial balance produces lower MSE's.
each of the real data sets.The relative MSE is defined relative to the MSE of the SRS, as Relative MSE( * ) = MSE( * )∕MSE(SRS), Variable descriptions for data sets.Mean spatial balance using Voronoi polytopes for the five sampling methods and various populations.Mean spatial balance using a modified Moran's I index for the five sampling methods and various populations.Lower values implies more spatial balance.n = sample size.Meth.= Sampling method.Relative MSE's of the Horvitz-Thompson estimators of the variable of interest for the four sampling methods and each of the three data sets.