## 1. Introduction

The term *geostatistics* describes the branch of spatial statistics in which data are obtained by sampling a spatially continuous phenomenon at a discrete set of locations *x*_{i},*i*=1,…,*n*, in a spatial region of interest . In many cases, *S*(*x*) cannot be measured without error. Measurement errors in geostatistical data are typically assumed to be additive, possibly on a transformed scale. Hence, if *Y*_{i} denotes the measured value at the location *x*_{i}, a simple model for the data takes the form

where the *Z*_{i} are mutually independent, zero-mean random variables with variance *τ*^{2}, often in this context called the *nugget variance*. One interpretation of the *Z*_{i} in model (1) is as measurement errors in the *Y*_{i}. Another, which explains the more colourful terminology, is as a device to model spatial variation on a scale that is smaller than the shortest distance between any two sample locations *x*_{i}. We adopt the convention that *E*[*S*(*x*)]=0 for all *x*; hence in model (1) *E*[*Y*_{i}]=*μ* for all *i*. Model (1) extends easily to the regression setting, in which , with *d*_{i} a vector of explanatory variables associated with *Y*_{i}. The objectives of a geostatistical analysis typically focus on prediction of properties of the realization of *S*(*x*) throughout the region of interest *A*. Targets for prediction might include, according to context, the value of *S*(*x*) at an unsampled location, the spatial average of *S*(*x*) over *A* or subsets thereof, the minimum or maximum value of *S*(*x*) or subregions in which *S*(*x*) exceeds a particular threshold. Chilès and Delfiner (1999) have given a comprehensive account of classical geostatistical models and methods.

Diggle *et al.* (1998) introduced the term *model-based geostatistics* to mean the application of general principles of statistical modelling and inference to geostatistical problems. In particular, they added Gaussian distributional assumptions to the classical model (1) and re-expressed it as a two-level hierarchical linear model, in which *S*(*x*) is the value at location *x* of a latent Gaussian stochastic process and, conditional on *S*(*x*_{i}), *i*=1,…,*n*, the measured values *Y*_{i},*i*=1,…,*n*, are mutually independent, normally distributed with means *μ*+*S*(*x*_{i}) and common variance *τ*^{2}. Diggle *et al.* (1998) then extended this model, retaining the Gaussian assumption for *S*(*x*) but allowing a generalized linear model (McCullagh and Nelder, 1989) for the mutually independent conditional distributions of the *Y*_{i} given *S*(*x*_{i}).

As a convenient shorthand notation to describe the hierarchical structure of a geostatistical model, we use [·] to mean ‘the distribution of’ and write , *X*=(*x*_{1},…,*x*_{n}), *S*(*X*)={*S*(*x*_{1}),…,*S*(*x*_{n})} and *Y*=(*Y*_{1},…,*Y*_{n}). Then, the model of Diggle *et al.* (1998) implicitly treats *X* as being deterministic and has the structure [*S*,*Y*]=[*S*][*Y*|*S*(*X*)]=[*S*][*Y*_{1}|*S*(*x*_{1})][*Y*_{2}|*S*(*x*_{2})]…[*Y*_{n}|*S*(*x*_{n})]. Furthermore, in model (1) the [*Y*_{i}|*S*(*x*_{i})] are univariate Gaussian distributions with means *μ*+*S*(*x*_{i}) and common variance *τ*^{2}.

As presented above, and in almost all of the geostatistical literature, models for the data treat the sampling locations *x*_{i} either as fixed by design or otherwise stochastically independent of the process *S*(*x*). Admitting the possibility that the sampling design may be stochastic, a complete model needs to specify the joint distribution of *S*, *X* and *Y*. Under the assumption that *X* is independent of *S* we can write the required joint distribution as [*S*,*X*,*Y*]=[*S*][*X*][*Y*|*S*(*X*)], from which it is clear that for inferences about *S* or *Y* we can legitimately condition on *X* and use standard geostatistical methods. We refer to this as *non-preferential sampling* of geostatistical data. Conversely, *preferential sampling* refers to any situation in which [*S*,*X*]≠[*S*][*X*].

We contrast the term *non-preferential* with the term *uniform*, the latter meaning that, beforehand, all locations in *A* are equally likely to be sampled. Examples of designs which are both uniform and non-preferential include completely random designs and regular lattice designs (strictly, in the latter case, if the lattice origin is chosen at random). An example of a non-uniform, non-preferential design would be one in which sample locations are an independent random sample from a prescribed non-uniform distribution on *A*. Preferential designs can arise either because sampling locations are deliberately concentrated in subregions of *A* where the underlying values of *S*(*x*) are thought likely to be larger (or smaller) than average, or more generally when *X* and *Y* together form a marked point process in which there is dependence between the points *X* and the marks *Y*.

We emphasize at this point that our definition of preferential sampling involves a stochastic dependence, as opposed to a functional dependence, between the process *S* and the sampling design *X*. For example, a model in which the mean of *S* and the intensity of *X* share a dependence on a common set of explanatory variables does not constitute preferential sampling. In most geostatistical applications it is difficult to maintain a sharp distinction between the treatment of variation *S*(*x*) as deterministic or stochastic because of the absence of independent replication of the process under investigation. Our pragmatic stance is to represent by a stochastic model the portion of the total variation in *S* that cannot be captured by extant explanatory variables.

Curriero *et al.* (2002) evaluated a class of non-ergodic estimators for the covariance structure of geostatistical data, which had been proposed by Isaaks and Srivastava (1988) and Srivastava and Parker (1989) as a way of dealing with preferential sampling. They concluded that the non-ergodic estimators ‘possess no clear advantage’ over the traditional estimators that we describe in Section 3.1. Schlather *et al.* (2004) developed two tests for preferential sampling, which treat a set of geostatistical data as a realization of a marked point process. Their null hypothesis is that the data are a realization of a *random-field model*. This model assumes that the sample locations *X* are a realization of a point process on *A*, that the mark of a point at location *x* is the value at *x* of the realization of a random field *S* on *A*, and that and *S* are independent processes. This is therefore equivalent to our notion of non-preferential sampling. Their test statistics are based on the following idea. Assume that *S* is stationary, and let . Under the null hypothesis that sampling is non-preferential, the conditioning on is irrelevant; hence *M*_{k}(*h*) is a constant. Schlather *et al.* (2004) proposed as test statistics the empirical counterparts of *M*_{1}(*h*) and *M*_{2}(*h*), and implemented the resulting tests by comparing the observed value of each chosen test statistic with values calculated from simulations of a conventional geostatistical model fitted to the data on the assumption that sampling is non-preferential. Guan and Afshartous (2007) avoided the need for simulation and parametric model fitting by dividing the observation into non-overlapping subregions that can be assumed to provide approximately independent replicates of the test statistics. In practice, this requires a large data set; their application has a sample size *n*=4358.

In this paper, we propose a class of stochastic models and associated methods of likelihood-based inference for preferentially sampled geostatistical data. In Section 2 we define our model for preferential sampling. In Section 3 we use the model to illustrate the potential for inferences to be misleading when conventional geostatistical methods are applied to preferentially sampled data. Section 4 discusses likelihood-based inference using Monte Carlo methods and suggests a simple diagnostic for the fitted model. Section 5 applies our model and methods to a set of biomonitoring data from Galicia, northern Spain, in which the data derive from two surveys of the same region, one of which is preferentially sampled, the other not. Section 6 is a concluding discussion.

The data that are analysed in the paper can be obtained from http://www.blackwellpublishing.com/rss