Feasibility of using survey data and semi-variogram kriging to obtain bespoke indices of neighbourhood characteristics: a simulation and a case study

Data on neighbourhood characteristics are not typically collected in epidemiological studies. They are however useful in the study of small-area health inequalities. Neighbourhood characteristics are collected in some surveys and could be linked to the data of other studies. We propose to use kriging based on semi-variogram models to predict values at non-observed locations with the aim of constructing bespoke indices of neighbourhood characteristics to be linked to data from epidemiological studies. We perform a simulation study to assess the feasibility of the method as well as a case study using data from the RECORD study. Apart from having enough observed data at small distances to the non-observed locations, a good fitting semi-variogram, a larger range and the absence of nugget effects for the semi-variogram models are factors leading to a higher reliability.


Background
Health of individuals is not just determined by individual characteristics but also by the environment, physical and social, in which one lives [Diez Roux, 2016, Kress et al., 2020, Zolitschka et al., 2019] such that an unequal distribution of health determinants may lead to health inequalities [Zolitschka et al., 2022, Egan et al., 2008].In particular, the spatial distribution of contextual factors such as access to transport, noise but also neighbourhood social structures, play a role in the development of health inequalities [Zolitschka et al., 2022].In social epidemiological research, there is a wide range of possibilities to empirically assess contextual factors.
Noise or air pollution data obtained from official sources at very small-scale, for example, provide an accurate description of exposure at the location of study participants.Contextual factors can also be measured as individual perception, data which is typically available in social surveys.
While these may be poor proxies for objective measures, they are determinants of health in their own rights.Concerning neighbourhood social structures there may be no other more objective measures [Zolitschka et al., 2019].
In this research context, indicators of small-area (e.g.neighbourhood) characteristics are used to study contextual effects on health inequalities.An example of such indicators are indices of multiple deprivation (IMDs), tools to categorise the socio-economic structure of the place of residence of study participants.These are usually obtained from official statistics or surveys and provide aggregated indicators at the level of administrative units.This means that indices are based on dimensions which are the same for everyone living in a given unit.A example of such indices available for analysis in social-epidemiology is the one developed for all nations of the U.K. [Abel et al., 2016] or the German Index of Multiple Deprivation (GIMD) [Maier and Schwettmann, 2018].IMDs are usually categorised into quintiles [Abel et al., 2016] and linked to observations based on their area of residence.A reason for this is to avoid the assumption of a linear association between outcome and deprivation.
Various methods have been developed to increase the usability of IMDs and widen the source of data used.One example is to use regression models to adjust IMDs obtained separately for the the four U.K. nations based on different data and statistical units [Abel et al., 2016].
Regression models are also used to "link" indicators estimated from surveys valid at a larger scale using small-area shrinkage estimation methods [Datta and Ghosh, 2012].More generally, Gotway and Young [Gotway andYoung, 2002, Gotway andYoung, 2007] proposed a framework for linking geographically aggregated data from different sources for the purpose of spatial prediction providing measures of uncertainty.Spatial microsimulation is also a method linking various data sources to create synthetic datasets by combining, for example, individual survey data with small area aggregated data (e.g.census data).These datsets are then used in simulation studies [Smith et al., 2021].This method had been used to study various questions relative to contextual health inequalities [Campbell and Ballas, 2016, Ballas et al., 2006, Wu et al., 2022].One particular limitation of all these procedures for their application to the study of small-area health inequalities is their reliance on administrative units at which data is aggregated.This limitation has been highlighted in the neighbourhood effect literature [ Van Ham et al., 2012] Moreover IMDs are not necessarily relevant for all the population concerned or research questions.In particular the small-area determinants of health of vulnerable populations can be different from those of the general population.Theoretical work on small-area factors relevant for the health of vulnerable populations (e.g., refugees) postulate that the housing context is relevant for health, in particular "immediate surroundings of the accommodation and its' physical and social boundaries" [Dudek et al., 2022a].A quantitative analysis confirms that feelings of isolation, of insecurity and limited contact with neighbours are associated with the health of refugees [Dudek et al., 2022b].These factors, specific to a particular population group, might not be relevant for the majority population.A major point is that the relevance of these neighbourhood characteristics is only very local and it would not make sense to use them aggregated at a larger scale.Therefore we need alternative methods to obtain indicators of neighbourhood characteristics based on bespoke variables and defined at very small spatial scales.Valuable information about neighbourhood characteristics are contained in social survey data in which participants are asked about their perception of green space, noise, quality of construction, or about the social characteristics of their neighbourhood.Aggregated as a score, these variables provide neighbourhood metrics based on individual perception.
Geo-spatial modelling methods, in particular modelling semi-variograms, have been successfully used to evaluate non-measured spatial effect on health [Breckenkamp et al., 2021].This consists of estimating how much an outcome is correlated to the outcome of a neighbour as a function of the distance to the neighbour [Schabenberger and Gotway, 2005].Information about neighbourhood characteristics at small scale and in particular relative to the social fabric of a neighbourhood may be available in survey data for which the geo-location of the participants is included.Modelling the spatial distribution of this data provides models which can be used for the prediction of values at given locations via the kriging method [Cressie, 1993, Schaben-berger andGotway, 2005].This would enable to obtain bespoke indicators of neighbourhood characteristics to be used in the study of small-area health inequality for specific populations at individual location rather than aggregated data at a given spatial unit.
It remains to show whether this approach provides reliable and useful predictions of the true values.A particular source of error is the estimation of the semi-variogram from sparse data at small distances as seen in Sauzet et al. [Sauzet et al., 2021].Consequently, only a small number of data points is available for the kriging estimates at small distances, given that the weight attributed to data beyond the range of the semi-variogram should be virtually null.Different variables related to aspects of a common construct like, for example, social cohesion [Kress et al., 2020] can be correlated at the same location as well as across different spatial locations.
These correlation structures can also be modelled using a form of semi-variogram generalised to several variables: the co-variogram [Gelfand and Banerjee, 2010].A score can be constructed as linear combinations of variable values predicted separately from each other or by using a common co-variogram model.
In this work we investigate the usability of kriging models based on semi-variograms to predict geo-located neighbourhood characteristics at small scale (less than a kilometre) based on data of the type available in survey data: large sample size but sparse data at small distances.
We first recall the definition and estimation of semi-variograms and the method of kriging.
We then perform a simulation study to evaluate the reliability of semi-variogram based kriging prediction when data at small-distance is sparse, and next extend the simulation to a comparison of univariate semi-variograms with co-variograms to build indicators based on several variables.Finally, we present a case study based on data from the RECORD study collected of over 7,000 inhabitants of the Urban Area of Paris [Chaix et al., 2012].

Method Statistical model
We assume a Gaussian random field as random continuous process underlying the distribution of a spatially correlated neighborhood characteristic Z.More details on what follows can be found in [Schabenberger and Gotway, 2005].That is, Z is assumed to exist in every point s defined by its coordinates (x, y) of a continous area D. We denote by Z(s) the realisation of the random variable Z at locations s.We also assume second order stationary and isotropy of the random field.That means that the covariance of Z measured at two different locations only depends on the distance (lag) h between the locations and is independent of its direction (rotation-invariant), and the random field has a constant mean: It follows that the variance of Z is also constant since it corresponds to C(0).
In geostatistics, it is common to express the spatial covariance structure of a continuous process in terms of semi-variograms.The semi-variogram contains mainly the same information as the covariance function.It is defined as: And for second order stationary processes one has: such that γ(0) = C(0) − C(0) = 0, the semi-variogram passes through the origin, and it reaches (approximately) γ(h * ) = σ 2 the partial sill when the distance h exceeds the distance h * , the so called range, up to which observations are correlated [Schabenberger and Gotway, 2005].
The empirical semi-variogram is estimated by the Mathéron estimator as where |N (h)| is the number of pairs of data points which have a lag distance of h (or lie in a small interval around).
The Mathéron estimator is unbiased under the above mentioned hypotheses.The graph of γ(h) against ∥h∥ is called the empirical semi-variogram.
There are cases where the (empirical) semi-variogram does not pass through the origin because of a) spatial microprocesses whith a more fine-grained resolution as reproduced by the semi-variogram, or b) because of measurement errors.In practice the variables are not completely spatially structured and therefore the correlation of two observations taken at the same location are not necessarily equal.The deviation from the origin at h = 0 is called nugget effect c 0 , with c 0 +partial sill σ 2 0 = σ 2 .
An empirical semi-variogram often can be approximated by fitting one of several standard functions.One standard parametric model is the exponential semi-variogram which can be fitted as with practical range where θ is a parameter to be fitted.

Prediction of spatially structured variables at unmeasured points
There are different methods to predict a value at locations within a continuous area (random field) where no measurement is available.With kriging the value of one or more characteristic(s) Z at a given point s 0 is predicted from the known values of this characteristic at nearby points.
The result is a weighted average of all known points in the area (global kriging) or of a selection of neighbouring points (local kriging), that is a linear combination of known data.Here the weighing is provided by the exponential semi-variogram model and depends on the spatial correlation structure, in particular on the distance between points such that points at distances above the range have weight 0.

Ordinary kriging
Under the assumption of a constant but unknown mean µ of the two-dimensional random field, we perform ordinary kriging to predict unobserved values Z(s 0 ) at (usually unobserved) locations s 0 = [x 0 , y 0 ].The assumed model is: with Z(s) representing the observed spatial data [Z(s 1 ), Z(s 2 ), . . ., Z(s n )], and • the µ is unknown but constant over the random field • Σ is assumed to be known in this model.In practical applications this is often not the case, and it has to be estimated based on the semi-variogram model.
• Z(s) is a secondary stationary process (for the benefit of semi-variogram estimation).
The ordinary kriging predictor at a location s 0 is with Z representing the observed data and λ the matrix of kriging weights.And these weights are determined by 2. The variance-covariance matrix Σ = V ar [Z(s)] or the corresponding inverse.

Co-kriging
The kriging model can be extended to the simultaneous prediction of several variables which are spatially correlated with each other [Schabenberger and Gotway, 2005].This is typically the case when constructing an index based on a number of related neighbourhood characteristics.
In this case, in addition to a semi-variogram for every variable, a cross-semi-variogram for each pair of variables can be modelled.The fact that cross-covariance functions have to be specified in such a way that for every selection of locations a positive definite covariance matrix results presents a challenge.We use the linear model of co-regionalisation (LMC) implemented in the gstat package in R [Goulard and Voltz, 1992,R Core Team, 2021,Gräler et al., 2016,Pebesma, 2004].
The different variables can be measured at identical locations (for example, when they are from the same survey) which is called collocated or isotopic measures [Wackernagel et al., 2002] or each at different locations (when from different data sources which may use different spatial scales), which is called heterotopic.If all variables are indicators of the same spatial process, spatial correlations can be assumed in both cases.
A question that the simulation study needs to answer is if in order to predict a linear combination of variables (index) it is better to take the correlation structure between variables into account by using a co-kriging model with a more complex estimation procedure or to predict the variables separately.To compute the index value, we combine the predicted values at a specified location as unweighed sum or mean of the three variables in both cases.

Simulation study
The simulation scenarios are based on previous studies which provide us with ranges for the semi-variogram and a number of neighbours within this range which corresponds to the kind of data which might be expected if some form of local sampling is performed (e.g.spatial random walk).The size of the sample surface and the density of the simulated data points are based on the overall sample size available to estimate an exponential semi-variogram model and the number of points available at a given range (radius) to predict values from the data available.
We performed two families of simulations.One in which the data is used to estimate the semivariogram model for kriging and one in which the range is given a priori based on the theoretical considerations.The limitation of performing a simulation in which the semi-variogram model is estimated, is that there is no possibility to check that the semi-variogram model does not fit the empirical semi-variogram.When this occurs, the person fitting the model would change the starting parameter for the estimation or the maximal distance defining the empirical semivariogram.This cannot be done in a simulation and leads to artificially increased variability.
Therefore we include a check of validity in which only models with "reasonable" shape parameter are considered.For the simulation with fixed range we use ranges which may differ from the true range of the simulation process.The simulation parameters are provided below and in Table 2.
The aim of the simulation study is twofold: 1. evaluate the reliability of prediction using a semi-variogram kriging model when the semi-variogram must be estimated from the data and only a limited number of observations is available at small distances and 2. find out whether using a co-variogram to predict several spatially correlated measures improves the prediction of a summary index compared to predicting the measures separately.The simulation model reproduce the type of data analysed in other studies (e.g.[Breckenkamp et al., 2021] but also based on theoretical considerations in which only the local spatial scale is relevant and spatial processes at wider ranges are ignored.

Simulation of true values
We start by simulating random fields.We assume a second order stationary process with an exponential semi-variogram function in all simulated fields with varied ranges.The variance of the considered variable was set to 1. Overall, twelve random fields were simulated on regular grids with different dimensionalities and ranges.From these realisations we sampled the observed values used to estimate the parameters of the semi-variogram model and the kriging predictions.
Different sampling densities were simulated by varying the dimensionality of the simulated fields to be 8,000m, 10,000m, and 15,000m and for each field size sampling 3 different numbers of points to result in three comparable densities.Half of the random fields were simulated without a nugget effect and the other half with a nugget effect of 0.2 (20% of the true variance).For the co-regionalised (multivariate) model, three variables were simulated with varying pairwise correlations between these variables.The data were simulated with the R package RandomFields [Schlather et al., 2015].

Sampling from the random fields
Given the sample size n corresponding to different densities and the ranges we repeatedly (5,000 times) sampled sets of n points from the corresponding simulated random fields.These points were used to estimate the parameters of the exponential semi-variogram model first and the models were then used to obtain kriging estimates at the selected 200 test points.In a second variant, kriging was based on semi-variograms with parameters fixed at the true values for range and nugget.For co-kriging we distinguished two cases: First, we sampled collocated variables where all three variables were measured at the same locations with varying the densities (and pairwise correlations).Second, the three variables were sampled from different locations, each with a different density (again from random fields with varying pairwise correlations).

Selecting fixed test points for prediction
From the simulated random fields, we randomly chose 200 fixed test points for each extent of

Varying parameters
To compare the quality of predictions between different scenarios we chose the parameters: • dimensionality of random fields: Simulated quadratic random fields with the dimen-sionalities of 8,000m, 10,000m, and 15,000m were used to sample from.
• nugget effects of semi-variogram: The nugget was set to be 0 (i.e.no nugget effect) or 0.2.
• number of sampled points/density: To arrive at three comparable expected densities of observed points within a specific radius, we then chose the number of sampled points dependent on the dimensionality of the random field we sampled from.Numbers of points were: 650, 1,300, 2,300 for the smallest fields (8, 000m 2 ), 1,000, 2,000 and 3,500 for the medium (10, 000m 2 ), and 2,500, 5,000, and 8,000 points for the largest fields (15, 000m 2 ).
This resulted in expected numbers of points within a distance of 250m of about 2, 4, and 7 for each size of random field.Increasing the range for a fixed distance between points means that the number of pairs available to estimate the spatial correlation structure increases.
Therefore, we expect that the kriging model based on the parametric semi-variogram will better perform for larger ranges.
• semi-variogram parameters: Parameters of exponential semi-variogram models used for kriging were either estimated from the sampled points or fixed at the true values.
For the multivariate co-kriging models we also varied • the pairwise correlations between the three different variables to be r = 0.1, r = 0.5, or r = 0.9 • and the type of predictions being univariate ordinary kriging (3 variables predicted separately) versus co-kriging (3 variables predicted simultaneously based on co-variograms).
Twelve sets of results were generated for each size of random field with estimated and another 12 with fixed semi-variograms.Overall, 72 generated result sets for different parameter combinations (dimensionality, range, density, nugget, semi-variogram parameters) each consisting of repeated kriging predictions for 200 testpoint locations were included in our analyses in the univariate (single measure) case.For the multivariate case only fixed semi-variograms were used, kriging was repeated 1,000 times, the range was set to 600, and the maximum number of nearby points to include for kriging was fixed to be 50 within a maximum radius of 1,000m.
When the three variables were sampled from different locations density was varied between the variables.

Evaluation
The precision of the estimates at a given point is evaluated using the empirical standard error defined as the standard deviation of all the estimated values at this point over all 5,000 samples from a given simulation scenario.The 200 test points are classified within the quintile categories of the outcome distribution.Then the reliability of the estimation method is based on the proportion of estimated points which fall into the original quintile as well as the percentage which fall in a neighbouring quintile.The evaluation of the multivariate approach mainly focuses on the benefit of using a co-variogram model as opposed to estimate the different variables independently of each other.

Case study
We used data from the "Residential Environment and CORonary heart Disease" study (RECORD) [Chaix et al., 2012] to illustrate the method described above.This cohort study conducted in the Paris Ile-de-France region in 2007 and 2008 with 7,290 participants contains in particular data on a range of perceived neighborhood characteristics as well as geo-coded addresses of participants.
We selected three variables from the dataset defined as scores based on questionnaire items on perceived neighborhood characteristics.These three scores have a spatial correlation structure and are also spatially correlated with each-other (see Figure 2 for the co-variograms).The three scores on perceived physical and social deterioration as well as insecurity from others in the neighborhood were selected based on high inter-correlations in the study which makes them candidates for measures of a homogeneous index and likely to be generated by the same spatial process (an assumption of linear models of co-regionalisation).We randomly chose 200 participants with known residential addresses for which the perceived deterioration and insecurity of their neighbourhood was available.We predicted these values using several semi-variogram models (see below), and we compare the predicted with the true values.
First, we used all available data to fit exponential semi-variogram models for the three variables and their co-variograms by employing the linear model of co-regionalization.A good fit for the three exponential models was obtained for maximal distances of the empirical semivariogram of 1,000m to 1,250m (see Figure 2).Deviations from the shape of the model at larger distances indicated the existence of additional spatial processes on larger scales.We therefore only used points within a distance up to 1,250m to fit the semi-variogram model parameters, resulting in an estimated range of 756m (scale parameter of 252 for exponential model) common to the three variables.In a second stage, we sampled varied numbers of observations that we used both to estimate the semi-variogram exponential parameters and to predict values at the 200 points.
The kriging predictors were obtained using a number of randomly sampled known points varying between 500 and 7,090 (total number of observations minus the 200 points to be predicted).The minimum number of used neighbouring points was fixed to be 50.
The mean of the kriging predictions of the three neighborhood variables at a given location was considered as representing the index value at this point.We then compare the estimated values to the true observed values at the 200 test points.The proportions of correct or neighbouring index quintiles as well as deviations from the true index values are reported for the varying parameter combinations.

Simulation: Reliability of single point estimation
We predicited the values at 200 fixed points (test points) by ordinary kriging based on 5,000 different samples (1,000 in the multivariate case) for each combination of parameters.The values for the different reliability measures are summarized per point and per parameter combination.
Figure 4 shows for each quintile the proportion of predicted values which are in the correct quintile, or a neighbouring quantile.With increasing range the proportion of estimations falling in the correct or neighbouring quintile increases, first and foremost for the lowest and highest quintile.Aside from that, a shift of predictions to the mean was more marked when the semivariogram parameters were estimated from the sampled observed points than when fixed to the true values, and this being especially pronounced at a smaller range.Results for simulations with a nugget effect of 0.2 look very similar (see Table 2).
In Table 2 the values are further summarised over the 200 test-points, showing mean, median, and SD of the distributions.The different quality measures are given for increasing extent of the simulated random field, three different sample sizes (resulting in 3 densities comparable between random fields of different dimensionalities), 2 ranges, with and without nugget effect, and semivariogram parameters either fixed or estimated.A larger range and an increasing density were apparently related to smaller prediction errors (MSE) and bias, and, accordingly, with higher poroportions of correct predictions.Having a larger range means than the information held in the observed point can be used at a larger distance.Increasing the sample size and, therefore, the density has two advantages, more points are available for the estimation of the model leading to a better semi-variogram model, and there are more nearby points to be used for the predictions.
A higher dimensionality came along with a reduction in bias but no clear advantages in terms of other measures.Introducing a nugget effect led to less accurate predictions of quintiles and a markedly increased MSE.These patterns of results were very similar for fixed and empirically estimated semi-variograms.Only the precision increased with higher density or range with the fixed semi-variogram, and proportion of the correctly predicted quintiles was slightly better in this case.Overall, the comparison of both variants shows better performance with estimated semi-variogram parameters, especially considering the prediction errors and precision.On average, up to 80% of all predictions were found in the correct or its neighbouring quintile, but the exact correct quintile could only be predicted in 36% of cases in maximum.

Results of multivariate simulation and prediction
Values for the three variables were predicted at the same 200 test point locations as in the univariate case.At each test point an index value was computed as sum of the three predicted variables and this was compared to the sum of the true simulated values.All results relate to these index values.Besides the variable correlations which were varied in the simulations, the multivariate kriging scenarios differed from the univariate ones in fixing the maximum number of neighbourhood points used for local kriging to 50 and the number of repeated sampling and prediction to 1,000.Both was done to account for the increased complexity of the model and related time as well as memory capacity requirements.Results are displayed in tables 3 and 4).
Comparison univariate versus multivariate.The patterns of results were virtually similar for the separate univariate predictions and co-kriging predictions, when all 3 variables were measured at the same locations (collocated).Some differences were seen in the multi-located (heterotopic) case.Here, the bias of predictions was smaller for less correlated variables in the univariate case but slightly smaller with a high correlation for the co-kriging results.The proportion of exactly correct predictions of quintiles increased by 5 percentage points in the univariate but 11 percentage points in the co-kriging predictions with multi-located variables.
However, this pattern was not evident when considering the proportions including neighbouring quintiles.
Comparison point collocation versus multi-location.We compare the results from Tables 3 and 4. Scenarios with three variables observed at different locations show better results than for variables observed at the same locations.We see up to 82% of points in the correct or neighborhood quintile for multi-located observations but only up to 72% of collocated observations for the univariate method and 83% vs. 72% for the multivariate method.Bias and MSE were also clearly greater in the collocated than in the multi-located case.
Effect of the correlation between variables.The more the variables are correlated the more information can be gained from the others and, therefore, improvements in the predictions in the case that observations are at different locations for different variables were to be expected.However, different correlated variables from the same location share much information, increasing with size of the correlation.Therefore, the information gain is decreasing with increasing correlations.This is confirmed by the simulations which show that increasing the correlation between the variables from 0.1 to 0.9 led to a decrease in the proportions of prediction in the right or neighbouring quintile (71 to 65%) in the collocated case, both for univariate and multivariate predictions.When the points are multi-located only the percentage of exactly predicted quintiles increased with increasing correlation (35 to 40% univariate and 36 to 47% multivariate).This seems less relevant, since overall the proportion of exact predictions still seems too low for practical applications.

Further results
Introducing a nugget effect lowered prediction quality in all scenarios in terms of correctly predicted qunitiles and prediction error.Bias was mainly unaffected in the collocated case but was decreased in models without nugget effect for multi-located observations.
Density, which also corresponds to the number of sampled observations, did not affect prediction quality.
In the collocated case, the extent of the random field made no clear difference in the proportion of predicted quintiles.Bias was minimal for the smallest field.With multi-located obervations we observed the best quintile prediction as well as smallest prediction errors with the medium sized field and comparable smaller biases either in the medium or small field.

Case study
The results of the case study based on RECORD data are shown in Tables 5 and 6.The  6 and 7.
Semi-variogram estimated using all available observations.In Table 5 the measures of the quality of estimates are presented with varying the number of sampled known points used for kriging but using the semi-variogram obtained from all 7,290 available points.Here we see that multivariate and univariate approaches are equivalent in terms of quality and only large sample sizes provide a high percentage of prediction in the correct or neighbouring quintile: 88% for 500 points to 95% for 7,090 known points.The bias ranging from 20% of the index mean to 5% with half of predictions having less than 1% bias for 7,090 known points.
Semi-variograms estimated using the subset of observations available for prediction.In table 6 the measures of the quality of predicted values from semi-variograms based on only the number of sampled points are presented and the corresponding semi-variogram models are given in Table 7.Here, the increase in the quality of prediction with the number of sampled known points is more pronounced in the multivariate case.That is, specifically the realistic estimation of the cross-variograms seems to depend on the number of known points.
However, the difference between univariate and multivariate kriging is mainly seen up to 2,000 known points.

Discussion
We have shown the usability and reliability of kriging methods to create geo-located indices of neighbourhood characteristics based on survey data.The particularity of the method compared to the existing literature is that it does not rely on predefined geographical units, thus allowing for very fine spatial granulation of characteristics.The focus of our study is on geo-located data sources (survey data) for which only sparse data at small distance to the points being predicted is available, thus offering limited information for prediction.The benefit of such data however for the study of small-area health inequalities is to provide data on very local neighbourhood characteristics (which may be associated with health outcomes) relative to individual health data otherwise rarely available, while geo-coded data on health outcome may be increasingly available.
We based our kriging models on semi-variograms commonly seen in social-epidemiological studies of neighbourhood effects on health in other context [Breckenkamp et al., 2021, Sauzet et al., 2021].
Here the prediction at one location is based on the spatial correlation structure between observed values around the non-observed location.The simulation scenarios were chosen to reflect the correlation structure of existing data as well as a sparse number of observations at small distance to the location at which predictions are performed (from about 10 to 40 observations).The mean bias ranges among the simulated scenarios from 10 to 3% of the standard deviation of the outcome providing on average a reliable prediction of the true value.We also looked at data categorised in quintiles with a median over the simulations having 80% or more predictions in the right quintile or its neighbour.Therefore the method may provide a useful indicator of neighbourhood characteristic at one location with the limitation that the variability may be reduced if too much reduction to the mean occurs.Nevertheless, this is a better than predicting the same values for a whole area.
We presented a case study using a large study with about 7,000 geo-coded observations from which we drew samples of various sizes.With using a sample of only 500 data points available for prediction (average of 3.2 (SD 2.6) points within the range of predicted locations), 86% of predictions were in the correct or neighbouring quantile going up to 95% when using all the data available (average of 45.1 (SD 33.8) points within the range of predicted locations).Showing that the method provided satisfactory predictions with very little neighbouring points.
As expected, larger sample sizes and more data at small distances will both provide better estimates of the semi-variogram model parameters.But relevant for the predictions is that there are enough points within the range radius and a small nugget effect.A nugget effect means that only a part of the information available in the points within the range radius will be used for prediction.The role of the range of the semi-variogram model is double.If the true range of the exponential model is larger (see simulation results) then more observed values will be (correctly) used for prediction.However, if the estimated range is larger than the true range (as seen in some sub-samples of the case study) then some observed values will be used for prediction which are actually uncorrelated with predicted points, thus reducing the quality of predictions.
We performed prediction based on estimated semi-variograms as well as with a semi-variogram given a priori.On average estimation provided better results, probably because the high variability of the simulated data meant that the "true" semi-variogram did not always fit the specific sample data well.However, there may be situations where it makes sense to use a range and a nugget effect based on theory or previous work.In that case the weight put on each point is not based on the correlation structure observed in the data.
Another aspect of our investigations was whether using co-variogram models, which take the spatial correlation structure between different variables included in the index into account, would provide a better prediction of indices (as linear combinations of predicted values) than the linear combination of separately predicted values.The advantage of using the information gained by including the spatial correlation between different variables has to be balanced with the possibility of problems with the estimation of the co-variogram model.In our case study, if all points available were used to estimate the models (thus providing a rather good semi-variogram and probably co-variogram) then the multivariate and the univariate approaches lead to similar results.Whereas, as in the simulations, if the semi-variogram models are estimated using a smaller sample of points then the univariate approach provides better results.Therefore in general it does not seem to be hugely beneficial to use co-variogram models and with increasing number of variables the difficulties to obtain estimates increases.If the different variables are not observed at the same points, e.g. if the data come form different sources, then a co-variogram will provide better estimates.
Given the usefulness of spatial methods based on semi-variograms to provide neighbourhood indicators not relying on administrative areas presented here or elsewhere, it is important that designers of social surveys take geographical proximity into account in their sampling frame.
This will allow to obtain estimates of neighbourhood characteristics which are not based on the perception of one person only but on the perception of several neighbours.Our case study was based on a an epidemiological study with a large sample size with participants selected in randomly chosen "arrondissements" or communes of the Paris area.As a consequence the number of neighbours within 750m ranges from 10 to 120, thus providing very good estimates at non observed locations.A major drawback is the presence of observation deserts.Having enough data for kriging provides another possible application of the method proposed.We can obtain a measure of neighbourhood perception of an individual and an "objective" measure of neighbourhood characteristic with the kriging predicted values provided by the neighbours.
A limitation of this work is the lack of estimation of prediction error.The estimation of semi-variogram parameters standard errors is difficult, in particular because of the type of data available (an empirical semi-variogram) and that the estimates are skewed (see [Dyck and Sauzet, 2022] for more details).Further work is needed to provide a reliable measure of precision.This is particularly important if an indicator is based on a large number of variables bringing each a source of uncertainty.A direction of investigation is the work of Lopiano et al on misaligned data [Lopiano et al., 2011] used in regression models.

Conclusion
Provided that survey data and study data close enough to each other in sufficient numbers are available, survey data with geo-coded spatially correlated neighborhood characteristics can be used to predict values for these variables at nearby locations using semi-variogram kriging models.This enables, for example, the creation of bespoke indices for the study of small-area determinants of health which are valid at very small spatial scales and comprising relevant dimensions for the population under study. Tables random field on which to obtain kriging estimates.The same 200 test points were chosen for different parameter values for each simulated field.Ordinary kriging (function krige from gstat) was used to predict the values at these 200 fixed test points based on the values at the sampled locations for each sample and on the estimated or fixed exponential semi-variogram model from this sample, with a maximal distance for the estimation of the empirical semi-variogram equal to 1,000 (about 1.5 times to 3 times the true range).
the plots display the means summarized over the 5,000 samples per test-point for the 200 locations.That figure represents the true values at the 200 estimated points against the bias of the estimated kriging values (mean difference between kriging estimate and true value of the simulated random field) depending on the true range of the semi-variogram model and based on fixed versus estimated semi-variograms.The bias clearly depends on the magnitude of the true value but this dependency decreases with increasing range.For smaller ranges a reduction to the mean for extreme true values is apparent.With increasing range, the bias overall decreases (see also Table corresponding semi-variogram parameters estimated from different numbers of sampled points and varying the maximal distance for semi-variograms are provided in Table 7.The three variables considered are approximately normally distributed.The mean of true index values over the 200 test points is −0.094, median = −0.091,and SD = 0.624.The corresponding kriging predicted values were mean = −0.097,median = −0.102,and SD = 0.478 for the univariate version when using all available points for the estimation of the semi-variograms.The values were mean = −0.098,median = −0.107,and SD = 0.406 with a co-variogram based on all available points.For results based on varying (co-)semi-variograms see Tables

Figure 4 :
Figure 4: Proportion of predicted points in the correct or a neighbouring quintile grouped by true quintile value

Table 1 :
Expected number of points within radius (mean and SD of thev no. of points over the 200 test points per sample):

Table 3 :
Quality of prediction in different kriging scenarios: multivariate simulation with 3 variables observed at different locations (1,000 known points)

Table 4 :
Quality of prediction in different kriging scenarios: multivariate simulation with 3 variables observed at the same locations (collocated.1.000 known points) Note: Values are summarized over the 200 testpoints; SD: standard deviation; prop.corr.Q.: proportion of predicted values in correct quintile; corr./neighbor.Q.; proportion of predicted values in correct or neighbouring quintile; Bias: mean of predicted minus true value in SD units of true outcome; SE: mean standard error of predicted values; MSE: mean squared error of predicted values; No. of points: number of observed points used for local kriging ; Correlations: pairwise correlation of the 3 variables in the simulated random field; Density: density of sampled points