Mapping depth to Pleistocene sand with Bayesian generalized linear geostatistical models

Spatial soil applications frequently involve binomial variables. If relevant environmental covariates are available, using a Bayesian generalized linear model (BGLM) might be a solution for mapping such discrete soil properties. The geostatistical extension, a Bayesian generalized linear geostatistical model (BGLGM), adds spatial dependence and is thus potentially better equipped. The objective of this work was to evaluate whether it pays off to extend from a BGLM to a BGLGM for mapping binary soil properties, evaluated in terms of prediction accuracy and modelling complexity. As motivating example, we mapped the presence/absence of the Pleistocene sand layer within 120 cm from the land surface in the Dutch province of Flevoland, using the BGLGM implementation in the R‐package geoRglm. We found that BGLGM yields considerably better statistical validation metrics compared to a BGLM, especially with – as in our case – a large (n = 1,000) observation sample and few relevant covariates available. Also, the inferred posterior BGLGM parameters enable the quantification of spatial relationships. However, calibrating and applying a BGLGM is quite demanding with respect to the minimal required sample size, tuning the algorithm, and computational costs. We replaced manual tuning by an automated tuning algorithm (which eases implementing applications) and found a sample composition that delivers meaningful results within 50 h calculation time. With the gained insights and shared scripts spatial soil practitioners and researchers can – for their specific cases – evaluate if using BGLGM is feasible and if the extra gain is worth the extra effort.

mapped the presence/absence of the Pleistocene sand layer within 120 cm from the land surface in the Dutch province of Flevoland, using the BGLGM implementation in the R-package geoRglm. We found that BGLGM yields considerably better statistical validation metrics compared to a BGLM, especially withas in our casea large (n = 1,000) observation sample and few relevant covariates available. Also, the inferred posterior BGLGM parameters enable the quantification of spatial relationships. However, calibrating and applying a BGLGM is quite demanding with respect to the minimal required sample size, tuning the algorithm, and computational costs. We replaced manual tuning by an automated tuning algorithm (which eases implementing applications) and found a sample composition that delivers meaningful results within 50 h calculation time. With the gained insights and shared scripts spatial soil practitioners and researchers canfor their specific casesevaluate if using BGLGM is feasible and if the extra gain is worth the extra effort.
• Including spatial correlation might sometimes be worth the extra effort and computational costs.

K E Y W O R D S
Bayesian statistics, digital soil mapping, generalized linear model, geoRglm, model-based geostatistics, spatial statistics 1 | INTRODUCTION In many soil mapping applications, spatial variables are continuous and can, perhaps after a transformation (Müller, 2007), conveniently be modelled using Gaussian spatial models. However, some spatial soil applications involve discrete count variables or categorical variables (e.g., Kempen, Brus, & Heuvelink, 2012;Malone, Minasny, & McBratney, 2017). In other cases, the end-user is only interested in information relative to threshold values (e.g., Lark & Ferguson, 2004). By definition, discrete and categorical variables cannot be transformed to normality. With such variables, nonlinear methods such as indicator kriging (Journel, 1983) deliver reasonable results, but are suboptimal in the case of a trend (Papritz, 2009) and lack model-based consistency, which implies, among others, that they are unsuited to facilitate kriging with change of support (Emery & Ortiz, 2004).
If environmental covariates are available for the whole area of interest related to the target variable, using a generalized linear model (GLM) might be a solution for mapping discrete soil properties. GLM assumes that the actual observations are a realization of a discrete random process, such as Bernoulli, binomial or Poisson. Given this assumption, GLMs use, for each location, a transformation of a virtual continuous variablethe linear predictorto the distribution parameter. For instance, the logit transformation links the linear predictor to the probability of success for a Bernoulli process. The linear predictor is the linear combination of covariate values (including an intercept) scaled by regression parameters (Myers, Montgomery, & Vining, 2002, section 4.2).
However, when applied to spatial data, GLM ignores any spatial dependencies, other than those contained in the covariates. The geostatistical extension, a generalized linear geostatistical model (GLGM), adds spatial dependence to the model and linear predictor, by means of a spatially correlated Gaussian field (Diggle & Ribeiro, 2007). Therefore, compared to a GLM, a GLGM is more generic and flexible, and thus potentially better equipped, provided that its spatial dependence parameters can be estimated well. However, GLGMs also require more skills of the modeller (as calibrating spatial parameters is less straightforward than calibrating regression parameters) and are computationally more expensive.
Another possible extension is application of Bayesian statistics. Bayesian statistics demands explicit incorporation of pre-observation knowledge, even if this means an explicit definition of our ignorance (Lindley, 2004). It also considers all model parameters to be stochastic quantities, thus parameter uncertainty is taken into account (McElreath, 2016). Both Bayesian properties are applied in soil mapping by, for example, Steinbuch, Brus, and Heuvelink (2018) and Poggio, Gimona, Spezia, and Brewer (2016). Both properties together enable the construction of hierarchical models in a convenient and statistically sound way (Gelman, Carlin, Stern, & Rubin, 2013), which can be very useful in spatial statistics (Banerjee, Carlin, & Gelfand, 2004). In our context, both GLMs and GLGMs have Bayesian extensions, abbreviated to BGLM and BGLGM, respectively. A BGLGM is a spatial implementation of a Bayesian hierarchical model (Diggle & Ribeiro, 2007).
The objective of this work was (a) to make spatial Bayesian hierarchical modelling accessible and understandable for pedometricians, and (b) to evaluate whether it pays off to extend from a BGLM to a BGLGM for mapping binary soil properties, when evaluated in terms of prediction accuracy, modelling effort and computational costs. We pay due attention to software implementation issues and provide scripts in Appendix S1, because especially a BGLGM cannot easily be programmed from scratch, thus we need to rely on existing software libraries and functions.
As motivating example, we map the depth of the Pleistocene cover sand layer relative to the surface in the Dutch province of Flevoland. This depth is defined as a Bernoulli variable obtained by determining whether the depth is above or below 1.2 m. state that at every s, z s ð Þ is a binary realization (zero or one) of the Bernoulli distribution: where parameter p s ð Þ 0, 1 ½ indicates the probability of success. We further assume that Z is an independent process, that is, Z s ð Þ and Z s 0 ð Þ are statistically independent for all s,s 0 A, s ≠ s 0 . This spatial process Z ¼ Z s ð Þ, s A f gis observed at n locations s i , i ¼ 1, 2, , n, providing observation vector z ¼ z s 1 ð Þz s n ð Þ ½ T , being a realization of Z ¼ Z s 1 ð ÞZ s n ð Þ ½ T . We model p s ð Þ, and thus Z s ð Þ, initially with a generalized linear model (GLM), as explained in the following section. Next, we extend this GLM to a Bayesian generalized linear model (BGLM) and a Bayesian generalized linear geostatistical model (BGLGM), the two models to be compared in this research.

| Generalized linear model (GLM)
We base the explanation of GLM on textbooks such as Myers et al. (2002). The logit transformation projects p s ð Þ to the mathematically more convenient y s ð Þ: while the inverse logit function derives p s ð Þ from y s ð Þ: The unobservable variable y s ð Þ, s A represents the "signal" over the study area. The vector y ¼ y s 1 ð Þy s n ð Þ ½ T indicates the signal on the observation locations. In a GLM, y is modelled as a linear predictor: Here, X indicates the k Â n design matrix, containing leading ones and covariate values at all observation locations. The vector of k regression coefficients, including the intercept, is given as β. Later we discuss how we can calibrate this model using the vector of observations z. Note that in the GLM defined above (but not in the other models considered next), all randomness is captured in Equation (1).

| Bayesian extension: BGLM
To include parameter uncertainty, we extend the GLM by considering β a stochastic parameter vector (but note that in this work we assume neither s nor X to be uncertain).
To work conveniently with stochastic parameters we shape our models using Bayesian statisticswe assume that the reader is familiar with concepts such as Bayes' rule, prior (conjugate or otherwise), conditional distributions, likelihood, posterior, the posterior predictive distribution and Markov Chain Monte Carlo (MCMC) simulation; for example textbooks, see Banerjee et al. (2004), Gelman et al. (2013) or McElreath (2016. In the following, we introduce a Bayesian GLM (BGLM). The posterior density of β, conditional on the observations, is given by Bayes' rule: with f βjz ð Þ the posterior probability distribution of β conditional on z, containing all our knowledge after taking the observations. The prior f β ð Þ contains our preobservation knowledge; it might be a mathematical abstraction of our assumed ignorance. The likelihood of the observations conditional on model parameter f zjβ ð Þ follows directly from the Bernoulli probability mass function and can, while incorporating Equations (3) and (4), be expressed as: In Section 3.1 we elaborate on the choice of f β ð Þ and on how we implemented an algorithm to infer f βjz ð Þ.

| Prediction with a BGLM
To derive a prediction map of the probabilities of success, we first define n Ã prediction locations s i , i ¼ n þ 1, …, n þ n Ã . Typically, these are the nodes of a grid covering the study area, for which all covariates must be available. The prediction probabilities at the prediction locations result from back-transforming the signal at these locations using Equation (3): We therefore require the vector y Ã ¼ y s nþ1 ð Þy s nþn Ã ð Þ ½ T of signals at prediction locations.
In contrast to GLM, in a BGLM y Ã is stochastic. The posterior probability distribution, referred to as the "posterior predictive", equals: Note that the last identity holds because y Ã is completely characterized by the covariates and β, given β, y Ã and z are independent. In Section 3.1 we show the implementation of the BGLM prediction.

| Generalized linear geostatistical model (GLGM)
In a generalized linear geostatistical model (GLGM) (Banerjee, Carlin, & Gelfand, 2015;Diggle & Ribeiro, 2007), Equation (4) is extended by adding a spatially correlated random effect: Here, U is a vector of random variables U s i ð Þ, i ¼ 1,…n, taken from a stochastic spatial process U, which is modelled as a zero mean, stationary Gaussian random field defined over A. Note that we use again Y , and its realization y, for the signal, although they are modelled differently in a GLM and GLGM.
The spatial structure of U is in this research characterized by an exponential correlation function: where h indicates the Euclidean distance between locations s and s þ h A, and ϕ is in geostatistical context called the "range parameter" or "distance parameter".
The "nugget to partial sill ratio" η 2 is given by: where parameter τ 2 captures short-distance variation, that is, the "nugget", while σ 2 ("partial sill" in geostatistics) is the variance of U, minus the variance of U already captured by τ 2 . Note that when τ 2 ¼ 0, η 2 ¼ 0 and the correlation function Equation (10) on very short distances approximates one.
In this research, we made simplifying choices regarding the spatial covariance function σ 2 Ác ϕ, h,η 2 ð Þ, such as not to consider spatial anisotropy (c is a function of Euclidean distance only, not of direction) and not to consider other covariance functions.
Each element of the n Â n correlation matrix C ϕ,h, η 2 ð Þ is given by the correlation function Equation (10) corresponding to the distance between two observation locations. With C ϕ, h, η 2 ð Þ, we can extend Equation (9) with spatial parameters: For notational convenience, in what follows we often merge spatial and regression parameters into one parameter vector θ ¼ β, σ 2 , ϕ, η 2 h i , and we will often indicate C ϕ,h, η 2 ð Þby C ϕ ð Þ or C.

| Bayesian extension: BGLGM
In the Bayesian extension of the GLGM we consider θ stochastic. Y is stochastic as well, due to β and the spatial signal U. We therefore need their joint distribution. The joint posterior density is, according to Bayes' rule and the chain rule of conditional probability, proportional to: with f yjθ ð Þ the proportional density of the signal conditional on the parameters. Note that f zjy, θ ð Þ¼f zjy ð Þ because conditional on y, z and θ are independent. Equation (13) is often referred to as a "hierarchical model", referring to the three related levels of probability distributions (Banerjee et al., 2004).
Equation (13) shows three terms to be inferred or defined: (a) the likelihood of the observations, conditional on the signal, which follows from Equations (1) and (3) and is much like Equation (6): (b) the likelihood of the signal, conditional on the parameters, which is given by the MVN probability density distribution: with j C j indicating the determinant of C, and (c) the definition of prior f θ ð Þ, which is addressed in Section 3.2. Note that from Equation (13) we can infer the posterior of θ by integrating out the signal: In the next section we first extend the above theory to prediction on new locations.

| Prediction with a BGLGM
To predict the signal y Ã conditional on z within a BGLGM context, we formulate the posterior predictive distribution as an integral over both the parameters θ and the signal at the observation locations y: Note that y Ã and z are conditionally independent given θ and y. In the final expression of Equation (17), we can consider f θ, y z j Þ ð (the joint posterior density of parameters and signal as provided by Equation (13)) as weights to derive f y Ã jz ð Þ from f y Ã jθ, y ð Þ. The other component of the integrand, f y Ã jθ, y ð Þ, the distribution of a Gaussian response at prediction locations given observations and model parameters, boils down to simple kriging. This is because y Ã and y are part of the joint multivariate normal distribution: where C Ã ϕ ð Þ indicates the correlations between the signals at observation and prediction locations and C ÃÃ ϕ ð Þ the correlation matrix of y Ã , both of which can be calculated using Equation (10) and the geographic distance between the relevant locations. The simple kriging predictor, for a given realization of model parameters and signal, equals a Gaussian distribution (Searle, 1997, p. 47) and with prediction error variance: In theory, one could calculate both the prediction probability density as provided above and the corresponding weight factor f θ, y z j Þ ð in Equation (17) for every possible combination of parameters and signal at the observation locations, and numerically integrate the outcomes to infer the posterior predictive distribution f y Ã jz ð Þ. However, even if the space spanned by the model parameters and signal y would be discretized by a coarse grid, this would be prohibitive because of the high dimension of the parameter-signal space. In Section 3.2 we discuss how to relieve the burden of computational costs and show how the actual inference of posterior parameters and posterior predictive was carried out in this research.

| BGLM implementation
We chose to apply a prior f β ð Þ indicating our preobservation total ignorance about β. Because β is a centrality parameter, this prior might be considered "flat" or "improper uniform" and can from a mathematical point of view be left out from Equation (5). This means that the posterior β for a BGLM is proportional to the likelihood (expressed as a function of β) as derived for GLM. According to Kutner et al. (2005, section 14.5) and Myers et al. (2002, sections 4.3 and 4.4.1) this likelihood proportionally approaches an MVN distribution in the case of large sample sizes, where the mean of this MVN distribution can be estimated by a maximum likelihood estimator using the iteratively reweighted least squares algorithm, and the corresponding variances and covariances are estimated via the Fisher information matrix. For these calculations, we applied the algorithm as implemented in the base function glm in the statistical programming language R (R Core Team, 2017). The above mathematical and numerical steps are illustrated in Figure 1. We obtain a prediction p Ã by substituting the maximum likelihood estimate of β in Equation (4), with X replaced by X Ã and the result of that in Equation (7).

| BGLGM implementation
For BGLGM inference and prediction we shaped our approach around several functions from the geoRglm R package, which is developed to calibrate generalized linear spatial models and thus enable two-dimensional spatial prediction of binomial-and Poisson-distributed variables of interest (Christensen & Ribeiro Jr, 2002, 2015. These functions numerically approximate the posterior distribution of signal and model parameters by simulating a large sample from the posterior. However, sampling from a posterior can be computationally expensive, especially ifas in our casethe posterior contains a signal of size n (the number of observations), making it necessary to sample from a high-dimensional combined parameter and signal space. To lower computational costs, several extensions and simplifications were implemented by Christensen, Roberts, and Sköld (2006), to be discussed in the following subsections together with several choices we made for this research. Appendix S1 elaborates on the applied mathematics. The corresponding workflow, showing analytical and numerical processing steps between the boxed representations, is illustrated in Figure 2. Our embedding of geoRglm follows in Section 3.2.3.

| Choice of BGLGM priors: Restrict nugget and range parameter
We assume that the marginal prior ϕ, the marginal prior τ 2 and the combined prior of β and σ 2 are independent: We write the joint prior for β and σ 2 as a product of conditional and marginal densities: Next, following the ideas of Diggle and Ribeiro (2007), we assign as combined prior a multivariate normal (MVN) and inverse scaled chi-squared (χ 2 ScI ) distribution, respectively: From this we derive the distribution of β and σ 2 conditional on y and ϕf β, σ 2 y,ϕ j Þ ð which also has a MVNχ 2 ScI distribution; the proof is provided in Appendix S1. Because f β, σ 2 ð Þ and f β,σ 2 y, ϕ j Þ ð have the same type of distribution, they are "conjugate" and in this context, we consider f β, σ 2 ð Þ a "conjugate prior". The "hyperpriors" ξ 0 , D 0 , ν 0 and ς 2 0 define the distribution of the prior; subscript 0 indicates that a variable is a prior or hyperprior parameter.
For pragmatic reasons, η 2 is restricted to a fixed plugin value. This means that we should apply external information about η 2 or try out different values and compare results. Furthermore, there are some computational costs involved for each possible value of ϕ. Therefore, the prior of ϕ is restricted to a discrete set of equally distanced values, thus forcing every possible ϕ in any calculation to be one of those values. For these values for ϕ, related internal variables are precalculated and stored.
Within the above-indicated constraints, we chose in our research a low-informative prior f β ð Þ for the regression coefficients, by stating that D À1 0 ¼ 0. This is a limit situation that causes f βjσ 2 ð Þ to become improper. For f σ 2 ð Þ and f ϕ ð Þ, we followed the reasoning of Berger, de Oliveira, and Sans'o (2001) for Gaussian spatial models, which provides a reference prior composed of f σ 2 ð Þ¼ 1=σ 2 (which equals ν 0 ¼ 0, in this case a limit situation parameterized F I G U R E 1 Workflow to infer the posterior of β and predict the probability of success using a Bayesian generalized linear model (BGLM). Our implementation has been built from R base components for the χ 2 ScI distribution whereby f σ 2 ð Þ becomes improper) and a proper prior for ϕ, depending on the observation locations and related covariate values. An example of this reference prior f ϕ ð Þ follows in the case study, and among others in Figure 4. We fixed the prior for τ 2 (and thus η 2 ) to an arbitrary zero, because we have no other information about its possible value nor its distribution, and exploring multiple values was outside the scope of this research.
3.2.2 | Integrate out regression coefficients and variance; sample from posterior range parameter and signal Computational costs can be substantially reduced by integrating out the regression coefficients β and variance of the spatial signal σ 2 (see Appendix S1), rather than being part of numerically approximating the posterior. This "integrating out" means that these parameters, as such, disappear from the equation, while their influence remains, expressed in the relations between the remaining variables.
The remaining ϕ,y parameter and signal space is sampled using Markov chain Monte Carlo (MCMC) (Brooks, Gelman, Jones, & Meng, 2011). In each iteration the MCMC algorithm alternates between a proposal for ϕ and a proposal for y. For the one-dimensional update in ϕ, a Metropolis algorithm is applied, meaning that the proposal distribution is symmetric. For the n-dimensional block update in signal space, the Langevin-Hastings algorithm (LH) is applied (Christensen et al., 2006). LH is a special case of Metropolis-Hastings, with faster convergence: whereas with Metropolis the number of iterations to reach convergence is proportional to the dimension of the combined parameter-signal space (i.e., it is O n ð Þ; see Chivers & Sleightholme, 2015, for an explanation of the used "big O" notation), LH converges proportionally to the cubic root of the dimensionality ðO n 1=3 À Á Þ (Roberts & Rosenthal, 1998). It does so by using information captured in the spatial correlation matrix C and in the observations z, to form a proposal probability gradient field (Christensen et al., 2006).

| Tuning the proposal distributions, chain phases
The proposal distributions for the MCMC iteration steps of ϕ and y are, respectively: (a) a normal distribution, scaled and rounded to the earlier defined discrete values for ϕ, and (b) a multivariate normal distribution scaled realization realization F I G U R E 2 Workflow to sample parameter posteriors and the posterior predictive using a Bayesian generalized linear geostatistical model (BGLGM), as implemented in the geoRglm package (Christensen & Ribeiro Jr, 2002, 2015. The Markov Chain Monte Carlo part (MCMC, indicated in red) represents many thousands of iterations. The mentioned "tuning phase" is our addition according to the LH algorithm (Christensen et al., 2006). Each proposal distribution is scaled by a single variance parameter, that is, γ ϕ and γ Y , respectively. These scale settings should force the acceptance rates to be as close as possible to the ideal rates: 0.44 for a single-parameter Metropolis algorithm (Rosenthal, 2011) and 0.57 for LH (Christensen et al., 2006).
Manually tuning γ ϕ and γ Y proved to be a very cumbersome process, because (a) a test run to assess a tuning setting can take up to half an hour of computation time and we need to assess several settings of two scaling parameters combined, and (b) sometimes at the very beginning of a chain a low value for γ Y is needed to achieve an acceptance rate for y larger than zero (in other words, to start exploring the parameter-signal space), whereas soon afterwards this value has to be increased. To overcome both issues we used a self-tuning algorithm that searches values for γ ϕ and γ Y , forcing acceptance rates near the optimal values as given above. Because the chain continues while trying out different settings during the first several thousand iterations, this approach supports starting up.
For the self-tuning algorithm we applied a proportionalintegral (PI) feedback controller for each scale setting. Such controllers are used in many industrial processes, and are known for their robustness with respect to process modelling uncertainties (Aström & Murray, 2008) and easy mathematical description. Thus, in our research the complete MCMC chain consisted of three phases: (a) the tuning phase, where γ ϕ and γ Y are tuned and exploration of the parameter-signal space starts, (b) the warm-up phase, where the chain arrives in an appropriate subspace of the parameter-signal space (note that, according to Gelman et al. (2013), the expression "warm-up" is a better analogy than the often used expression "burn-in" for this phase), and (c) the production phase, where the chain explores the parameter and signal space and thus samples the posterior. During phases 2 and 3, γ ϕ and γ Y are constant. The chain part from phase 3 is thinned, and ultimately used for inference and prediction.

| Inferring posterior and posterior predictive
As illustrated in Figure 2, each iteration of the thinned chain enables the sampling of one value of the marginal posterior f βjz ð Þ; all those values together provide the sample describing this marginal posterior as an empirical distribution. Exactly the same holds for f σ 2 jz ð Þ. The process to arrive at the posterior predictive is similar: for each thinned chain iteration, a corresponding prediction distribution is calculated, and from this distribution one two-dimensional set of values is simulated. Again, all values together constitute the spatial posterior predictive as empirical distribution f y Ã jz ð Þ, which can be inverse logit transformed to f z Ã jz ð Þ. In Appendix S1, all this is explained in more detail; for example, how the chosen priors and chain iteration values influence those three sampling functions. The marginal posterior f ϕjz ð Þ as empirical distribution is already captured in the thinned chain values.

| CASE STUDY: DEPTH OF PLEISTOCENE SAND LAYER IN FLEVOLAND
The Dutch province of Flevoland consists of two land reclamations (or polders): the Noordoostpolder (which translates literally to "North-East polder") reclaimed in 1943, and the Flevopolder, reclaimed in two parts, 1957 and 1968, respectively. The total land area is ca. 1,400 km 2 . The Noordoostpolder encapsulates two former islands: Urk (founded on boulder clay deposited in the Saalien ice age) and Schokland (founded on Holocene peat).
During the last ice age (Weichselien), cover sand, an aolian deposit with a median grain size of 105-210 μm (Koster, 2009), was deposited in this area. During the Holocene, the cover sand was covered by marine deposits (silt and clay) and with peat. The depth below the land surface of this Pleistocene cover sand layer now ranges from 0 to over 16 m.
The provincial government and the regional water board are interested in the soil composition until the Pleistocene sand layer (Brouwer, de Vries, & Walvoort, 2018). This is because unripened clay and peat might be present above the Pleistocene substrate, which has properties that change under the influence of air. These changing properties cause a subsiding ground level (on some spots with over 2 cm year À1 ), while the Pleistocene sand layer is stable. For example, in the municipality of Zeewolde, subsidence as a result of ripening soils causes problems with road maintenance. Also, spatially variable subsidence causes changes in surface water flows, which both the water board and farmers have to anticipate. A map of the depth of the Pleistocene sand layer might help to locate potential problems with land subsidence. The map might also be helpful in the foundation of construction work (such as buildings, roads and railroads).

| Soil data and covariates
In 2018, the soil in the province of Flevoland was sampled at 1,507 points (Brouwer et al., 2018) selected by spatial coverage sampling (Walvoort, Brus, & de Gruijter, 2010) over the area of interest. At each location, the depth of the Pleistocene cover sand layer below the land surface was determined. At some locations the cover sand was not encountered within the augering depth, which was at least 1.20 m. At these locations we have socalled right-censored observations of the cover sand depth. We transformed the continuous depth data into indicators, having value 1 if the depth of the cover sand exceeds 1.20 m, and 0 elsewise. We used two covariates in raster format, the first being elevation as derived from the digital elevation model (DEM) of the Netherlands, more specifically the LiDAR based Dutch AHN2 (PDOK, 2020). We used the AHN2 version with 5-m resolution and void filling, and resampled (using the mean) it to a 25-m square grid. We chose this covariate because we expected it to be related to the cover sand depth. For the calibration locations, the DEM ranged from À5.4 to 8.6 m, with a mean of À3.6 m, relative to the reference sea level. The second covariate is the estimated thickness of the Holocene layer according to the three-dimensional Dutch Digital Geological Model (Dinoloket, 2020;Gunnink, Maljers, van Gessel, Menkovic, & Hummelman, 2013). This covariate is a legacy map of the depth to the Pleistocene cover sand. Depending on the quality of the legacy map, we expect this covariate to be a good candidate for modelling and mapping our Bernoulli variable. For the calibration locations, the thickness of the Holocene layer ranges from 0 to 8.6 m, with mean 2.9 m.
After cleaning the data (the original 1,507 observations and the covariates) and defining 500 observation locations as "validation locations", exactly 1,000 calibration locations remain, of which 782 have value "1", that is, depth to Pleistocene sand > 1.20 m. The removed observations contain two observations at exactly the same location as other observations and five locations without full covariate coverage. The calibration and statistical validation locations, and the covariates are shown in Figure 3. Note that urban areas, swamps, waterbodies and areas where the Pleistocene sand layer is known to be at the surface were not or less intensively sampled because they are outside the area of interest (Brouwer et al., 2018).

| Research approach
We compared four regression models: no covariates (indicated by "empty regression model"), elevation as covariate, thickness of the Holocene as covariate, and both elevation and thickness of the Holocene as covariates ("full regression model"). All four regression models have an intercept. For each regression model, we used a BGLM and a BGLGM approach. To facilitate convergence assessment (further explained in Subsection 4.2.1) each BGLGM was fitted twice. Thus, we calibrated 12 models (four BGLMs and eight BGLGMs), and assessed their performance by statistical validation for comparison; the research goal here was to compare the added value of a BGLGM versus BGLM in the context of different regression models. We selected the models without covariates and the models with both covariates to show the posterior distributions of model parameters and in prediction over the whole area of interest.
In the case of a BGLGM, we aggregated the data by constructing pairs of points based on minimal separation distance, to reduce computing time. The sum of the two Bernoulli variables was treated as a binomial variable of two trials, located at a virtual observation point halfway between the two points of a pair. The binomial distribution is a straightforward generalization of the Bernoulli distribution shown in the theory sections of this work, allowing for several "draws" or observations per location instead of one; the number of observations at a given point is denoted by m or m i , not to be confused with the unit of distance "m". In our case, m ¼ 2 for all virtual points. The covariate values at the individual points were averaged and assigned to the virtual observation point.
We ran the warm-up phase and production phase with 5 million iterations each, and thinned the production phase to 50 iterations for posterior inference and statistical validation. This strong thinning was needed for computational feasibility.

| Chain convergence assessment for a BGLGM
As we are not aware of any easy to use MCMC convergence diagnostics for high-dimensional sample spaces, we applied the between-chain approach (Gelman & Shirley, 2011): for each BGLGM under investigation, we started two chains, with different random starting values for ϕ and Y , and different pseudo random number seeds, forcing the proposals of each iteration to be different between the two chains. We assessed the convergence by visually comparing the posterior densities of ϕ, β and σ 2 of the two chains, as well as the statistical validation metrics (i.e., measures of the quality of the spatial predictions). Because the model parameters and the statistical validation metrics depend on the signal, we assume that this approach, although indirectly, provides sufficient information to assess the convergence of inference of both ϕ and the signal. For illustration, we also present some MCMC traceplots and parameter correlations in Appendix S1.

| Statistical validation
As already indicated, we randomly divided the original observations into a calibration set (originally n ¼ 1, 000, for a BGLGM aggregated to 500) and a validation set (n v ¼ 500). As a statistical validation metric for continuous predictions p vprobabilities between (and including) zero and onein combination with binary validation observations z vzero or onewe used the Brier score (Spiegelhalter, 2019): which is analogue to the general mean squared error statistic.
Based on an arbitrary probability threshold of 0.5, we also computed confusion matrices for each calibrated model, each matrix showing the numbers of true positives, false positives, true negatives and false negatives. We summarized these confusion matrices by the overall accuracy: and by the balanced accuracy: which is the average of the fraction of correctly predicted positives and the fraction of correctly predicted negatives (Brodersen, Ong, Stephan, & Buhmann, 2010). To investigate the improvement by the recent soil survey and our models as compared to the legacy map (used as a covariate in some of our models), we also calculated the validation metrics for this legacy map. We transformed the continuous variable thickness of the Holocene (which is equal to the depth to the Pleistocene cover sand) according to the legacy map into an indicator applying the depth threshold of 1.20 m.

| Prediction
For prediction, we chose to use all available information (thus, use the full regression model). We resampled the covariate rasters to a prediction mask resolution of 250 Â 250 m. In the case of a BGLM, we used the predict function based on a glm object. For a BGLGM, we found that the calculation time of the prediction (additional to calculating the MCMC itself ) was substantial. This prediction calculation time appeared to be roughly proportional to O n 2 Ã n t À Á , with n Ã the number of prediction locations and n t the number of remaining MCMC samples after thinning. We separated the prediction mask into a batch of 10 small prediction masks with ca. 2,500 pixels each, and predicted each small mask with a separate run, additional to the single run mentioned in Section 4.2 to infer the posterior parameters. All runs were provided with the same random generator seed and calibration data. Within each run we thinned the MCMC production phase to five samples; again, this quite rigid approach in separation and thinning was needed for computational feasibility. For each of those five samples, we calculated a spatial prediction as depicted in Figure 2. For each prediction location, we selected the median of the five predictions and finally merged the small prediction masks into the complete prediction map.

| General performance and computational costs
Comparing the calculation approaches, we found that a BGLM always and swiftly produces an answer. However, we had initially some problems getting a BGLGM to work at all. A first hurdle was that we obtained MCMC chains with acceptance rates very close to zero or one, meaning that the chain did not explore or very slowly explored the parameter and signal space. This was due to difficulties with finding workable MCMC proposal settings γ ϕ and γ Y , and probably also the need for different proposal settings during the initial iterations, compared to the subsequent iterations. Those problems were solved for all but a few of our many trials (not shown, and tested with both real-world and simulated data) using the tuning algorithm as described in Section 3.2.3. Given a proper working MCMC algorithm with reasonable acceptance rates, the second hurdle was to produce meaningful results, such as convergence to reasonable posteriors. We found that enough data were needed to achieve convergence, and that the (detrended) data needed to be sufficiently spatially structured; we also found that convergence problems due to a weak spatial structure could be solved by increasing the MCMC chain length.
The BGLMs produced answers within a few seconds calculation time, whereas each single BGLGM run had a calculation time of 30-50 h, using contemporary computer technology optimized for speed, one processing core and sufficient working memory.
For elaborating on the results and a more detailed understanding of the BGLGM algorithm, we show and discuss in Section A7 in Appendix S1 MCMC traceplots and the MCMC parameter correlations. Figures 4 and 5 show posteriors of regression and spatial parameters are shown for the empty regression model and the full regression model, respectively, both using the BGLM and the BGLGM models. The regression coefficient distributions for elevation are around zero for both the models. In the case of the BGLGM, the distributions of β are wider compared to those obtained with the BGLM. In the empty BGLGM regression model the posterior of β 0 is even almost bi-modal, with one peak at β 0 ¼ 0 and one at β 0 ¼ 3. In the case of the full regression BGLGM models, the posteriors of β 1 and β 2 have more probability mass away from zero than that obtained with the BGLM. The posterior of spatial parameter σ 2 has quite some probability mass at higher values in the case of the empty regression model. With the full regression model, this probability mass moves closer to zero. The median of the posterior of σ 2 for the two BGLGMs with the empty regression model equals 47 and 57, whereas for the two BGLGMs with the full regression model these are 6.8 and 5.4. Posterior ϕ shows the same movement: the medians are 29 and 30 km for the empty regression models and 8.3 and 7.0 km for the full regression models. The spatial parameters σ 2 and ϕ indicate variance and range parameter, respectively. Note that the BGLM model only has a regression coefficient (no spatial parameters). Note also that prior σ 2 cannot be normalized because it is improper; in other words: its surface under the curve cannot be scaled to one. For each BGLGM, the posteriors of two separate runs (different in random seed) are given, allowing exploration of their convergence

| Statistical validation
The validation metrics are given in Table 1 and the confusion matrices in Figure 6. Note that by its definition, the balanced accuracy in the case of a no covariates model, which equals "just taking the mean", is exactly 0.5. On the contrary, because almost 80% of the calibration data have value 1 (see Section 4.1), a BGLM model based on taking the mean of the observations results in a spatially constant prediction larger than 0.5, and thus an overall accuracy of around 0.8 and also around 20% false positives; this is visualized with the confusion matrices. All BGLGMs outperform all BGLMs according to all metrics. A BGLGM extracts a bit more information out of the elevation covariate; we can see in all metrics that adding elevation to the BGLM empty regression model does not improve any performance, whereas for the BGLGM it does. Adding elevation to the regression There are minor differences in the validation statistics between the two different runs of the BGLGM models, with the largest difference being 0.003 (relative: ca. 5%) in the Brier score, 0.008 ( < 1%) in overall accuracy and 0.004 ( < 1%) in balanced accuracy.
The legacy map of the property of interest (thickness of the Holocene) performs better than the BGLMs without the thickness of the Holocene included as covariate, but is worse in all other cases.
Although all numbers presented in Table 1 point towards the same main conclusions as presented above, the three different metrics emphasize other differences between the models and model runs.
T A B L E 1 Statistical validation of all models according to the Brier score (smaller is better), overall accuracy (larger is better) and balanced accuracy (larger is better)

| Predictions
The predictions for the full models are presented in Figure 7. Figure 8 shows the same predictions for the detail area indicated by the orange rectangle in Figure 7. For the BGLGM, we used run #1. In both maps, we can see the effect of the covariates elevation and thickness of the Holocene layer (compare with Figure 3), for example in the clearly distinctive spots with low probability in the very south-west of the Flevopolder. This may be an artefact in  the urban area of the city of Almere. The prediction by the BGLM is in general somewhat smoother; especially in the very south of the Flevopolder. The BGLGM allows predicted probabilities much closer to zero than does the BGLM, whereas in large parts of the Noordoostpolder the BGLGM predicts probabilities closer to one compared to the BGLM. In the north of the Flevopolder, the BGLGM map shows an almost circular pattern with low probabilities that is hardly visible in the BGLM. Relating the prediction to the observations (Figure 8), the BGLGM follows the observations in the calibration data more closely.

| Mapping Pleistocene sand layer depth
Both prediction maps (Figure 7) confirm the general spatial pattern depicted on the covariate map thickness of the Holocene (Figure 3), showing several areas where the Pleistocene sand is close to the surface. The areas where the Pleistocene sand is deeper reflect old fluvial patterns, for example, the deltas formed by the ancestors of the current rivers Overijsselse Vecht and Gelderse IJssel (Brouwer et al., 2018) and probably also the Eem. The models predict also in the urban areas, the swampy areas, and areas otherwise excluded from the sampling design, lacking both calibration and validation data. One of these excluded parts is in the area just below the very north of the Noordoostpolder. Here, the predicted probability of Pleistocene cover sand deeper than 1.20 m correctly approaches zero. The maps indicate (although not conclusively) which areas are most susceptible to soil subsidence (Brouwer et al., 2018), so where a closer look into the soil properties would make sense. The same methodology could also be used to make a map of the probability of Pleistocene cover sand at the surface. The statistical validation data of these maps suggest that we should have more trust in Figure 7b than in Figure 7a. Also visually, it seems that the BGLGM prediction in Figure 7b follows the observations as provided in Figure 3 more closely. We should be careful with the low probability dots in the south-west of the Flevopolder as shown on both maps, because they are caused by the locally deviating covariate values (very probably artefacts) and not from the observation data.

| Posterior regression parameters
The posteriors of the regression coefficients were calculated differently in the BGLM and BGLGM. For the BGLM, we used straightforward mathematics and a numerical algorithm converging within a limited number of iterations. This was possible because of our choice for a low-informative prior, which meant that we could borrow well-developed concepts based on non-Bayesian statistics (i.e., estimated maximum likelihood and the corresponding estimated variances). In the case of the BGLGM, the posterior is a sample generated from an MCMC chain. We suppose that extending a BGLM model with variogram parameters and a signal so that a BGLGM model is obtained, allows for wider posteriors of the regression parameters and even for a non-Gaussian posterior; see β 0 in Figure 4. The empty regression model resulted in a very narrow BGLM posterior centred at the logit transformed mean of the observations, whereas the corresponding BGLGM posteriors explored a much wider range of values. We could not explain why the two posterior BGLGM modes consistently were around, but different from, the mode of the posterior BGLM. In the full model ( Figure 5), the modes of the intercepts β 0 of the BGLM and BGLGM are about equal; more freedom for β results in larger modes for β 1 and β 2 , suggesting that the BGLGM gets more information out of the covariates than the BGLM. This was already suggested by the statistical validation. Perhaps this is due to the covariance structure or to the hierarchical nature in the BGLGM.

| Posterior spatial parameters
The influence of the regression model on the spatial parameters, as shown in Section 4.3.2, indicates that the covariates explain part of the variation (i.e., the probability mass emphasizes lower values of σ 2 if covariates are included) and explain also part of the spatial structure at larger distances (i.e., the probability mass emphasizes lower values of ϕ after including covariates). It would be interestingin follow-up researchto explore if the resulting variogram slopes close to the origin are different when comparing the empty regression model with the full regression model. It would also be interesting to investigate the correlation between the spatial parameters σ 2 and ϕ for a given regression model by exploring the joint posterior distribution f σ 2 , ϕ z j Þ ð , which is, especially in a Bayesian setting like this, fairly easily to do.

| Statistical validation
The threshold of 0.5, used for computing the confusion matrices and the overall accuracies, is a fairly arbitrary choice. Additionally, the overall accuracy metric assumes that the penalties of a false-negative and a false-positive conclusion are equal. As a validation metric without threshold, we chose the Brier score because it is very straightforward, and also because in earlier work we had disappointing experiences with the often-applied area under curve metric (Steinbuch et al., 2018). Note that in the case of an unbalanced statistical validation dataset (meaning: containing many zeros and few ones, or vice versa), or with non-uniform penalties on prediction error, the Brier score as a means of model comparison can be counter-intuitive, while other metrics might offer more flexibility (Assel et al., Assel, Sjoberg, & Vickers, 2017;Jewson, 2004). Statistical validation of probabilistic predictors in the context of soil mapping is discussed by Rossiter, Zeng, and Zhang (2017) and Beaudette (2020), including alternative metrics. However, because of our focus on model comparison rather than on real-world applications where penalties indicate practical consequences, we assume that the quite consistent message from the three applied metricsoverall accuracy, balanced accuracy and Brier scorealready provided a useful answer to our research questions. We did not calculate any other validation metrics. The only inconsistency is the drop in overall accuracy from the regression model with the thickness of the Holocene as covariate to the full model in the case of the BGLGM; this drop is not reflected in an increased Brier score. Based on the overall accuracy, we suspect some overfitting of the data by the elevation covariate model.

| Comparison of the models and alternative approaches
In our research the BGLM performed reasonably well (see Table 1). However, for this it leaned heavily on the thickness of the Holocene covariate, which is based on the 3D geological model as described in Gunnink et al. (2013). The covariate elevation, often included as covariate in digital soil mapping, hardly added predictive power to any of the models. We should also note that the DEM is already used in the 3D geological model.
Based on the validation statistics (Table 1), the BGLGM clearly outperformed the BGLM. However, as indicated in Section 4.3.1 (and elaborated in Section A8 of Appendix S1), it was quite an effort to get the BGLGM functioning properly, and when it worked there were serious computational costs involved, with an associated electrical energy footprint (Taffoni et al., 2019) and a waiting time of up to 50 h until results were available. For the BGLGM, we followed the approach and implementation of Diggle and Ribeiro (2007) and Christensen (2002). Comparable approaches in the Runiverse, such as geoCount (Jing & De Oliveira, 2015), spBayes (especially function spMvGLM [Finley, Banerjee, & Gelfand, 2015]), geoBayes (Evangelou & Roy, 2019), PrevMap (Giorgi & Diggle, 2017) and approximations such as INLA (Rue, Martino, & Chopin, 2009), eventually embedded in geostatsp (Brown, 2015), were outside our scope, but might be interesting to consider in future research and practical applications. We also did not look into more pragmatic approaches such as indicator kriging, creating a Gaussian model with the available data (i.e., without simplification to binary observations) or data-mining methods such as random forest (Heung et al., 2016) (in combination with additional covariates), all of which have their merits and drawbacks in the case of a binomial spatial variable.
A decision on which method to use should depend on the available or feasible sample size and the available covariates, the final goal of the map and/or other deliverables such as the posteriors, uncertainty quantification andmore pragmatically -the available resources. Often, in digital soil mapping many covariates are used, some based on direct observation (such as DEM and its derivatives as generated by radar and vegetation data visible on multi-spectral satellite images) and some based rather on modelling and interpolation (such as climate data). In case few covariates related to the variable of interest are available, we need to have methods at hand that deliver reasonable mapping results based only on the sample itself. In such cases, indicator kriging, although statistically not completely sound, might be a good approach. If we want to add already existing knowledge not captured in covariates, a Bayesian approach might be a good option. If we, from a pedometrics' viewpoint, are interested in a solid statistical description of binomial soil properties over space, an approach such as the BGLGM might very well be worth the effort. Furthermore, the addition of an automated tuning phase as described in Section 3.2.3 makes both methodological research and practice-directed data processing much more feasible than manual tuning would allow.

| CONCLUSIONS
Adding a geostatistical component to a Bayesian generalized linear model (BGLM) for mapping binary soil properties yielded considerably better statistical validation metrics in a case study with a large (n ¼ 1, 000) observation sample and few relevant covariates available. However, the resulting Bayesian generalized linear geostatistical model (BGLGM) is quite demanding with respect to sample size, tuning the algorithm and computational costs. In this study we focused on the implementation of the BGLGM as provided in the R-package geoRglm, which we embedded in our own scripts. We replaced manual tuning by an automated tuning algorithm and found a sample composition (in size and in number of observations per location) that delivers meaningful results within 50 h calculation time. With the gained insights, spatial soil practitioners and researchers canfor their specific casesbetter evaluate if using a BGLGM would be possible at all and if the extra gain would be worth the extra effort. The developed automated tuning algorithm (of which the code is available as part of an easy to run example script, see doi.org/ 10.5281/zenodo.5005828) makes implementation of a BGLGM in applications easier.

SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of this article.