Spatial regression modeling via the R2D2 framework

Spatially dependent data arises in many applications, and Gaussian processes are a popular modelling choice for these scenarios. While Bayesian analyses of these problems have proven to be successful, selecting prior distributions for these complex models remains a difficult task. In this work, we propose a principled approach for setting prior distributions on model variance components by placing a prior distribution on a measure of model fit. In particular, we derive the distribution of the prior coefficient of determination. Placing a beta prior distribution on this measure induces a generalized beta prime prior distribution on the global variance of the linear predictor in the model. This method can also be thought of as shrinking the fit towards the intercept-only (null) model. We derive an efficient Gibbs sampler for the majority of the parameters and use Metropolis-Hasting updates for the others. Finally, the method is applied to a marine protection area data set. We estimate the effect of marine policies on biodiversity and conclude that no-take restrictions lead to a slight increase in biodiversity and that the majority of the variance in the linear predictor comes from the spatial effect.\vspace{12pt}


Introduction
Spatially-dependent data arise in many applications including ecology (e.g., Plant, 2018), public health (e.g., Reich and Haran, 2018), environmental exposure monitoring (e.g., Berrocal et al., 2020) and medical imaging (e.g., Masotti et al., 2021).For example, this work is motivated by an analysis of the effect of conservation efforts on aquatic biodiversity (Gill et al., 2017).Since biodiversity is driven by ecological processes that evolve over space and time, it is reasonable to expect dependence between nearby observations.Accounting for this spatial dependence is necessary to provide accurate inference of covariate effects (Hodges and Reich, 2010) and allow for prediction of biodiversity at unmeasured locations.
Bayesian inference is a popular approach for modeling spatial data.This is typically accomplished through a latent Gaussian process.Estimating the hyper-parameters of this covariance structure is notoriously difficult (e.g., Zhang, 2004) which makes selecting prior distributions paramount.The standard approach places a vague (large variance) inverse gamma prior distribution on the spatial variance and informative gamma prior distribution on spatial correlation parameters (Gelfand et al., 2010).As Berger et al. (2001) discuss, eliciting an informative prior distribution is challenging as the spatial correlation parameters can be difficult to interpret.To provide an automatic approach, the authors derive reference priors for the covariance parameters using an objective Bayesian paradigm which minimizes the prior information from an information theory perspective (Berger, 2006).The Bayesian hierarchical spatial model is quite intricate so the resulting prior distribution has a complicated form.To provide a prior construction that can more easily incorporate prior knowledge while still being weakly informative, Fuglstad et al. (2019) adapt the penalized complexity prior framework of Simpson et al. (2017) to construct prior distributions for the hyper-parameters of a Matérn covariance structure.This prior distribution shrinks the spatial component of the model towards a base model, i.e., one with no spatial dependence, and the resulting prior distribution is much faster and easier to compute than that of Berger et al. (2001).
In this work, we propose a principled prior framework for Gaussian process spatial models by leveraging a Bayesian coefficient of determination, R 2 n , (Gelman et al., 2019) and the R2D2 prior framework (Zhang et al., 2022).This extends Zhang et al. (2022), Yanchenko et al. (2021) and Aguilar and Bürkner (2022) to spatial models.We show that a beta prior distribution on R 2 n is (approximately) equivalent to a conditional generalized beta prime distribution on the linear predictor variance, which includes the marginal spatial variance.This derivation conditions on other spatial parameters which allows the proposed method to accommodate virtually any correlation structure, e.g., non-stationarity.We also derive an efficient MCMC sampler which consists almost entirely of Gibbs sampling steps.The resulting prior specification provides both an interpretable method to select a subjective prior, and a default approach in the absence of prior information.Indeed, the Bayesian coefficient of determination measures the proportion of variance explained by the model so prior domain knowledge can be intuitively incorporated in the model through R 2 n as opposed to choosing prior distributions for individual spatial parameters.For example, it is likely easier for a practitioner to have a prior belief that the model will capture a certain proportion of the variation in the data than to a priori estimate the spatial range parameter.On the other hand, a uniform prior distribution on R 2 n or one with large prior mass near zero yields an automatic approach.Lastly, we apply the proposed method to a marine protection area data set to study the effects of marine policies and find that certain fishing restrictions lead to a small increase in aquatic biodiversity.
The proposed method relates to the existing methods in several ways.First, each is weakly informative in some sense: Berger et al. (2001) in terms of information theory, Fuglstad et al. (2019) in terms of the Kullback-Leibler divergence between the base and fitted model, and the proposed method (can be) in terms of R 2 n .Similar to Fuglstad et al. (2019), the proposed method allows for intuitive incorporation of prior domain knowledge, facilitates fast and easy computation and shrinks the fit towards a base model.Specifically, a prior distribution with large mass near R 2 n = 0 is akin to shrinking or penalizing towards the intercept-only (baseline) model.The difference is that Fuglstad et al. (2019) treats the model without spatial dependence as the baseline model.In contrast with the previous two methods, our approach chooses the prior distribution for the global variance which induces a prior distribution on the other spatial parameters whereas Berger et al. (2001) and Fuglstad et al. (2019) directly set the prior distribution on the spatial hyper-parameters and marginal variance.
The layout for the remainder of the paper is as follows.In Section 2 we present the modeling framework.Section 3 derives the prior distribution and discusses its properties.Computation is the topic of Section 4 and we apply the method to a marine protection area data set in Section 5.
We close with a discussion in Section 6.

Spatial Gaussian process model
We introduce the method for a Gaussian spatial regression model with fixed effects.For observation i ∈ {1, . . ., n}, let Y i be the response, s i ∈ R 2 be the spatial location and X i = (X i1 , . . ., X ip ) be a corresponding vector of explanatory variables.We also define the n × p matrix The standard spatial regression model is where β 0 is the intercept, β = (β 1 , . . ., β p ) T is a vector of fixed effects, θ = (θ 1 , . . ., θ n ) T are the spatial random effects and ε i iid as the linear predictor for observation i.

Modeling framework
The covariance of the fixed and random effects are parameterized in terms of an overall variance parameter W > 0, spatial correlation parameters ψ and proportions φ = (φ 1 , . . ., φ p+1 ) with φ j > 0 and p+1 j=1 φ j = 1.The elements of φ apportion variance between the fixed and spatial random effects.Specifically, where 0 m is a vector of zeros of length m, Φ is a p × p diagonal matrix with diagonal elements {φ 1 , . . ., φ p }, Σ ψ is the n × n spatial correlation matrix (for notational convenience we will often denote the spatial correlation matrix simply as Σ), and β ⊥ θ.We stress that the prior construction is conditional on the spatial correlation matrix Σ; therefore the framework holds for virtually any correlation function, e.g., Matérn, anisotropic, etc.
The ensuing derivations treat the explanatory variables X as fixed.It is common practice to standardize X such that its columns have mean zero and variance one, although the results still hold even if this is not the case.We recommend standardizing X so that σ 2 W is the average marginal global variance of the linear predictor η i (see Supplemental Materials).Indeed, W is then the average signal-to-noise ratio, i.e.,

Dirichlet decomposition
The parameters φ determine the relative variance of the model apportioned to each fixed effect and the spatial effect.The proportion of variance allocated to the spatial effect is φ p+1 and 1 − φ p+1 = p j=1 φ j is the proportion of variance allocated to the fixed effects.We could consider these parameters as fixed or give them a prior distribution, e.g., φ ∼ Dirichlet(ξ 1 , . . ., ξ p+1 ).We often take ξ 1 = • • • = ξ p+1 ≡ ξ 0 as default choice of these hyper-parameters.The concentration parameter ξ 0 > 0 controls the prior variation with large ξ 0 encouraging all the variance components to be roughly equal to 1/(p + 1) and small ξ 0 encouraging prior uncertainty in the variance components.Another default choice is 2p and ξ p+1 = 1 2 if it is expected that the fixed and spatial effects contribute equally to the global variance.Lastly, if we believe that the variance is the same for each fixed effect, then we can define a new parameter, φ, such that φ = ( φ1 , φ2 ) ∼ Dirichlet(ξ 0 , ξ 0 ).Then we set φ j = φ1 /p for j ∈ {1, . . ., p} and φ p+1 = φ2 to enforce equal variance for all fixed effects.We suggest this final parameterization as a default choice due to its simplicity.

Spatial R2D2 prior
The goal of this work is to chose prior distributions that induce a desirable distribution on the Bayesian coefficient of determination, R2 n , from Gelman et al. (2019) defined below in Section 3.1.To achieve this, we specify a prior distribution for W given φ and ψ that induces a Beta(a, b) on R 2 n .Since the prior for R 2 n is Beta(a, b) for all φ and ψ, the marginal distribution of R 2 n over φ and ψ is also Beta(a, b).In this sense, the prior for R 2 n described below is a function of the joint prior distribution of W , φ and ψ.

Bayesian coefficient of determination
If we define η i as the signal and ε i as the error, then R 2 n (Gelman et al., 2019) is where is the sample variance of the signal and η = n i=1 η i /n is the sample mean. 1 The interpretation of v n is the variation of the expectation of future data, conditioned on the fixed and spatial effects as well as the explanatory variables and spatial locations.Since v n is the variance of the modeled predictive means, and the predicted means depend on model parameters, then v n is conditional on the distribution of the fixed effects.Therefore, R 2 n measures model complexity because a more complex model can explain more variation in future data than a simpler model.For example, with W = 0 implying β = 0 p and θ = 0 n , then η i = β 0 is the intercept-only model and v n = R 2 n = 0. Thus, a prior for W with mass near zero corresponds to a prior for R 2 n with mass near zero and shrinks the prior to the simple intercept-only model.

Prior derivation
While Gelman et al. (2019) proposed R 2 n as an a posteriori measure of model fit, we use R 2 n to determine the prior distribution of covariance parameters.We view X and s as fixed and thus η i is a random function of β and θ, whose distribution depends on W , ψ and φ in (2).In this section, we specify a prior for W given ψ and φ so that averaging over the joint prior distributions β, θ and W gives R 2 n ∼ Beta(a, b).In order for R 2 n ∼ Beta(a, b), the variance v n must have generalized beta prime (GBP) distribution (Johnson et al., 1995;Yanchenko et al., 2021), i.e., v n ∼ GBP(a, b, 1, σ 2 ).The GBP distribution can be obtained via a transformation of a beta random variable: if U ∼ Beta(a, b), To specify the prior we write where P = (I n − 1 n 1 n 1 T n )/(n − 1), I n is the n × n identity matrix and 1 n is the vector of ones of length n.Since the fixed and spatial effects are assumed independent and we condition on X, Xβ + θ follows a normal distribution with mean zero and covariance σ 2 W (XΦX T + φ p+1 Σ).Then the distribution of v n can be equivalently written in terms of Z = (σ 2 W ) −1/2 (Xβ + θ) where where To specify a prior distribution for W so that the product W S has a beta prime distribution requires an understanding of the distribution of S. The classical quadratic form results that yield a chi-squared distribution do not apply here because the covariance of Z is not idempotent.Thus, the true distribution is a weighted sum of χ 2 1 random variables.Instead of working with the exact distribution, we approximate it by a gamma distribution with the same mean and variance (e.g., Box, 1954).By properties of quadratic forms, So, S approx.
Since W S ∼ BP(a, b) and S approx.
∼ Gamma(α ψ,φ , β ψ,φ ), then the conditional distribution of W given φ and ψ is specified by the hierarchical model where This expression shows that the distribution of W is conditional on the spatial and variance allocation parameters, ψ and φ, respectively.This allows trivial extensions of the prior construction to different spatial variance models.The distribution of R 2 n , however, is unconditional on any model parameters because of this construction.
The distribution in ( 6) is not standard so we would like to write it in a more manageable form.
We begin with the following proposition.
where U 2 ∼ Gamma(a, γ −1 ).We utilize this form of the distribution for computational purposes, but we can further simplify it for better conceptual understanding.
Therefore, an equivalent expression of ( 6) and ( 7) is We can, thus, write out the full prior specification: completes the model.Since this prior specification was induced by a prior distribution on R 2 and the variance is apportioned to the fixed and spatial effects through the Dirichlet decomposition, we name this the spatial R2D2 prior as in Zhang et al. (2022) and Yanchenko et al. (2021).

Properties
We now explore different properties of the spatial R2D2 prior.First, the unconditional distribution (not depending on γ) of W is not analytic, but we can find its unconditional prior mean and variance.
While α ψ,φ and β ψ,φ are complex functions of the explanatory variables, spatial locations and spatial covariance matrix, a and b can be tuned to enforce certain properties on W .For example, the expectation of W increases as a/b increases (which also increases the prior mean of R 2 n ) and b > 2 is required for both finite and mean and variance.Conversely, if b is large then there is large prior mass near R 2 n = 0 and the prior mean of W is also small.If a < 1 and b ≤ 1, then W has infinite expectation (i.e., heavy tail) as well as a mode at 0, a desirable property of shrinkage priors (e.g.Carvalho et al., 2009;Bhattacharya et al., 2015;Zhang et al., 2022).
In Table 1 we report the prior mean and variance of W for the same settings as above but let ρ ∈ {0.2, 0.4, 0.6, 0.8, 1.0}.Again, these results are for a specific realization of covariates and sampling locations.To ensure finite mean and variance of W we take (a, b) = (4, 4).As the spatial correlation increases, both the mean and variance of W increase since the model cannot capture the same amount of spatial variation without W increasing.This highlights the interplay between ρ and W for determining the prior distribution.
Since the prior distribution of W is highly dependent on µ S and σ 2 S , we are interested in their behavior for several specific correlation structures without covariates, i.e., φ p+1 = 1.Recall that the prior for W in ( 6) is W S|φ, ψ ∼ BP(a, b) for scaling factor S that is (approximately) distributed as a gamma random variable with mean µ S and variance σ 2 S .This mean and variance depend on the spatial correlation matrix and therefore studying their forms can illustrate how spatial correlation affects W 's prior distribution.In the following examples, we consider three specific spatial correlation structures Σ and their effect on the prior distribution of W .  Example 1 -Compound symmetry First consider the compound symmetry model with correlation ρ between all observations.Then Therefore, if ρ < 1 then S → µ S in distribution and, in particular, if the observations are independent with ρ = 0 then S converges in distribution to one and W ∼ BP(a, b) as in Zhang et al. (2022).On the other hand, if ρ = 1 then S is degenerate at zero.In this extreme, the covariance is singular and restricts v n = 0 so that no prior for W can achieve the prior R 2 n ∼ Beta(a, b).For ρ near to but less than one, S has mean and variance near zero implying the prior scale of W must be large to compensate for the restrictions of v n induced by the correlation.
Example 2 -Blocked compound symmetry Next assume the n locations are partitioned into m blocks, each with n/m locations, and the correlation is ρ within block and zero between blocks. Then We recover the compound symmetry result for m = 1 and the mean increases as the number of blocks m increases and/or as the correlation decreases.As the number of blocks increases, the number of correlated observations decreases which causes the prior mean of W to decrease since the mean of W and S are inversely proportional.Put another way, a smaller spatial correlation does not require as large of a W to capture the dependency.
Example 3 -Matérn correlation For the general Matérn (or any other) spatial correlation model we have as n → ∞ where the expectation is with respect to the sampling distribution of the spatial locations1 .If the correlation function is convex (e.g., exponential), then where d is the average distance between points.Again, we see that µ S is inversely related to the strength of spatial correlation.
Since φ and ψ are not conjugate, they require a Metropolis-Hastings step to update.Let φ (t)   be the value of φ at step t of the sampler.Then the candidate value φ * is drawn from φ * ∼ Dirichlet(c 1 φ (t) ) for some hyper-parameter c 1 .Using a similar definition and assuming that the spatial correlation model is Matérn with fixed ν, the candidate value ρ * is drawn from log ρ * ∼ Normal(log ρ (t) , c 2 ) for hyper-parameter c 2 .Both c 1 and c 2 are tuned during the burn-in stage to ensure an acceptance rate between 20% and 50%.
Although the steps of the algorithm are straightforward to implement, they can be slow for large n.The distribution for W depends on α ψ,φ and β ψ,φ which are functions of φ and ψ.Since φ and ψ are updated each iteration of the sampler, so too must α ψ,φ and β ψ,φ be updated.But these terms depend on traces of matrix multiplication so this is a computationally expensive process.
This computational burden can be mitigated by the following observation.For some matrix A with eigenvalues {λ i } n i=1 , tr(A) = n i=1 λ i and tr(A 2 ) = n i=1 λ 2 i .Thus, we can replace the computations of the trace of squared matrices with the sum of eigenvalues.Since we only require the eigenvalues (and not the eigenvectors), this can be done in O(n 2 ) as compared to the O(n 3 ) needed for matrix multiplication.Additionally, these eigenvalues only need to be computed once per iteration since µ S ∝ tr(A) and σ 2 S ∝ tr(A 2 ) where A = P(XΦX T + φ p+1 Σ).In the Supplemental Materials, we conduct a small simulation study to compare the proposed R2D2 prior framework with a standard vague prior and the Penalized Complexity (PC) prior of Fuglstad et al. (2019).Under these settings, the R2D2 prior yielded the best estimates of the spatial marginal variance parameter, while all methods performed comparably for estimating the fixed effects and spatial range parameter.5 Marine protection area data analysis

Data and model
Marine Protection Areas (MPAs) have been established around the globe to preserve aquatic biodiversity.Gill et al. (2017) collected data to understand the effects of these policies on conservation efforts.The response variable, Y i , is the logarithm of the biodiversity at site s i where a larger value of the response means greater biodiversity, a goal of conservationists and scientists.For this analysis, we consider the observations around Australia, as seen in Figure 2. Australia is known for its vast and unique collection of species (e.g., Butler et al., 2010) as well as being home to two biodiversity hotspots on the east and southwest coasts.Biodiversity hotspots are geographical regions that are rich in species, particularly those that are endemic, rare and/or endangered (Myers, 1988;Reid, 1998).
The spatial locations of the n = 471 observations shifted and scaled to fit in the unit square.We also select p = 9 explanatory variables: multi-use (0) vs. no-take (1) regulation indicator; depth (m); wave exposure (kW/m); distance to shoreline (km); distance to provincial capital market (km); coastal population (within 100 km 2 ); minimum sea surface temperature (2002-2009, • C); Chlorophyll-a (2002-2009, mg/m 3 ); and reef area within 15 km.These explanatory variables are centered and scaled to ensure each column has mean zero and variance one.Of primary interest is the indicator variable for whether the sampled location was under a multi-use (MU) or no-take (NT) restriction.MU regions have restrictions on fishing practices but still allow for some fishing whereas NT zones have a total ban on fishing.One question is whether there is a significant difference in biodiversity between the MU and NT zones after accounting for the other covariates.
In addition to these covariates and the spatial random effect, we also include a random intercept for the MPA, which is straightforward to incorporate into our prior framework.The model is then where Z i = 1 if site i corresponds with MPA region and 0 otherwise for ∈ {1, . . ., L = 37}.
For comparison, we also consider a vague prior and the penalized complexity (PC) prior (Fuglstad et al., 2019).For the vague prior, we take where we fix σ 2 β = 100 and the rest of the parameters have the same prior distribution as in the spatial R2D2 prior.From Fuglstad et al. (2019), the PC prior is where we set α = 0.05, σ 0 = 10 and ρ 0 = 0.01 as per Fuglstad et al. (2019).Note that these hyperparameter choices ensure that P(ρ < ρ 0 ) = α and P(σ θ > σ 0 ) = α.Again, all other parameters have the same prior distributions as above.For all models, we take 10,000 MCMC samples as burn-in, 100,000 to monitor convergence and thin the results by saving every fifth sample.To allow for a fair comparison, we coded all methods by hand in R. The code for the R2D2 prior is available at: https://github.com/eyanchenko/r2d2space.

Results
We report the posterior median and 95% credible interval for several parameters of interest in Table 9 as well as plot the posterior distribution of R 2 n and W in Figure 3.In the Supplemental Materials, we also report representative sample of trace plots in addition to results on computation time and number of effective samples for each MCMC chain.
First, we notice that each spatial R2D2 prior yields a posterior median of R 2 n around 0.42.This means that approximately 42% of the variation in the response can be explained by variation in the linear predictor.The posterior median of R 2 n is larger for the (a, b) = (4, 1) prior and smaller for the for the vague and PC priors, however, are slightly larger than that of the spatial R2D2 priors.This is because these priors induce a prior distribution on R 2 n that is effectively a point mass at one, due to the large prior variance of the fixed effects.n ∼ Beta(a, b) for the MPA fisheries data.Note that β 1 is the effect of NT relative to MU; σ 2 θ is the variance of the spatial term, θ (φ 3 W for the R2D2 priors); ρ is the spatial range effect; and φ 1 , φ 2 and φ 3 are the proportion of variance allocated to the fixed, random and spatial effects, respectively.
Next, we consider the posterior distribution of β 1 , the parameter that quantifies the effect of the NT and MU zones.All models yield a posterior median of β 1 greater than zero indicating that there is greater biodiversity in the NT zones than in the MU zones.All of the credible intervals contain zero, however, so this finding is not statistically significant.These conclusions are similar to those found in Cui et al. (2022).We can also compute the posterior probability of β 1 being positive as a measure of significance for this explanatory variable.We find that P(β 1 > 0|Y) is 0.94 for the Vague and PC priors compared to 0.88, 0.88, 0.87 and 0.89 for the R2D2 prior with (a, b) equal (1,1), ( 1 2 , 1 2 ), (1,4) and (4,1), respectively.This provides moderately strong evidence in favor of a statistically significant effect with the vague and PC priors providing the strongest evidence.Further investigation is needed to understand the practical significance of these zones.
We also report the posterior median and 95% credible intervals for the other fixed effects in the Supplemental Materials.For example, sea temperature and measurement depth seem to have the greatest effect on biodiversity whereas population and Chlorophyll-a concentration have very little impact.In many cases, the posterior medians are larger in magnitude for the vague and PC priors because the R2D2 priors are shrinking the fixed effects towards the base model while the prior variance is much larger in the other two models.
The prior distribution of R 2 n has only a small effect on the posterior distribution of φ since these distributions are quite similar across different combinations of (a, b).This is sensible because the R 2 n metric is most directly related to the linear predictor, which these parameters have minimal effect on.In particular, the spatial R2D2 prior with (a, b) = (1, 4) yields a posterior median of φ = (0.17, 0.19, 0.61).Thus, the spatial random effect accounts for approximately 61% of the variation in the linear predictor, whereas the fixed effects and MPA region account for about 20% each.The results are similar for the other R2D2 priors.
The difference among the prior frameworks is most pronounced for the spatial range and variance parameters.For the spatial marginal variance σ 2 θ (φ 3 W for the R2D2 priors), the Vague prior has the largest posterior median, followed by the PC prior, while the R2D2 priors have the smallest.
The Vague prior does not penalize this parameter which leads to the largest values.Both the PC and R2D2 priors, however, explicitly shrink this parameter towards zero (via W for R2D2), leading to the smaller posterior estimates.From this example, the R2D2 prior corresponds to greater shrinkage.While the R2D2 priors all yield similar results for ρ, the PC and vague priors have a posterior median that is approximately three times larger than that of the R2D2 prior.We expect the PC prior to have a larger estimate for ρ because this framework explicitly forces the estimate towards a base model of ρ → ∞.The large estimate for the Vague prior could be owed to the strong correlation (0.68) between the posterior distributions of σ 2 θ and ρ.Thus, the large values of σ 2 θ may also be increasing the estimates of ρ.The corresponding correlation for the R2D2 (1, 1) prior, on the other hand, is only 0.29.Finally, in Figure 4, we plot the posterior median of θ i at each location s i .All spatial R2D2 priors gave similar results so we only plot the results for (a, b) = (1, 1) and (1, 4).From the figure, spatial correlation between the fixed effects, we let β|σ 2 , W, φ, where Σ β is a spatial correlation matrix.Then α ψ,φ and β ψ,φ are computed using µ S = tr(A) and Finally, there are several interesting avenues of future study for the R2D2 prior framework including areal spatial data, non-Gaussian responses, times series and longitudinal data.In the Supplemental Materials, we outline how this procedure may work for non-Gaussian responses.
Additionally, throughout this work we have assumed that R 2 n has a beta distribution a priori.Another interesting extension would be to consider other prior distributions on R 2 n such as the Kumaraswamy distribution (Jones, 2009).

Extension to non-Gaussian response
We include a brief note on how the R2D2 spatial prior framework may work for non-Gaussian response.As an example of non-Gaussian data, consider a spatial counts regression model where .
where P = (I n − 1 n 1 n 1 T n )/(n − 1).Now, µ has a multivariate Log Normal distribution so we cannot use the same techniques from this manuscript.One approach is using simulation to approximately match the prior distribution of R 2 n to the desired Beta(a, b).For example, we may assume some prior distribution on (σ 2 , φ, ψ) ∼ π(•) and that W ∼ GBP(α 1 , α 2 , 1, α 3 ).We draw from these prior distributions, and find the empirical distribution of R 2 n .Then we find the values of (α 1 , α 2 , α 3 ) which minimize the KL divergence between the simulated empirical distribution of R 2 n and a desired Beta(a, b) distribution.This should then yield (approximately) the correct distribution.We may also be able to draw on ideas from Stochastic Variational Inference (Hoffman et al., 2013).

Simulation study
We conduct a brief simulation study to compare the proposed R2D2 prior framework with that of a vague/uninformative prior and the penalized complexity (PC) prior (Fuglstad et al., 2019).

Data generation model
We generate the response, Y i , from a normal likelihood: where β ∈ R p are the fixed effects, θ i is the spatial effect of response i and ε i iid ∼ Normal(0, σ 2 ) for i = 1, . . ., n.We generate the fixed effects with AR(1) auto-correlation i.e., where Σ X has an AR(1) auto-correlation, i.e., Cor(X ij , X ik ) = r |j−k| for j = k and r ∈ (0, 1).
Additionally, we generate the true value of the fixed effects from β j ∼ Normal(0, σ 2 σ 2 β ) for j = 1, . . ., p. Next, θ ∼ Normal(0 n , σ 2 σ 2 θ Σ) where Σ has an exponential correlation structure, i.e., where d ij = ||s i − s j || is the distance between the sampling locations of observations i and j, and ρ is the spatial range parameter.We generate sampling locations, s i , randomly within the unit square.
For the vague prior, we take where we fix σ 2 β = 100 and the rest of the parameters have the same prior distribution as in the when n = 100, the Vague and R2D2 priors yield the lowest MSEs.When n = 200, however, the R2D2 prior slightly outperforms both Vague and PC.Each method yields slightly lower MSEs when n is larger.Finally, MSE and coverage are comparable across methods for estimating R 2 n .

Discussion
We close the simulation study section with a brief discussion.First, each method yields comparable results for estimating β.Since each of the prior frameworks focuses on the spatial variance and range parameters, the similar results for estimating the fixed effects is to be expected.The main focus of the simulation study, however, is comparing the estimates of the spatial parameters.For σ 2 θ , the Vague prior distribution has non-trivial mass at large values, which leads to overestimating σ 2 θ and thus high MSE.The PC and R2D2 priors, on the other hand, explicitly shrink the estimates of σ 2 θ which generally leads to lower MSE values.It is evident that the R2D2 prior has greater shrinkage as it has significantly lower MSE values than both Vague and PC.Across all setting, the Vague and R2D2 priors yield comparable MSE values and both outperform PC when estimating ρ.Since the PC prior forces the estimate of ρ → ∞, this could be leading to a large, positive bias and thus, larger MSE.The R2D2 construction only indirectly affects the prior distribution of ρ (since W 's prior distribution is conditional on ρ), so the similar estimation to the Vague prior is reasonable.Additionally, Zhang (2004) proved that neither σ 2 θ nor ρ are consistently estimable under the in-fill asymptotic setting.This explains why the MSE values for σ 2 θ do not decrease with increasing n and while there is only a small decrease for ρ.Finally, each method yields similar estimates of R 2 n .Since this is largely driven by estimates of the fixed effects β and response variance σ 2 , and these are similar across models, this is a sensible result.

MPA data analysis 10.1 Sensitivity analysis
We re-run the real-data analysis for some other hyper-parameter combinations to show that the results are insensitive to these choices.We used 110,000 MCMC samples with the first 10,000 as burn-in and thinned the chain by keeping every fifth sample.The results are in Table 9.There is small change in parameter estimates between the different settings despite large differences in the prior distributions as the prior for ρ is on the log-scale.

Trace plots for MPA data
See Figure 5 for trace plots of various parameters of the MPA data set using the R 2 n ∼ Beta(1, 1) prior.Other priors led to similar convergence.
, b, c, d) and has density function π(x; a, b, c, d) a, b, c, d > 0 where B(•, •) is the beta function.The GBP reduces to the beta prime distribution

Figure 1 :
Figure 1: Prior distribution of W for different combinations of (a, b).

Figure 2 :
Figure 2: Logarithm of the biodiversity for locations around Australia.

Figure 3 :
Figure 3: Posterior distribution of R 2 n and W for MPA fisheries data using R2D2 prior with R 2 n ∼ Beta(a, b).
(a, b) = (1, 4) prior which reflects their prior means.Interestingly, even though the (a, b) = (1, 1) and (a, b) = ( 1 2 , 1 2 ) priors have quite different prior shapes, they yield almost identical posterior R 2 n shapes, likely because they have the same prior mean R 2 n .As expected, these trends are similar for the posterior distribution of W because R 2 n and W are positively associated.The posterior of R 2 n

Figure 4 :
Figure 4: Posterior median of latent spatial parameter θ for various prior distributions.

Figure 5 :
Figure 5: Trace plots for various parameters of the MPA data set using the R 2 n ∼ Beta(1, 1) prior.

Table 1 :
Values of parameters α ψ,φ and β ψ,φ and summaries of prior distribution of W for (a, b) = (4, 4) and different values of the spatial range parameter ρ.

Table 2 :
Posterior median and 95% credible intervals for vague, penalized complexity and R2D2 prior with R 2

Table 3 :
Simulation study results for n = 100 and ρ = 0.1.MSE is root mean squared error loss with posterior median and Cov is frequentist coverage of 95% posterior credible interval.Standard errors are in parentheses.The best values for each column are in bold.

Table 4 :
Simulation study results for n = 100 and ρ = 0.2.MSE is root mean squared error loss with posterior median and Cov is frequentist coverage of 95% posterior credible interval.Standard errors are in parentheses.The best values for each column are in bold.

Table 5 :
Simulation study results for n = 100 and ρ = 0.3.MSE is root mean squared error loss with posterior median and Cov is frequentist coverage of 95% posterior credible interval.Standard errors are in parentheses.The best values for each column are in bold.

Table 7 :
Simulation study results for n = 200 and ρ = 0.2.MSE is root mean squared error loss with posterior median and Cov is frequentist coverage of 95% posterior credible interval.Standard errors are in parentheses.The best values for each column are in bold.