Spatial+: A novel approach to spatial confounding

Abstract In spatial regression models, collinearity between covariates and spatial effects can lead to significant bias in effect estimates. This problem, known as spatial confounding, is encountered modeling forestry data to assess the effect of temperature on tree health. Reliable inference is difficult as results depend on whether or not spatial effects are included in the model. We propose a novel approach, spatial+, for dealing with spatial confounding when the covariate of interest is spatially dependent but not fully determined by spatial location. Using a thin plate spline model formulation we see that, in this case, the bias in covariate effect estimates is a direct result of spatial smoothing. Spatial+ reduces the sensitivity of the estimates to smoothing by replacing the covariates by their residuals after spatial dependence has been regressed away. Through asymptotic analysis we show that spatial+ avoids the bias problems of the spatial model. This is also demonstrated in a simulation study. Spatial+ is straightforward to implement using existing software and, as the response variable is the same as that of the spatial model, standard model selection criteria can be used for comparisons. A major advantage of the method is also that it extends to models with non‐Gaussian response distributions. Finally, while our results are derived in a thin plate spline setting, the spatial+ methodology transfers easily to other spatial model formulations.


Lemma 1.3 is proved by
. Lemmas 1.4 and 1.5 prove a number of asymptotic results that are convenient for later proofs. Lemma 1.4 shows how the results used by Rice (1986) for the analysis in dimension d = 1 generalize to dimensions d ≥ 1, while Lemma 1.5 generalizes Lemma 3 of Chen and Shiau (1991) to dimensions d ≥ 1. Proofs of Lemmas 1.2, 1.4 and 1.5 are given in Web Appendix B.
Proof. See Web Appendix B.
Lemma 1.3. For any g ∈ H m (Ω), let g = (g(t 1 ), . . . , g(t n )) T . The averaged squared bias B 2 tp (g, λ) of the thin plate spline S λ g (i.e. the fitted values in a model of the form (1) in our paper with β = 0) is given by Proof. See Utreras (1988) Lemma 2.2.

Web Appendix B: Proofs of technical lemmas
In this appendix we prove the lemmas set out in Web Appendix A. We start by introducing some notation. Recall the assumption from our paper that which means that the covariate x is correlated with the smooth f in the spatial model. Therefore, x decomposes as with f x = (f x (t 1 ), . . . , f x (t n )) T and x = ( x 1 , . . . x n ) T . For the asymptotic analysis, it is often convenient to consider the behaviour of the components in this decomposition separately. Let c x = (c x 1 , . . . , c x n ) T and ξ x = (ξ x 1 , . . . , ξ x n ) T denote the coefficients of f x and x , respectively, in the basis Φ introduced in Web Appendix A, i.e.
Note that since f x ∈ H m (Ω) is bounded, we have that As in Rice (1986) and Chen and Shiau (1991), we also note that the following assumptions hold for the coefficients ξ x of the iid noise x .

Proof of Lemma 1.4
To prove (a), we use the decomposition x = f x + x from (1) and the corresponding basis expansions in the basis Φ to get We note that while to the Cauchy-Schwarz inequality, the term 2c x k ξ x k will never dominate the rate of convergence. Therefore, we only need to consider the parts of the sum relating to the other two terms. Using Cauchy-Schwarz again we see that Here we have used Lemma 1.3 and (2). For the term involving (ξ x k ) 2 we have that by assumption (A3) and Lemma 1.2. Hence, by assumption (A2), and therefore (a) is proved.
By Lemma 1.3 we have that For a > 0 we have 1 1+a ≤ 1 and a 1+a ≤ 1 and therefore Using this with a = λµ k we see from assumption (A3) and Lemma 1.2 that

So by assumption (A2), (b) is proved.
For (c) let c = Φ T f be the coefficients of f in the basis Φ. Then For the term involving c x k , we use Cauchy-Schwarz and (2) to see that by Lemma 1.3. For the term involving ξ x k , we use Cauchy-Schwarz again to obtain Here we have used assumption (A3), Lemma 1.2 (since λµ k (1+λµ k ) 2 ≤ 1 1+λµ k ) and the fact that The rate of convergence of o(n −1/2 ) follows from the fact that n −1/2 (log n)λ −d/4m+1/2 ≈ n −1/2 (log n)n −δ(1−d/2m)/2 = o(n −1/2 ) since 1 − d/2m > 0. This proves (c). For (d) we have that For the term involving (c x k ) 2 we see that by (2). For the term involving (ξ x k ) 2 we see from assumption (A3) and Lemma 1.2 that Proof of Lemma 1.5 As in the proof of Lemma 1.4 we write and once again, by Cauchy-Schwarz, we only need to consider the terms involving (c x k ) 2 and (ξ x k ) 2 . Since λµ k 1+λµ k ≤ 1, Lemma 1.3 shows that For the term involving (ξ x k ) 2 , firstly note that if a 1 , a 2 , a 3 > 0, then where in the last step we have used the fact that 1 1+ai ≤ 1 and ai 1+ai ≤ 1 for all i. Using this with a 1 = a 2 = λ x µ k and a 3 = λµ k we see that by assumption (A3) and Lemma 1.2. Therefore, by assumption (A2). This shows (a). For (b) we have that For the term involving (c x k ) 2 , the same argument as in (a) shows that this is for a 1 , a 2 , a 3 , a 4 > 0 and using this with a 1 = a 2 = λ x µ k and a 3 = a 4 = λµ k shows that x as in (a). This proves (b).
For (c) let c = Φ T f be the coefficients of f in the basis Φ. Then For the term involving c x k , we use Cauchy-Schwarz to see that by the proof of Lemma 1.4 (c). This proves (c). For (d) we have that For the term involving (c x k ) 2 , Cauchy-Schwarz implies that by Lemma 1.3. For the term involving (ξ x k ) 2 we use (A3) and Lemma 1.2 to see that

This proves (d)
For (e) we have that For the term involving (c x k ) 2 we see that For the term involving (ξ x k ) 2 x by assumption (A3) and Lemma 1.2. This proves (e).
For (f) we write For the first term in (3), . For the second term in (3) we see that by Lemma 1.4 (a). From (e), the third term in (3) is given by This proves (f).

Web Appendix C: Main asymptotic results
This appendix details the main results of our asymptotic analysis referred to in Section 3 of the paper.

Asymptotic results for the spatial model
In the model (1) of the paper, spatial correlation is modeled through smoothing of the term f . Without the smoothing penalty, the model is an ordinary linear model in which all effect estimates are unbiased. Therefore, bias in the covariate effect estimate arises as a direct result of smoothing. Rice (1986) showed for dimension d = 1 that, while this bias is asymptotically 0 as n → ∞, the rate of convergence may be slow. More specifically, we cannot ensure that the bias converges faster than the standard deviation if the smoothing parameter λ converges at the optimal rate (minimizing the AMSE of the estimated spatial effect). Therefore, the bias can in practice become disproportionately large. Here, we generalize Rice's results and see that the problem of potentially excessive bias in β persists in models where the spatial domain has dimension d ≥ 1. As an aside, we note that, as in the d = 1 case, the rate of convergence of the variance of β, is the same as that in a model with no smoothing penalty.
Theorem 3.1. Suppose λ ≈ n −δ for some 0 < δ < 1, f, f x ∈ H m (Ω) are bounded and m ≥ d. Then for the partial thin plate spline estimate of β we have that In particular, Var( β) = O(n −1 ) and we need λ = o(n −1 ) to ensure that the bias converges faster than the standard deviation of β. Proof by Lemma 1.4 (a) and (c). Similarly, since Var(y) = σ 2 I, (3) in the paper shows that by Lemma 1.4 (a) and (b).
Then the average squared bias B 2 (f, λ) and average variance V (f, λ) of the partial thin plate spline estimate of f satisfy In particular, the optimal rate for λ in terms of minimizing AMSE( f ) is λ = O(n −2m/(2m+d) ), and when λ converges at this optimal rate, AMSE( f ) = O(n −2m/(2m+d) ).
Since f = S λ (y − βx) by (3) in the paper, We therefore see that by Theorem 1(a), Lemma 1.4 (d) and Lemma 1.3. This proves part (a).
For (b), firstly note that We therefore see that by Lemma 1.2, Theorem 1(b) and Lemma 1.4 (d). This proves part (b).
Finally, recall that From the above, we see that the bias term increases with λ while the variance term decreases with λ so that the optimal rate for minimizing AMSE( f ) is achieved when O(λ) = O(n −1 λ −d/2m ). This leads to an optimal rate of λ = O(n −2m/(2m+d) ). At this rate for λ, B 2 (f, λ) and V (f, λ) converge at the same rate of O(n −2m/(2m+d) ).
We have therefore proved the following result which shows that we cannot avoid the potential for excessive bias in β, unless λ converges at a rate slower than the optimal rate of convergence, i.e. unless the smooth term is undersmoothed.
Corollary 3.3. Suppose λ ≈ n −δ for some 0 < δ < 1, f, f x ∈ H m (Ω) are bounded and m ≥ d. The optimal rate of convergence for λ in terms of minimizing AMSE( f ) is slower than the required rate of o(n −1 ) for ensuring that the bias of β converges faster than the standard deviation of the estimate.

Asymptotic results for the spatial+ model
In dimension d = 1, Chen and Shiau (1991) show that for the model (5) of the paper, the problems identified by Rice disappear. That is, when the parameters λ and λ x converge at the optimal rate (for minimizing the AMSE of the estimated spatial effect), the bias of the covariate effect estimate β + converges to 0 faster than the standard deviation and, therefore, does not become disproportionately large. We now generalize these results to dimensions d ≥ 1.
Then for the spatial+ estimate of β we have that In particular, Var( β + ) = O(n −1 ) and we need λλ x = o(n −1 ) to ensure that the bias converges faster than the standard deviation of β + .
Finally, recall that . From the above we see that the bias term increases with λ while the variance term decreases with λ so that the optimal rate for minimizing AMSE( f + ) is achieved when O(λ) = O(n −1 λ −d/2m ). This leads to an optimal rate of λ = O(n −2m/(2m+d) ). Since we have assumed that the convergence rates for B 2 + (f, λ, λ x ) and V + (f, λ, λ x ) are equal, the optimal rate for λ x is then obtained when O(n −1 λ From this we obtain the following result which shows that, unlike β, the estimate β + does not need undersmoothing to avoid excessive bias. Corollary 3.6. Suppose λ ≈ n −δ , λ x ≈ n −δx for some 0 < δ, δ x < 1, f, f x ∈ H m (Ω) are bounded and m ≥ d. If λ and λ x converge at the optimal rates in terms of minimizing AMSE( f + ), then λλ x = o(n −1 ). In particular, the optimal rates for λ and λ x ensure that the bias of the spatial+ estimate β + converges faster than the standard deviation of the estimate.
Proof. Theorem 3(b) shows that we need E( β + ) − β = o(n −1/2 ) to ensure that the bias converges faster than the standard deviation. Part (a) of the same theorem shows that this required rate can be achieved if λλ x = o(n −1 ). Suppose λ and λ x converge at their optimal rates from Theorem 4. Then since for any > 0, This proves the result.

Web Appendix D: Partial residual estimates
In this appendix we consider, as an aside, the asymptotic behaviour of the partial residual estimates introduced by Denby (1986) and, independently, by Speckman (1988), which are the estimates we obtain using the gSEM approach of Thaden and Kneib (2018). Here we adapt the method used in Sections 3.2 and 3.3 of the paper for estimates in the spatial and spatial+ models to show how the asymptotic results in Chen and Shiau (1991) for the partial residual estimates generalize from the one-dimensional model to dimensions d ≥ 1. We show that, as is the case for the spatial+ model, the smoothing-induced bias in the covariate effect estimate goes to 0 faster than the standard deviation, i.e. the partial residual estimates also avoid the problem of disproportionate smoothing-induced bias. For a given value λ > 0 of the smoothing parameter, the partial residual estimates for the covariate effect β and the unknown smooth effect f = (f (t 1 ), . . . , f (t n )) T in the model (1) of the paper, are defined as where S λ is the smoother matrix. A similar argument to that of Section 2.2 of the paper shows that these estimates are the ones we would obtain in the gSEM if, for simplicity, we used the same smoothing parameter in all regressions. That is, the estimate β pr is the same as the estimated effect in the linear model given by where r x = (I − S λ )x and r y = (I − S λ )y are the residuals after fitting a thin plate spline to x and y, respectively. Minor adjustments to the proofs of Theorems 1 and 2 and Corollary 1 for the spatial model estimates lead to the following results. These results show that the asymptotic behaviour of the estimates β pr and f pr is the same as that of the corresponding spatial model estimates, except for the rate of convergence of the bias of the covariate effect estimate β pr . More specifically, E( β pr ) − β = o(n −1/2 ) + O(λ), whereas E( β) − β = o(n −1/2 ) + O(λ 1/2 ) and this difference is enough to ensure that the bias converges faster than the standard deviation when λ converges at the optimal rate (for minimizing the AMSE of the estimated spatial effect).
Theorem 4.1. Suppose λ ≈ n −δ for some 0 < δ < 1, f, f x ∈ H m (Ω) are bounded and m ≥ d. Then for the partial residual estimate of β we have that x as n → ∞. In particular, Var( β pr ) = O(n −1 ) and we need λ = o(n −1/2 ) to ensure that the bias converges faster than the standard deviation of β pr .
By (4), f pr = S λ (y − β pr x) has the same format as the corresponding partial thin plate spline estimate, and therefore, as in the proof of Theorem 2. For the derivation of B 2 pr (f, λ) and V pr (f, λ), we can therefore apply the same proof where the only adjustment needed is the rate of convergence of the bias E( β pr ) − β.
In particular, the optimal rate for λ ensures that the bias of the partial residual estimate β pr converges faster than the standard deviation of the estimate.

Web Appendix E: Derivations for unsmoothed models
In this section we consider in more detail the models we compare in Section 4 of the paper when no smoothing penalty is applied (i.e. λ = λ x = 0) and include some derivations that help explain our simulation results for these models. In the unsmoothed case, the models are ordinary linear models and the estimated covariate effect β and fitted values can be found using simple linear algebra. To avoid confusion, rather than using the same notation for the estimated covariate effect, we denote the estimate by β null , β, β RSR , β gSEM , β + in the null, spatial, RSR, gSEM and spatial+ models, respectively.
Recall that, by the data generation process, each replicate of the response data in the simulation is of the form y = βx + f + y where β and f = −z − z are the true covariate and spatial effects, respectively, and y is iid noise.
In the null and RSR models the estimated covariate effect is the ordinary least squares estimate, in particular, for any given data replicate, the estimate in these models are identical. More specifically, So the bias in the null and RSR models is given by E((x T x) −1 x T f ) and is, therefore, directly related to the correlation between the covariate x and the true unmeasured spatial effect f . Since in our simulations the correlation is negative, the bias in our results is therefore negative. While the covariate effect estimates in the null and RSR models agree, the fitted values differ as the larger model matrix in RSR explains a part of y that is treated as random noise in the null model. In fact, the column space of the model matrix of the RSR model is the same as that of the spatial model and, therefore, (when λ = 0) the fitted values in these models agree (i.e. for any given data replicate, the fitted values will be identical). When no smoothing penalty is applied, the spatial model, the gSEM and the spatial+ model are essentially the same, i.e. for any given data replicate they have the same fitted values and the same unbiased estimate for the covariate effect. The spatial model is an ordinary linear model where the columns in the model matrix are the covariate x and the spatial basis vectors B sp . The spatial+ model is a reparametrization of the spatial model where the column x in the model matrix is replaced by the spatial residuals r x = x − f x (where f x are the fitted values of a spatial thin plate spline fitted to x). This does not change the overall column space as the difference f x lies in the column space of B sp . By the data generation process, with β f x − z − z in the column space of B sp and, therefore, the true effect of r x is the same as that of x. In fact, since r x is orthogonal to the spatial basis vectors, the estimated effect β + in the spatial+ model (14) of the paper (and therefore β in the spatial model (10) of the paper) is obtained as Similarly, for the unsmoothed gSEM, since with β f x − z − z − f y in the column space of B sp , the estimated effect of r x in the gSEM model (13) of the paper is given by This shows that β = β + = β gSEM . Note that, since r x and y are independent i.e. the estimated covariate effect is unbiased.
6 Web Appendix F: Additional simulation results

Mis-specified model
In the simulations of Section 4 of the paper, the data was generated in such a way that the true spatial dependence was that of a thin plate spline. This was ensured by replacing the spatial fields z and z by the fitted values of a thin plate spline model fitted to them. However, in practice, this assumption may not hold and we therefore repeated the simulations for data generated in the same way but where, instead of fitting a thin plate spline model to z and z , we used Gaussian process smooths. More specifically, z has an exponential covariance structure with range parameter 5 and z a spherical covariance structure with range parameter 1. Figure 1 shows the results of these simulations. We see that the results are very similar to the simulation results in the paper.

Moderate sample size
In order to investigate the behaviour at moderate sample sizes, we repeated the simulations of Section 4 of the paper for sample sizes n = 300, n = 150 and n = 50 (with spatial basis sizes k sp = 100, k sp = 100 and k sp = 30, respectively). The results shown in Figure 2. where θ and φ are parameters of the distribution and a, b and c are functions. This family includes a large number of commonly used distributions in applied statistics, e.g. Gaussian, Poisson, gamma and binomial.
the PIRLS algorithm. We can therefore define the generalized RSR model to be the same as the generalized spatial model but with the spatial basis vectors B sp in the model matrix replaced by Then, by construction, the generalized RSR model corresponds to a Gaussian model for which the columns x = √ Wx and √ W B sp are orthogonal: x T √ W B sp = x T W B sp = 0.