A joint bayesian space-time model to integrate spatially misaligned air pollution data in R-INLA

In air pollution studies, dispersion models provide estimates of concentration at grid level covering the entire spatial domain, and are then calibrated against measurements from monitoring stations. However, these different data sources are misaligned in space and time. If misalignment is not considered, it can bias the predictions. We aim at demonstrating how the combination of multiple data sources, such as dispersion model outputs, ground observations and covariates, leads to more accurate predictions of air pollution at grid level. We consider nitrogen dioxide (NO2) concentration in Greater London and surroundings for the years 2007-2011, and combine two different dispersion models. Different sets of spatial and temporal effects are included in order to obtain the best predictive capability. Our proposed model is framed in between calibration and Bayesian melding techniques for data fusion red. Unlike other examples, we jointly model the response (concentration level at monitoring stations) and the dispersion model outputs on different scales, accounting for the different sources of uncertainty. Our spatio-temporal model allows us to reconstruct the latent fields of each model component, and to predict daily pollution concentrations. We compare the predictive capability of our proposed model with other established methods to account for misalignment (e.g. bilinear interpolation), showing that in our case study the joint model is a better alternative.


Introduction
Our approach is similar to the coregionalization model proposed by Schmidt and Gelfand (2003) to model CO, NO, and NO 2 which allows us to calibrate the deterministic models against the monitor observations through a coefficient similarly to calibration techniques. However, here we treat the three sources of information on NO 2 as coming from the same true underlying spatio-temporal process (i.e. the true air pollution concentration field) as in Bayesian melding. The pure application of this kind of models is computationally prohibitive for the high resolution output data we have at hand. This issue is solved by representing the spatially continuous fields as solutions to a Stochastic Partial Differential Equation (SPDE) to handle this in a computationally efficient way (Lindgren et al., 2011;Krainski et al., 2018). Additionally, our model reconstructs the continuous latent spatial and temporal fields allowing us to account for all the sources of uncertainty: first, the one associated with the estimates from the numerical models, which is not provided as they are deterministic models, and second, the measurement error associated with ground observations. This is most useful in the perspective of using the predictions from the air pollution model as a measure of exposure in an epidemiological model, where the uncertainty could be fed forward (see for example Lee et al. 2017 andCameletti et al. 2019).
The inference is done under the Bayesian paradigm through the Integrated Nested Laplace Approximations (INLA) coupled with the SPDE approach, which is implemented in the R-INLA package (Rue, 2018).
Other authors have implemented solutions for spatially misaligned air pollution data in R-INLA, however their approaches differ from ours under several points of view. In particular, Moraga et al. (2017) show an example of area-to-point misalignment addressed via block averaging, in a spatial-only context, without accounting for the uncertainty associated with the raster data. Cameletti et al. (2019) implement a spatial upscaler from point to area comparing two different averaging methods. Kifle et al. (2017) compare additive and coupled spatio-temporal processes for multivariate data in a biological context (prevalence of vectors for arboviruses), where the data are not misaligned, and do not include any explanatory covariate.
To the best of our knowledge, this is the first time a spatio-temporal model for spatially misaligned point-referenced data is implemented through the INLA-SPDE approach considering more than one deterministic model output at different spatial and temporal resolutions.
We compare and contrast several models and through a cross-validation method we evaluate which one produces the most accurate predictions of NO 2 concentration in Greater London and surroundings for the period 2007 to 2011. We compare our method with two approaches in which the alignment is done through bilinear interpolation or kriging, hence not accounting for the measurement error associated with the misaligned covariates. The first is a simple hierarchical model that includes linear effects for the covariates and structured spatio-temporal residuals. The second is the recently proposed data integration model from Mukhopadhyay and Sahu (2017), which allows for non-stationarity in the residual spatial process.
The remainder of the article is organised as follows: section 2 presents the study area and data; section 3 describes the methods used in the analysis, starting with the model specification followed by a description of the competitor models; section 4 reports the results of the application of such methods to our air pollution data; finally, section 5 contains the conclusions and a short discussion, pointing towards further developments.

Study area and data
The study focuses on NO 2 , as it is one of the pollutants regulated by national and international directives, and it is traffic driven, hence characterised by high spatio-temporal variability. present; (ii) we eliminated non-positive daily averages which do not allow for the logarithmic transformation required in the analysis (negative observations are due to measurement error); (iii) monitors where the resulting daily NO 2 is available for less than 1370 days, not necessarily consecutive, have been excluded (this is because an active monitor does not necessarily record NO 2 measurements).
The monitors are split into 6 groups maximizing similarity criteria between groups, and a 6-fold cross validation is performed.
For each monitor we have information about the site type classification, that we aggregated into 3 categories: rural, urban, road-kerb side.
We define our study area as that including all the selected monitors, containing 495 grid cells for AQUM and 44,117 grid cells for PCM.
The locations of the air pollution data sources described above are displayed in Figure 1. The AQUM model includes chemistry, physical and aerosol models, meteorological configuration based on the Met Office's North Atlantic and European Model (NAE) and emission data (Savage et al., 2013); the PCM model input includes emission inventory, energy projections, road traffic counts, road transport activity and meteorological hourly data from Waddington weather station (Ricardo Energy & Environment, 2017).
Although data were available, we decided against the inclusion of meteorological variables in our models, as they are already an input for both the numerical models considered in the analysis.
Consistently with the selection criteria, all the 6 training and validation sets have similar distribution of daily NO 2 concentration by site type (see Appendix A, Fig. A.1). As expected, in all sets the road-kerb side monitors have higher mean and maximum levels of NO 2 .
In particular, 17 road-kerb side sites overcome the limits set by the WHO and the European Commission for the annual average of 40µg/m 3 , for at least 4 of the 5 years under study (see Appendix A,Fig. A.2). Of these, the monitor in Lambeth-Brixton Road (LB4) is also well above the threshold of 18µg/m 3 not to be exceeded more than 18 times annually, for every year, even though a decreasing trend can be observed (from 865 hourly exceedances in 2007, to 62 in 2011), and other 6 monitors exceeded this threshold between 2007 and 2008.

Methods
In this section we first present some analysis on the AQUM and PCM data, then we introduce the joint model, and finally the models that we use for comparison. Note that we will represent vector/matrices in bold typeface.

Separate models for AQUM and PCM data
In order to quantify the relevance of the temporal component for PCM (i = 1) and the spatial component for AQUM (i = 2), we ran three models for each data source separately: (i) one with spatial-only or temporal-only effect respectively, (ii) one with additive spatial and temporal effects and (iii) one with a spatio-temporal interaction.
Let's define y i as the vector of air pollution concentration on the logarithmic scale across space and time for the i-th numerical model. This is assumed to be normally distributed with mean η i and variance σ 2 i : Each element of the linear predictor η i (for a time point t and location s identified by UTM coordinates) for models (i), (ii), (iii) is specified as follows: where z 1i (s) is the realisation at location s of the spatial process z 1i with Matérn covariance function is a temporal process modelled as a random walk and z 3i ∼ M V N (0, σ 2 z3i Σ t ⊗ Σ s ) is a separable space-time interaction with Matérn covariance function and temporal dependence modelled as a random walk.
Based on the deviance information criterion (DIC), the results show that AQUM spatial variation is relevant but there is no need for a space-time interaction so model (ii) is selected for AQUM, while the PCM temporal variation is negligible so model (i) is selected for PCM (see Appendix B for details). Hence, we will use this specification in the joint model presented in the next section.

Bayesian joint spatio-temporal model for misaligned covariates
Following Kifle et al. (2017) we implement an additive space-time model for data observed at different points in space, which share a spatial and a temporal component.
Previous similar applications consider measurements of more than one variable at the same locations, but this is not a requirement in the INLA-SPDE approach.
Our model is joint in the sense that we specify one likelihood for the response and one for each of the misaligned covariates, and they contain common components which are estimated using all the data. Even though in R-INLA the problem is computationally treated similarly to a multivariate situation, this is not our case as we ultimately consider solely the monitor observations as response variable.
We make the assumption that the same temporal dynamics govern AQUM and monitor observations, and likewise the same spatial dynamics govern PCM, AQUM and monitor observations.
Our hierarchical model has three levels: in the first we define the likelihoods, in the second the random effect components, while the third level includes the prior distributions for the model parameters and hyperparameters.
The joint model presented below is implemented via INLA, a computationally efficient alternative to Markov chain Monte Carlo (MCMC) methods that works specifically on hierarchical Gaussian Markov Random Fields (GMRF). Details on how this is done in R-INLA can be found in Appendix E.

Level 1: Likelihoods and linear predictors
Let y i (s, t) denote the PCM (i = 1) and AQUM (i = 2) data and the observed NO 2 concentration (i = 3) at the generic time point t and site s, on the logarithmic scale. These are assumed to be normally distributed, with mean η i (s, t) and measurement error variance σ 2 i : Based on the results from section 3.1 we model the PCM data with an intercept and a spatial component and the AQUM data with an intercept and additive spatial and temporal components. These are shared between the three linear predictors, which are the following: η 2 (s, t) = α 2 + λ 1,2 z 1 (s) + z 2 (t) η 3 (s, t) = α 3 + β ks + λ 1,3 z 1 (s) + λ 2,3 z 2 (t) + z 3 (t, k s ) where α i are the intercepts, λ i,j are the scaling parameters for the shared components from η i to η j , β ks is the fixed effect for the site type as categorical variable (k s = 0: rural (reference), k s = 1: urban and k s = 2: road-kerb side), and z 1 and z 2 are the shared random effects. The linear predictor for the ground observations η 3 also contains an interaction term z 3 which allows for a different residual temporal trend for each site type.
Note that even though PCM is assumed to be governed only by a spatial effect, its output does vary in both space and time, so the deterministic model output y 1 has space and time indices (here the locations are the centroids of the 44117 PCM grid cell, and the time points are the years), while its latent field z 1 has only a spatial index.
For AQUM, the space and time indices of y 2 correspond to the centroids of the 495 AQUM grid cell and the 1826 days respectively.
Finally, y 3 is measured at the 126 monitors on 1826 days.

Level 2: latent fields
In Equation (3), is the common spatial latent field, with Σ being the correlation matrix defined by the Matérn stationary and isotropic covariance function (see Appendix D). It is important to note that z 1 is then rescaled for AQUM and monitor observations through λ 1,2 (eq. 2) and λ 1,3 (eq. 3).
In the same equation, z 2 (t) is the t-th element of the temporal latent field z 2 , and is modelled as a random walk: Similarly to z 1 , z 2 is rescaled for the monitor observations through λ 2,3 (eq. 3). Finally, z 3 is the residual temporal trend assumed to be different for each site type (rural, urban, road-kerb side), and modelled as first order autoregressive In other words, we assume conditionally independent replications of the same latent field for each site type, with shared hyperparameters (Martins et al., 2013).

Level 3: priors
The priors on the model parameters are specified as follows.
For the standard deviation of the random walk we assume a penalised complexity prior such that the probability that σ z2 is greater than the empirical standard deviation of the AQUM data is 1%, i.e. P (σ z2 > SD(AQU M )) = 0.01.
Finally on the coefficients of the fixed effects α i and β ks we assume the default weak Normal prior distribution N (0, 1000).

Competitor models
We compared our model to three different competitors: (i) a joint model that includes only one misaligned covariate (either AQUM or PCM), (ii) a simple hierarchical model that includes a covariate aligned at the monitoring sites through bilinear interpolation (Akima, 1978) or kriging, and (iii) a complex hierarchical model which allows for non-stationarity after interpolating the misaligned covariates via bilinear interpolation (Mukhopadhyay and Sahu, 2017).
The aim of this comparison is to evaluate if the inclusion of more than one extra data source actually improves the model predictive capability and can counterbalance the need for complex random effect structures, and if there is a gain in moving from a simple interpolation to a modelling framework.
We describe the three comparators in the rest of this section.

Joint models with one misaligned covariate only
The joint model that includes PCM only is specified as follows: and Similarly, the joint model that includes AQUM only is defined as: and For these models we considered either fixed to 1 or varying calibration coefficients λ i,j , and different priors. The final choice of priors is the one reported in Section 3.2.3.

Data integration model via interpolation
We implement two models that use interpolation techniques to obtain values of AQUM and PCM at the monitoring stations. The first is a naive bilinear interpolation, the second can be considered as Bayesian kriging, as we predict AQUM and PCM at the monitoring stations from the models described in section 3.1.
In both cases, after aligning the AQUM (X 1 ) and PCM (X 2 ) values, we consider a linear effect on the covariates, a spatially structured residual z 1 , a temporally structured residual z 2 and the site-type-specific temporal effect z 3 specified as in Section 3.2. We also keep the fixed effects for the site type β ks as in the joint model.
We specify a normal likelihood y(s, t) ∼ N (η(s, t), σ 2 ) and the linear predictor as follows:

Data integration model with non-stationarity
Mukhopadhyay and Sahu (2017) developed a site-type-specific regression on the AQUM data using our same classification for the site type. The key feature of their model is the specification of a non-stationary spatio-temporal process, which leads to a better predictive performance compared to the stationary Gaussian process (GP) in their application.
To obtain a like-for-like comparison, we also include a site-type-specific regression on the PCM data and implement both the stationary and the non-stationary versions of this model.
Both AQUM (X 1 ) and PCM (X 2 ) are interpolated at the monitoring site locations through bilinear interpolation.
The hierarchical model specification in this case is: ) for the model with AQUM only, and for the model with AQUM and PCM.
Here k = 0 indicates rural site type (baseline), k = 1 urban and k = 2 road-kerbside, δ k (s) is an indicator function equal to 1 if site s is of type k and 0 otherwise, γ 0 , γ 1 and γ 2 are the baseline intercept and slopes for X 1 and X 2 , while γ 0k , γ 1k and γ 2k are site-type-specific adjustments to the baseline intercept and slopes.
For the spatio-temporal process ν we first assume a stationary time-independent GP with zero mean and exponential correlation function (note that ν t (s) = ν(s, t)): . Then we specify a non-stationary covariance structure as in Sahu and Mukhopadhyay (2015): From multivariate Gaussian theory it follows thatν t = C * H −1 ν * (φ)ν * t with C * being the cross-correlation function between ν t and ν * t . The non-stationarity and the anisotropy are given by the fact that corr(ν t (s),ν t (s )) = c * (s) T H −1 ν * (φ)c * (s ) which depends on both s and s and not only on the separation vector or the distance between locations.
To introduce temporal dependence we specify a first order autoregressive model for ν * t . The choice of knot locations, model specification and prior distributions is based on Mukhopadhyay and Sahu (2017). This is justified by the fact that we use the same data sources on a subregion of their study area.

Validation and predictive capability measures
We compared the predictive capability through proper scoring rules (Gneiting and Raftery, 2007) such as the crossvalidated logarithmic score (logScore), the Continuous Ranked Probability Score (CRPS) and the Root Mean Squared Error (RMSE), and also the Predictive Model Choice Criterion (PMCC) proposed by Gelfand and Ghosh (1998).
Furthermore, we reported the correlation between the observed and the predicted values for the validation sites (COR), the Mean Absolute Percentage Error (MAPE), and the 95% coverage (COV), defined as the percentage of times that the observed value falls within the 95% credibility interval of the sampled posterior marginal.
To measure the predictive capability we need to report the fitted values at the validation sites on the scale of the outcome, as cannot compare the observed values with the fitted values because they are not accounting for the measurement error, but only for the uncertainty associated with the model parameters. In order to do so, we draw 50 values from the marginal posterior of the measurement error p(σ 2 3 |y 3 ) first, then draw η 3 from its conditional posterior p(η 3 |σ 2 3 , y 3 ) using the simulated values of σ 2 3 (Gelman et al., 2013). These values are used as mean and variance of a Normal distribution, from which we sampled values at each site (sample size = 100).
The following are the formulas for the different measures of predictive capability used in the paper. For simplicity we apply a slight change of notation here: y jt indicates the observed value at monitor j (m is the number of validation monitors) and day t (t = 1, . . . , T , T =1826), andŷ jt is the corresponding predicted value obtained as mean of the vector of Q = 100 × 50 sampled valuesŷ jt =ŷ 1 , . . . ,ŷ Q ∼ F jt , F jt being the empirical distribution function ofŷ jt .
For each model under comparison, the predictive capability measures presented above are computed pooling together the 6 validation sets. It can be calculated by day, by site, by site type or across all sites to obtain specific and global measures, with lowest measures indicating the best predictive performance.

Predictions on a regular grid
From the joint model, we extract daily predictions of NO 2 concentration on a regular grid that covers the study area. For the grid we choose an intermediate spatial resolution between PCM and AQUM data to limit the computational burden while retaining spatial variability.
In order to provide the predictions in a reasonable time, we extract samples from the joint posterior marginals and estimate the linear predictor at each time-location for the 1826 days on the regular grid (Thomas et al., 2019).
We compute the predictions from the model that includes all monitors as training set.
We extract samples from the posterior marginals of the model components in order to reconstruct the linear predictor at each time-location for the 1826 days on the regular grid, as: η 3 (s, t) = α 3 + β ks + λ 1,3 z 1 (s) + λ 2,3 z 2 (t) + z 3 (t, k s ) In particular, following the tutorial by Bakka (2017), we obtain samples from the posterior of the intercept α 3 and β ks for each site type, samples from the posterior of λ 1,3 z 1 at the mesh nodes and reproject it on the prediction grid, samples from the posterior of λ 2,3 z 2 at each time point (days), and samples from the posterior of z 3 for each day and site type.
Note that in order to predict at the grid locations we need to know the value of site type classification for each grid point. With this aim we built a function which assigns each location to road-kerb side, urban or rural depending on the distance from any road as well as using the Corine land cover for the year 2012 for the UK, Jersey and Guernsey shapefile from the Centre for Ecology and Hydrology (Cole et al., 2015). See Appendix F for more details.
For each sample we then sum up the samples from the fixed effects and random effects to reconstruct the linear predictor, then the prediction is given by average across all samples.

Results
In this section we present the results of the model comparison, with particular focus on the advantages of the proposed joint model, and the daily predictions that we obtained from the best model.

Model comparison
In order to show whether the inclusion of more than one extra data source actually improves the model predictive capability, we compare our proposed joint model with the corresponding models that include only AQUM or PCM.
For AQUM we assume a spatio-temporal effect or temporal-only effect when PCM is included.
We also compare our joint model with other well established data integration techniques, the simple interpolation models described in section 3.3.2 and the more complex ones described in section 3.3.3.
Besides providing information about all the sources of uncertainty, all the joint models have better performance than the models where the misaligned data are interpolated, even allowing for non-stationarity (see table 2).
However, the AQUM data do not seem to provide much information, in fact the model that includes only AQUM has a far worse performance than the one only including PCM. In addition, allowing for a spatial effect on AQUM does not improve the prediction, for the model where PCM is also included. This can be explained by the fact that the time-sitetype interaction z 3 replaces the role of AQUM in capturing the temporal trend when we remove AQUM from the model and the temporal information is still provided by the numerous monitoring stations, while there is no other structured spatial component that compensates for PCM when it is removed.
Furthermore, as we focus here on spatial prediction rather than temporal forecasting, removing AQUM is less of a burden on the model performance in terms of predictive capability.
Nevertheless the model including both AQUM and PCM has the best performance in terms of PMCC and CRPS and we will report the results from this model in the next section.
Note that the predictive capability measures of the models in section 3.3.3 cannot be compared with the others due to the different model structure. Only the PMCC and the 95% coverage are comparable and reported in table 2. AQU M + P CM , stationary (Mukhopadhyay and Sahu, 2017) 75510 * 62.13% * As provided by spT.Gibbs function in R package spAir. ** (s) indicates spatial-only random effects; (t) indicates temporal-only random effects; (s, t) indicates additive spatial and temporal random effects. When not specified, a linear effect is assumed as described in section 3.3. With regard to these models, allowing for non-stationarity and anisotropy leads to very little gain compared to the introduction of an additional source of data at high spatial resolution. In general, their performance is almost as poor as having a linear effect on interpolated covariates.

Results from the complete joint model
We report the results for the joint model that includes spatial and temporal effects on AQUM and spatial effect on PCM, re-ran using all monitors as training data.
Looking at the summary reported in table 3 we see that, as expected, there is an increase in the NO 2 concentration going from rural to road-kerb side locations, but not for urban. For the spatial latent field z 1 , the estimated empirical range, i.e. the distance after which the spatial correlation function drops to 0.13 (Lindgren and Rue, 2015), is 177 Km, corresponding to approximately 50% of the maximum extension of the spatial domain. The scaling parameters λ i,j are all different from 1, meaning the spatial field for PCM needs to be rescaled for AQUM (λ 1,2 = 1.1) and for the monitor observations (λ 1,3 = 1.3), and the temporal latent field for AQUM is also calibrated against the monitor observations with λ 2,3 = 0.9.
The intercepts α i represent the overall mean of PCM, AQUM and ground observations respectively.
The spatial latent field z 1 (Fig. 2) shared between the PCM data, the AQUM data and the monitor observations shows the traffic-driven characteristics of NO 2 as we can recognize higher values in correspondence of motorways and major city centers. The rescaled fields are reported as well and for λ 1,3 z 1 the magnifying effect of the scaling parameter λ 1,3 = 1.3 is particularly visible. Figure 3 shows the temporal latent field z 2 shared between the AQUM data and the monitor observations which captures the seasonality of NO 2 , and the rescaled field λ 2,3 z 2 which is shrinked by the scaling parameter λ 2,3 = 0.9.
The latent fields z 1 and z 2 are both centred in zero as the large scale component of PCM and AQUM is captured by their intercepts α 1 and α 2 .
Finally, the time-sitetype interaction z 3 in Fig. 4 shows that there is some residual site-type-specific temporal variability, especially for urban and road-kerb side monitors, which is not captured by the main temporal component z 2 .

Daily predictions
We selected four NO 2 pollution episodes reported by the LondonAir website (King's College London, 2018) and compared the predictions for these four days with four randomly selected summer Sundays across the study period, where we expect to see low levels of NO 2 . The predictions show the expected behaviour, with high predicted concentrations during the pollution episodes and low concentrations during the selected Sundays (Fig. 5).
A layer with the roads classified as motorways is plotted on top of each map, showing correspondence between the highest predicted levels of NO 2 and the major roads. This is expected because NO 2 is a highly traffic-driven pollutant. A peak of NO 2 concentration can also be observed in the area of Heathrow airport, on the left of Greater London, which is characterised by the highest levels also on low concentration days.

Conclusion and discussion
We implemented a hierarchical Bayesian model to estimate air pollution concentration, combining misaligned data sources with a joint approach. This approach can be considered in between Bayesian melding and calibration, and it is the first attempt at implementing such methods on spatio-temporal air pollution data in R-INLA.
The proposed model includes information on the site type as well as output from two different numerical models characterised by spatial and temporal variability and accounting for traffic, chemistry, land use and meteorological covariates. Our method is transferable to any available data sources, however the interpretation of the results may change according to their intrinsic characteristics, in particular referring to the information included in the deterministic or LUR models.
We show that including more than one covariate at different spatial and temporal resolution increases model predictive capability. However, removing AQUM has proven not to be detrimental, but this could be justified with the fact that we are not doing temporal forecasting.
Overall we prove that using as much spatial and temporal information as possible is more beneficial than increasing the complexity of the random effect structure.
A time-site type interaction was added to the model to account for residual temporal variability observed when looking at the site type-specific residuals.
The advantages of our method are manyfolds: first, reconstructing the entire latent field in a Bayesian approach provides us with the marginal posterior distribution for all the uncertainty parameters, allowing us to correctly quantify the uncertainty associated with our predictions and the deterministic models, that is not possible to obtain with other downscalers and non-model-based solutions; second, unlike the spatio-temporal downscaler proposed by Berrocal et al. (2012), our model reconstructs the latent fields of the misaligned covariates as a whole, rather than locally. For the same reason, in order to obtain daily predictions at new locations there is no need to calculate the value of the misaligned covariates at the prediction locations, as the model already estimates the whole latent field.
Our analysis presents some limitations related on one side to the computational requirements of INLA due to the high number of parameters, and on the other side to the generalizability of the results, as the models are quite data-sensitive.
In particular, we have very few rural sites even though we extended the study domain outside Greater London, suggesting the presence of preferential sampling that we did not account for. Furthermore, we made assumptions of stationarity and isotropy which may not hold when extending the spatial domain to bigger areas.
As a next step we will extend the joint model to a multivariate version including other pollutants, such as PM 10 or O 3 .
In the future, the predicted air pollution concentration with associated measure of uncertainty could be used as exposure in an epidemiological model, allowing for uncertainty propagation. B Comparison of separate models for AQUM and PCM data Table B.1 shows the deviance information criterion (DIC) and the logarithmic score for the models implemented separately for AQUM and PCM.
We need to include a spatial component for AQUM as this significantly improves the model performance.
According to the reported measures, the spatial-only model for PCM is the worst in terms of goodness of fit. However, the temporal component seems negligible and when including a space-time interaction the range becomes unreasonable (5230 Km). For these reasons, and to limit the computational burden, we decided to keep only the spatial component when modelling PCM, also because the temporal information is provided by the monitors and the AQUM data at a much higher resolution (daily instead of annual).

C INLA-SPDE
In this section we define the class of models on which we can perform Bayesian inference using INLA and briefly introduce how the inference is computed, following the notation in Blangiardo and Cameletti (2015).
Let us consider a set of data y = (y 1 , . . . , y n ) with distribution characterized by a parameter µ, usually the mean, defined as a function of an additive linear predictor g(µ) = η: -456714 -1.04 * (t) indicates temporal-only random effect; (s) indicates spatial-only random effect; (s + t) indicates additive spatial and temporal effects; (s * t) indicates space-time interaction.
Defining the vector of parameters θ = (α, β, f ) T and the vector of hyperparameters ψ = (ψ 1 , . . . , ψ K ), the likelihood is given by We assume the latent field θ to be multivariate Normal with precision matrix Q and conditionally independent, i.e. a Gaussian Markov Random Field (GMRF): The Markov property ensures the sparsity of the precision matrix Q.
Therefore we first need to compute (i) the joint posterior marginal of the hyperparameters p(ψ|y) and (ii) the posterior conditional distributions p(θ i |ψ, y).
Within such class of models, which includes a wide range of possible model specifications, we can compute these distributions through a Laplace approximation: wherep(θ|ψ, y) is the Gaussian Laplace approximation of p(θ|ψ, y) around its mode θ * (ψ).
For p(θ i |ψ, y) we consider a partition θ = (θ i , θ −i ): wherep(θ i |ψ, y) is the Gaussian Laplace approximation of p(θ i |ψ, y) around its mode θ * −i (θ i , ψ). In R-INLA other approximation strategies are implemented and can be chosen to speed up the computation.
In particular, p(θ i |ψ, y) can be directly derived from the Normal approximationp(θ|ψ, y) already computed in the first step (Gaussian strategy). This can produce inaccurate approximations, however when the conditional p(θ|ψ, y) is Gaussian it is an exact approximation and there is no need to apply the Laplace method, as in our case.
For point-referenced data (i.e. data observed at point locations typically referenced by coordinates), the latent continuous spatial process is a Gaussian Field (GF) with dense spatial covariance matrix that leads to computational issues. The SPDE approach proposed by Lindgren et al. (2011) is an alternative which consists in representing a continuous spatial process (the GF) as a discretely indexed spatial random process (i.e. a GMRF).
The continuous GF with Matérn covariance structure z(s) is the exact and stationary solution of the following stochastic partial differential equation: (κ 2 − ∆) α/2 (τ z(s)) = W(s) (1) where ∆ is the Laplacian, α is a smoothness parameter, κ is a scale parameter, τ controls the variance, s is the generic spatial location and W(s) is a Gaussian spatial white noise process.
For the relationship between the SPDE in Eq. (1) and the Matérn parameters see Eq.
For details on how to solve the SPDE that gives a Matérn random field see Bakka (2018).
This solution z(s) can be approximated through a weighted sum of basis functions ψ g defined at the G vertices (nodes) of a triangulation ( Therefore, at a discrete set of locations s 0 = (s 1 , ..., s G ) , i.e. the mesh nodes, the GP z follows a multivariate Normal distribution with zero mean and spatially structured correlation matrix. At the data locations we write z i = Az(s 0 ) where A is the projection matrix from the mesh nodes to the data locations. This simplifies the notation as the distribution assumed at the data locations has a complicated form.

D Matérn covariance function and INLA-SPDE
Let r = d(i, j) = ||s i − s j || for two-dimensional domains. Then the Matérn class of stationary and isotropic covariance functions is defined as where σ 2 = Γ(λ) Γ(α)(4π) d/2 κ 2λ τ 2 is the marginal variance, K λ is the modified Bessel function of second kind and order λ > 0, λ = α − d/2 is a smoothness parameter, where d is the dimention of the domain (2 in case of spatial processes). We refer the reader to Lindgren and Rue (2015) for a discussion on the choice of α implemented in R-INLA. Finally, κ is a scaling parameter related to the empirically derived range ρ = √ 8λ/κ.
We adopt the default α = 2, which for two-dimensional domains means smoothness parameter λ = 1.
We need a parameterization to represent the Matérn correlation structure in INLA. The easiest way would be to assign a joint normal prior distribution to ψ 1 = log(τ ) and ψ 2 = log(κ), in fact from the formula of the marginal variance σ 2 we can derive and from the formula of the empirical range ρ = √ 8λ/κ we obtain A more naturally interpretable parameterization is in terms of standard deviation σ and range ρ, such as Substituting (5) in (3) and (6) in (4) we get respectively: so we can express τ and κ in terms of ψ 1 and ψ 2 (Lindgren and Rue, 2015).
More recently, Fuglstad et al. (2019) suggested an extension of the penalized complexity prior (PC prior) proposed by Simpson et al. (2017). The PC prior penalizes the Kullback-Liebler divergence between the model P and the base model P 0 , i.e. the information lost when approximating P with P 0 , and it is suggested as a way to reduce overfitting.
Here P represents a model component, such as a GRF.
Fuglstad et al. (2019) developed a joint version of the PC prior for the range and marginal variance of Matérn GRFs with fixed smoothness and d < 4. The joint PC prior is derived from the alternative parameterization of the Matérn covariance function (τ, κ) and then transformed back on the range and variance scale (ρ, σ). The advantage of PC priors compared to the prior described above is that it can be expressed in a user friendly form stating the upper or lower tail probability for the generic parameter of interest φ: P (φ > U ) = q or P (φ < L) = q. This intuitive interpretation makes it easy to specify a vague, weak or informative prior by tuning the parameter.
In the case of the joint PC prior for GRFs, we must choose an upper limit for the standard deviation and a lower limit for the range: P (σ > σ 0 ) = q 1 and P (ρ < ρ 0 ) = q 2 .

E Implementation of the joint model through the INLA-SPDE approach
Because we have misaligned data, we describe the joint model with three likelihoods and three linear predictors (section 3.2, equations 1, 2, 3).
To implement this in R-INLA, we need to create a complex data structure: the matrix of observations M is defined as a block matrix with the number of columns corresponding to the number of likelihoods and each block row corresponding to the data used to estimate one of the linear predictors (Martins et al., 2013).
Through the R-INLA copy function, the z 1 and z 2 random effects are included in the linear predictor, so each of them shares the hyperparameters across the linear predictors in eq. 1-3, but at the same time has a scaling parameter as well for calibration purposes (Gómez-Rubio, 2019; Rue et al., 2017).
Since the SPDE provides the approximation of the entire spatial process at the mesh nodes, there is no actual alignment procedure involved here. For the blocks with spatial structure (all three in our case), we just need a link between the mesh nodes and the locations at which the value is known; this link is provided by a projector matrix defined by built-in functions. Because these locations are different, we need one projector matrix for the PCM data y 1 , one for the AQUM data y 2 and one for the ground observations y 3 .
For each block, a R-INLA stack object is created to link the data and/or the projector matrix to the model effects included in the linear predictor: for the y 1 stack we will have the intercept and only one random effect represented by the spatial index, for y 2 we have the intercept, a temporal and a spatial index, and for y 3 we will need the intercept, the site type covariate, the spatial index for z 1 , the temporal index for z 2 and the time-site type interaction for z 3 .
All the stack objects are then put together and passed on to the inla call as data.
With this data structure it is easy to include a validation set: assuming we select m sites for validation and the remaining n = s 3 − m for estimation, we just need to set the observations corresponding to the validation locations to N A, so that R-INLA will assume them as unknown and will predict their values. In this case the M matrix will be:   + 1, 1) . . .
We also need the projector matrix associated with the validation set and the corresponding stack object. The prediction is computed at each new location (either a validation site or on a regular grid) through the procedure described in Section 3.5.

F Retrieving site type at unknown locations
To determine the site type at unknown locations, we tested 12 different approaches on the known monitoring sites using four different land cover sources of information to retrieve the rural and urban classification, and three different rules to determine the road-kerb side classification based on the distance from a major or minor road (Greater London Ordnance Survey minor and major roads ESRI shapefile -Ordnance Survey 2017).
The first source for land cover is the Corine land cover for the year 2012 for the UK, Jersey and Guernsey shapefile from the Centre for Ecology and Hydrology (Cole et al., 2015), the second is the MODIS land cover type 1 classification raster at 500m resolution for year 2005 from NASA LP DAAC (2013), and the last two are the Global Urban Footprint (GUF) rasters at 1km and 12m resolution respectively (DLR, 2016;Esch et al., 2017). In all cases, the non-urban land cover types are aggregated to rural, and we assume the land cover did not change significantly over the study period.
We applied the following three rules to all the land cover data: (R1) The site type is defined as road-kerb side if the location is within 4 m from any road, otherwise urban or rural according to the land cover shapefile. This rule combined with the MODIS land cover is the one applied by Mukhopadhyay and Sahu (2017).
(R2) The site type is defined as road-kerb side if the location is within 10 m from any road, otherwise urban or rural according to the landcover shapefile.
(R3) The site type is defined as road-kerb side if the location is within 50 m from a major road or within 10 m from a minor road, otherwise urban or rural according to the landcover shapefile. This rule accounts for the different width of the roads, assuming the road midline corresponds to the center of the street.
The percentage of correctly classified monitors for each method is reported in Table F.1.
The Corine shapefile seems to provide more accurate information compared to the MODIS raster, and combined with the 10/50m rule it gives the highest percentage of correct classification (64.8%). This method was therefore applied to the unknown locations.
The data workspace can be requested directly to the corresponding author.