Efficient and automatic methods for flexible regression on spatiotemporal data, with applications to groundwater monitoring

Fitting statistical models to spatiotemporal data requires finding the right balance between imposing smoothness and following the data. In the context of P‐splines, we propose a Bayesian framework for choosing the smoothing parameter, which allows the construction of fully automatic data‐driven methods for fitting flexible models to spatiotemporal data. An implementation, which is highly computationally efficient and exploits the sparsity of the design and penalty matrices, is proposed. The findings are illustrated using a simulation study and two examples, all concerned with the modelling of contaminants in groundwater. This suggests that the proposed strategy is more stable that competing methods based on the use of criteria such as generalised cross‐validation and Akaike's Information Criterion. © 2015 The Authors. Environmetrics Published by John Wiley Sons Ltd.


INTRODUCTION
Spatiotemporal data have become ubiquitous. In some settings, this has been driven by the development of affordable technology for data collection where spatially located networks of sensors collect data over time. In environmental monitoring, multiple sensors are routinely used to gather data over time, in air, water or land settings. Brain imaging using electro-encephalography or magneto-encephalography is another example where around 200 sensors each record brain signals at very high time resolution, generating large volumes of data. In many scientific contexts, measurements are increasingly made automatically, leading to high-resolution data with a strong degree of regularity, while on other occasions, visits to sites of interest by trained personnel may be required, leading to sparser and more irregular data patterns. The problem discussed in the present paper deals with measurements of groundwater collected from wells and sent for subsequent lab analysis. Barcelona et al. (1985) describe the practical details. As mentioned earlier, the practicalities and cost of this inevitably lead to irregularity in time and also in space.
Models for the analysis and interpretation of spatiotemporal data have developed rapidly to match the demands of the data now available and the underlying questions. Sometimes, prediction is the aim, while on other occasions, interest can be directed at assessing the mean levels of the measurement and evidence for change over time. Banerjee et al. (2004), Finkenstädt et al. (2007 and Cressie and Wikle (2011) provide excellent entry points to the large literature on spatiotemporal modelling, with the last book very helpfully giving coverage of modern hierarchical and dynamic methods in both breadth and depth. These models are usually implemented in a Bayesian setting. In the wider literature, a unifying theme is the expectation that the spatial and temporal patterns exhibited will not follow simple parametric forms, so that models, which can express flexible, but generally smooth shapes, are required. One approach is to apply flexible forms of regression, described for example by Wood (2006), in the spatiotemporal setting. Bowman et al. (2009) take this approach to the modelling of sulfur dioxide over Europe throughout the 1990s. P-splines, described by Eilers and Marx (1996), and more general regression splines, offer a very interesting approach through the use of relatively low-dimensional sets of basis functions, and Lee and Durban (2011) apply this to the spatiotemporal modelling of ozone over Europe. The formulation of P-splines offers an interpretation in terms of mixed effects, and Ruppert et al. (2003) showed a wide range of settings to which these models can be applied when the random effect interpretation is appropriate. A fully Bayesian P-splines model was introduced by Lang and Brezger (2004), with inference carried out by Markov chain Monte Carlo MCMC. Fahrmeir et al. (2004) adopted a model of this type in the specific setting of spatiotemporal data with an empirical Bayes approach, which returns again to a mixed-model representation. Brezger and Lang (2006) provided a wider range of models and efficient updating schemes, while Brezger and Lang (2008) discussed simultaneous probability statements for Bayesian P-spline models, again in the context of MCMC implementation. More recently, Wood (2011) explored the Restricted maximum likelihood (REML) approach in detail and developed a fast implementation in a generalised linear modelling framework.
The context of the application discussed in this paper is the monitoring of contamination in groundwater. It is clearly important to assess water quality and its associated risks to human health and the wider environment, and in particular, to detect sudden increases in contaminant concentration due to possible releases. The contaminants in the groundwater are measured using water samples collected from wells and sent for subsequent lab analysis. The practicalities and cost of this inevitably lead to irregularity in time and also in space, even when operating within a fixed set of sampling locations determined by the well positions. The data collection and assessment activity are generally undertaken by staff who have science or engineering backgrounds, but may not have had advanced training in statistical methods. However it is impractical that results should always be referred back to others for statistical analysis, and so there is a practical need for statistical tools that can be implemented easily and robustly as a routine part of the work of those environmental professionals. The analysis therefore needs to be fully automatic and fast to not only carry out but also to produce results, which are reliable, informative, and to aid robust project decision-making.
The aim of the present paper is to address these issues. In order to allow the construction of flexible models over space and time, Psplines are used because of their ability to provide compact representations and to express smoothness control in simple forms, as described in Section 2. A fully Bayesian spatiotemporal model is introduced in Section 3, using conjugate priors to avoid the need for MCMC implementation. In particular, the issue of selecting the degree of smoothness in the model is also addressed in order to produce a fully automatic procedure. A focus will be on issues of 'ballooning', where predictions can be high in areas where there are no data, and this is identified and addressed by appropriate choices of the number of basis functions and the type of smoothness penalty used. The need for speed is addressed through matrix decompositions, which enable the parameter that controls smoothness to be separated out from the computationally intensive parts of the calculation, along similar lines to those used by Ruppert et al. (2003), but also exploiting the sparsity of the design matrices associated with the spline basis.

SPATIOTEMPORAL SMOOTHING BY P-SPLINES
The P-spline approach to smoothing has become widely used because of its simplicity and its 'low rank' representation of the function of interest. In the simplest case where responses y i and covariate values x i are observed for a sample i D 1; : : : ; n, the model y i D m.x i / C " i describes a flexible underlying relationship through the nonparametric regression function m whose form is unspecified beyond an assumption of smoothness. A basis approach assumes this function can be expressed through a linear combination P p j D1˛j j .x/, where the functions j .x/ are usually taken to be B-splines (usually of order 3) because of their efficient construction from polynomial pieces. By modifying the values of the coefficients˛j , a huge range of smooth functions can be created by weighting the local P-spline basis functions, which are centred at a grid of values along the x-axis.
A simple method of extending this to the spatiotemporal setting, where data y i are indexed over space .s 1i ; s 2i / and time t i , for i D 1; : : : ; n, is to express the regression function as m.s 1 ; s 2 ; t/ D P j P k P l˛jkl j .s 1 / k .s 2 / l .t /, which corresponds to using the tensor product of the marginal B-spline bases. This uses a basis set, which is simply the product of all triples of the marginal basis functions over s 1 , s 2 and t . This can be conveniently expressed in vector-matrix form as Y D B˛C ", where Y and " are vectors of response data and error terms,˛is the vector of parameters˛j kl and the design matrix B consists of the basis functions (columns) evaluated at each data point (rows). The design matrix B can be efficiently constructed through 'row-wise' Kronecker products of the marginal basis functions (Ugarte et al., 2010;Lee and Durban, 2011). The marginal spline bases make use of equally spaced knots, as this simplifies the construction of the penalty that will be employed to regularise the spline coefficients. Tensor product splines are not the only spline-based approach for covering the spatial domain. One alternative method commonly used in this context (see e.g. Kammann and Wand, 2003) are thin-plate splines. The latter have the advantage that they can adapt to the spatial density of the data and thus may lead to a more efficient placement of the spline knots, however, at the expense of having to run an additional algorithm for their placement.
Although this model can be fitted by simple least squares, Eilers and Marx (1996) proposed to use a dense set of basis functions in conjunction with a penalty term to control the degree of smoothness in the estimate. Specifically, the parameter estimate is chosen to be the value of˛that minimises where the matrix D computes successive differences across the sequence of˛j kl 's in each of the three covariate dimensions. Second-order differences are often used. Large values of the smoothing parameter thereby induce smoothness in the values of˛and hence in the estimated function m. The solution for the basis coefficients is easily seen to be Ǫ D .B T B C D T D/ 1 B T y. The trace of the matrix B.B T B C D T D/ 1 B T , which creates the fitted values from the data vector y, is defined as the 'effective degrees of freedom' by analogy with standard linear models. This gives a more intuitive scale on which the smoothness of the estimate can be expressed. The details of these methods are described by Eilers and Marx (1996), Ruppert et al. (2003), Wood (2006) and many other authors. In particular, Lee and Durban (2011) and Ugarte et al. (2010) discuss P-splines in the spatiotemporal setting. In all forms of flexible or nonparametric regression, the choice of the degree of smoothness for the estimator is a crucial one, and many authors have addressed this issue. Widely used approaches include cross-validation (CV) that, using spatiotemporal notation, chooses to minimise P n iD1 fy i O m i .s 1i ; s 2i ; t i /g 2 , where the subscript i indicates that observation i is not included in the construction of the estimate. Generalised cross-validation (GCV) is a popular variation. Akaike's Information Criterion (AIC) and its variation Akaike information criterion (AICc), described by Hurvich and Tsai (1989), are also widely used as a means of balancing the goodness-of-fit of the model against its complexity. The general thinking is that AICc is less affected by the under-smoothing to which AIC and cross-validation are sometimes prone. Another commonly used criterion is BIC. The details of all these methods are discussed by many authors, with Wood (2006) as a good starting point.
In the groundwater monitoring setting described in Section 1, the choice of the degree of smoothing is particularly important, as it needs to be implemented in an unsupervised, automatic setting. The panels in Figure 1 show the predicted concentrations at a single time snapshot from a dataset of benzene measurements, using a P-spline spatiotemporal model. The top-left panel shows the effects of using AICc in selecting the smoothness of the estimate. A troubling feature is that there are areas of high predicted values, which are not well supported by the observed data ('ballooning'). The top-right panel shows the results when four wells are omitted. Even though the four omitted wells recorded low concentrations, omitting them causes the unsupported peaks of the posterior distribution to disappear, indicating strong sensitivity of the results to particular observations. A further source of concern is that removing these wells changes the predictions in the immediate vicinity of the wells less than predictions further away. GCV and observation-based cross-validation suffer from the same problem. The results obtained by a Bayesian approach (shown in the bottom two rows) are much less sensitive to the removal of the four wells. The results for cross-validation depend on whether the well structure is used when omitting observations. This will be discussed in more detail in Sections 5 and 6. The single most important factor that causes this 'ballooning' is the design used for collecting the data. However, this design is typically imposed by external constraints and thus cannot be changed.
Another key contributing factor to the 'ballooning' problems is the choice of the number of basis functions used. The computational advantage of using splines over other methods such as kriging depends strongly on the number of basis functions. In almost all practical applications, this number is chosen by finding a reasonable compromise between run time and memory usage. This is not a problematic issue for one-dimensional or two-dimensional data where a moderate number of basis functions (say around 25) present little computational challenge. For spatiotemporal data however, using 25 basis functions for each dimension would result in having to perform expensive matrix operations on a 15; 625 15; 625 matrix, which in the context discussed here is not a realistic option. Figure 1. Predictions of the concentration (in g=`) for the benzene data on one particular day (Section 6.1) using the penalisation parameter chosen by optimising the different criteria, as well as fully Bayesian model averaging. The left column was obtained by using all wells, the right column was obtained after removing four wells (marked by crosses In the context of spatiotemporal models, a low number of basis functions are usually chosen, and this raises the question of whether it is also necessary to use a penalty to induce smoothness, given that the low number of basis functions already offers some protection against over-fitting. The results discussed in Sections 5 and 6, however, show that using a penalty is especially important in this case, as it helps to prevent 'ballooning'.
The methods of selecting smoothness and the number of basis functions used are not the only factors in the appearance of this 'ballooning' effect of high and low values. The use of a second-order smoothness penalty encourages the appearance of linear sections, especially if there is a gap in the data, as these linear sections do not attract any penalty. Depending on the data around the gap, this can lead to a high peak or deep valley with little support from neighbouring observations. Using a first-order smoothness penalty typically lessens the problem to some extent, but is, on its own, often not enough to avoid 'ballooning' altogether. Choosing a different transformation of the response variable, such as the logarithm or the square root, can, depending on the data and the transform used, also have an effect.
There is a vast literature on the selection of the smoothing parameter for spline-based models in order to avoid over-fitting or under-fitting. The focus of this work is on the problem of 'ballooning', which is a different phenomenon.

BAYESIAN SPATIOTEMPORAL SMOOTHING
This section sets out how a Bayesian framework can be used to select the smoothing parameter. The starting point in the classical Bayesian linear model formulation (see e.g. Denison et al., 2002) is adapted to the spatiotemporal setting. If the model determined by a particular value of the penalisation parameter is denoted by M then, using the model described earlier, the likelihood function is derived from For a fixed value of smoothing parameter, a conjugate prior for the parameters˛; 2 is the normal-inverse gamma, that is, In Sections 5 and 6, we use a D b D 0:0001 to acknowledge the uncertain prior information on the parameter 2 . The posterior distribution of the penalisation parameter can be shown to be This is a special case of model comparison of Bayesian linear models, here using as the model index; for a general result see for example, Denison et al. (2002, equation 2.24, which gives the general principle for the comparison of Bayes factor). One difficulty in the present context is the degenerate nature of the prior for˛, expressed in the rank deficiency of the differencing matrix D. This can be handled by use of an additional ridge penalty, which gives the matrix D full rank. The posterior distribution shown earlier can be obtained by considering the limit as this ridge penalty goes to 0. Alternatively, it is possible to decompose the regression coefficients˛into two components, one of which has a flat improper prior and the other a proper Gaussian prior. This approach is discussed in more detail in the Appendix. The main difference between this approach and the random-effects formulation of Ruppert et al. (2003) is that both the regression coefficients and the variance are handled in a Bayesian way, resulting in a fully Bayesian model.
In Sections 5 and 6, a non-informative improper uniform prior is used for f M . The value of that maximises this posterior density, known as the MAP (maximum a posteriori) value, is then adopted for penalisation. Additionally, a fully Bayesian approach, numerically integrating out the penalty parameter , is considered.
The use of conjugate priors is a computational convenience that addresses the need for fast computation, as MCMC or other solutions required for dealing with non-conjugate priors would be infeasible, given the time constraints which apply. Only when a conjugate prior is placed on˛and 2 is it possible to compute efficiently the marginal posterior probability of a given model M . The proposed methodology works with any prior placed on M . The flat prior used in Sections 5 and 6 is used for convenience. The results are however not sensitive to reasonable choices for the prior on M .

COMPUTATIONAL ISSUES
Computation of the MAP distribution requires the determinant of the posterior covariance matrix of˛as well as the posterior residual sum of squares, which in turn requires computation of the penalised least-squares estimator minimising the objective function (1). The penalised least-squares estimator and the determinant have to be recomputed for every value of under consideration. This requires matrix operations, which are O.p 3 /, where p is the number of regression parameters. For a spatiotemporal penalised spline model using p 0 basis functions in each dimension, we have p p 3 0 ; thus, the cost of the matrix operations is O.p 9 0 / However, the problem can be rewritten (see e.g. Wood, 2000;Ruppert et al., 2003) in such a way that the expensive linear algebra operations can be performed independently of , and only O.p 2 / operations have to performed for every value of . This allows the MAP solution to be computed much more efficiently.
Both the matrix of basis functions B and the differencing matrix D are sparse. Exploiting this sparseness allows further improvement in computational efficiency. However, for a trivariate P-spline problem with a moderate number of knots, the matrix B is much more dense than the matrix D. For example, if 10 basis functions are used for each dimension, roughly 14% of the entries of B are non-zero, whereas less than 0.3% of the entries of D are non-zero. Sparsity can therefore be exploited most effectively by working initially only on the matrix D, in contrast to the approaches set out by Wood (2000) and Ruppert et al. (2003), which start by manipulating the matrix B. The approach set out in detail in the Appendix is loosely based on the method described by Eldén (1977). The core idea is to exploit the sparseness of the matrices for almost all matrix operations. However, each matrix decomposition creates 'in-fill', and so the matrices become increasingly dense. Only the final step, a singular value decomposition, is computed using dense methods. This implies that exploiting the sparsity of the design matrix and penalty matrix will not allow the implementation to become much more than twice as fast as the corresponding dense methods. However, in the context of this work, where the aim is to obtain results within less than a minute, this offers a significant increase in speed. In addition, exploiting the sparseness reduces the amount of memory used.

SIMULATION STUDY
In this section, a simple simulation study is used to compare the different methods of selecting the smoothing parameter in a systematic way. The data are simulated from a highly idealised model for the spread of a solute in water. This is based on the partial differential equation @y @t D D @ 2 y @s 2 1 C @ 2 y @s 2 2 ! C 1 .s 1 ; s 2 / @y @s 1 C 2 .s 1 ; s 2 / @y @s 2 Here, y denotes the concentration of the solute, s 1 and s 2 denote the spatial coordinates and t 2 OE0; 1 denotes time. The first term describes the spread of the solute in the groundwater by diffusion, with the constant D controlling how fast the solute spreads. The two further advection terms describe how the solute is affected by groundwater flow, whose direction and velocity is represented by the functions 1 and 2 . These functions were chosen to correspond to the observed groundwater levels in the benzene example discussed in Section 6.1. Figure S1(a) shows the assumed groundwater levels and flow, which in the simulations are assumed for simplicity to be constant over time, as well as the assumed initial spread of the solute and its evolution over time. The 'true' concentrations were obtained by interpolating the numerical solution to the differential equation, computed over a 100 100 100 regular grid. Observed measurement data were generated by multiplicative Gaussian error terms, with standard deviation chosen to give a signal-to-noise ratio on the log-scale of 10 W 1. This reflects the fact that measurements of the solutes are usually quite accurate. A very small value of 0.05 was used for within-well correlation of the data, while the between-well correlation was assumed to be 0. Before the data were analysed, they were transformed using the function log.y C1/. The additive term was introduced because the simulations can produce concentrations of exactly 0. All model fitting and evaluation were performed on the transformed scale.
A P-spline model using second-order basis functions and a first-order penalty was used with 14 basis functions for easting, eight for northing and five for time. The different number of basis functions for space matches the different extents of the monitored region in easting and northing in the guiding example, while the reduced number of basis functions for time was chosen to reflect the fact that concentrations vary more quickly in space than in time. Addressing these issues through the basis functions allows a single smoothing parameter to be used in the model. Where little a priori information on solute behaviour is available, a natural default would be to choose a common number of basis functions in each dimension. The overall number of basis functions is deliberately chosen to be rather low to allow fast computations. Experimentation has shown these numbers of basis functions to be effective from this perspective, in addition to preventing 'ballooning' or over-fitting and producing good estimates of the underlying solute patterns.
Three different designs were used. The first scenario uses exactly the same well coordinates and sample dates as the benzene example discussed in Section 6.1. It consists of 1402 observations sampled at 29 well locations. The second scenario uses a much larger number of 280 randomly placed wells that are sampled much less frequently, resulting in the same number of observations. The second scenario is a much better design from a statistical point of view but is, of course, much more expensive, as establishing a new well is considerably more costly than collecting a sample from an existing one. The third scenario uses the same wells as the first scenario, but only has 100 observations in total, with each well sampled only about four times on average.
For all three scenarios, the methods were compared using the integrated squared error Z Z Z S . O m.s 1 ; s 2 ; t/ m.s 1 ; s 2 ; t// 2 ds 1 ds 2 dt The integral was computed numerically over the inside of the convex hull of the observed wells and sampling dates. Table 1 shows the results obtained from 500 replications for all three scenarios. From the table, it is immediately clear that no one method outperforms all other methods for all three scenarios.
Out of the three scenarios presented, only scenario one is prone to 'ballooning'. In this situation, AICc and GCV show poor performance. These two criteria, as well as observation-based cross-validation, lead to severe 'ballooning'. The Bayesian approaches (BIC, MAP and model averaging) give much better performance, with no evidence of 'ballooning'. The predicted concentration surfaces obtained using the different criteria are given in Figure S2. Figure 2 (a) shows density strip plots (Jackson, 2008) of the distribution of the smoothing parameter for each method. This shows that the Bayesian approaches and well-based cross-validation select values of the smoothing parameter , which are large enough to prevent 'ballooning'. The problems with other methods are caused by values of ; which are too low.
Although BIC performs very well if the focus is on preventing ballooning, it is prone to under-fitting. In the second scenario, which provides the 'best' data for estimating the concentrations, BIC performs significantly worse than the other methods. As Figure 2 (b) shows, this is because of selecting a value for , which is too large. In all three scenarios, the MAP and the fully Bayesian approach give good results, being the best method in the second and the third scenarios.
Cross-validation is, by far, the most computationally demanding method. In the simulations, and in Section 6, 10-fold cross-validation was used. The results depend on how the cross-validation is carried out. One option is to remove entire wells rather than single observations, and in this case, cross-validation favours very large values of the penalty parameter. In contrast, if the well structure is ignored and individual observations are removed, cross-validation favours very small values of the penalty parameter. The reason for the difference is that in this dataset, 'ballooning' occurs only in space and not in time. There is a small number of wells, and these are sampled very frequently in time.
Omitting observations individually typically does not create gaps in time, which are large enough to allow 'ballooning' at individual wells. Cross-validation can therefore address 'ballooning' only if a well is omitted entirely. The difference between the two variants is much less pronounced in the second and third scenarios.
The proposed model can be extended by allowing two separate smoothing parameters, one for space and one for time. Only one of these tuning parameters can be tuned using the efficient linear algebra in Section 4. Tuning the second smoothing parameter requires recalculating all the matrix decompositions described in the Appendix for each candidate value of the second smoothing parameter, which significantly slows down the computations. However, when two smoothing parameters are used in scenario one, the temporal smoothing parameter is estimated to be around 20 times the spatial smoothing parameter. This might suggest that the use of a single smoothing parameter is inappropriate, but the models with two smoothing parameters are more likely to suffer from ballooning, leading to a poorer fit. If the ratio of the smoothing parameters is chosen to minimise the mean-square error on the test data, one would choose both smoothing parameters to be roughly equal, suggesting that a single smoothing parameter helps preventing ballooning and allows a more robust estimation of the concentration surface. This supports the use of a single smoothing parameter.

Monitoring of benzene in groundwater
Benzene (C 6 H 6 ) is a constituent of crude oil and refined petrol, which can have serious adverse health (and ecological) effects if released into the environment. A release from an underground storage tank system can result in benzene contaminating the groundwater below the storage tank system. After such releases, networks of wells are set up to monitor possible groundwater contamination. The contaminant of interest is the concentration of benzene in g=`(modelled on a log-scale). The data consist of 1402 observations, which were obtained from a network of 29 wells, with considerable irregularity in the spatial and temporal spacing of the observations. A p-spline model using second-order basis functions and first-order difference penalties was fitted, with the smoothing parameter determined using different criteria. Different numbers of basis functions were chosen for the two spatial dimensions, with 18 for easting and 11 for northing, to reflect the different spatial extents in these directions. To reflect the fact that the level of spatial variation is expected to exceed the temporal variation, a smaller number (7) of basis functions were used for the time margin. The choice of the numbers of basis functions therefore provides a simple device for allowing appropriately differential degrees of smoothing in different dimensions, while retaining the very large computational advantage of a single overall penalty parameter. Figure 3 depicts the choice of the optimal value of the penalty parameter for this example. In the case of the Bayesian MAP approach, the optimal choice is given by the value of , which maximises its posterior distribution. It is noticeable that AICc, GCV and observationbased cross-validation all lead to very low estimates of the smoothing parameter, which effectively 'switches off' the penalty, leading to 'ballooning'. BIC and well-based cross-validation select a much larger value of the smoothing parameter, preventing 'ballooning'. Using a fully Bayesian approach (MAP or full model averaging) results in a smoothing parameter, which is smaller than the one selected by BIC and well-based cross-validation but is still big enough to prevent 'ballooning'. The posterior distribution of is typically quite narrow, so using the MAP usually gives results very similar to a fully Bayesian treatment of the smoothing parameter, which involves averaging over . This is borne out in this example, as well as in the simulations in Section 5. As shown in Figure 1 and discussed in Section 2, the MAP is not sensitive to the removal of the four wells, shown as crosses in the right-hand column of plots. In contrast, the values chosen by AICc, GCV and observation-based cross-validation are highly sensitive, with removal of the wells resulting in a big change in the predictions with much more credible predicted concentrations. This highlights that the well design plays a key role.

Monitoring of methyl tertiary butyl ether in groundwater
A more extensive example of the use of these techniques is provided by retrospective analysis of a dataset on a pollution event at a refinery site. MTBE (methyl tertiary butyl ether) is a petrol additive designed to reduce engine knocking and noxious emissions. MTBE is no longer in routine use at the site studied but was present in the refinery at the time of the event. On entry to groundwater, MTBE moves conservatively because of its high aqueous solubility and low retardation potential. It degrades only slowly under anaerobic conditions. Figure 4 shows a schematic plan of the site with colour-coded points to indicate the concentrations of MTBE measured at the monitoring wells at a date near the time of the MTBE release. Standard methods of analysis in this setting were to inspect individual well measurements over time to identify trends. Geographical information systems were available and these were helpful for individual time snapshots but these could not easily be adapted to show the evolving dynamics of the incident. Figure 5 (and the earlier Figure 1) was created using the rp.spacetime function from the current version of the rpanel (Bowman et al., 2007) package for R (R Development Core Team, 2011). This shows the estimated pollution surface at four time points using the Bayesian smoothing model described in Section 3, using 18 basis functions for easting, 22 basis functions for northing, 14 basis functions for time and the MAP estimate of . Despite the presence of protective pumping wells at the north-west boundary of the refinery site, the threat of MTBE migrating across the site boundary and potentially reaching drinking water wells required immediate action. The first time point shown in Figure 5 corresponds to the upgrading of a line of wells used to form a flow barrier in the middle of the site. The effectiveness of these wells was greatly improved, and the resulting curtailment of the plume to the north-west is apparent. Subsequently, the source of the MTBE release was identified near the south-east corner of the site, and the model clearly tracks the dissipation and attenuation of MTBE and the end of the incident. The shape and direction of the plume are clear and consistent with the south-east/north-west gradient in groundwater flow. Figure S4 shows the estimate of the time trend for four selected wells.   In order to assess the effect of the number of basis functions, we have also investigated a model with fewer basis functions (14 basis functions for easting, 17 basis functions for northing and 10 basis functions for time) and a model with an increased number of basis functions (22 basis functions for easting, 26 basis functions for northing and 15 basis functions for time). The smaller model uses about half of the number of parameters of the model shown in Figures 5 and is about eight times faster, as suggested by the computational complexity of the final eigenvalue decomposition. The larger model has about twice as many parameters than the aforementioned model and takes about eight times as long to fit. The smaller model suffers from ballooning in the North East, whereas the larger model gives results similar to the model used. Plots of the estimated concentration surface obtained from these two models are available in Figure S5.

DISCUSSION
A fully automatic Bayesian framework for determining the smoothing parameter in spatiotemporal P-spline models has been proposed. The focus was on a situation where the key objective was to deliver, on a fast timescale, automatic and robust estimates of the distribution of a solute in groundwater and the corresponding plume geometry. In particular, there was a need to avoid spurious local extrema of the predictive surface with little support in the data ('ballooning'), which can sometimes occur in regions where the well design is sparse.
In our experience, and evidenced by the simulation study and the two real-world examples presented, the Bayesian methods studied are more stable than competing strategies based on criteria such as the AICc or GCV. While BIC is very good at avoiding 'ballooning', it can also lead to over-smoothing, to which the other Bayesian methods are less prone. If used appropriately, which is difficult to judge without prior knowledge, cross-validation can be very effective at preventing 'ballooning', but it has a rather high computational cost.
Although our focus was on spatiotemporal models, the methods can also be applied to other smoothing problems in which the use of a single smoothing parameter is appropriate, possibly after rescaling of parts of the penalty or adjustment of the number of basis functions.
The use of splines is not the only way of constructing spatiotemporal models for the contamination of groundwater. A particularly attractive alternative would be the use of a model based on the underlying physical processes. However, such models require a good understanding of the geology of the site, which in turn requires additional information that is not always readily available.
In our experience the problem of 'ballooning' is not limited to splines. 'Ballooning' can also occur when using other techniques such as kriging with a Matérn covariance. In the latter case, the severity of the problem of 'ballooning' depends on how the shape parameter of the covariance function is chosen, and also whether the data adhere to the implied assumptions of stationarity and isotropy. The P-spline based approach has the key advantage of not requiring the latter assumption.
The methodology set out above is implemented in GWSDAT, a fully automatic tool for the analysis of groundwater contaminants developed by Shell Global Solutions (Jones et al., 2014).