Quantifying subsurface parameter and transport uncertainty using surrogate modelling and environmental tracers

We combine physics‐based groundwater reactive transport modelling with machine‐learning techniques to quantify hydrogeological model and solute transport predictive uncertainties. We train an artificial neural network (ANN) on a dataset of groundwater hydraulic heads and 3H concentrations generated using a high‐fidelity groundwater reactive transport model. Using the trained ANN as a surrogate model to reproduce the input–output response of the high‐fidelity reactive transport model, we quantify the posterior distributions of hydrogeological parameters and hydraulic forcing conditions using Markov chain Monte Carlo calibration against field observations of groundwater hydraulic heads and 3H concentrations. We demonstrate the methodology with a model application that predicts Chlorofluorocarbon‐12 (CFC‐12) solute transport at a contaminated field site in Wyoming, United States. Our results show that including 3H observations in the calibration dataset reduced the uncertainty in the estimated permeability field and infiltration rates, compared to calibration against hydraulic heads alone. However, predictive uncertainty quantification shows that CFC‐12 transport predictions conditioned to the parameter posterior distributions cannot reproduce the field measurements. We found that calibrating the model to hydraulic head and 3H observations results in groundwater mean ages that are too large to explain the observed CFC‐12 concentrations. The coupling of the physics‐based reactive transport model with the machine‐learning surrogate model allows us to efficiently quantify model parameter and predictive uncertainties, which is typically computationally intractable using reactive transport models alone.

groundwater reactive transport model. Using the trained ANN as a surrogate model to reproduce the input-output response of the high-fidelity reactive transport model, we quantify the posterior distributions of hydrogeological parameters and hydraulic forcing conditions using Markov chain Monte Carlo calibration against field observations of groundwater hydraulic heads and 3 H concentrations. We demonstrate the methodology with a model application that predicts  solute transport at a contaminated field site in Wyoming, United States. Our results show that including 3 H observations in the calibration dataset reduced the uncertainty in the estimated permeability field and infiltration rates, compared to calibration against hydraulic heads alone. However, predictive uncertainty quantification shows that CFC-12 transport predictions conditioned to the parameter posterior distributions cannot reproduce the field measurements. We found that calibrating the model to hydraulic head and 3 H observations results in groundwater mean ages that are too large to explain the observed CFC-12 concentrations. The coupling of the physics-based reactive transport model with the machine-learning surrogate model allows us to efficiently quantify model parameter and predictive uncertainties, which is typically computationally intractable using reactive transport models alone. Predicting the evolution of groundwater quality at contaminated sites is a foremost concern for current and future water resource management. Yet, field-scale contaminant transport that spans decadal to century-long timescales is impractical to directly measure and remains uncertain (Dam et al., 2015;Hammond & Lichtner, 2010;Zachara et al., 2013). It is increasingly apparent that aquifer biogeochemical conditions and contaminant transport dynamics are influenced by groundwater flow with varying residence times (Bea et al., 2013;Manning et al., 2015;Visser et al., 2009). For instance, groundwater flow with multi-decadal residence times can impact contaminant reactive chemistry processes (Green et al., 2010;Liao et al., 2012) and is requisite understanding to estimate transport velocity distributions used to predict contaminant flushing timescales (Bohlke & Denver, 1995;Manning et al., 2015;Sanford & Pope, 2013). Nonetheless, our understanding of groundwater flow and transport at the field-scale is complicated by the limited observations that are sensitive to groundwater with decadal and longer residence times (Gardner et al., 2015;Zell et al., 2018). Further investigation on the role that uncertainties in long-residence time groundwater transport have on field-scale solute transport predictions and predictive uncertainties is needed.
Physics-based numerical models are amongst the most powerful tools available to assimilate long-residence time groundwater into predictions of subsurface transport (Li et al., 2017;Steefel et al., 2005).
Given that numerical models are imperfect representations of complex groundwater systems, model calibration using field observations is key to estimate effective model parameters and make solute transport predictions (Doherty, 2015;Hill & Tiedeman, 2007). Model calibration against hydraulic head data alone cannot constrain the groundwater transport velocity fields and leads to poor solute transport predictive performance (Portniaguine & Solomon, 1998;Thiros et al., 2021). Augmenting calibration datasets with observations of solute concentrations has been shown to improve estimates of the parameters and processes that control solute transport (e.g., Schilling et al., 2019). Whilst injection tracer tests provide solute transport information locally around a well gallery (e.g., Ma et al., 2014), these methods are limited by the time frames of the field campaign and often cannot constrain field-scale processes and parameter heterogeneities. Alternatively, environmental tracers are naturally applied over timescales that range from years to centuries and act as a proxy for solute transport that integrates heterogeneity over broad spatial scales (Cook & Herczeg, 2000;Suckow, 2014). Environmental tracer observations are commonly assimilated into groundwater model calibration datasets (Portniaguine & Solomon, 1998;Sanford et al., 2004), with many studies reporting subsequent improvements in hydrogeological parameter estimates and system hydraulic forecasts (Green et al., 2010;Sanford, 2011;Starn et al., 2014;Thiros et al., 2021).
With respect to field-scale solute transport predictions, Curtis et al. (2006) and Åkesson et al. (2014) utilise measured tritium ( 3 H) at contaminated field sites to calibrate hydraulic model parameters of reactive transport models that simulate uranium and nitrate, respectively.
Despite the prevalence of studies that assimilate environmental tracer observations into model calibration datasets, quantification of model parameter and subsequent solute transport predictive uncertainties has received much less attention.
Theoretical and applied studies have shown that model structural errors, observation data uncertainty and calibration non-uniqueness degrade groundwater model calibration performance and lead to uncertain predictions (Hill & Tiedeman, 2007;Liu & Gupta, 2007). Calibration non-uniqueness is caused by the inability of field data to constrain the correlations amongst groundwater flow and transport model parameters and processes (Doherty & Welter, 2010;Linde et al., 2017). Calibration non-uniqueness manifests in many plausible models that can equally fit observation data, thus, interpreting a bestfit model is often fraught and leads to poor predictive performance (Beven, 2006;Hunt et al., 2007). Quantifying the parameter and resulting model predictive uncertainties, however, remains a challenging task. Typical groundwater model calibration and uncertainty quantification using deterministic methods requires the assumption of model linearization and Gaussian model errors (Doherty, 2015;Tarantola, 2005). However, the non-linearity in the mathematical equations that describe groundwater solute transport and variably saturated flow calls into question the application these simplified uncertainty analysis methods. Studies report parameter uncertainties quantified using linear approximations can significantly differ from those estimated using more robust Monte Carlo methods (Gallagher & Doherty, 2007;Yoon et al., 2013;Zell et al., 2018). Improved uncertainty quantification methods are needed to investigate the impact that including environmental tracer observations that constrain groundwater flow and transport over long temporal scales has on field-scale solute transport predictions and predictive accuracy.
Markov chain Monte Carlo (MCMC) is generally considered the most robust method to perform model calibration and uncertainty analysis (Linde et al., 2017). However, MCMC analysis is computationally intensive and often remains intractable to perform using fieldscale groundwater flow and transport models (Tonkin & Doherty, 2009;Yoon et al., 2013). Machine-learning based surrogate models that are trained to emulate the input-output response of the high-fidelity groundwater flow and transport model can be used to investigate groundwater system processes and perform uncertainty quantification at a fraction of the computational cost compared to the original physics-based model (Asher et al., 2015;Razavi et al., 2012).
For instance, Laloy and Jacques (2019) and Fienen et al. (2018) compare the ability of multiple surrogate models to emulate physics-based reactive transport models at the column scale and groundwater flow at regional scale, respectively. Recent studies extend the use of groundwater flow and transport surrogate models to perform MCMC analysis to solve the inverse problem that identifies groundwater contaminant source regions Zhou & Tartakovsky, 2020) and infer subsurface hydraulic conductivity fields (Cui et al., 2018;Mo, Zhu, et al., 2019;Rajabi, 2019;Xu et al., 2017).
To our knowledge, no study has applied surrogate modelling to emulate transport of environmental tracers at the field scales and performed subsequent solute transport predictive uncertainty analysis.
In this work, we develop an artificial neural network (ANN) surrogate model that is trained to simulate groundwater levels and 3 H concentrations at a contaminated field site near Riverton, WY. We utilise a high-fidelity, physics-based groundwater flow and transport model to generate the dataset that is used to train the surrogate model.
Using the trained surrogate model as the forward simulator, we perform MCMC calibration to infer uncertainties in subsurface property and hydraulic boundary condition parameters, conditioned on field observations of groundwater levels and 3 H observations. Ensembles of groundwater mean age and chlorofluorocarbon-12 (CFC-12) predictions from the high-fidelity model evaluated with samples from the calibrated parameter posteriors are used to estimate predictive solute transport uncertainties. To investigate the influence that 3 H observations that can constrain groundwater transport at the multidecadal timescales has on solute transport predictive performance, we perform the same MCMC model calibration and predictive analysis using water level observations alone. Through comparison of the model predictive uncertainties given the two calibration datasets, we are able to explore the role that long-residence time groundwater has on solute transport processes at the Riverton site.

| Site description
The Riverton site is located $3 km south-west of Riverton, WY on the Wind River Indian Reservation (Figure 1). Contamination at the Riverton site is sourced from a former uranium and vanadium processing mill that was active between 1958 and 1963. Despite tailings remediation in 1989 and a risk assessment that predicted natural flushing of the groundwater contaminants would occur within 100 years, elevated uranium concentrations persist within the shallow alluvial aquifer (Dam et al., 2015;DOE, 1998). A significant amount of groundwater flow and solute transport research has been performed and is on-going at the Riverton site to better understand the uranium plume dynamics (e.g., Byrne et al., 2020;DOE, 2015).
The Riverton site has an area of $7 km 2 and is on an alluvial terrace at $1500 m elevation within the Wind River Basin. The climate is arid to semi-arid with an annual mean temperature of 8 C and precipitation of 200 mm that predominantly occurs as winter snow and summer rain (DOE, 2015). The surface hydrology is characterised by the Wind River to the north and Little Wind River to the south. Both the Wind River and Little Wind River experience peak and base flows in June through July and September through February, respectively.
The Wind River Basin is composed of interbedded Eocene age sandstone and shale layers (DOE, 1998). Groundwater flows in the southeast direction through three predominant aquifers (DOE, 1998): (1) a 4-6 m thick unconfined aquifer comprised of sands, gravel and silt; (2) a middle semi-confined 5-9 m thick sandstone aquifer; and (3) a deep 15-20 m thick confined sandstone aquifer. Confining units are composed of shale and have thicknesses up to 10 m.

| Observation datasets
Field observation datasets are presented in detail within (Thiros et al., 2021). The US Department of Energy (DOE) has performed extensive site characterisation at the Riverton site, which includes the installation of numerous groundwater wells and regular groundwater sampling (DOE, 1998(DOE, , 2015. Throughout this work, observation data- during the months ranging from May to October. Sampling was performed following US Geological Survey procedures (https://water. usgs.gov/lab) and chemical analysis was performed at the University of Utah Noble Gas Laboratory following procedures presented in Thiros et al. (2021). CFC-12 concentrations are corrected for excess air calculated using the measured Ne, Ar, Kr and Xe aqueous noble gas concentrations and the closed-equilibrium excess air model (Aeschbach-Hertig et al., 1999). Due to expected microbial degradation of CFC-11 and CFC-113, we only use CFC-12 concentrations, which are less likely to be biochemically altered (Cook & Herczeg, 2000).

| High-fidelity forward model
Transient and 3-D groundwater flow and environmental tracer transport at the Riverton site is simulated using the PFLOTRAN software (Hammond et al., 2012;Hammond et al., 2014). PFLOTRAN is a physics-based numerical model that solves the fully distributed where S(l, t) is the estimated river water surface elevation [L] at time  (1) is solved separately for the Little Wind River (R = lwr) and Wind River (R = wr). Whilst R is varied during the calibration process, the a priori value is calculated as the average land surface elevation slopes along the river corridors delineated using the DEM ( [L/T] that are applied to land surface using a specified Neumann flux boundary condition are in the form: where γ is a multiplier that scales the base infiltration rate I th (t) [L/T]. I th (t) is calculated as the difference between precipitation rates measured at the Riverton, WY airport ($10 km away) and evapotranspiration rates evaluated using the Thornthwaite equation (Thornthwaite, 1948). To approximate snowpack processes, we adjust the measured precipitation totals such that precipitation accumulates over days with average temperatures below 0 C. The accumulated water is then applied to the precipitation totals for the next day with an average temperature above 0 C. The average of the I th (t) timeseries is 5.0 mm per month and has a standard deviation of 11.3 mm per month ( Figure S2). For the variably saturated flow, the van Genuchten characteristic function (van Genuchten, 1980) is used to relate fluid pressure to effective saturation and the Mualem relation (Mualem, 1976) is used to relate effective saturation to relative permeability. The empirical characteristic function fitting parameters m and α are taken from the literature (Dingman, 2015) and previous modelling studies for the Riverton site (DOE, 1998) ( Table 1).
The environmental tracer aqueous concentration histories (described in Section 2.2) are applied to the hydrologically active boundaries using a Dirichlet zero-gradient transport boundary condition. This boundary condition applies a specified concentration (Dirichlet type boundary) for water entering the domain and a zerogradient Neumann flux condition for water discharging from the domain. We assume that groundwater that enters the sides of the domain deeper than 6 m (two model layers) below the river surface water elevations is pre-modern and does not contain CFC-12 nor 3 H. Whilst this assumption is difficult to verify with direct field observations, simulations that applied atmospheric environmental tracer concentrations along the full depth profile below the rivers introduced an unrealistic amount of tracer concentration into the deep model layers.
The porosity (n ss ) and permeability (k ss ) fields of the sandstone model layer are assumed homogeneous. The hydrogeological properties of the shallow soil and gravel layers are considered spatially homogeneous for porosity (n soil ) and heterogeneous for permeability (k soil ). The heterogeneous permeability field is parameterized using the pilot point method, which only varies permeability at discrete locations known as pilot points (Doherty, 2003). To limit the number of parameters, we place 25 pilot points on an unstructured grid with a density that approximates the observation well density (Figure 1). The 3-Dlog 10 permeability field is created by interpolating between pilot points using ordinary kriging with an exponential variogram, unit standard deviation for the sill, and isotropic correlation lengths of 800 m in the x and y directions and no variation in the z direction. These cor- m ¼ n soil , log 10 k ss , n ss , γ, lwr, wr, log 10 k p where log 10 k p soil is the permeability of pilot point number p [0, 25]. Table 2 shows the parameter prior mean values that are based on previous site characterisation (DOE, 1998(DOE, , 2012(DOE, , 2015. Whilst the highfidelity model is spatially distributed, we simplify our workflow to only record simulated groundwater levels and 3 H concentrations at times and locations that match the observation dataset (Section 2.2). Let  (Asher et al., 2015). Whilst many types of ANN have been effectively trained to emulate hydrologic problems (see Shen, 2018), they have been used less frequently to emulate groundwater solute transport models at the field scales. In this work, our ANN surrogate model is a deep, but narrow, multi-layer perceptron (MLP). MLP are built as a sequence of layers, each with a number of nodes that are fully connected to all nodes in the previous layer (e.g., Lecun et al., 2015). MLP map from an input parameter layer to an output prediction layer by constructing a series of transformations in the form: where N is the number of training examples. Gradients of Equation 5 with respect to θ are calculated using reverse mode automatic differ-

| MCMC model calibrations
Bayesian model calibration is a widely used method to infer the posterior distribution of uncertain model parameters m given observation data d obs (Linde et al., 2017). The posterior parameter distribution P (mjd obs ), which represents our updated belief in model parameters after considering both observation data and prior knowledge regarding parameters, is quantified using Bayes' theorem where P(m) are the prior parameter distributions and P(d obs jm) is the likelihood of the observations given a parameters set. Computing the left-hand side of Equation (6) is intractable and must be approximated using numerical methods. In this work, parameter posterior distributions are quantified using a MCMC method that directly draws discrete samples from the posterior distributions.
The prior distributions reflect the state of knowledge on the parameters before considering the observation data. The a priori parameter mean values are established from previous characterisation of the Riverton site (Table 2). Following Brinkerhoff et al. (2021), we define the prior uncertainties using scaled Beta distributions: where Bound L and Bound U are the lower and upper parameter bounds, respectively, in Table 2. Beta distribution priors put higher probability density on values in the centre of the distribution, which corresponds to the a priori parameter mean. However, these priors are vague enough to allow a range of plausible parameters, which is important because unknown model structural errors lead to model calibration parameter estimates that are at incommensurate scales with fieldbased measurements (Doherty & Welter, 2010 The likelihood in Equation (6) is a measure of goodness of fit between the field observations and an equivalent model prediction.
Evaluating the likelihood provides a method to falsify parameter samples drawn from the prior distribution using observation data. Typically, quantifying the likelihood assumes a Gaussian error model (or data noise distribution) that is parameterized using estimates of the assumed observation errors (Linde et al., 2017).  (Goode, 1996). With this formulation of age transport, we do not simulate the full residence time distribution, rather, only the first moment of the distribution. The distribution of mean ages simulated here is not conceptually equivalent to the residence time distribution that is commonly estimated with the use of environmental tracers and lumped parameter models (Cook & Herczeg, 2000;Maloszewski & Zuber, 1982). The predictive mean age distribution represents an estimate in the uncertainty of mean age given plausible parameter sets.
This distribution captures the variance in average transport behaviours, conditioned to the observation dataset. Alternatively, residence time distributions are a measure of the flux weighted residences times of the varying flowpaths contributing to a sample, given a single model.

| Surrogate model performance
The surrogate model is a simplification of the high-fidelity model in that it only predicts groundwater levels and 3 H concentrations at locations and times that match the field observation dataset. Figure 2 shows water level predictions made by the trained surrogate model,

| Parameter posteriors
Using the trained surrogate model as the forward simulator, we test two separate model calibration scenarios. The first scenario uses only observed water levels in the calibration dataset. The second model calibration utilises both observed water levels and 3 H concentrations and was performed to further investigate the influence of conditioning solute transport predictions and predictive uncertainty to observations that are sensitive to long-residence time groundwater. In particular, the difference in parameter and predictive uncertainties between the two calibration scenarios represents a measure of observation data worth and can be used to gain insight on the role of longresidence time groundwater in solute transport uncertainties at the Riverton site (Zell et al., 2018). Model calibration and predictive uncertainty quantification methods are consistent between the separate 3 H and water level dataset scenarios. The prior and marginal posterior distributions for select model parameters are shown in Figure 4 and the joint posterior distributions are shown in Figures 5 and 6 for the calibration datasets that assimilate 3 H and water levels only, F I G U R E 2 Residual between the validation and artificial neural network (ANN) surrogate model water level predictions (y-axis) versus the corresponding validation water levels created by the high-fidelity model (x-axis).
Each subplot corresponds to a single observation well location and time. The left and right columns contain the instances out of the total 166 water level observations with the lowest and highest validation root-mean square error (RMSE), respectively. The black dashed line is the one-to-one prediction (zero residual) and the dotted red line is the best-fit line.
respectively. Model calibration results for the remaining parameters will be discussed below and shown in Figure S4.
Comparing the prior and marginal posterior distributions provides insight on how much the observation dataset constrains the F I G U R E 3 Residual between the validation and artificial neural network (ANN) surrogate model 3 H predictions (yaxis) versus the corresponding validation 3 H concentrations created by the highfidelity model (x-axis). Each subplot corresponds to a single observation well location and time. The left and right columns contain the instances out of the total 28 3 H observations with the lowest and highest validation root-mean square error, respectively. The black dashed line is the one-to-one prediction (zero residual) and the dotted red line is the best-fit line. For both calibration scenarios shown in Figure 4, the soil porosity posterior distribution shows a slight increase in the mean value and minor uncertainty differences compared to the prior. In particular, the maximum a posteriori is increased to $35% relative to the prior of 30%, and the uncertainty spans the full bounds of the prior. For the calibration scenario that only uses water level observations, Figure 6 indicates that the soil porosity does not have correlations with the other parameters. Alternatively, the soil porosity has a positive correlation structure with the infiltration rate parameter when the calibration assimilates both water level and 3 H observations ( Figure 5). The calibration to water levels alone results in river gradient posterior distributions that closely match those of the 3 H calibration scenario ( Figure 4). This suggests that the water levels, rather than 3 H observations, provide the bulk of the information content that constrains these parameters. This is supported by a sensitivity analysis that suggests the 3 H observations have minor sensitivities to the river gradient parameters, compared to the water level observations ( Figure S3). The largest discrepancies between the two datasets occur for pilot points 13, 15 and 22, which are shown in Figure 4. It is also apparent that whilst the pilot point permeability maximum a posteriori estimates can show significant differences between the two calibration dataset scenarios, the posterior uncertainty ranges tend to agree ( Figure 4 and S4).

| MCMC calibration performance
The parameter posteriors reflect our prior knowledge regarding parameters and the fit between model predictions and observed data, as shown in Figure 8. In Figure 8a, the 3 H residuals evaluated using the maximum a posteriori parameter estimates (black dots) show considerable scatter and a systematic bias around the one-to-one line. where all of the simulation uncertainties capture the one-to-one line ( Figure 8b). There are minor observable differences in the water level residuals when 3 H observations are omitted from the calibration dataset ( Figure 8d). Alternatively, Figure 8c shows that the calibration to water levels alone significantly over-predicts the 3 H observations. F I G U R E 8 Calibration residuals for 3 H on the left and water levels on the right. Subplots (a) and (b) correspond to the calibration against both water levels and 3 H, whilst (c) and (d) use only water levels. The x-axis is the field observations and y-axis are the simulated equivalents using the trained surrogate model. The black dots are predictions using the maximum a posteriori parameter estimates and the shaded coloured regions are the 95% confidence intervals evaluated using Monte Carlo sampling. The dashed line is the one-to-one prediction line.

| Predictive distributions
We use the high-fidelity model and 1000 parameter sets randomly sampled from the posterior distributions to simulate mean age distributions at four well locations (Figure 9). For the water level and 3 H calibration dataset, the solid lines in Figure 9 illustrates To further investigate the predictive mean age distributions, we compare the 3 H and CFC-12 field observations against idealised aquifer mixing behaviours. Figure 10 shows the 3 H and CFC-12 concentrations expected in samples assuming no mixing between flow lines (piston-flow model) and mixing that results in an exponential age distribution (Cook & Herczeg, 2000). We use the CFC-12 observation dataset to validate the predictive capability of our calibrated model. We use CFC-12 as the predictor of interest because we can apply the typically valid assumption of conservative transport and we know an approximate input concentration function. Figure 11a shows predictions of CFC-12 concentrations made using the high-fidelity model and the same 1000 parameter samples used to generate the mean age ensemble when calibrating to both 3 H concentrations and water levels. The CFC-12 prediction ensemble means and predictions using the maximum a posteriori parameters are significantly less than the field observations. In particular, the calibrated model predicts CFC-12 concentrations that are generally <0.5 pmol kg À1 , whilst the field observations are between 1 and 2 pmol kg À1 . The lack of high CFC-12 concentration model predictions is consistent with Figure 9 that shows mean age predictions F I G U R E 9 Posterior predictive distributions of mean age at four observation wells. The solid lines correspond to the calibration using water level and 3 H observations and the dashed line corresponds to the calibration using water levels alone. Each distribution contains a total of 1000 high-fidelity model runs parameterized by randomly sampling from the respective parameter posterior distributions. are generally pre-modern (before the year $1960). Furthermore, the predictions of CFC-12 concentrations >0 pmol kg À1 suggests the full age distributions contain a fraction of modern water mixing with premodern water. It is also apparent that for all but three wells, the predictive uncertainty does not capture the field observations. 4 | DISCUSSION

| Surrogate modelling error
Coupling physics-based hydrologic modelling with machine-learning surrogate models to investigate system processes and perform uncertainty quantification is a current area of considerable research (Asher et al., 2015;Linde et al., 2017;Razavi et al., 2012;Shen, 2018 (Rajabi, 2019).
Whilst we expect that the surrogate model validation accuracy would increase with a larger training set, this was not implemented due to the computational requirements of running the high-fidelity model.
However, we assume that the error introduced by the surrogate model is less than the total model error that accounts for the unknown model structural defects (Doherty & Welter, 2010;Xu & Valocchi, 2015).

| Hydrologic system characterization
The result that the estimated infiltration rates and uncertainties differ whether or not 3 H is included within the observation dataset represents the insensitivity of water levels to the infiltration rate parameter. In particular, water level observations cannot uniquely constrain both permeability and recharge parameters (Portniaguine & Solomon, 1998). Alternatively, the reduction of the infiltration rate posterior uncertainties suggests that 3 H concentrations in the shallow aquifer are highly sensitive to the boundary condition flux applied at F I G U R E 1 1 Chlorofluorocarbon-12 (CFC-12) concentration posterior predictive intervals using (a): 3 H and water levels in the calibration dataset and (b): Water levels alone. Each distribution contains a total of 1000 high-fidelity model runs parameterized by randomly sampling from the parameter posterior distributions. The filled circles and horizontal dash represent the ensemble mean and prediction with the maximum a posteriori, respectively. The uncertainty bars indicate the range of the ensemble prediction and the black triangles correspond to field observations. land surface. This sensitivity helps to explain the seemly discrepant result that studies such as Starn et al. (2014) and Carroll et al. (2020) find calibration to environmental tracers significantly constrains porosity estimates, yet, porosity in this work had little influence on the calibration. This is due to the transport velocity field in the shallow subsurface being set by the infiltration rate and the variation in porosity has little influence.
A sensitivity analysis performed on the training dataset shows that the boundary condition parameters (infiltration rate and river gradients) are the most sensitive to the observation dataset ( Figure S3).
In particular, the water level observations are most sensitive to the river gradient parameters whilst the 3 H observations are most sensitive to the infiltration rate multiplier parameter. Taken together these results suggest the water level observations dominantly constrain the groundwater hydraulic gradient and flux produced by the river boundary condition parameters. Thus, it is plausible that the estimated low infiltration rates when assimilating 3 H observations (e.g., Figure 4) is due to parameter compensation that aims to reduce the large flux of young water that is set by the river gradient parameters and water level observations. To test this hypothesis and the trade-off between 3 H and water levels in constraining the boundary flux parameters, we performed a MCMC calibration using the trained surrogate model and only 3 H observations. The estimated river gradient parameters are decreased towards the prior mean when only 3 H is assimilated into the model calibration, yet the infiltration rate posterior distribution remains similar to the calibration that uses both 3 H and water levels ( Figure S4a). This suggests that the low infiltration rates when considering 3 H observations are not solely due to parameter compensation against the river gradient parameters that are needed to explain the water level observations.
Despite anticipated benefits of calibrating against 3 H observations that are sensitive to both the permeability field and infiltration rate, the model poorly predicts CFC-12 concentrations ( Figure 11).
Comparison of the predictive mean age ensembles for the two calibration dataset scenarios can be used to provide insight on the processes leading to the observed CFC-12 prediction bias. The predictive mean age distribution for the 3 H calibration scenario contains a higher proportion of older water relative to the calibration that solely uses water levels. The mean ages that are greater than $70 years can only explain the observed 3 H through a mixture of modern (<70 years) and pre-modern (>70 years) water (Gardner et al., 2011). In the 3 H calibration scenario, the mean ages often exceed 100 years, which suggests the residence time distribution contains a significant fraction of premodern water. Whilst constraining the velocity and subsequent groundwater mixing such that there is a large portion of pre-modern water leads to model predictions that can explain the observed 3 H concentrations, predictions of CFC-12 are systematically low ( Figure 11). This CFC-12 bias is indicative that the fraction of modern, CFC-12 bearing groundwater is too little. This result is unexpected in that both 3 H and CFC-12 are sensitive to groundwater ages up to 70 years old.
Despite the contribution of pre-modern groundwater predicted by the numerical model, the environmental tracer mixing plots suggest that the majority of samples can be explained with a simple pistonflow model ( Figure 10) parameters that do not accurately generalise to predict CFC-12 concentrations. We anticipate that key model limitations in this work are the assumed structure of the boundary conditions and subsurface heterogeneity. Infiltration is expected to vary spatially and is a function of the complex land surface energy-balance and plant transpiration dynamics that govern evapotranspiration. Our approach that scales a base infiltration rate timeseries by a multiplier does not account for the uncertainty in the temporal dynamics of infiltration.
Given the observed sensitivity of 3 H simulations to the infiltration rate, this likely imposes a major assumption that propagates to the calibrated parameters and predictive uncertainty.
The simplified subsurface lithology represents an important model assumption that is expected to be influencing the transport simulations. It is apparent that the calibrated model does not accurately capture the full variance of the 3 H observations (Figure 8). Similar to findings of Starn and Belitz (2018)

| Uncertainty quantification
One of our research objectives was to investigate how uncertainties in long-residence time groundwater flow propagate to field-scale solute transport predictions and predictive uncertainties at the Riverton site. Whilst it is typically assumed that assimilating more observation data into model calibrations improves model performance, we find that calibrating against 3 H observations degrades predictive accuracy.
This finding is in contrast to our hypothesis that constraining solute transport predictions with observations of solute transport that span long-residence times will force the model to reconcile a broader spectrum of transport processes, leading to improved system characterisation and predictive performance. Rather, our results suggest that due to model structural errors, calibrated parameters take on very specific roles that do not necessarily represent the true processes and properties of the system. Consequently, improved model calibration and lower parametric uncertainties does not necessarily translate to improved model predictions. Identifying the different sources of uncertainties that lead to poor predictive performance is a challenging, yet critical task.
The robust uncertainty quantification methodology that we employ is a valuable step in understanding the model inadequacies that lead to inaccurate predictions. Within the MCMC calibration framework, we identify the full parameter sets that are consistent with the field observations and prior parameter knowledge. Thus, our uncertainty quantification methodology provides some level of confidence that the poor predictive performance we observe is not derived from the limiting assumptions that often influence model calibrations; such as model linearization and solely characterising local minima with a single optimal model. Alternatively, we are able to attribute the discrepancies between model calibration and predictive performance to model structural errors that are not compensated within the estimated parametric uncertainties. This highlights challenges in assimilating environmental tracer data into complex groundwater models and the need to robustly quantify both parametric and predictive uncertainties when evaluating model performance. However, we also highlight that by assimilating 3 H observations and performing uncertainty analysis we are able to expose the presence of model structural errors with high certainty. Alternatively, the water level observations do not contain adequate information content to capture the model defects.
These insights support studies that show model structural and conceptual errors can be the dominant source of uncertainty for complex groundwater models (Enemark et al., 2019).

| CONCLUSIONS
We demonstrate a method that facilitates parameter and predictive uncertainty quantification of a computationally expensive physicsbased groundwater flow and transport model using a machinelearning surrogate model. We train an ANN surrogate model to approximate the groundwater levels and 3 H concentrations predicted by a reactive transport model as a function of 31 uncertain model parameters. MCMC analysis is then performed using the trained surrogate model to infer parameter posterior distributions and predictive groundwater mean age and CFC-12 transport uncertainties for a field site near Riverton, WY. We find that whilst assimilating 3 H observations into the model calibration significantly constrains parameter uncertainties, the model does not predict the observed CFC-12 concentrations. The model parameters and associated uncertainties that explain the 3 H observations predict a larger fraction of old groundwater compared to what the CFC-12 observations suggest. The discrepancy between model calibration and predictive performance demonstrates that model misrepresentations can lead to low parametric uncertainties that do not translate to improved model predictive performance nor uncertainty estimates. These finding highlight the need to perform both parametric and predictive uncertainty analysis when assimilating information rich datasets such as environmental tracers into complex groundwater models. The methods presented in this study provide a tool that allows uncertainty quantification using computationally expensive groundwater flow and transport models.