Forecasting Groundwater Contaminant Plume Development Using Statistical and Machine Learning Methods

Forecasting groundwater contaminant plume development is critical for determining risks to downgradient receptors and predicting the time to site closure. However, accurate forecasts are a formidable challenge due to the complexities of a heterogeneous subsurface. While historically groundwater well data in combination with numerical flow models have been used for this task, the advent of machine learning offers new data‐driven opportunities for improving contaminant fate and transport predictions. In this study, we interrogate the viability of two forecasting models—Prophet and damped Holt's exponential smoothing model—for predicting groundwater contaminant plume development. The impacts of spatial and temporal data density on the accuracy of the forecasts are evaluated. For wells with declining contaminant concentrations, the damped Holt's method achieves more accurate forecasts. However, only Prophet allows for the inclusion of exogenous regressors, enabling predictions of future declining trends in wells with still increasing contaminant concentrations. Application of these models does not only require robust training data, but also an understanding of model biases. Overall, powerful data‐driven models are already available for contaminant plume prediction, but groundwater sampling approaches will have to improve, for instance, through the collection of real‐time spatial and temporal high‐resolution data, to take full advantage of their capabilities.


Introduction
PHT3D is a computer code for general reactive transport calculations, coupling MODFLOW/MT3DMS for transport and PHREEQC for chemical reactions.It was developed by Henning Prommer in the 1990s and has been applied by him and his coworkers to various groundwater problems of practical interest.The resulting publications (http://www.pht3d.org/pht3dpublic.html)show an impressive applicability of the code and illustrate the underlying understanding of quite complicated interactions (e.g., Prommer and Stuyfzand 2005;Prommer et al. 2008Prommer et al. , 2009)).In the original version, transport is calculated during a time step, an input file is written for PHREEQC for calculating reactions such as ion exchange and precipitation or dissolution of minerals, and these steps are repeated for subsequent time steps until finished.This loose coupling has the advantage that updates of the master programs can be installed without much effort.A disadvantage is that the calculation of the chemical reactions needs to be initialized time and again for each cell in the model, which adds another time-consuming step to calculations that are already computer-intensive.Another disadvantage is that surface complexation reactions need to be calculated first using the water composition from the previous time step and then reacted with the changed water concentrations.This procedure was not implemented in the original version of PHT3D, and surface complexation reactions could not be calculated.
Prommer and Post recently released the second version of PHT3D that resolves the shortcomings and works very well.The improvement is owing firstly to the implementation of total-variation-diminishing (TVD) scheme that MT3DMS uses for calculating advective and dispersive transport (Zheng and Wang 1999).Secondly, it is because PHREEQC is now being used for storing the chemical data of the model, including the chemical activities and the composition of surface complexes from the previous time step.In addition, the procedure to transport total oxygen and hydrogen has been adapted from PHAST (PHAST is the 3D reactive transport model developed by Parkhurst et al. 2004, based on HST3D and PHREEQC).This enables the user to obtain the redox state of the solution without having to transport individual redox concentrations of the elements (e.g., C being distributed over carbon-dioxide, C(4), and methane, C(-4)).The tighter coupling quickens the calculations twofold at least, but probably by an order of magnitude for the more interesting cases.In this review, the background of the new implementation is presented and illustrated with examples and compared with results from PHREEQC and PHAST.

Introduction
Since the emergence of contaminant hydrology as a discipline, the struggle to delineate and track movement of contaminant plumes in soil and groundwater has made vast strides, but ultimately, the heterogeneity of the subsurface challenges an accurate understanding.The advection-dispersion-reaction (ADR) equation provides a mathematical explanation for contaminant transport, but the determining mechanisms are often convoluted by complex processes.This complicates not only the delineation of groundwater plumes, but also the understanding of where and how far contaminants will move.Predicting the behavior of contaminant plumes has mostly been performed with process-based analytical and numerical models that solve the ADR equation.Tracer studies at highly characterized research sites have highlighted limitations in the predictive power of these models due to unrealistic assumptions that include over-simplified aquifer property distributions, Fickian transport behavior, and inadequate representation of preferential flow paths (Zheng et al. 2011).As data collected from contaminated sites become more comprehensive, empirical data-driven models may offer advanced opportunities in creating a better understanding of contaminant fate and transport at individual sites.
Machine learning models are increasingly applied in earth sciences and have potential to transform the field as has occurred in many other areas.Linear algebra-based methods including Positive Matrix Factorization (PMF) and linear discriminant analysis have been applied to remediation site data; for instance, PMF was used to identify areas of a plume in different stages of attenuation at a large site in New Jersey (Capozzi et al. 2018), and linear discriminant analysis was used to show 1,4-dioxane attenuation at sites documented in California's Geotracker public records system (Adamson et al. 2015).One important area in which matrix factorization has proved useful to determine plume sources is forensics (Vesselinov et al. 2018;Sun et al. 2020).Other applicable data mining and statistical machine learning techniques involve the use of decision trees and neural networks.Decision trees were used with text mined from reports of gasoline station sites to predict remediation costs Article Impact Statement: Adaptions in groundwater sampling approaches will be required to take full advantage of data-driven contaminant fate and transport modeling.(Farrell et al. 2007).Neural networks have been used extensively to predict the parameters, such as permeability and distribution coefficients, that are inputs to reactive transport models (Santos et al. 2021).Advances in data-driven, empirical modeling have been made and these tools are increasingly adopted for use in water resources and geology (Razavi et al. 2012;Yadav et al. 2018;Dramsch 2020;Liu et al. 2020).These studies have demonstrated the applicability of machine learning methods to environmental data.However, these are only a few examples, and many more areas of machine learning exist for exploration to understand how they can improve our understanding of complex environmental systems.
More specifically, the time series forecasting methods originally developed in the field of econometrics which incorporate machine learning have recently been gaining popularity applied to earth sciences.Time series data are foundational to contaminated site investigation and monitoring.The interpretation of these data is often based on nonparametric trend analysis, such as the Mann-Kendall method, which is not appropriate to predict future concentrations (ITRC 2013).An open-source model named Prophet (Taylor and Letham 2018) has been used to predict air pollution (Shen et al. 2020;Nath et al. 2021), groundwater levels (Aguilera et al. 2019;Fernández-Ayuso et al. 2019), surface water pollution, and streamflow values (Papacharalampous and Tyralis 2018; Dabrowski et al. 2020;Papacharalampous and Tyralis 2020).Prophet has performed well in some cases, such as temperature, precipitation, groundwater level, and air pollution data, but in other cases using streamflow data there have been mixed results.In addition, the use of forecasting algorithms and numerical models in tandem may be beneficial, with some arguing that an iterative process using forecasting prior to intensive numerical model calibration may improve upon their predictive performance (White 2017).
Numerical models use simplified representations of subsurface geology, then mechanistically apply a transport model and reactions.The time series models we have chosen to study use historical data to train a model that then predicts future values.There is no mechanistic representation of fate and transport processes for either the historical data or the prediction.This provides an advantage because the complexity of realistically representing the many processes contributing to fate and transport becomes impractical for large sites over long time scales (National Research Council 2013).However, a disadvantage of this empirical time series model approach is that any mechanistic interpretation is based on the data and a general understanding of the physical and chemical processes at play, not explained by mathematics that represents those processes in the model.
In this study, we take a data-driven approach of using available contaminant concentrations to train forecasting models without solving governing transport equations.Forecasts using both a modified exponential smoothing model and Prophet are compared to the output of a realistic synthetic dataset generated for a site with historical contamination.The effects of training dataset size and temporal data density are interrogated to understand potential improvements in model performance.In addition, we use Prophet to evaluate the benefits of incorporating training data from multiple observation wells as external regressors.Forecasting in this manner has the potential to introduce a new tool for developing and refining conceptual site models, and the computational efficiency of these methods creates opportunities to streamline decision-making and reduce monitoring/remediation costs at a wide variety of sites.Historical data generated at legacy sites are fundamental to generating appropriate conceptual site models but remain largely untested for use in forecasting.With increasing quality and quantity of data, investigating the applications of these tools can improve our understanding of contaminant transport and lead to practical uses to improve modern site management.

Site Setting
Initially, historical lab and field data sets from large legacy sites across North America were examined to determine if parameters measured on a standard basis (e.g., oxidationreduction potential, electron acceptor concentrations) correlated with contaminant concentrations or contaminant parent-to-daughter ratios.After correlation analysis with combined data from these large sites did not yield satisfactory correlations, an area of enhanced remediation-a permeable reactive barrier (PRB)-was chosen for a detailed study.Details about the initial correlation analysis and site hydrogeology are included in Supporting Information and a general site map is shown in Figure S2.Zero-valent iron filings were used to create a barrier in the subsurface that was intended to intersect a plume where the major contaminant of concern was carbon tetrachloride (DERS 1997).Several wells installed on each side of the PRB had been sampled on a semi-regular schedule starting from the time that the barrier was installed.Clear correlations that we expected in such a strongly biased environment downgradient of a PRB were nonetheless not found (Figure S1).Consequently, we decided to reverse our task by creating a realistic synthetic dataset and by exploring the question of how much data are needed for accurate groundwater plume forecasting.

Synthetic Dataset Creation
An existing groundwater flow model was adapted to generate detailed, synthetic concentration data sets that were input to machine learning algorithms to test input data requirements.Model-simulated or synthetic data sets are commonly used to evaluate parameter estimation and predictive modeling routines in hydrology (Mirus et al. 2011;Liu et al. 2020).For this study, the groundwater model used to generate synthetic data was initially designed to evaluate the efficacy of pump-and-treat scenarios for the containment of site contaminants (DERS 1997).The spatial extent of the original modeling domain was reduced to focus on the plume area.In addition, the grid was rotated to gain better alignment with the primary transport direction and perpendicular PRB orientation.The flow model was converted from MODFLOW-96 to MODFLOW-2005(Harbaugh 2005).Reactive transport of carbon tetrachloride was performed using the MT3DMS code (Zheng 2010).The numerical transport model solves the ADR, as described in Supporting Information.
The updated groundwater flow and transport model consisted of three layers, with 507 rows and 300 columns.Horizontal grid spacing ranged from 0.3 m near the PRB to a maximum of 3 m.Layer thickness ranged from 2 to 20 m reflecting spatially variable aquifer thicknesses.Confining units between aquifers were simulated using MODFLOW's quasi-3D approach through a vertical leakance parameter, consistent with the original flow model.The northern boundary cells were assigned constant head values representing a river.The southern boundary cells were assigned constant heads, with head values extracted from the steady-state solution of the original model.The eastern, western, and lower boundaries were parallel to flow and assigned no-flow conditions (Figure S3).Hydraulic parameters were extracted from the original model, with horizontal hydraulic conductivity (K) values ranging from 26 to 100 m/day and assigned to zones reflecting physical heterogeneity within the aquifers.Leakance values, necessary to simulate vertical flow through confining units, were calculated using a constant vertical K of 8.6 × 10 −5 m/day.Porosity was assigned a constant value of 0.3.
Water flow was simulated under steady-state conditions, consistent with the original model and considered appropriate for the long-term simulation with no significant changes in storage.Groundwater velocity vectors calculated from the steady-state hydraulic heads were used as inputs for transient simulations of carbon tetrachloride (CT) transport and degradation.CT was introduced to the model through constant concentration cells assigned in the source area (Figure S3).Approximating historical conditions at the site, constant concentration cells in layers 1 and 2 were active during an initial source pulse lasting 5.5 years.Removal of the source was simulated by deactivating the constant concentration cells for the subsequent 20 years of the simulation.Simulation time was discretized using a total of 515-time steps, with lengths ranging from 0.1 to 20 days based on a time-step multiplier of 1.2.Shorter time steps were used at the beginning of the simulation and immediately following source deactivation.CT degradation within PRB cells was simulated as a first-order decay reaction.The assigned rate constant of 2.57 day −1 (halflife = 0.270 days) was estimated from an independent analysis of the percent reduction of CT due to the PRB as determined from field data.Model-simulated concentrations during the 25.5-year transient simulation, corresponding to years 2005 through 2030, were extracted for a series of observation points arranged into 5 longsects extending downgradient from the PRB (Figures 3 and S4).To account for spatiotemporal heterogeneities and other complexity not explicitly represented in the numerical model, synthetic Gaussian noise was added to the simulated CT concentrations prior to training with machine learning algorithms (Shuryak 2017).Gaussian noise addition involved drawing numbers from a normal distribution with a mean of zero and spatially scaled standard deviation set to 40% of the peak concentration at that location.This method resulted in some negative concentration values, which were subsequently set to zero.

Time Series Forecasting Models
Two forecasting models, Prophet, and the damped Holt's exponential smoothing model, were used to predict CT concentrations at the site.Exponential smoothing models predict future values based on averages of past values, with exponentially less weight given to older observations.Methods based on this framework are quick and computationally efficient and have performed well in a wide variety of applications (Hyndman and Athanasopoulos 2021).Holt's method is a type of exponential smoothing model incorporating trend and seasonality (Holt 2004).The trend component in this method remains linear into the future, which tends to overestimate values at long time horizons.A variation of this method that accounts for this overestimation includes a dampening parameter that reduces the magnitude of the trend until a point where it no longer has an effect (Gardner and Mckenzie 1985).This damped Holt's exponential smoothing model was selected for analysis in this study based on performance with eight observation points distributed within the plume.
Prophet is a model that uses interpretable components and expands on a generalized additive model.It was released as an open-source project by Facebook in 2018 (Taylor and Letham 2018).The creators of this method argued that in many industries there is a need for "forecasting at scale," in which many analysts who are familiar with their field but may not be experts in forecasting models should be able to make accurate forecasts using an interpretable and tunable model.The model is designed to use an "analyst in the loop" workflow, where an analyst can easily create forecasts, detect anomalies, and tune parameters to correct them, all while efficiently forecasting with many datasets.This model has increasingly been applied to hydrology, atmospheric sciences, and other areas of earth science.
Synthetic data described above were used to train these models.Different simulated sampling intervals and amounts of univariate training data were considered to investigate how limitations of real-world data might impact the usability of these forecasting methods.The input data for each forecasting model was obtained from 100 individual sampling points in the synthetic plume.The 100 chosen points comprise the lower layer of the numerical model.A value of 0.96 for phi as the damping parameter was used, as determined by tuning models with reasonable values of between 0.8 and 0.98 (Hyndman and Athanasopoulos 2021) on eight wells distributed through the plume.The Prophet model was customized with a logistic growth model specified for the trend, with a floor of 0 and a capacity of the maximum value of the training dataset plus one (in this case 86.8).The prior change scale of 0.5 was similarly determined by iterating through reasonable values of between 0.001 and 0.5 (Prophet 2021) for the same eight wells.

Accuracy Measures
Predicted CT concentrations generated by each forecasting model were compared to the synthetic data at various spatial locations (Figures 3 and S4) and times.Model errors were quantified using two statistical measures: the root mean square error (RMSE, Equation ( 1)) and the mean absolute error (MAE, Equation ( 2)).The RMSE is widely used in the evaluation of transport models (Zheng 2002). (2) In these equations, N represents the total number of observations (concentration values) for a particular location in the synthetic test dataset, x i is the ith observation, and y i is the predicted equivalent of the ith observation.RMSE is the mean of the squared prediction errors, and MAE is the mean of the absolute values of the prediction errors.Both measures are in the same units of the time series, which in this case is mg/L.In cases where different units are being compared scaled error metrics would be advantageous (Taylor and Letham 2018), but because the units of measurement in our study are the same and some values are close to zero, we chose to use RMSE and MAE to quantify the error.Smaller values for both metrics represent better performance of the models.The RMSE is more sensitive to outlier errors and thus tends to be higher than the MAE.These metrics were calculated based on the forecasts over the interval of the test period compared to the original synthetic data extracted from the numerical groundwater transport model.

Results and Discussion
The focus of this study was to investigate how much spatial and temporal data resolution is required to accurately forecast groundwater contaminant behavior.To study this question, portions of the synthetic dataset were used to train forecasting models and to compare the predictions against test data in order to investigate what model performed best.In addition, the models were trained with different simulated sampling intervals to determine the effect of decreasing temporal density of data on the accuracy of forecasts.The amount of data that was included in the training dataset was also varied to determine the effect that this would have on the model performance.Finally, upgradient well data were used as regressors in the forecasting models to understand if including data from different spatial arrangements could markedly improve the forecasting performance (Table 1).

Model Performance Using Historical Data
Time-series plots for eight selected wells in the synthetic dataset downgradient of the PRB are shown in Figure 1.The entire dataset was split into a test set and training set, shown as black and gray, respectively.The initial evaluation considered monthly data at each spatial location (i.e., observed data with a ~30-day frequency).Monitoring well concentration forecasts are closest to the test values in areas near the PRB (e.g., transect 17 ft downgradient of the PRB).Of the eight wells, both the RMSE and MAE were the smallest for this transect (Figure 2), except for the forecast using the damped Holt's method in transect 303, longsect 2, which had values of RMSE and MAE of 0.21 and 0.17 mg/L, respectively.For transect 17 in longsect 2, RMSE and MAE were 0.06 and 0.04 mg/L, respectively, using the damped Holt's method, and 0.54 and 0.51 mg/L, respectively, using Prophet.For wells in this area, the downward trend is correctly predicted by both methods.For the observation points in transect 303, the test data in longsect 2 is well predicted by Holt's method, while the pattern of the data in longsect 4 is not reproduced by either method.The predicted concentrations for observation points in transect 865 and 1462 are largely inaccurate because the training data covers only a period of increasing trend.The highest MAE was 10.4, for transect 303, longsect 4, using the Prophet method.The highest RMSE was 11.4, for transect 865, longsect 4.This highlights the limitations of long-term forecasting when training data are from an individual monitoring well where concentrations are still rising.Many legacy sites that are in the monitoring stage are comprised of wells with consistently declining concentrations.However, with less frequent sampling, natural variability may cause results that are not representative of the prevailing trend behavior that occurs in the subsurface (McHugh et al. 2011).For wells with still increasing concentrations of contaminants, these forecasting models that are not constrained by data from other spatial locations would be inappropriate.In addition, if new sources or significant changes in groundwater flow occur, these forecasting algorithms may not effectively forecast trend changes.
For predictions based on training data through 2017, the mean RMSE was 13.4 for Prophet and 4.6 for the damped Holt's model, and the mean MAE was 11.6 for Prophet and 4.2 for the damped Holt's model.Prophet RMSE ranged from 0.2 to 69.2 while the damped Holt's model RMSE ranged from 0.05 to 40.3.Prophet MAE ranged from 0.2 to 65.1 while the damped Holt's model MAE ranged from 0.04 to 37.4.An improvement in forecasting was observed using training data from 2005 through 2020, with a mean RMSE of 5.6 for Prophet and 2.7 for the damped Holt's model, and a mean MAE of 4.7 for Prophet and 2.4 for the damped Holt's model.Prophet RMSE ranged from 0.03 to 54.8 while the damped Holt's model RMSE ranged from 0.01 to 58.6.Prophet MAE ranged from 0.03 to 49.1 while the damped Holt's model MAE ranged from 0.01 to 54.2.These values vary spatially with factors including maximum concentration and distance from the source.Almost all errors are smaller with the data trained up to 2021.Comparisons of these error metrics can be made between the same data with different training and test sets in the top and bottom panels of Figure 2, although these values are not normalized to the total concentrations.Prophet performs better than the damped Holt's model for the closer wells, but worse for the wells furthest downgradient of the PRB.

Table 1
The different forecasting scenarios tested shown with the variables that were adjusted for each scenario.Maps of the predicted carbon tetrachloride concentrations in comparison to the output of the numerical model in the year 2025 are shown in Figure 3.The forecast generated by the damped Holt's method tends to slightly overestimate concentrations in the upgradient southern portion of the well field, while this forecast underestimates the carbon tetrachloride concentrations further downgradient.Prophet largely overestimates the concentrations compared to the synthetic dataset.

Effects of Decreased Sampling Intervals
Groundwater sampling frequency is determined by many factors, including cost, risk, regulatory requirements, and degree of site characterization.Quarterly sampling is a common scheme required by regulatory agencies and has been used as a benchmark for cost and time required to calculate an effective attenuation rate (McHugh et al. 2016).Sites where remedy effectiveness or nearby receptors are being  closely monitored may be sampled on a more frequent basis.At sites where there is little risk, sampling schedules are more likely to be semi-annual or annual.Typically, sampling is more frequent when monitoring wells are initially installed, then decreases as a better understanding of contaminant fate and transport evolves and risk diminishes.To understand the impact of decreasing sampling frequency on forecast accuracy, two different temporal densities, monthly and quarterly, up to the year 2018 were used for model training and compared to the test data until 2030 (Figure 4).At longsect 3 in the transect 303 ft downgradient from the PRB, damped Holt's method performs satisfactory compared to the test data, using monthly monitoring data.The RMSE was 0.19 mg/L, and the MAE was 0.15 mg/L.Although the Prophet model performs worse using monthly training data at this location (RMSE = 2.9 mg/L; MAE = 2.8 mg/L), it still approximately replicates the shape and direction of the trend.When the same forecast models are trained with quarterly data, predictive capability is substantially reduced (Figure 4).The Prophet model does not correctly forecast the trend change.The damped Holt's method does replicate the trend direction, but this is likely an artifact, as the exponential smoothing places higher weight on more recent samples and the quarterly sample variability includes low values close to the training data cutoff date (Figure 4).In addition, the damped Holt's method does not correctly forecast the magnitude of the change.Consequently, while quarterly or semi-annual groundwater monitoring are often sufficient to evaluate risks for downgradient receptors, accurate plume forecasting may (temporarily) require higher monitoring frequencies.

Comparing Different Training and Testing Data Splits
The amount of data in a training set is critical to the forecast's accuracy.As shown previously with wells at different distance from the source area, the point at which the concentrations begin to decline is a factor in the ultimate accuracy of the prediction.This is illustrated for a longsect 3 observation point in Figure 5.With the training data split in 2018, both models perform poorly (Figure 5).With a training data split in 2021, the models are still not sufficiently recognizing the change in the trend.In fact, while the forecast should improve with a later training split, the damped Holt's method forecast worsens, supporting our conclusion above that the forecast is heavily biased by more recent data points.Eventually, with the training split in 2024, both models provide an acceptable forecast until the end of the dataset in 2030.Consequently, a robust training data set needs to be collected before reliable forecasts can be achieved.

Including Upgradient Well Concentrations
All previous analysis was performed using univariate data.However, groundwater wells are not installed in isolation; rather, they are installed as a network.Monitoring wells measure point concentration in a continuous aquifer.Contaminants may move through all the wells, but at different times based on advection, diffusion, and reactions.Upgradient well data will contain a record of what patterns may appear in downgradient data, as shown in Figure 1.Thus, we tested whether including this upgradient well data as a nonlinear regressor could improve the forecast at downgradient wells.The damped Holt's method does not allow for exogenous regressors; thus, only the results using Prophet are presented here.
Figure 6 shows that a forecast using training data up to 2018 and one upgradient regressor well produces a better forecast than the historical data up to 2018 alone.This results in improvement of the RMSE from 8.8 to 5.6.Notably, the trend is changed from increasing to decreasing with the additional information from another spatial location.When four upgradient regressors in adjacent longsects are added to the model, the RMSE decreases to 4.1.This is an incremental improvement, although relatively greater benefits were observed when changing from quarterly to monthly data (Figure 4) or increasing the amount of training data (Figure 5).The spatial layout of the regressor wells may play an important role in the improvement of forecasts on the target well.However, the monitoring well network is typically determined by other factors, and while they may be placed to gain the most insight into the subsurface plume behavior, they are not amenable to relocation for optimization after their installation.

Effects of Heterogeneity
Heterogeneity within the groundwater flow and transport model strongly dictates the shape and timing of the synthetic plume.This study tested machine learning algorithms on synthetic concentration results obtained from a single realization of aquifer heterogeneity.In reality, significant uncertainty characterizes aquifer heterogeneity at field sites, which often requires models that incorporate multiple stochastic realizations of heterogeneity to predict a range of plausible outcomes.Future studies that use machine learning algorithms to predict contaminant concentrations should incorporate a range of heterogeneities to determine how differences in resulting plume shapes affect model performance.

Conclusions
Many large historical datasets from contaminated sites sampled for over 20 or 30 years have been collected.While not without unique value, the utility of these data for the purpose of contaminant plume forecasting may be limited due to improper groundwater sampling design and techniques.However, data collection is becoming more sophisticated; as the quality and quantity of environmental data improve, data-driven analysis will have greater potential to address relevant questions beyond compliance monitoring.
Using two different models with univariate data, our study showed that the damped Holt's modified exponential smoothing model largely outperformed Prophet in forecasting plume behavior.However, in wells with increasing contaminant concentrations, a future declining trend could not be accurately predicted without external regressors.Only when upgradient well data are included in the analyses, decline curve forecasts can be achieved in these cases.Overall, satisfactory predictions of time to closure can be achieved if the collected data used for model training are of sufficient spatial and temporal density.Nevertheless, specific model biases should be well-understood and established during sensitivity analysis similar to the approach of our study.We note that this study was conducted using simulated concentration data and that the findings may not apply to currently available real-world datasets.
Spatial high-resolution data, for instance from multilevel samplers, have previously transformed our understanding of contaminant fate and transport in the subsurface, and improved our ability to manage sites.The collection of temporal high-resolution data will similarly revolutionize our ability to forecast contaminant concentrations and the time to closure.With the rapid emergence of sensorbased collection of better and cheaper data, this revolution is imminent.Applicability of sensors to high-resolution monitoring of contaminated sites has been recently demonstrated (Sale et al. 2019;Sale et al. 2020;Blotevogel et al. 2021).This study presents results of the performance of only select forecasting algorithms-more research in this area will be necessary to take advantage of the full potential of advanced statistical and machine learning algorithms.Likewise, dashboards with real-time data are becoming more common, and implementing forecasting capabilities automated for many wells will ensure that predictive capabilities are available to site managers (Askarani and Gallo 2020).A consequential aspect of forecasting will be building trust that predictions are reliable, both at a regulatory and operational level.

Figure 1 .
Figure 1.Forecasts of carbon tetrachloride (CT) concentrations at eight observation points in longsects 2 and 4, at 17, 303, 865, and 1462 ft downgradient from the permeable reactive barrier.Both Prophet and the damped Holt's method models were trained with synthetic data through 2017.The black line represents the synthetic observation data during the training period, whereas the gray line represents the synthetic observation data for the test period.

Figure 2 .
Figure 2. Transect averages of accuracy measures for forecasts trained on data up to 2018 (top) and 2021 (bottom).Highlighted are the mean absolute error at different transects.

Figure 3 .
Figure 3. Carbon tetrachloride groundwater plumes downgradient of the PRB (brown line at the southern end of the predicted plume) in 2025 based on the synthetic data (left), damped Holt's method forecast with training data through 2017 (middle), and Prophet forecast with training data through 2017 (right).The direction of groundwater flow is NNE, perpendicular to the PRB.The map of the synthetic data (left) shows naming conventions of longsects and transects of the observation point network.Modeled concentrations are from the lower aquifer layer that occurs below the river.

Figure 4 .
Figure 4. Forecasts of the observation point in longsect 3, transect 303 ft downgradient of the PRB with training data through 2017, using monthly data on the left and quarterly data on the right.The shift in data density from monthly to quarterly sampling results in poor forecasts by both models.

Figure 5 .
Figure 5. Forecasts of the observation point in longsect 3, transect 865 ft downgradient of the PRB with the training and testing data split at three different dates, 2018, 2021, and 2024.The black line represents the training data, the blue line is the Prophet forecast, and the red line is the damped Holt's method forecast.

Figure 6 .
Figure 6.Time series plots of forecasts using the Prophet model with historical data only, one upgradient regressor well, two upgradient regressor wells, and four upgradient regressor wells.All models are trained with data through 2017.The large blue dot on the diagram of the sampling network represents the well that is being forecast.The orange dots represent regressor wells.