Deep learning of estuary salinity dynamics is physically accurate at a fraction of hydrodynamic model computational cost

Salinity dynamics in the Delaware Bay estuary are a critical water quality concern as elevated salinity can damage infrastructure and threaten drinking water supplies. Current state‐of‐the‐art modeling approaches use hydrodynamic models, which can produce accurate results but are limited by significant computational costs. We developed a machine learning (ML) model to predict the 250 mg L−1 Cl− isochlor, also known as the “salt front,” using daily river discharge, meteorological drivers, and tidal water level data. We use the ML model to predict the location of the salt front, measured in river miles (RM) along the Delaware River, during the period 2001–2020, and we compare predictions of the ML model to the hydrodynamic Coupled Ocean–Atmosphere‐Wave‐Sediment Transport (COAWST) model. The ML model predicts the location of the salt front with greater accuracy (root mean squared error [RMSE] = 2.52 RM) than the COAWST model does (RMSE = 5.36); however, the ML model struggles to predict extreme events. Furthermore, we use functional performance and expected gradients, tools from information theory and explainable artificial intelligence, to show that the ML model learns physically realistic relationships between the salt front location and drivers (particularly discharge and tidal water level). These results demonstrate how an ML modeling approach can provide predictive and functional accuracy at a significantly reduced computational cost compared to process‐based models. In addition, these results provide support for using ML models in operational forecasting, scenario testing, management decisions, hindcasting, and resulting opportunities to understand past behavior and develop hypotheses.

In the mid-Atlantic region of North America, the Delaware River Basin supplies drinking water for 15 million people including nearly half of the population of New York City (Hutson et al. 2016).Water resources in the basin are jointly managed by four states and the federal government to meet demand while also maintaining downstream flow and water quality objectives (Mandarano and Mason 2013).Within the Delaware River Basin, the main stem of the Delaware River provides recreational opportunities, critical habitat to endemic species, and drinking water for several urban areas including Philadelphia, Pennsylvania.Salinity concentrations within the lower reaches of the Delaware Bay estuary vary throughout the year, but during periods of low river discharge, ocean salinity can be driven upstream, threatening water quality.Increased salinity conditions, which can be brought on by drought, sea level rise, and/or increased demand, can corrode critical infrastructure, threaten species habitat, and necessitate further treatment or alternative sources of drinking water.As such, salinity is a key constituent in the Delaware Bay estuary, and its dynamics are tracked closely to ensure the protection of water quality.For example, managed releases from upstream reservoirs, which are jointly managed for drinking water, flood control, recreation, and other purposes, are used to meet minimum flow objectives and manage salinity intrusion.However, these releases require the careful balancing of salinity mitigation with other beneficial uses, and there is considerable uncertainty surrounding the effect that the timing and magnitude of releases have on downstream salinity.
Within the Delaware Bay estuary, salinity intrusion is measured using the location of the 7-d average 250 mg L À1 Cl À isochlor, also referred to as the "salt front."The location of the salt front is reported in river miles (RM; 1.6 km) along the central axis of the estuary, ranging from 0 at the mouth of the estuary to 134 at Trenton, NJ where the river is no longer influenced by ocean tides.The location of the salt front, and salinity dynamics more broadly, in the Delaware Bay estuary are primarily controlled by river discharge along the mainstem of the Delaware River and from other major tributaries such as the Schuylkill River.However, other factors such as tidal action, meteorological systems, and bathymetric and topographical features also have influence, particularly during times of low discharge (Ross et al. 2015).
Current models of the Delaware Bay estuary primarily use numerical techniques to solve equations for the fundamental physical processes which drive salinity, hydrologic, and other biogeochemical dynamics (Galperin and Mellor 1990a,b;Whitney and Garvine 2006;Aristaz abal and Chant 2013;Philadelphia Water Department 2015).For example, the Coupled Ocean-Atmosphere-Wave-Sediment Transport (COAWST) modeling system has been used to accurately simulate salinity dynamics in the Delaware Bay (Cook et al. 2023).
While numerical hydrodynamic models such as COAWST can accurately simulate water level, temperature, salinity, and other constituents at fine spatial and temporal resolutions (Warner et al. 2010), there are substantial drawbacks including: (1) high computational costs in running the model, which can lead to long run times, shorter simulation periods, and difficultly in using the model for operations or short-term forecasting.For example, the COAWST model takes $ 50 h of compute time on 180 processors to simulate 1 yr of conditions within the Delaware Bay estuary.This limits the ability to test suites of scenarios simulating reservoir releases or potential future conditions.In addition, given the computational constraints, producing model ensembles for uncertainty or using ensemble-based data assimilation approaches to develop operational models for near-term forecasting is not feasible; (2) substantial systemspecific expertise required to define boundary conditions and configure the model; (3) significant requirements for hosting and accessing input and output data due to fine spatial and temporal resolutions of the model, for example, the COAWST model outputs $ 2 TB of data for a single year of simulations.
The rapid advances of machine learning (ML) have shown great promise in simulating dynamics in complex environmental and hydrologic systems, including estuaries, at reduced computational costs (Zaini et al. 2022;Bahari et al. 2023;Qi et al. 2023).For example, Zhou et al. (2021) used the Cubist ML algorithm to estimate the impact of Mississippi River discharge on estuary salinity dynamics along the Gulf Coast.Alizadeh et al. (2018) showed that ML models could accurately forecast estuarine water quality constituents, including salinity, up to 2 h in the future (Alizadeh et al. 2018).Other studies have demonstrated how ML models can reduce cost in data collection and computation (Codden et al. 2021).The flexibility and computational efficiency of such models raises exciting possibilities about their capacity to harness ever-growing data availability to develop accurate models at speeds and resolutions not possible with traditional modeling approaches.However, the complexity of ML models' inner workings can make it difficult to discern if the models are representing a system in a physically reasonable way.Moving beyond purely predictive performance metrics, it is important to ensure that ML models are getting the right answers for the right reasons (Kirchner 2006).This issue is particularly critical if models are to be used for forecasting, testing of future scenarios, or extrapolations of any kind in space and/or time.
This type of model accuracy is termed functional performance, which represents the ability of a model to accurately reproduce coupling and sensitivities within the system (Ruddell et al. 2019).A model with high functional performance should reproduce pairwise input-output relationships seen in the observed data.Transfer entropy, a metric from information theory, has been proposed as a method for assessing functional performance, as it provides a robust measure of non-linear, one-way coupling between inputs and outputs (Schreiber 2000;Ruddell and Kumar 2009;Konapala et al. 2020).The joint assessment of predictive and functional performance can provide a deeper understanding of a model's ability to represent a physical system and help identify potential weakness in the model.
The ML model used in this study is a long short-term memory (LSTM) network (Hochreiter and Schmidhuber 1997), a kind of temporally aware neural network that has shown skill in representing hydrologic processes due to its ability to retain information over many time steps (Rahmani et al. 2020;Frame et al. 2021;Zhi et al. 2021).As the name suggests, the network is engineered to discern which information from the past is relevant to the current time step and to carry that information forward.This behavior has natural resonance with hydrologic systems, where storage and transfer can operate across a range of different scales.An advantage of LSTMs is that they can identify these timescales from input data automatically (Tennant et al. 2020).
In this study, we develop a ML model to predict the 7-d backward-looking average salt front location using daily observations of discharge, meteorological variables, and water levels.We train the model using a record of the salt front location from 2001 to 2020 developed by the Delaware River Bay Commission (Preucil and Reavy 2020) with data from 2001 to 2015 used for training and 2016 to 2020 used for testing.We compare the model results with modeled salt front location from the COAWST hydrodynamic model for 3 yr during the holdout period for which we have results for both models.In addition to comparing predictive performance of the two models, we use transfer entropy between river discharge and salt front location to compare model functional performance at identified critical time scales.As COAWST is a state-of-the-art hydrodynamic model (Warner et al. 2010), it serves as a point of comparison for measuring the predictive and functional performance of the ML model.We also investigate the ML model's internal gradients using a method called expected gradients (EGs) (Erion et al. 2021) to assess whether the model is using input variable data in a physically consistent manner.
The reduced computational cost and increased flexibility of ML approaches raise the possibility of quickly running ensembles of scenarios.These applications would have the potential to inform water resources management both in real time and for the long term.For this and other future applications to be realized, modeling frameworks need to be rigorously compared to both observational data and the existing state-of-the-art modeling approaches to understand their strengths and weaknesses.Although ML approaches have been applied to estuary salinity in the past, this work focuses on explicitly comparing state-ofthe-art process-based modeling to ML methods using a combination of predictive and functional performance metrics.In addition, our use of EGs to examine the drivers of model response through time is a unique application of this explainable artificial intelligence technique to understand model behavior.

Geographical setting
The Delaware Bay estuary is a coastal plain estuary that drains an area of about 35,000 km 2 across Pennsylvania, New Jersey, New York, Delaware, and Maryland (Hutson et al. 2016).The Delaware River, its main tributary, contributes $ 60% of the annual freshwater to the total discharge of the estuary, with the Schuylkill River contributing an additional 15% and no other single tributary contributing more than 1% (Sharp et al. 1986) (Fig. 1).Streamflow is highest in the spring and lowest in the late summer and early fall (Supporting Information Fig. S1).
Salinity dynamics within the estuary have been characterized as having a weak response to variation in streamflow compared to other estuary systems (Garvine et al. 1992;Aristaz abal and Chant 2013).This is in large part due to significant tidally driven advection dominated by the M 2 principal lunar constituent (Wong 1995), vertical mixing, and the shape of the estuary.Other factors such as wind and gravitational flow due to density gradients formed by freshwater inputs can also affect salinity dynamics (Galperin and Mellor 1990a).
Throughout the estuary, lateral variations in salinity are generally greater than vertical (Wong and Münchow 1995), with more saline water found in the middle of estuary.These lateral gradients are likely less important in the upper reaches of the estuary, as it narrows considerably (Sharp et al. 1982).While lateral salinity gradients can be important for driving salinity dynamics in certain parts of the estuary, they are beyond the scope of this study.

Data sources
The ML model uses daily dynamic drivers to make predictions: discharge at two locations, four meteorological variables (barometric pressure, air temperature, wind direction, and wind speed), and four tidal signatures, for a total of 10 predictors.Daily river discharge data were acquired from the Delaware River at Trenton, NJ (USGS Site 01463500) and Schuylkill River at Philadelphia (USGS Site 01474500) using the USGS National Water Information System (U.S. Geological Survey 2022).Daily meteorological data were downloaded from the NOAA National Estuarine Research Reserve System (NERRS) at Saint Jones River (DELSJMET) (NOAA NERRS n.d.) (Fig. 1).We used the Saint Jones River data because that station provided the only complete record of all variables of interest among stations in the area, and the station record is more locally accurate than a gridded meteorological product.
Observed and predicted water level data from the mouth of the estuary were downloaded from the NOAA National Ocean Service (NOS) Tides and Currents web portal for Lewes, DE (8557380) at a 6-min resolution (NOAA National Ocean Services n.d.).Water levels are predicted for specific locations along the coastline using celestial mechanics and past local observations (NOAA Center for Operational Oceanographic Products and Services n.d.).Water level data was converted from 6-min resolution to the daily time step by calculating the daily range, maximum height, difference between predicted and actual height, and subtidal frequency fluctuations (Supporting Information Material Data preprocessing).
The ML model target variable is the backward-looking 7-d average location of the 250 mg L À1 Cl À isochlor, known as the salt front.The salt front location in units of RM is a specific regulatory metric, as such our modeling and analysis is done in these units (Delaware River Basin Commission 2019).The location of the salt front is estimated by the Delaware River Basin Commission using daily observations of specific conductivity from four locations throughout the estuary, Reedy Island (RM 54), Chester (RM 84), Fort Mifflin (RM 92), and the Ben Franklin Bridge (RM 100).Specific conductivity is converted to Cl À using a locally calibrated conversion, and a log-linear interpolation method is used to estimate Cl À concentration between observation points (Preucil and Reavy 2020).When station data is missing due to a faulty sensor or routine maintenance, often in the winter months, a correction is applied to interpolate between non-adjacent sensors.Uncertainty estimates are not available for the salt front location.For the purposes of calculating the salt front from observations, salinity is synonymous with Cl À concentration.
The 7-d average salt front is similar to the position of the 2 g kg À1 isohaline (X 2 ) used as a measure of salinity intrusion in other estuaries (Guerra-Chanis et al. 2019).It is an advantageous target variable for several reasons.First, it is a management-relevant variable that is directly referenced in decisions to mitigate salinity by releasing water from upstream reservoirs (U.S. Geological Survey, Office of the Delaware River Master 2017).Second, in contrast to an Eulerian frame of reference such as salinity concentration at single or multiple locations, a time series of the salt front location describes the directional history of salinity within the estuary; it conveys whether salinity is advancing inland or being repelled downstream.Finally, the salt front location is a single value that provides a succinct summary of salinity conditions in the estuary; more complex model output, such as maps of salinity or salinity at several locations, would require post-processing to produce similarly useful information.Although choosing an interpolated value for the modeling target introduces additional uncertainty from the interpolation method that might be avoided if we were instead to model salinity at several locations, we elected to select the salt front record as it is a target with more management relevance.

ML model
To model the 7-d average location of the salt front we use the LSTM ML method (Hochreiter and Schmidhuber 1997).The LSTM is a type of temporally aware neural network that has been used successfully to model complex hydrologic systems (Kratzert et al. 2018;Read et al. 2019;Jia et al. 2021;Sadler et al. 2022).Its success in hydrology over other ML methods is largely due to its ability to represent system memory and antecedent conditions.An LSTM is a sensible choice in this case as antecedent conditions and system memory have strong influence on the salt front movement (Supporting Information Fig. S1).The fundamental unit of an LSTM is a cell or hidden unit, which is made up of a series of gates through which incoming data pass.The gates control the flow of information throughout the model: the input gate controls which information from the current time step is added to the model's memory, the forget gate controls which information from the previous time step is removed from memory, and the output gate controls which information is used to make the current time step's prediction.The LSTM cells can be represented in a simplified way by: where h and c are the hidden state and cell state, respectively, and x t represents a vector of the model inputs at time t.The full set of equations describing the LSTM's functions as well as details describing the choice of modeling hyperparameters and number of replicates can be found in the Supporting Information.
In Eq. 2, the LSTM is used to generate the prediction of the salt front location, s t , using linear regression.We employed a model with a single LSTM layer, 20 hidden units, a look back sequence length of 365 d, dropout of 0.1 and a recurrent dropout of 0.3.The model was trained for 500 epochs with a learning rate of 0.003.The results presented are the average of 10 model replicates differing only in the random initiation of model weights.All input data were z-score normalized before modeling.

COAWST hydrodynamic model
The COAWST modeling system (Warner et al. 2008(Warner et al. , 2010) was used to model the processes controlling salinity intrusion in the Delaware Bay.The Regional Ocean Modeling System (Haidvogel et al. 2000(Haidvogel et al. , 2008;;Shchepetkin and McWilliams 2005) was employed as the ocean circulation model.River discharge was prescribed using the USGS gage at Trenton, NJ (USGS Site 01463500; U. S. Geological Survey 2022).Flows at the Chesapeake and Delaware Canal were estimated from NOAA water level and velocity measurements at Chesapeake City, MD (8573927; NOAA National Ocean Services n.d.).Tidal forcing was from the Advanced Circulation Model (Mukai et al. 2001).Subtidal forcing was generated using a 32-h low pass filter from water level observations interpolated from Lewes, DE (8557380) and Cape May, NJ (8536110).Temperature and salinity boundary conditions at the coastal ocean are provided by the COAWST forecast model (Warner and Kaira 2022).All model runs included atmospheric dynamics that were forced by the European Center for Medium-Range Weather Forecasts Reanalysis v5 (Hersbach et al. 2020) with a spatial resolution of 30 km and a temporal resolution of 1 h.The COAWST model was calibrated using water level and salinity at four stations.Details of the calibration scheme can be found in (Cook et al. 2023).
The COAWST model simulates estuary water level, velocities, temperature, and salinity, among other output variables, at a spatial resolution of 5-450 m.The model was simulated with 16 vertical levels and run with a 2-to 10-s time step.Several methods for comparison of COAWST model output to ML simulations were considered, including selecting COAWST salinity simulation from the locations where specific conductivity observations were collected and post-processing to interpolate a simulated salt front record.We elected to use the method that had previously been employed by Cook et al. (2023) and was the most likely to be used in a decisionmaking context.To derive a time series of the modeled salt front location from COAWST output, salinity is aggregated to the daily timestep, and the bottom of the vertical cell equal to approximately 0.52 psu (250 mg L À1 Cl À ) is selected (Philadelphia Water District 2019).Each model run consists of simulating a specific year, and each year is run independently.Note that the COAWST model and the ML model are designed to predict different quantities at different spatial and temporal resolutions and scales, therefore the models are not necessarily on equal footing for different prediction tasks.However, in this study, we have designed the comparison to emphasize relevance from a management perspective.

Assessing model performance
The predictive performance of the hydrodynamic and ML models were assessed against the salt front record using root mean squared error (RMSE) and Nash-Sutcliffe efficiency (NSE) (Nash and Sutcliffe 1970).The model performance was assessed separately for a set of six RM intervals (< 58, 58-68, 68-70, 70-78, 78-82, > 82 RM) using geographic and bathymetric control points (Fig. 1).The intervals do not have the same length but were chosen based on the physical structure of the system.The features of the intervals are discussed in the Spatial Variability in Modeling Results section of the Discussion.
The functional performance of the model is an indication of how well a model represents the relationships between inputs and outputs compared to the observed relationships.To compute functional performance, we used transfer entropy (TE), a concept from information theory that quantifies the asymmetric flow of information between sources and sinks and can capture both linear and non-linear interactions (Schreiber 2000).The TE X!Y is a measure of the reduction in uncertainty in Y (the sink variable) gained through knowledge of X (the source variable), conditioned on the history of Y (Schreiber 2000;Ruddell and Kumar 2009).In our analysis, we focused on discharge from the Delaware at Trenton, NJ and discharge from the Schuylkill at Philadelphia as the source variables (X j ) for transfer entropy calculations, where j is the index of the source location, j is an element of T, S f g for Trenton or Schuylkill, respectively.The sink variable (Y) was the 7-d average salt front location.We calculated TE for each combination of sources and sink for modeled and observed data.For more detail on the calculation of TE please see Supporting Information Material.
Functional performance is defined as: where functional performance values were calculated for each model, and for each individual X j À > Y relationship.
A functional performance equal to 0 is ideal, meaning that the modeled salt front location (Y) has the same relationship to the input (X j ) as measured in the observed data.A functional performance > 0 indicates that the relationship is overdeterministic, while a functional performance < 0 indicates over-randomness.

ML model feature importance and EGs
To assess the importance of each input feature to the accuracy of the predictions, we use permutation feature importance in which a feature is independently replaced with a random sampling of values within the observed range of that feature's values and the trained model is used to make predictions.The resulting error, measured as RMSE, is compared to the error obtained using the observed unpermuted data, and the process is repeated for each input feature.This results in an estimate of the mean decrease in the performance metric resulting from information loss for each permuted feature, where larger decreases indicate that the model relies more on that feature to improve predictions.
EGs are a model agnostic method for calculating a model's local sensitivity to input variables at each individual time step of the model output.High positive or negative EG values indicate that the input variable strongly influences the model prediction at that time step, while EG values near zero indicate that the variable has little influence on the output.The method calculates the accumulated gradients of the change in an output with regards to a change in the input, df(x)/dx, as the input variable goes from some baseline value x 0 to its true value x, that is, Þwhere α = 0 at the baseline and α = 1 at the true value of x and γ is a function describing the relationship between x and x 0 (Jiang et al. 2022) (Supporting Information Fig. S6).The baseline values x 0 are generated by sampling from the range of the input values in the training data, given by D. For this study, we generated 200 random samples with replacement.Given a baseline distribution D, the EG of the ith input variable Φ i ð Þ can be calculated as the following expectation (Erion et al. 2021): where U 0,1 ð Þ is the uniform distribution between 0 and 1.We calculate the EG for each replicate model run, differing only by random seed, and show the average EGs across model replicates for visualization purposes.
While both permutation feature importance and EGs can help assess the influence of input features globally and locally, respectively, neither method explicitly accounts for interaction between features.

ML model results
The baseline ML model reproduces the structure of the salt front record well in both training and testing periods with RMSEs of 1.25 RM and 2.52 RM, respectively (Fig. 2a-d).The model shows worse performance when the salt front location is low in the estuary (< RM 58; Fig. 2e), however, model performance at this range is of lower importance to the modeling objectives which focus on salinity intrusions in the upper estuary, and the salt front location is not tracked below RM 54.When the salt front is high in the estuary (> RM 82), the model matches the observations well during the training period (Fig. 2b), but underpredicts the two major peaks in 2016 and 2019 during the testing period (Fig. 2d).However, the model did identify these as periods of salinity intrusion, and their relative timing is well matched.These conditions are relatively rare, representing 6.5% of the training period and 6.1% of the testing period, however they are of significant importance from a management perspective.
In general, the model performed best under median annual flow conditions.For example, the best performance during the testing set was observed in 2020 (RMSE = 1.38 RM), which was the year with the median cumulative discharge in the Delaware and Schuylkill Rivers during the entire modeling period.In contrast, 2016 was the year the lowest cumulative flow in the modeling period and showed the worst performance (RMSE = 3.31 RM).Similarly, the model performs worse when the salt front is either low or high in the estuary (< 58 or > 78 RM) and performs better when the salt front is closer to its median value (Fig. 2e).
The ML model showed NSE = 0.85 during the testing period and NSE = 0.96 during the training period.Somewhat worse performance was reported by Meyer et al. (2020) who used process-based and empirical models to simulate specific conductivity at several locations within the Delaware Bay estuary.They report NSE = 0.706-0.834for the process-based empirical model and 0.458-0.744for a hydrodynamic model.The modeling scheme in that study did not include holdout data, so comparison to the training performance is more appropriate.

Comparison of ML and COAWST hydrodynamic model
The ML model showed consistent accuracy across all 3 yr in the testing set for which we have model results for both models (2016, 2018, and 2019), with the best performance in 2019 (RMSE = 2.53 RM) and worst performance in 2016 (RMSE = 3.31 RM) (Fig. 3a).In contrast to the ML model, the COAWST model showed a wider range of accuracy, with minimum RMSE in 2019 (2.69 RM) and maximum in 2018 (8.46 RM).The two models show very similar performance in 2016 and 2019, while the ML model outperformed COAWST in 2018 (Fig. 3a). Figure 4a,c,e shows the annual time series with observed and modeled salt front location for both models.In 2016, both models capture the general structure of the salt front record but fail to capture the maximal extent of the salt front intrusion.In 2018, the COAWST model shows a consistent under prediction resulting in poor overall performance, but shows good sensitivity, matching the higher frequency fluctuations of the salt front record similar to the ML model (Fig. 4c).In 2019, both models capture the rising limb of the salt front peak similarly, but the COAWST model shows superior performance in matching the maximum salt front location (Fig. 4e).When aggregated across all 3 yr of comparison, the ML model outperforms the COAWST model across nearly all RM intervals (Fig. 3b); however, much of the poor performance in the COAWST model is driven by 2018, which was a very wet year with higher-than-average flows.The COAWST model shows no consistent performance across RM intervals, with worst performance at higher RM intervals in 2018 (Fig. 4d) and best performance in 2019 (Fig. 4f).

Comparison of functional performance
To assess functional performance of the ML model and COAWST, we focus on the relationship between 7-d average salt front location and discharge of the Delaware River measured at Trenton, NJ (Fig. 5a) at a time lag of 7 d, and the Schuylkill River at Philadelphia, PA (Fig. 5b) at a time lag of 8 d.For example, the salt front location on 10 October 2018 is the average salt front location from 03 October 2018 to 10 October 2018, and its relationship to Delaware discharge on 03 October 2018 is assessed (02 October 2018 for Schuylkill discharge).The time lags were chosen as they showed the strongest correlation between discharge at the two locations and salt front in the observational data (Supporting Information Fig. S1) and, as such, represent important time scales of information transfer.
The ML model results show over-random behavior (functional performance < 0) for 2016 and 2018 for both Delaware and Schuylkill discharge (Fig. 5a,b).Average functional performance for these 2 yr of À0.062 for the Delaware and À0.065 for the Schuylkill indicate that the ML model is underutilizing the information contained within the two discharge records by approximately 6%.However, the functional performance for 2019 shows near-perfect representation of the dischargesalt front relationship for the Delaware (0.002) and the Schuylkill (À0.004).
Averaged across all years, the COAWST model shows more accurate functional performance (closer to zero), with mean functional performance À0.013 and À0.009 for the Delaware and Schuylkill, respectively (Fig. 5a,b).The ML model shows À0.049 and À0.052 for Delaware and Schuylkill, respectively.
There is no consistent relationship between the functional performance and predictive performance (RMSE) for either model (Fig. 5), which suggests that there is not an obvious tradeoff between functional and predictive accuracy.For example, in 2018 the COAWST model shows poor predictive performance (RMSE = 8.46 RM), yet it also shows superior functional performance compared to the ML models.For both ML and COAWST model, 2018 shows the least accurate functional performance.2018 was a particularly wet year with the highest annual cumulative flow during the modeling period in the Schuylkill, and the second highest annual total in the Delaware, which represent conditions outside the range seen during ML training (U. S. Geological Survey 2022).

Feature importance and EGs
The permutation feature importance (Supporting Information Fig. S7) shows that the discharge of the Delaware and Schuylkill Rivers are the most and second most important variables, respectively, followed by the air temperature and wind speed.The feature importance calculation for every input variable besides the two discharge variables resulted in very small decreases in performance (< 0.1 RMSE), indicating that the model relies heavily on information from discharge to make its predictions.A version of the model with only Delaware and Schuylkill discharge as predictors showed an RMSE = 2.81 RM, compared to 2.52 RM for the full set of predictors, suggesting that even though individually the additional features contribute only a small amount to improve predictive performance, together they do improve the model.
The ML model EGs for 2019 are shown in Fig. 6b-d.The year 2019 was chosen to highlight this method as the shift from high discharge periods (Jan-Jul) to a prolonged summer low-flow period (Aug-Oct) showed the clearest pattern in EGs throughout the year.The highest magnitude EGs throughout the year are generally discharge from the Delaware and the Schuylkill Rivers (Fig. 6b).These gradients respond to the discharge record and attribute large decreases in the predicted salt front with high discharge events.For example, the large discharge event in January in the Delaware River (Fig. 6a; blue line) results in a large magnitude negative gradient (decreased salt front prediction) for Delaware discharge (Fig. 6b; red line).Similarly, a large discharge event in July in the Schuylkill River (Fig. 6a; green line) results in a large magnitude negative gradient (decreased salt front prediction) for Schuylkill discharge (Fig. 6b; gold line).
The EGs associated with discharge peaks are generally negative, reflecting the fact that discharge values greater than baseline generally tend to result in a decrease in the location of the predicted salt front.Similarly, generally positive EG values model is attributing a decrease in the predicted salt front to an increase in discharge, such as the Delaware discharge in October and November that push the salt front downstream from its annual maximum.Supporting Information Fig. S6 shows a schematic representation of the how the EGs are accumulated during discharge events and low-flow periods.
During the summer low-flow period, when there are fewer discharge events, the magnitudes of the discharge EGs are smaller than other parts of the year.During this period, the water level residuals (Fig. 6c), and the wind direction EGs (Fig. 6d) show greater magnitude, indicating that the model is relying more heavily on tidal and meteorological input data during periods of low discharge.

Spatial variability in modeling results
The ML models showed consistent spatial variability in model predictive performance (Figs.2e, 4b,d,f), with generally better performance when the salt front was in the intervals 58-78 RM, and worse above or below this range.This is likely a result of two factors: first, during the training and testing period, 85% and 87% of the observations of the salt front location fall in the interval 58-82 RM, respectively.This results in fewer opportunities for the model to learn the dynamics of the more extreme values, leading the model to make predictions that are biased toward central values.Similarly, during the training period, the maximum salt front location was RM 89, while during the testing period, there were 13 d in which the salt front exceeded that value, indicating that the testing period represented an extrapolation into unseen conditions.These observations are consistent with other studies that have shown ML models (Frame et al. 2022;Kayalvizhi et al. 2022) and other modeling approaches (Brunner et al. 2021) can struggle to match extreme values even if they are seen during the training period.
The second factor that contributes to the spatial variability in model response is the physical dimensions of the estuary.The model performance was assessed separately in RM intervals using geographic and bathymetric control points.Hydrodynamic studies of the Delaware Bay estuary show that bathymetric control points can exert significant control on salinity dynamics (Cook et al. 2023).
RM 58 is the location of the Chesapeake and Delaware Canal, connecting the Chesapeake and Delaware Bays.Below this point, the estuary opens into the Delaware Bay where salinity dynamics are governed by tidal energy and belong to a distinct regime from the upper estuary.Any salt front locations below RM 54 were removed from the input data, but observations between 54 and 58 were retained to provide a softer lower limit for the model.The ML model generally over-predicts the salt front location over this interval, resulting in an RMSE = 3.84 RM (Fig. 2e).Some of the error in this region is likely due to the influence of the canal, which can have bi-directional flow throughout the year (Wong 2002) and was not included as input in either model.
Upstream of RM 58, the river tapers in width from $ 5 km at RM 58 to 2.8 km at RM 68.The interval from RM 68 to 70 represents a narrowing of the river below the mouth of Christina River, which drains the urban and suburban areas surrounding Wilmington, DE.From RM 70 to 78, the river has a narrow, deep central channel.The model shows the best performance over these three intervals (RMSE = 1.92 RM for 58-68; RMSE = 1.68 RM for 68-70; and RMSE = 2.97 RM for 70-78).
Between RM 78 and 82, the deep central channel of the river widens, allowing for more lateral transport and mixing, leading to a more complex relationship between discharge and salt front location and likely also influencing performance (RMSE = 3.83 RM).Above RM 82 the river approaches the urban and suburban areas surrounding Philadelphia, PA and is influenced by discharge from Chester Creek and the Schuylkill River (RMSE = 3.86 RM).These intervals are more strongly influenced by urban runoff, which may complicate salinity dynamics further (Sharp et al. 2009).In addition, these areas contain groundwater wells that are used for drinking water abstraction from the underlying Potomac-Raritan-Magothy formation.Pumping in these wells can induce recharge from the river into the groundwater system, which represents an additional factor not accounted for by this modeling framework (Navoy et al. 2005).
How well is the ML model representing the physical system?
Consistent with other studies of salinity dynamics in the Delaware Bay, the ML model identified discharge in the main stem of the Delaware River as the most important driver of the salt front, followed by discharge in the Schuylkill River (Supporting Information Fig. S7) (Wong 1995;Sharp et al. 2009;Ross et al. 2015).Further analysis of the relationship between discharge and salt front indicate that the ML model is underutilizing the information contained within the two discharge inputs (Fig. 5a,b) during 2016 and 2018, while the functional performance for 2019 was near-perfect.These results suggest that under "normal" conditions, like 2019, the ML model does as well as the COAWST model in representing the functional relationship between discharge and salt front location.
In 2016 and 2018, COAWST shows better functional performance than the ML model.This result is not surprising, as the hydrodynamic model forces physically realistic flow at every time step.Even so, the EGs and feature importance together show that the ML model identified the most important drivers in the system and developed physically realistic relationships between inputs and outputs.Furthermore, it showed that those relationships could be used to predict the salt front location under a wide range of conditions, as 2016 had the lowest combined discharge (Delaware plus Schuylkill) of any year in the modeling period.As noted, 2016 and 2018 represent low and high discharge end members in the modeling set, respectively, while 2019 represents a more typical discharge year in both the Delaware and Schuylkill Rivers.
The decreased functional performance in 2016 and 2018 may be a result of several factors.First, average conditions are better represented in the training set and therefore more accurately simulated in the testing set, similar to predictive performance results.In addition, the seven-and eight-day time lags used to assess functional performance from Delaware and Schuylkill discharge, respectively, represent the average maximum magnitude correlation between discharge and salt front location throughout the entire modeling time period.However, viewed on an annual basis, the time lag of the maximum magnitude correlation varies as a function of the cumulative discharge amount (Supporting Information Fig. S8).This serves to highlight the year-to-year variability in the duration and strength of system memory that is disrupted by frequent discharge events (i.e., 2018) and extended by prolonged periods of low-flow (i.e., 2016).In general, drier years show longer time lags of maximum association between discharge and salt front location indicating longer system memory under drier conditions.As such, the assessment of functional performance at the time lag of maximum magnitude correlation may be most relevant for median flow years, while greater (lesser) time lags are likely more important for drier (wetter) years.Similar relationships have been noted in others estuary systems, for example, Qiu and Wan (2013) defined a piece-wise autoregressive relationship between discharge and salinity dynamics for different discharge regimes in the Caloosahatchee River Estuary.
While LSTMs are temporally aware ML algorithms and would ideally identify these patterns directly from the data, a process guided deep learning approach might be particularly useful in such a setting where critical time scales are variable and knowable (Read et al. 2019).To improve model functional performance at a particular time lag, the functional performance calculation could be incorporated into the ML model loss function.However, care must be taken to ensure that there is independent assessment of functional performance and that the modeling approach is not subject to kludging, where the model is tuned to fit an overly narrow set of performance metrics (Clark 1987;Ruddell et al. 2019).
In addition, the modeling target variable is calculated from observations.The interpolation naturally introduces uncertainty into the exact location of the salt front.This uncertainty could be reduced by predicting salinity at a specific location or several locations where observational data are available.

Leveraging the strengths of hydrodynamic and ML models together
The ML model presented in this study can be trained and used to make 20 years of predictions in a wall time of approximately 20 min on 16 2.60 GHz processors, while the COAWST model runs in approximately 50 h on 180 processors to produce a single year of predictions (4000 CPU hours per year).As noted previously, the COAWST model produces 2 TB of model output, with high-resolution estimates of water level, temperature, salinity, and other variables.For detailed investigations of estuarine circulation, the information gained is often worth additional computational cost.However, for prediction tasks such as the location of the salt front, or water quality constituent concentrations at various discrete locations, a ML approach such as the one shown here can produce similarly accurate results at a fraction of the computational cost.This has significant implications for the ability of such a model to help inform management decisions.The relative speed and flexibility of the ML approach means that the model could be altered or improved more easily based on stakeholder suggestions.For example, in its current form the ML model could be adjusted to make predictions of salinity at several specific locations throughout the estuary, assuming that the training data at those locations were obtainable.In addition, the ML models can be used to run a wider suite of scenarios that might cover potential future climate conditions and management actions than what would be feasible using a hydrodynamic model.
Similar advantages have been noted in several other diverse estuarine systems.For example, in the Chesapeake Bay, salinity and temperature response to climate change scenarios were modeled using statistical trees, a data-driven technique similar to traditional tree-based ML methods (Muhling et al. 2018).In addition, estuary systems along the Gulf of Mexico USA, the Pearl River in China, and the Danshui River in Taiwan have been successfully modeled using ML approaches with applications to forecasting and operational early warning systems (Chen et al. 2016;Zhou et al. 2021;Weng et al. 2024).In the Caloosahatche River Estuary in Florida, USA a statistical model was competitive with a three-dimensional hydrodynamic model and the statistical model was shown to be particularly useful for forecasting in support of upstream water resource decision-making (Qiu and Wan 2013).
Investigating the coupling of the ML model to upstream reservoir operations models to quickly run scenarios could reveal how different timing and magnitude of reservoir releases affect the movement of the salt front.Such information could have direct application to reservoir operation and water resource management and relies heavily on the rigorous model evaluation conducted in the current study to demonstrate the predictive and functional accuracy of the ML modeling approach.
Despite the clear computational advantages, the direct comparison of the ML models with COAWST showed that there were conditions in which the hydrodynamic model consistently outperformed the ML model.For example, the maximum salinity intrusion in 2019 was much better matched by COAWST than the ML model (Fig. 4e).As such, the ML model should not be used in its current form to make predictions of the extent of salinity intrusion during extreme events when the salt front is ≥ $ 82 RM.This highlights the utility of a dual modeling approach, where process-based models can be used in conjunction with ML approaches to collectively improve the performance and extrapolative power of each approach.While we focus on a comparison of modeling approaches in this study, future work could be focused on conceptual and/or computational integration of the models, for example through alteration of the ML loss function to conserve mass in the system (Read et al. 2019), using a ML model to emulate COAWST output (Chen et al. 2018), or rewriting portions of the COAWST model in a differentiable language to employ a differentiable ML framework (Feng et al. 2022).Further, the current model could be improved by including additional observational or modeled input data to drive the model.For example, records of discharge and/or salinity from contributing tributaries could increase model performance.In addition, the incorporation of more historical data that represents extreme conditions may help to increase the model performance, particularly at the model extremes.For example, the records of discharge and salt front are available for the drought of record for the area in the 1960s and could be included in future training runs; however, complete meteorological observations are sparser historically, such that tradeoffs might include the need to interpolate input data and/or employ a simpler ML model.

Conclusions
In this study, we developed a ML model to make hindcast predictions of the 7-d average 250 mg L À1 isochlor in the Delaware Bay estuary over the period 2001-2020.In comparison to results from a state-of-the art hydrodynamic model, the ML model produces overall more accurate results and similarly accurate functional coupling between inputs and outputs at a drastically reduced computational cost.The EGs of the ML model reveal that it has learned physically reasonable relationships, and the model is able to make accurate predictions under a wide range of conditions.While there are areas of potential improvement (e.g., prediction of extreme values, quantification of uncertainty), the current approach already shows promise for flexible exploratory and predictive modeling of estuary salinity dynamics in support of management decisions.oxygen at the continental scale?Environ.Sci.Technol.55: 2357-2368.doi:10.1021/acs.est.0c06783Zhou, J., M. J. Deitch, S. Grunwald, E. J. Screaton, and M.

Fig. 1 .
Fig. 1.Map of study area.Orange bars indicate river mile intervals used for model performance analysis; yellow inverted triangles provide historical and management context.Projection: World Geodetic System 1984.Basemap generated using the tigris and dataRetrieval packages in R (R Core Team 2022; DeCicco et al. 2023; Walker 2023), depth data from (NOAA National Centers for Environmental Information 2005).

Fig. 2 .
Fig. 2. Machine learning model results for training period (a, b) and the testing period (c, d).Panel (e) shows the testing period model results broken down by river mile interval, with the standard deviation among 10 replicates shown with error bars.

Fig. 3 .
Fig. 3. Comparison of ML model with COAWST model for each year in the testing set (a) and for each river mile interval (b), the standard deviation among 10 replicates for the ML model is shown with error bars.

Fig. 4 .
Fig. 4. Comparison of predictions from COAWST (orange) and ML model (blue) for the salt front location for 2016 (a, b), 2018 (c, d), and 2019 (e, f).Predictive performance for COAWST and ML models for years during the ML testing period (b, d, f) binned by performance at different river mile intervals.The standard deviation among 10 replicates for the ML model is shown with error bars.

Fig. 5 .
Fig. 5. Functional performance for (a) the relationship between the discharge 7 d previously in the Delaware River at Trenton, NJ and the 7-d average salt front location and (b) the discharge 8 d previously in the Schuylkill River at Philadelphia, PA and the 7-d average salt front location plotted against predictive performance for each year (symbol) and model (color).A functional performance value of 0 indicates perfect fidelity to observed relationships.