A Hybrid, Non‐Stationary Stochastic Watershed Model (SWM) for Uncertain Hydrologic Simulations Under Climate Change

Stochastic Watershed Models (SWMs) are emerging tools in hydrologic modeling used to propagate uncertainty into model predictions by adding samples of model error to deterministic simulations. One of the most promising uses of SWMs is uncertainty propagation for hydrologic simulations under climate change. However, a core challenge is that the historical predictive uncertainty may not correctly characterize the error distribution under future climate. For example, the frequency of physical processes (e.g., snow accumulation and melt) may change under climate change, and so too may the frequency of errors associated with those processes. In this work, we explore for the first time non‐stationarity in hydrologic model errors under climate change in an idealized experimental design. We fit one hydrologic model to historical observations, and then fit a second model to the simulations of the first, treating the first model as the true hydrologic system. We then force both models with climate change impacted meteorology and investigate changes to the error distribution between the models. We develop a hybrid machine learning method that maps model state variables to predictive errors, allowing for non‐stationary error distributions based on changes in the frequency of model states. We find that this procedure provides an internally consistent methodology to overcome stationarity assumptions in error modeling and offers an important advance for implementing SWMs under climate change. We test this method on three hydrologically distinct watersheds in California (Feather River, Sacramento River, Calaveras River), finding that the hybrid model performs best in larger and less flashy basins.


Introduction
Climate change and its uncertain impacts on the hydrologic system pose major challenges to the adaptation of existing water resources infrastructure and the design and construction of new infrastructure (Boland & Loucks, 2021;Stakhiv & Hiroki, 2021).This challenge is particularly notable in regions where data to support precise hydrologic modeling are limited.Considering this challenge, methods that quantify uncertainty in future hydrology play an increasingly critical role in the modern practice of water resources planning and management (Bloschl et al., 2019;Brown et al., 2015;Hui et al., 2018;Milly et al., 2008;Read & Vogel, 2015;Sterle et al., 2019).
In the past, historical hydrologic variability was deemed an adequate representation of future hydrologic uncertainty, motivating the use of stationary, stochastic streamflow models (SSMs) in engineering design and planning (Loucks & Van-Beek, 2017;Teegavarapu et al., 2019;Thomas & Fiering, 1962).As the impacts of climate change (and other land use change) have become increasingly apparent, many have questioned the suitability of such stationary statistical models for infrastructure planning (Galloway, 2011;Milly et al., 2008;Montanari & Koutsoyiannis, 2014).While the parameters of SSMs can be modified to enable the simulation of new hydrologic behavior (e.g., Bracken et al., 2014;Hadjimichael et al., 2020), the range of plausible change is difficult to infer without a modeling framework that can predict emergent patterns of hydrologic response to climate change.
Stochastic watershed models (SWM) were developed to address this challenge (Vogel, 2017).SWMs combine deterministic predictions from process-based hydrologic models with a stochastic element that captures model uncertainty (Farmer & Vogel, 2016;Sikorska et al., 2015;Steinschneider et al., 2015;Vogel, 2017).The use of process-based models enables hydrologic projections that explicitly represent changes to meteorological forcings and landscape characteristics (e.g., vegetation or land use) and their non-linear impacts on hydrologic response.The stochastic component of a SWM represents hydrologic uncertainty that the deterministic model cannot capture.In the most straightforward case, this uncertainty is approximated by the predictive uncertainty of the model (i.e., based on errors between model predictions and the observations).The predictive uncertainty reflects the integration of input, parametric, and model uncertainty (Montanari & Koutsoyiannis, 2012) and can be represented by a variety of error modeling approaches (Koutsoyiannis & Montanari, 2022;McInerney et al., 2017;Shabestanipour et al., 2023;Vogel, 2017).The addition of simulated model errors and deterministic hydrologic model simulations creates a SWM simulation, and repetition of this process using multiple random samples of error yields a SWM ensemble that can be used for short term probabilistic prediction (e.g., flood forecasting; Sikorska et al., 2015;McInerney et al., 2018;Koutsoyiannis & Montanari, 2022), subseasonal-toseasonal forecasting (McInerney et al., 2020), and long-term planning (e.g., design event estimation; Farmer & Vogel, 2016;Shabestanipour et al., 2023).
To date, one important issue in stochastic watershed modeling that remains unresolved relates to non-stationarity in the stochastic process for predictive uncertainty.When used to develop hydrologic projections under climate change, past studies have made the implicit assumption that predictive uncertainty inferred from historical errors is sufficient to characterize future uncertainty (Shabestanipour et al., 2023;Sikorska et al., 2015;Vogel, 2017).Some have argued this approach is sufficient if the deterministic component of the model can account for nonstationarity (Montanari & Koutsoyiannis, 2014).However, there are reasons to doubt this assumption in the context of hydrologic model prediction.The distribution of hydrologic model prediction errors in the historical period implicitly reflects the historical distribution of model states and observed hydrology (Liu & Gupta, 2007;Renard et al., 2011;Vogel, 2017).The effects of climate change go deeper than simply amplifying or attenuating hydrologic response, instead affecting fundamental process relationships within catchments, including the timing and rate of snow accumulation and melt (Mote et al., 2018;Musselman et al., 2017), timing of peak soil moisture (Xu et al., 2021), and changes to runoff efficiency through both physical (Lehner et al., 2017;Overpeck & Udall, 2020) and biophysical (Mankin et al., 2019) effects.These climate change induced effects will alter the frequency, timing, and intensity of model states, activate model components in configurations not seen in the historical record, and change the way meteorological forcing is converted to streamflow.In turn, the model predictive errors would be expected to exhibit fundamental departures from the distributional properties observed in the historical period.For instance, if within the historical record a hydrologic model exhibits different error distributions during periods of snow accumulation and melt versus periods of direct rainfall-runoff response (e.g., because of different, incorrect process representations under those two different hydrologic regimes), and under climate change the former process becomes less frequent and the latter more frequent (or less/more intense), the distribution of model errors under climate change would almost certainly change compared to the historical period.To the authors' knowledge, this issue in SWMs has not yet been documented in the literature.
The potential for non-stationary predictive errors complicates an already difficult problem in stochastic watershed modeling (Beven, 2016).Hydrologic prediction errors exhibit a number of challenging characteristics including autocorrelation, heteroscedasticity, and non-normality, even in the stationary case (Hunter et al., 2021;Mcinerny et al., 2017Mcinerny et al., , 2018;;Schoups & Vrugt, 2010).Efforts to understand and quantify these errors (Liu & Gupta, 2007) have progressed from simple autoregressive techniques (Toth et al., 1999) to more complex statistical methods using either decomposition (Kuczera et al., 2006;Renard et al., 2011) or aggregate approaches to predictive error modeling (McInerny et al., 2018;Montanari & Koutsoyiannis, 2012;Shabestanipour et al., 2023;Sikorska et al., 2015).One recent approach that has gained significant traction is the use of machine learning (ML) to correct prediction errors of process-based hydrologic models (Hah et al., 2022;Konapala et al., 2020;Quilty et al., 2022;Shen et al., 2022aShen et al., , 2022b)).These approaches (often termed "hybrid" or "physics informed data driven" models) range from simpler ML-based error correction models (Konapala et al., 2020;Shamseldin & O'Connor, 2001;Shen et al., 2022aShen et al., , 2022b) ) to more complex stochastic formulations that utilize ensembles of hydrologic model simulations, each with different parameter sets and ML-based error correction models, and possibly including contributions from additional uncertainties (e.g., input, parameter;Quilty et al., 2022;Hah et al., 2022). in the process-based model (Beven, 2020;Hah et al., 2022;Quilty et al., 2022;Shen et al., 2021).While hybrid methods do not consistently improve hydrologic predictive performance over more direct ML methods (Frame et al., 2021), they can add realistic physical constraints to model predictions and help to address the issue of uncertainty representation in these methods (Klotz et al., 2022).In addition, some initial work suggests hybrid models may be more appropriate than direct ML prediction models for long-term projections that extrapolate hydrologic responses under unprecedented climate change (Feng et al., 2023;Reichart et al., 2023;Wi & Steinschneider, 2022, 2024).A similar logic may support non-stationary models of hydrologic model error for future uncertainty quantification.That is, hybrid methods that map process model states to predictive errors may also be able to exploit these relationships to capture non-stationarity in error structure based on changes in the frequency of hydrologic regimes (i.e., changing frequency of projected model state variables).This approach decouples the error models from static empirical relationships that may change fundamentally in a future climate, such as seasonality in the error distribution.To date, the potential of hybrid models to support non-stationary SWMs remains unexplored.
In this work, we demonstrate for the first time the challenge of non-stationary prediction errors in stochastic watershed modeling under climate change, and we advance a novel, hybrid modeling framework to address this challenge.We demonstrate this work in a case study of the Feather River basin upstream of Oroville Lake in northern California.We first introduce a stylized experimental design where one hydrologic model is calibrated to observed streamflow and treated as the true hydrologic system (hereafter the "truth model"), while a second model (hereafter the "process model") is then calibrated to simulations from the truth model.We force both models with the same set of non-stationary meteorological inputs and document non-stationarity in the error distribution between them.We then develop a hybrid error model composed of an ML-based error correction model and a dynamic residual noise model, both of which use process model state variables to infer error properties.We assess the ability of this hybrid error model, coupled with simulations from the process model, to preserve the statistical properties of the truth model in out-of-sample cases with and without the impacts of climate change and compare results to a static SWM approach as a benchmark.We conclude the study by demonstrating the same technique for a process model fit to actual streamflow observations in the Feather River basin, as well as two other California basins (Sacramento River above Shasta Lake, Calaveras River above New Hogan Lake) that differ in size and hydrologic regime.

Study Area and Data
The Feather River basin upstream of Lake Oroville drains an area of 9,338 km 2 on the west facing slopes of the northern Sierra Nevada Range (Figure 1), and serves as the largest water source directly contributing to the State Water Project of California.This portion of the Sierra Nevada reaches altitudes of nearly 3,000 m, making the Feather River a snow-dominated catchment.The precipitation regime is driven by large, infrequent atmospheric rivers (ARs) that exhibit significant inter-annual variability and occur primarily in the cold season (November-April) (Hanak et al., 2011).Accordingly, streamflow varies considerably across years and also across seasons, as snowmelt drives higher flows in the spring and early summer months and high evapotranspiration drives lower flows in late summer and fall.Winter flows can vary considerably in response to winter storms, particularly when associated with AR-induced warming or rain-on-snow events (Hanak et al., 2011;Huang et al., 2012).Climate change is expected to significantly impact hydrologic response in this region through reduced snowpack, earlier snowmelt, and changing precipitation characteristics (Hanak et al., 2011;Huang et al., 2012;Sterle et al., 2019).
We also consider flows in the Sacramento River upstream of Shasta Lake (SHA) and the Calaveras River upstream of New Hogan Lake (NHG) for part of our analysis described in Section 3.5.The Sacramento River basin is approximately two times larger than the Feather River basin (12,262 km 2 ) and flows out of the southern Cascade Range, which is lower in elevation than the Sierra Nevada and less snowmelt dominated.The Calaveras River basin is approximately one tenth the size of the Feather River basin (940 km 2 ) and originates in the foothills of the Sierra Nevada, making it primarily rainfall dominated and its hydrology flashier than the Feather or Sacramento Rivers.
Daily streamflow data were taken from the California Data Exchange Center (CDEC) Full Natural Flow (FNF) database for water years (WY) 1989-2018(1 October 1989-30 September 2018)  production of a river basin unaltered by upstream diversions, storage, and imports of water (see Figure 1).FNFs are calculated from observed flows and estimates of water diversion and imports, reservoir operations, and reservoir evaporation (Zimmerman et al., 2018).We note that even though FNFs on the Feather River are calculated rather than observed natural flows, this will have little impact on our stylized experiment described in Section 3, which fits and compares one hydrologic model to another.In addition, the two other study locations are relative unimpaired, with only a few, very small reservoirs upstream of Shasta Lake with insignificant storage capacity (∼1% of mean annual flow), and no reservoirs on the Calaveras River upstream of New Hogan Lake.
We used daily precipitation from the 6 km climate product of Pierce et al. (2021) as input forcings to all hydrologic models used in this work.Daily minimum and maximum temperature were taken from the 6 km data set in Livneh et al. (2015) up through December 2015 and then extended to September 2018 using the PRISM daily data set (PRISM Climate Group, 2014) to match the timeframe of the precipitation data.

Experimental Design
This study employs a stylized experimental design to demonstrate the challenge of non-stationary prediction errors in SWMs under climate change and to evaluate whether a novel, hybrid modeling framework can address this challenge (Figure 2).To demonstrate the utility of our approach, it is necessary to establish experimental conditions where we know the truth and can therefore measure how well our methods capture it.We first select two hydrologic models, designating one as the "truth model" and the other as the "process model."The truth model is taken to be the true hydrologic system, and simulations from this model under alternative meteorological forcing are taken to be the true hydrologic response to that forcing.In this study, we use SAC-SMA (Burnash, 1995) as the truth model.The process model represents an (imperfect) model of the true hydrologic system that can only approximate new hydrologic responses under alternative meteorological forcing.HYMOD BRODEUR ET AL. (Boyle, 2001) is used as the process model in this work.In the stylized experiments below, we first ensure that SAC-SMA provides a reasonable representation of the observed hydrology of the Feather River basin, but then assume the fitted SAC-SMA model is the real-world watershed of interest.The goal of the HYMOD model is then to predict the behavior simulated by the SAC-SMA model, and not to predict the actual observations.We use this "model-as-truth" approach-akin to "perfect model" experiments used in the climate modeling community (Abramowitz & Bishop, 2015;Herger et al., 2018;Knutti et al., 2017)-to overcome the challenge of having no future observations with which to compare our process model.That is, because there are no streamflow observations in the distant future under substantial amounts of climate change, there is no way to determine whether hydrologic model errors are non-stationary under highly non-stationary meteorological forcing.By using one model (SAC-SMA) as the true hydrologic system and another (HYMOD) as a model of that system, we can explore this issue in a controlled (albeit highly stylized) experiment.We describe the hydrologic models used for the truth and process models (SAC-SMA and HYMOD, respectively) in more detail in Section 3.2 below.
In the experiment, we evenly split the available record into a training period for calibration and validation (WY 1989(WY -2004) and a test period for out-of-sample model evaluation (WY 2005(WY -2018)), following common practice.During the training period, we calibrate the truth model to observed streamflow data and then calibrate the process model to the truth model in that same period (gray dashed box in Figure 2).We then examine the distribution of hydrologic prediction errors between the truth and process models in the test period, calculated based on historical precipitation and temperature data from this period (Test), as well as historical precipitation and temperature warmed by 4°C (Test + 4C).The Test + 4C case is produced by simply adding 4°C to all temperatures in the testing period, and is reflective of the warming in California projected by a multi-climate model ensemble average by the end of the 21st century under the RCP 8.5 emission scenario (Pierce et al., 2018).We then document changes to the error distribution between the truth and process models across these two scenarios, and can attribute these changes to differences in how the truth and process models respond to warming with all other aspects of meteorology held constant.
Errors between the truth and process model in the training period, along with state variables from the process model, are then used to fit a hybrid SWM (described in Section 3.3).The hybrid SWM combines samples of error between the truth (SAC-SMA) and process (HYMOD) models with deterministic simulations from the process model to develop an ensemble of streamflow traces that is statistically consistent with flows from the truth model.We evaluate whether our proposed hybrid SWM can capture potential changes in the error distribution between the Test and Test + 4C scenarios, and also compare the results of our hybrid SWM against a simpler benchmark SWM that does not depend on process model state variables (described in Section 3.3.3).We use interpretability methods to understand how the hybrid SWM uses model state variables to estimate changes to the error distribution (see Section 3.4).
Finally, because there are limitations using a model to represent the true hydrologic system in the stylized experiment above, we repeat the experiment in a real-world (non-stylized) setting.Here, we train and test the performance of a hybrid SWM against actual streamflow observations for the Feather River (ORO), Sacramento River (SHA), and Calaveras River (NHG) over the historical record, using SAC-SMA as the process model (Section 3.5).That is, the SWM combines samples of error between observations and SAC-SMA predictions with deterministic simulations from SAC-SMA to develop an ensemble of streamflow traces that is statistically consistent with observed flows.

Hydrologic Model Setup
We calibrate two daily hydrologic models for the Feather River basin, SAC-SMA (Burnash, 1995) and HYMOD (Boyle, 2001), that are used as the truth and process models, respectively.We selected these models for two reasons.First, both have previously been developed for the Feather River basin, performed very well against observations, and outperformed other process-based models like the Variable Infiltration Capacity model (see Wi & Steinschneider, 2022).Second, SAC-SMA and HYMOD have similar structures, and so any non-stationarity in the predictive errors between these models under climate change would indicate there is a high likelihood that non-stationarity would also emerge between models (or between models and the actual observations) with more significant structural differences.
Both SAC-SMA and HYMOD are built using 828 hydrologic response units (HRUs) defined for the Feather River basin by segregating each 6 km climate grid cell into different soil classes from the 1 km resolution State Soil Geographic data set (Miller & White, 1998).The temperature forcings are adjusted for each HRU using the monthly lapse rates derived by Wi & Steinschneider (2022) for the area.The Lohmann routing model (Lohmann et al., 1998) traces the runoff from HRUs through the river channel to simulate streamflow at the basin outlet (i.e., daily inflows into Oroville Dam).
We use a genetic algorithm (Wang, 1991) to calibrate the hydrologic models and use Nash Sutcliffe Efficiency (NSE; Nash and Sutcliffe, 1970) as the objective function.We first calibrate SAC-SMA (truth model) to the full natural flows for the period of WY1989-2004.The flows simulated by SAC-SMA were then used to calibrate HYMOD (process model) for the same period.For the training (test) periods of WY1989-2004(WY2005-2018), SAC-SMA simulations achieved training (test) NSE of 0.92 (0.88) when fit to the observations, whereas HYMOD NSE was 0.95 (0.92) when fit to the SAC-SMA flows.Further information on hydrologic model calibration is in Supporting Information S1.We also note that in previous work (Wi & Steinschneider, 2022), HYMOD achieved training (test) NSE of 0.80 (0.78) when fit to the observations, although that calibration of HYMOD is not used in this stylized experiment.
All internal state variables simulated by HYMOD (the process model) are used to inform our error model (see Table 1, also Supporting Information S1).These include simulated streamflow (sim), runoff, baseflow, snow water equivalent (swe), and upper and lower soil moisture (upr_sm, lwr_sm), and represent basin-wide average states (i.e., the sum of HRU state variables weighted by percent area).We also include meteorological input variables (e.g., temperature, precipitation) in this list, and use the term "state variables" hereafter to refer to both meteorological and internal hydrologic model state variables, as in Shen et al. (2022aShen et al. ( , 2022b)).

Hybrid SWM
We develop a novel, hybrid SWM that is composed of deterministic and stochastic components: Water Resources Research Here, Q t is the true streamflow at time t, F(X t ,π) is a deterministic streamflow estimate from a process model F conditioned on meteorological and other inputs X t and parameters π, and e t is the stochastic prediction error.In our stylized experiment, Q t represents flow from SAC-SMA and F(X t ,π) represents flow from HYMOD, but in realworld applications, Q t would be observed streamflow and F(X t ,π) would be estimates from any hydrologic model of interest.To estimate the SWM, we follow the approach of Montanari and Brath (2004) and first train the model F to the flows Q t (i.e., estimate the model parameters π), and then afterward we develop an error model to represent the stochastic behavior of e t .While more sophisticated approaches are possible that estimate the error model jointly with F and quantify parameter uncertainty in π (Kuczera et al., 2006;Renard et al., 2011), we opt for a simpler, staged approach that is easier to implement and helps avoid complex interactions between process and error model estimation.
The primary methodological contribution of this work is an adaptive, state-variable-dependent hybrid model for e t , illustrated in Figure 3 (a simpler benchmark model is described in Section 3.3.3).There are two main components of the hybrid model.The first is an initial model updating step referred to as "error correction" (Shen et al., 2022a(Shen et al., , 2022b)), which is analogous to hydrologic post-processing (Seo et al., 2006;Sharma et al., 2023;Siqueira et al., 2021;Zha et al., 2020).The error correction model f creates a mapping between the process model state variables (θ t,SV ) and the raw errors (e t ), including autocorrelation in the errors through lagged error terms (e t-1:t-p ) out to lag p: This model corrects for conditional bias, that is, biases in the process model predictions that are dependent on the internal states of the model and recent prediction errors.The second component is a dynamic residual model to capture the remaining stochasticity in the error correction model residual, ε t , also as a function of process model state variables.Each of these components is described in more detail in Sections 3.3.1 and 3.3.2below.
Importantly, we use a split-sample calibration/validation approach to fit these two components in a similar fashion to Hah et al. (2022).That is, we first fit the error correction model f to one subset of the training data (termed the calibration set), and then we fit the dynamic residual model to a separate subset of the training data (termed the validation set), after the error correction model has been applied to that validation set (see Figure 3).This strategy helps ensure that the dynamic residual model will represent the true variability of out-of-sample residuals from the error correction model.In this work we employ an approximate 70%/30% split of the training data between calibration (WY 1989(WY -1998) ) and validation (WY 1999(WY -2004) ) periods, following common practice in the ML literature (Hastie et al., 2017;Shalev-Shwartz & Ben-David, 2013).
The hybrid model can then be used to simulate errors (e * t ) in a new time period using the state variables associated with the process model simulated in that new period.We hypothesize that the model-based hydrologic states will vary considerably in periods with very different climates (e.g., Test vs. Test + 4C; see Figure 3), and this will propagate into new error distributions for the SWM.Simulated errors can be added to the process model simulation to yield a single SWM trace of streamflow; a SWM ensemble is generated by repeating this process for many independent simulations of error (see Section 3.3.3).

Error Correction Model
As described in Equation 2, the error correction model f uses process model state variables (θ t,SV ) and lagged errors (e t-1:t-p ) to estimate current process model errors for time t (i.e., f captures conditional bias in the process model).
Many different error correction models could be selected for f (Frame et al., 2021;Konapala et al., 2020;Shen et al., 2022aShen et al., , 2022b)).In this work, we follow Shen et al. (2022aShen et al. ( , 2022b) ) and select f to be a random forest (RF) model, leveraging its parsimony, demonstrated hydrologic performance, and out-of-sample robustness (Tyralis et al., 2019).More advanced models (e.g., long short-term memory networks; Kratzert et al., 2018;Frame et al., 2021) would likely work better if applied to many sites simultaneously, but our focus on a single location with limited data supports a simpler ML model less prone to overfitting (discussed further in Section 5).The primary hyperparameters of a RF model are the number of trees in the forest ("ntree") and the number of features to randomly select at each split ("mtry").The individual trees in the forest are fit by minimizing the mean squared error for a randomly bootstrapped subset of the data with a random subset of input features, while the overall RF output is determined by a majority voting scheme across trees (Breiman, 2001;Breiman et al., 1984).
The RF model was implemented using the "ranger" package in R (Wright & Ziegler, 2017) and the default hyper-parameter settings of "ntree" = 500 and "mtry" where k is the number of variables.While we found that some improvement in error correction was possible through hyper-parameter selection on the "out-of-bag" prediction error, these improvements were modest and had the negative effect of apportioning more variable importance to the lagged errors, which degraded simulation performance (see Text S2 in Supporting Information S1).
The RF model is fit to calibration period data and then used to predict the errors in the validation set êval t ) .These predicted errors are subtracted from the raw errors e val t to yield residuals ε val t in the validation period, which are used to train the dynamic residual model, described next.

Dynamic Residual Model
The residual model captures the stochastic properties of ε t , and is "dynamic" in the sense that it allows the stochastic properties to vary over time based on hydrologic model state.This model is fit to the validation set of error correction model residuals (ε val t ), which ensures it does not underestimate the variability of out-of-sample residuals from the error correction model (see Text S3 in Supporting Information S1).
To construct this model, we leverage the generalized likelihood (GL) approach of Schoups and Vrugt (2010) that utilizes the flexible skew exponential power (SEP) distribution (also known as the skew generalized error distribution; Wurtz et al., 2020).The original GL approach includes an autoregressive model and a linear model for heteroscedasticity which results in a set of random deviates (a t ) that are modeled via the SEP with a mean μ of 0, a standard deviation σ of 1 (i.e., after standardization by the heteroscedastic model), kurtosis β, and skew ξ (see Text S4 in Supporting Information S1 for more detail).We modify this formulation to allow all free parameters of the GL model (standard deviation σ, kurtosis β, skew ξ, and lag-1 autoregressive coefficient φ) to vary over time: (3a) The log-likelihood function for the SEP distribution of ε (Equation 3) is a function of the parameter vector  Schoups and Vrugt (2010).Prior to maximum likelihood estimation, we scale all state variables to prevent discrepancies in magnitude from impacting the inferred parameters, and we preserve this scaling when simulating residuals from the SEP distribution in new time periods.We employ noise regularization during maximization of the likelihood function to smooth the parameter estimates (Klotz et al., 2022;Rothfuss et al., 2019).We also ensure the free parameters remain within valid ranges (σ t > 0; β t >( 1); 0.1 < ξ t < 10; 0 ≤ φ t ≤ 1) by penalizing parameter selections that result in parameter values outside of these ranges.As part of the constraint for σ t , we require σ 0 to be no lower than the mean of the absolute value of the lowest decile of the residuals, and we require all elements of the vector σ 1 to be non-negative.Finally, ξ t is log-transformed to linearize its relationship with θ SV,t .
This modified GL approach allows the residual model to capture state dependent, time varying properties of variance, autocorrelation, and distributional form.Moreover, the dynamic model allows for adaptive prediction of residual error distributions even if the model state variables extend beyond their historical range, which is a challenge for other recently developed local uncertainty estimation procedures (e.g., BLUECAT, Koutsoyiannis & Montanari, 2022).

SWM Ensemble Generation and Benchmark Static SWM
To generate a single SWM trace, we first generate new time series of random deviates ãt from the SEP distribution with μ = 0, σ = 1 and the time-varying estimates βt and ξt , which are determined by the model state variables via Equations 3b and 3c.These ãt are then converted to new εt timeseries via Equation 4, where estimates of the lag-1 autoregressive parameter φt and the heteroscedastic parameter σt are inferred from Equations 3a and 3d: We then combine the simulated residuals εt with the error correction model to simulate new errors ẽt (Equation 5).These errors are generated as the sum of the predicted error from the error correction model, f (θ SV,t , ẽt 3 , ẽt 2 , ẽt 1 ) , which depends on the state variables at time t (θ SV,t ) and the generated errors from the previous 3 timesteps (ẽ t 3 , ẽt 2 , ẽt 1 ), and the generated residual error (ε t ) .
Since this model includes lag-1 to lag-3 errors, it is initialized with 3 randomly generated deviates from the residual error model.To generate an M-member ensemble of SWM traces, we simply repeat the steps above M times, where each trace has a different random time series of deviates ãt sampled from the SEP distribution.
Importantly, the simulation procedure above includes contributions from model state variables directly (via the error correction model), as well as through the stochastic distribution of εt (via the dynamic residual model).This novel integration of dynamic, state variable dependent components enables generalizable error simulation with intrinsic adaptability for out-of-sample and non-stationary error distributions.
To benchmark the hybrid error model, we also introduce a static SWM designed similar to the hybrid model but without dependence on hydrologic state variables.from the calibration period (WY1989-1998) on a monthly basis to capture seasonality.First, monthly mean biases are estimated and removed from e t , producing residuals similar to ε t in Equation 2.Then, an autocorrelative model (AR(3)) and a linear heteroscedastic transform are fit to ε t to remove autocorrelation and capture variance that changes with simulated flow, and an SEP distribution is fit to the decorrelated and scaled residuals.This is the basic approach proposed in Schoups and Vrugt (2010) and is very similar to the dynamic residual model in Equation 3, although these fits are conducted separately by month without dependence on state variables.Simulation from the static SWM follows a similar procedure to the dynamic residual model, with monthly biases added back in during simulation.

Local Interpretable Model-Agnostic Explanation (LIME)
To understand the time dependent importance of state variables in the RF error correction process (Section 3.3.1),we use an explainable artificial intelligence (xAI, Holzinger et al., 2022) technique referred to as LIME (Ribeiro et al., 2016).LIME randomly perturbs the inputs around each model prediction to develop a local, sparse linear approximation to the more complex ML model's predictive logic.This linear model provides a representation of the relative importance of each input to the ML model's final prediction at each time step (Hvitfeldt et al., 2022;Ribeiro et al., 2016).In this context, LIME has similarities to time-varying sensitivity analyses used in hydrologic model diagnoses (Herman et al., 2013).Importantly, LIME offers a uniquely different perspective than the aggregate variable importance metrics generated internally by the RF algorithm, since the explanations can be analyzed at the precision of individual events or aggregated over subsets of interest.

Real-World Application
As a final experiment, we apply the hybrid SWM framework developed in Section 3.3 to a non-stylized, real-world setting, where the hybrid error model is constructed on the errors between the actual observed streamflow and a process based hydrologic model (SAC-SMA) calibrated to those observations.In this case, the "truth model" is now the more complex real world streamflow generating process against which hydrologic models are a simplified representation, and we assess the ability of the hybrid SWM framework to learn an error distribution in the training period that generalizes to the test period.This experiment is conducted using FNF from the Feather (ORO), Sacramento (SHA), and Calaveras (NHG) River basins.SAC-SMA models are fit to the FNF at the SHA and NHG sites in the same way as described in Section 3.2 for the ORO site, with training (test) NSE for SHA and NHG equal to 0.91 (0.90) and 0.89 (0.83), respectively.We note that in any real-world application, we would not anticipate significant amounts of non-stationarity in model residuals between training and testing periods if warming trends or other types of climate change between the two periods are modest.

Non-Stationarity in Raw Errors
To first highlight the potential for non-stationarity in predictive errors, we examine the raw error distribution (e t , Equation 1) by month between the process (HYMOD) and truth (SAC-SMA) models in the out-of-sample test period with and without 4°C warming applied to the meteorological data (Figures 4a and 4b).Predictive errors are defined as truth minus process model output.Note that because both models use the same meteorological data, any errors between the two models (and non-stationarity in those errors) can be attributed to differences in model structure.In Figure 4a, we show errors for the 5 wettest years in the test period ("5wet"; 2005, 2006, 2011, 2016, 2017), while in Figure 4b we show errors for the 5 driest years ("5dry"; 2007, 2008, 2012, 2014, 2015).We define  (2005,2006,2011,2016,2017).(b) As in (a) for "5dry" years (2007,2008,2012,2014,2015).(c) Mean daily flow of the historical observations (WY2004-2018) and SAC-SMA truth model across WY2004-2018 for the two scenarios, smoothed with a 30-day moving average.Solid (dotted) lines are for the "5wet" ("5dry") years.
"5wet" and "5dry" based on total annual streamflow.Similar results for the training period are shown in Text S5 in Supporting Information S1.
For the wettest years (Figure 4a), there is a substantial divergence in error distributions between the historical and warmed scenarios across seasons, with the most notable differences occurring in the late winter and early spring (February-April) and summer (June-September).In March and April, when the mean daily flows are at their annual peak (Figure 4c, "5wet"), the errors in the two cases are biased in opposite directions, with process model outflows systematically overpredicting truth model flows in the Test case but underpredicting them in the Test + 4C case.In April to June, when the two cases exhibit the greatest disparities in mean daily flow (Figure 4c, "5wet"), error biases change sign for both Test and Test + 4C cases.Then in the summer, both the Test and Test + 4C errors suggest process model overpredictions, but these are larger in the Test + 4C case.In addition, there are several months when the error dispersion (i.e., interquartile range) differs substantially between the two cases, with February-April being the most prominent examples.The reduction in bias and dispersion of the errors in the Test + 4C case, particularly in March, suggests less structural uncertainty between the truth and process models as snowmelt declines under warming.
For the driest years (Figure 4b), there is little difference in the error distributions between the Test and Test + 4C cases.In these years, the average streamflow with and without warming is very similar, with only small declines in the spring and summer in the Test + 4C case.Overall, Figure 4 shows that the error distribution between the truth and process models can change under warming during wet years alongside shifts in key hydrologic processes, such as more precipitation falling as rain and running off in the cold season, less snowpack carrying over into the spring, an earlier start to the snowmelt season, and increased evapotranspiration in the summer (Figure 4c).This phenomenon is not observed during very dry years when there is little water to begin with and therefore little absolute change in the underlying hydrology with warming.

RF Error Correction Model Performance
In the first step of our modeling approach, we apply RF error correction to remove systematic biases that can vary through time conditional on hydrologic state.Figure 5 shows the residual distributions (ε t , Equation 2) for both Test (Figure 5a) and Test + 4C (Figure 5b) cases after fitting this error correction procedure to the training set and applying it to the test set, focusing on the 5 wettest years where we see the largest degree of nonstationarity.Figure 5 also shows the raw error distributions (e t , Equation 1) for comparison, as well as scatterplots comparing raw errors to the RF predicted errors (Figures 5a and 5b insets).There is a clear reduction in conditional bias across months in the Test case, where all residuals are nearly centered around zero.This reduction in conditional bias holds in the Test + 4C case for October-May, but deteriorates somewhat in the summer months (June-September).Notably, the raw errors were successfully debiased in both Test and Test + 4C cases in March and April, when biases in the two cases were of different sign.These results showcase three important properties of the error correction process: (a) the model's ability to learn state variable-error relationships that enable debiasing across varying seasonal behavior; (b) the model's resistance to overfitting (i.e., the RF model provides effective error correction on unseen data in both the Test and Test + 4C case); and (c) stability of the learned relationships even with prominent shifts to the raw error distributions under non-stationary forcing.However, some issues do remain; the scatterplots show that the RF model struggles in the lower tails (large negative errors) in the Test case and in the upper tails (large positive errors) in the Test + 4C case.
The RF model calculates variable importance as the fractional contribution of each variable to reducing prediction variance across the entire data set.Figure 6 shows that lag-1 autocorrelation in the raw errors is the most influential predictor variable.The most important state variables are the runoff component of simulated flow and the simulated flow itself, implying that conditional biases in the error are related to differences in how rainfall is apportioned to overland flow between the truth and process models.The remaining state variables show similar, lower values of importance, but we note that some variables that would be important only in specific times of year (e.g., snow water equivalent, SWE) will likely be less important in the aggregate.
The dominance of autocorrelation in variable importance suggests that a simpler autoregressive (AR) model could be sufficient as an error correction procedure.However, an AR model cannot simulate conditional bias that changes in nature under non-stationary conditions (Shabestanipour et al., 2023).A RF error correction model based solely on state variables (no lag terms) can infer conditional bias in both out-of-sample and non-stationary out-of-sample cases, but underpredicts the magnitude of the bias (see Text S6 in Supporting Information S1).This supports the integration of autoregressive and state variables in the RF error correction model.
While the RF model calculates variable importance across the entire data set, we use LIME to explore the time varying importance of state variables to the error correction model.This is shown in Figure 7 for the Test and Test + 4C cases, using results in March for illustration.Here, we confine our analysis to daily empirical errors that bias in opposite directions for the Test and Test + 4C cases, in order to better emphasize how state variable impacts on error correction change based on background climate state.That is, we take the daily feature weights from LIME in March only when the predictive errors are negative (positive) for the Test (Test + 4C) cases, and show the median feature weight for these days in Figure 7.We do not include the lag-1 to lag-3 features in Figure 7 to concentrate focus on the state variable effects.In a practical sense, feature weights are the normalized coefficient values of the local linear regression in the LIME procedure (Section 3.4).Thus, positive feature weights imply a positive correlation between the feature (i.e., the state variable) and the local model response (i.e., the estimated error) and vice versa.
Figure 7 shows that precipitation and the process model simulated flow (sim) and baseflow exhibit the largest absolute feature weights in both the Test and Test + 4C cases.These feature weights are amplified in importance from the Test to the Test + 4C case, while the feature weight for snow water equivalent (swe) reduces to near zero for Test + 4C.This suggests that precipitation and simulated flow partitioning dynamics better explain model error in the Test + 4C case against a backdrop of decreasing SWE influence.This likely reflects changes in snowmelt dynamics under warming that lead to HYMOD overestimation (underestimation) of SAC-SMA flows in March under the Test (Test + 4C) cases.While the LIME procedure cannot provide causal evidence, it's possible that the more active role of snow accumulation and melt in determining streamflow response in the Test case, but not in the Test + 4C case (where snowpack is much less prevalent in March), leads to a reversal of model predictive biases in this month.

Dynamic Residual Model Performance
Overall, the error correction process yields residuals (ε t ) in both Test and Test + 4C cases that are relatively unbiased, but that still exhibit time dependent properties (e.g., variance that changes by month; see Figure 5).This suggests that important dependencies between the model states and the residuals may still exist after error correction.We assess the ability of the dynamic residual model to capture these dependencies by comparing the empirical residual distribution (i.e., the distribution of ε t calculated from Equation 2) to the residual distribution simulated by the dynamic residual model (ε t in Equation 4), all for the out-of-sample Test and Test + 4C cases.Figure 8 shows this comparison for selected months (February-April) that exhibited the most notable differences between residual distributions in the Test and Test + 4C cases (see Figure 5), but a comparison across all months is presented in Text S7 in Supporting Information S1. Results from Figure 8 show that in the Test case (top row), the dynamic residual model captures seasonal changes to the residual distribution's shape and variance.In the Test + 4C case (bottom row), the empirical residual distributions become more peaked in March and April compared to the Test case, and the dynamic residual model is able to infer these changes.The agreement between empirical  -1998).Note: "err -1" variable importance equals 0.27, which extends outside of plot bounds.Definitions for all state variable acronyms can be found in Table 1 while "err -1" to "err -3" reflect lag 1 to 3 errors.

Water Resources Research
10.1029/2023WR035042 BRODEUR ET AL. and simulated residuals in Figure 8 confirms that the dynamic residual model is able to use state variable information to capture changes in higher moments of the residuals ε t across months and very different climate conditions.
Table 2 shows the state variable effects for the different parameters in the SEP model (see Equations 3a-3d), while Figure 9 shows the seasonality in SEP parameters in Test and Test + 4C cases.For the parameter σ t (heteroscedasticity), the runoff state variable is the most influential with a small influence from SWE (Table 2).This result reflects the strong relationship between error variance and flow magnitude, as noted in previous literature (Schoups & Vrugt, 2010).This is also seen in the strong seasonal signal in both the mean and variability in σ t (Figure 9, top left), though we note that this seasonality is truncated in Jul-Oct (the minimum value of σ t is constrained for model stability).Further, the greatest divergence in σ t between the Test and Test + 4C cases  comparison against median error (light orange background, black outline) for a selected month, where errors and associated feature weights are aggregated for errors less than (greater than) zero in the Test (Test + 4C) cases.Definitions for all variable acronyms can be found in Table 1.occurs in the late winter and spring months where mean flow magnitudes diverge most substantially (see Figure 4c) and there are important contributions from snow accumulation and melt.Skewness (ξ t ) shows a relatively weak seasonal signal that is centered around 0 (log 10 ξ t = 0; no skew) in the warm season and slightly negative (log 10 ξ t < 0; negative skew toward process model overpredictions) in the winter and spring.Negative skew is more prominent in the Test case.As for σ t , skewness is most strongly influenced by runoff and SWE, but also has an important contribution from precipitation (precip).In contrast, the kurtosis parameter β t exhibits a strong seasonal signal that is primarily tied upper and lower soil moisture (upr_sm, lwr_sm), with smaller contributions from SWE, average temperature (tavg), runoff, and evapotranspiration (et).The residual distributions exhibit values of β t close to 1 (i.e., a Laplace distribution) in the cold season that become progressively more peaked and fat-tailed (β t > 1) in the summer months.This reflects a concentration of probability mass around small residuals ε t in low flow months with high probability of large (scaled) residuals (see Figure S4 in Supporting Information S1).Both the Test and Test + 4C cases show similar seasonal characteristics, though the Test + 4C β t are uniformly larger than the Test β t across most months except September-December.Note.The "intcpt" row corresponds to [σ 0 , β 0 , ξ 0 , φ 0 ] from Equations 3a-3d, while the remaining rows correspond to the coefficient vectors [σ 1 , β 1 , ξ 1 , φ 1 ] from Equations 3a-3d from left to right.All state variables are scaled, so the magnitude of the coefficient is proportional to its effect on the parameter.Finally, lag-1 autocorrelation (φ t ) shows no seasonal signal but a notable increase in variability in the winter season that is most influenced by baseflow and upper soil moisture (upr_sm).It is important to note that the autocorrelation captured in the residual model is the leftover autocorrelation after error correction (which included lag-1 to lag-3 terms).Across the seasons, there is very little difference between the Test and Test + 4C cases, indicating that the autocorrelation structure of the residuals ε t is not highly influenced by warming.

Hybrid Model Simulation Performance
After fitting both components of the hybrid error model, we simulate new errors (ẽ t , Equation 5) via the generation procedure detailed in Section 3.3.3and evaluate how well their distribution matches that of the raw empirical errors (e t , Equation 1).In Figures 10a and 10b, we show this comparison separately by month for both Test and Test + 4C cases, again focusing on the "5wet" years when error non-stationarity is more evident.The hybrid model is able to reproduce the direction of bias and general patterns of variance across both the Test and Test + 4C cases.For instance, in both March and April, the Test empirical errors are negatively biased and have substantially more variance as compared to the Test + 4C errors, which are positively biased and have less variance.The model is able to simulate both of these shifts.In the summer, the hybrid model is also able to capture the negative bias in the Test case.However, this bias grows in the Test + 4C case, and the hybrid model does not capture this deeper negative summertime bias, consistent with the results in Figure 5.This deficiency will lead to simulated summertime flows that are biased compared to the truth model.
In comparison to the hybrid SWM, a static SWM based purely on seasonality in the error distribution (Figures 10c  and 10d) struggles to capture the out-of-sample changes to the error distribution under both the Test case and the Test + 4C case.During the Test case (Figure 10c), the static SWM substantial overestimates error variance and is insensitive to shifts in error bias during the winter and spring.This challenge is compounded in the Test + 4C case, where neither shifts in error biases or dispersion are captured faithfully.The state variable dependencies built into the hybrid model allow a more faithful emulation of these shifts, even if imperfect.We also note that both models produce relatively accurate coverage probabilities in the Test case but struggle to produce accurate coverage probabilities in the Test + 4C case (see Text S8 in Supporting Information S1).
To further illustrate the performance of the hybrid SWM, Figure 11 shows the simulated timeseries of flow for a 6-month subperiod (February-July 2011) in the Test case (Figure 11a) and Test + 4C case (Figure 11b) that spans both wet and dry seasons.We first highlight the markedly different truth model flows for the Test and Test + 4C cases, where again the only difference is the applied +4°C temperature adjustment to the Test + 4C forcings.The peak flow event in March in the Test case is weaker and of shorter length compared to the larger, sustained multipeak event in the Test + 4C case.In contrast, the snowmelt recession is longer and of higher magnitude in the Test case versus the Test + 4C case.The results also illustrate the method's adaptive bias correction, where the hybrid SWM corrects much of the process model's overprediction bias in the Test case in April (Figure 11a inset), whereas in the Test + 4C case, the model helps correct for process model underprediction bias during this same time of year (Figure 11b inset).The spread in the SWM ensemble is smaller under the Test + 4C case between May-July because flows are simulated to be lower (and more baseflow driven) in this subperiod, and variance is correlated most strongly with runoff (see Table 2).During March and April, the spread of the SWM ensemble is also able to capture the peak flows well for both Test and Test + 4C cases.Overall, the hybrid SWM simulations improve the process model simulation based on the ensemble median and capture many of the observations within the ensemble spread.
Finally, we verify the hybrid SWM by comparing the distribution of simulated high flows and extended dry conditions against those simulated by the deterministic truth and process models, as well as the static SWM (Figure 12).For high flows, we use the 99th percentile of flow as the metric for comparison (Figures 12a and 12b).
In the Test scenario (Figure 12a), the truth model 99th percentile flow value is substantially below that of the process model, indicating that the process model overestimates the truth model for high flows.The hybrid SWM ensemble is mostly able to correct this overestimation while also retaining high precision in the distribution, whereas the static SWM ensemble overestimates the 99th percentile truth model flow (even more so than the process model) and is very imprecise.In the Test + 4C case, the process model underpredicts the truth model 99th percentile (Figure 12b).The mode of the hybrid SWM ensemble partially corrects this bias, and the truth model 99th percentile falls within the SWM distribution.The static SWM is less biased in the Test + 4C case, but again, its distribution for high flows is highly variable.

Water Resources Research
10.1029/2023WR035042 BRODEUR ET AL.
For extended dry conditions, we use the minimum cumulative annual flow as the metric for comparison.The process model accurately predicts this metric in the Test case (Figure 12c) but underpredicts it in the Test + 4C case (Figure 12d).We find that in both the Test and Test + 4C cases (Figures 12c and 12d), the static SWM substantially underestimates the truth model value, again with high imprecision.The hybrid SWM estimate in the Test case is accurate, but this is not the case in the Test + 4C case, where the truth model value is not captured in the hybrid SWM distribution.This discrepancy reflects the inability of the hybrid SWM to capture the summertime shifts in the Test + 4C case, which was also noted in Figure 10b.
Overall, these results show that the hybrid SWM can capture non-stationary shifts in hydrologic model error more reliably than the static SWM, and therefore can better simulate streamflow ensembles under changing climate conditions.However, the hybrid SWM still struggles with low flow emulation, which has been noted in previous SWM studies as a difficult challenge for these methods (Shabestanipour et al., 2023).

Real-World Application
We conclude by employing the hybrid model in a real-world setting to assess whether the model is effectively inferring conditional error distributions when the truth model is the actual streamflow observations and the process model is SAC-SMA.We fit the model to three different basins (ORO, SHA, NHG) to show the generalizability of the hybrid SWM approach across varying basin sizes and hydrologic regimes, using the same procedure outlined for the stylized case.There are statistically significant upward trends in temperature (between 1°and 1.5°C) and no significant trends in precipitation over the 1989-2018 period across the three basins (see Text S9 in Supporting Information S1), but we found that these modest warming trends did not drive substantial non-stationarity in the error distributions between the training (WY1989-2004) and test periods (WY2005-2018) (see Text S10 in Supporting Information S1).However, as shown below, error distributions due vary notably with hydrologic regime.Figure 13 shows the results of applying the hybrid model to errors between the process model (SAC-SMA) and observed streamflow for the three basins, focusing on the 5 wettest and driest years in the Test period (an analysis across the entire Test period is shown in Text S10 in Supporting Information S1).Across the larger sites (ORO, SHA), there is good agreement between the hybrid SWM simulated errors and the empirical errors in terms of seasonality of the error biases and dispersion, and how this varies across wet and dry years.For example, at SHA, empirical biases during the late fall and early winter (November-February) tend to be negative during wet years and positive or near zero during dry years, and this is captured by the hybrid model.Similarly, the hybrid model captures small shifts in empirical bias across wet and dry years in certain months (November-February, August-September) at ORO.Still, there are areas where the hybrid model struggles.This is seen most prominently for NHG, which is a much smaller, flashier basin that is rainfall dominated and has many days of zero flow.During dry years, the model overpredicts the magnitude and spread of errors at NHG in most months, while during wet years bias is overestimated in some months (January, March-June).The model also has a tendency to overpredict error dispersion at ORO in the winter and spring months in wet years, and misses some shifts in bias at ORO in certain months (e.g., May).These challenges notwithstanding, the results suggest that the hybrid SWM can infer important changes to error properties from model states in an out-of-sample period and under different hydrologic conditions in a real-world setting, particularly for larger and less flashy basins.

Discussion and Conclusion
In this work, we examined the assumption that historical predictive uncertainty of hydrologic models is sufficient to characterize future predictive uncertainty under non-stationary climate.We developed an idealized "model as truth" experimental design to test this assumption, where we designated one hydrologic model as "truth" and another as the "process" model.This design allowed us to analyze predictive uncertainty under both the historical meteorological conditions to which the models were fit and also under significant warming.We found that there were substantial shifts in the predictive error distribution under climate change, which manifested in changes to bias, variance, and (to a lesser extent) higher moments of error.These results suggest that SWMs fit to historical data may not perform well when used to simulate future, climate change impacted hydrology.
This result has large implications for the use of SWMs to estimate hydrologic design events under climate change.
Process-based models are one of our best tools to predict streamflow under projections of future climate, but these models exhibit systematic errors in the prediction of extreme low flows and high flows that impedes their direct use in estimating climate change impacted design events (7Q10, 100-year flood; Shabestanipour et al., 2023).One of the most important contributions of SWMs is the reduction of simulation bias in process-based hydrological model predictions at the upper and lower flow quantiles (Farmer & Vogel, 2016;Vogel, 2017), and so SWMs are generally seen as a tool that can help preserve the causal nature of process-based models while still providing the information needed for hydrologic design in long-term water resources planning efforts under climate change.However, the differences in error distributions between the Test and Test + 4C cases shown in this work imply that SWMs trained to a historical period may not improve the estimation of these design criteria under future climate conditions, complicating the use of SWMs as a tool for water resources planning under future climate conditions.To address these issues, we developed a novel, hybrid SWM to leverage information in hydrologic model state variables to predict changes in predictive uncertainty.The model used ML error correction to remove biases conditional on hydrologic state, and then used dynamic residual modeling to capture the dependencies between hydrologic state and higher order moments of the error distribution.To better emulate out-of-sample predictive uncertainty, we introduced a training approach whereby we fit the error correction model to a calibration set and then subsequently fit the dynamic residual model to a separate validation set, before evaluating the approach on an independent test set.
We found that the hybrid model was able to capture prominent shifts in predictive uncertainty in the test set, both for historical climate (Test) and under warming (Test + 4C).This included significant changes in bias during the winter and spring months, when snow accumulation and melt dynamics differed significantly between the truth and process models in the Test and Test + 4C cases.Notably, a static benchmark SWM was unable to emulate these shifting biases.The hybrid modeling framework was also able to predict changes in error variance and kurtosis in the spring months under warming.Overall, predictive uncertainty estimated using the hybrid error model matched that observed between the truth and process model reasonably well, even though some attributes of predictive uncertainty (e.g., magnitude of bias; coverage probabilities) were not captured, especially in low flow months.This finding was further supported by verification of the hybrid SWM streamflow ensemble against high flow and extended dry conditions simulated by the truth model.While improvements in some of these attributes should be the focus of future work, our methodology provides an important step toward addressing a gap in the hydrologic ML literature of how to adequately assess uncertainty under plausible but unprecedented future conditions (Klotz et al., 2022;Wi & Steinschneider, 2022).
Using different approaches for model interpretability (e.g., feature importance, LIME), we showed that lagged error terms, components of simulated streamflow, precipitation, and snow water equivalent were the most important features when correcting for bias, while a variety of meteorological and internal state variables helped model changes in higher order moments and autocorrelation of the residuals.Importantly, the effects of certain features in the error correction model changed in sign or non-linearly in intensity depending on the background climate and month of interest, which was of particular relevance during historical snowmelt seasons.This suggests that changes to predictive uncertainty under non-stationarity are more complex than just shifts in timing (e.g., Xu et al., 2021) or simple scaling relationships (e.g., Read & Vogel, 2015).We also highlight that these complex changes occurred where nonstationary was applied via a simple temperature shift, suggesting that future studies are needed to explore how more complex forms of non-stationarity (e.g., changes to precipitation distributions) might impact hydrologic predictive uncertainty.
We also tested the hybrid model in a more challenging real-world setting in three separate basins, where the hybrid error model had to predict changes in predictive uncertainty between a process model and actual streamflow observations.We found that the hybrid error model worked well across most sites and months, exhibiting similar performance to the stylized experiment in capturing out-of-sample, state dependent shifts in hydrologic uncertainty that varied across wet and dry years.These included prominent shifts in the sign, magnitude, and dispersion of the error distributions.However, the hybrid model struggled for the smaller, flashier, rain-fed basin (NHG), particularly in dry years, and for ORO in certain months.The results for NHG in particular suggest the model may have difficulty generalizing to flashier basins with a significant number of zero flow days, which is a challenge previously seen with other SWM approaches (Schoups & Vrugt, 2010).This challenge notwithstanding, the results showed good qualitative agreement with error seasonality and shifts across wet and dry regimes across the other two sites, suggesting the approach has potential as a generalizable SWM strategy under climate change.In constructing the hybrid error model used in this work, we emphasized interpretability and parsimony over complexity.Future work could explore more advanced error correction procedures, potentially drawing on the forecast post-processing literature (Seo et al., 2006;Sharma et al., 2023;Siqueira et al., 2021), more complex optimization schemes for the dynamic residual model, or non-linear relationships to state variables in the dynamic residual model.Furthermore, this study accomplished only a limited exploration of the spatial generalizability of the approach and future work should examine in more detail how performance varies by hydroclimate regime.These two efforts should be considered in tandem, as more sophisticated error correction procedures (e.g., long short-term memory networks) perform best out-of-sample when trained simultaneously to a large set of watersheds with diverse landscape and climate characteristics (Nearing et al., 2021).In other words, to improve the spatial generalizability of our approach, future work should investigate how to train a regional hybrid SWM model, instead of the site-specific training considered in this work.

Figure 1 .
Figure 1.Geographical area of study depicting Feather River inflow to Oroville Dam (1) as well as significant upstream reservoirs and other diversions (2-9).The inset in the upper right shows the locations of the primary study basin (Feather River; ORO) and two additional basins considered in part of the analysis (Sacramento River; SHA, Calaveras River; NHG).

Figure 2 .
Figure 2. Conceptual diagram showing the stylized experimental design, where a "truth" (Q) and "process" (F) model are designated to test the effect of alternate hydrologic model forcing on predictive errors between the two models.In the Test scenario, both models are forced with out-of-sample but stationary forcings, whereas in the Test + 4C scenario, both models are forced with non-stationary and out-of-sample forcings incorporating 4°C of applied warming.

Figure 3 .
Figure 3. Diagram of hybrid SWM structure including setup of calibration, validation, and testing periods (Test and Test + 4C) within the data.This diagram highlights the staged nature of the hybrid SWM, where an error correction model is first fit to calibration data, which is then used to fit the residual model on the validation data.The resultant model can be used to generate new sequences of predictive uncertainty in out-of-sample scenarios (e.g., Test and Test + 4C) using state variable timeseries associated with the "process" model.Consistent with Figure 2, the colors of the boxes are used to denote the time period used (Calibration, Validation, Test, Test + 4C), and separate colors are also used for the "truth" (Q) and "process" (F) models.

Figure 5 .
Figure 5. (a) Monthly comparison of raw error (e t ) distributions versus residual distributions (ε t ) after correction by the Random Forest (RF) model in the "5wet" years from the Test period (WY2005-2018).The inset is a scatterplot comparison of raw error versus RF predicted error.(b) As in (a) but for the Test + 4C case.

Figure 8 .
Figure 8. Top row: Empirical distribution of RF-corrected residuals ε t (histogram) versus the kernel density estimate of a simulated sample of residuals εt from the dynamic residual model (red line) for selected months from the Test case.Bottom row: As in top row, but for the Test + 4C case.

Figure 7 .
Figure7.Locally Interpretable Model-agnostic Explanation (LIME) median feature weight (white background, gray outline) comparison against median error (light orange background, black outline) for a selected month, where errors and associated feature weights are aggregated for errors less than (greater than) zero in the Test (Test + 4C) cases.Definitions for all variable acronyms can be found in Table1.

Figure 9 .
Figure 9. Values for the four free parameters in the dynamic residual model aggregated by month for the Test and Test + 4C cases.The bold lines are the mean parameter values while the blue shading is the 90% confidence interval for the Test case and the orange dashed line is the 90% confidence interval for the Test + 4C case.

Figure 10 .
Figure 10.(a) Monthly empirical distribution of errors in the Test subset (dark blue) versus 1,000 aggregated samples of hybrid model simulated errors (light blue).Panel (b) same as (a) but for Test + 4C errors, where empirical (simulated) errors are dark orange (coral).(c)-(d) As in (a) and (b), but simulated errors are from the static SWM model.

Figure 11 .
Figure 11.(a) Truth model flow (black) compared against process model flow (dark blue) and the median flow of 1,000 samples (pink) from the hybrid SWM for the February-July period in 2011 from the Test scenario.95% coverage interval for the 1,000 samples are shown in light gray.Inset contains zoomed in depiction of period delineated by black dashed box.(b) As in (a) but for the Test + 4C scenario, where process model (hybrid SWM median) flow is dark green (orange).

Figure 12 .
Figure 12.Comparison between the 99th percentile flow distributions for the hybrid SWM (pink) and static SWM (gray) for the Test scenario.Deterministic values from the truth (process) model simulations are shown with black (blue) vertical lines.(b) As in (a) but for the Test + 4C scenario.(c)-(d) As in (a)-(b) but for the minimum cumulative annual flow value throughout the simulation period.

Figure 13 .
Figure 13.Empirical errors between SAC-SMA and observations compared to 1,000 aggregated samples from the hybrid SWM in the 5wet and 5dry subsets of the Test period (WY2005-2018).Comparisons are conducted across three separate basins (ORO, SHA, NHG).
at Oroville Dam on the Feather River (CDEC ID: ORO), Shasta Dam on the Sacramento River (CDEC ID: SHA), and New Hogan Dam on the Calaveras River (CDEC ID: NHG) (CA DWR, 2024).These data represent unimpaired flows, or the natural water which determines how the standard deviation, kurtosis, skew, and lag-1 autocorrelation change based on model state variables (θ SV,t ) (Equations 3a-d).Maximization of the loglikelihood function simultaneously estimates all 4(m + 1) parameters in η, where m is the number of state variables.In Appendix A, we define other intermediate terms (σ ξ t ,ω β t ,c β t ,a ξ,t ) required in the likelihood function, following

Table 2
State Variable Coefficients for Multiple Linear Regression Models of the Four Parameters in the Dynamic Residual Model