Optimized Predictive Coverage by Averaging Time‐Windowed Bayesian Distributions

Hydrogeological models require reliable uncertainty intervals that honestly reflect the total uncertainties of model predictions. The operation of a conventional Bayesian framework only produces realistic (interpretable in the context of the natural system) inference results if the model structure matches the data‐generating process, that is, applying Bayes' theorem implicitly assumes the underlying model to be true. With an imperfect model, we may obtain a too‐narrow‐for‐its‐bias uncertainty interval when conditioning on a long time‐series of calibration data, because the assumption of a quasi‐true model becomes too strict. To overcome the problem of overconfident posteriors, we propose a non‐parametric Bayesian method, called Tau‐averaging method: it applies Bayesian analysis on sliding time windows along the data time series for calibration. Thus, it obtains so‐called transitional posteriors per time window. Then, we average these into a wider predictive posterior. With the proposed routine, we explicitly capture the time‐varying impact of model error on prediction uncertainty. The length of the calibration window is optimized to maximize goal‐oriented statistical skill scores for predictive coverage. Our method loosens the perfect‐model‐assumption by conditioning only on small windows of the data set at a time, that is, it assumes that “the model is sufficient to follow the system dynamics for a smaller duration.” We test our method on two cases of soil moisture modeling and show how it improves predictive coverage as compared to the conventional Bayesian approach. Our findings demonstrate that the proposed method convincingly overcomes the overconfidence drawback of Bayesian inference under model misspecification and long calibration time‐series.


Introduction
Hydro(geo)logical models require reliable uncertainty intervals that honestly reflect the total uncertainties of model predictions.Bayesian probability theory offers a statistically rigorous framework to propagate various types of uncertainty through the model to its output.Specifically, Bayesian updating provides so-called posterior distributions over parameters and states of a given model structure that reflect the remaining uncertainty after taking into account observed data for calibration.Bayes' theorem guarantees that those posteriors converge toward the true values (which generated the data) as the calibration data set size increases toward infinity if the underlying model structure is in fact true, that is, reflects the so-called data-generating process in nature (see, e.g., Kleijn & van der Vaart, 2012;Shao, 1997).
However, model structural errors always exist when a real-world system is represented by a model (de la Varga & Wellmann, 2016;Kavetski et al., 2018;Sikorska et al., 2012).If the chosen model does not flawlessly describe the data-generating process, or model errors are not appropriately handled (e.g., via parametric statistical error models), all inferred probability distributions (for parameters, states, predictions) will again converge to complet.alleged confidence with increasing data set size, yet around optimal but unrealistic values (Köpke et al., 2018;Rahn et al., 2011;Reichert & Schuwirth, 2012): optimal in the sense that they represent the bestcompromise solution to fit the data with the flawed model; unrealistic in the sense that the identified parameter values now become "apparent" model parameters that lose their physical interpretation in the real-world context.
This effect of Bayesian updating under model misspecification has implications for parameter inference and predictive performance.With respect to parameter inference, it may become a problem if modelers (a) wish to draw conclusions from the posterior parameter distribution about the natural system's properties or (b) wrongly interpret the narrow uncertainty distribution as an indicator of the truthfulness of a parameter estimate (Dos Reis & Yang, 2013).
With respect to predictive performance, we will almost surely see what we call "overconfidence" in prediction intervals if large data sets (e.g., long time-series of hydrological data) are used for calibration: Bayesian updating under model misspecification will yield a very narrow posterior predictive distribution that fails to capture future or unseen data points.The posterior (almost) collapses on the prediction made by the best-compromise parameter set, and while this is optimal in the sense of minimizing the error over the full calibration data set, it does not necessarily perform well in predicting even a single data point.The reason is that the imperfect model failed to statistically admit its errors during the Bayesian update-which treats the model as true, as we stated above.The unavoidable overconfidence becomes a major hindrance in practical applications, because a biased prediction with no meaningful uncertainty estimate fails to provide guidance in water resources management and risk assessment.
In the area of hydro(geo)logy, this overconfidence becomes a profound issue due to three reasons: (a) we have limited knowledge about the true dynamics of the hydro(geo)logic system; (b) we cannot mathematically and computationally represent all processes on all scales in our models; and (c), the time series available for calibration in hydro(geo)logic modeling can be really long, which enhances the overconfidence issue.Usually a modeler seeks to best "tune" the model, and a model should be calibrated to as long data sets as possible.For example, a typical calibration period length in hydro(geo)logy is years (Arnold et al., 2012;Baffaut et al., 2015;Razavi & Tolson, 2013;Singh & Bárdossy, 2012;Werth & Güntner, 2010).The consequence is that we ambitiously try to quantify and minimize the post-calibration uncertainty in our deterministic process-based models.But if we perform Bayesian updating on a given model structure that does not adequately represent the real system, we are bound to fail with biased simulations and low predictive coverage (Motavita et al., 2019;Reichert et al., 2021;Vrugt et al., 2008;Xu & Valocchi, 2015).Note that the term predictive coverage in this research means the ability of the posterior predictive distribution to cover previously unseen real data points, that is, the appropriateness of predictive distributions: the perfect Bayesian prediction distribution should contain X% of testing data in its X% credible interval for any value 0 ≤ X ≤ 100%.This term should not be confused with interval estimation in frequentist statistics.
There are two principal paths of improving model performance (or the inference results): (a) decrease the model's bias and (b) improve the uncertainty quantification.For the first principal path, that is, to decrease the bias (a), there is a variety of proposed approaches.For example, it is common to use an output error model to express the model residuals, which accounts for input errors, model structure errors, and observation errors (Evin et al., 2014;Kavetski et al., 2006;Reichert & Mieleitner, 2009;Xu et al., 2017).Generally, this type of error model needs to include autocorrelation, heteroscedasticity, and non-Gaussian distributions.Such an approach requires empirical paremeterizations (e.g., parameterization of a certain trend or correlation in the random process) and does not guarantee satisfying results (Ammann et al., 2019;Reichert et al., 2021;Reichert & Schuwirth, 2012;Schoups & Vrugt, 2010).Another way to construct error models is by data-driven methods.For example, Demissie et al. (2009) compared four machine-learned error correction models implemented with a physically based groundwater flow model.Xu and Valocchi (2015) proposed to apply Gaussian process regression for the same Water Resources Research 10.1029/2022WR033280 HSUEH ET AL. task.Sikorska et al. (2015) simulated the residuals by resampling from the residual population of hydrological models with a nearest neighbor approach.This method provides an alternative to the parameterization of model errors, when the statistical characteristics are unknown.
Another way to deal with bias is to add stochastic processes into a deterministic model (Reichert & Mieleitner, 2009).In this kind of approach, an error term accounts for uncertainties inside the system description and propagates this intrinsic uncertainty through time and through the model to the model output.Reichert and Mieleitner (2009) modified deterministic hydrological models by applying random parameters to model inputs.Later, Reichert et al. (2021) applied a stochastic process directly to model parameters that determine the reservoir outflow levels.This stochastic process aims to increase the generalization ability of the model, because it resolves the intrinsic stochasticity in the true natural system.Stochastic models also have been applied in a coupled hydrological and water quality model at the catchment scale (Ammann et al., 2021).
The second principal path of improving model performance is to improve the uncertainty quantification (b).This means that the uncertainty should be reported more honestly.Multi-model combination methods, such as Bayesian model averaging (BMA) or Gaussian model mixtures (e.g., Höge et al., 2019) use an ensemble of models to provide more reasonable uncertainty intervals.Their main reasoning is that a collection of several competing or collaborating models can help cover the uncertainty of choosing between different possible system conceptualizations, as no single model can a priori be said to be the true one.Wöhling et al. (2008) concluded that the BMA method provides good predictive performance.In another study, BMA avoided biased estimation of groundwater levels when the uncertainty from several different conceptual groundwater models and climate scenarios were considered (Mustafa et al., 2019).According to Darbandsari and Coulibaly (2019), BMA can enhance the quality of streamflow simulation performance in data-scarce watersheds if ensemble members include diverse simulation scenarios.With a similar starting point, but different approach, an informal Bayesian approach called the generalized likelihood uncertainty estimation (GLUE) was proposed to extend parameter uncertainty by constructing an empirical informal likelihood function (Beven & Binley, 1992).Blasone et al. (2008) proposed to make the prediction intervals statistically more robust with different cut-off thresholds so that the predictive distribution contains 95% of the training data at a significance level α = 0.05.Although this approach has been applied in several studies and showed satisfying results (Blasone et al., 2008;Cu Thi et al., 2018;Freer et al., 1996;Hassan et al., 2008), it is criticized for not being statistically rigorous (e.g., Reichert & Mieleitner, 2009;Stedinger et al., 2008).
A further alternative for more realistic uncertainty quantification is to apply fully data-driven approaches instead, because they aim to quantify the information in the data (and hence should implicitly provide a reasonable estimate of remaining uncertainty, or the information gap) (Kratzert et al., 2019).Höge et al. (2022) applied machine learning (ML) models to a lumped hydrological model by substituting the empirical parts of the model and learning the parametrizations with ML models.When abundant data is available, data-driven models even learn the relation between different basins.Increasing training data size has been proven to make data-driven models more generalizable (i.e., transferable to other catchments) (Gauch et al., 2021).Such approaches seem to perform surprisingly well.However, to understand why data-driven approaches perform better or worse in specific situations; and to gain deeper system understanding still remains a challenge.
In contrast to the above, but still in a data-driven manner, we wish to adapt the Bayesian approach such that deterministic process-based models can be provided with more realistic uncertainty estimates.We propose a framework that allows us to use Bayesian inference for long data records and imperfect models by widening the posterior just so much that predictive coverage is accurate, that is, consistent with the prescribed confidence bounds.In contrast to multi-model approaches, we want to achieve this without the need of setting up and maintaining a set of competing models.Also, we would like to build on formal Bayesian grounds.And, finally, we want to do so in a data-driven manner, without taking a priori assumptions on the parametric statistical shape of model errors.
Therefore, we build on a recent approach in a previous study of ours, called Bayesian analysis with a sliding timewindow (Hsueh et al., 2022).The analysis starts from selecting a window length τ that slides along the calibration time series.Note that the term "calibration" in this method refers to "Bayes style calibration," that is, conditioning on observations and inferring the parameters as random variables.This window size determines the data set length used simultaneously (per window position) for calibration, and the updating step for each time-window position is consistently Bayesian.Note that this procedure weakens the model assumption from "model describes system" to "model can follow dynamics over a limited time-window length."The original publication did this in order to detect and diagnose episodically re-occurring patterns of model error.In the current study, our goal is not to diagnose problems of the model, but to solve the problem of overconfidence by providing more honest predictive uncertainty intervals.Our method has a connection to the dynamic identifiability analysis (DYNIA) by Wagener et al. (2003), which also applies a sliding window to extract more information.The study selects different window lengths according to the length of the period over which the parameter is influential.Another related method is the Parameter Identification Method based on the Localization of Information (PIMLI) by Vrugt et al. (2002), which also allows to increase information retrieval from the data.
Our method is based on the motivation that we explicitly wish to widen the posterior distributions in view of the time-varying effect of model errors on prediction uncertainty.This is what we achieve by averaging over time: We obtain so-called transitional posteriors (one per position of time window) and combine these Bayesianinferred distributions into an overall, widened posterior by averaging over the sliding time windows.In order to achieve accurate predictive coverage, we search for the optimal calibration window length.This is the length where the weakened assumption ("model can follow system") is still true, but the data are otherwise used as strongly as possible.
Since the optimal choice of the time-window length is extracted from the model-data misfit, our proposed approach can be seen as a data-driven extension of the Bayesian framework for uncertainty assessment and it allows to use any formal likelihood function within the time windows.This method mitigates the drawback of conventional Bayesian inference where an erroneous model together with a large data set produces overconfident prediction intervals.Overall, we call it sliding-window Bayesian analysis, as of this study equipped with averaging time-windowed Bayesian distributions.A similar approach to perform Bayesian model calibration on multiple data sets in the presence of model error has been proposed by Viswanathan et al. (2023).However, their approach is not specifically tailored to time-series data, does not involve sliding windows, and instead of optimized window width relies on expert knowledge-based choice of data subsets for PDF averaging.
The article is structured as follows: Section 2 describes the conventional Bayesian framework for uncertainty analysis and introduces the proposed method of optimized time-windowed PDF averaging.Section 3 demonstrates the concepts and the benefits of our proposed approach with a didactic example.Section 4 introduces a soil-hydrologic field study with a synthetic data set as well as a real-world data set.Section 5 summarizes the results of these applications.Finally, we summarize the strengths and limitations of our proposed method in Section 6.

Conventional Bayesian Framework for Uncertainty Estimation
The distribution of an uncertain quantity of interest y predicted by a fixed deterministic model structure M(θ) with uncertain model parameters θ (potentially also uncertain model drivers, boundary conditions, etc.) can be expressed with a prior distribution p(y|M) = ∫ Θ p(y|θ, M)p(θ|M)dθ.This distribution can be updated to reflect a posterior state of knowledge, p(y|D, M), given the evidence in observed data D via Bayes' theorem: For the sake of brevity, we omit the conditioning on M in the following, because we are not concerned with multiple models.Analogously, the prior parameter distribution p(θ) is updated to the posterior in the light of data: The likelihood function p(D|y, M) is a function of model misfit (residuals) y D that are attributed to measurement errors and model errors.For example, the likelihood function could be a Gaussian distribution with error covariance matrix R: with N obs representing the number of observations (i.e., the length of D).The error representation could be, for example, limited to measurement error only (as in our analysis, R has diagonal entries of measurement error) so that all other contributions to residuals are to be treated by our sliding window analysis.Note that the choice of likelihood function is not limited to be Gaussian; any arbitrary likelihood can be applied in our method.

Sliding-Window Bayesian Analysis
Within the sliding-window Bayesian analysis, we perform a Bayesian update of all model quantities conditioned on the short data set within the time window that starts at the initial time step.Then, we slide the window to the next time step and repeat the update.Note that we use the same parameter and hence prediction prior in all time steps.This procedure is repeated until reaching the end of the calibration data set.
In this manner, we obtain N τ (N τ = N 0 τ + 1) posterior distributions through the Bayesian updating procedure, where N 0 represents the data set length, and τ refers to the window size, that is, the number of data points that are within the calibration window.Those posteriors are generated temporarily and are not our final result; thus, we call them transitional posteriors: where D T j (τ) is defined with a calibration window length of τ, namely, conditioning on τ observed data points.The same would hold for the model parameters.By calibrating a model only on a time window, we reduce the number of observations on which we condition at once (τ < N 0 ).Within these short subperiods, we can assume the model to be quasi-true in the sense that a data subset of length τ is less likely to reject the model, that is, the fundamental assumption of Bayesian updating is not violated (too strongly).
With smaller values for τ, this makes the likelihood in Equation 3 more distributed over a broader range, and the overall model likelihood p D T j (τ) |y T j (τ) ) larger.Altogether, it leads to wider posterior distributions with more realistic uncertainty intervals.However, each subperiod of course only contains information about a specific state of the system; to use all that information from the whole time series, we must combine these distributions without running into the "overconfidence trap" of conventional Bayesian updating.This is why we propose an averaging of the time-windowed (transitional) posterior distributions, as described in the next section.

PDF-Averaging for the Sliding-Window Analysis
To acquire the final posterior, we average the probability density functions of the transitional posteriors (PDF averaging, Figure 1): where N τ refers to the total number of transitional PDFs and p y T j (τ) |D T j (τ) ) denotes the transitional posterior of time window T j .Averaging the transitional PDFs from all time windows means that we include in our analysis the information contained in the whole time series, but not by conditioning on all these data at the same time.We change the operation "and" in the conditional probability (multiplicative likelihood formulation in Equation 3) to "or" (while maintaining "and" within each time window according to Equation 4); see also the discussion on multiplicative versus additive likelihood formulation in Bayesian updating by Viswanathan et al. (2023).The traditional "and" within each time window ensures an appropriate conditioning effect through the data: ensemble members (realizations) that show a higher goodness-of-fit will have a larger contribution to this window's posterior distribution than realizations that show a lower goodness-of-fit.
Note that when combining the posterior distributions, the posterior distribution per time window is normalized.This means that all time windows have the same importance in the averaged PDF in Equation 5. We call this approach "frequency-based weighting."It will make sure that the averaged posterior from Equation 5 reflects the characteristics (frequency of event occurrence) of the calibration time series.In case of rainfall-runoff modeling, for example, a strongly base-flow dominated time series will lead to transitional PDFs that are mostly fitted to these flow conditions.By taking the average of the these transitional posterior distributions, we automatically represent all the hydrological conditions in the whole time series in proportion to their frequency of occurrence.
There could be the case that the model fits the data better in some windows (time steps) than in other windows.This is not exploited by our approach, but modelers can use different averaging strategies tailored to their modeling purpose.Another similar averaging strategy could be a likelihood-based weighting scheme.This averaging strategy would not normalize the posterior at each time window.In fact, a likelihood-based weighting scheme (as opposed to our proposed frequency-based weighting) can be interpreted as a more performanceoriented but risk-tolerant approach: there is a chance to achieve a better overall performance as measured by a likelihood-based skill score in predicting future data, but at the cost of potentially scoring poorly during specific periods.This will become clearer in our didactic example, and the discussion in the supplementary information.Alternatively, one could include a situation-specific weighting in Equation 5, for example, to emphasize extreme flows.

Optimizing the Time-Window Length τ for PDF Averaging
A final choice to make in the time-windowed PDF averaging scheme is to define the length of the time window τ.
It can be chosen based on expert knowledge, or based on error period identification (Hsueh et al., 2022), or it can be optimized toward performance in predicting future system states.The latter is the approach we propose here.
We use the so-called predictive log-score (PLS) as the score metric to quantify how well data reserved for testing is captured by the predictive posterior (Good, 1952).The score is evaluated as the logarithm of the posterior probability of each observed data value D val,n within the validation period: where N val stands for the total number of validation data points, and D val,n represents the observed value of the nth validation point.For each selected calibration window length (τ), the scores of all validation points are summed up to the final score.This is equivalent to evaluating the product of probabilities over the full time-series.Hence, we implicitly require that all (future/validation) data points should be satisfactorily captured by the predictive distribution; if any of the data values has a probability density of (close to) zero, the overall score will be very low (i.e., highly negative on log-scale).This is a desired feature as it ensures a good predictive coverage of all validation data with the final PDF.
With the PLS scores for a set Z of different τ values, we can search for an optimal value τ opt : where the PLS is defined via Equation 6.With the predictive posterior inside the PLS that is constructed by the average over transitional PDFs in Equation 5(which are dependent on the choice of τ), we can search for τ opt , for example, by complete enumeration.
Note that the PLS is related to Bayesian model evidence (BME), but with the important difference that the PLS assumes independence of the data points.That means, with the PLS we check for each data point independently its probability density given the predictive posterior; BME, in contrast, evaluates the multi-dimensional predictive PDF which means that (numerically) individual ensemble members need to be able to capture all of the data points well to achieve a good score.This is in the same spirit as implicitly assuming the model to be true.Here, with our time-varying parameter posteriors, it is more meaningful and consistent to ask for the full distribution to cover each individual data point well (no matter which exact parameter value fits which data point well).So the PLS consistently answers whether the obtained widened distribution achieves our goal of more realistically capturing unseen data points.
The optimal window size that we expect to acquire should be short enough to resolve different regimes of how the model can be wrong against the data, and long enough to have a pronounced conditioning effect (a calibration on a single data point, as an extreme case, would not provide enough information for reasonable narrowing of the prior distribution); but this conditioning effect is only beneficial if we are still within a scale of time window length where the model can be considered quasi-true.
For different calibration purposes, a modeler can formulate any other problem-specific skill score to find an optimal value for τ.For example, the objective function can be a metric derived from information theory or a state-dependent metric (e.g., good coverage for upper or lower extreme states).This decision should reflect an appropriate penalization of accuracy or variance in the light of the model purpose.

Numerical Implementation
To obtain the transitional posteriors as a basis for (optimized) PDF averaging, we need to perform Bayesian updating N τ times.Since analytical solutions are tied to too strong assumptions, we have to draw on numerical implementation in practical applications (Schöniger et al., 2014).In principle, any suitable implementation method to perform Bayesian updating would also work for our proposed approach; however, the repeated computational effort can be reduced by a smart choice of method.
Brute-force Monte Carlo (MC) is computationally expensive when considering a single updating step, but becomes very efficient in our proposed setup of repeated updating.An MC ensemble consists of many random model predictions that, together, represent the prior predictive distribution.It is generated based on random draws from the prior parameter distribution and subsequent model runs.For our purposes, all transitional posteriors can be evaluated using this single MC ensemble.This is because we only need to re-evaluate likelihoods per time window, not re-run simulations, and use the window-wise likelihoods to weigh all realizations in each window.This means that we only need to run the model once to prepare the (admittedly huge) ensemble at the pre-processing stage.Once the MC ensemble is generated, the model is not needed for the rest of the analysis.
Given the observed data D T j (τ) in the current calibration window and a global definition of the likelihood function p(D|y), we evaluate the individual likelihood of each of the N MC ensemble members (realizations) to obtain a time window-specific weighting of the ensemble.Recall that we assign to each window the same importance, which is achieved by normalizing all realization-wise likelihoods within each window by the sum of likelihoods.
Then, we slide the window to the next time step and repeat the procedure.We collect all the normalized likelihoods (imagine them as a time series of weights per realization) and finally take the average along the time dimension.For a sequence of N τ shifted windows, there are N τ × N MC normalized likelihoods, and we obtain a final averaged likelihood vector of dimension 1 × N MC .The final (parameter or prediction) posterior is obtained by weighting the prior ensemble with this averaged likelihood vector.
Modelers could also choose a Markov Chain Monte Carlo (MCMC) method for its efficiency in the singleupdating setup; however, here, a new MCMC run would be needed for each shifted time-window, because the objective function changes.Hence, MCMC would become unfeasible if the time series is very long or the time window very short.The brute-force MC approach we propose is cheap for short time windows, as not many prior realizations are necessary for statistically stable weights under the weakened conditioning per time window.

Didactic Example
In this section, we first use a didactic example to visualize the concept of our proposed method by generating two cases, one with a perfect (true) model, and one with a flawed model (the common case).

Simple Linear Model Setup
We generate two test case scenarios with a simple linear model (LM model): Case C1-Reality can be reproduced by the model (the model is perfect, as the conventional Bayesian method assumes): Case C2-Reality is too complex to be resolved by the model (the model is imperfect or partially erroneous, as all hydrological models are in practice).
Let us assume that in case C1, we can observe y t .The observation y t is associated with a physical quantity x t .The relationship between y t and x t is y t = 0.2 • x t .We build a corresponding model to predict the "system" of case C1, which writes as y t = w • x t with one model parameter w.
In case C2, the system changes (e.g., changing boundary conditions) and becomes more complex.In this system, y t is also observable for us, and y t is still associated with x t , but the relationship between y t and x t is: The model to describe case C2 remains unchanged, that is, we take into account that the (a priori unknown) twostate characteristic of the system is not resolved by the model.Case C2 is designed to mimic the scenario when a natural system switches between processes, inputs, or states, but the model fails to capture this feature.
In both of the cases, we create a time series of 200 time steps, and since case C2 is merely extended from case C1, they both are presented in Figure 2.This data set is split into two groups: The first 160 time steps are used for calibration and the last 40 time steps for validation.At each time step, we generate a random input value {x t ∈ (0, 1)|t = 0, …, 200} as a given physical quantity.We set up a simple linear model y t = w • x t with a parameter prior w ∼ U(0, 1) and sample size 30,000.We use a likelihood that reflects i.i.d.Gaussian measurement error with variance 0.04 2 We test our approach with four calibration window sizes: τ = 5, 15, 20, 160.To simplify the notation, by τ = 5 we mean conditioning on windows that contain 5 data points.The conventional Bayesian method (condition on the full data set at once) is represented by calibration with a window size of τ = 160.τ = 5, 15, 20 are designed to demonstrate the effect of the Tau-averaging method.

PDF Averaging
When a calibration window of given size slides through a time series, a series of transitional posteriors (see Section 2.2) is generated on the fly.Recall that obtaining those transitional posteriors is not our final goal, but they are intermediate results.Figure 3 presents how the transitional posterior shifts from one state to another, when the LM model is calibrated on 5 data points at each time step in case C2.Each calibration window in Figure 3b shares the same color as the resulting transitional posterior in Figure 3a.To let readers clearly see the system state that the data points are from, the states are indicated in Figure 3b.
When the calibration window is completely within the state 1, the transitional parameter posterior centers at the (true) value of 0.2, the true value for state 1.When the calibration window slides to the next time step (the second left window), the resulting transitional posterior mode shifts away from the value of 0.2, because now the observed data points are not purely from the system at state 1.This shifting continues because the ratio of calibration data points from state 1 versus state 2 changes at each time step.When the calibration window is completely within state 2, the transitional posterior centers at value 0.7, the true value for state 2. From this demonstration, we can tell that the transitional posterior reveals the reality correctly whenever the calibration window is completely within one state.
The gray thick curves in Figure 4 present the averaged posterior of case C2, with the calibration window lengths of 5, 15, 20, and 160, respectively.Each light blue curve presents one transitional posterior conditioning on the data points within the calibration window at a certain time step.
Note that state 2 is in fact the same time span when model errors appear.Based on our design, the model in case C2 does not know and cannot describe state 2 where the relationship between y t and x t changes.This deficit between the system and the model is a model error from the modeler's perspective, because the modeler thinks about a single, fixed parameter value (e.g. that of state 1), but does not acknowledge that there may be changes to other parameter values (e.g. that of state 2).Following this logic, we label the recognized parameter value (state 1 in reality) with the green vertical line and the other (unrecognized) parameter value (state 2 in reality) with the red vertical line.
In our demonstration, the total calibration data length is 160 days.We first use a window length of τ = 5, leading to 156 transitional posteriors (156 = 160-4, because the first four data points are not enough for one time window yet).Each transitional parameter posterior in Figure 4 represents the plausible parameter values given the data set within one respective window.When the calibration window slides into different states, we also observe that the transitional parameter posterior shifts away from the parameter value of the current state (the transitional posteriors that center neither at 0.2 nor at 0.7).
The averaged posterior is a combination of all transitional posteriors.The shape of this averaged posterior is determined by the frequency of the transitional posteriors.If many transitional posteriors peak at the same value, the final average posterior will reflect this fact.In the example with τ = 5, the averaged posterior honestly presents two most frequently shown parameter values within the total of 156 transitional posteriors.
In Figure 4 we show that a well-chosen τ, which is small enough compared to the duration of typical error periods (in our case τ = 5 against an error period length of 20 points), the parameter posterior can reveal the parameter values of both states.A larger calibration window of about the same size as the error period (τ = 20) makes this advantage disappear.An even larger calibration window (up to conventional Bayesian calibration) fails to reveal how much the misspecified model struggles when trying to match the observed data.In Figure 4 with τ = 160, it shows a clear peak at an intermediate parameter value (here: 0.41), which does not represent either of the actual system states.This can be understood as a best compromise solution when fitting the model to the full time-series without giving it the flexibility to adapt to two different system states.
In comparison, the transitional distributions and the averaged parameter posteriors for a case with a perfect model (case C1) are shown in Figure 5.In this case, the parameter value at state 1 is set to 0.2 for the synthetic reality, and there is no second state.Since the model describes reality correctly and this is a simple example, there is no conflict within the transitional distributions.Therefore, the widening effect of the Tau-averaging method is almost negligible and disappears in the limit of full calibration data length.This demonstrates that, in the extreme case of a true model, the Tau-averaging method converges to the conventional Bayesian updating method.

Figure 2.
Didactic example with the observation data points for case C1 marked with the blue curve, and for case C2 marked with the symbol "x." Case C2 aims to mimic the periodically occurring change of processes or inputs in the natural system that switches between state 1 and state 2 every 20 days.
Figure 3.The shifting process of the transitional posteriors obtained from the calibration window τ = 5 in case C2.The transitional posterior mode shifts from 0.2 (state 1) to 0.7 (state 2).When the calibration window is completely within one state (the very left and the very right window), the posterior reveals the reality correctly.

Widened Posterior With the Tau-Averaging Method
Comparing predictive posteriors of the conventional and of our modified Bayesian updating, we can observe a more significant effect.Figure 6a shows the predictive posterior of the first validation point (t = 161), again acquired with τ = 5, 15, 20, and 160.Note that it is not the parameter posterior despite the similar shape.The gray thick curves present the averaged PDF (over the whole time series).The parameter value at state 1 is set to 0.2, marked with green line, and the parameter value at state 2 is set to 0.7 and marked with the red line.For visualizing the evaluation of the predictive posterior obtained with the window averaged transitional parameter posteriors over system time (t = 161-200), we show their 95% credible intervals as time series in Figure 6b.In this case of a perfect model (case C1), the conventional Bayesian method works perfectly because the assumption of a perfect model is fulfilled, and an infinite data set enables the posterior to converge toward the true parameter and predictions.Since the model is perfect, the length of calibration window does not influence much the calibration performance, the widening effect is not significant.Thus, the predictive posteriors of different values of τ overlap on each other (in Figure 6b).
Given a flawed model (case C2), calibration with the Tau-averaging method leads to very different results.Figure 6c compares the predictive posteriors of different values of τ for the first validation point (t = 161) as an example.A conventional Bayesian calibration on the full data set (τ = 160) in case C2 shows that the predictive posterior is biased and overconfident when interpreting it as a representation of the true system value.The maximum likelihood prediction does not meet the observation (Figure 6c), leading to a potentially wrong or misleading information for model/parameter interpretation and for decision-making.With the conventional Bayesian calibration task that merely provides a best-fit compromise on this full data set, the parameter posterior (Figure 4) centers at a single value, but neither the maximum a posteriori parameter value nor the prediction (Figure 6c) meet any of the two true states of the system.Thus, the posterior predictive distribution with τ = 160 fails to cover most of the data points as in Figure 6d.
Similar to the transitional and averaged parameter posteriors shown in Figure 4, the widening effect can again be observed in the predictive posteriors obtained with the window-averaged transitional posteriors (Figures 6c and  6d).By applying our proposed Tau-averaging method with the window length τ = 5, we obtain a bimodal predictive posterior as in Figure 6c, and one of the posterior peaks falls on the observed data point.From the previous results in Section 3.2, we see that this predicted value comes from one of the inferred true parameter values.Similar to case C1, Figure 6d compares the 95% credible intervals of the predictive posteriors for various τ values.In this case, different τ values obviously change the credible intervals.Compared to the 95% credible interval from τ = 160 in Figure 6d, smaller τ values show wider credible intervals that covers almost all the validation data points.These results show two advantages of our proposed method: (a) The two different parameter values that are truly associated to the system at both states are revealed.(b) The Tau-averaging method can overcome both the bias and the overconfidence in the conventional Bayesian approach that occurs with long time series for calibration and an imperfect model.These advantages are very insightful for modelers, because they provide results that can be interpreted with respect to their underlying physical meaning, at least if the imperfect model has a parameter with a conceptually interpretable connection.Also, they provide appropriate uncertainty intervals for forecasting and robust decision making.

Optimizing the Window Size for Averaging
As demonstrated in the previous section, using a small calibration window size and averaging transitional PDFs can mitigate the effect of erroneous overconfidence in conventional Bayesian inference.In the didactic example, even if we use a calibration length of 20 data points, the parameter posterior still predicts the value in the "valley" of the bimodal distribution, and so tends to be overconfident.Obviously, a good choice for the value of τ is advisable.
As an example for discussion, Figure 7 compares the predictive posteriors at the first validation point (t = 161) obtained for τ = 1, 5, 15, 20, and 160.The model has the least constraints if it is calibrated on one data point at a time (τ = 1), and thus has the widest credible intervals.However, our goal is not to have the widest possible credible interval, but the most appropriate ones.We can observe how the posterior evolves, when we calibrate the model with increasing τ.In the calibration with τ = 1, the posterior probability mass spreads to the largest extent.At the longest calibration data length (τ = 160), the probability mass is centered at the compromise solution and forms an almost normal distribution with a small standard deviation.
By formulating an optimization problem according to Section 2.4, we can find the optimal calibration window length that extends the posterior coverage only to the degree needed such that the posterior covers the validation data points.Our chosen skill score PLS (Equation 6) assists to achieve this goal.This can be clarified by observing the p-values at the observed value (vertical dashed line) in Figure 7.We can see that, even though the averaged posterior from τ = 1 has the widest credible interval, the score is not the highest.This is because the probability mass is distributed too much over a too wide range.In this case, the posterior is not biased, but too imprecise.On the contrary, with the maximum window size τ = 160, the posterior gives a confident but biased result, which is completely off from the observation, and thus the p-value is low., 5, 15, 20, 160) show that, with shorter windows, there is the chance to reveal multi-modal posteriors, and calibration with longer windows first smears out these multiple peaks, and then narrows the distribution.As illustrated, an optimal window length τ opt exists where the calibrated result is most plausible, that is, which produces the highest predictive PDF value (y-axis).The optimization result in Figure 8 visualizes that, in the case of a perfect model (blue line), the score is rising (improving) with increasing calibration data length.In the case of an imperfect model (black line), the score increases in the beginning and then decreases.This demonstrates clearly the failure of the conventional Bayesian routine under violated model assumptions.From this figure, we can also observe that the optimal window length for case C2 is τ opt = 4.The score for the window length values larger than 10 decreases rapidly.Together with Figure 7, we can tell that this is almost when the posterior bimodal shape starts to vanish.

Demonstration on a Soil Model Case Study
In this Section, we test our proposed method on a real-world case study.First, we design a data scenario similar to Hsueh et al. (2022) to allow for controlled model error conditions and clear interpretation of results; then, we apply the method to field data in the same modeling setup.

Field Site Description
The field data used in this research stems from previous research (Wöhling et al., 2008;Wöhling & Vrugt, 2008) at the Spydia experiment site in the northern Lake Taupo catchment, New Zealand.The study from Wöhling et al. (2008); Barkle et al. (2011) aimed to inversely evaluate the hydraulic parameters of a volcanic vadose zone with three multiobjective optimization algorithms.

Soil Model Setup
The data used in this study was originally presented in Wöhling and Vrugt (2008).In this research, the 1D Richards equation and the modified Mualem-van Genuchten model (MvG model) (Mualem, 1976;van Genuchten, 1980;Vogel & Cislerova, 1988) is used to simulate the unsaturated soil water movement.The Richards equation writes: In this equation, θ stands for the volumetric water content [ ], and h is the soil water pressure head [cm].t and z are time [day] and depth [m] respectively.K is the unsaturated hydraulic conductivity function [cm day 1 ] in vertical direction.R [cm 3 cm 3 day 1 ] refers to a sink term when root water uptake is considered.
The modified MvG model with an air-entry value of h s = -2 cm writes: with the effective saturation where θ r and θ s represent the residual and saturated water content [ ] respectively; α [cm 1 ] and n [ ] control the shape of the water retention curve, K sat refers to the saturated hydraulic conductivity [cm • sec 1 ], and the pore connectivity parameter l [ ] is given by Mualem (1976).The model is implemented in the software Hydrus 1D (Simunek et al., 2005) and used for all simulations in this study.

Water Resources Research
10.1029/2022WR033280 HSUEH ET AL.Wöhling and Vrugt (2008) used the laboratory analysis of physical characteristics and texture properties for the soil horizons at the Spydia experiment site to derive prior distributions for the MvG model parameters θ s , α, n, K sat and l.They decided for independent uniform distributions with individual upper and lower bounds (see Table 1).The model output uncertainty bounds stem from these five unknown MvG model parameters.For more detailed description of the model setup, please refer to Wöhling et al. (2008).
Another model we use in this study is the dual porosity model by Durner (1994) that describes the retention properties as a linear combination of the van Genuchten curves for different pore spaces: where w i are weighting factors for k subclass of the soil with w i ∈ (0, 1) and ∑w i = 1.One of the advantages of the dual porosity model is the similar interpretation of the parameters as in the van Genuchten model (Durner,

Synthetic Data Scenario
A synthetic observation data set is generated by one specific parameterization of the soil hydraulic model as described in Table 2.We refer to this parameterization as the reference run (marked as dotted curve in Figure 9).This setup represents an analogy to state 1 as in our didactic example (Figure 10).
To generate the scenario when the true system changes, we use the dual-porosity model during three selected periods (pink shaded boxes in Figure 9), that is, we assume that the system's properties change due to some external conditions (e.g., changes in soil hydraulic properties).The additional parameters for model parameterization are listed in Table 3.This setup aims to mimic state 2 as in our didactic example.Recall that these three periods are error periods from a modeler's perspective.More details about the synthetic observation generation are provided in Hsueh et al. (2022).
We use the time series from day 1 to day 150 as calibration period and from day 151 to day 200 as validation period (Figure 9).Both periods include periods in which the model cannot resolve the reality (i.e., model error occurs).

Real Data Scenario
The pressure head data of 0.4 m depth (the observations that are closest to the surface layer) are used here due to their dynamics triggered by the atmospheric forcing.For the demonstration of our method, we exclude the first 10 days (data points) for spinning up and the following 200 days are used.
We choose the first 150 simulation days for calibration and the following 50 simulation days for validation.Calibration and validation period are exactly the same as in the synthetic scenario for better comparison of the results.

Numerical Implementation
To perform Bayesian updating (and our modification) for the synthetic and real data set cases, we define a Gaussian likelihood function with diagonal covariance matrix.We assume lumped measurement and model errors with a standard deviation of 0.04 m for soil water pressure head data.We generated an ensemble of N MC = 900, 000 random parameter realizations and performed model simulations for our given setup.The parameter priors are uniform distributions between the ranges given in Table 1.
We first present averaged PDFs for three different calibration window lengths (τ = 5, 15, 20 days) to demonstrate the effect of window-size choice on the final predictive PDF.For comparison, we also show the result of τ = 150, which corresponds to the conventional Bayesian result.To evaluate and compare predictive performance for different values of τ, we use the PLS as described in Section 2.4.To identify the optimal calibration window size in a second step, we perform a complete enumeration on a higher resolution of τ = 1, 2, … 20.Since we observed a clear declining trend in performance for τ larger than 20, we evaluate the score less frequently in this region (but nevertheless report it for illustration).
The convergence of all Bayesian update runs is assessed with the effective sample size (ESS).The ESS indicates the independent sample size needed to obtain equally precise distributions (Robert et al., 1999).We check the ESS for the averaged distribution of each value of τ and confirm that results are reliable.One exception is the result of τ = 150: with an ESS below 15, it would be deemed unreliable.Low ESS-numbers simply confirm a frequently encountered problem for Brute-force MC implementations of Bayesian updating given long and informative data series.Our proposed method does not suffer from this problem, which demonstrates that our approach also has advantages on the level of implementation and convergence.
All our analyses are implemented and performed in MATLAB© R2019a on a standard desktop computer.

Averaged Posterior Parameter Distributions
Before discussing the predictive coverage, we present the transitional and averaged parameter posteriors for different values of τ (5, 15, 20, and 150) in Figure 11.
As we learned in the didactic example, the transitional parameter posteriors obtained from calibrating a perfect model (case C1) should have a similar shape to that of the averaged PDF (Figure 5).However, in this soil model case study, the assessment of the transitional posteriors is not as simple as in the didactic example.Generally, as long as the transitional posteriors share the same values of maximum a posteriori probability (MAP) estimate, we consider them to be similar.For example, comparing the transitional posteriors of parameters θ s and l, we can observe that transitional posteriors of τ = 15 and 20 have shapes similar to the averaged posteriors (Figures 11b1,11c1,11b5,and 11c5).In contrast, we consider the transitional posteriors for τ = 5 of these two model parameters  (Figures 11a1 and 11a5) to be dissimilar, since the MAP estimate of some of these transitional posteriors shifts significantly away from the peak of the averaged posterior.
As explained in the method section, we observe how the frequency of the transitional posterior influences the final shape of the averaged posterior.We can separate the transitional posteriors of the parameter θ s in Figure 11a1 into two shapes: Shape 1 has right tails and shape 2 has left tails.We can observe that shape 1 is dominant in the averaged posterior.This means that the final effect of calibration is not much influenced by state 2 (error periods), because the frequency of state 2 is not high (not as significant as in the didactic example).If the macro-pore system was more dominated, we would see a clearer change (another peak at a high value) in the averaged posteriors of parameters α and ł.
With a time window of τ = 150, which is the total length of calibration data set, we return to the conventional Bayesian procedure and thus no transitional posterior is shown.Note that with τ = 150 the ESS value is lower than 15 and the results may be unreliable.That said, the bimodal distribution shown in the averaged posteriors of the parameters θ s and K sat of τ = 150 do not really represent two physical parameter values.
Both the transitional and averaged posteriors reflect the existence of model errors (state 2 that is unknown for a modeler).Recall that state 2 (error period) is generated by a dual porosity model (Hsueh et al., 2022), which is characterized by higher α and l values.This can be observed by some transitional posteriors of the model parameter α and l that have left tails.This extending effect also indicates model deficit.If we observe the parameters θ s and l, the averaged posteriors for τ = 5 suggest that these two model parameters could also assume higher values.Although the shapes of averaged posteriors do not change as significantly as in the didactic case C2 (from normal to bimodal distribution), we are strongly convinced that this bimodal shape will appear, as long as a modeler has a long enough time series where an unresolved process (here: due to another parameterization) repeats regularly.This can be especially observed from the slight shape change of averaged posteriors for the parameter α from τ = 5, 15, 20.

Predictive Coverage Achieved by Averaged Distributions
To demonstrate how our method optimizes predictive coverage, we summarize the 95% credible intervals of the averaged posterior predictive distributions for four τ values in Figure 12.
Calibrating the model on the full calibration data set (here: 150 days), that is, using the conventional Bayesian calibration scheme, it yields the most narrow interval.When the model can correctly describe the system, this narrow posterior is beneficial.Yet, when the model cannot accurately resolve the system, this narrow posterior fails to cover the data points that reflect system state 2 (days 161-170).
Similar to the didactic case study, we see a clear narrowing effect with increasing values of τ.The credible intervals for τ = 15 and 20 become wider, but still do not cover all data points.The predictive posterior for τ = 5 shows the largest uncertainty intervals, but may already be too wide.

Optimal Calibration Window Size
Based on Figure 12 visually, we would guess that τ around 20 days should yield an optimal predictive coverage.
To find the true optimum, we perform a complete enumeration of window sizes 1-20 days, and additionally evaluate window sizes of 30, 50, 70, …, up to 150 days.For each window size, we determine the PLS of the averaged PDF for the validation period.Figure 13 shows the score as a function of calibration window size τ.We find that the true optimum is actually at a shorter window size of 14 days.Hence, this is the optimal balance between "zooming in" on the error periods (10 days) and "zooming out" to involve as much information for calibration as possible.
To better understand the optimization result, we analyze the averaged predictive PDFs for individual validation data points (Figure S4 in Supporting Information S1).As expected, a longer window is scoring higher probability density values for most data points, because the error period fraction in the calibration data set is rather small.So for state 1 during validation, a strong calibration effect from the perfect model is observed.However, validation data points within state 2 then lie at the tails of the narrow predictive distributions.They would prefer shorter calibration window sizes (e.g., data point 170 is more plausible given a window length of 5 days instead of 15 or Water Resources Research 10.1029/2022WR033280 20).The PLS now identifies the best compromise τ that scores acceptably during the whole validation data set, no matter which system state.

Predictive Coverage Achieved by Averaged Distributions
This section discusses the results of our method when tested on real field data.The transitional and the averaged parameter posteriors for each model parameter are presented in the supplementary information.Figure 14 compares the predictive posterior credible intervals that are evaluated with four calibration window lengths τ = 5, 15, 20, 150.
As expected, the posterior intervals obtained with the full calibration window length τ = 150 days (conventional Bayesian approach) are very narrow and fail to cover more than 50% of the observed data points.When using a shorter window, for example, τ = 20, we can see the improvement against the overconfident predictive interval, but some data points are still not covered.At τ = 15, the predictive posterior nicely captures the dynamics during day 193 and 195, and all data points are within the predictive posterior.
The predictive posterior for τ = 5 shows the widest uncertainty interval.From a rough visual inspection, τ = 5 seems to be too short because the corresponding credible interval seems to be unnecessarily large.Until this point, it seems to be plausible to conclude that τ = 15 could be the optimal window length.

Optimal Calibration Window Size
To find the formally optimal window length, we perform a full analysis of the PLS for all window lengths and present the result in Figure 15.Although τ = 15 looked like the optimal window length, the PLS figure shows a clear optimum at τ = 7, which is only slightly larger than 5.
To confirm this result, we examine further by investigating the predictive posteriors of each validation point (Figure S12 in Supporting Information S1).We observe that the peak for τ = 5 at day 154 starts to separate from the other two larger values of τ.Those subplots explain that the observation can be better predicted with smaller τ (day 158 to day 165).That is, calibration with τ = 5 achieves significantly higher scores at these validation points.
If we recall the results of the synthetic data set case, most of the inferred predictions with different τ values (Figure S12 in Supporting Information S1) have similar shapes (identical MAP estimate).Interestingly, in the real data case, we observe that calibration with τ = 5 leads to different shapes of predictive posteriors during day 158 and 165 (Figure S12 in Supporting Information S1).During these days, it is obvious that the predictions obtained from smaller window length (e.g., τ = 5) fit absolutely better and are less biased, and this fact certainly explains why the optimal τ turns out to be 7 instead of a larger one.

Summary and Conclusions
This research proposes an approach to address the problem of overconfident and biased posterior distributions resulting from conventional Bayesian updating in the presence of imperfect models and long, informative time series for calibration.We wish to have more confidence that predictive intervals produced by our models are statistically appropriate, which is not to be mistaken with more confident predictions.In this setting, more confidence can mean larger predictive uncertainty.Our proposed solution relies on Bayesian calibration with short, sliding time windows with a length (i.e., number of covered data points) specified by τ.Then, our method combines the window-wise (so-called transitional) posteriors into one wider distribution via PDF (probability density function) averaging.
Our method can be seen as a mixture model that mixes situation-specific posteriors into a combined distribution.
For the maximum value of τ (i.e., the length of the available calibration data set) of τ, there is only a single PDF and no mixture, so one arrives back at conventional Bayesian inference.For smaller values of τ, the individual transitional PDFs become wider and more situation-specific.These transitional PDFs are mixed in proportion to their frequency of occurrence, so τ is the only parameter governing the mixture.This allows us to formally optimize the window size τ such that the best possible predictive coverage is achieved.
Our method could also be compared to specific versions of "particle filters" that avoid filter collapse, where the particles (analog to realizations) forget their likelihood weights after some time.The difference is that our motivation is not at all to technically avoid a filter collapse (due to numerical implementation issues), but to extend the Bayesian idea to avoid the theoretical collapse of posteriors (due to an improper application frame of the Bayesian formalism if the model is flawed).A further application of the averaged posterior could be a data assimilation/filtering method.With the "currently most applicable time-windowed PDF" from the ensemble of previously seen time windows, a data assimilation/filtering method may mitigate the problem of a filter collapse.
We explained the method and its effects in a didactic example: Given a flawed model, the predictive posterior becomes overconfident and fails to encompass all validation data points, when the analysis is performed in a conventional Bayesian manner (i.e., conditioned on the complete time series).The predictive posteriors obtained via PDF averaging of shorter window sizes capture the future data points much better.We use the PLS to assess the predictive performance of the averaged PDF in validation, and use it as optimization target for the maximization of predictive coverage.We explain how this score evaluates both the uncertainty coverage and the goodness of fit of the predictive posterior.
The didactic example also shows that the interpretation of the averaged distribution is more insightful than the posterior of the conventional Bayesian approach.The latter could lead to parameter PDFs that are physically not meaningful/interpretable, whereas our approach would reveal potentially multi-modal PDFs with peaks at the true parameter values.As a side effect, the modeler can learn about the sources of model deficits at least when there is a parameter set that can compensate for the deficit by varying their values.
We also demonstrated our approach on a soil model case study.In two cases with synthetic model-generated and real-world field data, we demonstrate the strength of our method even when time-dependent model errors exist: with our approach the predictive posterior is modified to be more realistic and captures all the validation data points.In the case with field data, the predictive score reveals an optimal calibration window length τ opt = 7 days.
Analyzing the predictive posterior of individual validation points, we have discovered clear distinct shapes in the case of τ = 5.This indicates, that with a smaller calibration window, we obtain more information for better predictions.
The Tau-averaging method is very beneficial for hydrological models because the Bayesian method is typically applied to long time series of, for example, discharge, groundwater heads or soil moisture values.The asymptotic behavior of Bayes' estimators (narrower credible intervals with increasing data length) becomes a disadvantage if the model contains significant model errors.The calibration with long time series potentially leads to biased results and parameter compensation.By rearranging the way of conditioning data within time windows, a modeler avoids these drawbacks, while with our proposed method of PDF averaging, all the information contained in the (long) time series are still appropriately considered.
One great advantage of this method is that it is not only tailored for a certain type of data or purpose, as long as the system states and initial conditions are observable.Our method builds on a rigorous Bayesian foundation and it can be applied with any likelihood definition to any type of dynamic system.The choice of the calibration window length can be based on expert knowledge (characteristic time length of particular processes, for example), or based on formal optimization toward a predictive skill score as shown here (i.e., data-driven).Also, all involved metrics and the optimization target can be specifically designed toward various calibration purposes.There are also alternative strategies for the PDF averaging, which may reflect different degrees of risk aversion that a modeler wants to implement or different emphasis on specific time periods.
Overall, our proposed method will prove useful for any type of dynamic modeling task where the model is imperfect, available data sets are rather large, and an honest coverage of prediction intervals are of importance.In addition, the method aids in diagnosis and interpretation of model structural errors, which was the initial motivation for developing the sliding window Bayesian analysis (Hsueh et al., 2022) that served as methodological foundation for this contribution.

Figure 1 .
Figure 1.Illustration of PDF averaging with the proposed Tau method.

Figure 4 .
Figure 4. Didactic example, case C2: Transitional parameter posteriors (light blue PDFs) of four different window lengths.The gray thick curves present the averaged PDF (over the whole time series).The parameter value at state 1 is set to 0.2, marked with green line, and the parameter value at state 2 is set to 0.7 and marked with the red line.

Figure 5 .
Figure 5. Didactic example, case C1: Transitional parameter posteriors (light blue PDFs) of four different window lengths.The gray thick dotted curves present the averaged PDF (over the whole time series).The parameter value at state 1 is set to 0.2.

Figure 6 .
Figure 6.Demonstration of posterior of both the perfect (case C1) and imperfect (case C2) models.Predictive posterior at validation point t = 161 of case C1 (a) and case C2 (c).Predictive posterior (95% interval) obtained from averaging PDF of all validation points of case C1 (b) and case C2 (d).

Figure 7 .
Figure7.Predictive log-score by reading the data probability from the posterior predictive PDF.Averaged posterior predictive distributions obtained from the calibration window lengths (τ = 1, 5, 15, 20, 160) show that, with shorter windows, there is the chance to reveal multi-modal posteriors, and calibration with longer windows first smears out these multiple peaks, and then narrows the distribution.As illustrated, an optimal window length τ opt exists where the calibrated result is most plausible, that is, which produces the highest predictive PDF value (y-axis).

Figure 8 .
Figure 8. Predictive log-score (PLS) of calibration with varying lengths of the calibration window length τ.The maximum of τ is 160 time steps, which is the total length of the calibration data set.Light blue curve: PLS of case C1 (perfect model).Dark curve: PLS of case C2 (erroneous model).

Figure 12 .
Figure 12.Synthetic data set and 95% credible interval of the posterior predictive distribution for the validation period, obtained with calibration window lengths of τ = 5, 15, 20 and 150 days.

Figure 13 .
Figure 13.Predictive log-score of the averaged posterior predictive distributions in validation as a function of calibration window length τ (synthetic data scenario).

Figure 14 .
Figure 14.Real data set and 95% credible interval of the averaged posterior predictive distribution for the validation period, obtained with calibration window lengths of τ = 5, 15, 20 and 150 data points.

Figure 15 .
Figure 15.Predictive log-score of the averaged posterior predictive distributions in validation as a function of calibration window length τ (real data).

Table 1
Bounds for the Uniform Prior of the Mualem-van Genuchten (MVG) Model Parameters HSUEH ET AL.

Table 3
Dual Porosity Model Parameter Values Applied to Perturb the Base-Case Data Set During Error Periods in the Case of Model Structural Error (Case 1); Parameters for the First Pore Subsystem Are Identical to Those Given in Table 2 (Base Case), With w 1 = 1 w 2 Averaged PDF (dark curve) and the transitional posteriors (light blue curves) with different window lengths (τ = 5, 15, 20, and 150 days) of the field study with synthetic observed data.For each subplot, x and y axes refer to model parameters and PDF respectively.