## 1. Introduction

Forecasts of hydrologic variables are subject to uncertainty due to errors introduced into the modeling chain via imperfect initial and boundary conditions, poor model resolution, and the necessary simplification of physical process representation in the model [e.g., *Palmer et al*., 2005; *Bourdin et al*., 2012]. Deterministic forecasts of streamflow ignore these errors and may provide forecast users with a false impression of certainty. Probabilistic forecasts expressed as probability distributions are a way of quantifying uncertainty by indicating the likelihood of occurrence of a range of forecast values. Additionally, probabilistic inflow forecasts enable water resource managers to set risk-based criteria for decision making and offer potential economic benefits [*Krzysztofowicz*, 2001].

Ensemble forecasting techniques are designed to sample the range of uncertainty in forecasts. However, in both weather and hydrologic forecasting applications, ensembles are often found to be underdispersive and therefore unreliable [e.g., *Eckel and Walters*, 1998; *Buizza*, 1997; *Wilson et al*., 2007; *Olsson and Lindström*, 2008; *Wood and Schaake*, 2008; *Bourdin and Stull*, 2013]. In order to correct these deficiencies, uncertainty models can be used to fit a probability distribution function (PDF) to the ensemble, whereby the parameters of the distribution are estimated based on statistical properties of both the ensemble and past verifying observations. These theoretical distributions can potentially reduce the amount of data required to characterize the distribution (e.g., from 72 ensemble members to two parameters describing the mean and spread of a Gaussian distribution), and allow estimation of probabilities for future rare events that lie outside the range of observed behavior [*Wilks*, 2006].

A variety of different uncertainty models are available for generating probabilistic forecasts from ensembles. The simplest method is the binned probability ensemble (BPE), which makes the assumption that each ensemble member and the verifying observation are drawn from the same (unknown) probability distribution [*Anderson*, 1996]. The verifying observation therefore has an equally likely probability of (*K* + 1)^{−1} (given an ensemble of *K* members) of falling between any two consecutive ranked ensemble members, or outside of their predicted range.

Alternatively, it can be assumed that verifying observations are drawn from a normal distribution centered on the ensemble mean (or, equivalently, that the ensemble mean forecast errors are normally distributed). In this Gaussian uncertainty model, distributional spread can be given by the variance of the ensemble members, implicitly assuming the existence of a spread-skill relationship. That is, that the spread of the ensemble members should be related to the accuracy (or skill) of the ensemble mean; when the forecast is more certain, as indicated by low ensemble spread, errors are expected to be small. However, this relationship is often tenuous [e.g., *Hamill and Colucci*, 1998; *Stensrud et al*., 1999; *Grimit and Mass*, 2002]. For variables with nonnormally distributed forecast errors, such as precipitation and streamflow, the Gamma distribution has also been applied in uncertainty modeling frameworks [e.g., *Hamill and Colucci*, 1998; *Sloughter et al*., 2007; *Vrugt et al*., 2008].

The Bayesian model averaging (BMA) uncertainty model assigns a probability distribution to each ensemble member, assuming the verifying observation to be drawn from one of these [*Raftery et al*., 2005]. The forecast distribution is taken to be a weighted average of these distributions, where weights are based on past performance of individual ensemble members.

In contrast with uncertainty models that fit distributions to ensembles, there exist sophisticated models that can be used to produce probabilistic forecasts from an individual hydrologic model. Such methods are commonly based on a sampling of the model's parameter uncertainty space. The generalized likelihood uncertainty estimation (GLUE) method is conceptually simple, easy to implement, and can handle a range of different error structures [*Kuczera and Parent*, 1998; *Blasone et al*., 2008]. Unlike GLUE, Bayesian recursive estimation (BaRE) makes strong, explicit assumptions about error characteristics [*Thiemann et al*., 2001]. The formal generalized likelihood function of *Schoups and Vrugt* [2010] builds on previous approaches, extending their applicability to situations where errors are correlated, heteroscedastic, and non-Gaussian, and resulting in improved forecast reliability.

If the assumptions made by the uncertainty model regarding error characteristics are valid, the resulting probability forecasts should be statistically reliable or *calibrated*, meaning that an event forecasted to occur with probability *p* will, over the course of many such forecasts, be observed a fraction *p* of the time [*Murphy*, 1973]. Otherwise, the probabilistic forecasts cannot be used for risk-based decision making, since the probabilities cannot be taken at face value.

Various methods of statistical calibration have been devised to correct for deficiencies in probabilistic forecasts. These can generally be split into two groups: ensemble calibration, which adjusts individual ensemble members in order to produce reliable forecasts; and probability calibration, which adjusts the probabilities (derived from an uncertainty model) directly. The BMA uncertainty model, as presented by *Raftery et al*. [2005] is an example of ensemble calibration, as it was developed specifically to produce sharp, calibrated probability forecasts by refining the spread parameters of the individual member distributions such that the continuous ranked probability score (CRPS) is minimized over a training period. Generalizations of BMA have also been developed for this purpose [e.g., *Johnson and Swinbank*, 2009].

The weighted ranks method [*Hamill and Colucci*, 1997] and its generalization, the Probability Integral Transform (PIT)-based calibration scheme of *Nipen and Stull* [2011] are examples of probability calibration that have been shown to improve the reliability and value of forecasts of precipitation, temperature, wind speed, and other meteorological variables. *Nipen and Stull* [2011] also demonstrated that their method was able to improve probabilistic forecasts generated using BMA when those forecasts were unreliable. Bayesian ensemble calibration methods have been applied successfully in hydrologic forecasting applications over a range of time scales [e.g., *Duan et al*., 2007; *Reggiani et al*., 2009; *Wang et al*., 2009; *Parrish et al*., 2012]. Probability calibration on the other hand, has not yet been widely adopted by the hydrologic modeling community. *Olsson and Lindström* [2008] provide an example of a very simple probability calibration used to improve ensemble spread. *Roulin* [2007] applied the weighted ranks method to medium-range forecasts of streamflow and found very little improvement to the already reliable forecasting system. Quantile mapping (QM) is a similar probability calibration technique, but is suited to seasonal hydrologic forecasting, as it maps forecast probabilities to their corresponding climatological values [*Hashino et al*., 2007; *Madadgar et al*., 2012].

In this paper, we present a generalized methodology for producing probabilistic forecasts of reservoir inflow from an ensemble of deterministic forecasts. Prior to combination, each ensemble member is individually bias corrected using a simple degree-of-mass-balance scheme. An intelligent probability calibration scheme is employed to improve the reliability of the probability forecasts when necessary. The methods are applied to a 72 member ensemble and tested over a period of two water years.