This paper introduces a methodology for the construction of probabilistic inflow forecasts for multiple catchments and lead times. A postprocessing approach is used, and a Gaussian model is applied for transformed variables. In operational situations, it is a straightforward task to use the models to sample inflow ensembles which inherit the dependencies between catchments and lead times. The methodology was tested and demonstrated in the river systems linked to the Ulla-Førre hydropower complex in southern Norway, where simultaneous probabilistic forecasts for five catchments and ten lead times were constructed. The methodology exhibits sufficient flexibility to utilize deterministic flow forecasts from a numerical hydrological model as well as statistical forecasts such as persistent forecasts and sliding window climatology forecasts. It also deals with variation in the relative weights of these forecasts with both catchment and lead time. When evaluating predictive performance in original space using cross-validation, the case study found that it is important to include the persistent forecast for the initial lead times and the hydrological forecast for medium-term lead times. Sliding window climatology forecasts become more important for the latest lead times. Furthermore, operationally important features in this case study such as heteroscedasticity, lead time varying between lead time dependency and lead time varying between catchment dependency are captured.
 The work described herein is motivated by the needs of the short-term optimization methodology for hydropower production. Stochastic optimization, i.e., optimizing when accounting for uncertainties in available resources (water) and utility (prices) in the future, is about to be introduced. This uncertainty must be quantified, and in this paper, we focus on water resources availability expressed as inflow forecasts. In the following, we use the more general term flow since the methodology can be applied to any flow forecast. When optimizing a water system, all catchments and several lead times must be considered simultaneously. Depending on the hydropower system in question, we may be dealing with a set of headwater catchments, or a system of upstream-downstream reservoirs in which water from one catchment arrives perhaps days later in a lower catchment, or a combination of both. Thus, there is a need to construct a multivariate predictive distribution for the forecasts which accounts for between-catchment and between-lead time dependencies. Furthermore, short-term stochastic optimization methods require ensembles, i.e., samples of flow forecasts, as input. Based on the ensembles, the optimization methods build scenario trees for future flows. An overview of scenario tree algorithms is given in Dupacova et al.  and a recent development is described in Rasol et al. . It is therefore desirable to have probabilistic forecasts that can be sampled from.
 Traditionally numerical physically based hydrological models are used to provide deterministic forecasts of flow, and we refer to them as hydrological forecasts. The driving forces of these dynamic models are precipitation (water in) and temperature (evaporation, snow accumulation, and snow melt are temperature-driven processes). The hydrological forecast is found by running the numerical hydrological model using observed temperature and precipitation values as input to obtain initial states, whereas precipitation and temperature forecasts are used as inputs during the forecasting period. This gives a deterministic flow forecast which contains errors for several reasons [Refsgaard and Storm, 1996]. These include errors in input data, especially in precipitation and temperature forecasts (e.g., catch deficit and interpolation of precipitation volumes); errors in internal states when the forecast starts (e.g., too much snow); errors in model parameters (e.g., the conductivity parameter is overestimated); errors in model structure (e.g., the absence of important processes); and errors in data used for calibration (e.g., errors in the rating curve).
 For almost three decades, the estimation of uncertainties in hydrological modeling in general has been a major field of interest [e.g., Kuczera, 1983; Beven and Binley, 1992; Yapo et al., 1998; Thiemann et al., 2001; Kavetski et al., 2006]. However, only a few of these studies have focused in particular on uncertainties linked to forecasts [e.g., Krzysztofowicz, 2002, 1999; Todini, 2008; Montanari and Grossi, 2008; Reggiani and Weerts, 2008]. For recent publications, see Cloke et al.  and references therein. Probabilistic forecasts may be either purely statistical or based on a hydrological forecast. An example of a pure statistical model is the climatology or persistent forecast. In order to benefit from our understanding of hydrological processes as well as our observations, it is appealing to base probabilistic forecasts on both hydrological and statistical forecasts.
 Conceptually, there are two ways of using a hydrological model to make probabilistic forecasts [e.g., Renard et al., 2010]; endogenous and exogenous (pre/postprocessing) methods. Whereas endogenous methods are based on a physical model and aim to make parts of this model stochastic, exogenous methods involve the construction of an uncertainty model for the deterministic forecast based on the joint distribution of forecast and observations. Endogenous methods can be used to make the internal states of the hydrological model stochastic, e.g., using an Ensemble Kalman Filter [e.g., Moradkhani et al., 2005], or make the model parameters stochastic [e.g., Reichert and Mieleitner, 2009], or replace the hydrological model with a probability density function describing precipitation-runoff processes [e.g., Bulygina and Gupta, 2009], or to produce seasonal flow forecasts [e.g., Wang et al., 2009]. Exogenous methods keep the numerical hydrological model deterministic. Pure postprocessing methods build a stochastic model for flow forecasts from either one [e.g., Krzysztofowicz, 1999; Montanari and Grossi, 2008; Weerts et al., 2011; Brown and Seo, 2013], or several hydrological models using either Bayesian Model Averaging [e.g., Vrugt et al., 2007] or the Model Conditional Processor [Todini, 2008; Coccia and Todini, 2011]. Most postprocessing methods aim to build a model which provides a predictive distribution of the flow forecast. The major challenges associated with this approach are linked to accounting for temporal dependencies, specification of the distribution, and accounting for heteroscedasticity of forecast errors. An attractive solution to the last two issues is to apply quantile regression [Koenker, 2005] to directly estimate the flow forecast quantiles [Weerts et al., 2011]. However, this method requires many parameters, since a unique regression equation has to be established for each quantile. Preprocessing might be included in a form of a stochastic model for weather forecasts [e.g., Krzysztofowicz, 2002; Todini, 2008], or probabilistic calibrated ensemble weather forecasts may be used [e.g., Cloke and Pappenberger, 2009; Marty et al., 2013]. Most of the postprocessing studies listed above focus on forecasting lead times of up to a few days for one site. Nevertheless, probabilistic methods for the seasonal forecasting of streamflow [e.g., Wang et al., 2009, 2011] and multisite forecasting [e.g., Reggiani and Weerts, 2008; Wang et al., 2009] are also developed.
 The aim of this work was to construct reliable and sharp joint predictive distributions for flow forecasts for several lead times in a system of catchments. The models introduced here can be seen as extensions of the single-catchment one lead time model introduced by Engeland et al. . These extensions are based on two working hypotheses:
 1. The sliding window climatology forecast, the persistent forecast, and the hydrological forecast all contain predictive information about the future flow, and the relative importance of these forecasts varies with lead time.
 2. The between-catchment and between-lead time residual dependencies vary with lead time.
 The probabilistic forecast models proposed here can utilize information from the three operationally available forecasts (the hydrological, the sliding window climatology, and the persistent). Furthermore, our models account for the dependencies in forecast errors both between catchments and between lead times. This information is important for how joint management of several hydropower reservoirs may be optimized for durations longer than 1 day. The working hypotheses and the predictive performance of the proposed models were tested for five catchments and ten lead times on a headwater system form part of the of the Ulla-Førre hydropower complex in southern Norway. The predictive performance of the models was evaluated according to the reliability (correctness of the distribution), sharpness (width of the forecast intervals), and efficiency (how well the median values fitted with observed flows). The unique contribution of this paper is to provide an extension of the statistical model of Engeland et al.  to incorporate several lead times and catchments, and to use energy score (ES) [Gneiting et al., 2008] as a measure of the quality of a multivariate probabilistic flow forecast.
2 The Study Area, Data, and Hydrological Model
2.1 The Study Area
 In this study, we considered the river systems linked to the Ulla-Førre hydropower complex in south-western Norway (Figure 1). This area has one of Norway's largest hydropower reservoirs Blåsjø (3.1 km3). The complex consists of a system of several hydropower plants with an average annual production of 4.5 TWh. The region has five catchments; Lauvastøl, Kvilldal, Saurdal, Stølsdal, and Osali, see Figure 1. Osali and Lauvastøl are natural catchments whereas Stølsdal, Saurdal, and Kvilldal are heavily modified catchments with reservoirs, creek intakes, and water tunnels. The Kvilldal catchment accumulates water from the four other catchments.
 The region is mainly at high altitudes with a long winter season and a major melt period in May and June. The climate is characterized by a seasonal variation in temperature with February (−4.9°C) as the coldest and July (11.2°C) as the warmest months, respectively. These values are averaged temperatures for the entire catchment area upstream Kvilldal. Precipitation also varies seasonally, with the lowest values in June (111 mm) and the highest in November (481 mm). Table 1 lists some general characteristics of the 5 catchments. The elevation is based on a digital elevation model with resolution of 1 × 1 km2.
Table 1. Area, Average Elevation, Correction Factors for Interpolated Precipitation and Performance Measured as Reff for the Five Catchments in the Ulla-Førre System
Average Elevation (masl)
2.2 Hydrological and Meteorological Data and Forecasts
 Streamflow data for the catchments Osali and Lauvastøl catchments were obtained from The Norwegian Water Resources and Energy Directorate, whereas naturalized flows were provided by the hydropower operator Statkraft for Stølsdal, Saurdal, and Kvilldal. Temperature and precipitation data were provided by the Norwegian Meteorological Institute and Statkraft. Precipitation data were adjusted for catch deficit due to wind loss according to the method of Førland et al. . Deterministic forecasts for 2 m temperature and total precipitation were obtained from the European Centre for Medium-Range Weather Forecasts (ECMWF). We used the operational forecast which has a resolution 0.5 × 0.5° for the period 1 September 2005 to 2 February 2006 and 0.25 × 0.25° for the period 3 February 2006 to 31 August 2009. The forecasts start at 00:00 h GMT, and are given for every 6th hour with up to 10 days lead time.
2.3 Hydrological Forecasts
 The hydrological model was based on daily time steps and was distributed on a 1 × 1 km grid. It employed soil moisture and response routines from the Hydrologiska Byråns Vattenbalansavdelning (HBV) model [Bergström et al., 1992], combined with snow depletion curve and energy balance routines for snow modeling [Kolberg et al., 2006].
 The model was calibrated using observation data from the period 1 August 2001 to 31 August 2005. The period 1 August 2000 to 31 July 2001 was used as “spin-up” period to mitigate the effects of initialization. The Nash-Sutcliffe coefficient Reff was used as the objective function, and the resulting calibration values are given in Table 1. For Osali, Stølsdal, and Lauvastøl, the Reff values obtained indicate moderate model performance.
 The calibrated hydrological model was used to generate forecasts for lead times l = 1–10 days for each day t and catchment i during the period 1 September 2005 to 31 August 2009. The issue time is given by t, and the forecast valid time by t + l.
2.4 Sliding Window Climatology and Persistent Forecasts
 The sliding window climatology was established for each catchment i. It depends on the day of year of the time τ and reflects usual flow conditions at the time in question. It was constructed on the basis of observations during the period 1987–1994 by assigning each day of the year the median value for a 15 day window centered on the day of interest. For example, the sliding window climatology for 8 February is the median of inflow observations taken between 1 and 15 February during the period 1987–2004.
 The persistent forecast represents the last observations at the issue time t of the forecast and states that the flow will remain constant during the forecasting period and that it does not depend on lead time.
2.5 Exploratory Analysis of Dependencies
 To obtain an initial check on working hypothesis 2 (in which we state that the between-catchment and between-lead time residual dependencies vary with lead time) and to establish a basis for the formulation of the postprocessing model, we performed an exploratory analysis of the error dependencies of the hydrological forecast; . Figure 2a shows the forecast error correlations between the catchments for each of the 10 lead times, . We see that the between-catchment correlation increase with increasing lead time. Furthermore, Figure 2b also shows that the correlation between errors of successive lead times, , increases for all catchments. Postprocessing models should thus have the flexibility to permit the correlations between catchments and succeeding lead times to be dependent on both catchment and lead time.
3 Probabilistic Flow Forecast Modeling Framework
 In this section, we use Gaussian models to set up a framework for probabilistic forecasts based on deterministic flow forecasts for N catchments and L lead times. There are two issues associated with using Gaussian models for flow forecasts; (1) flow data are generally not Gaussian, but exhibit a heavier right tail, and (2) flow forecast errors generally increase with predictions, i.e., they are heteroscedastic. To take these issues into account, we transformed the flow forecasts and observations using the Box-Cox transformation (see Appendix Box-Cox Transformation) where q is the flow in original scale (m3/s), λ is a transformation parameter, and y is the transformed flow. A Shapiro-Wilk test was used to evaluate the λ parameter giving transformed flow observations that were the closest to a Gaussian distribution. Visual inspection of forecast errors was used to evaluate the homoscedasticity of the forecast errors in the transformed space. The (transformed) flow observation of catchment i at time τ is denoted . In this presentation, we use the notation for the probabilistic (transformed) flow forecast issued at time t for catchment i and lead time l. This forecast is valid for time t + l and should be compared to observation . Furthermore, we use 1: L to denote a vector of all lead times, and 1: N for all catchment. For example, we denote the vector of random variables for the forecast issued at time t for lead time l and all catchments by .
 The probabilistic forecast at issue time t for all catchments and all lead times was modeled as Gaussian with mean and covariance matrix Σ:
 In line with our first working hypothesis, we modeled the mean using the three deterministic forecasts presented in section 'The Study Area, Data, and Hydrological Model': the sliding window climatology, the persistent, and the hydrological. These forecasts were used as covariates in the probabilistic forecast. We denote the sliding window climatology forecast valid for time t + l and catchment by , the persistent forecast issued at time t for catchment i by , and the hydrological forecast issued at time t for lead time l and catchment i by . The persistent forecast is the last available observation and independent of lead time, i.e., . We modeled the mean as a linear combination of the three forecasts, where the coefficients differ between catchments and lead times;
where might be interpreted as a bias term, and , and as weighting factors for their respective deterministic forecasts. From (2), we see that for a specific catchment and lead time the expected value varied from day to day because the deterministic forecasts/covariates also varies with issue time t. The relative influence of the forecasts/covariates differs between catchments and lead times because the coefficients differ between catchments (i) and lead times (l).
3.2 Dependency Model
 To obtain a probabilistic forecast for several catchments and lead times, we needed to develop a variance and dependency model as well. We used the Gaussian model in (1) with mean as defined in (2). Our aim was to model the dependency structure given by Σ. The easiest approach would be to assume independence between catchments and lead times. However, our second working hypothesis, which is supported by the explanatory analysis presented in section 'Exploratory Analysis of Dependencies', indicates that we need to model dependency because both between-catchment and between-lead time dependencies vary with lead time. For a system with N catchments and L lead times, Σ is a matrix with distinct parameters. When N = 5 and L = 10 the covariance matrix Σ has 1275 distinct parameters. In order to be able to estimate the model parameters, we needed a realistic model with fewer parameters.
 For simplicity, we now suppress the forecast time t and catchment vector 1:N from the notation, and let Yl denote a vector of flow forecasts for all catchments for lead time l. Any joint probability density function (pdf) can be factorized as;
and instead of modeling the joint pdf directly, we can model each factor in (3). We first assumed a Markov property between lead times, i.e., given the flows for the previous lead time, the flow for lead time l does not depend on flow at earlier lead times. This gives , and hence (3) simplifies to
 Our approach to do the joint modeling involved modeling each factor of (4). Since the joint distribution is Gaussian, all marginal and conditional distributions are also Gaussian, and from standard results of multivariate statistics we know that the marginal distribution for lead time l = 1 is
and the conditional distribution for all catchments at lead time l given the flow for all catchments at lead time l – 1 is
where the conditional mean for is given by
where is the marginal mean for lead time l as described in section 'Mean Model' and is a matrix that describes the dependency between the error of the mean of our probabilistic forecast at lead time l – 1 and the flow at lead time l. For further simplification, we also assumed a Markov property between catchments for successive lead times. Given the flow for catchment i at lead time l – 1, there is no further information about the flow of catchment i at lead time l, , in the flow of the other catchments at lead time l – 1. Hence, . This implies that the matrix in (7) is diagonal, and we can write the conditional mean for catchment i as:
 The dependency model which we have built here has the flexibility to enable each catchment and each lead time to have its own marginal variance in the Box-Cox transformed space. Furthermore, the postprocessing model is heteroscedastic in the original flow space for a given lead time and catchment, while the probabilistic forecast variance depends on the predictive mean in original space. This is illustrated in Figure 3 where we see that the width of the 95% forecast interval varies with median forecast (and lead time), but that the forecast intervals are identical for identical lead time and median forecasts.
3.3 Special Model Cases for Testing the Working Hypotheses
 As a mean to evaluate our working hypotheses, we developed some special cases of the model described in sections 'Mean Model' and 'Dependency Model'. The first three of these models were constructed to evaluate the amount of predictive information of each deterministic forecast by successively excluding each of them in the mean model. A fourth model was used to evaluate the importance of the dependency submodel by assuming independence.
 The first model used only the deterministic and the persistent forecasts, i.e., by setting in (2). The climatology is then indirectly included in the parameter estimation;
 The second model employed only the deterministic and the climatological forecasts, i.e., by setting in (2);
 The third model was a pure statistical model based on flow data provided only by the sliding window climatology and the persistent forecasts. We obtained a pure statistical model by omitting the hydrological forecast in the mean model, i.e., by setting in (2);
 The fourth model was an independent model where there are neither between-catchment or between-lead time dependencies were included. It was constructed assuming the covariance matrix of the probabilistic forecast in equation (1) is diagonal. In this way, each catchment and each lead time had its own residual variance, and the dependencies between these were not estimated. We used the mean model as given in (2), but set in (7).
 Probabilistic climatological forecasts were established for each catchment and each day of year. They were constructed based on data for the period 1987–1994 by assigning each day of the year its own climatological distribution based on a 15 days window. For example, the climatology for 8 February is based on inflow observations between 1 and 15 February for the period 1987 and 2004. To obtain the forecast distribution for each day, the data were Box-Cox transformed and a normal distribution was fitted to every day.
3.4 Model Evaluation Scheme
 To evaluate the predictive performance of our models, all results were calculated in original scale and by using cross-validation. When estimating the parameters, we successively omitted years one by one and used these parameters to make forecasts for the year remaining. In total, we had 4 years of data as a basis for the cross-validation which resulted in a fourfold cross-validation scheme. For model fitting, we used all available data as described in Appendix Estimation of Parameters in Probabilistic Forecast Model.
 The purpose of postprocessing was to provide probabilistic flow forecasts in the form of probability density functions (pdf) for the flow. There are several methods available to evaluate probabilistic forecasts [e.g., Atger, 1999; Gneiting et al., 2007; Laio et al., 2007; Thyer et al., 2009]. We adhered to the criteria summarized by Engeland et al.  and tested for reliability and sharpness. To be consistent with the tradition in hydrology, we also evaluated the efficiency. These criteria are described in section 'Reliability, Sharpness, and Efficiency'.
 The continuous rank probability score (CRPS) combines both reliability and sharpness in one criterion which can be applied to univariate forecasts (one catchment and one lead time). The energy score (ES) [Gneiting et al., 2008] combines reliability and sharpness for multivariate forecasts (several catchments and/or lead times). Section 'Continuous Rank Probability Score and Energy Score' briefly describes these evaluation criteria.
 The evaluation of the predictions was carried out using a cross-validation scheme described in section 4.3, and was based on flows q in original scale (m3/S).
4.1 Reliability, Sharpness, and Efficiency
 A forecast is considered reliable if it is statistically consistent with the observed uncertainty (i.e., 95% of the observations should be inside a 95% forecast interval). This was examined using the predictive QQ-plot as described in Laio et al.  and Thyer et al. . If we let Ft denote the predictive cumulative distribution function of the flow at time t, and qt the corresponding observed flow, the p value is given as . If our model is reliable, the p values will follow a uniform distribution on the interval [0,1]. This was checked graphically by plotting the p values as a function of theoretical quantiles of U(0,1). Deviations from the bisector (the 1:1 line) denote interpretable deficiencies [see Laio et al., 2007; Thyer et al., 2009].
 To summarize the reliability, we used the α-index as defined by Renard et al. 
where pj represents sorted p values in increasing order and j indicates the rank. The theoretical quantile was calculated using the plotting position . The α-index describes the total deviation of p values from the theoretical uniform quantiles. It has a maximum value of 1 that indicates perfect reliability. We used the α-index in order to easily compare the reliability of multiple forecasts on one plot. However, in order to analyze how the forecasts fail, predictive QQ-plots are needed.
 Perfectly reliable forecasts such as the climatological forecast might provide forecast intervals that are too wide for any practical use [e.g., Gneiting et al., 2007]. We wanted our probabilistic forecast to be reliable and at the same time to exhibit as little uncertainty as possible. In other words, we wanted the forecast pdf to be sharp. In this study, for each catchment and each lead time, we evaluated sharpness by employing the average width of the 95% forecast intervals.
 The traditional way of evaluating (deterministic) flow forecasts is to use the Nash-Sutcliffe efficiency Reff given as:
where is the flow observation at time t, and the forecast for the same day, i.e., lead time l forecast issued at time t – l.
4.2 Continuous Rank Probability Score and Energy Score
 To evaluate and rank univariate forecasts (one catchment and one lead time), we used the continuous rank probability score (CRPS) [Matheson and Winkler, 1976; Hersbach, 2000; Gneiting et al., 2007]. This can conveniently be calculated from samples drawn from the predictive distribution;
where are independent samples from the predictive pdf f and qobs is the observation. For a time series of forecasts, we defined CRPS as the average over taken over all time steps:
 CRPS is negatively oriented, i.e., lower values indicate better scores, and it is measured on the same scale as the observations (m3/s in our case). It is also a valid measure for evaluating the performance of deterministic forecasts. In this case, it reduces to the absolute error between observations and forecasts.
 To evaluate and rank multivariate forecasts (involving multiple catchments and/or lead times), we used the energy score (ES) which is introduced by Gneiting et al.  and applied in Salazar et al. . For a single time step, ES was calculated using a sample of size m of the vector Q obtained by simulating from the multivariate probabilistic forecast
where denotes the Euclidean norm and are m independent vectors sampled from the multivariate probabilistic forecast and is the vector of observations. The temporal average of all values of ES is denoted as ES. The ES is a generalization of the CRPS for multivariate forecasts, and has many of the same properties as the CRPS. It is negatively oriented and reduces to the absolute error for deterministic forecasts.
5 Results From the Ulla-Førre Case Study Using Five Catchments and Ten Lead Times
5.1 Model Fitting
 The first step was to set the λ parameter in the Box-Cox transformation in equation (A1). A Shapiro-Wilk test for normality indicated that the optimal value for λ was between 0.0 and 0.25. We set λ = 0.2 as visual inspection of forecast errors (plots of model residuals versus predicted values and QQ-plots) using this transformation showed that homoscedasticity and normality were reasonable assumptions. We show in this paper the plots of model residuals versus predicted values for Saurdal in Figures 3b, 3d, and 3f for lead times 1, 5, and 10 days, respectively. All further estimations were carried out for transformed flows. To evaluate if our results and conclusions were sensitive to the λ parameter, we fitted and evaluated the models for λ = 0.0 and 0.25 as well (results not shown).
 Based on the complete postprocessing model (equations (2) and (8)), Figures 3a, 3c, and 3e show the 95% forecast intervals for lead times 1, 5, and 10 days at Saurdal. The estimated parameters and for the full postprocessing model, together with their 95% confidence intervals, are shown in Figure 4. Figure 5a shows the autoregression parameters ad together with their 95% confidence intervals, while Figure 5b shows the estimated correlations of the postprocessed residuals.
5.2 Evaluating the Models
 We used the methodology presented in section 'Model Evaluation Scheme' to evaluate the predictive performance of our postprocessing models. To evaluate the reliability of the postprocessing models and the climatological forecast, we used the predictive QQ-plots (see Figure 6). We also evaluated the reliability of different magnitudes of forecasted flows for the full postprocessing model (Model 1) by using the α-index (see Figure 6). We divided the postprocessed forecasts into the 20% lowest, the 10% highest, and the median flow values, and calculated the α-index for each of these three groups. If the α-index depends on flow magnitude, this might either be because the bias in the forecast depends on magnitude or because the heteroscedasticity of the flow forecasts is miss-specified. Since the hydrological forecast was deterministic, its reliability could not be evaluated. The model without dependency was very similar to our full postprocessing model for all marginal evaluations for a given catchment and lead time. For this reason, its evaluation results are only shown for the energy score (ES).
 In order to summarize the sharpness for all catchments and lead times in one figure, we plotted the average width of the 95% forecast intervals as a function of lead times (Figure 7). Figure 8 shows the CRPS for all models, lead times and catchments as well as ES for evaluation of the probabilistic forecasts jointly across all catchments for each lead time. When we calculated ES, we obtained a single value by using all 50 dimensions (5 catchments multiplied by 10 lead times). We divided the total ES into subspaces by calculating ES over all catchments to obtain one ES for each lead time (10 values), and over all lead times in order to get one ES for each catchment (five values). The results are shown in Figure 8 and Table 2. In the context of hydropower scheduling, the total volume is important. In other words, it is the most important to have a good model for the largest catchments. We therefore chose to use the flow values given in (m3/s) when calculating the ES.
Table 2. Energy Score (ES) for Each of the Five Catchments for Lead Times 1–10 Jointly, and for all Catchments and Lead Times Jointlya
The scores of the best models are shown in bold type. The models evaluated are the climatological forecast (Climatology), full postprocessed model (Model 1), model excluding the sliding window climatology forecast (Model 2), the hydrological forecast (Model 3), the persistent forecast (Model 4), and the dependency (Model 5) and the hydrological forecast (Hydrological). The two last lines show the CRPS for forecasted flows accumulated over all lead times for Model 1 and Model 5.
 We included also CRPS calculated for the flow volumes accumulated over all lead times for each catchment as well as accumulated over all catchments and lead times in Table 2. These scores were used to evaluate the importance of the dependency model.
 When calculating the efficiency Reff for probabilistic forecasts, we used the median of the probabilistic forecast as in (12). Figure 9 shows the efficiency Reff for each model, each lead time and each catchment.
6.1 Working Hypotheses
 We now want to link the results of the case study to our working hypotheses. These were (1) the sliding window climatology forecast, the persistent forecast, and a hydrological forecast all contain predictive information about the future flow, and the relative importance of these forecasts change with lead time; and (2) the between-catchment and between-lead time residual dependencies vary with lead time.
 To evaluate the first hypothesis, we start by considering the regression parameters of the mean model given in Figure 4. These coefficients can be interpreted as weights for the three deterministic forecasts, and we find that the importance of the different forecasts vary with lead time and also differ between the catchments. By examining the 95% confidence intervals of the coefficients we also find that these differences are well beyond parameter estimation uncertainty. To evaluate the predictive conditional information of each forecast, we compared the predictive performance of the full postprocessing model (Model 1) with the special case models in which each model successively excludes each of the forecasts from the full postprocessing model. Model 2 excludes the sliding window climatology forecast, Model 3 the persistent forecast, and Model 4 the hydrological forecast. The reliability of the different forecasts can be obtained from Figure 6, the sharpness from Figure 7, the scores CRPS and ES from Figure 8, and the efficiency from Figure 9. In the following, we focus on the scores in Figure 8, but the same trends can be found by considering sharpness (Figure 7) and efficiency (Figure 9).
 In Figure 4, we can see that the hydrological forecast is exhibits its highest weights for medium lead times (3–9 days). By comparing the CRPS and ES values in Figure 8 of the full model (Model 1) and Model 4, which excludes the hydrological forecast, we can see that the conditional information content in the hydrological forecast is the largest for lead times of 3–9 days. The conditional information content in the hydrological forecast is low for both the shortest (1–2 days) and longest lead times (10 days). As such, the hydrological model is superfluous for these lead times. In these cases, we received no substantial benefit in terms of making a better probabilistic forecast neither from our physical knowledge (contained in the hydrological model), nor from meteorological forecasts. Thus we consider it sufficient to use a combination of the sliding window climatology and persistent forecasts.
 The persistent forecast exhibits high weights for the first lead times (Figure 4) which subsequently decreases with lead time. By comparing the CRPS and ES values in Figure 8 and Table 2 for the full model (Model 1) with Model 3, which excludes the persistent forecast, we see that the conditional information content in the persistent forecast is large for lead time 1, and has some effect for lead time 2. For longer lead times, the persistent forecast does not contain much conditional predictive information.
 In Figure 4, we can also see that the sliding window climatology forecast exhibits higher weights with increasing lead times (Figure 4). However, the weights are relatively small for all lead times, and either the persistent or the hydrological forecast have higher weight for all catchments and all lead times. Furthermore, by comparing the CRPS and ES values (Figure 8 and Table 2) for the full model (Model 1) and Model 2 which excludes the sliding window climatology forecast, we find that inclusion of the this forecast only slightly improves the forecasts. For our case study, the sliding window climatology forecast does not contain much conditional predictive information.
 We can also see that the importance of the different forecasts varies between catchment. For example, for the two smallest catchments (Lauvastøl og Osali), the persistent forecast provides additional information for the first lead time only, whereas for the other catchments it provides information for the second lead time as well.
 The second hypothesis is supported by an exploratory analysis of the forecast error exhibited by the hydrological forecast described in section 'Exploratory Analysis of Dependencies' (see Figure 2). From the estimated correlations between catchments (Figure 5), we find that the correlations increase slightly with lead time. From the between-lead times dependency parameters ad (Figure 5), we can see that the dependency varies considerably between catchments, and that in general larger catchments exhibits larger dependencies. Furthermore, the dependency between lead times 1 and 2 is larger than the dependency between lead time 2 and 3, especially for the Kvilldal catchment. For greater lead times, a slightly increasing dependency trend is observed for all catchments.
 The predictive importance of including dependency between lead times and catchments can be evaluated by comparing the CRPS and ES values (Figure 8 and Table 2) for the full model (Model 1) with the model assuming independent residuals (Model 5). Here we find that inclusion of the between-catchment correlation improves the forecasts for the first and the two last lead times, and that inclusion of the between-lead time dependency slightly improves the forecasts for all catchments, with the exception the two smallest: Lauvastøl and Osali. For all catchments jointly the dependency model only slightly improves the forecasts according to the ES. A reason of interest for dependencies is the consequences for the uncertainty of accumulated flows. To get alternative evaluation of the dependency model we therefore calculated CRPS for accumulated flows. These CRPS values are listed in Table 2, and Model 1 clearly performs better than Model 5, especially when accumulating over all catchments and lead times. Further calculations (results not shown in the paper) show that the improved CRPS for accumulated flows in Model 1 compared to Model 5 comes from improvements in both reliability and sharpness. This indicates that the ES has low sensitivity to the dependency structure, which is also a finding in Möller et al. .
 Figure 5 shows that the correlation in forecast errors between catchments increases with lead time both for the hydrological and the postprocessed forecast. This feature is important for reservoirs management because it implies that if a given forecast is wrong in one catchment, it is likely to be wrong in the same direction for all catchments. As a result, the operator has less flexibility during the management of each individual reservoir when he aims to maximize income. The postprocessed forecast results in smaller correlations between catchments than the hydrological forecast. This reduction in correlations reveals a potential for improving hydropower scheduling.
 One desirable property of a probabilistic forecast is to be reliable. The plots displayed in Figure 6 show that reliability increases with lead time. The climatological forecasts are the most reliable. We believe that the deviations from perfect reliability are small and that for this application the other evaluation criteria should be used to rank the performance of the postprocessing models.
 The heteroscedasticity of flows and flow forecast errors often presents a challenge in hydrological modeling. We have used a Box-Cox transformation of all observations and deterministic forecasts, and assumed a Gaussian model in the transformed space. In Figure 3, the residuals in transformed space are plotted against predictive flow for lead time 1, 5, and 10 for the Saurdal Catchment. This figure (and the corresponding plots for the other catchments) did not demonstrate any clear evidence of heteroscedasticity in the transformed space, such as increasing residuals with increasing flow). The plots of the α-index shown in Figure 6 indicate how well the homoscedasticity assumption is met. We see that for Saurdal the forecasts are equally reliable for all flows. At Lauvastøl the forecasts are the least reliable for low flows; at Osali and Stølsdal the forecasts are least reliable for both low and high flows; whereas at Kvilldal the forecasts are least reliable for high flows. Further inspection of the forecasts (not shown) revealed that the forecasts at Lauvastøl, Osali, and Stølsdal were too wide for low flows, whereas there is a overprediction for the high flows at Stølsdal, Osali, and Kvilldal. We carried out sensitivity tests for the value of λ parameter in the Box-Cox transformation by varying the value between 0.0 and 0.25. The results revealed only minor changes in the results and had no influence on the conclusions.
6.2 Relevance of Case Study
 The events that are important to quantify the uncertainty for in a hydropower production setting depends on, from a hydrologists point of view, external variables such as the electricity prices, the magazine state and capacity as well as the production capacity and not necessarily on the probabilistic forecast quantity. The value of accurate quantitative measures of uncertainty is thus difficult to assess without combining a probabilistic forecast with stochastic optimization decision simulator.
 Our postprocessing model has only been tested for the Ulla-Førre complex. However, we believe that it has sufficient flexibility to be applied on other catchments and/or regions. The importance of the three different forecasts may depend on specific conditions unique to each different system. For example, the importance of the climatological forecast may increase in catchments which exhibit even greater seasonal variation in flows (e.g., due to pronounced seasonal variations in snow accumulation and melt processes, evapotranspiration processes, and/or precipitation). We also expect that the importance of the persistent forecast may depend on catchment characteristics such as catchment size, lake percentage, and drainage density. In our case study, we found that catchment size might explain the information content in the persistent forecast. It is also reasonable to suppose that the importance of the hydrological forecast depends on the quality(correctness) of the hydrological model and its input. In this study, the hydrological model was applied on daily time steps, that is, larger than the concentration time for the smallest catchments. Daily time steps were used because this is a stipulation required by the hydropower scheduling models and because most of input data are available at this resolution.
 We recognize that the precipitation inputs might include large uncertainties. In particular, the systematic errors which create as yet unknown biases generate major challenges in the study area. It was not our aim in this study to investigate this subject by building a stochastic preprocessor. Instead, we have carried out a simple bias adjustment of precipitation inputs, both observations and forecasts. The uncertainty inherent in these inputs will, however, implicitly be reflected in the postprocessing model, both in terms of the relative importance of the hydrological forecast and the covariance estimates.
 A challenge facing postprocessing methods is that they require homogeneous forecasts as input. The quality of the hydrological forecasts depends heavily on the quality of their inputs, such as the precipitation data and the meteorological forecasts. Since gauge networks as well as meteorological forecast products develop over time, it may be necessary to recalibrate the postprocessing model on a regular basis.
 An interesting way of extending the postprocessing is to employ explanatory variables such as season or weather within the variance and dependency structure. Another class of interesting explanatory variables is to use information about the uncertainty linked to the available forecasts, such as the spread of the three deterministic forecast; the spread of the sliding window climatology; and/or the spread of a stochastic hydrological forecast obtained by a stochastic preprocessor of input uncertainty.
7 Summary and Conclusion
 In this paper, we have introduced a modeling framework for constructing probabilistic inflow forecasts for several catchments and lead times simultaneously. The framework involves a postprocessing method which in our case study was sufficiently flexible to handle operationally important features the error is likely to contain such as bias, heteroscedasticity, between-lead times dependencies that varies with lead time, and between-catchment dependency which also vary with lead time. The framework is also able to utilize three different types of deterministic forecasts; the hydrological forecast, the persistent forecast and the sliding window climatology forecast.
 We have applied the framework to construct probabilistic models for 10 lead times using field data from five catchments in the Ulla-Førre hydropower complex in south-western Norway. The resulting models were straightforward and computationally inexpensive to sample from, and can thus be used as input to stochastic optimization methods.
 When evaluating the models' predictive performance, we used a cross-validation scheme in which we successively omitted 1 year of data for the estimation and then applied the estimated parameters for the year that was omitted. Even though our full model contained a large number of parameters (200 for the mean model and 195 for the dependency model), the cross-validation indicates no signs of overfitting. On average, the full model and the model without the sliding window climatology forecast performs equally well and exhibits better predictive performance for all catchments and lead times compared to the other reduced models where one of the deterministic forecasts is omitted.
 Based on our two working hypotheses, the following lessons have been learned from this study of the Ulla-Førre complex;
 1. The importance of each forecast type (hydrological, persistent and sliding window climatology) depends on lead time and catchment.
 2. The sliding window climatology forecast can be left out from the postprocessing model without significant loss of information.
 3. The value of information contained in the persistent forecast is the largest for the first two lead times.
 4. For lead times 1 and 10 days there is not a need for a hydrological model since the combination of persistent and climatological forecasts produce essentially the same results as the full model.
 5. The forecast errors are strongly correlated between catchments and between lead times.
 The importance of the dependency model was demonstrated when calculating CRPS values for time accumulated flow forecasts, whereas the ES values were not very sensitive to correct estimation of the dependency structure.
 In our future research, we plan to use the probabilistic forecasts presented in this paper as benchmarks for endogenous-based probabilistic forecasts and preprocessing-based probabilistic forecasts.
Appendix A: Box-Cox Transformation
 The Box-Cox transformation transform between original scale q (in this case, flow measured in m3/s) and transformed scale y using
where λ is a transformation parameter. The aim of the transformation is to obtain Gaussian distributed variables and/or homoscedasticity.
Appendix B: Estimation of Parameters in Probabilistic Forecast Model
 The model setup in sections 'Mean Model' and 'Dependency Model' for which we want to estimate parameters can be described with L regressions. For l = 1
where and the elements of are given by . And for we have
where and the elements of are given by . As there are no parameters in common of these regressions, estimation can be done for each of them independently. Each of these regressions has the property that each catchment has unique parameters in the mean, but are linked as the residuals ( or ) are correlated. This is known as seemingly unrelated regression (SUR), and both the regression coefficients and the covariance matrix ( or ) can be estimated as described in Zellner . This is implemented in the R-package systemfit.
Appendix C: Derivation of Elements in Full Covariance Matrix Σ
 Assume that we have a multivariate Gaussian random variable Y, which we decompose into two vectors and Yl, . Y has mean and covariance matrix
where is the cross-covariance matrix between Yl and . The conditional distribution is then Gaussian with mean
 From (C3), we find that the marginal variance for lead time l can be found from
 As and are estimated, recursive use of (C4) and (C5) can be used to obtain and for all lead times l. If also further cross-covariance matrices are of interest, further recursive use of formulas similar to (C4) and (C5), where the conditioning is done on all previous lead times.
 This study is a part of the project named “The hydrologic crystal ball” sponsored by Statkraft. We will also thank the anonymous reviewers and the associate editor for the critical comments that helped to improve the manuscript.