Simultaneous calibration of ensemble river flow predictions over an entire range of lead times


  • S. Hemri,

    Corresponding author
    1. Swiss Federal Institute for Forest, Snow and Landscape Research WSL, Birmensdorf, Switzerland
    2. Now at Institute for Applied Mathematics, Heidelberg University, Heidelberg, Germany
    • Corresponding author: S. Hemri, Institute for Applied Mathematics, Heidelberg University, Im Neuenheimer Feld 294, DE-69120 Heidelberg, Germany. (

    Search for more papers by this author
  • F. Fundel,

    1. Swiss Federal Institute for Forest, Snow and Landscape Research WSL, Birmensdorf, Switzerland
    2. Deutscher Wetterdienst, Offenbach, Germany
    Search for more papers by this author
  • M. Zappa

    1. Swiss Federal Institute for Forest, Snow and Landscape Research WSL, Birmensdorf, Switzerland
    Search for more papers by this author


[1] Probabilistic estimates of future water levels and river discharge are usually simulated with hydrologic models using ensemble weather forecasts as main inputs. As hydrologic models are imperfect and the meteorological ensembles tend to be biased and underdispersed, the ensemble forecasts for river runoff typically are biased and underdispersed, too. Thus, in order to achieve both reliable and sharp predictions statistical postprocessing is required. In this work Bayesian model averaging (BMA) is applied to statistically postprocess ensemble runoff raw forecasts for a catchment in Switzerland, at lead times ranging from 1 to 240 h. The raw forecasts have been obtained using deterministic and ensemble forcing meteorological models with different forecast lead time ranges. First, BMA is applied based on mixtures of univariate normal distributions, subject to the assumption of independence between distinct lead times. Then, the independence assumption is relaxed in order to estimate multivariate runoff forecasts over the entire range of lead times simultaneously, based on a BMA version that uses multivariate normal distributions. Since river runoff is a highly skewed variable, Box-Cox transformations are applied in order to achieve approximate normality. Both univariate and multivariate BMA approaches are able to generate well calibrated probabilistic forecasts that are considerably sharper than climatological forecasts. Additionally, multivariate BMA provides a promising approach for incorporating temporal dependencies into the postprocessed forecasts. Its major advantage against univariate BMA is an increase in reliability when the forecast system is changing due to model availability.

1. Introduction

[2] Bayesian model averaging (BMA) has been introduced by Raftery et al. [2005] for postprocessing of meteorological forecasts. It generates a single forecast PDF for a quantity like temperature or rainfall by combining several forecasts from different models. The goal of statistical postprocessing is to obtain calibrated and sharp predictive continuous PDF's of the quantity to be forecast given uncalibrated raw predictive ensembles [Raftery et al., 2005; Gneiting et al., 2007].

[3] In hydrology probabilistic runoff forecasts are typically generated by running a hydrologic model several times using atmospheric variables from different meteorological models as forcing. In recent years BMA has been used increasingly for the combination of multiple rainfall-runoff models [Ajami et al., 2007; Duan et al., 2007; Vrugt and Robinson, 2007; Diks and Vrugt, 2010; Parrish et al., 2012]. Besides the work of Bogner et al. [2013] there has so far been no similar BMA study accounting for the correlation structure between different lead times in the field of hydrologic forecasting. Note that Yuan and Wood [2012] successfully applied a different Bayesian postprocessing approach to probabilistic hydrologic forecasts. On the basis of Berrocal et al. [2007], combining BMA with the geostatistical output perturbation (GOP) method [Gel et al., 2004] in order to achieve spatially correlated BMA forecasts, we will use this method to account for the correlations among lead times. That is, space is replaced by lead times. For simplicity this BMA method is called multivariate BMA from now on.

[4] While multivariate BMA is supposed to improve probabilistic forecasts in terms of multivariate verification measures like the energy score Gneiting et al. [2008], it seems also to slightly improve the marginal predictive distributions at single lead times. Pinson et al. [2008] and Pinson and Girard [2012] introduced a similar method used for postprocessing of probabilistic forecasts of wind power. Like multivariate BMA, it resorts to the multivariate Gaussian distribution and estimates covariance matrices from the series of prediction errors.

[5] For this study, the meteorological forcing consists of a mixture of different models. On the one hand, both global and regional models are used. On the other hand, the models can be divided into deterministic forecasts and ensemble prediction systems (EPS). Each member of an EPS stands for a model run with different initial states, boundary conditions or model formulations [Palmer, 2000]. Ideally, these model runs are exchangeable, i.e., they are supposed to be statistically identical. One of the main goals of this study is to combine forecasts stemming from such a bunch of different models into a sound probabilistic density forecast.

[6] In the following, we apply the methods introduced above to river discharge forecasts for Thur River in Eastern Switzerland. Hourly runoff forecasts forced by a mixture of deterministic models and EPS's that cover different ranges of lead times are combined to one single multivariate PDF using multivariate BMA. Thus, multivariate BMA features modifications that allow on the one hand to account for temporally correlated prediction errors and on the other hand to produce time consistent forecasts though the different forcings cover unequal ranges of lead times. Note that any analysis has been performed using the statistical software package R [R Development Core Team, 2011].

[7] Section 2 presents the data set used in this study along with an example of a particular forecast. BMA, and especially multivariate BMA, are explained in detail in section 3. The forecasts obtained by univariate and multivariate BMA over a period of two and a half years are verified in section 4. Lastly, we discuss the results shortly in section 5 and make some concluding remarks in section 6.

2. Data Set With Example Forecast

[8] In this study, we consider Thur River, which drains a medium-sized subcatchment of the Swiss Rhine. It covers an area of 1696 km2. Ranging from 356 at the gauging station in Andelfingen to 2503 at the highest mountain peak its mean altitude is 769 m a.s.l. Due to its cool climate, runoff generation from autumn to spring are influenced by snow dynamics. As a result of snow melting monthly discharges are usually maximal in April, though precipitation amounts are highest in summer [Fundel and Zappa, 2011; Fundel et al., 2013]. Any observed runoff values have been provided by the Swiss Federal Office for the Environment (FOEN).

[9] The discharge forecasts consist of a hydrometeorological ensemble prediction system (HEPS). HEPS are obtained by running a hydrologic model several times with different meteorological forcing [Addor et al., 2011; Zappa et al., 2011]. The runoff predictions, which serve as raw forecasts for BMA in this study, have been generated by the hydrologic model “Precipitation-Runoff-Evapotranspiration HRU Model” (PREVAH) [Viviroli et al., 2007a, 2007b, 2009]. Reforecasts, i.e., historical forecasts with a consistent model version, from several models of different size have been available as meteorological forcing. Table 1 gives an overview over the 69 model members that provide forecasts on an hourly basis for different ranges of lead times. For more details on the particular models we refer to Molteni et al. [1996] for the ECMWF-EPS, to Marsigli et al. [2004, 2005]; Tibaldi et al. [2006]; Montani et al. [2011] for COSMO-LEPS and to [Consortium for Small-scale Modeling, 2011] for COSMO-2 and COSMO-7. PREVAH is run on a daily basis, initialized at 00:00 UTC, in order to forecast hourly discharges. Each run corresponds to an ensemble member of an EPS or to a deterministic model. The period from 17 July 2007 to 17 November 2009 serves as verification period. For any verification day x, a calibration period of 30 days running from day x – 40 to day x – 11 has been used for model fitting. This guarantees no overlapping of calibration and verification period for lead times up to 240 h.

Table 1. Features of the Meteorological Input Models
NameNumber of ModelsLead times (h)Spatial Resolution ∼ (km)

[10] Now an example of a BMA forecast for lead times 1–240 h is given here. Figure 1 shows forecast trajectories of the raw ensemble, and both the univariate and the multivariate BMA models. The predictions have been initialized at 00:00 UTC on 18 April 2008. Note that COSMO-2, COSMO-7, and COSMO-LEPS drop out at lead times 24 h, 72 h, and 120 h, respectively. In case of the BMA forecasts the trajectories are obtained by random sampling of 69 different scenarios from the BMA distributions. The number of 69 samples corresponds to the number of raw ensemble members available at the first 24 lead times. For lead times up to 120 h the forecast trajectories of the raw ensemble, the univariate and the multivariate BMA forecast are plausible, if we verify them with the observations. Dropping out of COSMO-LEPS strongly affects the raw ensemble and the univariate BMA forecast. As they now rely only on the ECMWF members, which forecast low runoff, they completely miss the observed runoff pattern. However, the multivariate approach still relies on the COSMO-LEPS members through the covariances. For additional illustration predicted quantiles of raw ensemble and both BMA forecasts initialized at fifteen different days, of which five represent high flow events, five low flow events, and five are chosen randomly, are available as supporting information for this report.

Figure 1.

Forecast trajectories of the raw ensemble, univariate and multivariate BMA forecasts initialized at 00:00 UTC on 18 April 2008 against lead time. BMA trajectories correspond to 69 random samples from the forecast distributions.

3. Methods

3.1. Standard, Univariate BMA

[11] Raftery et al. [2005] introduced a Bayesian model averaging (BMA) approach that allows combining dynamic models. In its simplest form BMA combines forecasts math formula from K different models to one mixture forecast probability density function (PDF) given by

display math(1)

where wk is the weight of model k and gk is the conditional PDF of y, e.g., the forecast runoff, given that model k is best. Note that this form accounts neither for bias correction nor for transformation of rk. Additionally, lead times are implicitly assumed to be completely independent.

[12] As parts of the raw forecasts stem from EPS's BMA needs to deal with exchangeable ensemble members. Fraley et al. [2010] provide a BMA variant which guarantees by model design that parameters of exchangeable ensemble members are forced to be equal. Hence, equation (1) has to be rewritten as

display math(2)

where the index j denotes a specific member of ensemble model i.

[13] Stream flow data are undoubtedly non-Gaussian. In order to be able to resort to BMA methods relying on Gaussian distributions, both the observations and the raw ensemble predictions have to be transformed such that they are approximately normal. Duan et al. [2007] used the Box-Cox transformation [Box and Cox, 1964] to make runoff data approximately normal. The Box-Cox transformation is given by:

display math(3)

where x are the original data and λ is the Box-Cox coefficient. For this study the parameter math formula has been estimated by maximizing the likelihood of the Box-Cox transformed climatological observations using a normal distribution centered at the mean of the transformed observations using the R-function boxcox of the package Mass. The term climatology denotes here the collection of the observed runoff values from January 1974 to Mai 2007. This period does not overlap with the verification and the associated training periods mentioned in section 2. Hence, the estimate of math formula is independent from any model parameter values that were used to simulate discharge dynamics. Figure 2 shows a normal QQ-plot for the Box-Cox transformed climatology of Thur River. In the following we will use the Box-Cox parameter estimate math formula for both, observations and forecasts.

Figure 2.

Quantile-quantile plot against the theoretical quantiles of a normal distribution with appropriate mean and variance verifying normality of the Box-Cox transformed climatology of Thur River. The Box-Cox parameter estimate is math formula.

[14] The Box-Cox transformed raw forecasts math formula still need to be bias corrected. This may be done by linear regression of the Box-Cox transformed observations on the Box-Cox transformed forecasts of the training period as in Raftery et al. [2005]. The complete model in the lead time independent case with Box-Cox transformed and bias corrected data is then:

display math(4)

where φ denotes the Box-Cox transformed observations. ai and bi are regression parameters. In this study bi is set to 1 as we consider only simple additive bias correction. Tests with both regression methods, standard, and robust, performed worse. The weights wi and the variances math formula are estimated by maximum likelihood using the Expectation Maximization (EM) algorithm [Dempster et al., 1977; McLachlan and Krishnan, 1997; Duan et al., 2007; Fraley et al., 2010].

3.2. Multivariate BMA

[15] In traditional BMA, different lead times are considered as completely independent from each other. Hence, ensemble models not covering the same range of lead times do not cause any problems. If a model drops out at a particular lead time, one just estimates the BMA model without using it anymore. Now the independence assumption is relaxed by replacing univariate BMA by a multivariate BMA method that is based on multivariate normal distributions. Berrocal et al. [2007] combined BMA with the geostatistical output perturbation (GOP) method [Gel et al., 2004] in order to generate calibrated probabilistic spatial forecasts of whole weather fields. In the present multivariate approach space is replaced by lead time. That is, the goal is to estimate a BMA model that is able to generate calibrated probabilistic forecasts over an entire range of lead times simultaneously.

3.2.1. BMA Fit

[16] In a first step a BMA model is fitted as described in section 3.1. But now, instead of estimating the model parameters for each lead time separately, lead times are combined into blocks of the same model configuration. Within a block the number of available ensembles is constant. Additionally, the variance is assumed to be constant within a block, such that all lead times can be treated equally. Only G < L BMA models are now estimated per verification day, where G is the number of blocks and L denotes the number of lead times considered. All equations remain the same, but Tr, i.e., the set of dates and lead times in the training period, has now another meaning. It is now the set of all forecasting lead times in a particular block and for all dates in the training period.

3.2.2. GOP Fit

[17] In order to connect the GOP method with the BMA fits, the joint transformed forecast needs to be written as

display math(5)

where math formula is the conditional distribution of Box-Cox transformed runoffs for a block of lead times given the vector of forecasts rk from model member k for the same block of lead times. ak and bk are bias correction coefficients, h(rk) are the Box-Cox transformed raw forecasts from model k. math formula is up to a constant given by Σk, the covariance matrix defining the error structure of the forecasts. In the following this covariance matrix needs to be estimated from the training data. As mentioned above, this can be done in the same way as in Gel et al. [2004] and Berrocal et al. [2007]. Their GOP model is stationary, isotropic and based on the assumption of an exponential variogram. Note that the more general Matérn variogram may be an alternative. Σk can thus be estimated by:

display math(6)

where σi,j is the (i, j)th element of Σk, δi,j is 1 if i = j and 0 otherwise, and di,j is the lag between lead time i and lead time j. The coefficients math formula, math formula, and rk can be interpreted as the coefficients of a geostatistical model [Chilès and Delfiner, 1999; Cressie, 1993]. Hence, the following exponential variogram can be fitted:

display math(7)

where d is the lag between two lead times.

[18] The procedure for estimation of those parameters is explained in detail in Berrocal et al. [2007]. As we use a slightly adapted version, accounting for exchangeable ensemble members, a short sketch of the procedure is given here:

[19] 1. Obtain the bias corrected forecasts, where math formula denotes the vector of bias corrected forecasts from ensemble i and member j on training day t of the training period Tr. The elements of math formula stand for the different lead times. math formula can easily be partitioned according to the block of lead times. The bias correction parameters depend now also on the blocks g, and hence, ai,g and bi,g are calculated for each ensemble using all lead times within block g. In this study, we have set bi,g = 1.

[20] 2. Then the empirical variogram needs to be estimated. To this end, calculate first the matrix of forecast errors erri,t for each model i and each training day t, where the rows of erri,t indicate lead times and the columns indicate ensemble members. In case of a deterministic model, erri,t is a vector. From the forecast errors the empirical variogram math formula for ensemble i can be calculated using:

display math(8)

for d in math formula, and where L is the largest lead time of a given model, mi is the number of members of model i and T is the number of days in the training period Tr. Then the optimal exponential variogram parameters, now denoted by math formula, math formula, and ri, are estimated. This is done numerically by minimizing the following expression:

display math(9)

[21] 3. Then Σi can easily be obtained using equation (6).

[22] 4. Σi does not account for the fact that the BMA variances math formula are conditional on the assumption that model i is the best. Hence, corrected covariance matrices math formula have to be calculated, where math formula is the deflation factor.

[23] As an illustration let us take a closer look at the predictions for Thur River initialized at 00:00 UTC on 18 April 2008. As explained above BMA models over entire blocks of lead times, i.e., 1–24, 25–72, 73–120, and 121–240 h in case of the Thur data set, are estimated first. In a second step, the correlation structure of the errors among forecasts for different lead times within each model ensemble needs to be estimated. This is done by fitting an exponential variogram to the empirical semivariances of the bias-corrected raw forecast errors, as shown in Figure 3. Obviously, the exponential variograms fit quite well to the empirical variograms for that particular training period. Though not shown here, variogram fits for other training periods are comparable.

Figure 3.

Empirical variograms and corresponding exponential fits for COSMO-2, COSMO-7, COSMO-LEPS, and ECMWF forecast errors calculated with data from 9 March to 7 April 2008.

[24] In a next step the multivariate forecast density distributions have to be constructed by simulation. The challenge here is to combine the separate block BMA estimates across block boundaries while accounting for dependencies among different lead times. This problem is solved by using the property of the multivariate normal distribution that subvectors of the multivariate normal given the remaining elements are again multivariate normal with specified mean and covariance matrix. This property is explained in detail in, e.g., Krzanowski [2000]. Now a description of how the property is used in order to combine the BMA fits is given here. The set math formula denotes the set of all vectors of bias corrected forecasts for verification day t. The random vector of forecast Box-Cox transformed runoffs over the entire range of lead times, math formula, follows a L-variate normal distribution. The vector math formula can now be partitioned as follows:

display math

where lg denotes the largest lead time of block g, and hence L = lG. The distribution of math formula can now be written as:

display math

[25] In addition the bias corrected forecasts and the covariance matrices need to be partitioned accordingly. The forecast vector math formula of an ensemble covering lead times up to block g is partitioned by:

display math

where math formula for better readability. The covariance matrix can be partitioned as follows:

display math

where math formula denotes the covariances among lead times from 1 to l1, math formula among lead times math formula to l2, and math formula among lead times math formula to lg. Ensembles covering only block 1 do not need any partitioning.

[26] Now the resulting forecast distribution for any verification day, t (index t omitted in the following), can be simulated by iterating the following steps:

[27] 1. Select a member (i, j) by sampling with weights wi from the BMA fit for block 1.

[28] 2. Obtain a simulated realization math formula from the distribution math formula

[29] 3. Select a member (i, j) by sampling with weights wi from the BMA fit for block 2. Note that the number of available ensemble models usually decreases from block to block and that the weights wi are now different from those in step 1.

[30] 4. Obtain math formula by simulating from:

display math(10)

[31] 5. For blocks math formula do the following:

[32] 1. Select a member (i, j) by sampling with weights wi from the BMA fit for block g.

[33] 2. Obtain math formula by simulating from:

display math(11)


display math

[34] According to Berrocal et al. [2007] each iteration can be interpreted as an ensemble member from the multivariate BMA model. Hence, after many iterations, the theoretical quantiles for any lead time can be approximated by the empirical quantiles of the simulated ensemble.

4. Results

4.1. Verification Tools

[35] As stated in section 1, the goal of statistical postprocessing are well calibrated and sharp predictive density functions of the quantity to be forecast [Raftery et al., 2005; Gneiting et al., 2007]. In the following some methods for assessing calibration and sharpness will be introduced. Calibration can be examined by the Probability Integral Transform (PIT). The Continuous Ranked Probability Score (CRPS) addresses both, calibration and sharpness. The widths of the prediction intervals are a simple measure for sharpness.

4.1.1. Calibration

[36] As stated above the PIT is a measure for calibration. The PIT is given by:

display math(12)

where math formula is the conditional predictive probability density of the response variable yi, i.e., observed runoff in this work, given the forecast ensemble xi. Rosenblatt [1952] showed that math formula for a probabilistic forecast with perfect calibration. The PIT can be visualized by a histogram of the relative frequencies into which quantile interval of the forecast density the observations fall. In case of discrete predictive distributions, like ensemble forecasts or quantiles of continuous predictive distributions, the deviation from uniformity can be estimated by the reliability index [Delle Monache et al., 2006; Berrocal et al., 2007]. To this end the relative frequencies of ranks of the observations among the discrete forecasts in the training period is determined. From this, the negatively oriented reliability index can be calculated by:

display math(13)

where fj denotes the relative frequency of rank j.

4.1.2. Continuous Ranked Probability Score

[37] Hersbach [2000] discusses the CRPS and its decomposition in detail. Nevertheless a short summary is provided here. The CRPS assesses how much the predicted and the observed cumulative distribution differ from each other. Denote runoff, or any other quantity of interest, by y and the predictive density by ρ(y). And let again yi be the observed runoff, then the CRPS is:

display math(14)

where the cumulative distributions P and Pi are given by:

display math(15)
display math(16)

where H(y) is the Heaviside function:

display math(17)

[38] The best possible CRPS value is zero for a deterministic forecast that equals the observation, i.e., P = Pi. The lower the sharpness and the higher the bias, the higher is also the CRPS. Hence, CRPS should be as small as possible. Usually one is interested in math formula, i.e., the average CRPS over the verification period. The CRPSS, which is its associated skill score, can be calculated by

display math(18)

where the uncertainty, math formula equals the math formula of the climatological forecast, i.e., the empirical density of the climatology. Note that the CRPSS of a perfect forecast equals 1. CRPSS values below zero indicate that the forecast is worse than the climatological forecast in terms of math formula. We have calculated math formula and math formula using the R-package verification.

4.1.3. Multivariate Generalizations of PIT Histogram and CRPS

[39] The scoring rules introduced above assess only the univariate properties of the marginals of the BMA forecasts. In order to examine the dependency structure among different lead times multivariate generalizations of the PIT histogram and the CRPS are needed.

[40] Gneiting et al. [2008] introduced the multivariate rank histogram as a multivariate generalization of the PIT histogram. In this study multivariate ranks are assigned in a similar way, but with an adjustment in order to account for the high dimensionality of the data at hand. As proposed by Pinson and Girard [2012] dimensions are reduced by an orthogonal transformation. The most obvious way to do this is to consider only the first few components of the principal component transform in the following estimation procedure:

[41] 1. Sample randomly m realizations of the predictive distribution at hand. Collect the realizations math formula, math formula and the corresponding observation math formula in the set math formula, where L is the number of lead times considered.

[42] 2. Reduce the dimension by applying a principal component transform to the set math formula. Since only the first d principal components are used, the transformed set math formula consists of vectors of dimension d.

[43] 3. Calculate preranks math formula, where 1 denotes the indicator function, and math formula if and only if math formula for math formula. A prerank of n for vector j indicates, for instance, that there exist n – 1 other vectors with all elements smaller or equal to the corresponding elements of vector j.

[44] 4. The multivariate ranks correspond then to their corresponding preranks, but with ties, i.e., multiple principal components with equal preranks, resolved at random.

[45] Without dimension reduction preranks are very likely to be equal to one due to the high dimensionality of 240 lead times considered. Hence, any multivariate ranks would be assigned completely at random leading to a flat multivariate rank histogram in almost every case. The downside of the dimension reduction approach is that the number of principal components used has to be selected deliberately.

[46] Gneiting and Raftery [2007] and Gneiting et al. [2008] proposed the energy score (ES) as a multivariate generalization of the CRPS. For univariate quantities the energy score equals the CRPS, with which it shares its negative orientation. It is given by

display math(19)

where math formula denotes the Euclidian norm. X and X′ are both independent random vectors following the same distribution P. For an ensemble forecasts Pens of size M the energy score can be calculated by:

display math(20)

where math formula denote the members of Pens and math formula is the vector of verifying observations. math formula denotes the prediction lags. If P is a density forecast, the energy score can be approximated by sampling randomly math formula from P:

display math(21)

where k is the size of the random sample.

4.2. Summary Results

[47] In the following, the univariate and the multivariate BMA approach are compared by means of summary results over the entire training period.

4.2.1. CRPSS

[48] First, we analyze CRPSS over the entire range of lead times up to 240 h. As shown in Figure 4, both BMA approaches, univariate and multivariate, lead to an increase in CRPSS compared to the raw ensemble for lead times up to 170 h. Note that any CRPSS values below zero indicate that using the climatology as a prediction would outperform the actual forecast. For lead times up to 72 h univariate and multivariate BMA are equally good. From 73 to 120 h the multivariate model outperforms its univariate counterpart. While dropping out of COSMO-2 and COSMO-7 at lead-times 72 h and 120 h, respectively, does not lead to a sudden decrease in CRPSS, losing the COSMO-LEPS forecasts at 120 h has a big impact. The multivariate approach dampens this effect resulting in better CRPSS scores for lead times between 120 and 150 h.

Figure 4.

CRPSS of the BMA fits. The horizontal dashed line corresponds to the CRPSS of the climatology, which is zero by definition. The vertical dotted lines indicate the lead times at which COSMO-2, COSMO-7, or COSMO-LEPS drop out.

4.2.2. Calibration and Sharpness

[49] Now we investigate calibration by means of a modified, i.e., time dependent, 3-D PIT histogram. Figure 5 compares the PIT of the raw ensemble with that of the BMA models. The raw ensemble is underdispersed over the entire range of lead times. The BMA models are much better calibrated and do not show the cyclic calibration pattern of the raw ensemble. Differences between the univariate and the multivariate model are very small and do not seem to be systematic except for a drop in the (0.9, 1.0] interval right after lead time 120 h in case of multivariate BMA. Note that both models exhibit a frequency shift from interval (0.0, 0.1] to the intervals (0.2, 0.3] and (0.3, 0.4] at lead time 120 h.

Figure 5.

3-D PIT histograms for lead times 1–240 h for the raw ensemble and both, the univariate and multivariate, BMA models. Each lead time slice corresponds to the marginal PIT histogram for a particular time lag. Perfect calibration would result in a horizontal plane at 0.1

[50] After having assessed calibration in more detail, we now take a closer look at the sharpness. This is done by comparing the mean width of the 95% prediction intervals of the BMA models with the 95% width of a lead time dependent climatological forecast. Lead time dependent climatology stands for the empirical density of observations at a particular time of day. That is, we have 24 climatologies instead of 1, because the raw and BMA forecasts are provided on an hourly basis. For lead times up to 120 h sharpness of the BMA models decreases linearly with increasing lead time as shown in Figure 6. At lead time 120 h the BMA models are still clearly sharper than the climatology. The univariate and the multivariate BMA models are quite similar in terms of sharpness up to lead time 120 h. The multivariate model is slightly sharper for lead times from 72 to 120 h. Interestingly, the multivariate model exhibits a prominent decrease in sharpness at lead time 120 h, while this is not the case for the univariate model. For lead times beyond 150 h the BMA models are not sharper than the corresponding climatologies.

Figure 6.

Mean width of 95% prediction intervals against lead time. The dashed line corresponds to the 95% prediction interval widths of a climatology, which is dependent on the time of day. The vertical dotted lines indicate the lead times at which COSMO-2, COSMO-7, or COSMO-LEPS dropout.

4.2.3. Multivariate Evaluation

[51] As shown above both, univariate and multivariate, BMA methods lead to marginally well calibrated forecasts, which are also of similar sharpness. Considering multivariate evaluation methods multivariate BMA performs much better than its univariate counterpart. Figure 7 compares multivariate rank histograms that have been obtained by using only the first few principal components of the principal component transform. Multivariate BMA is likely to be well calibrated in multivariate terms, whereas the bell-shaped histograms of univariate BMA indicate poor calibration. Table 2 shows that in case of multivariate BMA the first six principal components are able to explain 86% of the variance, while also leading to a low reliability index. For univariate BMA 40 principal components are needed in order to explain only 79% of the variance. The reliability indices of univariate BMA are comparable with those of multivariate BMA only when using ≥ 10 principal components. Using more than 10 principal component means also calculating a multivariate rank histogram on ≥ 10 dimensions. This typically leads to flat rank histograms even for badly calibrated ensembles, because the observations and most of the ensemble members are very likely to have prerank 1 which leads to randomly distributed observation ranks.

Figure 7.

Multivariate rank histograms of univariate (left) and multivariate BMA (right) forecasts. The histograms are shown for the first 2, 4, and 6 principal components.

Table 2. Proportion of Variances and Corresponding Reliability Indicesa
 Proportion of VarianceReliability Index
n pcuni BMAmult BMAuni BMAmult BMA
  1. a

    Proportion of variance explained by the first n principal components and the corresponding reliability indices for univariate and multivariate BMA.


[52] As stated above the ES is a multivariate generalization of the CRPS, which assesses sharpness and multivariate calibration simultaneously. Table 3 shows the mean and median ES's over all verification dates. The ES values confirm the multivariate rank histograms, which indicate that multivariate BMA performs better than univariate BMA. Note that even the ensemble consisting of the ordered math formula quantiles exhibits a higher mean and median ES than a random sample of size k = 1000 from the multivariate BMA forecast distributions.

Table 3. Mean and Median Energy Scores Over the Training Perioda
  1. a

    In case of univariate BMA, ES has been calculated once from the exact math formula quantiles of the forecast distribution and once using the random sampling (Monte Carlo) approach with sample size k = 1000. For multivariate BMA, ES has been obtained by using a random sample of size k = 1000.

uni BMA 69 quant0.8650.631
uni BMA mc0.9750.709
mult BMA mc0.8310.598

5. Discussion

[53] We first consider univariate verification tools only. Both BMA methods, the univariate and the multivariate, are able to increase calibration and sharpness compared to the raw forecast for lead times up to 120 h. Beyond lead time 150 h both BMA methods do not perform better than the raw forecasts. As stated in the results section 4, multivariate BMA is able to improve CRPSS significantly between lead time 120 and 150 h. Since COSMO-LEPS, which is a regional ensemble model with high spatial resolution, forecasts only up to 120 h, the pronounced drop in CRPSS from lead time 120–121 h in case of the raw ensemble and the univariate BMA approach is not surprising. Note that the high resolution deterministic models COSMO-2 and COSMO-7 forecast only up to 24 and 72 h, respectively. Hence, it is probable that multivariate BMA is able to make use of the more valuable COSMO-LEPS forecasts during the first 10–20 h after the dropping out of COSMO-LEPS. As this is reflected in CRPSS, but not in sharpness, multivariate BMA seems to mainly improve reliability. This might be due to the higher spread of the COSMO-LEPS ensemble compared to the ECMWF ensemble. Consequently, the multivariate BMA is less sharp for lead times right after the dropping out of COSMO-LEPS. From the above considerations, it follows that even in terms of univariate verification multivariate BMA is to be preferred in cases where the most valuable models drop out at an early lead time. Taking multivariate verification tools like the energy score or the multivariate rank histogram into account, multivariate BMA performs much better than its univariate counterpart. Additionally, we could show that the multivariate rank histogram can be used to assess high dimensional data. To this end, dimension reduction by adequate selection of the first few components of the principal component transform has to be performed. Further investigations on multivariate verification might be valuable to overcome the loss of information due to principal component selection.

[54] As stated in section 3.2, the dependencies among lead times of the multivariate BMA model may be modeled by the more general Matérn variogram model instead of the exponential variogram model. This would on the one hand result in more accurate variogram fits, but on the other hand also in a more complex and more time consuming fitting process. As a test, of which we do not show any further details here, we have fitted both models to the average empirical variogram over all training periods used for this study. The exponential variogram fits proved to be only slightly worse in this case. This may be different in case of other hydrologic settings and needs to be checked for any other catchment separately.

6. Conclusions

[55] We have shown that multivariate BMA provides a valuable tool for postprocessing of probabilistic forecasts for Thur River over an entire range of lead times simultaneously. It performs much better than its univariate counterpart in terms of multivariate verification measures, and more surprisingly it outmatches univariate BMA also in terms of univariate verification. Multivariate BMA especially improves the reliability of the prediction when the forecast system is changing, i.e., fewer models are available due to dropping out at particular lead times. In order to generalize those findings, the strength of multivariate BMA should be verified using other catchments of different sizes and flow regimes. Additionally, multivariate BMA needs to be contrasted with other multivariate postprocessing methods like Ensemble Copula Coupling (Schefzik et al., Uncertainty quantification in complex simulation models using ensemble copula coupling, to be published in Statistical Science, 2013). Due to its inherent correlation structure, sampling randomly from the multivariate BMA forecast distribution leads to more realistic predicted trajectories than univariate approaches. Furthermore, in spite of its complexity multivariate BMA comes along with a rather low computing time, since one single BMA model typically applies to a whole range of lead times. However, multivariate BMA features also some shortcomings: first, it is much more complex than the univariate BMA approach. Second, multivariate BMA would systematically fail in cases of sudden changes in variance structure among different lead times. Note that changes in variance structures from the training period to the verification day would be even more problematic. Further, improvements of multivariate BMA may go in the direction of applying different models depending on the river flow regimes like winter low flow, summer low flow et cetera [Tongal et al., 2013]. One of the main tasks would be to find sound flow regime differentiation criteria.


[56] We are grateful to H. R. Künsch of the Semniar for Statistics at ETH Zürich for advising the master thesis of S. Hemri, on which this article is based. T. Gneiting, M. Scheuerer and other colleagues of the Institute for Applied Mathematics at Heidelberg University is thanked for helpful discussions and inputs. We like to thank three anonymous reviewers for their helpful comments. We are indebted to the Swiss Federal Office of Meteorology and Climatology, MeteoSwiss and ECMWF for granting access to their products. This research was partly funded from grants of the EU FP7 Project IMPRINTS (grant 226555/FP7-ENV-2008-1-226555) and by the Department of Waste, Water, Energy and Air (AWEL) of Canton Zurich.