Seasonal streamflow forecasts are valuable for planning and allocation of water resources. In Australia, the Bureau of Meteorology employs a statistical method to forecast seasonal streamflows. The method uses predictors that are related to catchment wetness at the start of a forecast period and to climate during the forecast period. For the latter, a predictor is selected among a number of lagged climate indices as candidates to give the “best” model in terms of model performance in cross validation. This study investigates two strategies for further improvement in seasonal streamflow forecasts. The first is to combine, through Bayesian model averaging, multiple candidate models with different lagged climate indices as predictors, to take advantage of different predictive strengths of the multiple models. The second strategy is to introduce additional candidate models, using rainfall and sea surface temperature predictions from a global climate model as predictors. This is to take advantage of the direct simulations of various dynamic processes. The results show that combining forecasts from multiple statistical models generally yields more skillful forecasts than using only the best model and appears to moderate the worst forecast errors. The use of rainfall predictions from the dynamical climate model marginally improves the streamflow forecasts when viewed over all the study catchments and seasons, but the use of sea surface temperature predictions provide little additional benefit.
 Seasonal streamflow forecasts are important for management of water resources. In Australia, the Bureau of Meteorology (BOM) provides seasonal streamflow forecasting service. The forecasts are issued at the start of each month and predict the total unregulated inflow (hereinafter called streamflow) volumes for the next 3 months (hereinafter called season) at each forecast site (hereinafter called catchment). The forecasts are produced using a statistical technique, specifically the Bayesian joint probability (BJP) modeling approach [Wang et al., 2009; Wang and Robertson, 2011]. Separate forecasting models are established for each catchment and overlapping season to account for spatial variability and seasonality.
 The seasonal streamflow forecasts exploit two sources of predictability. The first is the amount of water held in a catchment (as snow, in surface storages, in the soil and in groundwater) at the time the forecast is made, which we term “catchment wetness”. The second is the climate during the forecast period. In the operational forecasting system, one predictor is used to represent each source of predictability. The predictors used in the forecasting models are chosen from a pool of candidates according to their predictive performance for retrospective cross-validation forecasts [Robertson and Wang, 2012a]. Candidate predictors representing catchment wetness include observed antecedent rainfall and streamflow totals. Candidate predictors used to represent the climate during the forecast period include the lagged climate indices.
 Choosing the best predictors can be problematic. A choice based on historical data is subject to sampling error, especially when the underlying relationships are relatively weak [Wang et al., 2012a]. In addition, several competing models can show similar overall performance but produce quite different forecasts for individual events. Choosing only one model excludes other plausible models and ignores model uncertainty [Beven and Binley, 1992; Raftery et al., 2005; Wang et al., 2012a].
 An alternative is to include all candidate models, and weight each model according to its predictive performance [Casanova and Ahrens, 2009; Wang et al. 2012a]. Several studies have pointed out the advantages of combining forecasts from multiple models [Casey, 1995; Rajagopalan et al., 2002; Coelho et al., 2004; Luo et al., 2007; Stephenson et al., 2005; Regonda et al., 2006; Devineni et al., 2008; Bracken et al., 2010; Wang et al., 2012a]. For example, Rajagopalan et al.  used a Bayesian approach to combine categorical climate forecasts derived from a number of general circulation models (GCM) forecast ensembles. The posterior probabilities of the combined forecast, calculated under the assumption of multinomial process, showed improvement over individual model forecasts. Bracken et al.  and Regonda et al.  combined forecasts from multiple candidate models based on an objective criteria measuring the “predictive risk” of the candidate models, and reported that forecast combination resulted in improved seasonal streamflow forecast performance over using only the “best” model.
 Among various methods of model combination, Bayesian model averaging (BMA) has been reported to be an effective way of combining forecasts from multiple models [Hoeting et al., 1999; Neuman, 2003; Raftery et al., 1997, 2005, Ajami et al., 2007]. In the classical BMA approach, model weights are based on posterior model probabilities [Hoeting et al., 1999; Neuman, 2003]. However, BMA can also be formulated as a mixture model problem, where weights are derived by maximizing the likelihood function of the combined model. The maximization can be achieved by using expectation-maximization (EM) algorithm [Raftery et al., 1997, 2005, Ajami et al., 2007].
Wang et al. [2012a] further developed the mixture model approach to BMA. Firstly, they applied a prior for the weights to slightly favor an outcome of more evenly distributed weights (and thus consensus forecasts). The use of the prior results in more stable weights in face of large sampling uncertainty. Secondly, they used a cross-validation likelihood function, instead of the classical likelihood function, to maximize the “predictive” rather than the “fitting” ability of the combined model [Shinozaki et al., 2010]. Wang et al. [2012a] applied BMA to combine seasonal rainfall forecasts from multiple statistical models. They showed that the combined rainfall forecasts are superior to forecasts generated from the best individual model.
 While the approach of combining forecasts from multiple models can be effective, it relies on candidate models that perform well individually. Recent studies show that it is possible to improve seasonal streamflow forecasts by incorporating outputs of dynamical hydrological models into statistical forecasting methods [Robertson et al., 2013; Rosenberg et al., 2011]. Hydrological models capture some of the catchment physical processes and can therefore better represent catchment wetness than simply using antecedent streamflow or rainfall as a predictor.
 In the BJP model currently used by the BOM, lagged climate indices are used as predictors of future climate. While there is evidence that lagged climate indices can be useful for forecasting seasonal rainfall, the forecast skill that can be achieved is modest [Schepen et al., 2012b; Wang et al., 2012a]. Indeed, Robertson and Wang [2012b] and Robertson et al.  showed that most of the seasonal streamflow forecast skill comes from the knowledge of initial catchment wetness. A natural question is whether seasonal streamflow forecasts can be further improved by incorporating predictions from dynamical seasonal climate forecast models.
 Dynamical climate models simulate evolution of the climate system using physically based mathematical representations of the atmosphere, land and oceans. They predict dominant modes of climate variability such as the El Niño Southern Oscillation (ENSO) and the Indian Ocean Dipole (IOD) [Lim et al., 2009, 2011]. They also explicitly model transient processes and capture concurrent relations between forecast variables (e.g., rainfall and sea surface temperatures (SST)) [Schepen et al., 2012b]. Therefore, dynamical climate models can be expected to better predict climate than statistical models that simply use lagged climate indices as predictors.
 However, seasonal climate forecasting using dynamical models is challenging and the dynamical models do not often produce predictions that are more skillful than statistical models. The dynamical climate predictions are usually biased and have overconfident estimates of forecast uncertainty [Schephen et al., 2012a; Lim et al., 2011]. Rajagopalan et al.  reported that climate predictions from a combination of a number of global climate models were superior to climatology in only a few regions of the world. Schepen et al. [2012b] reported that the performance of forecasts of “calibrated” dynamical climate models were comparable yet “different” to their statistical counterparts and showed that their combination could extend the spatial and temporal coverage of forecast skill [Schepen et al., 2012a, 2012b].
 In this study, we investigate two strategies for improving the statistical method currently used by the BOM for seasonal streamflow forecasting. The first strategy is to use BMA to combine forecasts from multiple candidate models (established using the BJP modeling approach) using lagged climate indices as predictors instead of the currently used method of selecting the “best” forecast. We use the BMA approach by Wang et al. [2012a] for forecast combination. The second strategy is to take advantage of the direct simulations of various dynamic processes represented by a global climate model by using its rainfall and SST predictions as predictors to first establish additional BJP candidate models, and then adding them into the existing pool for forecast combination.
 The next section describes the catchment and hydrological data. Section 3 describes the BJP modeling approach, and the candidate models, including a description of the predictors used to establish candidate models. Section 4 describes four different methods used to produce the seasonal streamflow forecasts. Section 5 assesses the skill and reliability of the forecasts. Section 6 presents some further analysis and discussion, and section 7 summarizes the study and presents overall conclusion.
2. Catchment and Hydrological Data
 The study is carried out on 22 catchments in eastern Australia. Six of the catchments are located in Queensland, 13 in Victoria, one at the border of Victoria and New South Wales, and two in Tasmania (see Figure 1). The names and catchment characteristics of the 22 catchments are given in Table 1.
Table 1. Attributes of the 23 Catchments Used for This Study
Catchment Area (km2)
Mean Annual Rainfall (mm)
Mean Annual Flow (mm)
Annual Runoff Coeff.
605 (138 GL)
South Johnstone River
2018 (787 GL)
76 (2765 GL)
23 (372 GL)
79 (304 GL)
289 (395 GL)
227 (2764 GL)
279 (890 GL)
248 (433 GL)
175 (1320 GL)
150 (63 GL)
373 (1447 GL)
188 (1349 GL)
98 (172 GL)
Cairn Curran Res.
72 (115 GL)
77 (54 GL)
485 (236 GL)
Upper Yarra Res.
443 (149 GL)
577 (74 GL)
766 (97 GL)
793 (2141 GL)
1724 (1260 GL)
 The two northernmost Queensland catchments (Barron and South Johnston Rivers) experience subtropical to tropical climate and are characterized by very wet austral summers (December–February) and drier winters (June–August). The South Johnston River is the wettest of 22 catchments with annual rainfall in excess of 3000 mm (Figure 1 and Table 1). In contrast, the two largest catchments, the Burdekin River and the Cape River, with areas of 36,260 and 50,291 square kilometers respectively, have drier climates characterized by low rainfall (less than 600 mm per annum) and high evapotranspiration.
 The 14 Victorian catchments (including Lake Hume) experience a temperate climate. Typically the wettest months are June to September and the driest January to March. These catchments can be further divided into Upper Murray (Lake Hume, Dartmouth Reservoir, Kiewa River and Ovens River), central Victoria (Lake Nillahcootie, Lake Eildon, Goulbourn weir, Lake Eppalock, Cairn Curran Reservoir and Tullaroop reservoir) and southern Victoria (Thompson Reservoir, Upper Yarra Reservoir, Maroondah Reservoir and O'Shannassy Reservoir). The wettest Victorian catchments are in the southern Victoria region, with annual rainfall of 1300–1400 mm, followed by the Upper Murray and central Victoria (see Table 1).
 The two Tasmanian catchments experience temperate maritime climate and are typically wet throughout the year, with the wettest months extending from April to October. The King hydro-electric scheme catchment is the wetter of the two, with annual rainfall in excess of 2600 mm and a runoff coefficient of 0.64.
 In this study, we use the observed monthly streamflow data obtained from various water management agencies and the Bureau of Meteorology. For most catchments, data are available from 1950 to 2008 (see Table 1). The monthly catchment average rainfall and potential evapotranspiration for each catchment are calculated from 0.05° (∼5 km) gridded data available from the Australian Water Availability Project (AWAP) [Jones et al., 2009].
3. Candidate Models
 In each catchment, we seek to produce probabilistic forecasts of streamflow for the next season at the start of each month. We treat the 12 overlapping seasons separately and establish separate models to account for the seasonal variations in hydrological and climate processes.
 In each catchment, we establish multiple candidate models using the BJP modeling approach [Wang et al., 2009; Wang and Robertson, 2011] and generate forecasts in cross validation over the entire years (Table 1) in record. We then use these forecasts for model combination and “best” model selection. The process of formation of candidate models and their subsequent use for best model selection and model combination is summarized in Figure 2. Figure 2a is a generalized schematic representation of the BJP modeling approach. It shows how the predictors representing the catchment wetness (CW) and the climate are used to establish a candidate model. Figure 2b provides an overview of the methodology (described in section 4) and shows the arrangement of the candidate models, formed with different combination of predictors, for model selection and combination.
 The following sections describe formation of candidate models for one of the 12 overlapping seasons (for example January-February-March or JFM).
3.1. The BJP Modeling Approach
 In the BJP modeling approach, the predictors y(1) and the predictands y(2) are arranged as column vector (equation (1) and Figure 2a). The vector y(1) consists of predictors representing catchment wetness (CW) and climate during the forecast period. In this study, we use a total of three predictors (except for two special cases, explained in section 4) to establish each candidate model, two representing CW and one representing the climate. The predictands include both streamflow and rainfall for the forecast period. Although rainfall is not the primary variable of interest, it is included to supplement the relatively weak relationships between climate indices and streamflow during the forecast period, especially when the streamflow data is sparse.
 The original BJP formulation applies the Yeo-Johnson transformation (equation (2)), an extended version of the Box-Cox transformation [Yeo and Johnson, 2000 and Wang et al., 2009], to all the variables to normalize data and stabilize variances. The Yeo-Johnson transformation is given by
where λ is the transformation parameter.
 In this study, we apply a log-sinh transformation (equation (3)) [Wang et al., 2012b] for variables that can have only positive values (rainfall and streamflow), but retain the Yeo-Johnson transformation for variables that can have both positive and negative values (the climate indices). The log-sinh has been shown to outperform the Box-Cox transformation when applied in catchments with highly skewed data [Wang et al., 2012b]. The log-sinh transformation is given by
where α and β are the parameters of the log-sinh transformation.
 The transformed variables (z) are then assumed to follow a multivariate normal distribution.
 Where μ is mean vector, σ is the standard deviation vector, diag(σ) is the diagonal matrix with diagonal element σ, and R is the correlation matrix. The posterior distribution of model parameters p(θ|YO), including mean vector, standard deviation vector, transformation parameters and the correlation matrix, is inferred using a Bayesian formulation (equation (5)), implemented through Markov chain Monte Carlo (MCMC) sampling.
where p(θ) is the prior distribution, t = 1 … n is the year in data, YO contains all the data used for the parameter inference. These include observed data for the predictands as well as the observed and model predictions or hindcast data from the dynamical climate models used as the predictors. The likelihood function is calculated as
where Jz→y is the Jacobian determinant of the transformations (equations (2) and (3)) from z to y.
 Zero rainfall and streamflow data are treated as censored data with an unknown precise value below or equal to a threshold value (zero). The probability density below the threshold is then numerically integrated to the zero value point of the variable. This extends the assumption of the continuous multivariate normal distribution even in the case of zero values in the data [Wang and Robertson, 2011].
 Prior to the inference, the parameters are reparameterized to make the inference process easier. The details of the reparameterization of the parameters, specification of prior distribution, methods for calculation of the Jacobian determinants and treatment of zero values are given in Wang and Robertson  and Robertson and Wang [2012a]. To produce probabilistic forecasts, the transformed multivariate normal distribution is conditioned on the predictor values. The posterior predictive density for a new forecast year corresponding to the overlapping season (for example JFM), is given by equation (7).
 Numerically, we generate a probabilistic forecast by first sampling 1000 sets of parameters to represent the posterior parameter distribution using the MCMC. Then for each parameter set, we generate one forecast member from the posterior predictive density. Collectively the 1000 ensemble members represent the forecast for the season. Details of the method for the numerical evaluation of equation (7) can be found in Wang et al.  and Wang and Robertson .
 We follow the approach of Robertson et al.  to construct a predictor representing CW (Figure 2a) by using the predictions from a monthly water balance and partition (WAPABA) model [Wang et al., 2011a]. We calibrate WAPABA by maximizing a scalarized multiobjective measure consisting of a uniformly weighted average of the Nash-Sutcliffe efficiency (NS) coefficient [Nash and Sutcliffe, 1970], the NS of log transformed flows, the Pearson correlation coefficient and a symmetric measure of bias.
 Using the calibrated parameters, we run the model from the start of the historical record to the forecast period of interest using observed forcing data to initialize the model state variables. We then simulate streamflow for subsequent 3 months by forcing the model with mean monthly climatology. We repeat the procedure to produce a time series of the seasonal streamflow predictions for each month in the historical record (Table 1).
 The predictions are produced in cross validation over the entire historical record (Table 1) using the leave-five-years-out cross-validation method. The process involves removing the forecast year of interest and the subsequent 4 years from the calibration process and generating the predictions for the forecast year. This prevents the temporal sequence of the historical data being preserved in the parameter inference for the hydrological model and artificially inflating forecast skill.
 We then use the predicted streamflow as a predictor in BJP models, to account for the influence of the CW on the next season streamflow. Additionally, we also use the observed streamflow for the month immediately before the forecast period (lagged streamflow) as a second predictor. Robertson et al.  found that using the WAPABA prediction and the lagged streamflow consistently improves the skill of seasonal streamflow forecasts over using only antecedent streamflow or antecedent rainfall as the CW predictor. We adopt this approach in representing the CW for all candidate models (except for a no predictor model to be explained in section 4).
3.3. Predictors Representing Climate During the Forecast Period
 The predictors representing the influence of climate during the forecast period on forecast streamflow are unique to each candidate model. In each catchment, we use a total of 23 predictors representing climate (along with 2 CW predictors) to form 23 candidate models (Figure 2b). The predictors include 11 lagged climate indices and currently used in the BJP modeling approach for forecasting streamflow and rainfall [Robertson and Wang, 2012a, Wang et al., 2012a; Schepen et al., 2012b] and 12 additional predictors introduced for the purpose of this study that are derived from climate model predictions.
3.3.1. Predictors Established Using Lagged Climate Indices
 Eleven climate indices for the month immediately preceding the forecast period are used as candidate predictors. The lagged climate indices are calculated from observations of SST and atmospheric circulation anomalies over the Pacific Ocean, Indian Ocean and the extratropical region (Table 2). These climate indices have been related to rainfall over the study region and have been previously applied to forecasting seasonal rainfall and streamflow [Risbey et al., 2009; Lim et al., 2011; Schepen et al., 2012b].
Table 2. Climate Indices Included as Candidate Predictors of Seasonal Forecast of Streamflow and Data Sources
C – 0.5 (E + W); C = average SST anomaly over 165E–140W and 10N–10S; E = average SST anomaly over 110W–70W and 5N–15S and W = average SST anomaly over 125E–145E and 20N–10S
NCAR, ERSST.v3 [Smith et al., 2008]
Indian Ocean West Pole Index [WPI; Saji et al., 1999]
Average SST anomaly over 50E–70E and 10N–10S
NCAR, ERSST.v3 [Smith et al., 2008]
Indian Ocean East Pole Index [EPI; Saji et al., 1999]
Average SST anomaly over 90E–110E and 0N–10S
NCAR, ERSST.v3 [Smith et al., 2008]
Indian Ocean Dipole Mode Index [DMI; Saji et al., 1999]
DMI = WPI - EPI
NCAR, ERSST.v3 [Smith et al., 2008]
Indonesian Index [II; Verdon and Franks, 2005]
Average SST anomaly over 120E–130E and 0N–10S
NCAR, ERSST.v3 [Smith et al., 2008]
Tasman Sea Index [Murphy and Timbal, 2008]
Average SST anomaly over 150E–160E and 30S–40S
NCAR, ERSST.v3 [Smith et al., 2008]
140° E Blocking Index [Risbey et al., 2009]
0.5*(U + U – U – 2*U – U + U + U) where U[X] is the 500 hPa zonal wind at latitude X
Calculated from NCEP/NCAR reanalysis data [Kalnay et al., 1996]
3.3.2. Additional Predictors Established Using Dynamical Climate Model Predictions
 The additional predictors that represent climate during the forecast period are derived from the Predictive Ocean Atmosphere Model for Australia (POAMA) (http://poama.bom.gov.au/about_poama2.shtml) predictions. The POAMA is a dynamical climate forecasting system designed to produce seasonal to interannual forecasts of climate for Australia. The POAMA comprises a global coupled ocean-atmosphere model and ocean-atmosphere-land data assimilation system [Alves et al., 2003; Oke et al., 2005; Lim et al., 2011; Wang et al., 2011b]. Three configurations of POAMA are available: POAMA 2.4c is the standard configuration, POAMA 2.4a and POAMA 2.4b use an atmospheric model with improved physics associated with shallow convection. POAMA 2.4b includes an additional ocean-atmosphere flux correction scheme that reduces climatological biases [Wang et al., 2011b]. Each configuration generates different climate predictions. An ensemble of 10 predictions is generated by each model configuration from perturbations in initial conditions.
 We use two POAMA products as predictors representing the influence of climate to establish the candidate models:
 1. POAMA 2.4 Rainfall predictions: Three candidate predictors are calculated as the mean of the 10-member seasonal rainfall prediction ensembles generated by each of POAMA 2.4a, 2.4b and 2.4c (P24A, P24B and P24C in Figure 2b). Area weighted interpolations are performed to downscale the POAMA rainfall predictions (of 2.5 deg x 2.5 deg spatial resolution) to the catchment scale.
 2. POAMA 2.4 SST predictions: Nine climate indices are derived from predicted SST shown in Table 2 and Figure 2b; SOI and 140°E Blocking Index are excluded because they are derived from atmospheric variables. The nine SST-based climate indices are derived from the mean of the 30 member ensemble SST prediction generated by the POAMA model. The POAMA SST forecast predictors are named by a prefix of P24 followed by the name of the predicted climate index in Figure 2b.
 A previous study, which analyzed the skill of forecasts produced using statistical models relating POAMA 2.4 rainfall and SST predictions to observed rainfall [Langford et al., 2011] over continental Australia, revealed that the forecast skill was modest and varied spatially and with season. The forecasts produced from POAMA 2.4 rainfall predictions were particularly skillful during austral autumn over the south-eastern part of Australia, while the forecasts produced from POAMA 2.4 SST predictions were more skillful in spring over eastern Australia. Therefore a combination of both could result in improved climate forecasting skill over a larger number of catchments and seasons.
 In addition, POAMA is increasingly seen as the forecasting tool of the future, eventually replacing statistical methods, for forecasting seasonal climate in Australia. It is therefore of paramount interest, to the forecasting community in Australia, to understand whether it can contribute to the skill of seasonal streamflow forecasts [Schepen et al., 2012]. POAMA climate predictions show modest skill and the skill is comparable to those made using statistical methods [Lim et al., 2011]. More importantly, the seasons and catchments where POAMA predictions are skillful tend to be different to those where statistical forecasts are skillful [Schepen et al., 2012a]. We therefore expect that any improvements from using POAMA predictions will be subtle and vary with catchments and seasons. We are interested in capitalizing on those subtle increases in skill and investigating in what seasons and catchments the POAMA predictions provide added value over the use of lagged climate indices.
 Seasonal streamflow forecasts are produced for each candidate model described in section 3 using a “leave-five-years-out” cross-validation [Robertson et al., 2013] strategy. The cross-validation forecasts are produced for the period 1950–2008 (Table 1) for candidate models using the lagged climate indices. Cross-validation forecasts using the predictors from POAMA predictions are produced for 1980–2008, based upon the availability of good quality POAMA predictions.
4. Methods for Model Selection and Forecast Combination
 We compare four methods of generating forecasts from candidate models to establish the benefits of combining forecasts from multiple statistical models through BMA and the benefits of including dynamical climate model predictions into statistical streamflow forecasting models. The following subsections describe the four methods and Figure 2b provides an overall schematic representation of the steps involved in the process.
4.1. Method A: Best Predictor Forecast Model
 Method A is based on choosing the best forecast model. The method consists of first calculating the natural logarithm of the pseudo Bayes factor (ln(PsBF); equation (8) of each candidate model in the pool and then selecting the model with the highest ln(PsBF) [Robertson and Wang, 2012a]. To calculate the value of ln(PsBF), we use “leave-five-years-out” cross-validation predictive densities [Robertson et al., 2013] instead of the model predictive density shown in equation (7). DelSole and Shukla  showed that model skill can be artificially inflated if predictor selection is not included in cross validation. In calculating the ln(PsBF), we exclude the cross-validation density corresponding to the forecast year i and subsequent 4 years (as shown in equation (8), resulting in a different value of ln(PsBF) corresponding to each forecast year.
where i = 1 …T is the cross-validation forecast year of interest, T is the total years in record, is the cross-validation density of the kth candidate model in the pool, is the cross-validation density of the no predictor model (described below), yO(2)t is the vector of observed predictand data for time step t (t = 1…T), and yO(1)t is the corresponding predictor data, (t, t + 4) denotes that data from years t to t + 4 have been excluded from the parameter inference process.
 Our pool of candidate models consists of 11 models formed using lagged climate indices, one climatology model (CLM) and one CW model (Figure 2b). The CLM model represents a “no predictor” model, and will get selected over other candidate models if there is no skill in any of the other models in the pool. The CW model includes streamflow predictions from WAPABA and lagged observed streamflow. The CW model gets selected if the information about the climate from all the candidate models is very weak. The best model identified for each cross-validation forecast year (i) is used to produce the seasonal streamflow forecast for that year. Examples of best model selected are shown in Figure 11.
4.2. Method B: Combined Forecast From Models Using Lagged Climate Indices
 Method B combines the forecasts from the 13 candidate models (Figure 2b) considered in method A, using the BMA approach [Wang et al., 2012a]. The BMA approach combines forecasts from candidate models by assigning higher weights to better performing models while allowing for model uncertainty [Wang et al., 2012a]. The predictive density after combining forecasts from K candidate models is defined as the weighted average of the individual model densities (equation (9)).
where wk is the weight assigned to kth candidate model and is the BMA predictive density, is the predictive density of the kth model. The weights can be calculated by maximizing the posterior distribution of the weights using an EM algorithm [Wang et al., 2012a].
 In this study, we use the cross-validation densities , instead of predictive densities and (as in method A) exclude the cross-validation density corresponding to the forecast event (i) and its subsequent 4 years when calculating the weights. This results in a different set of weights corresponding to each forecast year (i). The cross-validated weights (wi, equation (10) are used to generate forecasts for the year of interest.
where is the leave-five-years-out cross-validation density of the kth candidate model in the pool calculated by the year excluding t to t + 4 years from the parameter inference process, is the vector of cross-validated weights for each forecast year.
 The first term on the right hand side of equation (10) arises from the Dirichlet prior distribution for the weights, where α (=1+b/K) is a parameter controlling the strength of the prior. The use of this prior encourages uniform model weights when the uncertainty in calculating the weight is high. We adopt a slightly stronger prior on weights (b = 1) than was used by Wang et al. [2012a] (b = 0.5). We found that a stronger prior on weights was needed to obtain relatively stable cross-validation weights over the forecast period.
 The climate predictors (used for the formation of the candidate models) can be highly correlated with each other. For example, there is a strong correlation between IOD and ENSO over a large part of south-eastern Australia in spring [Lim et al., 2009]. When highly correlated predictors are used in a single model, for example by multiple linear or nonlinear regression, the resulting relationship may not be robust and can lead to spurious predictions under certain combinations of predictor values. In our approach here, each of the candidate models has only one climate related predictor. Therefore the individual models are not affected by predictor collinearity. In combining the forecasts from multiple models through BMA, if a subgroup of model forecasts is related to each other because they use correlated predictors, the BMA weight distribution within this subgroup may not be very stable as the forecasts from different models can to some extent substitute for each other. However, the collective BMA weight of this subgroup relative to other less related subgroups should be stable. To overcome the problem of unstable BMA weights among related models, we apply in our BMA approach a prior (equation (10)) that encourages even BMA weight distribution.
4.3. Method C: Combined Forecasts Including POAMA Rainfall Predictions
 Method C combines forecasts from candidate models established using POAMA rainfall predictions as predictors with forecasts produced using method B (Figure 2b). We apply a two-step approach to account for the difference in the length of the data between lagged climate indices (1950–2008), and POAMA predictions (1980–2008). In the first step, we combine forecasts from the three candidate models using POAMA predictions as predictors with forecasts from a CLM and from the CW model using BMA. In the second step, the forecast from this model combination is combined with the forecast from method B using a second application of BMA. The weights for the second BMA are calculated for the period of 1980–2008.
 The relations between the predictors representing the climate and the streamflow are often weak and noisy. While the POAMA predictions are limited by the availability of data length, the two-step BMA process allows the use of maximum length of data in the establishment of models and calculation of weights for candidate models using the lagged climate indices as predictors.
4.4. Method D: Combined Forecasts Including POAMA SST Predictions
 Method D combines forecasts from nine candidate models established using POAMA predictions of climate indices as predictors in two steps (Figure 2b). We first combine forecasts from the nine candidate models using POAMA climate index predictions with forecasts from the three candidate models using POAMA rainfall predictions as well as CLM and CW model. We then combine these with the forecasts from method B, including only the events after 1980 to calculate weights in the same way as in Method C.
5. Forecast Assessment and Results
 We are interested in assessing the performance of the forecasts generated by each method with respect to two important features of probabilistic forecasts, skill, and reliability. The forecast skill is a measure of accuracy and is expressed relative to a reference forecast [Jolliffe and Stephenson, 2003; Robertson and Wang, 2012b]. The forecast reliability is the measure of “statistical consistency” of the forecast distribution with the observed frequency of the events [Toth et al., 2003; Robertson and Wang, 2012b].
5.1. Forecast Skill
 We use Root Mean Square Error in Probability (RMSEP) [Wang and Robertson, 2011; Robertson and Wang, 2012a] to measure forecast error (equation (11)) and RMSEP skill scores (SSRMSEP) to compare skill between the methodologies. RMSEP measures the difference between the nonexceedance probability of a forecast and the corresponding observation, rather than differences in magnitude. An advantage of RMSEP is that it places equal emphasis on all events rather than being strongly influenced by large errors occurring at a few events, as can occur in traditional measures of mean squared errors. The skill (equation (12)) of the forecast is calculated relative to a reference forecast.
where is the observation at t = 1, 2… n events, yt is the median of the forecast distribution. FCLI is the cumulative historical distribution (climatology), and is the nonexceedance probability for event t. SSRMSEP is the skill of the forecast, RMSEPREF is the RMSEP error values of the reference forecast, and RMSEP is the RMSEP error value of the forecast. The cumulative historical distribution is a transformed normal distribution fitted to available data.
 The choice of the reference forecast depends upon the requirement of the users. In this study, we use climatology as reference for all methods (A–D). In addition, for methods B–D, we also assess skill incrementally by calculating skill scores relative to forecasts produced by previous method. This allows assessment of benefit of any particular method over the baseline (previous method). We also assess model performance using continuous ranked probability score (CRPS) but do not include the results in this paper because the conclusions from using CRPS and RMSEP are similar.
5.1.1. Method A
 Figure 3 shows the quantile ranges of three forecasts produced by method A. The figures show forecasts with moderate skill (less than 20% RMSEP skill score; Figure 3a), forecast with high skill (greater than 50%; Figure 3b) and forecast with very low skill (close to 0%, Figure 3c) relative to climatology. The figures provide visual appreciation of the how forecasts corresponding to particular skill scores might look like.
 Figure 4 shows the RMSEP skill scores calculated for 22 catchments and 12 overlapping seasonal forecasts generated using method A. In general, the skill is higher for the forecast seasons that follow the wet months. The high skill periods tend to occur from May to August in Queensland and from September to December in Upper Murray, central Victoria and southern Victoria. During this period, the catchment undergoes a “draining” phase, which presents as a receding limb on an annual hydrograph. The skill of the forecasts are low before or during the wet months (January to March in Queensland and March to August in Victoria) as the catchment starts to wet up, characterized by the rising limb on an annual hydrograph.
 Higher skill is observed in catchments that demonstrate longer “persistence” in the observed streamflow data than catchments with rapidly fluctuating, or flashy, runoff response. For example, in Victorian catchments the runoff response seems to be dominated by a large contribution of base flow, resulting in strong persistence in the streamflow data. In such cases, the relation between the predictand (streamflow) and the initial catchment wetness predictors is strong, resulting in high skill. In fast responding catchments (such as wet Tasmanian catchments or wet South Johnstone River and dry Cape River in Queensland), the skill is mostly reliant on the ability to predict influence of the climate during the forecast period, and is generally poor [Robertson et al., 2013].
5.1.2. Method B
 Figure 5 compares the RMSEP skill scores, relative to climatology, produced by Method B and method A. Each point in the figure represents RMSEP skill score calculated for a single catchment and season. Most of points (65%) lie above the 1:1 line, indicating method B generally produces more skillful forecasts than method A. Seasons and catchments that have low or negative skill tend to show the greatest improvement in skill while seasons and catchments with higher skill show little change. This suggests that the BMA forecast combination not only preserves higher skill of forecasts but also moderates errors of the worst forecasts resulting from using method A.
 Figure 6 shows the RMSEP skill scores obtained by Method B. This time the RMSEP skill scores are calculated relative to Method A to show the benefit of a multimodel approach over the best model approach for the 22 catchments and 12 overlapping seasons. In Queensland, the improvements in skill range from 0.1 to 25% and are scattered across the catchments and seasons without any particular pattern. Some reductions in the skill of the forecasts occur, but these are infrequent and are less than 5%. In the upper Murray, the improvements can be observed for all catchments and in a majority of the seasons. The improvements can be as high as 15%, and reductions in skill are small (less than 5%). In central Victoria, southern Victoria and Tasmania the improvements are as high as 20% and reductions are less than 5%. In all catchments, the improvements outweigh losses in number of occurrences and magnitudes. The average improvement in skill for the 22 catchments and 12 seasons is 2.7%.
5.1.3. Method C
 Figure 7 compares the RMSEP skill scores, relative to climatology, of the forecasts produced by methods B and C. More than half of the points (55%) are located above the 1:1 line, indicating that there is some, albeit marginal, benefit in including candidate models using POAMA rainfall predictions as predictors into the BMA.
 Figure 8 shows the RMSEP skill scores calculated relative to method B to quantify the changes obtained by including models that use POAMA rainfall predictions. The skill varies widely with seasons and forecast catchments.
 In Queensland, the improvements are limited to a few forecast seasons. The reduction in skill is as high as 8.5%, for the SON forecast for Somerset Dam, but in other catchments and seasons they tend to be small (<5%). In the Upper Murray region, improvements are observed for MAM-AMJ and JJA-ASO seasonal forecasts and range in magnitude from 3.5 to 20%. The improvement for MAM seasonal forecasts occurs when the contribution to forecast skill from lagged climate indices is small and when the catchment is in transition from a dry to wet phase, during which the CW predictors provide little skill [Robertson et al., 2013]. The improvements for JJA and JAS occur in the period when the catchments are saturated and the contribution from the CW predictors is minimal.
 The skill improvements for MAM forecasts continue for three catchments and the improvements during the wet months (JJA-ASO forecasts) occur for most catchments in central Victoria. In southern Victorian catchments, improvements occur for few forecast seasons (JAS and NDJ) and tend to be small in magnitude (<10%). In other seasons, the changes in skill are small (less than ±5). In Tasmania, improvements in forecast skill outweigh reductions, in terms of magnitude and number of occurrences. The improvements are as high as 25%, while the losses are less than 6%.
 In Queensland, inclusion of POAMA rainfall as a predictor may not be beneficial, but in Upper Murray, central Victoria and Tasmania, it leads to improvements during seasons when both CW predictors and lagged climate indices do not provide much skill. Importantly, inclusion of POAMA rainfall as a predictor does not lead to a large loss of skill and the skill improvements outweigh the reductions for many catchments. The average improvement for the 22 catchments and 12 seasons is slightly more than 1%.
5.1.4. Method D
 Method D generally does not improve forecast skill over method C (figure not included). Minor improvements are visible in Tasmania and Queensland catchments for some seasons. However in most catchments and seasons the changes in RMSEP skill scores are within ±5%, indicating that inclusion of climate index predictions from POAMA does not add much value over POAMA rainfall predictions.
5.2. Forecast Reliability
 We use reliability diagrams [Wilks, 1995] to assess the reliability of the forecast distributions produced by each method. The reliability diagram compares forecast probabilities of events corresponding to certain selected thresholds against their observed relative frequency. For each method, we assess the reliability of the forecasts corresponding to three thresholds; events not exceeding 33.3, 50, and 66.6% of the observed flows. We assess the reliability of the forecasts produced by all four methods but show results corresponding to methods A(blue), B(red), and C(grey) in Figure 9. To increase sample size, we pool together the forecasts for four nonoverlapping seasons (JFM, AMJ, JAS, and OND), from all 22 catchments. We divide the sample into seven bins (Figure 9, inserts) and include events from 1980 to 2008. The lines in figure illustrate the reliability of the forecast distribution, while the histograms indicate the sharpness.
 The figure shows that for all the methods, the plotted values closely follow the 1:1 line indicating that the forecast probabilities for all three thresholds are consistent with the observed frequencies. In other words, the forecast distributions are reliable. The reliability of the forecast produced by method D (figure not included) is very similar to other methods. Comparison of the histograms indicate slight drop in sharpness from method A to methods B and C, as indicated by larger number of samples in the extreme bins of method A.
6. Further Analyses and Discussions
6.1. Benefit of Forecast Combination Using the BMA
6.1.1. Moderation of Worst Forecast Error by BMA
 Our results show that combining forecasts from multiple models (method B) leads to increases in RMSEP skill scores over that using the best model (method A) in many situations. The improvements are more visible in seasons that show negative skills. Figure 10 shows the time series of the skill scores produced by methods A and B. The skill scores are calculated for each event relative to climatology but averaged over all 22 catchments. Both methods show large variability in the skill scores with values fluctuating between −160% and 80%. However, Method B consistently performs better on forecasts with very low and negative skill, while performing as well as method A for forecasts that have higher skill. This further reinforces the results obtained in 5.1.2 (Figure 5) that the BMA forecast combination not only preserves higher skill of forecasts but also moderates errors of the worst forecasts resulting from using the best model.
 The figure also shows consistent improvements in RMSEP skill scores after 2002. This may be the result of the severe drought experienced over a large part eastern Australia, which caused (1) an increase in persistence in the streamflow data resulting from many low flow events and (2) shift in the climatology (i.e., the climatology computed as the average seasonal streamflow of a long period of record was a poor representation for this period in time).
6.1.2. Changes in Forecast Performance by BMA
 In general, the improvements due to BMA mostly occur when the forecast skill is dependent upon combination of various factors influencing climate and hydrological processes (represented by candidate models). In these situations, a single model is insufficient to represent different the aspects of the large-scale climatic system. To illustrate this, we take an example of OND forecasts for South Johnstone River. The forecast skill is affected by Indian Ocean and Pacific Ocean influences and neither of them is dominant. The choice of “best” model fluctuates between CW + SOI, CW + II, and CW + NINO34, (Figures 2b and 11a and Table 2) over the cross-validation period. Selection of a single model ignores other two equally important models, and results in lower RMSEP skill scores. Method B includes information from all three by weighting them approximately equal (Figure 11b). Method B results in forecast quantiles (Figure 12b) that are slightly more conservative (larger) than A (Figure 12a) but include a larger number of observed values inside the 0.25–0.75 intervals (and closer to forecast median) leading to a skill increase of ∼15% (Figure 5, magenta dot).
 Similarly, when all the candidate models have poor forecast skill, method A is vulnerable to the effects of random noises in the data, making the selection of the “best” model very difficult (and contentious) during cross validation. As an illustration, we take an example of the AMJ forecasts in Maroondah Reservoir. The lagged climate indices do not have much skill in forecasting the climate during autumn in southern Victoria and the CW predictors do not possess any skill either, as the catchment is in transition from being very dry to just starting to get wet. Ideally in this condition, selection of CLM would have provided higher skill score, but method A selects a number of different models (Figure 11c) and produces forecast skill that is worse than climatology (Figures 4 and 5, cyan dot).
 Method B, on the other hand, assigns small weights to all models, with a slightly greater emphasis on CLM (Figure 11d). Therefore, although the skill of the forecasts is close to climatology, the increase in RMSEP skill score for the season is 15% compared to method A (Figure 5, cyan dot). The forecast quantiles generated by method A (Figure 12c) appear sharper compared to B (Figure 12f), but have poorer skill. Forecast quantiles produced by B have lower resolution reflecting lack of information in the data, but the observed values are more aligned to the median resulting in the increase in the skill score.
 In some cases, the negative skill scores produced by method A can be avoided by using a ln(PsBF) threshold that limits the selection of a candidate model due to random noise in the data [Robertson and Wang, 2012a, Bracken et al., 2010; Regonda et al., 2006]. However, determining the threshold is complex and requires considerable computational effort [Robertson and Wang, 2012a], while method B provides a much simpler alternative to it.
 Decreases in forecast skill from BMA also occur, but are less frequent and of small magnitude (<5%). They mostly occur for catchments and seasons where a single candidate model produces the most skill in forecasts. As an illustration, we look at OND forecasts for Lake Hume. For the OND season, the climate in the region is strongly affected by Indian Ocean influences and the forecast skill is dominated by CW + WPI (Figure 11e). In this case, the choice of the “best” model is very clear and the forecasts are very skillful, with RMSEP score of 50% relative to climatology.
 Method B assigns higher weight (∼ 0.8) to CW + WPI (Figure 11f), but it also accommodates other candidate models. Although the weights given to other models are small compared to CW + WPI, they result in some (<1%) loss in RMSEP skill score (Figure 5, orange dot). The contributions from other models can be reduced, to some extent, by providing a weaker prior on the weights but it will also make the weights more vulnerable to random noise in the data and lead to a worse performance overall. The forecast quantiles (Figures 12e and 12f) generated by both methods are similar. In this case, the loss of skill is marginal, and method B preserves the forecast qualities of method A.
6.2. Benefit of Using POAMA Predictions
 The POAMA rainfall prediction skill varies considerably with seasons and spatially across continental Australia, and this is translated to the seasonal streamflow forecast performances by method C [Langford et al., 2011; Schepen et al., 2012a, Lim et al., 2011]. In Upper Murray for MAM, in particular, POAMA rainfall predictions show very high skill. This may be due to the ability of the POAMA to capture upper atmospheric processes. The autumn rainfall variability in south-east Australia can be linked with the strengthening and southward movement of subtropical ridge [Robertson and Wang, 2012b]. While POAMA may have adequately captured this process, lagged climate indices derived from SST anomalies, obviously, cannot. Furthermore, atmospheric processes (like the one described above) are often very chaotic and lack the persistence necessary to be useful as lagged climate indices. However, further investigations are required to verify our speculation.
 Similarly, the POAMA data available for model establishment and BMA are shorter and thus noisier than for method B. This could possibly make BMA more vulnerable to noise in data leading to negative skill scores (relative to method B) in some seasons. The BMA approach used in this study can account for short and noisy data, to some extent, by applying a stronger prior on weights. However, this also increases the potential of diluting improvements over seasons where there is stronger skill.
 Finally, the use of POAMA SST predictions do not provide skill beyond what is provided by the lagged climate indices and the POAMA rainfall predictions, but their inclusion does not degrade forecasts skill either. Operationally this leads to an interesting question of whether all POAMA prediction combinations (POAMA rainfall predictions as well as POAMA SST predictions) could be used for method C without substantial losses in skill.
 We investigate this using an additional modeling setup combining forecasts from models using all POAMA predictions as predictors with CLM and CW and compare the results with method C. The results show that (figure not included) method C provides better forecast skill compared to that by using all POAMA predictions. The lagged climate indices data available over a longer period of time could possibly lead to identification of more robust relationships, thus providing greater forecast skill. The result highlights the important contribution of the lagged climate indices to the skill of the seasonal streamflow forecasts in the current forecasting context.
7. Summary and Conclusion
 Our study aims to improve forecasts of three monthly (seasonal) total streamflow made using the BJP modeling approach by providing a better representation of the climate during the forecast period. We test two strategies aimed at improving the seasonal streamflow forecasting performance. The first strategy involves combining forecasts from multiple candidate models using the BMA approach [Wang et al., 2012a]. The BMA approach combines the strengths of the forecasts made by different candidate models, obviating the need to select the best model, and accounts for model uncertainty. The second strategy is to incorporate rainfall and SST predictions from a dynamical climate model (POAMA) into the existing forecasting system. The dynamical models simulate physical processes and capture concurrent relations between SST anomalies and climate at the forecast region. We test both strategies on 22 catchments located in eastern Australia.
 Our results show that BMA successfully combines the strengths of forecasts produced by competing models, leading to increases in forecast skill in most catchments. Forecasts produced using the BMA combination consistently retain the skill of the best individual candidate model when it performs well. When the best individual candidate model performs poorly, the BMA combination improves forecast skill. The improvements are mostly observed for forecasts that are under influence of a number of different competing models or processes, and when forecast skill for the season is very low and possibly dominated by noise. Marginal losses, however, can occur for forecasts that are dominated by a strong and overwhelming signal or a single candidate model.
 The inclusion of POAMA rainfall predictions leads to mixed results, with skill improvements in Victorian and Tasmanian catchments and losses of skill in Queensland. However, the losses are generally small and the benefits outweigh the losses when viewed across the 22 catchments and 12 seasons. Use of POAMA SST predictions do not provide any additional value over that provided by the POAMA rainfall predictions.
 In the current circumstances, we believe that the strengths of both lagged climate indices and POAMA rainfall predictions should be combined to produce best forecast performance. We recommend the BMA combination of forecasts from the candidate models with the 11 lagged climate indices, three POAMA rainfall predictions as well as the catchment wetness and the climatology (as in method C). While this combination does not improve forecast skill over all the catchments substantially, it provides the best overall forecast performance without hugely increasing computational burden.
 This research has been funded by the South Eastern Australian Climate Initiative (SEACI). Partial support for the research is also provided by the Water Information Research and Development Alliance (WIRADA) between CSIRO's Water for a Healthy Country Flagship and the Bureau of Meteorology, and by the CSIRO OCE Science Leadership Scheme. Data used in this study were provided by Melbourne Water, Hydro Tasmania, the Murray Darling Basin Authority, Goulburn-Murray Water, the Queensland Department of Environment and Resource Management, the Bureau of Meteorology's Climate and Water Division, and the Centre for Australian Weather and Climate Research. We would like to acknowledge James C. Bennett for his help in editing the manuscript and Andrew Schepen for making the POAMA Climate index forecasts available to us. We would also like to thank the Associated Editor and three anonymous reviewers, whose comments and suggestions help improve the paper substantially.