Streamflows often vary strongly with season, and this leads to seasonal dependence in hydrological model errors and prediction uncertainty. In this study, we introduce three error models to describe errors from a monthly rainfall-runoff model: a seasonally invariant model, a seasonally variant model, and a hierarchical error model. The seasonally variant model and the hierarchical error model use month-specific parameters to explicitly account for seasonal dependence, while the seasonally invariant model does not. A Bayesian prior is used in the hierarchical error model to account for potential variation and connection among model parameters of different months. The three error models are applied to predicting streamflows for five Australian catchments and are compared by various performance scores and diagnostic plots. The seasonally variant model and the hierarchical model both perform substantially better than the seasonally invariant model. From a cross-validation analysis, the hierarchical error model provides both the most accurate prediction mean and the most reliable prediction uncertainty distribution in most situations. The use of the prior to constrain the model parameters in the hierarchical model produces more robust parameter estimation than the other two models.
 Hydrological models have become essential tools for flood hazard mitigation and water resources management. Increasingly, there is a demand for probabilistic predictions to reflect the fact that model predictions are subject to errors and that prediction uncertainty needs to be taken into account in decision making.
 Various methods have been developed over recent decades to quantify hydrological prediction uncertainty. The methods range from lumping all errors into only prediction errors [e.g., Sorooshian and Dracup, 1980; Kuczera, 1983; Vrugt et al., 2005], to implicitly specifying model input, output, parameter and structural errors through the likelihood function on the total error [e.g., Beven and Binley, 1992; Freer et al., 1996], through to explicitly characterizing each source of errors [e.g., Moradkhani et al., 2005; Kuczeraetal., 2006; HuardandMailhot, 2008; ReichertandMieleitner, 2009; Renardetal., 2010; SalamonandFeyen, 2010; Renardetal., 2011]. In nearly all cases, statistical models are used to represent the structure of the prediction errors. Some statistical models assume homoscedastic error distributions [e.g., Diskin and Simon, 1977], while others assume heteroscedastic error distributions either explicitly [e.g., Sorooshian and Dracup, 1980; Schoups and Vrugt, 2010] or through data transformation [e.g., Thiemann et al., 2001; Thyer et al., 2002; Wang et al., 2012a]. They also differ in the way they represent the temporal dependence of the prediction errors. The most commonly used are independent error models [e.g., Diskin and Simon, 1977] and autoregressive error models [e.g., Kuczera, 1983; Bates and Campbell, 2001; Engeland and Gottschalk, 2002].
 We attempt to devise error models that can be applied to real-time hydrological forecasts. To make these models easy to apply, we have sought to make them as simple as possible and to minimize computation. To achieve this, we have chosen to focus only on errors in the response variable—i.e., the streamflow prediction.
 The performance of hydrological models varies with flow magnitudes and soil moisture [Freer et al., 2003; Choi and Beven, 2007], and because flow magnitudes and soil moisture often vary with season it is useful to consider errors as being dependent on season. It is possible to attempt to reduce seasonally dependent prediction errors by calibrating hydrological models differently for different seasons (or months), and then apply a generic error model. Alternatively, or in addition, the error model can be varied seasonally. For example, Yang et al.  considered a continuous-time autoregressive error model and used different asymptotic standard deviations and characteristic correlations for dry and wet seasons. Their case study of the Chaohe Basin in northern China showed that using seasonally dependent parameters leads to more accurate probabilistic streamflow prediction than using constant parameters throughout the year. Engeland et al.  created 15 seasonally dependent weather classes for a catchment in northern Norway and evaluated hydrological prediction errors with an autoregressive model using weather class specific parameters. They demonstrated the usefulness of seasonally dependent parameters for accounting for high uncertainties linked to snow cover formation and snowmelt processes.
 In this study, we consider a hydrological model that is calibrated to all available data (i.e., not calibrated conditionally for each season or month). We then rely on the error model to cope with seasonal dependence in the prediction errors. Our approach to seasonal error modeling is similar to postprocessing: we treat the hydrological model parameters (once calibrated) as fixed, and then devise seasonally dependent error models without revising the hydrological model parameters.
 Varying error models by month or by season introduces a large number of additional parameters. Any model that has a large number of parameters may be prone to overfitting. One approach to guarding against overfitting is to apply constraints on parameters. In one of our error models (the seasonally variant model), we allow the parameters to be specified for each month without constraints. To guard against overfitting, we devise a hierarchical error model that connects the parameters of different months through a Bayesian prior (we refer to this as the hyper-distribution). The extent of parameter variation with month is then inferred from data.
 Bayesian hierarchical modeling has been applied to hydrological prediction errors previously, notably in the Bayesian Total Error Analysis methodology (BATEA) introduced by Kavetski et al. . BATEA is based on a Bayesian hierarchical model but is different from this study in several respects: First, BATEA uses a Bayesian hierarchical model to introduce latent variables describing uncertainties in observations and model structure. The hierarchical error model in this study attempts to connect model parameters from different months and avoid possible overparameterization. Second, BATEA applies a Markov-Chain Monte Carlo (MCMC) method to evaluate the posterior distribution of model parameters. In this study, we estimate only the single best value of model parameters, and parameter uncertainty is not explicitly considered. Third, BATEA explicitly treats different sources of error, while the hierarchical error model in this study aggregates all sources of error into the prediction residual.
 In section 2, we describe the hydrological model used in this study and present methods relating to the error models, their estimation and evaluation. A case study of five catchments in Australia to demonstrate model calibration and verification is given in section 3. We discuss and summarize our findings in section 4.
2.1. Hydrological Model
 The WAter PArtition and BAlance (WAPABA) model recently introduced by Wang et al.  is used in this study. The WAPABA model is a lumped conceptual monthly rainfall-runoff model using monthly rainfall and potential evapotranspiration as inputs. The WAPABA model was evolved from a Budyko framework model [Zhang et al., 2008] and partitions water to a number of components based on supply-demand-consumption curves. Wang et al.  applied the WAPABA model to 331 catchments in Australia and found that it performed as well as or even better than two widely used daily models in simulating monthly runoff. The WABAPA model has five parameters:
: Catchment consumption curve parameter,
: Evapotranspiration curve parameter,
: Proportion of catchment yield as groundwater,
: Ground water store time constant,
: Maximum water holding capacity of soil store.
 We denote the collection of the WAPABA parameters as
2.2. Seasonally Invariant Model and Seasonally Variant Model
 Let and , respectively, denote the actual and simulated monthly streamflow from the WAPABA model at a given time . To normalize data and stabilize variance, a logarithmic, hyperbolic-sine transform [Wang et al., 2012a] is applied to Qt and by
respectively, in order to induce the model error,
to follow a normal distribution. and are the transform parameters. A seasonally invariant error model is defined by a lag-one autoregressive model of the model error in the transformed domain as
where is a white noise process with zero mean and variance . There are three error model parameters in a seasonally invariant model: the parameter is closely related to the bias of ; the parameter describes the lag-one autocorrelation of ; and the parameter represents the variation of . To keep the model structure simple, we assume that the WAPABA model has no overall bias (i.e., ) and this implies that . We thus do not infer from model calibration but fix it to be zero. When equation (5) is applied, it updates the error by using the information from the previous time step. All three parameters in equation (5) are assumed to be constant over time (i.e., seasonally invariant). This restricts the effectiveness of the seasonally invariant error model to cases where the model error has little seasonal variation.
 A seasonally fully variant error model (abbreviated to seasonally variant model) is defined by
where denotes the calendar month at time and is a white noise process with zero mean and variance . This is an extension of the seasonally invariant error model to allow the parameter to vary across all months. The error model parameters in equation (6) are month specific to explicitly represent seasonal dependence in prediction errors. The total number of error model parameters for the seasonally variant model is 36, while the seasonally invariant error model has only three parameters. In addition, there are the two transform parameters and five hydrological model parameters.
2.3. Hierarchical Error Model
 To improve the robustness of the error model, we develop a new error model from the seasonally variant model by building connections within model error parameters through Bayesian modeling. In particular, model error parameters from different months are assumed to be random and follow a common prior. The parameter variation with month is inferred from data and indicates the seasonal dependency of model error structure.
 We define the hierarchical error model by equation (6) with the additional assumption that the error model parameters of different months follow independent and identical Gaussian priors:
for , where , , and are the hyperparameters describing the mean of error model parameters and , , and are the hyperparameters describing the standard deviations. We reparameterize to so that its distribution can be easily identified in the reparameterized domain. The issue of the reparameterization will be further addressed from the perspective of parameter estimation in section 2.4.
2.4.1. Seasonally Invariant Model
 Maximum likelihood estimation (MLE) is used to estimate all 10 parameters (including five hydrological model parameters, two transform parameters, and three error model parameters) for a seasonally invariant model. MLE provides a parameter estimation that maximizes the likelihood (the probability density function of the observed streamflow conditional on precipitation and evapotranspiration for all within the calibration period ) as a function of unknown parameter ,
 The hydrological model transfers the information from and to the simulated streamflow , so the likelihood given by equation (10) can be explicitly written as
By using equations (2)-(5), we can derive the likelihood contribution at each time as
where is the Jacobian determinant for the transform from to ,
 Care must be exercised to compute the likelihood function when zero flows occur. Zero flows are considered as censored data having unknown values below or equal to zero. The cumulative probability below is adopted for the term in the likelihood when ,
where is the cumulative distribution function of a standard normal distribution. The zero-flow treatment is essentially the same as Wang and Robertson , but only applied to (see further discussion in section 5). Equations (12) and (15) describe a probabilistic streamflow prediction, which is the probability of conditioned on the model parameters, the WAPABA simulated streamflow at and and the observed streamflow at . This streamflow prediction includes error updating. The Shuffled Complex Evolution (SCE) algorithm [Duan et al., 1994] is used to minimize the negative log likelihood for the seasonally invariant model. The lower and upper bounds to perform the SCE optimization are given in Table 1.
Table 1. Parameter Estimation for the Seasonally Invariant Model, Including the Lower and Upper Bounds of the SCE Optimization and the Calibrated Parameters for the EPP Catchment
Fixed value, see discussion in section 2.2 and is the mean streamflow.
 MLE is also used for the seasonally variant model and the likelihood is basically the same as equations (10) and (11), while we keep the transform parameters and the hydrological model parameters as the estimates from the seasonally invariant model. This approach aims to avoid possible parameter interaction and ease the computational burden. More discussion on this approach is given in section 5. The likelihood evaluated at all observations can be calculated by the product of the likelihood function evaluated at each individual month:
 This implies that we can maximize the likelihood equation (11) by maximizing the likelihood function evaluated at each individual month. The same estimation as the seasonally invariant model can be directly applied to estimate the parameter for each individual month. The MLE for the seasonally variant model is essentially the same as applying the MLE for the seasonally invariant model 12 times. In this estimation procedure, parameters from 1 month are independent of parameters for the other months. The estimation of the seasonally variant model shares the same order of computational complexity as that of the seasonally invariant model. The Nelder-Mead Simplex algorithm [Nelder and Mead, 1965] is used to minimize the negative log likelihood for the seasonally variant model. The Simplex algorithm is a local optimization and is thus sensitive to starting conditions. If flows are never zero, minimizing the likelihood function defined by equation (11) is equivalent to solving an ordinary least squares problem because of the Gaussian assumption. We use the MLE as if no zero flows are present as the starting values to perform the Simplex algorithm. The proportion of monthly streamflows that are zero is generally small in the catchments we have examined (see section 3) and, therefore, the optimized parameters are often close to the starting values.
2.4.3. Hierarchical Error Model
 The hierarchical maximum likelihood estimation [Farrell and Ludwig, 2008] is used for the hierarchical error model. For the hierarchical model, the error model parameters and hyperparameters are estimated separately in a two-stage procedure. The other parameters are carried over unchanged from the seasonally invariant model, and their values are fixed. In the first stage, we do not directly estimate the error model parameter for each individual month, but instead estimate the hyperparameters at the population level. The point estimates of the hyperparameters maximize the likelihood, marginal over all possible error model parameters for each month:
where is given by equation (12). In the second stage, the error model parameters for each individual month are estimated to maximize the likelihood conditional on the estimated hyperparameters from the first stage. Specifically,
 The major computational difficulty is the integral in equation (17). Although this integral is already factorized as 12 three-dimensional integrals, each subintegral has no analytical form and has to be calculated by Monte-Carlo integration [e.g., see Robert and Casella, 2004, chap. 3]. For a given set of hyperparameters, we generate , , and randomly from the Gaussian distributions , , and , respectively, and calculate . We repeat this procedure N times and use the average of N values of to approximate each subintegral. In this paper, we choose N to be 1000. Similar to the seasonally variant model, the Simplex algorithm is also used for estimating the hierarchical error model. We choose the inference of the estimated parameters from the seasonally variant model as the starting values of the hyperparameters. For example, we use the mean and standard deviation of estimated from the seasonally variant model as the starting values of and . The starting values to maximizing the likelihood function described by equation (18) are the corresponding parameters estimated for the seasonally variant model.
 The issue of model evaluation needs to be addressed in any modeling exercise. Cross validation is a common approach to assess model adequacy without using additional, independent data for verification. To assess a forecast against a given monthly observation, we leave out the observed streamflow of that month and streamflows of the five subsequent years for model parameter estimation (i.e., 60 monthly observations are removed). We then use the estimated parameters to predict the streamflow of this particular month. Streamflow in a given month will influence subsequent flows for a certain period through catchment memory; we have assumed this influence extends less than 5 years. Leaving out five succeeding years should ensure that observed flows have negligible influence on the prediction for a given month, ensuring reliable cross validation. It follows that the cross-validation results we present give a robust estimate of model performance for future events.
 We use several evaluation statistics and diagnostic plots for model verification. Bias and Nash-Sutcliffe (NS) efficiency [Nash and Sutcliffe, 1970] are used to quantify the accuracy of the prediction mean after bias correction and error updating, defined as
respectively, where is the prediction mean and is the mean of for all .
 The prediction probability distributions are evaluated by using the continuous ranked probability score (CRPS). The CRPS has been widely used to assess probabilistic forecasts of streamflow, for example, by Yang et al.  and Wang et al. . Let be a cumulative distribution of a probabilistic prediction of streamflow at time . The CRPS evaluated at the observed streamflow is defined by
where denotes a step function that attains a value of 1 if and a value of 0 otherwise. We denote the average of over all of interest as CRPS and use it as an overall assessment of performance. A smaller value of CRPS indicates a better probabilistic prediction.
 A cross-validated likelihood [Smyth, 1996, 2000; Shinozaki et al., 2010; Wang et al., 2012b] is used to indicate the predictive capability, and the ratio of cross-validated likelihoods is used to support the use of a particular error model over a comparative one for prediction. Mathematically, a cross-validated log-likelihood ratio of Model to Model is defined by
where the cross-validation MLE of . is the predictive density of conditional on , , , and obtained from model and has a closed form provided by equations (12) and (15). We calculate the cross-validated log-likelihood ratio for each pair of the three error models.
 is preferred to when is greater than zero, subject to sampling uncertainty. A chi-square distribution approximation is usually used for the likelihood ratio but is not applicable for our study as we wish to consider a likelihood ratio under cross validation. To determine whether the improvement of is statistically significant, we follow the bootstrap procedure for the cross-validated likelihood of McLachlan  to simulate the distribution of . Denote and . We resample from with replacement and estimate the bootstrap statistic by the bootstrap resample . We repeat bootstrap resampling 5000 times and approximate the distribution of by the empirical distribution of . The proportion of the bootstrap statistics greater than zero indicates how strongly is preferred to . This proportion suggests how often works better than (i.e., the probability) but not by how much works better than (i.e., the magnitude).
 In addition, a histogram is used as a diagnostic plot to check the uniformity of the prediction probability integral transform (PIT) of the streamflow observations. The PIT of the observed streamflow is defined by
 If is a reliable prediction, should be uniformly distributed on [0, 1]. The deviation from uniformity of indicates whether is too high or too low, or too wide or too narrow, as compared with observed streamflow. We use a histogram in preference to the PIT uniform probability plot, as a large number of observations are available [Wang et al., 2009].
3. Case Study
 We carry out this study in five catchments in Victoria, Australia (Figures 1b and 1c), a region of temperate climate. Attributes of these catchments are given in Table 2. Rainfall and streamflows in these catchments are highest around the austral winter (June to September) and lowest in the austral summer (January to March) (Figure 1a).
Table 2. Catchment Attributes for Five Catchments Used in This Study
Catchment Area (km2)
Mean Annual Rainfall (mm)
Mean Annual Flow (mm) (Volume in Parentheses)
Annual Runoff Coefficient
Zero Flow Proportion (%)
Lake Nillahcootie (NIL)
150 (63 GL)
Lake Eildon (EIL)
373 (1447 GL)
Eppalock Reservoir (EPP)
98 (172 GL)
Cairn Curran Reservoir (CCN)
72 (115 GL)
Thompson Reservoir (THM)
485 (236 GL)
 Monthly streamflow observations from 1950 to 2004 are used in this study. The monthly catchment average rainfall and potential evapotranspiration for each catchment are calculated using the 5 km gridded data set developed for the Australian Water Availability Project (AWAP) [Raupach et al., 2008; Jones et al., 2009]. The first five years (1950–1954) are used as a warm-up period to initialize states in the monthly hydrological model. The remaining 50 years (1955–2004) are used for model calibration and verification. Because the case studies from five catchments achieve very similar results, we only report detailed results from the Eppalock (EPP) Reservoir catchment in the main body of the paper (section 3) and outline the results from the other catchments in a summary table. Further analyses of the other catchments are included as supporting information.
 The estimated model parameters of the seasonally invariant model from the calibration period (1955–2004) are presented in Table 1. The seasonally dependent error model parameters from the seasonally variant model and hierarchical error model are compared graphically in Figure 2. The error model parameters from the hierarchical error model and the seasonally variant model share a similar seasonal pattern. For example, for both models, the value of reaches the minimum at March and rises to the maximum at October before declining again. The autocorrelation parameter for both the seasonally variant model and the hierarchical error model generally has a positive value except for some dry months, such as February. As expected, the seasonally variant model parameters vary more with season than the hierarchical error model parameters. The error model parameters from the seasonally invariant model are within the range specified by the estimated parameters from the hierarchical error model. The hyperparameters for all five catchments are given by Table 3, which shows that the hyperparameters , , and are all different from zero. This demonstrates that the error model parameters , , and are indeed seasonally dependent.
Table 3. Estimated Hyperparameters of the Hierarchical Error Model in Calibration for All Catchments
 Figure 3 presents the bias of three error models in the context of model calibration for the EPP catchment. The hierarchical model and seasonally variant model produced lower overall biases than the seasonally invariant model. The hierarchical error model produces the lowest mean bias of 0.1 mm, demonstrating the ability of this model to correct bias. It is somewhat surprising that the hierarchical model gives marginally better calibration biases, as we would expect the seasonally variant model to be able to fit observations more closely. Biases for all models generally follow the seasonal pattern of flow magnitudes, with larger biases in winter than summer. The seasonally variant model and the hierarchical model perform markedly better than the seasonally invariant model in reducing biases in the drier summer months (October to March). Overall, the use of month-specific parameters significantly reduces biases in calibration.
 We also evaluate the log likelihood to indicate how model constraints affect the likelihood. The values of the log likelihood are −1216, −1103, and −1112 for the seasonally invariant model, the seasonally variant model and the hierarchical error model, respectively. These show that the models with more constraints have lower likelihoods of observed streamflow given the precipitation, PET, and observations from the previous time step. The seasonally invariant model has a substantially lower log likelihood, than the other two models, while the hierarchical error model has a slightly lower log likelihood than and the seasonally invariant model. The hierarchical error model has a lower log likelihood than the seasonally variant model because the additional constraints imposed by the hyperparameters make parts of the parameter space much less probable. The seasonally variant model allows the maximum flexibility for error model parameters and leads to the largest log likelihood. The seasonally invariant model instead uses the fewest parameters and results in the most limited model fitting in terms of log likelihood. However, the log likelihoods given here reflect only the ability of each model to fit observed data, rather predict streamflows for an independent period. We consider cross-validated likelihood ratios to address this issue in section 3.3.
 Figure 4 directly checks the model assumptions (i.e., the suitability of the Gaussian distribution to describe errors and the assumption that residuals separated by two or more time steps are independent) for the EPP catchment from calibration results. For each error model, we examine the estimated standardized residuals defined by equations (5) and (6) to see whether it can be approximated by a standard normal distribution. As seen from the first column of Figure 4, the quantiles of from all three error models are reasonably close to the quantiles of a standard normal distribution. The autocorrelation of as a function of lag (second column of Figure 3) is only significantly different from zero at a lag of zero months. This indicates that can be considered as an independent time series.
3.3.1. Evaluation Statistics
 Figure 3 presents three verification scores, including bias, NS, and CRPS for the EPP catchment, calculated after cross validation. We calculate each verification score for each month in order to demonstrate seasonal performance. The verification score computed over all 12 months is reported as a measure of overall performance.
 The hierarchical error model leads to the smallest overall bias, 0.27, in this cross-validation analysis, while the overall biases for the seasonally invariant model and the seasonally variant model are 0.86 and 0.3, respectively. As expected, the streamflow predictions from the cross-validation analysis are more biased than predictions from calibration, although the differences in calibration biases and cross-validation biases are very small for both the hierarchical model and the seasonally variant model. As with the calibration results, the seasonally invariant model performs the worst, while the hierarchical error model and the seasonally variant model show very similar performance. The hierarchical error model leads to similar (or larger) NS than the other two other error models. All three error models are useful for high to medium flow months (from April to December) with NS scores greater than 0.6. The NS scores for low flow months (from January to March) are generally low and this suggests that it is very challenging to predict low flow for all models. Similar to bias and the NS score, the hierarchical error model also produces the smallest overall CRPS and thus the most accurate probabilistic streamflow prediction under cross validation. As expected in a strongly seasonal catchment, CRPS shows that larger prediction errors occur in high flow months than in low flow months.
 Table 4 compares the overall performance statistics for the other four catchments. NS and CRPS are very similar for all three error models for all catchments. The hierarchical error model and the seasonally variant model also produce similar biases for all catchments. The similar performances of the hierarchical error model and the seasonally variant model after cross validation are somewhat surprising. We would expect the hierarchical error model to be less susceptible to overfitting than the seasonally variant model and to, therefore, perform better under cross validation. The seasonally invariant model performs comparably well for CRPS and NS, but leads to more biased predictions for three of the five catchments.
Table 4. Performance Statistics (Average Over All Months) for the Other Four Catchments (Seasonally Invariant Model (IV); Seasonally Variant Model (SV); and Hierarchical Error Model (H))
 Table 5 presents the cross-validated log-likelihood ratio to compare different error models from the point of view of statistical model selection. In contrast to the calibration log likelihoods presented in section 3.2, the hierarchical error model clearly has the best ability to predict events that have not been used in parameter estimation. The cross-validated log-likelihood ratios of the hierarchical error model assessed against the seasonally invariant model and against the seasonally variant model are all positive and are greater than zero for at least 90% of the bootstrap simulations except for the THM catchment (66% against the seasonally invariant model) and the EPP catchment (63% against the seasonally variant model). The seasonally variant model is strongly preferred to the seasonally invariant model at EPP and EIL but performs significantly worse at THM, while the seasonally invariant and seasonally variant models perform very similarly at NIL.
Table 5. Model Comparison Through the Cross-Validated Log-Likelihood Ratio, With the Percentage of Bootstrap Log-Likelihood Ratios Greater Than Zero Shown in Parentheses (Seasonally Invariant Model (IV); Seasonally Variant Model (SV); and Hierarchical Error Model (H))
3.3.2. PIT Histogram
 Figure 5 compares PIT histograms from three error models when the cross-validation analysis is carried out. PIT histograms based on all months are generally close to the theoretical value derived from a uniform distribution as suggested by the horizontal line. The seasonally invariant model, however, does not lead to uniform PIT histograms for each individual month. The PIT histogram of the seasonally invariant model is skewed to the left for dry months (such as from December to March) and to the right for wet months (such as from July to October). This suggests the seasonally invariant model yields an overestimation of low flows but an underestimation of high flows. The PIT histograms of the seasonally variant model and the hierarchical model are fairly uniform and no substantial spikes are observed for any individual month, indicating that predictions from these two error models are both reliable.
3.3.3. Prediction Median and Credible Interval Plots
 Figures 6-8 show the prediction median and prediction [0.05, 0.95] credible interval with observed data for the three error models. The prediction median is generally consistent with the observed data and the [0.05, 0.95] credible interval gradually increases with the prediction median. All error models perform much better than climatology, which uses a constant prediction median for each month and a credible interval independent of the observed values. The seasonally invariant model provides the widest credible intervals in Figure 6, while the seasonally variant model and the hierarchical error model lead to very similar credible intervals as shown in Figures 7 and 8.
 Figure 9 shows the coverage of the prediction [0.05, 0.95] credible interval. The value of coverage for each month is different and ranges from 0.9 to 0.95. The overall coverage of the hierarchical error model and the seasonally invariant model is very close to the theoretical coverage of 0.90, while around 87% observations are within the range for the seasonally variant model. These overall coverages mask considerable month-to-month variation. The coverage of the seasonally invariant model, in particular, varies considerably, for example, covering only 80% of observations in May, while in September it covers more than 95% of observations. This supports the PIT histogram analyses (Figure 5) in demonstrating that the seasonally variant and hierarchical error models generally give reliable estimates of prediction uncertainty, while the seasonally invariant model does not consistently estimate uncertainty reliably for all months.
 Figure 10 displays the time series of the prediction median and prediction [0.05, 0.95] credible interval for the hierarchical error model and observed streamflow. The time series shows the ability of the error model to predict streamflows and reliably assess the uncertainty for a range of stream flows. No evident trend over time is observed in the relationship between prediction median and observed values, indicating that the performance of the forecasts is not unduly influenced by wetter or drier periods.
3.3.4. Sensitivity Analysis
 In order to evaluate the sensitivity of parameter estimation to a small change in data set, we compare the standard deviation of the parameter estimates from each cross validation in Figure 11. A smaller standard deviation indicates a more stable parameter estimation. The hierarchical error model leads to more stable error model parameters than the seasonally variant model for nearly all cases. For example, the standard deviations of and for the hierarchical error model are substantially smaller in March. This suggests that the additional constraints imposed by the Bayesian priors significantly improve the robustness of the error model and make the error model less sensitive to the outliers in data set. The seasonally invariant model leads to the least variation in error model parameters at the cost of flexibility and performance.
4. Discussion and Conclusions
 In this study, we compare three error models to investigate the seasonal dependence of the prediction errors of a hydrological model: a seasonally invariant model, a seasonally variant model, and a hierarchical error model. The seasonally invariant model applies the same parameter set to all months. The seasonally variant model uses a set of month-specific error model parameters. The hierarchical error model is derived from the seasonally variant model, and constrains the seasonally varying model parameters by assuming common priors for these parameters and then infers any seasonal influence on errors from data. All error models are used in conjunction with the WAPABA rainfall-runoff model for monthly streamflow predictions at catchments in southeast Australia.
 The hierarchical error model produces comparable or better monthly streamflow predictions than the seasonally invariant error model and the seasonally variant error model. The seasonally invariant model performed reasonably well for NS and CRPS, however, its predictions tended to be more biased than both the hierarchical and seasonally variant models. In addition, the seasonally invariant model is less statistically reliable than both other models in certain months. Reliability is a critically important attribute for robust forecasts, and accordingly we do not recommend the use of the seasonally invariant error model. The seasonally variant and hierarchical error models have similar bias, CRPS and NS values, and are similarly reliable. This is somewhat surprising, as the large number of unconstrained parameters in the seasonally variant model could well have resulted in overfitting the data set. The lack of overfitting of the seasonally variant model is possible due to the low variability of monthly data. The improvement of the hierarchical error model might not be great in magnitude, but the improvement is very consistent, as demonstrated by tests of statistical significance through the cross-validated log-likelihood ratio. Likelihood-ratio-based factors assess the entire distribution of probabilistic forecasts, rather than simply assessing the forecast ensemble mean (as with NS or bias). Further, likelihood ratios have been recommended over the Brier Score (from which the CRPS is derived) as they give more intuitive results [Jewson, 2008]. This makes the likelihood ratio an attractive measure of model performance, and we argue that it shows that the hierarchical error model is clearly superior to the other models tested here.
 The hierarchical error model had the additional benefit of ensuring that parameter estimation is more stable. This makes the hierarchical error model less susceptible to overfitting when applied to a wider range of catchments and less susceptible to outliers in observations. In summary, the hierarchical model offers marked improvements in performance over the seasonally invariant model, and slight but consistent improvements in performance over the seasonally variant model, with the additional benefit of more stable parameters. These improvements need to be weighed against the increased complexity of the hierarchical error model in relation to the seasonally variant model. The seasonally variant model is preferred over the seasonally invariant model and can be used as an alternative of the hierarchical error model in practical applications.
 We have shown that when we allow error model parameters to vary seasonally and infer parameters from data that the parameters do indeed vary with season. This is not surprising for the seasonal catchments we have used in this study and supports findings from other studies that hydrological model errors are seasonally dependent [e.g., Choi and Beven, 2007]. The potential danger in allowing parameters to vary seasonally is overfitting the error model. We have shown, however, that the hierarchical error model and the seasonally variant model outperform the seasonally invariant model under robust cross validation, showing that overfitting is not a problem for these two models. The hierarchical model, in particular, is designed to prevent overfitting. This supports the use of month-specific parameters in the error model.
 We use a multistage parameter estimation procedure for the seasonally variant model and the hierarchical error model. Both these models rely on WAPABA and transform parameters from the seasonally invariant model. Estimating all parameters (i.e., including the WAPABA and log-sinh transform parameters) jointly for the seasonally variant and hierarchical models in a single stage may be possible. However, such a procedure is likely to make some parameters compensate for each other and yield a nonidentifiable parameter inference. The multistage estimation procedure also reduces the number of parameters in optimization, eases the computational burden and makes the parameter inference more reliable. Simplification for calculating equations (16) and (17) would not be factorized for each month if all parameters were estimated in a joint context.
 It is well established in the literature that the parameters of a hydrological model may vary strongly with different error models. Accordingly, the WAPABA model parameters estimated for the invariant model may not, in the absence of an error model, allow WAPABA to simulate catchment processes as accurately as possible, and therefore, may not be appropriate for regionalization or other forms of extrapolation. We note, however, that because the seasonally invariant model (for which we determined the WAPABA parameters) performed worst of all the error models, this indicates that optimizing WAPABA parameters as part of the seasonally variant or hierarchical models (if this is possible) is likely to strengthen the performance of these two models. Therefore, the conclusions we have drawn about the relative strengths of the hierarchical and seasonally variant models are substantiated, even if the WAPABA model parameters could be improved.
 We have chosen to use the season as a covariate to represent the variation in hydrological prediction error. Other covariates such as weather pattern [Yang et al., 2007] and the state of flow regime could be also used and may be more efficient for daily streamflow prediction. If other covariates are used, we suggest imposing some constraints on the covariate-dependent parameters to improve the robustness of parameter estimation.
 In this study, we use both seasonally dependent variance and a logarithmic, hyperbolic-sine transform to handle heteroscedasticity. As already noted, the use of month-specific parameters is strongly supported by the findings in this study, and this includes month-specific variance. Specifying variance for each month has the potential to render the use of a transformation (in our case, the log-sinh transformation) unnecessary. To test whether the use of the log-sinh transformation is still necessary, we applied the seasonally variant model without any transformation to the EPP catchment. We found that the standardized residuals cannot be well approximated by a normal distribution. This suggests that an appropriate transformation is necessary for stabilizing variance and its role is not replaced by seasonally dependent parameters. More discussion on the importance of transformations can be found in Del Giudice et al. .
 While we have chosen to focus on errors in the response variable, more comprehensive investigation into sources of model error is, of course, possible. As we note in section 1, a recent example of such a detailed investigation is BATEA [Kavetski et al., 2006a, 2006b; Kucreza et al., 2006]. BATEA employs a Bayesian hierarchical model and uses MCMC sampling to characterize uncertainty in hydrological model inputs, internal fluxes, and outputs. The parameters of our error models, together with the water balance model parameters, could have been inferred by a full Bayesian approach through MCMC sampling to provide a full posterior distribution of model parameters. We have not taken this approach in this study mainly for computational reasons. The large number of parameters in the models and the many cross validation runs we conducted would have been highly computationally intensive if MCMC sampling were applied. Further, we note that while investigations of parameter uncertainty are useful for diagnosing structural problems in hydrological models, parameter uncertainty is often not the major contribution to the total predictive uncertainty if a sufficiently large number of data points are available [e.g., Kuczera et al., 2006].
 A complex error model (like BATEA) can be applied in a real-time context provided that the model calibration is done offline. However, we sought to design an error model that could be calibrated (and recalibrated, as data become available) and implemented reasonably quickly so as to be easily extended to real-time applications. The hierarchical error model recommended in this paper is associated with a monthly time step hydrological model and is, therefore, mainly useful for short-term and seasonal streamflow prediction. A similar error model in conjunction with a daily or hourly hydrological model could be adopted for real-time forecasting at shorter time steps.
 In this study, we have applied a zero-flow treatment to but not to in calculating the likelihood function (equation (15)). A more comprehensive treatment could follow the full approach of Wang and Robertson , but it would substantially increase the complexity of the parameter estimation. This will be considered in future work.
 The seasonally variant and hierarchical error models developed in this study are suitable for use in real-time seasonal streamflow forecasting. The error models update the streamflow prediction of the WABAPA model based on the information from the previous month. The bias correction and model updating of streamflow prediction with multiple month lead time can be obtained by applying the hierarchical error model recursively. When incorporated with climate ensemble forecasts, the error model could be adapted to seasonal streamflow forecasting for real-time applications.
 This work has been supported by the Water Information Research and Development Alliance (WIRADA), a collaboration between CSIRO and the Bureau of Meteorology. We would like to thank Jiufu Lim for his contribution at the early stage of this work and Prafulla Pokhrel for providing data. Robertson David and Eddy Campbell made valuable suggestions that led to substantial strengthening of the manuscript. We are grateful to two anonymous reviewers and an associated editor for their insightful comments and constructive suggestions.