Water Resources Research

Generic error model for calibration and uncertainty estimation of hydrological models

Authors


Abstract

[1] Because of the necessary simplification of the complex natural processes and the limited availability of observations, model simulations are always uncertain and this uncertainty should be quantified. In this contribution, the model error is quantified using a combined procedure. For the uncertainty of discharge due to meteorological input, a stochastic simulation method is used. To quantify the effect of process representation and parameterization, a sensitivity analysis is carried out. It is assumed that the model error due to process uncertainty is proportional to the sensitivity. The final model error variance can thus be calculated from the stochastic errors and the process sensitivities. The coefficients used for the quantification are estimated simultaneously with the model parameters. The methodology presented produces error series that are normally distributed and that represent the varying importance of different processes in time. This uncertainty time series can be used as a weighting factor to normalize the model residuals during calibration so that the assumptions of least squares optimization are fulfilled. Calibration and uncertainty estimation are demonstrated with an example application of a distributed Hydrological Bureau Waterbalance (HBV) model of three watersheds in the Neckar basin in southwest Germany. The model residual distributions are presented and compared to a standard calibration method. Further, it is shown that the new methodology leads to more realistic confidence intervals for model simulations. Although applied to the HBV model as an example, the method is general and can be applied to any model and also in conjunction with other uncertainty estimation techniques.

1. Introduction

[2] Rainfall runoff modeling is one of the central and classic problems in hydrology. By definition every model is a simplification of a more complex system. The fact that natural processes are described with mathematical equations and the corresponding parameters are derived from observations and experience leads to uncertainties. The main sources of uncertainty are embodied in the following five areas: inputs, state specification, process definition, model structure and output.

1.1. Input Uncertainty

[3] The meteorological input is based on point observations which are themselves uncertain and sometimes combined with indirect measurements such as radar or satellite information. Because of the fact that the exact precipitation, temperature and other input variables are not known at every point of the catchment, uncertainty due to measurement and interpolation errors and spatial variability need to be taken into account.

1.2. State Uncertainty

[4] The actual state of the catchment (e.g., moisture conditions, snow cover) is usually not directly observed but calculated using model equations. Because of the fact that the input and the model abstraction are simplifications, the state itself becomes uncertain. Additionally, continuous simulations inherit state uncertainty from preceding time steps.

1.3. Process Abstraction-Related Uncertainty

[5] The main hydrological processes are described using equations that can only capture parts of the complex natural processes. Parameters of these equations correspond partly to sets of discrete measurements or need to be estimated via calibration. This automatically leads to uncertainties of the corresponding model output.

1.4. Model Structure Uncertainty

[6] The model structure itself leads to uncertainty due to the inherent simplification of the more complex real system. The discretization of the landscape into polygons or rasters produces additional errors as the real processes occur on much smaller scales.

1.5. Output Uncertainty

[7] Observed discharge, groundwater level, soil moisture, conductivity and other observations are also based on rating curves, point measurements, or remote sensing and can be corrupted by measurement errors and neglected spatial variability.

[8] The quantification of these uncertainties is important both for practical decision making and theoretical modeling. Unfortunately, this is neither a straightforward nor simple task. Kavetski et al. [2003], Gupta et al. [2005], Beven [2006], Schaefli et al. [2007], and many others state that despite the considerable attention that has been given to uncertainty estimation in recent years, there has been no satisfactory approach to separate all sources of error and to quantify the total uncertainty proposed to date. Singh and Woolhiser [2002] describe this fact as one of the major limitations of current watershed models. Therefore, the purpose of this study was to develop a methodology for the quantification of total model uncertainty considering the above list of relevant error sources in turn and in combination.

[9] Even physically based hydrological models require parameter calibration because subgrid processes can only be parameterized in a lumped way. Effective parameters are required at the model grid scale, which can be quite different from field or laboratory measurements despite being lumped with one another [Beven, 1989]. This calibration is more difficult than may be expected because of problems associated with the objective function used, parameter interaction, input uncertainty, and the implicitly assumed error model. Kavetski et al. [2003] give a comprehensive overview of these problems and show that objective functions based on least squares or derivatives thereof will yield biased parameter estimates if the input and output data are corrupt.

[10] It has long been understood that the choice of a single objective function must lead to biased calibration as each performance criterion is sensitive only to certain characteristics of the hydrograph [Krause et al., 2005]. Multiobjective calibration has been proposed to counteract this effect [Yapo et al., 1998; Gupta et al., 2003] and additional information may very well reduce the uncertainty of model predictions. However, the extension of the dimensionality of the optimization can also increase uncertainty and the approach will still suffer from the main shortcomings of standard single-objective calibration. The problem is that most calibration methodologies assume and require that the model errors are Gaussian and that their variance is constant in space and time (homoskedastic), which is rarely verified.

[11] Markov Chain Monte Carlo methods are the most popular in uncertainty estimation. The Shuffled Complex Evolution Metropolis algorithm (SCEM-UA) of Vrugt et al. [2003] and the generalized likelihood uncertainty estimation (GLUE) by Beven and Binley [1992] have been used in numerous studies. The latter has also been criticized for the adoption of “less formal likelihoods”, the subjective choice of “behavioral” parameter sets and the lumping of all sources of uncertainty into a single parameter uncertainty, which leads to very wide confidence bounds [Mantovan and Todini, 2006; Kavetski et al., 2003]. Perhaps the major concern with both methods is the lack of a specific error model structure acknowledging the properties of input and parameter uncertainties.

[12] Montanari and Brath [2004] propose to use the normal quantile transform in order to make the input and output time series Gaussian and to derive a linear regression relationship between the model residuals and the simulated river flow. The major drawback of this method is the assumption that the model performance and errors are homoskedastic. Wagener et al. [2003] tackle this commonly ignored problem with a dynamic identifiability analysis (DYNIA). It allows for the evaluation of simulated and observed time series with respect to information content for specific model parameters. This analysis can be used to indicate areas of structural failure and potential improvement to the model.

[13] Kavetski et al. [2003] introduce a strict inference scheme called BATEA (Bayesian Total Error Analysis) to analyse the model parameters' posterior distributions, conditioned to model error, input error and output error, by Monte Carlo Markov Chains. Considering explicit input and output uncertainty, this method still requires error models of low dimensionality for numerical reasons. Unfortunately, most environmental observation time series show significant complexity, thus prohibiting the use of simple multiplicative error models.

[14] One of the first and most robust approaches to deal with heteroskedasticity was presented by Sorooshian and Dracup [1980]. They employed a power transformation and maximum likelihood theory to estimate the weights of the weighted least squares approach in a two-step procedure optimizing the parameters of a simple two-parameter model. Schaefli et al. [2007] use a mixture of two normal distributions to mimic the heteroskedasticity of the total modeling errors of a conceptual rainfall runoff model applied to a highly glacierized alpine catchment. The two normal distributions represent the error populations during the two very distinct high- and low-flow regimes. Unfortunately, the approach still assumes normal, homoskedastic, and lag-one autocorrelated error distributions for each flow regime and lumps all error sources into the parameter uncertainty. As the assumptions were not completely proven by the data, the problem was broken down into two similarly ill-posed cases instead of actually being solved.

[15] Gallagher and Doherty [2007] demonstrate the estimation of model predictive uncertainty for a water resource management model consisting of a soil water balance and a groundwater model. Although the chief disadvantage of the method, the assumption that the model is linear, prevents the exact determination of highly nonlinear model error, useful approximations of the individual contributions to the overall predictive uncertainty can be given; provided that plausible estimates of the individual uncertainty sources such as input data or model parameters are available. An issue which is neglected in most approaches is that conditions change in time, e.g., because of improved observation networks, climate change scenarios or meteorological forecasts. This varying temporal uncertainty due to changing input should be acknowledged in uncertainty estimation methods.

[16] Gupta et al. [2005] identify the typical assumptions of normality, constancy of variance and simplicity of the correlation structure of the underlying error model as the major drawbacks of current uncertainty estimation schemes. Therefore, the presented methodology explicitly addresses these important properties: it produces error series that are normally distributed and that reproduce the variable contributions of different processes to the total uncertainty in time. It is based on a scaled decomposition of plausible error contributions from different uncertainty sources that represent the time variant importance of different processes. The hydrological model and the corresponding error model are calibrated simultaneously. The uncertainty time series are used as a weighting factor to normalize the model residuals during calibration so that the assumptions of least squares optimization are fulfilled. The methodology is demonstrated with an example application to the distributed Hydrological Bureau Waterbalance (HBV) model of three watersheds in the Neckar basin.

2. Methodology

[17] The error of the modeled discharge on day t is considered to be a random variable ɛQ(t). This random error is assumed to be the sum of the random errors due to input (e.g., precipitation ɛP(t) and temperature ɛT(t)) and process description ɛequation imageimage (e.g., snow accumulation and melt processes, soil properties, runoff generation and infiltration, and internal storage)

equation image

With θi being the respective group of model parameters that control the processes just mentioned. For the sake of simplicity we assume that process errors are independent. A covariance matrix to include the mixed terms could easily be introduced into the equation. If one assumes that these random variables are independent, then the variance of the sum is the sum of the variances

equation image

The assessment of these variances is not a trivial task. While one can assume that the errors on successive days are independent for the meteorological variables this assumption does not hold true for the errors due to the inevitable simplifying process descriptions and effective model parameters, θi, [Kuczera et al., 2006; Schaefli et al., 2007]. Many methods to estimate the uncertainty from these individual sources have been proposed. The main difficulty is to separate their effect from the other sources. An example for rainfall input errors is given by Kuczera and Williams [1992]. They used a Box-Cox transformation to fit a stochastic rainfall model to the observations and propagated the associated rainfall uncertainties through a rainfall runoff model. The method presented here is built upon similar ideas.

3. Meteorological Sources of Uncertainty

[18] Rainfall and temperature are the most frequent meteorological inputs for hydrological models. They are usually observed at a selected number of points and have to be estimated for the rest of the catchment. Meteorological data can be interpolated using geostatistical methods such as ordinary or external drift kriging (EDK) [Ahmed and de Marsily, 1987]. For this study, both precipitation and temperature were interpolated using EDK. For the temperature, topographical elevation was used as external drift, as the temperature changes linearly with a change in altitude. For precipitation, the square root of the elevation was used as external drift, as the increase in precipitation becomes smaller with increasing altitude. Interpolation leads to an artificial picture of the real meteorological conditions as it is an estimation with minimum error variance. Therefore per definition, the true variability of the real fields is smoothed out. In order to account for the effect of spatial variability, simulation methods can be used to quantify the uncertainty due to meteorological input data.

3.1. Conditional Precipitation and Temperature Simulation

[19] In contrast to the objective of interpolation (minimum error variance), the goal of simulation methods is to create a set of realizations of a variable that show the same variability as the observations. Conditional simulation additionally preserves the observations at the measurement locations as much as this is possible. There are different simulation methods such as Monte Carlo, turning band, sequential and Markov chain simulations (A. Bárdossy, Introduction to geostatistics, lecture notes, Institute for Hydraulic Engineering, Universitaet Stuttgart, 2002).

[20] Despite their popularity in geostatistics, conditional simulation methods have not been used as extensively in hydrology as have the related interpolation approaches. Haberlandt and Gattke [2004] apply simulated annealing to generate precipitation fields as stochastic input for a Nash cascade model of the Lippe basin in Germany. They found considerable variability of the runoff hydrographs due to simulated precipitation but the excessive simplicity of the hydrological model prevented deeper analysis of the uncertainty.

[21] Simulation is carried out in space for each time t separately using the following algorithm:

[22] 1. The observed values are transformed using their empirical distribution function to a standard normal distribution (normal score transform)

equation image

with Z(ui, t) being the precipitation at location ui at time t.

[23] 2. The experimental variogram γ*(h) of W(ui, t) (with fixed t) is calculated for a selected set of lags leading to the values (γ*(h1), …, γ*(hk)).

[24] 3. A theoretical variogram is fit to the experimental one using the algorithm developed by Hinterding [2003], where (1) the experimental variogram is monotonized using PAVA (pool adjacent violators algorithm) [Barlow and Bartholomew, 1972], this way one obtains a monotonic sequence γp(h1) ≤ … ≤ γp(hk); (2) the nugget is estimated as γp(h1); (3) the range as the first hi such that i > 1 and γp(hi) = γp(hi+1); (4) the total sill is set to 1 as W follows a standard normal distribution; and (5) the variogram is assumed to be spherical.

[25] 4. Conditional geostatistical simulation is applied to generate values of W(u, t) for unobserved points u.

[26] 5. The last step is the back transformation to the marginal of the precipitation using the inverse transform of the normal score applied in step one.

[27] Note that because of the fact that simulation was carried out for a large number of days (total period >10,000 days) a conventional geostatistical analysis could not be carried out for each individual day. A more sophisticated approach using maximum likelihood variogram estimation would have required overproportional numerical efforts and was thus not used. The procedure could have been used without the normal score transformation but because of the skewness of the precipitation distribution the variogram fitting algorithm frequently produced implausible results.

[28] The method produces stochastic rainfall that conserves the observations themselves as well as their spatial variability and avoids the smoothing effect of interpolation. Because of the high variability of precipitation fields which differs from time step to time step one can assume that the corresponding uncertainty is independent. For temperature the same procedure was be applied, however in this case the deviation from an elevation-dependent trend was simulated.

4. Process and Parameter Related Uncertainty

[29] Process representation and process parameters play a very important role in the uncertainty of hydrological models. Their contribution certainly depends on the actual hydrological conditions. For example in summer the contributions of snow accumulation and melt processes in Germany are negligible, while in winter they may play a central role. The same also applies to other processes that contribute either more or less to the runoff generation and concentration under certain conditions. The sensitivity of the calculated discharge on a given day with respect to a parameter group, θ, can be calculated as equation image(t). In this case the first-order approximations of these derivatives have been used. Figure 1 shows an example of the determined discharge sensitivity time series.

Figure 1.

Time variant discharge sensitivities with respect to different processes (parameter groups) at Neuenstadt.

[30] Whereas the snow module parameters show a significant sensitivity during winter, runoff generation and concentration processes are important throughout the whole year. Runoff generation is most sensitive during large precipitation events and in the summer when the soils are dryer. Runoff concentration determines runoff most significantly during floods, but also has a remarkable influence during the recession periods that follow the flood events by controlling the retention capacity in the basin.

[31] One can assume that the standard deviation of the random contribution of a certain process to the total uncertainty is proportional to its sensitivity

equation image

This in combination with equation (2) leads to

equation image

Because the coefficients, ai, are unknown, equation (5) does not yield an explicit estimation of the error variance. The coefficients have to be estimated via calibration. Model calibration in this case can be considered in an overarching sense as a simultaneous estimation of the model parameters, θi, and the parameters of the calculated output error model, ai. Assuming that the errors are normally distributed, this task can be carried out using a maximum likelihood method or a biobjective optimization. Therefore for both cases, the standardized errors are calculated as shown below

equation image

with Qo(t) as the observed discharge on day t (m3/s), Qm(t) as the modeled discharge on day t (m3/s), and η(t) as the standardized model error (dimensionless). This approach implies that the model should be more accurate on a day when the uncertainty is smaller (e.g., when fewer processes are active). In other words the model should not be forced to be correct on a day when the input is quite uncertain, because it would be for the wrong reason. Using the maximum likelihood method, the likelihood of a normal distribution is

equation image

In this case, the likelihood of the model parameters, θi, and the error model parameters, ai, is

equation image

If each data point, xi, has its own standard deviation, σi, (a reasonable assumption based on Figure 1), equation (7) can be transformed to the log likelihood

equation image

Substituting for xi with (Qo(t)−Qm(t)) and for σi with Std [ɛQ(t)], together with μ = 0 yields

equation image

Maximizing the log likelihood function leads to the optimal model parameters, θ, a.

[32] An alternative to the maximum likelihood approach is to minimize the sum of the estimation variances

equation image

under the condition that the normalized variance is unity

equation image

or under the condition that η(t) is normally distributed (biobjective optimization). This condition can be ensured by minimizing the Kolmogoroff-Smirnoff D statistic, which measures the maximum distance between an empirical cumulative frequency distribution and a given theoretical distribution, which is in this case the normal distribution

equation image

with F(xi) as the relative cumulative frequency distribution of xi and Φ(Xi) as the value of the normal distribution for this Xi.

[33] However, other assumptions regarding the error distributions may be reasonable and could therefore also be included into the described methodology.

5. Case Study

5.1. Input Uncertainty

[34] The methodology introduced in this study was applied to the Neckar basin located in southwest Germany (Figure 2). The climate in the basin can be characterized as temperate humid with a long-term spatially average annual precipitation of 950 mm that ranges from 700 mm (Tübingen, 370 m.a.s.l) to 1680 mm (Freudenstadt, 787 m.a.s.l). The precipitation regime shows a weak seasonality with a moderate maximum rainfall in June (110 mm) and a minimum rainfall in October (66 mm). The average annual temperature in the catchment is 8.7°C, the minimum is 6.4°C in Klippeneck (973 m above sea level (asl)) and the maximum is 9.1°C in Nürtingen (280 m asl). The coldest month is January (–0.5°C) and the warmest is July (17.3°C). All data used in this study was provided by the State Institute for Environmental Protection Baden-Württemberg.

Figure 2.

Location of the Neckar catchment in Germany.

[35] Figure 3 shows an example of precipitation data that were interpolated with external drift kriging which is a typical interpolation method used in environmental models. The unrealistically smooth gradients that follow the topography and the distance from the locations of the observed points are clearly visible. Figure 4 shows an example of the simulated precipitation fields for the same day. It exhibits a similar spatial structure (determined by elevation and observations), but has a much higher spatial variability; the standard deviation is 17 mm, compared to 8 mm as it was with the interpolation. The mean of all of the simulated realizations as shown in Figure 5, should resemble the interpolated field in the left of Figure 3, which it does. In this example, the variability in the interpolated field is already relatively high because a dense station network is available (294 stations for 14 100 km2 so that the average distance between them is 6.9 km). In most other cases, fewer stations exist and interpolations are even smoother.

Figure 3.

EDK-interpolated precipitation field for 1 December 1981.

Figure 4.

Example of a simulated precipitation field for 1 December 1981.

Figure 5.

Ensemble mean of 50 simulated precipitation fields for 1 December 1981.

[36] A similar procedure is followed for temperature. Figure 6 shows an example of EDK-interpolated temperature, and Figure 7 shows an example of simulated temperature. Elevation is the only source of spatial variability between the stations in the interpolation case. In contrast, the simulated field provides a more realistic representation of the true temperature distribution. The mean of the ensemble of realizations again comes close to the interpolation, which indicates that the systematic error introduced by the simulations is small (Figure 8).

Figure 6.

EDK-interpolated temperature field for 18 March 1980.

Figure 7.

Example of a simulated temperature field for 18 March 1980.

Figure 8.

Ensemble mean of 50 simulated temperature fields for 18 March 1980.

[37] This bias was investigated by comparing statistics from other individual days throughout the year and the long-term mean values for precipitation and temperature. Analysis of the discharges modeled with the simulated time series showed that on average using the temperature simulations did not introduce a significant bias into the simulated discharges compared to using interpolated temperatures.

[38] Depending on the time of the year, the discharges of the individual realizations vary but their long-term ensemble mean closely matches the interpolations. Figure 9 shows the mean of the differences between the interpolation and simulation as well as the temperature at Höfen for 4.5 years. In winter when temperatures are close to 0°C, the discharges due to precipitation vary significantly depending on the form of precipitation (snow or rain). A positive peak in the curve is always followed by a negative one and vice versa. This is due to the fact that snow can only melt once and that a smaller snowpack subsequently leads to reduced discharges. The impact of temperature in summer is small and the long-term mean of the series is zero, indicating that no systematic error is introduced by the temperature simulations. Looking at the standard deviation of the discharge differences mentioned above as a measure of uncertainty in the input data, one can expect a strong seasonality, which is shown in Figure 10. Here, a 30-day moving average of the 10-year mean daily discharge difference standard deviation and temperature are plotted. As expected, when the mean temperature in winter is below 10°C the variability starts to rise; at 5°C, there is a significant spread in the discharges of the different temperature simulations, which also means a greater uncertainty from the temperature input data than during summer.

Figure 9.

Mean differences between discharges modeled with interpolated and simulated temperatures at Höfen.

Figure 10.

Annual cycle of daily standard deviation of discharge differences (interpolation minus realizations) modeled with simulated temperatures at Höfen (30-day moving average).

[39] The precipitation simulation, on the other hand, was shown to overestimate rainfall by about 35 mm (4%) per year and therefore produce systematically larger discharges. This is due to the nonlinearity of the normal score transformation and the skewness of the precipitation distribution. Accordingly, the precipitation input uncertainty was determined on the basis of the ensemble mean of the simulations and not on the interpolations. The mean of the differences between the ensemble mean and the realizations in this case is zero, per definition. Figure 11 shows the standard deviation of the discharge differences between the individual realizations and the ensemble mean. The 30-day moving average of the 10-year mean daily discharge difference standard deviation and precipitation are plotted.

Figure 11.

Annual cycle of daily standard deviation of discharge differences (ensemble mean minus realizations) modeled with simulated precipitation at Höfen (30-day moving average).

[40] Input uncertainty depends strongly on the available data, which can be demonstrated using a reduced measurement network. From a precipitation forecast or during a storm, only a smaller number of rainfall observations may be available. The reduced resolution of the model input has a significant effect on the input uncertainty. This fact is demonstrated by using only every fourth rainfall station (69 instead of 294) for the conditional simulation of rainfall realizations. Figure 12 shows the calculated standard deviation of the differences between ensemble mean and the individual realizations based on the reduced station set which is used to estimate the input uncertainty. The method allows for an error estimation even if observation densities are varying. A comparison with Figure 11 indicates that the reduced data density approximately doubles the rainfall uncertainty during summer to approach that of winter when compared with the complete data set and also changes the seasonal distribution of the rainfall and discharge standard deviation. Therefore in such cases, the predictive uncertainty would need to be calculated accordingly. However, as the error model parameters and process uncertainties in this method are assumed to be sufficiently independent of input errors, one only needs to reassess the contribution of precipitation to the total uncertainty. This separation of the error sources is one of the significant advantages of the proposed method. Nevertheless, in the case of a forecast, it is not the resolution but the bias of the forecast that will constitute the greatest problem. This effect was not within the scope of this study but should be addressed in the future.

Figure 12.

Annual cycle of daily standard deviation of discharge differences (ensemble mean minus realizations) modeled with simulated precipitation using the reduced station set at Höfen (30-day moving average).

[41] To estimate the uncertainty in discharge due to meteorological input data, the standard deviation of the differences between the modeled discharges using interpolated and simulated temperatures was calculated for each day. For rainfall uncertainty, the differences between the ensemble mean discharge and the discharge of the individual realizations of the simulated precipitation were used. These uncertainty time series were then able to be combined with the process uncertainties to estimate the total model uncertainty due to the input and process description.

5.2. Process Uncertainty

[42] The methodology is demonstrated by its application to the distributed, conceptual HBV model of three mesoscale watersheds of the central European Neckar basin. The three watersheds represent the major landscape units of the basin: The Swabian Jura in the southeast (gauge Süßen), the Black Forest in the west (gauge Höfen) and the plains in the north (gauge Neuenstadt). The climate in all three basins can be characterized as temperate humid. Table 1 summarizes the properties of the basins.

Table 1. Key Properties of the Three Watersheds
 SüßenHöfenNeuenstadt
Elevation (m asl)360–860360–900170–520
Area (km2)340217140
Mean annual precipitation for 1990–1999 (mm)8761375983
Mean discharge for 1990–1999 (m3/s)5.34.71.3

[43] The modified HBV model based on 1 km2 grid cells as primary hydrological units, as described by Götzinger and Bárdossy [2007], was used for this case study. Two snow related parameters were kept constant in each basin. The two parameters controlling the soil moisture, β and kperc, and the three storage coefficients, α, k1 and k2, were allowed to vary from cell to cell. As free calibration of such a large number of parameters is expected to introduce significant uncertainty, this study attempts to quantify this uncertainty and propose a new, heteroskedastic calibration methodology for such distributed or lumped models. The model was calibrated twice for the three basins using the time period from 1990 to 1999: First, using a composition of Nash-Sutcliffe coefficients on a daily, weekly and annual scale (standard calibration) and second, using the approach presented in section 4. The traditionally measured model efficiency of both calibration runs was acceptable (the mean Nash-Sutcliffe coefficient was 0.58 in both cases).

[44] First, the distributions of the model residuals were compared. The final model residuals of each calibration run were transformed by dividing them by their standard deviation: directly in the case of the standard calibration, and by taking the ratio of the model residual and the uncertainty in the case of the maximum likelihood methodology. Figure 13 shows the results for the three basins compared to a normal distribution. The standard calibration model residuals are biased, skewed and the standard deviation and kurtosis are too large. Key figures for both methodologies are presented in Table 2; Note that the standardized model residuals are compared here. The absolute errors of the maximum likelihood calibration are not larger than the standard approach although the histograms are wider.

Figure 13.

Standardized classified relative frequency distributions of model error using the (top) standard and (bottom) maximum likelihood calibration methodology for the basins (left) Süßen, (middle) Höfen, and (right) Neuenstadt; the normal distribution is shown for comparison. Note that the highest- and lowest-class intervals contain all values greater and less than 3 and –3, respectively.

Table 2. Mean, Standard Deviation, and Kolmogoroff-Smirnoff D Statistics of the Model Error Distributions for Standard and Maximum Likelihood Calibration
 Standard CalibrationMaximum Likelihood Calibration
 MeanStandard DeviationDtestMeanStandard DeviationDtest
Süßen–0.022.9211.43–0.240.8711.25
Höfen0.673.219.500.000.994.29
Neuenstadt0.291.0021.10.250.997.18

[45] The maximum likelihood calibration yields approximately Gaussian error distributions, which are only slightly biased and skewed. At all three gauging stations, the two class intervals of outliers (<−3 and >3) contain a considerable number of time steps (1.5%, 1.1% and 1.1%, respectively) where the model could hardly simulate the discharge correctly, although it was expected to do so. This indicates additional weaknesses in the data or model structure that could not be properly addressed with this methodology. The frequency of outliers is nevertheless smaller than in the standard calibration approach (1.3%, 1.7% and 1.1%, respectively, where the expected value is 0.13%).

[46] In contrast to the standard calibration, the mean of the distribution in Höfen is zero, but in Süßen and Neuenstadt it is still considerably biased. The standard deviations of all three distributions are sufficiently close to 1 and their kurtosis resembles the normal distribution. Therefore despite the remaining bias, all three are more suitable for optimization methods assuming Gaussian errors than their standard calibration equivalents.

[47] For practical applications one has to assume a certain error distribution to calculate confidence intervals for the model output, in this case discharge. If all distributions can be assumed to be sufficiently normal, one can derive confidence intervals for the simulated discharge by taking the inverse of the normal residual distribution and adding an expected deviation corresponding to a selected confidence limit to the calibrated discharge values. For the case of the standard calibration, a certain error model of the simulation must be assumed. This can be additive (constant in time, a fixed value for each discharge) or multiplicative (relative to the simulated discharge, by adding the expected deviation in the logarithmic domain and transforming back). Figures 14 and 15 show two such examples for the gauge Neuenstadt, calculated for a confidence level of 80%.

Figure 14.

Standard calibration 80% confidence intervals for an additive error model.

Figure 15.

Standard calibration 80% confidence intervals for a multiplicative error model.

[48] The confidence intervals can be validated by comparing them with the observed discharges. Intervals that are too wide will contain too many observations; the predictions will also be very uncertain. If the intervals are too small, they will not include as many observations as they theoretically should, which indicates weaknesses in the chosen methodology [Beven and Binley, 1992]. As the intervals are derived for a certain confidence level, one can easily judge their value by comparing the number of points inside the limits with the chosen confidence level [Montanari and Brath, 2004].

[49] An additive error confidence interval is similar to a fixed width band around the simulated values (Figure 14). Therefore, all low and medium flows are included but, the model has little predictive power for those situations. Almost all higher discharge values on the other hand fall outside the range, which shows that the methodology is also not appropriate for that case. Multiplicative error intervals can be derived by transforming the discharges and calculating the uncertainty bounds in the logarithmic domain. In a sense, these confidence limits are more realistic because many instruments show relative errors. However, the amount of uncertainty also increases in low-flow situations because it is difficult to precisely measure and model small quantities. Therefore, these confidence limits are useful for some of the medium floods, but may miss some low-flow observations (Figure 15).

[50] The heteroskedastic error model provides confidence intervals that more realistically represent the time variant uncertainty of the discharge simulations (Figure 16). The uncertainty is large during floods and in all cases where several processes are simultaneously active. It is smaller during recessions and in less complex situations, such as when soils are either completely wet or dry. No parts of the hydrograph are systematically missed and four of the six small floods are contained within the confidence bands, which shows that they are suitable for all discharge ranges. At the same time, the error model has much more predictive power than an additive or multiplicative error model as the confidence interval is smaller when the model is assumed to be more precise.

Figure 16.

Maximum likelihood calibration 80% confidence intervals using the heteroskedastic error model.

[51] A statistical comparison of all three error models is given in Table 3. As shown, the additive error model overestimates and the multiplicative error model underestimates the uncertainty of the simulations in five of the six cases. If the error distributions were perfectly normal, exactly 80% of the points would lie inside the confidence bounds. Because the distribution of the heteroskedastic methodology in Süßen deviates most from this assumption, this limit is also largely exceeded.

Table 3. Data Points Within the 80% Confidence Limits of the Three Error Modelsa
 Additive ErrorMultiplicative ErrorHeteroskedastic Error
  • a

    Values are given in percent.

Süßen947790
Höfen918184
Neuenstadt947182

5.3. Comparison of Maximum Likelihood and Biobjective Optimization

[52] The maximum likelihood method presented in section 4 yields the optimal model and error model parameters if all of the assumptions used in its application are valid. Unfortunately, a posteriori analysis shows that the means of the standardized model error distributions can be slightly biased. The biobjective optimization provides the possibility of giving more weight to the normality criterion during calibration. In this section, the results of its application are presented and compared to the maximum likelihood method. The sum of the error variances and the Kolmogoroff-Smirnoff D statistic scaled by a constant factor of 100 are used as objective function in this calibration.

[53] As expected, the standardized error distributions (Figure 17) resemble the normal distribution more closely than those of the maximum likelihood calibration (Figure 13). Nevertheless, the biobjective calibration distributions also show a considerable number of outliers, especially in Süßen and Höfen. The absolute errors of the biobjective calibration are larger than those of the maximum likelihood calibration, which results from the larger weight that was assigned to the normality condition during optimization. Table 4 supports these statements. Except for Höfen, where the maximum likelihood method also achieved very good results, the Kolmogoroff-Smirnoff D statistics show that the biobjective calibration yields distributions which are much more resembling a Gaussian distribution; the price for this is a reduced model precision because the absolute errors are larger and therefore the confidence intervals are also on average wider, this is shown in Figure 18 (to be compared to Figure 16).

Figure 17.

Standardized classified relative frequency distributions of model error using the bi-objective calibration methodology for the basins (left) Süßen, (middle) Höfen, and (right) Neuenstadt; the normal distribution is shown for comparison.

Figure 18.

Biobjective calibration 80% confidence intervals using the heteroskedastic error model.

Table 4. Mean, Standard Deviation, and Kolmogoroff-Smirnoff D Statistics of the Model Error Distributions for the Maximum Likelihood and Biobjective Calibration Methodologies
 Maximum LikelihoodBiobjective Calibration
 MeanStandard DeviationDtestMeanStandard DeviationDtest
Süßen–0.240.8711.25–0.071.161.35
Höfen0.000.994.290.091.374.29
Neuenstadt0.250.997.180.041.031.27

[54] Note the extremely large uncertainty in the first days of April; this is a good general example of the benefit of the methodology. The high uncertainty stems from snow storage modeling. Depending on whether or not the snow cover in the basin has already melted or not, new precipitation will fall on top of snow or directly on soil. Additionally, if the air temperature is below the threshold temperature, the precipitation will be snow; if it is above, it will be rain. The possible combinations of both conditions lead to very different discharge simulations. Obviously, a process with a substantial error memory together with a purely stochastic process creates a situation that critically depends on the state and the input of the system.

[55] A statistical comparison of both calibration methods is given in Table 5. As can be seen, the maximum likelihood method overestimates and the biobjective underestimates the uncertainty of the simulations in three of the six cases (Süßen and Höfen). In general, the maximum likelihood method is recommended as it theoretically yields better results. If a posteriori analysis shows significant violations of the underlying assumptions the biobjective calibration can be used to enforce normality. In that case, the weighting of the two criteria should be optimized to reduce the loss in accuracy.

Table 5. Data Points Within the 80% Confidence Limits of the Maximum Likelihood and Biobjective Calibration of the Heteroskedastic Error Modela
 Maximum LikelihoodBiobjective Calibration
  • a

    Values are given in percent.

Süßen9082
Höfen8471
Neuenstadt8281

5.4. Validation

[56] Because the maximum likelihood optimization has been identified as an appropriate method to determine model and error model parameters simultaneously, the transferability of both parameter sets to a different time period was analyzed. The time period from 1980 to 1989 was simulated. The traditionally measured model efficiency was acceptable (mean Nash-Sutcliffe coefficient is 0.66). In addition, the rainfall and temperature variance time series were calculated using another model parameter set to test the transferability of the input uncertainty. Table 6 summarizes the results of this test. As in the calibration period, the distributions do not pass the Kolmogoroff-Smirnoff test, but although their normality slightly deteriorates, the transferability of the parameters and input uncertainty time series is still valid. Nevertheless, the confidence limits appear to be systematically too large. An analysis of input uncertainty time series from several model parameter sets showed that they are highly correlated which supports the assumption that input errors are independent of process errors. However, input errors can be slightly biased. In the presented case, this leads to the overestimation of model uncertainty in the validation period, which was demonstrated by the excessive confidence intervals.

Table 6. Mean, Standard Deviation, Kolmogoroff-Smirnoff D Statistics, and Points Within the 80% Confidence Intervals for the Validation Period
 MeanStandard DeviationDtestPoints Within Confidence Intervals
Süßen–0.250.9312.6789
Höfen–0.140.838.9490
Neuenstadt0.080.857.7389

[57] Another possibility for validation is the internal testing of underlying assumptions. By separating the observations and simulations into a low-flow regime (smaller than the mean discharge) and a high-flow regime (larger than the mean discharge), one can check if the distributions of the resulting sub-populations are also sufficiently Gaussian. In Figure 19, this is shown for the stations Süßen and Neuenstadt. The number of observations in the validation period at Höfen was too small to separate them into meaningful subsets; therefore no data from Höfen is provided in Figure 19.

Figure 19.

Standardized classified relative frequency distributions of model error during low- and high-flow situations at Süßen and Neuenstadt.

[58] The variance of the errors in the low-flow regime is too small for both gauging stations, however, the distributions do not deviate excessively from the normal distribution. During high flows, the distributions are obviously biased and skewed, especially in Süßen because the model underestimates a large number of observations. This indicates internal weaknesses in the process description of the model, which cannot and should not be compensated for by the error model. Therefore, this analysis can be a valuable diagnostic tool in checking the performance of the model in different flow situations. Table 7 provides the relevant statistics for both flow regimes.

Table 7. Mean, Standard Deviation, and Kolmogoroff-Smirnoff D Statistics for the Low- and High-Flow Regime of Süßen and Neuenstadt
 Low-Flow RegimeHigh-Flow Regime
 MeanStandard DeviationDtestMeanStandard DeviationDtest
Süßen–0.030.607.96–1.051.559.69
Neuenstadt–0.050.654.22–0.011.181.53

[59] A similar analysis can be performed regarding the seasonal behaviour of the model. The observations and simulations were split into winter periods (November–April) and summer periods (May–October), and the model error distributions were plotted for each subpopulation individually (Figure 20).

Figure 20.

Standardized classified relative frequency distributions of model error from winter and summer periods at Süßen and Neuenstadt.

[60] All four distributions deviate significantly from the normal distribution. The comparison of winter and summer simulations shows that the model performs better in summer than in winter. This may be due to the reduced complexity in summer, when fewer processes are active and the soils are mostly dry. The higher rainfall uncertainty in convective precipitation seems to be compensated by the standardized model error. Again, winter storms are mostly underestimated, especially in Süßen, which is dominated by snowmelt events. This systematic bias indicates weaknesses in the process representation. Table 8 shows the statistical key figures for both seasons.

Table 8. Mean, Standard Deviation, and Kolmogoroff-Smirnoff D Statistics From Winter and Summer Periods in Süßen and Neuenstadt
 WinterSummer
 MeanStandard DeviationDtestMeanStandard DeviationDtest
Süßen–0.581.408.05–0.120.617.35
Neuenstadt0.250.934.99–0.320.646.99

[61] The presented analysis of simulation subsets shows that the model is not ergodic, which means that subpopulations behave differently to the complete population. Ergodicity is an important prerequisite for the transferability of the results to other time periods (validation). Therefore, further research is needed to improve the methodology.

6. Results and Discussion

[62] The uncertainty from the input data, model parameters, and process description were estimated using a combined procedure for calibration and uncertainty estimation. The key points of the presented methodology are summarized in the following list:

[63] 1. The hydrological model error is assumed to be a combination of random components from the input variables and the process representation errors.

[64] 2. The uncertainty of the calculated discharge can be different for each time step depending on the uncertainty of the input and the processes contributing at and before the given time.

[65] 3. The standard deviation of the random error corresponding to a given process and parameter group is assumed to be proportional to the sensitivity of the simulated discharge with respect to the selected parameter group.

[66] 4. From these sensitivities an error model corresponding to a given parameter set can be created.

[67] 5. Model parameters and the parameters of the error model can be estimated simultaneously using a maximum likelihood or biobjective optimization procedure. The objective of model accuracy is the minimization of the normed deviation between the modeled and observed discharge. The objective of the error model parameter estimation is to obtain normally distributed random errors.

[68] 6. Because the two objectives (model accuracy and reliability) are coupled, only a joint optimization can be successful.

[69] 7. The framework developed in this paper yields a heteroskedastic model error, which can be used to derive plausible, time variant, process-dependent confidence limits for hydrological simulations.

[70] The estimation of predictive uncertainty depends on the modeling purpose. Therefore, different implementations of the presented method are necessary for simulation, forecasting or climate change impact studies as the input uncertainty varies significantly in each case. The advantage of the presented approach is that the process and parameter uncertainty is successfully separated from all other uncertainty sources. The meteorological uncertainty can be added depending on the given situation. Therefore, different models are likely to be better for different purposes depending on the data availability and the data quality (forecasts, radar and climate scenarios).

[71] Model resolution is expected to be just as important as input data. It was shown that input uncertainty is reduced when more observations are available for a given model resolution. On the other hand, one can assume that process uncertainty increases with finer model resolutions for a given input data density. Therefore, an optimal model resolution could be found for each observation network, which balances both effects and ideally exploits the available information.

[72] Another important aspect is the randomness of the calculated normalized model errors. Significant autocorrelation in the error time series (0.83, 0.91 and 0.74 for Süßen, Höfen and Neuenstadt, respectively) shows that the process-based error memory may be overestimated in the heteroskedastic error model. Additional analysis could prove if the time series can nevertheless be treated as quasi-random.

[73] In the introduction of this paper, model structure and output were mentioned as additional sources of predictive uncertainty. The former is represented by the process uncertainties as the discharge sensitivity of each parameter group corresponding to a certain process of a given model structure is used. Therefore, the simplification by the model is implicitly taken into account. Finally, we are interested in the question of how wrong the discharge can be for a given model structure and not of how wrong the model structure is itself. This hypothesis can be verified by comparing several model structures as previously discussed for model resolution.

[74] The output uncertainty, in this case discharge, can be derived from the analysis of the rating curves and can be easily incorporated as an additional term into the methodology. The contribution is usually expected to be much smaller than the input and process uncertainty; further analysis of this aspect was not within the scope of this study. Therefore, additional research is needed to prove these hypotheses.

Acknowledgments

[75] This research was funded by the European Union in the Sixth Framework Program through the project RIVERTWIN. All data was kindly provided by the State Institute for Environmental Protection Baden-Württemberg. The suggestions of the editors and three anonymous reviewers helped to improve the paper.

Ancillary