A parametric Bayesian combination of local and regional information in flood frequency analysis

Authors


Abstract

[1] Because of their impact on hydraulic structure design as well as on floodplain management, flood quantiles must be estimated with the highest precision given available information. If the site of interest has been monitored for a sufficiently long period (more than 30–40 years), at-site frequency analysis can be used to estimate flood quantiles with a fair precision. Otherwise, regional estimation may be used to mitigate the lack of data, but local information is then ignored. A commonly used approach to combine at-site and regional information is the linear empirical Bayes estimation: Under the assumption that both local and regional flood quantile estimators have a normal distribution, the empirical Bayesian estimator of the true quantile is the weighted average of both estimations. The weighting factor for each estimator is conversely proportional to its variance. We propose in this paper an alternative Bayesian method for combining local and regional information which provides the full probability density of quantiles and parameters. The application of the method is made with the generalized extreme values (GEV) distribution, but it can be extended to other types of extreme value distributions. In this method the prior distributions are obtained using a regional log linear regression model, and then local observations are used within a Markov chain Monte Carlo algorithm to infer the posterior distributions of parameters and quantiles. Unlike the empirical Bayesian approach the proposed method works even with a single local observation. It also relaxes the hypothesis of normality of the local quantiles probability distribution. The performance of the proposed methodology is compared to that of local, regional, and empirical Bayes estimators on three generated regional data sets with different statistical characteristics. The results show that (1) when the regional log linear model is unbiased, the proposed method gives better estimations of the GEV quantiles and parameters than the local, regional, and empirical Bayes estimators; (2) even when the regional log linear model displays a severe relative bias when estimating the quantiles, the proposed method still gives the best estimation of the GEV shape parameter and outperforms the other approaches on higher quantiles provided the relative bias is the same for all quantiles; and (3) the gain in performance with the new approach is considerable for sites with very short records.

1. Introduction

[2] Depending of the availability of data, flood quantiles can be estimated using local frequency analysis, regional frequency analysis or a combination of both. Much effort have been spent during the last decades on the study of the statistical properties of flood distributions, but the lack of sufficiently long data series continues to limit the precision of the results [Bobée and Rasmussen, 1995]. The regionalization concept, introduced by Dalrymple [1960], allows us to mitigate the lack of data by transposing information from gauged sites toward ungauged sites of interest. The concept was continuously developed since, and new approaches were regularly developed by researchers [e.g., Benson, 1962; Matalas and Gilroy, 1968; Vicens et al., 1975; Rousselle and Hindie, 1976; National Environment Research Council (NERC), 1975; Tasker, 1980; Greiss and Wood, 1981; Kuczera, 1982; Hosking et al., 1985; Lettenmaier et al., 1987; Stedinger and Lu, 1995; Madsen et al., 1994, 1995; Madsen and Rojsberg, 1997; Fill and Stedinger, 1998; Burn, 1990; Groupe de recherche en hydrologie statistique (GREHYS), 1996a, 1996b; Ouarda et al., 2000, 2001; Chokmani and Ouarda, 2004]. Regionalization also results in more precise estimates of quantiles and parameters in sites with short records. It is however difficult to decide whether the local data series are long enough to discard regional information. To deal with this issue, Matalas and Gilroy [1968] recommend choosing the estimator that has the smallest variance. It would however make more sense to combine systematically all available and relevant information to have a better knowledge of the hydrological quantities to be estimated. Attention should be paid to the fact that, in a highly heterogeneous region, the addition of the regional information may be counterproductive.

[3] We present in this paper a parametric Bayesian method for combining local and regional information for the GEV distribution. In this method, the prior information is specified from the regional data by the probability distribution of a quantile and two quantile differences qT1, qT2qT1, qT3qT2 (where qT is the T-year annual flood quantile). Guidelines for its extension to other extreme value distributions are also provided.

[4] The paper is divided into six parts. Section 2 presents a literature review on the Bayesian approaches for combining local and regional information. In section 3 the proposed Bayesian model is presented and the approaches for regional estimation and for prior specification are developed. The MCMC algorithm that was used to make inference on parameters and quantiles is also presented. The validation methodology is presented in section 4. The case study is presented in section 5, and the results are discussed in section 6. A conclusion is finally presented in section 7.

2. Literature Review

[5] The need to combine regional and local information was perceived early and several authors tried to address the issue using various approaches. These approaches can be classified in two groups: (1) mixed approaches which consist in estimating some parameters with the local data and the others with the regional data and (2) approaches that simultaneously use both information sources to estimate all parameters and quantiles. A Bayesian approach can be used in both cases, but to the knowledge of the authors, all approaches that are classified in group 2 are Bayesian. Bayesian approaches can consist either in the construction of an empirical estimator, or the complete inference of the posterior distributions. Depending on the distributions of local and regional estimators, the parametric Bayesian inference can be conducted either analytically or numerically.

2.1. Mixed Approaches

[6] The index flood method [Dalrymple, 1960; NERC, 1975] represents a mixed approach when it is applied to a gauged site because the average at-site flow is estimated with local data, while the parameters of the distribution of the normalized quantile are estimated with the regional data. Lettenmaier et al. [1987] used Monte Carlo simulation to show that, if the underlying regional distribution in the index flood approach is the generalized extreme value distribution (GEV), and if the parameters of this distribution are estimated with the L moments or the probability weighted moments (PWM), then the index flood regional estimation is more effective than the local estimation even in case of moderate regional heterogeneity.

[7] Another example of a mixed approach is the “two parameter” GEV/PWM method in which the shape parameter of the GEV distribution is estimated by a regional approach and the two other parameters with the local data. This method showed to be superior to the three parameter GEV/PWM regional index flood method for the estimation of the 100-year flood when the size of local data series increases, or when the regional heterogeneousness is significant [Lettenmaier et al., 1987; Stedinger and Lu, 1995; Fill and Stedinger, 1998].

[8] The procedure recommended by the Interagency Advisory Committee on Water Data [1982] is also a mixed approach since it uses a weighted skew (shape of the LP3 distribution) in order to improve the at-site estimator. The weighted skew may be computed through regression analysis, with the at-site skew.

[9] More recently, regional flood frequency analysis using canonical correlation analysis (CCA) has been extended to account for local data in neighborhood delineation [Ouarda et al., 2001]. CCA is a multivariate statistical technique which is used to express hydrological and physiographical variables in two special canonical spaces with special intercorrelation features. Distance in the hydrological space allows the delineation of the neighborhood of a given station using the approach of confidence level ellipsoid [GREHYS, 1996a, 1996b; Ouarda et al. 2000; Girard et al., 2000]. Short local data series can then be helpful to position a station in the hydrological space, and thus to define a more adequate neighborhood. It is a mixed approach to regionalization in the sense that local data influence parameter estimation through the identification of neighborhood limits. A mixed approach can also be Bayesian: for instance, a Bayesian approach was used by Reis et al. [2003, 2005] to infer the skew coefficient of the LP3 distribution while using local data to compute the two other parameters.

2.2. Simultaneous Estimation Using Bayesian Approaches

[10] In the Bayesian framework (which will be presented in more detail in section 3), the prior knowledge on the unknown quantities (parameters or quantiles of the local distribution) is described by probability densities. In the hydrological literature dealing with the combination of local and regional information, these prior probability densities are usually obtained from a regional analysis [e.g., Vicens et al., 1975; Madsen and Rojsberg, 1997; Fill and Stedinger, 1998]. The prior probability distributions are then used with the local observations to infer posterior distributions using the Bayes theorem.

2.2.1. Empirical Bayes Approach

[11] When the probability distributions of both regional and local quantile estimators are normal, it is easily shown [e.g., GREHYS, 1996b] that the quantile posterior distribution is normal with the following parameters:

equation image
equation image

where qT is the flood quantile we wish to estimate, equation imageT(L) (equation imageT(R)) the local (regional) estimation of qT, and σL2 (σR2), its local (regional) estimation variance. The estimator presented in equation (1) is also called linear empirical Bayes estimator and was used by Vicens et al. [1975], Kuczera [1982], Fill and Stedinger [1998], and Madsen and Rojsberg [1997].

[12] Vicens et al. [1975] assumed that the annual mean flows of New England rivers could be described by a normal distribution and obtained the average and the variance of the prior distribution of the mean annual flows with a multiple linear regression on physiographic variables. They then discussed the variation of the shape of the posterior distributions of flows with respect to the precision of the local and regional distributions. This analysis showed that the combination of the two sources of information reduced the estimation variance of the parameters and that of the mean annual flow. The posterior distribution of streamflows was dominated by the estimator which had the smallest variance.

[13] Kuczera [1982] used an empirical Bayesian method to stabilize the estimation of the variance of flood records, which were assumed to have a lognormal distribution. He obtained the prior information by fitting a gamma distribution to the estimated local variances. He used this model on a simulated data set without intersite correlation and showed that the relative root mean square error (RRMSE) of the estimated 100-year flood is reduced. The reduction becomes however less important as the regional heterogeneousness increases. Kuckzera's [1982] approach was later shown to be sensitive to violations of distributional assumptions [Lettenmaier and Potter, 1995]. The computation of the RRMSE by Kuczera [1982] was possible only because the true values of the quantiles of the simulated flood data were known. In a second application, Kuczera [1982] used real data from selected New England basins. Since the true values of the quantiles were not available, he was only able to show that the combination of the regional and local information stabilizes the estimation of quantiles, i.e., the posterior distribution of quantiles has a smaller variance.

[14] Fill and Stedinger [1998] used the empirical Bayesian method to combine the result of normalized quantiles regression (NQR) with the two-parameter GEV/PWM regional estimator. The NQR method, introduced by Koenker and Bassett [1978] and applied in hydrology by Stedinger [1989], consists in estimating the normalized quantile (the flood quantile divided by the average at-site flow) by linear regression on physiographic variables. Fill and Stedinger [1998] showed by simulation that the empirical Bayesian estimator was more robust and, in terms of root mean square error, performs as well or better than the NQR method or the two-parameter GEV/PWM method.

[15] Madsen and Rojsberg [1997] used two Bayesian estimators of the T-year event in a study that was conducted on flood data from New Zealand. They used the index flood approach for regional estimation and the generalized Pareto distribution (GP) as the distribution of flood peaks above a given threshold. The first estimator is the empirical Bayesian estimator given in equation (1) whereas the second is the mean of the posterior distribution of the quantile obtained with a parametric Bayesian approach. In both cases, the prior information about the parameters of the GP was obtained by linear regression on physiographic variables, and then used to calculate the quantile estimation. Their results indicated that the parametric Bayesian estimator leads to posterior quantile estimation and variance that are respectively 5% and 11% higher than those obtained with the empirical Bayesian approach. They explained this result with the positive asymmetry introduced by the choice of the prior distribution.

2.2.2. Parametric Bayesian Approaches

[16] Less often used because of its complexity, the parametric (or fully) Bayesian inference for nonnormal distributions consists in inferring the posterior probability density of the parameters and the quantiles and generally leads to numerical integration. A common approach to avoid or reduce numerical integration consists in attributing to both local and regional estimators mutually compatible probability distributions (called conjugate distributions) so that the posterior of the unknown quantities distribution can be written in a closed analytical form.

[17] Parametric Bayesian approaches to regionalization were used by Shane and Gaver [1970], Rousselle and Hindie [1976], Rasmussen and Rojsberg [1991], Madsen et al. [1994, 1995], and Madsen and Rojsberg [1997] for PDS models for which the exceedances are assumed to have a generalized Pareto or an exponential distribution.

[18] Shane and Gaver [1970] assumed that the exceedances above a given threshold follow an exponential distribution. They derived the equivalents of equations (1) and (2) for this distribution while searching for the linear combination of regional and local estimations which gives the smallest root mean square error. They also considered a Bayesian approach where the prior information about the parameter of the exponential distribution describing the magnitude of exceedances is represented by a Gamma distribution. The mean and variance of the prior distribution were obtained by regional multiple linear regression. Shane and Gaver [1970] then compared the implication of both estimators on the optimal height of a protection dike and found that both methods give essentially the same result.

[19] Rousselle and Hindie [1976] and Rasmussen and Rojsberg [1991] considered the classical PDS model with exponentially distributed exceedances and derived the posterior distribution of the T-year event. Rousselle and Hindie [1976] considered an informative gamma prior distribution for all the parameters while Rasmussen and Rojsberg [1991] assumed a non informative prior for the parameter of the exponential distribution of exceedances.

[20] Madsen et al. [1994, 1995] generalized the model of Rasmussen and Rojsberg [1991] to the case where the distribution of the exceedances is the Generalized Pareto distribution and applied it to extreme rainfalls. The model of Madsen et al. [1994, 1995] was later adapted to index flood regional estimation in the work by Madsen and Rojsberg [1997] which was described in section 2.2.1.

3. Bayesian Estimation

[21] In the Bayesian approach, the imperfect knowledge of the exact parameter values is accounted for through probability distributions. As stated by Jaynes [1985], the width of these probability distributions should be seen rather as a representation of the range of values that are consistent with observed data and the knowledge than as indicators of the range of variability of the parameter. The specification of prior information requires that belief or knowledge about the parameters is expressed in terms of a prior distribution, which must be formulated independently of the observations. This probability density is then used with the observations to obtain the posterior distribution using the well known Bayes theorem:

equation image

where x = (x1, x2, …, xn) is the vector of observations, π (equation image) the prior probability density of the parameters, f(xequation image) the likelihood of the observations, and p(equation imagex) the posterior probability density of the parameters given the observations. The posterior distribution is obtained either analytically or numerically using sophisticated techniques such as Markov chain Monte Carlo (MCMC) algorithms [e.g., Gilks et al., 1996]. Example studies using Bayesian methodologies with the GEV distributions are those by Coles and Powell [1996], Coles and Tawn [1996], or Huerta and Sansü [2005].

[22] In our application, equation image = (μ, σ, ξ) where μ, σ and ξ are respectively the position, scale and shape parameters of the GEV distribution. The PDF of the GEV distribution is given by

equation image

Its CDF is given by

equation image

The quantiles are given by

equation image

Because the observations are independent, the likelihood of an observed sample x = (x1, x2, …, xn) is given by

equation image

The specification of the prior information via π (equation image) can be made in several manners, for example, (1) by attributing a probability distribution to the ratios equation image and equation image given the quantiles qT1, qT2 and qT3 [Crowder, 1992], (2) by specifying the joint distribution of the parameters ξ, μ and σ [Coles and Powell, 1996], and (3) by using a quantile and two differences of quantiles (e.g., qT3qT2, qT2qT1 and qT1) to which we attribute a probability distribution [Coles and Tawn, 1996; A. Stephenson and M. Ribatet, A users's guide to the evdbayes package (version 1.1), the Comprehensive R Archive Network, http://cran.r-project.org/, hereinfter referred to as Stephenson and Ribatet, evdbayes user's guide, 2006].

[23] The last method was selected in this study because of its simplicity, and ease of implementation since qT3qT2, qT2qT1 and qT1 are hydrological quantities readily obtained using regional multiple linear regression. The estimation of hydrological quantities using multiple linear regression is straightforward. It was used in several studies [e.g., Matalas and Gilroy, 1968; Stedinger and Tasker, 1985; Tasker and Stedinger, 1989; Thomas and Benson, 1970; GREHYS, 1996a, 1996b; Ouarda et al., 2001] and provides a fitted (normal) distribution for the explained variable. To the knowledge of the authors, there is no published work in the hydrological literature that can orient the choice of a given class of distribution for ξ, μ, σ or for quantile ratios. The use of the first two methods would thus involve much more subjective elements than the application of the well known multiple linear regression model. Indeed, the parameters could have been obtained using multiple regression on physiographical variables, but this would have been a naive approach because of the observed interdependence between the GEV parameters (Stephenson and Ribatet, evdbayes user's guide, 2006): increasing ξ or σ leads to a heavier tailed distribution, so a priori negative correlation between these parameters is expected [Coles and Tawn, 1996]. This interdependence between parameters is taken into account with a fewer hyperparameters when working in the quantile space (Stephenson and Ribatet, evdbayes user's guide, 2006).

[24] In sections 3.13.3, more details will be provided on the regional model, prior specification with regional information, and the MCMC algorithm used to infer the posterior.

3.1. Regional Model

[25] A regional model contains two parts [GREHYS, 1996a]: (1) a method of determination of homogeneous regions and (2) a regional estimation method. Homogeneous regions are subsets of stations having similar hydrologic behavior. Several methods have been proposed in the hydrological literature to delineate homogeneous regions such as the regions of influence method [Burn, 1990], correspondence analysis and hierarchical ascending classification [GREHYS, 1996a, 1996b], canonical correlation analysis [Cavadias, 1989; Ouarda et al., 2000, 2001], and the L moments method [Hosking and Wallis, 1993]. Regional estimation can be carried out for instance with the index flood method [Dalrymple, 1960] or the direct multiple regression method [Matalas and Gilroy, 1968; Thomas and Benson, 1970].

[26] The notion of similar hydrological behavior (and thus the concept of regional homogeneity) is relatively vague since it depends on what the modeler considers as being the key interactions between hydrological variables. For instance, a region for which the logarithms of quantiles are grossly linear combinations of some physiographical variables is homogeneous from the point of view of the users of the regional log linear multiple regression model, but not necessarily for the users of the index flood regional model for which the similarity of the shape parameter at all sites is essential. The two approaches can thus lead to different conclusions from the same data set.

[27] In this paper, the first definition of homogeneity (linear relation between the logarithm of quantiles and covariates) is considered. This is important for the validation phase which will involve the generation of regional data sets. To be consistent with the latter choice, the regional estimation method that will be used is direct multiple regression. There will be no need for a regional delineation method in the validation process since the generation algorithm is designed to directly provide hydrological regions with user-defined characteristics.

3.2. Prior Specification Using the Regional Model

[28] Prior information is specified from the regional model as follows: given three quantiles qT1, qT2, qT3 such as p1 = equation image < p2 = equation image < p3 = equation image and their differences ΔqT1, ΔqT2, ΔqT3 defined by

equation image
equation image
equation image

The log linear model is used to describe the relationship between the hydrological quantities and physiographic variables. If we denote equation imageqTiR the regional estimation of ΔqTi, the regional regression model is given by

equation image

where

equation image

with equation image. In equation (11), MVN (xequation image, \Sigma) stands for the multivariate normal distribution with mean vector xequation image and variance-covariance matrix ∑. Ak represents the value of the kth physiographic or meteorological variable at the site of interest, equation imagek(i) is a regression coefficient, and m is the number of physiographic variables.

[29] We assume that the errors in model (11) do not display intersite correlation but that there may be some correlation between the error series corresponding to different quantiles. Model (11) is thus a case of the classical multivariate normal distribution with independent realizations. Its location parameters as well as its variance-covariance matrix can thus be obtained using ordinary least squares. More complex procedures such as generalized least squares [Stedinger and Tasker, 1985, 1986; Tasker and Stedinger, 1989] which account for intersite correlations could have been considered. However, this would have complicated the already difficult simulation of the validation data set (see section 4). Such procedures can improve the precision of the regional model when used on real data and deserve consideration in future work.

[30] Since there is no intersite correlation, equation image(i) is obtained by solving the following equation with the ordinary least squares method (OLS):

equation image

where ɛ(i) is the random error term. The elements of Σ are directly computed from the data:

equation image

[31] We deduce from (11) and (12) that

equation image

where J is the Jacobian of the transformation of (ΔqT1, ΔqT2, ΔqT3) toward (μ, σ, ξ). The expression of J is derived by Stephenson and Ribatet (evdbayes user's guide, 2006):

equation image

where xi = −log (1 − pi).

[32] For the comparison with the empirical Bayesian estimator, E(qTR) and Var(qTR) are also estimated from the solutions of the following equation:

equation image

The bias introduced by the logarithmic transformation in (16) is also corrected:

equation image

where bir and bia are the relative and absolute biases, and σTiR the quantile estimation variance. The relative and absolute biases are estimated by ordinary least squares using observed values of equation imageTi and those simulated with equation (16).

3.3. Inference on Parameters and Quantiles

[33] Inference on parameters and quantiles was carried out with the Metropolis-Hasting algorithm following Stephenson and Ribatet (evdbayes user's guide, 2006). The goal of the Metropolis-Hastings algorithm is to construct a Markov chain for which the equilibrium distribution is the posterior defined in (3). The generic Metropolis-Hasting algorithm can be written as follows.

  1. Start with some initial parameter value θ0 and set i to zero.
  2. Given the parameter vector θi, draw a candidate value θi+1 from some proposal distribution.
  3. Compute the ratio R of the posterior density at the candidate and initial points, R = P(equation imagei+1x)/P(equation imageix).
  4. With probability min(R, 1), accept the candidate parameter vector, else set θi+1 = θi.
  5. Set i = i + 1 and return to step 2.

[34] Many versions of this algorithm have been proposed depending on the proposal distribution and the order in which the parameters are updated. In this study, the three parameters of the GEV distribution are updated successively with normal proposal distributions for μ, log (σ) and ξ as proposed by Stephenson and Ribatet (edvbayes user's guide, 2006). The steps to generate the parameters at step i + 1 (i.e μi + 1, σi + 1 and ξi + 1) given μi,σi and ξi are the following.

  1. Propose μ* ∼ N(μi, σμ ) where N represents the normal distribution.
  2. Set Δ = equation image.
  3. Set μi+1 = μ* with probability min {1, Δ}, else set μi+1 = μi.
  4. Propose σ* ∼ LN(σi, σσ) where LN represents the lognormal distribution.
  5. Set Δ = equation image.
  6. Set σi+1 = σ* with probability min {1,Δ}, else set σi+1 = σi.
  7. Propose ξ* ∼ N(ξi, σξ).
  8. Set Δ = equation image.
  9. Set ξi+1 = ξ* with probability min {1, Δ}, else set ξi+1 = ξi.

[35] The variance parameters σμ, σσ and σξ of the proposal distributions are tuned using a trial-error method to improve convergence speed and acceptance rates. The Geweke [1992] test was chosen to assess the convergence of the MCMC chain because of its ease of interpretation. It is based on a test of equality of the means of the first part and the last part of a Markov chain.

4. Validation Methodology

[36] Simulation is an attractive way to validate the proposed methodology of combination of local and regional information. However, generating regional data is not a trivial task. It involves reproducing (1) at-site frequency distributions, (2) the relation between at-site flood features and explanatory physiographical and meteorological variables, (3) the dependence between the various explanatory variables at a given site, (4) the relation between explanatory variables at different sites, and (5) the regional heterogeneity characteristics. Unfortunately, most of these aspects are still not well understood, and even if they were, it would be hard to generate data sets which respect all the above mentioned constraints. Nevertheless, a simulation study was performed in which an effort was made to preserve as much as possible of the elements mentioned above. This simulation study was performed in four steps: (1) define the data structures to be generated, (2) set up a generation procedure which respects the maximum of above mentioned constraints, (3) generate the data sets, and (4) evaluate the studied parameters and quantile estimation methods on the data sets. All these steps will be described in detail in the following sections.

[37] It is obvious that the performance of the combination method will be influenced by the size of local data series as well as the bias and precision of the regional model. Another intuitive factor is the number of stations within the region, but its effects are not direct: it plays a role through its linkage with the bias and precision of the regional model. For this reason, several cases were considered in the validation study, corresponding to different values of the bias and precision of the regional model. For each of these cases, the performance of the studied combination methodology were assessed for different lengths of the local data series. The data structure for each of these cases is what we call a ‘regional data set’ and is described in section 4.1. The parameters used to generate each data set are provided in section 5.2.

4.1. Structure of a Regional Data Set

[38] The data structure for each case has three levels corresponding to (1) the station level, (2) the hydrological region level, and (3) the regional data set level which is a collection of regions on which the studied methods will be evaluated by Monte Carlo simulation. The lowest level corresponds to the hydrological station scale. Each generated element at this level is represented by a set of physiographical variables and a variable length observation record. The generated elements at the second level represent hydrological regions and are collections of stations among which one is designated as the target station. The length of the generated record at each station is randomly selected between 15 and 70, except at the target site where 80-years series are generated. At the third level, several regions are generated using the same bias and the same variance covariance matrix of the regional model.

4.2. Generation Procedure

[39] The procedure consists essentially in generating a triplet(qT1, qT2, qT3) of ‘real’ quantiles at each site, and then using them to compute the GEV parameters using the procedure described in Appendix A. This procedure takes advantage of the fact that, in the specific case of the GEV distribution, given the triplet of return periods (T1,T2,T3), there is a bijection between the triplet of parameters (μ, σ, ξ) and the triplet of quantiles (qT1, qT2, qT3);At non target sites, the quantiles (qT1, qT2, qT3) are generated using the following equation:

equation image

At target sites, (qT1, qT2, qT3) are generated following:

equation image

where br is a bias parameter that ensures that the log linear regional model would be biased if used to estimate quantiles at the target station. The reason for introducing br is that the regional model is always biased to some extent at the target site, since it is fitted with data from other sites. In a truly homogeneous region, the bias is null. In practice, there is always some moderate heterogeneity in hydrological regions and br was introduced to represent real life cases. The magnitude of the elements of Σ control the quality of the precision model: the lower the elements on the diagonal of Σ , the more precise the regional model would be. Note that in equation (19), the relative bias introduced through br affects all three quantiles, i.e if the regional model overestimates qT, then it will overestimate qT2 and qT3 in the same proportions. It seems reasonable that the relative errors of the regional model for the three quantiles would be the same. This constraint has an important implication: it preserves the ratios of quantile differences, thus the shape parameter (see Appendix A).

[40] To ensure that the simulations reproduce the complex relationships in the data set, we opted to use real field data for the vector x of explanatory variables in equation (18). The field data should come from a known hydrological region, each column of x representing a station inside that region. The vector of regression parameters β is computed from the same data set. The variance covariance matrix is computed using the following equation:

equation image

In equation (20), α is a parameter that allows to tune the quality of the regional regression and rTi_Tj is the correlation coefficient between regional estimates of qTi and qTj from the field data.

[41] Even tough the same vector x is used for each generated region and the same vector β is used for each generated region, the ‘true’ quantiles and parameters are different since the quantiles (and thus the parameters) are linked to the realizations of a random process (equations (18) and (19)). Each generated region is thus different from the others. Once x and β are obtained, the simulation study proceeds using the following algorithm. 1. Choose the number M of regions to generate (the number of stations in a region is given by the number of rows of x, plus one). 2. Choose the values of α and br to set the characteristics of the regional model. 3. For each i ∈ {1, .., M}, generate the ith region following these steps. 3a. Choose a target station t ∈ {1, .., n}. 3b. For each k ∈ {1, .. t − 1, t + 1, .., n} generate (qT1, qT2, qT3) at the kth station using equation (18). 3c. Generate (qT1, qT2, qT3) at the kth station using equation (19). 3d. For each k ∈ {1, .., n}, compute the ‘true’ parameters μki, σki and ξki using the procedure given in Appendix A. 3e. For each k ∈ {1, .. t − 1, t + 1, .., n} pick a random number l between 15 and 70 and generate a l-year GEV sample using the simulated parameters μki, σki and ξki. 3f. Generate an 80-year GEV sample at the target site using μti, σti and ξti. 4. For each l ∈ {5,10,20,40,80}, consider l first generated values at the target sites as the recorded stream flows. Apply the different parameters and quantile estimation methods presented in this paper. To the regional data sets, and compute the performance criteria as function of l.

4.3. Performance Measures

[42] The mode (Mo), the median (Md) and the mean (M) of the posterior probability distribution of quantiles and parameters obtained by the parametric Bayesian method will be used as punctual estimators, along with the empirical Bayesian estimator (EB), the regional estimator (R) and the local estimator (L). The performance of these five estimators will be assessed using the standard deviation (s), the bias (b) and the root-mean-square error (RMSE) defined by

equation image
equation image
equation image

where ns represents the number of samples, θ the real value of the variable (quantile or parameter), equation imagei its ith estimation and μequation image = equation imageequation imageequation imagei the mean of the estimations. We shall also check whether the parameters μ, σ and ξ obtained with the complete Bayesian method are closer to the ‘real’ parameters than those estimated with the short series of data.

5. Application

[43] As mentioned in section 4.2, a real data set was required to extract realistic physiographical variables and compute reliable parameters for equations (18) and (19). The application consisted in selecting a hydrologic region, extracting physiographical variables, generating the remaining characteristics and then applying successively all the studied parameter and quantile estimation methodologies.

5.1. Field Data

[44] The data was extracted from a database of 168 hydrological stations provided by the Quebec Ministry of the Environment (Province of Quebec, Canada) and for which the following physiographic and meteorological variables were available: the catchment area, the percentage of the area covered by lakes, the mean slope of the catchment, the mean annual precipitation and the average annual accumulation of degree-days below zero.

[45] As the province of Quebec is commonly divided into thirteen hydrographic regions (Figure 1), a natural choice was the hydrographic region which contains the largest number of stations among those listed in the above mentioned database. Hydrographic region 05 was hence selected with 32 stations. These stations are illustrated in Figure 1, and their characteristics are listed in Table 1.

Figure 1.

Hydrographic regions in the province of Quebec and hydrometric stations of the region 05.

Table 1. Characteristics of the Stations of the Hydrographic Region 05 of the Province of Quebec, Canada
ParameterMeanStandard Deviation
q10, m3/s243.34219.26
q100, m3/s333.40300.83
q1000, m3/s425.09400.17
Catchment area, km21114.491160.24
Mean slope of the catchment, m/km2.881.01
Percentage of the area covered by lakes, %3.272.48
Mean annual solid and liquid precipitation, mm1182.84217.62
Average annual accumulation of degree-days below zero1481.29173.99
Matrix of regression parameters (including the intercept parameter)equation image-

5.2. Characteristics of the Generated Regional Data Sets

[46] Three regional data sets corresponding to different characteristics of the log linear regional relationship were generated. Each regional data set contains 1000 regions (M = 1000). The number of station in a given region is the same as in the Quebec 05 hydrographic region, from which the physiographic data is borrowed. The first data set is generated using an unbiased linear relationship between the explanatory variables and the logarithm of the quantiles (br = 0), and a very low variance of the error component (α = 0.10). The second data set also uses an unbiased linear relationship between the explanatory variables and the logarithm of the quantiles, but with a larger variance (α = 0.50). The third data set is similar to the first one, but a bias term is introduced at target sites (br = 100%). To provide an idea of the range of values that have been generated the local estimations of μ, σ and ξ as well as the regional estimations of qT1, qT2 and qT3 were computed at the target site in each region and in each regional data set. The histograms of the relative error of the regional estimation of qTi, i = 1, .., 3 are given in Figures 2a, 2b, and 2c for the first generated data set. The histograms of the local estimations of μ, σ and ξ are also provided in Figures 2d, 2e, and 2f. Similar histograms are provided for the second and third regional data sets are provided in Figures 3 and 4, respectively. Note that none of these histograms represent a normal distribution because of the logarithmic transformation in equations (18) and (19).

Figure 2.

Histograms of relative error on quantiles q10, q100, and q1000 and histograms of the GEV parameters at target sites for the first set of regions (no bias on quantiles, variance factor of the regional regression equal to 10%): (a) q10, (b) q100, (c) q1000, (d) μ , (e) σ, and (f) ξ.

Figure 3.

Histograms of relative error on quantiles q10, q100, and q1000 and histograms of the GEV parameters at target sites for the second simulated set of regions (no bias on quantiles, variance factor of the regional regression equal to 50%): (a) q10, (b) q100, (c) q1000, (d) μ, (e) σ, and (f) ξ.

Figure 4.

Histograms of relative error on quantiles q10, q100, and q1000 and histograms of the GEV parameters at target sites for the third simulated set of regions (100% positive relative bias on quantiles, variance factor of the regional regression equal to 10%): (a) q10, (b) q100, (c) q1000, (d) μ, (e) σ, and (f) ξ.

[47] The two values of α (0.1 and 0.5) are consistent with observed values of the regional model error variance in Quebec hydrographic regions. For instance, the regional model error variance for the region 05 of Quebec was 0.0584 for q10, 0.0814 for q100 and 0.1108 for q100. If we consider the set of all the hydrographic regions, the regional model error variance ranges from 0.0176 to 0.0951 for q10, from 0.0361 to 0.1231 for q100, and from 0.0534 to 0.2368 for q100; Thus the lowest value of α (0.1) is in the range of observed values and can thus be considered as representing a homogeneous region. The upper value of α (0.5) is largely above observed values and thus represents a heterogeneous region.

5.3. MCMC Runs and Convergence Assessment

[48] For each region, 5000 iterations of the MCMC algorithm were first run. Then, the Geweke [1992] convergence test is applied every 10000 iterations on the MCMC chains until convergence is successfully assessed. Furthermore, the parameters of the proposal distribution are dynamically changed in the course of the run to obtain an acceptation rate between 0.40 and 0.80, i.e., if the acceptance rate of the 100 last iterations is less than 0.4 the variance of the proposal distributions is reduced. Conversely, if the acceptance rate of the 100 last iterations is greater than 0.8, the variance of the proposal distributions is increased to better explore the parameter space. Examples of MCMC chains for the three quantiles q10, q100 and q1000 as well as for parameters μ, σ and ξ are provided in Figure 5. They are computed with the first region of the first generated data set and it is easy to visually check that the chains have reached their stationary distributions. All the other MCMC runs displayed similar characteristics. Once convergence successfully assessed for a given run, the last 10000 iterations were used to make the inference on the parameters μ, σ and ξ as well as the quantiles q10, q100 and q1000.

Figure 5.

Examples of MCMC chains and real values of quantiles and parameters for first region of the first generated data set: (a) q10, (b) q100, (c) q1000, (d) μ, (e) σ, and (f) ξ.

6. Results and Discussion

[49] Once the simulations were performed, the effects of regional homogeneity and the effects of the length of the local data series on parameters and quantiles estimation were investigated.

6.1. Effects on Parameter Estimation

[50] The RMSE of the M, Md and Mo parameter estimators is plotted as function of the length of the local data series and compared in Figures 68 to the RMSE of the local estimator (L) for the three regional data sets. The RMSE, bias and standard deviation of these estimators are presented in Tables 24.

Figure 6.

RMSE of the estimators of μ according to the length of local data series: (a) first generated data set, (b) second generated data set, and (c) third generated data set.

Figure 7.

RMSE of the estimators of σ according to the length of local data series: (a) first generated data set, (b) second generated data set, and (c) third generated data set.

Figure 8.

RMSE of the estimators of ξ according to the length of local data series: (a) first generated data set, (b) second generated data set, and (c) third generated data set.

Table 2. RMSE, Bias, and Standard Deviation of the Estimators of μ
 lFirst Generated Data SetSecond Generated Data SetThird Generated Data Set
MMdMoLMMdMoLMMdMoL
  • a

    Smallest value for a given data set and a given length of the local data series.

RMSE517.8617.45a18.7625.8134.5133.50a37.4042.2735.3537.3625.9314.92a
RMSE1014.4114.38a15.9619.1523.0423.00a24.3026.0317.9017.3817.9110.01a
RMSE2011.2111.21a12.2913.9615.61a15.6616.9016.8510.9610.8111.077.05a
RMSE407.337.33a8.328.849.25a9.2510.059.826.356.296.304.82a
RMSE606.22a6.237.107.057.847.898.877.74a4.844.805.064.16a
RMSE805.525.49a6.176.126.356.32a7.126.543.903.864.293.64a
Bias50.107a0.6711.0742.891−0.071a2.3152.5247.52818.18017.83717.4210.579a
Bias100.694a0.9011.4881.8351.265a1.4021.2943.33612.02711.78311.2920.487a
Bias200.632a0.7291.0230.9450.4720.419a0.5511.3756.8296.7236.4310.189a
Bias400.183a0.2010.3230.188−0.056a−0.084−0.2650.5223.5853.5213.2590.003a
Bias600.0500.046−0.1160.034a−0.348−0.386−0.5430.022a2.3032.2602.189−0.078a
Bias800.0200.021−0.0470.003a−0.032a−0.047−0.3650.2401.5941.5641.508−0.172a
Standard deviation517.8617.44a18.7325.6534.5133.42a37.3141.5930.3232.8319.2014.91a
Standard deviation1014.3914.35a15.9019.0623.0022.96a24.2725.8113.2612.7713.9010.00a
Standard deviation2011.2011.19a12.2513.9315.60a15.6616.9016.798.578.479.017.05a
Standard deviation407.337.32a8.318.839.25a9.2510.059.805.255.215.394.82a
Standard deviation606.22a6.237.097.057.837.888.857.74a4.254.244.564.16a
Standard deviation805.525.49a6.176.126.356.32a7.116.543.563.53a4.023.64
Table 3. RMSE, Bias, and Standard Deviation of the Estimators of σ
 lFirst Generated Data SetSecond Generated Data SetThird Generated Data Set
MMdMoLMMdMoLMMdMoL
  • a

    Smallest value for a given data set and a given length of the local data series.

RMSE58.588.26a9.8723.4326.2125.21a30.0830.0049.7956.7317.5912.49a
RMSE107.217.09a8.1815.8916.35a16.6218.5517.9213.8912.7312.927.50a
RMSE206.366.30a7.1210.7411.8612.0913.1611.84a8.918.398.595.06a
RMSE405.175.13a5.557.548.017.99a9.858.575.395.155.293.59a
RMSE604.514.50a5.125.926.015.92a6.736.534.023.873.932.95a
RMSE804.034.02a4.585.025.045.02a5.775.463.343.233.502.64a
Bias50.859−0.001a−0.156−1.8371.634a−3.089−6.712−2.18515.09014.07911.899−1.648a
Bias100.546−0.015a−0.255−0.534−0.566a−2.383−4.056−1.6829.8409.0328.696−0.081a
Bias200.381−0.028a−0.358−0.136−0.622a−1.505−1.982−1.2566.0335.6105.364−0.146a
Bias400.2860.018−0.2470.005a0.190−0.225−0.1520.051a3.3723.1503.010−0.091a
Bias600.178−0.031−0.1200.012a−0.033a−0.331−0.470−0.1212.2842.1341.920−0.064a
Bias800.073−0.085−0.223−0.049a−0.015a−0.232−0.148−0.1051.7971.6861.6320.020a
Standard deviation58.538.26a9.8623.3526.1625.02a29.3229.9247.4554.9512.9612.39a
Standard deviation107.197.09a8.1815.8816.34a16.4518.1017.849.808.979.557.50a
Standard deviation206.356.30a7.1110.7411.8412.0013.0111.78a6.566.246.715.05a
Standard deviation405.165.13a5.557.548.017.99a9.858.574.204.074.353.59a
Standard deviation604.514.50a5.125.926.015.91a6.716.533.313.233.432.95a
Standard deviation804.034.02a4.575.025.045.02a5.775.462.812.763.092.64a
Table 4. RMSE, Bias, and Standard Deviation of the Estimators of ξ
 lFirst Generated Data SetSecond Generated Data SetThird Generated Data Set
MMdMoLMMdMoLMMdMoL
  • a

    Smallest value for a given data set and a given length of the local data series.

RMSE52.91E-22.90E-2a3.31E-254.78E-28.66E-28.54E-2a9.63E-253.17E-27.82E-28.44E-27.63E-2a52.60E-2
RMSE102.68E-2a2.69E-23.02E-229.33E-26.96E-2a6.98E-27.81E-230.29E-23.68E-2a3.71E-23.97E-228.71E-2
RMSE202.56E-2a2.56E-22.94E-217.54E-26.27E-2a6.28E-27.06E-219.58E-22.39E-2a2.42E-22.75E-218.62E-2
RMSE402.37E-22.37E-2a2.76E-211.97E-25.75E-2a5.76E-26.54E-213.56E-22.52E-2a2.55E-22.94E-212.33E-2
RMSE602.30E-22.30E-2a2.60E-29.56E-25.38E-2a5.39E-25.79E-210.73E-22.52E-2a2.55E-22.82E-29.77E-2
RMSE802.27E-22.27E-2a2.62E-28.20E-24.94E-2a4.95E-25.55E-29.06E-22.48E-2a2.51E-22.85E-28.33E-2
Bias5−22.18E-04−14.63E-04−11.50E-04a−11.53E-0233.48E-04a70.26E-041.66E-02−13.05E-02−11.77E-04−8.45E-04a12.27E-04−8.73E-02
Bias10−16.63E-04−10.22E-0434.36E-06a−4.63E-0264.55E-04a86.15E-041.41E-02−4.68E-0271.20E-04a78.75E-0489.57E-04−3.80E-02
Bias20−14.35E-04−9.81E-046.04E-04a−1.73E-0273.31E-04a91.22E-041.17E-02−2.62E-021.38E-021.42E-021.48E-02−59.28E-04a
Bias40−12.21E-04−9.16E-04−5.04E-06a−92.39E-0449.99E-04a58.58E-0476.29E-04−2.19E-021.77E-021.80E-021.87E-0221.59E-04a
Bias60−14.21E-04−11.23E-04−3.27E-04a−77.59E-0452.45E-04a56.00E-0459.41E-04−1.35E-021.87E-021.90E-021.93E-02−1.33E-04a
Bias80−12.23E-04−9.88E-04−4.65E-04a−36.89E-0453.12E-04a53.23E-0460.27E-04−1.02E-021.90E-021.93E-022.08E-02−3.75E-04a
Standard deviation52.90E-22.89E-2a3.31E-253.55E-28.65E-28.51E-2a9.49E-251.54E-27.82E-28.44E-27.62E-2a51.87E-2
Standard deviation102.67E-2a2.68E-23.02E-228.96E-26.93E-26.92E-2a7.68E-229.92E-23.61E-2a3.62E-23.87E-228.46E-2
Standard deviation202.55E-2a2.56E-22.94E-217.45E-26.22E-26.21E-2a6.96E-219.40E-21.95E-2a1.96E-22.31E-218.61E-2
Standard deviation402.37E-2a2.37E-22.76E-211.93E-25.73E-25.73E-2a6.49E-213.39E-21.80E-2a1.81E-22.26E-212.33E-2
Standard deviation602.30E-22.30E-2a2.60E-29.53E-25.35E-2a5.36E-25.76E-210.64E-21.68E-2a1.69E-22.06E-29.77E-2
Standard deviation802.27E-22.27E-2a2.62E-28.19E-24.91E-2a4.92E-25.51E-29.00E-21.60E-2a1.61E-21.94E-28.33E-2

[51] It can be seen in Figure 8 as well as in Table 4 that the RMSE and the standard deviation of the shape parameter are much smaller when estimated with the parametric Bayesian approach than those obtained with the local estimator. This is true for all regional data sets and all lengths of data series. Since large return period quantile magnitudes are very sensitive to variations of ξ, this means that the parametric Bayesian method will lead to more stable estimations when the data series are short.

[52] The parametric Bayesian approach performs also better than the local estimation when estimating the location and scale parameters μ and σ on the first generated data set: it leads to smaller RMSE and standard deviation for all values of l (Tables 2 and 3 and Figures 6a and 7a). The same conclusion can be drawn for the second generated data set (Tables 2 and 3 and Figures 6b and 7b) except that the improvement is less important. Because of the large bias of the regional model, the local estimator outperforms the parametric Bayesian estimator when estimating μ and σ on the third generated data set (Tables 2 and 3 and Figures 6c and 7c).

6.2. Effects on Quantile Estimation

[53] The RMSE of the M, Md and Mo quantile estimators are compared in Figures 9 through 11 to the RMSE of the local, empirical Bayes and regional estimators for the three regional data sets. The RMSE, bias and standard deviations of these estimators are presented in Tables 57.

Figure 9.

RMSE of the estimators of q10 according to the length of local data series: (a) first generated data set, (b) second generated data set, and (c) third generated data set.

Figure 10.

RMSE of the estimators of q100 according to the length of local data series: (a) first generated data set, (b) second generated data set, and (c) third generated data set.

Figure 11.

RMSE of the estimators of q1000 according to the length of local data series: (a) first generated data set, (b) second generated data set, and (c) third generated data set.

Table 5. RMSE, Bias, and Standard Deviation of the Estimators of q10
 lFirst Generated Data SetSecond Generated Data SetThird Generated Data Set
MMdMoLREBMMdMoLREBMMdMoLREB
  • a

    Smallest value for a given data set and a given length of the local data series.

RMSE520.4720.17a21.9051.7533.1633.8350.8750.42a57.1868.80148.0178.6865.7662.5657.1327.15a157.2993.17
RMSE1018.2517.99a19.0236.6633.2029.9038.79a38.9841.9944.47148.0156.9247.0344.6042.3920.62a157.1582.71
RMSE2016.4216.33a17.1927.0233.2625.4726.81a26.9828.2629.44148.0149.9530.3829.2928.0414.22a157.1765.97
RMSE4012.9812.95a13.9119.0933.2921.0617.6717.57a18.8119.40148.0129.5918.3117.7316.949.98a157.6248.43
RMSE6011.68a11.7012.8316.0833.3818.8414.2814.28a15.1415.96148.0123.0013.7413.4113.508.85a157.9438.61
RMSE8010.6410.59a11.4213.7033.3716.8812.93a12.9613.7014.14148.0118.1911.2010.9410.647.55a158.2532.15
Bias51.590.63−0.70−6.840.11a−8.701.75a−3.78−10.65−2.82−11.49−17.9448.5746.2042.01−4.39a115.3342.64
Bias101.520.80−0.60−2.270.24a−4.52−0.09a−2.89−7.47−1.81−11.49−10.9933.9332.2729.73−0.75a115.1836.02
Bias201.140.60−0.67−0.890.25a−2.43−0.67a−2.18−4.93−1.36−11.49−8.6120.8920.0018.50−0.30a115.1727.57
Bias400.520.16a−0.84−0.770.25−1.600.41−0.39−1.77−0.24a−11.49−4.5611.9511.4710.50−0.08a115.8718.91
Bias600.18−0.11a−0.75−0.840.22−1.40−0.12a−0.67−1.86−0.64−11.49−3.208.298.017.56−0.13a116.3214.14
Bias80−0.05a−0.25−0.61−0.730.22−1.110.07a−0.32−1.23−0.46−11.49−2.426.506.285.76−0.10a116.7911.42
Standard deviation520.4120.16a21.8951.3033.1632.6950.8450.28a56.1868.75147.5676.6044.3342.1838.7126.79a106.9682.84
Standard deviation1018.1917.97a19.0136.5933.1929.5638.79a38.8841.3244.44147.5655.8532.5730.7830.2320.61a106.9174.46
Standard deviation2016.3816.32a17.1827.0133.2625.3526.80a26.8927.8329.41147.5649.2122.0521.4021.0714.22a106.9659.93
Standard deviation4012.9712.95a13.8919.0833.2921.0017.6617.57a18.7319.40147.5629.2413.8813.5213.299.98a106.8644.59
Standard deviation6011.68a11.7012.8116.0533.3818.7914.2814.26a15.0315.95147.5622.7810.9510.7511.198.85a106.8435.92
Standard deviation8010.6410.59a11.4113.6833.3716.8512.93a12.9613.6414.13147.5618.039.128.968.957.55a106.7930.05
Table 6. RMSE, Bias, and Standard Deviation of the Estimators of q100
 lFirst Generated Data SetSecond Generated Data SetThird Generated Data Set
MMdMoLREBMMdMoLREBMMdMoLREB
  • a

    Smallest value for a given data set and a given length of the local data series.

RMSE532,7632.14a34.33186.2047.9361.6080.8779.64a89.42225.31199.94155.81107.74102.0195.33a104.48213.52155.61
RMSE1029.2128.76a31.53117.3647.8055.4564.31a65.1971.37133.33199.94123.9881.3777.2673.5968.84a213.33140.57
RMSE2027.0126.84a28.7682.0048.0446.4047.93a48.3553.78103.02199.94110.5355.8053.7551.4344.44a213.35122.33
RMSE4022.5522.43a23.9457.6348.0942.3833.8233.41a36.7460.56199.9482.2435.6534.4933.9329.61a213.9798.71
RMSE6020.8420.80a22.1447.3448.2338.4728.8528.63a31.8248.77199.9466.7527.5726.7826.9524.60a214.4083.41
RMSE8019.3219.22a20.8141.6048.2135.6128.2828.18a28.3840.97199.9455.1923.2222.5822.3921.33a214.7571.86
Bias53.381.51−1.3517.39−0.23a−18.203.61a−7.40−21.8716.43−8.88−41.3178.4674.2767.8012.15a157.1879.48
Bias102.691.39−1.256.34−0.03a−13.500.36a−5.39−15.198.24−8.88−27.9857.3354.4350.145.77a156.9866.52
Bias202.020.98−1.282.490.01a−8.580.33a−3.37−10.035.25−8.88−20.6837.2535.6233.073.47a156.9557.50
Bias401.260.53−1.020.640.04a−5.972.37−0.24a−4.421.57−8.88−12.5622.8121.8820.482.32a157.9343.34
Bias600.720.09−1.38−0.720.03a−4.712.02−0.10a−3.070.56−8.88−8.8816.8016.1815.411.58a158.5334.84
Bias800.37−0.13−1.02−0.400.03a−3.501.70−0.17−3.26−0.05a−8.88−7.4013.8713.3712.471.21a159.1328.90
Standard deviation532.5832.11a34.31185.3947.9358.8580.7979.30a86.70224.71199.74150.2373.8369.9267.02a103.77144.52133.78
Standard deviation1029.0828.73a31.50117.1947.8053.7864.31a64.9769.74133.07199.74120.7857.7454.8353.86a68.60144.45123.83
Standard deviation2026.9326.82a28.7381.9648.0445.5947.93a48.2452.84102.88199.74108.5841.5440.2639.39a44.30144.52107.97
Standard deviation4022.5122.42a23.9157.6248.0941.9633.7433.41a36.4760.54199.7481.2827.4026.67a27.0529.52144.3788.69
Standard deviation6020.8320.80a22.1047.3348.2338.1828.7828.63a31.6748.76199.7466.1621.8621.34a22.1224.55144.3575.79
Standard deviation8019.3219.22a20.7841.6048.2135.4428.2328.18a28.1940.97199.7454.6918.6218.20a18.6021.29144.2165.79
Table 7. RMSE, Bias, and Standard Deviation of the Estimators of q1000
 lFirst Generated Data SetSecond Generated Data SetThird Generated Data Set
MMdMoLREBMMdMoLREBMMdMoLREB
  • a

    Smallest value for a given data set and a given length of the local data series.

RMSE551.4049.87a51.19850.5573.3592.42242.44111.62a122.65769.44258.18230.42154.71146.59137.67a526.48281.60232.76
RMSE1045.9744.95a46.30380.3172.9590.7997.5297.33a110.29402.71258.18198.45122.05116.08114.25a213.54281.34204.23
RMSE2042.7542.23a43.93225.3473.4378.9381.1780.32a93.81280.00258.18178.3488.2284.8480.64a125.18281.36184.64
RMSE4037.2036.81a37.54144.8373.5074.1164.4462.49a65.67172.25258.18148.1959.3557.2655.15a75.83282.19152.38
RMSE6035.5035.23a36.55112.6373.7267.0760.1457.96a65.25137.66258.18124.8147.4745.9243.84a59.47282.76133.42
RMSE8033.8433.45a35.44100.8973.6962.8060.4258.36a62.79106.04258.18106.6741.0839.7339.09a51.48283.18117.20
Bias56.503.54−0.65a197.571.40−20.2218.16−8.74−30.97153.35−4.34a−52.47108.56102.5693.63a121.38202.80130.01
Bias105.153.130.25a61.871.65−22.376.10−5.86−23.1868.63−4.34a−42.0282.0077.8472.8738.36a202.5598.56
Bias204.142.49−0.36a24.331.76−15.865.71−2.36a−14.7641.93−4.34−32.4355.7353.2249.1419.43a202.4986.57
Bias403.231.96−1.02a10.921.81−11.517.621.02a−7.5313.02−4.34−19.0136.1134.5732.1910.67a203.7666.04
Bias602.481.34−0.75a4.471.80−8.747.751.96a−6.967.99−4.34−13.2727.8426.7524.707.34a204.5554.43
Bias802.061.10−0.53a4.301.80−6.056.451.30a−7.044.91−4.34−11.3423.7322.7821.605.42a205.2945.72
Standard deviation550.9949.75a51.18827.2973.3490.18241.76111.27a118.67754.00258.14224.37110.23104.74100.92a512.30195.37193.06
Standard deviation1045.6844.84a46.29375.2572.9387.9997.3397.16a107.82396.82258.14193.9590.3986.12a87.99210.06195.26178.87
Standard deviation2042.5542.15a43.93224.0273.4177.3280.9780.29a92.64276.84258.14175.3768.3866.0863.93a123.67195.35163.08
Standard deviation4037.0636.76a37.52144.4273.4873.2163.9862.49a65.24171.76258.14146.9647.1045.6544.78a75.08195.22137.33
Standard deviation6035.4235.20a36.54112.5573.7066.5059.6457.93a64.88137.43258.14124.1138.4537.3336.21a59.02195.22121.81
Standard deviation8033.7833.44a35.44100.8073.6662.5060.0758.34a62.40105.92258.14106.0733.5332.55a32.5851.19195.07107.92

[54] Figure 9 and Table 5 show that overall, the best estimator (in terms of RMSE and standard deviation) for q10 on the first generated data set is Md, closely followed by M and Mo. Next comes EB, and then R or L depending of the length of the local data series. R is the worst estimator when the length of the local data series is larger than 10, but it is better than L when l = 5 and l = 10. Depending on l, M or Md take the first place when estimating q10 on the second generated data set, followed by L, EB and R. The improvement due to the use of the parametric Bayesian method instead of the local estimation method is smaller than in the case of the first generated data set. Finally, the L estimator turns to be the best with regard to all performance measures on the third data set.

[55] Similar conclusions can be drawn for the estimation of quantile q100 (Figure 10 and Table 6). The ranking of the different estimators remains the same but the parametric Bayesian approach seems to be more competitive than in the case of q10. The improvement over L is larger (Figure 9b versus Figure 10b) for the second data set, and L hardly beats M, Mo and Md when estimating q100 on the third generated data set (Figure 10c). The parametric Bayesian estimators become the bests for all data set when used to estimate q1000 (Figure 10 and Table 6). Results indicate that the proposed method becomes more and more efficient as the return period increases. This property makes it very attractive for design purposes where high return period quantiles are of interest.

[56] The reason for which the proposed approach performs better than the EB approach is that the latter does not account for the distribution of the local data series. EB makes the simplifying and often unverified assumption that the probability distributions of both regional and local quantile estimators are normal, which is not the case. This is true neither for the regional model (only the logarithm of the quantile is normal), nor for the local quantile estimator. The parametric Bayesian estimator does not make such limiting assumptions and the fact if leads to better result it is not surprising.

6.3. Effect of Longer Local Data Series

[57] It can be seen in Figures 611 as well as in Tables 27 that when the length of the local data series increases, the RMSE and the standard deviation of all quantile estimators but the regional one decrease. The bias decreases almost consistently, although a few cases arose were the bias increased. For instance, the reduction of the RMSE of quantile estimators on all data sets ranges from 29% to 41% for the local estimator (mean reduction: 34%), from 6% to 41% for the empirical Bayes estimator (mean reduction: 20%), and from 13% to 40% for the parametric Bayesian estimators (mean reduction: 19%) when the length of data series increases from 20 to 40 years.

[58] The reduction of RMSE due to the proposed Bayesian combination method is compared in Table 8 to the reduction of RMSE due to the use of the 40-year data series instead of the 20-year data series. It can be seen that, for the first data set, the use of the Md quantile estimator always leads to a higher average reduction of the RMSE than the use of the 40-year data series instead of the 20-year data series. The same conclusion can be drawn for the second data set, but only for q100 and q1000. An interesting remark is that the parametric Bayesian method is helpful when estimating q1000 on the third data set, but not as much as doubling the length of the local data series. It means that the Bayesian method for combining at-site and regional information cannot and should not be considered as a substitute to a sustained intensive hydrological monitoring program. Quite the opposite: the application of the proposed Bayesian information combination method is only possible because of the availability of a reasonably good and dense regional network of stations with a good record of information. In fact, this result points out the importance of maintaining a good hydrometeorological network, as the available record can be used not only for at site frequency analysis but also for the estimation at other sites, even ungauged or shortly gauged ones.

Table 8. Percentage of Reduction of RMSE due to the Change in the Length of Data Series (From 20 years to 40 years) or the Application of Bayesian Combination Methods
 First Generated Data SetSecond Generated Data SetThird Generated Data Set
MMdMoSaMMdMoSaMMdMoSa
  • a

    S = Switching from l = 20 to l = 40.

Q1039.23%39.58%36.38%29.34%8.94%8.35%4.01%34.09%−113.57%−105.95%−97.15%29.81%
Q10067.06%67.26%64.93%29.72%53.47%53.06%47.79%41.21%−25.57%−20.96%−15.73%33.37%
Q100081.03%81.26%80.50%35.73%71.01%71.31%66.50%38.48%29.53%32.22%35.59%39.42%

6.4. Sensitivity to Regional Information

[59] As pointed out in section 6.1, the ability of the proposed methodology to correctly estimate the location and scale parameter decreases when a nonnull relative bias is introduced in the regional model. This relative bias does not seem to affect its ability to correctly estimate the shape parameter. The reason for this is that the relative bias introduced through br affects all three quantiles in the same proportions. It is shown in Appendix A that the shape parameter is function of a ratio of quantile differences and thus should not be affected by br. Whether this constraint in the generation scheme is reasonable or not is a matter of judgment. The authors’ experience has shown that when using the log linear regional regression model, quantiles of different return periods tend to be biased in the same direction (downward or upward) and the magnitude of their biases are comparable.

[60] A simple sensitivity analysis of the method to differences in the relative biases for different return periods was performed. Equation (19) was slightly modified to affect different relatives biases (br1, br2, br3) in the regional model of quantiles (qT1, qT2, qT3). For illustration purposes, we considered br1 = br2 = 0 and allowed br3 to vary form 0 to 1 with increments of 0.1; For each value of (br1, br2, br3), a regional data set was generated following the procedure described in section 4 and the following quantities were computed: (1) the mean difference equation image between location parameter at target sites and location parameter at nontarget sites, (2) the mean difference equation image between scale parameter at target sites and scale parameter at nontarget sites, (3) the mean difference equation image between shape parameter at target sites and shape parameter at nontarget sites, and (4) estimations of q10, q100 and q1000 using the Md estimator. equation image is computed as follows:

equation image

equation image and equation image are computed using the same procedure. Given a regional data set, equation image, equation image and equation image are measures of how the parameters at target sites differ from the parameters in their respective regions.

[61] The results are plotted in Figure 12 and allow to draw the following conclusions.

Figure 12.

Sensibility analysis of the statistical parameters of the generated samples and the performances of the proposed methodology to the bias parameter br3 ( br1 = br2 = 0 ): (a) equation image, (b) equation image, (c) equation image , and (d) RMSE of the Md estimator of q10, q100, and q1000.

[62] 1. As expected, regional heterogeneity increases as br3 becomes very different from br1 and br2 (i.e., equation image, equation image and equation image become significantly different from zero). equation image and equation image increase, while equation image decrease.

[63] 2. As br3 (and thus regional heterogeneity) increases, the RMSE of the Md estimator of q1000 increases, which means that the proposed method becomes less efficient for this particular quantile. The RMSE of the estimator of q10 and q100 do not seem to be affected. The best performance corresponds to br1 = br2 = br3 = 0.

[64] Indeed, this sensitivity analysis does not cover all the range of possible configurations of br1, br2, br3, and further investigation is desirable. However, the results strongly suggest that the methodology may be counterproductive at sites that are very different from the regional mean. This potential problem should be circumvented by a careful choice of the neighborhood delineation method.

6.5. Generalization to Other Extreme Value Distributions

[65] In the specification of the prior, only the Jacobian J (equation (15)) depends of the distribution. Thus its application to other extreme value distributions is straightforward if an expression of J can be derived for the new distribution. The MCMC algorithm will also need to be adapted to the target distribution. Other analytical expressions for ΔqTi may also be used provided that the expression of J does not take null values in the parameter space.

7. Conclusions

[66] A parametric Bayesian methodology to combine local and regional information in order to improve the estimation of flood quantiles is presented. The methodology is validated on three simulated data sets representing different levels of regional homogeneity. In this method, the prior information is specified using multiple regression on quantiles and quantile differences. The developments are made with the generalized extreme value distribution but guidelines are provided for its extension to other distributions. The proposed method relaxes the assumption of the local quantile probability distribution and can be applied to very short data series. It stabilizes the estimation of the GEV shape parameter and improves significantly the estimation of the parameters and the quantiles when relatively short series are used. The method was shown to be superior in terms of RMSE to the local and regional estimators, and to the empirical Bayesian estimator used by Kuczera [1982]. On two out of the three simulated data sets, it was shown that the improvement in quantile estimation due to the use of the parametric Bayesian approach is at least equivalent to that obtained with the use of at-site series that are twice as long. The method presented in this paper is thus a promising approach for the estimation of quantiles at sites with short to medium length flood records.

Appendix A:: Computation of μ, σ, and ξ From qT1, qT2, and qT3

[67] These equations allow to compute μ, σ, and ξ given ΔqT1, ΔqT2 and ΔqT3. From equations (8) and (9) we have

equation image

If g is a monotonic function of ξ, g−1 exists and we have:

equation image
equation image
equation image

A simple plot of g versus ξ allows to confirm that g is monotonic for T1 = 10, T2 = 100 and T3 = 1000.

Notation
α

parameters that allows to tune the precision of the regional model.

equation image

matrix of regression coefficients.

β(i)

ith row of equation image.

equation image

mean difference between location parameter at target sites and location parameter at nontarget sites.

equation image

mean difference between scale parameter at target sites and location parameter at nontarget sites.

equation image

mean difference between shape parameter at target sites and shape parameter at nontarget sites.

br

common bias parameter for qT1, qT2, qT3.

br1 (resp. br2, br3)

bias parameter for qT1 (resp. qT2, qT3).

Ak

value of the kth physiographic or meteorological variable at the site of interest.

EB

empirical Bayes estimator.

f(xθ)

likelihood of the observations.

L

local estimator.

M

mean of the quantile or parameter posterior probability density.

Md

median of the quantile or parameter posterior probability density.

Mo

mode of the quantile or parameter posterior probability density.

μ

location parameter of the GEV distribution.

n

sample size.

ns

number of samples.

p

exceedance probability.

π(equation image)

prior probability density of the parameters.

p(equation imagex)

posterior probability density of the parameters given the data.

qT

T-year flood.

ΔqT1

T1-year flood.

ΔqTi, i ≥ 2

difference between the Ti-year flood and the Ti−1-year flood.

equation imageT(L)

local estimation of the T-year flood.

equation imageT(R)

regional estimation of the T-year flood.

R

regional estimator.

Σ

variance-covariance matrix.

σL

standard deviation of the local estimation of the T-year flood.

σR

standard deviation of the regional estimation of the T-year flood.

T

return period.

equation image = (μ, σ, ξ)

parameters vector.

equation imagei

ith estimation of the parameters vector.

x

vector of observed data.

ξ

shape parameter of the GEV distribution.

Acknowledgments

[68] The financial support provided by the Natural Sciences and Engineering Research Council of Canada (NSERC) and Hydro-Quebec is gratefully acknowledged. The paper has been improved by helpful comments from the associate editor and three anonymous reviewers.

Ancillary