## 1. Introduction

### 1.1. Standard Implementations of Regional Frequency Analysis

[2] The purpose of regional frequency analysis (RFA) is to estimate the distribution of some hydrologic variable (e.g., annual maximum rainfall or runoff) using data from several sites. Compared with standard at-site frequency analysis, RFA attempts to improve the precision of estimates by sharing the information stemming from similar sites. Moreover, RFA enables estimation at ungaged or poorly gaged sites by transferring the information arising from neighboring gaging stations.

[3] Numerous approaches have been proposed to implement RFA schemes. Among them, the index flood method proposed by *Dalrymple* [1960] is still widely used in engineering practice. It is based on a scale invariance hypothesis: within a homogeneous region, distributions from all sites are assumed identical, up to a scale factor, termed the index flood. The implementation of the index flood method can be summarized in three steps [e.g., *Hosking and Wallis*, 1997]: (1) delineation of an homogeneous region; (2) estimation of the common regional distribution, based on scaled at-site data (i.e., divided by the index flood); and (3) transfer of information to an ungaged or poorly gaged site using a regression model linking the index flood values with site characteristics.

[4] The index flood method is widely used because of its ease of implementation and its robustness. However, its basic implementation is affected by several limitations:

[5] 1. The delineation of homogeneous regions, where the scale invariance assumption holds, is far from obvious;

[6] 2. The scale invariance assumption might simply be too restrictive in some cases;

[7] 3. In most cases, the index flood is defined as the mean or the median of at-site data, but the physical reasons behind this choice are unclear;

[8] 4. The regional distribution is estimated by pooling scaled at-site data, and treating them as if they were independent, which is rarely the case;

[9] 5. Standard regression methods like ordinary least squares might be statistically inefficient because index flood values are statistics (as opposed to observations), and are therefore affected by estimation errors that may be dependent in space and whose properties may vary from site to site; and

[10] 6. The previous points make the quantification of the total predictive uncertainty challenging.

[11] A wealth of research has been carried out over the years, either to improve the implementation of the index flood method, or to generalize it by abandoning some of its most restrictive assumptions (in particular, scale invariance). A nonexhaustive list of examples includes the work by *Ouarda et al.* [2001] on the concept of homogeneous region; the studies by *Stedinger* [1983] and *Hosking and Wallis* [1988] on estimating the regional distribution with spatially dependent data; or the extension to peak-over-threshold series of *Madsen and Rosbjerg* [1997] and *Ribatet et al.* [2007].

[12] The transfer of information from a gaged to an ungaged site and the quantification of the associated predictive uncertainty has been a topic of particular attention [e.g., *Stedinger and Tasker*, 1985, 1986; *Reis et al.*, 2005; *Kjeldsen and Jones*, 2009a; *Micevski and Kuczera*, 2009]. Indeed, this transfer relies on a regression model linking the index flood values (estimated at gaged sites) and site characteristics. *Stedinger and Tasker* [1985, 1986] introduced the generalized least squares (GLS) approach to account for both the heteroscedastic and spatially dependent nature of sampling errors (i.e., errors in estimating the index flood at gaged sites) and the existence of regression errors. *Robson and Reed* [1999] and *Kjeldsen and Jones* [2009a] further generalized the GLS approach by considering spatially dependent regression errors.

[13] Despite these advances, GLS-based transfer of information from a gaged to an ungaged site still relies on the following two-step procedure:

[14] 1. Local estimation: an index flood is first chosen (e.g., the at-site mean or median), and is estimated at each gaged site. This estimation is affected by sampling errors, which are spatially dependent and whose variance varies from site to site. Consequently, the covariance matrix of sampling errors is also estimated.

[15] 2. Regional estimation: A regression model is estimated to link the index flood value with site/catchment characteristics. Importantly, the estimation of the regression model accounts for the existence of sampling errors in a GLS framework, and is hence performed conditionally on . Also note that estimating a regression model is not restricted to estimating the regression coefficients, but also involves estimating the properties of regression errors, i.e., their covariance matrix . This is of primary importance since this matrix plays an important role in the predictive uncertainty at ungaged sites.

[16] Such a two-step procedure might be problematic in the context of quantifying the total predictive uncertainty. Indeed, estimates at step 2 are conditional on estimates at step 1. In particular, the covariance matrix is an estimate, and may itself be in error (see, e.g., the discussion by *Stedinger and Tasker* [1985] and *Kroll and Stedinger* [1998] on desirable properties of ). Such errors may then propagate to step 2. Consequently, it would be desirable to avoid separating the inference process in two separate steps: this can be achieved using hierarchical models (see section 1.2).

[17] In addition to these issues related to the estimation procedure, the assumptions underlying index flood approaches remain questionable. Indeed, the scale invariance assumption is convenient because it merges all spatial variability into a single parameter (the index-flood parameter). However, this assumption forces the coefficient of variation of data to remain constant throughout a homogeneous region: this might be too restrictive in some regions, or it might force one to drastically reduce the spatial extent of the region to ensure the scale invariance hypothesis is met. Several alternatives have therefore been proposed to move beyond this restrictive framework, in particular, the region of influence approaches [*Burn*, 1990] and recent developments [*Kjeldsen and Jones*, 2009b], normalized quantile regression [*Fill and Stedinger*, 1998], or empirical Bayes procedures [*Kuczera*, 1982a, 1982b, 1983] (see also the discussion provided by *Griffis and Stedinger* [2007]).

### 1.2. Bayesian Hierarchical Models

[18] An alternative approach, based on Bayesian hierarchical models, has been explored more recently. In particular, *Wikle et al.* [1998] proposed a general hierarchical framework to describe the spatial variability of the distribution of some environmental variable. The principle of a hierarchical model is to use several modeling layers. For instance, a first layer may assume that the data follow some distribution with unknown parameters, while a second layer may model the variability of those parameters in space, using some regression model. This closely corresponds to the successive steps involved in the standard implementation of RFA approaches. However, the main advantage of a hierarchical model is that all unknown quantities can be inferred simultaneously, therefore accounting for possible interactions between estimation errors made at different layers and yielding a more rigorous quantification of the predictive uncertainty. In other words, hierarchical models allow describing the local variability of data *together* with their regional coherence, without separating the inference process in several steps. Moreover, such models are more general than the model underlying the index flood approach, since they do not require assuming scale invariance.

[19] Several applications of Bayesian hierarchical models in a hydrological context have been proposed in the literature. For instance, *Cooley et al.* [2007] described the spatial variability of extreme rainfall, while *Aryal et al.* [2009] extended this description to both spatial and temporal variability. Similarly, Lima and Lall used Bayesian hierarchical models to describe daily rainfall occurrences [*Lima and Lall*, 2009] or runoff extremes [*Lima and Lall*, 2010] in a regional context. In addition to these hydrological applications, similar Bayesian hierarchical models have been used in other fields, e.g., for extreme wind speed modeling [*Coles and Casson*, 1998; *Casson and Coles*, 1998, 2000].

[20] Despite improving the standard implementation of RFA approaches, all of the Bayesian hierarchical models described above rely on an assumption of conditional independence: data are assumed spatially independent given the values of their distribution's parameters. This would be a valid assumption if most spatial covariation in the data was explained by the spatial covariation in the parameters. However, it is questionable since spatial dependence between data on the one hand and parameters on the other hand arise from distinct processes, as noted by *Cooley et al.* [2007]: In a nutshell, data dependence can be interpreted as *weather* spatial dependence, while dependence between parameters (also termed process dependence) relates to *climate* spatial dependence. Weather should exhibit spatial dependence (at least within a short distance range), even if the climate were perfectly known.

[21] Examples of spatial hierarchical models explicitly accounting for intersite dependence in the observations are very few. *Perreault* [2000] proposed such a model to detect a regional step-change in annual runoff. Alternatively, *Micevski et al.* [2006] and *Micevski* [2007] proposed a Bayesian hierarchical regional flood model accounting for data dependence. However, in both cases, the explicit description of data dependence was rooted to a particular distributional assumption: Gaussian data were assumed by *Perreault* [2000], while *Micevski et al.* [2006] and *Micevski* [2007] used a mixture of lognormal distributions. Unfortunately, Gaussian-related assumptions may be too restrictive for other hydrologic variables or in other geographical contexts.

### 1.3. Objectives

[22] Building on previous work described in sections 1.1 and 1.2, this paper aims to derive a general Bayesian hierarchical framework for RFA. In particular, this framework should enable an explicit description of spatial dependence between data, without relying on Gaussian-related distributional assumptions. This is achieved by means of the elliptical copula family [*Genest et al.*, 2007], which constitutes a convenient tool to model dependence in a highly dimensional and non-Gaussian context.

[23] The paper is organized as follows. Section 2 describes the two-level hierarchical framework, with level 1 modeling the joint distribution of observations and level 2 modeling the variation of the distribution parameters in space. Section 3 describes the inference of the hierarchical model and its use for prediction at both gaged and ungaged sites. Section 4 illustrates the application of the proposed framework for the regional estimation of extreme rainfall. Avenues for further improvement are identified and discussed in section 5, before summarizing the main results in section 6.