# Formulation of a mathematical approach to regional frequency analysis

## Abstract

[1] Estimation of design quantiles of hydrometeorological variables at critical locations in river basins is necessary for hydrological applications. To arrive at reliable estimates for locations (sites) where no or limited records are available, various regional frequency analysis (RFA) procedures have been developed over the past five decades. The most widely used procedure is based on index-flood approach and L-moments. It assumes that values of scale and shape parameters of frequency distribution are identical across all the sites in a homogeneous region. In real-world scenario, this assumption may not be valid even if a region is statistically homogeneous. To address this issue, a novel mathematical approach is proposed. It involves (i) identification of an appropriate frequency distribution to fit the random variable being analyzed for homogeneous region, (ii) use of a proposed transformation mechanism to map observations of the variable from original space to a dimensionless space where the form of distribution does not change, and variation in values of its parameters is minimal across sites, (iii) construction of a growth curve in the dimensionless space, and (iv) mapping the curve to the original space for the target site by applying inverse transformation to arrive at required quantile(s) for the site. Effectiveness of the proposed approach (PA) in predicting quantiles for ungauged sites is demonstrated through Monte Carlo simulation experiments considering five frequency distributions that are widely used in RFA, and by case study on watersheds in conterminous United States. Results indicate that the PA outperforms methods based on index-flood approach.

## 1. Introduction

[2] Regional frequency analysis (RFA) gained recognition as a viable option to arrive at design estimates of variables associated with hydrometeorological events such as extreme precipitation and floods at target locations in river basins that are ungauged or have limited records. The analysis involves (i) use of a regionalization approach for identification of locations that are similar to the target location (site), in terms of mechanisms influencing the variable being analyzed, to form a homogeneous region, and (ii) use of a RFA approach to fit a distribution to information pooled from the region for arriving at design estimate. Among the various RFA approaches that have been developed in the past, index-flood approach [Dalrymple, 1960] gained wide recognition, which makes the following assumptions: (i) records of the variable at each site in a region are identically distributed; (ii) records at each site are serially independent; (iii) there is no dependence between records at different sites; and (iv) frequency distribution of the variable is identical across sites in the region, except for a site-specific scaling factor called index-flood. Of these assumptions, the first three are generally valid for analysis of a random variable representing hydrometeorological extreme event, but the fourth is specific to only index-flood approach. Implementation of the index-flood approach involves normalization of records of the variable for each site by dividing them by the site's scaling factor and combining information from those normalized records to construct a “dimensionless distribution function” (growth curve) that is assumed to be unique for all the sites in the region. Required quantiles at the target site are estimated by multiplying the growth curve by site-specific scaling factor, which is often chosen as mean of the variable. Alternate scaling factors that have been considered in previous studies include median, trimmed mean, and quantile of the at-site distribution [Smith, 1989; Hosking and Wallis, 1997].

[3] For the index-flood approach to be effective, the aforementioned assumptions (i)–(iv) should be valid for the records before and after normalization. Validity of the first three assumptions can be ensured by considering the scaling factor to be a population statistic. However, as population statistic is unknown in real-world scenario, modelers chose sample statistic for normalization. As a result, frequency distribution of record at each site undergoes a change and the assumptions (ii) and (iii) would be violated. These effects of normalization were reported in a number of previous studies [e.g., Stedinger, 1983; Hosking and Wallis, 1997, p. 88; Sveinsson et al., 2001, 2003]. Stedinger [1983] attempted to overcome the problem by implementing the index-flood approach in log-space and suggested use of unbiased moment or probability weighted moment estimators. Boes et al. [1989] suggested use of the method of maximum likelihood to estimate parameters of growth curve assuming regional frequency distribution to be Weibull. Sveinsson et al. [2001] extended this approach to generalized extreme value (GEV) distribution and termed it as population index-flood (PIF) method. Literature review on different aspects related to the index-flood approach can be found in Stedinger and Lu [1995] and Bocchiola et al. [2003].

[4] For situations where sample statistic (mean) is chosen as a scaling factor, Sveinsson et al. [2001] proved through analytical formulations that assumption (ii) of index-flood approach (i.e., independence of records at each site) would be invalid, and frequency distribution of the normalized data would be different from that of the original data. Further, it was argued that the fourth assumption of index-flood approach (i.e., distributions of normalized records would be identical for sites) would be invalid if record lengths of sites in a homogeneous region are different. In real-world scenario, the scale and shape parameters of sites in a homogeneous region may not be close enough to be considered identical, even if the type of frequency distribution is the same for all the sites in the region. It can even be theoretically established that probability of those parameter values being exactly equal for any two sites in a region is zero, even if those sites have unlimited record lengths. The shortcomings associated with index-flood approach motivated development of an alternate mathematical approach to RFA in this paper. The RFA is deemed to be effective if knowledge of location, scale, as well as shape parameters of all the sites is utilized in the analysis, to properly characterize the growth curve (dimensionless distribution function) that represents the region. The proposed approach (PA) involves (i) identification of an appropriate frequency distribution to fit the random variable being analyzed for the homogeneous region, (ii) use of a proposed transformation mechanism to map observations of the variable from original space to a dimensionless space where the form of distribution does not change, and variation in values of location, scale, as well as shape parameters of the distribution is minimal across sites, thus satisfying all the assumptions of index-flood approach, (iii) construction of a growth curve in the dimensionless space, and (iv) mapping the growth curve to the original space for the target site by applying proposed inverse transformation to arrive at required quantile(s) for the site.

[5] The reminder of this paper is structured as follows: index-flood approach and the problem being addressed in this paper are described in section 2. Following that methodology for proposed RFA approach is presented in section 3. Subsequently effectiveness of the proposed methodology is demonstrated through Monte Carlo simulation experiments and by application to real-world data in section 4. Finally, summary and concluding remarks are given in section 5.

## 2. Background

[6] Over the past five decades, frequency analysis of a variety of variables associated with hydrometeorological events such as extreme rainfall, wind speed, floods, and droughts has been performed extensively using index-flood approach [Dalrymple, 1960]. Let denote j-th observed record (from ni data points) of the variable being analyzed at site i in a region comprising of N sites. The approach defines the quantile function of the variable at site i as

(1)

where μi is the site-specific scaling factor known as index-flood and q(F) is ordinate of regional growth curve corresponding to nonexceedance probability F. The key assumption in equation (1) is that q(F) is common to all the sites in the region and is thus independent of site-specific information.

[7] To construct q(F), datapoints in the record corresponding to each site in the region are divided by the sites' scaling factor , and information from the resulting normalized records is combined using options such as station-year method or regional L-moment method [e.g., Hosking and Wallis, 1997, p. 88]. The former method combines normalized records of all the sites into a single sample, to which appropriate frequency distribution is fitted to arrive at q(F). The latter method, which is widely used by practitioners, combines summary L-statistics (L-mean; coefficient of L-variation, L-skewness, L-kurtosis) of the normalized records by averaging them across sites to arrive at regional estimates of those statistics, which are the basis for identification of appropriate regional frequency distribution and estimation of q(F). The method is henceforth referred to as conventional index-flood (CIF) method in this paper.

[8] Monte Carlo simulation experiments were performed in the present study to investigate effectiveness of CIF method by examining errors in estimates of location, scale, and shape parameters with respect to their true population values, considering scaling factor as population mean statistic. Each experiment involved

[9] (i) simulation of a realization of homogeneous region comprising N sites (each having record length n), based on each one of the five commonly used theoretical distributions (generalized logistic (GLO), GEV, generalized Pareto (GPA), generalized normal (GNO), and Pearson type III (PE3)) and L-moments procedure [Hosking and Wallis, 1997] by specifying first L-moment (λ1), second L-moment (λ2), and L-skew (τ3);

[10] (ii) normalization of record of each site using population mean statistic (λ1) as scaling factor, and use of L-statistics corresponding to those normalized records to estimate at-site values for location, scale, and shape parameters of the distribution considered for simulation;

[11] (iii) computation of regional average L-statistics and using those to estimate regional values for location, scale, and shape parameters by considering regional frequency distribution to be the same as the distribution considered for simulation; and

[12] (iv) comparison of at-site and regional values of location, scale, and shape parameters with their true population values in the normalized space.

[13] In these experiments, region construction ensured validity of assumptions of index-flood approach. To examine the effect of region and sample sizes, two values for N (25 and 50) and three values for n (25, 50, 100) were considered. Overall, the experiments revealed that errors in regional estimates of scale and shape parameters were significant with respect to their true population values, whereas those for location parameter were reasonable. Further, at-site values of all the parameters were found to be scattered with respect to their regional estimates. Scatter in the case of location parameter was found to be marginal, while that for scale and shape parameters was found to be high, even though all the sites correspond to a realization of homogeneous region. These inferences were found to be valid irrespective of the form of regional frequency distribution, the number of sites in a region, and sizes possible for sites' record length in real-world scenario. For brevity, results corresponding to a realization of homogeneous region generated based on GEV distribution are shown for n = 50 in Figure 1. The results suggest the need to device an alternative effective methodology for RFA that would yield regional estimates of location, scale, and shape parameters, which are less biased with respect to their population values in the space where regional growth curve is constructed. It is hoped that the curve forms the basis for arriving at better quantile estimates in the original space for sparsely gauged as well as ungauged sites. This motivated authors to develop a new transformation mechanism for mapping observations of the variable being analyzed from the original space to a dimensionless space where deviations of regional estimates of all the parameters with respect to their population values as well as at-site estimates are minimal.

## 3. Methodology for Proposed Approach to RFA

[14] Suppose there are N sites in a region that is delineated to be homogeneous with respect to a random variable X (e.g., peak flow). Let x denote an observation (data point) corresponding to X. Implement the following steps to arrive at regional quantile function for a target site in the region.

[15] (i) Identify an appropriate regional frequency distribution to fit X. In real-world scenario, the distribution can be identified using observations (data) corresponding to sites in the region by an effective regional goodness of fit (GOF) test.

[16] (ii) Map observations corresponding to X from the original space to those corresponding to random variable Y in a dimensionless space, such that frequency distribution of X and Y remain the same, and variation in at-site values of location, scale, as well as shape parameters of the distribution is minimal. To facilitate mapping, equation (2) is proposed when X follows GLO, GEV, GPA, or GNO distributions, whereas equation (3) is proposed when X follows PE3 distribution.

(2)
(3)

where ξX denotes location parameter and αX and kX in equation (2) denote scale and shape parameters, respectively, whereas βX in equation (3) represents scale parameter of the frequency distribution of X. Equation of cumulative distribution function (CDF) of X corresponding to GLO, GEV, GPA, and GNO distributions is given in Table 1, while that for PE3 distribution is provided in Table 2. The CDF of Y that follows GLO, GEV, GPA, or GNO distributions, and the corresponding values for L-moments and parameters are given in Table 3, while those for PE3 distribution are provided in Table 4. It may be noted that the values of location, scale, and shape parameters for GLO, GEV, GPA, and GNO populations are 0, 1, and 0, respectively. Further values of location and scale parameters for PE3 population are 0 and 1, respectively, whereas the value of shape parameter is the same as that in the original space. Details pertaining to derivation of population parameter values and the corresponding equations for population growth curves in the dimensionless space are provided in Appendix Appendix.

Table 1. Formulations Related to GLO, GEV, GPA, and GNO Frequency Distributions for the Random Variable Xa
DistributionGLOGEV
1. a

FX(x), ξX, αX, and kX denote cumulative distribution function, location, scale, and shape parameters, respectively. x(F) is quantile estimation function. .

FX(x)
Range of x
kX
αX
ξX
x(F)
DistributionGPAGNO
FX(x)
Range of x
kX
αX
ξX
x(F)
Table 2. Formulations Related to PE3 Frequency Distributiona
DistributionPE3
1. a

FX(x), ξX, βX, and α denote cumulative distribution function and parameters related to random variable X, and x(F) is quantile estimation function.

FX(x)
Gamma function
Incomplete gamma function
Range of x
α
βX
ξX
x(F) , KT is a frequency factor that can be computed using approximations such as Wilson-Hilferty and Cornish-Fisher transformations [Rao and Hamed, 2000, pp. 146–148]
Table 3. Formulations Related to GLO, GEV, GPA, and GNO Frequency Distributions for the Random Variable Ya
DistributionGLOGEVGPAGNO
1. a

FY(y) is the cumulative distribution function; , , and are the first three L-moments; ξY, αY, and kY denote location, scale, and shape parameters, respectively; and y(F) is the population growth curve.

FY(y)
Range of y
00.577210
10.69310.5
00.11781/60
kY0000
αY1111
ξY0000
y(F)
Table 4. Formulations Related to Random Variable in Case of PE3 Frequency Distributiona
DistributionPE3
1. a

FY(y) is the cumulative distribution function; and are the first two L-moments; ξY, βY, and α denote parameters related to distribution of random variable Y; and y(F) is the population growth curve.

FY(y)
Range of y
α
α
βY1
ξY0
y(F)

[17] (iii) Compute L-statistics corresponding to each of the sites in the dimensionless space using values obtained from mapping of observations and use those as the basis to estimate regional average L-statistics.

[18] (iv) Estimate location, scale, and shape parameters of regional frequency distribution using the regional average L-statistics and construct growth curve in the dimensionless space.

[19] (v) To arrive at regional quantile function for the target site, map the growth curve to the original space by applying proposed inverse transformation equation. Use equation (4) if regional frequency distribution is among GLO, GEV, GPA, or GNO, and equation (5) if it is PE3.

(4)
(5)

where denotes location parameter, and in equation (4) represent scale and shape parameters, respectively, and in equation (5) represents scale parameter corresponding to the target site. The subscript X indicates that all the parameters are estimated in the original space. Those parameters can be reliably estimated using observations at the target site if record length for that site is large enough. However, if the site is ungauged or has inadequate data, the required parameters can be estimated based on regional information by various methods. One option is to estimate those parameters using regional average values of L-statistics in equations corresponding to those parameters given in Tables 1 and 2. An alternate option is to estimate those parameters by using regression relationships developed between each of them and site-specific attributes that influence the variable being analyzed. The site-specific attributes should be those that are readily available even for ungauged locations. For example, catchment area, slope, drainage density, and soil characteristics could be considered as attributes in the case of RFA of floods.

## 4. Performance Assessment

[20] The proposed RFA approach was evaluated for its effectiveness by performing Monte Carlo simulation experiments in section 4.1 and by application to data pertaining to watersheds in conterminous United States in section 4.2. In both the sections, results obtained from the proposed approach are compared with those from methods related to index-flood approach to assess their relative performance.

### 4.1. Simulation Experiments

[21] Simulation experiments were designed to demonstrate the potential of proposed RFA approach over methods related to index-flood approach in estimating quantiles for ungauged sites. Each experiment involved simulation of Nsim (=1000) realizations of a homogeneous region comprising N sites (each having record length n), based on one of the five frequency distributions (GLO, GEV, GPA, GNO, PE3) and method of L-moments, for specified values of λ1, λ2, and τ3. In each realization, values in synthetic record corresponding to any site were independent among themselves and independent with respect to those corresponding to other sites.

[22] The value of λ1 was, without loss of generality, set to 100 for each site. To examine whether results from the experiment are sensitive to the values of coefficient of L-variation and τ3, experiments were repeated for various (τ, τ3) pairs considered by Viglione et al. [2007]. The pairs are deemed to represent parameters corresponding to annual maximum, mean, and minimum streamflows at about 1400 sites that are part of U.S. Geological Survey (USGS) Hydro-Climatic Data Network (HCDN) in the conterminous United States.

[23] In a realization, one site at a time was considered to be ungauged and regional quantile function corresponding to the site was constructed based on information from the remaining gauged sites in the realization by RFA using the proposed approach, CIF method, and PIF method [Sveinsson et al., 2001]. In this analysis, the best fit at-site and regional frequency distributions were assumed to be the same as the distribution considered for simulating the realization. RFA with the proposed approach was based on steps presented in section 3. To utilize in the inverse transformation equation, location, scale, and shape parameters were estimated using regional average values of L-statistics.

[24] For RFA with CIF method, values in synthetic record corresponding to each of the gauged sites in the realization were first normalized using their respective “sample mean statistic” (scaling factor). Following this, regional average values of L-statistics were computed based on at-site values of those statistics in the normalized space. Subsequently, those statistics were used to compute location, scale, and shape parameters of the regional frequency distribution, and growth curve (CDF) was constructed using the parameters. Regional quantile function for the ungauged site was then constructed in the original space by multiplying the growth curve with the ungauged site's scaling factor that was considered to be the average of scaling factors corresponding to the gauged sites in the realization.

[25] The PIF method was implemented on realizations generated only based on GEV distribution (population), for which regional frequency distribution is GEV following the foregoing assumption. This is because it is the only distribution for which L-moment based equations for PIF method were available from Sveinsson et al. [2001, p. 2747], and the purpose of this study is not to develop formulations for implementing the PIF method on realizations/regions for which regional frequency distribution is other than GEV. To apply PIF method on a realization, at-site values of L-statistics ( and ) were computed for gauged sites in the realization using their synthetic records, and those values were used to compute regional average values of the L-statistics (τR and ). Regional shape parameter corresponding to the realization was then estimated as

(6)

where

(7)

[26] The ratio of scale parameter to location parameter of the regional frequency distribution, which is assumed to be a constant value for all the sites in the realization, was estimated as

(8)

[27] Scale parameter corresponding to the ungauged site was then estimated as

(9)

where is the second sample L-moment corresponding to the ungauged site that was considered to be the average of the second sample L-moment values corresponding to the gauged sites in the realization. Location parameter for the ungauged site was then estimated as

(10)

[28] Regional quantile function for the ungauged site corresponding to PIF method was constructed using the parameters , , and kX.

[29] The regional quantile function constructed for each of the ungauged sites in a realization using proposed approach as well as each method (CIF, PIF) was compared with the known population quantile function corresponding to the distribution that was the basis for simulating the Nsim realizations. The error was quantified for five return periods (T = 25, 50, 75, 100, and 200 years) or nonexceedance probabilities (F = 0.96, 0.98, 0.9867, 0.99, and 0.995) in terms of three performance measures: relative bias (R-bias), absolute relative bias (AR-bias), and relative root mean square error (R-RMSE). For each of the Nsim realizations, the measures corresponding to T year return period are estimated as

(11)
(12)
(13)

where and denote population quantile and regional estimate of quantile, respectively, corresponding to site i and T year return period. An approach or method could be considered effective if the corresponding values for the measures are closer to zero.

[30] Boxplots representing Nsim values of R-bias, AR-bias, and R-RMSE for each return period were prepared corresponding to each of the five distributions (GLO, GEV, GPA, GNO, and PE3), two region sizes (N = 25 and 50), three sample sizes (n = 25, 50, and 100), and four (τ, τ3) pairs: (0.2, 0.1), (0.3, 0.2), (0.4, 0.3), and (0.5, 0.4). In general, R-bias, AR-bias, and R-RMSE values were found to decrease with increase in sample size, and increase with increase in return period and (τ, τ3) values. Comparison of R-bias related boxplots corresponding to PA with those related to CIF and PIF methods indicated that for boxplots corresponding to PA, the medians are more closer to zero and the interquartile and 5–95 percentile ranges are smaller. Further, comparison of AR-bias and R-RMSE related boxplots revealed that values of those measures are significantly lower for PA, and the performance of PA is followed by that of CIF and PIF methods. Regional quantile estimates corresponding to CIF and PIF methods were, in general, significantly biased (positively as well as negatively) with respect to population quantiles, while those corresponding to PA were less biased. This cannot be detected based on R-bias alone but can be speculated by noting that AR-bias is significantly large for CIF and PIF methods when compared to that for PA, though medians of R-bias are nearby zero for all the methods. For brevity, results on R-bias and AR-bias corresponding to GLO and GEV distributions, and R-RMSE corresponding to GEV, GPA, and PE3 distributions are presented in Figures 2-6, respectively.

[31] With regard to estimation of regional quantile for ungauged sites, the better performance of PA could be attributed to construction of growth curve in the proposed dimensionless space, where deviations of regional estimates of all the parameters with respect to their population values as well as at-site estimates are minimal. Relatively inferior performance by CIF method can be attributed to construction of growth curve using parameters that are marginally biased with respect to their respective population values, while inferior performance by PIF method can be attributed to construction of growth curve (regional quantile function) using site-specific location and scale parameters that are largely biased with respect to the population values of those parameters. To demonstrate this, plots showing comparison of at-site values of the parameters and their regional estimates with population values are presented for PIF method and PA alongside those for CIF method in Figure 1, considering the GEV distribution based realization that was the basis for CIF method based plots.

[32] The performance of PA was compared with that of CIF and PIF methods for the case where the population mean corresponding to each site in a realization is assumed to be known in the Monte Carlo simulation experiments. To examine if the assumption results in an improvement in the performance of an approach/method, the corresponding boxplot related to each of the three performance measures (R-bias, AR-bias, and R-RMSE) was compared with that resulting from the approach/method for the case where population mean value for each site is unknown. The results indicated marginal improvement in the case of PA, substantial improvement in the case of CIF method, and no improvement in the case of PIF method when population mean is known. Nevertheless, values of performance measures corresponding to PA were consistently lower than those corresponding to methods related to index-flood approach when population values are known. Insensitivity of results from PIF method to knowledge on population mean value is expected, because mean statistic is not utilized by the method for estimation of parameters. This can be noted from equations (6) to (10) that form the basis for construction of regional quantile function for a target site by the method. For brevity, results on R-bias, AR-bias, and R-RMSE corresponding to GEV distribution are presented in Figure 7.

[33] To test robustness of PA to misspecification of regional frequency distribution (i.e., regional distribution being different from population), additional Monte Carlo simulation experiments were designed. Each experiment involved simulation of Nsim realizations of a homogeneous region comprising N sites (each having record length n), based on a frequency distribution (population) chosen from among GLO, GEV, GPA, GNO, and PE3 distributions, as done in the previous simulation experiments. Subsequently, the regional frequency distribution was assumed to be different from the population distribution that was considered for simulating the realizations. The regional quantile function constructed for a realization using PA as well as CIF and PIF methods was compared with the known true (population) quantile function corresponding to the distribution that was the basis for simulating the Nsim realizations, and the error was quantified in terms of three performance measures (R-bias, AR-bias, and R-RMSE) for five return periods (T = 25, 50, 75, 100, and 200 years) and then visualized in the form of boxplots, as done in the previous simulation experiments. The PIF method was applied only when the regional frequency distribution is GEV due to reasons already mentioned. Values of performance measures obtained with the PA were consistently lower than those resulting from the use of methods related to index-flood approach, indicating that the PA outperforms index-flood approach related methods even in situation where the form of regional frequency distribution is misspecified. To examine robustness in performance of a method, the corresponding boxplot related to each of the three performance measures was compared with that resulting from the method for the case where the form of regional frequency distribution was assumed to be the same as the distribution (population) that was the basis for simulating the realizations. Difference in the two boxplots was found to be marginal in the case of each performance measure for PA as well as each of the methods related to index-flood approach. This indicates that all the methods are robust to misspecification of frequency distribution. For brevity, results from two experiments considering N = 25, n = 25, τ = 0.4, and τ3 = 0.3 are presented in Figure 8 for T = 50, 100, and 200 years. In the first experiment, realizations of a homogeneous region were based on GEV population, but regional frequency distribution was assumed to be GNO. While in the second experiment, the population and the regional frequency distribution considered in the first experiment were interchanged. The ability of the proposed approach to consistently yield lower and robust values for performances measures could be attributed to contiguity of CDFs/growth curves (corresponding to the considered frequency distributions in the nondimensional space defined in this study through proposed transformations), in the range of nonexceedance probabilities or return periods that are of interest in the context of extreme value analysis (Figure 9).

[34] Overall, Monte Carlo simulations reveal that the proposed approach outperforms CIF and PIF methods in predicting quantiles for ungauged locations irrespective of whether (i) population mean statistic corresponding to ungauged sites is known and (ii) the form of regional frequency distribution is misspecified. Further, the results are consistent for various populations, region, and sample sizes considered.

### 4.2. Application to Real-World Data

[35] Effectiveness of the proposed RFA approach in real-world scenario is demonstrated by application to annual maximum flows (AMFs) corresponding to 884 sites in conterminous United States, which were found to be stationary by Mann-Kendall [Mann, 1945; Kendall, 1955] and KPSS [Kwiatkowski et al., 1992] tests at 90% confidence level. Each of the chosen sites had at least 10 years of record and mean annual peak flow greater than 50 ft3/s. Location of those sites are shown in Figure 10. The flow records were extracted from USGS HCDN, and they meet certain accuracy criteria specified in Slack et al. [1993].

[36] Before proceeding to RFA, “delineation of homogenous regions” (regionalization) is necessary. A variety of approaches are available for this purpose, and none of them is proven to be universally superior. In this perspective, one of the widely used approaches referred to as “region of influence” (ROI) [Burn, 1990] has been chosen for regionalization in this study. To implement this approach, inventory on attributes corresponding to the watersheds was necessary. It was prepared based on database developed by Kroll et al. [2004] on topographic, physiographic, hydrogeologic, geologic, and climatic variables for watersheds corresponding to gauges (sites) in HCDN. In addition, flood seasonality measures that formed the sole basis for several regionalization studies in recent years [e.g., Cunderlik and Burn, 2006], were computed for each of the sites based on date of occurrence of the flood events, but results indicated that watersheds in the study area do not show strong seasonal flood response. The inventory was scrutinized to identify irredundant attributes that are fairly well correlated with mean of AMFs. The attributes selected based on this analysis were drainage area, main channel slope, mean annual precipitation, and base flow recession constant based on daily streamflow. Those attributes together with two location indicators (latitude and longitude) were chosen as features for regionalization. Use of location indicators ensures that sites (depicting watersheds) in a region are not too far off in geographical space, because geographically nearby watersheds could exhibit similar flood response due to similarities in the causative precipitation events. In practice, caution should be exercised to ensure that the chosen features can be obtained even for ungauged locations (sites) in the study area. Among the chosen six features, “mean annual precipitation” and “base flow recession constant” could be estimated for ungauged locations by spatial interpolation of the respective feature values corresponding to gauged locations in the study area, and the remaining four features could be estimated even if no historical (past) records are available.

[37] Among the six features, values corresponding to the feature “drainage area” were quite large, and their distribution was highly skewed. Consequently, those values were transformed using logarithmic transformation. Subsequently, values of each feature were scaled by subtracting by its respective minimum value and then dividing by its respective range, so that the resulting values lay between 0 and 1. One can instead consider different forms of scaling that would result in greater (lesser) range for features that are deemed to be more (less) important in depicting flood producing mechanism for watersheds in the study area. A feature vector representing each of the watersheds was prepared using corresponding values for the six scaled features.

[38] To assess effectiveness of the proposed RFA approach in predicting peak flow quantiles for ungauged sites, one site at a time (from among 884 sites) was considered to be ungauged, and pooling group for the site was prepared using ROI approach. For this purpose, other sites were ranked in ascending order of their Euclidean distance to the ungauged site in the 6-D space of the scaled features. Following this, those sites were considered one at a time (in order of their distance) and assigned to the pooling group until collective record length of all the sites in the group exceeded 500 station-years. This ensures that pooled information is adequate to determine quantiles corresponding to return period T up to 100 years, as per 5T rule [Institute of Hydrology, 1999], and adequate sites are available to develop regression relationship using information in the group to estimate first and second L-moments for the ungauged site. AMFs corresponding to sites in the pooling group were considered as the basis for testing homogeneity of the group using regional heterogeneity measures [Hosking and Wallis, 1997, p. 63] and identifying regional frequency distribution using regional GOF test [Hosking and Wallis, 1997, p. 81] with 90% confidence level. The heterogeneity measures included H1 (based on τ), H2 (based on τ and τ3), and H3 (based on τ3 and L-kurtosis). In practical applications, a group is regarded as sufficiently or acceptably homogeneous when H < 2. The percentage of sufficiently homogeneous pooling groups as per H1, H2, and H3 was found to be about 26%, 66%, and 87%, respectively. Percentage of homogeneous pooling groups as per H1 is often low, as it is deemed to be the strictest homogeneity measure [Cunderlik and Burn, 2006, pp. 7–8]. In results based on GOF test, the distribution for which GOF measure was sufficiently close to zero was selected as regional frequency distribution, which was found to be GLO for 239 groups, GEV for 301 groups, GNO for 235 groups, PE3 for 79 groups, and GPA for 30 groups. Information on the distribution corresponding to each of the sites is provided in Figure 10.

[39] To arrive at regional quantile function for ungauged site corresponding to each of the 884 pooling groups, the RFA was performed on each pooling group using the proposed approach and each of the three methods related to index-flood approach, namely, CIF, PIF, and logarithm based index-flood (LIF) method [Stedinger, 1983]. The LIF method was not considered for Monte Carlo simulation experiments because application of the method requires identification of theoretical form of frequency distribution in log-space corresponding to synthetic samples generated in the original space based on each of the populations (GLO, GEV, GPA, GNO, and PE3), but those theoretical forms are not available in log-space.

[40] The procedure for performing RFA with each approach/method on a pooling group was similar to that on a realization in the case of Monte Carlo simulation experiments, except for estimation of (i) the first L-moment for ungauged site to be used in equation (4) or (5) with PA and as scaling factor in CIF method and (ii) the second L-moment for ungauged site to be used in equation (9) with PIF method that was implemented on pooling groups for which the regional frequency distribution was GEV. The required moment was estimated by substituting values of scaled features corresponding to the ungauged site in regional regression relationship developed between the moment and the six scaled features for gauged sites in the pooling group.

[41] For RFA with LIF method, peak flow records corresponding to each of the gauged sites in the pooling group were first log-transformed, and then those transformed values (in log-space) were subjected to steps involved in CIF method to construct a growth curve in normalized log-space. The curve was then multiplied with scaling factor for the ungauged site in log-space to arrive at regional quantile function in the log-space. The scaling factor was computed by substituting values of the six scaled features corresponding to the ungauged site in regression relationship developed between the scaled features for gauged sites and their corresponding mean annual peak flow in the log-space. The regional quantile function was then inverse log-transformed to arrive at regional quantile function for the ungauged site in the original space.

[42] The regional quantile function constructed for an ungauged site using each of the methods (CIF, PIF, LIF) and PA was compared with the “true” quantile function (CDF) for the site in terms of R-bias, AR-bias, and R-RMSE values corresponding to five return periods (T = 25, 50, 75, 100, and 200 years). The “true” quantile function was constructed by fitting the best fit frequency distribution to AMF data available for the ungauged site, following the conventional practice [e.g., Cunderlik and Burn, 2006]. The best fit at-site frequency distribution was found to be GLO for 301 sites, GEV for 142 sites, GNO for 118 sites, PE3 for 183 sites, GPA for 129 sites, and Wakeby for 11 sites using L-moment based goodness of fit test [Hosking and Wallis, 1997] with 90% confidence level. Values of the performance measures indicate that errors are least for the PA, whereas those are relatively much higher for LIF, PIF, and CIF methods (Table 5). To gain further insight, scatterplots between the “true” at-site quantile estimates and regional quantile estimates based on PA and each of the index-flood related methods (CIF, PIF, and LIF) were prepared for various return periods. They showed that points corresponding to PA are less deviated with respect to the solid 1:1 line than those corresponding to methods related to index-flood approach. In general, the regional estimates based on PIF and LIF methods were mostly higher than true quantile estimates for all sites, whereas those based on CIF method were lower than true quantile estimates for the sites for which at-site quantiles are high. Results corresponding to a typical return period (T = 50 years) are presented in Figure 11, for brevity. Overall the results indicate that the proposed approach offers significant improvement over the index-flood related methods for RFA.

Table 5. Performance Measures R-Bias, AR-Bias, and R-RMSE Computed Based on Errors in Flood Quantiles Estimated Corresponding to (a) All 884 Ungauged Sites and (b) 301 Ungauged Sites for Which Regional Frequency Distribution of Pooling Group Is GEVa
(a)
TR-Bias (%)AR-Bias (%)R-RMSE (%)
PACIFLIFPACIFLIFPACIFLIF
25−11.35−56.10−83.7929.0074.18100.2851.54140.07176.75
50−9.18−54.80−58.0427.2773.5780.5545.83139.66148.01
75−8.29−54.31−45.6927.0973.6372.2943.85140.16135.03
100−7.80−54.06−37.8827.2173.8567.7742.96140.82127.17
200−7.03−53.83−21.6828.1774.9160.7142.29143.37111.88
(b)
TR-Bias (%)AR-Bias (%)R-RMSE (%)
PACIFLIFPIFPACIFLIFPIFPACIFLIFPIF
1. a

PA represents the proposed approach; CIF, PIF, and LIF denote conventional, population, and logarithm index-flood methods, respectively. T denotes return period in years.

25−11.30−50.20−82.38−31.6029.3669.2698.6459.2257.52136.97192.48127.81
50−9.40−49.03−56.43−31.4628.2868.5878.8960.0151.51135.61161.41130.38
75−8.68−48.56−43.91−31.5428.5668.6270.4760.7149.44135.57147.02132.28
100−8.31−48.32−35.97−31.6729.0068.8066.0761.3548.57135.88138.16133.80
200−7.84−48.11−19.47−32.2330.8070.0658.9663.4348.24137.76120.63138.06

[43] The reasons for better performance of PA when compared to CIF method are the same as those mentioned in the case of Monte Carlo simulation experiments. Whereas inferior performance of PIF method relative to PA can be attributed to the fact that the method assumes the ratio of scale to location parameters to be a constant value for all the sites in a region, which may not be a valid assumption in real-world scenario. The site-specific location and scale parameters estimated with this assumption are expected to be largely biased with respect to population values of those parameters. Therefore, the regional quantile function based on those parameters would be largely biased with respect to the true quantile function. While inferior performance of LIF method relative to PA could be attributed to the fact that appropriate frequency distribution to fit peak flows in log-space may not be closer to any of the known theoretical distributions.

[44] In regionalization studies, delineated regions are often adjusted to improve their homogeneity. Hosking and Wallis [1997, p. 59] summarize various options in use by modelers for making adjustments to regions. To verify if such adjustments result in better results, the ROIs identified for each of the 884 sites were adjusted by eliminating discordant sites and assigning additional sites (in order of their distance to the ungauged site) till collective record length of sites in the group exceeded 500 station-years. Errors in quantile estimates based on adjusted ROIs were found to be marginally higher than those based on unadjusted ROIs. The results are not shown due to lack of space. Increase in errors is attributed to pooling of farther sites into ROI by making adjustments.

## 5. Summary and Concluding Remarks

[45] The key assumption of the index-flood approach requires parameters (viz., location, scale, and shape) of frequency distributions of normalized records to be identical for all the sites in a homogeneous region. In the widely used L-moment framework, estimates of those parameters are conditioned on regional average estimates of L-statistics. Errors in estimates of scale and shape parameters are shown to be fairly large with respect to their true population values (in the normalized space), through Monte Carlo simulation experiments considering scaling factor of index-flood approach to be population mean statistic. To arrive at better parameter estimates and consequently accurate quantile estimates, a novel mathematical approach is proposed for RFA in L-moment framework. Transformation mechanisms corresponding to various commonly used frequency distributions are proposed to facilitate mapping the random variable being analyzed from original space to a dimensionless space where distribution of the random variable does not change, and deviations of regional estimates of all the parameters (location, scale, shape) of the distribution with respect to their population values as well as at-site estimates are minimal. This is demonstrated for situation where population is known. The location, scale, and shape parameters corresponding to GLO, GEV, GPA, and GNO populations are analytically derived to be 0, 1, and 0, respectively, in the dimensionless space.

[46] Growth curve (CDF) constructed in the dimensionless space with proposed approach tends to be closer to population growth curve in that space, as regional estimates of parameters are closer to their population values. Mapping the curve to ungauged site in the original space would yield quantile function that tends to be closer to true (population) quantile function. Monte Carlo simulation experiments revealed that the proposed approach outperforms CIF and PIF methods in predicting quantiles for ungauged locations irrespective of whether (i) population mean statistic corresponding to ungauged sites is known and (ii) the form of regional frequency distribution is misspecified. Experiments on real-world data also suggested that the proposed approach offers significant improvement over CIF, PIF, and LIF methods in RFA. Further improvement in results could be possible by considering Mahalanobis distance to form ROI [Cunderlik and Burn, 2006], instead of Euclidean distance considered in this study. However, experiments in this direction require development of Mahalanobis based ROI methodology in multivariate setting to be able to simultaneously consider all six watershed related attributes for regionalization. Research is underway to extend the proposed approach to the framework of conventional method of moments and method of maximum likelihood.

## Appendix A

[47] Derivations of L-moments and parameters corresponding to transformed random variable Y that follows GLO, GEV, GPA, GNO, or PE3 distributions are presented. The parameters form the basis to construct population growth curve y(F) in the dimensionless space.

#### A1. Generalized Logistic Distribution

[48] L-moments of the random variable Y that follows GLO distribution in dimensionless space are determined as

(A1)
(A2)
(A3)

[49] The relationships between L-statistics and parameters of the distribution [Hosking and Wallis, 1997, pp. 196–197] can be used to arrive at values for population location, scale, and shape parameters represented as ξY, αY, and kY, respectively.

(A4)
(A5)
(A6)

[50] As kY = 0 quantile corresponding to nonexceedance probability F is estimated based on quantile estimation function given in Table 1 for GLO distribution, by replacing X with Y as they follow the same frequency distribution.

(A7)

#### A2. Generalized Extreme Value Distribution

[51] L-moments of the GEV distributed random variable Y in dimensionless space are determined as

[52] Let so that and

(A8)

[53] Let so that and

(A9)

[54] Let so that and

[55] From foregoing analysis,

(A10)

[56] The relationships between L-statistics and parameters of the distribution [Hosking and Wallis, 1997, p. 196] can be used to arrive at values for population location, scale, and shape parameters represented as ξY, αY, and kY, respectively.

(A11)
(A12)
(A13)

[57] As kY = 0 quantile corresponding to nonexceedance probability F is estimated based on quantile estimation function given in Table 1 for GEV distribution by replacing X with Y as

(A14)

#### A3. Generalized Pareto Distribution

[58] L-moments of the random variable Y that follows GPA distribution in dimensionless space are determined as

(A15)
(A16)
(A17)

[59] The relationships between L-statistics and parameters of the distribution [Hosking and Wallis, 1997, p. 195] can be used to arrive at values for population location, scale, and shape parameters represented as ξY, αY, and kY, respectively.

(A18)
(A19)
(A20)

[60] As kY = 0 quantile corresponding to nonexceedance probability F is estimated based on quantile estimation function given in Table 1 for GPA distribution by replacing X with Y as

(A21)

#### A4. Log Normal Three-Parameter (Generalized Normal) Distribution

[61] L-moments of the GNO distributed random variable Y in dimensionless space are determined as

(A22)

[62] In general, . In the present case

(A23)
(A24)

[63] The values determined for , , and can be used in the relationships between L-statistics and parameters of the distribution to arrive at values for population location, scale, and shape parameters represented as ξY, αY, and kY, respectively.

(A25)
(A26)
(A27)

[64] The quantile corresponding to nonexceedance probability F can be estimated as

(A28)

#### A5. Pearson Type III Distribution

[65] L-moments of the PE3 distributed random variable Y in dimensionless space are determined for the case as

(A29)

[66] The hypergeometric function can be expanded following Abramowitz and Stegun [1972, p. 558, 15.2.11] as

[67] Considering

[68]  and can be expanded following equations 15.1.25 and.15.1.24 of Abramowitz and Stegun [1972] as

(A30)

[69] The values determined for and can be used in the relationships between L-statistics and parameters of the distribution to arrive at values for population location and scale parameters represented as ξY and βY, respectively.

(A31)
(A32)

[70] The quantile corresponding to nonexceedance probability F is estimated based on quantile estimation function given in Table 2 for PE3 distribution, by replacing X with Y as

(A33)

where KT is the frequency factor that can be computed using approximations such as Wilson-Hilferty and Cornish-Fisher Transformations [Rao and Hamed, 2000, pp. 146–148].

## Acknowledgments

[71] The authors would like to express their gratitude to three anonymous reviewers and Associate Editor for their constructive and helpful comments that were helpful in improving quality of this manuscript.