The assessment of regional homogeneity is a critical point in regional frequency analysis. To this end, many homogeneity tests have been proposed, even though a general comparison among them is still lacking. Commonly used homogeneity tests, based on L moments ratios, are considered here in a comparison with two rank tests that do not rely on particular assumptions regarding the parent distribution. The performance of these tests is assessed in a series of Monte Carlo simulation experiments. In particular, the power and type I error of each test are determined for different scale and shape parameters of the regional parent distributions. The tests are also evaluated by varying the number of sites belonging to the region, the series length, the type of the parent distributions and the degree of heterogeneity. We find that L moments based tests are more powerful when the samples are slightly skewed, while the rank tests have better performances in case of high skewness. On the basis of these findings we propose a simple method to guide the choice of the homogeneity test to be used for the different possible cases.
 Estimation of the frequency of extreme events is often required in the hydrological practice. The procedures for the analysis of a single set of data are well-established, but often observations of the same variable at different measuring sites are available, and more accurate conclusions can be reached by analyzing many data samples together. This constitutes the basis for regional frequency analysis [e.g., Hosking and Wallis, 1997]. Critical points of the regional approach to frequency analysis are in the choice of the method to group the data samples together, and in the assessment of the plausibility of the obtained groupings. This involves testing whether the proposed regions may be considered homogeneous or not. The hypothesis of homogeneity implies that frequency distributions for different sites are the same, except for a site-specific scale factor.
 The performances of these tests are assessed through the determination of their power with Monte Carlo simulation experiments. Having a more powerful homogeneity test implies that there is the potential to reduce the error of quantile estimators, that is the final goal of a Regional Frequency Analysis. However, this is the case only when the significance level is selected which maximizes the benefits of having a more powerful homogeneity test. A full analysis of the problem would require to disentangle the relations between the significance level and the power of the tests, which is in turn a very complicated problem, that goes beyond the scope of the present manuscript. Additional considerations on this topic are found at http://www.idrologia.polito.it/∼alviglio/homtest.htm
Section 2 is devoted to the description of the considered tests. In section 3 we describe the procedure adopted for carrying out the comparison among the tests, in section 4 the obtained results are presented, and in section 5 some conclusions are drawn.
2. Homogeneity Tests
 Suppose that k samples of observations of the same variable at different measuring sites are available, and that one wishes to verify if they can be grouped to form a statistically homogeneous region: let Yij be the jth observation in the ith sample, sorted in ascending order (Yi1 ≤ Yi2 ≤ … ≤ where i = 1, …, k). Following an index value procedure, the observations are first rescaled with respect to a site specific index value (details on the choice of the index value are provided in section 4.1) obtaining Xij = . If the observations are independent and the ith rescaled sample has distribution function Fi, the homogeneity test corresponds to verifying the hypothesis H0: F1 = … = Fk = F, without specifying the common distribution F. The merits and drawbacks of a test statistic are evaluated by considering its power and its type I error. Given the null hypothesis H0 (in our case the hypothesis of regional homogeneity), the power of the test is defined as the probability of correctly rejecting H0 when it is not true. If instead the hypothesis is rejected when it should be accepted, one makes a type I error. The test is unbiased when the probability of making a type I error is equal to the selected level of significance, α, of the test.
 Homogeneity tests involve finding, for each site, an estimate of a quantity, θi, that measures some aspects of the (at site) frequency distributions, and verifying if the dispersion of the θi values around their regional counterpart, θR, is consistent with the hypothesis of homogeneity. This requires defining the distribution of θ under the null hypothesis H0, which in many cases implies that the common distribution F is selected a priori. This is a theoretical problem affecting the application of many homogeneity tests (an exception is the Wiltshire [1986a] CV-based test). The necessity to preselect F implies that the test actually does not allow one to verify the homogeneity hypothesis alone, but the composite (homogeneity plus goodness of fit) hypothesis that the parent distribution is the same at each site, and has a predefined mathematical form F. As a consequence, the possible reasons why the test is not passed can be either that the region is heterogeneous, or that the adopted regional probability distribution F is inadequate. We will return to this point in section 2.2, where the Anderson-Darling test is described.
 A second problem occurs as an effect of the normalization by the index value, which in some cases can distort the distribution of the test statistic under the null hypothesis: this is the case, for example, of the Wiltshire [1986a] rank-based test or of the k sample Anderson-Darling test. The problem will be treated in detail in section 2.3. We now describe the four homogeneity tests selected for the comparison. The R package homtest, developed to facilitate the practical application of the tests, is available at the CRAN web page (see http://www.r-project.org/).
2.1. Hosking and Wallis Heterogeneity Measures
 The idea underlying Hosking and Wallis  heterogeneity statistics is to measure the sample variability of the L moment ratios and compare it to the variation that would be expected in a homogeneous region. The latter is estimated through repeated simulations of homogeneous regions with samples drawn from a four parameter kappa distribution [see Hosking and Wallis, 1997, pp. 202–204]. More in detail, the steps are the following.
 1. With regard to the k samples belonging to the region under analysis, find the sample L moment ratios (see Hosking and Wallis  for details) pertaining to the ith site: these are the L coefficient of variation (L CV),
the coefficient of L skewness,
and the coefficient of L kurtosis
Note that the L moment ratios are not affected by the normalization by the index value, i.e., it is the same to use Xi,j or Yi,j in equations (1)–(3).
 2. Define the regional averaged L CV, L skewness and L kurtosis coefficients,
and compute the statistic
 3. Fit the parameters of a four-parameter kappa distribution to the regional averaged L moment ratios tR, t3R and t4R, and then generate a large number Nsim of realizations of sets of k samples. The ith site sample in each set has a kappa distribution as its parent and record length equal to ni. For each simulated homogeneous set, calculate the statistic in equation (5), obtaining Nsim values. On this vector of V values determine the mean μV and standard deviation σV under the hypothesis of homogeneity (actually, under the composite hypothesis of homogeneity and kappa parent distribution).
 4. An heterogeneity measure, which is called here HW1, is finally found as
can be approximated by a normal distribution with zero mean and unit variance: following Hosking and Wallis , the region under analysis can therefore be regarded as “acceptably homogeneous” if < 1, “possibly heterogeneous” if 1 ≤ < 2, and “definitely heterogeneous” if ≥ 2. Hosking and Wallis  suggest that these limits should be treated as useful guidelines. Even if the statistic is constructed like a significance test, significance levels obtained from such a test would be accurate only under special assumptions: to have independent data both serially and between sites, and the true regional distribution being kappa.
 The statistic measures heterogeneity only in the dispersion of the samples, since it is based solely on the differences between the sample L CVs in the region. As such, it is insensitive to heterogeneity that arises between sites having equal L CV but different L skewness. Hosking and Wallis  also give an alternative heterogeneity measure (that we call HW2), in which V in equation (5) is replaced by:
The test statistic in this case becomes
with similar acceptability limits as the HW1 statistic. Hosking and Wallis  judge to be inferior to and say that it rarely yields values larger than 2 even for grossly heterogeneous regions. Moreover they stress that in practice it is uncommon to have sites with equal L CV and different L skewness (sites with high L skewness tend to have high L CV too). Anyway we decided to consider also this statistic in the present paper because it is used in the most systematic and documented regional flood study available [Robson and Reed, 1999].
2.2. The k Sample Anderson-Darling Test
 As mentioned, the HW1 and HW2 heterogeneity measures suffer from the limitation that they take a kappa parent distribution, thus reverting the homogeneity test into a goodness of fit plus homogeneity test. The kappa distribution is probably flexible enough to limit the consequences of this assumption [Hosking and Wallis, 1997], but the theoretical inconsistency remains. We therefore decided to propose in the comparison also tests that do not have this problem. A possible candidate could be the Wiltshire [1986a] CV-based test, unless it was shown by the same Author to be unreliable. Another test that does not make any assumption on the parent distribution is the Anderson-Darling (AD) rank test [Scholz and Stephens, 1987]. The AD test is the generalization of the classical Anderson-Darling goodness of fit test [e.g., D'Agostino and Stephens, 1986], and it is used to test the hypothesis that k independent samples belong to the same population without specifying their common distribution function.
 The test is based on the comparison between local and regional empirical distribution functions. The empirical distribution function, or sample distribution function, is defined by F(x) = , x(j) ≤ x < x(j+1), where η is the size of the sample and x(j) are the order statistics, i.e., the observations arranged in ascending order. Denote the empirical distribution function of the ith sample (local) by (x), and that of the pooled sample of all N = n1 + … + nk observations (regional) by HN(x). The k sample Anderson-Darling test statistic is then defined as
 If the pooled ordered sample is Z1 < … < ZN, the computational formula to evaluate equation (9) is
where Mij is the number of observations in the ith sample that are not greater than Zj. The homogeneity test can be carried out by comparing the obtained θAD value to the tabulated percentage points reported by Scholz and Stephens  for different significance levels.
 The statistic θAD depends on the sample values only through their ranks. This guarantees that the test statistic remains unchanged when the samples undergo monotonic transformations, an important stability property not possessed by HW heterogeneity measures. However, problems arise in applying this test in a common index value procedure. In fact, the index value procedure corresponds to dividing each site sample by a different value, thus modifying the ranks in the pooled sample. In particular, this has the effect of making the local empirical distribution functions much more similar one to the other, providing an impression of homogeneity even when the samples are highly heterogeneous. The effect is analogous to that encountered when applying goodness of fit tests to distributions whose parameters are estimated from the same sample used for the test [e.g., D'Agostino and Stephens, 1986; Laio, 2004]. In both cases, the percentage points for the test should be opportunely redetermined. This can be done with a nonparametric bootstrap approach presenting the following steps.
 1. Build up the pooled sample of the observed nondimensional data.
 2. Sample with replacement from and generate k artificial local samples, of size n1, …, nk.
 3. Divide each sample for its index value, and calculate θ(1)AD.
 4. Repeat the procedure for Nsim times and obtain a sample of θ(j)AD, j = 1, …, Nsim values, whose empirical distribution function can be used as an approximation of GH0(θAD), the distribution of θAD under the null hypothesis of homogeneity.
 5. The acceptance limits for the test, corresponding to any significance level α, are then easily determined as the quantiles of GH0(θAD) corresponding to a probability (1 − α).
 We will call the test obtained with the above procedure the bootstrap Anderson-Darling test, hereafter referred to as AD.
2.3. Durbin and Knott Test
 The last considered homogeneity test derives from a goodness of fit statistic originally proposed by Durbin and Knott . The test is formulated to measure discrepancies in the dispersion of the samples, without accounting for the possible presence of discrepancies in the mean or skewness of the data. Under this aspect, the test is similar to the HW1 test, while it is analogous to the AD test for the fact that it is a rank test. The original goodness of fit test is very simple: suppose to have a sample Xi, i = 1, …, n, with hypothetical distribution F(x); under the null hypothesis the random variable F(Xi) has a uniform distribution in the (0, 1) interval, and the statistic D = cos[2 πF(Xi)] is approximately normally distributed with mean 0 and variance 1 [Durbin and Knott, 1971]. D serves the purpose of detecting discrepancy in data dispersion: if the variance of Xi is greater than that of the hypothetical distribution F(x), D is significantly greater than 0, while D is significantly below 0 in the reverse case. Differences between the mean (or the median) of Xi and F(x) are instead not detected by D, which guarantees that the normalization by the index value does not affect the test.
 The extension to homogeneity testing of the Durbin and Knott (DK) statistic is straightforward: we substitute the empirical distribution function obtained with the pooled observed data, HN(x), for F(x) in D, obtaining at each site a statistic
which is normal under the hypothesis of homogeneity. The statistic θDK = Di2 has then a chi-square distribution with k−1 degrees of freedom, which allows one to determine the acceptability limits for the test, corresponding to any significance level α. Note that the implementation of the DK test is much simpler compared to the other considered statistics.
3. Basis for Test Comparison
 The main issue of this work is to analyze, through Monte Carlo simulations, which of the tests described in section 2 works better, i.e., is less biased (type I error close to the adopted significance level) and more powerful. The Monte Carlo simulation experiment requires the following.
 1. An artificial region is defined by providing the number of samples k, their length n (which is kept constant for all sites), the (three-parameter) parent distribution used for the generation of the samples, and the regional average L moment ratios τR and τR3.
 2. The artificial region has a known heterogeneity, with the local L moment ratios, τ(i) and/or τ(i)3 varying linearly from site 1 through site k, with an overall range of variation Δτ and Δτ3 (when Δτ and Δτ3 are both equal to zero, the region is homogeneous).
 3. For each site in the region, the three parameters of the parent distribution are estimated from the local L moments, and a sample of size n is generated from and normalized by the index value.
 4. The four homogeneity tests are applied to the obtained artificial region, after having selected a significance level α for the AD and DK tests, or an almost equivalent acceptability limit for the HW1 and HW2 heterogeneity measures.
 5. One thousand replications of the artificial regions are generated, and each replication is separately tested for homogeneity with the four tests; the power of each test (or its type I error) is estimated as the percentage of the 1000 replicates recognized as heterogeneous.
 The comparison among the tests should be as general as possible; different values of k, n, , τ, τ3, Δτ, Δτ3, and α need then to be considered, which complicates the numerical simulation. In particular, the average dispersion and skewness of the samples, τR and τR3, are very likely to relevantly affect the performances of the test. The same is true for the other parameters, but the effects on the tests of a change of, say, n is much easier to predict and therefore less interesting. For this reason we decided to consider several τR and τR3 values, i.e., to explore in our simulation experiment a large portion of the τ − τ3 diagram. Numerical constraints to the τ and τ3 values are given by Hosking and Wallis : these are 0 ≤ τ < 1, − 1 < τ3 < 1, and 2 τ − 1< τ3 (valid for variables that can take only positive values). However, the portion of the τ −τ3 space bounded by these constraints remains still too big in an operational perspective.
 To choose tighter bounds in the τ − τ3 space we refer to a hydrological perspective considering Vogel and Wilson  work, who use L moment diagrams to select a regional distribution for annual minimum, average and maximum streamflows. Vogel and Wilson  build these diagrams for more than 1400 river basins in the continental United States. All the observed τ − τ3 values, independently of the type of flow, occupy a bisector band of the graphic with τ3 − 0.2 < τ < τ3 + 0.4 (see Figure 1) and very few points have a τ3 larger than 0.5 or smaller than −0.1. We therefore choose to limit our investigations to the region with the following bounds (Figure 1):
We consider all τR and τR3 pairs inside that region on a grid with a 0.1 spacing (gray points in Figure 1).
 As for the other involved variables (k, n, , Δτ, Δτ3, and α), the adopted simulation strategy involves building up a main case study, with reasonable parameter values, and then carrying out a sort of sensitivity analysis. The parameters selected for the main case study are the following: k = 11; n = 30; ≡ generalized extreme value (GEV) distribution; α = 5% (or, equivalently, θHW ≤ 2); Δτ = 0 and Δτ3 = 0 for verifying the type I error, or Δτ = 0.5τ and Δτ3 = 0 for verifying the power of the tests (see section 4.2). The type and degree of heterogeneity, the sample size, the number of sites in the region, the significance level, and the parent distribution are then varied once at a time (see section 4.3), and the results are analyzed for four points in the central part of the τ − τ3 diagram (points A, B, C and D in Figure 1).
 This section is divided into three parts: in the first one the choice of the index value is discussed, in the second one the main case study is described and in the third part the effects of the variation of k, n, , Δτ, Δτ3, or α is analyzed.
4.1. Choice of the Index Value
 A relevant issue in regional frequency analysis, which is related to the main subject of this paper, is the choice of the index value, i.e., of the parameter used to normalize the samples. We decided to include a specific section regarding this topic both because the choice of the index value can affect the performances of the homogeneity tests, and because we wish to raise some discussion on this important, but often neglected, topic. In the original formulation of the index value method by Dalrymple , the index value was intended to be the population mean. However, the passage from theory to practice involved replacing the population mean by the sample mean. As clearly pointed out by Sveinsson et al. , this change is not trouble-free, since replacing the population mean by its sampling counterpart can produce relevant distortions in the regional frequency analysis. The induced distortions can be expected to be rather large when the sample mean is not a “good” estimator of the population mean, i.e., when it is either biased or has a large estimation variance. In those cases a possible alternative would be to use the sample median as the index value, as proposed for example by Robson and Reed . The advantages of this choice are described hereafter.
 A numerical investigation is conducted for each simulation point in Figure 1. One hundred thousand samples of length 30 are generated from a GEV distribution with known mean and median. The distortion of the sample estimates of the mean and median are estimated by the normalized root mean square error,
where μ and are, respectively, the population and sample mean (or median) of each sample. The difference between the RMSE% for the mean and for the median is shown in Figure 2. Where the differences are negative, the estimation of the mean by its sample counterpart is less biased than the corresponding median estimation, and the mean can therefore be regarded as a more reliable index value. It is clear from Figure 2 that the differences are almost negligible, except that in the very right part of the graph, corresponding to highly skewed samples, where the sample median performs considerably better than the sample mean. In fact, the sample median is known to be less sensitive than the sample mean to the presence of outliers, and the latter are more likely found in samples from highly skewed distributions [Hampel, 1974]. Overall, we believe that Figure 2 demonstrates the advantages of using the sample median as the index value when skewed parent distributions are suspected, as in flood frequency analysis studies. Similar results are obtained with distributions other than the GEV. We therefore use the sample median as the index value in the following of the paper.
4.2. Main Case Study
 The main case study corresponds to a full analysis of the performances of the tests for all points in the τ − τ3 diagram, with k = 11, n = 30, ≡ GEV distribution and α = 5% (or θHW ≤ 2). The type I error of the tests is considered first, through simulation from homogeneous regions, with Δτ = 0 and Δτ3 = 0. Figure 3 reports on the background (gray numbers) the percentage of regions considered heterogeneous by each test, and in the foreground (black lines) a fitted “trend surface” whose isolines show how the type I error varies in the τ − τ3 space. It can be noticed that the average sample values <tR> and <t3R> (i.e., the averages of tR and t3R over the 1000 replications) can be different from their theoretical counterparts τR and τ3R, i.e. the gray numbers in Figure 3 do not precisely lie on the grid defined in Figure 1. This is due to the fact that in small samples t and t3 are not unbiased estimators of τ and τ3 [Hosking and Wallis, 1997].
 None of the tests has the expected type I error everywhere in the τ–τ3 space. In a large part of the τ − τ3 space the percentage of regions stated as nonhomogeneous by the heterogeneity measures of Hosking and Wallis is 2 ÷ 4%; this percentage rises to 8 ÷ 10% for high L skewness coefficients (t3R > 0.4, Figure 3). The rank tests have a correct type I error in the central diagonal part of the L moments space, while the percentage of regions mistakenly assumed as heterogeneous increases toward the borders (especially for the DK test).
Figure 4 reports the results of the tests for simulated regions whose heterogeneity is due to the different dispersion of the frequency distributions at different sites. The range of variation of the L CVs (Δτ) inside the region is 0.5 times the regional average L CV (τR). Being k = 11 as before, in a region with τR = 0.2 the samples are generated from distributions characterized by τ values respectively equal to 0.15, 0.16, 0.17,…, 0.25. The gray points and trend lines in Figure 4 show the power of the tests, i.e., the percentage of times when the test succeed in detecting the heterogeneity. The lack of power of the measure HW2, as anticipated by Hosking and Wallis , is evident. For all other tests, the power tends to be greater in the diagonal line of the τ − τ3 space and to grow toward the upper right corner of the investigated space. HW1, if compared to the DK and AD tests, has a higher power in the bottom left part of the L moments space. In contrast, for highly skewed regions it has considerably lower power than the nonparametric tests, among which the AD test is the most powerful.
4.3. Sensitivity Analysis
 As mentioned in section 3, the effect of a variation of k, n, , Δτ, Δτ3, and α is considered in four points (A, B, C and D) located in the central part of the τ − τ3 diagram (Figure 1), rather than through the whole diagram. As an example, we report in Figure 5 the behavior of the tests for regions whose heterogeneity is only due to the shape parameter (Δτ = 0, Δτ3 ≠ 0). In this case the nonparametric tests, in particular the AD test, and the Hosking and Wallis heterogeneity measure HW2 are (obviously) more powerful than HW1. This is particularly evident when the average shape parameter is rather large (τ3R ≥ 0.2) since for low values of τ3R (point A) all tests fail to detect the heterogeneity. As expected, the power of the tests increases with increasing heterogeneity, i.e., with increasing Δτ3.
 As a second example, we show in Figure 6 the power of the tests for regions generated from different parent distributions, when the heterogeneity is only due to differences in the L CVs (Δτ = 0.5τR). In addition to the GEV distribution, which is considered in the main case study, the other adopted three-parameter distributions are the Generalized Logistic distribution (GL), the three-parameter Lognormal distribution (LN), the Pearson type III distribution (P3) and the Generalized Pareto distribution (GP). The reader is referred to Hosking and Wallis [1997, pp. 191–208] for a description of the parametrization of these distributions and of the relations between their parameters and the L moments. The four tests behave in a very similar manner with varying parent distribution: in point A (low skewness) the Hosking and Wallis heterogeneity measure HW1 outperforms the nonparametric tests, while in point D (high skewness) the reverse is true. Points B and C reflect the transition between the two cases, and are characterized by a substantial equivalence of the different testing techniques. In all cases HW2 lacks power to discriminate between homogeneous and heterogeneous regions.
 The effects of a variation of the other parameters are more trivial, and the corresponding diagrams are not shown for reasons of space: the power of the tests increases with increasing number of sites k in a region and with increasing series length n. The tests are much more affected by the length of the series (n values from 10 to 100 are considered) than by the number of sites k (values from 3 to 21 have been considered). As for an increase of the degree of heterogeneity in the dispersion parameter (Δτ/τR), its effect is obviously to increase the power of the tests. The power reaches a 100% value when Δτ/τR = 1 (except that for HW2). In all of the considered cases the HW1 test is more powerful in points A and B, while the DK and AD tests are more powerful in points C and D. The differences in power can be relevant, under a practical viewpoint, especially for intermediate degrees of heterogeneity.
5. Discussion and Conclusions
 A practical problem in regional frequency analysis is the choice of a test for regional homogeneity assessment. In this paper, the Hosking and Wallis heterogeneity measures (based on L moment ratios) are compared with the bootstrap Anderson-Darling test and with the Durbin and Knott rank test. This comparison shows that the Hosking and Wallis heterogeneity measure HW1 (only based on L CV) is preferable when skewness is low, while the bootstrap Anderson-Darling test should be used for more skewed regions. As for HW2, the Hosking and Wallis heterogeneity measure based on L CV and L CA, it is shown that it considerably lacks power.
 Our suggestion is to guide the choice of the test according to Figure 7, that we have obtained as a compromise between power and type I error of the HW1 and AD tests. The L moment space is divided into two regions: if the t3R coefficient for the region under analysis is lower than 0.23, we propose to use the Hosking and Wallis heterogeneity measure HW1; if t3R > 0.23, the bootstrap Anderson-Darling test is preferable. Further comments arise from the observation of Figure 7 that displays some (tR, t3R) points. Each of these points is representative of a homogeneous region, considered in three flood frequency studies: Hosking and Wallis , that directly report the tR and t3R values for several regions in the Appalachian area; De Michele and Rosso  and Farquharson et al. , that give the three parameters of the GEV distribution (estimated using L moments) for many regions in Italy [De Michele and Rosso, 2002] and around the world [Farquharson et al., 1987]. Note that, as expected, these empirical regions lay in the part of the parameter space that was considered in our simulations. Also note that the majority of the points belong to the upper right region of τ − τ3 space, where the bootstrap Anderson-Darling test is more powerful.
 The good performances of the Hosking and Wallis heterogeneity measure HW1, largely used in hydrology, deserve further comments. The HW1 test is based solely on the L CV coefficient (see equations (5) and (6)), and the fact that it performs well suggests that the heterogeneity among the series is mainly due to variations in the sample variance of the samples. In contrast, the variations in skewness and kurtosis are in many cases masked by the sample variability of higher-order moments and L moments. As a consequence, other tests of constancy of the variance in different samples can be used as alternatives to the HW1 test. Possible examples are the “classical” Levene and Barlett tests [Conover et al., 1981], that, however, resulted to be weaker than the HW1 test in a preliminary study.
 This work was carried out under a CIPE (Comitato Interministeriale per la Programmazione Economica) grant, and a COFIN project of MIUR (Ministero dell'Istruzione, dell'Università e della Ricerca), Italy.