A general expression for the statistical distribution of the probability of the highest event occurring in a record is presented. This result can be empirically applied to situations where records are available for multiple geographical locations. The empirical estimation of the probability of the highest events provides a means to assess whether the assumed (extreme value) distribution is appropriate for extrapolation or not. The approach allows for combining the highest events from different records and to validate estimated return periods in the order of the length of the combined records. The method is illustrated with an analysis of the annual extreme wind speeds over the North Atlantic area according to the ERA40 dataset, showing that the Gumbel distribution is in favor of the GEV distribution to estimate the (appropriately transformed) extreme wind speeds up to return periods of 104 years.
 From a theoretical point of view, several distributions are possible [e.g., Coles, 2001]. However, although these distributions often fulfill the standard goodness of fit criteria, they may differ considerably in their estimates for large return periods. This makes it hard to verify which distribution correctly describes the extremes for return periods exceeding the length of the observational record.
 A frequent situation is that multiple time series of a certain meteorological variable are available (e.g., for different stations) for which one can assume the same distribution, albeit with different parameters. A common approach is then to combine the (normalized) data of multiple stations [e.g., Buishand, 1991; Hosking and Wallis, 1997], but this requires spatial homogeneity over the whole area.
 Here we present a robust diagnostic tool able to detect whether the most extreme events of multiple records are well described by the fitted distributions or not. The tool detects how strongly the highest extremes in records (the “outliers”) are over- or underestimated. The only conditions to get the detection tool to work are that enough records are available, and that the most extreme events of different records are distinct.
with F the cumulative distribution function of a (meteorological) variable y, from which follows:
Equation (1) states that F(y) is uniformly distributed between 0 and 1 for every cumulative function F, whereas equation (3) states that the inverse function of G applied to F(y) is distributed according to G.
 Let yn be the maximum of n independent observations y1 ≤ y2≤ … ≤ yn from distribution F(y), then the distribution of yn is given by:
Using that G−1 is monotonically increasing, it follows:
 Note that equation (11) does neither depend on the underlying distribution F nor on the parameters of F, nor on the sample size n. This implies that the distribution of ΔXn can be constructed by combining (different or equal) meteorological variables from different or equal parent distributions, with correlated or uncorrelated parameters.
Equation (11) can be visualized on a so-called Gumbel plot, where the abscissa is transformed into the Gumbel variate:
The approximation is useful for the graphical illustration of the concept, as the highest event yn is plotted at a plotting position xn ≃ ln(n). Hence the horizontal ‘distance’ ΔXn between xn and the distribution (with abscissa −ln(−ln(F(yn)))) is Gumbel distributed with location parameter 0 and scale parameter 1. The concept is schematically illustrated in Figure 1.
 If m records with independent outliers are used of (not necessarily equal) lengths Ti, 1 ≤ i ≤ m, the distribution of the outliers can be tested up to a maximum return period Tmax:
which implies that Tmax can be orders larger than the individual record lengths.
 We applied equation (11) to 104 computer-generated independent records of random length, from a randomly chosen distribution (Gumbel, Weibull, Gaussian or GEV) with randomly chosen parameters. The result is shown in Figure 2, in which Δn is the estimated value of ΔXn. It shows an excellent agreement with theory, independent of the choice of the distribution, parameters or record length. Note that Figure 2 clearly shows that the transformation according to equation (11) fully determines the distribution of ΔXn, with no degrees of freedoms left.
Equation (11) holds for an exactly known distribution F. However, in practical applications, an a priori type of distribution is assumed, and then the parameters are to be estimated from the data. This means that F is replaced by , where stands for the assumed distribution, and for the vector of the parameters fitted to the data (using Maximum Likelihood estimates).
 A common choice for F to describe meteorological annual maxima is the Generalized Extreme Value (GEV) distribution [Jenkinson, 1955]:
with the Gumbel variate x a substitute for:
in which μ is the location parameter, α the scale parameter, ξ the shape parameter, and y the annual maxima of the considered variable. The Gumbel distribution is the special case with ξ = 0.
 A well-known problem with goodness of fit tests is the influence of finite n and of the unknown parameters on the test criteria [Laio, 2004]. In our case, both the sampling effect and the effect of yn on result in a standard deviation of Δn that is too small. The effect of the sample length n on the distribution of Δn is illustrated in Figure 3 for several values of n. Shown is the case that the original distribution F is a Gumbel distribution and the fitted distribution is either a Gumbel- or a GEV-distribution. It shows that if the fitted distribution is a Gumbel distribution, the distribution of Δn is near the theoretical line if the record lengths n ≥ 40.
 However, the presence of an extra parameter (ξ) in the GEV distribution leads to a severe underestimation of the larger values of ΔXn even if the record length n = 300. So, although the extra parameter of the GEV distribution may better fit to the observations than the Gumbel distribution, this improvement holds only for return periods within the record length–for return periods exceeding the record length the results of the GEV fit are worse. This makes the GEV distribution inappropriate for estimates of return periods that exceed the record length. The same conclusion can be drawn from situations in which the parent distribution F is a GEV distribution, or that L-moments estimates [Hosking et al., 1985] are used [Landwehr et al., 1979; Kochanek et al., 2008] (see auxiliary material).
 A possible solution may be to make assumptions about the shape parameter, e.g., to be constant over a certain area [Buishand, 1991].
 A single extreme meteorological event may determine the value of yn (and thus of Δn) for multiple records. To remove this spatial and temporal correlation, only those values of Δn should be considered that belong to distinct meteorological events. For a given extreme event, that record should be considered for which the event is most exceptional, i.e., for which Δn is maximal. Note that this condition is much weaker than those required for Regional Frequency Analysis [e.g., Reed et al., 1999].
4. Interpretation of the Empirical Δn Distribution
 The following situations may occur:
 1. If all m (temporally independent) points in a Δn-plot follow the theoretical line, the chosen distribution describes the extremes properly up to a return period corresponding to the length of the combined records (equation (14)).
 2. If all m points in a Δn-plot are below the theoretical line, the outliers are lower than implies. This means that overestimates the probability of the high extremes. In case of F being an extreme value distribution, this points to incomplete convergence of the normalized annual maxima to the extreme value distribution.
 3. If all m points in a Δn-plot are above the theoretical line, the outliers are higher than implies. This means that underestimates the probability of the high extremes. In case of F being an extreme value distribution, this points to incomplete convergence, or to the presence of a second population in the far tail of F [van den Brink et al., 2004a].
 4. If the m points in a Δn-plot follow the theoretical line up to a certain Gumbel variate, and then start to be higher, then this means that describes only the more frequently occurring extremes well, but underestimates the probability of the less frequent extremes. This points to the presence of a second population in the far tail of F.
 5. If all lower points in a Δn-plot are above the theoretical line, and all higher points are below that line, F has too many free parameters, and is inappropriate to describe the extremes for return periods exceeding the individual record length.
5. Example: Extreme Wind Speed in ERA40
 As an example of the method, we tested Cook's  hypothesis that the normalized annual maxima of the wind speed u follow a Gumbel distribution, if not u itself, but uk is the fitted variable, with k the shape parameter of the Weibull distribution:
with a the scale parameter. He argues that the normalized block-maxima of the Weibull distribution converge to the Gumbel distribution for every value of k, but that the convergence speed strongly depends on k, being optimal for k = 1, which is equivalent to fitting uk.
 We tested Cook's  hypothesis by determining the distribution of Δn for the North-Atlantic region (59W-10E and 41N-70N) by fitting a GEV- and a Gumbel distribution to the annual maxima of both u and uk. For u, we used the 44 annual maxima of the 10 m wind speed from the ERA40-dataset [Uppala et al., 2005] for the period 1958–2001, interpolated to a spatial resolution of 1°, resulting in 2100 values for yn.
 The Weibull shape parameter k was determined for every grid point from all 6-hourly wind speeds with u > a, with a the Weibull scale parameter (see auxiliary material).
 If an extratropical cyclone determines Δn for multiple grid points, we only use the maximum value of Δn, i.e., we consider the cyclone only on its most exceptional moment. We assume that cyclones are distinct if the dates corresponding to Δn differ by more than 3 days.
 This selection procedure leads in this example to 240 independent values of Δn. Equation (14) implies that it is likely that somewhere in the North-Atlantic area a 104-year (240 × 44) event happened during the 1958–2001 period.
 Fitting a GEV distribution leads to a severely biased distribution of Δn, both for u (not shown) and uk (squares in Figure 4a). The maximum value of Δn of 2.84 corresponds to a return period of about 750 years (equation (10)), i.e., an underestimation by a factor 14. Fitting a Gumbel distribution to the annual maxima of u (open circles in Figure 4a) results in an underestimation of Δn, indicating incomplete convergence to the Gumbel distribution. However, fitting a Gumbel distribution to the annual maxima of uk (closed circles in Figure 4a) gives a satisfactory agreement with the theory, which confirms that the North-Atlantic annual extremes of uk can be described by a Gumbel distribution up to return periods of 104 years (equation (14)). The same distribution for Δn is obtained for the winds on a 2.5° × 2.5° resolution. We stress that those return periods are only valid for the realization of the climate represented by the 1958–2001 period of the ERA40 dataset. Analyses for detrended series (as a first estimate of low-frequent variability) and of an apparently more homogeneous subset (1958–1990) lead to the same conclusion. However, inhomogeneities may be an issue in other parts of the world, especially on the Southern Hemisphere [Wang et al., 2006].
 The 240 geographical locations of Δn for the Gumbel-fit to uk are shown in Figure 4b. The event with the highest Δn of 5.32 (return period 9·103 years) is ‘Martin’, one of the Christmas-storms in France [Ulbrich et al., 2001] at (0E,45N) on 27-12-1999.
 A second result of Figure 4a is that in this area, no second population of extreme wind speeds is detected.
6. Discussion and Conclusions
 We derived an expression for the distribution of the probability of the most extreme events in (meteorological) records. By combining the results for multiple records, the empirical distribution of the probability of record-extremes can be compared with the theoretical distribution of the outliers.
 In climatology, the Gumbel and GEV distribution are often applied to annual maxima. Here we showed that the fitted GEV distribution on average severely overestimates the probability of the record-extremes for most observational record lengths (<100 year). This is caused by the presence of an extra parameter in the GEV distribution, which makes the fit too sensitive for sampling effects, and thus inappropriate for extrapolation. Possible practical solutions are to assume the GEV shape parameter to be constant over the area, or to fit a Gumbel distribution either to the variable y itself or to an appropriate power yk.
 Whereas the standard goodness of fit tests evaluate the residuals within the observed time-domain, this tool tests if the fit is appropriate outside the observed time-domain. This is especially of importance for extreme-value statistics as it is often used to make estimates for return periods that amply exceed the record length.