Observational probability method to assess ensemble precipitation forecasts



It is common practice when assessing the skill of either deterministic or ensemble forecasts to consider the observations with no uncertainty. Observation uncertainty may be associated with different causes and the present paper discusses the uncertainty that derives from the mismatch between model-generated grid point precipitation and locally measured precipitation values. There have been many attempts to add uncertainty to the verification process; in the present paper the uncertainty is derived from the observed precipitation distribution within grid boxes of assigned resolution. The Brier skill score (BSS) and the area under relative operating characteristic curve skill score calculated utilizing the verification method which includes observational uncertainty (O-OP), are compared to analogous scores obtained from standard verification methods. The scores are calculated for two different forecasting systems: the European Centre for Medium-Range Weather Forecasts Ensemble Prediction System and the Spanish Meteorological Agency Short-Range Ensemble Prediction System.

The results show that the resolution component of the BSS improves when using the O-OP method, i.e. forecast probabilities are distinguished from climatological probabilities and therefore the system has better skill. The reliability component, on the contrary, greatly degrades and this degradation is worse for lower precipitation thresholds. The results also show that the more asymmetric the precipitation distribution is within the grid box, the larger is the degradation of the reliability component. The overall BSS improves except for low thresholds. These results encourage further research into observation uncertainty and how it can be effectively accounted for in the verification of weather parameters such as precipitation. Copyright © 2011 Royal Meteorological Society

1. Introduction

Verification of numerical weather prediction (NWP) models plays a central role in the improvement of both short and medium range forecasts. Deterministic and ensemble forecasts are assessed to ascertain their skill against observations and the latter are assumed to be exact. Even though this assumption is not in general true, it is widely accepted in the context of verification. In the last few years, many papers have discussed the validity of this assumption and concluded that it may be legitimate for longer forecast ranges, when the forecast error is much larger than the observation uncertainty, but there is an inconsistency when accepting that the model can be uncertain while the observations are exact.

Earlier attempts to take account of observation uncertainties can be found in Ciach and Krajewski (1999), Briggs et al. (2005) and Roberts and Lean (2008), who use the information on observation error to define the uncertainty. Bowler (2008) takes a slightly different approach whereby the standard deviation estimates from the data assimilation are used to quantify the error and therefore the uncertainty.

Saetra et al. (2004) investigate the effects of observation errors on the statistics for ensemble spread and reliability by adding normally distributed noise with predefined standard deviation to the forecast for each ensemble member. The addition of this uncertainty reduces the number of outliers, leading to flatter rank histograms in the short-range forecast. Moreover, Saetra et al. show that rank histograms are highly sensitive to the inclusion of observation error in the verification process, whereas reliability diagrams are less sensitive. In fact, perfect observations or observations with added noise produce almost identical results. Similar results are discussed in Pappenberger et al. (2009), who classify observation uncertainty as a result of measurement errors, inhomogeneous observation density, or model or observation interpolation.

Candille and Talagrand (2008; hereafter CT08) validate an ensemble prediction system introducing the ‘observational probability’ method in the verification process (hereafter referred to as OP). They find that reliability and discrimination are degraded, while resolution is improved. Observation uncertainty is defined by the normal distribution, whose expectation and standard deviation are obtained by random draws from a normal and lognormal distribution, which guarantees that the uncertainty is variable in both mean and spread.

The present paper discusses the uncertainty associated with precipitation observations due to inability of such observations to be representative of an area around them. This uncertainty is often referred to in the literature as ‘representativeness error’ and is linked to the large spatial variability of rainfall. The verification approach presented in this paper is applied to precipitation forecasts from two ensemble forecasting systems and includes the uncertainty as sampled from the spatial variability of observed precipitations. As an extension of OP, and because of its empirical nature, it is defined as ‘observed observational probability’ (O-OP). This methodology aims at extending previous attempts to include uncertainty in the verification process to variables that are non-Gaussian distributed. Particular attention has been paid to the asymmetry of the precipitation distribution within each grid box. The stronger the asymmetry, the more likely that representativeness issues play an important role in the scores computation. Synthetic data experiments are used to compare, in a theoretical context, O-OP with the methods assessed in CT08, as well as to discuss the model performance dependency on the asymmetry of the observations' distribution. These experiments help in understanding the behaviour of different verification methods and support the results obtained with forecast data experiments.

Observation uncertainty is defined using information from high-density observation networks available in Europe. Stations contained within each model grid box are used to define the uncertainty, while their averaged value, assigned to each grid point, is the observed status of the atmosphere.

Section 2 contains the description of the observation database and of the forecasting systems. Section 3 describes metrics and the O-OP methodology applied to synthetic and real data, while results are discussed in section 4. Conclusions are drawn in section 5.

2. Observation dataset and models

Models and the observation dataset will be briefly introduced in this section.

2.1. Observation dataset

European meteorological offices provide 24 h accumulated precipitation reports from their high-density rain gauges networks to the European Centre for Medium-Range Weather Forecasts (ECMWF). These data are used to compute an observed precipitation estimate using a simple upscaling technique (Ghelli and Lalaurette, 2000; Cherubini et al., 2002) whereby stations are assigned to model grid boxes and then averaged to produce one areal average value to be assigned to the corresponding grid point. Such a grid average value can be compared to the model precipitation forecast, which is also relative to the grid box areal average, as they both refer to the same spatial scales.

In the present paper a 24 h accumulated precipitation estimate on a 1° × 1° grid and on a 0.25° × 0.25° grid have been used. Moreover, a rainfall distribution within each grid box has been computed, utilizing the information provided by all the stations belonging to same box. The observation uncertainty is then defined by using the 10, 25, 50, 75 and 90 quantiles of such distributions, with the 10th and 90th quantiles representing the left and right tails. No function is fitted to the percentiles, in order to have continuous probability distribution function. The accumulation time (0600 UTC to 0600 UTC) has been selected as the most common and the 24 h accumulation period guarantees that a slight time shift in accumulation periods does not affect the results significantly.

The observation dataset was available for the period April 2007 to December 2008.

2.2. Models

Two forecasting systems have been used for this study: the Ensemble Prediction System (EPS) from ECMWF and the Short-Range Ensemble Prediction System (SREPS) from the Spanish Meteorological Agency (AEMET). Both forecasting systems produce an ensemble of forecasts, which provide a description of a probability density function (PDF) of forecast states of the atmosphere.

The ECMWF EPS forecasting system has been operational at ECMWF since 1992 and comprises 50 members and one control forecast. Each of the 50 ensemble members is initialized from a perturbed initial condition (Buizza and Palmer, 1995), to which stochastic perturbations (Buizza et al., 1999), sampling the model errors due to parametrized physical processes, are added. The control forecast is initialized with the unperturbed initial condition (Buizza et al., 2003). The ECMWF EPS resolution during the period under consideration is T399L62 (spectral truncation T399 with 62 vertical levels), which corresponds to about 50 km in the horizontal in the mid latitudes. The EPS is not a static forecasting system and it is continuously updated to benefit from the best available schemes and to use the latest observations from new observation systems. Ghelli and Primo (2009) describe some of the changes that have had an impact on the ECMWF precipitation forecasts.

Since the observed precipitation estimate is available on a 1° × 1° degree grid, the model fields have been interpolated on the same grid. The higher-resolution estimate (0.25° × 0.25°) was not considered to be a fair choice because the original grid of the ECMWF EPS model is 0.45° × 0.45°, which is much coarser than the 0.25° × 0.25°.

SREPS (García-Moya et al., 2011) has been developed at the AEMET and adopts a different approach from ECMWF to sample forecast uncertainty. The present study will not assess the differences in the way the sampling of uncertainty is carried out, but will solely assess the benefits of including observation uncertainty in the calculation of the verification scores. AEMET-SREPS started producing daily ensemble forecasts (0000 UTC, 20 members) in April 2007; the current system runs twice daily (0000 and 1200 UTC) with 25 members. During the period April 2007 to December 2008, SREPS comprised a set of five different limited area models (LAMs) that ran once a day (0000 UTC) using initial and boundary conditions from four different global models (GCMs), totalling 20 members. In this way the uncertainties originating from model errors and from imperfect initial conditions are taken into consideration. Table I shows the set-up of SREPS. The GCMs include variational data assimilation schemes and up-to-date physical parametrization schemes. Within SREPS, each model is run at its own resolution and map projection, and then post-processing ensures that all the models are consistent on the same area, latitude–longitude grid, and vertical and horizontal resolution. The latter has been set to 0.25° × 0.25°, with 40 levels in the vertical equally spaced.

Table I. AEMET-SREPS composition: models and initial and boundary conditions.
ModelsInitial and boundary conditions
HIgh-Resolution Limited Area Model (HIRLAM)ECMWF IFS (Integrated Forecasting System) (Jakob et al., 2000)
(McDonald and Haugen, 1992; Undén et al., 2002) 
 UK Met Office Global Unified Model (Cullen, 1993)
 NCEP (National Centers for Environmental Prediction) Global Forecasting System (Sela, 1982)
 DWD (Deutschen WetterDienst) Global Model (Majewski etal., 2002)
DWD high-resolution regional modelECMWF IFS
(Majewski and Schrodin, 1994)UK Met Office Global Unified Model
 NCEP Global Forecasting System
 DWD Global Model
UK Met Office Unified ModelECMWF IFS
(Cullen, 1993)UK Met Office Global Unified Model
 NCEP Global Forecasting System
 DWD Global Model
COSMO (Consortium for Small-scale MOdelling)ECMWF IFS
(Doms and Schättler, 1997)UK Met Office Global Unified Model
 NCEP Global Forecasting System
 DWD Global Model
Mesoscale Model V5 (Penn State University andECMWF IFS
National Center for Atmospheric Research)UK Met Office Global Unified Model
(Dudhia, 1993; Grell et al., 1994)NCEP Global Forecasting System
 DWD Global Model

The 0000 UTC forecast cycle for ECMWF EPS and SREPS has been used in this study. Moreover, the t + 30 and t + 54 forecast range for the two ensemble systems has been selected as the SREPS forecast lead time is only up to 3 days. These time steps verify at 0600 UTC when the observations are available.

3. Metrics and methodology

Comprehensive descriptions of standard probabilistic (including ensemble) verification methods can be found in Candille and Talagrand (2005), Wilks (2006) and Jolliffe and Stephenson (2003). In section 3.1 a summary of the verification framework and the OP method, as described in CT08, is given; the proposed O-OP method is then introduced in section 3.2, synthetic data experiments are described in section 3.3 and forecast data experiments in section 3.4. CT08 notation is followed for consistency.

3.1. General framework and OP method described in CT08

A binary event X of the parameter x exceeding a threshold t is considered ({X : x > t}). Given a forecast probability p (number of member values exceeding t, p = {0/N,1/N,…,N/N}) and the corresponding a posteriori observation probability p0, the Brier score is defined as BS = E[(pp0)2], where E[] is the expectation value over all cases (forecast–observation pairs). BS is negatively oriented and BS = 0 if and only if p = p0, while BS = 1 indicates the worst possible forecast. The standard decomposition of the BS was discussed in CT08 (CT08-7) and is reported here for clarity:

equation image

The expectation values can be taken through a partition in probability space by E[y] = E[Ep[y]] = ∫yg(p)dp, where g(p) is the forecast probability distribution, p′(p) = Ep[p0] is the distribution of conditional observation frequencies and pc = E[p′(p)] = E[Ep[p0]] is the base rate (or frequency of occurrence of the event). The reliability component (B_rel) measures the correspondence between forecast probabilities and p′(p) and can be improved by calibration, while the resolution component (B_res) gives a measure of variability of p′(p) (given the forecasts) around the base rate, and cannot be improved by calibration. For a perfectly reliable system the reliability component vanishes, and the resolution is equal to the sharpness, a measure of variability of the forecast probability distributions, or how often different forecast probabilities occur (without taking into account the observations). The uncertainty component (B_unc) is solely dependent on the observations and corresponds to the value of the BS using the sample climatology as forecast (perfect reliability, no resolution); it is equal to the variance of p0 and is usually taken as a reference for the Brier skill score (BSS), if special care is taken with the interpretation (Mason, 2004). The BSS decomposition in reliability and resolution used here are those used in CT08-8:

equation image

Discrimination, i.e. the ability of the system to distinguish the occurrence or non-occurrence of the binary event X given the observations, is a complementary measure of ensemble performance. Discrimination is related to the hit rate (H) and the false alarm rate (F) for a given base rate s. The area A under the ROC curve (H vs. F) is a measure of discrimination, with A = 0.5 for the sample climatology (no skill) and A = 1 for a perfect forecast. The expressions for H and F in the general context described so far are (in CT08 these are referred to as CT08-4):

equation image

Traditional verification techniques do not account for observational uncertainty; therefore, p0 = {0,1}, E[p02] = E[p0], and the set of scores CT08-7,8 (‘extended scores’) are simplified to the traditional expressions (‘standard scores’). CT08 assess the impact of observational uncertainty by comparing three verification methods: traditional or ‘reference’ (REF), verifying an ensemble against observation ignoring the presence of uncertainty, with standard scores; ‘perturbed-ensemble’ (ENS), perturbing the ensemble in a consistent way with the observational uncertainty and verifying it against the same observation data with standard scores, thus providing a reliable system by construction; and finally ‘observational probability’ (OBS, here OP), which accounts for the observational uncertainty by defining an observation probability distribution and verifying the ensemble against this distribution with the extended scores.

The OP method provides on each realization a p0 that can take any value in the interval [0,1]. The base rate pc for REF and OP will be different (pc(ref)≠pc(obs)) and a perfect forecast (which always predicts event X with probability p = p0) will not necessarily be a deterministic forecast. The Brier uncertainty term E[p02] − pc2 (CT08-7) plays an important role here: it is the variance of p0 in the observation probabilities; it is maximized when p0 can only be 0 or 1, and when allowing a range of values for p0 then this term decreases. Thus E[p02] − pc2 decreases as the observational uncertainty increases and, according to equations CT08-8, this decrease makes BSS_rel increase (degrade) and BSS_res decrease (improve) as both are negatively oriented. Therefore, when compared with REF, OP degrades reliability and generally improves resolution, showing a higher overall BSS but a degraded discrimination.

3.2. The proposed O-OP method, statistics and metrics

The O-OP method is introduced to take into account the observed uncertainty for a parameter that is non-Gaussian distributed, representing this uncertainty by the grid box empirical observation distribution, which can be sampled given a sufficiently large number of observations.

On each grid box there are a number of observations inside: a precipitation data distribution. By linear interpolation (no assumptions about shape) five quantiles {xα(i)} are computed from this distribution. These five quantiles are used to represent the data distribution inside the grid box: {xα(i)} with i = 15, and α(i) = 0.10, 0.25, 0.50, 0.75, 0.90 in the sense that P(x > xα(i)) = 1 − α(i). This is expected to be consistent as only boxes with at least five observations are selected, covering the domain reasonably.

The estimate for the observed probability p0 = P(x > t) of the event X on each grid box is derived using the quantiles {xα(i)} = {x0.10,x0.25,x0.50,x0.75,x0.90}, which represent the observational distribution. For a given i = k it is xα(k−1)< t < xα(k) and then the following inequalities apply (see Figure 1):

equation image(1a)
Figure 1.

Observation probability estimation p0 for the binary event x > t with the O-OP method (see text) for an asymmetric distribution in which the five O-OP quantiles have been plotted. In this case xα(k−1) = x0.50 and xα(k) = x0.75, so that the threshold lies as xα(k−1)< t < xα(k). The exact p0 is the grey area P(X > t) in (a), is upper bounded by the grey area P(x > xα(k−1)) given in (b) and lower bounded by the grey area P(x > xα(k)) given in (c); the average grey area p0 ∼ (P(x > xα(k)) + P(x > xα(k−1)))/2 depicted in (d) is the O-OP estimate for p0.

The probability P(x < t), given by the area under the distribution from t to the left, has a lower bound (area from xα(k−1) to the left) and an upper bound (area from xα(k) to the left). In the same way, and now using P(x > xα(i)) = 1 − α(i):

equation image(1b)
equation image(1c)

The term P(x > xα(k−1)) gives an overestimation, whereas P(x > xα(k)) gives an underestimation of the expected observed probability p0 = P(x > t), which is likely to be closer to the overestimation (‘left’) value, owing to the expected asymmetry of the observed precipitation distribution inside the grid box. However, a simple average of the upper and lower bounds has been used to estimate p0 (see Figure 1):

equation image(2)

In extreme cases p0 = 0 has been taken as estimate if t > x0.90 and p0 = 1 if t < x0.10.

At each grid box Eq. (2) is applied to compute the observational probability p0, while the forecast probability p is computed in the usual way. Once p and p0 are computed for all the cases, the accumulation of statistics (i.e. compute p′(p) = Ep[p0], etc.) is done by partitioning the probability space into N + 1 bins, a convenient and suitable classification when the N members of the ensemble are equally considered (Ziehmann, 2000). Then equations CT08-7 and CT08-8 (‘extended scores’) are applied to compute the BS and BSS decompositions and equations CT08-4 are applied to compute the ROC area from H and F. The BS and BSS components are then calculated, applying the equations in section 3.1 (‘extended scores’). Similarly, the ROC area is obtained using the equations for H and F in section 3.1. O-OP differs from OP only in the way the observational uncertainty distribution is considered and hence in the p0 estimate. Both differ from REF in the application of ‘extended scores’ instead of ‘standard scores’.

Although not necessarily, the statistics can be represented in terms of cell counts (e.g. Jolliffe and Stephenson, 2003). In that case, as p0 can take a range of values between 0 and 1 (in OP or O-OP), a partition can be also done in observation probability space, for O-OP into six bins (five quantiles are used and six different values of p0 can occur). In this context the joint distribution of forecasts and observations is given by two distributions that characterize the system performance completely: g(p,p0) and p′(p), where g(p,p0) is the forecast–observation probability distribution and p′(p) = Ep[p0] gives the conditional observation distribution (as described before).

As a measure of discrimination ROC skill area (RSA) is used. If A is the area under the ROC curve, RSA = 2A − 1 gives values in the interval [−1,1]: 1 for a perfect forecast, 0 for no skill and −1 for a potentially perfect forecast after calibration. Discrimination is related to resolution, but they do not measure exactly the same property and, especially if observational uncertainty is present, they can show different and indicative behaviours. While BSS is potentially insensitive to extreme events, RSA is not (Gutiérrez et al., 2004), whereas RSA can be insensitive to some kinds of forecast biases (Kharin and Zwiers, 2003). Therefore RSA is used here as a complementary score to BSS.

It is important to point out that the observational uncertainty introduced in the O-OP method aims at describing representativeness issues linked to the observations, whereas ensemble forecasts sample initial conditions and forecast errors. Therefore, it is beyond the scope of the present paper to compare the model and observation probability distribution functions. The main scope of the present paper is to empirically introduce observational uncertainty information in the verification measures. Moreover, Jolliffe I. T. (2009; personal communication) noted that the extended BS including observational probability can only be demonstrated to be proper score under some conditions, and thus it could be hedged. Further research on this issue could be developed in future work.

3.3. Synthetic data experiments set-up

Using the same kind of CT08 synthetic data experiments—now for precipitation—O-OP is included together with the three methods presented in CT08 ENS, REF and OP (see above). These experiments can help in understanding and complement the results obtained from the real experiments (see following subsection). CT08 notation is preserved as much as possible; for a detailed description of the basis for this numerical experiment the reader is referred to CT08.

Precipitation is not normally distributed, and here normal distributions have been replaced either by gamma (Wilks, 2006; Sloughter et al., 2007) or log(gamma) distributions. In each case the uncertainty of the day is defined by a gamma G(α,β), whose parameters α and β are determined numerically from the expectation m and standard deviation s (the notation G(m,s) is used hereafter for G(α(m,s)(m,s))). The expectation m is obtained by a random draw (G(M,S)), while s is obtained by a random draw (log(gamma(σ,d)). The values of parameters M, S, σ, d (see CT08) have been selected to make G(m,s) fit the climatological distribution of precipitation in Europe for the 3-month period considered in the forecast real data experiments. The truth xτ is drawn from G(m,s), the observation x0 from G(xτ) for given ε, and the ensemble {êi} is obtained from G(m,s).

In REF (traditional), the ensemble {êi} is verified against x0 using standard scores. In ENS the perturbed members {ûi} are generated by G(êi) and verified against x0 using standard scores; this system is reliable by construction. In OP, the raw ensemble {êi} is verified against the observation described by the distribution G(x0) using the extended scores described in section 3.1. In O-OP, five random observations are drawn from G(x0), five quantiles {xi} are computed from there by linear interpolation, then the ensemble {êi} is verified against x0 using the extended scores described in section 3.2. Results for the synthetic data are presented in section 4.

3.4. Forecast data experiment set-up

The performance of two ensembles (ECMWF-EPS and AEMET-SREPS) is measured, comparing in both of them the standard verification method REF (using the precipitation averaged grid point values xavg) with the O-OP method (using the precipitation quantiles {x0.10,x0.25,x0.50,x0.75,x0.90} and estimating the observed probability for each grid point). BSS, its reliability (BSS_rel) and resolution (BSS_res) components and the RSA are discussed.

Results for the forecast ranges t + 30 and t + 54 have been studied and show similar behaviour. Therefore, to improve clarity it was decided to display solely figures for the t + 54 range, but the conclusions can be extended to the shorter forecast lead time.

Rainfall thresholds 1, 5 and 10 mm/24 h have been selected to define the corresponding binary events of exceedance. For each binary event the joint distribution of forecasts and observations has been computed for both the REF and O-OP methods. A different number of forecast probability bins has been used for the two ensembles, namely 52 bins for ECMWF-EPS and 21 for AEMET-SREPS. Scores are calculated for 3-monthly periods to avoid issues linked to small sample size. Consistently, sample climatologies for the 3-month periods have been taken as reference. Hence the time series figures shown in the results are created with the corresponding 3-month moving average.

4. Results

4.1. Synthetic data experiments

Synthetic data experiments are used to compare the behaviour of different verification methodologies: traditional (REF), perturbed ensemble (ENS), OP (CT08) and the novel O-OP.

Figure 2 (left panels) shows the behaviour of the four verification methods for precipitation against variations on uncertainty in the observations (ε/σ, see section 3, ε = 0 corresponding to verification against the truth) for a fixed precipitation threshold (5 mm). The results for the reliability (upper left) are similar to those of CT08 for 850 hPa temperature (T850, cf. Figure 4 in CT08): ENS shows a perfect reliability, REF shows a worsening of the reliability as uncertainty grows (more precisely, as the uncertainty term decreases; see section 3) and for OP the worsening is much more pronounced. The novel O-OP shows a faster deterioration rate of the reliability, though consistent with OP. The resolution (bottom left) is deficient in both REF and ENS, consistent with the fact that they share the same uncertainty term, while OP improves resolution for a range of uncertainty values. The O-OP highlights a further improvement of the resolution for a larger range of uncertainty values. This result is not deeply understood, and could be due to the way in which O-OP samples the uncertainty. The main difference from CT08 is that the resolution growth rate is now larger for all methods. This could be due to the larger decrease in the Brier uncertainty term (see section 3), as the uncertainty grows when precipitation is considered.

Figure 2.

Left: BSS_rel (top) and BSS_res (bottom) behaviour as function of the uncertainty (ε/σ) given the 5 mm threshold for the four different verification methods: traditional (REF, solid line), perturbed ensemble (ENS, dotted line), OP (dashed line) and O-OP (dashed-dotted line). Right: BSS_rel (top) and BSS_res (bottom) behaviour of the same verification methods for different precipitation thresholds, given the uncertainty ε/σ = 1.

Figure 2 (right panels) displays the variations in reliability (top right) and resolution (bottom right) as a function of precipitation threshold given a fixed uncertainty (ε/σ = 1). ENS shows perfect reliability, as expected; in REF the reliability deteriorates slightly with the threshold; OP shows a significant degradation in the range 1–5 and remains constant thereafter. In O-OP the reliability improves for lower thresholds, in contradiction to OP because of the way in which O-OP estimates p0, while for higher thresholds the reliability deteriorates in O-OP, consistently with OP. All methods (bottom right) display a deterioration of the resolution (as ε/σ = 1, see above) as the threshold grows; in OP the degradation is more important at higher thresholds. The different behaviour between OP and O-OP resolution can be partly explained by the discretization of the estimate of p0 in O-OP (see section 3.2).

CT08 define the observational probability as a normal distribution with an assigned variance, but this cannot be used here: precipitation is asymmetrically distributed. To assess the impact of asymmetry on the observational uncertainty, an index of asymmetry is necessary to establish a possible relationship between asymmetry, observational uncertainty and changes in resolution and reliability terms. The Yule–Kendall index (Wilks, 2006; difference between the upper quartile and the median and the median minus the lower quartile, divided by the interquartile range) provides a robust (independent of any assumption on the distribution) and resistant (not unduly affected by a few outliers) measure of the skewness of the precipitation distribution at a specific grid point. This index has been computed in the synthetic data experiments. Figure 3 shows the change in O-OP BSS_rel with respect to REF BSS_rel plotted against the asymmetry of the O-OP precipitation distribution for three selected precipitation thresholds: there is a clear increase of BSS_rel for 1 mm and 5 mm, while for 10 mm the growth is clear for the first half of the asymmetry range (for higher thresholds and high asymmetry values reliability is not degraded in the same way).

Figure 3.

Absolute change of BSS_rel as a function of asymmetry for the synthetic data experiments on three different cases: 1 mm/24 h (dashed line); 5 mm/24 h (solid line); 10 mm/24 h (dotted line).

The overall synthetic data results are in general agreement with CT08 findings: O-OP degrades reliability with respect to REF, both underestimating the ‘baseline’ reliability given by ENS; O-OP improves resolution with respect to REF due to the decrease in the uncertainty term (the variance of the observations decreases when observational uncertainty is accounted for) and gives an improved resolution compared to REF or ENS. When compared, O-OP degrades reliability consistently with OP, and improves resolution more than OP.


Figure 4 shows the time series of the BSS and its components for three different thresholds. The standard verification (solid line) and the O-OP method (dotted) for the 1 mm/24 h (a), 5 mm/24 h (b) and 10 mm/24 h (c) thresholds. The BSS slightly improves during the winter months when using the O-OP method (dotted line) for higher thresholds (b, c) and this is in agreement with the findings of CT08, while for the 1 mm/24 h threshold (a) the BSS deteriorates compared to REF (solid line) because of the large increase in BSS_rel (discussed below). The resolution (BSS_res, circles) and reliability (BSS_rel, triangles), both negatively oriented, are displayed and show that BSS_rel is further amplified if the O-OP method (filled triangles) is used, while BSS_res (filled circles) decreases (improves) when compared to REF (corresponding open symbols). This behaviour can be observed for all the precipitation thresholds and is in agreement with CT08.

Figure 4.

BSS time series for ECMWF EPS 24 h accumulated precipitation forecasts at t + 54 (0600 UTC), using the standard verification method (solid thick line) and the O-OP method (dotted thick line). BSS_rel (thin lines, triangles for standard approach, and filled triangles for O-OP) and BSS_res components (thin lines, circles for standard approach and filled circles for O-OP) are also shown. The 1 mm/24 h precipitation threshold is shown in panel (a), 5 mm/24 h in panel (b) and 10 mm/24 h in panel (c).

The increase in BSS_rel is inversely proportional to the precipitation threshold, with the largest values for the 1 mm/24 h threshold where the spring and summer reliability goes from values of around 0.05 to 0.3–0.4. In order to understand better the behaviour for this specific threshold, it is worth recalling that the precipitation estimate (verification against a single observation) is built by calculating the averaged precipitation of all the reporting stations within a grid box. This mechanism will increase the number of grid points with small amount of precipitation (fewer situations with zero precipitation) and smooth out the large amount of accumulated rain. Since ECMWF EPS has a tendency to over-predict the small amount of precipitation, the estimate will be much closer to the model forecast. Conversely, models have difficulty in predicting consistently large amounts of rain and therefore a smoother observed estimate is closer to model fields. Moreover, for higher thresholds the event occurs less often, and the forecast probabilities will be smaller; thus BSS will be heavily weighted towards the forecast probability low categories, and the observed frequency will be low. Therefore the reliability term and its change will also be smaller for the higher thresholds than for the lower ones.

Moreover, the decrease in BSS_res observed throughout the precipitation thresholds (Figure 4) is linked to a systematic decrease of the uncertainty term (variance of the observational probability) in the BS decomposition (see section 3). The largest decrease in uncertainty during the summer months can be attributed to the predominant convective situations whereby within a grid box the precipitation variability may be larger as precipitations are very local in nature (not shown).

Figure 5(a) depicts the change as a percentage in BSS_res as function of time for three precipitation thresholds: 1 (solid), 5 (dashed) and 10 mm/24 h (dotted). The increase in BSS_res is mainly due to the decrease in uncertainty term (see section 3) and shows seasonality, with the largest increase in the winter and the smallest in the summer. The largest decrease in uncertainty during the summer months can be attributed to the predominant convective situations whereby within a grid box the precipitation variability may be larger as precipitations are very local in nature. A lower BSS_res increase for the 1 mm/24 h threshold could be associated either with the greater uncertainty term decrease or with the model bias at low precipitation thresholds.

Figure 5.

Relative (%) change of BSS_res (a) as function of time and relative (%) change of RSA (b) as a function of time for three precipitation thresholds: 1 mm/24 h (solid), 5 mm/24 h (dashed) and 10 mm/24 h (dotted line) for ECMWF EPS.

The RSA for ECMWF EPS is shown in Figure 5(b), displaying the time series of change (as a percentage) when the O-OP method is compared to REF for different thresholds. The RSA deteriorates when scoring with O-OP for all thresholds, showing a drop of about 9%. The largest drop (around 12%) is obtained for the higher precipitation threshold (dotted line). The drop, which is consistent with that observed by CT08, is due both to the increased uncertainty of the observations, which in turn decreases the number of model hits and increases the false alarms, and by the rather coarse definition of O-OP observational probability classes. This issue could be improved if a distribution were fit to the observations within a grid box. The decrease in RSA displays a seasonal behaviour, with the largest decrease in the summer months and the lowest in autumn and winter for thresholds of 1 to 10 mm/24 h. The summer months are characterized by convective rain that will, within a grid box, show a larger variability and therefore uncertainty than the large-scale rain associated with synoptic systems more common in autumn and winter.

The Yule–Kendall index (Y-K; see section 4.1) has been calculated for each grid box precipitation estimate and averaged in space over the whole domain to provide an overall estimate of the asymmetry of the observations. Figure 6 illustrates the correspondence between BSS_rel change and Y-K for all the precipitation thresholds (asterisks), for rain greater than 1 mm/24 h (filled circles) and 10 mm/24 h (triangles). Due to averaging, the range from 0.14 to 0.21 differs from that in the synthetic data experiments, which was in the range [0,1]. Linear regressions are used to plot the corresponding lines, which give a better indication on any correlation between the two variables. The positive correlation (reliability degradation) increases if the sample is stratified, with the lower threshold (1 mm/24 h, solid line) contributing most to the change in BSS_rel. This result is quite interesting because it shows that, for small amounts of rain, the higher the precipitation variability within a grid box the more likely that the reliability of the system is underestimated by O-OP. As mentioned above, the averaging methodology for a single precipitation estimate will increase the number of grid points where there are small amounts of observed estimated rain, which in REF will result in an observed probability of 1. On the other hand, the O-OP method will reflect the low probability for small amounts as reported by the observations for each grid point.

Figure 6.

Absolute change of BSS_rel as a function of asymmetry for ECMWF EPS in three different cases: 1 mm/24 h (filled circles and dashed line); 10 mm/24 h (triangles and dotted line); all thresholds (asterisks and solid line).

CT08 observe that pobs (pOP) is smaller/larger than pREF when the predicted probability p is greater/smaller than 1/2. In the case of O-OP there is no fixed threshold on the predicted probability p but it will depend on the asymmetry and the varying threshold (reliability diagrams not shown). For small thresholds the relation between the O-OP and REF probabilities is pOOP< pREF. Conversely, for a large amount of precipitation (10 mm/24 h, dotted line) the precipitation grid box estimate used in REF will reduce the overall probability for these large amounts by smoothing out the extremes, and the O-OP method will conserve the low probabilities with which these events take place: pOOP> pREF(p) for low p (much larger number of cases) and pOOP< pREF for high p (low number of cases). The fundamental difference between the two results is the definition of the observational probability: in CT08 the noisy observations are built using a normal distribution, while in the present study the observational probabilities are obtained directly from precipitation observations with an asymmetric distribution.


Figure 7 shows the BSS and its components for 1 mm/24 h (a), 5 mm/24 h (b) and 10 mm/24 h (c). BSS slightly improves if the O-OP method (dashed line) is used for all thresholds except 1 mm/24 h (a). Like ECMWF EPS, BSS_rel (filled triangles) increases for O-OP and is worse for 1 mm/24 h. Even though a straight comparison between ECMWF EPS and SREPS is not fair, as the two systems have been scored at different resolutions, it seems appropriate to conclude that both systems suffer from the same overestimation of precipitation at small thresholds, and that the issue affects the limited area model to a lesser degree.

Figure 7.

BSS and its component time series for AEMET SREPS 24 h accumulated precipitation forecasts at t + 54 (0600 UTC). Curves and symbols are displayed as in Figure 1. The 1 mm/24 h precipitation threshold is shown in panel (a), 5 mm/24 h in panel (b) and 10 mm/24 h in panel (c).

The BSS_res changes are illustrated in Figure 8(a), where the BSS_res percentage change is plotted versus time. The improvement is of the order of 10–12%, with peaks of 15% during winter and spring periods. This is consistent with both CT08 and the results previously presented for ECMWF EPS. The correspondence between BSS_rel and asymmetry (not shown) for AEMET SREPS is not as clear as for ECMWF EPS: the higher the grid box resolution, the smaller the variance and the asymmetry of the observations inside. Therefore the decrease in uncertainty is smaller and hence BSS_rel and BSS_res show smaller changes.

Figure 8.

Relative (%) change of BSS_res (a) as a function of time and relative (%) change of RSA (b) as a function of time for three precipitation thresholds: 1 mm/24 h (solid), 5 mm/24 h (dashed) and 10 mm/24 h (dotted line) for AEMET SREPS.

Time series showing the change in RSA (as percentage difference from REF) are displayed in Figure 8(b) for the 1 mm/24 h threshold (solid line), 5 mm/24 h (dashed line) and 10 mm/24 h (dotted line). As already noted for ECMWF EPS, RSA degrades when the O-OP method is used. Interestingly, there is very little difference among the curves for the different thresholds as opposed to the behaviour of the ECMWF EPS, which exhibits a considerably larger degradation for the higher thresholds (Figure 5(b)). This could again be associated with the reduced variability within the grid box for SREPS that were run at a higher resolution than the ECMWF EPS in the period considered in this study.

5. Conclusions

The effect of observation uncertainty in the forecast verification process has been investigated in several papers. The uncertainty has been mostly associated with observational error that has been added to the observations used in the verification methodology. In the present paper, the observed uncertainty has been included in the scoring method, extending what was previously described in CT08. The novel methodology ‘observed-observational probability’ (O-OP) allows the inclusion of empirically observational uncertainty for variables that are non-Gaussian distributed. CT08-extended BSS and its decomposition in BSS_rel and BSS_res, as well as RSA, have been used.

Results using synthetic data show differences among the traditional (REF), perturbed-ensemble (ENS), OP and O-OP verification methods. Results show that O-OP degrades reliability with respect to REF, both underestimating the ‘baseline’ reliability given by ENS; O-OP improves resolution with respect to REF due to the decrease of the uncertainty term (the variance of the observations decreases when observational uncertainty is accounted for) and gives a closer measure to the real resolution than REF or ENS, which underestimate it. These results are in agreement with the findings in CT08. Moreover, the O-OP methodology underestimates reliability consistently with OP, but overestimates resolution more than OP.

ECMWF EPS (global) and AEMET-SREPS (LAM) precipitation forecasts (0000 UTC + 54) for a period of 21 months are verified against gridded observed estimate rain (areal averages for REF and observed distributions for O-OP) from European high spatial resolution data. The BSS shows slight improvements when using O-OP for both ECMWF EPS and SREPS for precipitation thresholds above 5 mm/24 h, also in agreement with CT08. The improvements are due to the decrease in the Brier uncertainty term, which leads to an increase in the BSS resolution component. Conversely, the reliability component degrades as expected. A striking result comes from the 1 mm/24 h precipitation threshold: the large degradation of the reliability for ECMWF EPS and to a smaller extent of SREPS LAM can be explained by recalling that the observed precipitation distribution is highly asymmetric (large variability). Therefore, when calculating the areal value for the precipitation estimate, the number of grid points with small amounts of precipitation will increase, thus improving the reliability of the system. In these situations the O-OP observed probability will be smaller than the probability (equal to 1) of the standard method. Conversely, the precipitation estimate will smooth out large amounts of precipitation; thus for rare events the probability (equal to 0) for REF will be much less compared to the observed probability, therefore degrading the reliability of the system. The implications of these findings are that if uncertainty is included in the observations (OP and O-OP) the expected reliability degradation can be extreme for an asymmetric distribution like precipitation, while the system will show an improved resolution. Moreover, the larger the asymmetry, the larger is the reliability deterioration, consistent with the results from synthetic data experiments. It is expected that for rare events in the O-OP method the reliability would not worsen because the probability of an observed rare event within a grid box will be small but not 0.

The time series of the percentage change in BSS_res for both systems indicate a seasonal behaviour, with winter being the period when application of the O-OP method in the scoring process results in highest scores for the model performance. RSA is degraded for both systems when the O-OP method is used, again in agreement with CT08: the higher uncertainty in the observations has the effect of increasing the number of false alarms at the expense of hits. SREPS LAM is less affected because of the higher resolution and thus less variability within a grid box.

Both OP and O-OP methods underestimate the ‘baseline’ reliability (i.e. the reliability when uncertainty is absent), both overestimate to some degree the ‘baseline’ resolution and both underestimate the ‘real performance’ of the system. O-OP shows the advantage of introducing empirical observational uncertainty without the need to assume any particular distribution.

The present study has been carried out on 21 months of observed data to establish the robustness of the O-OP method in a semi-operational context; future plans include the extension of the verification method to a longer period.

The present study has also shown the impact of grid box averages (standard verification method REF) versus grid box distributions (O-OP method). It has been shown that asymmetry has a main role; averaging a highly asymmetric distribution (e.g. generalized low precipitation and one extreme report within a grid box) will produce overestimation of precipitation amounts, whereas the average of rain gauges reports within a grid box where generalized low precipitation and no extremes are observed will produce a smoothed field and therefore reduce the variability substantially (Ensor and Robeson, 2008). As a consequence, a reliability degradation (higher in O-OP than REF) and an improvement of resolution have been shown.


The authors would like to thank the two anonymous referees for constructive feedback. Thanks also to José Antonio López and Andrés Chazarra (AEMET) for helpful comments and clarification.