How sensitive are probabilistic precipitation forecasts to the choice of calibration algorithms and the ensemble generation method? Part I: sensitivity to calibration methods


  • Juan J. Ruiz,

    Corresponding author
    1. Departamento de Ciencias de la Atmósfera y los Océanos, Universidad de Buenos Aires, Buenos Aires 1428, Argentina
    2. Centro de Investigaciones del Mar y de la Atmósfera (CONICET-UBA), Buenos Aires 1428, Argentina
    • Facultad de Ciencias Exactas y Naturales, Departamento de Ciencias de la Atmósfera y los Océanos, Universidad de Buenos Aires, Buenos Aires 1428, Argentina.
    Search for more papers by this author
  • Celeste Saulo

    1. Departamento de Ciencias de la Atmósfera y los Océanos, Universidad de Buenos Aires, Buenos Aires 1428, Argentina
    2. Centro de Investigaciones del Mar y de la Atmósfera (CONICET-UBA), Buenos Aires 1428, Argentina
    Search for more papers by this author


Different techniques for obtaining probabilistic quantitative precipitation forecasts (PQPFs) over South America are tested during the 2002–2003 warm season. They have been applied to a regional ensemble system which uses the breeding technique to generate initial and boundary conditions perturbations. This comparison involves seven algorithms and also includes experiments to select an adequate size for the training period. Results show that the sensitivity to different calibration strategies is small with the exception of the rank histogram algorithm. The inclusion of the ensemble spread or the use of different ensemble members for the computation of probabilities shows almost no improvement with respect to probabilistic forecasts computed using the ensemble mean. This is basically due to the strong relationship between precipitation error and its amount. Copyright © 2011 Royal Meteorological Society

1. Introduction

The inclusion of uncertainty in weather forecasts can lead to substantial increase of their economic value (Zhu et al., 2002; Palmer et al., 2007). The most popular way to account for this uncertainty is through the use of probabilistic forecasts, which essentially assign a probability to the occurrence of future events, based on a combination of presently available information about those events (e.g. numerical weather predictions, forecaster's experience, past observations). Ensemble forecasting (Epstein, 1969) has introduced new tools for the computation of probabilities, since it can provide estimations of the first and second moments of the future probability distribution function (PDF) of any variable of interest, such as, for example, precipitation amount. The ability of ensemble systems to outperform deterministic-style forecasts and to predict forecast skill has been convincingly established (see Palmer et al., 2007 and references therein). However, various challenges in the statistical postprocessing of ensemble outputs have been described. As documented by many papers (e.g., Wilks and Hamill, 2007 and references therein), probability derived directly from the ensemble is largely affected by model errors, and leads to unreliable probabilistic forecasts that reduce the economic value of the information.

In order to correct the effect of the ensemble systematic errors, several techniques have been developed, all of them based on the study of the relationship between error and forecast value and in the development of statistical models to compute a calibrated probability given the forecasts of the ensemble members (Hamill and Colucci, 1997, 1998; Eckel and Walters, 1998; Applequist et al., 2002; Gahrs et al., 2003; Gallus and Seagal, 2004; Raftery et al., 2005; Hamill and Whitaker, 2006; McLean Sloughter et al., 2007; Stensrud and Yussouf, 2007, among others). Some of these techniques can be applied even to a single deterministic forecast allowing the computation of probabilities without running an ensemble system (e.g., Gallus and Seagal, 2004). Although most of them share common principles, there can be recognized differences in the implementation and/or in the mathematical algorithms employed by them. To this end, given the variety of calibration strategies available, many of them referred to above, it remains unclear which is their relative performance, for example, in terms of forecast quality. Accordingly, one of the objectives of this paper is to describe the sensitivity of probabilistic forecasts to the calibration algorithm using selected techniques, including a brief discussion of their pros and cons. In addition, some modifications are proposed to improve either the performance or the ease of computation of some of these algorithms, which are also compared with the original implementations.

All the calibration techniques proposed are applied to precipitation forecasts, since they are one of the most challenging and least accurate products available from numerical weather prediction (Ebert, 2001; Stensrud and Yussouf, 2007). The algorithms have been applied to a short range regional ensemble system over South America based on the WRF (Weather Research and Forecasting) model. This work is an extension of Ruiz et al. (2009) where PQPF (Probabilistic Quantitative Precipitation Forecasts) generated with two different ensemble systems and using two calibration strategies (Hamill and Colucci, 1997; Gallus and Seagal, 2004) were compared. In that case, the validity of the results was limited due to the relatively small amount of observations available for calibration. In the present work, satellite estimates have been used for calibration and verification purposes, in order to reduce the uncertainty due to the small number of observations over the area of interest. According to Ruiz (2009), this choice does not affect the general conclusions regarding calibration performance, while providing more robust statistics due to more data availability.

Sensitivity to calibration strategies constitutes the first part of this assessment. It is also of interest to make a more comprehensive analysis including ensemble generation, taking into consideration the variety of alternatives to generate computationally cheap regional ensemble systems (see Applequist et al., 2002; Gahrs et al., 2003). Both assessments could help designing a probabilistic forecast system, suitable for small operational/research centres. This complementary evaluation is addressed in Ruiz et al. (2011).

The article is organized as follows: Section 2 describes the ensemble system, the dataset for verification/calibration and different calibration methods used: results are analysed in Section 3, and Section 4 provides the conclusions of this work.

2. Methodology

2.1. The regional ensemble system

The regional ensemble system has been constructed using different initial conditions to run the WRF-ARW model version 2.0 (Skamarock et al., 2005) with 40 km grid spacing and 31 vertical sigma levels over the domain shown in Figure 1. The ensemble members have been generated using breeding of the growing modes (hereafter breeding, Toth and Kalnay, 1993) applied to both, the initial and boundary conditions, with a rescaling period of 6 h. The regional ensemble consists of 11 members (5 pairs of perturbed members and a control run) integrated up to 48 h lead time. Boundary conditions are provided by a global ensemble (generated with the same perturbed initial conditions) based on the Medium Range Forecast model with T62L28 resolution (approximately 2.5° horizontal resolution). Although the resolution of the global model is low, it is considered that this limitation does not affect the main conclusions of this work. The unperturbed initial condition (control run) is obtained from the Global Data Assimilation System (GDAS) analysis with 1° × 1° horizontal grid spacing. The experiment starts on 15 December 2002 and ends by 15 February 2003, and the regional ensemble forecasts are initialized at 1200 UTC.

Figure 1.

Regional ensemble domain and CMORPH total precipitation (in mm) between 15 December 2002 and 15 February 2003 (shaded). The black box indicates the area used for verification and calibration

The configuration used to run WRF has been selected among a variety of options following Ruiz et al. (2010), which shows that this configuration satisfactorily represents surface observations during this same period. Basic model settings include Kain–Fritsch cumulus parameterization (Kain, 2004), the Yonsei University scheme (Hong and Pan, 1996) for boundary layer representation, and surface processes are modelled using the NOAH surface model (Chen and Dudhia, 2001). The reader is referred to Ruiz et al., 2010 for a detailed analysis of WRF model performance during the period of interest.

2.2. Data

In this work, CMORPH passive microwave precipitation estimates ( janowiak/cmorph_description.html) with 0.25° horizontal resolution and 3 h temporal resolution (Joyce et al., 2004) are used for forecast calibration and verification. Although the native horizontal and temporal resolution of CMORPH is 8 km and 30 min respectively, in this work the lower resolution products have been preferred because their resolution is closer to our model setting. Among existing precipitation estimates, CMORPH has been selected because the use of microwave data leads to better precipitation estimates (Joyce et al., 2004; Ebert et al., 2007), and provides an excellent alternative to the coarse operative rain gauge network over South America. CMORPH performance and its potential to be used for verification purposes over the region of interest have been addressed in Janowiak et al. (2005), Ruiz et al. (2009) and Ruiz (2009). In particular, Ruiz (2009) focuses on the same period as the present work and shows that CMORPH data have some systematic biases over the region of interest: they tend to overestimate the frequency of high and low precipitation events and underestimate the frequency of ‘no rain’ events during summer. However, it has been demonstrated that these systematic errors produce a small impact upon PQPF calibration compared with raingauge data (Ruiz et al., 2009), supporting the use of CMORPH estimates in the present work.

For calibration and verification purposes, only the region indicated by the box in Figure 1 has been considered. As discussed in Ruiz et al. (2009), among others, calibration is sensitive to precipitation climatology, particularly to differences between tropical and mid latitude regimes. For this reason, the Northern Part of Argentina, Southern Brazil, Paraguay and Uruguay, sharing a rather similar precipitation regime, has been selected.

In addition to that, this region is of particular relevance for addressing forecast quality issues, since it is within the La Plata Basin, which is the second basin in South America according to its size and the most important given its population and economic activities. Moreover, during the austral summer around 80% of the precipitation over this region is associated with the occurrence of mesoscale convective systems (Salio et al., 2007), representing a particular challenge for precipitation forecast.

2.3. Calibration methods

As discussed in the introduction, one of the main objectives of this work is to assess the ability of different calibration algorithms to improve PQPF reliability. In order to understand better the advantages and disadvantages of selected alternatives, a brief description of the different algorithms implemented in this study is presented.

2.3.1. Rank histogram (RH)

This method was first introduced by Hamill and Colucci (1998) (from now on HC98) and uses the precipitation rank histogram to compute the probability of occurrence of precipitation above a certain threshold. As in HC98, the training sample is divided into three categories based on the forecast spread. In this case the limits of the categories have been chosen from the terciles of the forecast spread probability distribution (this means that slight changes in the limits of the categories can occur from 1 day to the other keeping the number of cases falling into each category approximately constant). Three rank histograms are computed, one for each spread category.

The main problem with the rank histogram approach in its original formulation (e.g. HC98) is that it tends to overestimate higher probabilities due to a non explicit treatment of the chance of observed precipitation being 0 mm. For this reason, in this work a modified version of the algorithm is introduced. The main difference between HC98 and our implementation is that the computation of the rank histogram is done excluding the cases where the observed precipitation is 0 mm. The probability of the precipitation being 0 is computed separately for each spread category.

Accordingly, the probability of the precipitation being over a certain threshold is computed as the product of two probabilities as indicated by:

equation image(1)

where P(o > tr|F) is the probability of the observed precipitation o being over the threshold tr given the precipitation value forecast by the different ensemble members F = (f1, ., fi, ., fk), with fi the forecast precipitation by the ith ensemble member and k the ensemble size. P(o > 0|s) is the probability of the observation o being over 0 given the spread of the forecast precipitation ensemble s. In this case this term is computed as the observed frequency of o being > 0 at each spread category. Finally P(o > tr|o > 0, F) is the probability of precipitation being over the threshold tr given that o is greater than 0 and the forecast precipitation by the different ensemble members. This term is computed from the rank histogram as in HC98.

Another difference with respect to HC98 original implementation is that only those forecasts where any of the ensemble members predict precipitation above 0 mm are considered for the computation of the rank histogram. Cases where all the ensemble members forecast 0 mm are considered separately, assuming that the probability of precipitation over the different thresholds is 0. This change has only a minor effect upon calibrated PQPF skill.

2.3.2. Gallus and Seagal algorithm (GS)

This method is based on Gahrs et al. (2003), Gallus and Seagal (2004) and Gallus et al. (2007). The range of forecast precipitation is divided into several bins. At each bin, the probability of having precipitation above a certain threshold is assumed to be equal to the observed frequency of precipitation above that threshold. Implied is the assumption that this probability takes place at the centre of the corresponding bin. Probabilities for other forecast precipitation values are then computed through linear interpolation. The following bin limits have been selected: 0, 0.01, 0.5, 1, 2.5, 5, 10, 15, 20, 25, 30, 50 and 150 mm. Different choices for bin distributions have been tested, including a dynamic computation of the bins using equal frequency bins, but the sensitivity of the results to this particular issue is small, as in Gahrs et al. (2003). Figure 2(a) and (c), show an example of the relationship between ensemble mean forecast precipitation and the probability of having precipitation over 1 and 15 mm respectively, obtained with GS.

Figure 2.

Observed precipitation probability as a function of ensemble mean forecast precipitation for 1 mm threshold (a and b), and 15 mm threshold (c and d). (a) and (b) GS (circles), GAMMA (grey line), LR (crosses), ELR (triangles), (c) and (d) LR (crosses), SLR with low spread (solid grey line) and SLR with high spread (dashed grey line)

2.3.3. Logistic fit algorithm (LR)

This algorithm is similar to GS: the main idea is to find a relationship between forecast precipitation and probability of occurrence of rain above a certain threshold and is based on Applequist et al. (2002) and Hamill et al. (2004). The main difference with GS is that, instead of using several bins, a logistic regression (Wilks, 2006a) is used to fit the relationship between forecast rainfall and probability. The result is a reduction in the number of parameters required to compute probabilities: in GS, the number of probabilities that have to be estimated directly from the data is equal to the number of bins, while in this case, only two parameters are needed to describe the fitted function. The general form of a logistic regression is given by:

equation image(2)

where f is the forecast rainfall, P(y > tr|f) is the probability of the cube root of the observed precipitation (y) being over the considered threshold given f, and a and b are the parameters of the fitted function. Notice that the one-third power transformed forecast precipitation is used as predictor, instead of the precipitation itself, following the results of McLean Sloughter et al. (2007). As Hamill et al. (2004) recommended a one-fourth power transform for PQPF calibration, other power transforms have also been tested (i.e. one-half, one-fourth) but in this work the best results have been achieved with the one-third power transform.

Logistic regression can be particularly useful to compute the probability of high values of forecast precipitation, mainly in the case where the size of the training sample is small. Under these circumstances, the use of GS could result in a less robust estimation of probabilities by using too few forecast-observation pairs at particular bins. Instead, LOG uses the entire sample to estimate the relation between forecast precipitation and observed precipitation probability. Moreover, as can be seen in Figure 2(a) and (c), LOG is very close to the curve obtained with GS for both thresholds.

2.3.4. Extended logistic regression (ELR)

In the previous implementation of the logistic regression, a particular curve is fitted for each precipitation threshold, meaning that a and b are calculated for each threshold, implying larger computation with increasing threshold resolution. Wilks (2009) introduced the concept of extended logistic regression, where a monotonic function of the threshold itself is included as a predictor within the regression equation. In this way a set of parameters is obtained that allows the computation of probabilities for any threshold, moreover, even the full PDF could be obtained in this way. The extended logistic regression can be expressed as:

equation image(3)

where P, tr, f, are as in the logistic regression equation, and g(tr) is a non decreasing function of the threshold. In this case the following function has been used:

equation image(4)

this function has also been used by Wilks (2009) and Schmeits and Kok (2010). Other power transformations have been tested for g(tr), however the square root was the one that gave the best results in terms of the Brier Skill Score (BSS). Coefficients a, b and c in Equation (3) are computed using the maximum likelihood method (Wilks, 2006a). Note that in this case ae, be and ce are independent of the threshold.

To compare the coefficients obtained with the standard logistic regression with those of the extended logistic regression, the former have been plotted in Figure 3 as a function of the precipitation threshold. Note that in Equation (3), ae is equivalent to a in Equation (2) and be + g(tr)ce is equivalent to b. The fact that in the extended logistic regression the dependence of coefficient a with the selected threshold is not considered (i.e. ae is assumed to be independent of the threshold) is justified by Figure 3, which shows that this coefficient is almost constant.

Figure 3.

Logistic regression parameters (Equation (2)) as a function of precipitation threshold. ‘a’ (circles), ‘b’ (crosses), and fit to the parameter a using ELR (dashed black line) and LLR (solid lack line), and to parameter b using ELR (dashed grey line) and LLR (solid grey line)

Another way to take into account the dependence of the fitting coefficients a and b with the precipitation threshold is to compute them for several thresholds using Equation (2) and then perform a least squares fit of the form:

equation image(5)

where â(tr) and (tr) are the estimated coefficients a and b as a function of the precipitation threshold, and a1, a2, b1 and b2 are the fitting parameters that are computed using least squares linear regression between the computed values of a and b and the square root of the precipitation threshold. This alternative implementation of the extended linear regression will be referred as LLR. Figure 3 also shows the values of â and as a function of threshold. It can be seen that both ELR and LLR capture quite well the dependence of parameters a and b from Equation (2) with the threshold.

2.3.5. Two variable logistic regression (SLR)

In this case a logistic regression including two predictors is used to describe the dependence of probability of precipitation above certain thresholds: the forecast precipitation and the ensemble dispersion. The idea is similar to the LR scheme but the ensemble spread is used as an extra source of information. The implementation follows those by Wilks (2006b) and Wilks and Hamill (2007) and can be expressed as:

equation image(6)

where P, y, tr and f are as in Equation (2), S is the spread of the power transformed ensemble, and as, bs and cs are coefficients that are adjusted using the logistic fit.

Figure 2(b) and (d), show a comparison between LR and SLR using high and low spread cases. The low spread case shows lower probability values even for high forecast precipitation values (far above from the considered threshold), which is not the expected behaviour, since the lower ensemble spread should be associated with less uncertainty and increased forecast resolution.

2.3.6. Gamma algorithm

This algorithm is based on the one presented in S2007, where calibrated probabilities are obtained as a linear combination of the individual PDFs associated with each ensemble member. The conditional probability of the cube root of the observed precipitation (y) being over a certain threshold (tr), given the precipitation value forecast by each individual member (fk) is given by:

equation image(7)

where hk(y > tr|fk) is the conditional probability of y being over tr given the precipitation forecast by the kth ensemble member (fk) in the case where the kth ensemble member is the best one; wk is the weight associated with the kth ensemble member (i.e. the probability that the kth ensemble member is the best one) and K is the ensemble size. In this work the weights are equal for each ensemble member as in a single model ensemble the skill of the individual members is considered to be almost the same (Fraley et al., 2010; Schmeits and Kok, 2010). The model proposed by S2007 for the computation of hk(y > tr|fk) has two parts depending on y being 0 or greater than 0. The probability of observed precipitation being greater than 0 as a function of forecast precipitation (hk(y > 0|fk)) is modelled using a logistic regression that can be expressed as:

equation image(8)

where ak, bk and ck are constants which are estimated individually for each ensemble member. δk is 1 if fk is 0 and 0 otherwise, so the term δkck introduces a correction in the computation of the probabilities when the forecast precipitation is 0.

The conditional probability of y being over a certain threshold given that the precipitation is greater than 0 is modelled using a gamma distribution as stated in:

equation image(9)

where Gmath image(tr) is the cumulative distribution function at tr assuming a gamma distribution with mean µk and standard deviation σ.

The value of µk is also considered a function of the forecast precipitation and is computed as:

equation image(10)

where dk and ek are constants which are computed individually for each ensemble member through linear regression between the third root of the forecast precipitation and the observations as in S2007. Note that for the current experiments, even though ak, bk, ck, dk and ek are computed individually for each ensemble member, their value is averaged in order to obtain a single coefficient which will be applied to all the ensemble members in the calibration step. This is because, as stated before, in this particular ensemble the skill of the different members is not significantly different.

The value of σ is assumed to be the same for all the ensemble members as in S2007, but in this case it will be assumed that σ is independent of the forecast precipitation because this assumption is the one that produces the best results for the experiments presented in this paper. Similar conclusions have been obtained by Schmeits and Kok (2010).

Combining Equations (8) and (9), an expression to compute the probability of the precipitation being over a certain threshold given the forecast precipitation by the kth ensemble member, assuming that this is the best ensemble member, can be obtained:

equation image(11)

Now, Equation (7) can be used to combine the probability associated to each ensemble member assuming that the weights represent the probability of each ensemble member being the best ensemble member. The computation of the conditional probability based on Equation (7) is referred to as Bayesian Model Averaging (BMA). In this case, as equal weights are being used for all the ensemble members, which substantially simplifies the calibration process, the method will be referred to as GAMMA ENS.

The value of σ is computed numerically finding the maximum of the following function:

equation image(12)

where gmath image(yi) is the probability density function (PDF) evaluated at the value of the observed precipitation at the ith location and time (yi) assuming a gamma distribution with mean µk and standard deviation σ, fik is the forecast precipitation for the ith location and time by the kth ensemble member, δy is 1 if yi is equal to 0 and 0 otherwise, and N is the total number of different locations and times in the training sample. The argument of the logarithm is the PDF that results from the combination of the PDFs associated with each ensemble member evaluated at the observed precipitation value, so when C is maximum then the combination is optimal in the sense that the probability of the observations given the forecast precipitation by the individual ensemble members within the training sample is maximum.

The numerical maximization of C is performed simply computing the value of C for a range of values of σ, and choosing the value of σ corresponding to the maximum value of C.

Equation (11) could also be used to obtain the probability of precipitation above a certain threshold in the case where only one forecast is available (i.e. it can be applied to a single forecast experiment or to the ensemble mean). The application of Equation (11) to the ensemble mean, to obtain a calibrated probabilistic forecast will be referred to as GAMMA. In this case the parameters from Equations (8) and (10) have to be estimated for the ensemble mean.

In brief, it should be remarked that some of the algorithms (GAMMA, GS, LR, ELR and LLR) can be applied to both ensemble forecasts and deterministic (single) forecasts while some others (GAMMA ENS, SLR and RH) require an ensemble of forecasts to compute probabilities. This consideration can be important depending on the type of ensemble considered. For example, in the case of a multi model ensemble, where systematic errors of each ensemble member could be different, some advantage could be obtained if calibration is applied to each ensemble member individually and then probability is averaged using weights proportional to each member forecast skill over the training period, as in S2007. This issue will be addressed more in depth in Ruiz et al. (2011).

For all the methods tested, the probability of rainfall over 0 mm when the ensemble mean forecast precipitation is 0 has been set to 0. This decision does not affect calibration or skill measures significantly, while removing low probability values associated with large areas of ‘no rain’ forecasts.

2.3. Verification scores

In this work the skill of the different PQPF is quantified using the Brier Skill Score (BSS, Brier, 1950; Wilks, 2006a) which can be defined as follows:

equation image(13)

where BR is the Brier score of the forecast and the BRclim is the Brier score of the climatology. The Brier score is given by

equation image(14)

where pi is the forecast probability of a given event for a particular time and location, and oi is 1 if the event actually occurred at that particular time and location and 0 otherwise. The summation is over all the times and locations where N is the total number of issued forecasts. The Brier score of the climatology is computed by replacing pi by the climatological probability of the event. The Brier score is strictly proper (i.e. it cannot be improved by hedging) and although the BSS is not strictly proper, it approximates a strictly proper scoring rule if the verification sample is moderately large (Wilks, 2006a). It should be noted that the BR also provides information about the continuous ranked probability score (CRPS) (Hersbach, 2000). According to Gneiting et al. (2005), the CRPS is equal to the integral of the BR over all possible thresholds.

The events considered in this paper for the computation of the BSS are the observed precipitation being over the following thresholds: 0.0, 1, 2.5, 5, 7.5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 120 and 150 mm.

Following Stephenson et al. (2008), the BSS can be decomposed into a reliability and generalized resolution components as stated in:

equation image(15)

where the forecast probability range is divided into m bins (in this case 0.1 wide bins are used), nk is the number of cases within each bin, pkj is the forecast probability for the jth case within the kth bin, k is the mean forecast probability for the kth bin, okj is 1 when the jth event at the kth bin occur and 0 otherwise, ōk is the observed relative frequency of the event within the kth bin. The first term in the right hand side of Equation (15) represents the reliability component of the BSS and the following terms are the generalized resolution component as defined by Stephenson et al. (2008). In terms of the BSS, a better forecast is the one with higher BSS, i.e. that where the reliability term is closer to 0 and with a larger resolution term. Note that in this case the definition of the reliability component includes the minus sign.

3. Results

3.1. Estimation of the calibration training period length

The calibration methods have been applied as in a real forecast situation, i.e. using a dynamic training approach which considers N days prior to each forecast. The selection of an appropriate length for the training period (i.e. of N days) depends on several factors as, for example, data availability and current weather regime. Although long training periods are preferable, they should optimally correspond to the same season, and this is only attainable using reforecasts with the same version of the model (e.g. Hamill and Whitaker, 2006), which is out of our current possibilities.

Two different approaches to determine a reasonable sample size N under our dynamic calibration, are tested in this work: one is based on the stability of the estimated parameters and the other is based on the skill of the resulting PQPF. In both cases, the LR algorithm has been used, given its simplicity and good performance, as will be discussed later in the text.

One hundred groups for each length of N days have been randomly selected, and parameters have been estimated for each group to evaluate their stability, as a function of N. Figure 4(a) shows the results for the estimation of parameter a (see Equation (2)) at 1 mm threshold. The behaviour of this parameter becomes stable for training periods between 10 and 15 days. Figure 4(b), shows the BSS as a function of the number of days in the training period. As can be seen in this figure, BSS rapidly increases as a function of number of days in the training period, with a significant change in the rate of increase between 7 and 10 days, and continuous improvement, but less significant, with larger training data sets. Based on the results discussed so far, 20 days training period was adopted for the subsequent experiments. Even though the stability of the parameters and the skill of the forecast continues to increase slightly beyond the 20 days training sample size, this value has been chosen in this work due to shortness of the experimental period and due to the fact that even when the skill keeps increasing the rate of increase is relatively small compared to that seen for smaller training samples. The size of the training period also depends on the amount of precipitation data available, as well as on the number of rainy days. So it is considered that this optimum number should be calculated at least for each season and region (assuming that different regions have different observational coverage).

Figure 4.

(a) Mean value (grey solid line) and sigma (black lines) for parameter ‘a’ from Equation (2) as function of number of days in the training data set. (b) BSS over the last 5 days of the period as function of number of days in the training data set

3.2. Comparison between calibration strategies applied to the ensemble mean

The idea of calibration algorithms is to find a statistical model to describe the ensemble/model uncertainty. To assess how good this model is, the same type of data should be used for both training and verification, because statistical differences between verification and training data might appear as spurious failures in the calibration algorithm. As stated in Section 2, CMORPH data have been preferred here, since the main objective of this experiment is to evaluate the behaviour of different calibration methodologies and not to quantify the skill of the ensemble systems.

One of the forecast properties most affected by calibration is reliability, which can be evaluated at different thresholds using the reliability diagram, as shown in Figure 5. All the strategies reduce the overestimation of having rainfall above selected thresholds with probability larger than 0.2. The GAMMA approach shows a small overestimation of the probabilities for values higher than 0.5. Similar results in terms of reliability improvement are found at the 15 mm threshold, however, as expected, for this threshold maximum reliable probabilities are almost half of those observed for the 1 mm threshold.

Figure 5.

(a) Reliability diagrams for 1 and (b) 15 mm thresholds for 24 h PQPF over region 1. Noncalibrated forecast (black line), GS (circles), GAMMA (grey line), LR (crosses) and ELR (triangles)

In order to quantify the skill of these calibration methods, the BSS, as well as its reliability and resolution components have been computed using Equation (15) (as in Stephenson et al., 2008) and are shown in Figure 6. The BSS and its components are plotted compared against the LR which is taken as a reference. The uncalibrated forecast is not included in this figure due to its poor performance, which might be due to the characteristic summer precipitation regime over the selected domain. As stated before, during this time of the year a large amount of precipitation comes from mesoscale convective systems that are usually hard to predict, thus increasing forecast error and leading to low forecast skill. As can be seen in Figure 6, up to 30 mm the performance of the different calibration approaches is very similar, however for higher thresholds the LR algorithm seems to lose reliability. This might be indicating that as the precipitation threshold increases and the number of cases where the event occurs in the training sample diminishes, the logistic fit loses accuracy. ELR efficiently solves this problem including the precipitation threshold as a predictor, the alternative LLR approach also leads to similar results as those obtained with ELR (not shown). However, for large precipitation thresholds the BSS is close to 0, indicating that the skill of the probabilistic forecast is very close to that of climatology. The ELR approach, however, seems to have slightly worse skill over the lower thresholds.

Figure 6.

(a) BSS and its (b) reliability and (c) resolution components over region 1 for different calibration strategies relative to the value of the BSS and its components corresponding to the LR method (in all cases positive values means an improvement with respect to the LR method). Lines and markers as in Figure 5. The grey dashed lines in panel (a) represent the 90% confidence interval limits for the LR BSS computed using 1000 bootstrap samples

GS and GAMMA are the algorithms showing the best performance over all the selected threshold ranges, GS being the best choice over lower thresholds. It should be remarked that the GAMMA approach only uses three parameters to compute the probability. This also produces a significant reduction in the number of parameters that has to be determined during the calibration process. Figure 6 also provides insight into the resolution component of the BSS: all the calibration algorithms introduce an increase in forecast resolution compared to the uncalibrated forecast (not shown), although there is little difference among them. Table I summarizes the skill of the different calibrations discussed so far using the CRPS. In the present case, the best results have been achieved using the GS and GAMMA calibrations, closely followed by the other methods. LLR is included in this table for comparison against the ELR method: the performance of both is similar, as expected. However, the LLR approach is slightly more skillful. As can be seen in this table, the 48 h lead time forecasts shows larger CRPS (i.e. lower skill) as expected, with similar behaviour of the different calibration schemes.

Table I. CRPS values corresponding to the different calibration strategies discussed in the text for 24 and 48 h forecasts
Method24 h forecast48 h forecast
GAMMA ENS4.704.95

Other ways to measure reliability and resolution, as the probability integral transform histogram and the ROC diagram, confirm the results discussed so far, both at 24 and 48 h lead times.

The calibration methods discussed before use only one predictor, the ensemble mean. In the next section a set of methods that use the full ensemble or the ensemble mean and spread as predictors are evaluated, in order to address whether these variables provide extra information that could improve precipitation forecast skill.

3.3. Comparison of calibration strategies using more information than the ensemble mean

As previously shown, the use of the ensemble mean as the only predictor provides good results, however it should be expected that adding extra information would result in better skill. This is analysed with the aid of Figure 7, showing the BSS and its decomposition for those algorithms including other information besides the ensemble mean (i.e. SLR, GAMMA ENS and RH). The RH algorithm has the lowest BSS, mostly due to lower reliability. For GAMMA ENS, a degradation of reliability with respect to GAMMA for thresholds under 20 mm and a slight improvement for higher thresholds can be seen. These results are in agreement with those of S2007 where the dressed ensemble has slightly lower BSS than a logistic regression based on the ensemble mean. However the BSS of the GAMMA ENS approach is lower than the BSS of GAMMA for almost all thresholds. A similar situation is observed with the SLR and LR approaches with slightly lower BSS for the SLR particularly at lower thresholds. The resolution component shows almost no improvement with the inclusion of extra information about the ensemble.

Figure 7.

As in Figure 6, but for GAMMA (grey solid line), GAMMA ENS (black dashed line), LR (crosses), SLR (circles) and RH (triangles). The grey dashed lines in panel (a) represent the 90% confidence interval limits for the LR BSS computed using 1000 bootstrap samples

Given these results, a natural question is why does the inclusion of extra information not improve forecast resolution in these experiments? For example, Hamill and Colucci (1998) studied the relationship between ensemble spread and ensemble error for precipitation, concluding that although the relationship between both is high, a strong relationship between ensemble mean (i.e. actual rain amount) and spread also exists. In order to assess whether this applies to the current investigation, ensemble spread-error and ensemble mean-error relationships are shown in Figure 8. Correlation coefficients are particularly high between these variables. However, as discussed in Hamill and Colucci (1998), if the dependence of spread and error with ensemble mean is removed, there is almost no relationship between spread and error. To illustrate this issue, the ensemble mean range has been divided into several bins (the same bins that have been used for the calibration methods) and the error and standard deviation at each ensemble mean bin have been standardized with respect to their corresponding bin mean and standard deviation value. Figure 8, shows the dispersion plot of the standardized ensemble spread and standardized error. As can be seen, the correlation coefficient is close to 0, suggesting that spread is not adding much extra information. This could also explain the unexpected behaviour of the calibration curves for low and high spread discussed in Section 2 and illustrated in Figure 2.

Figure 8.

(a) Scatter plot between ensemble spread and the ensemble mean error, (b) ensemble mean and ensemble mean error and (c) the standardized ensemble spread and ensemble mean error. All variables are in mm

4. Conclusions

In this work a comparison of different methods for probabilistic quantitative precipitation forecast (PQPF) calibration over South America is presented. The analysis focuses on 24 h accumulated precipitation forecasts over a particular sub-region where probabilistic precipitation forecasts could be of relevance for economic activities. Two main types of calibration algorithms have been employed, one using only the ensemble mean, and the other adding information about the ensemble through its spread or using all the ensemble members.

Among those methods using only the ensemble mean, the Gallus and Seagal algorithm (GS) and GAMMA algorithm, show very similar results in terms of calibration and skill. However, GAMMA uses less parameters than GS, and also has the advantage that once the shape of the PDF is estimated probabilities can be computed for any desired threshold.

The inclusion of the ensemble spread or the individual ensemble members as in two variable logistic regression (SLR), GAMMA ENS and rank histogram (RH), does not lead to an improvement of skill or reliability with respect to methods based on the ensemble mean, what has been shown to be related to a particular characteristic of precipitation, that exhibits large correlation between expected error and precipitation amounts (Hamill and Colucci, 1998). Still, more work is needed to assess whether this remains true for cases of high horizontal resolution ensembles, where promising results have been obtained by Schaffer et al. (2011).

It should be remarked, however, that forecasts using the ensemble mean are better than those derived from a single deterministic forecast because errors of the ensemble mean are lower than those of a single deterministic forecast (Ruiz et al., 2009).

Another alternative to explore is the inclusion of additional predictors that can affect PQPF calibration more directly as, for example, the ensemble mean bias. This kind of approach could be more adequate to handle the non-uniform weather regime over the area of study, which in this paper has been only partially controlled by using a box inside the model domain. The use of a two variable logistic regression using the ensemble mean bias in the training period would allow correction of the calibration curves with forecast systematic errors. The inclusion of other variables could also be explored as in Applequist et al. (2002) and Hamill and Whitaker (2006), among others. In particular, Hamill and Whitaker (2006) also introduce the analog technique applied to global NCEP reforecasts. Methods applied to large datasets such as the NCEP reforecasts, are very promising and their evaluation over South America is part of work in progress.

Although most of the results shown in the present paper correspond to only one region of South America, other regions such as central Brazil and Bolivia have also been analysed. Over these regions, characterized by tropical regimes, forecast skill scores are lower, in agreement with what has been reported by Ruiz et al. (2009) and Ebert et al. (2003). However, results regarding different calibration performance were very similar to the ones over the region used throughout the present study, so no discussion about other regions has been included here. Also, 48 h forecast behaviour does not add useful information to the present analysis, except for the fact that their skill is lower than that for 24 h ones. It is considered that this investigation provides useful insight into the problem of selecting a calibration technique adequate for improving probabilistic precipitation forecasts reliability in an operational framework over South America. This work, combined with that of Ruiz et al. (2011), which evaluates the impact of using several types of ensembles, should aid in the identification of an optimum and feasible system to generate probabilistic precipitation forecasts operationally.

Although during this experimental period a relatively large amount of data are available, the work is not providing information about dependence of the results with time of the year (that can introduce strong changes in forecast skill) or information about inter annual variability that could also be significant. The answer to the question of whether this variability would affect each calibration algorithm in the same way and/or if our results would remain the same over other periods is not obvious. New experiments including longer experimental periods which span over different seasons, and years, should be performed to see if the main conclusions of this work could be confirmed and generalized to other weather regimes.


The authors are thankful to Istvan Szunyogh for providing the scripts to run the global ensemble, to Erick Kostelich for his help with the MRF model runs and to Jae Schemm for providing the initial conditions in the required file format. The authors also wish to acknowledge the two anonymous reviewers for their comments on this manuscript. This study has been supported by the following projects: UBACyT X204 and CONICET PIP 112-200801-00399. The research leading to these results has received partial funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No 212492 (CLARIS LPB. A Europe-South America Network for Climate Change Assessment and Impact Studies in La Plata Basin).