How sensitive are probabilistic precipitation forecasts to the choice of calibration algorithms and the ensemble generation method? Part II: sensitivity to ensemble generation method


  • Juan J. Ruiz,

    Corresponding author
    1. Universidad de Buenos Aires, Argentina
    2. Centro de Investigaciones del Mar y la Atmósfera (CONICET-UBA), Buenos Aires, Argentina
    • Dept. de Cs. de la Atmósfera y los Océanos, FCEN, Universidad de Buenos Aires, Argentina.
    Search for more papers by this author
  • Celeste Saulo,

    1. Universidad de Buenos Aires, Argentina
    2. Centro de Investigaciones del Mar y la Atmósfera (CONICET-UBA), Buenos Aires, Argentina
    Search for more papers by this author
  • Eugenia Kalnay

    1. University of Maryland, College Park, USA
    Search for more papers by this author


In this work the sensitivity of summer Probabilistic Quantitative Precipitation Forecasts (PQPF) to alternative ensemble generation methods over southeastern South America is examined. A perturbed initial condition ensemble using the breeding technique, a multimodel ensemble and a pragmatic ensemble based on spatial shifts of the forecast fields have been used to generate calibrated PQPF over the selected region and the results were evaluated using the Brier Skill Score. Results show that calibrated PQPF quality exhibits sensitivity to the ensemble system used and this sensitivity is mainly related with the resolution component of the Brier Skill Score. For the 24 h lead time, the pragmatic approach shows surprisingly good results while for the 48 h lead time, the best results are obtained with the multimodel ensemble. The combination of the spatial shift technique with the multimodel and with the perturbed initial conditions ensemble has also been evaluated and resulted in an increase of the PQPF skill at all lead times. Copyright © 2011 Royal Meteorological Society

1. Introduction

In Part I of this research (Ruiz and Saulo, 2011, from now on RS11), different calibration strategies were applied to a single regional model ensemble and the sensitivity of the Probabilistic Quantitative Precipitation Forecasts (PQPF) skill to these calibration methodologies was examined. An 11 member ensemble using the Weather Research and Forecasting model (WRF) with perturbed initial conditions generated using breeding of the growing modes (Toth and Kalnay, 1997) was the basis for those tests. Part I experiments show that calibration algorithms such as those proposed by Gallus and Seagal (2004), Applequist et al. (2002) and Sloughter et al. (2007) (hereafter S2007), denoted as GAMMA, lead to very similar results, and the choice of one or another depends on the specific user needs.

As described by Wilks and Hamill (2007), errors in PQPF have several sources, including initial conditions and model errors. These two particular error sources motivate two different approaches for ensemble generation: the initial and boundary condition perturbation ensemble (Molteni et al., 1996; Toth and Kalnay, 1997 among many others) and ensembles with different formulation of model physics (Palmer et al., 2007) or multimodel ensembles (i.e. ensembles where each member is a different model) (Krishnamurti et al., 1999; Ebert et al., 2001; Park et al., 2008). The main objective of this paper is to evaluate which approach leads to a better PQPF skill over our region of interest and also to perform a comparison between perturbed initial conditions and/or perturbed physics ensembles with simpler and computationally cheap ensembles such as the one proposed by Theis et al. (2005) which only takes into account displacement errors in the forecast field. In previous work focusing over South America (Ruiz et al., 2009) two different ensemble systems were evaluated: a regional ensemble based on the Scaled Lagged Averaged Forecasts (SLAF) technique (Ebisuzaki and Kalnay, 1991) and the University of Sao Paulo Multimodel Ensemble System (Silva Dias et al., 2006) and they were compared with the PQPF derived from a single model. Results showed that the multimodel ensemble system was better than the SLAF ensemble, particularly over tropical regions.

When multimodel ensembles are considered, it would be desirable to use calibration strategies that could take into account that the different ensemble members might not be equally probable. In order to deal with this issue Raftery et al. (2005) proposed the Bayesian Model Averaging method (BMA) which is an ensemble dressing technique that takes into account the individual skill of each ensemble member by computing the probability of each one being the best ensemble member and by individually removing the systematic error of each member. This technique was extended to precipitation forecasts by S2007, adopting a gamma distribution to describe the probability density function (PDF) of forecast precipitation errors.

With the aim of taking into account heterogeneity inside the ensemble, two different approaches are used here: an implementation of the BMA approach as in S2007 and the calibration of a weighted ensemble mean. These two approaches are also compared with the case where all the members are assumed to be equally probable. It is considered that, combined with Part I, this work provides useful guidelines to identify strategies for probabilistic precipitation forecasts generation and also to estimate forecast uncertainties using ensemble techniques that are adequate for limited computational capabilities. This paper is organized as follows. Section 2 describes the ensemble generation strategies and the diverse methodologies used for PQPF calibration, Section 3 shows the results of the verification of the PQPF derived from different ensemble systems and calibration strategies and Section 4 presents the main conclusions of this work.

2. Methodology

2.1. Verification scores

In this work the skill of the different PQPF is quantified using the Brier Skill Score (BSS, Wilks, 2006) which can be defined as follows:

equation image(1)

where BR is the Brier score of the forecast and the BRclim is the Brier score of the climatology. The Brier score is given by:

equation image(2)

where pi is the forecast probability for a given event for a particular time and location, and oi is 1 if the event actually occurred at that particular time and location and 0 otherwise. The summation is over all the times and locations where N is the total number of issued forecasts. The Brier score of the climatology is computed by replacing pi by the climatological probability of the event. The Brier score is strictly proper (i.e. it cannot be improved by hedging) and the BSS approximates a strictly proper scoring rule if the verification sample is moderately large (Wilks, 2006). It should be noted that the BR also provides information about the continuous ranked probability score (CRPS) (Hersbach, 2000). According to Gneiting et al. (2005), the CRPS is equal to the integral of the BR over all possible thresholds.

The events considered in this paper for the computation of the BSS are the observed precipitation being over the following thresholds: 0.0, 1, 2.5, 5, 7.5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 120 and 150 mm.

Following Stephenson et al. (2008), the BSS can be decomposed into a reliability and generalized resolution components as stated in:

equation image(3)

where the forecast probability range is divided into m bins (in this case 0.1 wide bins are used), nk is the number of cases within each bin, pkj is the forecast probability for the jth case within the kth bin, k is the mean forecast probability for the kth bin, okj is 1 when the jth event at the kth bin occurs and 0 otherwise, ōk is the observed relative frequency of the event within the kth bin. The first term in the right hand side of Equation (3) represents the reliability component of the BSS and the following terms are the generalized resolution component as defined by Stephenson et al. (2008). In terms of the BSS, a better forecast is the one with higher BSS, i.e that where the reliability term is closer to 0 and with the larger resolution term. Note that in this case the definition of the reliability component includes the minus sign.

2.2. Ensemble generation

Three different ensemble generation strategies are tested in this work. All of them are implemented using the Weather Research and Forecasting (WRF) model Advanced Research WRF (ARW) dynamical core version 2.0 (Skamarock et al., 2005). In all the experiments horizontal grid spacing is 40 km, and 31 vertical sigma levels have been used with the model top located at 50 hPa. The model domain (Figure 1) is the same used in the experiments described in RS11.

Figure 1.

Regional ensemble domain and CMORPH total precipitation (in mm) between 15 December 2002 and 15 February 2003 (shaded). The black box indicates the area used for verification and calibration

2.2.1. Breeding ensemble

This is a single model, single configuration ensemble with perturbations introduced in the initial and boundary conditions using breeding of the growing modes technique (hereafter breeding) (Toth and Kalnay, 1997) with a rescaling period of 6 h, and is the same used in RS11. The ensemble consists of 11 members (five pairs of perturbed members and a control run) integrated up to 48-h lead time. Boundary conditions are provided by a global ensemble based on the Medium Range Forecasts model with T62L28 resolution (approximately 2.5 horizontal resolution). It should be noted here that the aim of this paper is to explore the impact of different ensemble methodologies upon PQPF, so the fact that a relatively old version of the global model is used to provide boundary conditions to the regional ensemble, should not affect the main conclusions of this work.

The unperturbed initial condition (control run) is obtained from the Global Data Assimilation System (GDAS) analysis with 1°× 1° resolution. The experiment starts on 15 December 2002 and ends by 15 February 2003, and the regional ensemble forecasts are initialized at 1200 UTC. The configuration used to run WRF has been selected among a variety of options following Ruiz et al. (2010), which shows that this model configuration represents surface observations satisfactorily. Basic model settings include Kain–Fritsch cumulus parameterization (Kain, 2004), Yonsei University scheme (Hong and Pan, 1996) for boundary layer representation and surface processes are modelled using the NOAH surface model (Chen and Dudhia, 2001). The reader is referred to Ruiz et al. (2010) for a detailed analysis of WRF model performance during the period of interest.

2.2.2. Multimodel ensemble

The WRF framework allows the use of alternative parameterizations for convection, boundary layer mixing, surface processes, cloud microphysics and radiation. In this work, this capability is used to construct a multimodel ensemble based on the same dynamical core (ARW) but using different choices for model physics. Initial and boundary conditions, as well as model domain are the same for all members and correspond to the control run of the breeding system described above.

Table I shows the different choices adopted for each ensemble member. Three convective schemes were used: the Kain Fritsch scheme (KF) (Kain, 2004), the Grell scheme (GRELL) (Grell and Devenyi, 2002) and the Betts-Miller scheme (BM) (Betts, 1986). Two different schemes are used for the representation of processes in the planetary boundary layer (PBL): the Yon-sei University scheme (YSU) (Hong and Pan, 1996) and the Mellor-Yamada-Janjic scheme (MYJ), (Janjic, 1994). Also, three land surface models (LSMs) are tested: the NOAH LSM, (NOAH) (Chen and Dudhia, 2001), the Rapid Update Cycle (RUC) LSM (Smirnova et al., 1997) and a simple five layer surface model (five Layer) previously used in MM5, that keeps soil moisture constant. Other model physics packages such as radiation treatment and microphysics have been kept as in the breeding experiment. As for this particular period (Austral summer) most of the forecast precipitation over the region under consideration, results from the convective scheme, larger sensitivity should be expected from those parameterizations strongly linked with convection. While the convective scheme is directly associated with the representation of convective processes within the model, the PBL scheme and the LSM are also strongly related through their influence upon atmospheric convective instability and moisture availability at low levels. This is the basic explanation for the particular combinations of model configurations adopted in this work. The multimodel ensemble mean obtained by simply averaging these configurations, has shown good performance representing surface variables for the same region and period of study of this paper, as reported in Ruiz et al. (2010).

Table I. Schemes used for each ensemble member for the representation of convection, PBL and surface processes
3BMMYJ5 layer
8KFYSU5 layer

2.2.3. Spatially shifted ensemble

This ensemble is synthetically generated and does not require multiple model runs. It simply consists of spatially shifted forecasts obtained with a single control run (Theis et al., 2005; Schaffer et al., 2011). The main idea of this strategy is to account for uncertainty in the location of forecast features (i.e. in this case, precipitation areas). Accordingly, this ensemble cannot take into account uncertainties associated with the intensity or even the existence of these features. In this case, the control run of the breeding ensemble is used to construct the spatially shifted ensemble.

The number of ensemble members can be very large depending on model resolution and the maximum shift that will be allowed. Figure 2 shows the BSS of the PQPF as a function of the maximum shift. As can be seen in this figure, BSS increases with increasing maximum spatial shift up to a maximum shift of five grid points (larger spatial shifts do not produce a significant increase in BSS, not shown). This improvement is mostly associated with the resolution component of the BSS indicating better discrimination between the occurrence and the non occurrence of precipitation above different thresholds (not shown). For the case with a maximum shift of five grid points and given that horizontal grid spacing is approximately 40 km, then the maximum real shift is around 280 km, corresponding to the case when the maximum shift is performed simultaneously in X and Y directions. By doing this, it is assumed that it is possible that any precipitation feature occurs as far as 280 km from its forecast location.

Figure 2.

BSS difference between shifted PQPF corresponding to different maximum spatial shifts (one grid point, black dashed line; three grid points, grey solid line and five grid points, grey dashed line) and the PQPF derived from the unshifted control run (black solid line). (a) For 24 h forecasts and (b) for 48 h forecasts

Although initially all shifts are equally probable, the likelihood of each shifted forecast can be taken into account during the calibration, as will be discussed below. However, this likelihood is not flow dependent, as is the case in the breeding ensemble. Ensemble size is (2 ms + 1)2, where ms is the maximum shift expressed in number of grid points (i.e. in this case ms = 5).

2.3. Calibration methodology

Calibration and verification of 24 h accumulated PQPF is performed over the same region used in RS11 which is indicated by the box in Figure 1. For each particular date, the calibration is performed using all the forecasts and observations corresponding to 20 days prior to the forecast initialization date. For this reason, the period with calibrated forecasts starts on 8 January 2003 and ends on 15 February 2003. The optimal length for the training period has been determined in RS11. It should be noted that this length can depend on the data available for calibration (i.e. number of stations, satellite coverage, domain size) as well as on the climatological regime of the region (drier regions could require a longer training sample). CPC MORPHing technique (CMORPH) precipitation estimates (Joyce et al., 2004) have been used as observations in the calibration process. Among existing precipitation estimates, CMORPH has been selected because the use of microwave data leads to better precipitation estimates and provides an excellent alternative to the coarse operative rain gauge network over South America. CMORPH performance and its potential to be used for verification purposes over the region of interest have been addressed in Ruiz et al. (2009).

In RS11, several calibration strategies have been discussed and applied to a single model ensemble. However in this case, as multimodel ensembles are introduced, it is expected that differences in skill among ensemble members could have some impact in the calibration process. This is also true for the spatial shifted ensemble where forecast skill obtained through various shifts in direction and distance, could change significantly.

In order to evaluate these issues, an approach similar to that presented in S2007 was followed, where calibrated probabilities are obtained as a linear combination of the individual PDFs associated with each ensemble member. The conditional probability of the cube root of the observed precipitation (y) being over a certain threshold (tr), given the precipitation value forecast by each individual member (fi) is given by:

equation image(4)

where hk(y > tr|fk) is the conditional probability of y being over tr given the precipitation forecast by the kth ensemble member (fk) in the case where the kth ensemble member is the best one: wk is the weight associated with the kth ensemble member (i.e. the probability that the kth ensemble member is the best one) and K is the ensemble size. The model proposed by S2007 for the computation of hk(y > tr|fk) has two parts depending on y being 0 or greater than 0. The probability of observed precipitation being greater than 0 as a function of forecast precipitation (hk(y > 0|fk)) is modelled using a logistic regression that can be expressed as:

equation image(5)

where ak, bk and ck are constants which are estimated individually for each ensemble member. δk is 1 if fk is 0 and 0 otherwise, so the term δkc introduces a correction in the computation of the probabilities when the forecast precipitation is 0.

The conditional probability of y being over a certain threshold given that the precipitation is greater than 0 is modelled using a gamma distribution as stated in:

equation image(6)

where Gmath image(tr) is the cumulative distribution function at tr assuming a gamma distribution with mean µk and standard deviation σ.

The value of µk is also considered a function of the forecast precipitation and is computed as:

equation image(7)

where dk and ek are constants which are computed individually for each ensemble member through linear regression between the third root of the forecast precipitation and the observations as in S2007.

The value of σ is assumed to be the same for all the ensemble members as in S2007, but in this case it will be assumed that σ is independent of the forecast precipitation because this assumption is the one that produces the best results for the experiments presented in this paper.

Combining Equations (5) and (6), an expression to compute the probability of the precipitation being over a certain threshold given the forecast precipitation by the kth ensemble member, assuming that this is the best ensemble member, can be obtained:

equation image(8)

Now, Equation (4) can be used to combine the probability associated to each ensemble member assuming that the weights represent the probability of each ensemble member being the best ensemble member. The computation of the conditional probability based on Equation (4) is referred to Bayesian Model Averaging (BMA).

The value of σ as well as the weights corresponding to each ensemble member, are computed using the expectation-maximization algorithm described by S2007. Basically this algorithm seeks for the maximum of the following function:

equation image(9)

where gmath image(yi) is the probability density function evaluated at the value of the observed precipitation at the ith location and time (yi) assuming a gamma distribution with mean µk and standard deviation σ, fik is the forecast precipitation for the ith location and time by the kth ensemble member, δy is 1 if yi is equal to 0 and 0 otherwise, and N is the total number of different locations and times in the training sample. The argument of the logarithm is the PDF that results from the combination of the PDFs associated with each ensemble member evaluated at the observed precipitation value, so when C is maximum then the combination is optimal in the sense that the probability of the observations given the forecast precipitation by the individual ensemble members within the training sample is maximum.

To evaluate the impact of the weights within the implementation proposed by S2007, an alternative formulation of the model assuming equal weights for each ensemble member has also been used. This implementation will be referred to as GAMMA-ENS. In this case, only the value of σ is computed trough the maximization of C and this maximization is performed simply computing the value of C for a range of values of σ, and choosing the value of σ corresponding to the maximum value of C.

Equation (8) could also be used to obtain the probability of precipitation above a certain threshold in the case where only one forecast is available (i.e. it can be applied to a single forecast experiment or to the ensemble mean). The application of Equation (8) to the ensemble mean, to obtain a calibrated probabilistic forecast will be referred to as GAMMA. In this case only a, b, c, d and e from Equations (5) and (7) have to be estimated for the ensemble mean. This idea can also be applied to a weighted ensemble mean. One possible approach is to use the weights computed by the BMA method to compute a weighted ensemble mean, and then to apply Equation (8) to obtain the calibrated probabilistic forecast derived from this ensemble mean. This particular implementation will be referred to as WMEAN. BMA and WMEAN approaches include information about the individual skill of each ensemble member through the use of different weights that reflect the probability of each ensemble member of being the best one. The GAMMA-ENS approach considers that all ensemble members have the same probability of being the best. However, systematic errors associated with each ensemble member are partially corrected through the use of Equation (7), where a different relationship between the mean of the gamma distribution and the forecast precipitation is assumed for each ensemble member. Finally, the GAMMA approach combines all the ensemble members into a single forecast without taking into account their individual skill.

Figure 3(a), shows the time averaged weights associated with each ensemble member for the multimodel ensemble. As can be seen, some ensemble members have a higher probability of being the best ensemble member, with members 8 and 10 showing the largest weights. Differences among weights assigned to different ensemble members is large, even larger than differences in their skill as suggested, e.g. by their Equitable Threat Score (ETS, Schaefer, 1990). Note, for example, that the mean weight assigned to member number 5 is almost 0, so this member is excluded from the ensemble by the application of this methodology. As can be seen in Figure 3(b), weights associated to each ensemble member show a smooth time evolution (which is mainly due to the fact that training samples corresponding to consecutive days are very similar). However, these weights show important changes within the period of study (some models starting with large weights end up with small weights by the end of the experimental period). These temporal changes within the experimental period are larger than changes in the individual skill of different member computed over the same training period using the ETS (not shown). It should be noted that if probabilistic forecasts are going to be computed for other variables, then the weights should be re-calculated since model skill can be very sensitive to the variable under consideration (see for example Ruiz et al. (2010) documenting different skill for alternative variables, over the same period). For the shifted ensemble (Figure 4), maximum time averaged weights (computed in the same way as in Figure 3(a)) are associated with smaller shifts. In this case, the distribution of weights also indicates better skill for the west–east displacements than for the north–south ones. This could be an indication that displacement errors associated with precipitating systems are higher in the west–east direction.

Figure 3.

(a) Temporally averaged weights for each multimodel ensemble member and (b) time evolution of the weights assigned to each multimodel ensemble member

Figure 4.

Weights associated to each member of the spatially shifted ensemble as a function of the corresponding shift in the south–north (y axis) and the west–east (x axis) directions. Negative shift values indicate southward and westward shifts respectively

Figure 5(a), shows the performance of the proposed calibration alternatives compared to that of GAMMA applied to the multimodel ensemble mean (negative values denote skill loss with respect to the GAMMA multimodel in terms of BSS and its components). It can be seen that, for the case of the multimodel ensemble, the sensitivity of the BSS and its components to the different calibration alternatives proposed is small with respect to the 90% confidence interval width. The confidence interval has been computed by applying a bootstrap technique. The use of weights introduces an improvement in the BSS (i.e. the WMEAN slightly outperforms the GAMMA approach and so does the BMA with respect to the GAMMA-ENS approach particularly for the lower thresholds). However, in both cases, the impact is smaller than the width of the confidence interval. It can also be seen that the combination of different PDFs to compute the probability (either using equal or different weights for each ensemble member) produces a degradation of the skill in terms of reliability for thresholds below 40 mm (Figures 5(b) and (e)). There is also a negative impact in the resolution component (Figure 5(c)) which affects almost all the range of thresholds considered in this paper. Similar results have been obtained by S2007 for this range of precipitation thresholds although the reason for this behaviour is not clear. In the case of the spatially shifted ensemble (Figures 5(d), (e) and (f)) it is not clear whether the use of weights leads to a skill improvement in terms of the BSS or not, since GAMMA outperforms WMEAN in this case, but on the other hand, BMA outperforms GAMMA-ENS. As in the previous case, the combination of PDFs to compute the final conditional probabilities reduces the skill at both the reliability and resolution components.

Figure 5.

Differences of the BSS (a and d) and its reliability (b and e) and resolution (c and f) components with respect to the GAMMA experiment, as a function of the precipitation threshold (mm). Each line corresponds to distinct calibrations: GAMMA (black solid line), WMEAN (grey solid line), GAMMA-ENS (black dashed line) and BMA (grey dashed line). The 90% confidence interval for the BSS corresponding to the GAMMA experiment is indicated by grey circles. (a), (b) and (c) correspond to the multimodel ensemble, and (d), (e) and (f) correspond to the spatially shifted ensemble. All scores correspond to 24 h forecasts

Similar results can be obtained computing the CRPS, which can be interpreted as the BS integral over all possible thresholds (Gneiting et al., 2005). Note that for this index, lower values indicate better performance. CRPS values for the experiments presented in Figure 5 are shown in Table II.

Table II. 24 h CRPS values computed for alternate calibration approaches, applied to the multimodel ensemble and to the spatially shifted ensemble
 MultimodelSpatially shifted

Some other alternatives have been tested for the computation of the weights used to obtain the weighted ensemble mean. One possible approach is to use weights which are proportional to the individual skill of each ensemble member measured using the ETS. This approach produces better results for the spatially shifted ensemble and for the multimodel ensemble (not shown). However, they are not significantly better than the ones obtained using the GAMMA approach (i.e. computing the probability directly from the standard ensemble mean). One of the differences between the weights computed using the ETS and the ones computed by the BMA approach is that the latter shows larger differences among the weights associated with each ensemble member. This suggests that the expectation-maximization algorithm, used to compute the weights in the BMA approach, might be putting too much weight into certain ensemble members at the expense of others, thus producing a loss of valuable information provided by members that are only slightly worse.

3. Evaluation of PQPF skill

Results from the previous section support the choice of the GAMMA calibration strategy in terms of performance and ease of implementation. According to this, subsequent results correspond to this calibration strategy. Figure 6, shows the BSS differences with respect to the breeding ensemble, for the multimodel and the spatially shifted ensembles (i.e positive values denote skill increase with respect to breeding). The reliability component is very similar in all cases due to the effect of calibration (not shown) so it is not going to be discussed. For the 24 h lead time (Figure 6(a)) the shifted ensemble shows the best results, while multimodel ensemble performs better than the breeding one.

Figure 6.

BSS difference with respect to the breeding ensemble (grey solid line), for the multimodel (black solid line), the spatially shifted (grey thick dashed line) and the combined ensembles (black dashed lines). Grey circles indicate the 90% confidence interval corresponding to the BSS of the breeding ensemble. The thin grey dashed lines indicate the BSS of the individual members of the multimodel ensemble. Part (a) corresponds to 24 h forecasts and (b) to 48 h forecasts

The multimodel and breeding ensembles can be also combined into a larger one consisting of 21 members (i.e., the control run is the same for both) as suggested in Hou et al. (2001). This combination, denoted as COMBINED, outperforms the multimodel ensemble, particularly for thresholds above 5 mm (Figure 6(a)).

The improvement achieved by the shifted ensemble is surprising because this is basically a cost free ensemble where differences among members are not flow dependent and also are not related to errors in model formulation as in the rest of the ensembles. This behaviour suggests that a simple spatial smoothing can effectively remove unpredictable components of the precipitation field at short lead times, at least at this horizontal resolution. Similar results have been recently reported by Schaffer et al. (2011), who obtained higher PQPF skill at 24 h lead times with a shifted ensemble than with a mixed physical parameterizations and perturbed initial conditions ensemble. However, when longer lead times are involved, results are somewhat different, as shown in Figure 6(b) for the 48 h lead time. In this case, the skill of the shifted ensemble with respect to that of the multimodel or combined ensembles is significantly worse, although the shifted ensemble is still slightly better than the breeding one. This suggests that as the atmosphere evolves and differences among members in the multimodel ensemble grow, they better represent forecast uncertainty. The simpler smoothing approach can only take into account position errors but cannot represent uncertainty associated with the existence of a precipitating system or its intensity. This type of uncertainty can only be addressed with a dynamic ensemble. Similar results, showing fast degradation of PQPF derived from a single dynamical model run have been also obtained by Ruiz et al. (2009).

Figures 6(a) and (b), also include the BSS computed for the PQPF derived individually from each of the multimodel ensemble members. In this case the GAMMA algorithm is used to obtain a PQPF from each individual model. As expected, the performance of any of these forecasts is significantly worse than any of the considered ensemble systems, showing the importance of considering the use of ensembles independently of the ensemble generation strategy.

The spatially shifted approach can also be combined with either the multimodel, breeding and combined ensembles, in order to blend the best attributes of these approaches. To do this, shifted ensembles have been derived from each ensemble member of multimodel, breeding and combined ensembles respectively, using a maximum shift of five grid points. The resulting ensembles are referred to as shifted multimodel, shifted breeding (both with a size of 1331 members) and shifted combined (with a total size of 2541 members). Results for 24 and 48 h lead times are shown in Figure 7 as BSS differences with respect to the breeding ensemble. This figure also includes the BSS corresponding to PQPF obtained applying the spatial shift approach to each individual member of the multimodel ensemble. At the 24 h lead time (Figure 7(a)), shifted multimodel and shifted combined ensembles outperform the shifted breeding at low thresholds, however the opposite occurs for thresholds higher than 20 mm. As discussed before, changes in BSS are due to differences in the resolution component of the score (not shown). It is surprising that the shifted breeding outperforms the shifted multimodel given that the breeding ensemble alone performs worse than the multimodel ensemble alone (as seen in Figure 6(a)). Moreover, the shifted breeding ensemble skill is close (slightly better) to that of the shifted ensemble based on the breeding control run (i.e. when the spatial shift technique is applied to that specific member of the breeding ensemble). This behaviour is showing that, although the performance of the multimodel ensemble can outperform the performance of the individual ensemble members (as shown in Figure 6(a)), the performance of the shifted multimodel cannot outperform the shifted ensemble based on those members that show the best skill.

Figure 7.

BSS and resolution difference between the breeding ensemble (grey solid line) and the shifted multimodel ensemble (black solid line), shifted breeding ensemble (grey thick dashed line) and shifted combined ensemble (black dashed line). Grey circles indicate the 90% confidence interval corresponding to the BSS of the breeding ensemble. The thin grey dashed lines indicate the BSS of the shifted ensemble derived from each individual member of the multimodel ensemble. Part (a) corresponds to 24 h forecasts and (b) to 48 h forecasts

The results corresponding to 48 h forecasts are shown in Figure 7(b). In this case the shifted multimodel shows the best performance, closely followed by the shifted combined ensemble. Although for this forecast length some models outperform the shifted multimodel above 40 mm, and this occurs on wider thresholds ranges at 24 h length, it should be considered that, a priori, it is difficult to know which of the proposed configurations is the one that will produce the best performance.

These results suggest that the combination of the spatial shift approach with the multimodel and/or the combined ensembles entails the advantages of the two approaches: the decrease in forecast skill with increasing lead time is not as fast as in the case of the spatially shifted ensemble alone and the performance of the shifted multimodel and shifted combined ensembles is significantly better than that of the multimodel or combined ensembles alone (compare for example the BSS shown in Figure 7 with the ones presented in Figure 6).

In order to provide extra elements to interpret previous results, probability integral transform (PIT) histograms (Gneiting et al., 2005) corresponding to the uncalibrated PQPF of some of these ensemble systems, are shown in Figure 8. The PIT is defined as the value of the forecast cumulative distribution function evaluated at the value of the observation, and the PIT histogram is simply the histogram of the PIT values over different times and places. In this case the computation of the PIT is made directly from the ensemble members (i.e. the probability of the precipitation being below the observed value is computed as the number of ensemble members whose forecasts are below the observed value over the ensemble size). To construct the histogram probability, 0.1 width bins were used. The interpretation of the PIT histogram is equivalent to the interpretation of the rank histogram (Hamill, 2001). The use of the PIT histogram is preferred since it allows the comparison of ensembles with different ensemble sizes. As has been reported in several studies (S2007; Eckel and Walters, 1998; Hamill and Colucci, 1998 among others) the uncalibrated forecasts are under-dispersive and in this particular case they show a wet bias, as smaller CDFs are more often forecast at the observed value. Nevertheless, those ensembles which show better performance in terms of the BSS are slightly less under dispersive, as for example, the shifted multimodel ensemble and the spatially shifted ensemble. Under-dispersion reduction is not only a consequence of the differences in ensemble size, since the multimodel ensemble also shows less underdispersion than the breeding ensemble (and both ensembles have the same size). Instead, it should be interpreted as an indication that these approaches produce a more realistic representation of the uncertainty associated with precipitation forecasts over the region.

Figure 8.

PIT histogram for the uncalibrated PQPF of different ensemble systems corresponding to 24 h forecasts: breeding ensemble (grey solid line), shifted multimodel ensemble (dashed black line), multimodel ensemble (black solid line) and shifted ensemble (grey dashed line)

4. Discussion and conclusions

Different ensemble generation strategies have been compared for 24 h accumulated PQPF generation over southeastern South America during the 2002–2003 warm season. CMORPH precipitation estimates have been used in order to work with a more complete precipitation dataset for calibration and verification. Calibration strategies for multimodel ensembles or, more generally speaking, for ensembles where each member may have different skill, were briefly discussed and tested. The results suggest that the computation of a weighted ensemble mean can lead to moderately better results. However the best choice for a weight computation algorithm is still an open question. The PQPF derived from the un-weighted ensemble mean produces, if not the best results, almost as good results as any other approach considered in this work, having the advantage of straighter computations.

The PQPF derived from two dynamically generated ensemble systems was compared with a very simple ensemble system (the spatially shifted ensemble) which only takes into account the uncertainty associated with the position of rain areas. This simple ensemble proves to be quite competitive at short forecast ranges (less than 24 h) as has also been documented by Schaffer et al. (2011) yet its skill drops rapidly with increasing lead times, as shown in Section 3. This may be due to the limitation of the simple spatial lag approach in capturing uncertainties associated with the existence, or intensity of forecast precipitation features. These sources of uncertainty tend to become more important with increasing lead times. The success of this simple approach for short lead times suggests that dynamic ensemble approaches are not good enough to represent forecast uncertainty during the first 24 h. It seems that more specific methods are needed to better represent uncertainty at short forecast ranges, at least for the horizontal resolution adopted in this work.

With respect to the ensemble generation strategy, among dynamically generated ensembles tested the multimodel ensemble based on different configurations available in the WRF model, outperformed the breeding ensemble, which is based exclusively on initial and boundary conditions perturbations. Still, the improvement obtained from the combination of breeding and multimodel ensembles is not large. This suggests that most of the PQPF limitations during this particular time of the year (i.e. Southern Hemisphere summer) arise from errors in model physics rather than problems in the initial and/or boundary conditions. During this particular season, most of the precipitation predicted by different configurations results from the convective parameterization, further emphasizing the importance of appropriate model physics in determining precipitation fields. Along the same line, it could be considered that relatively low performance of the breeding ensemble, could be due to the length of the rescaling period, which could be adjusted to capture smaller scale and faster growing perturbations associated with mesoscale and convective scale processes.

During summer, important interactions take place between synoptic and convective scales over this region, particularly when mesoscale convective systems are present. For this reason the results described in this paper might not apply to other seasons, when convective precipitation is less frequent.

Among the alternatives that have been evaluated, the most important improvement has been obtained with the combination of the multimodel ensemble approach and the spatial shift technique even at 48 h lead time. This approach is particularly interesting and promising for implementing high resolution ensembles in small operational or research centres for which computational costs largely restrict ensemble size. The spatial shift ensemble could enlarge the number of ensemble members with no extra computational cost, increasing PQPF skill.

More research is needed to extend these results to other seasons and other variables like surface variables (temperature, dew point, winds). The spatial shift approach is still applicable to this case; however, to avoid unrealistic displacement of some features associated with variations in terrain elevation and other geographical features, the shift should be applied to the standardized climatological anomalies instead of using the field itself.

Although the results shown in this paper were obtained using CMORPH, similar conclusions can be drawn using a dense raingauge network available over the region during the selected period. Most of the experiments were also carried out over a tropical region located in Northern Brazil, which also exhibits a similar behaviour, although PQPF skill was lower than that for the region discussed in this paper.

Besides BSS, other verification scores have been used to compare the results of the experiments presented in this work. Forecast resolution and value have been measured using the relative operating characteristic diagrams and the value score (Wilks, 2006). The relative operating characteristic diagrams allow similar conclusions to that obtained through the resolution component of the BSS. In the case of the value score, experiments showing the best results were the same showing higher accuracy as measured by the BSS. For this reason only the results associated with the BSS have been presented in this work.


The authors are thankful to Istvan Szunyogh for providing the scripts to run the global ensemble, to Erick Kostelich for his help with the MRF model runs and to Jae Schemm for providing the initial conditions in the required file format. Fruitful comments and suggestions received from Elizabeth Ebert and one anonymous reviewer are also acknowledged. This study has been supported by the following projects: UBACyT X204, CONICET PIP 112-200801-00399. The research leading to these results has received partial funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No 212492 (CLARIS LPB. A Europe-South America Network for Climate Change Assessment and Impact Studies in La Plata Basin).