Optimizing Precipitation Forecasts for Hydrological Catchments in Ethiopia Using Statistical Bias Correction and Multi‐Modeling

Accurate rainfall forecasts on timescales ranging from a few hours to several weeks are needed for many hydrological applications. This study examines bias, skill and reliability of four ensemble forecast systems (from Canada, UK, Europe, and the United States) and a multi‐model ensemble as applied to Ethiopian catchments. By verifying these forecasts on hydrological catchments, we focus on spatial scales that are relevant to many actual water forecasting applications, such as flood forecasting and reservoir optimization. By most verification metrics tested, the bias corrected European model is the best individual model at predicting daily rainfall variations, while the Canadian model shows the most realistic ensemble spread and thus the most reliable forecast probabilities, including those of extreme events. The skill of the multi‐model ensemble outperforms individual models by most metrics, and is skillful up to 9 days ahead. Skill is higher for the 0–5 day model accumulation than for the first 24 h, suggesting that timing errors strongly penalize the skill of forecasts with shorter accumulation periods. Due to seasonality in the model biases, bias correction is best applied to each month individually. Forecasting extreme rainfall is a challenge for Ethiopia, especially over mountainous regions where positive skill is only reached after bias correction. Compared to individual models, the multi‐model ensemble has a higher probability of detecting extreme rainfall and a lower false alarm rate, with usable skill at 24 h lead times.

roughly the northernmost extent of the ITCZ, the rainfall pattern is monomodal with a peak in the summer. This June to September rainy season is referred to as kiremt.
Ethiopia's complex topography and its location at the boundary between the Central African ITF and the South Asian monsoon system make it a complicated region to study meteorologically. For example, studies have identified between four and six homogeneous sub-regions, each with different seasonal rainfall cycles and/or different modes of interannual variability (e.g., Diro et al., 2008;Gissila et al., 2004;Riddle & Cook, 2008). During the boreal spring, a low-level cross-equatorial jet, the Somali Jet, begins to form, bringing moist air from the southern Indian Ocean onto the southern flank of the Ethiopian highlands (Riddle & Cook, 2008;Riddle & Wilks, 2013). By June, this moisture source is diverted as the Somali Jet extends across the Arabian Sea toward India, becoming the inflow to the Indian Monsoon. During the boreal summer, low-level flows from the Congo Basin to the southwest, and the Red Sea to the northeast converge over Ethiopian highlands fueling the summer Ethiopian rainy season (Jury, 2011;Jury & Chiao, 2014;Viste & Sorteberg, 2013b). While the Red Sea and Indian Ocean are the largest of these moisture sources (Viste & Sorteberg, 2013b), variability in moisture transport from the west (Congo Basin) has a greater impact on rainfall variability (Nicholson, 2017;Riddle & Wilks, 2013;Viste & Sorteberg, 2013a).
Tropical waves can be a source of predictability in the tropics at timescales ranging from several hours to several weeks. African easterly waves (AEW), caused by disturbances in upper tropospheric jets, can be a source predictability at timescales of a few days, especially over central and western Africa. A few studies have suggested that these westward traveling waves may have precursors over the Arabian Sea and Ethiopia. For example, Mekonnen and Rossow (2011) and Mekonnen and Thorncroft (2016) show that wave activity in the upper-level tropospheric winds can concentrate disorganized convection over the Arabian Sea into larger mesoscale convective systems that can transition into AEWs over Ethiopia and Sudan (Jury, 2018). At longer timescales up to 30 days, the Madden-Julian Oscillation (MJO) influences rainfall over the arid regions of eastern Ethiopia (Schlueter et al., 2019), as well as coastal Kenya and Tanzania (Pohl & Camberlin, 2006a, 2006b during the transition seasons. An analysis of the relative influence of all types of equatorial waves on African precipitation, however, suggests that the impact of tropical waves over Ethiopia is limited compared with central and western Africa (Schlueter et al., 2019). Together, these studies highlight the meteorological complexity of the region for rainfall forecasting.

Numerical Weather Prediction for Ethiopia
The global NWP models used in this study are physics-based numerical models that estimate how the atmosphere evolves with time, given certain initial and boundary (surface) conditions. These models are the backbones of most weather forecasts across the globe. However, NWP models face some challenges in the tropics. Recent studies have shown that error growth over the initial 1-2 days of a forecast is much larger in the tropics than in the midlatitudes (e.g., Bechtold, 2019;Žagar, 2017). These errors are due in part to a sparser observing network for model initialization, especially of upper-level winds (Žagar, 2017), which are important for concentrating and propagating convective disturbances. In addition, the models struggle to realistically model the character of tropical convection (e.g., Stephens et al., 2010). They produce too many days with light rain, underpredict the rainfall rates of the highest intensity events (Trenberth et al., 2017), are unable to capture realistic diurnal cycles (e.g., Knievel et al., 2004;Yang & Slingo, 2001), and are often not run at high enough resolution to properly model the interaction of atmospheric flows with complex topography (e.g., Woodhams et al., 2018). Finally, the ensemble systems tend to be overconfident, or under-dispersive, because they fail to properly model the unpredictable, stochastic aspects of rainfall generation (Watson et al., 2017).
Statistical methods can correct some of these model shortcomings. The models used in this study are run in ensemble mode: each model is run multiple times, using different perturbations of the initial conditions to create an ensemble of possible outcomes. This ensemble represents the uncertainty in the forecasts. Averaging together multiple ensemble members serves to cancel out the noise in individual model runs, and enhance the model signal. Averaging the ensembles from several different forecast centers together further reduces noise and tends to improve the forecast, especially if the individual models have opposing errors (Hamill, 2012;Park et al., 2008). Another key aspect of why a multi-modeling approach is often so successful is explained by Hagedorn et al. (2005). Each model has its own strengths and weaknesses.
Which model is "the best" single model often depends on the particular situation. A model that for example performs well for predicting summer rainfall might not be the best model for winter rainfall. The multi-model however performs consistently well, especially when compared to single models instead of the (varying) best model. Even models that seem to perform poorly, might enhance the multi-model by the aforementioned opposing errors. Only when a particular model consistently performs the worst, it indeed degrades the skill of the multi-model. Vice versa, only when one single model consistently performs the best, the multi-model will have lower skill. Next to the multi-modeling approach, known biases in the forecasts can also be corrected using historical records of forecasts and observations.
Only a few previous studies have focused on assessing NWP model skill over eastern Africa and Ethiopia at timescales ranging from 24 h to several weeks. Vogel et al. (2020) provide a comprehensive analysis of the probabilistic skill of the European and Canadian models for the global tropics. Woodhams et al. (2018) compare the results of the UKMET operational global model with a limited-domain high-resolution convection permitting (CP) model over eastern Africa for forecasts up to 48 h. Those authors find significantly better representation of the diurnal cycle and 3-h rainfall rates in the CP model, but only modest improvements in the skill of 24 h averages over the global model. Mutemi et al. (2007) assesses the skill of seven ensemble forecast models over Africa in general, comparing individual model forecasts with a multi-model mean and a simple method for weighting the models and correcting the mean biases in the forecasts. They showed that the multi-model ensemble performed better than the individual best model, but care should be taken in the selection of forecasts to be included in the multi-model: "bad" models degraded the skill of the multi-model. From the NWP models available in the TIGGE archive, we have performed an initial skill analysis and selected four of the best performing state-of-the-art ensemble prediction systems to create a multi-model ensemble.
As in our current study, Kipkogei et al. (2016) use the TIGGE multi-model ensemble to assess skill over the Greater Horn of Africa. They also study four TIGGE models: ECMWF (Europe), NCEP (US), UKMET (UK), and CPTEC (Brazil) and show that the skill, as measured by spatial correlations and the equitable threat score (ETS; also a metric of the spatial pattern of rainfall), is the highest for a weighted bias corrected multi-model average (super-ensemble), followed by the UKMET model, uncorrected for biases. The authors suggest that more work is needed to test other bias correction and multi-modeling methods, and to determine if the skill scores are high enough to be of practical use.
Our study extends this previous work in several ways along the lines of what Kipkogei et al. (2016) suggest. First, we use and test a bias correction approach, called quantile-to-quantile (Q2Q) mapping (Hopson & Webster, 2010), that corrects the full rainfall distribution instead of just the mean rainfall. We also test whether doing a separate bias correction on each month of data improves the results over performing the bias correction on the aggregated year-round data. Second, our choice of temporal and spatial scales is motivated by real-world water management challenges, and we examine skill separately for individual catchments (see Figure 2) as well as average the results regionally. Compared to Kipkogei et al. (2016) who validate spatial rainfall patterns daily on 625 km 2 grid boxes, we examine skill on catchments ranging in size between 4,000 and 200,000 km 2 (see Table 1), for both 24-h and 5-day averages (the latter being the day 0-5 and day 5-10 average rainfall accumulations). Given the challenges that NWP models have in the tropics, it is useful to test whether averaging to these large temporal and spatial scales improves skill. Third, we extend the skill analysis to include many more skill metrics and visualization tools frequently used in forecast verification. This approach gives a more complete picture of the utility of the forecasts for specific catchments, as well as insight into reasons why different models and methods succeed or fail. STELLINGWERF ET AL.

Forecast Data
The gridded forecast data of four ensemble prediction systems (EPSs) are used in this study. Some descriptives of the models are given in Table 2. Model data produced from 2011 to 2016 are used in this study. Mladek (2019) gives a full description of the NWP models in the TIGGE archive.

Multi-Model Ensemble
The multi-model ensemble that we use here is a collection of equally weighted ensemble members of all available models mentioned in Table 2. UKMET's forecasts go out to seven days. Therefore, the multi-model consists of 101 members for lead times up to 7 days, and 90 members for longer lead times. The forecasts are all from 00Z initialization cycles. Any cycle without complete fields out to the final lead time of the respective model are dropped. The TIG-GE archive provides the forecasts fields for all models at 0.5° × 0.5°. Given this, it is important to note that within the smallest catchment, effectively only a couple of forecast grids represent the basin, while close to one hundred represent the largest catchment. As such, this has implications for the expected basin-averaged rainfall forecast skill of each catchment, where all things being equal, forecast displacement errors diminish as the basin spatial scale increases.

Observational Data
The observational data set is a merged satellite product from two near real-time satellite products that use measurements of microwave and infrared radiation in their construction: the GSMaP (Global Satellite Mapping of Precipitation (Aonashi et al., 2009)) and NASA's TRMM (Tropical Rainfall Measuring Mission;Huffman et al., 2007). The GSMaP product is released hourly with a horizontal spacing of 0.1 by 0.1 lat/ lon degrees; while the NASA's TRMM product is released every three hours, with an horizontal spacing of 0.25 by 0.25 lat/lon degrees. Because satellite rainfall estimates from a number of centers have been shown to have significant random and systematic errors in our study region (Hirpa et al., 2010) and lack of country-wide rain gauge and/or radar data availability precludes the application of a geographically extensive corrective error algorithm for these products separately, we have merged these two products together to (presumptively) reduce the errors (random and systematic) from using only one product.
To merge the two products, we first brought the two products to a common spatial and temporal grid, bilinearly interpolating spatially the NASA product to the higher resolution GSMaP 0.1° by 0.1° grid, and temporally accumulating the GSMaP data up to the NASA 3 h time-step. Then, where both are available, the two products were averaged using equal weighting grid-by-grid, time-step-by-time-step; using only the single product where only one product is available. Finally, the 0.1° by 0.1° merged satellite product is spatially averaged over the Ethiopian catchments (using full grid point weights for grids falling fully within the catchment, and fractional grid point weights along the catchment boundaries).

Bias Correction Using Quantile-to-Quantile Mapping
Bias correction by so-called Q2Q mapping (Hopson, 2005;Hopson & Webster, 2010) is a relatively straightforward technique to remove systemic biases from ensemble forecasts by requiring that the forecasts be replaced with samples from the cumulative distribution function (CDF) of the associated observational historical record. This is done, in short, by replacing the forecast percentile (as determined relative to the model historical CDF) with the observation value associated with the equivalent percentile (as determined relative to the observational historical CDF). This is visually explained in Figure 3. In this study, Q2Q bias correction is done for catchment-integrated predictions. For ensemble forecasts, every ensemble member is adjusted separately to preserve the ordering by lead time (i.e., temporal covariance) of each ensemble.
The first step to bias correct a catchment's rainfall ensemble forecast value from an ensemble forecast system is to select data to represent a well-sampled model hindcast CDF, one for each lead time. Similarly, STELLINGWERF ET AL.  Table 2 Description of the Numerical Weather Prediction Models observational data is selected to represent a well-sampled observational CDF (but with no lead time dependence). These CDFs are then stored and used in the following two-step process: for each lead time of a particular ensemble member, the forecast value's percentile is determined from the model CDF; then the equivalent percentile of the observed CDF is extracted, and used in replacement of the forecast. This is done for each lead time for a given ensemble member. Please see the Appendix A for more details.
We apply the Q2Q bias correction separately to each model, each catchment, each accumulation period, and each lead time (while maintaining the lead time ordering of each ensemble member). To test the importance of seasonal variations in the biases, we compare the results of a Q2Q bias correction performed on the aggregated data from the whole year, with results of a Q2Q bias correction performed on individual months, separately. From the whole set of ensemble members, each forecast day we randomly select three ensemble members to create the forecast climatology. This reduces computation time and also limits using the same forecast data for both Q2Q bias correction and validation, while still having a sufficiently large sample size. As this data set consists of 7 years of data, leaving out days with missing data, the sample size that makes up the model climatologies is still over 5,000 (the adequacy of the reduced 3-member climatologies was also confirmed by testing subsets of data, one with three members and the other with the full ensemble set, and found no appreciable difference in results).
Because this Q2Q bias correction is applied to precipitation forecasts, there will be many occasions where the predicted precipitation is zero. Assume some model predicts zero precipitation 20% of the time. This means the lower 20% of the CDF all consist of zero precipitation. In this case we randomly pick a percentile of the model CDF from this 20%. If the cumulative chance for zero precipitation is larger in the observed climatology, say 30%, this will always result in an "adjusted" forecast also producing zero precipitation. If on the other hand the cumulative chance for zero precipitation is smaller in the observed climatology, in some random occasions the zero precipitation forecast will be corrected to producing some (typically light) precipitation. Using the same example where the modeled chance of zero precipitation is 20%, but now the STELLINGWERF ET AL.
10.1029/2019EA000933 6 of 27 Figure 3. A visualization of bias correction using quantile-to-quantile (Q2Q) mapping. The cumulative distribution function (CDF) of both model and observed climatology are shown at top. First, the percentile of the forecast rainfall (P fcst ) is computed (75th). Using this percentile, the forecast rainfall is adjusted (P adj ) to the matching percentile of the observed rainfall climatology. The lower two plots ("ranked forecasts" and "ranked observations") show how to apply this technique in practice. First, the rank of the forecast value to adjust (orange hash mark on line "ranked forecasts") is determined compared to all the past hindcasts of the model (blue hash marks, which fall in the range of 0-1 m). The observed value having the same rank (orange hash on line "ranked observations") in the observational climatology is then determined and used in replace of the forecast value.
observed chance is only 10%, the model will be corrected to producing (some) precipitation on average half of the time it predicted zero precipitation.
On the right side of the distribution there are also some considerations to be made. The observational data set in this study only consists of 7 years of data. This means that very rare rainfall events that happened more than 7 years ago have not been captured. Also, the model climatology is three times as large (3 random ensemble members per day) as the observational climatology (1 value per day). Because of this, the highest (fractional) percentile present in the forecast climatology is higher than the (fractional) highest percentile in the observed climatology (consider, as an example, an observational data set with a sample size of 100 and a forecast data set with a sample size of 300. Within the observational data set, the sample with the highest rainfall amount corresponds to a percentile of (100 − 1)/(100) = 0.990, while in the case of the forecast data set the highest rainfall amount corresponds to a percentile of (300 − 1)/(300) = 0.997). When the percentile of an uncorrected ensemble member exceeds the highest percentile of the observed climatology, it is constrained to the maximum observed rainfall. More sophisticated methods such as the GEV method (Katz et al., 2002) could be applied, but are beyond the scope of this research.

Statistical Validation Techniques
For statistically validating the original and bias corrected forecasts against data from satellite observations, we will make use of several verification metrics. These include summary measures, difference and correlation metrics, so-called skill scores, and accuracy measures for binary forecasts. An overview of metrics used in this study are given in Table 3. A theoretical background of these metrics is given in the Appendix A.
We assess the performance of the ensemble mean precipitation using the deterministic measures of difference, correlation, and skill. The probabilistic accuracy is assessed using the brier skill score as well as plots illustrating the reliability and resolution. Finally, we evaluate the performance of binary forecasts for high-intensity rainfall.
To assess the uncertainty and related statistical significance of differences between models and accuracy changes after applying bias correction, we apply a technique called bootstrapping (Efron, 1979). In short, by randomly sampling the forecast data many times, different outcomes of accuracy measures are simulated. Suppose we have 100 days of rainfall predictions and observations; we pick 100 random days to create a sample. This is done with replacement, so that it can occur that one day is picked more than once, or not at all. We repeat this procedure many times, in our study 1,000 times, to simulate 1,000 different possible STELLINGWERF ET AL.  values for each metric. The distribution of these 1,000 values can then be used to infer the uncertainty around the mean value of the metric.
Which model performs "best" or "worst" can depend on the metric chosen (Jolliffe & Stephenson, 2012). The broad range of verification metrics used in this study allows for a thorough evaluation of different aspects of model performance. Special focus is given to how model performance varies with the changing of seasons, how model performance changes with lead time and temporal integration time (accumulation time), and finally how accurate models predict heavy rainfall events. Whereas accurate rainfall predictions with longer lead times are more important for water management purposes, accurate rainfall predictions for shorter lead times are essential for flood forecasting.

Results and Discussion
This chapter demonstrates the need for bias correction in Section 4.1. In Sections 4.2 to 4.4 we assess the dependency of skill on the time of the year, the temporal characteristics of the forecast, and finally the total rainfall amount. Figure 4 shows the climatology of both the observed and forecast precipitation for all 12 Ethiopian catchments. The seasonality of the rainfall varies by region. Catchments 9, 10, and 13, in the southeastern region, show a strongly bimodal rainfall pattern, with rainfall in the MAM and ON months. Catchments 4, 6, and 7, mostly located in the Rift Valley in the northeast, show a different bimodal rainfall pattern, with the second rainfall peak occurring in summer. Catchments 1, 3, and 5 (the northwestern catchments) on the other hand, show a clear summer rainfall peak, with most rainfall in July and August. Catchments 8, 11, and 12, the southwestern catchments, have a dominant spring (belg) rainfall season with a precipitation maximum usually around April/May. In general, these models tend to systemically overpredict the amount of precipitation during the rainy seasons. Regardless of the bias, skill can still be present as long as the patterns match. Figure 5 shows scatterplots of predicted versus observed rainfall. While the ensemble mean generally produces too much rainfall when observed values are low, extreme rainfall events tend to be underpredicted by the ensemble mean. As we are averaging over multiple models, the multi-model ensemble mean will also tend to be less extreme than the ensemble mean of individual models. As opposed to ensemble means, the individual ensembles are trying to simulate actual realizations (and the variability) of the atmosphere, and are able to produce a more realistic distribution of daily rainfall rates ( Figure 6). UKMET is the only model that underestimates the frequency of heavy rainfall events.  All raw model predictions tend to overestimate the average rainfall. Q2Q bias correction successfully removes the bias, significantly decreasing the mean absolute error and the RMSE. The R 2 and the index of agreement (d) increase slightly, because these metrics are almost insensitive to bias. The AC and SS RMSE both clearly increase after Q2Q bias correction. We have assessed the uncertainty of performance changes for all metrics by applying bootstrapping. The scores of all metrics and all models change significantly (p < 0.05) after Q2Q bias correction, except for the SS RMSE for UKMET and the R 2 for ECMWF, UKMET, and NCEP. This is denoted with a single asterisk in Table 3. Figure 7 lays focus on the uncertainty of the increase of the SS RMSE , as this metric evaluates overall model performance. As systemic biases change over the seasons, monthly Q2Q bias correction is consistently most effective in lowering the error as well as increasing correlation. In the remainder of this chapter bias corrected results are obtained from monthly bias corrected forecasts.

Validation Statistics of the Ensemble Means
Overall, when averaging over all Ethiopian catchments, the multi-model EPS performs best. This difference in performance of the multi-model EPS is significant (p < 0.05) for all models and all metrics, except for the bias (all models have almost zero bias) and the d and AC of ECMWF. This is shown by the double asterisk in Table 3 That the multi-model EPS significantly outperforms other models, is shown in Figure 8. This figure compares the SS RMSE of the NWP models after applying monthly Q2Q bias correction, including its uncertainty. It is also worthwhile to assess the spatial variability of the best scoring models. Figure 9 shows the number of catchments for which each model's ensemble mean scores best based on error and correlation metrics.
10.1029/2019EA000933 9 of 27 Before Q2Q bias correction, the multi-model is most frequently the best when looking at the RMSE, R 2 , d, and AC. Judging a model based on its bias or MAE, there is no clear "winner." After Q2Q bias correction the multi-model is clearly the best choice for most catchments. The bias is almost negligible after correction (Table 4) and therefore the distribution of best models is more or less random.
The summary statistics and the error and correlation metrics give a good indication of overall model performance, when using the ensemble mean as the forecast. Rank histograms give more information about how reliable the whole set of ensemble members are. Rank histograms bin observations according to their values relative to the sorted values predicted by ensemble members. An under-dispersive EPS will have a U-shaped histogram, with observed values often falling outside the rainfall range predicted by the ensemble members. Vice versa, an over-dispersive EPS will have a large population in the middle ranks. EPSs with a dry bias will have a large population in the highest rank, and vice versa. Figure 10 shows the rank histograms. Models tend to have a wet bias before Q2Q bias correction, which turns into a (smaller) dry bias after the correction. The EPSs are generally under-dispersive before and after Q2Q bias correction, but less so after.
In the remainder of this chapter we will more elaborately explore the effect of Q2Q bias correction on the variability of forecast skill.

Seasonal Variability of Skill
The forecast skill, expressed through the anomaly correlation and the skill score of the RMSE, is temporally variable (Figure 11). The dry season is characterized by the lowest skill scores before applying Q2Q bias correction, influenced by the systemic overestimation of low-intensity rainfall. After removing the systemic STELLINGWERF ET AL.

10.1029/2019EA000933
10 of 27 biases, the skill score improves significantly. Skill is highest during the spring (MAM) and autumn (ON) rainy seasons. The predictability of the summer rainy season is lowest.
During the rainy seasons, the multi-model ensemble has the highest skill, quickly followed by ECMWF. After Q2Q bias correction, NCEP performs the worst. There is quite a lot of variability in skill from catchment to catchment (triangles in Figure 11). This is likely due the regional differences in the topography, as well as (related or non-related) differences of rainfall drivers. These results highlight the complexity of the region both spatially and temporally, with skill scores varying both by catchment and from one season to the next. Further work is needed to better understand the relationship between spatial and temporal variations in these skill scores and the underlying mechanisms that cause daily rainfall variability across Ethiopia.

Skill Versus Lead Time and Integration Time
Skill decreases with lead time. This becomes apparent through both the decrease of the anomaly correlation and the SS RMSE (Figure 12).
The anomaly correlations for predicting 5-day accumulation averages, both the day 0-5 and day 5-10 accumulation averages (shown as crosses in Figure 12) are larger than the skill for predicting 24-h accumulations. Note that the average lead time for day 0-5 (day 5-10) accumulations correspond to two (seven) days, because the accumulation starts on day zero (five). One explanation for this higher skill is that increasing the accumulation time smooths the rainfall peaks. After all, it is more likely for any individual day to have extreme (and harder to predict) rainfall than five consecutive, extremely wet days. Another explanation is STELLINGWERF ET AL. 10.1029/2019EA000933

of 27
Note. For each model the uncorrected forecasts (Raw), forecasts using quantile-to-quantile (Q2Q) bias correction on all model data at once (All), forecasts using Q2Q bias correction conditioned on the month (Mon) are evaluated. All forecast metrics are based on the ensemble means of +24 h forecasts. The sample size is 1,729 daily values for all models, the average observed rainfall 1.96 mm/day. The sequential color scale and indicates the score per metric for bias corrected models, with the best scores in green, mediocre scores in yellow and the worst in red. The asterisks indicate that: * difference between score of monthly bias correction versus no bias correction is NOT significant (p > 0.05). ** difference between score of monthly bias corrected multi-model versus other bias corrected models is NOT significant (p > 0.05).

Table 4
Validation Statistics Over the Ethiopian Catchments that temporal phase errors are relatively less important for longer accumulation periods. Suppose that a tropical wave and related precipitation is crossing a catchment on day 1, day 2-5 being dry. Now suppose a model perfectly forecasts the amount of rainfall, but is one day too late with the arrival of the tropical wave.
The model produces rainfall only on day 2, when it is observed to be dry. This will result in zero skill for day 1 and day 2, but a perfect skill for a 5-day forecast (when we suppose the model had perfectly forecast no rain for day 3-5). This temporal phase error is directly related to a spatial phase error. Suppose we have five catchments in the east-west direction. Suppose a hypothetical disturbance arrives from the east and it starts to rain on day 1 in the most easterly catchments. On day 2 the system has moved to the adjacent catchment, and so on until it rains in the most westerly catchment on day 5. The model is again consistently 1 day late, but is a perfect forecast otherwise. For five consecutive days the skill of 24-h rainfall accumulations is zero in all catchments. However, for the 5-day accumulation the skill is perfect (except for the last catchment, as rain is forecast on day 6, outside of the 5-day accumulation period).
STELLINGWERF ET AL.  In general, then, there is a tendency for skill to be higher for longer temporal averaging periods since a longer averaging window smooths over errors in the phase of tropical wave propagation, and cancels out errors in stochastic variability at hard to predict temporal and spatial scales. In contrast, skill generally decreases with lead time as forecast model errors continue to accumulate. These two effects work in opposition when comparing 0-24 h and 0-5 day forecasts (or a 120-144 h and 5-10 day forecast). Which of the two is higher in skill will depend on many factors, including the characteristic timescales of rainfall producing mechanisms, the inherent difficulty of predicting rainfall in the region (the so-called "predictability limit"), the quality and resolution of the model initialization, as well as other factors. Our research suggests that model STELLINGWERF ET AL.
10.1029/2019EA000933 Figure 9. Bar charts that show the number of catchments for which each model's +24 h ensemble mean forecast has the best score, per verification metric. A model is depicted as "the best" when it has the lowest bias (in absolute sense), the lowest MAE, the lowest RMSE or the highest R 2 , d, or AC. Figure 10. Rank histograms for +24 h ensemble forecasts before (red) and after (blue) applying Q2Q bias correction. A rank histogram is created by sorting ensemble members from low to high precipitation amounts and then binning observations according to their magnitude relative to the individual ensemble members. The blue, horizontal line represents the shape a perfectly reliable ensemble prediction system would have.
errors (due to phasing, or other factors) strongly penalize the skill of forecasts with shorter accumulation periods, so that the advantage of being near the model initialization is overshadowed by the advantage of a longer averaging period.
For the SS RMSE , this skill score is only higher for 5-day accumulations after Q2Q bias correction. This is likely because biases accumulate with increasing integration periods, worsening the SS RMSE .
The multi-model ensemble is the most skillful, though the difference with ECMWF is small and diminishes with lead time. We can also see that skill based on anomaly correlation is not completely analogous to skill based on the SS RMSE . For example, CMC, together with NCEP, performs worst considering anomaly correlation (after Q2Q bias correction). However, considering the SS RMSE , CMC is the third best. This apparent inconsistency can be explained as follows: Murphy and Epstein (1988) showed that the skill score of the mean squared error can be decomposed the following way: The three terms on the right side are the anomaly correlation, conditional, and unconditional biases, respectively. When the bias terms are negligible, the skill score and the square (!) of the anomaly correlation are STELLINGWERF ET AL.

10.1029/2019EA000933
14 of 27 Figure 11. The seasonal variability of AC (upper) and skill score of the root mean squared error SS RMSE (lower) before (left) and after (right) Q2Q bias correction. The metrics are calculated using the ensemble means of +24 h forecasts and subsequently averaged over all Ethiopian catchments for the whole temporal range of the data set. In addition to the skill metrics the figures also show the monthly average of relative frequency of rainy days. The whiskers represent spatial variability and extend to the 10th and 90th percentile. The lines are a spatial average of the skill metrics over all catchments while the triangles identify, per month, the highest and lowest values associated with all of the catchments. equivalent. Thus, the square of the anomaly correlation can be seen as a measure of potential skill, whereas the skill score shows the actual skill (Murphy & Epstein, 1988;Wilks, 2011). This means that even though the anomaly correlations of CMC and NCEP are similar, NCEP has more conditional bias (as the unconditional biases are practically zero for both models, see Table 4).
Applying Q2Q bias correction significantly increases skill for all lead times considered. Before Q2Q bias correction the two best models, the multi-model ensemble and ECMWF, were somewhat skillful for lead times up to three days. After correction, these models show skill up to a lead time of at least 9 days. These two EPSs perform best throughout the forecast lead times considered here.

Skill Versus Rainfall Amount
Rainfall data often follow a Gamma distribution (Wilks, 2011): low rainfall intensities are common while heavy rains are more rare (Figure 6). Figure 13 shows conditional quantile (CQ; Murphy et al., 1989) plots for bias corrected forecasts. Given a certain forecast, different possible outcomes (observations) can occur. The CQ plot shows the distribution (through selected quantiles) of observed rainfall for the forecast of individual ensemble members (Wilks, 2011), along with a forecast of climatology (average rainfall for each day of the year; i.e., it is the forecast if you would use the average rainfall on a specific date as a prediction). The 1:1 line represents a perfect forecast. At the low end of the rainfall distribution, ensemble members are unbiased, with observed rainfall not being over-or under-forecast. However, this is not the case for the high STELLINGWERF ET AL.

10.1029/2019EA000933
15 of 27 Figure 12. The anomaly correlation coefficient AC (upper) and skill score of the root mean squared error SS RMSE (lower) before (left) and after (right) Q2Q bias correction. The lines represent 24-h averages, the crosses 5-day averages (subsequently the day 0-5 and day 5-10 accumulation periods, with crosses centered on the respective period). Note that there is no day 5-10 forecast for UKMET, as this forecast only goes out to 7 days ahead. The metrics are calculated using the ensemble means and subsequently averaged over all Ethiopian catchments for the whole temporal range of the data set.
end of the rainfall distribution. Individual (bias corrected) ensemble members that forecast a large daily rainfall usually are still too wet, with observed rainfall being much lower in most cases. This latter behavior is typically expected for uncertain forecasts of extreme events. However, keep in mind these conditional quantile plots show the joint distribution of individual ensemble members and observations, not the ensemble mean (as in, e.g., Figures 5 and 9).
Related to this, a second aspect that can be inferred from the CQ plot is the resolution. The resolution refers to the ability of models to distinguish among different observed events for different forecasts, or put another way, the ability of the forecast model to differentiate between the conditional averages of the observations for different values of the forecast. Suppose that there is hardly any difference in the observed rainfall for forecasts of rainfall of 10 and 20 mm/day; the forecasts would have low resolution. The resolution can be inferred from the angle of the quantile lines of the observations. Generally, the quantile lines become more horizontal with forecasts of increasing rainfall (i.e. the same distribution of the observations occurs for forecasts of different values), indicating a decrease of resolution. No correction or calibration method can correct for a lack of resolution. Comparing between models, CMC shows the lowest resolution. In the end, the CQ plots suggest that individual ensemble members, especially when forecasting heavy rainfall, do not provide enough resolution for certain kinds of decision-making, and motivate the need to utilize the whole ensemble set, which we will focus on in the remainder of this study.
Heavy rainfall events are most relevant in the context of flood forecasting. We chose to compare the skill of forecasts for rainfall exceeding the observed (catchment-specific) 90th percentile. Figure 14 shows the reliability diagrams of each EPS model, as well as the frequencies of forecast probabilities exceeding the 90th percentile. Exceedance probabilities are calculated from the raw and bias corrected ensemble distributions, neither of which are explicitly calibrated to account for overconfident or over-dispersive ensembles. Models are reliable when, on average, the events being forecast occur with a relative frequency that is close to the STELLINGWERF ET AL. forecast probability. The reliability diagrams clearly show that all models tend to overestimate the probability of heavy rainfall events. This wet bias is mostly fixed by the Q2Q bias correction. Another aspect that can be inferred from the reliability diagram, is the model's resolution. Models have a good resolution when different forecast probabilities are reflected by different observed frequencies. UKMET has the poorest resolution, with the weakest relation between forecast probability and observed relative frequency. Interestingly, CMC performs well looking at resolution of forecast probabilities while its individual ensemble members showed the least resolution ( Figure 13). This is likely due to the large (not shown), more accurate ensemble spread that characterizes CMC (see Figure 10). This large spread might degrade the resolution of individual ensemble members, while increasing resolution for forecast probabilities. If the model forecasts an extreme event with a high probability, which does not occur often for the CMC model (as seen by the right hand side of the sharpness diagram inset), this is more often reflected by more accurate frequencies of occurrence. The multi-model, with per definition the largest ensemble spread, also has a high-resolution. For all models, resolution is increased after bias correction. Finally, the inset boxes show the model's sharpness, which is a property of the forecast alone and does not depend on the observations. Models are sharp when forecast probabilities are often different from the climatological average frequency. UKMET and ECMWF have a similar sharpness, however ECMWF is much more reliable, with higher probabilities also reflected in a higher relative observed frequency. Bias correction using Q2Q mapping decreases sharpness but increases reliability and resolution. The improved reliability and resolution are reflected by an increased Brier skill score.
The Brier skill score shows the accuracy of probability forecasts for binary events. In this case, the event of interest is exceedance of the 90th percentile of the observed precipitation, and the reference forecast is the daily average climatology. Figure 15 shows the spatial distribution of the BSS. The lowest skill scores are found in the mountainous regions in the west. A similar spatial pattern in skill was found for probabilistic STELLINGWERF ET AL.

10.1029/2019EA000933
17 of 27 Figure 14. Reliability diagrams showing the relationship between forecast probability and observed relative frequency of precipitation exceeding the climatological 90th percentile for Ethiopia. The inset boxes show sharpness diagrams, illustrating the frequency of occurrence of the forecast probabilities on a log-scale. The probabilities were discretized using bins ranging from 0 to 1, with a bin width of 0.1. +24 h ensemble forecasts were used to create these plots.
skill metrics by Vogel et al. (2020) Q2Q bias correction increases the BSS for all models and for all catchments. The multi-model ensemble has the largest BSS. CMC and ECMWF have similar performance to each other, and UKMET and NCEP are the least skillful.
We have seen that the multi-model ensemble has the highest Brier skill score for extreme rainfall. Also for less extreme rainfall the multi-model ensemble usually outperforms the other models ( Figure 16). Generally, the BSS increases with increasing rainfall percentile, and most models are skillful for rainfall predictions exceeding the 40th percentile. Because these rainfalls are per definition increasingly rare, using climatology as a predictor will give a lower and lower BS clim . The skill scores of all models increase after Q2Q bias correction. Following the multi-model ensemble, CMC is usually the best when looking at the median BSS, closely followed by ECMWF. NCEP and UKMET are generally the least skillful.

10.1029/2019EA000933
18 of 27 Figure 15. Brier skill score for precipitation exceeding the catchment-specific 90th percentile of rain before (upper left) and after (upper right) Q2Q bias correction. The lower figure shows the difference (after -before correction) of the BSS. In each sub-figure, the lower right plot displays the precipitation amount corresponding to the catchment-specific 90th percentile. +24 h ensemble forecasts were used to create these plots.

Threshold-Based Decision-Making
For the purpose of flood forecasting, thresholds need to be set for deciding when to let the "alarm bells" go off and take preventive action. Because we are dealing with ensemble forecasts we could for example start to issue flood warnings when the chance of exceeding a threshold rainfall (e.g., the 90th percentile) exceeds a certain threshold probability. In choosing this probability threshold, there is always a trade-off between the probability of detection, which can be increased by decreasing the threshold probability, and the false alarm rate which inevitably increases along with it. The ROC diagram (Mason, 1982), shown in Figure 17, shows this trade-off. A perfect forecast has a POD of one and an F of zero, and would display a graph along the left and upper boundaries of the ROC diagram. For a forecast using the climatological probability of the event (no skill), the POD and H are always equal, no matter the threshold probability. Skillful forecasts are located to the upper left of the 1:1 line, which represents no skill. As is to be expected for extreme events, that per definition deviate strongly from climatology, all models are skillful. Again, the multi-model ensemble is the most skillful.
Decision-makers should carefully consider the trade-off between the probability of detection and the false alarm rate in choosing a threshold probability for taking action (e.g., issuing a warning). Figure 18a shows how choosing different threshold probabilities influences the POD, the success ratio (SR, which is 1-FAR), the bias (straight lines) and the TS (curved lines). The best possible forecast has a POD, SR, B, and TS of one. Increasing the threshold probability increases the SR, but also decreases the POD and vice versa. A common solution in finding the "best" threshold probability is to find the one that maximizes the TS (Wilks, 2011). The TS is the ratio of correct "yes" forecasts to the total amount of times the event was forecast and/or observed. The TS is similar to the H but does not reward correct "no" forecasts, which would boost the score when applied to rare events. In the case of our example and data set, the maximum TS is obtained using a threshold probability of 0.3, or 30%. So, if at least 30% of the ensemble members exceed the 90th rainfall percentile, a warning is issued (a "yes" forecast). If the probability is lower than that, no warning is issued (a "no" forecast). Choosing to maximize the TS does STELLINGWERF ET AL.

10.1029/2019EA000933
19 of 27 Figure 16. Brier skill scores for precipitation exceeding a range of percentiles (x-axis) for all models. The box plots show the spatial variability of the BSS for calibrated forecasts, while the colored diamonds show the non-corrected spatial median of the BSS. +24 h ensemble forecasts were used to create these plots. Figure 17. ROC curves showing the trade-off between false alarm rate (F) and probability of detection (POD) for precipitation exceeding the 90th percentile. The curve is created using a series of incremental probability thresholds, shown for the multi-model ensemble. The 1:1 line represents no skill. +24 h bias corrected ensemble forecasts were used to create these plots.
come at a price. On average, the B is around 1.2. This means extreme rainfall is forecast around 20% more often than it is observed. For a threshold probability of 30%, this bias is relatively constant with increasing lead time. Figure 18b shows in more detail how the TS, the POD, the FAR and B develop with increasing lead time, using a threshold probability of 30%. Obviously, the scores are best for shorter lead times. The multi-model ensemble starts off with a TS of almost 40%. This is the highest score among the models, and this highest ranking persists throughout the lead times considered. At a 1-day lead time the TS of the multi-model ensemble is as good as ECMWF's 0-day lead time forecast. For the other models the difference is even larger: the TS at 4-day lead of the multi-model is higher than the 1-day lead forecast of NCEP, UKMET, and CMC. Also STELLINGWERF ET AL. considering the POD, the multi-model ensemble consistently outperforms all other models. The probability of correct "yes" forecasts, given the rainfall occurred, exceeds 60% for lead times up to 1 day when using the multi-model ensemble. ECMWF is a close second and is followed by UKMET, NCEP, and CMC in order of descending TS and POD. Because in this example we already put out a warning when the probability of exceeding the 90th percentile rainfall amount is 30%, there will be days with false alarms, decreasing the SR.
Only for the multi-model ensemble, for 0-day lead time forecasts, the SR is higher than 50%. Again, ECMWF is a close second, followed by NCEP and UKMET, and finally CMC with the lowest SR. Using a threshold probability of 30%, all models overestimate the frequency of rainfall exceeding the 90th percentile (bias).

Conclusion and Recommendations
This study explores one method to optimize the skill of large-scale precipitation forecasts from operational NWP systems. We applied Q2Q bias correction to the ensemble forecasts of four EPSs and a multi-model ensemble that combines those individual EPSs. We validated these models using satellite observations of precipitation in 12 catchments in Ethiopia. We used a variety of validation metrics to compare models and to explore the variability of model skill.
We found that applying a simple and straightforward bias correction technique, Q2Q mapping, significantly increases model skill. Especially in mountainous areas, biases are large and should be corrected. Biases change over the seasons and vary spatially. Therefore, the best model improvement is achieved by applying Q2Q bias correction per catchment and per month.
When looking at a spatial average of all Ethiopian catchments, the ensemble mean of the multi-model ensemble performs best when assessed over a broad range of error and correlation metrics. A close second is the ensemble mean of ECMWF. The bias corrected multi-model ensemble provides the best performing predictions for a large majority of catchments. For some individual catchments other models score better, judging based on the error and correlation metrics. Forecasters might therefore select the best model for each catchment individually. Not only is the multi-model ensemble the best for a majority of catchments, it is also best for most months of the year, especially during wet seasons. With increasing lead time, the multi-model ensemble remains the best performer. However, the additional skill over ECMWF diminishes with lead time as other models lose skill more rapidly than ECMWF. The ranking of CMC, UKMET, and NCEP is both spatially and temporally variable and depends on the verification metric used.
Increasing the temporal scale of a forecast increases its skill, albeit at the cost of the forecast's granularity. The anomaly correlation and (after Q2Q bias correction) the skill score of the root mean squared error are larger for 5-day rainfall averages than for 24-h averages. A possible explanation for this result is that the relative importance of temporal and related spatial phase errors decrease when increasing the accumulation period. For future research it might be worth investigating the effect phase error has on skill.
If the goal is to forecast extreme rainfall, CMC as well as the multi-model ensemble are the two most reliable. However, both models underestimate the frequency of high-intensity rainfall. The Brier skill score shows the accuracy of probabilistic forecasts and is highest for the multi-model ensemble. When making decisions related to flood forecasting, probability thresholds could be used to decide whether or not to act on the risk. In choosing the threshold probability, there is always a trade-off between the probability of detection and the false alarm rate and ratio. This trade-off, visualized by the ROC curve and the performance diagram, is most optimal using the multi-model ensemble. When using a threshold of 30%, the multi-model ensemble outperforms all others based on the threat score, the probability of detection and the false alarm ratio. ECMWF is a close second while CMC performs worst.
All results shown in this study were based on validating using satellite observations of rainfall. These observations themselves may very well be biased. Measures that are independent of bias, such as the anomaly correlation, can be seen as the more fair metric to compare the performances of models. Even considering just the anomaly correlation, the multi-model ensemble and ECWMF would still come out as "the best." It would be worthwhile to further investigate the biases of different rainfall products, to make a quantitative estimate of the uncertainties in Ethiopia.
It is worthwhile to invest in improvements of NWP models so as to prevent precipitation biases from occurring in the first place. In the meantime, statistical post-processing techniques, such as bias correction or calibration, are necessary for providing skillful and reliable forecasts. Therefore, it is also worthwhile to invest in further improvements of these techniques. Perhaps applying an extreme value theory approach would result in better predictions of extreme rainfall.
A very straightforward and simple approach for creating a multi-model ensemble, just using all available ensemble members, significantly improves model skill and reliability. Perhaps, a more sophisticated selection of ensemble members would improve its skill relative to the best individual model even more. For example, one could apply methods to select only those members that perform best in certain regions.

Appendix: Elaboration of Bias Correction and Validation Metrics
In this section we expand on the Q2Q mapping bias correction technique and further explain the validation metrics that were introduced in the Section Methods.

Quantile-to-Quantile Mapping Bias Correction
Expanding somewhat more formally on the explanation of this technique in Section 3.2, the process can be represented as follows. The model where i represents each catchment, f each forecast lead time, P [α] denotes the probability of the event α, and x (y) a specific value of the generic random variable X (Y) representing the observed (forecast) weather variable of interest. Using these CDFs, a direct percentile-by-percentile bias correction is made to each catchment i, lead time f, and precipitation ensemble member fcst of the forecast value y by simply setting Equation 1 equal to Equation A1, and determining the value of x that satisfies the equation. In this manner, the "observation-space" percentile is matched with the "forecast model-space" percentile as shown in Figure 3, and x is used in replace of y in the hydrometeorological application. Note that this technique ensures that the forecasts produce the same climatological number of "no rain" events as observed. It also preserves the temporal (ranked) covariances of the weather variable forecast fields (as generated by the numerical weather prediction model).
For a specific example, a predicted rainfall amount of, say, 25 mm corresponds to the 75th percentile of model climatology. However, the 75th percentile of observed rainfall is, say, lower at 20 mm. Bias correction is then applied by adjusting the original rainfall prediction of 25-20 mm. This technique is visually explained in Figure 3. For ensemble forecasts every ensemble member is in this way adjusted.

Difference Measures
In the following, O refers to the observed variate, P refers to the model-predicted variate, and N refers to the sample size. Except for the Brier skill score and the metrics for binary forecasts, the model-predicted rainfall P is the ensemble mean of each individual EPS.
The statistical measures to show the difference, or error, between the model and observations are the mean absolute error (MAE), root mean squared error (RMSE), the bias error, and the index of agreement d, as recommended by Willmott (1981). These statistical values are calculated using Equations A3-A6.
The MAE describes the average absolute difference between an observed and model-predicted variate and is calculated using: The RMSE is related to the MAE, but gives more weight to larger errors due to its squaring of errors. Both measures indicate overall model performance. The RMSE is calculated as follows: Overall bias (also known as mean error) measures the average bias across all forecasts. The bias gives a good indication of the systemic over-or under-prediction of forecast precipitation. This bias is unconditional, as it does not depend on the forecast value. The bias is calculated using: The index of agreement (d, Willmott, 1981) can be applied to compare between models and is both a relative and bounded measure (as opposed to the RMSE). The d is calculated as follows:

Correlation Metrics
The coefficient of determination, R 2 , shows how much of the variation of the predictand, P, is accounted for by the regression, in this case a least squares linear regression   * P a b O. P is the predicted value of P. R 2 is then calculated using: with SSR being the regression sum of squares: SSE is the sum of squared errors: The disadvantage of the R 2 is that you can obtain a high score by just getting the seasonality right. This is not the case with the anomaly correlation. An anomaly is the difference between the (observed or forecast) variate, in this case precipitation, and the climatological mean of this variate. Simply said, it shows whether a day is wetter or dryer than usual. The anomaly correlation is a measure that shows the correlation of the anomalies of the forecast precipitation with the actual, observed precipitation anomaly. If the patterns of forecast and observed precipitation are more alike, the AC will be higher, regardless of any systemic bias that may exist. The (uncentered) AC is calculated using Equation A10: with C i being the climatological precipitation. Because the observational data set only consists of 7 years of data, just using the average rainfall for each day will create a rather erratic and even somewhat skillful climatology. To mimic a climatology of a longer time-span, we smoothed the 7-year climatology by computing a running mean over the 7-year climatology with a 20-day averaging window.

Skill Scores
Accuracy measures by themselves may be ambiguous. Therefore, we will use several skill scores. Skill scores relate the accuracy of a prediction to a reference forecast using a particular accuracy measure. In this case, we use the climatology (C i ) as a reference forecast. Positive skill scores mean that the forecast is more accurate than when simply using the climatological mean value and vice versa. We use two skill scores: the skill score of the RMSE (SS RMSE ) and the Brier skill score (BSS). These are calculated using Equations A11-A13.
The SS RMSE is calculated using: with BS being the model Brier score and BS clim being the Brier score using the climatological mean. The Brier score measures the accuracy of forecast probabilities of dichotomous events and is calculated using the following equation: In this case P k and O k denote the paired forecast and observed probability of the dichotomous event. Because probabilities of exceeding a certain amount of rainfall vary over the year, using just one value for the O k could create false skill (Hamill & Juras, 2006). Especially in Ethiopia, where the rainfall pattern is strongly seasonal, positive skill could be achieved by only predicting the overall variations across seasons. Therefore, we calculate monthly values of O k for computing the BSS. For example, the climatological probability of any dichotomous event, O k , is the same for all days of all Januaries, simply the average relative frequency across all 7 years.

Accuracy Measures for Binary Forecasts
Probabilistic forecasts may have to be converted to binary forecasts for the purpose of decision-making. For example, if the probability of exceeding a certain rainfall amount is larger than a certain threshold probability, a flood warning may be issued. In essence, probabilistic forecasts are then converted to a "yes" or STELLINGWERF ET AL. "no" forecast. A contingency table is created for a set of forecasts by applying the same threshold across a set of forecast/observation pairs and categorizing each pair as in Table A1. The accuracy of binary forecasts can then be calculated using a variety of scores: the hit rate (H), threat score (TS), probability of detection (POD), false alarm ratio (FAR) and the bias (B), Equations A14-A19.
The hit rate is simply the fraction of correct "yes" and correct "no" forecasts. H is calculated using: For events that seldom occur the hit rate tends to be high. This is caused by the large amount of correct "no" forecasts. The threat score does not take into account correct "no" forecasts, and is therefore a better accuracy measure for rare events. The TS is calculated as follows: The probability of detection is the fraction of correct "yes" forecasts given the event occurred, and is calculated using: The false alarm ratio is the fraction of incorrect "yes" forecasts over the total number of "yes" forecasts and is calculated using: The false alarm rate, not to be confused with the false alarm ratio, is the fraction of incorrect "yes" forecasts over the total number of nonevents and is calculated using: The bias is the ratio of the number of "yes" forecasts to the number of "yes" observations and is calculated as follows:

Data Availability Statement
This work is based on TIGGE data. TIGGE (The Interactive Grand Global Ensemble) is an initiative of the World Weather Research Programme (WWRP). As such, the authors would like to thank ECMWF for continuing to provide open access to this archive, and we are also indebted to Steve Worley and Doug Schuster for earlier access to the NCAR TIGGE archive. Besides the ECMWF TIGGE portal for raw forecasts, processed forecasts used in this study can be found here: http://ral.ucardu/˜hopson/ESS/.