Automatic calibration of an ensemble for uncertainty estimation and probabilistic forecast: Application to air quality



[1] This paper addresses the problem of calibrating an ensemble for uncertainty estimation. The calibration method involves (1) a large, automatically generated ensemble, (2) an ensemble score such as the variance of a rank histogram, and (3) the selection based on a combinatorial algorithm of a sub-ensemble that minimizes the ensemble score. The ensemble scores are the Brier score (for probabilistic forecasts), or derived from the rank histogram or the reliability diagram. These scores allow us to measure the quality of an uncertainty estimation, and the reliability and the resolution of an ensemble. The ensemble is generated on the Polyphemus modeling platform so that the uncertainties in the models' formulation and their input data can be taken into account. A 101-member ensemble of ground-ozone simulations is generated with full chemistry-transport models run across Europe during the year 2001. This ensemble is evaluated with the aforementioned scores. Several ensemble calibrations are carried out with the different ensemble scores. The calibration makes it possible to build 20- to 30-member ensembles which greatly improves the ensemble scores. The calibrations essentially improve the reliability, while the resolution remains unchanged. The spatial validity of the uncertainty maps is ensured by cross validation. The impact of the number of observations and observation errors is also addressed. Finally, the calibrated ensembles are able to produce accurate probabilistic forecasts and to forecast the uncertainties, even though these uncertainties are found to be strongly time-dependent.

1. Introduction

[2] Air quality simulation involves complex numerical models that rely on large amounts of data from different sources. Most of the input data is provided with high uncertainties in their time evolution, spatial distribution and even average values. Chemistry-transport models are themselves subject to uncertainties in both their physical formulation and their numerical formulation. The multi-scale nature of the problem leads to the introduction of subgrid parameterizations that are an important source of errors. The dimensionality of the numerical system, involving up to hundreds of pollutants in a three-dimensional mesh, is much higher than the number of observations, which also leads to high uncertainties in non-observed variables.

[3] In order to quantify the uncertainties, classical approaches rely on Monte Carlo simulations. The input fields and parameters of the chemistry-transport model are viewed as random vectors or random variables. These are sampled according to their assumed probability distribution, and a model run is carried out with each element of the sample. The set of model outputs constitutes a sample of the probability distribution function of the output concentrations. Typically, the empirical standard deviation of the output concentrations measures the simulations uncertainties. This approach has been applied for air quality simulations [Hanna et al., 1998, 2001; Beekmann and Derognat, 2003].

[4] Another approach is the use of models which differ by their numerical formulation or physical formulation. The models can originate from different research groups [e.g., van Loon et al., 2007; Delle Monache and Stull, 2003; McKeen et al., 2005; Vautard et al., 2009] or from the same modular platform [Mallet and Sportisse, 2006]. In addition to this multimodel strategy, the input data can also be perturbed so that all uncertain sources are taken into account. It is also possible to choose between different emission scenarios and meteorological forecasts as Delle Monache et al. [2006a, 2006b] did. Pinder et al. [2009] split the uncertainty into a structural uncertainty due to the weaknesses in the physical formulation and a parametric uncertainty due to the errors in the input data. Garaud and Mallet [2010] built the ensemble with several models randomly generated within the same platform and with perturbed input data.

[5] Whatever the strategy for the generation of an ensemble, several assumptions are made by the modelers. One needs to associate probability density functions to every input field or parameter to be perturbed. Under the usual assumption that the distribution of a field or parameter is either normal—or log-normal, one has to estimate a median and a standard deviation. For a field, providing a standard deviation is complex as it should take into account spatial correlations, and possibly time correlations. As for multimodel ensembles, one has little control over the composition of the models when they are provided by different teams. When the models are derived within the same platform, the key points are the amount of choice in the generation of an individual model, and the probability associated to each choice. Once all the assumptions and choices have been made, it is technically possible to generate an ensemble. However, it is quite difficult to determine the proper medians and standard deviations of the perturbed fields, and to design a multimodel ensemble that properly takes into account all formulation uncertainties.

[6] In order to evaluate the quality of an ensemble, several a posteriori scores compare the ensemble simulations with observations. These scores, such as rank histograms, reliability diagrams or Brier scores, assess the reliability, the resolution or the sharpness of an ensemble. For instance, a reliable ensemble gives a well estimated probability for a given event in comparison to the frequency of occurrence of this event, whereas the resolution describes the capacity of an ensemble to give different probabilities for a given event.

[7] Improving the quality of an ensemble should lead to improved scores, e.g., to a flat rank diagram or low Brier score. One strategy could be tuning the perturbations of the input fields or optimizing the design of the multimodel ensemble (that is, choosing or developing physical parameterizations or numerical schemes, and better weighting each design option), so as to minimize or maximize some score. This is a complex and computationally expensive task that would require the generation of many ensembles.

[8] In this paper, we adopt a strategy based on a single, but large, ensemble. Out of a large ensemble, a combinatorial optimization algorithm extracts a sub-ensemble that minimizes (or maximizes) a given score such as the variance of a rank diagram. This process is referred to as (a posteriori) calibration of the ensemble. Section 2 describes it in detail. It is applied in Section 3 to a 101-member ensemble of ground-ozone simulations with full chemistry-transport models run across Europe during the year 2001. The scores of the full ensemble and the optimized sub-ensemble (i.e., the calibrated ensemble) are studied, based on observations at ground stations. In Section 4, the uncertainty estimation given by the calibrated ensemble is analyzed. In Section 5, probabilistic forecasts for threshold exceedance are studied.

2. Calibration Method

[9] Hamill and Colucci [1997] use rank histograms to calibrate precipitation probabilistic forecasts. When the ensemble is not reliable enough, the probabilistic forecasts cannot be derived directly from the ensemble relative frequencies. Assuming the shape of the rank histogram remains the same in the forecast period, the authors propose to rely on the past rank distribution to compute the probabilistic forecasts. Hopson and Webster [2010] calibrate an ensemble prediction to improve floods forecasting. An empirical cumulative distribution function is provided by ensemble predictions of precipitation. Then, it is calibrated with observations, using a quantile-to-quantile mapping technique.

[10] In this paper, by “ensemble calibration” we mean extracting a sub-ensemble from a large ensemble so that a certain criterion is satisfied. A preliminary step is therefore to generate a large ensemble, composed of simulations that are sufficiently different from each other to provide substantial information. A criterion is defined to assess the quality of an ensemble, and a corresponding score measures how well the criterion is satisfied. An automatic selection of a sub-ensemble is finally carried out to minimize the score. The criterion usually assesses the uncertainty representation of an ensemble, based on the additional information brought by the observations. This section details the method employed to generate a large ensemble and to carry out an automatic calibration.

2.1. Generation of a Large Ensemble

[11] The method employed for the automatic generation of a large ensemble is described by Garaud and Mallet [2010]. A wide range of options should be available for the design of a single model: several physical parameterizations, several numerical discretizations, different sources for the input data and random perturbations in the input fields. In the paper referred to, thirty alternatives are available for the generation of a single model. Each member of the ensemble is defined after the random selection of one option per alternative.

[12] In this paper, we rely on the same ensemble as Garaud and Mallet [2010]. It includes 101 members run throughout the year 2001 over Europe. This ensemble will be used and calibrated in Section 3.

2.2. Automatic Selection

[13] Suppose a base ensemble with N members. There are ∑k=1N(equation image) possible sub-ensembles. If N = 100, there are over 1030 sub-ensembles. It is obviously impossible to consider all possible combinations in order to select the best combination with respect to the given criterion. Consequently a combinatorial optimization algorithm is required to minimize the score associated with the criterion.

[14] Let equation image be the full ensemble and equation image be a sub-ensemble of equation image. equation imageequation image is supposed to be non-empty. Let J(equation image) be the score of equation image. The following sections describe different scores and algorithms which may be used in the ensemble calibration.

2.2.1. Criterion and Score

[15] The main reasons for generating an ensemble are to improve forecasts with the so-called ensemble forecasts, and to estimate the uncertainty in the model's output. In this paper, we focus on the second objective. The criterion typically measures the quality of an uncertainty estimation or of the prediction of exceeding a threshold. It can be based on two desirable features of an ensemble:

[16] 1. Reliability: an ensemble has high reliability when its probabilistic forecasts for a given event match, on average, the observed frequency of this event.

[17] 2. Resolution: the capacity of the prediction system to distinguish the outcomes for a given event. Rank Histogram

[18] A rank histogram measures the reliability of an ensemble. Let {x1,… xj,…, xN} be the output of a N-member ensemble at a given time, sorted in increasing order. This ensemble is considered as a sample of a random variable X with some probability distribution, which means that all xj are supposed to follow the same probability distribution. Let Y be a random variable representing the true state. At a given point, if Y has the same probability distribution as X, then EX[PY(yxj)] = equation image, where EX[.] denotes the expectation related to X, PY the probability associated with Y and y a realization of the true state, i.e., an exact measured ozone concentration for instance. The rank histogram, developed by Anderson [1996], Talagrand et al. [1999], and Hamill and Colucci [1997], is computed by counting the rank of the true state to an actual sorted ensemble of forecasts. A perfect diagram is flat, whereas a U-shaped rank histogram means a lack of variability in the ensemble.

[19] Let rj be the number of observations of rank j. An observation of rank j is an observation which is higher than the concentrations of exactly j members of the ensemble. Suppose we have M observations. The expectation of rj is equation image = E[∑m=1M PY(xj < ymxj+1)] = equation image. The score related to the rank histogram flatness is based on the squared error

equation image

[20] The score equation image gets lower as the histogram gets flatter, since equation image corresponds to the height of a flat histogram. Obviously, this measure depends on the number of members. It can be normalized by equation image0 = E[equation image] = equation image because E[(rjequation image)2] = equation image. Finally the following score is used to measure the flatness of the rank histogram:

equation image

which should ideally be close to 1. Reliability Diagram

[21] Instead of simply predicting whether an event will occur or not, an ensemble can provide a probabilistic forecast. This is especially useful for the prediction of a threshold exceedance. A basic probabilistic forecast may be given by the number of models which exceed the threshold over the total number of models [Anderson, 1996]. In order to construct a reliability diagram, the range of forecast probabilities, [0, 1], is divided into K + 1 bins [p0, p1],…, [pk, pk+1], …, [pK−1, pK] where p0 = 0, pK = 1, and the sequence (pk)k is increasing. Let Ok be the (observed) relative occurrence frequency of the event when the ensemble predicts in [pk, pk+1]. A reliable ensemble should give Ok ∈ [pk, pk+1]. The reliability diagram [Wilks, 2005] plots Ok against pk or equation image(pk + pk+1). A perfect reliability diagram should follow the diagonal. Brier Score

[22] The Brier score measures the mean squared probability error for a specific event [Brier, 1950; Wilks, 2005]. Let M be the total number of observations. Let pi be the forecast probability and oi be the observed probability at a date i. The observed probability oi is equal to 1 if the event occurred, and 0 otherwise. The Brier score is given by:

equation image

[23] A Brier score for an ensemble can be compared with the Brier score of the climatological forecast. The climatological forecast is given by a single occurrence frequency oc, observed in the past. If oi follows the Bernoulli distribution and is equal to 1 with the frequency oc and to 0 with the frequency 1 − oc, the expectation of the Brier score equation imagecl of the climatological forecast is given by

equation image

The so-called Brier skill score is defined by

equation image

It ranges between [−1, 1] and is greater than 0 when the ensemble prediction gives a better forecast than the climatological forecast. Discrete Ranked Probability Score

[24] Suppose a set of L events, and let pli be the forecast probability for the l-th event at the date i. The total number of observations M is the same for each event. The discrete ranked probability score (DRPS), which is a variant of RPS (ranked probability score) [Epstein, 1969; Murphy, 1971], is given by:

equation image

[25] This score is a generalization of the Brier score from a single event to a set of events.

[26] While the rank histogram and the reliability diagram measure the reliability of a prediction system, the Brier score, and thus the DRPS, can measure the reliability and the resolution of an ensemble as shown in [Murphy, 1973]. The latter scores can be broken down into three terms: reliability, resolution and uncertainty. For instance, the Brier score is an estimation of E[(po)2]. Let p0 be the specific probability for a given event equation image and O0 be the occurrence frequency of equation image when p0 is provided. The occurrence of equation image denoted o follows Bernoulli's distribution. Thus, o takes value 1 with frequency O0 and takes value 0 with frequency 1 − O0. The expected value of (p0o)2 is

equation image

[27] Then, we compute (7) for many probabilities. In our case, the prediction system provides discrete probabilities for a given event. Suppose the system provides K + 1 different probabilities denoted pk, ranging in [0, 1]. Let nk be the number times pk is computed with the ensemble. Thus, the frequency distribution of pk is given by equation image with M the total number of considered dates, i.e., the total number of observations. We have equation image Σk=0Knk = 1. Let Ok be the observed occurrence frequency of the event when the ensemble predicts pk. The climatological occurrence frequency is oc = equation image Σk=0KnkOk.

equation image

[28] The first term is a reliability term since it compares the probability provided by the forecast system with the occurrence frequency of the event. The second term is called “resolution” and is equivalent to the variance of Ok. The third one is the “uncertainty” term which corresponds to the score of the climatological forecast. It is constant for a specific event and is maximum when the climatological forecast is equal to 0.5. This means that the climatological forecast has the worst Brier score when it provides the most uncertain occurrence probability, i.e., 0.5. The same decomposition can be carried out for the Brier skill score and the DRPS (9).

equation image

[29] The choice of a criterion, i.e., an ensemble score, is the first step of the ensemble calibration. The second step is the choice of a combinatorial optimization algorithm.

2.2.2. Combinatorial Optimization Algorithm

[30] Two combinatorial optimizations are employed in order to minimize the scores previously introduced: a genetic algorithm and simulated annealing. Genetic Algorithm

[31] The genetic algorithm, described by Fraser and Burnell [1970] and Crosby [1973], takes evolutionary biology as its basis, with the selection, crossover and mutation of a population of individuals. Let equation imagei be an individual, that is, a sub-ensemble, and let equation image = {equation image1,…, equation imagei,…, equation imageimage be a population of Npop individuals. The first step of the genetic algorithm is the random generation of the first population (denoted equation image0). Each equation imagei randomly collects an arbitrary number of models of the ensemble equation image. Then, three important steps generate the population equation imagek+1 based on equation imagek:

[32] 1. Selection: a few individuals are selected according to some method. In practice, we select half the best individuals with respect to the score.

[33] 2. Crossover: among the selected individuals, a crossover is carried out. Two parents equation imagea and equation imageb create two new children equation imagec and equation imaged. All the models of equation imagea and equation imageb are randomly dispatched into equation imagec and equation imaged. The list of models in an individual can be seen as its genetic print. A new population denoted equation imagek+1 is generated with Npop/2 parents and Npop/2 children.

[34] 3. Mutation: each individual of the previous population equation imagek+1 can mutate. In our case, a model can be replaced by another one, removed from an individual or added to an individual. These mutations constitute the new population equation imagek+1.

[35] The operation is repeated until some stopping criterion has been satisfied, e.g., when a given number of iterations is reached. The final population contains many individuals that are better (with respect to the cost function) than those of the initial population. It is the best individual of the final population that is considered as the calibrated ensemble. Simulated Annealing

[36] Simulated annealing, described by Kirkpatrick et al. [1983], is a basic optimization method inspired by a thermodynamic process. Each sub-ensemble of the search space is analogous to a state of some physical system.

[37] In our case, the first state is just a random generation of a sub-ensemble. The current state has a lot of neighbor states which correspond to the current state with a unit change, that is, a removed, added or replaced model in the sub-ensemble. Let equation image be the current sub-ensemble and equation image′ be a neighbor sub-ensemble. equation image′ is a new sub-ensemble which is randomly built from the current sub-ensemble with one removed, added or replaced model. In order to minimize (resp. maximize) a score J, two transitions to the neighbor are possible:

[38] 1. If the score J(equation image′) is lower (resp. higher) than equation image), then the current sub-ensemble moves to the neighbor sub-ensemble. equation image′ becomes the current sub-ensemble and another neighbor is generated.

[39] 2. If the J(equation image′) is greater (resp. lower) than J(equation image), moving to equation image′ is allowed to occur with an acceptance probability. This acceptance probability is equal to exp (−equation image) (resp. exp(equation image) where T is called temperature and is decreased after each iteration. A state movement is carried out if u < exp(−equation image) where u is a random number uniformly drawn from [0, 1]. At the beginning of the algorithm, the acceptance probability is high. Thus, the probability of switching to neighbor is higher than at the end of the algorithm.

[40] At the end of the process, the best state encountered in all the iterations, i.e., the best sub-ensemble, is taken as the calibrated ensemble.

3. Application to a 101-Member Ensemble

[41] We consider the 101-member ensemble, launched throughout the year 2001 over Europe and described in detail by Garaud and Mallet [2010]. The ensemble was automatically generated for the simulation of ground-level ozone, with a horizontal resolution of half a degree. Each member of the ensemble is a unique combination of physical parameterizations, numerical schemes and input data. For instance, the members can differ in the chemical mechanism (RACM or RADM2), the computation of the vertical diffusion coefficient (Louis' or Troen&Mahrt's parameterizations), the vertical resolution (5 or 9 levels) or the perturbation of the meteorological fields (wind, temperature, etc.) and emission sources. About 30 alternatives are available for the generation of a member. The generated ensemble contains very different members and has a wide spread. The following subsections deal with the assessment of this ensemble and its calibration according to ensemble scores previously mentioned.

3.1. Evaluation of the Ensemble

[42] In this sub-section, we quickly review the performance of the models and then of the ensemble.

[43] The ensemble evaluation is carried out using the observation network Airbase ( This database, managed by the European Environment Agency, provides ground-level ozone observations at 210 rural background, 702 rural, 647 suburban and 1324 urban stations across Europe.

[44] Stations that fail to provide observations at over 10% of all the dates considered are discarded as the scores at these stations may not reliable. In order to have stations which are representative of the ozone peak concentration at the model scale (half a degree in the horizontal), only rural and background stations are kept. There are about 123,000 observations for ozone peaks during the year 2001. Following usual recommendations [Russell and Dennis, 2000; Hogrefe et al., 2001; U.S. Environmental Protection Agency, 1991], a cut-off is applied to the observations. Observations below 40 μg m −3 are discarded so as to focus on the most harmful concentrations.

3.1.1. Models Skills

[45] The different models show quite different skills and performances. The spatio-temporal mean of ground-level ozone peaks ranges from 60 to 130 μg m −3. Their variability is also quite different because the global standard deviation of ozone peak simulations ranges between 17 and 44 μg m−3.

[46] Figure 1 shows the performance, compared to the observations, of the 101 simulations in a single diagram. This Taylor diagram [Taylor, 2001] takes into account the standard deviation of the observations and the correlation between each simulation and the observations. The radial coordinate of the Taylor diagram corresponds to equation image where σx is the empirical standard deviation equation image of the simulated sequence (xi)i=1,…,n, and σy is the empirical standard deviation of the observed sequence (yi)i=1,…,n. The azimuth is the arccosine of the correlation between (xi)i=1,…n and (yi)i=1,…,n. The lower azimuth, the higher correlation between a simulation and the observations. A Taylor diagram shows the performance of an ensemble of simulations in term of correlation, the variability of each simulation compared with the observed variability, and the spread of these performances. Although a large number of simulations show less variability than the observations, a number of members still show good variability. The correlations range between 0.3 and 0.77.

Figure 1.

Taylor plots of ozone peak averaged over stations. The radial coordinate is the standard deviation normalized by the standard deviation of observations. The angles between the abscissa axis and the lines correspond to the arccosine of the correlation ρ between each simulation and observations.

[47] This shows that the ensemble has a strong variety and that the models can have very different statistical measures and performance. A few models have weak skill, i.e., a high RMSE (up to 29.6 μg m−3) and a low correlation (down to 0.3). However these models should not be discarded because they can bring useful information. Figure 2 shows the number of times each model is closer to an observation than any other model. Most of the bars are close to the mean (1091 observations). Figure 2 shows that all the members give the closest concentrations to the observations for a significant number of times. In the worst case, the count is about half the mean count. The worst model in terms of RMSE and correlation gives the closest concentrations to 1061 observations, which is about the average performance. This means that even if a member shows a bad performance on average, it still brings useful information in some regions and at some dates.

Figure 2.

Best models count for ozone peaks on the network Airbase. A model is counted “best” when the discrepancy between the simulated concentration and the observation is minimal. The count is carried out for all observations.

3.1.2. Ensemble Scores Reliability Diagram

[48] Figure 3 shows the reliability diagram for the event [O3] ≥ 120 μg m−3. The ensemble shows a reasonable performance since the diagram roughly follows the diagonal. Below the forecast probability 0.4, the ensemble overforecasts the event occurrence since the reliability curve is below the diagonal. On the other hand, the ensemble under forecasts the event occurrence when the forecast probabilities are greater than 0.4. The diagram shows that the ensemble has an acceptable resolution. An ensemble with lower resolution would have a flatter reliability diagram which would be close to the climatological forecast. Unfortunately, for an event based on a higher concentration, such as [O3] ≥ 180 μg m−3, the ensemble leads to a poor reliability diagram. This can be explained by the very low occurrence of the event – about 0.6% of all cases – and by the sharpness histogram. Two sharpness histograms are shown in Figure 4 and represent the frequency of the forecast probabilities for the two previous events. The sharpness indicates the tendency of an ensemble to provide probabilities near 0 or 1. The forecast probabilities provided for the first event (120 μg m−3) are quite frequent and close to 0. Thus, most of the time, no simulation exceeds the threshold, so that the ensemble gives a null probability of event occurring. For the threshold 180 μg m−3, the sharpness histogram is even worse since over 98% of forecast probabilities are less than 0.1. As the number of forecast probabilities greater than 0.1 is so low, it seems difficult to correctly build a reliability diagram. Hence for the threshold 180 μg m−3, the calibration cannot be carried out using the reliability diagram.

Figure 3.

Reliability diagram of the ensemble for ozone peaks. The ozone concentration threshold is 120 μg m−3. The black line corresponds to a perfect reliability diagram. The dashed horizontal line is the value of the climatological forecast.

Figure 4.

Sharpness histograms for two ozone concentration thresholds: (a) 120 μg m−3 and (b) 180 μg m−3. Rank Histogram

[49] Figure 5 is the rank histogram of the 101-member ensemble for ozone peaks. The histogram does not show any extremely low or extremely high bar, but several bars have half the height they should have and several others are significantly higher than expected. The first bar, which corresponds to the number of observations below the lower envelope of ensemble, is especially high. It means that, at certain locations and dates, the spread of the ensemble is insufficient to cover the observations. The measure of the flatness described in the section is 148.

Figure 5.

Rank histogram of the 101-member ensemble on network Airbase for ozone peaks. The horizontal dashed line corresponds to the ideal value for a flat rank histogram with respect to the number of members. The large number of observations on the left means there are many observations below the lower envelope of the ensemble. Brier Score and DRPS

[50] The Brier score, Brier skill score and discrete ranked probability score are computed with the full ensemble, with the “best” model alone and with the climatological forecast. The “best” model will be the member from the full ensemble that minimizes or maximizes the given score. The climatological forecast is given by the all-year relative (observed) occurrence frequency of the event. These different scores are reported in Table 1. The DRPS is computed with the threshold exceedances for 80, 100,120, 140 and 160 μg m−3.

Table 1. Brier Scores and Brier Skill Scores for the Event [O3] ≥ 120 μg m−3 for the Ensemble, the “Best” Model, With Respect to the Score, and the Climatological Forecasta
 Full EnsembleBest ModelClimatology
  • a

    The DRPS is computed with the threshold exceedances for 80, 100, 120, 140 and 160 μg m−3.

Brier76 10−395 10−3113 10−3
Brier skill32.7 10−215.6 10−20.0
DRPS90.3 10−3124 10−3130 10−3

[51] It is interesting to notice that the “best” model is always the same for all scores and corresponds to the model which has the smallest RMSE (20.5 μg m−3). This “best” model is always better than the climatological forecast. It should, however, be noted that, first, one model can only provide probabilities equal to 0 or 1 and secondly, a large majority of the models have worse scores than the climatological forecast. For instance, over 77% of the models have a negative Brier skill score for the 120 μg m−3 threshold exceedance. Whatever the score, the full ensemble always performs better than the “best” model. Consequently it seems that an ensemble is necessary to provide forecast probabilities which are more accurate than probabilities provided by a single model.

3.2. Calibration

3.2.1. Reliability Diagram

[52] We introduce the average probability equation imagek of all forecast probabilities lying in the interval [pk−1, pk]. As described in the section, a perfect reliability leads to equation imagek = Ok. In order to have an optimized reliability diagram, the calibration method is therefore carried out with the mean squared error of the diagram. The score to minimize can be written as

equation image

[53] We consider the event [O3] ≥ 120 μg m−3, and we apply the genetic algorithm and the simulated annealing. Figure 6 shows the two resulting reliability diagrams. The calibrated diagrams are better than the reliability diagram of the full ensemble since they are closer to the diagonal. The 35-member calibrated ensemble from the genetic algorithm is very reliable and has a mean squared error lower than 10−5. As the reliability is improved, the Brier skill score of the two calibrated sub-ensembles are equal to 34 10−2 and 35 10−2, which represents slight improvements compared with the full ensemble. The Brier score decomposition shows that the reliability term is better after calibration whereas the resolution term is slightly worse. For the best calibrated sub-ensemble (genetic algorithm), the reliability term decreases by about 93% while the resolution term decreases by about 1%. Candille and Talagrand [2005] show that there is a compromise between reliability and resolution. Thus, resolution can be degraded when reliability is improved. Nevertheless, this calibration dedicated to improving reliability degrades resolution very slightly.

Figure 6.

Calibrated reliability diagrams for the event [O3] ≥ 120 μg m−3 from the simulated annealing and the genetic algorithm. The dashed line corresponds to the value of the climatological forecast.

3.2.2. Rank Histogram

[54] We now apply the calibration with criterion (2) so as to get a flat rank histogram. Note that it is desirable to obtain a sub-ensemble with the largest number of models so that an accurate uncertainty estimation can be produced. It is possible to obtain a perfectly flat diagram with just one model, providing half the observations are below the model concentrations and half the observations are above; but one model cannot help in providing an uncertainty estimation.

[55] The calibration results depend on the height of the highest bar (here, the left bar) of the full-ensemble histogram. All observations with rank 0 (left bar) are below the lower envelope of the ensemble. For any sub-ensemble, the height of the left bar cannot be lower than the number r0 of observations below the lower envelope. In a flat histogram, at best, the height of the left bar is still r0 and all the bars have the same height. In this case, there cannot be more than 34 members (which is deduced from the total number of observations divided by r0). Figure 7 is the rank histogram of the calibrated sub-ensemble using simulating anneal. There are 33 members and the flatness score is about 6 instead of 148 for the full ensemble score.

Figure 7.

Rank histogram of the calibrated ensemble on network Airbase for ozone peaks. The horizontal dotted line corresponds to the ideal value for a flat rank histogram according to the number of members.

[56] This calibrated sub-ensemble also improves the Brier scores and the DRPS. For the same events as before, the Brier skill score and DRPS respectively give 36 10−2 and 90 10−3. It is interesting to notice that the reliability (from the DRPS decomposition (9)) is decreased by 90%, while the resolution remains unchanged. This is consistent with the fact that the rank histogram is an ensemble score which measures reliability.

3.2.3. DRPS

[57] The calibration according to the DRPS gives DRPScalib = 66 10−3. The DRPS of the full ensemble is reduced by 15%. The reliability part (see (9)) is reduced by 47% and the resolution part by 10%.

[58] For all ensemble scores, the calibration provides well balanced sub-ensembles. They always are better than the full ensemble, the best model or the climatology. The calibrated sub-ensembles also improve the reliability. However, the resolution essentially remains the same. As for the Brier score decomposition (8), the resolution term does not depend directly on the agreement between the forecast probability and the event occurrences. The improvement in the resolution depends on the definition of forecast probabilities bins described in paragraph and [Candille and Talagrand, 2005]. The ensemble calibration essentially improves the quality of forecast probabilities, i.e., the reliability, rather than the variance of frequency occurrence Ok.

4. Uncertainty Estimation

[59] We now analyze the uncertainty estimation based on the sub-ensemble calibrated for the rank histogram. This calibration is chosen because it is related to the probability distribution of ozone concentrations, whereas the other scores are used to assess an ensemble for specific events.

[60] The uncertainty can be estimated with the (empirical) standard deviation of the ensemble. A monthly average of the standard deviation of the calibrated ensemble is computed in each cell of the domain studied. Figure 8 shows the corresponding uncertainty map over Europe, averaged over June 2001. A higher ozone uncertainty appears along the south-coasts of Europe. This is consistent with a well-known difficulty of predicting ozone along the coasts, mainly because of poor representation of winds and turbulence in these areas.

Figure 8.

Monthly average of ozone uncertainty from a calibrated sub-ensemble for June 2001 across Europe (μg m−3).

[61] Before presenting further results, it is important to assess the robustness of the calibration method. One question is the spatial robustness. A calibrated sub-ensemble is spatially robust if it is still reliable at non-observed locations. In order to check this robustness, we randomly exclude stations from the calibration, and assess the calibration on the remaining stations.

[62] Figure 9 shows all observation stations previously used to compute the ensemble scores and to calibrate the ensemble. This network is randomly split into two sub-networks (cyan and yellow). The rank histogram calibration is then carried out on each sub-network, that is, using only the observations of the sub-network. Figure 10 shows four rank histograms for the two calibrated sub-ensembles. At the top of the figure, the calibrated rank histograms are shown, each computed with the observations used for their calibration. At the bottom, the rank histograms are computed using the observations of the other sub-network. The rank histograms are almost flat, which shows that the calibration is robust. It is noteworthy that the two sub-ensembles have a similar number of members (27 and 28 members for the “cyan” and “yellow” sub-ensembles, respectively).

Figure 9.

The two random sets of stations over Europe. These two sub-networks are used to assess the spatial robustness of the ensemble calibration method. The two sub-networks are a partition of the full network: each station of the full network belongs to one and only one sub-network.

Figure 10.

Rank histograms of the calibrated sub-ensembles on the two random sub-networks. The calibrated rank histograms of the (a) cyan and (b) yellow sub-ensembles. The rank histograms computed (c) from the yellow sub-ensemble on the cyan sub-network and (d) from the cyan sub-ensemble on the yellow sub-network.

[63] We can now compare the uncertainty estimation maps from the two previous calibrated sub-ensembles. Figure 11 shows the uncertainty estimation of the two calibrated sub-ensembles from the two previous random sub-networks. The spatial structures are similar. The high and low uncertainty values are located at the same places. In Figure 12, these uncertainty maps are also compared with the uncertainty map obtained after calibration with all observations. The relative difference between these maps is about 3% on average, and marginally exceeds 10%. For reference, the figure also shows the relative difference with the uncertainty derived from the full ensemble.

Figure 11.

Temporal average of uncertainty estimation in μg m−3 from two sub-ensembles which were calibrated with two random sub-networks over Europe. Uncertainty map from (a) the cyan network and (b) the yellow network for June 2001.

Figure 12.

Relative discrepancy on uncertainty fields (averaged over June) between the sub-ensemble calibrated with all observations and (a) the sub-ensemble calibrated on the cyan sub-network, (b) the sub-ensemble calibrated on the yellow sub-network, and (c) the full ensemble. For example, the relative discrepancy (Figure 12c) is defined (pointwise) as the difference between the averaged uncertainty obtained with the full ensemble and the averaged uncertainty obtained with the calibrated sub-ensemble, divided by the latter.

[64] Besides spatial robustness, the previous results also show that here, half observations are sufficient to calibrate an ensemble and estimate uncertainties. This raises the question of how many observations are needed for the calibration. An experiment was carried out to estimate this number. First, a rank histogram is computed for the full ensemble with about 30,000 hourly observations. These observations are selected arbitrarily. Then, observations are randomly removed and the rank histogram is computed again. After a few iterations, we can compare several rank histograms with a different number of observations. Figure 13 shows two rank histograms of the full ensemble with about 32,500 observations and about 10,100 observations. Their shapes are very similar. Below 8000 observations, the shape of the rank histogram starts changing. So we conclude that 8000–10,000 observations are required to assess the quality of the 101-member ensemble.

Figure 13.

Rank histograms with a different number of observations: (a) about 32,500 and (b) 10,100 observations.

[65] A similar experiment was carried out to determine the number of observations needed for the calibration to be reliable. The full ensemble is calibrated a few times with a total number of observations (initially 32,500) divided by 2, 3, 5, 8 and 13. The calibrated ensembles contain a similar number of members, ranging from 22 to 27. The rank histograms for the calibrated ensembles are then computed, each time with the observations used in the calibration. The rank histograms remain flat in every case. The uncertainty estimations starts depending on the number of observations when there are fewer than 8000–10,000 observations.

[66] Another question is the impact of observational errors on the calibration and on the uncertainty estimation [Anderson, 1996; Hamill, 2001]. The rank histogram checks whether two random variables sample the same distribution. Noise in the observations should therefore be added to ensemble so that we can check the ensemble samples the real uncertainty without observation noise. Let xim(t) be the simulated concentration at station i and date t for the model m. We assume that observational errors do not depend on the station and date. We introduce the perturbed concentrations equation imageim(t) = xim(t)(1 + αim) where αim follows a uniform distribution on the interval [−ɛ, ɛ]. This form allows us to introduce a noise relative to the concentration, which is a usual feature for ozone observations. Based on work by Airparif [2007], ɛ ≃ 0.13 for ozone peak concentrations measured over the year 2009 at about 30 stations from the Airparif monitoring network (in the Paris region). This noise is introduced before the calibration. The calibrated ensemble with perturbation (ɛ = 0.13) shows a flat rank histogram, and the resulting uncertainty estimations are plotted in Figure 14. The values and spatial patterns of the standard deviation are very similar to those of the calibration without perturbations. The observation errors therefore seem to have a limited impact on the calibration.

Figure 14.

Uncertainty estimations at stations (a) for the reference calibration and (b) for the calibration with perturbed simulations (ɛ = 0.13).

[67] Finally, we investigate the robustness of the calibration over time. A calibration is carried out during a learning period, and the relevance of this calibration is evaluated for a forecast period. The sub-ensemble selected based on the learning period is referred to as an a priori sub-ensemble. The quality of the forecast is measured by comparing the a priori sub-ensemble and the a posteriori sub-ensemble that is calibrated over the forecast period.

[68] The learning period is a week, from April 3rd to April 9th, with 50,000 hourly ozone observations. It is an arbitrary chosen period. The forecast period ranges from April 10th to April 16th. Figure 15 shows the uncertainty map computed during the learning period and the forecast uncertainty map. These maps clearly show different patterns, e.g., with higher forecast uncertainties over the North Sea, over France and Germany, and with lower forecast uncertainties over several parts of the Mediterranean Sea. This, and tests not reported here, show that the uncertainty estimations can vary strongly over time. Figure 15 also shows the a posteriori uncertainty map. The forecast and a posteriori maps essentially show the same patterns and uncertainty levels. This means that, despite the significant variation in time, the calibration seems robust over time. Here the calibration can be used to forecast the uncertainties for a few days. The root mean square error between the forecast and a posteriori maps (daily averages), divided by the mean of the a posteriori map, is equal to about 5% over each of the next six days. It is noteworthy that the learning period should be long enough—two-day or four-day periods do not appear to be long enough to ensure a good forecast.

Figure 15.

Comparison of ozone uncertainty maps averaged over one week in μg m−3. (a) The uncertainty estimation during the learning period (from 3rd to 9th April 2001), (b) the uncertainty forecast (10th to 16th April 2001), and (c) the a posteriori uncertainty.

5. Risk Assessment and Probabilistic Forecast

[69] In order to check that the calibration can help in risk assessment and in forecasting a given event, the same tests as in the previous section are carried out with the Brier skill score and the reliability diagram instead of the rank histogram.

[70] Figure 16 shows, for each sub-network, reliability diagrams for calibrated sub-ensembles and the full ensemble. Any sub-ensemble calibrated on one sub-network performs well on the other sub-network.

Figure 16.

Reliability diagrams for [O3] ≥ 100 μg m−3 of the calibrated sub-ensembles and the full ensemble. (a) The reliability diagrams are computed on the cyan sub-network. (b) The reliability diagrams are computed on the yellow sub-network.

[71] The same conclusion can be drawn from the Brier skill score calibration. Table 2 shows the Brier skill scores of the full ensemble and calibrated sub-ensembles computed on the cyan sub-network for three different thresholds — 80, 100 and 120 μg m−3. Whatever the threshold exceedance, the calibrated sub-ensembles perform significantly better than the full ensemble. The sub-network over which the calibration was carried out does not impact the results.

Table 2. Brier Skill Scores of the Full Ensemble and the Calibrated Sub-ensemblesa
Threshold Exceedance[O3] ≥ 80 μg m−3[O3] ≥ 100 μg m−3[O3] ≥ 120 μg m−3
  • a

    The scores are computed using the observations of cyan sub-network.

Full ensemble0.350.370.34
Cyan calibrated sub-ensemble0.400.460.44
Yellow calibrated sub-ensemble0.400.460.44

[72] According to these results, the calibrations based on the reliability diagram and the Brier skill score seem spatially robust.

[73] In order to assess the temporal robustness, we select arbitrarily the learning period from May 31th to June 6th and rely on the corresponding calibrated sub-ensemble to forecast the period from June 7th to June 13th. Figure 17 shows the reliability diagrams of the full ensemble, the a priori calibrated sub-ensemble and the a posteriori calibrated sub-ensemble, for the threshold exceedance [O3] ≥ 100 μg m−3. The a priori sub-ensemble performs better than the full ensemble, but its reliability diagram is deteriorated compared to the a posteriori sub-ensemble. Note that the forecast period is long (7 days) because the reliability diagram requires a significant amount of data to be computed. It is possible that the results would be better if the diagram could be computed with the observations of the very first forecast days only.

Figure 17.

Reliability diagrams for [O3] ≥ 100 μg m−3 of the full ensemble (cyan), the a posteriori calibrated sub-ensemble (red) and the a priori calibrated sub-ensemble (green). This is based on observations from June 7th to June 13th.

[74] The Brier skill scores in the same forecast period are 0.18, 0.27 and 0.25 for the full ensemble, the a posteriori sub-ensemble and the a priori sub-ensemble, respectively. It shows that the calibration can be relevant in the context of probabilistic forecast.

6. Conclusion

[75] The work presented in this paper relies on a 101-member ensemble that was automatically generated on the Polyphemus platform. This large ensemble is evaluated for uncertainty estimation and for probabilistic forecasts. The tests show that about 10,000 observations are required to properly evaluate the 101-member ensemble. A calibration method is designed to select a sub-ensemble from the full ensemble that better estimates the uncertainties.

[76] Several calibrations for different ensemble scores are carried out and show significant improvements in the ensemble scores. An almost perfect reliability diagram and a very flat rank histogram can result from the calibration. We note that observation errors have a slight impact on calibration, since uncertainty maps with and without observation errors have the same pattern. The quality of the spatial distribution of the uncertainty estimation is assessed by a cross validation. Again, the calibration seems robust as the uncertainty maps are reasonably sensitive to the observation network. Finally, we show that the method can be applied in a forecasting context. The calibration can be carried out on a learning period, and the resulting sub-ensemble is able to estimate the uncertainties in the subsequent period almost as well as the sub-ensemble calibrated on this subsequent period.

[77] It would therefore be a natural next step to apply the method proposed here in operational conditions, including for aerosols for which the number of available observations may be significantly lower. A question is how much the proposed approach can help forecast threshold exceedances. The results show that the scores associated with such forecasts are improved, but the impact in an operational platform for decision making has yet to be assessed.

[78] The complexity of the method mainly lies in the automatic generation of a large ensemble in which many sources of uncertainties are taken into account. An open question is what ensemble design should be considered for uncertainty estimation and probabilistic forecasting. This question is especially important when considering forecasts because the sub-ensemble selected over one period should still represent the right uncertainty sources in another period. Monte Carlo simulations, for instance, are easier to carry out, but they might miss important uncertainty sources coming from the model formulation itself.

[79] Further work should address the partition of the uncertainty sources in order to better identify modeling errors, representativeness errors and measurement errors. Also the spatial and temporal correlations in the errors should be evaluated.


[80] We would like to thank to Hélène Marfaing and Christophe Debert from Airparif for their very useful studies and their data about measurement uncertainties. We thank Richard James for proofreading the paper.