The Brier (1950) score is widely used in meteorology for scoring probability forecasts with two mutually exclusive outcomes (e.g. yes rain / no rain). For N forecast-verification pairs, the Brier score is†

(1)

where p_{n} is the forecast probability for the first of the two outcomes to occur at verification point n, and

(2)

The expression in Eq. (1) implicitly assumes that each of the forecast-verification pairs will be assigned equal weight. In some situations this is not appropriate, however, and each pair should be weighted accordingly. The weighted Brier score is

(3)

where w_{n} is the weight assigned to forecast-verification pair n and in this note a general w_{n} is assumed. (Murphy 1972, 1973) derived a decomposition of the non-weighted Brier score that splits it into terms representing observational uncertainty, forecast reliability and forecast resolution. This decomposition is in common use, for example when using the attributes diagram (Hsu and Murphy, 1986). In this note an analogous decomposition is derived for weighted forecast-verification pairs. Hersbach (2000) has derived the equivalent weighted decomposition for the continuous ranked probability score.

2. When is weighting appropriate?

The decomposition derived in this note should only be applied in situations where weighting is suitable or necessary. While the decomposition applies for any well-defined weighting of forecast-verification pairs, it is useful to consider which situations are suitable for weighting and which are not.

A common situation that might require weighting is one in which the pairs are distributed non-uniformly in space and a score is required that is representative of the whole domain. This could be over a regular grid such as the latitude–longitude grid (Figure 1, top), where weighting each grid point by the cosine of latitude would approximate spatial integration (for example Jung and Leutbecher, 2008). If there is some latitudinal variation in the Brier score or its components, then the effect of weighting can be substantial. Alternatively, it could be over an irregular set of points such as a network of weather stations (Figure 1, middle), where each station is representative of a different area. There might be a degree of redundancy between the observation stations in more densely observed areas, so it might be appropriate to assign lower weight there.

Another situation that might require weighting is a series of forecast-verification pairs over time at a single point in space (Figure 1, bottom). If the score represents the entire time series, then it might be appropriate to weight each pair by the length of the segment it represents. Alternatively, if the quality or reliability of observations changes over time (earlier verification samples may be less reliable or accurate because they used less advanced measurement techniques or a sparser observation network, for example), then it might be appropriate to assign less weight to the pairs with lower quality observations. If specific regions of the domain are known to produce consistently unreliable observational data, then weighting might be appropriate even when the pairs are uniformly spaced.

There are also situations in which the forecast-verification pairs are not distributed uniformly yet weighting is not appropriate. For example, a domain with very different climatological situations in two regions: a mountainous region with highly complex climate and many observation stations and a plains region with few observation stations and a homogeneous climate. In this situation it would probably not be appropriate to assign higher weight to the observations on the plain just because each one is representative of a larger area, because the more complex climate in the mountains means the degree of redundancy between observations in the two regions may be comparable. Another example is when calculating the economic value of a forecast based on a cost/loss analysis (Wilks 2001). Although the value score represents the whole domain, it is based on total costs and total losses, which are just the sums of the costs and losses at individual points.

Overall, whether to weight or not depends on context. Whether the observations should be weighted is just as important in the interpretation of the results as the scores themselves. The derivation below assumes nothing about the weights themselves, however, so it is applicable in any situation where weighting is applied in practice.

3. Derivation of the weighted decomposition

Begin by defining

(4)

as the total weight for the N pairs. The w_{n} could be normalized (so that W = 1 or W = N, for example), but W is retained here for generality. Following Murphy (1972), assume the forecast probability p_{n} can take any one of a fixed number of values; ordinarily these p_{n} are determined by the size of the forecast ensemble. For M ensemble members, p_{n} can take one of T = M + 1 values:

(5)

where t − 1 is the number of ensemble members that predict the first of the two mutually exclusive outcomes will occur (Eq. (2)). The Brier score can therefore be split into T categories BS^{t}, t ∈ {1,2,...,T}, each concerning the N^{t} cases with a forecast probability p^{t}. Since by construction p_{n} = p^{t} for all n in category t, then

(6)

where the are the weights assigned to the N^{t} forecast-verification pairs with p_{n} = p^{t} and is the outcome for the nth pair in category t.

Now define a second sum,

(7)

as the total weight in category t. Expand Eq. (6) and substitute in from Eq. (7) to obtain

(8)

The final can be rewritten as , because can only take values of 1 and 0. Hence

(9)

Define a third quantity

(10)

which is the (weighted) relative frequency of the first outcome for forecasts in category t. Hence

(11)

(12)

by completing the square. This form is analogous to Eq. (3) in Murphy (1972) but with defined for weighted forecasts.

Now sum over the probability categories 1 → T to recover the full Brier score. Each BS^{t} is weighted by the total weight in category t. Hence

(13)

Substituting in from Eq. (12) gives

(14)

which is the ‘original partition’ of the Brier score defined by Murphy (1973, his Eq. (2)) but for weighted forecasts. The first term on the right-hand side (r.h.s.) represents the contribution to the Brier score due to the forecast reliability, denoted REL.

Now proceed similarly to Murphy (1973). Expand the second term on the r.h.s. of Eq. (14):

(15)

and define a fourth sum as the second term on the r.h.s. of the equation above:

(16)

In Murphy (1973), is the overall observed relative frequency of the first outcome. It can be shown that Eq. (16) is the equivalent for the weighted decomposition by substituting in for from Eq. (10), giving

(17)

which is just the weighted overall relative frequency of the first outcome. Hence

(18)

(19)

(20)

The second term on the r.h.s. is the observational uncertainty for weighted forecasts, denoted UNC. Finally consider the term in braces; by expressing the second and third parts as sums and using Eqs (4) and (16), it is straightforward to show that this term is

(21)

which is the forecast resolution term. By putting these together, the decomposed Brier score for weighted forecast-verification pairs is obtained:

(22)

(23)

where W is given by Eq. (4), p^{t} by Eq. (5), w^{t} by Eq. (7), by Eq. (10), and by Eq. (17).

4. An illustration using seasonal forecasts

The effect of weighting on the Brier score and its decomposition is illustrated in this section with an analysis of seasonal forecasts from the EU ENSEMBLES project (Hewitt, 2005; Doblas-Reyes et al., 2009). To illustrate the method, a weighting function is used that has predictable results: seasonal predictability is generally higher at tropical latitudes compared with extratropical latitudes, so a weighting that favours the lower latitudes should produce better verification scores than the unweighted case.

The forecasts used are Stream 2 forecasts from five of the models used in the project: ECMWF IFS/HOPE, MeteoFrance ARPEGE4/OPA, Met Office HadGEM2, INGV ECHAM5/OPA8.2 and Kiel ECHAM5/OM1.‡ Many datasets are available from these forecasts, so the analysis is restricted to the following. Each forecast consists of nine initial condition ensemble members started from 1 May for each of the years 1991–2001, and the quantity predicted is the monthly mean temperature 2 m above the surface for each of the subsequent seven months (i.e. a lead time of one month represents the mean from 1 May–31 May). The forecasts were re-gridded from their original model grids§ using the Climate Data Operators first-order conservative remapping command remapcon on to a regular latitude–longitude grid of 2.5° spacing in both directions. They are verified grid-point-wise against the ERA-40 reanalysis dataset (Uppala et al.2005) ¶ valid at the same locations, which gives N = 10512 for the total number of forecast-verification pairs at each lead time.

For the purpose of this example, a suitable event must be defined for the forecasts to predict. The event to predict is as follows. The monthly mean 2 m temperature will be above the climate mean 2 m temperature, where the climate mean 2 m temperature is defined using the ERA-40 dataset as the mean of this quantity over the period 1961–1990 for each grid point and month.

For each forecast (i.e. for each of the models, for each of the eleven years), the nine ensemble members were used to compute a probability forecast p_{n} for this event to occur at each grid point n. The probability is given by the number of ensemble members predicting a higher 2 m temperature than the climate mean, divided by the number of ensemble members. For each ensemble forecast this was done for each lead time up to seven months ahead. The verification value d_{n} was then found by comparing the ERA-40 reanalysis value at that time with the climate mean for the appropriate month.

An example of one such probability forecast, showing p_{n} at each point, is shown in the top left panel of Figure 2, along with the verification d_{n} in the top right panel.

The Brier score for the forecast was then calculated using Eq. (3) as a global statistic over all the grid points, and the decomposition was calculated using Eq. (23). This was done for an unweighted forecast, setting w_{n} = 1 for each grid point, and for a weighted forecast. An appropriate weighting function is to assign weight to each grid point proportional to the area of the globe that the point represents. For grid points at latitude λ_{j} separated in latitude by Δλ, this weighting function is given by||

(24)

where multiplicative factors constant at each grid point are omitted without loss of generality. This weighting clearly favours low latitudes, and so better verification scores are expected than in the unweighted case because seasonal predictability is generally higher at tropical latitudes compared with extratropical latitudes.

For the example forecast in Figure 2, the contribution to the weighted Brier score from each point on the grid is shown in the bottom panel of that figure. For this particular forecast at a lead time of four months, the Brier score and its components are given in Table I.

Table I. Brier score and its components for the example given in Figure 2.

Unweighted

Weighted

Brier score

0.401

0.339

Uncertainty

0.210

0.227

Reliability

0.193

0.116

Resolution

0.00227

0.00392

The weighting function used in Eq. (24) assigns more weight at lower latitudes, so it is expected that the weighted forecasts will produce better verification scores than the unweighted case, because of the variation in seasonal predictability with latitude. The example shown in Figure 2 confirms this prediction: in Table I the Brier score and reliability component are lower for the weighted forecast and the resolution component is higher.

But is this a general result? Two more analyses are now presented: a single forecast verified over several months and all the models and years available for analysis in combination. Firstly, in Figure 3, the scores for the weighted and unweighted cases are shown as a function of lead time for the Met Office HadGEM2 1991 forecast (the same forecast as Figure 2). In this forecast the weighted forecast produces better scores than the unweighted forecast at all lead times, as predicted.

Secondly, the same analysis is extended to all models and years for which results are available. The same calculations were performed for all five models from 1991–2001. For ease of visualization, the results were then combined for each model into a mean at each lead time over all the years analyzed. These results are presented in Figure 4. The differences between the weighted and unweighted cases for the Brier score (not shown in the figure) and the reliability component behave as predicted when latitudinal-based weighting is applied. The values in the weighted case are lower than in the unweighted case, as expected if the grid points where predictability is higher are favoured by the weighting scheme. The ECMWF model seems to behave in the opposite way, however, with the Brier score and reliability component becoming poorer when weighting is applied. Perhaps there is a bias in the ECMWF model that causes it to perform more favourably than the rest of the models at extratropical latitudes, or it might be because the ECMWF model is the same one used to create the ERA-40 dataset used for verification.

The differences in the resolution and uncertainty between unweighted and weighted cases have a smaller effect on the Brier score than the reliability. From Figure 4, the relative contributions to the change in the Brier score when weighting is applied are approximately in the ratio 1:10:1 for uncertainty:reliability:resolution. Looking at the resolution scores, the predicted result is obtained for short lead times, as the resolution score increases with weighting. After three months lead time there is either no difference or a slight bias towards poorer scores when grid points are weighted. This may be partly because at lead times beyond two months resolution scores are close to zero anyway; see the right hand panel of Figure 3, for example.

The difference in the uncertainty scores is slightly anomalous, as the scores are marginally larger in the weighted case than in the unweighted case. As uncertainty can be interpreted as a measure of the intrinsic difficulty of the forecast, the uncertainty might be expected to decrease as more weight is assigned to regions where the behaviour is easier to predict. This result can be explained by examining the verification data, however. The definition of the event being forecast means that the expected value of is 0.5 over the period covered by the climate mean. If the monthly mean temperature increases (decreases) after the end of the climate mean period, however, will increase (decrease) because the probability of being greater than the climate mean rises above (falls below) 0.5. For both an increase and decrease the uncertainty will decrease, however, as it is equal to . The amount by which changes (and hence the amount by which the uncertainty falls) varies directly with the monthly mean temperature change. If the weighted uncertainty is larger than the unweighted value, therefore, the weighting must be assigning more weight to regions where temperature changes less between the climate mean period and the verification time. The result in Figure 4 therefore predicts that the temperature over the 1991–2001 period has changed (with respect to the climate mean) more in the extratropics than in the Tropics. In Figure 5 this temperature anomaly is plotted for May, comparing the mean of the 1991–2001 monthly mean temperatures with the mean of the 1961–1990 period. The anomaly is greater in the extratropics than in the Tropics, confirming the prediction and explaining the uncertainty results above. Equivalent plots for other months give the same result, and in some cases the difference between low and high latitudes is even greater than for the example shown here.

5. Concluding remarks

In this note a decomposition of the Brier score has been derived for weighted forecast-verification pairs, and its use has been illustrated for seasonal forecasts weighted according to the area represented by each grid point, i.e. proportional to the cosine of latitude. The weighted forecasts in the example give improved Brier and reliability scores compared with the unweighted case, consistent with what is expected given that tropical predictability is generally better than extratropical predictability.

The new decomposition has a few consequences for other verification scores. The attributes diagram (Hsu and Murphy, 1986) plots forecast probability against observed relative frequency. For weighted pairs the ordinate on the attributes diagram should be changed to the weighted expression for (Eq. (10)) and the point size used to represent each forecast probability should be changed from the total number of observations in that category to the total weight in that category, w^{t} (Eq. (7)).

The Brier score and its decomposition are often computed using a contingency table like Table 1 of Murphy. When weighted forecast-verification pairs are used, each element of the table is changed from the number of pairs with that combination of forecast and outcome to the total weight assigned to pairs with that combination. Equivalently, the contingency table for the whole forecast is a weighted sum of the contingency tables for each individual pair.

Weighting can also be applied to other scores. The ignorance score (Roulston and Smith, 2002) is defined by IGN = −log_{2}f_{j}, where f_{j} is the forecast probability assigned to the observed outcome. Averaging over N forecast-verification pairs with weights w_{n} gives

(25)

(26)

where W is defined by Eq. (4). The quantities used to calculate points on the relative operating characteristic curve (Wilks 2001, Fig.4) are also affected by weighting: the hit rate and probability of false detection scores used to create the curve need to be calculated using the weight assigned to each element of the contingency table instead of the number of pairs. Finally, when constructing the rank histogram, using a set of weighted forecast-verification pairs means changing the value for each ensemble member bin from the number of pairs where the verification falls within that bin to the total weight of pairs falling within that bin.

Acknowledgements

This work was financially supported by NERC Studentship NER/S/A/2005/13667. The author thanks Peter Read, Martin Leutbecher and two anonymous reviewers for their comments on the manuscript, and Falk Niehörster for comments on the manuscript and for access to the LSE Centre for the Analysis of Time Series seasonal forecasts from the ENSEMBLES project.

^{†}

Technically this is the half-Brier score, as it only considers one of the two possible outcomes, but the factor of two can be omitted here without loss of generality.

^{‡}

Details of the experiments run as part of the ENSEMBLES project are listed at http://www.ecmwf.int/research/EU_projects/ENSEMBLES/table_experiments/.

^{§}

N80 Gaussian reduced grid for ECMWF; regular long–lat grid of 192 × 145 points for the Met Office; N48 Gaussian grids for Kiel and INGV, and a reduced 128 × 64 Gaussian grid for MeteoFrance.

The area represented by the grid box extends from to everywhere except the poles, where it is between λ_{j} and at the south pole and and λ_{j} at the north pole.