A bias-corrected decomposition of the Brier score

Authors

  • C. A. T. Ferro,

    Corresponding author
    1. National Centre for Atmospheric Science, University of Exeter, UK
    2. College of Engineering, Mathematics and Physical Sciences, University of Exeter, UK
    • College of Engineering, Mathematics and Physical Sciences, University of Exeter, Harrison Building, North Park Road, Exeter, EX4 4QF, UK.
    Search for more papers by this author
  • T. E. Fricker

    1. College of Engineering, Mathematics and Physical Sciences, University of Exeter, UK
    Search for more papers by this author

Abstract

The Brier score is a widely used measure of performance for probabilistic forecasts of event occurrences and is often decomposed additively into three terms that quantify the reliability and resolution of the forecasts and the uncertainty of the forecast events. The standard decomposition yields biased estimates of the large-sample values of these three quantities: reliability is overestimated and uncertainty is underestimated, while resolution may be either overestimated or underestimated. An unbiased decomposition is shown to be unattainable but a new decomposition is proposed that has smaller biases and therefore provides a more accurate measure of forecast performance. The implications for the Brier skill score and the attributes diagram are discussed and results are illustrated with seasonal forecasts of sea-surface temperatures. Copyright © 2012 Royal Meteorological Society

1. The Brier score and its decomposition

Suppose that probabilities p1, , pn are forecasts for the occurrence of n events, and let x1, , xn indicate whether or not the n events occur, so that xi = 1 if the ith event occurs and xi = 0 if the ith event fails to occur. The Brier score (Brier, 1950) for these forecasts is

equation image

and takes values in the interval [0,1], with smaller values indicating better forecasts.

Suppose now that each forecast can take one of only K distinct values, π1, , πK. Let Ik = {i : pi = πk} be the set of indices for those occasions on which πk is forecast and let nk be the number of such occasions. For those k for which nk > 0, define the conditional relative frequency,

equation image

to be the proportion of events that occur out of the nk occasions on which πk is forecast. Also define

equation image

to be the overall proportion of occasions on which the event occurs. Then the Brier score can be decomposed (Murphy, 1973) as

equation image(1)

where

equation image(2)
equation image(3)
equation image(4)

and K0 = {k : nk > 0} so that the sums are over those k for which nk exceeds zero. The first term (REL) in the decomposition is a weighted average of the squared differences between the conditional relative frequencies and the corresponding forecasts and measures the reliability of the forecasts. The best score for the reliability is zero, which is obtained if the conditional relative frequencies are equal to their corresponding forecasts. The second term (RES) is a weighted variance of the conditional relative frequencies and measures the resolution of the forecasts. The worst score for the resolution is zero, which is obtained if the conditional relative frequencies are the same for all forecasts. The third term (UNC) is a measure of uncertainty or climatological variation in the event occurrence. Very rare or very common events have low uncertainty.

Bröcker (2011) showed that the three terms in the Brier score decomposition (1) are biased. This means that the expected value of each term is typically different from its true value, defined to be the value that would be obtained were the sample size, n, increased to infinity. The reliability is systematically overestimated and the uncertainty is systematically underestimated, while the resolution may be either overestimated or underestimated. Therefore, evaluating this standard decomposition for finite samples can give a misleading impression of forecast quality. We show that an unbiased decomposition of the Brier score is unattainable but propose a new decomposition that has smaller biases than the standard decomposition and therefore provides a more accurate measure of forecast performance. We discuss the implications of the bias for the Brier skill score and the attributes diagram, and illustrate the new decomposition with seasonal forecasts of sea-surface temperatures (SSTs).

2. The bias and a bias-corrected decomposition

We show that the standard decomposition of the Brier score is biased and derive our results under the assumption that the forecast-verification pairs {(pi,xi) : i = 1,…,n} are independent and identically distributed random variables. Extensions to dependent random variables are discussed in section 5. We define the long-run relative frequency with which the event occurs to be the expected value

equation image

for all i, and define the long-run relative frequency with which the event occurs amongst those occasions on which the forecast equals πk to be

equation image

for all i and each k. We also define the expected frequency with which πk is forecast in a sequence of n forecasts to be

equation image

for each k, where ϕk > 0. The weak law of large numbers tells us that equation image, equation image and nk/nϕk for each k as n → ∞. Substituting these limits into the decomposition (1) of the Brier score yields the following limits for the reliability, resolution and uncertainty:

equation image(5)
equation image(6)

These are the values that would be obtained were the sample size infinite. For finite n, however, a special case of a result obtained by Bröcker (2011) shows that the expected values of the reliability, resolution and uncertainty terms in the standard decomposition (1) are as follows:

equation image(7)

where νk,n is the probability that nk exceeds zero. A special case of these expressions (in which members of an ensemble predict the event independently with probability μ, the forecast is the proportion of ensemble members that predict the event and μk = μ for all k) was obtained by Ferro et al. (2008) in their investigation of the effect of ensemble size on the Brier score, but they did not comment on the dependence of these expected values on the sample size, n. The differences between the expected and limiting values above are the biases. The bias in the reliability,

equation image(8)

is non-negative and decreases monotonically to zero as n increases. In other words, REL tends to overestimate REL and the reliability of the forecasts will tend to appear poorer than it would do were a larger sample available. The bias in the uncertainty,

equation image(9)

is non-positive and increases monotonically to zero as n increases. Therefore the uncertainty will tend to appear smaller than it would do were a larger sample available. The bias in the resolution can be positive or negative, but also converges to zero as n increases. In practice, however, the bias in the resolution is often positive because μ(1 − μ) is often small compared with equation image, in which case the resolution of the forecasts will tend to appear better than it would do were a larger sample available.

We prove in the appendix that unbiased estimators for the reliability and resolution are unattainable. Nonetheless, we propose a new decomposition of the Brier score in which the estimate of uncertainty is unbiased and the estimates of reliability and resolution have smaller biases than in the standard decomposition. This new decomposition is

equation image(10)

where

equation image(11)
equation image(12)
equation image(13)

and K1 = {k : nk > 1}, so that the sums are over those k for which nk exceeds 1. Usually all nk exceed 1 because small nk are often eradicated by relabelling distinct forecasts with a common forecast value (e.g. Bröcker and Smith, 2007), although this will typically change the limiting values REL and RES being estimated. Whether or not forecasts are pooled in this way, the new decomposition yields more accurate estimates than the standard decomposition. We prove in the appendix that UNC′ is unbiased and that the biases of REL′ and RES′ decay to zero at a faster rate than the biases of REL and RES as the sample size, n, increases.

The new decomposition has one complication: REL′ and RES′ can be negative. In such cases, we recommend replacing the sum in the definitions of REL′ (11) and RES′ (12) by the largest value for which both terms are non-negative. This is equivalent to replacing REL′ with max{REL′,REL′ − RES′,0} and replacing RES′ with max{RES′,RES′ − REL′,0}. This ensures that the three terms in the decomposition still combine to equal B.

Independent work by Bröcker (2011) proposed a different decomposition:

equation image(14)
equation image(15)
equation image(16)

We prove in the appendix that the biases of these estimates all decay to zero more slowly than the biases of our decomposition. We also show in the appendix that the biases of the uncertainty and reliability terms in these three decompositions satisfy the following orderings for all n:

equation image(17)

and

equation image(18)

The ordering of the biases of the resolution terms can depend on n.

3. The Brier skill score and attributes diagram

The Brier skill score (e.g. Glahn and Jorgensen, 1970) is defined as BSS = 1 − B/Bref, where Bref is the Brier score for some set of reference forecasts. If the reference forecasts are always equal to the in-sample climatology, equation image, then Bref = UNC and BSS = (RES − REL)/UNC (Murphy, 1996). The preceding calculations show that both the numerator and denominator of this BSS are systematically underestimated, but to say anything about the bias of the ratio would require further analysis. We do find, however, that the BSS based on the new decomposition is larger than the BSS based on the standard decomposition: since UNC′ ≥ UNC, we have BSS′ = 1 − B/UNC′ ≥ 1 − B/UNC = BSS with equality if and only if B = 0.

Our new decomposition of the Brier score also has implications for the attributes diagram of Hsu and Murphy (1986). The attributes diagram augments the reliability diagram, which comprises the points equation image, with three lines: the horizontal line at height equation image representing climatology, the 45° line through the origin representing perfect reliability (REL = 0) and the no-skill line equation image. The positions of the points and the first two lines in the diagram are unaffected because the forecast values πk are fixed, the quantities equation image and equation image are unbiased estimators of the corresponding long-run quantities and if all points lie on the 45° line then REL′ = 0. The no-skill line, however, is affected. This line is derived by recalling that reference forecasts equal to the in-sample climatology, equation image, yield a Brier score equal to UNC. Setting B = UNC implies REL = RES and the no-skill line is obtained by equating the summands in the definitions of REL (11) and RES (12). However, UNC is a biased estimator for the expected Brier score achieved by the long-run climatological reference forecast, μ. An unbiased estimator is UNC′ and setting B = UNC′ implies REL′ = RES′. A new, no-skill curve is obtained, therefore, by equating the summands in the definitions of REL′ (11) and RES′ (12). Rewriting the last term in the definition of RES′ as the sum equation image shows that this new curve is defined by

equation image

where equation image and equation image. This curve is a hyperbola with asymptotes πk = β/2 and equation image. In the region between the two branches of the hyperbola, the REL′ summand is less than the corresponding RES′ summand and so this region represents forecasts that make a positive contribution to skill.

4. A numerical illustration

We illustrate the standard (1) and bias-corrected (10) decompositions of the Brier score using 4928 probabilistic, seasonal forecasts of equatorial Pacific monthly mean SST anomalies constructed previously by Stephenson et al. (2005). The forecasts are for the event that the anomaly is positive and are verified against the ERA-40 reanalysis (Uppala et al., 2005). Further information about the forecasts and verifications may be found in Stephenson et al. (2008). We categorize the forecast probabilities into ten non-overlapping bins of width 0.1 and replace each forecast by its corresponding bin mean so that there are ten distinct forecast probabilities. Qualitatively similar results were obtained for other bin widths.

We use the following procedure to illustrate how the biases in the Brier-score components depend on the sample size, n. First, we calculate the bias-corrected reliability, resolution and uncertainty components of the Brier score using all 4928 forecast-verification pairs and take these values as approximations to the true, long-run values REL, RES and UNC. Then, for each n < 4928, we form 10 000 samples of n forecast-verification pairs by subsampling at random from the full data set and compute the standard and bias-corrected decompositions for each sample. Thus, for each n, we obtain 10 000 values of REL and REL′, RES and RES′, and UNC and UNC′. The means of these values approximate the corresponding expected values and are plotted in Figure 1 for 10 ≤ n ≤ 100. The 5% and 95% quantiles of the 10 000 values are also plotted to illustrate the sampling variation.

Figure 1.

Expected values of reliability, resolution and uncertainty against sample size, n, for the SST forecasts: standard decomposition (solid lines), bias-corrected decomposition (dashed lines) and true, long-run values (dotted lines). Pointwise 5–95 intervals of the sampling distributions are superimposed: standard decomposition (light grey regions), bias-corrected decomposition (dark grey regions) and their overlap (hashed).

As expected, the standard Brier score decomposition yields large biases. The expected values of REL and RES exceed REL and RES while the expected value of UNC lies below UNC. The magnitudes of the biases are considerable when n is small. For example, the expected value of REL is at least five times greater than REL when n is less than 40. When the bias-corrected decomposition is used, the bias of UNC′ is zero for all n while the biases of REL′ and RES′ are smaller and decay more rapidly than the biases of REL and RES. The biases of REL′ and RES′ are negligible when n is greater than about 60, an accuracy achieved by REL and RES only once n exceeds 300 (not shown).

The 5–95% intervals defined by the quantiles of the sampling distributions are wider for RES′ than for RES, slightly wider for UNC′ than for UNC and slightly narrower for REL′ than for REL. The sampling variation is greater for UNC′ than for UNC because UNC′/UNC = n/(n − 1) > 1. We do not know if the sampling variation for REL′ is always less than for REL, or if the sampling variation for RES′ is always greater than for RES. For individual data sets, standard errors and confidence intervals for the three components might be estimated using ideas similar to those employed by Ferro (2007).

The SST data used in Figure 1 exhibit significant temporal dependence up to lags of three months and therefore violate the independence assumption that was used to derive the biases of the decompositions in section 2. The resampling scheme employed to construct Figure 1, however, destroys the time order of the data and so the results are indicative of how the decompositions perform when there is no temporal dependence. The performance of the decompositions in the presence of temporal dependence is discussed in section 5.

To illustrate our proposed adjustment to the no-skill curve in the attributes diagram, we consider a subset of 88 forecasts from a single grid point, at 150°W in the central equatorial Pacific. Using data from a single grid point helps to highlight the differences between the standard and bias-corrected no-skill curves because the two are visually indistinguishable when n is large. The diagram is shown in Figure 2. We see that the bias-corrected curve results in a larger positive skill region. The Brier score for these data is B = 0.131 and the standard decomposition yields REL = 0.018, RES = 0.137 and UNC = 0.250 with BSS = 0.475, while the bias-corrected decomposition yields REL′ = 0.009, RES′ = 0.129 and UNC′ = 0.251 with BSS′ = 0.478.

5. Discussion

The reliability–resolution–uncertainty decomposition of the Brier score is obtained by conditioning on the forecasts (Murphy, 1973). An alternative decomposition is obtained by conditioning on the verifications (Murphy and Winkler, 1987) to yield three terms that Murphy (1996) refers to as the type 2 conditional bias, the discrimination and the variance of the forecasts. The standard version of this alternative decomposition yields biased estimates of these three quantities and a bias-corrected version can be obtained using calculations similar to those described above. Decompositions obtained by conditioning on either forecasts or verifications can be obtained not only for the Brier score but for any score that takes the form of a mean-squared error (Murphy, 1996) or weighted mean-squared error (Young, 2010). Again, the standard decompositions are biased but bias-corrected versions can be derived.

Figure 2.

Attributes diagram for the SST forecasts. The circles are centred on the points equation image and their areas are proportional to the number, nk, of contributing data. The light grey region is the positive-skill region given by the standard no-skill line (dotted line). The dark grey region is the area added to the positive-skill region by using the bias-corrected no-skill curve (dashed curve). The solid horizontal and vertical lines represent the observed climatology, equation image.

In fact, all proper scores can be decomposed into reliability, resolution and uncertainty terms (Bröcker, 2009). It would be useful to identify the bias of the decomposition for other scores and to construct bias-corrected decompositions where possible. Bröcker (2011) has considered the logarithmic (ignorance) score and the multi-category Brier score. We consider briefly the cases of the ranked probability score (RPS: Epstein, 1969) and the continuous ranked probability score (CRPS: Brown, 1974; Matheson and Winkler, 1976).

The RPS can be written as a sum of Brier scores corresponding to a nested sequence of events (e.g. Toth et al., 2003) and therefore a decomposition of the RPS into reliability, resolution and uncertainty terms can be obtained by summing the corresponding terms of these Brier scores. Both standard and bias-corrected decompositions can be formed in this way. The CRPS can be written as an integral of Brier scores corresponding to a nested continuum of events (e.g. Hersbach, 2000) and so the CRPS can be decomposed in a similar manner, integrating the terms of the Brier score decompositions.

These decompositions of the RPS and CRPS, however, are unsatisfactory because they measure the average reliability and resolution of sets of forecasts for binary events instead of the reliability and resolution of the full probability distributions specified by the forecasts. Other decompositions based on the full distributions are preferable (Murphy, 1972; Candille and Talagrand, 2005). It appears to be possible to construct bias-corrected versions of these decompositions too. These alternative decompositions of the RPS and CRPS rely on each distinct forecast distribution being issued several times so that empirical distributions of the corresponding verifications can be constructed. Unless there are very many forecasts, it is therefore often necessary to group similar, rather than identical, forecast distributions (Candille and Talagrand, 2005). This is also often done for the Brier score when the issued forecast probabilities can take any value in the interval [0,1] instead of only K distinct values. When such grouping is used, Stephenson et al. (2008) show that the Brier score obtained by combining the reliability, resolution and uncertainty terms will typically differ from the value obtained by evaluating the Brier score directly from the ungrouped forecasts. In order to retrieve the Brier score for the ungrouped forecasts, it is necessary to generalize the resolution term in the decomposition to account for within-group variation. The same can be expected to be true for the decompositions of the RPS and CRPS when forecasts are grouped. The generalized resolution defined by Stephenson et al. (2008) is also biased but, again, a bias-corrected version can be derived. We expect that bias-corrected versions could also be obtained in the cases of the RPS and CRPS. Finally, other decompositions of the RPS and CRPS have been proposed that avoid the need to group forecasts (Hersbach, 2000; Candille and Talagrand, 2005). The bias of these decompositions could be investigated too.

We have assumed throughout that the forecasts and verifications are independent and identically distributed random variables. Temporal dependence is likely to inflate biases and also to reduce the rates at which biases decay to zero. Analysing the biases in the presence of temporal dependence is complicated, however, because the verifications that contribute to the conditional relative frequencies, equation image, are randomly spaced in time. Whichever decomposition is used, therefore, checking the convergence of the reliability, resolution and uncertainty estimates as the sample size increases is worthwhile. This can be done by plotting against n the estimates calculated from the first n data.

6. Summary

The standard decomposition of the Brier score is biased and we have proposed a simple, bias-corrected decomposition that provides a more accurate description of forecast reliability and resolution when the verification data can be described by independent and identically distributed random variables.

Acknowledgements

This work was funded by NERC Directed Grant NE/H003509/1. Caio Coelho provided the data. Expert comments from Jochen Bröcker, Ian Jolliffe and an anonymous referee helped us to improve the original manuscript.

Appendix

Proofs

There is no unbiased decomposition

If an unbiased estimator for REL (5) exists, then it must be the sum of unbiased estimators for the summands of REL and these estimators could be subtracted from the summands of REL (11) to obtain unbiased estimators for the summands, νk,nμk(1 − μk), of the bias (8) of REL, where νk,n = Pr(nk > 0) = 1 − (1 − ϕk)n, because the distribution of nk is binomial with parameters n and ϕk. An unbiased estimator for νk,nμk(1 − μk) must be a function of nk and {xi : iIk} but the order of the xi carries no information about μk or ϕk and so we can require this estimator to be a function of nk and sk, where equation image and the conditional distribution of sk given nk is binomial with parameters nk and μk. Consider an estimator g(nk,sk) with expectation

equation image

This polynomial in μk and ϕk must equal the polynomial

equation image

for all μk and ϕk if g(nk,sk) is to be an unbiased estimator for νk,nμk(1 − μk). This can happen only if, for all i = 0,1,…,n and j = 0,1,…,n, the coefficients of equation image in the two polynomials are equal . The latter polynomial, however, has a non-zero coefficient for equation image and the former polynomial contains no such term. Thus, there is no unbiased estimator for νk,nμk(1 − μk) and hence no unbiased estimator for REL. A similar argument shows that there is no unbiased estimator for RES.

The bias of the new decomposition

We show first that the uncertainty term (13) in the new decomposition is unbiased. The definitions of UNC (4) and UNC′ (13) yield UNC′ = nUNC/(n − 1) while the expressions for UNC (6) and E(UNC) (7) yield E(UNC) = (n − 1)UNC/n so that E(UNC′) = nE(UNC)/(n − 1) = UNC and

equation image(A1)

To find the bias of the reliability term (11), we write

equation image

where equation image if nk > 1 and rk = 0 if nk ≤ 1. If nk ≤ 1 then equation image. If nk > 1 then

equation image

because equation image when xi = 0 or 1, and so

equation image

Therefore,

equation image

and, using the bias (8) of REL,

equation image(A2)

The bias of RES′ equals the bias of REL′ because UNC′ and the Brier score itself are unbiased: the expectation of the Brier score is independent of n.

Next, we calculate the rate at which the bias of REL′ and hence of RES′ decays as n increases. From the binomial distribution of nk, we have

equation image

and therefore the bias (A2) of REL′ is

equation image

which decays geometrically as n increases. The leading-order terms in the biases for the standard decomposition decay at the much slower rate of 1/n.

The bias of Bröcker's decomposition

Now we calculate the biases and their rates of decay for the decomposition (14)–(16) proposed by Bröcker (2011). Arguments similar to those above show that

equation image(A3)

and

equation image(A4)

with bias(RES′′) = bias(REL′′) + bias(UNC′′). The biases of REL′′, RES′′ and UNC′′ all decay to zero at rate 1/n2. This is immediate for UNC′′. To see that it is true for REL′′, and hence for RES′′, note that

equation image

and

equation image

the leading-order term of which decays at rate 1/n. Therefore, equation image decays at rate 1/n and bias(REL′′) decays at rate 1/n2.

The ordering of the biases

The ordering (17) on the biases of the uncertainty terms follows immediately from the bias expressions (9), (A1) and (A3). The ordering (18) on the biases of the reliability terms follows from the bias expressions (8), (A2) and (A4) because

equation image

Ancillary