## 1. The Brier score and its decomposition

Suppose that probabilities *p*_{1}, *…*, *p*_{n} are forecasts for the occurrence of *n* events, and let *x*_{1}, *…*, *x*_{n} indicate whether or not the *n* events occur, so that *x*_{i} = 1 if the *i*th event occurs and *x*_{i} = 0 if the *i*th event fails to occur. The Brier score (Brier, 1950) for these forecasts is

and takes values in the interval [0,1], with smaller values indicating better forecasts.

Suppose now that each forecast can take one of only *K* distinct values, *π*_{1}, *…*, *π*_{K}. Let *I*_{k} = {*i* : *p*_{i} = *π*_{k}} be the set of indices for those occasions on which *π*_{k} is forecast and let *n*_{k} be the number of such occasions. For those *k* for which *n*_{k} > 0, define the conditional relative frequency,

to be the proportion of events that occur out of the *n*_{k} occasions on which *π*_{k} is forecast. Also define

to be the overall proportion of occasions on which the event occurs. Then the Brier score can be decomposed (Murphy, 1973) as

where

and *K*_{0} = {*k* : *n*_{k} > 0} so that the sums are over those *k* for which *n*_{k} exceeds zero. The first term (REL) in the decomposition is a weighted average of the squared differences between the conditional relative frequencies and the corresponding forecasts and measures the reliability of the forecasts. The best score for the reliability is zero, which is obtained if the conditional relative frequencies are equal to their corresponding forecasts. The second term (RES) is a weighted variance of the conditional relative frequencies and measures the resolution of the forecasts. The worst score for the resolution is zero, which is obtained if the conditional relative frequencies are the same for all forecasts. The third term (UNC) is a measure of uncertainty or climatological variation in the event occurrence. Very rare or very common events have low uncertainty.

Bröcker (2011) showed that the three terms in the Brier score decomposition (1) are biased. This means that the expected value of each term is typically different from its true value, defined to be the value that would be obtained were the sample size, *n*, increased to infinity. The reliability is systematically overestimated and the uncertainty is systematically underestimated, while the resolution may be either overestimated or underestimated. Therefore, evaluating this standard decomposition for finite samples can give a misleading impression of forecast quality. We show that an unbiased decomposition of the Brier score is unattainable but propose a new decomposition that has smaller biases than the standard decomposition and therefore provides a more accurate measure of forecast performance. We discuss the implications of the bias for the Brier skill score and the attributes diagram, and illustrate the new decomposition with seasonal forecasts of sea-surface temperatures (SSTs).