We show that the standard decomposition of the Brier score is biased and derive our results under the assumption that the forecast-verification pairs {(*p*_{i}*,x*_{i}) : *i* = 1*,…,n*} are independent and identically distributed random variables. Extensions to dependent random variables are discussed in section 5. We define the long-run relative frequency with which the event occurs to be the expected value

for all *i*, and define the long-run relative frequency with which the event occurs amongst those occasions on which the forecast equals *π*_{k} to be

for all *i* and each *k*. We also define the expected frequency with which *π*_{k} is forecast in a sequence of *n* forecasts to be

for each *k*, where *ϕ*_{k} > 0. The weak law of large numbers tells us that , and *n*_{k}*/n* *ϕ*_{k} for each *k* as *n* ∞. Substituting these limits into the decomposition (1) of the Brier score yields the following limits for the reliability, resolution and uncertainty:

- (5)

- (6)

These are the values that would be obtained were the sample size infinite. For finite *n*, however, a special case of a result obtained by Bröcker (2011) shows that the expected values of the reliability, resolution and uncertainty terms in the standard decomposition (1) are as follows:

- (7)

where *ν*_{k,n} is the probability that *n*_{k} exceeds zero. A special case of these expressions (in which members of an ensemble predict the event independently with probability *μ*, the forecast is the proportion of ensemble members that predict the event and *μ*_{k} = *μ* for all *k*) was obtained by Ferro *et al.* (2008) in their investigation of the effect of ensemble size on the Brier score, but they did not comment on the dependence of these expected values on the sample size, *n*. The differences between the expected and limiting values above are the biases. The bias in the reliability,

- (8)

is non-negative and decreases monotonically to zero as *n* increases. In other words, REL tends to overestimate REL_{∞} and the reliability of the forecasts will tend to appear poorer than it would do were a larger sample available. The bias in the uncertainty,

- (9)

is non-positive and increases monotonically to zero as *n* increases. Therefore the uncertainty will tend to appear smaller than it would do were a larger sample available. The bias in the resolution can be positive or negative, but also converges to zero as *n* increases. In practice, however, the bias in the resolution is often positive because *μ*(1 − *μ*) is often small compared with , in which case the resolution of the forecasts will tend to appear better than it would do were a larger sample available.

We prove in the appendix that unbiased estimators for the reliability and resolution are unattainable. Nonetheless, we propose a new decomposition of the Brier score in which the estimate of uncertainty is unbiased and the estimates of reliability and resolution have smaller biases than in the standard decomposition. This new decomposition is

- (10)

where

- (11)

- (12)

- (13)

and *K*_{1} = {*k* : *n*_{k} > 1}, so that the sums are over those *k* for which *n*_{k} exceeds 1. Usually all *n*_{k} exceed 1 because small *n*_{k} are often eradicated by relabelling distinct forecasts with a common forecast value (e.g. Bröcker and Smith, 2007), although this will typically change the limiting values REL_{∞} and RES_{∞} being estimated. Whether or not forecasts are pooled in this way, the new decomposition yields more accurate estimates than the standard decomposition. We prove in the appendix that UNC′ is unbiased and that the biases of REL′ and RES′ decay to zero at a faster rate than the biases of REL and RES as the sample size, *n*, increases.

The new decomposition has one complication: REL′ and RES′ can be negative. In such cases, we recommend replacing the sum in the definitions of REL′ (11) and RES′ (12) by the largest value for which both terms are non-negative. This is equivalent to replacing REL′ with max{REL′,REL′ − RES′,0} and replacing RES′ with max{RES′,RES′ − REL′,0}. This ensures that the three terms in the decomposition still combine to equal *B*.

Independent work by Bröcker (2011) proposed a different decomposition:

- (14)

- (15)

- (16)

We prove in the appendix that the biases of these estimates all decay to zero more slowly than the biases of our decomposition. We also show in the appendix that the biases of the uncertainty and reliability terms in these three decompositions satisfy the following orderings for all *n*:

- (17)

and

- (18)

The ordering of the biases of the resolution terms can depend on *n*.