A conditional decomposition of proper scores: quantifying the sources of information in a forecast

Scoring rules condense all information regarding the performance of a probabilistic forecast into a single numerical value, providing a convenient framework with which to rank and compare competing prediction schemes objectively. Although scoring rules provide only a single measure of forecast accuracy, the expected score can be decomposed into components that each assess a distinct aspect of the forecast, such as its calibration or information content. Since these components could depend on several factors, it is useful to evaluate forecast performance under different circumstances; if a forecaster were able to identify situations in which their forecasts perform particularly poorly, then they could more easily develop their forecast strategy to account for these deficiencies. To help forecasters identify such situations, a novel decomposition of scores is introduced that quantifies conditional forecast biases, allowing for a more detailed examination of the sources of information in the forecast. From this, we claim that decompositions of proper scores provide a broad generalisation of the well‐known analysis of variance (ANOVA) framework. The new decomposition is applied to the Brier score, which is then used to evaluate forecasts that the daily maximum temperature will exceed a range of thresholds, issued by the Swiss Federal Office of Meteorology and Climatology (MeteoSwiss). We demonstrate how the additional information provided by this decomposition can be used to improve the performance of these forecasts, by identifying appropriate auxiliary information to include within statistical postprocessing methods.

is to be used and it is therefore necessary to consider several different aspects of a forecaster's performance (Jolliffe and Stephenson, 2012). To understand fully the strengths and limitations of a prediction system, a variety of diagnostic tools should be employed, including graphical displays, summary statistics, and numerical performance measures.
Although scoring rules provide only a single measure of forecast quality, the expected score can be decomposed into components that each assess a distinct aspect of the forecast performance. Score decompositions provide additional feedback to the forecaster, which can be used to identify the strengths and limitations of a prediction scheme, and, in turn, help to improve future forecasts. Most commonly, the expected score is decomposed into terms quantifying the forecaster's uncertainty, resolution, and reliability. The uncertainty describes the inherent variability in the forecasting scenario, while the resolution measures the extent to which this variation is captured by the forecaster. The reliability, or (auto)calibration of a forecaster, on the other hand, refers to how well predictions align with the corresponding observations. These terms are all related to the marginal and conditional distributions of the forecasts and observations, and score decompositions therefore connect scoring rules to the distributions-oriented framework for forecast verification introduced by Murphy and Winkler (1987).
The decomposition of scores into uncertainty, resolution, and reliability has been studied in most detail using the Brier score (Brier, 1950;Murphy, 1973) and is often posited as a reason for the score's popularity. However, several alternative scores have been similarly decomposed: the weighted Brier score (Young, 2010), the discrete and continuous ranked probability scores (Sanders, 1963;Murphy, 1972;Hersbach, 2000;Candille and Talagrand, 2005), the error-spread score (Christensen, 2015), the quantile score (Bentzien and Friederichs, 2014), and the logarithmic or ignorance score and variations thereof (Weijs et al., 2010;Tödter and Ahrens, 2012). Bröcker (2009) builds on results in DeGroot and Fienberg (1982) to show that the expectation of any proper scoring rule can be partitioned into terms that represent the forecast uncertainty, resolution, and reliability, using the entropy and divergence functions associated with that score. A simple and interpretable generalisation of this result is provided by Siegert (2017).
Other decompositions have also been proposed that supply the forecaster with alternative information to that provided by the uncertainty, resolution, and reliability (Yates, 1982), and a thorough overview of the use of score decompositions in forecast evaluation is available in Mitchell (2020). The most suitable decomposition depends on what information would be most beneficial to the forecast user. Atger (2004), for example, remarks that the forecast reliability alone provides limited information regarding forecast quality to end users, since a forecast that is reliable according to this criterion can still exhibit large conditional biases. In the case of weather forecasts, for example, forecast quality could depend on the time of year, the spatial location, or the value of the forecast itself, and it is therefore useful to evaluate the performance of a forecaster under different circumstances. If a forecaster were able to identify situations in which performance is particularly poor, then they could more easily develop their forecast strategy to account for these deficiencies.
As such, in the following section we propose a novel decomposition of proper scoring rules that allows a forecaster to analyse the uncertainty, resolution, and reliability of their predictions under different circumstances, whilst maintaining a connection to the overall forecast accuracy. The new decomposition allows for a more detailed quantification of the sources of information in a forecast, reinforcing the relationship between scoring rules and the well-known analysis of variance framework. This is examined in Section 3. The decomposition is applied to the Brier score in Section 4, which is then used to assess probabilistic forecasts of temperature exceedance issued by the Swiss Federal Office of Meteorology and Climatology (MeteoSwiss) in Section 5. Finally, Section 6 concludes.

DECOMPOSITIONS OF SCORES
In the following,  denotes the set of all possible outcomes, or observations, and Y is a random variable taking values in this set, corresponding to the unknown quantity being forecast (the predictand). Formally,  is a subset of a measurable space such as the real line, and, for simplicity, we assume that  is a finite set, so that the observation can take only a finite number of possible values. Forecasts for Y are assumed to be in the form of probability distributions over , and we denote by  the set of all such distributions. A scoring rule is then defined as a function which takes a forecast and an observation as inputs, and outputs a real (possibly nonfinite) number, that is, a score. Without loss of generality, the scoring rules considered here are negatively oriented, so that a smaller score is preferable. Let E Y denote the expectation with respect to the random variable Y , and let S(F, G) denote the expected score for a forecast F given that Y follows the distribution G ∈  , that is, (2) The score entropy of a distribution G ∈  is defined as and the corresponding score divergence between two distributions F, G ∈  as A scoring rule S is said to be proper with respect to  if d(F, G) ≥ 0 for all F, G ∈  , and strictly proper with respect to  if this inequality is strict when F ≠ G. Suppose we are interested in assessing a forecaster via a proper scoring rule S. Bröcker (2009) demonstrates that the expected score assigned to the forecaster can be factorised into terms representing the forecast uncertainty, resolution, and reliability: where G denotes the unconditional distribution of Y and G F the conditional distribution of Y given a forecast F, that is, the distribution of the observations when a particular forecast is issued. We treat F as a random variable, taking values in  , and E F,Y denotes the expectation with respect to both F and Y . That is, we imagine a population of forecasts and outcomes, and we represent this population by the joint distribution of the random variables F and Y ; in practice, this distribution is typically unknown, and it is therefore common to consider the empirical distribution corresponding to a set of previously issued forecasts and their observations, with the forecaster assessed using the average score over these forecast-observation pairs. This decomposition does not rely on the scoring rule being proper, though propriety ensures that the resolution and reliability components are non-negative, which aids their interpretation. The first term of the decomposition, UNC Y , expresses the inherent variability of the predictand and is therefore referred to as the uncertainty. A subscript Y is used to indicate that this is a property of the outcome variable and is independent of the forecast. The uncertainty is quantified by the entropy of the outcome's marginal distribution, and a larger uncertainty is indicative of a less predictable forecast scenario, which in turn leads to a higher (worse) expected score.
The second component in the decomposition is the resolution, RES F , which loosely measures the information content in the forecasts. The resolution acts negatively on the score, so that more informative predictions correspond to larger resolution terms, and hence result in smaller scores. If the score is minimised at zero, as is often the case, then the ratio of the resolution to the uncertainty provides the proportion of uncertainty that can be explained by the forecasts. Furthermore, the uncertainty and resolution are often considered together as a measure of the sharpness or refinement in the forecasts (Blattenberger and Lad, 1985;Mitchell, 2020).
The final term, REL F , assesses the statistical consistency between the forecasts and observations. Since it acts positively on the score, this component is interpreted as the extent to which the forecasts are miscalibrated, and a lower reliability term is therefore desired. The component is equal to zero if the conditional distribution of the predictand given the forecast is equal to the forecast itself, that is, F = G F . For example, if an event is predicted to occur with probability p, then the event should materialise on 100p% of occasions when such a forecast is issued. The forecast in this case is said to be reliable, or calibrated. Although the resolution and reliability depend on both the forecast and the outcome, a subscript F is used to highlight that these terms measure to what extent the forecasts capture the behaviour of the predictand. This notation will be useful to distinguish between different decompositions of the expected score.

A conditional decomposition
In practice, it is often useful to investigate how the forecast uncertainty, resolution, and reliability change under different circumstances. In doing so, forecasters can identify sources of deficiencies in their prediction scheme more easily. Let A denote a random variable taking values in the set {A 1 , … , A J }, where A 1 , … , A J represent mutually exclusive and collectively exhaustive states of the forecast system, representing some auxiliary information on which to condition the forecast evaluation. In the context of weather forecasting, for example, possible choices for the states may include partitions of time (e.g., seasons or weather regime occurrences), space (e.g., spatial regions or grid points), or possible values of a meteorological variable. Note, however, that choosing states that depend on the observations will raise the forecaster's dilemma (Lerch et al., 2017). To avoid this, the states must be chosen such that the state assigned to a forecast is identifiable by the time the forecast has been issued.
To study the performance of a forecaster in a particular state, the decomposition of the expected score into forecast uncertainty, resolution, and reliability can easily be applied conditionally on the occurrence of this state, TA B L E 1 Definition and brief interpretation of the terms of the classical (Equation 5) and conditional (Equation 8) decompositions.

Term Definition Interpretation
UNC Y e(G) Uncertainty in the outcome variable Uncertainty explained by the forecast Within-state uncertainty in the outcome variable Uncertainty explained by changes in the state Uncertainty explained by the forecast that is not explained by changes in the state Uncertainty explained by changes in the state that is not explained by the forecast Conditional reliability of the forecast given the states yielding a state-dependent or local decomposition: where G A denotes the conditional distribution of Y given the state A, and G F,A is the conditional distribution of Y given both the forecast F and the state A.
The total expected score is recovered from Equation (6) by taking the expectation over the possible states. This is akin to calculating the score separately for forecasts pertaining to each state and then summing the resulting scores, each weighted by the relative frequency with which that state occurs. Likewise, uncertainty, resolution, and reliability components can be calculated by taking the expectation of the terms corresponding to each state: This generates a decomposition of the expected score that contains three components, representing the expected forecast uncertainty, UNC Y |A , resolution, RES F|A , and reliability, REL F|A , given the chosen set of states. However, these terms are not equal to those in Equation (5). In particular, the three terms in Equation (7) vary depending on the choice of states. For example, a forecaster is reliable according to this decomposition (i.e. REL F|A = 0) if and only if they are calibrated conditionally on the occurrence of each possible state, and Equation (7) therefore provides a stronger criterion for reliability than Equation (5) (Candille and Talagrand, 2005). It follows that the conditional reliability term REL F|A must be at least as large as the overall reliability, REL F ; this is made clear by Equation (8) below. Equations (5) and (7) thus form two distinct decompositions of the expected score, each assessing slightly different characteristics of the forecast. Equation (5) provides definitions of the uncertainty, resolution, and reliability that do not depend on the states and are therefore easy to interpret. However, Equation (7) provides useful information regarding the forecast performance in different situations, which is not available from Equation (5). Although one could apply both decompositions separately, an alternative representation of the expected score exists that amalgamates the two decompositions, thereby possessing the benefits of both: The first, fourth, and sixth components (UNC Y |A , RES F|A , and REL F|A ) are those present in Equation (7), while the first, second, and third sets of square brackets are equal, respectively, to the classical uncertainty, resolution, and reliability components in Equation (5). A proof of this is available in the Appendix A. The terms of this decomposition also relate to an alternative local decomposition introduced recently by Ehm and Ovcharov (2017), and this connection is discussed in the Appendix D. Table 1 briefly summarises the definition and interpretation of the terms in Equation (8). This conditional decomposition consists of five terms, rather than three, and contains two terms, RES A and RES A|F , that are both added and subtracted from the decomposition, and therefore have no effect on the overall score: Although these extra terms contribute nothing to the expected score, they are themselves useful, in that they convey supplementary information to the forecaster regarding the behaviour of the forecasts and observations conditional on the chosen states. For example, RES A is the expected divergence between the marginal distribution of the outcome and the conditional distribution of the outcome given each state. It thus describes the variation in the outcome due to changes in the state, and can be thought of as the resolution of A (note the similar form to the forecast resolution in Equation (5)). The uncertainty has thus been divided into within-state (UNC Y |A ) and between-state (RES A ) contributions. Similarly, RES A|F can be thought of as the variation in the outcome variable that arises due to changes in the state but is not captured by the forecast or, equivalently, the increase in resolution that would be attained if the forecast captured the dependence of the predictand on the state perfectly.
Moreover, the sum of RES A and RES F|A is equal to the joint information provided by both the forecast and the states. That is, . The labelling of terms provides an intuitive interpretation here: the uncertainty resolved by F and A can be written as the uncertainty resolved by A plus the additional uncertainty resolved by F after A has been accounted for, or vice versa. Hence, by expressing the forecast resolution as the joint information provided by the forecast and states minus the resolution of the states given the forecast, Equation (8) allows for a sequential analysis of the sources of information in the forecasts.
The reliability component is now equal to the difference between RES A|F and the expected reliability of the forecast F with respect to A, REL F|A . These two terms are equal, and hence the reliability is zero, when F = G F , which is the standard requirement for autocalibration. Note, however, that a reliable forecast is not necessarily calibrated with respect to the possible states. That is, REL F could be zero even if REL F|A is not. This thus acknowledges that forecast errors may cancel each other out in such a way that the prediction system is calibrated on the whole, but is subject to conditional biases (Hamill, 2001). If the forecast is reliable with respect to the chosen set of states, then F = G F,A , for all possible states A, in which case both RES A|F and REL F|A (and hence also REL F ) become zero. In this case, the forecast satisfies a stronger form of calibration, but the absence of conditional biases manifests in the score via the resolution term, since the negative impact of RES A|F is eliminated. This highlights that removing conditional biases is synonymous with increasing the information contained in the forecasts.

Conditioning on several states
The decomposition in Equation (8) has the desirable property that it can easily be extended further to simultaneously assess the forecast conditionally on two separate sets of states, say {A 1 , … , A J } and {B 1 , … , B L }.
For example, weather forecasters may be interested in assessing how well their forecasts perform in different seasons and in different weather regimes. This could be achieved using two separate applications of Equation (8), using the seasons or the weather regimes as the states, but these are unlikely to be independent, with some variations in the weather regime likely linked to changes in the season. If the two states are considered simultaneously, then we can evaluate how well the forecast captures the seasonality in the outcome, as well as the remaining variation due to changes in the regime after having accounted for the seasonal cycle, or vice versa. Letting B denote a random variable taking values in {B 1 , … , B L }, we can write the expected score as where the terms of this decomposition are defined similarly to those in Equation (8), but with A replaced with A and B. Note that this is equivalent to introducing a third set equal to the Cartesian product of the two sets of states, A and B = B l , and performing the conditional decomposition with respect to these combinations of the states. This decomposition does not separate the individual effects of the two sets of states, but rather considers their joint effect. For example, it would inform the forecaster how much variation in the outcome can be explained by changes in both the season and the weather regime, but not how much variation is explained by each factor individually. However, as alluded to earlier, the joint resolution of A and B can be decomposed further using the fact that Table 1. An analogous breakdown also holds for the conditional resolution, RES A,B|F . Substituting this into Equation (10) gives This partitioning of the joint resolution therefore allows for a sequential analysis of the information content provided by a set of states: the uncertainty can now be written as the expected variation in the outcome for a fixed A and B (UNC Y |A,B ), plus the variation that can be explained by changes in A (RES A ) and the additional variation explained by changes in B after A has been accounted for (RES B|A ), or vice versa. We can thus quantify sequentially the amount of uncertainty attributable to changes in A and B.
Similarly, in the resolution term we can quantify sequentially the amount of uncertainty attributable to changes in A and B that is not captured by the forecast. For example, just as we can interpret RES A|F as the improvement in score that would be achieved if the forecast modelled the dependence between the outcome Y and the state A correctly, RES B|F,A is the improvement in score that would be achieved by modelling the dependence between Y and the state B correctly, if the forecast were already to capture the relationship between Y and A. If there is a strong association between the states A and B, then RES B|F,A could be low even if RES B|F is high. This information is therefore not available from two separate applications of Equation (8).
Similar arguments can also be used to extend Equation (8) further and condition the decomposition on more than two sets of states. Equation (11) contains seven distinct components, two more than in Equation (8), and the number of terms increases by two for every additional set of states that is considered in the decomposition. Although each term will provide the forecaster with more information regarding the performance of their forecasts in theory, considering more states in practice requires further stratification of the data at hand, and hence the number of states to use in the decomposition is limited by the amount of data available.

ANALYSIS OF INFORMATION
The terms of Equation (8) can be used to quantify the amount of uncertainty attributable to different sources. These terms therefore resemble those in the well-known analysis of variance (ANOVA) framework, which seeks to quantify how much variation in the outcome variable can be explained by changes in certain factors. A duality between decompositions of scores and the analysis of variance has already been noted for quadratic scores by Blattenberger and Lad (1985). In this section, we claim that score decompositions provide a broad generalisation of the ANOVA framework, and we reinforce this by drawing parallels between statistics commonly used in ANOVA and the terms of Equations (5) and (8).
Consider a set of n observations, each corresponding to one of K factors, or treatment groups. The observations are written as y k, , where k ∈ {1, … , K} denotes the treatment group and ∈ {1, … , n k } is the unit within the group of interest, with n k denoting the number of observations given treatment group k. The sum of the n k across all possible treatment groups is equal to n, the total number of observations. A one-way analysis of variance then decomposes the variance in the observations into between-factor and within-factor effects. The decomposition is typically expressed as where y k is the mean observation in treatment group k and y is the mean observation across all treatment groups (Wilks, 2019). The first term is the sum of squared differences between the observations and their global mean, termed the total sum of squares (TSS). This term is equal to the sum of the treatment sum of squares (SST), denoting the weighted deviation between the global and conditional means, and the sum of squared error (SSE), the variation of the observations from their respective conditional means. Dividing these terms by the total number of observations recovers the law of total variance: the variance in the observations (TSS) is decomposed into the between-treatment group variation (SST) and the within-group variation (SSE). The SST terms thus represent the amount of variation in the observations that is captured by the treatment groups, whereas SSE is the remaining, unexplained variance. A common analysis might then test whether SST is significantly different from zero, which would suggest the treatment groups explain a non-negligible amount of variation in the data.
This decomposition does not depend on a specific prediction as such, but only the extent to which the treatment groups can distinguish between different observations. However, the sum of squared error can be decomposed further into two terms: where P k represents a prediction pertaining to treatment group k (Brook and Arnold, 1985). The first term on the right-hand side is the squared error of the forecasts P k (SFE). The second term represents the deviation between the forecasts and the conditional mean given the treatment group, and is often labelled the sum of squares due to lack of fit (SSLF; Brook and Arnold, 1985). Note, however, that the ANOVA decomposition is typically applied in-sample, with the forecasts constructed implicitly to equal the conditional means observed in the data, that is, P k = y k for all k. Hence, the SSLF component is usually zero.
Nonetheless, combining and rearranging Equations (12) and (13) and scaling by 1∕n gives Here, the mean squared error of the predictions P k has been decomposed into the variance of the observations, minus the average squared deviation between the conditional and global means, plus the average squared difference between the predicted value in treatment group k and the associated conditional mean. These terms clearly resemble the uncertainty, resolution, and reliability of the forecast, respectively, as defined in Section 2. The (scaled) ANOVA decomposition is thus equivalent to a factorisation of the mean squared error into uncertainty, resolution, and reliability terms, where the treatment groups correspond to a discrete set of possible forecast values. In the case of binary outcomes, Equation (14) is equivalent to the well-known decomposition of the Brier score introduced by Murphy (1973). Of course, for other scores, the uncertainty, resolution, and reliability terms are not equal to those in Equation (14), and hence the exact mathematical equality of the analysis of variance and score decomposition frameworks does not hold. Nevertheless, the broad interpretation of the terms remains similar regardless of the score used to assess the forecasts: the resolution represents the amount of uncertainty that can be explained by the forecast, the reliability quantifies the accuracy that is lost due to miscalibration, and, in both cases, the aim of the forecaster is to maximise the proportion of uncertainty that is explained by their forecasts. Decompositions of scoring rules therefore provide a generalisation of the analysis of variance framework.
Furthermore, there are parallels in some common applications of score and ANOVA decompositions. For example, the terms of Equation (12) are often used to calculate the coefficient of determination or R 2 value corresponding to a set of predictions. This is a well-known goodness-of-fit statistic that quantifies the proportion of variation in the observations that is explained by the treatment groups, defined mathematically as the ratio of the explained variance (SST − SSLF 1 ) to the total variance (TSS). Analogously, if the score is minimised at zero, then the proportion of uncertainty that is captured by the forecast is equal to the ratio of RES F − REL F to UNC Y , which is equal to the skill score obtained when using the unconditional (i.e., climatological) distribution as a reference forecast (Mason, 2004).
Similarly, the terms of the new decomposition (Equation 8) also align with the analysis of variance framework. Another common application of the ANOVA decomposition is to test whether or not an additional set of factors could capture some of the remaining unexplained variation in the data; this is typically the goal of a two-way ANOVA. In particular, a sequential ANOVA approach can be used to test whether potential factors reduce the SSE significantly, given those already included. The proportion of previously unexplained variation that is captured by the extra set of factors is quantified by the coefficient of partial determination. As remarked in the previous section, the RES A|F component of Equation (8) represents the extent to which the forecast does not model the dependence between the observations and the chosen states accurately, while, if the score is minimised at zero, the expected score can be thought of as the amount of variation in the observations that is not explained by the forecast. Therefore, the ratio of RES A|F to the expected score is the proportion of unexplained variation in the outcome that would be explained if the forecast captured all uncertainty in the observations due to changes in the state. This thus constitutes a natural analogue of the coefficient of partial determination. If this ratio is large, then it would be beneficial to incorporate information regarding the chosen states into the forecast.

Decompositions of the Brier score
The decomposition of scores into forecast uncertainty, resolution, and reliability has been studied in most detail using the Brier score. Hence, in this section, the Brier score is decomposed according to both Equations (5) and (8). The Brier score is used to assess forecasts for the occurrence of a binary event, and is defined as the squared difference between a probability forecast p and the corresponding binary outcome y, which takes the value one if the event under consideration occurs, and zero otherwise (Brier, 1950): A smaller score clearly indicates closer alignment between the forecast and observation, and a perfect forecast is one that can always recognise when the event will and will not occur, corresponding to a Brier score of zero. Consider the case where the forecast p can only take one of a finite number of values, p ∈ {P 1 , … , P K }. Murphy (1973) shows that, in such circumstances, the average Brier score over n forecast instances can be divided into three components. For each forecast instance i = 1, … , n, suppose we have a probability forecast p i ∈ {P 1 , … , P K }, a corresponding observation y i ∈ {0, 1}, and a prevailing state a i ∈ {A 1 , … , A J }. Let I k• = {i ∶ p i = P k } denote the set of all instances where P k was issued as the forecast, let I • = {i ∶ a i = A } denote the set of all instances where state A occurred, and let I k = {i ∶ p i = P k , a i = A } denote the set of forecast instances where P k was issued as the forecast and state A occurred. Define n k• = |I k• |, n • = |I • |, and n k = |I k | as the number of forecast instances in each of these three sets, and note that ,k=1 denoting the double summation over all possible forecasts and states. We define unconditional and conditional relative frequencies of the event as The classical decomposition of the Brier score into uncertainty, resolution, and reliability components is then (Murphy, 1973). The first term,ÛNC Y , is an estimator for the uncertainty component, the second,RES F , is an estimator for the resolution, and the final term,REL F , is an estimator for the reliability. This is equivalent to the ANOVA decomposition in Equation (14). Clearly, this decomposition does not depend on the J possible states that could arise. Suppose now that we wish to assess the forecasts conditional on these states. The terms of Equation (8) for the Brier score can be estimated usingÛ with a hat again used to emphasise that these terms are estimators given a finite amount of data. It is shown in the Appendix B that when these estimators are combined as in Equation (8), we recover the terms in Equation (17).

Bias-corrected components
In practice, the Brier score and the terms of its decomposition can only be estimated from a finite amount of data. The sample mean score is an unbiased estimator for the expected score, but, in general, the empirical uncertainty, resolution, and reliability terms are biased (Bröcker, 2012). Although no unbiased estimators exist for the resolution and reliability components of the Brier score (Ferro and Fricker, 2012), Bröcker (2012) and Ferro and Fricker (2012) have proposed bias corrections, the biases of which decay at a much faster rate than those of Equation (17). In the following section, the bias-corrected estimators of Ferro and Fricker (2012) are utilised: Similar ideas can be used to obtain bias corrections for the terms in Equation (18). In doing so, the resulting estimates are less sensitive to the amount of data available: It is assumed here that the forecast-observation pairs are independent and identically distributed, and also that the forecasts and states are chosen such that none of n k• , n • , and n k are equal to one. It is straightforward to verify that the bias corrections above cancel each other out so that the estimator for the expected score is unbiased, and also that the bias corrections for the corresponding UNC Y , RES F and REL F estimators agree with those in Equation (19). A more thorough investigation of the properties of these bias corrections is conducted in the Appendix C. In the next section, these bias-corrected components of the Brier score are used to assess operational weather forecasts.

Data
The decompositions of scores discussed thus far provide the forecaster with additional information regarding how their forecasts perform. In this section, we provide an example of how these decompositions can be applied in practice, and how the additional information can be used to improve statistical postprocessing models. To illustrate this, the Brier score is used to evaluate forecast probabilities that the daily maximum temperature will exceed a chosen threshold. The forecasts are extracted from MeteoSwiss's COSMO-E ensemble prediction system, which operates at 2.2-km resolution over Switzerland and the surrounding area. The gridded forecasts issued by COSMO-E are interpolated here to 146 synoptic weather stations across the domain, displayed in Figure 1. At each location, the prediction system generates ensemble forecasts comprised of M = 21 members for the daily maximum temperature, and a probability that the temperature will exceed a chosen threshold is then extracted from the ensemble using the following formula: where M + denotes the number of ensemble members that predict that the threshold will be exceeded. This results in 22 (M + 1) evenly spaced possible forecast values.
The forecasts considered here are initialised daily at 0000 UTC during the three summer months (JJA) between 2018 and 2020, and are available up to four days in advance. Hence, at each lead time, there are roughly 40,000 (92 × 3 × 146) available forecast cases on which to evaluate forecast performance. Forecast evaluation is performed using the mean Brier score over all forecast cases, with daily maximum temperature measurements at the sites of interest used to verify the forecasts.
The classical decomposition of the mean Brier score can be used to assess the resolution and reliability of the forecasts, while the novel decomposition provided by Equation (8)  resolution and conditional reliability given some choice of the states. In the following, we condition this latter decomposition on three sets of states: the prevailing weather regime when the forecast is initialised, a grouping of the stations of interest depending on their spatial location, and an alternative grouping of the stations based on their altitude.
The weather regimes considered here are the nine weather type classifications used operationally at MeteoSwiss, derived by applying a cluster analysis to the leading principal components of ERA40 analysis mean sea-level pressure fields (see Weusthoff, 2011, for details). These regimes form a mutually exclusive, collectively exhaustive partition of time, that is, it is assumed that one and only one regime occurs on each day. Weather regimes have a large influence not only on the local weather but also on the biases that manifest in numerical weather models (Ferranti et al., 2015;Matsueda and Palmer, 2018;Allen et al., 2020). Although forecast errors typically depend more on the weather regime that is predicted to occur at the forecast validation time (Allen et al., 2021), such information is not available for the forecasts and time range considered here. As such, the regime assigned to a forecast is that which occurs when the forecast is initialised. Using weather regimes as the states in the decomposition of the Brier score allows one to analyse how forecast quality depends on the prevailing large-scale atmospheric behaviour.
While the regimes constitute a partition of time, the remaining two choices of states form subgroups of the 146 weather stations. Firstly, the stations are divided into five groups based on their spatial location. The five groups comprise the Jura, a subalpine mountainous range in the north of Switzerland; the north and south slopes, areas either side of the Swiss Alps; the Central Plateau, which lies between the Jura mountains and the Swiss Alps; and the Grisons region in the southeast of Switzerland. The stations assigned to each of these groups are displayed in Figure 1.
Also displayed in Figure 1 is the altitude of each station, defined as the height above sea level. Switzerland has a remarkably complex topography, but operational weather models typically assume a more homogeneous land surface in order to simplify the differential equations underlying the numerical forecasts. As such, forecast biases across Switzerland often depend strongly on altitude (Friedli et al., 2021). Hence, to investigate this in relation to COSMO-E temperature exceedance forecasts, the stations are grouped depending on whether their altitude is lower than 500m, between 500 and 1000 m, 1000 and 1500 m, 1500 and 2000 m, or above 2000 m. These five categories then form a set of states that can be used within the conditional decomposition. Figure 1 also displays the altitude group to which each station belongs, with the number of stations assigned to an altitude group ranging from 18 (≥ 2000 m) to 47 (< 500 m).

Reference forecasts
Firstly, consider forecasts made for whether or not the temperature exceeds 20 • C at a lead time of three days; the sensitivity of the results to the threshold and the lead time is considered later. One of the simplest forecasts to issue is the unconditional distribution of the temperature threshold exceedance, that is, the historical average event frequency. Such a forecast is known to be reliable but also uninformative, and it is hence often considered as a baseline with which other forecast schemes can be compared (Mason, 2004). In this case, the unconditional or climatological forecast probability of the threshold being exceeded is estimated from the daily maximum temperature observations during the summer seasons of 2016 and 2017. The resulting forecasts are therefore assessed out-of-sample. Data from all stations are amalgamated when estimating this historical event frequency, and the climatological forecast therefore does not react to changes in the location, nor to changes in the weather regime. The decompositions of the Brier score for this climatological forecast are displayed in the first rows of Figures 2-4. Figure 2 displays the components of Equation (8) when the weather regimes are used as the states, Figure 3 uses the spatial regions as the states, while Figure 4 shows results for the altitude groups. The terms of the classical decomposition, as well as the overall score, are independent of the choice of states, with both the forecast resolution and reliability of the climatological forecast almost equal to zero; if the forecasts were evaluated in-sample, then both terms should be exactly zero. The ratio of RES A to UNC Y provides the proportion of variation in the observations that is attributable to changes in the state: the weather regimes explain 2.6% of the uncertainty in the outcome, while the spatial regions explain 8.7% and the altitude groupings 38.6%.
Furthermore, for all choices of the state, the RES A and RES A|F components are equal, suggesting none of this state-dependent variation is captured by the climatological forecast. Indeed, this is expected given that the climatological probability is constructed such that it does not change depending on these states. As such, despite the forecasts being almost perfectly reliable overall, they exhibit large conditional biases, which can be seen from the large values of REL F|A . This conditional miscalibration could be removed by deriving a separate climatology F I G U R E 2 Conditional (left) and classical (right) decompositions of the Brier score for the climatological, conditional climatological, COSMO-E ensemble, statistically postprocessed, and altitude-dependent postprocessed forecasts, when predicting whether the daily maximum temperature will exceed 20 • C at a lead time of three days. The decompositions are presented using the weather regimes as the states. Red, green, and blue contributions sum up to UNC Y , RES F , and REL F , respectively. All terms have been scaled by 10 4 to aid interpretation. [Colour figure can be viewed at wileyonlinelibrary.com] for each state and issuing this as a state-dependent climatological forecast. The decompositions for these three conditional climatology forecasts are displayed in the second rows of Figures 2-4. In general the RES A|F and REL F|A components are again almost the same, indicating the forecasts are reliable, but they are now also very close to zero, indicating the forecasts are additionally reliable conditional on the chosen states. Note that this is not the case F I G U R E 3 As in when the regimes are employed as the states: although the RES A|F component is very close to zero, the conditional reliability term, REL F|A , is not. The reason for this is that some of the nine regimes do not occur frequently in the relatively short time period considered here, making it difficult to estimate the climatological event probability accurately in these regimes. This leads to regime-dependent climatology forecasts that are miscalibrated when evaluated F I G U R E 4 As in out-of-sample, illustrating the trade-off between incorporating information into statistical models and the associated risks of overfitting.
In removing the conditional biases owing to a set of states, the resulting forecasts exhibit an increased resolution, and hence better score. However, the information provided by such a statistical forecast is still much lower than that provided by a numerical weather prediction model, which aims to reproduce the physical processes governing the atmosphere's evolution. The decompositions of the Brier score for the COSMO-E ensemble forecasts are also presented in Figures 2-4, and, as expected, the resolution of the COSMO-E output is considerably larger than that of the climatological forecasts. In particular, relatively low values of RES A|F suggest the weather prediction model output explains a large proportion of the uncertainty attributable to the regimes and spatial regions. However, the increased information provided by the COSMO-E forecasts comes at the expense of reliability. This can be seen from the large differences between the REL F|A and RES A|F terms, and can be verified using reliability diagrams (not shown). Nonetheless, the overall Brier score is unsurprisingly much lower for the dynamical forecasts than for the climatological predictions.
As discussed in Section 3, the ratio of RES F − REL F to UNC Y constitutes an analogue to the R 2 value, or coefficient of determination, regularly used within the ANOVA framework. In this case, the generalised R 2 value for the COSMO-E forecasts is equal to 0.413, suggesting the ensemble forecasts explain 41% of the uncertainty in the outcome. This also suggests that the COSMO-E forecasts are 41% more skilful than the climatological forecasts, which can be confirmed by calculating the Brier skill score for the COSMO-E forecasts using the climatology as a reference.
The results in Figures 2-4 refer to a particular lead time and temperature threshold. To illustrate the sensitivity of the results to these factors, the terms of the two decompositions are displayed as a function of lead time in Figure 5 (for a fixed threshold of 20 • C), and as a function of threshold in Figure 6 (at a fixed lead time of three days). While the uncertainty is independent of the lead time, the forecast resolution decreases slightly as the forecast horizon increases. However, this is counteracted by a decrease in forecast reliability, resulting in a Brier score that is fairly insensitive to the lead time. The terms of Equation (8)-which are calculated using the altitude groups as the states-exhibit similar behaviour, with the conditional resolution and reliability terms also decreasing slowly with time, while the two uncertainty components remain almost constant.
In contrast, the terms of both decompositions depend strongly on the choice of threshold. The uncertainty in the outcome is maximised when the climatological exceedance probability is 0.5, which occurs at a threshold between 20 and 25 • C, and the dependence of this inherent variability on the choice of threshold highlights why forecasts made for events with different climatological frequencies should not be compared directly. The resolution follows a similar pattern, while the miscalibration of the forecasts tends to be larger for slightly lower thresholds. On the other hand, when considering the exceedance of higher thresholds, which is common when constructing weather warnings, the COSMO-E ensemble forecasts already appear almost reliable.
The right-hand panel of Figure 6 demonstrates how the altitude groups affect forecast performance at different temperature thresholds. Whereas the within-state uncertainty, UNC Y |A , accounts for a high proportion of the total uncertainty at larger thresholds, the relative contribution of the between-state uncertainty, RES A , is larger at much lower thresholds. This suggests that the altitude groups are more adept at identifying whether or not a low daily maximum temperature will occur, rather than a high F I G U R E 6 Terms of the classical decomposition (left) and the conditional decomposition with respect to the altitude groups (right) for the mean Brier score of the COSMO-E ensemble forecasts, as a function of the temperature threshold. All results are shown for a lead time of three days. [Colour figure can be viewed at wileyonlinelibrary.com] temperature. Accordingly, the conditional forecast miscalibration, REL F|A , and the amount of information provided by the altitudes that is not explained by the forecast, RES A|F , are relatively large at lower thresholds.

Statistical postprocessing
While ensemble forecasts derived from complex weather prediction models typically offer substantial improvements over purely statistical approaches, it is common in practice to employ statistical techniques to recalibrate the dynamical forecasts. This is known as statistical postprocessing (Vannitsem et al., 2018). In its simplest form, postprocessing aims to model the conditional distribution of the outcome given the ensemble forecast, F, that is, G F . If G F can be modelled perfectly, then issuing this conditional distribution as the forecast will result in a prediction that is reliable, whilst also retaining the information contained in F. Hence, although errors occur in practice when estimating G F , the general aim of postprocessing is to calibrate the forecast without discarding the information provided by the numerical weather prediction model. As such, the theoretical improvement in the Brier score that would be obtained by an ideal recalibration scheme is equivalent to the miscalibration of the raw ensemble forecast, REL F . For example, when considering a threshold of 20 • C at a lead time of three days, Figures 2-4 suggest that a perfect recalibration scheme would improve the raw ensemble forecast by roughly 10% (141.8/1348.1), while Figure 6 indicates that there is less benefit to postprocessing when interest is in larger temperature thresholds (see Williams et al., 2014). This is also apparent from Figure 7, which shows the theoretical improvement gained by postprocessing as a function of the threshold. Figure 7 also displays the relative improvement of a simple postprocessing scheme applied to the ensemble forecasts, calculated using the Brier skill score of the postprocessed forecasts with the COSMO-E ensemble forecast as reference. The forecasts are postprocessed here using a logistic regression model, where the sole predictor is the probability of the temperature threshold being exceeded, as extracted from the raw ensemble forecast using Equation (21). Like the climatological forecast, the postprocessing model is trained on data from 2016 and 2017, and a separate model is fitted for all thresholds. Figures 2-4 again present the decompositions of the Brier score for the resulting forecast. Although the postprocessed forecasts are not perfectly calibrated, perhaps due to the limited data available or invalid assumptions made by the logistic regression model, the reliability term is significantly reduced, and only a small amount of resolution is lost in the process. As such, Figure 7 demonstrates that, despite the improvement falling below the theoretical improvement available for a perfect recalibration scheme, this postprocessing approach is still beneficial for most thresholds. When considering the exceedance of larger temperature thresholds, however, this simple postprocessing method can hinder the performance of the COSMO-E forecasts.

5.2.3
Conditional postprocessing Although statistical postprocessing reduces the miscalibration of the ensemble prediction system, the recalibrated forecasts may still exhibit conditional biases. Large values of REL F|A , for example, illustrate that the logistic regression-based postprocessing method still exhibits conditional biases owing to all three choices of the states. The conditional miscalibration given the altitude groups (REL F|A in the fourth row of Figure 4) is even larger than the overall miscalibration for the raw COSMO-E ensemble output (REL F in the third row of Figures 2-4). Moreover, the RES A|F component quantifies the additional information that would be provided by the forecast if it correctly captured all of the uncertainty in the outcome owing to the states. The ratio of this to the overall score then constitutes an analogue to the partial R 2 value. Calculating this quantity for the postprocessed forecasts using the three sets of states suggests that the predictions would improve by a further 2.1% if the weather regimes were incorporated into the forecast, 4.4% if the spatial regions were considered, and 13.8% if information regarding the altitude were utilised.
With this in mind, a forecaster might then look to incorporate these states into the postprocessing model. Several recent studies have highlighted the benefit provided by additional sources of information when postprocessing (e.g. Taillardat et al., 2016;Messner et al., 2017;Rasp and Lerch, 2018;Allen et al., 2019;Schulz and Lerch, 2022) and the novel decomposition presented herein allows forecasters to screen possible predictor variables in order to find those that are most relevant to use when postprocessing: the decomposition can be applied several times with different choices of the states, and those that result in a large RES A|F component may be a source of significant conditional biases in the forecast.
For example, consider a second postprocessing model whereby the altitude group that a station belongs to is included as an additional categorical predictor in the logistic regression model described previously. The final row of Figure 4 demonstrates that this forecast exhibits a reliability similar to that of the original postprocessing model, but both RES A|F and REL F|A are significantly smaller, indicating the conditional biases owing to the altitude groups are now almost alleviated. As seen by the classical decomposition, this results in a larger resolution component, which in turn coincides with a lower total mean Brier score. Figure 7 illustrates how the benefit provided by this altitude-dependent postprocessing model changes with the temperature threshold of interest. Both the theoretical benefit-quantified by the partial R 2 value associated with the altitude groupings for the postprocessed forecast-and the actual benefit-quantified by the relative improvement in the mean Brier score of the altitude-based logistic regression model over that of the simpler postprocessing model-are displayed, and both suggest that the altitude is considerably more informative when predicting whether or not the daily maximum temperature will exceed a relatively low threshold. Furthermore, these two quantities almost coincide for all temperature thresholds, which suggests that the dependence of the outcome on the altitude is modelled sufficiently well by these postprocessed forecasts.
Of course, as well as incorporating the altitude, it would be straightforward also to include the weather regimes and the spatial regions (among other possible predictors) in the postprocessing model. Incorporating information regarding these states should yield yet larger improvements. However, this is true only in theory, and errors may occur in practice due to a finite amount of data. In this sense, just as the aim of postprocessing methods is to calibrate the ensemble forecast without sacrificing the information content provided by the numerical weather prediction model, adding more predictor variables seeks to increase the forecast resolution without using a model that is so complex that it overfits the training data and thus hinders calibration. This aligns with the notion of yielding predictive distributions that are sharp, subject to being calibrated (Murphy and Winkler, 1987;Gneiting and Raftery, 2007).

CONCLUSIONS
This article has studied decompositions of proper scoring rules into uncertainty, resolution, and reliability components. We distinguish between two alternative partitions of proper scoring rules into these terms and propose an extension that utilises information from both. The decomposition introduced herein maintains the established interpretation of the components, while also allowing the forecast quality to be assessed in different situations.
The motivation behind such an extension is that it provides additional information to forecasters, helping them to identify when the performance of their forecasts changes. The novel decomposition divides both the forecast uncertainty and resolution into within-state and between-state contributions, thereby permitting a more thorough analysis of the sources of information in the forecasts. From this, we suggest that score decompositions provide a generalisation of the well-known analysis of variance framework, with further analogies corresponding to the coefficients of determination and partial determination.
In addition, this decomposition permits the simultaneous evaluation of unconditional and conditional forecast biases, acknowledging that forecast errors in different states of the forecasting system may cancel each other out, leading to a prediction scheme that is reliable on the whole but conditionally miscalibrated. Such information can then easily be incorporated into the forecasts, potentially via statistical postprocessing methods, in order to improve future predictions.
These conditional biases could also be analysed by separately decomposing the expected score into uncertainty, resolution, and reliability components for each state of interest (see Equation 6). The novel decomposition presented herein illustrates how these terms relate to the overall forecast uncertainty, resolution, and reliability, and additionally includes two terms that provide the forecaster with further information regarding how their forecasts perform. In particular, the RES A component in Equation (8) describes the amount of uncertainty in the predictand that is attributable to changes in the state, while RES A|F quantifies how much of this variation is captured by the forecast. Large values of this latter component therefore indicate that the forecast does not capture the dependence of the outcome on the different states correctly, suggesting the forecasts would be improved if this information were incorporated into the prediction system.
Since decompositions of scores into uncertainty, resolution, and reliability terms are studied most commonly using the Brier score, both classical and novel decompositions are presented for this score in Section 4, along with suitable bias corrections when these terms are estimated using a finite sample. These decompositions are then applied to probability forecasts that the daily maximum temperature will exceed a range of thresholds. The forecasts are extracted from MeteoSwiss's medium-range ensemble prediction system, COSMO-E, at a range of weather stations across Switzerland. The novel decomposition is applied in three different circumstances, which form partitions of either time or space, and we demonstrate how the terms of this decomposition can be used to design effective statistical postprocessing methods that can recalibrate the ensemble output, whilst also incorporating relevant information.
For simplicity, we condition the decomposition of scores on a discrete random variable, which can assume one of a finite number of possible states. In practice, such an assumption is typically necessary in order to calculate the terms of the decomposition. In theory, however, this could be extended to continuous random variables. Similarly, we also assume that the forecasts can manifest only as a finite number of possible options. This too can often be achieved easily by grouping together similar forecasts, though Dimitriadis et al. (2021) recently proposed a more consistent approach to estimate the terms of score decompositions using isotonic regression. Future work could investigate using this approach to estimate the terms in Equation (8). Such an approach could then potentially be generalised so that the extended decomposition can readily be applied to probabilistic forecasts for nonbinary events.
Bhend is thanked for providing the data used in this study, and also for fruitful discussions. We are also grateful to two anonymous reviewers, the suggestions of which have helped significantly to improve the original article.

DATA AVAILABILITY STATEMENT
The code used in this study is available on GitHub at https://github.com/sallen12/ConditionalScoreDecomp.

APPENDIX A. CONDITIONAL DECOMPOSI-TION OF PROPER SCORES
We demonstrate here how the components of the classical decomposition of proper scores can be expressed as the terms in Equation (8). Firstly, the uncertainty term can be rewritten as The reliability term can be similarly decomposed: Finally, combining Equations (7) and (8) we get into which we can substitute the expressions for UNC Y and REL F derived above. Rearranging this obtains the breakdown of the resolution as given in Equation (8):

APPENDIX B. CONDITIONAL DECOMPOSI-TION OF THE BRIER SCORE
The uncertainty component of the classical Brier score decomposition is where we use the fact that n k n [ (P k − y k ) 2 + (y k − y k• ) 2 biases as previously derived forŨNC Y ,R ES F , andREL F . Furthermore, although biases are still present for all terms, they again decay at a much faster rate than those in Equation (C5). Although these estimators are not all unbiased, they are asymptotically unbiased. As such, if the sample size, n, is sufficiently large, then the estimators should produce reliable estimates of all terms in the decomposition, regardless of how many times each state is observed. However, to make reliable conclusions about the terms of the decomposition within a particular state (i.e., the components in Equation (6)), sufficient observations that correspond to this state are required. (2017) introduce a similar decomposition of scoring functions in the presence of auxiliary information to that presented in Section 2.1. We show in this section how Equation (8) relates to the decomposition presented therein. Ehm and Ovcharov (2017) consider pairs of the form W = (F, A), where A again denotes some auxiliary information, while F represents the forecast. This leads to the following local decomposition:

Ehm and Ovcharov
This partition conditions the expected score on both the forecasts and some auxiliary information, whereas the local decomposition in Equation (6) depends only on the states. Therefore, Equation (6) can be obtained by taking the expectation of the components in Equation (D1) with respect to the possible forecasts, and rearranging the expected entropy into the local uncertainty and resolution. Ehm and Ovcharov (2017) then note that taking the expectation of Equation (D1) with respect to W yields the overall expected score: Writing W = (F, A), this can be expressed in the terms of Equations (5) and (8). The first term is clearly equal to the uncertainty, UNC Y , while the third term is equal to REL F|A . The second term, on the other hand, is the resolution of W, RES F,A , which, using results in Section 2.1, can itself be decomposed into RES A + RES F|A . We therefore note that the latter two terms of Equation (D2) are not equal to the resolution and reliability as defined in Equation (5), but are instead larger, by an amount equal to RES A|F .