• Brier score;
  • Reliability diagram


Ensemble forecasts provide probabilistic predictions for the future state of the atmosphere. Usually the probability of a given event E is determined from the fraction of ensemble members which predict the event. Hence there is a degree of sampling error inherent in the predictions. In this paper a theoretical study is made of the effect of ensemble size on forecast performance as measured by a reliability diagram and Brier (skill) score, and on users by using a simple cost-loss decision model. The relationship between skill and value, and a generalized skill score, dependent on the distribution of users, are discussed. The Brier skill score is reduced from its potential level for all finite-sized ensembles. The impact is most significant for small ensembles, especially when the variance of forecast probabilities is also small. The Brier score for a set of deterministic forecasts is a measure of potential predictability, assuming the forecasts are representative selections from a reliable ensemble prediction system (EPS). There is a consistent effect of finite ensemble size on the reliability diagram. Even if the underlying distribution is perfectly reliable, sampling this using only a small number of ensemble members introduces considerable unreliability. There is a consistent over-forecasting which appears as a clockwise tilt of the reliability diagram. It is important to be aware of the expected effect of ensemble size to avoid misinterpreting results. An ensemble of ten or so members should not be expected to provide reliable probability forecasts. Equally, when comparing the performance of different ensemble systems, any difference in ensemble size should be considered before attributing performance differences to other differences between the systems.

The usefulness of an EPS to individual users cannot be deduced from the Brier skill score (nor even directly from the reliability diagram). An EPS with minimal Brier skill may nevertheless be of substantial value to some users, while small differences in skill may hide substantial variation in value. Using a simple cost-loss decision model, the sensitivity of users to differences in ensemble size is shown to depend on the predictability and frequency of the event and on the cost-loss ratio of the user. For an extreme event with low predictability, users with low cost-loss ratio will gain significant benefits from increasing ensemble size from 50 to 100 members, with potential for substantial additional value from further increases in number of members. This sensitivity to large ensemble size is not evident in the Brier skill score. A generalized skill score, dependent on the distribution of users, allows a summary performance measure to be tuned to a particular aspect of EPS performance.