Clustering numerical weather forecasts to obtain statistical prediction intervals


  • Ashkan Zarnani,

    1. Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada
    Search for more papers by this author
  • Petr Musilek,

    Corresponding author
    1. Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada
    2. Department of Computer Science, VSB-Technical University Ostrava, Czech Republic
    • Correspondence: P. Musilek, Department of Electrical and Computer Engineering, University of Alberta, Edmonton T6G2V4, Canada. E-mail:

    Search for more papers by this author
  • Jana Heckenbergerova

    1. Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada
    2. Department of Mathematics and Physics, University of Pardubice, Czech Republic
    Search for more papers by this author


The numerical weather prediction (NWP) model outputs are point deterministic values arranged on a three-dimensional grid. However, there is always some level of uncertainty in the prediction. Many applications would benefit from provision of relevant uncertainty information along with the forecast. A common means of formulating and communicating forecast uncertainty are prediction intervals (PI). In this study, various methods for modelling the uncertainty of NWP forecasts are investigated and PIs provided for predictions accordingly. In particular, the interest is in analysing the historical performance of the system as a valuable source of information for uncertainty analysis. Various clustering algorithms are employed to group the performance records as the first step of the PI determination process. In the second step, a range of methods are used to fit appropriate probability distributions to errors of each cluster. As a result, PIs can be computed dynamically depending on the forecast context. The clustering algorithms are applied over different feature sets and derived and generated features. All presented PI computation methods are empirically evaluated using a comprehensive verification framework in a set of experiments involving two real-world data sets of NWP forecasts and observations. The proposed evaluation provides a considerably fairer and more reliable judgement compared to existing methods. Results show that incorporating trained uncertainty model outputs into the NWP point predictions provides PI forecasts with higher reliability and skill. This can lead to improvement of decision processes for many applications that rely on these forecasts. Copyright © 2013 Royal Meteorological Society

1. Introduction

Weather forecasts are typically made and reported in the form of an expected value for the attribute of interest in a particular time and location. Numerical weather prediction (NWP) models are advanced computer simulation systems that provide such expected value forecasts for a number of attributes. Although the deterministic interactions of simulated physical processes in such systems yield real-value numbers with high precision, these values are uncertain due to the inaccuracy of initial conditions, parameterization of sub-grid scale processes, and various simplifying assumptions (Palmer, 2000; Orrell et al., 2001; Lange, 2003). However, such uncertainty information is not available in the immediate outputs of the system. Yet, in many applications it is desirable that forecasts be accompanied by the corresponding uncertainties. Information about forecast uncertainty may be as significant as the forecast itself. Such information can have important role in planning and decision making processes that utilize the forecasts (Chatfield, 1993; Richardson, 2000).

The uncertainty of a forecast is typically formulated and communicated using prediction intervals (PIs) accompanied by a percentage expressing the level of confidence or nominal coverage rate [e.g., T = (2, 14 °C), conf = 95%] (Hahn and Meeker, 1991; Chatfield, 1993). The confidence level specifies the expected probability of the actual observation to fall inside the PI range. This type of forecast (sometimes called a central credible interval forecast or forecast interval) may be harder for a non-specialist to interpret and evaluate, but it provides the user with a more complete description of the predicted phenomenon compared to a point forecast (Chatfield, 2001). In spite of the clear value of PI forecasts, this format of forecast ‘…has been largely overlooked by meteorologists and would benefit from some attention…’ (Jolliffe and Stephenson, 2003).

A major category of solutions for uncertainty analysis and PI estimation, especially in meteorology, is based on ensemble predictions (Ehrendorfer, 1997). In this method, individual predictors are members of an ensemble of forecasts run with different parameters and/or initial conditions. The forecast uncertainty is linked to the dispersion among the members (Richardson, 2000; Toth, 2003). However, ensemble executions of an NWP model incur a high computational cost, making this approach infeasible in many applications especially when updated uncertainty analysis is required in short temporal intervals.

PIs can also be obtained by statistical modelling of forecast error using the historical performance of relevant past forecasts by the system (Chatfield, 2001; Jørgensen and Sjøberg, 2003). In this approach, dynamics of the forecast uncertainty is essentially learned from the recorded accuracy of past forecasts. Such forecasts are available for many deterministic forecasting systems. In the current study, the focus is on this approach as a potentially efficient method that has received relatively little attention in the literature.

It is a well-known fact that the extent of forecast uncertainty varies with the weather situation (Palmer, 2000). For example, low pressure systems are known to be less predictable than the more stable high pressure systems. It is expected that such patterns of uncertainty dependency on the forecasted attributes can be discovered from the historical performance of the NWP forecasts (Lange, 2003; Nielsen et al., 2006; Pinson et al., 2006). Lange et al. discovered such dependencies by clustering the performance records into six separate groups and characterizing the attributes of their error distribution individually (Lange, 2003, 2005; Lange and Heinemann, 2003). However, this analysis is not practically used and evaluated for the purpose of deriving PIs from a deterministic forecasting system.

A practical application of weather classification to obtain PIs was proposed by Pinson et al. (2006) and Pinson and Kariniotakis (2010). The authors used two predicted variables, wind speed and wind power, to categorize wind energy forecast records into four manually defined classes (Pinson, 2006). PIs were then computed using empirical quantiles of the error distributions in each group and fuzzy membership values of a new forecast in each of the predefined groups. Experimental evaluation of the resulting PIs demonstrated applicability of the historical forecast grouping approach. It provided skilful and relatively reliable PIs from the initial point forecasts.

To improve the quality of the resulting PIs and also to alleviate the problem of manual grouping of the weather forecasts, we investigate the application of automatic objective-based clustering algorithms to obtain optimally defined forecast record groups that follow the inherent structures in data. It is suggested that as these clusters are based on the actual similarities between the past forecast situations they lead to PIs of higher quality. Moreover, this approach does not suffer from the limitations of expert-based definition of partitions which becomes a daunting task by increasing dimensionality of the influential variables. In this study, the application of crisp clustering algorithms (K-means, CLARA and Hierarchical Clustering) is examined, and the resulting PIs assessed. Fuzzy C-means clustering is also applied as a natural alternative to the crisp allocation of forecast records to clusters. The next step involves fitting of an appropriate probability distribution function to the actually observed error distribution in each cluster. Statistical techniques are examined in this regard and the required modifications when the fuzzy approach is used are considered. Inherent in all these models is the dynamic calibration of forecasts by uncovering the ‘situation-based’ forecast bias.

The evaluation of PI forecasts and probabilistic forecasts in general, is more complex compared to point forecasts. To test the proposed approaches empirically, the PI models are applied to two real-world data sets. A comprehensive evaluation framework that covers all major measures found in the PI evaluation literature is developed. This approach also brings some new insights to the PI verification process, leading to fairer judgements.

The applicability and quality of the resulting PIs in practical scenarios is also investigated. The results provide insight into the role of different aspects such as the clustering algorithms, number of clusters, feature sets, distribution fitting algorithms and their appropriate choice in the uncertainty modelling process. In addition, higher skill and quality of the output PIs compared to some baseline PI approaches and raw point predictions of the NWP system show the advantages and value of the proposed models.

Next section provides basic definition of PIs and explains the density fitting methods used in the quantile calculation process. Section 'Using clustering techniques for uncertainty modelling' explains the application of clustering algorithms in the uncertainty modelling process. PI forecast quality measures and the verification framework are explained in Section 'An evaluation framework for PI forecasts'. Finally, Section 'Experimental study' reports the experimental results, while conclusions and future directions are provided in the last section.

2. Prediction intervals: definitions and methods

2.1. Prediction intervals

NWP models provide deterministic forecasts in the form of a time series math formula for each weather attribute (e.g., temperature, wind speed) on a three-dimensional grid. For simplicity, the location coordinates are omitted in the following text. However, in a probabilistic forecasting model the prediction is provided in a form of probability distribution function (pdf) math formula as an estimate of the true pdf math formula, where yt is the variable of interest. It is expected that the actual observation yt is a sample following the prediction distribution math formula. A point forecast (math formula) is in effect a single value from this full distribution prediction which is often selected to be the mean of the distribution, the expected value for yt.

The relation between the forecast math formula and its observation yt can be described as:

display math(1)

i.e., each observation can be decomposed to the predicted value math formula for time t, and an error term et for the specific forecast instance. Based on a probabilistic forecast, the pdf math formula and the cumulative distribution function (cdf) math formula are explicitly available. These are estimations of the true pdf and cdf of the observations math formula and math formula. Thus, an α-quantile of yt distribution can be defined as (Hahn and Meeker, 1991; Pinson, 2006; Pinson and Kariniotakis, 2010):

display math(2)
display math(3)

The prediction interval math formula is defined as (1 − α)-confidence interval into which observation yt is expected to fall with probability 1 − α. Therefore, it can be described as a range satisfying:

display math(4)

where math formula and math formula are, respectively, the lower and upper bound of prediction interval math formula defined by the corresponding distribution quantiles:

display math(5)
display math(6)

For instance, with α = 0.05, the prediction interval has a 95% confidence level bounded by quantiles math formula and math formula as αl = 0.025 and αu = 0.975. The above equations are also expected to be correct for the estimations math formula and math formula that are provided by a probabilistic forecasting system. The corresponding quantiles for the predictive distributions would hence be math formula and math formula (Pinson, 2006; Wilks, 2006). In practice, math formula has to be predicted at time t = tk using all information available at time t. The probabilistic forecast would accordingly provide math formula as the prediction interval for the target value in k temporal steps (e.g. hours) ahead.

When a forecasting system does not provide probability models over the predictions, output forecasts have no guidance about their accuracy. In this situation, the uncertainty dynamics of the predictions have to be analysed using a secondary procedure (Chatfield, 1993; Bremnes, 2004; Lange, 2005). This is a condition holding for the NWP forecasts which are originally deterministic. Based on this approach, the analysis of the prediction errors is the means to obtain probabilistic forecasts. The systematic characterization of forecast error can lead to the modelling of forecast uncertainty for the target variable. This can be achieved by considering et in Equation (1) as an instance of the random variable e and using math formula as its estimated cumulative distribution function:

display math(7)
display math(8)

where math formula is the estimated α-quantile of the estimated probability distribution of forecast error math formula. The distribution of yt, and hence the desired quantiles, are not explicitly known. Therefore, to find the math formula prediction interval of yt, the quantiles of e (i.e., the error associated with the forecast) are estimated and added to the predicted value math formula to obtain the lower and upper bounds for the original variable (Pinson and Kariniotakis, 2010). Thus, by finding quantiles over the forecast error distribution, one can find the quantiles over the forecast value that is expected to enclose the target observation.

2.2. Fitting distributions to forecast error

Various methods can be used to hypothesize a distribution over the forecast error (math formula) using available samples in the performance record of the NWP system. After such a distribution is available, prediction intervals can be calculated using Equations (7) and (8). Major schemes of deriving distributions from error samples and used in this work are described below.

2.2.1. Gaussian fit

The error of a point forecast at time t (et) can be regarded as a sample of the error random variable e. This random variable e has its own probability distribution that can be characterized by its mean (μe) and standard deviation (σe).

Let {et} be a series of random samples of the error variable e. The values of sample bias and sample standard deviation can then be calculated by the following sample statistics:

display math(9)
display math(10)

where N is the size of the sample series (Chatfield, 2001).

A simple yet popular method to find the boundaries of math formula is based on the assumption that the pdf of error follows a Gaussian distribution. Many studies do confirm that forecast errors of many weather attributes follow a Gaussian distribution (Landberg, 1999; Lange, 2005). When there is a Normal distribution math formula with known parameters is assumed, one can calculate the PI quantiles (Jørgensen and Sjøberg, 2003) using Equations (7) and (8):

display math(11)
display math(12)

where zα/2 and z1 − α/2 are the (α/2) and (1 − α/2)-quantile, respectively, of the standard normal distribution math formula. In the case of a PI with 95% confidence (α = 0.05), zα/2 and z1 − α/2 are equal to 1.96 (Hahn and Meeker, 1991; Jørgensen and Sjøberg, 2003).

Figure 1(a) shows the error distribution of temperature forecasts in various locations in the province of British Columbia (BC), Canada for Summer 2008. The matching normal distribution and quantiles are also shown.

Figure 1.

(a) Temperature error and corresponding normal distribution based on math formula and math formula of the entire dataset. (b) Wind speed (m s−1) error distribution and a its Weibull distribution fit curve for a sample subset of NWP forecasts.

To obtain PIs for the NWP forecasts using this method, a dataset of past predictions and associated observations must be constructed. Using the simple method of moments for the fitting step, one can apply Equations (9) and (10) to calculate math formula and math formula using sample statistics of the empirical dataset. Hence, as these parameters are estimates, the boundaries of the prediction interval are determined using the following equations (Wonnacott and Wonnacott, 1990; Jørgensen and Sjøberg, 2003):

display math(13)
display math(14)

where t(α,n) is the quantile of the Student's t-distribution for confidence level α and n degrees of freedom. The quantiles of t-distribution and the multiplier term are used since the moments of the true distribution are unknown and estimated based on samples from the historical performance dataset (Jørgensen and Sjøberg, 2003).

2.2.2. Weibull fit

Investigations into actual forecasts accuracies show that in many cases the forecast error distribution does not fully follow a symmetrical normal distribution shape. This is often viewed in target attributes that follow non-Gaussian distributions such as wind speed (Pinson, 2006). To achieve a better fit and consequently better PIs for such cases with skewness, Weibull distribution can be a good choice. Weibull distribution is defined as:

display math(15)

where k > 0 is the shape parameter and λ > 0 is the scale parameter of the distribution; the value of function f is zero for x < 0. To find the fitting distribution parameters for a set of values of x, the method of Maximum Likelihood Estimation (MLE) is used (Wilks, 2006). Using MLE, the distribution parameters are tuned into values so that the expectation of drawing the sample data from the fitted distribution is maximized. The cdf of this fitted Weibull distribution (math formula) can then be used to compute the error quantiles of the PI when used in Equations (7) and (8). The error values have to be shifted to the right so that the minimum value for the random variable is zero. Figure 1(b) depicts an example of fitting Weibull over a subset of wind speed errors.

2.2.3. Empirical distribution

Another alternative is to use a non-parametric approach where the error distribution can be derived from actual distribution of the sample data at hand. This means that the empirical cumulative distribution function of sample errors is used as a direct estimate of the true population distribution (Pinson, 2006). This empirical cdf is defined as:

display math(16)

where E is the set of errors in the available sample. Using this approach, the term math formula in Equations (7) and (8) would be computed differently leading to different PIs. There is also a large body of work on obtaining probabilistic distributions and their calibration in situations where ensemble forecasts are available (Nipen and Stull, 2011).

2.2.4. Kernel density smoothing

There are some potential drawbacks in the application of the empirical cdf method for estimation of forecast error distribution. First, the sampling characteristics of the data can have a dramatic impact on the cdf function. Second and more significantly, the domain of the pdf function will be limited to the minimum and maximum values existing in the sample which is not ideal for the PI analysis that is chiefly sensitive to the tails. An alternative to the empirical pdf function is the kernel density smoothing method which can provide both a smoother function and a better estimate of the tails. Instead of considering a zero one value for the pdf construction in the empirical pdf approach, kernel density smoothing is achieved by stacking kernel blocks that are centred at the data values. A smoothing kernel is a non-negative function that has a unit area and hence is a proper probability density function on its own. Each sample provides a stacking element equal to the smoothing kernel centred at the sample data value. The Gaussian smoothing kernel which has a support of [−, + ] is used. The final pdf function is constructed by Wilks (2006):

display math(17)

where h is a smoothing parameter which balances the smoothing intensity. The definition of h in the above equation is a reasonable choice for this parameter when using the Gaussian kernel (Silverman, 1986). In this formulation, s is the sample standard deviation and IQR is the Inter-Quantile Range of the sample data.

In Figure S1(a) the empirical distribution of a sample subset of wind speed errors is shown. The smooth curve shows the kernel smoothing density of this sample using Gaussian kernels. Also the cumulative distribution function of the empirical distribution and the kernel smoothing distribution for this sample are shown in Figure S1(b). As can be seen, the kernel version has a desirable smoother shape and declines gradually on the edges.

When there are multiple ensemble member forecasts accessible, one can apply the Bayesian model averaging approach described in Raftery et al. (2005) to obtain a calibrated smooth probability density forecast.

3. Using clustering techniques for uncertainty modelling

3.1. PI computation using crisp clustering

In the above PI computation procedures, the weather situation and other features such as time and location do not play any role. Thus, PIs obtained by these methods are invariant in various weather situations. Considering the notion of dependence of forecast error on the forecast situation (Palmer, 2000), a fine grouping of situations can lead to clusters of forecast cases with similar error behaviour. Simultaneously, the error behaviour in a cluster can be distinct from cases in other clusters. Such groupings can be found by clustering all available cases using the relative influential variables as the features. Subsequently, the prediction interval analysis described in the previous section can be applied to each cluster separately. This way, different PIs can be found for the different forecast situations discovered through clustering. In other words, rather than considering all cases as equal, the error distribution within each cluster determines the prediction interval of that cluster only. After discovering clusters of forecast records in the first step, the error distribution in the clusters are modelled in the second step (Figure S2). Finally, using the density models the PIs are obtained and later undergo an evaluation process.

In this study, several clustering algorithms representing the most common clustering approaches are applied: K-means from median-based class, CLARA from medoid-based class, and Agglomerative Hierarchical clustering from the hierarchical class of clustering algorithms (Xu and Wunsch, 2005).

3.1.1. K-means

This is a simple yet powerful clustering algorithm that has been used in many applications (Vejmelka et al., 2009) including clustering of atmospheric situations and patterns (Huth et al., 2008). The algorithm can find k clusters in a dataset D = {x1, x2, …, xN} where math formula, N is the total number of available forecast cases for training and d is the total number of influential variables. It iteratively updates the centre points of the K clusters C = {c1,c2, …,ck}, and reassigns every data point to the nearest cluster centre. This heuristic iterative process locally minimizes the total distance of points to their respective cluster's centre (Kaufman and Rousueeuw, 1990):

display math(18)

where Dj is the set of points in D that are assigned to cluster j, and cj is their nearest cluster centre in C. Here, math formula are the d influential features for forecast case i. Forecast error math formula of case i is also associated with the predictand y, but not used for clustering.

To find clusters of NWP forecasts, each prediction can be considered a point in D. For each point xi, d influential variables are taken into account (e.g. forecast temperature, wind speed and direction, surface pressure, location, elevation, etc.). Figure 2(a) visualizes a sample of four clusters discovered by K-means in the BC data set of temperature forecasts in a two-dimensional space obtained using two-dimensional Principal Components Analysis (2DPCA). After a set of cluster centres {c1,c2, …,ck} is determined from past forecasts, each cluster has its own set of forecast cases Dj and its own set of errors for target y in math formula such that:

display math(19)
Figure 2.

(a) 2DPCA visualization of the four identified clusters (first two components cum. prop. of var. = 0.40). (b) Error distributions and their moments for the entire dataset (solid black) and four identified clusters of forecasts.

where nj is the number of sample points in cluster j. The error samples of a desired variable (e.g., temperature) in each cluster i are defined by considering math formula. Next, a distribution fitting approach described in the previous section is applied to the error sets math formula to estimate functions math formula and math formula. The estimated math formula is then used in Equations (7) and (8) in place of math formula to obtain different PI for each cluster j. For instance, based on the Gaussian fitting method sample statistics math formula andmath formula are determined for each cluster j. Figure 2(b) shows the fitted Gaussian distributions for temperature error in all discovered clusters of Figure 2(a). When a new forecast xnew is made, the cluster to which it belongs is identified via the nearest cluster centre:

display math(20)

and boundaries of the corresponding PI are estimated as:

display math(21)
display math(22)

where math formula is the attribute of interest in the forecast xnew. In general, PIs determined through this process vary for different forecasts as they depend on the cluster which the current forecast case belongs to. The methods introduced in Monache et al. (2011) are also interestingly related as they consider each forecast case as an independent cluster to predict error.

3.1.2. CLARA

Random sampling from the original data set is used in CLARA (Clustering LARge Applications) to maintain the important geometric properties of clusters effectively. This reduced-size sample data set is then clustered into K groups represented by data points (medoids). This grouping is done through a brute force search for best swap between a medoid and a data point by the PAM (Partitioning Around Medoids) clustering algorithm (Kaufman and Rousueeuw, 1990; Xu and Wunsch, 2005).

3.1.3. Agglomerative hierarchical clustering

In this algorithm, the data rows are initially regarded as single clusters and at each step the most similar pairs of clusters are merged into a new cluster at the higher level. This process is repeated until reaching a single cluster (Xu and Wunsch, 2005). The output of this process is a tree structure called dendrogram. The resulting dendrogram can be cut into the desired number of clusters based on the order of similarities between the clusters. To obtain a PI model, the same process as described for the K-means algorithm can be used.

3.2. PI computation using fuzzy C-means clustering

The clustering algorithms described in the previous section assign each sample point to exactly one cluster. Therefore, the membership of a forecast case in each cluster (xjDi) is characterized by a binary value. In a more natural approach, the forecast cases can be associated with multiple clusters at different levels of membership supported by fuzzy sets. Such partial membership of samples in forecast groups can possibly improve the modelling of forecast situations. Many weather conditions such as transitional phases of weather can be better explained using this approach. Fuzzy C-means (FCM) is a common method of fuzzy clustering. In this algorithm the membership values of data points (rows) in various clusters (columns) are represented by a matrix with fractional entries (compared to binary values of crisp clustering algorithms). The objective function of this clustering process changes accordingly (Bezdek et al., 1984; Pedrycz, 2005):

display math(23)

where uij represents the degree of membership of point xi in cluster j, math formula, and m > 1 is a fuzzification factor that controls the balance between membership values of close to 0 or 1 and the intermediate values (Equation (23)). The objective function can be minimized using gradient descent in an iterative process where the membership matrix is updated as follows:

display math(24)

and the cluster centres are then calculated using the new partition matrix. This iterative process repeats until convergence, i.e. until none of the u matrix entries changes are greater than ϵ.

After applying a binary clustering algorithm, the set of past error samples in cluster j for target y, i.e. math formula could be induced by Equation (19). Unlike in such a crisp clustering approach the output of the fuzzy clustering process determines the contribution of each error sample to all clusters. Therefore, the distribution fitting for each cluster is performed with all training samples, using a weighting scheme based on each sample's degree of membership in that cluster. In other words, the training samples that are more associated with a cluster contribute more to the formation of the cluster's pdf:

display math(25)

Hence, when applying the kernel density smoothing method to fit a probability distribution over the error set of cluster J, uij determines the vector of weights for the samples in the fitting process. In addition, any new forecast case xnew is now associated with all clusters, but with different degrees of membership. Therefore, Equation (20) is not appropriate for PI calculation anymore. Instead, the PI boundaries computed for each cluster's fitted distribution are consolidated using membership level of the new forecast in each of these fuzzy clusters

display math(26)

where unew,j = mj(xnew) is the membership level of xnew in cluster j. This provides normalized weighted mean of the individual quantiles calculated for each cluster. Hence, for example, when the new forecast belongs to cluster 1 with a much higher degree compared to cluster 2, its prediction interval is determined with much stronger contribution from quantiles of cluster 1 rather than cluster 2. This method is in essence a distribution combination process.

4. An evaluation framework for PI forecasts

4.1. Basic verification measures

Verification of PI forecasts serves to determine the quality of forecasts and to select or modify a forecasting system properly. The probabilistic nature of interval forecasts complicates their verification compared to deterministic forecasts (Chatfield, 2001; Wilks, 2006). Atmospheric science has been the field with most developments in forecast verification processes (Wilks, 2006). However, verification analysis of probabilistic forecasts in general and PI forecasts in particular are still under ongoing development (Brocker and Smith, 2007; Casati et al., 2008). The Brier score is a classical and still widely used verification score for probabilistic forecasts (Brier, 1950). However, this score is appropriate only for dichotomous variables (Jolliffe and Stephenson, 2003; Pierolo, 2011). An extended version of the score for multi-category probabilistic forecasts is the widely used Ranked Probability Score (RPS) (Murphy, 1971). However, the major shortcoming of this score is a well-known fact: it does not penalize vague forecasts i.e. wide intervals (Wilks, 2006). Another version of this score developed for probabilistic forecasts of continuous variables named Continuous Ranked Probability Score (CRPS) (Hersbach, 2000) can only provide a measure for the match between the forecast full pdf and that of the observation (Wilks, 2006). Hence, these measures are not appropriate for PI evaluation: either because they are not sensitive to interval width, or because they do not match the double-quantile format of PI forecasts.

Recently, information theory approaches have also been studied as a verification tool for probabilistic forecasts. Examples include ignorance score (Roulston and Smith, 2001) and information gain (Pierolo, 2011). Although these measures can provide proper scores (Brocker and Smith, 2007; Gneiting and Raftery, 2007) for probabilistic predictions of continuous variables they are not widely used (Casati et al., 2008). They are not very intuitive and powerful for communicating the forecast skill to decision makers. In addition, they can be applied only in scenarios where both the predicted and target probabilities are provided in the form of full probability distributions. Hence, they cannot be directly employed for the evaluation of PI forecasting systems. Instead, efforts have been made to quantify the skill of these forecasts using more relevant measures. Specific measures for evaluation of reliability, sharpness and resolution aspects of PI forecasts have been proposed in Bremnes (2006), Pinson et al. (2006), and Pinson and Kariniotakis (2010). They reflect individual quality aspects of the forecast, but do not provide a single score for a conclusive verification.

The major expectation from a set of PI forecasts is that their empirical coverage of the observations in a test setting is as close as possible to the their required confidence level. This primary property of a PI forecasting system M is called reliability (Pinson and Kariniotakis, 2010) noted as math formula:

display math(27)
display math(28)

where T is the number of PIs in the test data set used for the evaluation of PI forecasts, and math formula is an indicator of hit (1) or miss (0) of an observation with respect to the PI boundaries. Hence, math formula simply accounts for the difference between average hit of the forecasts over all test cases and the required nominal coverage defined for the PI. For an ideal case, math formula. It is assumed, without loss of generality, that all forecasts in the tests are provided with a constant confidence level.

A forecasting system providing PIs with less vagueness (i.e. with PIs) is clearly preferred. This leads to the second major aspect of PI forecast quality called sharpness (Nielsen et al., 2006; Pinson and Kariniotakis, 2010):

display math(29)

where math formula is the width of the ith prediction interval. Note that the sharpness measure has a negative meaning as forecasts with lower values of average PI width are preferred.

Another important quality aspect of a PI computation method is its ability to provide intervals of variable width, depending on the forecast situation. A method with high resolution math formula is capable of distinguishing low uncertainty from high uncertainty forecasts. The standard deviation of PI widths is a natural choice to measure the resolution (Pinson, 2006):

display math(30)

It should be noted that this measure is not dependent on the observations. Thus, it is not a significant measure individually and can be hedged. However, when the two first major measures of reliability and sharpness are equal for two competing PI forecasting methods, the method with higher resolution may be preferred.

4.2. Skill score of a prediction interval forecasting system

Having a single scalar summary measure of forecast quality is always attractive and useful for objective comparison of various methods. Such measure would simplify the evaluation of the complete performance profile of a forecasting system. The most common prediction interval skill score is the Winkler's score proposed in Winkler (1972). It is widely used as a concluding objective evaluation measure for PI forecasting methods (Bremnes, 2004; Nielsen et al., 2006; Pinson et al., 2006). A comprehensive study by Gneiting and Raftery (2007) proved that this score is ‘strictly proper’. This means that it would give the maximum score to a forecast that is actually the true belief of the forecaster and thus cannot be ‘hedged’. This indicates that a PI that follows the true distribution of the target can obtain the maximum score.

Using the notations and assumptions defined here, this skill score can be expressed as:

display math(31)
display math(32)

To understand the behaviour of this score better, it can be algebraically simplified by considering cases of hits and misses. math formula and math formula are used instead of math formula and math formula for simplicity. When a ‘hit’ occurs for forecast PI of case i, math formula and math formula. By substituting these values in Equation (13) and multiplying the terms we have:

display math(33)

In the other case, when an observation is ‘missed’, it is either on the right or the left side of the area outside the PI boundaries. In this case, the values of (math formula and math formula) are equal to (0, 0) or (1, 1), respectively. When the missed observation is on the right side of the interval it has a positive distance of δi from the upper boundary math formula. The score of this particular case can be calculated using Equation (32):

display math(34)

For a miss on the left side of the PI, the score will have the same value calculated by Equation (34). As math formula is the overall miss rate, the total score received by a PI forecasting method M over all T cases in the test set is:

display math(35)

where math formula is the average distance of an observation from the PI boundaries among the missed cases, and math formula is the average of this distance among all of the test cases owing to the fact that Δi is equal to zero for hit cases and δi for misses.

Figure S3 depicts the value of this score for different observation values for a sample PI of [−5, +5]-conf = 95%. As can be seen, the values of the score are equal for all cases where the observation is inside the PI. The score linearly decreases as the observation moves far from the PI boundaries. By multiplying this score by 2/α, which does not change the actual comparison among different methods, the equal verification measures of Winkler's score and Gneiting's score can be easily retrieved. Hence, it can be seen that all these scores are essentially equivalent and simply use a weighted sum of the two major aspects of PI quality: sharpness and reliability. However, they measure the reliability aspect of the PI forecast by the average distance of observations to the PI boundaries in math formula.

4.3. Uncertainty of skill score measurements

The described PI skill score can provide a measure for evaluation of a group of PI forecasting systems. This way, the best system can be chosen for the application at hand among many potential choices. However, a close look at the evaluation study conducted in this work reveals that the mere calculation of the SScore measure using a test data set may result in misleading evaluations. Since the number of real-world test cases is always limited, the ‘measurement uncertainty’ of SScore using the available test data set must also be accounted for. This issue is of a greater importance when there are fewer test cases available to measure SScore in each cluster as the number of clusters increases.

To analyse the uncertainty of the skill score, its components are investigated more closely. The terms that are dependent on the method under verification are math formula and math formula. These terms are essentially a weighted sum of their measured values in the different clusters:

display math(36)

where Tj is the set of test cases that are assigned to cluster j. The measured SScore is denoted as math formula since it is a sample statistic from a single sample set only. math formula is calculated as the average of PI widths among these test cases in cluster j. As the PIs in a single cluster are obtained based on the same fitted error distribution, it follows that:

display math(37)

Hence, the width term of the skill score is constant in each cluster, and there is no uncertainty when this statistic is measured using a sample data set in model test evaluations. However, the math formula term is the mean statistic of the random variable math formula which is measured using a set of sample values with Tj members, and thus it is subject to sampling variations.

With a limited number of test cases and high nominal coverage rates of PIs, it may happen that in cluster j = 1 only a few test cases (e.g. |T1| = 400) are assigned to a cluster and fewer of them (e.g. 30) would lead to non-zero values of math formula. The measured value of math formula for this cluster may be equal to math formula of another cluster j = 2 with significantly more test cases (e.g. |T2| = 6000). Although these two statistics are equal, the uncertainty of math formula is much smaller than the uncertainty of math formula: for cluster two, the measure has been evaluated using a much larger sample set.

To analyse the uncertainty of math formula, math formula must be considered not as a single estimate over the test cases, but as a one-sided confidence interval that provides an upper bound over this measure with a specific confidence level. After using this upper limit for all clusters, a lower limit on the SScoreM with the desired confidence level can be determined:

display math(38)
display math(39)

where β is the desired confidence level over the measure expressed as a percentage. As an example for β = 0.95, math formula is the lower boundary of which the true skill score of method M is expected to be at least equal to with a 95% confidence.

To find the confidence interval over math formula, its sampling distribution (providing the probability distribution that describes the batch-to-batch variations of this statistic) has to be considered. Bootstrap resampling is a method for building a collection of artificial data batches with the same size as the original sample set with replacement. The computed statistic over these batches effectively provides an estimation of the sampling distribution (Wilks, 2006). For the purpose of this study, as many as 2000 bootstrap samples were constructed for each cluster and the math formula measure was calculated for each sample set. The distributions defined over these measurements are then used to compute the desired quantile based on the confidence level β. Intuitively, there should be less uncertainty associated with math formula when increasing number of test cases in cluster j are used in the bootstrapping process, Using the upper limit math formula in Equation (36) leads to the lower limit of the final skill score in math formula. This measure, which considers the test sample uncertainties, is preferred for fair verification of PI forecasting systems.

5. Experimental study

5.1. Data sets and experimental set-up

The applicability and performance of the described uncertainty analysis models to obtain PI forecasts was experimentally evaluated using NWP model WRF (WRF Website). Two different hindcast data sets of hourly predictions were augmented by observations from the National Center for Atmospheric Research (NCAR) data repository. After deriving forecast errors the two data sets were considered as two repositories of the NWP model's historical performance. The WRF v3 simulations were run in three nested grids with resolutions of 10.8, 3.6 and 1.2 km. The outermost domain covered an area of about 15 595 km2 on a 38 × 38 grid. The nearest grid point to the observation station was assigned as the point of the associated forecast.

The first data set contains the forecasts and observations at 60 different weather stations in the province of British Columbia, Canada for the period of the summer 2008. This data set (referred to as BC) contains about 13 000 records of forecast history. For this data set, 10 major weather, location and time attributes were used as the influential variables: predicted wind speed, wind direction, temperature, surface pressure, mixing ratio, grid precipitation, altitude, latitude, longitude, and hour-of-day. The second data set covers two stations close to Agaziz, BC, over 3 years of 2007, 2008 and 2009. This data set (referred to as AG) contains about 51 000 records of historical performance of NWP forecasts. There is a total of 35 features available in this data set as listed in Table S1. For both data sets the described computation methods were applied to obtain PIs over the forecast temperature.

For the BC data set, five different subsets of the available features are defined to investigate the role of influential variables and to select the optimal set for PI forecasts. These feature sets are described in Table S2. Note that the features starting with ‘pg’ are new derived features representing the tendency of surface pressure between the current forecast and the forecasts made 1, 3, 6 and 12 h earlier for the same location (pg 1, pg 3, pg 6 and pg 12). These features are expected to provide valuable information about the temporal stability of the forecast weather further aiding uncertainty analysis and modelling. Table S3 lists 14 different feature sets used for the AG data set. Features in both data sets are also normalized.

The large number of features in some feature sets (e.g. BF2PG) can have negative impact on the quality of the clustering algorithm. To reduce their dimensionality only the x-most important components were using. The number after letters ‘PC’ in these feature sets represents how many of the most significant components determined using Principle Component Analysis (PCA) were used.

For evaluation of the various PI forecasting methods, the AG data set was split into three folds, based on the temporal sequence of the records. That is at each fold run, 2 years of data were used for training the PI model and the third year for its testing. These forecast PIs are then verified using the measures described in Section 4 and their average value over all folds is reported. Monthly split of the data records in the BC data set also yields a three-fold cross validation. In another evaluation approach, the two data sets were split randomly into five different folds to perform a five-fold cross validation process.

The focus of this study is on the 95%-confidence level PIs (i.e. α = 0.05) for the target. Due to the availability of alternative choices for the various steps in the PI training phase, many different methods have been defined by the combination of these options:

  1. Feature set: as listed in Tables S2 and S3.
  2. Clustering algorithm: K-means (five random starts), CLARA (300 + 4K random samples), HClust (Euclidian distance and Ward's agglomeration method) and FCM (m = 1.2 for BC and 1.1 for AG).
  3. Number of clusters (K): from 2 to 200 with increasing intervals.
  4. Fitting method: Gaussian, Weibull, empirical and Kernel density smoothing.

To compare the various proposed methods with a baseline method, several simple approaches are considered (titles starting with ‘Base-’). The first baseline method is the climatological approach that considers all past error samples together (i.e. K = 1) and computes the PI based on these samples. Note that any of the fitting methods can be used for the distribution fitting phase. Other baseline methods considered in this study follow manual categorization of past forecast records based on an attribute: hour-of-day (K = 24), month (K = 12), predicted temperature and wind speed (K = 10 with equal bin discretization). For an initial look at the forecast error, Figure S4 shows the standard deviation of target error plotted for different months and years in the AG data set. There is clearly some regular pattern of forecast uncertainty in the records that can be exploited when computing conditional PIs.

5.2. Crisp clustering PIs

The methods that achieved the five best values of SScore for temperature PIs among all considered PI forecasting methods are listed in Table S4. Please note that the − T factor has been omitted from SScore formula (Equation (35)) to make the measure negatively oriented and independent of the number of test samples between different experiments. The best methods have the highest number of clusters. This result is counter intuitive: with very large number of clusters, there are very few samples available in each cluster to effectively learn the uncertainty model of the weather situation represented by the cluster. The same issue is present for the best methods applied to the BC data set.

Figure S5 provides a closer look at the role of K in the SScore evaluations. It shows the SScore trend for the best temperature PI setup using each of the clustering algorithms for the BC and AG data set. As mentioned earlier, the continuously improving trend contradicts the statistical nature of the training process.

The reason for this optimistic evaluation of SScore for methods with increasing number of clusters is that the measured SScore is the sample statistic over the available test samples. However, as argued in Section 4.3, the uncertainty bounds for this measure are further away from this estimation for a small number of samples. Using the 95%-confidence level bound for the SScore (Equation (39)) it can be ensured that the evaluations are not skewed by small number of samples. Figure S6 shows sharpness and average delta (the comprising elements of the skill score) have improving trends when the number of clusters increases for the top method in AG. However, the 95% bound on the average delta measure does not follow such trend. The figure also depicts the observed nominal coverage, its 95% lower bound (using binomial test) and the resolution measures. Figure 3 shows the trend of SScore0.95 for the best PI methods. As can be seen, this score accounts for the uncertainty of the SScore estimation and shows an increase in the SScore (as a loss value) with large number of clusters.

Figure 3.

SScore0.95 trend of best temperature PI methods over increasing number of clusters – (a) BC and (b) AG.

Consequently, SScore0.95 (called SScore unless mentioned otherwise) is used to rank the various methods. The results are reported in Tables 1 and 2. The best ranks achieved by the various baseline methods are also listed in these tables. In the BC data set, the K-means clustering algorithm with six clusters provides PI forecasts with the average width of 13.42 °C while the baseline method would provide PIs for the same forecasts with an average width of 14.23 °C. The paired-t test over the skill scores of these two methods confirms that their evaluation means are statistically significantly different (p-value = 0.014 < 0.05). The difference between the skill score of the best clustering and the best baseline method is also statistically significant (p-value = 0.0001). In addition, the K-means algorithm PIs has a standard deviation of 1.20° in the forecasts, while this value is equal to 0 for the climatological baseline approach that provides constant width PIs.

Table 1. PI verification measures for top methods of temperature PI in BC data set based on five-fold cross validation
AlgorithmKFitFeaturesSharpnessCoverageCoverage0.95ResolutionRMSESScoreSScore rankSScore0.95SScore0.95 rank

For the AG data set, the best K-means setup with the kernel fitting method over the BF2 feature set achieved an SScore which is less than 0.3485 with a 95% confidence. This value is equal to 0.3774 for the climatological baseline and 0.3704 for the best baseline which is the month-based grouping method. Such improvement of the skill score (p-value < 0.005) is achieved as the PIs of the K-means setup have smaller width (vagueness) and higher coverage of observations (reliability).

Better scores for methods using larger numbers of clusters in the AG data set (compared to the BC data set) is likely a result of the availability of more data samples and features in both the training and test phases. This increases the availability of learning samples in the training phase and decreases the uncertainty of the SScore evaluations in the test phase. In the Coverage0.95 column, the 95%-confidence level lower bound for the measured nominal coverage is provided. This estimate is the weighted average (by the number of test samples) of the binomial test lower bound of nominal coverage in individual clusters. There is a notable difference observed between the sample measure of nominal coverage and its 95% confidence level lower bound due to the availability of rather few test samples in each cluster (similar to top-right of Figure S6).

The tables also list the RMSE error of the forecasts. This performance measure, important for point forecasts, is calculated using the mean of a given PI. The notable improvement of this measure achieved by the proposed methods is due to the dynamic calibration of forecast bias in the forecast groups discovered by the clustering algorithms: the forecast bias is estimated from the accuracy records in a dynamic fashion depending on the forecast situation.

For a general comparison between the various elements comprising a PI forecasting method, the performance measures of the three clustering algorithms are aggregated and averaged over all possible combinations (number of clusters, feature sets and fitting methods). Figure 4 shows a box plot of the skill score of the clustering algorithms for the BC and AG data sets. It shows that the K-means clustering-based PI forecasts have superior performance for both data sets. The lower skill of HClust PIs in the AG data set is likely due to the use of a randomly selected subset with half of the size of the original data set, to overcome the low scalability of the algorithm.

Figure 4.

SScore0.95 of the three clustering algorithm for temperature PIs in (a) BC data set (five-fold cross validation) and (b) AG data set (Yearly cross validation) (c) SScore0.95 of the five different feature sets for temperature PIs in BC data set (five-fold cross validation).

Figure 4(c) shows the statistics of SScore0.95 for the five feature sets in the BC data set (averaged over all possible combinations). It is clear that feature sets BF1 and BF1PGPC4 yield PIs of the highest quality. The comparison of fitting methods in Figure 5(b) reveals that the Kernel density smoothing method reaches the best scores. Very similar results were obtained from the five-fold cross validation evaluations in the AG data set. The role of AG feature sets is illustrated in Figure 5(a). It shows that the pressure level features (included in BF2) are relevant and useful for temperature error modelling and PI computation.

Figure 5.

SScore0.95 (yearly cross validation) in AG data of (a) 14 different feature sets for temperature and (b) the 4 different fitting methods.

The best temperature PI method for BC data set uses K-means clustering with K = 6 and Kernel density smoothing. Table S5 provides a detailed description of PIs from the six clusters for a sample fold of test results. As expected, the method provides PIs with variable width. The third column of the table shows the average distance of a missed case from the edge of the forecasted PI. For cluster number 1, where there are fewer test cases available, the difference between the measured coverage and SScore and their respective 95% confidence level boundaries is greater than for other clusters. The Kolmogorov–Smirnov goodness-of-fit test results (see Appendix S1) in the last two columns also suggest that the hypothesis that trained error model and the observed test error follow the same distribution is not rejected at the 10% level for five out of six clusters.

The best temperature PI forecasting method for the AG data set uses K-means clustering with K = 50 and kernel fitting. Figure 6(a) illustrates the variability of the forecast PIs widths. The figure also shows the distribution of the PIs that actually missed the observed value accounted as the conditional reliability. It is clear that the forecast PIs are dynamic and are narrower compared to the climatological baseline method that yields static PIs with width of 12.17.

Figure 6.

(a) Histogram of forecast temperature PI widths (total counts and miss cases) for the best-performing method in AG. (b) Sample temporal trends of upper and lower boundaries of prediction intervals for temperature error and the actual observations (black).

A 100 h example of actual temperature PI forecasts using the best-performing method on AG data set is depicted in Figure 6(b). The horizontal lines represent the PIs of the climatological baseline method. This figure clearly shows that estimated forecast uncertainty changes for different predictions. In this case, PI varies between 8 and 15 yielding relatively high sharpness (i.e. relatively small average PI width). Figure 7 shows a fan chart of 11 different confidence level PIs (i.e. 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and 0.95) obtained using the best-performing method, along with the observed values of temperature. This figure further demonstrates the dynamic nature of forecast uncertainty estimates based on forecast situations.

Figure 7.

Examples of 11 different confidence level prediction intervals for temperature forecasts in 2009.

5.3. Fuzzy C-means clustering PIs

Using the FCM algorithm, the forecast situations are defined as fuzzy sets using the training data. As a consequence, the forecast cases are not associated with only one cluster but have different levels of membership in all clusters. The three best-performing fuzzy and non-fuzzy setups are listed in Table 1. FCM was applied to the BC data set using the best-performing feature sets and fitting method for this data set based on the non-fuzzy model evaluations described in the previous subsection (i.e. BF1, BF1PGPC4 and kernel fitting). The results show a modest improvement of PI verification score using the fuzzy approach (0.4194 vs. 0.4245).

A similar skill improvement was observed for the AG data set using BF2 and BF2PG feature sets and the kernel density smoothing method (Table 2). PIs obtained by the FCM algorithm (with K = 45) have the best skill score and the least RMSE when considering point forecasts. The FCM-based PIs have rather smaller values of resolution. This is expected as in these models the error characteristics of every forecast case are affected by all discovered situations although with different intensities. However, the presented empirical evaluation study confirms that the proposed clustering-based methods improve the skill of estimated PIs compared to baseline methods.

Table 2. PI verification measures for top methods of temperature PI in AG data set based on three-fold (yearly) cross validation
AlgorithmKFitFeaturesSharpnessCoverageCoverage0.95ResolutionRMSESScoreSScore rankSScore0.95SScore0.95 rank

6. Conclusions

Forecast uncertainty plays an important role in many practical applications of meteorology. In this study, the historical performance of WRF NWP model is used as a source of information for uncertainty modelling. The proposed approach allows dynamic analysis of uncertainty based on context, i.e., a predicted weather situation. Contexts of weather forecasts are established by automatically discovered clusters which are then used to derive conditional PIs through statistical analysis. The effectiveness of the proposed approach has been empirically evaluated using two data sets of weather hindcasts and associated observations.

Several feature sets were applied to group weather situations using four different clustering algorithms (K-means, Clara, HClust and Fuzzy C-means). To assess the proposed PI computation methods, a comprehensive evaluation framework based on a proper skill score metric was created. The assessment results confirmed the applicability of the proposed PI computation methods and showed that the resulting PIs have high sharpness and skill.

Comparisons to various baseline methods confirm an average 8% improvement in PI forecast skill when using the proposed dynamic methods based on Fuzzy C-means clustering. As a result of their nature, the proposed methods also intrinsically remove bias, decreasing the RMSE of point forecasts by up to 10%. The proposed PI modelling methods can be used in real world applications to enhance point forecasts of NWP systems with information on prediction uncertainty.

Future work will develop techniques to guide the clustering process (merging and splitting of clusters) using characteristics of error distribution in the clusters. Additional information to improve the skill of PI predictions can be obtained using time series analysis techniques capturing the temporal nature of weather attribute forecasts and their associated errors. Developed clustering methods will also be compared with methods based on quantile regression.