By continuing to browse this site you agree to us using cookies as described in About Cookies

Notice: Wiley Online Library will be unavailable on Saturday 7th Oct from 03.00 EDT / 08:00 BST / 12:30 IST / 15.00 SGT to 08.00 EDT / 13.00 BST / 17:30 IST / 20.00 SGT and Sunday 8th Oct from 03.00 EDT / 08:00 BST / 12:30 IST / 15.00 SGT to 06.00 EDT / 11.00 BST / 15:30 IST / 18.00 SGT for essential maintenance. Apologies for the inconvenience.

Willmott et al. [Willmott CJ, Robeson SM, Matsuura K. 2012. A refined index of model performance. International Journal of Climatology, forthcoming. DOI:10.1002/joc.2419.] recently suggest a refined index of model performance (d_{r}) that they purport to be superior to other methods. Their refined index ranges from − 1.0 to 1.0 to resemble a correlation coefficient, but it is merely a linear rescaling of our modified coefficient of efficiency (E_{1}) over the positive portion of the domain of d_{r}. We disagree with Willmott et al. (2012) that d_{r} provides a better interpretation; rather, E_{1} is more easily interpreted such that a value of E_{1} = 1.0 indicates a perfect model (no errors) while E_{1} = 0.0 indicates a model that is no better than the baseline comparison (usually the observed mean). Negative values of E_{1} (and, for that matter, d_{r} < 0.5) indicate a substantially flawed model as they simply describe a ‘level of inefficacy’ for a model that is worse than the comparison baseline. Moreover, while d_{r} is piecewise continuous, it is not continuous through the second and higher derivatives. We explain why the coefficient of efficiency (E or E_{2}) and its modified form (E_{1}) are superior and preferable to many other statistics, including d_{r}, because of intuitive interpretability and because these indices have a fundamental meaning at zero.

A recent paper (Willmott et al., 2012) examines various statistics that have been proposed and used in a wide variety of environmental fields to provide model evaluation. Model evaluation techniques are useful in model development, model verification, and model calibration, as well as in conveying model performance to others (Pappenberger and Beven, 2006; Schaefli and Gupta, 2007). In particular, Schaefli and Gupta (2007) note that while the Nash–Sutcliffe coefficient of efficiency (Nash and Sutcliffe, 1970) is widely used and well known by hydrologists, for example it is often foreign to researchers in other fields of environmental science. Moreover, values of the various statistics of model efficacy cannot be cross-compared and even those who often report such measures may not know what an index of model performance of 0.85 really means, for example.

Thus, Willmott et al.'s (2012) paper is vitally important in that it demonstrates a number of statistics of model evaluation available to researchers show little relationship across disparate measures. This underscores the need to understand more about each of these statistics and examine their behaviour and interpretation. However, the stated main purpose of Willmott et al. (2012) is the development of a ‘refined’ version of the dimensionless index of agreement (d_{r}) that resembles correlation in that it is bounded to the interval (−1,1]. Although they note that d_{r} is linearly equal to our modified coefficient of efficiency (E_{1}—Legates and McCabe, 1999) over the positive portion of its domain, they contend that d_{r} represents an improvement over E_{1}. We respectfully disagree and argue that E_{1} is superior owing to its more intuitive interpretation and other desirable characteristics.

2. The generic form of the coefficient of efficiency

Legates and McCabe (1999) wrote the generic form of the coefficient of efficiency, E_{j}, as

(1)

where the observed (O_{i}) and model predicted (P_{i}) series have N finite pairs for evaluation. In their original formulation, Nash and Sutcliffe (1970) used j = 2 (thus, E≡E_{2}), although Legates and McCabe suggested that j = 1 was a better scaling, owing to the fact that absolute values are preferable to squared terms since they do not give undue weight to outliers (see Willmott et al., 1985; Willmott and Matsuura, 2005). The various coefficients of efficiency, E_{j}s, are bounded by the range [1.0,− ∞) with a value of 1.0 indicating a perfect model (i.e. all O_{i} = P_{i}) and the statistics decrease as the model predicted and observed series diverge.

Of interest here is the interpretation of E_{j}. Although E_{j} has no lower bound, a value of E_{j} = 0.0 has a fundamental meaning. It implies that such a model has no more ability to predict the observed values than does the observed mean (or the baseline values, see Section 4.). In essence, since the model can explain no more of the variation in the observed values than can the observed mean, such a model can have no predictive advantage. For negative values of E_{j}, the model is less efficacious than the observed mean in predicting the variation in the observations. Although negative values of E_{j} represent a measure of the ‘level of inefficacy’ of a model (or maybe even a ‘level of uselessness’), if you will, a negative value of E_{j} (or a negative ‘Nash-Sutcliffe value’, as they are called in hydrology) indicates that the model has failed to explain more of the variability in the observations than their mean.

Consider now the interpretation of E_{2} (or E). The original formulation by Nash and Sutcliffe (1970) provides a direct comparison with the coefficient of determination, R^{2}, or the square of Pearson's product moment correlation coefficient. In simple linear regression, it is defined for the dependent variable, Y_{i}, as

(2)

which is analogous to E_{2} if Y_{i} = O_{i}, Ȳ = Ō, and Ŷ_{l} = P_{i} Thus, the interpretation of E_{2} represents the percent of variance in the observations that is explained by the model predicted values. For example, a value of E_{2} = 0.75 implies that the model can explain three-quarters of the variance in the observed values. Like R^{2}, a value of E_{2} = 1.0 implies a perfect model, whereas a value of E_{2} = 0.0 implies a model that cannot explain any variance in the observations. Again, negative values of E_{2} indicate that the model is worse than the observed mean (or baseline values) in predicting the observations.

Although E_{1} (or other values of j ≠ 2 in E_{j}) does not have this analogous property with R^{2}; we, nevertheless, believe that it is preferable to E_{2} because of the decreased weight by E_{1} on outliers. However, E_{1} does provide a similar degree of interpretation, albeit with respect to absolute differences and not the variance (i.e. squared differences). A value of E_{1} = 0.75, for example implies that the model is able to explain three-quarters of the absolute-valued differences between the observations and model predictions. Stated another way, the absolute value of the differences between the observations and the model predictions are only one-quarter of the difference between the observations and their mean (or baseline value). Such a model can explain 75% of the absolute difference between the observations and their mean.

This demonstrates that the scaling of the Nash–Sutcliffe coefficient of efficiency (E or E_{2}) and its modified form (E_{1}) is both robust and easily interpretable. We posit that this is a necessary quality for any metric of model evaluation. Such is why the coefficient of efficiency, in all its forms (i.e. E_{j}), is preferable to many other statistics available.

3. Limitations of the refined index of agreement

Willmott et al. (2012) define their refined index of agreement, d_{r}, as

(3)

where c = 2.0. They note that for c = 1.0, d_{r} is identically equal to the modified coefficient of efficiency, E_{1}, for the non-negative portion of its domain. Their refinement on E_{1}, therefore, is to rescale E_{1} using c and remap the negative portion of d_{r} so it more resembles a correlation coefficient, that is d_{r} varies over the domain (−1.0, 1.0]. We argue that the choice of c ≠ 1.0 destroys the interpretability of E_{1}, particularly at E_{1} = 0 (and hence d_{r} = 0) and the remapping for negative values is merely cosmetic.

For c = 2.0, d_{r} attains a value of 0.0 when the absolute value of the difference between the model predicted values and observations equals twice the difference between the observations and their mean. That is, when E_{1} = 0.0d_{r} = 0.5 and when d_{r} = 0.0E_{1} = − 1.0. Thus, a model with no more predictive ability than the observed mean would achieve a value of d_{r} = 0.5 and the value of d_{r} = 0.0 is arbitrary (i.e. the model has less than c^{−1} times the predictive ability of the observed mean). This makes the interpretation of d_{r} difficult and generally means that even relatively poor models will exhibit a high value of d_{r}, a criticism that also affects the original index of agreement (see Willmott et al., 1985).

The remapping of the negative portion of d_{r} such that it scales from (0,− ∞) to (0,− 1) is rather unnecessary. When d_{r} = 0.5, the model exhibits twice as much error (i.e. the absolute difference between the predicted and observed values) as could be achieved by using the observed mean as a predictor. In this sense, any value of d_{r} < 0.5 (as with a value of E_{1} < 0.0) places the model as having an explanatory ability that is less than the observed mean. Thus, these values simply describe the ‘level of inefficacy’ of the model and it is largely immaterial how that portion of the statistic is scaled.

Moreover, although it is piecewise continuous, d_{r} exhibits a C_{2} discontinuity (discontinuous second and higher derivatives) at 0.0. Let and for simplicity. For the positive values of d_{r} (),

(4)

while for the negative values of d_{r} (),

(5)

Thus, in the limit as ∑|P_{i} − O_{i}|→c∑|O_{i} − Ō_{i}| (i.e. A→cB), d_{r} is continuous only through the first derivative.

4. Baseline adjustments

Willmott et al. (2012) provide a rather cursory paragraph on the issue of adjusted baselines. However, much commentary has been provided on this topic, and, as we feel it is an important consideration in model evaluation, it is necessary to examine this topic in more detail.

In general, all statistics for model evaluation compare the relative efficacy of the model to the predictive abilities of the observed mean. It forms the basis of evaluating regression performance using the coefficient of determination (R^{2}) and in the absence of any alternative model, the observed mean is the best ‘strawman’ against which a model can be compared. However, it has long been recognized that the observed mean may not be the best choice of a foil. For example, it was argued by Garrick et al. (1978, p. 376) that a comparison of a model to the observed mean was an ‘unnecessarily primitive’ choice. We demonstrated in Legates and McCabe (1999) that the evaluation of a model that predicts potential evapotranspiration in southern Louisiana or runoff in southwestern Colorado would lead to quite high values of any model evaluation statistic when compared against the observed mean. This is because any model that mimicked the seasonal cycle to a reasonable degree would significantly outperform the observed mean. As a result, Legates and McCabe (1999) proposed a further modification to our modified coefficient of efficiency, , as

(6)

where the new baseline is Ō′_{l}. Instead of simply comparing the observed values to a single number, such as the observed mean, the observed values can be compared against seasonally varying values, such as seasonal means, or a prediction using a function of other variables. The value of Ō′_{l} also could be used to define the results from a previous version of the model so that the statistic could represent the enhanced efficacy of the current model release. In any event, the choice of the observed mean can be easily substituted for a more appropriate baseline and researchers should be cognizant of the fact that the observed mean is not likely the most appropriate choice.

Schaefli and Gupta (2007) highlight the importance of specifying an appropriate baseline comparison and argue that it should become a standard practice in hydrologic modelling. We agree and argue that climatologists too must strongly consider comparing their models against more appropriate baselines. In particular, Schaefli and Gupta (2007, p. 2079) conclude,

Every modelling study should explain and justify the choice of benchmark. Of course, the appropriate benchmark will necessarily be different for different types of case studies. However, for efficient communication, the benchmark should fulfill the basic requirement that every [scientist] can immediately understand its explanatory power for the given case study and, therefore, appreciate how much better the actual [model] is.

We wholly agree and argue that climatologists must consider model evaluation as an important component of their research and not simply as a statistic to be reported. Such benchmarks are not likely to be globally applicable, but, as argued by Schaefli and Gupta (2007), we concur that it is important for each scientist to carefully select a benchmark that is appropriate for their particular study.

5. Concluding remarks

We are thankful that Willmott et al. (2012) have allowed us to extend the discussion from the hydrological community to climatological research. We wish to applaud the effort Dr Willmott has made in the statistical evaluation of model performance over the years. Moreover, we hope that such discussions lead climatologists in the future to take a more proactive role in model evaluation and the statistics that describe model efficiency.

We believe that our modified coefficient of efficiency (E_{1}) is an improvement over the Nash–Sutcliffe statistic (E_{2}) and forms the most appropriate basis for model evaluation based on its simplicity and ease of interpretation. The refined index of agreement, d_{r}, posited by Willmott et al. (2012) exhibits several distinct flaws that make its utility less favorable. However, we welcome a discussion regarding model evaluation and baseline comparisons that has begun in the hydrological sciences and hope it extends to the climatological community as well.