A refined index of model performance: a rejoinder



Willmott et al. [Willmott CJ, Robeson SM, Matsuura K. 2012. A refined index of model performance. International Journal of Climatology, forthcoming. DOI:10.1002/joc.2419.] recently suggest a refined index of model performance (dr) that they purport to be superior to other methods. Their refined index ranges from − 1.0 to 1.0 to resemble a correlation coefficient, but it is merely a linear rescaling of our modified coefficient of efficiency (E1) over the positive portion of the domain of dr. We disagree with Willmott et al. (2012) that dr provides a better interpretation; rather, E1 is more easily interpreted such that a value of E1 = 1.0 indicates a perfect model (no errors) while E1 = 0.0 indicates a model that is no better than the baseline comparison (usually the observed mean). Negative values of E1 (and, for that matter, dr < 0.5) indicate a substantially flawed model as they simply describe a ‘level of inefficacy’ for a model that is worse than the comparison baseline. Moreover, while dr is piecewise continuous, it is not continuous through the second and higher derivatives. We explain why the coefficient of efficiency (E or E2) and its modified form (E1) are superior and preferable to many other statistics, including dr, because of intuitive interpretability and because these indices have a fundamental meaning at zero.

We also expand on the discussion begun by Garrick et al. [Garrick M, Cunnane C, Nash JE. 1978. A criterion of efficiency for rainfall-runoff models. Journal of Hydrology 36: 375-381.] and continued by Legates and McCabe [Legates DR, McCabe GJ. 1999. Evaluating the use of “goodness-of-fit” measures in hydrologic and hydroclimatic model validation. Water Resources Research 35(1): 233-241.] and Schaefli and Gupta [Schaefli B, Gupta HV. 2007. Do Nash values have value? Hydrological Processes 21: 2075-2080. DOI: 10.1002/hyp.6825.]. This important discussion focuses on the appropriate baseline comparison to use, and why the observed mean often may be an inadequate choice for model evaluation and development. Copyright © 2012 Royal Meteorological Society

1. Introduction

A recent paper (Willmott et al., 2012) examines various statistics that have been proposed and used in a wide variety of environmental fields to provide model evaluation. Model evaluation techniques are useful in model development, model verification, and model calibration, as well as in conveying model performance to others (Pappenberger and Beven, 2006; Schaefli and Gupta, 2007). In particular, Schaefli and Gupta (2007) note that while the Nash–Sutcliffe coefficient of efficiency (Nash and Sutcliffe, 1970) is widely used and well known by hydrologists, for example it is often foreign to researchers in other fields of environmental science. Moreover, values of the various statistics of model efficacy cannot be cross-compared and even those who often report such measures may not know what an index of model performance of 0.85 really means, for example.

Thus, Willmott et al.'s (2012) paper is vitally important in that it demonstrates a number of statistics of model evaluation available to researchers show little relationship across disparate measures. This underscores the need to understand more about each of these statistics and examine their behaviour and interpretation. However, the stated main purpose of Willmott et al. (2012) is the development of a ‘refined’ version of the dimensionless index of agreement (dr) that resembles correlation in that it is bounded to the interval (−1,1]. Although they note that dr is linearly equal to our modified coefficient of efficiency (E1—Legates and McCabe, 1999) over the positive portion of its domain, they contend that dr represents an improvement over E1. We respectfully disagree and argue that E1 is superior owing to its more intuitive interpretation and other desirable characteristics.

2. The generic form of the coefficient of efficiency

Legates and McCabe (1999) wrote the generic form of the coefficient of efficiency, Ej, as

equation image(1)

where the observed (Oi) and model predicted (Pi) series have N finite pairs for evaluation. In their original formulation, Nash and Sutcliffe (1970) used j = 2 (thus, EE2), although Legates and McCabe suggested that j = 1 was a better scaling, owing to the fact that absolute values are preferable to squared terms since they do not give undue weight to outliers (see Willmott et al., 1985; Willmott and Matsuura, 2005). The various coefficients of efficiency, Ejs, are bounded by the range [1.0,− ∞) with a value of 1.0 indicating a perfect model (i.e. all Oi = Pi) and the statistics decrease as the model predicted and observed series diverge.

Of interest here is the interpretation of Ej. Although Ej has no lower bound, a value of Ej = 0.0 has a fundamental meaning. It implies that such a model has no more ability to predict the observed values than does the observed mean (or the baseline values, see Section 4.). In essence, since the model can explain no more of the variation in the observed values than can the observed mean, such a model can have no predictive advantage. For negative values of Ej, the model is less efficacious than the observed mean in predicting the variation in the observations. Although negative values of Ej represent a measure of the ‘level of inefficacy’ of a model (or maybe even a ‘level of uselessness’), if you will, a negative value of Ej (or a negative ‘Nash-Sutcliffe value’, as they are called in hydrology) indicates that the model has failed to explain more of the variability in the observations than their mean.

Consider now the interpretation of E2 (or E). The original formulation by Nash and Sutcliffe (1970) provides a direct comparison with the coefficient of determination, R2, or the square of Pearson's product moment correlation coefficient. In simple linear regression, it is defined for the dependent variable, Yi, as

equation image(2)

which is analogous to E2 if Yi = Oi, Ȳ = Ō, and Ŷl = Pi Thus, the interpretation of E2 represents the percent of variance in the observations that is explained by the model predicted values. For example, a value of E2 = 0.75 implies that the model can explain three-quarters of the variance in the observed values. Like R2, a value of E2 = 1.0 implies a perfect model, whereas a value of E2 = 0.0 implies a model that cannot explain any variance in the observations. Again, negative values of E2 indicate that the model is worse than the observed mean (or baseline values) in predicting the observations.

Although E1 (or other values of j ≠ 2 in Ej) does not have this analogous property with R2; we, nevertheless, believe that it is preferable to E2 because of the decreased weight by E1 on outliers. However, E1 does provide a similar degree of interpretation, albeit with respect to absolute differences and not the variance (i.e. squared differences). A value of E1 = 0.75, for example implies that the model is able to explain three-quarters of the absolute-valued differences between the observations and model predictions. Stated another way, the absolute value of the differences between the observations and the model predictions are only one-quarter of the difference between the observations and their mean (or baseline value). Such a model can explain 75% of the absolute difference between the observations and their mean.

This demonstrates that the scaling of the Nash–Sutcliffe coefficient of efficiency (E or E2) and its modified form (E1) is both robust and easily interpretable. We posit that this is a necessary quality for any metric of model evaluation. Such is why the coefficient of efficiency, in all its forms (i.e. Ej), is preferable to many other statistics available.

3. Limitations of the refined index of agreement

Willmott et al. (2012) define their refined index of agreement, dr, as

equation image(3)

where c = 2.0. They note that for c = 1.0, dr is identically equal to the modified coefficient of efficiency, E1, for the non-negative portion of its domain. Their refinement on E1, therefore, is to rescale E1 using c and remap the negative portion of dr so it more resembles a correlation coefficient, that is dr varies over the domain (−1.0, 1.0]. We argue that the choice of c ≠ 1.0 destroys the interpretability of E1, particularly at E1 = 0 (and hence dr = 0) and the remapping for negative values is merely cosmetic.

For c = 2.0, dr attains a value of 0.0 when the absolute value of the difference between the model predicted values and observations equals twice the difference between the observations and their mean. That is, when E1 = 0.0dr = 0.5 and when dr = 0.0E1 = − 1.0. Thus, a model with no more predictive ability than the observed mean would achieve a value of dr = 0.5 and the value of dr = 0.0 is arbitrary (i.e. the model has less than c−1 times the predictive ability of the observed mean). This makes the interpretation of dr difficult and generally means that even relatively poor models will exhibit a high value of dr, a criticism that also affects the original index of agreement (see Willmott et al., 1985).

The remapping of the negative portion of dr such that it scales from (0,− ∞) to (0,− 1) is rather unnecessary. When dr = 0.5, the model exhibits twice as much error (i.e. the absolute difference between the predicted and observed values) as could be achieved by using the observed mean as a predictor. In this sense, any value of dr < 0.5 (as with a value of E1 < 0.0) places the model as having an explanatory ability that is less than the observed mean. Thus, these values simply describe the ‘level of inefficacy’ of the model and it is largely immaterial how that portion of the statistic is scaled.

Moreover, although it is piecewise continuous, dr exhibits a C2 discontinuity (discontinuous second and higher derivatives) at 0.0. Let equation image and equation image for simplicity. For the positive values of dr (equation image),

equation image(4)

while for the negative values of dr (equation image),

equation image(5)

Thus, in the limit as ∑|PiOi|→c∑|OiŌi| (i.e. AcB), dr is continuous only through the first derivative.

4. Baseline adjustments

Willmott et al. (2012) provide a rather cursory paragraph on the issue of adjusted baselines. However, much commentary has been provided on this topic, and, as we feel it is an important consideration in model evaluation, it is necessary to examine this topic in more detail.

In general, all statistics for model evaluation compare the relative efficacy of the model to the predictive abilities of the observed mean. It forms the basis of evaluating regression performance using the coefficient of determination (R2) and in the absence of any alternative model, the observed mean is the best ‘strawman’ against which a model can be compared. However, it has long been recognized that the observed mean may not be the best choice of a foil. For example, it was argued by Garrick et al. (1978, p. 376) that a comparison of a model to the observed mean was an ‘unnecessarily primitive’ choice. We demonstrated in Legates and McCabe (1999) that the evaluation of a model that predicts potential evapotranspiration in southern Louisiana or runoff in southwestern Colorado would lead to quite high values of any model evaluation statistic when compared against the observed mean. This is because any model that mimicked the seasonal cycle to a reasonable degree would significantly outperform the observed mean. As a result, Legates and McCabe (1999) proposed a further modification to our modified coefficient of efficiency, equation image, as

equation image(6)

where the new baseline is Ōl. Instead of simply comparing the observed values to a single number, such as the observed mean, the observed values can be compared against seasonally varying values, such as seasonal means, or a prediction using a function of other variables. The value of Ōl also could be used to define the results from a previous version of the model so that the statistic could represent the enhanced efficacy of the current model release. In any event, the choice of the observed mean can be easily substituted for a more appropriate baseline and researchers should be cognizant of the fact that the observed mean is not likely the most appropriate choice.

Schaefli and Gupta (2007) highlight the importance of specifying an appropriate baseline comparison and argue that it should become a standard practice in hydrologic modelling. We agree and argue that climatologists too must strongly consider comparing their models against more appropriate baselines. In particular, Schaefli and Gupta (2007, p. 2079) conclude,

Every modelling study should explain and justify the choice of benchmark. Of course, the appropriate benchmark will necessarily be different for different types of case studies. However, for efficient communication, the benchmark should fulfill the basic requirement that every [scientist] can immediately understand its explanatory power for the given case study and, therefore, appreciate how much better the actual [model] is.

We wholly agree and argue that climatologists must consider model evaluation as an important component of their research and not simply as a statistic to be reported. Such benchmarks are not likely to be globally applicable, but, as argued by Schaefli and Gupta (2007), we concur that it is important for each scientist to carefully select a benchmark that is appropriate for their particular study.

5. Concluding remarks

We are thankful that Willmott et al. (2012) have allowed us to extend the discussion from the hydrological community to climatological research. We wish to applaud the effort Dr Willmott has made in the statistical evaluation of model performance over the years. Moreover, we hope that such discussions lead climatologists in the future to take a more proactive role in model evaluation and the statistics that describe model efficiency.

We believe that our modified coefficient of efficiency (E1) is an improvement over the Nash–Sutcliffe statistic (E2) and forms the most appropriate basis for model evaluation based on its simplicity and ease of interpretation. The refined index of agreement, dr, posited by Willmott et al. (2012) exhibits several distinct flaws that make its utility less favorable. However, we welcome a discussion regarding model evaluation and baseline comparisons that has begun in the hydrological sciences and hope it extends to the climatological community as well.