Omri Allouche, Department of Evolution, Systematics and Ecology, Institute of Life Sciences, The Hebrew University, Givat-Ram, Jerusalem 91904, Israel (fax 972 2 6584741; e-mail email@example.com).
1In recent years the use of species distribution models by ecologists and conservation managers has increased considerably, along with an awareness of the need to provide accuracy assessment for predictions of such models. The kappa statistic is the most widely used measure for the performance of models generating presence–absence predictions, but several studies have criticized it for being inherently dependent on prevalence, and argued that this dependency introduces statistical artefacts to estimates of predictive accuracy. This criticism has been supported recently by computer simulations showing that kappa responds to the prevalence of the modelled species in a unimodal fashion.
2In this paper we provide a theoretical explanation for the observed dependence of kappa on prevalence, and introduce into ecology an alternative measure of accuracy, the true skill statistic (TSS), which corrects for this dependence while still keeping all the advantages of kappa. We also compare the responses of kappa and TSS to prevalence using empirical data, by modelling distribution patterns of 128 species of woody plant in Israel.
3The theoretical analysis shows that kappa responds in a unimodal fashion to variation in prevalence and that the level of prevalence that maximizes kappa depends on the ratio between sensitivity (the proportion of correctly predicted presences) and specificity (the proportion of correctly predicted absences). In contrast, TSS is independent of prevalence.
4When the two measures of accuracy were compared using empirical data, kappa showed a unimodal response to prevalence, in agreement with the theoretical analysis. TSS showed a decreasing linear response to prevalence, a result we interpret as reflecting true ecological phenomena rather than a statistical artefact. This interpretation is supported by the fact that a similar pattern was found for the area under the ROC curve, a measure known to be independent of prevalence.
5Synthesis and applications. Our results provide theoretical and empirical evidence that kappa, one of the most widely used measures of model performance in ecology, has serious limitations that make it unsuitable for such applications. The alternative we suggest, TSS, compensates for the shortcomings of kappa while keeping all of its advantages. We therefore recommend the TSS as a simple and intuitive measure for the performance of species distribution models when predictions are expressed as presence–absence maps.
Models generating presence–absence predictions (hereafter presence–absence models) are usually evaluated by comparing the predictions with a set of validation sites and constructing a confusion matrix that records the number of true positive (a), false positive (b), false negative (c) and true negative (d) cases predicted by the model (Table 1). Models generating non-dichotomous scores on an ordinal scale (hereafter ordinal score models) are often evaluated by applying a certain threshold to transform the scores into a dichotomous set of presence–absence predictions, and constructing a corresponding confusion matrix. One simple measure of accuracy that can be derived from the confusion matrix is the proportion of correctly predicted sites (overall accuracy; Table 2). However, this measure was criticized for ascribing high accuracies for rare species (Fielding & Bell 1997; Manel, Dias & Ormerod 1999). Two alternative measures that are often derived from the confusion matrix are sensitivity and specificity. Sensitivity is the proportion of observed presences that are predicted as such, and therefore quantifies omission errors. Specificity is the proportion of observed absences that are predicted as such, and therefore quantifies commission errors (Table 2). Sensitivity and specificity are independent of each other when compared across models, and are also independent of prevalence ((a + c)/n, the proportion of sites in which the species was recorded as present; Table 1).
Table 1. An error matrix used to evaluate the predictive accuracy of presence–absence models. a, number of cells for which presence was correctly predicted by the model; b, number of cells for which the species was not found but the model predicted presence; c, number of cells for which the species was found but the model predicted absence; d, number of cells for which absence was correctly predicted by the model
Validation data set
Table 2. Measures of predictive accuracy calculated from a 2 × 2 error matrix (Table 1). Overall accuracy is the rate of correctly classified cells. Sensitivity is the probability that the model will correctly classify a presence. Specificity is the probability that the model will correctly classify an absence. The kappa statistic and TSS normalize the overall accuracy by the accuracy that might have occurred by chance alone. In all formulae n = a + b + c + d
AUC was shown to be independent of prevalence (Manel, Williams & Ormerod 2001; McPherson, Jetz & Rogers 2004) and is considered a highly effective measure for the performance of ordinal score models. However, practical applications of species distribution models in conservation planning, such as the identification of biodiversity hotspots and the selection of representative conservation sites, often require presence–absence maps of species distribution, and thus a selection of a threshold for transforming ordinal scores into presence–absence predictions (Cumming 2000b; Loiselle et al. 2003; Berg, Gardenfors & von Proschwitz 2004). In such cases, predictive accuracy should be evaluated based on the selected threshold rather than on threshold-independent ROC curves. It should also be noted that some of the most frequently used models of species distribution (e.g. BioCLIM, Nix 1986; GARP, Stockwell & Peters 1999) generate dichotomous presence–absence predictions of species distribution, for which ROC curves cannot be applied.
In spite of its wide use, several studies have criticized the kappa statistic for being inherently dependent on prevalence and claimed that this dependency introduces bias and statistical artefacts to estimates of accuracy (Cicchetti & Feinstein 1990; Byrt, Bishop & Carlin 1993; Lantz & Nebenzahl 1996). In a recent study focusing on the evaluation of species distribution models, McPherson, Jetz & Rogers (2004) used numerical simulations to analyse the dependency of kappa on prevalence of the modelled species and found that kappa responds to variation in prevalence in a unimodal fashion. Based on this finding they concluded that ‘kappa's sensitivity to prevalence overall, however, renders it inappropriate for comparisons of model accuracy between species or regions unless certain precautions are taken’McPherson, Jetz & Rogers (2004).
In this paper we explain the observed unimodal dependency of kappa on prevalence, and introduce into ecology a new measure for the performance of presence–absence distribution models, the true skill statistic (TSS), which corrects for this dependency while still keeping all of the advantages of kappa.
We begin with a theoretical explanation for the unimodal dependence of kappa on prevalence. To do so we reformulate kappa in terms of prevalence, sensitivity and specificity. We then show analytically that TSS is largely immune to prevalence. We also compare the effect of prevalence on kappa and TSS using real data by modelling distribution patterns of 128 species of woody plants in Israel. Finally we discuss some methodological issues of kappa, TSS and other measures of accuracy, and their relevance for the performance of species distribution models.
The mechanism underlying the unimodal dependency of the kappa statistic on prevalence can be understood by reformulating kappa in terms of three parameters: prevalence, sensitivity and specificity. Such derivation leads to the following form:
( eqn 1)
where P, Sn and Sp are prevalence, sensitivity and specificity, respectively, Po is the observed accuracy and Pe is the accuracy expected to occur by chance. Kappa has an extremum at P that satisfies both (Sn − Sp)P2 − 2(1 − Sp)P + (1 − Sp) = 0 and 0 P 1. The extremum is a maximum when Sn + Sp − 1 > 0 and a minimum when Sn + Sp − 1 < 0. We will focus on the former case, which characterizes models with performance better than random. The prevalence that maximizes the kappa score of a given model is thus a function of the sensitivity and specificity of the model. If sensitivity and specificity are equal, a maximal kappa score is obtained for equal proportions of presences and absences. If sensitivity is larger than specificity, kappa is maximized by higher prevalence rates. If specificity is larger than sensitivity, kappa is maximized by lower prevalence rates. In any case, kappa inherently depends on prevalence. An alternative measure is thus required for assessing the performance of presence–absence models, which is largely insensitive to prevalence.
It is rewarding first to define theoretically when two modelling methods are of equal performance. It seems reasonable to assume that two methods are equal in their overall performance if they are equal in both sensitivity and specificity, and hence are equal in their ability to detect presences and absences. It also seems reasonable to expect that properties of the specific data set for which alternative methods are applied should not affect their rating. Taking into account the fact that the confusion matrix can be fully described by sensitivity, specificity, prevalence and the size of the validation set, an ideal measure of model performance should not be affected by prevalence or the size of the specific data set used for model validation (both being properties of the specific data set) and it should combine sensitivity and specificity so that both omission and commission errors are accounted for. We propose the true skill statistic (TSS), also known as the Hanssen–Kuipers discriminant, as a measure that satisfies these requirements. This statistic, traditionally used for assessing the accuracy of weather forecasts (McBride & Ebert 2000; Saseendran et al. 2002; Elmore, Weiss & Banacos 2003; Accadia et al. 2005), compares the number of correct forecasts, minus those attributable to random guessing, to that of a hypothetical set of perfect forecasts (see Appendix S1 in the supplementary material). For a 2 × 2 confusion matrix TSS is defined as:
( eqn 2)
Like kappa, TSS takes into account both omission and commission errors, and success as a result of random guessing, and ranges from −1 to +1, where +1 indicates perfect agreement and values of zero or less indicate a performance no better than random. However, in contrast to kappa, TSS is not affected by prevalence. It can also be seen that TSS is not affected by the size of the validation set, and that two methods of equal performance have equal TSS scores. In Appendix S1 in the supplementary material we describe in more detail the relation of TSS to kappa. TSS is a special case of kappa, given that the proportions of presences and absences in the validation set are equal.
Computer simulations were conducted to allow more thorough comparison of kappa and TSS and their responses to prevalence. Confusion matrices consisting of 100 cases each were created by tagging presences and absences to be correctly classified at probabilities equal to predetermined values of sensitivity and specificity, respectively. Three possible scenarios were simulated: (i) equal sensitivity and specificity (both set to 0·8); (ii) higher sensitivity (sensitivity = 0·9, specificity = 0·7); and (iii) higher specificity (sensitivity = 0·7, specificity = 0·9). The number of presence cases was varied systematically from 1 to 99 in increments of 1. For each level of prevalence we randomly simulated 100 000 confusion matrices. The kappa and TSS scores were determined for each of the 9 900 000 matrices and their mean values were calculated for each level of prevalence under the three scenarios. The corresponding theoretical expectations were also calculated for each value of prevalence based on equations 1 and 2. The results (Fig. 1) showed that TSS scores were largely unaffected by prevalence while kappa scores exhibited a unimodal response to prevalence, as found by McPherson, Jetz & Rogers (2004). We conclude that, in contrast to kappa, documented effects of prevalence on TSS can be interpreted as evidence for real ecological phenomenon rather than statistical artefacts.
An empirical comparison of the responses of kappa and TSS to variation in prevalence was carried out by re-analysing the data used by Farber & Kadmon (2003) for introducing the Mahalanobis distance as an approach for species distribution modelling. This data set comprises 32 414 geo-referenced observations on the distribution of 128 woody species in Israel (median number of observations per species 159). The models developed by Farber & Kadmon (2003) were validated using an independent database consisting of lists of species recorded in 96 validation sites of 5 × 5 km covering the main climatic gradients of Israel. The same calibration and validation data sets were used here to compare the responses of kappa and TSS to prevalence.
As in the theoretical analysis, prevalence was defined as the proportion of validation sites in which the relevant species was recorded. Predictive presence–absence maps were produced using the Mahalanobis distance. Three climatic factors were used as predictors in the models: mean annual rainfall, mean daily temperature of the hottest month (August) and mean minimum temperature of the coldest month (January). Further details of the modelling approach and the data can be found in Farber & Kadmon (2003).
We quantified the accuracy of the predictive map produced for each of the 128 species using four measures of accuracy: kappa, TSS, sensitivity and specificity. We also calculated the AUC statistic for each species non-parametrically using the Wilcoxon statistic (Hanley & McNeil 1982). Each of these five measures was regressed against prevalence using two types of models: a linear model and a quadratic model.
When kappa was regressed against prevalence with a linear model, prevalence had a positive but very weak effect on kappa (P = 0·047). The portion of variance explained by this model was extremely low (0·02). When the same data were analysed using a quadratic model the portion of variance explained increased to 0·12 and the coefficient of the quadratic term was negative and highly significant (< 0·001), as expected from the theoretical analysis.
The linear models constructed for AUC and TSS showed that both measures were negatively and significantly correlated with prevalence (Table 3). This response suggested that distribution ranges of rare species were more predictable than those of more common species. When AUC and TSS values were analysed by quadratic models, the coefficients of the quadratic term were not statistically significant (Table 3).
Table 3. Results of linear regression models (y = b0 + b1x) and quadratic regression models (y = b0 + b1x + b2x2) for the effect of prevalence of woody plant species on five measures of accuracy (kappa, TSS, AUC, sensitivity and specificity). Asterisks indicate significance levels of regression coefficients; *P < 0·05, ***P < 0·001
The effect of prevalence on sensitivity was not statistically significant for both the linear and quadratic models but the corresponding effect on specificity was negative and highly significant (Table 3). These results indicated that the decrease of TSS with increasing prevalence was caused by an increase in the magnitude of commission errors. As can be expected from these results, when the five measures of accuracy were plotted against prevalence, kappa showed a unimodal response, TSS, AUC and specificity showed a negative response, and sensitivity showed no response (Fig. 2).
Spearman correlation analysis indicated that all pair-wise correlations between AUC, TSS and kappa were statistically significant (P < 0·01). However, the correlation between AUC and TSS was higher than the correlation of AUC with kappa or the correlation between TSS and kappa (0·85 vs. 0·65 and 0·66, respectively).
McPherson, Jetz & Rogers (2004) demonstrated with numerical simulations that kappa, one of the most common measures of predictive accuracy in ecology, is inherently sensitive to prevalence, showing a unimodal dependency with a maximum at intermediate levels of prevalence. This bias has long been recognized in other research fields, such as clinical epidemiology (Cicchetti & Feinstein 1990; Byrt, Bishop & Carlin 1993; Lantz & Nebenzahl 1996). In this paper we provide an analytical explanation for the results obtained by McPherson, Jetz & Rogers (2004), and propose a new measure of accuracy, TSS, that is insensitive to prevalence while still keeping all the advantages of the kappa statistic. We also provide empirical data supporting the hypothesis that the two measures of accuracy respond differentially to variation in prevalence, and demonstrate that the relationship between kappa and prevalence is unimodal, as expected from the theoretical analysis. Several previous studies have documented unimodal responses of kappa to species’ prevalence (Manel, Williams & Ormerod 2001; Petit et al. 2003; Liu et al. 2005) but none of these studies attributed this response to statistical artefact.
In our empirical analysis, TSS showed a negative response to prevalence, a result we interpret as indicative of a true effect of prevalence (or ecological characteristics associated with prevalence) on predictive accuracy. The fact that AUC, which is known to be independent of prevalence, showed a similar response to prevalence supports this interpretation. We explain this result by the fact that prevalent species often occupy wide niches. The area of predicted presence for such species is therefore much larger than that of scarce species. The increased area allows the Mahalanobis distance method to keep high levels of sensitivity (correctly predicting a presence as such) but results in a decrease in specificity, as many locations where the species is absent are erroneously predicted as presence locations.
Evidence for negative effects of prevalence on the accuracy of species distribution models was found in several previous studies. For example, Guisan & Hofer (2003) analysed the distribution of reptiles in Switzerland using generalized linear models (GLM) and found that highly common species showed exceptionally low values of predictive accuracy. Segurado & Araujo (2004) compared the performance of seven modelling techniques in predicting the distribution of amphibians and reptiles in Portugal and found that, in general, widespread species had greater overall errors. Stockwell & Peterson (2002) analysed patterns of bird distribution in Mexico with GARP and found that range size had a negative effect on the accuracy of model predictions. They suggested that widespread species show local adaptations in ecological characteristics and that ignoring such ecological differentiation overestimates the species’ distribution range and reduces model accuracy.
As can be verified by assigning P= 0·5 in equation 1, when the proportions of presences and absences are equal, kappa is equal to TSS. The bias caused to kappa by unequal proportions of presences and absences in the validation set has led some to suggest that efforts should be made to collect validation sets such that prevalence would be around 50% (Lantz & Nebenzahl 1996; Hoehler 2000; McPherson, Jetz & Rogers 2004). Unfortunately this recommendation is of questionable practicability in ecological applications, particularly for rare species for which a small number of presence records is available. TSS satisfies this recommendation without requiring special adjustments or sampling efforts.
An alternative approach to obtaining a validation set with effective prevalence of 50% is by random resampling from the data available (Stockwell & Peterson 2002). This method suffers from stochasticity in the selection of points, but if the validation set is large enough, or if performance is averaged over several randomly drawn validation sets, the results would converge to the TSS statistic.
TSS provides a threshold-dependent measure of accuracy that is readily applied for presence–absence predictions. Our theoretical analysis demonstrates that it is not affected by prevalence, and our empirical analysis indicates that its values are highly correlated with those of the threshold-independent AUC statistic. These findings suggest that TSS can serve as an appropriate alternative to AUC in cases where model predictions are formulated as presence–absence maps. Several recent studies have jointly used AUC as a threshold-independent and kappa as a threshold-dependent measure of predictive accuracy (Thuiller 2003; Huntley et al. 2004; Pearson, Dawson & Liu 2004; Araujo et al. 2005; Pearson et al. 2006). Our results suggest that TSS should be preferred over kappa as a threshold-dependent measure in such studies.
As is evident from equation 2, TSS assigns equal weights to sensitivity and specificity. Practical applications might require different weights. In conservation planning, for example, when predicting distribution of endangered species, one may wish to weight sensitivity more than specificity. Unlike the kappa score, weights can easily be introduced to the TSS in a straightforward manner.
While our theoretical and empirical results support the superiority of TSS over the kappa statistics, a thorough comparison of the two measures should also consider their variance and its dependency on prevalence. It can be shown that the variance of TSS is given by:
( eqn 3)
where Sn, Sp, P and N are the sensitivity, specificity, prevalence and size of the validation set, respectively. As evident from Fig. 3a, TSS is highly variable for extremely low and high levels of prevalence. The variability is caused by the large variability in sensitivity for small data sets with very low prevalence range (as randomness can easily change the parameter from 1 to 0 or vice versa) and in specificity for small data sets with very high prevalence range. The variance of kappa can be calculated non-parametrically for a finite N, by determining the probability of obtaining each possible confusion matrix and the corresponding kappa score. Given a prevalence P, sensitivity Sn, specificity Sp and N cases, the probability of obtaining a confusion matrix with values a, b, c and d (as defined in Table 1) is given by:
( eqn 4)
A plot of the variance of kappa against prevalence (Fig. 3a) indicates that kappa does not suffer from the border effects obtained for TSS, and that its variability decreases for extremely low or high prevalence. However, the absolute value of kappa also decreases for low or high prevalence (Fig. 1a). A more informative comparison of the two measures should therefore be based on their coefficient of variation (CV; the standard deviation divided by the mean). Such a comparison reveals similar curves for the two measures (Fig. 3b). Similar patterns have been shown to characterize AUC as well (Cumming 2000a; McPherson, Jetz & Rogers 2004). Instability at extremely low or high levels of prevalence seems to be inherent in any model with low number of cases in one of the cells of the confusion matrix (Nelson & Cicchetti 1995).
Finally, when discussing the suitability of TSS vs. kappa as measures for model performance, one should distinguish between tests of agreement and validation. Kappa was originally designed to measure reliability of predictions by assessing agreement between two or more observers (Tooth & Ottenbacher 2004). In such applications none of the observers is treated as a ‘gold standard’, i.e. is known to be accurate. For such a purpose the prevalence effect on kappa is much desired and kappa should not be adjusted for it (Hoehler 2000). However, in validation tests such as those performed for evaluating the performance of distribution models, a gold standard obviously exists. Under such circumstances the prevalence effect of kappa turns against it, and sensitivity and specificity, which are not applicable for the purpose of assessing agreement between two observers, become very informative. TSS accounts for both sensitivity and specificity and is therefore better suited than kappa for measuring performance of a method in the presence of a gold standard.
In a recent review of the challenge of testing models of species distribution, Vaughan & Ormerod (2005) concluded that adequate testing of such models is still scarce and that their true value cannot yet be appraised. The results of this study support their conclusions and provide theoretical and empirical support that kappa, one of the most widely used measures of model performance in ecology, has serious limitations that make it unsuitable for such applications. The alternative we suggest, the TSS statistic, compensates for the shortcomings of kappa while keeping all of its advantages, and provides results that are highly correlated with those of the threshold-independent AUC statistic. We therefore recommend the TSS as a simple and intuitive measure for the performance of predictive maps generated by presence–absence models.
We thank A. Ben-Nun for assistance with GIS issues, U. Motro for fruitful discussions of statistical issues, and W. Thuiller and an anonymous referee for valuable comments on an earlier version of the manuscript. The research was supported by The Israel Science Foundation (grant no. 545/03) and the Ministry of Environment.
We describe the generalization of TSS for a k × k contingency table, adapting Doswell, Daviesjones & Keller (1990). Let us denote the k categories as Ci…Ck. Let nij be the number of cases that Ci was forecasted and Cj was observed. Let ni. be the total number of cases that Ci was forecasted, n.j be the total number of cases that Cj was observed, and n. the total number of cases. The value expected to occur by chance for the ijth cell is given by Eij= (ni.)(n.j)/n.. and should be subtracted from the observed ijth element nij of the matrix n to remove success due to random guessing.
Let us now denote by R the matrix whose elements are Rij=nij–Eij, and construct a standard matrix R*, that will be compared to R. R* is a matrix of perfect forecasts, accounting for random guessing. A matrix n* of perfect forecasts has n•i in the iith diagonal element and all zeros in the off-diagonal elements. The number of cases expected to occur by chance is given by and thus R* = n* – E*. The trace of R (the sum of the elements in the main diagonal) gives the number of correct forecasts beyond those attributable to chance. Using the trace of R* as a standard we define the generalized version of TSS as
( eqn 5)
Using the above notation Cohen's Kappa is defined by:
( eqn 6)
It is now clear that TSS = Kappa whenever E* is replaced with E.