### Abstract

- Top of page
- Abstract
- Model assessment indices
- Threshold determination approaches
- Materials and methods
- Results
- Discussion
- Conclusion
- Acknowledgements
- References

Transforming the results of species distribution modelling from probabilities of or suitabilities for species occurrence to presences/absences needs a specific threshold. Even though there are many approaches to determining thresholds, there is no comparative study. In this paper, twelve approaches were compared using two species in Europe and artificial neural networks, and the modelling results were assessed using four indices: sensitivity, specificity, overall prediction success and Cohen's kappa statistic. The results show that prevalence approach, average predicted probability/suitability approach, and three sensitivity-specificity-combined approaches, including sensitivity-specificity sum maximization approach, sensitivity-specificity equality approach and the approach based on the shortest distance to the top-left corner (0,1) in ROC plot, are the good ones. The commonly used kappa maximization approach is not as good as the afore-mentioned ones, and the fixed threshold approach is the worst one. We also recommend using datasets with prevalence of 50% to build models if possible since most optimization criteria might be satisfied or nearly satisfied at the same time, and therefore it's easier to find optimal thresholds in this situation.

Predicting species distributions is becoming increasingly important since it is relevant to resource assessment, environmental conservation and biodiversity management (Fielding and Bell 1997, Manel et al. 1999, Austin 2002, D'heygere et al. 2003). Many modeling techniques have been used for this purpose, e.g. generalized linear models (GLM), generalized additive models (GAM), classification and regression trees (CARTs), principal components analysis (PCA), artificial neural networks (ANNs) (Guisan and Zimmermann 2000, Moisen and Frescino 2002, Guisan et al. 2002, Berg et al. 2004). And most of the techniques give the results as the probability of species presence, e.g. GLM, GAM and some algorithms of ANNs, or environmental suitability for the target species, e.g. PCA (Robertson et al. 2003) and some algorithms of ANNs. However, in conservation and environmental management practice, the information presented as species presence/absence may be more practical than presented as probability or suitability. Therefore, a threshold is needed to transform the probability or suitability data to presence/absence data. A threshold is also needed when assessing model performance using the indices derived from the confusion matrix (Manel et al. 2001), which also facilitates the interpretation of modelling results. Before reviewing threshold determination approaches, we will review these model assessment indices first because some of these indices are also the only or primary component of some threshold determination approaches.

### Model assessment indices

- Top of page
- Abstract
- Model assessment indices
- Threshold determination approaches
- Materials and methods
- Results
- Discussion
- Conclusion
- Acknowledgements
- References

Many indices can be used in the assessment of the predictions of species distributions, including sensitivity, specificity, overall prediction success (OPS), Kohen's kappa statistic, the odds ratio, and the normalized mutual information statistic (NMI). And some of them have been incorporated into the approaches to determining thresholds. Fielding and Bell (1997) gave a comprehensive review (Manel et al. 2001). All these indices (Table 1) need the information from the confusion matrix, which consists of four elements: true positive or presence (a), false positive or presence (b), false negative or absence (c) and true negative or absence (d). Since the value of an individual element in the confusion matrix may take zero, the odds ratio and NMI cannot be calculated in some cases. Precision, recall and F are three indices used in the field of information retrieval. Precision is the proportion of the retrieved items that are relevant, i.e. the proportion of predicted presences that are real presences, recall is the proportion of the relevant items that are retrieved, which is equal to sensitivity, and F is the harmonic average of precision and recall (Nahm and Mooney 2000). F varies from 0, when almost no relevant items are retrieved, i.e. almost no real presences are predicted as presences, to 1, when all and only the relevant items are retrieved, i.e. all and only the real presences are predicted as presences. α is a parameter, which gives weights (α and 1−α) to the two components of F. Moreover, when α=0.5, F is strongly towards the lower of the two values (precision and recall); therefore, this measure can only be high when both precision and recall are high.

Table 1. Indices for assessing the predictive performance of species distribution models, a is true positives (or presences), b is false positives (or presences), c is false negatives (or absences), d is true negatives (or absences), n (=a+b+c+d) is the total number of sites and α is a parameter between 0 and 1 (inclusive). Index | Formula |
---|

Sensitivity (or Recall, R) | a/(a+c) |

Specificity | d/(b+d) |

Precision (P) | a/(a+b) |

Overall prediction success (OPS) | (a+d)/n |

Kappa | |

Odds ratio | (ad)/(cb) |

Normalized mutual information statistic (NMI) | |

F | |

### Threshold determination approaches

- Top of page
- Abstract
- Model assessment indices
- Threshold determination approaches
- Materials and methods
- Results
- Discussion
- Conclusion
- Acknowledgements
- References

There are many approaches to determining thresholds, which fall into two categories: subjective and objective. A representative in the first category is taking 0.5 as the threshold, which is widely used in ecology (Manel et al. 1999, 2001, Luck 2002, Stockwell and Peterson 2002, Bailey et al. 2002, Woolf et al. 2002). Sometimes 0.3 (Robertson et al. 2001) and 0.05 (Cumming 2000) are also used as thresholds. These choices are very arbitrary and lack any ecological basis (Osborne et al. 2001). Sometimes, a specific level, e.g. 95%, of sensitivity or specificity is desired or deemed acceptable, and it is predetermined (Cantor et al. 1999). Thus, the corresponding threshold can be found. This approach is also subjective because a specific level for some attribute (e.g. sensitivity or specificity, etc.) is predetermined by the researchers.

There are many objective approaches. With these approaches, thresholds are chosen to maximize the agreement between observed and predicted distributions. Cramer (2003) also realized the problem with fixed threshold approach, especially taking 0.5 as the threshold. He stated that with unbalanced samples, this gives nonsense results. Therefore, the sample frequency, i.e. the prevalence of species occurrence (which is defined as the proportion of species occurrences among all the sites), and the mean value of the predicted probabilities of species presence were recommended as the threshold. Fielding and Haworth (1995) used a threshold that was calculated as the mid-point between the mean probabilities of occupancy for the present and absent groups.

For other objective approaches, usually, either a specific index, e.g. kappa, or the trade-off between two conflicting properties, e.g. sensitivity and specificity, is optimized in various ways. Kappa maximization approach is popular in ecology (Huntley et al. 1995, Lehmann 1998, Guisan et al. 1998, Collingham et al. 2000, Berry et al. 2001, Pearson et al. 2002). Similarly, OPS and F can also be used in the determination of thresholds (Shapire et al. 1998). The sum of sensitivity and specificity can be maximized to give the threshold (Manel et al. 2001), which is equivalent to finding a point on the ROC (receiver operating characteristics) curve (i.e. sensitivity against 1-specificity) whose tangent slope is equal to 1 (Cantor et al. 1999). The point at which sensitivity and specificity are equal can also be chosen to determine the threshold (Cantor et al. 1999). This approach can also be applied to precision and recall (Shapire et al. 1998). Another approach is to select the point on the ROC curve that is closest to the upper-left corner (0,1) in the ROC plot since the point in this corner represents a perfect classification with 100% sensitivity and specificity (Cantor et al. 1999). Similarly, the point on the P-R (i.e. precision-recall) curve that is closest to the upper-right corner (1,1) in the P-R plot can also be used to determine the threshold since the point in this corner represents a perfect classification with 100% precision and recall.

Some researchers went further to identify the appropriate threshold by incorporating the relative cost of FP (false positive) and FN (false negative) errors and prevalence (Zweig and Campbell 1993, Fielding and Bell 1997) or by incorporating the C/B ratio (the ratio of net FP cost and net true positive benefit) and prevalence (Metz 1978, Cantor et al. 1999). The threshold is corresponding to the point on the ROC curve at which the slope of the tangent is (C/B)×(1−p)/p or (FPC/FNC)×(1−p)/p, where p is the prevalence (of species’ presence) and FPC and FNC are the cost of false positive and false negative respectively.

Although there are so many approaches to determining the threshold, there is no comparative study on their behaviours, so we don't know their relative performance. In this paper we compared twelve different approaches to determining thresholds (Table 2), and investigated their behaviours in various situations, i.e. different prevalence for model-building data and test data, using artificial neural networks, which have been recognized by many researchers, Ozesmi and Ozesmi (1999), Brosse et al. (1999), Manel et al. (1999), Olden and Jackson (2001), Berry et al. (2002), Pearson et al. (2002, 2004) and Olden (2003), as a modeling technique better than other traditional techniques in modeling complex phenomena with non-linear relationships. We realized that the probability-based approaches were used for predicted probabilities, and our modeling result is predicted suitability for species presence. However, we believe this will not hinder our effort to use these approaches since the “suitabilities” we get are ranged from 0 to 1.

Table 2. Threshold-determining approaches studied in this paper. Code | Approach | Definition | Reference |
---|

Subjective approach |

1 | Fixed threshold approach | Taking a fixed value, usually 0.5, as the threshold | Manel et al. (1999), Bailey et al. (2002) |

| | | |

Objective approaches |

Single index-based approaches: |

2 | Kappa maximization approach | Kappa statistic is maximized | Huntley et al. (1995), Guisan et al. (1998) |

3 | OPS maximization approach | Overall prediction success (OPS) is maximized | |

| | | |

Model-building data-only-based approach: |

4 | Prevalence approach | Taking the prevalence of model-building data as the threshold | Cramer (2003) |

| | | |

Predicted probability/suitability-based approaches: |

5 | Average probability/suitability approach | Taking the average predicted probability/suitability of the model-building data as the threshold | Cramer (2003) |

6 | Mid-point probability/suitability approach | Mid-point between the average probabilities of or suitabilities for the species’ presence for occupied and unoccupied sites | Fielding and Haworth (1995) |

| | | |

Sensitivity and specificity-combined approaches: |

7 | Sensitivity-specificity sum maximization approach | The sum of sensitivity and specificity is maximized | Cantor et al. (1999), Manel et al. (2001) |

8 | Sensitivity-specificity equality approach | The absolute value of the difference between sensitivity and specificity is minimized | Cantor et al. (1999) |

9 | ROC plot-based approach | The threshold corresponds to the point on ROC curve (sensitivity against 1-specificity) which has the shortest distance to the top-left corner (0,1) in ROC plot | Cantor et al. (1999) |

| | | |

Precision and recall-combined approaches: |

10 | Precision-recall break-even point approach | The absolute value of the difference between precision and recall is minimized | Shapire et al. (1998) |

11 | P-R plot-based approach | The threshold corresponds to the point on P-R (Precision-Recall) curve which has the shortest distance to the top-right corner (1,1) in P-R plot | |

12 | F maximization approach | The index F is maximized. In this study, α=0.5 is used in F, i.e. there is no preference to precision and recall | Shapire et al. (1998) |

### Discussion

- Top of page
- Abstract
- Model assessment indices
- Threshold determination approaches
- Materials and methods
- Results
- Discussion
- Conclusion
- Acknowledgements
- References

Finding a threshold and making the presence/absence prediction is the final step in species distribution modeling, and it is necessary in, for example, the estimation of species range and the assessment of the impact of climate change. It is important to give an accurate presence/absence prediction in these situations. However, in other situations, other considerations should be included. For example, in species reintroduction programs, we may limit the reintroduction sites to the most suitable areas; but in some conservation planning programs, we may take a less restrictive strategy, that is, purposely including some less suitable areas in protection. It is expected that the larger the predicted probability/suitability of presence at a site, the more suitable is the site to the reintroduction of the species. However, since the prevalence of model-building data has significant effect on the predicted probability/suitability of presence, i.e. the higher the prevalence, the bigger the predicted probability/suitability (Cramer 2003, Liu et al. unpubl.), this makes it difficult to decide the more suitable or less suitable sites. Therefore, even in those applications with some subjective decision-makings involved, it is still necessary to find the appropriate threshold and take the “objective” presence/absence prediction as a reference.

In this study, we treated two kinds of errors, e.g. false positive and false negative, as equally important and gave no preference to either side. But approach 12 can be adapted to the situation in which one of two conflicting sides is emphasized by changing the parameter α from 0 to 1. A smaller α emphasizes recall, and bigger α emphasizes precision. However, it is difficult to say to what degree one side is emphasized. In this situation, the subjective approach may be suitable, e.g. a “minimum acceptable error” could be defined that depended on the intended application of the model. For example, compared with false negatives, we could tolerate more false positives when we set up a conservation area for a particularly endangered species. If the purpose of the model was to identify experimental sites, where we could find a species, we should minimize the false positive error rate (Fielding and Bell 1997).

When some kind of cost for the false positive or false negative and/or benefit for true positive or true negative needs to be taken into account, Metz's (1978) approach can be adopted because the cost of false positive and the benefit of true positive as well as prevalence were explicitly considered. However, the relevant cost and benefit are difficult to determine in environmental and ecological practice. Zweig and Campbell (1993) suggested that if FPC>FNC, the threshold should favor specificity, while sensitivity should be favored if FNC>FPC (Fielding and Bell 1997). Because estimation of the cost and benefit may add more uncertainty to the problem, caution must be taken when this approach is adopted.

It is interesting to note that among the twelve approaches we studied, both sensitivity and specificity for approaches 4, 5, 7, 8 and 9 are high (>0.8) and are higher than those for the other approaches. Since the bigger the sensitivity, the smaller the false negatives rate, and the bigger the specificity, the smaller the false positives rate, therefore, both false positives rate and false negatives rate are low (<0.2) for the approaches 4, 5, 7, 8 and 9. These approaches are recommended to use. The other approaches either have low false negatives rate and high false positives rate (e.g. approach 12), or have high false negatives rate and low false positives rate (e.g. approach 10), or have both high false positives rate and high false negatives rate (e.g. approaches 1 and 3), therefore, these approaches are not recommended.

It is not unexpected that the fixed threshold approach (threshold=0.5) is one of the worst. Guisan and Theurillat (2000) found that the threshold histogram is not centered on 0.5 with symmetric tails in each opposite direction (toward 0 and 1), rather all values range between 0.05 and 0.65 with a mean at 0.35 and an asymmetric shape. In fact, the prevalence of model-building data affects all the results. The output is biased towards the larger of the two groups (Fielding and Bell 1997, Cramer 2003), occupied sites and unoccupied sites. When the prevalence is small, a 0.5 threshold would classify most of the sites as unoccupied (Cumming 2000). However, the prevalence approach is one of the most robust, i.e. although it is not the best in every situation, it is good, at least not bad, even in the worst situation. Actually it is one of the best as assessed using sensitivity, OPS and kappa. This is also not unexpected. In fact, in another study we found that the probabilities corresponding to the maximum OPS and maximum kappa for the test data are positively correlated to the prevalence of model-building data (Liu et al. unpubl.). We suggested that a good presence/absence prediction would be obtained by taking the prevalence of model-building data as the threshold. This hypothesis was verified by this study.

From this study we also found that when the prevalence of model-building data is 50%, there is little difference among the twelve approaches as measured by the four indices (Fig. 3), and the relative difference (=difference of the maximum and minimum among the twelve values for each index divided by the maximum) is <5% for all the four indices and the two species. Furthermore, in addition to approaches 1 and 4, approaches 2, 3 and 7 (and also 11 for *P*. *maritima*), 8 and 10, and 5 and 6 reach exactly the same result respectively for each of the two species. The convergence of approaches 1 and 4 is obvious. The convergence of approaches 5 and 6 can be easily deduced because there are equal number of occupied sites and unoccupied sites. The convergence of approaches 2, 3 and 7 means that kappa, OPS and the sum of sensitivity and specificity were maximized at the same time. The convergence of approaches 8 and 10 means that specificity=sensitivity (=recall)=precision at the same time. This means that many conditions are satisfied or nearly satisfied at the same time in this situation. Therefore, the best result will most probably be obtained by any approach, even the poor ones, which is verified by this study (Fig. 3c, d). This is encouraging since it supports our recommendation that it is better to use model-building data with prevalence of 50% in species distribution modeling (Liu et al. unpubl.).