Ecological niche modeling re‐examined: A case study with the Darwin's fox

Abstract Many previous studies have attempted to assess ecological niche modeling performance using receiver operating characteristic (ROC) approaches, even though diverse problems with this metric have been pointed out in the literature. We explored different evaluation metrics based on independent testing data using the Darwin's Fox (Lycalopex fulvipes) as a detailed case in point. Six ecological niche models (ENMs; generalized linear models, boosted regression trees, Maxent, GARP, multivariable kernel density estimation, and NicheA) were explored and tested using six evaluation metrics (partial ROC, Akaike information criterion, omission rate, cumulative binomial probability), including two novel metrics to quantify model extrapolation versus interpolation (E‐space index I) and extent of extrapolation versus Jaccard similarity (E‐space index II). Different ENMs showed diverse and mixed performance, depending on the evaluation metric used. Because ENMs performed differently according to the evaluation metric employed, model selection should be based on the data available, assumptions necessary, and the particular research question. The typical ROC AUC evaluation approach should be discontinued when only presence data are available, and evaluations in environmental dimensions should be adopted as part of the toolkit of ENM researchers. Our results suggest that selecting Maxent ENM based solely on previous reports of its performance is a questionable practice. Instead, model comparisons, including diverse algorithms and parameterizations, should be the sine qua non for every study using ecological niche modeling. ENM evaluations should be developed using metrics that assess desired model characteristics instead of single measurement of fit between model and data. The metrics proposed herein that assess model performance in environmental space (i.e., E‐space indices I and II) may complement current methods for ENM evaluation.

Step by step guide to partial ROC evaluations.
Ecological niche model evaluation using Partial ROC software (Barve 2008).
For details and theoretical bases of this model evaluation method, see Peterson et al. (2008). Briefly, models are calibrated with a set of data and are evaluated with an independent set of data. Evaluations consider the omission rate and the proportion of the evaluation area predicted as suitable, based on a sequence of threshold values across an evaluation area selected a priori. This spectrum of model predictions in a twodimensional error space is then refined based on a user-defined error tolerance (E), which corresponds to how much "noise" is likely present in the occurrence dataset. For example, E = 5% would allow up to 5% of occurrences to be left out of thresholded predictions, as they may represent erroneous georeferencing rather than model failure (Peterson et al. 2008). For each threshold, omission rates are calculated, and only the part of the error space for which the omission rate is <E is considered. The curve tracing the omission values and proportions of the study area is then divided by the corresponding null expectations (see Fielding and Bell 1997) to create an "area under the curve (AUC) ratio. This process is done several times in Partial ROC from a series of replicates using a percentage of evaluation points (usually 50%) selected randomly. This bootstrap step allows testing to establish whether a given AUC ratio is significantly elevated about 1, which is the performance of a random classifier (Peterson et al. 2008).
Steps to run the evaluations using Partial ROC software.
1. The data available for the species should be split in a set for calibration and a set for evaluation.

Calibration area
Evaluation area

Calibration points
Evaluation points 2. Models must be calibrated with one set of data: calibration points.
3. Once the model is calibrated using the calibration occurrences, cut the evaluation area that was defined a priori, and that is independent of the area using for model calibration (Hurlbert 1984). The evaluation points are distributed across this area, and the partial ROC test evaluates the coincidence between the points and the model predictions. Below is an example of the models' output in continuous format cut to the evaluation region. Pixels with high values are in red while pixels with low values are in green. Note that the coincidence of the evaluation points (black crosses) with the model predictions is the key to success of these evaluations … in particular, the GLM prediction shows good correspondence.
4. The raster file of the evaluation area should be converted to ascii. However, Partial ROC cannot read decimals. Thus, multiply the raster by 10000 and round the pixel values to obtain only integers. For example, rasters from Maxent may originally range from 0 to 0.9, after this step, values of the ascii should range from 0 to 9,000.
5. Count the number of pixels in the raster for each value of prediction. Following the example, count how many pixels have a value of "0" and how many pixels have a value of "9,000." 6. Make a list of the values of prediction from the raster ordered by the values of prediction (left column) sorted from smallest to largest. Note that suitability values of "0" should be counted (if no pixels are predict as "0"suitability add zero as the count of pixels for this value 23. If the analysis was completed successfully this window will appear: 24. The output file will contain a csv with the Partial ROC ratios. Values above 1 represent predictions-of evaluation points in the evaluation area-better than by random. You will expect at least 95% of ratios above 1 for an alpha <0.05. Partial ROC metric is also available in NicheA 3.0 (Qiao et al. 2015, 2016).