On evaluating species distribution models with random background sites in place of absences when test presences disproportionately sample suitable habitat


Correspondence: Adam B. Smith, Center for Conservation and Sustainable Development, Missouri Botanical Garden, PO Box 299, Saint Louis, MO 63166-0299, USA.

E-mail: adam.smith@mobot.org


Modelling the distribution of rare and invasive species often occurs in situations where reliable absences for evaluating model performance are unavailable. However, predictions at randomly located sites, or ‘background’ sites, can stand in for true absences. The maximum value of the area under the receiver operator characteristic curve, AUC, calculated with background sites is believed to be 1 − a/2, where a is the typically unknown prevalence of the species on the landscape. Using a simple example of a species' range, I show how AUC can achieve values > 1 − a/2 when test presences do not represent each inhabited region of a species__ range in proportion to its area. Values of AUC that surpass 1 − a/2 are associated with higher model predictions in areas overrepresented in the test data set, even if they are less environmentally suitable than other regions the species occupies. Pursuit of high AUC values can encourage inclusion of spurious predictors in the final model if they help to differentiate areas with disproportionate representation in the test data. Choices made during modelling to increase AUC calculated with background sites on the assumption that higher scores connote more accurate models can decrease actual accuracy when test presences disproportionately represent inhabited areas.