Towards the modelling of true species distributions

Authors


Species distributions are the focal interest in a vast number of studies in ecology, biogeography and related sub-disciplines, including metapopulation ecology, invasive species and global warming studies. Typically, a model is used to relate the probability of species occurrence to explanatory variables, for example those describing environmental conditions. A species distribution model allows covariate relationships to be tested for and for these relationships to be extrapolated to unsurveyed sites (e.g. as a distribution map) or to unsurveyed times (e.g. to predict response to climate change).

In recent years, species distribution modelling has experienced vigorous development at the intersection of ecology, biogeography, applied statistics and computer science, and several new techniques have appeared (Elith et al., 2006; Phillips et al., 2006). Surprisingly, however, virtually all approaches have neglected one important aspect of ecological field data – one which every naturalist knows very well –almost any species may be overlooked. That is, the probability of detection given presence (p) is seldom equal to 1. Indeed, uncertain detection adds much to the fun of being a field naturalist. It adds an element of hunting: if only I look harder, I’m more likely to bring in ‘prey’, i.e. to see an interesting species.

In academia, however, this age-old wisdom has often been lost; almost all approaches to species distribution modelling assume away the challenges of imperfect detection. This is worrisome, because inferences from models that erroneously assume that = 1 will suffer from at least three problems.

  • 1 Occupancy will be underestimated. For instance, if a species occurs in 80% of grid cells, but its detection probability is 60%, it will be found in only 48% of cells. Thus, the ‘apparent distribution’ is modelled, i.e. the product of the probability of occupancy and the probability of detection, not the true distribution.
  • 2 Estimated slopes of covariate relationships are biased towards zero.
  • 3 Environmental relationships with detection probability may be wrongly identified as determinants of occurrence or may mask them.

To account for imperfect detection, MacKenzie et al. (2002) and Tyre et al. (2003) have independently proposed the use of zero-inflated binomial models, also called ‘site-occupancy models’, in the present context. This model class can be used to estimate true, not apparent, species distributions. In their simplest incarnation, site-occupancy models consist of two linked logistic regressions:

  • 1 Latent ecological process: z|ψ ∼ Bernoulli(ψ).
  • 2 Observation process: y|z ∼ Bernoulli(z × p).

Both processes are modelled as ‘coin-flips’, i.e. as Bernoulli distributions (or binomials with trial size 1). The imperfectly observed (latent) occurrence state, presence (z = 1) or absence (z = 0), is modelled separately from the ‘presence/absence’observations, which are more correctly termed detection (y = 1) or non-detection (y = 0) observations. Occurrence, and hence the true species distribution, is governed by occurrence probability ψ, i.e. the probability that the abundance of the target species at a randomly chosen site is greater than zero. The observation process depends on detection probability p and maps occurrence state z onto observation y. Both parameters, ψ and p, can be modelled as functions of covariates, e.g. using a generalized linear model (GLM) or a generalized additive model (GAM) approach. To separately estimate both occurrence and detection parameters, at least two temporal, or more rarely spatial, independent replicate observations are required for a number of sites. Temporal replication must be over a short time period so that the occurrence state (z) of a site is unchanged.

Site-occupancy models represent the only framework available for species distributions that is not affected by the above three problems in the presence of imperfect detection. The original model has been greatly extended, also to dynamic distributions; see Kéry et al. (2010) for an application. Unfortunately, the considerable potential of these models for species distributions has not been recognized as widely as I believe it should.

In two recent papers Karanth et al. (2009, 2010) provide illustration of how site-occupancy modelling can be used for rigorous inferences about species distributions. They did not aim low; instead, they used a country-wide survey of local experts to obtain replicate observations of mammal species occurrences for the entire Indian subcontinent! Their work represents one of the first attempts ever undertaken in the history of biogeography to map, for a very large area, true species distributions, i.e. potential or actual distributions corrected for all effects of imperfect detection. In addition, their determinants of occurrence identified are not biased by imperfect detection. Thus, their true species distribution maps, albeit crude, stand out from a myriad of apparent distribution maps that confound true absence and non-detection.

It is worth mentioning that recent modelling approaches for presence-only data (e.g. Phillips et al., 2006; Ward et al., 2009; Warton & Shepherd, 2010) cannot solve the problem of imperfect detection for species distribution models. The reason for this is that they suffer from an intrinsic non-identifiability of the occupancy parameter ψ (Ward et al., 2009). Moreover, the information about ψ, and about environmental relationships with ψ, stems from a comparison of the locations of recorded presences with the background, i.e. the larger area from which sampled sites are drawn. Presence-only modelling must by necessity assume that sites with presence records are a random sample from the background area. However, presence records can hardly ever be a random sample from the wider landscape and this will introduce bias (Phillips et al., 2009).

I would not argue that presence-only data or unreplicated detection/non-detection data should be discarded. If they are all we have, they should probably be analysed to answer questions of scientific or management relevance. But analysts ought to know, and acknowledge openly, that they are on thin ice: they are modelling ‘apparent distributions’ only, their estimates of covariate relationships will be biased by imperfect detection and all of their inferences may be biased by sample selection bias, especially in the case of presence-only data. The degree to which this applies to a particular study may or may not be fatal for their goals, but this usually remains unknown and unknowable.

I would like to offer a few final comments on site-occupancy models:

  • 1 They require replicated observations, and these may simply not be available. It is possible to deduce not only non-detection data, but also replicate observations, from general faunistic databases (Kéry et al., 2010). However, the effects of various types of detection heterogeneity on estimates of occupancy and covariate effects remain to be explored.
  • 2 The quality of the estimates from these models naturally declines with small sample size, e.g. few sites, few replicates, low p. Simulations may be required to know how good one’s estimates will be (Guillera-Arroita et al., 2010).
  • 3 Software to fit site-occupancy models, e.g. MARK and PRESENCE, is less advanced in some respects than many traditional species modelling tools. In particular, there is no integration with mapping software to automatically extrapolate estimated environmental relationships and produce maps (but see the new R package ‘unmarked’, which allows integration of analyses with mapping tools that are available in R) (Fiske & Chandler, in press).
  • 4 Model and variable selection is cumbersome compared to super-flexible traditional modelling methods such as boosted regression trees or maximum entropy (Elith et al., 2006; Phillips et al., 2006). One would hope that the advantages of the two approaches, flexible covariate modelling and conceptual rigour, could be combined in the future. First steps in that direction are being explored (R.A. Hutchinson, School of Electrical Engineering and Computer Science, Oregon State University, pers. comm.).

To summarize my mini-review of species distribution modelling, I would say that without a doubt, species distribution modelling is more important than ever in the history of ecology. However, most traditional modelling methods must recognize the limitations of modelling apparent species distributions. Whenever within-site replicate observations are available, site-occupancy models may be adopted to yield estimates of true species distributions and unbiased covariate relationships. Creativity in deducing replicate observations from existing databases, and efforts made in designing the field work of future studies, will pay great dividends in terms of the rigour of the ensuing species distribution analyses. Finally, efforts are needed to merge site-occupancy models with modern variable selection techniques, and site-occupancy modelling ought to be integrated with mapping software. I believe that these developments would provide a great boost to species distribution modelling.

Editor: John Lambshead

Ancillary