A comparison of Maxlike and Maxent for modelling species distributions

Authors


Summary

  1. Understanding species spatial occurrence patterns and their environmental dependence is one of the fundamental goals in ecology and evolution. Often, occurrence models are built with presence-only data because absence data are unavailable. We compare the strengths and limitations of the recently developed presence-only modelling method, Maxlike, with the more widely used Maxent.
  2. In spite of disparities highlighted by the developers of Maxlike and Maxent, we show approximate formal relationships between the parameters of Maxlike and Maxent for two scenarios to illustrate their similarity. Using case studies based on real and simulated data, we show how these similarities manifest in practice.
  3. We find more similarities than differences between Maxlike and Maxent, including coefficient values, predicted spatial distributions, similarity to presence–absence models, predictive performance and ranking the predicted suitability of cells. Maxlike reliably predicted absolute occurrence probabilities for very large data sets on landscapes where occurrence probability approximately spanned [0,1]. For smaller data sets, the uncertainty in predicted occurrence probability by Maxlike was very large, due to the inherent limitations of presence-only data. In contrast, Maxent is constrained to predicting relative occurrence probabilities or relative occurrence rates unless it is provided with additional information from presence–absence data. Both models can reliably predict relative differences in occurrence probability.
  4. The choice of which model to use depends partly on sampling assumptions, which we discuss in detail. Due to limitations of presence-only data, ecologists should typically focus on interpretations relying on relative differences in occurrence probability or relative occurrence rates. We discuss how to remedy a number of concerns about the use of Maxent and how to avoid some potential pitfalls with Maxlike – particularly related to high variance predictions. We conclude that both methods are similarly valuable for understanding and predicting species’ distributions in terms of relative differences in occurrence probability when the models are specified carefully.

Introduction

Understanding species spatial occurrence patterns and their environmental dependence is one of the fundamental goals in ecology and evolution. The most common type of occurrence models correlates records of presence, and sometimes absence, with environmental covariates. While correlative occurrence models have some important shortcomings (Anderson 2012; Araújo & Peterson 2012), the limitation of available data makes them one of the few tools available for understanding spatial distributions for the majority of species on earth. Data limitations are particularly apparent for presence-only (PO) data (Graham et al. 2004) – composed only of presence (and not absence) records – which has lead to a variety of modelling methods to maximize inference (Manly, McDonald & Thomas 1993; Lele & Keim 2006; Phillips, Anderson & Schapire 2006; Royle et al. 2012). The most popular PO model has been Maxent (>2000 citations; Phillips, Anderson & Schapire 2006), which was originally developed in a machine learning framework but has been recently derived via maximum likelihood (Halvorsen 2012) and related to a variety of generalized linear models (Aarts, Fieberg & Matthiopoulos 2012; Fithian & Hastie 2013; Renner & Warton 2013). Recently, Royle et al. (2012) advocated an occurrence model for PO data, informally called Maxlike, to remedy issues that they found with Maxent. Royle et al. show that Maxlike is capable of estimating absolute occurrence probability (OP; the probability that the species is present in a grid cell), whereas Maxent cannot. In particular, Royle et al. focus criticism on the use of Maxent's logistic output to predict OP, with which we concur. Here, we illustrate the formal and practical relationship between Maxlike and Maxent under different sampling assumptions and a more appropriate application of Maxent.

The typical data for PO occurrence models consist of three components: (i) a collection of point locations where a species has been recorded, (ii) a gridded landscape that the modeller considers available to the species of interest (i.e. the species could migrate to all locations in that landscape), and (iii) environmental covariates for each of the grid cells. The point locations are typically aggregated to the grid cell in which they occur. These models assume that grid cells have been randomly sampled for presence, and thus, records occur in proportion to the species’ habitat usage preferences. Regardless of whether 1 individual or 100 individuals are observed, a 1 is recorded. Absences (0's) are unknown. Background samples are selected randomly from the gridded landscape to compare against the presence cells and determine whether environmental conditions are used disproportionately to their availability (Keating & Cherry 2004; Lele & Keim 2006). The goal of the analysis is to determine whether environmental conditions are used (presences) disproportionately to their availability (background).

Though it may seem appealing to understand species’ occurrence patterns in terms of OP, its definition poses some practical limitations. The OP of a grid cell describes the proportion of (hypothetical) cells with equivalent environmental conditions where a species is expected to be observed. This definition of OP is not purely biological, as we might intend; the OP depends on the size of the cell (larger cells should have larger OP) and is typically confounded with sampling effort (more search effort would lead to a larger probability of observation). Alternative models have been built to accommodate both of these limitations (e.g. Kery, Gardner & Monnerat 2010; Chakraborty et al. 2011; Keil et al. 2013); however, they require information on abundance or measures of sampling effort, which are typically unknown. Here, we focus on the more common occurrence models that ignore these issues, although these limitations are pertinent for model interpretation.

In this article, we begin by discussing the way in which Maxlike and Maxent model species occurrence. We show that Maxent is better suited to model relative abundance under different sampling assumptions than those typically made for modelling OP. We then illustrate the similarity between the methods by showing two scenarios where approximate analytic relationship between the slope parameters of each model hold. Consequently, we find more similarities than differences between Maxlike and Maxent. In the discussion, we highlight pitfalls of practical applications of both models and how they can be addressed.

Model definition

Modelling occurrence probability with Maxlike and Maxent

To formalize the differences between Maxlike's and Maxent's predictions, it is helpful to refer to their common likelihood function. Using yi 1 to denote a presence at grid cell xi, and ψ(yi = 1|xi, β0, β) to denote OP, the likelihood for Maxlike is given by Royle et al. (2012):

display math(eqn 1)

where N is the total number of presences, B is the set of background cells , β0 is an intercept parameter, and β is the vector of slope coefficients associated with environmental covariates. The numerator describes the likelihood at presence cells while the denominator describes the likelihood at background cells. Often, background cells are taken as a random sample of cells on the landscape where the model is fit (Lele & Keim 2006; Lele 2009; Royle et al. 2012).

The entropy maximization performed by Maxent turns out to be equivalent to maximizing the likelihood in equation (eqn 1) under certain assumptions. The maximum entropy distribution (described in detail below) can be obtained by maximizing the likelihood of a Gibbs distribution, given by q(βeβz(x)B eβz(x), where z(x) is a vector of environmental covariates in cell x (Della Pietra et al. 1997; Phillips et al. 2004). Given this equivalence, and comparing equation (eqn 1) with the Gibbs distribution, it is apparent that Maxent's model can be obtained by maximizing the likelihood in equation (eqn 1) by making the parametric assumption that ψ(y 1|x,βeβz(x). Note that β0 is missing from this equivalence, a point that we return to below. Operationally, Maxent optimizes a penalized version of the likelihood in eqn (eqn 1) (Phillips & Dudik 2008; Merow, Smith & Silander 2013).

The difference between Maxlike and Maxent derives from the choice of link functions used for ψ. Maxlike uses a logit-linear model:

display math(eqn 2)

Maxent uses a loglinear model:

display math(eqn 3)

The important difference between these functions is the intercept, β0, which differentiates their ability to predict OP. Using the loglinear model, Maxent cannot estimate an intercept because it cancels from the likelihood in equation (eqn 1) (Royle et al. 2012). The intercept effectively defines the expected prevalence across a landscape, which attaches a scale to the relative differences defined by the slope parameters (Fig. 1; Hastie & Fithian 2013). Prevalence is defined as ψ(y 1) ΣB ψ(yi 1|xi), that is, the proportion of occupied cells, or the average OP across the landscape; hence, large values of the intercept increase the prevalence.

Figure 1.

Graphical illustration of Maxlike's estimation of prevalence from presence-only data. Values of an environmental covariate (z) were simulated at 10 000 background cells, and 1000 presences were sampled with probability ψ(z) = −1−1*z. (a) The black line (grey interval) shows Maxlike's median (95% interval) predicted OP over 500 replicate presence samples. (b) The black (grey) probability density shows the empirical distribution of z at background (presence) cells. The red and blue lines illustrate how Maxlike estimates prevalence (i.e. the intercept in eqn (eqn 2)). The blue line shows the empirical ratio between the two densities in (b) while the red line shows this same ratio multiplied by the estimated value of prevalence (eqn (eqn 4)). Maxlike chooses the value of prevalence by which to multiply the empirical ratio (blue) to make it as similar as possible (red) to a logistic curve (black). Note the wide range of predicted OP (grey 95% interval) for negative values of the covariate that can result from even a simple univariate model with a large sample size.

Maxlike

Maxlike uses a logit-linear model for two reasons. First, a logistic model ensures that predicted OP lies on [0,1], whereas Maxent's loglinear model does not. Secondly, because no two logistic functions are exactly proportional, the intercept in equation (eqn 2) is estimable from the likelihood in equation (eqn 1). Hastie & Fithian (2013) point out that the lack of proportionality that enables Maxlike to differentiate the likelihoods for different values of the intercept may be very slight. Indeed, our demonstrations below show that this leads to high variance estimates of the intercept.

It may seem counterintuitive that prevalence is estimable from PO data. To determine the proportion of occupied cells, it would seem the proportion of unoccupied cells must also be known. To get around this data limitation, Maxlike takes advantage of the definition of OP: the proportion of occupied cells is equivalent to the average true OP across the landscape. Of course, the true OP is unknown, so Maxlike assumes that the average predicted OP across the landscape approximates the average true OP (denominator in eqn (eqn 1)).

To illustrate the connection between Maxlike and data shown in Fig. 1, it is helpful to rewrite ψ(y = 1|x), which describes OP in geographic space, as ψ(y = 1|z(x)), which describes OP in terms of environmental covariate space. Here, z(x) denotes the set of environmental covariates in cell x (Aarts, Fieberg & Matthiopoulos 2012; Royle et al. 2012). Using Bayes’ rule, the OP can be rewritten as (Phillips et al. 2009),

display math(eqn 4)

where π(z(x)|y = 1) is the empirical probability density of the environmental predictors at presence cells and π (z(x)) is the empirical probability density of the environmental predictors at background cells. These empirical probability densities can be thought of simply as frequency histograms of a predictor, for example mean annual precipitation (Fig. 1b). Figure 1 graphically illustrates how Maxlike predicts OP for a model with a single environmental covariate; OP is a model for the ratio of the density at presence cells (grey) to background cells (black), rescaled by the prevalence. The blue line gives the exact, empirical ratio between the densities, while the red line shows the same ratio multiplied by the estimated value of the prevalence. Maxlike is able to estimate the intercept by estimating a value for the prevalence that causes the red line to be as close as possible to a logistic curve (black line). Hence, Maxlike's parametric assumption is critical for determining the intercept and hence the predicted prevalence.

Maxent

Maxent's loglinear model and its inability to estimate an intercept can be understood by its connection to inhomogeneous poisson process (IPP) models (cf. Warton & Shepherd 2010; Chakraborty et al. 2011) although other derivations are possible (Phillips, Anderson & Schapire 2006; Halvorsen 2012; Merow, Smith & Silander 2013). IPPs provide a general framework linking a variety of occurrence modelling strategies, including Maxent, discretized Poisson count models, naïve logistic regression (using the background as absences) and weighted logistic regression (Aarts, Fieberg & Matthiopoulos 2012; Fithian & Hastie 2013; Renner & Warton 2013). Although our discussion below focuses on comparison of Maxlike with Maxent, it extends more generally to a variety of generalized linear models under appropriate assumptions about discretization of space, distributions and integration techniques.

Rather than assuming random sampling of space, as Maxlike does, IPPs assume that presence records represent a random sample of individuals, and thus, records occur in proportion to population density (‘density estimation’; Dudik, Phillips & Schapire 2004; Baddeley et al. 2010). With this assumption, the goal is to predict the density of presence records at point locations in continuous space (as opposed to grid cells) using an intensity function, denoted λ(x) (Warton & Shepherd 2010). λ(x) is often called an occurrence rate and has units of number of records per unit area (Fithian & Hastie 2013). Integrating λ(x) over a region A returns the expected number of presence records in A and is given by Λ(A) A λ(x) dx. Integrating λ(x) over the entire landscape returns the expected total number of presence records. The IPP's occurrence rate derives from the continuous limit of a Poisson count model for increasingly fine discretizations of the landscape (Fithian & Hastie 2013). If N(A) denotes the number of observations in area A, it is distributed as (Fithian & Hastie 2013):

display math(eqn 5a)

As A approaches zero, Λ(A) approaches λ(x). λ(x), is often modelled as a loglinear function (Warton & Shepherd 2010):

display math(eqn 5b)

Notably, the IPP estimates an intercept.

Maxent differs in two ways from the IPP: it uses a discretized landscape rather than a continuous one and it assumes that the total number of records is fixed rather than random. To achieve this, Maxent uses a discrete Poisson model in cell xi:

display math(eqn 6a)

with

display math(eqn 6b)

The first difference between these equations (eqn 5a) and (eqn 5b) is that presence records, which occur across a continuous landscape, are now aggregated to the grid cell xi. The second difference is that Maxent does not estimate the intercept (Aarts, Fieberg & Matthiopoulos 2012). The intercept in the IPP (eqn (eqn 5a)a) determines the total number of presence records, which is typically not of interest (Fithian & Hastie 2013). Often, the counts are proportional, but not equal, to abundance, and hence, one is restricted to modelling the relative occurrence rate (ROR; Fithian & Hastie 2013; Renner & Warton 2013), as Maxent does. To understand the distinction between abundance and ROR, consider the following example (adapted from Fithian & Hastie 2013); for the Carolina wren data shown in Fig. 2, 1506 presences were observed. If twice as many presences were observed, should that indicate that there are twice as many wrens or twice as many birders looking for wrens? Because all individuals are not observed in any subset of cells, the best we can do is model relative differences in occurrence rates between cells. The ROR is typically reported as a density function given by,

display math(eqn 7)
Figure 2.

Comparison among predictions from a GLM using PA data and Maxlike and Maxent using PO data. Panels (a–c) show the models described in this paper for relative OP on the cumulative output scale, while (d–f) are reproduced from Royle et al.’ (2012) models for absolute OP. AUC and COR are calculated with respect to PA data. All models of relative OP predicted very similar spatial patterns.

The ROR, π(x), is known as the software's raw output. Given that an individual was observed, π(x) describes the relative probability that the individual derived from each cell on the landscape (Phillips, Anderson & Schapire 2006; Merow, Smith & Silander 2013). Alternatively, if π(x) is interpreted in environmental space as in equation (eqn 4) (π(z(x))), it describes the predicted probability density of presences in environment z (Fig. 1).

From this derivation of Maxent's prediction, it seems clear that Maxent is better suited for predicting ROR than OP. By comparison of Maxent's raw output (eqn (eqn 7)) to the likelihood function in equation (eqn 1), it is apparent that λ(x) is equivalent to ψ(y = 1|x) (Royle et al. 2012). However, interpreting Maxent's prediction of λ(x) as OP is problematic for two reasons: λ(x) can be greater than unity and λ(x) derives from the assumption of random sampling of individuals, not the random sampling of space typically associated with OP (Royle et al. 2012). The unboundedness on [0,1] illustrates why additional assumptions are needed to convert these predictions to OP in the Maxent software package (Phillips & Dudik 2008). Due to problems with these assumptions (which lead to so-called logistic output, discussed in Section V.A) we focus only on Maxent's raw output, for the remainder of the paper.

Nonetheless, further accommodations have been made to interpret Maxent's output in terms of OP. To handle the issue of sampling design (individuals vs. space), Maxent duplicate observations in a grid cell to approximate a random sample of space. Conditional on the acceptability of this aggregation and thinning of the data (see 'Discussion' in Aarts, Fieberg & Matthiopoulos 2012), raw output can be interpreted as OP (ψ in eqn (eqn 3)) rescaled by a constant (the sum of predictions over all background cells). Hence, raw output is proportional to OP, and we refer to it as relative OP when treating the data as a random sample of space. Relative OP describes how cells compare to one another, with the ratio of relative OP equal to the ratio of OP between two cells (Lele & Keim 2006).

Linking Maxlike and Maxent

To summarize the previous two sections, Maxent's output can be interpreted in one of two ways, depending on sampling assumptions. If data are assumed to be a random sample of space, we can predict relative OP with Maxent, but not OP. If data are assumed to be a random sample of individuals, Maxent can predict ROR. Each of these interpretations is formally linked to Maxlike under the circumstances discussed below.

First, we consider the case where Maxent is used to model relative OP. In the limit of decreasing cell size on the discretized landscape, the parameters of logit-linear and loglinear models are asymptotically equivalent (Baddeley et al. 2010). As cell size decreases, the value of OP (denoted by ψ) in a given cell decreases. Maxlike models logit(ψ) log(ψ/(1−ψ))≈log(ψ), where approximate equivalence is achieved for small ψ. Because Maxent models log(ψ) explicitly, the models are equivalent in the limit of a high-resolution gridded landscape. This is not surprising, because an exponential function looks reasonably similar to a logistic function until the logistic function asymptotes.

Secondly, we consider the case where Maxent is used to model ROR. The probability that a cell is occupied is equal to the probability that the number of counts there is greater than zero. The probability of non-zero counts is equal to 1 minus the probability of zero counts:

display math(eqn 8)

For the Poisson model in equation (eqn 5a), the probability of k observations in a cell is given by Pr(N((xi= kλkexp(λ(xi))/k!. Hence, the probability of zero observations is Pr(N = 0= exp(λ(xi)). Using this result with a general loglinear model for λ(xi) (e.g. eqn (eqn 5b)b), we obtain,

display math(eqn 9)

This result shows that the slope coefficients (β) estimated by a model for ROR, using count data, are closely related to those obtained from a model for OP, using presence–absence data, if a complementary log–log link function were used (Baddeley et al. 2010). To the extent that the logit link (used by Maxlike) is a good approximation to the complementary log–log link (which is also preferred on theoretical grounds; Baddeley et al. 2010), and when the same method of parameter estimation is used, the slope coefficients produced by Maxlike and a model of ROR from Maxent should be approximately equivalent (as seen in Table 1).

Table 1. Comparison of coefficients among models for the wren data shown in Fig. 2a–c. All models exhibit the same qualitative pattern of sign and relative magnitude of coefficients. We do not expect their parameters to be identical given the models’ different functional forms
CoefficientGLMMaxlikeMaxent
Intercept−1·065−1·698NA
Latitude−5·049−6·671−4·124
Latitude2−3·762−5·046−2·370
Longitude6·4098·1737·621
Longitude2−2·051−1·159−2·418
% Deciduous0·7041·2881·286
% Deciduous2−0·136−0·236−0·635
% Mixed Forest−0·184−0·224−0·080
% Mixed Forest20·0250·0210·000

Equation (eqn 8) highlights a general connection between models of occurrence and abundance that extends from Maxent to IPPs. If an IPP is used to obtain λ(xi), it is possible to estimate an intercept (conditional on the presences being a random sample of individuals). Equation (eqn 8) shows that one can move freely between estimates of ROR and OP. However, the IPP intercept estimate does not account for the possibility of missing presence points; hence, estimates of the intercept (and prevalence) may be biased low. These missing presences prevent us from modelling occurrence rate, and limit us to ROR, as discussed above. In contrast, Maxlike does not appear to suffer from the influence of missing presences but instead leverages off the departure from loglinearity to estimate the intercept (which is criticized by Hastie & Fithian 2013).

These formal connections between Maxlike and Maxent show why the predictions of each model are similar, which we further illustrate with examples below.

Methods

To illustrate the similarities, strengths and weakness of Maxent and Maxlike, we use the same two-pronged approach following Royle et al. (2012): analysing (i) data on the distribution of the Carolina wren (Thryothorus ludovicianus) across the United States and (ii) the simulated data described by Royle et al.

Similarities between Maxent and Maxlike are apparent if their predictions are on the same relative scale and span the same range. Royle et al. (2012) aptly identify the problems with Maxent's logistic output (see 'Discussion'), which means that we cannot compare Maxent and Maxlike predictions of OP. Instead, we degrade Maxlike's predictions of OP to relative OP by dividing by the sum of predicted OP over all cells. For display, we convert predictions to cumulative output (a common format for Maxent), where the value assigned to a cell is the sum of all raw values less than or equal to the raw value for that cell, rescaled to lie between 0 and 100 (Phillips, Anderson & Schapire 2006).

Wren Data

We built models for the Carolina wren data using the same suite of features used by Royle et al. Because presence–absence (PA) data were available, we used logistic regression (denoted GLM) to build the ‘true’ model; that is, the best predictions of OP that we could achieve given absence data. We compared the GLM to Maxlike and Maxent to determine the similarity of relative OP predictions. Only linear and quadratic features were used in all models. We expect greater similarity between the GLM and Maxlike a priori simply because they use the same logistic functional form. Because it did not seem to relevant to make comparisons among modelling strategies while using spurious variables, we refined the variables incorporated in the model using forward and backward stepwise AIC selection on the features used by Royle et al. in the GLM (i.e. the ‘true’ model). This removed features related to the percentage coniferous forest and grassland. Using these reduced models, we also compared output from Maxlike and Maxent models built with 50, 100, 300 and 500 presences to determine the effect of sample size on predictions. We evaluate the fit of all models to 4615 presence–absence observations use the area under the receiver–operator curve (AUC; Fielding & Bell 1997) and the point biserial correlation (COR; i.e. Pearson correlation with one binary variable; Zheng & Agresti 2000).

Simulated Data

We simulated data to compare Maxlike and Maxent predictions on the relative OP scale and to explore the variability in predictions from Maxlike for different sample sizes of presences (50, 100, 200, 500, 1000 and 2000) using the same model as Royle et al. (2012). This model simulates the ‘true’ OP as a function of a covariate z (simulated from Normal(0,1)) as logit(ψ) −1−1*z. We drew a single sample of z at 10 000 cells and simulated binomial sampling for presence–absence at these cells based on ψ. The absences were discarded and 500 replicated samples were drawn from the set of presences. Importantly, these replicates were drawn from the same set of simulated z values and sampled without bias, so the only source of variability in the resulting models comes from sampling the ‘true’ distribution of the simulated species. Models were fit using only linear features.

To understand the effect of sample size on predictions for the simulated data, we compared the distribution of mean intercept estimates for the 500 replicates for each sample size. We made a similar comparison between the ranges of predicted OP across these replicates for each cell. This range was calculated as the maximum OP–minimum OP for each cell over the 500 replicates. We compared the variability of Maxlike predicted OP and Maxent's predicted relative OP as a function of sample size by plotting the distribution of average percentage error within cells across the 500 replicates. Finally, we compared Maxlike's predicted OP for different values for the slope parameter (−1, −0·6 and −0·2) to explore scenarios where true OP did not span [0,1], to determine when Maxlike might struggle to estimate the intercept.

All models were built in R, v3.0.2 (R Core Team 2013) except for those fit with Maxent, v3.3.3k (Phillips, Anderson & Schapire 2006), which were subsequently accessed and analysed in R. Maxlike models were fit with maxlike() in the maxlike package, v.0.1-3, and logistic regression models were fit with glm() in the stats package. Duplicate observations within a grid cell were removed for all models. Maxent's regularization was set to zero in order to fit an unpenalized likelihood function analogous to Maxlike's. Code for all analyses is provided in the Supplementary Materials.

Results

Wren Data

Maxent's and Maxlike's predictions of relative OP are highly similar to one another and to the GLM based on spatial pattern and AUC and COR (Fig. 2a–c, A2 in Appendix S1). This similarity is due to our decision to model relative OP. Royle et al.'s differences derive from using Maxent's logistic format where differences were more apparent (Fig. 2d–f), which we suggest avoiding (Section V.C). The GLM, Maxlike and Maxent found similar environmental relationships, with similar signs and relative magnitude for all coefficients but one, which was nearly zero (Table 1).

Changing the number of presences used for training the wren model drastically altered Maxlike's spatial predictions but not Maxent's (Fig. 3). Maxlike's mean predictions are somewhat variable across sample sizes due to the inherent difficulty in estimating prevalence. In contrast, rankings of cells in these same models are consistently good (AUC > 0·847), indicating that Maxlike correctly identifies environmental relationships, that is, slope parameters. Predictions with similar (or better) AUC values are found for Maxent (Fig. 3). By focusing on relative OP (i.e. only slope parameters), Maxent is able to make consistent predictions across this range of sample sizes (Fig. 3e–h). Of course, Maxlike can be used to make similarly robust estimates of relative OP by normalizing its predictions as we did in Fig. 2b.

Figure 3.

Comparison of representative Maxlike (a–d) and Maxent (e–h) performance for different sample sizes. AUC and COR are calculated with respect to PA data. The degradation of Maxlike's predictions of OP for smaller data sets is greater than the degradation of Maxent's predictions due to the challenging in estimating prevalence. Maxlike and Maxent offer similar performance, in terms of ranking cells, as shown by AUC and COR. The poor performance of Maxlike in (c) illustrates the challenge of estimating the intercept (cf. Figs 4 and 5); noise in this particular sample lead to a higher intercept estimate than the samples used in (a,b,d).

Simulated Data

Using simulated data, we found that Maxlike and Maxent make nearly identical predictions of relative OP (Fig. A3 in Appendix S1; correlation = 0·9999) in spite of the differences in OP reported by Royle et al. (2012). Again, Royle et al.'s differences derive from using Maxent's logistic format. When the number of presences used for training was reduced below 2000, variance in estimates of the intercept over replicated data sets increased drastically (Fig. 4a). This increase in variance produces a large range of predicted OP, when using sample sizes of 50–1000 (Fig. 4b). Even for 2000 presence samples, predicted OP varies by more than 0·2, on average.

Figure 4.

Comparison of mean estimates across 500 replicated simulated data sets for different sample sizes using violin plots. White dots indicate the median, black bars indicate quartiles, and grey density functions show the distribution of values. (a) Distribution of the maximum likelihood estimate of the intercept by Maxlike across replicates. The horizontal dashed line at y = −1 represents the true value of the intercept. (b) Distribution of range of predicted OP in a cell by Maxlike, averaged over replicated samples and averaged over the 10 000 cells on the simulated landscape. (c) Distribution of average percentage error in predicted OP by Maxlike across replicates. (d) Distribution of average percentage error in predicted relative OP by Maxlike across replicates. (e) Distribution of average percentage error in predicted relative OP by Maxent across replicates. Notably, Maxlike and Maxent have similar predictive precision and bias when estimating relative OP (d–e).

The instability of intercept estimates by Maxlike leads to large average percentage error in predicted OP (Fig. 4c). In contrast, by avoiding estimating the intercept and focusing on relative OP, Maxent's predictions are much less biased (lower% error in Fig. 4d compared with Fig. 4c). While relative OP is less informative than OP, it can be estimated more accurately for smaller sample sizes.

Maxlike struggled to estimate the intercept when the true values of OP did not range from 0 to 1 (Fig. 5). When the slope parameter was smaller, a wider range of values for intercept were possible that were consistent with the slope because OP did not span [0,1] over the range of the covariate. When OP spans [0,1] (Fig. 5a), the logistic curves were much better constrained (i.e. it was not possible to move the logistic curve up/down) and parameter estimates had lower variance.

Figure 5.

Maxlike can identify the intercept well when the true OP spans from 0 to 1 because the steep slope parameter does not allow the curve shift vertically (a). In contrast, when the species response is shallower (c, e) the same slope parameter is consistent with a wide range of intercept values (d,f). (a,c,e): Response curves from 200 simulated data sets simulated from (a) ψ(z) = −1−1*z, (c) ψ (z) = −1−0·6*z and (e) ψ (z) = −1−0·2*z. (b,d,f) Maximum likelihood parameter estimates from the 200 simulated data sets. Dashed lines indicate true parameter values.

Discussion

Similarities between Maxlike and Maxent

It is not surprising that Maxlike and Maxent identify similar environmental relationships given the formal relationship between their slope coefficients. In fact, Manly, McDonald & Thomas (1993) recommend using logistic regression (the form of Maxlike) with the background cells treated as absences as a shortcut to obtaining slope parameters for an exponential model (the form of Maxent) precisely because the slope estimates of one model are well approximated by the other (we observed this in Table 1). Both methods make similar predictions of spatial distributions (Fig. 2), similarity to the GLM using the full PA data set (Fig. 2), predictive performance (AUC in Fig. 2), coefficient values (Table 1) and ranking the predicted suitability of cells (Fig. A2 in Appendix S1). This similarity occurs because the additional information estimated by Maxlike, the intercept in equation (eqn 2), primarily slides the predictions up or down the scale of OP (Fig. 5; except to the extent that the intercept and slopes are correlated). Only minor differences are expected in cell rankings due to the different parametric forms of Maxlike and Maxent (eqns (eqn 2) and (eqn 3)) with deviations greatest in the tails of the distributions (Fig. A3 in Appendix S1; Johnson et al. 2006). The similarity in slope coefficients means that the differences in predictions between Maxlike and Maxent shown by Royle et al. (2012); reproduced in Fig. 2d–f) can be substantially reduced by focusing on relative OP, as we have done here, or using a different value for the prevalence if such information were known (Fig. A1 in Appendix S1). Note that the coefficients are not expected to be exactly equivalent because different parameter optimization techniques are used by each method.

Many applications of SDMs do not require OP, and relative OP is sufficient, which means that Maxlike's and Maxent's predictions can be interpreted similarly. The ratio of relative OP between cells is equivalent to the ratio of OP, so applications comparing or ranking or ratios of OP in cells are not affected by which method is chosen, for example locating the best habitat for finding a new population or defining priority conservation locations.

Practical limitations of Maxlike

Our explorations with Maxlike suggest that it may exhibit some practical limitations. Our concern is not the fundamental specification of the model, but whether the types of data sets to which it is applied contain sufficient information to provide useful estimates of OP. The parametric assumptions of Maxlike have already been criticized (Hastie & Fithian 2013; Phillips & Elith 2013), who highlight how the parametric (logistic) assumption of Maxlike is fairly restrictive and leads to inaccurate predictions with even mild departures from an exactly loglinear response. While parametric assumptions may be limiting in some cases, very flexible response curves can be readily accommodated using functional transformations of predictors, splines or nonlinear smoothers, so this factor is only partly limiting. Hence, we see the parametric assumptions as only partly limiting and focus on the limitations due to typical data sets and model interpretation in the case where Maxlike's assumptions are acceptable.

The apparent advantage of Maxlike is the ability to estimate the OP, whereas Maxent can only estimate relative OP or ROR. Of course, absolute probabilities are preferable because they contain more information than relative probabilities (Pearce & Ferrier 2000). Reliably estimating the prevalence from PO data requires a large number of observations to avoid large variance in predictions (Figs 3-5), which are usually not available from museum records or opportunistically collected data. While the rankings of cells or their relative OP are reliably estimated by Maxlike for samples of any size, variability in predicted OP can be so high for small sample sizes that the prediction may not be particularly useful (Figs 3 and 4). This highlights the need to report interval estimates for predicted OP. For example, high variance intercept estimates can be problematic when modellers create binary presence–absence predictions because predicted OP can depend heavily on sampling noise (Figs 3 and 4b). Previous studies similarly note the high variability in estimates of the intercept in Maxlike (Fitzpatrick, Gotelli & Ellison 2013) and closely related models (Lancaster & Imbens 1996; Lele & Keim 2006; Ward et al. 2009).

Poor predictions made by Maxlike suffer from the difficulty of identifying the intercept when true OP does not span [0,1] (Fig. 5). When true OP does not span [0,1], the response curve can readily shift up or down with little impact on the likelihood (Fig. 5). This observation also explains Maxlike's poor predictions in Phillips & Elith (2013). In order to apply Maxlike confidently to predict OP, one must determine whether the landscape spans OP values between 0 and 1, whether this is possible in practice a priori is an open question.

Substantial challenges exist in estimating sampling effort and detection probability for PO data (Dorazio 2012; Yackulic et al. 2012). Thus, the best one can do for typical PO data sets where sampling protocols are unknown is to indicate relative relationships in occurrence, which trivializes the debate over whether one can reliably predict OP vs. relative OP. If independent presence–absence data are available, it is possible to calibrate relative OP predictions to predict OP (Aarts, Fieberg & Matthiopoulos 2012; Halvorsen 2012; Halvorsen 2013). Otherwise, unless a model for detection probability is explicitly included (cf. Kery, Gardner & Monnerat 2010; Chakraborty et al. 2011; Yackulic et al. 2012), estimates of OP are only defined with respect to a given level of sampling effort (i.e. detection probability and true OP are confounded; Elith et al. 2011; Yackulic et al. 2012). Thus, interpretation of OP may not be as appealing as it initially appears if detection probability is ignored, as is typically done (Dorazio 2012; Yackulic et al. 2012).

Low variance Maxlike predictions may be restricted to sufficiently large data sets that may be well approximated as PA data. This is apparent if we recognize that PO and PA data lie on a continuum defined by sampling effort. If sampling effort is low or unknown, we often do not believe that a lack of presence corresponds to an absence, and we call it PO data. With increasing sampling effort we are more likely to believe these absences and call the observations PA data. This connection is made formally using models for detection probability (e.g. Gelfand et al. 2006; Kery, Gardner & Monnerat 2010; Chakraborty et al. 2011). In cases where large structured surveys exist, such as the breeding bird survey (from which the wren data derive), presence–absence methods are already a reasonable approximation. It is no surprise that Maxlike converges on similar predictions to a GLM (Fig. 2); the models have the same parametric form (logistic), and the data are nearly identical.

Modelling decisions with Maxent

Many of the limitations of Maxent arise from choices made during model building, particularly those raised by Royle et al. (2012). However, these are not fundamental attributes of Maxent models and are readily adjusted within the software package options (Merow, Smith & Silander 2013), or outside the software package (Halvorsen 2012), although this appears to be rarely done.

Maxent's logistic output should not be interpreted as OP because it relies on an assumption, rather than an estimate, of prevalence (Elith et al. 2011; Royle et al. 2012; Merow, Smith & Silander 2013). The differences between Maxlike and Maxent in Royle et al. (2012) emerged because they used Maxent's logistic output to predict OP; they have highlighted why this is problematic. Figure A1 in Appendix S1 shows the high variability in predictions that can emerge for different prevalence assumptions for the Carolina wren. However, these concerns about logistic output do not affect the interpretation of Maxent's raw output as relative OP; hence, we recommend that Maxent analyses focus on relative OP or ROR.

Maxent builds very flexible occurrence–environment relationships, which have been criticized for overfitting (Warren & Seifert 2011; Royle et al. 2012; Merow, Smith & Silander 2013). Maxent uses a number of functional transformations of each predictor, known as features, in conjunction with regularization to penalize for model complexity and select the best features from a very large candidate set (Tibshirani 1996; Phillips, Anderson & Schapire 2006; Merow, Smith & Silander 2013). Explorations with complex models can be valuable to explore whether any unexpected patterns exist in response curves but can be dangerous to interpret because such flexibility is bound to extract some pattern even when no causal relationships exist. One approach to reducing complexity in Maxent is to simply reduce the number of feature classes that it considers (e.g. only linear and quadratic features). Maxent's software package does not offer the option of including just a few specific features (e.g. based on specific hypotheses) but instead takes an all-or-nothing approach, which makes it difficult to evaluate the importance of specific features. If a subset of more complex features are desired, one can ensure that the model is built with precisely the user-specified set of features by (i) making the features manually, (ii) providing them as a layers to Maxent, (iii) selecting only linear features and (iv) optionally setting the regularization multiplier to zero (Merow, Smith & Silander 2013). Stronger control of the features included and model selection is also possible using an IPP framework (Renner & Warton 2013) or using insights from the strict maximum likelihood interpretation of Maxent (Halvorsen 2013).

The use of regularization has been criticized (Royle et al. 2012), although it is a widely accepted tool in both machine learning and statistics (Tibshirani 1996; Hastie, Tibshirani & Friedman 2009). If regularization is undesirable, Maxent's regularization parameter can easily be set to zero to maximize the unpenalized likelihood. The appeal of regularization is that it reduces overfitting to bias in the data. Choosing the appropriate strength of the regularization penalty is a more difficult matter that has received little attention (but see Phillips & Dudik 2008; Elith, Kearney & Phillips 2010; Anderson & Gonzalez 2011; Merow, Smith & Silander 2013; Renner & Warton 2013).

Recommendations

The connection between Maxent and Maxlike leaves users with a conundrum for many practical applications. When using Maxent as an OP model, one can make a good assumption about PO data – that it represents a random sample of space – but is forced to predict only relative OP due to limitations of the loglinear model. Alternatively, when using Maxent as an ROR model, one can make a questionable assumption about PO data – that it represents a random sample of individuals – and make a good assumption that the loglinear model is appropriate for count data. Maxlike makes reasonable assumptions about both the data and model for predicting OP. However, there exist a limited set of circumstances where the predicted OP has sufficiently low variance to be more informative than relative OP, due to the limited information contained in PO data.

Modelling relative OP is valuable whenever interpretations do not rely on absolute values of OP. We expect that this scenario is rather common, given the exploratory nature of many occurrence models. This recommendation derives from the challenges of interpreting (absolute) OP, due to unknown detection probability and sampling effort (Section 'Practical limitations of Maxlike'; Yackulic et al. 2012). Modelling relative OP can be done with either Maxent or Maxlike, though Maxent may be preferable if correlation between the intercept and slope parameters is present (Fig. 5). If the number of samples is large, OP spans [0,1], and detection probability is constant, Maxlike is preferable because it can predict OP. In any case, the prediction uncertainty should always be reported given its large magnitude in the relatively well-behaved case studies shown here (Figs 3-5).

When data represent a random sample of individuals, we recommend using Maxent to model ROR, or more generally a model within the IPP family (cf. Aarts, Fieberg & Matthiopoulos 2012; Fithian & Hastie 2013). Although Maxent is suitable when the landscape must be discretized and its settings have been carefully selected (cf. Merow, Smith & Silander 2013), working in an IPP framework offers a few advantages over Maxent. IPPs can provide explicit connections between OP and ROR (eqn (eqn 9); Baddeley et al. 2010), and estimate the intercept of a loglinear model (Aarts, Fieberg & Matthiopoulos 2012) when total abundance is measured in at least some cells. IPPs can be fit using a range of software for all flavours of generalized linear models, for example GAMs, boosted regression trees or ridge/penalized regression, that come with a variety of tools for model assessment (Fithian & Hastie 2013). If information is available on abundance, the intercept parameter of the IPP can be estimated once the slope parameters have been estimated describing to ROR (see Appendix from Aarts, Fieberg & Matthiopoulos 2012). Methods exist to determine the optimal spatial resolution and number of background points from the gridded landscape for an IPP, rather than the ad hoc methods typically used with Maxent (Warton & Shepherd 2010; Renner & Warton 2013). While OP from models built at different spatial resolutions typically cannot be compared, IPPs facilitate this comparison using intensity functions and offset terms based on pixel size during model fitting (Baddeley et al. 2010). Renner & Warton (2013) provide additional advantages, including theory regarding regularization, methods for assessing goodness-of-fit, and discussion of the loss of information associated with the gridded landscapes used in Maxent.

Though the merits of Maxlike and Maxent have been sharply contrasted against one another (Royle et al. 2012; Phillips & Elith 2013), we find that the limitations of PO data constrain modellers to focus on relative OP and that Maxlike and Maxent are similarly valuable for predicting relative OP (Fig. 4d, e) once modelling decisions have been carefully made. Consequently, we advocate for informed usage of any species distribution modelling method, by focusing on understanding assumptions, hypotheses and interpretations.

Acknowledgements

We thank Jane Elith, John Fieberg, Rune Halvorsen, Sean McMahon, Steven Phillips, Andrew Royle and Matthew Smith for helpful comments on the manuscript. CM acknowledges support from NSF Grant 1046328 to JAS and NSF Grant 1137366 to Sean M. McMahon.

Ancillary