#### Basic concepts

In maximum entropy density estimation, the true distribution of a species is represented as a probability distribution π over the set X of sites in the study area. Thus, π assigns a non-negative value to every site x and the values π(x) sum to one. We produce a model of π, a probability distribution that respects a set of constraints derived from the occurrence data. The constraints are expressed in terms of simple functions of the environmental variables, called features*.* Specifically, the mean of each feature is required to be close (within some error bounds) to the empirical average over the union of the presence sites. For example, for a feature “annual precipitation”, the corresponding constraint says that the mean annual precipitation predicted by the model should be close to the average observed precipitation. Since the set of constraints typically under-specifies the model, among all probability distributions satisfying the constraints, we choose the one of maximum entropy, i.e. the most unconstrained one (Jaynes 1957).

Maximum entropy density estimation can also be explained from a decision theoretic perspective as robust Bayes estimation. Specifically, consider the scenario where the goal of the modeler is to optimize the expected log likelihood (see “Performance measures”, below), and the only fact known about the true distribution π is that it satisfies a certain set of constraints. The strategy which guarantees the best performance regardless of π, also called the minimax strategy, is to choose the maximum entropy distribution subject to the given constraints (Topsøe 1979, Grünwald 2000, Grünwald and Dawid 2004).

To understand how π represents the realized distribution of the species, consider the following (idealized) sampling strategy. An observer picks a random site x from the set X of sites in the study area, and records 1 if the species is present at x*,* and 0 if it is absent. If we denote the response variable (presence or absence) as y*,* then π(x) is the conditional probability P(x∣y=1), i.e. the probability of the observer being at x*,* given that the species is present. According to Bayes’ rule,

- (1)

since according to our sampling strategy P(x)=1/∣X∣ for all x*.* Here P(y=1) is the overall prevalence of the species in the study area. The quantity P(y=1∣x) is the probability that the species is present at the site x*,* which is 0 or 1 for plants, but may be between 0 and 1 for vagile organisms. Equation 1 shows that π is proportional to probability of presence. However, if we have only occurrence data, we cannot determine the species’ prevalence (Phillips et al. 2006, Ward et al. 2007). Therefore, instead of estimating P(y=1∣x) directly, we estimate the distribution π. We emphasize that here x is a site, rather than a vector of environmental conditions. This treatment differs from more traditional statistical methods, such as logistic regression; later we will bring these two viewpoints together and present a new way of estimating probability of presence from the Maxent model (see below).

According to Section 2 of Dudík et al. (2004), the Maxent distribution belongs to the family of Gibbs distributions derived from the set of features f_{1}, …, f_{n}. Gibbs distributions are exponential distributions parameterized by a vector of feature weights λ=(λ_{1}, …,λ_{n}) and defined by

- (2)

where Z_{λ} is a normalization constant ensuring that probabilities q_{λ}(x) sum to one over the study area. Therefore, the value of the Maxent model q_{λ} at a site x depends only on the feature values at x*,* and hence only on the environmental variables at x*.* This means that the Maxent model, which we originally defined with strict reference to the set X of training sites, can also be “projected” to other sites where the same environmental variables are available. The Maxent distribution is the Gibbs distribution q_{λ} that maximizes a penalized log likelihood of the presence sites, namely

where the regularization parameter β_{j} is the width of the error bound for feature f_{j} and x_{1}, …,x_{m} are the presence sites. The first term, log likelihood, gets larger as we obtain a better fit to the data. This gives insight into how Maxent uses background data: the first term is larger for models that give more probability to the presence sites and less to the rest of the sites, i.e. models that best distinguish the presence sites from the background. The second term, regularization (also known as the lasso penalty; Tibshirani 1996) gets larger as the weights λ_{j} get larger. Larger weights λ_{j} typically mean that the model is more complex and is thus more likely to overfit. Maximizing the difference between log likelihood and regularization can be viewed as seeking a Gibbs distribution which fits the data well, but which is not too complex. The tradeoff is controlled by the regularization parameters.

For large sample sizes, the performance of Maxent (as measured by log likelihood of test data) converges to that of the best Gibbs distribution, as long as the presence sites are drawn independently at random according to π (Dudík et al. 2004). The theoretical analysis gives the best performance guarantees when the regularization parameters β_{j} are as small as possible, while keeping the true feature means (under π) within the error bounds. Thus, we have an incentive to obtain error bounds that are as tight as possible. To simplify the process of tuning parameters, we reduce the number of parameters from one per feature to one per feature class by setting

where β is a regularization parameter that depends only on the feature class and s^{2}[f_{j}] is the empirical variance of feature f_{j}, so is an estimate of the standard deviation of the empirical average. According to the theoretical guarantees we expect that will give good performance. However, the value of β that optimizes the theoretical bounds may not necessarily give the best model performance in practice. Therefore, we fine-tune the regularization parameter for each feature class separately, using empirical tuning as described below.

#### Environmental variables and feature classes in Maxent

Features in Maxent are derived from environmental variables of two types: continuous and categorical*.* Continuous variables take arbitrary real values which correspond to measured quantities such as altitude, annual precipitation, and maximum temperature. Categorical variables take only a limited number of discrete values such as soil type or vegetation type. Some categorical variables quantify the degree of some property (on a discrete scale), for example soil fertility. This type of variable is referred to as discrete ordinal*.* We will typically treat discrete ordinal variables as if they were continuous.

The Maxent software (Phillips et al. 2005) implements features of six classes: *linear* (L), *quadratic* (Q), *product* (P), *threshold* (T), *hinge* (H), and *category indicator* (C) features. Hinge features are introduced in this paper, while the other five classes were introduced in Phillips et al. (2006). Linear, quadratic, product, threshold, and hinge features are derived from continuous variables. Linear features are equal to continuous environmental variables, quadratic features equal their squares, and product features equal products of pairs of continuous environmental variables. Respectively, they constrain means, variances, and covariances of the respective variables to match their empirical values (Phillips et al. 2006). Category indicator features are derived from categorical variables. Specifically, if a categorical variable has k categories, it is used to derive k category indicator features. For each of the k categories, the corresponding category indicator equals 1 if the variable has the corresponding value and 0 if it has any of the remaining k*−*1 values.

Threshold and hinge features allow Maxent to model an arbitrary response of the species to an environmental variable from which they are derived. If f is a continuous variable then for any value h (called the knot*),* we define the threshold feature threshold_{f,h} by

The forward hinge feature *forwardhinge*_{f,h} is 0 if f(x) *≤*h*,* then increases linearly to 1 at the maximum value of f:

In a similar way, we define a reverse hinge feature, which is 1 at the minimum value of f, drops linearly to 0 at f(x)=h*,* and is 0 afterwards. Examples of a forward hinge feature and a reverse hinge feature are shown graphically in Fig. 1. Forward and reverse hinge features are collectively referred to as hinge features, and we have coined this term to evoke the shapes seen in the figure. In the terminology of splines, threshold features are base functions of splines of order 1 (piecewise constant splines) while hinge features are base functions of splines of order 2 (piecewise linear splines).

As the number of features increases, Gibbs distributions become more complex, and may be more prone to overfitting. We therefore expect that more complex feature classes will require more regularization to yield accurate predictions. Many combinations of feature classes are possible; common combinations used are LC, LQC, HC, HQC, TC, and HQPTC. Linear features are special cases of hinge features, so it is redundant to use L and H features simultaneously.

#### Maxent output formats and logistic models

The primary output of Maxent is the exponential function q_{λ}(x) that assigns a probability (referred to as a “raw” value) to each site used during model training. Raw values are not intuitive to work with though: in particular, it is hard to interpret “projected” values obtained by applying q_{λ} to environmental conditions at sites not used during model training. Raw values are also scale-dependent, in the sense that using more background data results in smaller raw values, since they must sum to one over a larger number of background points. For these reasons, raw values have generally been converted into the “cumulative” format (Phillips et al. 2006).

The cumulative format is defined in terms of omission rates predicted by the Maxent distribution q_{λ}. Specifically, we consider 0–1 prediction rules that threshold raw outputs at a level p*.* Each raw threshold p is transformed into the omission percentage c(p) predicted by q_{λ} for the corresponding rule, i.e.

Therefore, if we make a 0–1 prediction from the Maxent distribution q_{λ} using a cumulative threshold of c, the omission rate is c% for test sites drawn from q_{λ}. The cumulative format is scale-independent, and is more easily interpreted when projected, but it is not necessarily proportional to probability of presence.

For example, consider a generalist species whose probability of presence is close to 1 across the whole study area, with slight variations that avoid ties. Since the probability values are similar across the entire region, the cumulative values of individual sites will be roughly proportional to their rank, and hence they will range evenly from 0 to 100. Thus, big variations in cumulative value do not necessarily represent big variations in suitability or probability of presence.

We therefore introduce a new logistic output format that gives an estimate of probability of presence. Let z denote a vector of environmental variables, and let z(x) be the value of z at a site x*.* Traditional statistical methods such as logistic regression estimate P(y=1∣z)*,* the conditional probability of presence given the environmental conditions, which is closely related to the quantity we estimate, P(y=1∣x):

- (3)

where X(z) denotes the set of locations with environmental conditions z*.* Therefore, in order to estimate P(y=1∣z)*,* it suffices to focus on P(y=1∣x)*.* Indeed, in the special case that π(x) is only a function of the environmental conditions, if we let x(z) denote an arbitrary element from X(z), eq. 3 simplifies to

- (4)

Combining eq. 1 and 4, it is tempting to use our estimate q_{λ} of π to derive the following estimate of

However, this approximation has two difficulties. First, we may not know or be able to estimate P(y=1), since this quantity is not determinable from presence-only data (Ward et al. 2007). Second, the approximation may result in probabilities greater than one, since Maxent does not guarantee that q_{λ}(x) is smaller than 1/(P(y=1)∣X∣).

We resolve these difficulties with a novel application of the maximum entropy principle. Rather than applying the principle to estimate a distribution over sites, we apply it to a joint distribution P(x, y) representing both a sampling distribution over sites (assumed the same for data collection and evaluation) and the presence/absence of the species. In particular, we estimate P(x, y) by a distribution Q(x, y) of maximum entropy subject to constraints on the conditional distribution P(x∣y*=*1), i.e. the same constraints we applied to estimate π. Once we obtain the joint estimate Q*,* we have enough information to derive the conditional probability Q(y*=*1∣x), which turns out to be

where q_{λ} is the maximum entropy estimate of π and H is the entropy of q_{λ}*.* Similarly,

(for the derivation see Dudík and Phillips unpubl.). Thus, Q(y=1∣z) takes the form of a logistic regression model with the same set of parameters λ as the Maxent model and with the intercept determined by the entropy of q_{λ}. Because of the robust Bayes interpretation of Maxent (see “Basic concepts”, above), we expect that the estimate Q(y=1∣z) will perform well against a range of sampling distributions and prevalence values.

The model Q(y=1∣z) can also be interpreted from the point of view of information theory as follows. Suppose that we receive a sequence of independent samples from the Maxent distribution q_{λ}, corresponding to a sequence of observations. Then the average of their log probabilities will be very close to −H, the negative entropy (Cover and Thomas 2006), because −H is simply the mean log probability: . Thus, for “typical” sites whose log probabilities are close to this mean, we obtain q_{λ}(x)≈e^{-H}. The model Q therefore assigns typical presence sites probability of presence close to 0.5.

We may have a prior expectation about the probability of presence at typical presence sites: for example, extensive collecting effort may have been required to obtain the known occurrence records for a rare species, suggesting that its probability of presence is low everywhere. This information could potentially be incorporated as a constraint on P(x, y). However, probability of presence depends on sampling effort, and in particular on site size and, for vagile organisms, on observation time. Therefore, we can more simply incorporate knowledge of sampling effort by interpreting Q(y=1∣z) as probability of presence under a similar level of sampling effort as was required to obtain the known occurrence data.

Note that the raw, cumulative and logistic formats are all monotonically related, so they rank sites in the same order and therefore result in identical performance, when measured using rank-based statistics such as AUC (Fielding and Bell 1997). However, their predictive performance will vary when measured by statistics that depend on actual output values such as Pearson's correlation (Zheng and Agresti 2000).