The MIAmaxent R package: Variable transformation and model selection for species distribution models

Abstract The widely used “Maxent” software for modeling species distributions from presence‐only data (Phillips et al., Ecological Modelling, 190, 2006, 231) tends to produce models with high‐predictive performance but low‐ecological interpretability, and implications of Maxent's statistical approach to variable transformation, model fitting, and model selection remain underappreciated. In particular, Maxent's approach to model selection through lasso regularization has been shown to give less parsimonious distribution models—that is, models which are more complex but not necessarily predictively better—than subset selection. In this paper, we introduce the MIAmaxent R package, which provides a statistical approach to modeling species distributions similar to Maxent's, but with subset selection instead of lasso regularization. The simpler models typically produced by subset selection are ecologically more interpretable, and making distribution models more grounded in ecological theory is a fundamental motivation for using MIAmaxent. To that end, the package executes variable transformation based on expected occurrence–environment relationships and contains tools for exploring data and interrogating models in light of knowledge of the modeled system. Additionally, MIAmaxent implements two different kinds of model fitting: maximum entropy fitting for presence‐only data and logistic regression (GLM) for presence–absence data. Unlike Maxent, MIAmaxent decouples variable transformation, model fitting, and model selection, which facilitates methodological comparisons and gives the modeler greater flexibility when choosing a statistical approach to a given distribution modeling problem.

graphical user interface that is easy to use for modelers of all levels of experience. In addition, the models it produced showed high-predictive performance in Elith et al.'s (2006) influential paper comparing various distribution modeling methods. As a result, Maxent quickly became a very popular tool for distribution modeling; the paper that introduced it has more than 5,000 citations in the Web of Science Core Collection, and more than 60% of distribution modelers report using it (Ahmed et al., 2015). Recently, the Java software (currently version 3.4.1) was also adapted into an R package called "maxnet" (Phillips, Anderson, Dudík, Schapire, & Blair, 2017, currently version 0.1.2). For the remainder of this paper, we refer primarily to the Maxent Java software, but our statements are equally applicable to the maxnet R package.
The core functionality of the Maxent software is a statistical approach that comprises three linked, main elements: (a) variable transformation ("feature creation" in Maxent's terminology), (b) maximum entropy fitting, and (c) lasso regularization. Variable transformation expands explanatory variables (EVs) into a larger set of derived variables (DVs), the maximum entropy fitting algorithm finds parameter estimates, and lasso regularization reduces model overfitting.
Maxent implements these three elements together owing to its origins in the field of machine learning (Elith et al., 2011;Merow, Smith, & Silander, 2013;Phillips et al., 2006;Phillips, Dudík, & Schapire, 2004). However, there is no reason that variable transformation, maximum entropy fitting, and lasso regularization must be used in tandem; for example, a maximum entropy distribution model might equally be fit without variable transformation or by a different model selection technique. Nor is it clear that this combination of elements represents the optimal approach to presence-background distribution modeling. In fact, the literature reveals that many models produced by Maxent are difficult to interpret and poorly grounded in ecological theory, as evidenced by highly complex response curves, large numbers of parameters, and little critical examination of modeled relationships (Halvorsen, 2013;Yackulic et al., 2013). For studies aiming to explain occurrence-environment relationships, or to project them to a different spatial or temporal context, these norms are troubling (Halvorsen, 2012, especially figure 15 therein).
The purpose of this paper is to present the MIAmaxent R package as an alternative to the Maxent software and to summarize the motivation for MIAmaxent and its underlying approach (Halvorsen, 2013;. First, we assess individually the utility of variable transformation, maximum entropy fitting, and lasso regularization for distribution modeling. Especially, we ask how compatible each of these elements is with ecological theory concerning species responses to environmental gradients, using insights from the long tradition of gradient analysis in ecology (Austin, 2002(Austin, , 2007Halvorsen, 2012). Then, we show how modifications to Maxent's statistical approach-implemented in MIAmaxent-lead to more ecologically grounded and interpretable distribution models. By "ecologically grounded," we mean: in concordance with expectations about occurrence-environment relationships derived from lines of inquiry other than correlative distribution modeling. By "interpretable," we mean: simple enough that the relationship between prediction and predictors can be explained in an ecologically meaningful way.

| Variable transformation
Variable transformation in Maxent means that explanatory data enter into the model as DVs-functions of the original EVs supplied by the modeler (Phillips et al., 2006). It can be thought of as changing the functional form of the model specification, akin to adding polynomial terms to a linear regression. Recent versions of the software support five transformation types for continuous EVs: linear, quadratic, threshold, forward hinge, and reverse hinge (Phillips et al., 2017;Phillips & Dudík, 2008). In addition, categorical EVs are transformed into binary (dummy) DVs, and interaction terms are possible in the form of product transformations between pairs of EVs (Phillips et al., 2006). The effect of variable transformation is that the relationship between the occurrence of the modeled target and an EV can be captured more flexibly than if only the original EVs enter into the model (Austin, 2007;Halvorsen, 2013;Phillips et al., 2006). For example, if an occurrence rate is constant for one part of an EV range and monotonically increasing or decreasing for another part, then a hinge transformation of that EV allows Maxent to fit the response more closely than the original EV. Thus, variable transformation relates critically to the modeler's expectations about which model response shapes are ecologically realistic (Halvorsen, 2013;Merow et al., 2014;Phillips et al., 2006).
The gradient analysis literature shows that, across a sufficiently large interval, a species' response to any environmental determinant is generally unimodal, but truncation of the interval and the way the variable is scaled may affect the shape of the observed response (Austin, 2007;Halvorsen, 2012;. Thus, a particular variable transformation type accords with expectations based on gradient analysis if it enables a unimodal model response, or some truncated portion thereof, to the original EV. All of the transformation types in Maxent conform to this condition. Even a threshold transformation, which makes a continuous variable binary, can be regarded as approximating a strongly skewed and truncated unimodal response (Halvorsen, 2012). However, the list of transformation types in Maxent is not exhaustive, and other ecologically motivated transformations could also be considered (Halvorsen, 2013;Phillips et al., 2006). Increasingly, complex transformation types allow a closer fit to the data, but are less interpretable with respect to the original EV, and result in a less generalizable model.
Although the transformation types in Maxent are individually consistent with ecological expectations, combinations of the DVs they produce frequently result in model responses that are not. For example, the combined effects of multiple DVs may create a local minimum in the model response to a continuous EV (e.g., Elith et al., 2011, figure 5), which is inconsistent with the expectation of unimodality. More generally, a model containing multiple, simple transformations of a single EV can show a highly complex response to that EV. Therefore, a model will be more ecologically grounded and interpretable if the set of DVs it includes is tailored to the modeled system, based on a priori knowledge of the modeled target in the study area, or exploratory analyses of the data. Specifically, DVs that are responsible for unrealistic or unexpected response shapes, or that capture idiosyncrasies in the data rather than patterns of interest (Merow et al., 2013), should be discarded. However, Maxent applies the same transformation types to all continuous EVs and does not provide the possibility to customize the resulting set of DVs.
Moreover, Maxent users frequently neglect to examine modeled relationships critically (Yackulic et al., 2013), so they are ill-equipped to iteratively refine the set of DVs. In short, Maxent's variable transformation procedure does not offer the level of user control that is necessary to make maximally interpretable and ecologically grounded models.

| Maximum entropy fitting
The model fitting algorithm in Maxent is based on the principle of maximum entropy (Jaynes, 1957a(Jaynes, , 1957b, which has given the software its name. This algorithm finds the probability distribution that is minimally divergent from the distributions of predictors (equivalently: maximally uniform in space), subject to constraints given by the presence locations (Elith et al., 2011). The constraint placed on the distribution in Maxent is that the expected value of each predictor must match its empirical mean among presence locations (Phillips et al., 2006). Note that we use the term "predictor" to refer generically to any variable component of a model, regardless of whether it is transformed (DV) or not (EV). Under this constraint, the distribution which maximizes entropy follows a specific exponential distribution called a Gibbs distribution (Della Pietra, Della Pietra, & Lafferty, 1997). In recent years, it has been shown that these maximum entropy models belong to a broader family of solutions to parametric density estimation problems that also include inhomogeneous Poisson process models (IPP) and logistic regression models (Aarts, Fieberg, & Matthiopoulos, 2012;Renner & Warton, 2013;Warton & Shepherd, 2010). It particular, Fithian and Hastie (2013) showed that logistic regression recovers the same parameter estimates as the maximum entropy algorithm when background locations are weighted strongly compared with presence locations (so-called "infinitely weighted logistic regression"; IWLR).
Maximum entropy fitting gives the estimate of occurrence density that is maximally noncommittal, or minimally presumptive, while conforming to the information in the data (Jaynes, 1957a). A minimally presumptive model fit may be especially appropriate for the opportunistically collected presence-only data commonly used for distribution modeling, since frequent artifacts in these data-for example, resulting from sampling bias (Støa, Halvorsen, Mazzoni, & Gusarov, 2018)-may easily be inherited by overfitted models. For example, the classification methods (boosted regression trees and random forests) tested by  were more prone than Maxent to capture idiosyncratic patterns specific to the sample, despite applying variance reduction measures similar to Maxent's lasso regularization (Elith & Graham, 2009, supplemental information). Put another way, a maximally noncommittal model is likely to generalize well beyond its training data. This fact may not be captured in measures of predictive performance, which are typically estimated with data that are not fully independent of the training data (Halvorsen, 2012) due to spatial and temporal autocorrelation in geographic distributions (Araújo, Pearson, Thuiller, & Erhard, 2005).
While some fitting algorithms assume a distinct form of the occurrence-environment relationship, such as the rectilinear environmental space dictated by the classic BIOCLIM algorithm (Guisan & Zimmermann, 2000), maximum entropy fitting is exclusively responsive to the data. That gives maximum entropy fitting a straightforward ecological interpretation: It produces responses to predictors that deviate minimally from uniformity (i.e., no response). For this reason, a model fitted by maximum entropy will be interpretable and ecologically grounded to the same degree that the predictors in the model are, which emphasizes the importance of variable transformation. Maximum entropy fitting in itself is fully compatible with ecological theory concerning species responses to environmental gradients.

| Lasso regularization
The Maxent software uses a technique termed "lasso" (Least Absolute Shrinkage and Selection Operator) regularization to prevent model overfitting and improve model generalizability (Phillips et al., 2006;Tibshirani, 1996). Lasso regularization is one way to optimize the bias-variance tradeoff inherent in model selection (Hastie, Tibshirani, & Friedman, 2009); a more familiar model selection technique for many ecologists is subset selection (Zuur, Ieno, & Smith, 2007). Like subset selection, lasso regularization balances a model's goodness of fit against its complexity, by penalizing model complexity. However, unlike subset selection, the penalty is on the sum of absolute values of predictor coefficients, rather than on the number of predictors (Reineking & Schröder, 2006). As a result, lasso regularization reduces the magnitude of (i.e., shrinks) coefficients, potentially to zero, instead of selecting a subset of predictors. Subset selection "is a discrete process-variables are either retained or discarded," while "shrinkage methods [like lasso regularization] are more continuous" (Hastie et al., 2009). Lasso regularization stems from machine learning, a tradition where predictive power is valued above all else (Breiman, 2001), whereas subset selection stems from the classical tradition of hypothesis testing and aims to distinguish important predictors from insignificant ones (Hastie et al., 2009).
An important difference between lasso regularization and subset selection is that subset selection provides unbiased parameter estimates (Halvorsen, Mazzoni, Bryn, & Bakkestuen, 2015;Merow et al., 2013;Appendix A1). In contrast, lasso regularization allows the expected value of a given predictor to deviate from its mean presence value to achieve parsimony; that is, it relaxes the constraint on the maximum entropy distribution that expected values match their unbiased estimates (Halvorsen, 2013;Reineking & Schröder, 2006). This behavior is by design (Efron, 2001) and may be excused on the grounds that the constraint should not be enforced too closely, to prevent overfitting (e.g., Elith et al., 2011). However, subset selection also prevents overfitting, without biased estimates. Therefore, subset selection accords better with the principle of maximum entropy, which seeks the "least biased estimate possible on the given information" (Jaynes, 1957a), if the given information comprises the occurrence data and the predictors in the model. We emphasize that both lasso regularization and subset selection methods are subject to the bias-variance tradeoff, so-regardless of which method is employed-model complexity needs to be optimized from case to case according to the purposes of the study and characteristics of the data (Halvorsen, 2013;Merow et al., 2014;Reineking & Schröder, 2006).
Regression simulations have shown that subset selection performs better than lasso regularization when two conditions are fulfilled: (a) Relatively few candidate predictors have an effect on the modeled target and (b) their signal in the data is sufficiently strong (Reineking & Schröder, 2006;Tibshirani, 1996). Specifically, Reineking and Schröder (2006) showed that subset selection of logistic regression models results in better discrimination ability and better identification of true predictors when less than eight of 16 candidate predictors drive the response and more than 20 presences or absences per candidate predictor are observed. Their study used presence-absence data, so its conclusions are not directly generalizable to presence-background models, but it indicates which modeling circumstances are favorable for subset selection. Regarding species distributions, the gradient analysis literature suggests that the vast majority of environmentally determined variation in species occurrences is governed by a very limited number of composite, complex gradients (up to 3; Halvorsen, 2012). Together, these findings suggest that some distribution modeling applications may achieve better discrimination ability and better identification of true predictors via subset selection than via lasso regularization. We note that Gastón and García-Viñas (2011) found presence-background models selected by regularization to discriminate better than those selected by subset selection, but differences in variable transformation and interaction terms confound the comparison in that study.
One way to articulate the difference between lasso regularization and subset selection is that under lasso regularization, selection of predictors is secondary to parameter estimation, while under subset selection, parameter estimation is secondary to selection of predictors. Since predictors in distribution models usually represent ecologically meaningful quantities-especially when the purpose is to gain insight into the modeled target's response to its environment-subset selection will generally result in more interpretable models. Conceptually, if there exist two models with equal goodness of fit, where the first contains one predictor and the second contains two other predictors, and the sum of the absolute values of the coefficients in the two models are equal, then lasso regularization makes no distinction between the two models while subset selection prefers the model with only one predictor. For this reason, lasso regularization is more likely to result in models that include ecologically meaningless predictors (Halvorsen, 2013). Indeed, comparisons between lasso regularization and subset selection show that subset selection generally results in models with fewer predictors (Reineking & Schröder, 2006;Halvorsen, 2013;Halvorsen et al., 2016;Mazzoni, 2016 [chapter 6]; Appendix A2). Halvorsen et al. (2016) found that, across a range of model complexity penalties, distribution models selected by either lasso regularization or subset selection showed approximately equal predictive performance on independent presence-absence evaluation data, but the models selected by subset selection included significantly fewer predictors (see also Mazzoni, 2016, paper 6). In effect, the models selected by subset selection were more parsimonious. Lower model complexity is generally favorable for large spatial extents and coarse spatial resolutions, for small sample sizes, for strong sampling bias, and for applications that involve spatial or temporal extrapolation (Bell & Schlaepfer, 2016;Elith, Kearney, & Phillips, 2010;Merow et al., 2014;Moreno-Amat et al., 2015;Randin et al., 2006). For example, when projecting to a new area, changes in the covariance structure of predictors (Dormann et al., 2013) pose less of a risk for models containing fewer EVs. Simpler distribution models are also more suitable for hypothesis testing than for hypothesis generation and are clearly preferable when ecological interpretation is of interest in addition to spatial prediction (Halvorsen, 2013;e.g., Bendiksby, Mazzoni, Jørgensen, Halvorsen, & Holien, 2014;Merow et al., 2014). In summary, lasso regularization's focus on prediction error promotes complexity and sacrifices ecological interpretability of distribution models.

| Motivation for MIAmaxent
There is no consensus in the literature about the relative merits of Maxent's three core modeling elements-variable transformation, maximum entropy fitting, and lasso regularization-with regards to predictive performance or otherwise (Fletcher & Fortin, 2018;Phillips & Dudík, 2008). However, adjustments can be made to Maxent's variable transformation procedure to bring models more in line with expected ecological responses, and there is evidence that lasso regularization is not the optimal model selection technique for many distribution modeling applications (Halvorsen, 2013;Halvorsen et al., 2015Halvorsen et al., , 2016Reineking & Schröder, 2006). Applications focused on explaining the relationship of the modeled target to its environment, rather than predicting the modeled target's distribution in geographic space (Araújo et al., 2019;Halvorsen, 2012), are better served by subset selection than lasso regularization, because the former produces more interpretable models. Applications focused on projecting model predictions outside of the spatial or temporal context of the data may also benefit from subset selection, because it tends to result in lower model complexity, which often improves model generalizability (Bell & Schlaepfer, 2016;Elith et al., 2010;Merow et al., 2014;Moreno-Amat et al., 2015).
While it is possible to use the Maxent software without lasso regularization, practically none do so (Halvorsen, 2013;Mazzoni et al., 2015;Morales, Fernández, & Baca-González, 2017). This is unsurprising, because in the absence of an alternative, turning off lasso regularization will result in overfitted, highly complex models. Therefore, there is a need for a tool that replaces lasso regularization with subset selection, while retaining the two other core elements of Maxent. Software availability strongly affects modeling decisions (Ahmed et al., 2015), so a tool for subset selection will increase the likelihood that modelers investigate alternatives to lasso regularization. Furthermore, the coupling of variable transformation, maximum entropy fitting, and lasso regularization in the Maxent software hinders proper investigation of alternatives to each. Decoupling these three statistical elements will improve methodological comparisons and thereby advance good distribution modeling practice Golding et al., 2017;Halvorsen et al., 2015;Mazzoni, 2016;Naimi & Araújo, 2016).

| Core functionality and novelty
The MIAmaxent R package addresses the needs described above.
Its functionality primarily concerns the "statistical model" component of distribution modeling process, sensu Austin (2002, see also Halvorsen, 2012). It implements variable transformation, maximum entropy fitting, and subset selection, in a modular, adaptable manner (Mazzoni, 2016). The name "MIAmaxent" is derived from an early precursor to the package called the "MIA Toolbox"  and signifies a "Modular Integrated Approach" to Maxent.
Since maximum entropy fitting by infinitely weighted logistic regression (IWLR) is a trivial task in R, MIAmaxent's most important innovations are its implementations of variable transformation and subset selection. Additionally, MIAmaxent provides the option of using standard logistic regression in place of maximum entropy fitting by IWLR, without affecting variable transformation or subset selection. MIAmaxent's top-level functions correspond to a workflow that runs from the training data supplied for modeling to prediction and evaluation tools (Figure 1). We note that important distribution modeling considerations independent of the statistical model-such as collection of explanatory data, conceptualization of the study area, and treatment of spatial autocorrelation or sampling bias in the occurrence data-are not addressed in MIAmaxent and must be handled separately. Sampling bias (and detection bias) deserves especially careful consideration, because it may severely handicap presencebackground models (Merow et al., 2013;Phillips et al., 2009

| Variable transformation
MIAmaxent provides a standard set of transformation types that balances model flexibility and ecological interpretability. The set of transformation types implemented in the Maxent software is expanded slightly in MIAmaxent, as proposed by Halvorsen et al. (Halvorsen, 2013;Halvorsen et al., 2015;Mazzoni et al., 2015).
Linear transformations are supplemented with an additional monotonous transformation-the zero-skewness transformation (Økland, Økland, & Rydgren, 2001;-to reflect the fact that the scaling of the explanatory variable (EV) may not be ecologically meaningful for the species in question (Halvorsen, 2013). Maxent's quadratic transformation is replaced with a deviation transformation, which produces scaled distances from the EV value with the highest frequency of occurrence. As a result, the unimodal response expected based on gradient analysis may be captured by a single deviation-type derived variable (DV; Halvorsen et al., 2015). In Maxent, a unimodal response can be captured only through a combination of transformations (Halvorsen, 2013;Phillips et al., 2006). MIAmaxent's product transformations are produced during subset selection and are discussed in the following section. Like Maxent, MIAmaxent linearly rescales all DVs to the interval [0,1], which makes the magnitudes of coefficients directly comparable among variables.
In Maxent, the types of transformations applied depend on the number of presences; by default, smaller data sets are restricted to a reduced set of transformation types to avoid overfitting (Phillips & Dudík, 2008). Because the likelihood ratio tests used for subset selection in MIAmaxent account for sample size, MIAmaxent applies all transformation types by default, regardless of the size of the occurrence data set. Simultaneously, MIAmaxent offers fine control of which DVs enter into the model, as called for by Halvorsen (2013).
For example, a single DV may be removed from the pool of candidates for selection if it is found to produce an ecologically unrealistic or unexpected model response. Likewise, different types of transformations may be used for different EVs, and custom DVs may be added. The Maxent software does not offer the same degree of control over variable transformation because it creates DVs and enters these into the model in a single step. Greater control of model specification especially improves MIAmaxent's utility for hypothesis testing with distribution models (Merow et al., 2014). Finally, we emphasize that variable transformations in MIAmaxent may be used identically for models fitted by maximum entropy or logistic regression. With reference to Maxent's variable transformations, Merow et al. (2014) write: "in principle, this same complexity could be fit in a traditional GLM [logistic regression] but this is typically impractical." MIAmaxent removes this obstacle and makes it easy to create many DVs as candidates for selection in a logistic regression model .

| Subset selection
MIAmaxent implements subset selection in the form of forward stepwise selection. This means that model selection proceeds by adding variables to a minimal starting model, usually one by one, for as long as the improvement in goodness of fit outweighs the penalty on complexity. A significance test, which explicitly accounts for sample size, is used to compare the more complex model to its nested, simpler alternative (Halvorsen, 2013). Maximum entropy fitting of presence-background models is a special case of classical regression methods , and its maximum-likelihood interpretation may be used to derive likelihood ratio tests between nested models (Halvorsen, 2013;Halvorsen et al., 2015, appendix 2). The significance threshold (alpha) used for likelihood ratio tests may be decided a priori, or by choosing a value that maximizes some measure of fit by cross-validation or out-of-sample validation. The latter has been recommended for setting the strength of lasso regularization in Maxent (Merow et al., 2013).
Forward stepwise selection is a less stable process than some other subset selection procedures, because it is a so-called "greedy algorithm," which selects variables in order of greatest explanatory power (Araújo & Guisan, 2006;Elith et al., 2011). Thus, selection order may affect the final result, and the algorithm will not necessarily find the universally optimal subset. Bidirectional selection is more likely to reach the universally optimal subset, and best-subset selection will do so by definition, but these algorithms require more computation. For best-subset selection in particular, the amount of computation quickly becomes intractable as the number of candidate predictors increases. Moreover, empirical comparisons show that forward stepwise selection very frequently yields the same or nearly the same models as bidirectional (Wiegand, 2010) or bestsubset selection (Murtaugh, 2009). Therefore, forward stepwise selection is a computationally efficient choice among forms of subset selection.
MIAmaxent is designed to carry out forward stepwise selection in a two-stage, hierarchical procedure, executed with two different functions (selectDVforEV() and selectEV()). Because a single EV is generally transformed into multiple DVs, candidates for selection are inherently grouped according to their origin. Therefore, MIAmaxent allows the modeler to first select a parsimonious set of DVs for each EV and then select a parsimonious set of these DV sets-each representing a single EV (Halvorsen, 2013). During the second stage of the procedure, each set of DVs is treated as a unit that cannot be disaggregated, and forward selection proceeds as though the set constitutes an individual variable, while still accounting for each DV as one degree of freedom. Hastie et al. (2009)  instead, MIAmaxent treats the selected set of DVs derived from a given EV as an integral unit that takes the place of the EV. Therefore, with regard to MIAmaxent, "selection of EVs" is to be understood as shorthand for "selection of sets of DVs representing EVs". The hierarchical selection procedure can also be used to manipulate model complexity; for example, strict DV selection but lax EV selection will result in model with simple responses to individual EVs, but potentially many EVs.
A number of additional features in MIAmaxent improve control over the forward stepwise selection process, or aid interpretation of its outcome. First, it is possible to start EV selection from a minimal model that includes one or more EVs, by specifying a model formula. A priori inclusion of EVs may be desirable when these are known to affect the distribution (Gelman & Hill, 2006, first general principle for building regression models for prediction). It is also useful for testing hypotheses. Second, first-order interactions between pairs of EVs with individually significant main effects may be tested.
Interactions are not considered by default because their ecological justification is usually more tenuous than that of main effects (Merow et al., 2014). Third, MIAmaxent calculates and reports the fraction of null deviance explained for each model along the trail of selection . This value is analogous to the R 2 of least squares regression and termed "D 2 " in MIAmaxent, following Guisan and Zimmermann (2000). It can be used to compare nested models and to assess the relative contributions of variables in a model (Halvorsen, 2013). Note, however, that as a measure of variable contribution, these values are dependent on the order of inclusion in the model, so they should be interpreted with caution .
Some authors in the distribution modeling literature recommend prescreening EVs to reduce collinearity between candidates for selection (e.g., Merow et al., 2013). When using the Maxent software, reduced collinearity leads to more interpretable models, since two highly correlated variables may otherwise both be retained under lasso regularization (Merow et al., 2013). With MIAmaxent, subset selection will not retain both of two highly correlated variables unless the second accounts for a significant amount of variation beyond (i.e., orthogonal to) that accounted for by the first. Therefore, prescreening does not make MIAmaxent models more interpretable, and we recommend letting likelihood ratio tests determine which of two highly correlated EVs is the better predictor, except if model predictions are made for new data with changed covariance structure. In that case, using ecological knowledge to prescreen for the more proximal variables among correlated sets may reduce the risk associated with collinearity

| Data exploration
An intuitive way to explore environment-occurrence relationships manifested in a distribution modeling dataset is to plot occurrence rates against intervals or levels of EVs (Halvorsen, 2013). If the occurrence data comprise presence locations only, this rate reflects frequency of observed presence (FOP), while presence-absence data allow quantification of empirical frequency of presence (Støa et al., 2018). Hereafter, we refer only to plots of FOP, but our statements apply to both rates. Examining FOP plots is a useful data exploration step because it allows the modeler to compare prior expectations about occurrence-environment relationships to patterns in the data (Yackulic et al., 2013). FOP plots reveal patterns of occurrence specific to the study area, which may be contrary to expectations based on ecological knowledge. For example, a species generally considered thermophilic may show higher FOP at cold temperatures if the study area contains only the upper limit of its temperature range. Thus, a FOP plot may help the modeler anticipate the model's behavior.
Exploring FOP patterns may also guide the choice of transformation types (Halvorsen, 2013;Merow et al., 2013). For example, threshold transformations may be turned off if FOP plots show no abrupt shifts. Furthermore, MIAmaxent's FOP plots show the relative frequency of EVs values in the dataset (i.e., data density). Modeled relationships relate critically to the distributions of EVs in the dataset , and regions of data sparsity identified in a FOP plot may be associated with increased model uncertainty.

| Model interrogation
A basic but often overlooked way to understand a distribution model is to examine its parameter estimates (Yackulic et al., 2013).  et al., 2014, 2013). We recommend specifically that local minima in response curves be treated with skepticism, since these are often artifacts , and ecologically unlikely (Austin, 2002;Halvorsen, 2013). It is especially important to evaluate the trends of a response curve with respect to potential extrapolation; inspecting response curves can help the modeler decide whether "clamping" of predictions-whereby EV values are constrained to the interval present in the training data-is desirable (Elith et al., 2010;Guevara et al., 2018;Owens et al., 2013).
Model predictions for any supplied values of EVs are returned in the same nonspatial (data frame class) or spatial (raster class) format as the EVs. In addition, model predictions are always accompanied by the ranges of the supplied EV values compared with the training data range [0,1], which helps evaluate the risk of speculative extrapolation. Predictions from models fitted by maximum entropy are scaled to probability ratio output (PRO; Halvorsen, 2013), which can be interpreted as the "relative suitability of one place versus another" (Elith et al., 2011) and has the range (0,∞). PRO avoids the problematic assumptions inherent to Maxent's "logistic" and "cloglog" outputs Merow et al., 2013;Phillips et al., 2017;Yackulic et al., 2013), as well as the scale-dependence of Maxent's "raw" output (Halvorsen, 2013;Merow et al., 2013;Phillips & Dudík, 2008). In particular, PRO = 1 is a useful reference value that represents the relative suitability of a location randomly chosen from the entire set of training data locations used to parameterize the model, that is, the suitability of an "average" training data location (Halvorsen, 2013).
Finally, MIAmaxent can quantify a model's discrimination ability as the area under the curve (AUC) of the receiver operating characteristic (Fielding & Bell, 1997)-preferably using independent presence-absence occurrence data. Evaluation data collected independently of the training data are extremely informative and should be prioritized more often, especially in the context of projective modeling (Araújo et al., 2019;Araújo & Guisan, 2006;Bahn & McGill, 2013;Edvardsen et al., 2011;Halvorsen, 2012). Strict spatial independence between training data and evaluation data may not be achievable, since presences only occur within the spatially autocorrelated distribution of the species, but independently sampled data are a worthwhile alternative. To underline the importance of distinguishing between AUC calculated using presence-only or presence-absence data (Yackulic et al., 2013), MIAmaxent produces a warning when calculating AUC with presence-only data.

| E X AMPLE S
Both examples below are easily reproducible in R (version 3.5.2), using the R markdown file used to create this paper, which is available on GitHub (https ://github.com/julie nvoll ering/ MIAma xent-paper ).

| MIAmaxent workflow
In this section, we briefly demonstrate a basic modeling workflow in MIAmaxent (Figure 1). An expanded version of this demonstration accompanies the package as a vignette and can be accessed at: https ://cran.r-proje ct.org/web/packa ges/MIAma xent/vigne ttes/a-model ing-examp le.html.
The basic data format from which all analysis in MIAmaxent proceeds is a data frame with the response variable (presence/uninformed background or presence/absence) and explanatory variables (EVs). The readData() function provides convenient data import from spatial data formats commonly used in the Maxent software (CSV coordinates and ASCII raster files using the same coordinate  Note: Columns represent: the round of EV selection ("round"), the names of the EVs included in the model ("variables"), the number of DVs in the model ("m"), the fraction of deviance explained ("Dsq", sensu Guisan & Zimmermann, 2000), the F-statistic for the nested model comparison ("F"), the degrees of freedom associated with explained deviance ("df e ") and unexplained deviance ("df u "), and the p-value for the F-statistic under the specified degrees of freedom ("p"). et al., 2015). For example, "terslpdg_D2" is the squared deviation from an estimated optimum in "terslpdg" (around 6). deriveVars() also returns the parameterized transformation function that was used to produce each DV.
In Because none of these DVs from this EV accounted for a significant amount of variation in the response variable, the EV was dropped: In the second stage of model selection, the selectEV() function selects whole sets of DVs stemming from a single EV-picking those sets which explain a significant amount of variation in the response variable under the specified significance threshold (default alpha = 0.01): selectEV() returns three list items: the selected EVs as represented by their DVs, the trail of forward stepwise selection of EVs (Table 1), and the model automatically selected under the specified significance threshold. In this case, the automatically selected model contains 20 DVs representing 7 different EVs, and accounts for 3.1% of null deviance. Comparing this model to other models along the same trail of forward selection, the modeler may decide to proceed with the best model containing only 6 EVs, since it accounts for nearly the same fraction of null deviance (Figure 3).
The chooseModel() function returns the model specified by a supplied formula object and may be used to pick a simpler model from the trail of forward EV selection: Model parameters are stored as "alpha" and "beta" following the Model parameters are stored as "alpha" and "beta" following the notation of Elith et al. (2011) and Fithian and Hastie (2013). A vector of model predictions in probability ratio output (PRO) format is given by: where N is the number of background locations, α is a normalizing constant, β is a vector of coefficients, and x is a matrix of DVs.
To assess how well the model captured empirical occurrenceenvironment relationships, response curves may be compared with their corresponding FOP plots (Figure 4). Note that, where the response curve deviates strongly from the pattern in FOP, data density is very low.
The projectModel() function produces model predictions from a model object, the transformations used to create its DVs, and any values of EVs: It also compares univariate ranges of the EV data in the projection to the ranges of the same EVs in the training data (scaled to [0,1]). Values of categorical EVs are classified as "inside" or "outside" the range of values present in the training data. Since we projected our model across the same data used to train it, all ranges are identical: The testAUC() function calculates AUC for a given model based on evaluation data comprising occurrence data and the corresponding EV values. Preferably, the occurrence data include absences as well as presences. Plotting the receiver operating characteristic (ROC) curve is optional: The workflow above builds maximum entropy models based on presence-only occurrence data, through infinitely weighted logistic regression (IWLR). One useful feature of the MIAmaxent package is that the entire workflow can be adapted to presence-absence data and standard logistic regression (LR) models by changing a single setting (algorithm = "LR") in functions that perform model fitting. Replacing three commands above with their counterparts below would result in a statistical approach identical in variable transformation and model selection, but with maximum entropy models replaced by logistic regression models:

| MIAmaxent-maxnet comparison
In this section, we use an example data set to compare a maximum entropy model built using MIAmaxent (version 1.1.0) to one built using Maxent's R package equivalent, maxnet (version 0.1.2). As detailed above, the form of these models is identical (Appendix A3).
The primary difference between the two approaches is in model selection, where MIAmaxent uses subset selection, while maxnet uses lasso regularization. Another smaller difference is that MIAmaxent automatically adds all presence locations to the background prior to maximum entropy model fitting by IWLR, because the background should be representative of all conditions in the study area (Halvorsen, 2012;Renner et al., 2015); maxnet does not add presences to the background. For increased comparability, we do not use threshold transformations or interaction terms in either MIAmaxent or maxnet.
We compare the resulting models in terms of two crucial, linked qualities: their discrimination ability and their complexity. It is useful to examine these two qualities together because there exists an optimal level of complexity that maximizes discrimination ability on holdout or independent data (Hastie et al., 2009). We quantify discrimination ability as AUC from spatially stratified cross-validation.
Spatially stratified cross-validation is less likely than randomly partitioned cross-validation to overestimate predictive performance due to spatial autocorrelation (Radosavljevic & Anderson, 2014;Veloz, 2009;Wenger & Olden, 2012), so it is a good way to assess predictive power when independent evaluation data are not available. We assess model complexity by the number of parameters and shapes of response curves (like Merow et al., 2014).
The example data we use are described in the paper that introduced Maxent (Phillips et al., 2006) and are available for download from https ://biodi versi tyinf ormat ics.amnh.org/open_sourc e/ maxen t/. They consist of occurrence records for the brown-throated F I G U R E 3 An example of the increase in fraction of null deviance accounted for (D 2 ) across rounds of forward stepwise selection of explanatory variables (EVs). Each round contains models obtained by adding a single EV to the best model of the previous round. The automatically selected model is shown in red Appendix A4). In summary, the model produced by maxnet predicts marginally better than the model produced by maxnet, but it is less interpretable. However, spatially stratified cross-validation does not completely eliminate spatial autocorrelation and shared sampling bias between training and test partitions, so measures of predictive performance obtained by this procedure may favor overfitted models (Merow et al., 2014). Therefore, independent test data might have shown the simpler MIAmaxent model to perform relatively better.
These results are in line with previous findings that models selected by subset selection maintain high-predictive performance but are simpler and more ecologically meaningful than models selected by lasso regularization (e.g., Halvorsen et al., 2015Halvorsen et al., , 2016. To illustrate, consider the response curves in Figure 7; the MIAmaxent model predicts highest occurrence of brown-throated three-toed sloth at an intermediate level of precipitation that is lower in July (pre6190.l10) than October (pre6190.l7), while the maxnet model predicts highest occurrence at precipitation levels that are low in January (pre6190.l1), high in April (pre6190.l4), low in July (pre6190. l7), high in October (pre6190.l10), and annually low (pre6190.ann).
The response of the model produced by subset selection is clearly easier to justify ecologically.

| IN S TALL ATI ON AND CITATI ON
The latest release version of the MIAmaxent package (currently ver-   ).

| CON CLUS ION
Many authors have called for distributions modelers to consider more explicitly the ecological theory upon which their model is based (Austin, 2002(Austin, , 2007Halvorsen, 2012;Warren, 2012), and MIAmaxent grounds the practice of maximum entropy distribution modeling more strongly in ecological knowledge. MIAmaxent moves maximum entropy distribution models towards the "data modeling" school of statistical modeling (away from "algorithmic modeling"), by placing a higher premium on interpretable models (Breiman, 2001;Warren & Seifert, 2011). The most important change that MIAmaxent implements to cause this shift is to select models by subset selection instead of lasso regularization. Any distribution modeling approach-including how the model is produced and how it is evaluated-must be adapted to the purpose of the study and the characteristics of the data (Halvorsen, 2012;Merow et al., 2014), as no single approach is most suitable for all studies (Escobar, Qiao, Cabello, & Peterson, 2018;Mazzoni, 2016;Qiao, Soberón, & Peterson, 2015). MIAmaxent expands a distribution modeler's statistical toolbox, and for studies aiming to do something other than predict with minimal error the geographic distribution of the modeled target in the same spatial and temporal context as the data, MIAmaxent may frequently be more suitable than the Maxent software.

ACK N OWLED G M ENTS
Thank you to Knut Rydgren and Inger Auestad for constructive comments on an earlier version of this paper. Thank you also to two anonymous reviewers whose suggestions improved the paper.

CO N FLI C T O F I NTE R E S T S
The authors declare no conflicts of interest.

AUTH O R CO NTR I B UTI O N S
JV designed and coded the MIAmaxent package and drafted the paper. RH conceived the theoretical approach underlying the package and critically revised the paper. SM conceived the MIAmaxent package and developed its underlying informatics approach, provided data for the example, and reviewed the paper.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data used in the "MIAmaxent workflow" example accompany the MIAmaxent package (version 1.1.0), while those used in the "MIAmaxent-maxnet comparison" accompany the maxnet package (version 0.1.2). Both are available from the Comprehensive R Archive Network (https ://cran.r-proje ct.org/). This whole paper is fully reproducible from an R-markdown file that is available on GitHub (https ://github.com/julie nvoll ering/ MIAma xent-paper ).

PI R I C A L VA LU E S U N D E R I N CR E A S I N G L A SS O R EG U L A R IZ ATI O N
Regularization loosens the constraint on the maximum entropy distribution that the expected value of each predictor must match its observed mean. As a result, increasing the regularization multiplier to decrease model complexity has the effect that predictions are no longer fully consistent with the data (Merow et al., 2013, see also accompanying figure D2 in appendix 4). Subset selection does not have the same effect ( Figure A1).

A 2 : M O D E L B U I LD I N G US I N G S U B S E T O R S H R I N K AG E M E TH O DS
Dahlgren (2010) presented shrinkage methods and particularly the lasso as good alternatives to subset selection methods for model building with ecological data when the data comprise few observations and many candidate explanatory variables. He also advocates for avoiding model building altogether by using full models when the number of observations is sufficiently high and there is no strong collinearity between variables. Our plotting of his Table 1 results ( Figure A2) suggests that, in data sets of more than 20-30 observations per candidate explanatory variable, neither the lasso nor subset selection strongly improves prediction error compared with the full model, but subset selection results in more interpretable and parsimonious models than the lasso.

A3 : CO N D ITI O N A L EQ U I VA LE N CE O F M A X I M U M E NTRO PY M O D E L S I N " M I A M A X E NT " A N D " M A X N E T " PAC K AG E S
If presence observations are included in the background manually, and regularization is close to zero, predictions from models built using the "maxnet" package are near equivalent to predictions from models built using the "MIAmaxent" package ( Figure A3).
Note that, regularization cannot be turned off entirely in the maxnet package.

A 4: PA R A M E TE R S O F O P TI M A L M O D E L S I N " M I A M A XENT-M A XN E T CO M PA R I SO N " S EC TI O N
See Tables A1 and A2.
F I G U R E A 1 Deviation of expected values of predictions from the observed mean in the data, under three regularization scenarios (maxnet lasso regularization multiplier = 1, 10, 50). Deviations in models using lasso regularization are compared with deviations in unregularized models using the same set of predictors Ratio of observations:predictors Reduction in number of predictors lasso subset F I G U R E A 3 Near equivalence of "MIAmaxent" predictions and "maxnet" predictions, when maxnet uses a regularization multiplier close to zero and presence observations as part of the background. This illustration uses the "bradypus" data set from the maxnet package