#### Species distribution model

The IPPM is appropriate to model the location of points that are independent after conditioning on the environmental and geographical covariates. If the locations of individuals are independent, then the IPPM is appropriate to model the distribution of individuals. Many species, however, occur in groups. If individuals are treated as unique points, at a minimum, the individuals (points) that belonged to a group are not independent. Methods to test for independence of groups (i.e., point interactions) are well developed, and many methods exist to explicitly model point interactions (e.g., area-interaction model; Cressie 1993; Diggle 2003; Renner and Warton 2013). We proceed assuming that individuals occur in independent groups and that group locations can be modeled with an IPPM; however, the analyst should verify this assumption (Diggle 2003; Renner and Warton 2013).

The IPPM is similar to a generalized linear model with a Poisson response distribution because environmental covariates influence the group intensity through the log link function. The linear predictor can be written as:

- (1)

where the vector **λ**_{gi} is the group intensities, *α*_{0} is the intercept, **X**_{gi} is the design matrix of environmental covariates, and **α**_{gi} is the vector of environmental coefficients.

To estimate model parameters, the IPPM likelihood is required. The IPPM likelihood contains an integral that can be difficult or impossible to solve; therefore, numerical approximation is required. Many techniques have been developed to approximate the likelihood and obtain parameter estimates from the IPPM, and several of the methods are implemented in easily accessible software packages (Fithian and Hastie 2013).

Additional data associated with presence-only locations (e.g., group sizes) are known as marks (Cressie 1993; Diggle 2003). Marked IPPMs, for example, have been applied in forestry statistics to model the locations of trees and wood volumes (Stoyan and Penttinen 2000). We treat group sizes as marks and analyze the marks using a zero-truncated generalized linear model (GLM) assuming a truncated Poisson distribution. The zero-truncated GLM is similar to standard GLMs; however, the assumed response distribution is conditioned on the fact that only group sizes greater than zero can be reported for presence-only data (Zuur et al. 2009; Zipkin et al. 2012). Similar to the IPPM model, we model the expected group size using a linear predictor

- (2)

where the vector **λ**_{gs} is the rate parameters of the zero-truncated Poisson distribution (i.e., unconditional expected group sizes), *γ*_{0} is the intercept, **X**_{gs} is the design matrix of environmental covariates and **γ**_{gs} is the vector of environmental coefficients.

Modeling group sizes separately from group locations allows us to use different covariates in models of group intensities and group sizes. This flexibility is required to adequately model the distribution of abundance if environmental features influence group sizes. We note that the zero-truncated Poisson distribution may not be the best model of group sizes for all presence-only data; however, many zero-truncated distributions (e.g., zero-truncated negative binomial) exist. Models of sea duck group sizes from aerial surveys were explored by Zipkin et al. (2012), and their methods could also be applied to presence-only data.

To model intensities of abundance (**λ**_{abundance}), we multiplied the elements of group intensities by the unconditional expected group sizes:

- (3)

Due to the exponential inverse link function, environmental coefficients that occurred in both the IPPM and zero-truncated GLM models can be summed to estimate the marginal effects of environmental covariates on intensity of abundance.

Although we have presented linear models for the IPPM and zero-truncated GLM, many less restrictive methods exist to estimate **λ**_{gi} and **λ**_{gs}. For example, boosted regression trees or generalized additive models could also be used to estimate **λ**_{gi} and **λ**_{gs} (Guisan et al. 2002; Elith et al. 2008; Fithian and Hastie 2013).

#### Correcting for nondetection

Sampling bias that results in nondetection of groups has the potential to bias parameter estimates and predictions from the IPPM, zero-truncated GLM or any SDMs that uses presence-only data (Dorazio 2012). The effect of nondetection (i.e., Bernoulli thinning of the point process) on parameter estimates and predictions from an IPPM depends on the covariates that affect the detection and intensity process (i.e., **λ**_{gi}). Although the effects of nondetection on the IPPM have been documented (Dorazio 2012), we chose to conceptualize the detection process as a missing data mechanism so we could provide a unified framework that applies to both group locations and group sizes (Little and Rubin 2002). Using the terminology of Rubin (1976), if detection and reporting of groups were perfect (i.e., **p**_{det }= 1; where **p**_{det} is the vector of probabilities corresponding to each presence-only record), opportunistic records would consist of every possible location of the groups. With perfect detection, all parameters estimates from the IPPM would be asymptotically unbiased and identifiable. If detection is imperfect, but the covariates that influence the detection process are independent of the covariates that affect **λ**_{gi}, then the missing data are classified as missing completely at random (MCAR). In general, MCAR data are the best that can be obtained from any presence-only data collection process. If the nondetected presence-only data are MCAR, unbiased coefficients and relative intensities (_{)} are estimated with the IPPM assuming the model is correctly specified; however, an unbiased intercept parameter (*α*_{0}) is unidentifiable (Dorazio 2012; Fithian and Hastie 2013). If the covariates that affect the detection process are correlated or share covariates with the covariates affecting **λ**_{gi}, the missing data mechanism results in nonignorable missing (NIM) data and the coefficients of the correlated or shared covariates estimated from the IPPM will be biased (Dorazio 2012). It should be emphasized that covariates affecting the probability of detection that are the same as or correlated with covariates affecting **λ**_{gi}, but are not included in the IPPM due to model misspecification (i.e., neglecting to include the covariate), will result in NIM data. In practice, it is difficult or impossible to know whether the model is correctly specified or whether the data are MCAR, therefore assuming that missing data mechanism results in NIM data is a conservative assumption. We present a decision tree to aid researchers in deciding when correcting for nondetection sampling bias is required for the IPPM model (Fig. 1).

The effect of nondetection on the analysis of group size marks is slightly different. Similar to the IPPM, if the covariates that affect detection are independent of the covariates that affect group size, then the missing data mechanism is MCAR, which is equivalent to a completely random sample of group sizes. If the detection process resulted in MCAR data for group size, all parameters (*γ*_{0} and **γ**_{gs}) are identifiable and unbiased if detection is ignored. If, however, the covariates affecting detection are correlated with or the same as covariates affecting group size, the missing data are classified as missing at random (MAR). Under MAR, all parameters (*γ*_{0} and **γ**_{gs)} are identifiable and unbiased if detection is ignored assuming the model of group size is specified correctly and contains the covariates that were correlated with or affected both nondetection and group size. Under the MAR mechanism, the detection process would result in less data from values of covariates that resulted in low detection, but unbiased parameters estimates (e.g., *γ*_{0} and **γ**_{gs}) and predictions of **λ**_{gs}. For example, detection may be high close to developed areas, but large groups may tend to avoid these areas. In this case, more observations of large group sizes could be reported from areas that the larger groups tend to avoid, but analysis of the group size data does not result in biased estimates of the intercept (*γ*_{0}) or coefficients (**γ**_{gs}). Finally, if detection depends on group size after adjusting for the influence of covariates, the missing data mechanism is NIM, and parameters estimated would be biased. For example, if detection is greater for larger groups, then the parameters estimates from the zero-truncated GLM are biased and a correction for nondetection may be warranted. We present a decision tree to aid researchers in deciding when correcting for nondetection sampling bias is required for marks associated with presence-only locations (Fig. 2). Again, in practice, it is difficult or impossible to know whether the model is correctly specified or whether the missing data are MAR or MCAR, therefore assuming that missing data mechanism for the marks results in NIM data is likely a conservative assumption.

For presence-only data, correcting for nondetection is the same as correcting for missing data; therefore, we used methods to correct for NIM data in our study. To correct for NIM data, estimates of **p**_{det} must be obtained from auxiliary data (henceforth referred to as the detection data set) as there is no information in presence-only data about the detection process (Rubin 1976; Little and Rubin 2002). To correct for NIM data, the inverse of **p**_{det} is used to weight the log-likelihood of the IPPM and zero-truncated GLM (Little and Rubin 2002). Correcting for nondetection by weighting the log-likelihood is attractive because the analysis can be carried out in standard software that allows weights to be specified (see Appendix S1 for annotated R code).

Although weighting the log-likelihood corrects the bias in the coefficient estimates and predictions of **λ**_{gi} and **λ**_{gs}, obtaining meaningful measures of uncertainty such as standard errors (SE), confidence intervals (CI), and prediction intervals that incorporate the uncertainty in the detection process requires additional effort in the form of implementing a two-phase bootstrapping algorithm. We implemented a two-phase, nonparametric bootstrap algorithm which uses the detection data set to obtain estimates of **p**_{det} and then fits the marked IPPM using the estimates of **p**_{det} to correct for nondetection sampling bias. We present the algorithm here:

- Draw a bootstrap sample from the detection data set.
- Fit an appropriate model to the detection data set.
- Draw a bootstrap sample from the presence-only data that includes group size marks.
- Estimate
**p**_{det} for each location for the bootstrap sample in step 3 using the fitted model from step 2. - Fit an IPPM that weights the log-likelihood function using and save coefficient estimates or predicted values of
**λ**_{gi}. - Fit a model to group size that weights the log-likelihood function using and save coefficient estimates or predicted values of
**λ**_{gs}. - Repeat steps 1–6 to obtain
*b* bootstrap samples.

The CI and SE can be calculated from the empirical distributions; however, many other summaries of the empirical distributions (e.g., mean) may be of interest (Efron and Tibshirani 1994). An annotated example with R code implementing the two-phase nonparametric bootstrapping algorithm for the IPPM and zero-truncated GLM is available in Appendix S1.

The use of weighted log-likelihoods to correct for bias has a long history for NIM data (Little and Rubin 2002) and has been used successfully to account for NIM data when GPS collars fail to record animal use locations in habitat selection studies (Frair et al. 2004). Although weighting provides an automatic procedure to reduce bias in parameter estimates and predictions from the IPPM and zero-truncated GLM when detection bias results in NIM data, weighting results in an increase in variance of the estimands. The increased variance maybe undesirably large and thus correcting for nondetection should be viewed as a bias–variance tradeoff. In general, imprecise (i.e., due to small sample size) and highly variable (i.e., due to the effect of covariates) estimates of will result in highly variable estimands from the IPPM and zero-truncated GLM. For our simulation study, we estimated **p**_{det} using logistic regression (see simulation study); however, methods such as regularization that result in coefficient shrinkage or trimming that result in less variable estimates of may result in a more desirable bias–variance tradeoff (Little and Rubin 2002; Hastie et al. 2009).

#### Simulation study

We conducted a simulation study to assess the properties of our SDM. For our simulation study, the data-generating distributions corresponded to those of the IPPM and zero-truncated GLM. This allowed us to test our two-phase bootstrap algorithm and determine whether our algorithm performed well on simulated data where the true values were known. We simulated group presence-only data (**y**_{pres}) over a region with 1 million pixels using an inhomogeneous Poisson point process distribution with intensity function (**λ**_{gi}) that varied according to the linear predictor:

- (4)

where *α*_{0} was the intercept and *α*_{1} was the regression coefficient for the vector of covariates **z**_{gi}. At each presence location, group sizes (**y**_{gs}) were simulated using a zero-truncated Poisson distribution with a rate parameter (**λ**_{gs}) that varied according to the linear predictor:

- (5)

where *γ*_{0} was the intercept and *γ*_{1} was the regression coefficient for the vector of covariates **z**_{gs}. Detection of each group (*y*_{det}) was simulated using a Bernoulli distribution, where a realized value of one represented detection and a value of zero represented nondetection. The probability of detection (*p*_{det}) varied according to the linear predictor:

- (6)

where *θ*_{0} was the intercept, *θ*_{1} was the coefficient for the vector of covariates **z**_{det}, and *θ*_{2} was the coefficient for the scaled and centered effect of group size (*s*(**y**_{gs})).

The entire simulated data set could be represented by the vectors: **y**_{pres}, **y**_{gs}, **y**_{det}, **z**_{gi}, **z**_{gs}, and **z**_{det}. The observed presence-only data set was comprised of groups that were detected (i.e., **y**_{det = }1). The auxiliary data used to estimate and correct for detection bias were obtained by taking a random sample without replacement from the full simulated data set (detected and nondetected). Logistic regression was used to estimate **p**_{det} using the auxiliary data set assuming the linear predictor in equation (6).

We simulated data from the worst-case scenario: low detection in habitat with a high intensity of abundance (i.e., more and larger groups) and where the covariate that affects the intensity of abundance is the same as the covariate that affects detection. We simulated the covariates from a single standard normal distribution so the covariates of group intensity, group size, and detection were the same (i.e., **z**_{gi }= **z**_{gs} = **z**_{det}). The covariate parameter for the inhomogeneous Poisson point process distribution was fixed at *α*_{1} = 1. We evaluated two sample sizes by setting the intercept (*α*_{0}) to 7.0 for the small sample size and 8.5 for the large sample size. We conducted 1000 simulations for each sample size and estimated the parameters of the IPPM using infinitely weighted logistic regression with 1000 Monte Carlo integration points and weights of 10000 (Fithian and Hastie 2013). The parameters for the zero-truncated Poisson distribution used to simulate group size were *γ*_{0} = 1 and *γ*_{1} = 0.5. The parameters for the Bernoulli distribution used to simulate the detection process for groups were *θ*_{0} = −2, *θ*_{1} = −1, and *θ*_{2} = 0.5, so that detection decreased with the habitat covariate and increased with group size. We randomly sampled 20% of the full data set to obtain our auxiliary detection data and estimated **p**_{det} using logistic regression. Extremely low values in **p**_{det} in the small sample size case resulted in convergence issues for steps five and six in our two-phase bootstrap algorithm, so we trimmed by replacing values in ≤ 0.01 with 0.01. Although trimming could result in biased coefficient estimates, it improved convergence and greatly reduced the variance of parameter estimates from the IPPM and zero-truncated GLM with a minimal increase in bias in our simulations. For each simulation, we used *b *= 1000 bootstrap samples to estimate statistics from the empirical distributions.

We evaluated the properties of our statistical methods by comparing the results from the five scenarios for each sample size: (1) **p**_{det} was estimated and used to correct for detection bias; (2) **p**_{det} was estimated but the detection model was misspecified due to unknown group size; (3) **p**_{det} was known; (4) an unbiased sample of group locations and sizes (i.e., detection was perfect) was analyzed; and (5) detection bias was ignored. For studies using our methods, group size may be unknown in some of the auxiliary detection data (e.g., nondetected groups in a telemetry study; see discussion). Because of this, we evaluated our models ignoring the effect of group size (scenario 4) and estimated the parameters in our detection model with the misspecified linear predictor:

- (7)

Misspecification of the detection model could result in biased estimates of **p**_{det}, which, in turn, would result in biased estimates of *α*_{1}, *γ*_{1}, and . If the estimated **p**_{det} does not depend on group size or if group size was not available, there is no need to provide weights in step six of our estimation algorithm because the correction is equivalent to assuming that missing group size marks were MAR.

We compared estimates of *α*_{1}, *γ*_{1}, and from simulations of all five scenarios. We designed the comparison between the parameter estimates when **p**_{det} was known (scenario 3) to those when **p**_{det} was estimated (scenarios 1 and 2) to show the increase in variance due to uncertainty in . We designed the comparison between parameters estimates from the unbiased sample (scenario 4) and when **p**_{det} was known (scenario 3) to illustrate the increased variance of estimated parameters due to weighting the log-likelihood. Finally, we compared estimates from scenarios 1−4 to estimates from data when detection was ignored and the data were assumed to have been derived from an unbiased sampling effort (scenario 5).