### Summary

- Top of page
- Summary
- Introduction
- Implications of equivalence
- It's not what you use, it's how you use it
- Which way should you apply a given method?
- Discussion
- Acknowledgements
- References
- Supporting Information

**1.** The problems of analysing used-available data and presence-only data are equivalent, and this paper uses this equivalence as a platform for exploring opportunities for advancing analysis methodology.

**2.** We suggest some potential methodological advances in used-available analysis, made possible via lessons learnt in the presence-only literature, for example, using modern methods to improve predictive performance. We also consider the converse – potential advances in presence-only analysis inspired by used-available methodology.

**3.** Notwithstanding these potential advances in methodology, perhaps a greater opportunity is in advancing our thinking about how to apply a given method to a particular data set.

**4.** It is shown by example that strikingly different results can be achieved for a single data set by applying a given method of analysis in different ways – hence having chosen a method of analysis, the next step of working out how to apply it is critical to performance.

**5.** We review some key issues to consider in deciding how to apply an analysis method: apply the method in a manner that reflects the study design; consider data properties; and use diagnostic tools to assess how reasonable a given analysis is for the data at hand.

### Introduction

- Top of page
- Summary
- Introduction
- Implications of equivalence
- It's not what you use, it's how you use it
- Which way should you apply a given method?
- Discussion
- Acknowledgements
- References
- Supporting Information

Technological and data storage advances have led to an enormous increase in data on the location and habitat preferences of species. In this paper, we specifically focus on examples where we have data on where a species has been found, without corresponding data on where they were not located. Such a model is most commonly fitted by comparing the environmental conditions at the presence locations with those elsewhere in the study region, sometimes referred to as ‘used-available’ data or a ‘use-availability’ design (Manly *et al*. 2002). A model fitted in this way is commonly referred to as a resource selection function (Boyce & McDonald 1999; Manly *et al*. 2002, Fig. 1).

The ‘classical’ approach to resource selection function estimation is to apply logistic regression to used-available data – the response variable takes the value one or zero depending whether a point is *used* (a known presence) or *available* (typically randomly selected points). The resource selection function is then defined as the exponential function of the linear predictor from the fitted logistic regression model (Boyce & McDonald 1999; Manly *et al*. 2002; Johnson *et al*. 2006). The literature on resource selection functions has its origins over thirty years ago (Johnson 1980), and the seminal text on the topic was first published twenty years ago (Manly *et al*. 2002, first edition 1992).

A related topic has recently emerged in the species distribution modelling literature, on methods for analysing presence-only data (Pearce & Boyce 2006; Elith & Leathwick 2009). Presence-only data typically consist of digitized opportunistic sightings or museum records of where a species occurs, over a broad geographic scale. By coupling these data with maps of environmental variables, we can model the spatial distribution of a species using generalizations of logistic regression (Pearce & Boyce 2006) or modern methods of classification (Phillips *et al*. 2006; Elith *et al*. 2006).

Although these approaches are called presence-only methods, they rely on environmental background layers, which act as a set of ‘pseudo-absence’ points (Pearce & Boyce 2006). Typically, the response variable in analyses is an indicator which takes the value one for presence points and zero for pseudo-absences. Presence-only analysis is a relatively recent topic in the species distribution modelling literature, which has exploded in recent times – ISI Essential Science Indicators rate it as one of the fastest-moving research fronts in the environmental sciences (accessed December 2012).

This paper is a contribution of a *Journal of Animal Ecology* Special Feature that arose from recognition that the two problems described above – estimating a resource selection function from used-available data and estimating a species distribution model from presence-only data – are equivalent. The equivalence of these two problems has been known for some time (Ferrier *et al*. 2002, for example), but despite this, the used-available and presence-only literatures appear to have developed largely in parallel with little cross-fertilization of ideas. The following section (*Implications of equivalence*) considers the question: how can we advance one literature by leveraging from lessons learnt in the other? Then (in *It's not what you use, it's how you use it*), we consider more broadly the question of how to advance current practice in used-available and presence-only data modelling, in particular, the importance of thinking about how best to apply a given analysis method to the data set at hand. Finally (*Which way should you apply a given method?*), we review three key considerations when applying a given analysis method – study design, data properties and using goodness-of-fit tools to inform analyses.

### It's not what you use, it's how you use it

- Top of page
- Summary
- Introduction
- Implications of equivalence
- It's not what you use, it's how you use it
- Which way should you apply a given method?
- Discussion
- Acknowledgements
- References
- Supporting Information

The above suggestions, and more broadly, any advance in methods for analysing used-available or presence-only data, have the potential to improve predictive performance. It is well understood that applying different analysis methods to a data set can give very different results and very different predictive performance (Elith *et al*. 2006). However, it should also be understood that when different methods are applied in a similar way, differences in performance are typically more modest, and in some instances, seemingly different analysis methods can lead to near-identical results.

These ideas are illustrated on a bighorn sheep data set (T. Ryder, Wyoming Game and Fish Department, unpublished data). Five bighorn sheep (*Ovis canadensis*) were GPS-collared, and their location (in the Seminoe Mountains, Wyoming, USA) recorded hourly between 1 January and 15 April 2010 (Fig. 1). Maps of five environmental variables were also available over the whole study region, giving information on topology (aspect, slope, elevation) and exposure (distance from nearest tree, and distance from nearest ‘escape terrain’). It was of interest to estimate the resource selection function characterizing the behaviour of these five sheep and estimate the relative importance of the five environmental variables to the sheep.

We applied three quite different analysis methods to the bighorn sheep data – logistic regression (Warton & Shepherd 2010), maximum entropy (MAXENT, Phillips *et al*., 2006) and multivariate adaptive regression splines (MARS, Elith & Leathwick, 2007). These three methods were applied in similar ways – using the same random set of 2617 pseudo-absences, the same five environmental variables, included as linear, quadratic and interaction terms. Results were very similar across methods, as seen from inspection of maps of predicted values (Fig. 2) or from consideration of the relative importance of different environmental variables (Table 1).

Table 1. Different methods can give not-so-different results: the relative importance of different environmental variables (reported as % of explained deviance estimated via a leave-one-out approach) in (a) Logistic regression; (b) MAXENT; (c) MARS; when modelling the bighorn sheep data as in Fig. 2. Note that results are broadly similar across models; for example, the rank order of the five environmental variables is unchanged across the three modelsVariable | (a) Logistic regression | (b) MAXENT | (c) MARS |
---|

Aspect | 7·7 | 8·3 | 12·8 |

Distance to escape | 20·7 | 13·9 | 14·5 |

Slope | 0·9 | 0·7 | 0·1 |

Elevation | 34·4 | 49·7 | 41·9 |

Distance to tree | 31·5 | 18·2 | 18·0 |

However, precisely *how* a method is applied to data can have dramatic effects on results and predictive performance. For example, consider analyses of the bighorn sheep data using the same method, a Poisson point process model, with the same five environmental variables (as listed in Table 1). However, we have applied this analysis method in three different ways:

- As a static model – using the five environmental variables only, and not accounting for the time-sequencing of the data in any way.
- As a movement model (described below) – using raw data without transformation.
- As a movement model – using transformed data, where appropriate.

The movement model was fitted using the same modelling framework as the static model, but it additionally included three ‘movement variables’ in analyses, and pseudo-absences or ‘quadrature points’ (Warton & Shepherd 2010) were chosen in a different way. The three additional movement variables were a function of a sheep's last known sighting (distance from last location, direction of movement, time-of-day). By including these terms, the interpretation of model output changes – instead of modelling where a sheep is, we are modelling where a sheep will go next (given where it last was and when it was there). Pseudo-absences were chosen in the neighbourhood of a sheep's current location (similar to Forester *et al*. 2009), whereas for the static model, a regularly-spaced 30×30 m grid consisting of 78 182 pseudo-absences was used (Warton & Shepherd 2010). Further details are included in an Appendix S1.

When a point process model was applied to the bighorn sheep data in the above three ways, results differed substantially, both in terms of the appearance of maps of predicted sheep intensity (Fig. 3) or the relative importance of different variables ( Table 2).

Table 2. Changing how you apply method can dramatically affect results. The relative importance of different explanatory variables (reported as % of explained deviance estimated via a leave-one-out approach) when using a (a) static model; (b) movement model on raw data; (c) movement model on suitably transformed data. BIC of the fitted models is also included as a measure of goodness-of-fitVariable | (a) Static model | (b) Movement model, raw data | (c) Movement model, transformed data |
---|

*Static variables* |

Aspect | 32·5 | 0·29 | 0·21 |

Distance to escape | 1·5 | 0·04 | 0·02 |

Slope | 0·1 | 0·03 | 0·04 |

Elevation | 3·1 | 0·04 | 0·04 |

Distance to tree | 0·5 | 0·02 | 0·01 |

*Movement variables* |

Distance moved | – | 78·3 | 76·3 |

Direction of movement | – | 0·05 | 0·15 |

Distance moved × Time of day | – | 0·96 | 3·85 |

BIC | 657 713 | 500 261 | 496 544 |

While we can advance the methodology for presence-only and used-available analysis, and in so doing make some performance gains, there is clearly significant potential to make gains by advancing our thinking about *how to use* a given analysis method. This potential is evident in comparing results in Figs 2 and 3 or Tables 1 and 2 – while changing the analysis methodology had some effects on results, changing how a given method was applied had a more substantial effect. This raises the question: once you have chosen an analysis method, how should you decide how to apply it?

### Which way should you apply a given method?

- Top of page
- Summary
- Introduction
- Implications of equivalence
- It's not what you use, it's how you use it
- Which way should you apply a given method?
- Discussion
- Acknowledgements
- References
- Supporting Information

Given that the way a method is applied substantially influences results, it is important to consider carefully how to apply any chosen analysis method to the data at hand. A key consideration when analysing data in ecology is that the assumptions and approach make ecological sense (Austin 2002), a consideration which has implications for the choice of variables for analysis, and their form of inclusion in the model. In addition, there are some important statistical considerations:

- Study design – match the analysis method to the method by which the data were collected.
- Data properties – study the properties of data to be analysed and ensure variables are analysed on the appropriate scale.
- Goodness-of-fit – apply diagnostic tools to assess how well the given method of analysis fits the data.

These points are illustrated by example below, using the bighorn sheep data.

#### Study design

How data are analysed needs to be directly related to how it was collected. What were the independent sampling units (subjects) that were sampled? How were data collected on these subjects? See Cressie *et al*. (2009) for some additional considerations, including analysis when subjects had unequal probabilities of being sampled.

In the case of the bighorn sheep data (Fig. 1), the study involved putting radiocollars onto five sheep, and using GPS to track sheep movement at hourly intervals. Hence, the five sheep were the independent sampling units, and the data that were collected are best described as *movement* data. The data did not arise as some ‘static’ list of points where the sheep have been located. This means that static models (as in Figs 22 and a) which do not take account of the time-sequencing in the data are inappropriate, due to a mismatch between the way the data were collected and the way the data are treated in analysis. A high level of temporal autocorrelation has been introduced through repeated sampling, which needs to be accounted for to validly infer the nature of the environmental association (Patterson *et al*. 2008).

A more natural model for the bighorn sheep data aims to predict future sheep locations not only as a function of environmental variables, but also as a function of previous location(s) as in Fig. 3b,c. This changes the interpretation of the model from a static model, of where the sheep is standing, to a movement model, of where the sheep is going. Hence, we were able to directly model the resource selection decisions that a sheep was making. Distance from last known location proved to be by far the most important predictor of where a sheep was next seen, accounting for over 75% of explained deviance in the movement models ( Table 2b,c). This is not a particularly surprising result – obviously the best place to look for a sheep is where you last saw it! However, given how important previous location was in predicting a sheep's future location, it was important to incorporate this information into analysis.

Movement might be expected to show some diurnal variation, so because sheep were tracked at hourly intervals, time-of-day should also be incorporated into the model. In fact, analyses suggested that time-of-day (and its interaction with distance moved) was the second most important variable in the model. Further inspection suggested that sheep were most active at night and least active early in the morning.

The identities of the five different sheep were not made use of in the analyses of Fig. 2 and incorporating that knowledge could further improve models. One way to make use of sheep identity is to make design-based inferences (Manly *et al*. 2002) about predictive performance. For example, we can use a leave-one-out approach to consider how well a model predicts the movement of sheep *i*, when the model was constructed using all data except that for sheep *i*. Such use of independent ‘test data’ to assess predictive performance is an important idea in model validation (Boyce *et al*. 2002), and using different sheep as the test data, we can assess how well the model transfers from one sheep to the next.

#### Data properties

Some variables should not be analysed in their raw form, but instead routinely require transformation. A common example in biology is size variables, which tend to be the outcome of multiplicative processes and hence are quite naturally interpreted on a logarithmic scale (Kerkhoff & Enquist 2009). A different example, encountered in the bighorn sheep data set, is circular variables (Fisher 1993). Aspect is a circular variable measured from 0 to 360 degrees, with 0 and 360 both meaning the same thing – a due north aspect. Hence, this variable makes little sense when analysed on an arithmetic scale (Fig. 4a), but should be transformed in some way to reflect its circularity. A second circular variable in the sheep data set is time-of-day, which is circular in time (0 to 24 h).

A simple way to modify a circular variable for regression analyses is to include its sine- and cosine-transformations as predictors – and , where *K* is the periodicity of the circular variable (*K* = 360 and *K* = 24, respectively, for aspect measured in degrees and time-of-day measured in hours). The sine and cosine functions map a circular variable onto the unit circle such that it can be interpreted as a directional quantity (Fisher 1993). Figure 4b plots the aspects at which sheep were located, with ‘jittering’, such that a high density of points suggests an aspect highly favoured by sheep. Contrary to Fig. 4a, a pattern can be seen – the sheep tend to favour southerly aspects.

With reference to the point process models introduced previously (Fig. 3), model (b) was fitted using aspect and time-of-day without transformation. On more careful consideration of data properties, both variables should have been sine- and cosine-transformed, as in model (c). The implications of these changes on results were relatively modest – the maps of predicted sheep intensity were broadly similar in Fig. 3b,c, and in Table 2b,c, the main gains seemed to come from treating time-of-day as a circular variable, which captured an additional 3% of deviance.

#### Goodness-of-fit

The precise diagnostic tools that can be used to assess goodness-of-fit depend on the method of data analysis. Generalized linear models for example can typically be diagnosed using residual plots (Dunn & Smyth 1996) and using information criteria. A conspicuous shortcoming of some machine learning methods is the lack of diagnostic tools – having fitted a support vector machine (Hastie *et al*. 2009) to data, for example, what model assumptions were made, and how can we check that they were reasonable?

A range of diagnostic tools have been developed for point process models (Baddeley *et al*. 2000; Diggle 2003; Baddeley & Turner 2005). We can use information criteria, for example, applying BIC to the three point process models of Table 2 suggests model (c) is the most appropriate. Graphical tools can also be applied, for example *K*-functions (Baddeley *et al*. 2000; Diggle 2003). For Fig. 5, the cumulative conditional intensity of latitude (Cressie 1993) was calculated separately for each sheep using a leave-one-out approach, then plotted as a function of time. If the model fitted were valid, values of Λ(*y*) at points where sheep were observed would be approximately uniformly distributed, and importantly, they would be independent of time (or any other variable). This is evidently not the case for the static model (Fig. 5a), where each sheep's location in a north-south direction ‘drifts’, because of the dependence between a sheep's current location and its most recent known location. Models (b) and (c) account for this dependence sufficiently well that spatial dependence is no longer detectable (Fig. 5b,c).

### Discussion

- Top of page
- Summary
- Introduction
- Implications of equivalence
- It's not what you use, it's how you use it
- Which way should you apply a given method?
- Discussion
- Acknowledgements
- References
- Supporting Information

The equivalence of the problems of analysing used-available and presence-only data presents some opportunities to advance on current practice in each discipline, by leveraging ideas developed in their corresponding contexts. Several ideas on this front have been suggested in this paper, by noticing ideas developed in one literature that have seemingly been under-developed in the other – some particularly interesting examples include the application of modern analysis methods (Elith *et al*. 2006) in used-available analysis, and the development of presence-only models of functional response to environment (Matthiopoulos *et al*. 2011) as a means to study the potential effects of a changing environment on species distribution.

While there are some ideas that have been developed in one literature and not the other, some insight can also be gained by studying commonalities. An important common theme in the used-available and location-only literatures is that of how to select *available* points or pseudo-absences to include in analyses (Chefaoui & Lobo 2008; Forester *et al*. 2009; Barbet-Massin *et al*. 2012). The significant potential influence of this decision is evident in the analyses of Tables 1 and 2, where method of pseudo-absence choice seemed the most striking source of differences in results. In the models of Table 1, a random set of 2617 pseudo-absences was analysed. This gave completely different results to Table 2a, which analysed 78 182 pseudo-absences in a uniform rectangular grid at a fine spatial resolution. The latter is a natural and effective sampling scheme for a static model (Warton & Shepherd 2010), and the substantial differences in results suggest that for this data set, 2617 random pseudo-absences was grossly insufficient. Results were different again in Table 2b,c, where for each presence point, a set of pseudo-absences was chosen in a radial design around the last known location (similar to Forester *et al*. 2009). On the question of precisely how to choose the number and location of pseudo-absences, point process models (Warton & Shepherd 2010) and animal movement models (Moorcroft & Barnett 2008; Forester *et al*. 2009) seem to have particular potential – when the role of the pseudo-absences is implicit in the modelling framework, there is no need to make *ad hoc* decisions to specify their number and location.

Although not the focus of this paper, it should be noted that there is a vast and growing body of literature on methods for modelling animal movement (Patterson *et al*. 2008; Moorcroft & Barnett 2008, for example), suitable for data such as the bighorn sheep example considered here. The Poisson point process approach considered in Fig. 3 was not a typical animal movement modelling approach, rather it was an ‘omnibus’ approach for analysing point patterns adapted to the problem of modelling movement. However, the approach was sufficiently flexible that it could construct quite reasonable regression models of presence-only data in either the static or movement context, to illustrate some key ideas.

The key idea demonstrated in Table 2 and Fig. 3 was that the most important consideration in analysis is perhaps not which analysis method to use, but how to apply any given method in a manner that is appropriate for the data at hand. Both the used-available and presence-only literatures are rife with papers proposing advances in analysis methodology (for example, Phillips *et al*. 2006; Elith & Leathwick 2007; Elith *et al*. 2008; Matthiopoulos *et al*. 2011), but less attention tends to be paid to the perhaps more important question of how to apply a method appropriately. Key statistical considerations are as follows: analyse data in a manner which reflects the study design; consider data properties; use diagnostic tools to assess how reasonable a given analysis is for the data at hand. Yet, some methods of analysis lack the flexibility to handle different study designs (e.g. incorporating animal movement), and some are seriously deficient in diagnostic tools for assessing goodness-of-fit. Perhaps, this is where the greatest gains can be made in advancing methods for used-available and presence-only analysis.