Statistical inference and interpretation
- Top of page
- Concepts revisited
- Statistical inference and interpretation
In resource selection studies, the only information we have is the list of the resource units in the study area that were used during the investigation. We also have information on the resource units that could have been encountered during the study duration. Notice that a particular resource unit may be used repeatedly. This information is sufficient to obtain an estimate of f U(x), the use distribution. Given this limited information, the goal is to infer about the probability of selection (RSPF). Towards this goal, the key relationship is: . In the fox example, the use distribution answers the question: what is the probability that an egg picked randomly from those found in the stomach of a fox (use distribution of 863 eggs that were consumed) being blue? This probability is 0·8215 (Table 2). This probability is derived by examining only those units that were used and assumes that the set of observed used units is a random sample from all used units. We also can calculate this probability using the RSPF or the RSF and the available distribution, fA(x). If the probabilities of selection (RSPF) are known, fU(blue) egg is given by . Because the selection ratio is proportional to the probability of selection, in the above formula, one can replace the probabilities of selection by the RSF or selection ratios to obtain . Thus, knowledge of probabilities of selection RSPF, RSF or even the selection ratios and the available distribution is sufficient to obtain the use distribution, fU(x). In contrast, generally, it is not possible to estimate the probability of selection given knowledge only of fU(x) or presence-only data. For example, probability of selection could not be estimated if knowledge of the 863 eggs used by the fox were collected from a museum (Pearce & Boyce 2006), without the accompanying data on what the fox encountered.
In practice, the complete set of encountered units is seldom known. Instead researcher postulates about the resource units that may be encountered and use that assumption to estimate the available distribution. For example, one may consider all the study area to be equally accessible and take a random sample from it to estimate the available distribution. Or, one may consider only a small buffer around the current location as accessible and use that area to estimate the available distribution. Unfortunately, there is no practical way to check whether the assumed available distribution is appropriate or not. Moreover, the choice of the available distribution strongly affects the inference about the probability of selection and other quantities. Researchers should be aware that the major assumptions behind various RSF-related methods are (a) the available distribution is correctly specified, (b) selection depends only on the characteristics of the encountered resource unit and is independent of knowledge of the resource types of other resource units, and (c) the probabilities of selection remain unchanged during the period of investigation.
At the heart of resource selection involves active behavioural decisions by an organism. When designing controlled feeding studies, for example, researchers must make decisions as to whether to offer a food item to a consumer sequentially (Stage 1 experiments) or simultaneously with other food items (Stage 2 experiments) (Underwood & Clarke 2005; Taplin 2007; Manly 2006). When observing free-ranging animals, the way animals move and encounter resources has implications for the interpretation of outcomes (Matthiopoulos 2003; Martin et al. 2008). If resources are encountered sequentially, prey selection can be influenced by the order of prey presentation due to different cumulative handling times or gut saturation even if the prey's ‘tastiness’ remains constant. In habitat studies, travel cost to a habitat and recent memory of resources can alter expectations and future selection of habitats. As a result, the time-scale over which selection studies are conducted can be important. Although researchers are aware of these complications, existing methods are unable to account for these complications.
One of the most commonly used RSPF models is the exponential RSPF. It is known that the intercept parameter (β0) of the exponential RSPF is non-estimable (Lele & Keim 2006). A computationally simple way to estimate the non-intercept parameters is to use any standard Logistic regression package (Johnson et al. 2006; McDonald et al. 2006). Use of a Logistic regression package to estimate the non-intercept parameters of the exponential RSPF has led to confusion about its interpretation. For example, a common misconception is that the parameters can be interpreted as log-odds ratios (Hebblewhite, Merrill & McDonald 2005). However, the interpretation of the parameters is based on the model that is being fitted, not by the computational procedure that is used to fit it. The coefficients in the exponential RSF model give relative risk (Ramsay & Schafer 2002) and not the log-odds ratio. Sometimes, researchers standardize the estimated exponential RSF model by dividing the RSF values by their sum over the entire study area. Another standardization that is used is to divide the estimated exponential RSF model by the maximum of the RSF value over the study area. Such standardized values do not correspond to the probability of selection. This is because the standardized values depend on the number of resource units in the study area and the types of units available in the study area, whereas by definition, probability of selection depends only on the characteristics of the unit and not on the characteristics of the other units in the study area nor on how many resource units are in the study area. RSF values can be interpreted only in relation to each other. For example, one can use RSF values to answer the question: given two resource units which one is more likely to be selected. Thus, the plot of standardized RSF values gives the correct visual impression of which areas are more likely to be selected and which ones are less likely to be selected, assuming encounter probability is the same for all resource units.
The exponential model is the unique model when both the used and the available distributions follow the Normal distribution. In practice, the case of both used and available distributions being Normal is rare because at least some of the covariates are categorical or are strictly positive and may have skewed distribution. Nevertheless, the exponential model may be appropriate when the distributions are not Normal. This needs to be checked using model comparisons between exponential and other models such as Logistic or Probit. We also note that, similar to the above result, if both used and available distributions are multinomial, the exponential model is the only model that is permissible. This result was implicit in Lele & Keim (2006) and led to the conclusion that one cannot estimate probability of selection when all covariates are categorical.
The exponential model is somewhat limited because it allows estimation of only a relative probability of selection (RSF). The major deterrent for fitting models other than the exponential form was the availability of software, but this capability is now evolving. Parametric functions such as the Logistic, Probit or Complementary log-log models can be fitted to telemetry data, both with common or changing availability. In contrast to the exponential RSPF, the Logistic, Probit or Complementary log-log models allow estimation of the actual probability of selection and not simply the relative probability of selection (Lele 2009; R package ‘ResourceSelection’), making them more useful than the exponential model. The weighted distribution approach to estimate RSPF is applicable provided the covariate space contains at least one continuous covariate. It also needs a substantial number of used points, generally 300 to 500, to obtain stable estimators of the probability of selection. The new capability for estimating an RSPF makes it pertinent to ask the question of when relative measures of selection (i.e. RSF) are adequate for the research or conservation question at hand. As argued by Lele (2009), although the relative change in probability of selection from 0·05 to 0·01 and from 0·9 to 0·18 is the same (one probability is 5 times the other), the management interpretations are likely to be quite different. In the first case, one might perceive that an already bad environment is made a bit worse; in the second case, a really good environment is worsened substantially. We believe RSPF provides more information about the value of the resource type than RSF. Of course, if one really wants to use the relative change in probability of selection, it can always be computed when the s(x) of two resource types is known.
We note that when a Logistic model is fitted using the weighted distribution approach (Lele & Keim 2006), the parameters should be interpreted as log-odds ratios for selection. Logistic regression is also used to estimate probability of occupancy (Manly et al. 2002), but the odds ratios in selection and occupancy have different interpretations. In the former, it is the odds of selecting vs. not selecting a resource unit when it is encountered, whereas in the latter, it is the odds that a resource unit is used at least once or never.
The weighted distribution approach to estimate RSPF is applicable for data arising from telemetry studies where repeat visits to the same resource unit are counted as multiple data points. On the other hand, this method is inappropriate for presence-only data arising from occupancy surveys because in these data occupancy usually means ‘used at least once’. The weighted distribution based analysis of occupancy surveys (e.g. Royle et al. 2012) does not estimate probability of selection or habitat suitability.
- Top of page
- Concepts revisited
- Statistical inference and interpretation
In this paper, we presented precise definitions of common statistical quantities used in resource selection analyses. The lack of distinction between various terms we discuss above has led to considerable confusion in the literature on appropriate modelling approaches. Probability of selection is inherent in use distribution, choice, occupancy and occurrence, but they all use additional components such as the available distribution, choice set or number of individuals in the population etc. One cannot compute the probability that a used unit is of type x solely from information on the characteristics of a random sample of resource units with known use (i.e. presence-only data) without knowing the selection probabilities explicitly. As a result, use distribution, based only on the relative proportion of a resource type within the use set, is often employed to ascribe the value of the resource type. However, this metric does not explicitly consider constraints imposed by what is available. This critical point seems to have been ignored by many researchers to conclude selection and use are synonymous to each other.
There are two, not necessarily mutually exclusive, philosophies of statistical modelling (Breiman 2001). One approach emphasizes the importance of models for better understanding of the mechanisms underlying the phenomenon presuming that better understanding can lead to better prediction and forecasting. Another approach emphasizes prediction and hence leads to models that do not pretend to necessarily lead to better understanding of nature but that are good at prediction. The machine-learning-based maximum entropy approach (e.g. Elith & Leathwick 2009) is commonly used in predicting species distributions using presence-only data. Mathematically, MaxEnt approach is identical to the exponential RSF model described earlier. Most of the applications of exponential RSF tend to use linear or polynomial parametric models, whereas the models in MaxEnt are more flexible. They both provide only the relative probability of selection and are highly dependent on the specification of the availability distribution. There is one important but subtle difference between the situations where RSF models are used and where MaxEnt usually has been used. The MaxEnt model is commonly used when occupancy data, where the location is occupied at least once during the study period, are available, whereas RSF models are used for telemetry data where the same location may be used multiple times. Thus, MaxEnt seems to answer the question of occupancy when the non-occupancy data are unavailable. Mathematically, however, there is no reason why MaxEnt cannot be used for telemetry data as well.
Aside from the issue of choice of available distribution, the RSF, MaxEnt or RSPF models are, inherently, not process-based models as pointed out by Austin (2002). They do not explicitly relate the inferences to survival or the process of growth of a population. Instead, these are descriptive models that are useful for generating plausible hypotheses of what animals might select. However, any such inferences need to be corroborated by other pieces of evidence that relate directly to the behaviour and survival of the species (e.g. McLoughlin et al. 2010, Wasser et al. 2011).
Statistical models provide structure for characterizing selection, occupancy, use and choice, and all these concepts have useful applications. Selection, use and occupancy are all related concepts with different uses in management. Precise definitions and understanding of these concepts are important for applied ecological research. Our paper has attempted to clarify the assumptions and relationships behind different concepts used in selection studies. The main culprit for the confusion seems to be the lack of precise statements about various quantities that are being studied. Often, the same nomenclature is used to describe mathematically distinct events. We can avoid such confusion in future by describing the quantities and the associated events precisely and completely.