Generalized estimating equations and generalized linear mixed-effects models for modelling resource selection
*Correspondence author. E-mail: email@example.com
- 1Accurate resource selection functions (RSFs) are important for managing animal populations. Developing RSFs using data from GPS telemetry can be problematic due to serial autocorrelation, but modern analytical techniques can help to compensate for this correlation.
- 2We used telemetry locations from 18 woodland caribou Rangifer tarandus caribou in Saskatchewan, Canada, to compare marginal (population-specific) generalized estimating equations (GEEs), and conditional (subject-specific) generalized linear mixed-effects models (GLMMs), for developing resource selection functions at two spatial scales. We evaluated the use of empirical standard errors, which are robust to misspecification of the correlation structure. We compared these approaches with destructive sampling.
- 3Statistical significance was strongly influenced by the use of empirical vs. model-based standard errors, and marginal (GEE) and conditional (GLMM) results differed. Destructive sampling reduced apparent habitat selection. k-fold cross-validation results differed for GEE and GLMM, as it must be applied differently for each model.
- 4Synthesis and applications. Due to their different interpretations, marginal models (e.g. generalized estimating equations, GEEs) may be better for landscape and population management, while conditional models (e.g. generalized linear mixed-effects models, GLMMs) may be better for management of endangered species and individuals. Destructive sampling may lead to inaccurate resource selection functions (RSFs), but GEEs and GLMMs can be used for developing RSFs when used with empirical standard errors.
Accurate modelling of habitat selection by animals is critical to developing effective management plans. Resource selection functions (RSFs) are used to compare used with available habitat (Manly et al. 2002). Recent progress in GPS technology development has resulted in enormous amounts of data being made available. However, sequentially surveyed locations may be correlated at intervals as long as 1 month apart (Cushman, Chase & Griffin 2005), and are obviously correlated at intervals measured in minutes or hours (e.g. Fortin et al. 2005). Such data violate assumptions of independence of observations, which may increase frequency of type I errors (Clifford, Richardson & Hémon 1989).
One approach to dealing with this autocorrelation has been to adopt an analysis that assumes absence of correlation, then manipulating data to meet this assumption. For example, telemetry locations may be recorded every few hours or days (e.g. Johnson, Seip & Boyce 2004) on the assumption that this time-lag results in independence. However, the increased time-lag may not be sufficient to produce independent observations, and the reduced amount of data may increase bias and reduce accuracy (Gustine et al. 2006). Destructive sampling, accomplished by dropping data until independence is reached (Way, Ortega & Strauss 2004), is similarly problematic, and may require dropping as many as 95% of data collected (e.g. Saher 2005).
Some approaches that have been proposed for controlling for temporal autocorrelation are problematic. For example, information-theoretic approaches do not sufficiently correct for autocorrelation (cf. Boyce 2006; Aarts et al. 2008) because calculation of standard errors (sensitive to independence) is a critical component of the model selection paradigm, and because the likelihood used to calculate information criteria assumes independence (Burnham & Anderson 1998). Conditional logistic regression (e.g. Johnson & Gillingham 2005) assumes independence among groups of points (observed paired with random), which is not met when telemetry points are recorded frequently. To address this problem, Fortin et al. (2005) incorporated robust standard errors and destructive sampling to obtain long time-lags between clusters of points. Although useful at the ‘step-scale’, their approach does not allow for evaluation of habitat selection at the home-range scale.
Gillies et al. (2006) recommended models that include fixed and random (clustering) effects, such as generalized linear mixed-effects models (GLMMs) to control for the correlation that arises from recording multiple locations from each animal. Mixed models have been applied to correlated ecological data (e.g. Bolker et al. 2009), but Gillies et al. (2006) are among the first to apply it to RSFs (see also Aarts et al. 2008). However, there are at least two potential problems with applying GLMM to RSFs. First, the models are analytically complex (Fitzmaurice, Laird & Ware 2004: 326), which may inhibit convergence, and secondly, hypothesis tests in GLMMs are highly sensitive to model and correlation structure misspecification (Overall & Tonidandel 2004) when model-based standard errors are used. Because telemetry locations have been sampled sequentially, they are autocorrelated. However, random points selected from an animal's home range (e.g. Gillies et al. 2006) do not show autocorrelation, as they are not sampled sequentially over time. Because the correlation structure among telemetry and random points differ, it is impossible to correctly specify the within-cluster correlation structure. The data Gillies et al. provide suggest that grizzly bear Ursus arctos L. locations were determined approximately every 4 h. At this sampling frequency, it is unlikely that these locations are independent. Gillies et al. (2006: 890) misspecified the correlation structure, as they assumed that all data within a cluster (animal) were equally correlated. Therefore, their approach does not meet the assumptions of GLMM. Nonetheless, we believe their approach is promising, and can be developed further.
One possible modification is to use empirical (Huber–White sandwich) variance estimates within the GLMM to make the analysis robust to misspecification of the correlation structure (SAS Institute Inc. 2006), as Nielsen et al. (2002) did with a logistic regression model. Gillies et al. (2006) found that GLMMs were more effective for the development of RSFs than were logistic regression models with empirical standard errors, but did not evaluate GLMM combined with empirical standard errors. We suggest that GLMM with empirical standard errors may be robust to both among- and within-animal correlations, in contrast to GLMM without empirical standard errors.
Generalized linear models (GLMs) with generalized estimating equations (GEEs) may provide a useful alternative. GEEs include an additional variance component to accommodate correlated data, and to allow for differences among clusters. GEEs have several favourable properties for ecological analyses; for example, parameter estimates and empirical standard errors are robust to misspecification of the correlation structure (Overall & Tonidandel 2004), and they are usually less analytically complex than GLMMs (Agresti 2002: 365), hence, model convergence is more likely. GEEs have been used extensively in a variety of disciplines, such as epidemiology (Wu et al. 1999) and political science (Zorn 2001). In ecology, they have been used to control for lack of independence among nests clustered within sites (Driscoll et al. 2005) and among related species (Duncan 2004). Generalized estimating equations have been used only occasionally in habitat-selection studies. Storch (2002) and Dorman et al. (2007) demonstrate its use for controlling for spatial autocorrelation. In a conditional logistic regression context, Fortin et al. (2005) developed RSFs using estimating equations with an independence-working correlation structure and robust standard errors, which they implemented using Cox proportional hazards regression. Although GEEs with other correlation structures have not previously been used for building RSFs, robust standard errors have been applied to control for correlation among telemetry locations (Nielsen et al. 2002). However, pooling data across animals (e.g. Nielsen et al. 2002) biases results towards data-rich individuals (Gillies et al. 2006; Aarts et al. 2008), if data are not missing at random. Applying robust standard errors while using a working correlation structure other than ‘independence’ in the estimation procedure should help overcome this problem.
Nonetheless, like GLMMs, there are tradeoffs to the benefits of GEEs. Whereas GLMMs are sensitive to the choice of correlation structure, GEEs are sensitive to the link function (Pendergast et al. 1996: 101), which can affect model fit (Lele & Keim 2006). It is, therefore, important to compare these approaches according to both their performance and analytical paradigm, to evaluate the appropriateness of their tradeoffs under different management scenarios.
Another fundamental issue is the interpretation of parameter estimates. Conditional (subject-specific) coefficient interpretation means that coefficients model how individual responses change with respect to independent variables. Marginal (population) parameter estimates describe the effects of independent variables on a population. This has a strong effect on parameter estimates, standard error estimates, and significance testing (Fitzmaurice et al. 2004: 365). Whereas GLMMs generate conditional parameter estimates, from which marginal estimates can be derived (Agresti 2002: 499), GEEs only produce marginal ones. However, marginal parameter estimates derived from GLMMs are biased, in that their absolute value is too small, and this bias increases as the variance of the random effect increases (Agresti 2002: 499). Although RSFs do not produce estimates of actual probabilities of use, they produce estimates that are proportional to probability of use (Manly et al. 2002), and thus, this bias could be problematic. Further, the relationships among covariates, and the parameter estimates themselves, are not easily interpreted for marginal estimates derived from conditional models, and models are more likely to be misspecified (Agresti 2002: 499; Fitzmaurice et al. 2004: 364). It is therefore preferable to use a marginal model, such as GEE, when marginal population estimates are of interest (Agresti 2002: 501).
Accurate resource selection functions make an important contribution to the conservation of rare or threatened species (Johnson, Seip & Boyce 2004). The boreal population of woodland caribou Rangifer tarandus caribou L. is threatened in Canada (COSEWIC 2002). It is sensitive to habitat composition and anthropogenic activities (Brown et al. 2007), and therefore, accidental misspecification of RSFs would have important conservation consequences. We compared RSFs developed using GLMMs and GEEs, at two spatial scales, using data on woodland caribou. We compared effects of empirical and model-based standard errors on statistical significance. Finally, we compared our results with an analysis done on a destructively sampled subset of the data. Because GEEs have rarely been applied to RSFs, we provide an overview of this approach (see also Dorman et al. 2007).
Materials and methods
background on generalized estimating equations
For a review of the application of random effects for RSFs, we recommend Gillies et al. 2006; readers should also review the statistical literature on the use of random effects, such as Agresti 2002 and Bolker et al. 2009. Because we introduce GEEs for the development of RSFs using telemetry data, we present a brief conceptual overview of GEEs; for further details, we recommend Hardin & Hilbe (2003). We use the term cluster to mean a unit of analysis within which there are multiple measurements. In our example, each cluster is a caribou.
Three components are important in the GEE (Fitzmaurice et al. 2004: 294–295). Generalized estimating equations require a model for the mean response (as a function of covariates), the variance (often specified as a function of the mean), and a working correlation assumption. They are semi-parametric because estimates rely on parametric assumptions regarding the mean and variance/covariance, but they are not fully parametric (i.e. they require no other distributional assumptions).
First, consistent with a GLM, the conditional expectation, E(Yit | Xit) = µit, depends on the independent variables through a link function (a nonlinear equation used to link the predicted values with the independent variables):
g(µit) = Xitβ(eqn 1)
Secondly, the conditional variance of each Yit, given the independent variables, varies as follows:
Var(Yit) = φv(µit),(eqn 2)
where φ is a known or estimated scale parameter (depending on which response distribution is used), and v(µit) is a known variance function of the mean µit.
Thirdly, the correlation among data points within clusters is assumed to be a function of one or more correlation parameters, α. Essentially, the GEE is defined by substituting the variance term in the GLM with the following variance–covariance matrix (Hardin & Hilbe 2003: 58),
where V(µit) is the variance of the marginal mean µit, and D is a diagonal matrix. The correlation in the data is modelled using the working correlation matrix, , defined by the parameter vector α. This vector may contain a single value (i.e., α = α) as in the compound-symmetric correlation structure, or it may contain several values. R is a square matrix of dimensions ni×ni, where ni is the number of samples (or measurements) within each cluster.
An iterative process is used to estimate model parameters. First, estimates of Vi and β are obtained using initial estimates of α and φ (i.e., β is initially estimated from a generalized linear model, assuming φ = 1 and independence of observations). Then α and φ are estimated using the estimates calculated for Vi and β in the first step (Fitzmaurice et al. 2004: 302). This iterative process continues until model convergence is achieved, that is, that there is little change in the parameter from one iteration to the next.
At convergence, the model-based variance estimate is (Fitzmaurice et al. 2004: 305),
(eqn 4) (eqn 5)
where Di is the derivative matrix (the matrix of the derivative of µi relative to the components of β), and Vi is the working covariance matrix. However, the model-based variance is often replaced by the empirical or ‘sandwich’ variance estimator, which is robust even when the working correlation structure does not correctly describe the correlation in the data (Fitzmaurice et al. 2004: 304). Although it does require a sufficiently large number of clusters to be unbiased (Fitzmaurice et al. 2004: 305), it has a potentially broad application for ecological analyses. The empirical variance matrix is (Fitzmaurice et al. 2004: 302),
(eqn 6) (eqn 7)
Because Cov(Yi) is unknown, M is a theoretical variance, rather than a variance estimate. Cov(Yi) is estimated using,
habitat selection by woodland caribou
Eighteen adult female woodland caribou from the Smoothstone–Wapaweka caribou management area in central Saskatchewan, Canada, were collared in 2005 and 2006 using Lotek GPS collars (Lotek Wireless Inc., 115 Pony Drive, Newmarket, Ontario). Locations were recorded every 4 h and consisted of late winter locations (1 January–15 March 2006 and 2007), when resources are most scarce and habitat selection is strong (Brown et al. 2007). Data from 1 year per caribou were used. The number of locations per animal ranged from 188–610 GPS data points.
We modelled the influence of key habitat types on habitat selection by woodland caribou (Brown et al. 2007), by evaluating whether habitat types differed between telemetry locations and random locations. We followed Mayor et al. (2007) in applying a biologically relevant but simplified habitat selection model to illustrate our analyses; other landscape features may also influence habitat selection by woodland caribou. We compared the presence or absence of mature coniferous stands [treed muskegs (TM), mature spruce stands (MS) and mature jack pine dominated stands (MJPD)] within 50 m of telemetry and random points. We modelled the influence of distance to roads (DRD) and distance to hardwood/mixed-wood stands (DHMW)) as spatial structure of habitat patches may influence caribou population declines and habitat selection (Johnson & Gillingham 2005). Distance to HMW stands was correlated with presence/absence of HMW, and thus, presence/absence of HMW was not added to the model. Distance to cutblocks was correlated with DRD; therefore, the latter variable was used to capture effects of anthropogenic activities.
spatial scales and data sets
We evaluated habitat selection at two spatial scales: herd home range (e.g. Linke et al. 2005), and the home range of individual animals (e.g. Gillies et al. 2006). In the first data set, we selected random points from the herd home range (100% minimum convex polygon; HHR data set). We selected five times the number of random points per animal as collected from the GPS collars (e.g. Johnson & Gillingham 2005). In the second data set, we selected random points from the home ranges of individual animals (IHR data set). For each data set, this resulted in a total of 8985 telemetry locations and 44 925 random locations.
Because the correlation structure of the telemetry and random points differ, correlation structure cannot be correctly specified. However, both GEE and GLMM may be used with an empirical rather than model-based variance estimator, which is robust to deviations from this assumption (Hardin & Hilbe 2003; Fitzmaurice et al. 2004). Although correct specification of the correlation structure is desirable because it allows for the calculation of more efficient (usually smaller) standard errors (Fitzmaurice et al. 2004), empirical standard errors may be used to determine statistical significance when the correlation structure cannot be correctly specified or when it is unknown.
For each data set, we also used destructive sampling to remove some of the temporal autocorrelation among relocations (e.g. Way, Ortega & Strauss 2004; Saher 2005). Because as many as 95% of data points may have to be dropped before this is achieved (e.g., Saher 2005), we dropped 95% of our data to create two data sets that were 5% of the size of the intact data sets (HHR 5%, and IHR 5%), to create a relatively extreme example. Intervals between retained telemetry points were 3·33 days apart (see also Fortin et al. 2005). GEE and GLMM were used to control for clustering of data within animals.
We used Procs GENMOD and GLIMMIX in SAS 9·1 to develop GEEs and GLMMs, respectively (SAS Institute Inc. 2003). All statistical models included the same independent variables. Within GLMM, a random intercept variable was added to account for clustering of points within individuals (Gillies et al. 2006).
We used two working correlation structures to analyse each data set using GEE. The independent structure assumes within-cluster observations are independent, but is also useful for data sets with relatively few clusters (Hardin & Hilbe 2003: 142). In the compound-symmetric working correlation structure, all observations within clusters are assumed to be equally correlated, while observations from different clusters are assumed to be independent (see also Gillies et al. 2006). The compound-symmetric correlation structure is heuristically equivalent to including a random intercept in a mixed model. Empirical standard errors were used to evaluate statistical significance. We compared these results with model-based standard errors, to determine the effect of erroneously using model-based rather than empirical standard errors.
We did not directly compare relative fit of GEEs vs. GLMMs for two reasons. First, GEEs use a quasi-likelihood, while GLMMs typically use a maximum-likelihood framework for model estimation. Comparative measures such as Akaike's Information Criterion (Burnham & Anderson 1998) could be used for evaluating relative fit of models for GLMM (Bolker et al. 2009), whereas the quasi-likelihood-under-the-independence-model information criterion, or QIC (Pan 2001) could be used for evaluating relative fit of models for GEE, but there is no criterion that can be used for both. Further, our research indicates QIC rarely chooses the correct correlation structure (A. Barnett, N. Koper, A. Dobson & M. Manseau, unpublished data, 2008). Secondly, because parameter estimates from GLMM were conditional, while parameter estimates from GEE were marginal, parameter estimates and significance are expected to differ, and their comparison is not appropriate.
It is desirable to compare the fit of different correlation structures within a GEE analysis. Because our research demonstrates that QIC is strongly biased, this measure is not trustworthy. An informal comparison is to compare the relative size of the empirical (SEE) and model-based (SEM) standard errors. If SEE/SEM is close to 1, this suggests the correlation structure is correctly modelled (Bishop, Die & Wang 2000). We used the SEE/SEM ratio to evaluate whether the compound-symmetric correlation structure fit the data better than the independent correlation structure. There are no guidelines regarding the size of the ratio, but higher ratios reflect poorer model fit. This comparison is qualitative, but it is the best approach available at this time. Future developments and improvements to QIC are planned (J. Hilbe, 2008, personal communication).
Model validation is important for RSF analyses. To demonstrate model validation in the context of GEE and GLMM analyses, we applied k-fold cross-validation (Boyce et al. 2002) to the individual home range data set (with the compound-symmetric correlation structure for GEE), but emphasize that this method should not be used to compare the fit of GEE and GLMM unless marginal estimates are derived from both methods. Because GEE predicts habitat selection of a population, for each iteration in the k-fold analysis, we withheld three animals from the data set, used the remaining 15 animals (83% of the 18 animals) to develop each RSF, and tested its fit using the withheld animals. Because the GLMM is describing habitat selection of specific animals, we withheld 17% of the data from each animal, used the remaining 83% of the data to develop each RSF, and tested the model using the withheld data (Boyce et al. 2002). Spearman's rank correlation analysis was performed on the area-adjusted frequencies across RSF bins. Ten RSF bins with equal number of observations were created for the analyses.
All GEE models converged. Initially, the GLMM analysis on the IHR data set did not converge, but after changing the optimization procedure to the Newton–Raphson method with ridging, the analysis converged.
effect of spatial scales on comparison between glmm and gee
Scale had a strong effect on parameter estimates and statistical significance (Table 1). Avoidance of roads was only significant with the HHR data set. This is probably because the animals selected home ranges to avoid roads, and thus, the measure of availability of roads for the HHR and IHR data sets differ. Degree of selection for jack pine, spruce, and treed muskeg habitats was similar between the HHR and IHR data sets for the GEE models (Table 1). Distance to hardwood and mixed-wood stands had no effect.
Table 1. Winter habitat selection by woodland caribou from Saskatchewan, 2005 and 2006, at the herd (HHR) and individual home range (IHR) scales, analysed using GEEs. Empirical and model-based standard errors are included for comparative purposes, but model-based standard errors did not meet statistical assumptions. β estimates are the same for models with empirical and model-based standard errors
|HHR||Independent||β||0·132|| ||0·857|| ||1·164|| ||1·185|| ||1·978|| |
|HHR||Compound-symmetric||β||0·128|| ||0·816|| ||1·113|| ||1·135|| ||1·901|| |
|IHR||Independent||β||–0·027|| ||0·155|| ||1·146|| ||1·223|| ||1·991|| |
|IHR||Compound-symmetric||β||–0·026|| ||0·176|| ||1·168|| ||1·241|| ||2·033|| |
Statistical significance of parameter estimates derived using GLMM were similar to those derived using GEEs for the HHR data set. However, only treed muskeg appeared to be selected when using GLMM at the IHR scale (Table 2). In contrast, the GEE analysis at this spatial scale suggested that jack pine and spruce were also selected (Table 1). When analysed with GLMM, the degree to which jack pine, spruce, and treed muskeg were selected was less in the IHR than in the HHR data (Table 2).
Table 2. Winter habitat selection by woodland caribou from Saskatchewan, 2005 and 2006, at the herd (HHR) and individual home range (IHR) scales, analysed using GLMMs. Empirical and model-based standard errors are included for comparative purposes, but model-based standard errors did not meet statistical assumptions. β estimates are the same for models with empirical and model-based standard errors
|HHR||β||0·135|| ||0·863|| ||1·166|| ||1·195|| ||2·015|| |
|IHR||β||–0·019|| ||0·128|| ||0·704|| ||0·575|| ||0·920|| |
correlation structures and standard errors
The estimated correlation coefficient for the compound-symmetric correlation structures was small (r < 0·0075), suggesting that differences in habitat selection among caribou is small. This is consistent with previous research on discrete groups of woodland caribou (e.g. Gustine et al. 2006; Brown et al. 2007). However, the small magnitude of the correlation coefficient may also result from relative independence among random points. In addition, parameter estimates and statistical significance were similar for the independent and compound-symmetric correlation structures (Table 1), suggesting that allowing for differences among caribou had little effect on models.
The use of empirical and model-based standard errors had strong effects on significance (Tables 1–3). For GEEs, empirical standard errors were several times larger than model-based standard errors for the HHR and IHR data sets, regardless of whether independent (SEE/SEM range 7·13–12·53, mean 8·65) or compound-symmetric (SEE/SEM range 5·76–11·51, mean 8·15) correlation structures were used, although the SEE/SEM ratio was slightly smaller for the compound-symmetric structures. The high SEE/SEM ratio demonstrates that the correlation structures did not describe all the correlation in the data. Model-based standard errors were therefore underestimated, highlighting the importance of using empirical standard errors for determining statistical significance. Differences between model-based and empirical standard errors analysed using GLMM were similar to the results from GEE analyses (SEE/SEM range 6·43–13·32, mean 8·76). A statistical comparison of the fit of these correlation structures would have been more rigorous than this qualitative comparison but an unbiased statistical measure is not available at this time. Also, it would be unlikely to change our conclusion that differences in fit among correlation structures are small.
Table 3. Winter habitat selection by woodland caribou from Saskatchewan, 2005 and 2006, analysed using GEE, and destructive sampling to remove 95% of the data. Empirical and model-based standard errors are included for comparative purposes, but model-based standard errors did not meet statistical assumptions. β estimates are the same for models with empirical and model-based standard errors
|HHR||Independent||β||0·141|| ||0·919|| ||1·341|| ||1·142|| ||2·133|| |
|HHR||Compound-symmetric||β||0·140|| ||0·916|| ||1·334|| ||1·139|| ||2·126|| |
|IHR||Independent||β||0·004|| ||0·096|| ||0·779|| ||0·499|| ||0·938|| |
|IHR||Compound-symmetric||β||0·007|| ||0·007|| ||0·639|| ||0·289|| ||0·725|| |
In GEE analyses, the SEE/SEM ratio suggested that the compound-symmetric correlation structure fit the IHR 5% data set slightly better than the independent correlation structure, and therefore, we restrict our discussion to the former. The SEE/SEM ratio was closer to 1 in the two 5% data sets than for the complete data sets, indicating that some of the autocorrelation among telemetry locations had been removed by destructive sampling (Tables 3–4). However, SEE/SEM was on average 2·172 (range 1·823–2·874) for HHR 5%, and 1·518 (range 0·967–1·797) for the IHR 5% data set, suggesting that despite decreasing the sampling frequency to one sample every 3·33 days, autocorrelation remained. SEE/SEM was slightly smaller for the IHR 5% data set modelled with the compound-symmetric (range 0·97–1·80, mean = 1·52) than the independent structure (range 1·67–1·97, mean = 1·85, Table 3), suggesting that the compound-symmetric structure might fit slightly better.
Table 4. Winter habitat selection by woodland caribou from Saskatchewan, 2005 and 2006, analysed using GLMM, and destructive sampling to remove 95% of the data. Empirical and model-based standard errors are included for comparative purposes, but model-based standard errors did not meet statistical assumptions. β estimates are the same for models with empirical and model-based standard errors
|HHR||β||0·141|| ||0·919|| ||1·341|| ||1·142|| ||2·133|| |
|IHR||β||0·004|| ||0·096|| ||0·779|| ||0·499|| ||0·938|| |
In the GEE analyses, destructive sampling had little effect on parameter estimates or statistical significance for the HHR data set (Tables 1 and 3). However, destructive sampling influenced significance and relative size of parameter estimates for the IHR 5% data set (Tables 1 and 3). Using the compound-symmetric correlation structure, selection for spruce was significant for the IHR data, but not the equivalent destructively sampled data set. Selection for treed muskeg compared with jack pine was also relatively larger for the intact than the destructively sampled data set. With the independent correlation structure, neither jack pine nor spruce were selected when using the IHR 5% data, while both habitat types were selected when using the intact data set.
With the destructively sampled data sets, the GLMM analyses converged on solutions where the variance component associated with the random effect was estimated to be 0, with a 0 variance. This indicates that for the destructively sampled data set, the GLMM model was not an improvement over GLM, suggesting that habitat selection was similar among animals. The destructively sampled sets may have had insufficient data to distinguish differences in habitat selection among animals, at least in terms of variables included in the RSFs. Consistent with GEE, the SEE/SEM ratio averaged 2·164 (range 1·814–2·860) for the HHR 5%, and 1·849 (range 1·665–1·974) for the IHR 5% data set analysed with GLMM (Table 4). There were no differences in parameter significance between the destructively sampled and intact data sets. Relative sizes of parameter estimates differed somewhat between the intact and destructively sampled data, as selection for jack pine was stronger using the HHR 5% data set (Tables 2 and 4).
In predicting habitat selection of the animals sampled, the GLMM model was highly significant with an average r of 0·962 (SD = 0·016), while the GEE model had an average r of 0·739 (SD = 0·150).
Destructive sampling reduced power and could lead to higher probability of type II errors (see also Gustine et al. 2006) when analysed using GEE. In our study, we would have underestimated selection for spruce and jack pine (using the independent structure) with destructive sampling. Degree of habitat selection by woodland caribou may, therefore, be greater than previously recorded in studies that used destructive sampling or long intervals between relocations (e.g. Johnson, Seip & Boyce 2004). Alternatively, destructive sampling may not be sufficient for ensuring independence among sequential points (Cushman, Chase & Griffin 2005). Not accounting for such correlations (e.g. Johnson, Seip & Boyce 2004; Gillies et al. 2006) may overestimate the sample size of independent data, and may overestimate habitat selection (Clifford, Richardson & Hémon 1989). However, it may still be important to use destructive sampling to minimize the effects of measurement error (Jerde & Visscher 2005). Analytical methods that can use all the data are a better choice. Both GEEs and GLMMs hold promise, but both methods must be applied appropriately. Neither GLMMs nor GEEs with model-based variance estimators meet the assumption of correctly selecting the correlation structure. Evaluation of parameter significance was strongly influenced by type of standard error. Empirical standard errors must be used to determine significance if correlation structures are misspecified (Hardin & Hilbe 2003). This will almost invariably be the case with RSFs developed using GPS telemetry data. If GLMMs are used to develop RSFs, the methods described by Gillies et al. (2006) should be modified to ensure that empirical standard errors are used. The methods proposed by Fortin et al. (2005) avoid some of the differences in the correlation structure between used and available points, and thus are suitable if habitat selection is assessed at a local, rather than home-range scale.
Statistical results, and therefore conservation implications, sometimes differed between GEE and GLMM. One reason is that we used only conditional parameter estimates from GLMMs, while coefficient interpretation for GEEs was marginal; therefore, each model addressed a different question. Because the interpretation of parameter estimates is different for a marginal vs. conditional design, their selection is of critical importance and must be based on appropriate biological or management rationale. The population-specific response will be of interest if management actions are intended to influence whole populations. This would be typical of many scenarios in applied ecology, such as landscape-level conservation, in which a subset of the population is monitored to understand effects of management on the whole population or other populations. Marginal models such as GEEs are preferred for generating marginal population estimates (Agresti 2002: 501). Conditional estimates are most appropriate when management focuses on individuals (Fitzmaurice et al. 2004: 369); for example, when conservation of specific individuals is the management goal, such as with endangered species management (Gillies et al. 2006), and when individuals are monitored to understand how future management plans will affect them. If management goals are conditional, GLMMs can be used for their analysis.
We monitored a subset of individuals to develop management recommendations for the population; therefore, marginal estimates were of interest. We would have underestimated the degree of habitat selection for jack pine and spruce if we incorrectly applied a conditional approach to the analysis at the IHR scale. This would have serious conservation implications, as jack pine and spruce are economically valuable and therefore at significant risk of harvest.
Road avoidance was only detected at the HHR scale (see also Apps & McLellan 2006). Many previous studies have recognized differences in habitat selection among spatial scales (e.g. Gustine et al. 2006; Mayor et al. 2007), and thus, the analytical approach must be selected based on the appropriate spatial scale for answering the ecological or management question. The herd home range is the spatial scale at which many management plans are developed (e.g. Crichton & Duncan 2005), and therefore, this spatial scale is of particular importance in applied ecology.
Evaluating the predictive capacity of models is important for determining their usefulness for conservation and management (Boyce et al. 2002). Appropriate evaluation of model prediction using k-fold cross-validation (Boyce et al. 2002) is sensitive to whether a marginal vs. conditional approach is taken. Although we are not aware of any precedent for using k-fold cross-validation for evaluating fit of GEE or GLMM, it should be effective. However, the approach must be applied differently for a marginal vs. conditional design. For discussion purposes, we use the example of withholding 20% for evaluating performance of an RSF (sensu Boyce et al. 2002). With a marginal design, we hope to predict habitat selection of all animals in a population, from a subset of monitored animals. In that case, we should withhold data from 20% of the individuals and evaluate model performance with those. However, with a conditional design, the model describes habitat selection of specific animals. In that case, we should withhold data from 20% of the points from each animal, and evaluate model performance with those. Obviously, the results of the k-fold cross-validation will differ under each scenario, and address different questions about how well the RSF model predicts. It is unsurprising that we found RSFs were more likely to correctly predict habitat selection of those animals used to develop the models (conditional approach), than to predict habitat selection of other animals in the population (marginal approach); there is greater variation among than within individuals. Although threshold guidelines are not yet available (Pearce & Boyce 2006), guidelines for k-fold cross- validation thresholds would clearly differ for a marginal vs. conditional approach. This represents an important avenue for future research.
Suitability of GEEs and GLMMs for developing resource selection functions differ in several more ways. GLMMs may be less likely to converge than GEEs due to their added complexity (Fitzmaurice et al. 2004: 326). While we circumvented this problem using alternative optimization procedures, this is not always effective. Gillies et al. (2006) argue that point estimates obtained under a working independence assumption weights results towards animals with more samples. However, GEEs may be used in combination with empirical variance estimators to produce an estimator robust to moderate deviations in sample sizes (Fitzmaurice et al. 2004: 320). Parameter estimates from GEEs may be sensitive to degree of correlation (Pepe & Anderson 1994), which may vary among seasons. This suggests that, as is usually the case with RSFs, GEEs should be used to analyse habitat selection within reasonably discrete, biologically appropriate seasons. In our study, the correlation structure had few effects on GEE results. If greater differences in habitat selection among animals are expected in other studies, researchers should consider both independent and compound-symmetric correlation structures.
Both GEEs and GLMMs hold promise for the development of RSFs when used with empirical variance estimates. The optimal approach will depend on study design and management goals (see also Bolker et al. 2009). Selection of a marginal or conditional approach is a key step in the study design process, and should be based on ecological or management goals. GEEs may be more likely to converge for some data sets as they are simpler analytically, and are preferred when marginal population estimates are needed (Agresti 2002: 501). GLMMs are required for generating conditional population estimates, and may be preferred if the link function is likely to be misspecified (see also Lele & Keim 2006). GLMMs may be preferred if variances differ widely among groups within categorical explanatory variables (Agresti 2002: 501). Further research is required to adapt either approach for the development of resource selection probability functions (Lele & Keim 2006) to account for contamination of random points with used points (Keating & Cherry 2004). Contamination is likely to be minimal in our system, as the species density is low (Johnson et al. 2006). Nonetheless, this problem is beyond the scope of our study, and therefore, we restrict our discussion to RSFs. Caution must also be taken if relative selection is very small, as in this case, parameter estimates for RSFs may be inaccurate (Lele & Keim 2006). We also recognize that the correlation itself may be of interest, and recommend use of the many analytical procedures that are available, such as Mantel correlograms, in this case. However, other biological questions may need to be addressed within an RSF that does not focus on the autocorrelation, for which GEEs and GLMMs are useful.
We thank J.M. Hilbe and L. Lix for statistical advice, and J. Keeney and S. Keobouasone for GIS support. We thank E. J. Milner-Gulland, M. Hebblewhite and three anonymous reviewers for providing helpful editorial advice. The collaring programme was done in collaboration with F. Moreland and D. Frandsen, Prince Albert National Park, and A. Arsenault, T. Trottier and B. Tokaruk, Saskatchewan Environment and Resource Management. Funds were provided by Prince Albert National Park through Parks Canada Species at Risk Recovery Action and Education Fund, Saskatchewan Environment and Resource Management through the Fish and Wildlife Development Fund, Weyerhaeuser Inc., and Prince Albert Model Forest.