Identifying and correcting spatial bias in opportunistic citizen science data for wild ungulates in Norway

Abstract Many publications make use of opportunistic data, such as citizen science observation data, to infer large‐scale properties of species’ distributions. However, the few publications that use opportunistic citizen science data to study animal ecology at a habitat level do so without accounting for spatial biases in opportunistic records or using methods that are difficult to generalize. In this study, we explore the biases that exist in opportunistic observations and suggest an approach to correct for them. We first examined the extent of the biases in opportunistic citizen science observations of three wild ungulate species in Norway by comparing them to data from GPS telemetry. We then quantified the extent of the biases by specifying a model of the biases. From the bias model, we sampled available locations within the species’ home range. Along with opportunistic observations, we used the corrected availability locations to estimate a resource selection function (RSF). We tested this method with simulations and empirical datasets for the three species. We compared the results of our correction method to RSFs obtained using opportunistic observations without correction and to RSFs using GPS‐telemetry data. Finally, we compared habitat suitability maps obtained using each of these models. Opportunistic observations are more affected by human access and visibility than locations derived from GPS telemetry. This has consequences for drawing inferences about species’ ecology. Models naïvely using opportunistic observations in habitat‐use studies can result in spurious inferences. However, sampling availability locations based on the spatial biases in opportunistic data improves the estimation of the species’ RSFs and predicted habitat suitability maps in some cases. This study highlights the challenges and opportunities of using opportunistic observations in habitat‐use studies. While our method is not foolproof it is a first step toward unlocking the potential of opportunistic citizen science data for habitat‐use studies.


| INTRODUC TI ON
Modern biotelemetry devices using very high frequency (VHF) and Global Positioning System (GPS) approaches have made it possible to study the habitat use of multiple animals at fine spatial and temporal scales, providing unique opportunities to study how species use their environment without observer bias .
Nevertheless, such devices are expensive, often logistically difficult to deploy, and require specialist training in addition to the welfare considerations associated with animal capture. The result is that these approaches are often only used in study sites of limited size or with limited number of study animals, which may lead to poor population-level inferences ) and cannot be applied to all species for which such information is desirable. Ideally, it should be possible to use available biotelemetry data, and correct for, biases associated with the use of more extensive data types (which are often opportunistic), such as those associated with citizen science data sources.
Opportunistic citizen science data have the potential to provide tremendous amounts of data over large temporal and spatial scales that can potentially transform the study of ecology (Bela et al., 2016;Tewksbury et al., 2014). It has recently been estimated that as much as 50% of the species occurrence records stored in the Global Biodiversity Information Facility (GBIF) have been collected by Citizen Scientists (i.e., volunteers engaged in data collection), usually in the form of opportunistic data (Walker, 2019). Many citizen science projects have a long history (e.g., hunters recording harvest numbers, Cretois et al., 2020; records of the timing of cherry blossom in Japan, Aono & Kazui, 2008; the UK Butterfly Monitoring Scheme, Mair et al., 2014; the Christmas Bird Count, Kobori et al., 2016), and the development of web-based recording with userfriendly interfaces and associated databases is leading to an increase in the number of initiatives and an increasing uptake by the scientific community (Dickinson et al., 2012). Where these datasets have a sufficient spatial and temporal resolution, they can represent a costeffective tool for certain applications such as delineating relative distributions or identifying habitat correlates for 1st-order selection (Johnson, 1980). However, opportunistic data do not arise from any structured sampling design and thus violates many of the fundamental principles of data sampling. For instance, most data collected by volunteers are unevenly distributed in both space (i.e., off and on trails, close to roads, and human settlements, Westekemper et al., 2018) and time (i.e., collected during daylight and during the weekends) and might lead to spurious inference about drivers of species distribution. Moreover, observers differ in their abilities to recognize species and the effort they spend to detect certain species (Isaac et al., 2014). There is also a question whether the observed individuals are representative of the main wildlife population, or if they have deviant behavior, for example, because they are sick or more than usual habituated to human activities (Reimers et al., 2010). Even though at fine scales these biases can lead to misleading conclusions if not accounted for (Sicacha-Parada et al., 2020), some studies chose to simply ignore them (Weisshaupt & Rodríguez-Pérez, 2017), uncritically combine opportunistic records with other source of data (Mononen et al., 2018), or use methods difficult to replicate in other systems (Todd et al., 2016).
It is commonly assumed that opportunistic data represent the actual species distribution. This is only partially true as opportunistic data represents the intersection between opportunistic sampling and the actual species distribution (see a two-dimensional example in Figure 1). The environmental conditions determining occupancy by a species result from a hierarchical selection process (Johnson, 1980), while the fact that opportunistic data are conditional upon the presence of an observer and their ability to see and identify the animal, and file a report, are sources of bias. In contrast, the hypervolume in environmental space occupied by telemetry data results only from the space use of marked individuals from the species of interest. For instance, in Figure 1, citizen scientists and the target species do not use the landscape in the same way and citizen science observations only partially capture the 2nd and 3 rd orders of selection (Johnson, 1980). In contrast, ideal telemetry observations (i.e., exempt of sampling biases) are in theory able to capture both the 2nd and 3rd order of selection. In this example, citizen scientists use steeper slope and heavily used trails compared to the target species that prefer less steep landscapes that contains trails that are moderately used. Thus, the distribution of the citizen scientists and the species only partially overlap. Under the assumption of a representative sample of individuals in the telemetry data within a given site, it is possible to combine opportunistic and telemetry data to estimate the hypervolume occupied by the observers, which could be used to correct observer bias in the opportunistic data.
In this study, we present a novel method which aims to account for spatial biases in opportunistic observations to get a more accurate characterization of species' habitat selection in an area rich in opportunistic data but where relatively little telemetry data are available. We build on previous studies which have found that carefully F I G U R E 1 Conceptual figure representing the reasoning underlying the use of opportunistic observations to infer species' habitat preference along two potential environmental gradients. The thick line represents the area where opportunistic observations correctly identify species' ecological properties, including the RSF Trail use Slope CS 3rd order of selecƟon CS OpportunisƟc obs.
selecting background locations or using design-based weights can help account for latent sampling bias and improve improves the inference made on species distribution (see Irvine et al., 2018;Phillips et al., 2009) and extend these ideas to a higher order of selection (Johnson, 1980). We first explore the potential biases in opportunistic observations for three widespread and easily recognizable wild ungulate species (i.e., to limit the extent of the misidentification bias), roe deer (Capreolus capreolus), moose (Alces alces), and wild mountain reindeer (Rangifer tarandus) in southern Norway. Then, we account for these biases by fitting a model aimed at estimating accessibility for the citizen scientist (the observer model) based on the contrast between observation and telemetry locations. We use the observer model output to define the spatial domain of a background sample.
We then pair these background points with the opportunistic observations to estimate a resource selection function (RSF) that accounts for sampling bias. We further explored the potential of this method with both simulations and empirical datasets for the three wild ungulates. We compared the results of our novel method to RSFs naïvely using opportunistic observations without correction in availability locations and to RSFs derived from the unbiased telemetry data. For all species, the GPS data sampling interval ranged between 1 and 12 relocations per day. However, because of the large number of data points, which caused computational inefficiency, and to avoid risks of temporal and spatial autocorrelation, we resampled the telemetry dataset using the R package amt (Signer et al., 2019). We selected 1 GPS location every 5 h for both moose and roe deer and 1 GPS location every 10 h for wild reindeer as more observations were available. We tested different filtering to ensure independence and we notice that from 5 (for roe deer and moose) and 10 (for wild reindeer) hours onward the parameter estimates remained stable. Then, we selected location data that were recorded during summer (i.e., from June 22 to September 22) and during hours of normal human activities (i.e., between 8 and 22) and daylight for a fair comparison with opportunistic observations, which are more numerous in the summer months (in our dataset) and during daylight. Focusing on summer only also removed complications arising from variable migration behavior and possible confounding effects of proximity to winter feeding stations that are often used by moose and roe deer, as well as issues related to the increased grouping behavior of moose and roe deer in winter and the reduced human access to habitats caused by snow (Fryxell et al., 1988 Because of the low number of opportunistic records directly within the telemetry study sites of both roe deer and moose, we built a 10 km buffer around the distribution of telemetry observations and included records inside the buffer surrounding the telemetry data and inside the distribution of the telemetry data. For wild reindeer, enough opportunistic records were available within the areas from which telemetry locations were available (in the mountainous areas in the south, ranging from the southern part of Trøndelag and a county southward) and no buffer was needed. Because our paper aims to provide a general method, we aimed to get a dataset as representative as possible of the "general" or "everyday" citizen scientist. We sorted all opportunistic observations by the name of the observer and deleted observations made by "super-observers,"

| MATERIAL AND ME THODS
persons contributing to more than 50% of the dataset. These superobservers were often employed by wildlife management institutions, and therefore, their observations were not considered to be representative of typical opportunistic data. The resulting dataset was composed of 160 opportunistic records for moose, 316 for roe deer and 183 for wild reindeer. The spatial distribution of the observations used in our analysis is displayed in Map 1.

| Explanatory variables
We first fitted an observer model to quantify the biases in opportunistic data using variables related to human infrastructure (that might influence observer access to wildlife habitat), human activities (that might influence the potential number of observers), and habitat associated visibility (that might influence the detectability of a species to an observer), which are factors that are presumed to be the main drivers of biases in opportunistic records (Geldmann et al., 2016;Tiago et al., 2017).
After sampling availability locations with regard to the observer model, we estimated habitat selection for moose, roe deer, and wild reindeer using explanatory variables related to habitat (i.e., environmental data) and to human activity.

Explanatory variables for the observer model
We extracted map layers on roads and human settlement from Open Street Map (https://www.opens treet map.org; OSM) and Statistics Norway (https://kart.ssb.no/), respectively. Human settlements are defined as a cluster of buildings inhabited by at least 200 persons and the distance between buildings is less than 50 m (https://www.ssb. no/en/klass/ klass ifika sjone r/110). In the OSM dataset, we selected the main segments of the road network: motorway, trunk, primary, secondary, tertiary, unclassified, and residential. Datasets were then used to compute 10 m resolution rasters of distance to roads and distance to human settlements. Both rasters were created in ArcGIS Pro.
To represent human activity intensity, we used path use intensity and population number. Path use intensity captures the number of human activities such as running, cycling, or hiking events occurring on a given path. We aggregated all variables to a resolution of 1 km × 1 km, that is, the finest common resolution across covariates.
For both moose and roe deer, we used Strava Metro data (https:// metro.strava.com/) for southern Norway to compute a path use intensity raster. The Strava Metro product is a shapefile composed of OSM trails and roads. In its attribute table, each segment contained the number of users who recorded an activity, and calibration of the STRAVA activity counts using fixed-point counter station estimates revealed a strong overall correlation (Venter et al., 2020). Data were available between 2017 and 2020. We summed the number of users who recorded an activity within 1 km × 1 km grid cells and rasterized the results. Because of a lack of Strava users in high mountain habitats (due to poor telephone network coverage and battery constraints on mobile devices), we used a trail use index derived from trail counter data (automatic devices that record the number of people passing) (Gundersen et al., 2019). Human population density (residential) at a resolution of 100 m × 100 m was extracted from the national database and summed within 1 km × 1 km grid cells.

M A P 1
Map displaying the distribution of the opportunistic data for each species Finally, the Corine Land Cover dataset (available at a 100 m × 100 m resolution raster) was used to calculate the amount of forested area within a 1 km × 1 km grid cell as a proxy for visibility.
We assumed that the more forested area in a grid cell, the harder it would be for an observer to spot an animal.

Explanatory variables for the resource selection function
The explanatory variables used to estimate the resource selection function were chosen based on previous fine-scale studies of habitat selection of these species (for roe deer, see Bouyer et al., 2015; for moose, see Bjørneraas et al., 2011Bjørneraas et al., , 2012; and for wild reindeer, see Panzacchi et al., 2015; Table 1). Slope and altitude were computed from a 20 m resolution Digital Elevation Model extracted from the Norwegian Spatial Data Infrastructure (https://www.geono rge.no/).
Path use intensity, distance to roads, and urban settlements and forest coverage were the same variables we used in the observer model. We also included agricultural area coverage, which was computed by filtering the pixels labeled "agricultural areas" in the Corine Land Cover dataset. We then calculated the proportion of agricultural area in each 1 km × 1 km grid cell.

Sampling random availability locations
Resource selection functions are commonly used to characterize species' habitat use (Boyce & McDonald, 1999). RSFs are used to compare environmental covariates at locations visited by an animal with environmental covariates at a set of locations assumed to be available to the animal (Manly et al., 2007). Concretely, RSFs are presence/background (or presence/pseudo-absence) species distribution models used at a higher order of selection (Johnson, 1980) and are evaluated by fitting a logistic regression to observed and available locations with available locations consisting of points sampled randomly or systematically from within an animal's estimated home range (Manly et al., 2007).
Sampling available locations is a crucial step in habitat selection studies, and different choices of available locations may influence the quantification of selection (Beyer et al., 2010). Usually, areas are defined as "available" if they are found within a minimum convex polygon (MCP) drawn around the area from which "use" locations are derived (Calenge, 2011 We represented the spatial biases contained in opportunistic observations due to the observer behavior using a model of the biases known here as the observer model ( Table 2). The observer model estimates which factors influence the probability of an opportunistic observation being in a specific location. We use telemetry observations as a baseline and compare the differences in different locations of the telemetry data and opportunistic observations. For both telemetry and opportunistic data, we extracted at each location and for each observation the value of the covariate that has been demonstrated to influence the observation process (Table 1). We then fitted a logistic regression, the response variable being record type (i.e., the probability that an observation with certain environmental characteristics was opportunistically collected rather than derived from telemetry; opportunistic records were coded as 1 and telemetry observations coded as 0) and the explanatory variables being the extracted covariate values. If estimated parameter values (β) in Equation (1)  We then randomly sampled 100,000 locations within the polygon surrounding the telemetry points (plus buffer) and predicted the probability that a point would be "used" by the citizen scientist based on the parameter values estimated by the "observer model." Finally, we sampled the corrected availability locations (n = 3 x opportunistic citizen science observations, (1) Record type = + 1 distance to roads + 2 distance to urban centers + 3 path use intensity + 4 forest coverage + 5 population Forest and nice viewpoints were simulated as clusters across the simulated landscape and each grid cell was coded as 0 for "absence of forest" or "absence of nice viewpoints" and 1 as "presence of forest" or "presence of nice viewpoints" (Saura & Martínez-Millán, 2000). For Other gradient (a hypothetical variable which could represent another road or some properties of the study area such as a gradient of vegetation) and distance to roads, we generated a segment across the landscape and the distance to that segment was computed for each grid cell. It should be noted that the variable Other gradient was generated to be highly correlated with distance to roads (Pearson's r = −.80).
While distance to roads is directly associated with both the probability of occurrence of a species and an observer being present in a specific grid cell, other gradient is only associated with the probability of occurrence of the species being present in a specific grid cell. By being correlated to distance to roads, we could test whether sampling locations with regard to the observer model (i.e., including distance to roads) could correct for a variable not included in the observer model (i.e., other gradients). Finally, we simulated elevation across the landscape using a Gaussian random field (mean = 1 and std = 1).

| Simulating animal locations and opportunistic observations
We simulated animal locations which were linearly dependent on the values of the simulated environmental conditions using Equation (2). More specifically, using the following parameterization, we simulated a species that was more likely to be situated in forests (β forests = 2.5), in lower altitude (β altitude = −2), away from roads (β d_ roads = 4.5) and attracted by an unknown gradient (β other_gradient = 4.5).
The probability of presence of the species in a specific grid cell is given by Equation (2).
The likelihood of an observer being present in a given area depends on multiple variables such as accessibility (Sicacha-Parada et al., 2020). We gave a probability score to each grid cell, the higher the score, the more likely an observer is to be present. A high score (i.e., high probability of an observer being present) was given to grid cells located close to roads (β d_roads = −6), close to densely populated areas (β d_urb = −3) and if there were nice viewpoints (β nice_viewpoints = 1). The calculation of the probability of presence score (α) is given by Equation (3).
(2) logit( ) = − 7 + forest * forest + altitude * altitude + d roads * d roads + other gradient * other gradient (3) logit( ) = d roads * d_roads + d urb * d_urb + nice viewpoints * nice_viewpoints TA B L E 1 Description of the covariates used in the observer model and in the estimation of the resource selection function

Term Definition
Observer model Model quantifying accessibility within species home range for a citizen scientist by evaluating differences in locations between opportunistic and telemetry data.
Corrected availability Available locations sampled and that are used in the corrected OPP model in tandem with opportunistic observations.
Corrected OPP model Resource selection function estimated by an infinitively weighted logistic regression using both corrected availability locations and opportunistic observations. Naïve OPP model Resource selection function estimated by an infinitively weighted logistic regression using both availability locations randomly sampled across the species' home range and opportunistic observations. Finally, for each grid cell we calculated the probability of having an opportunistic observation. We defined as the product of α, the probability of an observer being within the area represented by the grid cell and , the probability of presence of the species being within the area represented by the grid cell. Thus, the probability of having an opportunistic observation in a specific grid cell was higher if the grid cell was easily accessible and if the probability of presence of the species was high (Equation 4).

| Model fitting
For each species, we estimated an RSF with (1) (Table 2). To account for individual-specific variation in the telemetry dataset, we used a random slope for all coefficients (Muff et al., 2020).

| Habitat suitability maps
In order to visualize the consequences of different background selection approaches, we predicted habitat suitability maps on a grid placed over the MCP drawn around the telemetry locations of each species.
These maps were produced using the covariates described earlier and the mean parameters of the RSFs for each of the three approaches.

| Difference in the locations of opportunistic citizen science observations and animal telemetry
We first estimated the potential biases in opportunistic observations by comparing them with telemetry data and available locations within the species' home range. Figure 2 shows that human activity variables (i.e., distance to roads, distance to human settlements, and path use intensity) influence the location of opportunistic observations and telemetry observations differently and that the contrast between telemetry and opportunistic data is species' specific. While opportunistic and telemetry locations are similarly distributed with regard to distance to roads for moose (mean dist to roads moose =741 and 757 m for opportunistic and telemetry locations, respectively), summary statistics indicate that opportunistic observations are on average closer to the roads for both roe deer and wild reindeer (mean dist to road roe deer = 262 and 422 m and mean dist to roads wild reindeer = 3574 and 8283 m for opportunistic and telemetry locations, respectively).
In contrast, opportunistic observations are on average closer to human settlements than telemetry observations for all species, although the discrepancy is particularly strong for moose (mean dist to human settlements moose = 3693, 6842; mean dist to human settlements roe deer = 2222, 4022; mean dist to human settlements wild reindeer = 19,417, 23,536 for opportunistic and telemetry locations respectively). Finally, the descriptive statistics show that opportunistic observations are located on average closer to more utilized paths than telemetry locations for all three species (mean log path use intensity moose = 3.54, 1.06; mean log path use intensity roe deer = 7.43, 6.77; mean log path use intensity wild reindeer = 4.41, 3.91 for opportunistic and telemetry locations respectively).

| Results of the simulation study
The simulation study was used to test our method under ideal conditions and confirm the intuition that accounting for biases reduces error in inference.
The results of the simulation (displayed on Figure 3) show that the model accounting for observer bias by sampling corrected available locations (i.e., RSF OPP corrected) returns coefficients that are more consistent with the parameter values used to simulate the species presence (i.e., simulated parameter value in Figure 3)

| Results from modeling the biases in opportunistic observations: the observer models
Even though the extent of the biases was qualitatively examined in Section 3.1, the method we suggest in this paper relies on the quantification of these biases. We quantified the biases by specifying an observer model and estimating the strength of the variables influencing the accessibility within the species' home range for a citizen scientist.
The observer model (Figure 4) confirms that there are spatial discrepancies between opportunistic citizen science data and telemetry locations. Results indicate that opportunistic observations are on average located within a different environmental space.
Nevertheless, while there are similarities between species, the extent to which the different variables influence opportunistic observation compared to telemetry observations differs. Opportunistic observations are on average closer to the roads than telemetry locations for both roe deer and wild reindeer (mean dist. to roads = −0.700, −0.729 for roe deer and wild reindeer respectively), but we only found a very weak effect for moose (mean dist to roads = −0.027). Nevertheless, opportunistic observations are closer to human settlements than telemetry locations for moose (mean dist to human settlements = −0.405).
The observer models show that opportunistic observations are more likely to be located near a heavily utilized path than telemetry locations for moose and wild reindeer mean log path use intensity = 0.247, 0.147 for moose and wild reindeer, respectively). It does not make a difference for roe deer, where opportunistic observations were located with a similar frequency as telemetry locations regarding path use intensity (mean log path use intensity = −0.027). Opportunistic observations of moose and roe deer are also likely to be made in areas with higher human density than telemetry locations (mean log population number = 0.204, 0.700). Finally, visibility (i.e., approximated by forest coverage within a grid cell) also had a major role as opportunistic observations for all three species were more likely to be found in grid cells containing F I G U R E 2 Boxplots of the distribution of telemetry locations, availability locations, and opportunistic citizen science observations within the distance to roads, distance to human settlements, and path use intensity spectrum for (a) moose, (b) roe deer, and (c) wild reindeer less forests (mean Forest coverage = −0.771, −0.876, −0.149 for moose, roe deer, and wild reindeer, respectively).

| Resource selection function obtained with a model using telemetry data, opportunistic data with random availability, and opportunistic data using a corrected availability
Finally, after quantifying the extent of the biases in 3.3 and sample availability locations regarding the observer model, we were able to estimate and compare the RSFs for the different models.
Discrepancies between the opportunistic citizen science observations and the telemetry locations had substantial effects on the naive OPP model ( Figure 5). Especially, in the model using telemetry data the coefficient was positive for distance to roads and distance to human settlements for all three species (mean dist to roads telemetry = 0.604, 0.779, 0.613, mean dist. human settlements telemetry = 1.005, 2.653, 0.341 for moose, roe deer, and wild reindeer, respectively). In contrast, the coefficients for distance to roads were negative for roe deer and wild reindeer (mean dist. to roads naïve CS = −0.036, −0.406 for roe deer and wild reindeer, respectively) in the naïve OPP model and negative for distance to human settlements for moose (mean dist. to human settlements naïve CS = −0.055). In the telemetry model, we can also see that the coefficient for path use intensity is negative for both While the corrected model is more consistent with the telemetry model than the naïve model for most of the coefficients, surprisingly we see in Figure 5 that the corrected OPP model coefficients do not get closer to the telemetry model for either proportion of agricultural F I G U R E 3 RSF parameter estimates obtained in the simulation for the naïve OPP model (in yellow) and the corrected OPP model (in green). Blue dots represent the parameter value used to simulate species presence across the simulated landscape. Dots represent the mean parameter estimates and bars the 95% credible intervals F I G U R E 4 Parameter estimates from the moose, roe deer, and wild reindeer observer models. The further away the estimate is from 0, the more OPP observations are affected by the variable compared to telemetry observations. Dots represent the mean parameter estimate and bars the 95% credible intervals coverage or altitude and even seems to perform worse than the naïve OPP model for these variables (mean agricultural coverage corrected CS = 0.204, 0.217, −0.099, mean altitude corrected CS = −1.896, −0.493, 3.297 for moose, roe deer, and wild reindeer, respectively).

| Suitability maps obtained with a model using telemetry, opportunistic with random availability, and opportunistic using a corrected availability
The

| D ISCUSS I ON
In this study, we show that it is possible to infer habitat selection of a species in an area rich in opportunistic data but where relatively little telemetry data is available and that opportunistic citizen science observations are skewed toward areas that are more easily F I G U R E 5 RSF parameter estimates for (a) moose, (b) roe deer, and (c) wild reindeer. In green are the parameter estimates obtained with the telemetry model, in yellow with the naïve OPP model, and in purple with the corrected OPP model. Dots represent the mean parameter estimates and bars the 95% credible intervals accessible and used by humans (i.e., the citizen scientists). Our results show that opportunistic observations are on average closer to human infrastructure and highly frequented trails than telemetry observations. This is are consistent with many studies pointing out the biases in opportunistic data (Geldmann et al., 2016;Sicacha-Parada et al., 2020;Tiago et al., 2017 F I G U R E 6 Suitability maps obtained using the mean coefficients of the telemetry (column 1), naïve OPP (column 2), and corrected OPP models (column 3) for roe deer (row 1), moose (row 2) and wild reindeer (row 3). Map values are the log-odds probabilities. On row 4 are the Pearson correlation coefficients between the habitat suitability maps obtained with the telemetry model and the ones obtained with the naïve OPP model and the corrected OPP model and S2). Alternatively, these results may be due to our failure to correctly understand how these environmental layers influence citizen scientists' movements and observation pattern.
Finally, the suggested method relies on a good estimation of observer bias. This requires reliable information about the species' ecology which can be obtained with GPS-telemetry data. Nevertheless, GPS-telemetry studies are costly and thus cannot be conducted everywhere on all species. Preliminary work (described in the annexes) suggests that using telemetry observation from an auxiliary species with similar habitat preference to a target species could be used to estimate observer bias, correct for availability locations used in the logistic model, and thus partially correct parameter estimates (see There are multiple reasons for why opportunistic observations do not accurately produce resource selection functions and more generally reflect species' ecology including spatial and temporal biases (Isaac et al., 2014). While the method suggested in this paper account for these biases to improve ecological inference, it is also necessary to improve opportunistic data collection. Citizen scientists usually report observations from quite human-dominated areas, or for instance when surprised to find a species in a place where they are not used to be seen, capturing only certain species' individual behavior that are not representative of the species usual range. Encouraging other types of citizen scientists, such as hunters or other outdoor enthusiast, could improve the coverage of the dataset and improve inference .
Despite its limitations, our method is a first step toward improving the use of opportunistic data in habitat selection studies. In fact, we do not present our method to correct for availability as an infallible technique but rather as a way to initiate conversations and research among ecologists to account for spatial biases in opportunistic data for more accurate inference at fine scales. Methods to account for variation in the observation process in opportunistic observations are developing and improving, notably with the potential of occupancy models (Altwegg & Nichols, 2019;Strien et al., 2013) and integrated models (Isaac et al., 2020). Nevertheless, these developments account for biases in opportunistic data at the distribution level (4th order of selection; Johnson, 1980), and to our knowledge, our study is the first attempting to find a general solution for using opportunistic data at finer scale. Instead of using telemetry data to infer biases in opportunistic data, it would also be possible to use other independent and reliable data sources such as observations systematically collected by professionals. It should be noted that our work is preliminary, and could be easily expanded. For instance, a possible extension of our method would involve using auxiliary species observation models to correct the habitat preference model of a target species (Figures S11-S13).

| CON CLUS ION
In this paper, we explore the challenges and the opportunities of using opportunistically collected citizen science data in habitat preference studies. We show that opportunistic data used in a naïve way can be misleading and result in spurious ecological inference.
Accounting for the observation process reduces this risk. Our study is a first step toward using opportunistic data for finer scale habitat analyses.

ACK N OWLED G M ENTS
BC was funded by a PhD scholarship from the Norwegian University of Science and Technology. JDCL and BVM were funded by the Research Council of Norway (grant 251112). We also thank the two anonymous reviewers for their helpful comments and suggestions.

CO N FLI C T O F I NTE R E S T
The authors declare no conflict of interest.

O PE N R E S E A RCH BA D G E S
This article has been awarded Open Data, Open Materials Badges.

DATA AVA I L A B I L I T Y S TAT E M E N T
Code and data to run the analysis and the simulation study can be found on Zenodo, https://doi.org/10.5281/zenodo.5520616.