Mapping large‐scale bird distributions using occupancy models and citizen data with spatially biased sampling effort
Abstract
Aim
Although data collected by citizen scientists have received a great deal of attention for assessing species distributions over large extents, their sampling efforts are usually spatially biased. We assessed whether the bias of spatially varied sampling effort for opportunistic citizen data can be corrected using occupancy models that incorporate observation processes.
Location
Hokkaido Island, northern Japan.
Methods
We applied occupancy models for citizen data with spatially biased sampling effort to model and map large‐scale distributions of 52 forest and 23 grassland/wetland bird species. We used estimated species richness (summed occupancy probabilities among the species) as the aggregated distributional patterns of each species group and compared them among two occupancy models (i.e. single‐species and multispecies occupancy models), two conventional logistic regression models and Maxlike, which do not explicitly deal with observation processes.
Results
Conventional logistic regression models and Maxlike predicted inappropriate patterns, such as forest species preferring lowland non‐forested areas where most of the data were collected. Occupancy models, however, showed more appropriate results, indicating that forest species preferred lowland forested areas. The prediction by logistic models was somewhat improved by the use of spatially biased non‐detection data as the absence data; however, estimates of species richness were still much lower than those of occupancy models. Differences in model outputs were evident for the forest species but not for grassland/wetland species because citizen data covered virtually all environmental niches for grassland/wetland species. Results of the single‐species and multispecies occupancy models were nearly identical, but in some cases, estimates from the single‐species models were not converged or deviated notably from those of other species compared with estimates by the multispecies model.
Main conclusions
We found that citizen data with spatially biased sampling effort can be appropriately utilized for large‐scale biodiversity distribution modelling with the use of occupancy models, which encourages data collection by citizen scientists.
Introduction
Contemporary ecology has been increasingly widening its scope in space and time. Moreover, global climate and land use changes are considered major threats to biodiversity (Parmesan & Yohe, 2003; Butchart et al., 2010). Therefore, modelling large‐scale biodiversity is of great importance to investigate the potential effects of climate change and land uses and to develop strategies to mitigate human‐induced biodiversity losses (Thuiller et al., 2008; Franklin, 2009). An important difficulty in answering these questions is how to collect sufficient data on species distributions over large extents and long time periods, which is difficult for individual researchers. In this context, citizen data have received a great deal of attention (Schmeller et al., 2009; Dickinson et al., 2010) because it allows many individuals to cover broad geographic ranges and various taxonomic groups (Sauer et al., 1994; Hopkins & Freckleton, 2002).
Citizen‐collected data, however, have some potential drawbacks (Danielsen et al., 2005; Dickinson et al., 2010). First, data quality regarding species detection is not necessarily consistent throughout the data set (Danielsen et al., 2005; Schmeller et al., 2009). For example, expert observers provide more precise records of detected species and population numbers than poorly trained observers (Sauer et al., 1994; Danielsen et al., 2005). Second, geographic bias may exist in sampling locations and survey efforts of citizen‐collected data (Reddy & Dávalos, 2003; Schmeller et al., 2009). Botts et al. (2011) reported that sampling locations and intensity in the South African Frog Project, conducted by volunteers and herpetologists, concentrated in spatially accessible sites or well‐known hotspots. This suggests that objectives, motivation and procedures (method and site selection) vary among citizens and that spatial accessibility influences the probability of detecting species (Reddy & Dávalos, 2003; Botts et al., 2011). Therefore, citizen data are referred to as ‘opportunistic data’ (Kéry et al., 2010).
The importance of accommodating differences in data quality and spatially biased sampling effort has been recognized (Devictor et al., 2010; Dickinson et al., 2010), and how to overcome these drawbacks is a major concern. One possible solution is the use of occupancy models (MacKenzie et al., 2006). Species detection probabilities are certainly < 1, and citizens usually perform observations repeatedly at the same sites. Using detection/non‐detection histories of citizen data, occupancy models can correct the spatially biased sampling effort of citizen‐collected data by considering observation processes (imperfect detection) and occupancy status of species separately in a hierarchical manner (MacKenzie et al., 2006; Royle & Dorazio, 2008). Other types of hierarchical models that consider imperfect detection have also been shown to effectively deal with uneven survey efforts among samples (Yamaura et al., 2011; Yamaura, 2013). However, this expectation has not been explicitly examined for opportunistic citizen data and occupancy models.
In this study, we applied different modelling methods to citizen data with spatially biased sampling effort and mapped large‐scale distributions of bird species richness. Birds are popular taxa and well observed over many years by many citizens, making them well suited for large‐scale modelling (Carignan & Villard, 2002). We used ordinary single‐species occupancy models as well as multispecies occupancy models (Dorazio & Royle, 2005; Dorazio et al., 2006) to estimate species richness (summed occupancy probabilities among the species) as the aggregated distributional patterns of target species. The results of occupancy models were compared to those of two conventional logistic regression models and Maxlike, which are standard methods and newly introduced techniques for species distribution modelling, respectively, but these models do not explicitly deal with observation processes. Finally, we discuss the effects of spatially biased sampling effort on species distribution modelling.
Methods
Species distribution data set
The study was conducted on Hokkaido Island, northern Japan (Fig. 1). Data sets of the distribution of common bird species were extracted from a wildlife distribution database developed and maintained by the Institute of Environmental Sciences, Environmental and Geological Research Department, Hokkaido Research Organization (Takada et al., 2009; Ono et al., 2013). This database is an aggregate of various sources, including published literature (e.g. local bird watchers' journals; see below) and original survey data. Therefore, the data were collected by many professional and citizen scientists using several survey methods across Hokkaido from 1970 to 2009. To best match the extent of the survey area to the extent of the modelling unit, we used the finest resolution available (1 km × 1 km) as the modelling resolution. We examined all survey records from the breeding season (April–August) and obtained distribution data for 52 forest bird species and 23 grassland/wetland bird species from 12,629 records composed of 2246 surveys at 625 sites (Table S1). We classified bird species into habitat groups (forest and grassland/wetland) according to the JAVIAN Database (http://www.bird-research.jp/appendix/br07/07r03.html). Sampling efforts—locations and number of surveys—were spatially concentrated around populated areas where agricultural fields were widespread, while fewer surveys were conducted in mountainous areas (Figs 1 & S1). Each record was assigned to one of four observation types according to observer type and survey method: point census by experts, line census by experts, observation with other methods by experts and observation with other methods by citizens. The methods categorized as other included surveys that lacked clear explanations, surveys by ships or aircraft, banding and questionnaires. In total, 53% of surveys were performed by citizens and 39% were performed by experts using methods other than point or line censuses. However, point and line censuses detected more species in a single survey than the other methods (Fig. S2).

Environmental covariates
Partially observed occupancy states of the modelled species were simply assumed to be a function of land cover and elevation. Land cover data were derived from 1:50,000 digital vegetation maps based on the second to fifth vegetation surveys provided by the Natural Conservation Bureau, Ministry of Environment (http://www.biodic.go.jp). We calculated the area of forest, grassland and wetland within each 1 km × 1 km mesh cell of the distribution data set. We calculated the mean elevation from the digital elevation model provided by the Japan Geographic Institute (http://www.gsi.go.jp/kiban/index.html).
Conventional logistic regression models
We developed two conventional logistic regression models using presence‐only (PO model) and presence–absence (PA model) data. In logistic regression models, both presence and absence data are required for model building, but the opportunistically collected citizen data occasionally lack absence data. In this case, pseudo‐absence data are commonly used (Elith et al., 2006; Wisz & Guisan, 2009; Barbet‐Massin et al., 2012). To develop the PO models, we used cells with at least one detection record as presence data, whereas pseudo‐absence data were generated from 300 cells randomly extracted from the background data. We performed 100 model estimates for each species and used the mean value as the representative estimate. The numbers of the pseudo‐absence and simulation runs were considered sufficient to obtain accurate estimates by logistic regression models according to the methods of Barbet‐Massin et al. (2012). However, when non‐detection data were available, we used the cells that had been surveyed in which no detection was recorded as a substitute for absence data (Elith et al., 2006; Phillips et al., 2009). We used this method for developing another set of logistic regression models (PA models). Thus, the difference between the PO and PA models was the spatial distribution of absence data (random or biased).
(1)
(2)All the parameters of the PO and PA models were assessed with the maximum‐likelihood estimation method using the program R v. 3.0.2 (R Core Team, 2013).
Maxlike model
Maxlike (ML model) is a newly introduced maximum‐likelihood technique to estimate the probability of occurrence rather than ‘relative’ probability using presence‐only data. Although this technique does not require pseudo‐absence data, it assumes a random sampling of the field survey as in the use of pseudo‐absence data (Royle et al., 2012). This model has recently gained major attention as a species distribution model with presence‐only data (e.g. Fitzpatrick et al., 2013; Merow & Silander, 2014).
To develop the ML models, we used the same presence data and habitat covariates as for the PO and PA models. The models were fitted using the R (R Core Team, 2013) and package maxlike v. 0.1−5 (Chandler & Royle, 2013) with the ‘BFGS’ method and 100,000,000 iterations to maximize the log‐likelihood functions.
Occupancy model
We developed two types of occupancy models, one that analysed each species individually (SO models) and another that analysed multiple species in a single model (MO models). MO models are an extension of SO models and are an appropriate framework for modelling species diversity including rare species (Dorazio & Royle, 2005; Dorazio et al., 2006), and we compared the performance of SO and MO models to deal with spatially biased sampling effort for citizen data.
(3)
(4)
(5)
(6)The SO and MO models had the same model structure as described above, but we assumed that the intercepts (βinti), and regression parameters of the occupancy state (βcovi) and observation process (βobsi) of the MO models were species‐specific random variables governed by hyperparameters. We assumed that
, where μintcom and σintcom2 were mean intercepts and variances across the forest (com = 1) or grassland/wetland species (com = 2). Similarly, βcovi and βobsi were specified to follow normal distributions with mean vectors μcovcom and μobscom and variance vectors σcovcom2 and σobscom2, respectively.
The parameters of the SO and MO models were estimated by hierarchical Bayesian modelling framework and Markov chain Monte Carlo (MCMC) techniques (MacKenzie et al., 2006; Royle & Dorazio, 2008). The priors of βinti, βcovi and βobsi for the SO models and μintcom, μcovcom and μobscom for the MO models were assumed to follow normal distributions with mean 0 and variance 1000. Similarly, σintcom, σcovcom and σobscom were assumed to follow non‐informative uniform priors defined by [0, 100]. The posterior distributions of all parameters were obtained by three chain runs of 3,000,000 simulations after a burn‐in of 600,000 samples and were thinned by three intervals using jags version 3.3.0 (Plummer, 2013) in R v.3.0.2 (R Core Team, 2013) and the package r2jags (Su & Yajima, 2013). The model was considered to have converged if the
values (or the values of the Gelman–Rubin statistic) of all parameters were < 1.1 (Gelman & Hill, 2007).
Finally, occupancy probabilities were summed across the species separately in each model, and spatial distributions of estimated species richness for forest and grassland/wetland species were obtained.
Results
We obtained all parameters of the PO, PA, ML and MO models for the target species, but the SO models for 18 forest bird species and two grassland/wetland bird species did not converge (Table S1).
Detection probabilities for all the target species were estimated to be < 1 and clearly differed among the four observation types. Mean detection probabilities across the species estimated by the MO models were highest for line censuses by experts (0.25 ± 0.18 SD) followed by observation with other methods by citizens and point censuses by experts (0.15 ± 0.10), and observations with other methods by experts had the lowest detection probability (0.08 ± 0.06). Detection probabilities for forest species were generally lower (maximum of 0.57 in the MO models) than those of grassland/wetland species (max. 0.72) regardless of observation type.
Predicted preferred habitats differed depending upon whether observation processes were included (occupancy models) or not (conventional logistic regression models and Maxlike), as well as whether spatially biased absence data were included (PA model) or not (PO and ML models). These differences were prominent for forest species, but not for grassland/wetland species (Tables 1 & 2). For the PO and ML models, the occupancy probabilities of the forest species decreased with increase in forest area (Fig. 2) because the coefficients for forest area were negative for most species (Fig. S3). The estimated species richness of forest species was high in lowland non‐forested areas, which was more apparent for the ML models than the PO models (Table 1, Fig. 3). In contrast, the coefficients for forest area became positive in the PA models and highest in the occupancy models, and thus, the occupancy probabilities of the PA and occupancy models increased with increase in forest area. This led to a reversed prediction of spatial distributions of estimated species richness, which was high in lowland forested areas (Table 1). Estimates of mean species richness of the forest species over the entire study area (per mesh cell) were much lower for PO (5.1 species) and PA models (6.5 species) than ML (22.5) and MO models (25.1 species). In contrast, the occupancy probabilities of the grassland/wetland species in the five models increased with increase in grassland and/or wetland areas (Fig. 2) as one or both of these coefficients were positive (Fig. S4). The species richness of grassland/wetland species predicted by the PO, PA and ML models was also lower, but all models consistently indicated that regions with vast wetland areas should have high species richness (Fig. 3). Estimates of the SO and MO models were very similar (Tables 1 & 2), but in some cases, differences were observed in intercepts, parameters or detection probabilities (Figs 2, 4, S3, & S4). Specifically, estimates by single‐species models deviated from those of other species predicted by the SO models.
| Habitat types | PO model (%) | PA model (%) | ML model (%) | SO model (%) | MO model (%) |
|---|---|---|---|---|---|
| Forest birds | |||||
| Forest areas in higher elevation | 0 (0) | 4 (11.8) | 0 (0) | 17 (50) | 7 (20.6) |
| Forest areas in lower elevation | 1 (2.9) | 25 (73.5) | 1 (2.9) | 14 (41.2) | 24 (70.6) |
| Non‐forest areas in higher elevation | 10 (29.4) | 1 (2.9) | 17 (50) | 1 (2.9) | 0 (0) |
| Non‐forest areas in lower elevation | 23 (69.6) | 4 (11.8) | 16 (47.1) | 2 (5.9) | 3 (8.8) |
| Grassland/Wetland birds | |||||
| Both grassland and wetland areas | 5 (23.8) | 4 (19.0) | 4 (19.0) | 11 (52.4) | 10 (47.6) |
| Only grassland areas | 0 (0) | 1 (4.8) | 0 (0) | 1 (4.8) | 3 (14.3) |
| Only wetland areas | 14 (66.7) | 12 (57.1) | 15 (71.4) | 8 (38.1) | 8 (38.1) |
| Non‐grassland and wetland areas | 2 (9.5) | 4 (19.0) | 2 (9.5) | 1 (4.8) | 0 (0) |
| PO model | PA model | ML model | SO model | |
|---|---|---|---|---|
| Forest birds | ||||
| PA model | −0.61 | |||
| ML model | 0.90 | −0.73 | ||
| SO model | −0.98 | 0.71 | −0.88 | |
| MO model | −0.93 | 0.84 | −0.94 | 0.95 |
| Grassland/wetland birds | ||||
| PA model | 0.98 | |||
| ML model | 0.90 | 0.88 | ||
| SO model | 0.89 | 0.86 | 0.94 | |
| MO model | 0.92 | 0.90 | 0.89 | 0.98 |



Discussion
Despite the invaluable potential of citizen data (Devictor et al., 2010; Dickinson et al., 2010), to our knowledge, no studies have examined the ability of occupancy models to deal with spatially biased sampling effort for citizen data. We compared estimates among the occupancy and conventional logistic regression models using citizen data with spatially biased sampling effort. As expected, our results showed that the probability of species occurrence and resultant species richness predicted by occupancy models were higher than those by the conventional models. If we increase the number of pseudo‐absence data based on common recommendations (Phillips & Dudík, 2008; Barbet‐Massin et al., 2012), this discrepancy is strikingly marked. Furthermore, conventional logistic models and Maxlike predicted that forest species preferred non‐forested areas (mainly agricultural landscapes), while occupancy models predicted that they preferred forested landscapes. The performance of the two occupancy models in handling spatially biased sampling effort for citizen data was almost identical. Although model validation was not performed due to the lack of independent unbiased data, these results suggest that opportunistic citizen data can be useful data sources for modelling large‐scale biodiversity if their spatially biased sampling effort is effectively considered. As in the other hierarchical models accounting for imperfect detection (Yamaura et al., 2011; Yamaura, 2013), occupancy models would also be promising for spatially biased sampling efforts. Occupancy models can make use of opportunistically collected citizen data, encouraging citizen scientists to collect more data.
Pseudo‐absence data extracted from unsampled locations are often used for species distribution modelling of presence‐only data, such as herbarium specimen records (e.g. Elith et al., 2006). Our results showed that when spatially biased presence data and random pseudo‐absence data were analysed by the logistic regression model, the coefficients of habitat covariates can be reversed, as demonstrated by the prediction of more forest species in non‐forest habitats. Previous studies proposed several ways to consider the spatially biased sampling effort when generating pseudo‐absence data: using the detection records of other species in the same data set (Elith et al., 2006; Phillips et al., 2009), extracting points randomly from within a certain distance of presence locations (spatial filtering; VanDerWal et al., 2009; Kramer‐Schadt et al., 2013), or extracting points based on known spatially biased sampling effort (e.g. bias file for Maxent; Phillips et al., 2009; Elith et al., 2011). The first way is almost the same as the PA models in this study, which yielded good estimation results, which suggests that the estimates of PO models would be greatly improved if presence and pseudo‐absence data were spatially biased in the same manner. However, creating such ideal pseudo‐absence data is considered to require trial and error, as shown by many previous studies (VanDerWal et al., 2009; Wisz & Guisan, 2009; Barbet‐Massin et al., 2012; Kramer‐Schadt et al., 2013), because true absences are rarely known. We believe that occupancy models are a more cost‐efficient way to deal with opportunistic citizen data rather than devoting precious time and energy for pseudo‐absence calibration when non‐detection data are available. Although occurrence data are relatively easy to collect (Bailey et al., in press), non‐detection data are only sometimes reported (although the data we used did include such information). In these cases, we cannot apply occupancy models and need alternative methods.
Recently, Maxlike has gained significant attention for species distribution modelling (e.g. Fitzpatrick et al., 2013; Merow & Silander, 2014) because the pseudo‐absence data are not required for the inference (Royle et al., 2012); however, our results showed that if sampling locations are spatially biased, Maxlike can also produce biased estimates. These results suggest that, as with conventional logistic models, Maxlike assumes random sampling (Royle et al., 2012) and is sensitive to spatially biased sampling effort. Problems entailed by presence‐only data may not be fully resolved without absence data (Royle et al., 2012; Hastie & Fithian, 2013; Merow & Silander, 2014).
Even if sampling locations are spatially biased, results of the distribution of grassland/wetland bird species by conventional models and Maxlike were qualitatively similar to those of occupancy models, although estimated species richness was much lower compared with occupancy models. The contrasts between forest and grassland/wetland species are caused by differences in the spatial distribution of forested and grassland/wetland areas. Grasslands are clustered around residential areas in the lowlands of Hokkaido, and thus more accessible than forested areas, which are predominantly in remote areas. In addition, wetlands might be well‐known hotspots where citizens conduct more surveys (Reddy & Dávalos, 2003; Botts et al., 2011) because these areas have high biodiversity due to their location in the transition zone between terrestrial and aquatic environments. As a result, sampling locations in the data set fully covered the environmental variation of the grassland and wetland areas. Conventional logistic regression models and Maxlike can provide comparable estimates when the entire environmental niche space of target species is covered, as demonstrated by the results for grassland/wetland birds.
Acknowledgements
We are very grateful to Dr. Marc Kéry and two anonymous referees for valuable comments on the manuscript. This research was supported by the Environmental Research and Technology Development Fund (D‐1201) of the Ministry of the Environment, Japan. Y. Yamaura was supported by JSPS KAKENHI Grants 23780153 and 25252030.
References
Biosketch
All authors of this manuscript are interested in the wildlife distributions and the application of species distribution modelling for effective conservation planning.
Author contributions: Y. Yamaura and I.K. conceived the ideas, M.H. led the analysis and writing of the manuscript, Y. Yabuhara and M.S. helped perform the modelling and write the manuscript, and S.O. contributed to the collection and management of the distribution data. All authors contributed to preparing the manuscript.




