Seasonal occurrence and abundance of dabbling ducks across the continental United States: Joint spatio‐temporal modelling for the Genus Anas

Estimating the distribution and abundance of wildlife is an essential task in species conservation, wildlife management and habitat prioritization. Although a host of methods and tools have been proposed to accomplish this undertaking, several challenges remain in accurately forecasting occurrence and abundance for highly mobile species. Exhibiting extensive geographic ranges with seasonally varying local occupancy, migratory ducks are exemplar highly mobile species and are foci for waterfowl conservation and management globally. With the goal of informing species conservation and management, our aim was to leverage citizen science data to estimate occurrence and relative abundance for ten dabbling duck species across the continental United States.

Likewise, approaches intended to derive abundance estimates from presence-only observations may assume a linear correlation of presence to abundance when in fact the relationship varies across multiple spatial and temporal scales (Johnston et al., 2015;Oliver et al., 2012;Yin & He, 2014).
To address some of the outstanding concerns with modelling highly mobile species, we employed spatially and temporally explicit techniques to estimate occurrence and relative abundance for ten migratory waterfowl species. Spatially explicit models are differentiated from those that are implicit in that the explicit variety considers the relative positions of individuals (without aggregating occurrences to grid or raster cells) and addresses spatial autocorrelation and spatial dependency as intrinsic parts of the model rather than through iterative fitting or the post hoc adjustment of model outputs (Hui, McGeoch, & Warren, 2006;Perry et al., 2002;Veldtman & McGeoch, 2004;Wiens, 2000). The spatial distribution of a species encodes information about the underlying ecological processes that gave rise to it (Hurtt & Pacala, 1995;Velázquez, Martínez, Getzin, Moloney, & Wiegand, 2016); therefore, it is essential that spatially explicit methods be applied to reveal species ecology through analysis of geography. Indeed, neglecting spatial autocorrelation may result in parameter estimates with smaller uncertainty than is warranted and may produce predictions with considerable error (Dormann, 2007;Elsner et al., 2016;Hoeting, 2009;Renner et al., 2015). In a comparable manner to spatial autocorrelation, temporal autocorrelation can also give rise to model error (Cressie & Wikle, 2011;Paradinas et al., 2017), which may be present whenever sampling has occurred repeatedly, at regular intervals, or over extended periods of time. For all of these reasons, spatially and temporally explicit methods offer distinct advantages over non-explicit alternatives, such as maximum entropy and machine-learning techniques (e.g., random forests, boosted regression trees), particularly when the focus of a study includes species or diseases that vary across time/space (Martnez-Minaya et al., 2018;Paradinas et al., 2017;Runge et al., 2014).
Our current proposal is to combine citizen science data with advanced spatio-temporal modelling techniques to jointly estimate occurrence probability and relative abundance for ten dabbling duck species in the Genus Anas. In addition to their contribution to ecosystem function (Bauer & Hoye, 2014;Figuerola, Charalambidou, Santamaria, & Green, 2010), economic importance (USFWS, 2015) and cultural benefit to humans (Schummer, Coluccy, Mitchell, & Elsen, 2017;Soulliere et al., 2007), several dabbling ducks species are of interest due to possible roles in avian influenza transmission (Van Den Brand et al., 2018;Papp et al., 2017;Ramey, DeLiberto, Berhane, Swayne, & Stallknecht, 2018;Ramey et al., 2014Ramey et al., , 2016. Choosing the continental United States as our study area, our primary goals include (a) leveraging freely available citizen science data to estimate seasonal species occurrence and abundance, (b) quantifying inter-annual variation in duck environmental associations and (c) demonstrating a powerful and flexible technique to account for errors arising from both spatial and temporal autocorrelation, as well as the biases inherent to data collected through citizen science efforts lacking standardized sampling protocols.

| Species data
period 2015-2017. Following pre-processing (see Section 2.3), 999,377 occurrence records were incorporated for spatio-temporal modelling. The number of records per species ranged between 10,039 and 190,309.

| Study domain and environmental data
The study domain encompassed the entire continental United States and exhibited an area exceeding 7.7 million km 2 . Environmental data (see Appendix S1: Supplement A) included two spatial datasets used to characterize land cover and dominant agricultural land use across the study domain: the 2011 National Land Cover Database (Homer et al., 2011Cropland Data Layer (CropScape, 2017. A "distance to water" covariate providing the Euclidean distance (km) to the nearest surface water class (inclusive of wetlands) within the National Land Cover Database (NLCD) was also calculated using the spatstat R package (Baddeley & Turner, 2005). Both the NLCD and Cropland datasets were categorical rasters at 30-m resolution. A topographical roughness index, a topographical wetness index and a continentality raster (30 arc-seconds continuous rasters) from the ENVIREM dataset (Title & Bemmels, 2017) were adopted to model topographical heterogeneity across the study domain. Indices for topographical roughness and wetness were intended to help differentiate wetlands and surface waters from other landscape components while continentality (average coldest month temperature subtracted from average warmest month temperature in °C) aided in distinguishing between terrestrial and marine climate regimes (Sayre, Comer, Warner, & Cress, 2009). Spatial and temporal climate variability were captured through inclusion of minimum monthly temperature and monthly precipitation rasters from the PRISM Climate Group at a 4-km resolution (PRISM, 2018). Additionally, a continuous raster (30 m) giving human population density (CIESIN, 2016) was used as one of two covariates to help control for possible observer bias associated with the reporting of duck occurrence locations (see Section 2.3).

| Preliminary analysis
Considering our point-referenced observation data and interest in quantifying both occurrence and abundance across a geographically expansive study domain, we adopted the stochastic partial differential equation (SPDE) approach (Lindgren, Rue, & Lindström, 2011), such that our joint modelling framework incorporated a two-tiered conditionally dependent structure with shared components. For clarification, we use the term "joint model" to refer to a single model consisting of multiple parts, levels or submodels that might typically be implemented as separate or individual analyses. The first level of our two-part joint model can be viewed as a submodel for duck occurrence probability, whereas its second level is a submodel for relative abundance. To avoid repeated and vague use of the word "model," the terms "tier" or "stage" are used to reference either the joint model's occurrence probability or relative abundance level specifically. Although our model exhibits a two-part structure that is comparable to that associated with occupancy models, which jointly estimate detection probability (observation process) with species occurrence probability (Johnston, Fink, Hochachka, & Kelling, 2018;MacKenzie et al., 2006), we emphasize that our model does not directly estimate detection. Rather than combining detection and occurrence probability, we propose a joint modelling approach that joins species occurrence probability with estimated relative abundance, such that species occurrence modelled in the first stage is passed as a covariate to the model's second stage for improved abundance estimation. Stated differently, the model's second stage (relative abundance) is conditional on the first stage (occurrence probability). The SPDE method was selected because of its ability to ease processing time associated with the spatial random effects used to quantify spatial autocorrelation and other spatial error in the joint model. Rather than using spatial grids or polygons, the SPDE method enables spatial random effects to be estimated across a triangulated mesh, which greatly reduces the time needed to run the models (Lindgren et al., 2011).
Prior to constructing species distribution models, pre-processing and exploratory analyses were performed to clean and better organize duck occurrence and environmental data for model fitting.
Steps were also taken to avoid covariate multicollinearity and similar statistical issues commonly encountered during species distribution modelling.
A major advantage of the eBird database is that criteria describing the field survey method used to collect data (i.e., point, transect and aerial survey) are included with each record, as are the number of participating surveyors, the duration of surveys and other essential information (Fink et al., 2010;Sullivan et al., 2014).
For the current study, records obtained from the eBird database were restricted to the most recent three-year period (2015-2017), included only complete checklists (i.e., all species present were documented and reported) and were filtered to remove records not yet reviewed by a participating member of the eBird organization (see Sullivan et al., 2009 for review standards). To reduce the range of values used in estimating abundance, all reported counts were truncated at 100 birds. Truncating counts were accomplished by recoding bird counts greater than 100 as 100. To avoid arbitrariness in choosing record filtering criteria and in selecting truncating thresholds, we followed comparable strategies to those reported in the literature (Cardador, Carrete, Gallardo, & Tella, 2016;Fink et al., 2010;Johnston et al., 2015;McCormack, Zellmer, & Knowles, 2010;Zipkin et al., 2017). Two actions were also taken to reduce the observer bias that is often inherent to citizen science data (Dormann, 2007;Johnston et al., 2018;Kadmon, Farber, & Danin, 2003;Renner et al., 2015). Firstly, an observation effort covariate was constructed by multiplying each record's reported number of observers by the total observation period duration (minutes) and then scaling the product to the range from zero to one. The resulting covariate reflected sampling effort (observer hours) with values close to zero indicating relatively minimal observation effort and a value of one signifying the maximum. Secondly, human population density was included as an independent variable with all models to account for the possibility that duck observations were proportional to the number of persons residing in the vicinity (Kadmon, Farber, & Danin, 2004). Beyond providing a means of accounting for observer bias during model fitting, the effort and human population variables facilitated prediction under assumptions of maximal observer effort and a proportionate collection of observations across the landscape regardless of human accessibility. That is, these variables allowed us to model both the observations and the observation process (Besbeas & Morgan, 2014;Gould & Nichols, 1998;Humphreys, Elsner, Jagger, & Pau, 2017;Zipkin et al., 2017).
Several Gaussian random fields (i.e., spatial random effects) were constructed to quantify the spatial relationships among duck observations in our dataset. The random fields served as error terms to measure spatial autocorrelation and other uncertainty associated with the observed point patterns within and between each tier of the model. To create the Gaussian random fields, a three-dimensional, triangulated mesh scaled to represent the curvature of the Earth's surface was constructed using guidelines provided by Lindgren et al. (2011). The mesh intersections (vertices) also served as quadrature locations to support numerical integration for some model components. We avoid using the term "pseudo-absence" as quadrature locations were intended to characterize the background environment (Renner et al., 2015) and should not be interpreted as locations devoid of ducks. Tools available in the r-INLA package for R  facilitated mesh construction by providing commands to control the distance between mesh vertices, the maximum length of triangle edges, the minimum permissible angle at edge intersections and the length of the mesh outer extension. The final mesh included 4,953 locationspecific vertices arranged in a regular, but not uniform, pattern across the study domain.
To organize data, a spatial point dataset was constructed by combining duck occurrence locations with a separate set created from the mesh vertices. Using the raster R package (Hijmans, 2016), location-specific values for the variables listed in Section 2.2 were extracted to the combined spatial points dataset and then scaled and centred for ease of computation and later interpretation. In the case of time-stamped temperature and precipitation data, extractions were undertaken to match monthly climate estimates to the corresponding year and month that individual ducks were observed. The resulting dataset included an attribute table with rows corresponding to the locations of duck occurrences or quadrature points and columns recording the coincident covariate for each geographic location. To enable season-specific modelling and comparison, reported duck observation dates were used to disaggregate the full spatial point dataset into spring (16 February-31 March), breeding (01 April-31 August), fall (01 September-30 November) and winter (01 December-15 February) subsets. In total, forty spatial point datasets were created and modelled; four season-specific models for each of ten focal duck species.
To gauge the potential for multicollinearity among covariates, collinearity diagnostics for independent variables as described by Belsley, Kuh, and Welsch (1980) were estimated for each duck species and season using the perturb R package (Hendricks & Pelzer, 2004). This method adds random noise to independent variables and then calculates variance decomposition proportions and a condition index to assess collinearity. Evaluation of model variables resulted in condition index scores for each species and season combination ranging between 11.2 and 19.8. These scores fell well below the Belsley et al. (1980 ) proposed threshold of 30 giving us confidence that collinearity would not bias model results.
As a final step before model fitting, a minimum of 25% of occurrence points were randomly selected from each duck species and season for model testing and evaluation. In the case of the abundant mallard (MALL), additional points were randomly selected for evaluation to reduce computer processing time and improve performance (see, Section 2.5). The final column of Table 2 provides the number of records used to fit and evaluate models for each focal species. Note: Sensitivity, specificity and the area under the receiver operating characteristic curve (AUC) are scaled from 0 to 1, and the True Skill Statistic (TSS) score ranges from −1 to 1. Higher scores signify improved performance for all metrics. Train/Test column provides the count of records used to fit and evaluate models, respectively. Parenthetical values provide the 95% confidence interval. Seasonal models are pooled by species.

| Statistical model
The The model's second-tier incorporated observed bird counts as the response variable and had the same covariates as the first-tier linear predictor (Equation 1), except that the second-tier also included the observer effort variable described during preliminary analysis (Section 2.3) and a shared spatial field created by copying the spatial effect  1 (s,t) from the first tier (Equation 1). Rather than coding unobserved locations associated with mesh vertices as zero as done in tier-one, the vertex locations were set to "NA" (missing data) in the second-tier resulting in a zero-truncated response vector. Adopting a Poisson likelihood with a mean μ and log link function, the second-tier process was, where β 0 is the intercept and β 1 through β 9 provide coefficients for the same covariates described under the first tier. Observer effort, which varies by location and year, is quantified by β 10 ,  2 (s,t) gives the spatial effect for second-tier observations, and  3 (s,t) is a copy of  1 (s,t) from the first tier that is shared with the second-tier. The shared field controls for correlation between our two model tiers

| Model fitting and evaluation
Model fitting and evaluation were performed as an iterative process in which models were constructed for each focal species and season and then validated against the testing dataset randomly partitioned during pre-processing. As occurrence probability comparison metrics, we calculated model sensitivity (proportion of correctly predicted presences), specificity (proportion of correctly predicted absences) and the area under the receiver operating characteristic curve (AUC) using the PresenceAbsence package (Freeman & Moisen, 2008) for R. Because the AUC may not be appropriate for comparison between species distribution models (Lobo, Jimnez-Valverde, & Real, 2008;Shabani, Kumar, & Ahmadi, 2016), we also determined the True Skill Statistic (TSS) by subtracting a value of one from the sum of the sensitivity and specificity estimates (Allouche, Tsoar, & Kadmon, 2006). Although the total number of presences available in our dataset differed by species, season and year, presence counts were largely balanced by background points drawn from the triangulated mesh. That stated, we recognized that a 0.5 threshold log it ( s,t ) = 0 + 1 Ppt (s,t) + 2 Cont s + 3 Twet s + 4 Tri s + 5 Pop s may not be optimal (Guisan & Theurillat, 2000) for all forty speciesseason combinations that were modelled and likewise acknowledge that a variety of other threshold selection criteria have been proposed (Bean, Stafford, & Brashares, 2012;Cramer, 2003;Liu, Berry, Dawson, & Pearson, 2005;Raxworthy, Ingram, Rabibisoa, & Pearson, 2007).
To assess the accuracy of duck abundance estimates, we calculated the Spearman rank correlation (referred to as correlation or "Cor" here forward) between model predicted abundance and the number of observed birds reported with eBird records as well as the correlation between predicted abundance and duck presence. In the case of mallard (MALL) models, the initial correlation between predicted and observed abundance fell below 0.50 causing us to re-run the models after reducing the MALL dataset. As described in Section 2.3, this resulted in a greater proportion of points being retained for model testing. Overall, accuracy statistics for occurrence probability appeared more consistent across species than those for abundance. More simply, greater between species variability is apparent in the Cor-A measure than in the TSS, sensitivity, specificity or AUC metrics. This disparity was also evident in the degree of inter-annual temporal cor- for all models in data Appendix S1: Supplement A.

| RE SULTS
Plotting predicted NSHO occurrence against predicted logscaled abundance (Figure 1) illustrates that although the two estimates are highly correlated (Table 3[Cor-B]), the relationship is non-linear with a rapid increase in abundance at locations having a greater than a 90% chance of occurrence. Note that the upper tail of each plot shows exponential increase.
Maps depicting predicted occurrence probability and estimated relative abundance for all focal duck species are included with data Appendix S1: Supplement B data. For demonstrative purposes, Pothole Region is recognized as being of critical importance to NSHO and a diverse range of other waterfowl (Fairbairn & Dinsmore, 2001;Klett, Shaffer, & Douglas, 1988;Reynolds, Shaffer, Renner, Newton, & Batt, 2001).
Although duck association with land cover varied little throughout the year, seasonal variability was evident for other model covariates

| D ISCUSS I ON
We presented a spatially and temporally explicit species distribution model for joint estimation of occurrence and abundance across the continental United States. With a focus towards migratory dabbling ducks, our primary objectives included undertaking a statistically rigorous treatment of potentially biased citizen science data, quantifying possible time-varying changes to duck environmental associations and demonstrating a robust and flexible method to account for spatial and temporal autocorrelation. Because point process models offer ecologically relevant information pertaining to both the observed pattern and underlying process (Fithian, Elith, Hastie, & Keith, 2015;Illian et al., 2013;Simpson et al., 2015;Velázquez et al., 2016), we utilized point-specific observation data from the eBird database (Sullivan et al., 2009) and opted to implement a geostatistical approach under a Bayesian perspective Lindgren et al., 2011;Rue et al., 2009). Observation data from the eBird database were found subject to observer bias (Link & Sauer, 1998), and accordingly, the effort and human population density covariates developed to control for sampling bias were shown to be important for all species and the majority of seasons (see Appendix S1: Supplement A). Though biased, we concur with Fink et al. (2010) that a major benefit of using the eBird database is that it provides observer-specific criteria to aid in quantifying survey effort for each record. It is also worth noting that our survey effort covariate had a stronger model influence than did the human population density variable and, like the environmental covariates, the strength and importance of both varied by season.
All modelled climate and topographical covariates showed seasonal variability. The lowest degree of seasonal change was seen in the land cover and agricultural variables derived from the NLCD and CropScape with the greatest being apparent in the temperature and water proximity covariates. The CropScape covariate did suggest the possibility that dabbling duck selection of rice, alfalfa, oats and other cultivars may change throughout the year (see Appendix S1:  Figure 2 and that predicted occurrence probabilities outside of the PPR are set to zero (tan colour) for improved visual clarity 100% environmental and geographic space, the strength of these correlations may have important implications for waterfowl conservation and wildlife disease risk assessment. For example, Figure 1 serves to emphasize that although duck presence is highly correlated to duck abundance, this relationship is non-linear and ducks may be substantially more abundant at some locations and times than at others.
That is, a correlation of r = 0.55 still allows for sizable inter-annual variability and the locations of densest duck aggregation may change year to year. As an applied matter, it is an important consideration that not all suitable habitat is created equal; some areas of suitable habitat may support substantially more ducks than do others. In an epidemiological context, if wildlife disease transmission rates correspond to population density, some locations may be at considerably greater risk than others and these higher-risk locations may move in the landscape from year to year.
Finally, we believe that our modelling framework helps address some of the outstanding concerns with modelling highly mobile species and underscore that spatially and temporally explicit methods grant distinct advantages over non-explicit alternatives. Our results for ten dabbling duck species appear reasonably robust considering use of publicly sourced data and the areal extent of our study area. Species distribution modelling is not a substitute for field survey; however, when formal survey data are not available and cannot be practically acquired, the presented framework offers a realistic alternative for modelling highly mobile species that vary thorough time and across space.

ACK N OWLED G EM ENTS
We thank J. Andrew Royle for assistance in reviewing the manuscript,