SEARCH

SEARCH BY CITATION

Keywords:

  • cross;
  • data validation;
  • data efficiency ratio;
  • data quality;
  • eBird;
  • North American Breeding Bird Survey;
  • species distribution model

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

1. Species monitoring is an essential component of assessing conservation status, predicting effects of habitat change and establishing management and conservation priorities. The pervasive access to the Internet has led to the development of several extensive monitoring projects that engage massive networks of volunteers who provide observations following relatively unstructured protocols. However, the value of these data is largely unknown. 2. We develop a novel cross-data validation method for measuring the value of survey data from one source (e.g. an Internet checklist program) relative to a second, benchmark data source. The method fits a model to the data of interest and validates the model using benchmark data, allowing us to isolate the training data's information content from its biases. We also define a data efficiency ratio to quantify the relative efficiency of the data sources. 3. We apply our cross-data validation method to quantify the value of data collected in eBird – a western hemisphere, year-round citizen science bird checklist project – relative to data from the highly standardized North American Breeding Bird Survey (BBS). The results show that eBird data contain information similar in quality to that in BBS data, while the information per BBS datum is higher. 4. We suggest that these methods have more general use in evaluating the suitability of sources of data for addressing specific questions for taxa of interest.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Species monitoring is used for assessing conservation status, ascertaining and predicting the effects of habitat change, establishing management and conservation priorities, and determining how management efforts are meeting objectives (US NABCI Monitoring Subcommittee 2007). Effective monitoring requires that collected data are representative of the region of interest (e.g. Sauer 2000), that data collection methods ensure a relatively high probability of detecting the target species (e.g. Conway & Timmermans 2005), and that data can be adjusted for imperfect detection of organisms (e.g. MacKenzie 2006). Designing ideal protocols to meet these goals would probably require protocols tailored to each individual species, which is probably impractical. Thus, species monitoring is typically accomplished through more generic data collection schemes. While data from such programs have their weaknesses (Nichols & Williams 2006), they are useful resources for finding answers to unanticipated questions for which no data exist from a ‘well-designed’ monitoring scheme. For example, much of our knowledge of climate change and its effects on distribution (e.g. Thomas & Lennon 1999; Hickling et al. 2005) and demography (e.g. Jiguet et al. 2006) is based on data from generic monitoring schemes. Further, it is not always the case that a researcher has sufficient knowledge of the important processes to design studies and create protocols and statistical models that are appropriate to the biological objectives (Fitzpatrick et al. 2009). The prior knowledge needed to appropriately design targeted studies must come from somewhere, and generic monitoring schemes can provide the needed insights to effectively design targeted studies (Hochachka et al. 2007).

One relatively unexplored source of generic, surveillance monitoring data is from low-structure, high-volume checklist programs, whose data value has not been thoroughly investigated (although focused demonstrations of information content in the data exist (e.g. Hochachka & Dhondt 2000; Koenig 2001; Kéry et al. 2010)). For birds, numerous checklist-based monitoring projects exist in the Americas (Sullivan et al. 2009), and elsewhere (Schmid et al. 2001; Baillie et al. 2006; Harrison, Underhill & Barnard 2008). Often these projects engage large numbers of volunteers, making the cost per datum trivial (Sullivan et al. 2009). The utility of these data depends not only on how they were collected but also on the species, performance measures and questions of interest; data quality cannot be measured in the abstract. What is needed are general methods for quantifying the effectiveness of different monitoring schemes for answering particular questions about species of interest. Such methods would identify the most useful data source for meeting an analyst's needs. However, comparing monitoring schemes is complicated by the fact that each monitoring program has its own sets of biases and sources of variance. For example, the North American Breeding Bird Survey (BBS; Robbins, Bystrak & Geissler 1986) is a highly standardized survey conducted along roadsides, which influences how frequently different species are detected (e.g. Bart, Hofschen & Peterjohn 1995; Hanowski & Niemi 1995). By contrast, the eBird project (Sullivan et al. 2009) allows volunteers to conduct surveys wherever they want to watch birds, resulting in much denser data near major population centres (Fig. 1).

image

Figure 1.  North American Breeding Bird Survey data are collected using a stratified design to ensure roughly uniform spatial coverage; eBird data are collected wherever birders choose to go, and is denser where people live. Each pixel shows the number of submitted checklists within a 20-km-by-20-km square; white indicates no checklists were submitted during May–July time period (2003–2008).

Download figure to PowerPoint

In this paper we develop a cross-data validation method for quantifying the predictive power of a data source relative to a benchmark data source. For concreteness, we present the method in the context of monitoring species status and distribution, but the method is also applicable to monitoring population trends. Our method compares the predictive power of two models (species distribution models in our example). The first model is learned using the data source of interest (called candidate data in the rest of the paper), while the second is learned using the benchmark data. We define a data efficiency ratio to quantify how many candidate data samples are needed to equal the information in a single benchmark sample.

To demonstrate the method, we quantify the value of eBird breeding season data relative to BBS data, for 75 regularly occurring North American bird species. eBird is a general purpose data source that collects bird observations year-round from any location in the western hemisphere. This generality is a potential boon for studying bird populations at times and places not covered by established monitoring programs. In particular, established large-scale monitoring programs are limited to the breeding season, and relatively few data are available about bird populations during migration and winter. However, the reliability of eBird data, which are collected via a low-structured protocol, needs to be verified before scientists can use the data with confidence. Our case study compares the general purpose eBird data to the more specialized BBS data because the BBS is an established and widely used data source, with a highly structured protocol, whose reliability has been extensively studied. We mitigate the differences between the two data sources by choosing a subset of eBird data that approximates the characteristics of BBS data. The results show that eBird data contain breeding season distribution information comparable in quality with that in BBS data. More generally, our case study demonstrates that data from a low-structure data-collection protocol contain useful information; with enough such data, the information can be similar in quality to that collected using a more intensive, structured data scheme.

Materials and methods

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Cross-data validation framework

Directly comparing the information in the candidate data set with that in the benchmark data set is hard because projects use different sampling designs and protocols, and draw samples from different points in space and time. To overcome this, our approach is to fit a pair of models (using the respective data sets) that can predict the independent variable (e.g. species occurrence) as a function of predictor variables (e.g. habitat and climate). The predictive models generalize from the available samples to any point in space/time, and summarize the information available from the two data sets. We can directly compare the models’ predictions for a common set of independent test points from the benchmark data source.

Validating the candidate model on external, independent data gives us a way to objectively measure data quality. If we instead validated the candidate model using held-out candidate data, we could verify whether models could be fit to predict the combined biological and observational processes generating the candidate data. This would let us measure how well the model represents the data but would not tell us anything about the value of the data themselves: the candidate data could be dominated by biases or noise that hides the underlying biological signal.

As the models serve as data set proxies, they need to faithfully represent the information in the data. Parametric models can be used if the phenomenon being described is sufficiently well understood. However, for many species, the causal factors that relate to their occurrence are not well understood, and machine learning techniques that automatically construct accurate non-parametric models from data (Hochachka et al. 2007; Kelling et al. 2009) are more appropriate.

Two types of comparisons can be made between data sources. First, how does the information from candidate data compare with the information from benchmark data? To answer this, we summarize the candidate model's predictive power on benchmark test data using appropriate performance metrics (e.g. accuracy and mean-squared error). We compare the candidate model's performance against the performance of the benchmark model and a simple baseline. As the benchmark model is fit using training data drawn from the same distribution as the test data, it should demonstrate the best possible performance for the test data. Comparisons with the simple baseline indicate whether either model has learned anything meaningful. For this study the baseline was to always predict the frequency of the species in the benchmark data. For example, western meadowlarks (Sturnella neglecta) were recorded in 23·6% of BBS surveys; so, the baseline predicted 0·236 probability of occurrence for all test surveys.

Second, how many candidate data points are needed to achieve the same performance as the benchmark data? To estimate data efficiency, we vary the volume of training data used for constructing the models and observe how performance depends on training set size. By fitting the observed trends, one can estimate how many candidate data would be needed to equal the performance of the benchmark model. The data efficiency ratio is the ratio of candidate to benchmark data at the best benchmark performance level; the ratio quantifies the relative efficiency of the candidate data source (for example, see Fig. 4a). In our study we found that performance improved roughly linearly with the log of data size. Accordingly, we fit the log-linear model:

  • image

where x is the sample size and y is the performance loss. If performances for large data volumes are close to perfect, a log-quadratic model may be a better fit for the trend; intuitively, improving performances are more likely to asymptote to perfect prediction than to actually achieve prefect prediction. Log-quadratic models were not needed to achieve good fits in our analyses.

image

Figure 4.  eBird's efficiency at collecting useful information, relative to North American Breeding Bird Survey (BBS), depends heavily on the focal species and performance measure. (a) Example of estimating a data efficiency ratio. roc loss for western meadowlark (Sturnella neglecta) models decreases as a function of training data size. Roughly 6·53 (39 180/6000) times as many eBird data as BBS data are needed to match the best BBS model performance, based on fitting log-linear models to the performance trends. (b) The acc trend lines for northern bobwhite Colinus virginianus) are converging; i.e. the eBird model improves faster than the BBS model and should eventually catch up. (c) The rms trend lines for eastern kingbird (Tyrannus tyrannus) are diverging; i.e. the BBS model improves faster than the eBird model. The eBird model will never match BBS model performance, unless one or both trends change direction in the future. (d) Stacked histograms of eBird:BBS data efficiency ratios. Each bar counts number of species with a data efficiency ratio in that range. If the trend lines used for estimating a ratio diverged, the ratio was categorized as diverging; otherwise it was categorized as converging.

Download figure to PowerPoint

The interpretation of a data efficiency ratio depends on the relative improvement rates of the two models. If the observed trends are parallel, then the data efficiency ratio is a threshold: as long as the ratio of candidate to benchmark data collected reaches the threshold, a candidate model can perform as well as a benchmark model. If the trends converge (e.g. Fig. 4b), the candidate model is improving faster than the benchmark model and is projected to perform as well as the benchmark model in future (provided at least as many candidate as benchmark data are collected). If the trends diverge (e.g. Fig. 4c), the candidate model is falling behind and matching the benchmark's performance requires ever more data. The candidate model is unlikely to ever perform as well as the benchmark model unless one trend changes direction.

When available, experts can compare the models qualitatively and identify any differences that are not captured by the performance statistics. Experts possess a broader, more holistic view of species distribution that comes from synthesizing multiple sources, including their own field experience. Benchmark as well as candidate models can be verified by experts. We use distribution maps generated from each model to visualize and diagnose the information captured by complicated models.

Calibration

Consistent differences in the biases of the candidate and benchmark data sets can exaggerate performance differences. For example, the candidate model may consistently predict lower occurrence probabilities. Correcting such simple differences can significantly improve the candidate model's performance and give a clearer picture of how similar the two models are.

In preliminary analyses of data from five species, we found that eBird occurrence models were poorly calibrated when predicting BBS test data. To fix this problem, we postcalibrated eBird models after they were fit using Platt's scaling (Platt 2000), which frequently produces good results (Niculescu-Mizil & Caruana 2005). This calibration involves fitting the sigmoid function:

  • image(eqn 1)

where x is a model's raw prediction. The sigmoid function transforms a raw prediction (x) into a calibrated prediction (f(x)). A small amount of benchmark data (i.e. set-aside BBS data) was used to fit equation (1) for each model. Calibration noticeably improved the accuracy and mean-squared error of candidate models for the preliminary results. Much of the benefit comes from adjusting the present/absent threshold (for accuracy) and correcting the mean probability (for mean-squared error). No calibration was used for the benchmark models because preliminary results showed that calibration hurt benchmark performance.

Data collection and processing

Our case study measured the quality of eBird breeding season data, relative to benchmark data from the North American BBS. BBS data were chosen because they were collected using a stratified design with controls that minimize the impact of weather, time of day and observer-based variance. The BBS data are the highest quality large-scale data set we know for North America, and have relatively uniform coverage (Fig. 1). To compare data sets as fairly as possible, we selected subsets of each to make the data as similar as possible (details below).

We treated observations in both data sets as transect surveys where the observers travel along a route and record which bird species are detected. We did this because latitude and longitude coordinates were only available for the first stop of each BBS route, and location coordinates were needed to associate predictor variables (e.g. habitat) with each survey. Our analyses used the subset of surveys collected in the contiguous USA because most predictors were not available elsewhere. In the rest of the paper, a survey is a sampling event that records an entire route's observations for a specific date and time. Data sets may contain multiple surveys of the same route (e.g. from different years).

We converted these data sets from count data to presence/absence data; i.e. non-zero counts became present, and all others became absent. As we used the predictive performance of the models as a proxy for the information in a data set, it was important that the learned models be as accurate as possible. Our experience is that modelling relative abundance is more challenging (and less well understood) than modelling occurrence. A poorly fit model can lead to the false conclusion that a data set contains little biological information; by studying occurrence modelling, we reduce this risk.

Occurrence models were fit for the 75 species most frequently recorded on BBS surveys (listed in Table S1 in the Supporting Information). We assumed that regularly occurring species would have enough presence observations to allow accurate modelling, and be easy to detect (implying that benchmark data measures biology mostly unobscured by detection processes).

eBird data

eBird (Sullivan et al. 2009) does not enforce a stratified sampling design, resulting in uneven spatial sampling (Fig. 1). Importantly, volunteers answer questions about how each survey was conducted:

  •  Where were the data collected?
  •  When were the data collected? (Date and start time.)
  •  How many people were in the birding party?
  •  What kind of survey was conducted? Surveys can be casual counts (made while doing something else), stationary point counts, transect counts or area counts (all birds within an area). For the last three types, volunteers answer additional questions about survey duration and the distance area covered (as appropriate). This information about effort is important for explaining some of the variance in the observations.
  •  Is the survey complete? On a complete checklist, the submitter reports all species they were able to identify.

One source of noise for our analyses is the non-standardized reporting of location for transect surveys; observers are encouraged to report the middle of the route as the location, but many report the beginning or end. This introduces extra noise into the spatial covariates associated with surveys.

We used the eBird Reference Dataset (Munson et al. 2009) which excludes casual counts and incomplete surveys. By only using complete checklists we were able to infer which species were undetected (as opposed to certainly absent: ‘absence’ information is noisier than presence information).

To make the eBird and BBS data similar, we selected transect surveys covering at most 8 km, that were collected from May to July, during the years 2003–2008 (2003 was the first year eBird collected at least as many transect surveys as the BBS; Fig. 2). The number of observers was missing for 9% of surveys; we filled in these missing values with the mean number of observers (two).

image

Figure 2.  eBird data are much more abundant than North American Breeding Bird Survey (BBS) data. While BBS data per year are constant, eBird data per year are growing. The multiple eBird lines show data volume after successive data selection steps: complete, non-casual checklists only; breeding season only; transect surveys only. In 2008, eBird collected 8·7 times as many breeding season transect surveys as BBS.

Download figure to PowerPoint

BBS data

The North American BBS (Robbins, Bystrak & Geissler 1986) collects data about birds throughout much of road-accessible North America during the breeding season. Volunteers conduct roadside surveys along predefined routes that are distributed to ensure good spatial coverage. Each route consists of 50 stops spaced 0·8 km apart; at each stop the observer counts the number of birds they detect within a 3-min period. Routes are surveyed once per year at the height of the breeding season when birds’ plumage and singing maximizes their detectability. Surveys start 30 min before sunrise as birds tend to be most active around sunrise.

Our analyses used BBS surveys from 2003 to 2008 collected in the contiguous USA. Only surveys of acceptable quality were used. (Each BBS survey is annotated with a variable called RunType that indicates if the USGS considers the survey acceptable for analysis.) We aggregated the first 10 stops in each route into an 8-km transect count associated with the route's starting location. We omitted the later stops because: (1) the predictors may not describe stops farther from the starting location as accurately; and (2) later stops occur farther from sunrise, resulting in decreased detection probabilities for most bird species (Robbins 1981; Skirvin 1981; Rosenberg & Blancher 2005; Hochachka, Winter & Charif 2009). The eBird models required effort predictors to predict occurrence on BBS data. We set survey distances to 8 km and durations to 30 min, for all BBS surveys.

Predictor variables

Predictor variables for the occurrence models included covariates describing the local region around the survey route and how much effort the observer expended (Table 1). The time stamp and observer covariates came with the eBird and BBS count data; the other covariates came from external GIS databases, and were linked to the surveys using their location and time (Munson et al. 2009). We chose to use large extent habitat predictors to ensure that the extents included the full length of the survey transects. We encoded missing categorical predictor values as mv (to treat them as just another nominal value) and discarded the small number of surveys with missing continuous values to avoid decisions about imputing missing values (58 BBS surveys; 1856 eBird surveys).

Table 1.   Summary of predictor variables used in models*
Category (no. predictors)Description
  1. *For processing details, see Munson et al. (2009).

  2. †Predictors from raster data use different grids (i.e. predictor grids are not aligned).

Checklist time stamp (3)The year, day (1–365), and time survey was started. Source: eBird and BBS.
Observer (3)Duration of observation (in hours), distance travelled (km) and number of observers in birding party. Source: eBird and BBS.
BCR (1)Bird conservation region (numeric identifier). Source: shape files from ESRI in 2004. Resolution: n/a (BCRs are large and stretch across multiple states).
Climate (10)†Averages over 30 years. Total precipitation for month. Average/min/max daily temperature for month. Mean, median and extreme data ranges for (a) last 32F day in spring and (b) first 32F day in autumn. Source: Climate Atlas of the USA, v2 (1961–1990), from NOAA–NCDC. Resolution: 4 km by 4 km. Details: http://www.ncdc.noaa.gov/oa/about/cdrom/climatls2/info/atlasad.html
Elevation (1)†Elevation in meters. Source: National Elevation Dataset from USGS. Resolution: 30 m by 30 m. Details: http://www.usgsquads.com/elevationdata.htm#NED_Info
Human population (1)Population per square mile (2000 US census). Source: shape files from ESRI. Resolution: n/a (census blockgroups are variable size). Details: http://www.census.gov/geo/www/tiger/glossary.html
Habitat (16)†Measures per cent of surrounding landscape for each of 16 land cover classes (e.g. open water, deciduous forest). Landscape is 15 km × 15 km box around location. Remote sensing data from 2001 National Land Cover Database. Source: MRLC. Resolution: originally 30 m by 30 m, aggregated into 15 km by 15 km. Details: http://www.mrlc.gov/nlcd.php

We did not discretize the spatial extent into a fixed grid; instead, spatial predictor values were defined from the neighbourhood around each survey location. Consequently, two surveys can only have identical spatial predictor values if they are within 30 m of each other (the finest resolution of predictors used). The coarsest spatial predictor is 15 km by 15 km.

Occurrence models

We trained bagged decision tree models (Breiman 1996) to predict species occurrence. We chose bagged decision trees for three reasons. First, many of the 75 species are not sufficiently well understood to specify parametric models. Second, bagged decision trees are flexible and powerful enough to approximate any function (provided enough training data are available) (Breiman et al. 1984), giving us confidence that a bagged tree model would be able to detect and use any information signal in the data. Third, decision trees, bagged trees and boosted trees have all been successfully used for species distribution modelling (De'ath & Fabricius 2000; Caruana et al. 2006; Elith et al. 2006; Hochachka et al. 2007). We used the IND decision tree package (Buntine & Caruana 1991) to train 100 MML style decision trees for every occurrence model. (IND's MML trees are a kind of Bayesian decision tree; for full details, see Buntine 1992.) Given the covariate description of a survey, each model predicts the probability of occurrence for a particular species. Recall that covariates describe the neighbourhood centred around a survey location; hence, predictions are made separately for each survey location, and not, as is sometimes done in species distribution modelling, for each cell in a fixed grid.

Model evaluation

We evaluated all occurrence models by measuring their predictive power on independent data. The data were divided into train and test sets (for model fitting and validation respectively) according to a checkerboard grid with 150-km-sided squares. For example, surveys located in white squares were training data, while surveys in black squares were test data. The same grid was used for dividing the BBS and eBird data to avoid any chance of spatial overlap between train and test sets. (The occurrence of a species is spatially correlated, and evaluating with test data that are spatially close to training data yields overly optimistic estimates of model performance. Checkerboard partitioning with large squares greatly reduces the chances of training and testing data being close enough to be spatially correlated.)

After partitioning, there were roughly 21 175 surveys in the eBird training set, and 6460 in both the BBS training and test sets. For calibrating model predictions, 300 surveys were subsampled from the BBS training sets and set aside (≈5% of training data). For the analysis of performance as a function of data size, data were subsampled from the remaining training data. All subsampling was by route; i.e. when one survey from a route was added to a subsample, all other surveys from that route were also added. For example, 4·4 surveys were collected from each BBS route (on average); so, the calibration set contained data from roughly 68 routes. Sampling by route ensured independent locations for training and calibration sets and simulated the benefit from monitoring more locations.

Predictive power was measured using accuracy (acc), root mean-squared error (rms), and the area under the ROC curve (roc) (Fielding & Bell 1997). These performance metrics measure diverse aspects of model performance and together give a more complete picture of model performance than a single metric (Caruana & Niculescu-Mizil 2004). acc is the percentage of times that a model correctly predicts whether the species was present/absent. rms measures the average error of a model's predictions and summarizes how well calibrated a model is. roc measures how well a model ranks sites from high to low occurrence. acc and roc are commonly used to evaluate occurrence models (Hegel et al. 2010, pp. 299–301), and rms is a standard metric for regression models. All performance measures were computed using perf (http://www.cs.cornell.edu/~caruana/perf/); acc was computed with a default threshold of 0·5.

In addition, we generated occurrence maps from each model to visually compare the eBird and BBS models. Each map shows the predictions of a model made for 130 000 random locations selected using a spatially stratified design. The model made predictions based on the covariate description for each location, with fixed values for the effort covariates (one observer surveying an 8-km transect over a 30-min period). Multiple predictions were made for each location, varying as a function of date (7, 14, 21, 28 June, for all years 2003–2008), and averaged to create a map representing the breeding distribution. Each map pixel is approximately 15 km by 15 km. For pixels that contain multiple random survey locations, the colour is determined by the average of the predictions from those locations; the few pixels containing zero predictions are white.

Each analysis was repeated 20 times using different train/test splits. Ten different checkerboard grids were generated by randomly shifting the corner of the board. Two runs were made from each board: one run using white squares for training data and one run using black squares for training data. All reported numbers and maps are averages over the 20 repetitions. Error bars in graphs show 1 SD.

Results

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Quality of eBird breeding season data

To visualize the results for all 75 species, we plotted BBS vs. eBird performance. For most species, the eBird models were nearly as accurate as the BBS models (Fig. 3, middle column).

image

Figure 3.  Performance of eBird models vs. North American Breeding Bird Survey (BBS) models. Scatter plots compare BBS performance with uncalibrated eBird performance (left column) and calibrated eBird performance (middle column), with one point per species. Points close to the diagonal line indicate similar BBS and eBird performance. The right column graphs show the cumulative distribution of performance gaps– the difference between BBS and eBird (or baseline) performance. The percentage of species with performance gaps less than various thresholds (x-axis) is shown for calibrated (eBird (20K, cal.)) and uncalibrated eBird models (eBird (20K, uncal.)) and for the baseline model. For example, calibrated eBird models had accuracy (acc) at most 0·05 (5%) below the respective BBS models for 90% of species. The calibrated and uncalibrated lines are the same in the top right graph because Platt scaling does not change rankings.

Download figure to PowerPoint

Calibration significantly improved acc and rms for most species (Fig. 3, left vs. middle), although it did slightly hurt model performance for 15 species (see Table S1). Calibration-induced errors fell into two categories: (1) occurrence increased from 0% to 5–15% outside the species’ range, as a result of raising occurrence within the range to match higher BBS frequencies and (2) occurrence decreased to nearly zero in the edges of the species’ range, as a result of correcting occurrence probabilities in the core of the range. A more complicated calibration method might be able to overcome these flaws.

Most species were observed in the minority of surveys; accordingly, the baseline sometimes achieved good acc by always predicting not present and good rms by always predicting low probability of occurrence. To rule out the possibility that eBird appeared similar to BBS because the BBS models were close to the baseline, we compared the performance gaps of the eBird and baseline models. A model's (or baseline's) performance gap is the difference between its performance and the benchmark (BBS) model's performance. For all metrics, eBird models had smaller gaps for more species than the baseline (Fig. 3, right column). Even for acc, the metric for which the baseline was most competitive, eBird was better. Interestingly, without adjusting the threshold for accuracy (via calibration), eBird acc was seemingly no better than baseline acc for performance gaps of 0·05 or less; with the correct threshold, the models were clearly much better than the baseline.

eBird data efficiency

Data efficiency ratios greatly varied across both species and performance metrics (Fig. 4; Table S1). Overall, the eBird data were noisier than the BBS data due to non-uniform spatial sampling, lower detection probabilities or varying survey lengths and durations. In general, ratios for roc performance were the lowest, followed by acc and then rms. Similarly, the performance trend lines for roc loss were parallel or converging for two-thirds of the species studied – far more often than for acc and rms (Fig. 4d). Some diverging data efficiency ratios were very large or even infinite (Table S1). Infinite ratios occurred when the eBird model's performance worsened with increasing amounts of training data.

Expert opinion of maps

Overall, the BBS maps were slightly better than the eBird maps (judged by authors WMH, MI, BLS and CW). For some species, the BBS and eBird maps were both very good and differed only in small details (e.g. Fig. 5, top row). In other cases, both maps captured the species’ range reasonably accurately but differed in the predicted occurrence due to differences in the data sources (e.g. Fig. 5, middle row). For some species, one or both of the maps contained major mistakes (e.g. Fig. 5, bottom row). The rest of this section describes the examples in Fig. 5 in more detail.

image

Figure 5.  Representative examples of model-generated occurrence maps. Top row: western meadowlark (Sturnella neglecta) range boundaries and areas of concentration in the North American Breeding Bird Survey (BBS) and eBird maps are extremely similar and accurate. Middle row: BBS and eBird maps of eastern kingbird (Tyrannus tyrannus) distribution differ mainly in their predicted occurrence rates, although the large-scale patterns are quite accurate for both. Bottom row: BBS and eBird maps of northern bobwhite (Colinus virginianus) correctly show the areas of high occurrence (Kansas, Oklahoma and south-eastern coastal plain), but both maps contain major mistakes elsewhere. For details, see the Expert opinion of maps section.

Download figure to PowerPoint

Western meadowlarks (top row; S. neglecta) are widespread but declining grassland birds. Both maps correctly show lower frequency in high mountains (e.g. Rockies, Sierra Nevadas), but the eBird map reflects finer scale habitat distinctions in California, south-east Arizona and western Colorado where meadowlarks occur only in the few patches of grassland and appropriate agriculture. eBird has better sampling coverage than BBS in California and south-east Arizona (Fig. 1), providing finer scaled occurrence data. In western Colorado, the BBS appears to have better coverage, but most routes follow open country roads in valleys and not windy mountain roads. Thus, mountainous habitat is under-sampled, biasing the data in the region.

Eastern kingbird (middle row; Tyrannus tyrannus) is a common bird of open country, roadsides and pond edges with stable populations. It is not surprising that the BBS model predicts higher occurrence as it is exclusively a roadside survey. The maps agree on areas of concentration (central Great Plains and coastal Southeast), but the BBS map is more accurate for the Dakotas where the BBS coverage is more complete and uniform than for eBird (Fig. 1). Both maps overstate the lack of kingbirds in the Adirondacks of New York and north-eastern Minnesota. While kingbirds do avoid heavily forested regions, kingbirds breed in these areas along lake edges and beaver ponds and in open area patches. The covariates used for modelling were probably measured at too coarse a resolution to detect these habitat distinctions.

Northern bobwhite (bottom row; Colinus virginianus) is an eastern quail that has disappeared from the northern portions of its former range and whose populations continue to drastically decline throughout their range. Bobwhites require sparsely populated farms and grasslands, and do not fare well in areas with people, dogs, and suburban development. eBird and BBS, respectively, under-sample and over-sample bobwhite habitat, causing biases in both maps. eBird data are concentrated around cities and towns (Fig. 1) where bobwhites are absent, and the eBird model over-generalizes this pattern to large regions of near-zero occurrence. BBS routes favour rural countryside, and the BBS model overstates bobwhite occurrence near populated areas. The true distribution of bobwhites lies somewhere between these maps, with high occurrence in wild and agriculture areas (as in BBS map) and wide buffers of low occurrence around cities and suburban areas (as in eBird map).

Discussion

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

In this paper, we illustrate how the relative information content of monitoring programs can be quantified for two different comparison goals. First, we show how the content of two data sets can be quantified for existing data. Second, we develop a more prospective comparison that asks how many data from one monitoring protocol would need to be collected to equal the information content of the reference protocol. Both methods are based on the assumption that a model linking predictor values to the response (e.g. presence and abundance) is a good summary of the information content of the data used to create the model. We feel that this is intuitively appropriate, and our results – that data from the more structured BBS protocol have a higher per-datum information content – is consistent with our intuition. While we have used species distribution models in our comparisons, other forms of models (e.g. abundance distributions and population trends) could be substituted as appropriate. The most novel aspect of our work is the use of a cross-validation that validates a model using data from a different source than the training data (vs. traditional validation using independent data from the same source). Our approach allows us to isolate the information content of the data from possible overfitting of biases inherent in the data-gathering process. The approach can be applied to compare the suitability of data sources for any modelling task.

Unresolved issues do exist with this approach because each collection protocol has its own biases, and these biases are also reflected in the model built from the data. We made basic attempts to account for the differing biases of the two protocols we considered by using a simple calibration. As our results show, more sophisticated approaches to addressing this issue are needed, as the calibration was not always effective. The failure patterns of simple calibration suggest that biases for a species vary by region as a function of the probability of occurrence; we suspect that the same issue would be encountered when modelling relative abundance instead of occurrence probabilities.

Another message from our results is that the species and performance metric will alter the quantitative differences between data from different protocols. In our comparison, we see three possible explanations for why eBird compares most favourably with BBS for roc, second most favourably for acc and worst for rms. First, ranking surveys from most likely to see a particular species X to least likely is easier than predicting whether each survey actually did record X (roc vs. acc). Similarly, predicting if a survey recorded X is easier than guessing the exact probability of recording X (acc vs. rms). Second, roc is unaffected by shifts in or (strictly monotone) scaling of detection probabilities. Third, tuning the threshold for acc is easier than calibrating rms; so, there is less chance to introduce errors. Regardless of the actual reason(s), analysts will need to decide whether it is sufficient that data contain accurate information on ranking of occurrence rates (i.e. measure performance with rocs), or whether absolute errors (measured by acc or rms) are important.

It is important to remember that the benchmark model represents both the biological signal and biases of the benchmark data. Differences between the biases of the data sources can easily prevent the candidate model from equalling the benchmark model's accuracy on independent benchmark data, regardless of how many data are available. We expect candidate model performance to often asymptote to a slightly worse level than benchmark performance, even when the performance trends are converging. As the test data are biased, small discrepancies between candidate and benchmark models do not automatically imply that the candidate data contain less information; rather, the data sources are simply different. Either source could be slightly better, or they could be complementary. Determining which scenario is true is beyond the abilities of cross-data validation and remains a task for experts.

Regarding our specific comparison, we found that eBird- and BBS-based models had similar predictive power, with eBird models being slightly less accurate than BBS models. The converging performance trends for two-thirds of the species for roc suggest that the discrepancies between eBird and BBS – at least for ranking sites by species suitability – will shrink as the volume of eBird data outstrips the volume of BBS data. By combining data efficiency ratios (Table S1) and the trends for eBird growth (Fig. 2), one can infer when enough eBird data will be collected to rival the information in BBS data for describing distributions. For example, the acc data efficiency ratio for northern bobwhite is 16·8 with a converging trend. To collect as much information about bobwhite presence/absence as the BBS (annually), eBird needs to collect about 36 000 transect surveys. eBird's data volume increased 52% annually since 2003; at this pace, eBird will collect 41 000 breeding season transect surveys in 2010 – enough to equal the information in the 2010 BBS surveys. In some cases, eBird data already describe species’ distributions more accurately – as we found for western meadowlark, based on expert opinion.

Conversely, there are species for which eBird and BBS contain drastically different biases and sources of variance as evidenced by a few infinite data efficiency ratios in Table S1. For example, nocturnal birds like chuck-will's-widow (Caprimulgus carolinensis) are rarely counted in eBird (because most surveys start after dawn), yet are counted in BBS (because all BBS surveys start a half-hour before dawn). eBird data will never be comparable with BBS data in these cases (barring the development of methods to account for protocol biases), and experts should decide which data source is most appropriate to the goals of an analysis.

Although our analyses considered breeding season distributions, there is no reason to think eBird data collected during other seasons differs significantly in quality.

In conclusion, we believe that the methods described in this paper can provide the basis for making decisions on appropriate choice of data to use, if a single source of monitoring data needs to be chosen for analysis. Alternatively, demonstration of reasonable information content in multiple data sources could open the door for using one data set to construct a prior for Bayesian distribution models (Thogmartin & Knutson 2007) that are then fit using the second data set (e.g. using eBird data to create informative priors for analysis of BBS data). Given the apparent consistency in informativeness of the data from the low-structure eBird protocol, the uses of such birder checklist data needs further exploration.

Acknowledgements

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

This study was supported by the Leon Levy Foundation, the Wolf Creek Foundation and the National Science Foundation (grants ITR-0427914, DBI-0542868, DUE-0734857, IIS-0612031, IIS-074826 and IIS-0832782). In addition, the authors thank the anonymous reviewers T. Damoulas and R. Hutchinson for helpful comments on early versions of this work.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information
  • Baillie, S.R., Balmer, D.E., Downie, I.S. & Wright, K.H.M. (2006) Migration watch: an Internet survey to monitor spring migration in Britain and Ireland. Journal of Ornithology, 147, 254259.
  • Bart, J., Hofschen, M. & Peterjohn, B.G. (1995) Reliability of the breeding bird survey: effects of restricting surveys to roads. The Auk, 112, 758761.
  • Breiman, L. (1996) Bagging predictors. Machine Learning, 24, 123140.
  • Breiman, L., Friedman, J., Stone, C.J. & Olshen, R.A. (1984) Classification and Regression Trees. Chapman & Hall, London.
  • Buntine, W. (1992) Learning classification trees. Statistics and Computing, 2, 6373.
  • Buntine, W. & Caruana, R. (1991) Introduction to IND and Recursive Partitioning. Tech. Rep. FIA-91-28, NASA Ames Research Center.
  • Caruana, R. & Niculescu-Mizil, A. (2004) Data mining in metric space: an empirical analysis of supervised learning performance criteria. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds W.Kim, R.Kohavi, J.Gehrke & W.DuMouchel), pp. 6978. ACM, New York.
  • Caruana, R., Elhawary, M., Fink, D., Hochachka, W.M., Kelling, S., Munson, A., Riedewald, M. & Sorokina, D. (2006) Mining citizen science data to predict prevalence of wild bird species. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds T.Eliassi-Rad, L.Ungar, M.Craven & D.Gunopulos), pp. 909915. ACM, New York.
  • Conway, C.J. & Timmermans, S.T.A. (2005) Progress toward developing field protocols for a North American marshbird monitoring program. Bird Conservation Implementation and Integration in the Americas: Proceedings of the Third International Partners in Flight Conference (eds C.J.Ralph & T.D.Rich), General Technical Report PSW-GTR-191, pp. 9971005. US Department of Agriculture Forest Service, Albany, California, USA.
  • De'ath, G. & Fabricius, K.E. (2000) Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology, 81, 31783192.
  • Elith, J., Graham, C.H., Anderson, R.P., Dudik, M., Ferrier, S., Guisan, A., Hijmans, R.J., Huettmann, F., Leathwick, J.R., Lehmann, A., Li, J., Lohmann, L.G., Loiselle, B.A., Manion, G., Moritz, C., Nakamura, M., Nakazawa, Y., Overton, J.M.M., Peterson, A.T., Phillips, S.J., Richardson, K., Scachetti-Pereira, R., Schapire, R.E., Soberón, J., Williams, S., Wisz, M.S. & Zimmermann, N.E. (2006) Novel methods improve prediction of species’ distributions from occurrence data. Ecography, 29, 129151.
  • Fielding, A.H. & Bell, J.F. (1997) A review of methods for the assessment of prediction errors in conservation presence/absence models. Environmental Conservation, 24, 3849.
  • Fitzpatrick, M.C., Preisser, E.L., Ellison, A.M. & Elkinton, J.S. (2009) Observer bias and the detection of low-density populations. Ecological Applications, 19, 16731679.
  • Hanowski, J.M. & Niemi, G.J. (1995) A comparison of on- and off-road bird counts: do you need to go off road to count birds accurately? Journal of Field Ornithology, 66, 469483.
  • Harrison, J.A., Underhill, L.G. & Barnard, P. (2008) The seminal legacy of the Southern African Bird Atlas Project. South African Journal of Science, 104, 8284.
  • Hegel, T.M., Cushman, S.A., Evans, J. & Huettmann, F. (2010) Current state of the art for statistical modelling of species distributions. Spatial Complexity, Informatics, and Wildlife Conservation (eds S.A.Cushman & F.Huettmann), Chapter 16, pp. 273311. Springer, Japan.
  • Hickling, R., Roy, D.B., Hill, J.K. & Thomas, C.D. (2005) A northward shift of range margins in British Odonata. Global Change Biology, 11, 502506.
  • Hochachka, W.M. & Dhondt, A.A. (2000) Density-dependent decline of host abundance resulting from a new infectious disease. Proceedings of the National Academy of Sciences of the USA, 97, 53035306.
  • Hochachka, W.M., Caruana, R., Fink, D., Munson, A., Riedewald, M., Sorokina, D. & Kelling, S. (2007) Data-mining discovery of pattern and process in ecological systems. Journal of Wildlife Management, 71, 24272437.
  • Hochachka, W.M., Winter, M. & Charif, R.A. (2009) Sources of variation in singing probability of Florida grasshopper sparrows, and implications for design and analysis of auditory surveys. The Condor, 111, 349360.
  • Jiguet, F., Julliard, R., Thomas, C.D., Dehorter, O., Newson, S.E. & Couvet, D. (2006) Thermal range predicts bird population resilience to extreme high temperatures. Ecology Letters, 9, 13211330.
  • Kelling, S., Hochachka, W.M., Fink, D., Riedewald, M., Caruana, R., Ballard, G. & Hooker, G. (2009) Data intensive science: a new paradigm for biodiversity studies. BioScience, 59, 613620.
  • Kéry, M., Royle, J.A., Schmid, H., Schaub, M., Volet, B., Häfliger, G. & Zbinden, N. (2010) Site-occupancy distribution modeling to correct population-trend estimates derived from opportunistic observations. Conservation Biology, in press. DOI: 10.1111/j.1523-1739.2010.01479.x.
  • Koenig, W.D. (2001) Spatial autocorrelation and local disappearances in wintering North American birds. Ecology, 82, 26362644.
  • MacKenzie, D.I. (2006) Modeling the probability of resource use: the effect of, and dealing with, detecting a species imperfectly. Journal of Wildlife Management, 70, 367374.
  • Munson, M.A., Webb, K., Sheldon, D., Fink, D., Hochachka, W.M., Iliff, M., Riedewald, M., Sorokina, D., Sullivan, B., Wood, C. & Kelling, S. (2009) The eBird Reference Dataset, Version 1.0. Cornell Lab of Ornithology and National Audubon Society, Ithaca, New York. URL: http://www.avianknowledge.net/content/features/archive/eBird_Ref.
  • Nichols, J.D. & Williams, B.K. (2006) Monitoring for conservation. Trends in Ecology & Evolution, 21, 668673.
  • Niculescu-Mizil, A. & Caruana, R. (2005) Predicting good probabilities with supervised learning. Proceedings of the 22nd International Conference on Machine Learning (eds L.De Raedt & S.Wrobel), ACM International Conference Proceeding Series, Vol. 119, pp. 625632. ACM Press, New York.
  • Platt, J.C. (2000) Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. Advances in Large Margin Classifiers (eds A.J.Smola, P.J.Bartlett, B.Schoelköpf & D.Schuurmans), pp. 6174. MIT Press, Cambridge, MA.
  • Robbins, C.S. (1981) Bird activity levels related to weather. Estimating Numbers of Terrestrial Birds (eds C.J.Ralph & J.M.Scott), Studies in Avian Biology, Vol. 6, pp. 301310. Cooper Ornithological Society, Waco, TX, USA.
  • Robbins, C.S., Bystrak, D. & Geissler, P.H. (1986) The Breeding Bird Survey: Its First Fifteen Years, 1965–1979. Resource Publication 157. US Fish and Wildlife Service, Waco, TX, USA.
  • Rosenberg, K.V. & Blancher, P.J. (2005) Setting numerical population objectives for priority landbird species. Bird Conservation Implementation and Integration in the Americas: Proceedings of the Third International Partners in Flight Conference (eds C.J.Ralph & T.D.Rich), General Technical Report PSW-GTR-191, pp. 5767. US Department of Agriculture Forest Service, Albany, California, USA.
  • Sauer, J.R. (2000) Combining information from monitoring programs: complications associated with indices and geographic scale. Strategies for Bird Conservation: The Partners in Flight Planning Process. Proceedings of the Third Partners in Flight Workshop (eds R.Bonney, D.N.Pashley, R.J.Cooper & L.Niles), RMRS-P-16, pp. 124126. US Forest Service, Rocky Mountain Research Station, Ogden, Utah, USA.
  • Schmid, H., Burkhardt, M., Keller, V., Knaus, P., Volet, B. & Zbinden, N. (2001) Die Entwicklung der Vogelwelt in der Schweiz. Avifauna Report Sempach 1, Annex. Swiss Ornithological Institute, Sempach, Switzerland.
  • Skirvin, A.A. (1981) Effect of time of day and time of season on the number of observations and density estimates of breeding birds. Estimating Numbers of Terrestrial Birds (eds C.J.Ralph & J.M.Scott), Studies in Avian Biology, Vol. 6, pp. 271274. Cooper Ornithological Society, Waco, TX, USA.
  • Sullivan, B.L., Wood, C.L., Iliff, M.J., Bonney, R.E., Fink, D. & Kelling, S. (2009) eBird: a citizen-based bird observation network in the biological sciences. Biological Conservation, 142, 22822292.
  • Thogmartin, W.E. & Knutson, M.G. (2007) Scaling local species-habitat relations to the larger landscape with a hierarchical spatial count model. Landscape Ecology, 22, 6175.
  • Thomas, C.D. & Lennon, J.J. (1999) Birds extend their ranges northwards. Nature, 399, 213.
  • US NABCI Monitoring Subcommittee (2007) Opportunities for Improving Avian Monitoring. Technical report, US North American Bird Conservation Initiative Committee. Available from the Division of Migratory Bird Management, US Fish and Wildlife Service, Arlington, Virginia, USA. URL: http://www.nabci-us.org/.

Supporting Information

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Table S1. Data efficiency ratios for eBird compared with North American Breeding Bird Survey (BBS).

As a service to our authors and readers, this journal provides supporting information supplied by the authors. Such materials may be re-organized for online delivery, but are not copy-edited or typeset. Technical support issues arising from supporting information (other than missing files) should be addressed to the authors.

FilenameFormatSizeDescription
MEE3_35_sm_tables1.pdf49KSupporting info item

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.