Filling in the GAPS: evaluating completeness and coverage of open‐access biodiversity databases in the United States

Abstract Primary biodiversity data constitute observations of particular species at given points in time and space. Open‐access electronic databases provide unprecedented access to these data, but their usefulness in characterizing species distributions and patterns in biodiversity depend on how complete species inventories are at a given survey location and how uniformly distributed survey locations are along dimensions of time, space, and environment. Our aim was to compare completeness and coverage among three open‐access databases representing ten taxonomic groups (amphibians, birds, freshwater bivalves, crayfish, freshwater fish, fungi, insects, mammals, plants, and reptiles) in the contiguous United States. We compiled occurrence records from the Global Biodiversity Information Facility (GBIF), the North American Breeding Bird Survey (BBS), and federally administered fish surveys (FFS). We aggregated occurrence records by 0.1° × 0.1° grid cells and computed three completeness metrics to classify each grid cell as well‐surveyed or not. Next, we compared frequency distributions of surveyed grid cells to background environmental conditions in a GIS and performed Kolmogorov–Smirnov tests to quantify coverage through time, along two spatial gradients, and along eight environmental gradients. The three databases contributed >13.6 million reliable occurrence records distributed among >190,000 grid cells. The percent of well‐surveyed grid cells was substantially lower for GBIF (5.2%) than for systematic surveys (BBS and FFS; 82.5%). Still, the large number of GBIF occurrence records produced at least 250 well‐surveyed grid cells for six of nine taxonomic groups. Coverages of systematic surveys were less biased across spatial and environmental dimensions but were more biased in temporal coverage compared to GBIF data. GBIF coverages also varied among taxonomic groups, consistent with commonly recognized geographic, environmental, and institutional sampling biases. This comprehensive assessment of biodiversity data across the contiguous United States provides a prioritization scheme to fill in the gaps by contributing existing occurrence records to the public domain and planning future surveys.


Introduction
There is increasing recognition that ecological and evolutionary processes operate in response to natural and anthropogenic factors that are apparent at regional, continental, and even global scales. Research in the fields of macroecology and landscape ecology has demonstrated that broad-scale environmental variation and spatial processes play important roles in generating and maintaining biodiversity (Brown 1995;Turner et al. 2001). Multiple contemporary threats to biodiversity are apparent at similarly broad spatial scales, including habitat loss and fragmentation stemming from the alteration of natural landscapes, climate change, and intercontinental faunal and floral exchanges (Rahel 2000;Bates et al. 2008;Newbold et al. 2015). In freshwaters, additional broad-scale threats to biodiversity include eutrophication and hydrologic alteration (Bennett et al. 2001;Poff et al. 2007;Esselman et al. 2011).
Primary biodiversity dataobservations of particular species at given points in time and spaceare essential to understanding how these broad-scale processes affect the distribution of species and biodiversity across the globe (Sober on and Peterson 2004;Peterson et al. 2010). A major challenge that remains, however, is inadequate primary biodiversity data for many regions and taxonomic groups throughout the world. Recent efforts to overcome this Wallacean shortfall have sought to compile species occurrence records using open-access database platforms (Lomolino and Lawrence 2004;Whittaker et al. 2005). For example, the Global Biodiversity Information Facility (GBIF) currently provides online open access to over 521 million occurrence records representing more than 1.4 million species (Edwards et al. 2000;Yesson et al. 2006). These databases have allowed investigators to test ecological and evolutionary hypotheses that explain the natural generation and maintenance of biodiversity as well as document contemporary and human-induced changes in biodiversity (e.g., Rahel 2000;Kozak and Wiens 2006;Mitchell and Knouft 2009). Moreover, recent developments in GIS software, broad-scale environmental data layers (e.g., Worldclim; Hijmans et al. 2005), and refinement of statistical techniques (e.g., MaxEnt; Phillips et al. 2006) have led to the extensive use of these open-access databases in species distribution modeling (SDM; Guisan and Thuiller 2005;Broennimann et al. 2007;Dom ınguez-Dom ınguez et al. 2006) and biodiversity mapping (Sousa-Baena et al. 2013;Garc ıa-Rosell o et al. 2015).
Broad-scale databases describing species occurrences, such as GBIF, are frequently composed of many smaller (i.e., narrower spatial extent or fewer records) surveys of multispecies assemblages or single-species occurrence records (e.g., georeferenced museum vouchers) collected for many different purposes and by many different scientists, natural resource managers, and even recreational naturalists. This data compilation scheme frequently results in incomplete inventories of the species occupying a survey location and inadequate survey coverage along three important ecological dimensions: time, space, and environment (Ladle and Hortal 2013). Survey completeness is important for biodiversity studies that seek to statistically model and map patterns in species richness (Lobo 2008;Chao and Jost 2012). Indeed, survey completeness is an overriding factor affecting observed richness for a given survey location (Hortal et al. 2007;Sober on et al. 2007). A variety of analytic approaches have been developed to quantify the completeness of biodiversity surveys (reviewed by Colwell et al., 1994). Many of these approaches use parametric or nonparametric algorithms to estimate "expected" (i.e., actual) species richness based on the frequency of individual species occurrences within a survey location. The proportion of observed richness versus expected richness is then computed and used as a metric of survey completeness (Hortal et al. 2006;Sober on et al. 2007). An alternative approach characterizes the final (i.e., right side) slope of the species accumulation curve for a given survey location. Slopes near zero suggest that richness has reached an asymptote with the currently available number of occurrence records and is indicative of high completeness (Yang et al. 2013).
Regarding survey coverage, different ecological and evolutionary questions require consistent data coverage along one or more dimensions of time, space, and environment (Rahel 2000;Broennimann et al. 2007;Pearman et al. 2008). Uneven representation of key environmental gradients by occurrence records can strongly influence the accuracy of SDMs and the perceived importance of environmental predictor variables used to build those SDMs (Kadmon et al. 2004;Loiselle et al. 2008;Tessarolo et al. 2014). Similarly, uneven representation of spatial and environmental gradients also affects the performance of modeling efforts aimed at predicting and mapping patterns in biodiversity (e.g., species richness) across unsurveyed regions (Dennis and Thomas 2000;Ladle and Hortal 2013). Discrepancies in environmental data coverage between two regions (i.e., incomplete space-by-environment data coverage) can influence model transferability, which can weaken inferences made about geographic range limits and niche shifts of invasive species or tests of local adaptation among geographically separated populations (Broennimann et al. 2007;Peterson et al. 2007). Spatial data gaps through time (i.e., spaceby-time) can influence the ability to detect geographic range shifts over time (Tingley and Beissinger 2009) or biotic homogenization between regions (Rahel 2000). Gaps in data along environmental gradients and through time (i.e., environment-by-time) can limit the detection of environmental niche evolutiona process that has important implications for understanding natural species richness patterns or predicting species adaptive potential in the face of human-induced global change (Pearman et al. 2008). As with completeness, a variety of analytic approaches have been developed to quantify the coverage of biodiversity surveys. These include direct measurement of the frequency distributions of biodiversity surveys along key environmental gradients (e.g., Kadmon et al. 2003;Loiselle et al. 2008) as well as summarizing environmental variation among survey locations as a surrogate of biodiversity (e.g., Hortal and Lobo 2005).
The aim of this study was to compare completeness and coverage among three open-access databases representing ten taxonomically diverse groups of macro-organisms in the contiguous United States (amphibians, birds, freshwater bivalves, crayfish, freshwater fish, fungi, insects, mammals, plants, and reptiles). First, by comparing completeness among a database composed of many smaller data compilation efforts (GBIF) with databases of systematic survey efforts (North American Breeding Bird Survey, federally administered fish surveys), our goal was to assess the utility of data compilation efforts with regard to describing spatial variation in species richness. Second, we characterized the coverage of biodiversity surveys derived from these databases along two spatial gradients (latitude and longitude); three natural environmental gradients (elevation, mean annual temperature, and mean annual precipitation); three anthropogenic environmental gradients (urban land cover, agricultural land cover, and total disturbed land cover); and two gradients of future climate change (forecasted change in mean annual temperature and mean annual precipitation between the late 1900s and the 2080s). Characterizing coverage along these latter two gradients of anthropogenic environmental change is an important, yet overlooked, component of biodiversity data planning. Lastly, by synthesizing completeness and coverage among multiple databases, taxonomic groups, and ecologically relevant gradients, our goal was to elucidate the causes of data gaps and offer an objective path toward comprehensive biodiversity conservation in the United States.

Compilation of occurrence records
We downloaded georeferenced occurrence records within the contiguous United States from GBIF. This data repository is likely the most comprehensive source of openaccess species occurrence records and includes records from other frequently used data repositories that are specific to a geographic region (e.g., BISON), taxonomic group (e.g., FishNet2, HerpNet, MANIS), or institution (e.g., Kansas University Biological Survey) (Yesson et al. 2007). As such, numerous studies have made use of GBIF data as presence-only records in SDMs (e.g., Giovanelli et al. 2008) and biodiversity studies (e.g., Pineda and Lobo 2009). For the present analysis, taxonomic keywords were used to download records from GBIF for amphibians (Class: Amphibia), birds (Class: Aves), freshwater bivalves (Order: Unionoida), crayfish (Order: Decapoda), fishes (Class: Actinopterygii), fungi (Kingdom: Fungi), insects (Class: Insecta), mammals (Class: Mammalia), plants (Kingdom: Plantae), and reptiles (Class: Reptilia). Records were screened and those with (1) "no known coordinate issues"; (2) a sampling year between 1800 and 2013; (3) a taxonomic rank of "species"; and (4) a record type of "specimen" were retained for the analysis. In addition to GBIF records, we obtained georeferenced routes from the North American Breeding Bird Survey (hereafter BBS; Pardieck et al. 2014). The BBS data have been used previously in national and continental-scale studies of avian ecology and conservation (e.g., Bahn and McGill 2007;Peterson et al. 2007). We also obtained georeferenced federal fish surveys (hereafter FFS) from the Environmental Protection Agency's Regional Environmental Monitoring and Assessment Program and National Rivers and Streams Assessment as well as the United States Geological Survey's National Water Quality Assessment. These three sources of fish distributional data have been collated previously and used as a comprehensive presence-absence dataset in national-scale studies of freshwater biogeography, ecology, and conservation (e.g., Herlihy et al. 2006;Mitchell and Knouft 2009;Mims and Olden 2012). The BBS and FFS datasets provide an informative comparison with the GBIF datasets, as they represent systematic (and presumably less biased) sampling efforts by one or several collaborating authorities.
We defined an individual occurrence record as a database row that represents an individual organism collected from a known location (i.e., latitude and longitude) and at a known time (i.e., calendar year). These occurrence records were mapped in ArcMap (version 10.1; ESRI, Inc.: Redlands, CA) and assigned to one of 83,545 grid cells (0.1°by 0.1°rectangles) distributed across the contiguous United States using the spatial join tool. These dimensions correspond to cells that range in size from 80 km 2 (11.1 km by 7.2 km) at the northernmost latitudes to 112 km 2 (11.1 km by 10.1 km) at the southernmost latitudes. A trade-off exists between maximizing survey resolution (i.e., small grid cells) while retaining an adequate number of occurrence records within each grid cell. Previous studies indicated that a 0.1°cell size provides sufficient resolution to be useful for biodiversity research (Hortal et al. 2006;Sober on et al. 2007) and a preliminary exploration of larger and smaller sizes indicated that this size retained adequate numbers of occurrence records per cell with datasets used in this study. We defined an individual survey as all occurrence records within a grid cell. Surveys were defined over three different time periods: all records between 1800 and 2013 (hereafter "complete time period"); all records between 1990 and 2013 (hereafter "contemporary time period"); and records falling within each of ten different 20-year intervals plus a final 14-year interval (i.e., 1800-1819, 1820-1839, 1840-1859, 1860-1879, 1880-1899, 1900-1919, 1920-1939, 1940-1959, 1960-1979, 1980-1999, 2000-2013).

Survey completeness
Three completeness metrics were computed and used to classify each survey as "well-surveyed" or "notwell-surveyed." These metrics included (1) the number of records per grid cell, (2) the Chao2 completeness index, and (3) the final (i.e., right side) slope of species accumulation curve (Chao 1987;Yang et al. 2013). We explored low (i.e., liberal), moderate, and high (i.e., conservative) thresholds for each of these three completeness metrics based on the range of thresholds used in previous studies (Sober on et al. 2007;Sousa-Baena et al. 2013;Yang et al. 2013). For the low threshold, well-surveyed cells were defined as those with ≥10 occurrence records, a Chao2 completeness metric ≥0.6, and a final species accumulation curve slope of ≤0.15. For the moderate threshold, well-surveyed cells were defined as those with ≥25 occurrence records, a Chao2 completeness metric ≥0.7, and a final species accumulation curve slope of ≤0.10. For the high threshold, well-surveyed cells were defined as those with ≥50 occurrence records, a Chao2 completeness metric ≥0.8, and a final species accumulation curve slope of ≤0.05.

Survey coverage
We evaluated the coverage of well-surveyed and not-wellsurveyed grid cells along spatial, environmental, and temporal gradients. Natural environmental variables were acquired as raster grids (30 arc-second resolution) from the WorldClim dataset and were spatially joined to the survey grid cells in ArcMap. These variables included elevation above sea level, contemporary mean annual temperature (MAT), and contemporary mean annual precipitation (MAP). Anthropogenic environmental variables were acquired as raster grids (30 m resolution) from the 2001 National Land Cover Database (NLCD). Percent coverage of each anthropogenic land cover class was summarized within each survey grid cell. We defined urban land cover as the sum of NLCD classes 21, 22, 23, and 24; agricultural land cover as the sum of NLCD classes 81 and 82; and total disturbed land cover as the sum of urban and agricultural land cover. Lastly, forecasted changes in mean annual temperature (ΔMAT) and mean annual precipitation (ΔMAP) were acquired as raster grids (30 arc-second resolution) from the Climate Wizard tool (Girvetz et al. 2009). These projections represent differences between contemporary  and future (2080s) MATs and MAPs based on an ensemble average of sixteen global circulation models assuming moderate carbon emissions (i.e., A1B scenario). These climate change variables were spatially joined to the survey grid cells using ArcMap.
We evaluated coverage along the time gradient by comparing the frequency of surveys among each of the 20-year time intervals to a uniform frequency distribution. We use a uniform distribution to represent the ideal null expectation of equal sampling among each of the 20-year time intervals from 1800 to 2013. For spatial and environmental gradients, however, the conditions of the background environment are not uniformly distributed within a given region (e.g., contiguous US). Thus, we compared the frequency distribution of surveyed grid cells to that of all 82,545 grid cells. Next, we performed Kolmogorov-Smirnov (K-S) goodness-of-fit tests for each dataset-by-taxon-by-gradient combination and used the test statistic (D-statistic) as an index of strong or weak (low or high values, respectively) congruence between each survey dataset and the background environment (Kadmon et al. 2004;Loiselle et al. 2008). D-statistics were computed for all surveyed grid cells and well-surveyed grid cells which we defined using the moderate completeness threshold described above. To evaluate comprehensive coverage of each survey dataset, D-statistics were summed for all spatial and environmental gradients. We plotted overlapping histograms of each survey dataset and the background environment to provide visual reference and detail as to the position along each gradient where congruence was strong or weak. Analyses for spatial and environmental gradients were limited to contemporary records (i.e., 1990-2013) because the 2001 NLCD land cover is not representative of historical land cover. All statistical analyses were carried out using the vegan (Oksanen et al. 2007) and fossil (Vavrek 2011) libraries in the R programming environment (R Core Team, 2014).

Survey completeness
Our compilation of open-access biodiversity data within the contiguous United States yielded in excess of 6.7 million GBIF records collected between 1800 and 2013, 4.8 million BBS records collected between 1963 and 2013, and 2.1 million FFS records collected between 1990 and 2008. These records were distributed among 183,165 GBIF grid cells (i.e., surveys), 3660 BBS grid cells, and 3,372 FFS grid cells. Since 1990, in excess of 1.9 million GBIF records, 3.0 million BBS records, and 2.1 million FFS records have been accumulated. These contemporary records were distributed among 75,836 GBIF surveys, 3523 BBS surveys, and 3372 FFS surveys (Fig. 1). For the complete time period, plant surveys from GBIF were most prevalent, followed by GBIF mammals, GBIF insects, and GBIF birds. The least prevalent surveys were GBIF crayfish, FFS fish, and BBS birds (Table 1, Fig. 2). Surveys from standardized datasets (i.e., BBS and FFS) were substantially more complete than those from GBIF. Specifically, 4.7% and 3.7% of GBIF-surveyed grid cells for the complete and contemporary time periods, respectively, were classified as well-surveyed based on the moderate or high completeness thresholds. By contrast, 82.6% and 82.3% of BBS-and FFS-surveyed grid cells for the complete and contemporary time periods, respectively, were classified as well-surveyed (Table 1, Fig. 2).
GBIF surveys represented the longest period of record (dating back to 1800), followed by the BBS surveys (1967) and FFS surveys (1990) (Fig. 3). GBIF surveys, particularly those classified as well-surveyed, were most prevalent since approximately 1920. Nevertheless, a substantial number of well-surveyed grid cells were available from the nineteenth century for birds (116 grid cells), mammals (22 grid cells), and plants (9 grid cells). The average number of species inventoried per grid cell (i.e., survey richness) was highest for birds, plants, and fungi and lowest for crayfish, amphibians, and mammals. For most taxa, well-surveyed grid cells contained more species (upper left diagonal in Fig. 4A) than did all (i.e., both well-surveyed and not-well-surveyed) surveyed grid cells. Two exceptions were BBS birds and FFS fish, both of which contained similar numbers of species for well-surveyed grid cells and all grid cells (1:1 line in Fig. 4A).

Survey coverage
Coverage indices (i.e., K-S D-statistics) ranged from 0.03 to 0.84 (mean = 0.26) across the across the 264 K-S tests for all (i.e., both well-surveyed and not-well-surveyed) grid cells surveyed since 1990. For well-surveyed grid cells, D-statistics were higher on average (0.31) and ranged from 0.06 to 0.82 across the 132 K-S tests. This variability in coverage indices suggests that coverage varied substantially among data sources, gradients, and taxonomic groups (Table 2). With regard to temporal coverage, GBIF records were more uniformly distributed (mean D = 0.50 and 0.52 for all surveyed grid cells and well-surveyed grid cells, respectively) compared to the BBS (D = 0.73 and 0.73) and FFS records (D = 0.81 and 0.82). Lower temporal bias in GBIF surveys compared to standardized surveys was likely a consequence of the longer time span from which GBIF records have been compiled (Fig. 3).
Cumulative coverage indices (i.e., D-statistics summed for the eleven spatial, environmental, and temporal gradients) suggested that the BBS bird and GBIF insect surveys had the best coverage, whereas GBIF surveys of bivalves and fish had the worst coverage (Fig. 4B). For most taxa, well-surveyed grid cells exhibited worse cumulative coverage (upper left diagonal in Fig. 4B) than did all (i.e., both well-surveyed and not-well-surveyed) surveyed grid cells. Exceptions included BBS birds, which contained wellsurveyed grid cells that exhibited similar coverage to all surveyed grid cells, and to a lesser degree FFS fish (1:1 line in Fig. 4B).
Averaged coverage indices (i.e., D-statistics) across all eleven temporal, spatial, and environmental gradients indicate that GBIF surveys of birds, insects, and mammals had the best coverage, whereas GBIF surveys of bivalves, crayfish, and fish were the most biased (Fig. 5A). For spatial gradients, the BBS bird surveys had the best coverage, followed by the GBIF insect surveys and the FFS fish surveys. By contrast, the GBIF bivalve, GBIF fish, and GBIF crayfish surveys had the worst spatial coverage (Fig. 5B). For environmental gradients, the BBS bird survey had the best coverage, followed by the GBIF insect and fungi surveys. By contrast, the GBIF surveys of bivalves, crayfish, and fish had the worst coverage (Fig. 5C). For forecasted gradients of climate change, the BBS bird surveys had the best coverage, followed by the GBIF insect surveys. By contrast, the GBIF surveys of bivalves, plants, and reptiles had the worst coverage along gradients of future climate change (Fig. 5D). Averaged coverage indices (i.e., D-statistics) across all ten GBIF datasets indicate gradients of agricultural land cover, disturbed land cover, and urban land cover had the best coverage, whereas the temporal, longitudinal, and forecasted change in mean annual temperature gradients had the worst coverage (Fig. 6A). For standardized datasets (i.e., BBS and FFS), gradients of latitude, forecasted change in mean annual temperature, and contemporary mean annual temperature had the best coverage, whereas the temporal, longitudinal, and contemporary MAP gradients had the worst coverage (Fig. 6C). Terrestrial taxa (i.e., amphibians, birds, fungi, insects, mammals, plants, and reptiles) generally had better coverage than aquatic taxa (i.e., bivalves, crayfish, and fish) ( Fig. 6B and D). For terrestrial taxa, gradients of contemporary mean annual temperature, contemporary mean annual precipitation, and urban land cover had the best coverage, whereas the temporal, longitudinal, and future MAT gradients had the worst coverage (Fig. 6B). For aquatic taxa, gradients of agricultural land cover, future MAP, and disturbed land cover had the worst coverage, whereas the temporal, contemporary MAP, and longitudinal gradients had the worst coverage (Fig. 6D).
Although D-statistics summarized survey coverage along the entirety of a given gradient, the specific locations along that gradient that were over-or under-represented were not characterized by these D-statistics. A detailed description of every histogram here would be exhaustive, so we provide the raw histograms for all 264 dataset-by-taxon-by-gradient combinations as Supporting Information (Figs. S2.1-S2.10).

Discussion
Open-access biodiversity databases are essential to biodiversity research and conservation (Sober on and Peterson, 2004;Peterson et al. 2010); however, the efficacy of these databases depends on the completeness of species inventories and the coverage of surveys across dimensions of space, environment, and time (Kadmon et al. 2003;Hortal et al. 2008;Ladle and Hortal 2013). Many assessments have been completed for regions of the world including Central America, South America, the Iberian Peninsula, and western Africa (Hortal et al. 2007(Hortal et al. , 2008Sober on et al. 2007;Sousa-Baena et al. 2013;Idohou et al. 2015) as well as the entire globe (Meyer et al. 2015). Still, no study to our knowledge has evaluated completeness and/ or coverage of open-access biodiversity data for the United States. Our compilation of the Global Biodiversity Information Facility (GBIF), the North American Breeding Bird Survey (BBS), and federally administered freshwater fish surveys (FFS) yielded in excess of 13.6 million occurrence records distributed among more than 190,000 survey grid cells within the contiguous United States. By evaluating multiple datasets and taxonomic groups, simultaneously, our findings provide novel insights into the Wallacean shortfall (Lomolino and Lawrence 2004;Hortal et al. 2015). This comparative approach to biodiversity informatics provides a relative understanding of data needs for ten of the most abundant and diverse macro-organism groups in the contiguous United States.
Open-access biodiversity datasets differ in the type and origin of occurrence records they contain. For example, GBIF contains occurrence records that are often represented by individually vouchered museum specimens (Edwards et al. 2000;Yesson et al. 2007), whereas the BBS is a standardized whole-assemblage surveying effort aimed at inventorying all breeding bird species along each survey route (Pardieck et al. 2014). The FFS is intermediate in that it contains standardized whole-assemblage surveys, but there is variation in completeness stemming from the surveys being carried out with different survey methods of different government entities (Gilliom et al. 1995). Not surprisingly, the standardized surveys (BBS and FFS) produced a substantially higher proportion of well-surveyed grid cells than GBIF-derived surveys of birds and fishes, owing to the larger number of individual records (and accumulated species) per grid cell (Hortal et al. 2007). Despite providing a low proportion of wellsurveyed grid cells, GBIF still provided a sufficient number of well-surveyed grid cells for most taxa because the database contains so many more independent occurrence records than the BBS and FFS. Across all ten GBIF taxa, the average number of well-surveyed grid cells was 912 for the complete time period and 300 for the contemporary time period. Previous studies suggest that siteby-species matrices of this size are sufficient to produce accurate SDMs using presence-absence techniques or accurately model and map patterns in species richness (Lobo and Mart ın-Piera 2002;Wisz et al. 2008). Thus, our evaluation of survey completeness demonstrates that the quantity and quality of data contained in all three datasets are suitable for biodiversity and SDM studies for most of the ten taxonomic groups.
Survey resolution is an important consideration that directly affected the completeness of species inventories (Hortal et al. 2006;Sober on et al. 2007). The size of grid cells we chose for the present studyapproximately 100 km 2is on the lower end of the size spectrum for studies of this type (e.g., Hortal et al. 2008;Yang et al., 2013;Sousa-Baena et al. 2013). As such, most grid cells did not contain adequate densities of GBIF records to be for three open-access biodiversity databases representing ten taxonomic groups. Note different y-axis scales within and among panels.
classified as well-surveyed. Nevertheless, those that did contain adequate densities of GBIF records offer richness and species occupancy information at a spatial resolution useful for biodiversity conservation and research (Rahbek 2005). One disadvantage of aggregating occurrence records by~100 km 2 grid cells is that finer-grained spatial resolution of the systematic surveys is lost. This is because BBS routes and FFS stream reaches represent species inventories of areas that are smaller than the~100 km 2 grid cells. Given that most grid cells contained only a single BBS route or FFS reach, there was no advantage to aggregating occurrence records from these datasets by grid cells because information from multiple BBS routes or FFS reaches was not accumulated in a way that increases grid cell completeness. We conclude that aggregating occurrence records by grid cells is an effective technique for studies that use GBIF by itself or for studies that compile records from multiple databases, but not for studies using only the BBS or FFS databases.
The time period over which occurrence records were accumulated is another key factor that affected the completeness of species inventories. Our analysis revealed many more well-surveyed grid cells for the 214-year complete time period compared to the 24-year contemporary time period. Although the time period over which occurrence records are aggregated will depend on the question being addressed (e.g., Rahel 2000;Pearman et al. 2008;Tingley and Beissinger 2009), these findings demonstrate that surveys derived from GBIF data are of sufficient quality and quantity for studies addressing contemporary or historical biodiversity of most taxonomic groups. It is also encouraging that GBIF records aggregated into 20year intervals yielded reasonably large numbers of wellsurveyed grid cells for several taxonomic groups going back to the late nineteenth century. Future efforts that identify historical time periods of high collection density could be used to optimize aggregation intervals, as opposed to the arbitrary 20-year intervals used in the present evaluation, and likely increase the number of historically well-surveyed grid cells (Hortal et al. 2008).
Many recent efforts have sought to characterize the Wallacean shortfall for individual taxonomic groups, particularly plants (e.g., Sousa-Baena et al. 2013;Yang et al., 2014) and insects (e.g., Hortal et al. 2008;Beck et al. 2013). Whereas these single-taxon studies are highly informative to conservation and research efforts within a given taxonomic group, comprehensive biodiversity conservation requires knowledge about data limitations of many taxonomic groups relative to one another (Funk et al. 2005;Meyer et al. 2015). Based on our simultaneous evaluation of ten taxonomic groups, it is apparent that the severity of the Wallacean shortfall varies substantially among taxonomic groups, an issue that has been described previously as the Linnean shortfall (Whittaker et al. 2005;Brito 2010). One notable trend is the lack of GBIF occurrence records for freshwater invertebrates, particularly crayfish and freshwater bivalves. Indeed, occurrence records for crayfish are so scare that no wellsurveyed grid cells were identified, even based on the low (i.e., least conservative) completeness thresholds. This is surprising given the imperilment of these taxa and the volume of research directed their way in recent years (Thorp and Covich 2009;Haag 2012;Ross 2013). Fungi were also poorly represented in GBIF relative to the other taxa, probably due to their cryptic life history and the relative paucity of research directed at documenting their distributions (Mueller et al. 2007). Overall, this quantitative evaluation thus provides an objective ranking (see Fig. 1) of primary biodiversity data needs for ten of the most abundant and diverse macro-organism groups in the United States.
Another notable trend that became apparent from evaluating coverage of multiple taxonomic groups is the "taxonomist surveying bias"; that is, the tendency for collectors (both professional and recreational) to survey locations where the taxon of interest is most abundant or diverse (Sastre and Lobo 2009). For example, our analysis shows that the frequency of reptile surveys is highest in the desert southwest where reptiles are abundant, diverse, and frequently collected and studied. By contrast, reptile records are infrequent at higher latitudes and eastern longitudes, where reptiles still occur but are less abundant and diverse (Kiester 1971) and, presumably, collected and studied less frequently (Sastre and Lobo 2009). Similarly, freshwater fishes and bivalves are highly diverse and frequently studied in the southeastern United States and consequently have been collected frequently and vouchered in museums of this region (e.g., Tulane Museum of Natural History; Warren et al. 2000;Haag 2012). This pattern of geographic bias in survey coverage for freshwater fishes mirrors the findings of a recent assessment of global fish biodiversity data . Institutional participation in data compilation projects such as GBIF also influences coverage. This is apparent in the high density of surveys for particular taxonomic groups in some states but not in others. For example, the Kansas Biological Survey has made extensive collections GBIF n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a of amphibians throughout the state of Kansas and has made these records electronically accessible via GBIF. This pattern is also evident for GBIF crayfish in Oklahoma and GBIF fish in North Carolina. Accessibility of sampling locations via road corridors and population centers is also a common driver of bias in survey coverage (Dennis and Hardy 1999; Kadmon et al. 2004). This source of coverage bias was not evident for any of the ten taxa, probably because the coarse survey resolution precluded our ability to detect biases in coverage along these finer resolution environmental characteristics. Survey coverage varied among spatial, environmental, and temporal gradients. Not surprisingly, GBIF surveys exhibited higher spatial and environmental bias compared to the BBS and FFS, which represent systematic sampling efforts that are planned to be spatially and environmentally stratified (Gilliom et al. 1995;Pardieck et al. 2014). On the other hand, an advantage of museum record compilations such as GBIF is that the temporal distribution of records is typically longer and more uniform than systematic sampling efforts. Indeed, the first GBIF records were collected in the early 1800s, whereas the Breeding Bird Survey began in the 1960s and the FFS surveys span only the 1990s and 2000s. Another trend was consistently poorer coverage for aquatic taxa compared to terrestrial taxa. This may be a consequence of where aquatic habitats are most prevalent. For example, aquatic taxa surveys were overrepresented in wetter areas (i.e., higher mean annual precipitation; Fig. S2.5) of the eastern United States (Fig. S2.2). Nevertheless, unique and functionally diverse aquatic taxa persist in the arid southwest and other poorly covered regions (Pool and Olden 2012). Future efforts to fill in these aquatic biodiversity data gaps should therefore be a priority. Apart from these general trends, each taxon-gradient combination exhibited unique biases (Figs. S2.1-S2.10) that should be considered by collectors on a taxon-specific basis when planning new data compilation and surveying efforts. Recent studies have highlighted that environmental biases vary in their effect on the performance of predictive models in other regions (Kadmon et al. 2003;Loiselle et al. 2008;Tessarolo et al. 2014). To what degree the environmental biases documented in the current study would affect predictive models remains unknown and should be a future objective of biodiversity informatics in the United States.
Effective biodiversity conservation starts with researchers and conservationists having access to biodiversity surveys of sufficient completeness and coverage (Reichman et al. 2011). Evaluations like the one we present provide a quantitative and comprehensive prioritization scheme to facilitate efficient improvements to existing databases, such as GBIF. Another essential goal of such prioritization schemes should be to produce future data coverages that enable the study of long-term biodiversity responses to anthropogenic environmental change (e.g., Jiguet et al. 2010). Such an approach should involve the identification of areas that presently are undersurveyed and are expected to undergo anthropogenic environmental change over the next 50-100 years. Such foresight in Figure 6. Coverage indices for each of eleven temporal, spatial, or environmental gradients (twelve taxonomic survey datasets pooled) averaged across (A) GBIF, (B) standardized (i.e., BBS and FFS), (C) terrestrial, and (D) aquatic datasets. Index values are D-statistics from Kolmogorov-Smirnov goodness-of-fit, indicating strong or weak (low or high D-statistics, respectively) congruence between survey datasets and the background environment. Vertical gray and red lines represent the mean of all eleven datasets for all surveyed grid cells and well-surveyed grid cells, respectively. data collection by this generation of scientists can provide complete and unbiased "before" data for BACIdesigned natural experiments conducted by scientists 50-100 years into the future after environmental change has occurred. Indeed, our evaluation suggests that climate change gradients are among the most poorly covered environmental gradients. Lastly, descriptive evaluations of completeness and coverage like the one we present should be viewed as an iterative process. Investigators will need to periodically reevaluate completeness and coverage as new occurrence records are added to openaccess databases. Such periodic reevaluations will need to incorporate additional coverage information, as new environmental data layers become available or as existing environmental data layers change as a consequence of climate and land cover change.

Supporting Information
Additional Supporting Information may be found online in the supporting information tab for this article: Figure S1. Spatial and environmental variables summarized at the resolution of 0.1°by 0.1°grid cells (N = 83,545) used in coverage analysis. Figure S2.1. Distribution of occurrence records along a latitudinal spatial gradient. Figure S2.2. Distribution of occurrence records along a longitudinal spatial gradient. Figure S2.3. Distribution of occurrence records along a gradient of elevation. Figure S2.4. Distribution of occurrence records along a gradient of mean annual temperature. Figure S2.5. Distribution of occurrence records along a gradient of mean annual precipitation. Figure S2.6. Distribution of occurrence records along a gradient of urban land cover. Figure S2.7. Distribution of occurrence records along a gradient of agricultural land cover. Figure S2.8. Distribution of occurrence records along a gradient of disturbed (urban + agricultural) land cover. Figure S2.9. Distribution of occurrence records along a gradient of change (future -present) in mean annual temperature. Figure S2.10. Distribution of occurrence records along a gradient of change (future -present) in mean annual precipitation.