Inferring trends in pollinator distributions across the Neotropics from publicly available data remains challenging despite mobilization efforts

Aggregated species occurrence data are increasingly accessible through public databases for the analysis of temporal trends in the geographic distributions of species. However, biases in these data present challenges for statistical inference. We assessed potential biases in data available through GBIF on the occurrences of four flower‐visiting taxa: bees (Anthophila), hoverflies (Syrphidae), leaf‐nosed bats (Phyllostomidae) and hummingbirds (Trochilidae). We also assessed whether and to what extent data mobilization efforts improved our ability to estimate trends in species' distributions.


| INTRODUC TI ON
The geographic distributions of species are the fundamental units of biogeography and an important variable in ecology.
Understanding the dynamics of species' distributions-that is, how they have changed over time-is essential for identifying drivers and correlates of range contractions and expansions (Powney et al., 2014;Woodcock et al., 2016); tracking the spread of invasive species (Delisle et al., 2003) and their impacts on native taxa (Roy et al., 2012); prioritizing areas for, and evaluating the effects of, conservation interventions (Cunningham et al., 2021;Moilanen, 2007); and monitoring progress towards international biodiversity targets, among other applications. To understand the dynamics of species' distributions, and hence tackle these important problems, researchers must have access to reliable records of what species occurred where and when. Generally, records of this type are referred to as species occurrence data (sometimes called biological records).
Species occurrence data have become increasingly accessible over the last two decades. This can be attributed to the mobilization of historic records from preserved specimens (taken here to include both the digitization of analog records and the deposition of digital records in public databases), the proliferation and growth of citizen science monitoring programs and the launch of online data portals through which these data can be easily accessed and shared (Ellwood et al., 2015;Faith et al., 2013;Nelson and Ellis, 2019;Peterson et al., 2015). To put this into context, the largest online data portal, the Global Biodiversity Information Facility (GBIF hereafter), currently holds nearly two billion species occurrence records spanning all continents and major taxa (GBIF.org, 2021).
Approximately 10% of the records held on GBIF derive from preserved specimens in museums and herbaria that have been mobilized for accession online. Whilst this represents a huge quantity of data, it is estimated that globally, museums and herbaria hold 1.5-2.0 billion preserved specimens (Peterson et al., 2015). That is to say, up to around 90% of these records have not been mobilized for use by the research community, at least not through GBIF. To bridge this gap, resources are now being devoted to national and international data mobilization initiatives (Nelson and Ellis, 2019; also see e.g. https://www.idigb io.org/). It is essential, therefore, to understand the extent to which specific mobilization efforts can improve our ability to derive robust estimates of trends in species' distributions.
Despite the increasing accessibility of species occurrence data, there remains a shortfall in the knowledge of species' geographic distributions and trends thereof: this is often called the "Wallacean shortfall" (Lomolino, 2004). The Wallacean shortfall can be explained at least in part by sampling biases-that is, nonrandom sampling along the axes of space, time, taxonomy and other important dimensionsand subsequent biases in data mobilization. Such biases confound information on species' true distributions with information on where, when and what was sampled, and which records were made accessible (e.g. Barends et al., 2020;Daru et al., 2018;Delisle et al., 2003;Isaac and Pocock, 2015;Oliveira et al., 2016;Reddy and Dávalos, 2003;Whitaker and Kimmig, 2020). Whilst individual datasets (e.g. from a single study or monitoring program) are not immune to these biases, they tend to become more evident when multiple datasets, each with their own idiosyncrasies, are aggregated (Whitaker and Kimmig, 2020).
There is no guarantee then, that a given slice of aggregated species occurrence data will be suitable for a given analytical use. addition of these new records in terms of their biases and estimated trends in species' distributions.

Results:
We found evidence of potential sampling biases for all taxa. The addition of newly-mobilized records of bees in Chile decreased some biases but introduced others. Despite increasing the quantity of data for bees in Chile sixfold, estimates of trends in species' distributions derived using the postmobilization dataset were broadly similar to what would have been estimated before their introduction, albeit more precise.

Main conclusions:
Our results highlight the challenges associated with drawing robust inferences about trends in species' distributions using publicly available data.
Mobilizing historic records will not always enable trend estimation because more data do not necessarily equal less bias. Analysts should carefully assess their data before conducting analyses: this might enable the estimation of more robust trends and help to identify strategies for effective data mobilization. Our study also reinforces the need for targeted monitoring of pollinators worldwide.

K E Y W O R D S
bees, GBIF, hoverflies, hummingbirds, leaf-nosed bats, pollinators, sampling bias, species occurrence data Perhaps the most striking example of geographic bias in the availability of species occurrence data is the disproportionately poor coverage of the tropics, where species richness is highest . For example, the Neotropics-which we define as South and Central America, Mexico and the Caribbean islandshosts the world's richest flora, and a high diversity of interactions with pollinators (Antonelli and Sanmartín, 2011). This region also hosts a great diversity of the major groups of pollinators, including the bees (Anthophila; Freitas et al., 2009;Moure et al., 2007), hoverflies (Syrphidae; Montoya, 2016), and two vertebrate taxa that are endemic to the region: hummingbirds (Trochilidae; Ellis-Soto et al., 2021) and leaf-nosed bats (Phyllostomatidae; Villalobos and Arita, 2010). And yet, despite their diversity in the region, there remains a Wallacean shortfall in the knowledge of pollinator distributions across the Neotropics.
In this paper, we assess the suitability of species occurrence data within GBIF for estimating temporal trends in species' distributions, and whether recent data mobilization efforts have improved the situation. We focus on records of flower-visiting invertebrates and vertebrates collected across the Neotropical region over the period 1950-2019. We include four taxonomic groups in our analysis: bees (Anthophila), hoverflies (Syrphidae), leaf-nosed bats (Phyllostomidae) and hummingbirds (Trochilidae). We note that not all species of Phyllostomidae are flower visitors but include the whole group for simplicity. Generally, these taxa provide pollination services to a large fraction of flowering wild plants and cultivated crops and comprise culturally iconic species and rarities of conservation importance (IPBES, 2019;Vieli et al., 2021). We begin by conducting a continental-scale assessment of the GBIF data for common forms of bias in the geographic, temporal and taxonomic dimensions. To conduct this assessment, we deploy several heuristics that each indicate the potential for some form of bias in the data (Boyd et al., 2021). To assess the extent to which digitization efforts can improve our ability to estimate trends in species' geographic distributions, we identify two recent mobilization efforts that have drastically increased the number of records available for bees in Chile (12,001 and 36,010 records, respectively; Lopez-Aliste and Fonturbel, 2021a, 2021b). We create a "predigitization" dataset by removing the records that were introduced via these two mobilization efforts. We then compare the predigitization dataset with the full dataset using three criteria: (1) the total quantity of data after various stages of filtering (e.g. removing records with spatial issues); (2) the extent of any potential biases; and (3) estimates of temporal trends in species' distributions obtained by fitting statistical models to the data.

| Data
We extracted occurrence data for Anthophila (GBIF, 2021a(GBIF, , 2021b, Syrphidae (GBIF, 2021c), Phyllostomidae (GBIF, 2021d) and Trochilidae (GBIF, 2021e) collected in the Neotropics over the period 1950-2019 from GBIF. We used a bounding box (65°S-40°N) to filter the data and subsequently removed records from the USA, which fell within its limits. We used the coordinate-Cleaner R package (Zizka et al., 2019) to flag and remove records with various potential spatial issues: coordinates matching country centroids and capital cities (indicating imprecise geolocation of records from vague locality names) and locations of biodiversity institutes; and records with equal latitude and longitude, which can indicate data entry errors. For the Anthophila, Syrphidae and Phyllostomidae, most of the records derive from natural history collections where they were identified by taxonomic specialists ( Figure S6). The majority of the Trochilidae records do not derive from preserved specimens but were collected through the eBird initiative, which also has a stringent quality assurance policy including an expert review of unusual sightings. Two authors on this paper (RMBS and LAFP) reviewed the lists of species names for the Anthophila and Syrphidae for taxonomic issues; they found nothing that would affect our results (see the Supporting Information for more information).

| Bias heuristics
To assess the data for sampling biases, we used five data-driven heuristics. We use the term heuristic to acknowledge that it is generally not possible to quantify the exact extent to which a sample is biased without a complete census or large probability sample for comparison. Although the goal is to draw species-level inferences, we apply these heuristics at the taxonomic group level, i.e. separately for the bees, hoverflies, hummingbirds and leaf-nosed bats. It is not possible to assess the data for sampling biases at the species level because they are presence-only: such data provide no information on sampling effort in space or time if a species was not detected.
Instead, we use the records for all species in each taxonomic group as a proxy for the spatio-temporal distribution of sampling effort for that group (often called the "target group approach"; see e.g. Phillips et al., 2009;Powney et al., 2019).
Each of the five heuristics indicates the potential for bias in at least one of the spatial, temporal and taxonomic dimensions (Boyd et al., 2021). Heuristics one and two are straightforward: the first is the total number of records for a taxonomic group, and the second is the proportion of species known to occur in the Neotropics that have been recorded (i.e. inventory completeness).
We acknowledge that these are probably better described as measures of "coverage" than "bias". However, when one looks at how they change over time (as we do here), then they indicate the potential for temporal biases in recording intensity and taxonomic coverage, respectively, both of which will be important to take into account for accurate inference. Information on the number of species known to occur in the Neotropics, derived from the literature, online datasets (specifically for Anthophila), specialists and authorities in each taxonomic group (among the authors), is used to calculate heuristic two ( Table 1).
The third heuristic is used to indicate preferential sampling of rare species. It is calculated by regressing the total number of records for each species on the number of grid cells (defined below) in which they have been recorded in each period. Each species' deviation from the fitted regression indicates the degree to which it is over-or under-sampled given its recorded range size (Barends et al., 2020). Extending this concept, we use the coefficient of variation (r 2 ) from the model as a measure of "rarity bias". This heuristic ranges from 0, indicating high bias (rare species are over-sampled relative to commoner species), to 1, indicating no bias. Note that where there is a negative correlation between recorded range size and sample size this heuristic becomes problematic to interpret; this problem did not arise here.
The fourth heuristic provides a measure of geographic bias; specifically, it measures the degree to which the data deviate from a random distribution in geographic space. This measure is based on the Nearest Neighbour Index (NNI; Clark and Evans, 1954). The NNI is given as the ratio of the average nearest neighbour distance of the empirical sample (using the associated coordinates) to the average nearest neighbour distance of a random distribution of the same density across the same spatial domain. We simulated 15 random distributions of equal density to the occurrence data, which allowed us to present the uncertainty associated with the index. For our NNI, values may range from 0.00 to 2.15: values below 1 indicate that the data are more clustered than a random distribution, values of ~1 indicate that the data are randomly distributed and values above 1 signify over-dispersion relative to a random distribution. We acknowledge that some records available on GBIF have been converted to point locations from, for example, gridded datasets. In these cases, coordinates are only approximate and the NNI may be distorted.
The fifth and final heuristic indicates whether the same portion of geographic space has been sampled over time; variation in geographic sampling confounds space and time; and this can result in serious inferential problems if population trends have not been uniform over space. This heuristic comprises a gridded map indicating the number of time periods (defined below) in which each grid cell has been sampled. Of course, changes in the geographic distribution of records could indicate changes in species' distributions and not a bias. However, we suggest that, when working at the taxon group level (i.e. across many species) and at a coarse resolution (see below), changes in which cells have records are most likely to reflect a bias.
It is important to conduct bias assessments at the spatio-temporal resolution (grain size) at which inferences about species' distributions are desired. Otherwise, one might inadvertently "smooth over" biases evident only at finer scales (Pescott et al., 2019). In this case, preliminary screening indicated that the data clearly would not per-

| Data
To determine the extent to which the digitization of historic collections can improve our ability to estimate trends in species' distributions, we focussed on two recent mobilization efforts in Chile. The

| Utility of data for trend estimation
To compare the utility of the GBIF data before and after the addition of the two datasets described above, we focussed on Chile, where the newly-mobilized data were collected, and on the bees (Anthophila), because both datasets include a large number of records for this taxon. We began by comparing the total quantity of data before and after digitization, the quantity of records with no spatial issues and the total number of species represented. We then used the five heuristics described earlier to compare the biases in the data pre-and postdigitization. Finally, we compared estimated temporal trends in Anthophila distributions in Chile derived from GBIF before and after the additional data became available.

| Trend estimation
To estimate temporal trends in bee distributions in Chile, we used three statistical models. These include the model of Telfer et al. (2002), and two variants of the "reporting rate" model (Franklin, 1999): the basic model (RR + LL) and a slightly more complex model, which includes a random site (grid cell) effect (RR + LL + site; Roy et al., 2012). These models have been discussed at length elsewhere (Isaac et al., 2014;Pescott et al., 2019). Each of the models provides a species-specific measure of change in range size after attempting to correct for changes in recording intensity (see the Supporting Information for full details of the models used here). We fitted the RR models at the same resolution as the bias assessment: 1 0 grid cells in decadal time periods. The Telfer method is slightly different in that it can only be used to compare range sizes between two time periods; hence, we designated the first three and last three decades in our analysis as the first and second periods, respectively (data from the decade in between these periods were not used to fit this model). All models were fitted using the R (R Core Team, 2019; version 4.1.0) package sparta (August et al., 2020).
To assess the extent to which the digitization of the historic data has changed our ability to estimate trends in species' distributions, we fitted models to both the pre-and postdigitiza-
In addition to temporal bias in data quantity, the data are also biased taxonomically, and the extent of these biases varies over time. First, for all taxa, the proportion of known species recorded within GBIF is <1. The leaf-nosed bats and hummingbirds are, however, best represented: in the early decades, around 75% of species in these groups were recorded, and in the later decades, this increased to almost 100%. Data are not available for the vast majority of known bee and hoverfly species (Figure 1b). Second, for most groups, rare species tend to be overrepresented in the data. Recall that the taxonomic bias index in Figure 1c  To reveal the potential for spatial biases in the data, we looked at the degree to which they are clustered in particular portions of the Neotropics using the NNI. For all groups, and in all decades, the data are more clustered than would be expected by chance (Figure 1d).
Whilst the NNI indicates that the data depart from a random dis-

| Data quantity
The two newly-mobilized datasets drastically increased the availability of Anthophila records collected in Chile between 1950 and 2019 on GBIF ( Table 2). The total number of records and the number of records without common spatial issues (see Methods) increased approximately sixfold; the number of records with no spatial issues and which are identified to species level increased approximately sevenfold; and the number of species recorded increased from 326 to 356 ( Table 2). The increase in species recorded in GBIF represents a move from 70% to 77% of the 464 bee species known to occur in Chile (López-Aliste et al., 2021).

| Biases
Whilst the newly-digitized data drastically increased the quantity of data available for bees in Chile, it did not reduce all forms of bias, and, in some cases, increased their severity. For example, Figure 2a shows that the vast majority of the new data were collected in decades two, three and four . A corollary is that the addition of these data introduced strong temporal biases in data quantity (Figure 2a,b). Moreover, in the full dataset, on average, preferential F I G U R E 1 Heuristics indicating the potential for bias in GBIF data for bees (Anthophila, green lines), hoverflies (Syrphidae, purple lines), leaf-nosed bats (Phyllostomidae, orange lines) and hummingbirds (Trochilidae, pink lines) across South and Central America. The data are assessed in seven decades between 1950 and 2019 (01/01/1950-31/12/1959, … 01/01/2010-31/12/2019). Panel (a) shows the number of records for each taxon in each of the seven decades in our analysis; these values are normalized by dividing by the number of records in the best-sampled decade per group for visual purposes. Panel (b) shows the proportion of species known to occur in the Neotropics that were recorded. Panel (c) shows an index of proportionality between species' recorded range sizes and the number of times they have been recorded in each decade (0 = low and 1 = high). Panel d shows the nearest neighbour index (NNI) for each taxon and decade, which indicates the degree to which the data are clustered (values further from 1 are more clustered). Shaded regions denote the 2.5th and 97.5th percentile calculated by comparing the data to 30 random distributions. Panels e-h show the number of decades in which each 1 0 grid cell was sampled for each taxon sampling of rare species is more apparent (Figure 2c). Finally, the addition of new records did little to increase the geographical representativeness of the data: the NNIs indicate a similar, if not slightly greater, departure from a random distribution in the full dataset ( Figure 2d). However, we remind the reader that the NNI cannot determine whether the data are nonrandomly distributed due to sampling biases or a taxon's true distribution.
Whilst the newly-digitized records did little to reduce some forms of bias in the available data, they improved the situation in other respects. The addition of the new data resulted in a more consistent level of taxonomic coverage across decades (~30%-40% of species known to occur in Chile; Figure 2b). They also increased the number of grid cells that have records in multiple decades, with many grid cells even having records from all decades (Figure 2e,f).

| Trend estimates
It was not possible to fit all models for all 146 species of Anthophila for which data are available in Chile, particularly when using the predigitization data. For the Telfer model we omitted species that were not recorded in at least two grid cells in the first time period: see Telfer et al. (2002) and the Supporting Information for the rationale. As a result, it was only possible to estimate distribution changes for 32 species using the Telfer method with the predigitization data. A separate problem emerged when fitting the relatively complex RR + LL + site model using the predigitization data: models for 21 species returned "singular fits". Singular fits occur where the estimated variance of the random intercept is 0, which can indicate that the model is overfitted. As a result, we only included the 304 species for which RR + LL + site models were successfully fitted, but also fitted the simpler RR + LL models, which do not include random effects; these models were successfully fitted for all 356 species.
As we wanted to compare the pre-and postdigitization models, for each model type, we were limited to including only those species whose distribution changes could be estimated using the predigitization data (even though many more species' distributions could be estimated using the postdigitization data).
Agreement between models fitted using the pre-and postdigitization data is generally strong, but there is some variation between model types (Figure 3). The correlations between predictions are 0.84, 0.83 and 0.52 for the Telfer, RR + LL and RR + LL + site models, respectively (Pearson's r; p < .001 in all cases; n = 32, 356 and 325, respectively).
Whilst the point estimates predicted by the models are highly congruent, there is strong evidence that the standard errors of the RR models' predictions are smaller when fitted to the postmobilization data than the premobilization data (Mann-Whitney U test; p < .001 in both cases; see the Supporting Information). This is not surprising given that the standard error of regression coefficients is a decreasing function of sample size, which increased sixfold (across species) with the addition of the newly-mobilized records.
To make a simple assessment of whether the newly-digitized data improved the accuracy of estimated trends, we focused on B.
terrestris, which has been continually introduced to Chile since the 1990s (i.e. midway through the time series) and has expanded widely since. We were not able to estimate a trend for B. terrestris using the

| DISCUSS ION
In this paper, we have demonstrated the need for analysts to use publicly available species occurrence data with caution when estimating trends in species' distributions. We began by providing evidence of sampling biases in available data on the occurrences of bees, hoverflies, leaf-nosed bats and hummingbirds collected in the Neotropics. We also showed that two recent data digitization efforts reduced some biases in the bee records collected in Chile, but introduced others. Finally, we showed that, despite a dramatic increase in data quantity, statistical models fitted to the pre-and postdigitization datasets produced broadly similar estimates of temporal trends in species' distributions ( Figure 3).
The data-driven heuristics used here indicate nonrandom sampling along the axes of space, time and taxonomy. However, one might not expect presence-only data to be randomly distributed; for example, it is possible that the data are nonrandomly distributed across the continent because the taxa are truly concentrated in certain portions of geographic space. We showed that the data for the leaf-nosed bats and hummingbirds were nonrandomly distributed ( Figure 1d) due to the availability of many records in the Andean region in Ecuador and Colombia (Figure 1g,h and Figure S3

Metric Predigitization Postdigitization
Total number of records 6635 38,807 Number of records without common spatial issues 6413 37,863 Number of records with no spatial issues and identified to species level 5574 37,024 Total number of species 326 356 TA B L E 2 Quantity of data on Anthophila collected in Chile over the period 1950-2019 before and after the addition of the newly-digitized records (after Fonturbel, 2021a, 2021b) and Figure S4 in the Supporting Information). This likely reflects the fact that these taxa are most diverse in this region (Ellis-Soto et al., 2021;Villalobos and Arita, 2010). Similarly, the distribution of data for bees is fairly consistent with areas of high species richness as estimated by Orr et al. (2021). For hoverflies, however, the nonrandom distribution of records more likely reflects sampling biases and the fact that most information remains undigitized in museums or other collections. For example, there is almost a complete absence of data in Venezuela and Paraguay, which is known to reflect a lack of monitoring (Montoya et al., 2012). There are also data on hoverfly occurrences from Colombia (Montoya, 2016), Brazil (Borges andCouri, 2009), Ecuador (Marín-Armijos et al., 2017) and Chile (Barahona-Segovia et al., 2021) that are yet to be digitized.
Much of the data for all taxa were collected in Mexico. In the case of the bees and hoverflies, this could reflect the fact this region has suitable habitat for many species. Mexico is a hotspot of endemic plants on which many species may depend (Myers et al., 2000); indeed, it hosts one of the richest bee faunas worldwide . However, Mexico is not considered a hotspot for leafnosed bats or hummingbirds (Ellis-Soto et al., 2021; Villalobos and Arita, 2010), so, for these taxa, the large number of records in this region likely reflects disproportionately high sampling or mobilization effort. In turn, leaf-nosed bat and hoverfly trends in Mexico would contribute disproportionately to any larger-scale trends (e.g. across the Neotropics) based on these data, unless serious mitigating action was taken. The fact that nonrandom distributions of presence-only data can reflect both sampling biases and species' true distributions reinforces the need for analysts to consult other sources of information, such as regional experts, in addition to the available data itself.
Notwithstanding the fact that the data for some taxa might be more geographically representative than the data-driven heuristics suggest, it is not possible to conclude that the available data for any  . For example, a species might fare well in one portion of the continent and less well in another; if the data were sampled from the former portion in one period and the latter portion in the next, then one might come to the artefactual conclusion that the species is in decline. Our results clearly demonstrate the need for analysts to properly scrutinize such data before using them to draw inferences about trends in species' distributions.
It is possible that the extent of the biases revealed here would differ had we consulted additional databases or considered alternative GBIF search terms. Whilst the data in many local databases ultimately end up on GBIF, there will be others that do not. Given the biases in the GBIF data revealed here and by others (e.g. Rocha-Ortega et al., 2021), it would be prudent for analysts to seek out such additional data. We have also been made aware that our GBIF search terms missed an appreciable number of hymenopteran records, which include bees, held by the American Museum of Natural History (Neil Cobb pers. comm.). These records can be accessed through GBIF, but currently lack-associated metadata on the date or year of collection.
Hence, it was not possible to use them in our analysis and they were not picked up by our search (which was temporally explicit).
The mobilization of historic records is the most direct (and arguably cost-effective) way to understand biodiversity change over the last few hundred years (Nelson and Ellis, 2019;Page et al., 2015).
However, to our knowledge, there have been no explicit comparisons of the utility of available data for a given inferential goal before and after the mobilization of such records. We identified two recent mobilization efforts that increased the quantity of data on bee occurrences in Chile approximately sixfold. The addition of these records had a mixed effect on sampling biases in the available data: a larger fraction of bee species are represented in the postdigitization data across decades, and more grid cells had been sampled in more decades; however, across decades there are stronger biases towards rare species and decades two to four . Whilst perhaps intuitive to some, the point that more data do not necessarily equal less bias is an important one and has the potential to be overlooked given the abundance of records now available to ecologists.
In terms of estimates of temporal trends in bee distributions in Chile, the addition of the newly-mobilized data had only a modest effect. This is indicated by fairly strong correlations between the predictions from the models fitted to the predigitization data and those fitted to the full dataset ( Figure 3) Telfer et al., 2002). Note that, respectively, one and three extreme outliers are omitted in panels (b) and (c)  is not possible to conclude that the mobilization of historic records improves our ability to estimate trends in species' distributions in this case.
Targets for data mobilization have previously been defined in terms of data quantity. For example, GBIF aimed to serve one billion records by 2010 (Peterson et al., 2015). We share the sentiment of others (Meyer et al., 2015;Peterson et al., 2015) that a better strategy would be to target the mobilization of data that would be most informative for some inferential goal. Studies like ours could be used as "gap analyses" to establish where best to target new mobilization efforts along the axes of space, time and taxonomy. Such studies could also inform decisions on where best to focus future adaptive or targeted sampling efforts and for which taxa. However, we acknowledge that there will always be trade-offs between the mobilization/sampling strategy (e.g. to reduce bias), funding, logistics, the availability of experts (particularly taxonomists) and local interests.
There remain substantial gaps in knowledge about the status of pollinating species worldwide, and the effectiveness of measures to protect them, with evidence largely biased towards Europe and North America (Dicks et al., 2016;Zattara and Aizen, 2021). Our study builds on others, such as Sousa-Baena et al. (2014) who looked at plants, in reinforcing the urgent need for strategic data mobilization, and for targeted monitoring in selected locations. The aim should be to get as close as possible to a representative sample along the axes of space, time and taxonomy. This will be challenging both logistically and financially, but the benefits would almost certainly outweigh the costs (Breeze et al., 2021). We would like to thank Neil Cobb for making us aware of hymenopteran records from the American Museum of Natural History, which had not been picked up by our GBIF search terms, and three anonymous reviewers for their valuable comments on a previous version of this manuscript.

CO N FLI C T O F I NTE R E S T
The authors have no conflicts of interest to declare.

PEER R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/ddi.13551.

DATA AVA I L A B I L I T Y S TAT E M E N T
The GBIF data can be accessed using the