Born‐digital biodiversity data: Millions and billions

Given the dramatic pace of change of our planet, we need rapid collection of environmental data to document how species are coping and to evaluate the impact of our conservation interventions. To address this need, new classes of “born digital” biodiversity records are now being collected and curated many orders of magnitude faster than traditional data. In addition to the millions of citizen science observations of species that have been accumulating over the last decade, the last few years have seen a surge of sensor data, with eMammal's camera trap archive passing 1 million photo‐vouchered specimens and Movebank's animal tracking database recently passing 1.5 billion animal locations. Data from digital sensors have other advantages over visual citizen science observation in that the level of survey effort is intrinsically documented and they can preserve digital vouchers that can be used to verify species identity. These novel digital specimens are leading spatial ecology into the era of Big Data and will require a big tent of collaborating organizations to make these databases sustainable and durable. We urge institutions to recognize the future of born‐digital records and invest in proper curation and standards so we can make the most of these records to inform management, inspire conservation action and tell natural history stories about life on the planet.

Museum specimens have always provided the most basic information about the spatial distribution of life on earth: which species live where and when. These records have formed the basis for our biodiversity range maps, biogeography and conservation planning (Suarez & Tsutsui, 2004). As the pace of global change accelerates, we need more biodiversity data to monitor how species are responding, which are most in need of conservation efforts, and what kinds of impacts these efforts deliver (Dirzo et al., 2014).
A recent paper by Farley, Dawson, Goring, and Williams (2018) discussed ecology's transition into the era of big data and showed exponential increases in biodiversity records in the Global Biodiversity Information Facility (GBIF) and other museum databases. A growing digital archive should put us in a good position to monitor change. However, another recent paper by Malaney and Cook (2018) showed that traditional museums actually are not keeping pace. Mammal specimen collecting in the United States reached its peak around 1990 and has dropped by a factor of three since then, with fewer than 5,000 specimens collected annually in recent years. That this is the situation for North American mammals-one of the world's best surveyed faunas-sheds stark light on what poor resolution incoming specimens will provide to understand changes in our global biodiversity. But what, then, explains the mismatch between the increases in GBIF data and the decreases in actual specimen collection?

| BORN D I G ITAL B I OD IVER S IT Y
The discrepancy is explained by a new class of biodiversity data that is collected electronically or "born digital". These are not a replacement for physical museum specimens, which are useful in ways that digital collections can never be, including studies of genomic diversity, dietary ecology, disease ecology and morphology, among many other yet undiscovered types of information (Holmes et al., 2016). However, born-digital records are documenting our biodiversity at a faster pace and higher resolution than physical museum specimens ever could. Most of this growth is through human observed data, 98% of GBIF vertebrate records since 2015 are observations (GBIF, 2018), and as of 2019, 94% of all biodiversity records in GBIF were observations. The volume of these observations has clearly led to new insight, enabled in part by sophisticated data filtering algorithms (Kelling, Yu, Gerbracht, & Wong, 2011), but the accuracy of these observations is typically impossible to check since most do not have any record that can be verified (i.e. no voucher specimen retained as a reference); indeed <1% have associated media that could function as a photograph or acoustic voucher. Furthermore, Bayraktarov et al. (2019) question whether the big unstructured biodiversity data provided by nonstandardized surveys really mean more knowledge. Approaches that do not document details of sampling effort, or give incomplete species records, will be of dubious value for modelling efforts to establish predictive relationships between species and environmental conditions (Bayraktarov et al., 2019;Steger, Butt, & Hooten, 2017). Fortunately, two sensor-driven types of born-digital biodiversity data, camera traps and animal tracking devices, are now maturing and coalescing to provide verifiable big data with well-documented sampling protocols and survey effort (Kays, Crofoot, Jetz, & Wikelski, 2015;Steenweg et al., 2017). The scale of data collected by these sensors has rapidly caught up with museums and citizen observations ( Figure 1). While not a solution for all groups, existing data represent a diversity of bird and mammal groups, around the world, including species of conservation concern ( Figure 2).
As a photo-vouchered spatial record of biodiversity, camera traps offer a direct parallel to the museum mammal specimen because the identity of the species can be verified in the photograph, potentially even automatically through artificial intelligence (He et al., 2016).
Although not all species can be visually distinguished (Potter, Brady, & Murphy, 2018), camera traps are useful for most medium or large terrestrial mammals and have recently proven effective for small mammals and canopy fauna (Bowler, Tobler, Endress, Gilmore, & Anderson, 2016;McCleery et al., 2014). Camera traps also have the advantage of clearly recording sampling effort (where they are run and for how long), which is typically not known for museum collections or citizen science observations. Since building eMammal as a repository for camera trap photographs at the Smithsonian in 2012, we have seen steady growth of records and by 2019 have > 1 million georeferenced, vouchered animal records ( Figure 1). To put this in perspective, the world's largest physical mammal collection, also at the Smithsonian, has just under 600,000 georeferenced mammal records spanning 180 years, and the second-largest mammal collection (Museum of Southwestern Biology) has about half that (Dunnum et al., 2018). Furthermore, eMammal probably represents a relatively small amount of the camera trap data collected in the last decade.
A new collaborative camera trapping project called Wildlife Insights will provide the artificial intelligence and automated analytical tools to process and analyse big data efficiently, thereby bringing together even more born digital data from around the world for effective and more timely monitoring of animal life on Earth.
Modern GPS technology has empowered the animal tracking field to grow even faster. For example, the Movebank animal tracking database we established at the Max Planck Institute of Animal Behaviour in 2009 has over 1.5 billion georeferenced animal records (Figure 1).
Animal tracking data are inherently autocorrelated and so do not function as statistically independent occurrences in spatial models as museum specimens typically do. However, this more detailed F I G U R E 1 Total size of georeferenced datasets available for birds and mammals from GBIF (museum specimens, observations), Movebank (number of individual animals tracked, total locations tracked) and eMammal (camera trap detections). Data available at https ://doi. org/10.5061/dryad.b42j56r 100 1,000 10 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 100 1,000 10 10,000 100,000 1,000,000 10,000,000 100,000,000 reference of other biodiversity data (Pacifici et al., 2017). As tracking tag technology miniaturizes, we can track smaller and smaller species (Kays et al., 2015). For example, the new ICARUS antenna that was recently mounted on the International Space Station allows the global tracking of 5 g GPS transmitters suitable for tracking 100 g birds with GPS accuracy and near global data readout (Wikelski et al., 2007).

| US E S FOR BORN -D I G ITAL RECORDS
The collection of Born Digital data in the first place is motivated primarily by spatial ecology, with animal tracking usually considering In addition to these empirical uses of Born Digital data, we also see great potential to use the images and stories of these animals to help connect people with nature and inspire them to contribute to  For example, Movebank now has consecutive live feeds from ca. 5,000 animals via GSM or satellite networks delivering approximately 1 million animal locations per day, while eMammal has ca.

| DIG ITAL B IODIVER S IT Y INFR A S TRUC TURE SUPP ORT
10K camera trap detections uploaded per week. While most statistical analyses are performed locally by scientists, there is also a need for web-based analytics to enable real-time monitoring and also make data available to users as diverse as land managers or school children (Schuttler et al., 2018).
The cyberinfrastructure and data curation that makes big data ecology possible are expensive. Not only do these require extensive bandwidth and server space, but also web interfaces and analytical tools. A database is never "done" but needs continual support to pay for the never-ending updates that maintain security and connectivity, not to mention upgrades to support new user needs and data streams. The natural history museums that hold our physical specimens are one logical home for these cybertools, but broader collaboration is needed across government organizations, NGOs, universities and research institutes to bear the annual costs of supporting born digital big data biodiversity. End users should also recognize the value of born digital data and tools to their work and expect some payment for services to be part of future funding models. Charging for data access would be against the spirit of open data and discourage wide use of these resources, particularly for developing countries that host much of the planet´s biodiversity. We believe that data should be freely ingested and freely provided in standard format. However, we suggest it is appropriate to charge for premium services such as streamlined ingestion of very large data sets (e.g. >1 hz sensor streams), more complicated derived data products or feature-heavy analytic protocols that would not only help sustain this cyberinfrastructure, but also widen the potential audience of users for these data.
Natural history museums were created as institutions to protect physical specimens so they are available to researchers for perpetuity and to use the objects and science stories to engage and educate a broad audience through exhibits and programming. Born-digital biodiversity data has the same potential for research and engagement value, but instead of shelving and taxidermists, we need to invest in servers, programmers and apps if we are to make them work as long-term records of planetary change, and inspiration for people to care about the natural world.

ACK N OWLED G EM ENTS
Big thanks to the larger teams that make eMammal and Movebank possible. We thank the Max Planck Institute of Animal Behaviour, The Smithsonian Institute, the North Carolina Museum of Natural Sciences, NASA, and the National Science Foundation for supporting Movebank and eMammal.