occCite: Tools for querying and managing large biodiversity occurrence datasets

documentation standards without sacrificing time and resources to the demands of providing increasing levels of detail on their datasets


Background
Recent advances in standards development (e.g. the Darwin Core; Wieczorek et al. 2012) and data publication methods (Robertson et al. 2014) have catalyzed cloud-based, open sharing of digitized natural history data in consistent formats (Constable et al. 2010).The result is over 1.6 billion occurrence records for species from across the tree of life served by the Global Biodiversity Information Facility (GBIF) alone (GBIF Secretariat 2020).As of November 2020, over 5000 peer-reviewed journal articles have been published that cite these data (GBIF Secretariat 2020), on topics ranging from biodiversity and biogeography to agriculture and climate change (Ball-Damerow et al. 2019).
These digitally accessible data are the result of millions of cumulative person-hours spent not only collecting organisms in the field, but also preparing and accessioning observations, specimens, and metadata post-collection (Hedrick et al. 2020).These data are also constantly being updated and revised as taxonomy changes and more specimens are accrued and digitized.Therefore, accession dates are a key piece of metadata to identify and trace possible data issues (Feng et al. 2019).However, despite an increasing interest in formalizing standards and metadata protocols for biodiversity data (Feng et al. 2019, Merow et al. 2019, Zurell et al. 2020), tools to manage the connections between occurrence data and the providers of those data, as well as assure proper citation as a key part of this process, are still nascent.Following best practices for citing datasets in a way that ensures a study is truly repeatable remains challenging given the size, complexity, and dynamism inherent in aggregating natural history data.
We must preserve the cycle of data citation from primary data sources to aggregating databases to research products and back again to primary data sources (Escribano et al. 2018).The citation cycle facilitates reproducibility and scientific transparency, but it is also key to supporting primary providers by documenting the use of their data.Data aggregators such as the Global Biodiversity Information Facility (GBIF) have made great strides in harvesting citations from research products and linking them back to primary data providers (Noesgaard 2019).However, this cycle functions only if those who publish research products cite primary data sources -in 2018, only 15% of research studies using GBIF data included the digital object identifier (DOI) for datasets provided by GBIF (Noesgaard 2019).Further, the R statistical computing environment (<https://www.R-project.org/>),which is used heavily by biodiversity researchers, has surprisingly few mechanisms to facilitate proper citation, unlike access through the GBIF web portal.In an era of shrinking funding for natural history museums and community science initiatives, but increasing relevance of the biodiversity they document, it is important to cite these primary data providers to highlight their efforts and emphasize their role as an essential link in the research chain.
Some packages in R provide tools that enable researchers to document sources during the data collection process.
For example, rgbif (Chamberlain et al. 2020a) for GBIF and BIEN (Maitner 2020) for the Botanical Information and Ecology Network (BIEN) provide interfaces for their specific aggregator databases that include valuable features for citing data.However, these and other R packages that serve a single aggregator database are designed for specific use cases tailored to their databases, and uniting aggregator results into a single dataset brings its own set of challenges.Multiplatform occurrence aggregators do exist -searches using spocc (Chamberlain 2019) can return occurrence information from up to six aggregator databases -but the process of combining data from these aggregators in each query results in the loss of key metadata: particularly accession date, primary data source, and in the case of GBIF, dataset DOIs.This is particularly important for software that uses occurrence downloading tools to supply data for biodiversity analyses (Kass et al. 2018, Osorio-Olvera et al. 2020).Finally, to our knowledge, there remains a deficit of R tools that manage metadata from an occurrence search and translate it into citations for primary data providers that include accession dates.
We here present a new package, occCite, which eases the burden on researchers to document increasingly complex occurrence datasets and associated metadata, with the aim of making studies on large, aggregated databases truly repeatable.occCite enables users to download data from multiple aggregator databases; manage, summarize, and visualize multi-database search results; and complete the data citation cycle by generating primary provider citations with DOIs and accession dates.occCite therefore preserves links between occurrence data and primary providers throughout the dataprocessing workflow.This package was initially developed as a module for dataset citation within the Wallace ecological modeling application (Kass et al. 2018) but was engineered to also have standalone functionality.We hope this package will enable more studies to reach emerging standards (Feng et al. 2019, Merow et al. 2019, Zurell et al. 2020) for occurrence citation practices that are fully open, repeatable, and acknowledge primary data providers for their hard work assembling, digitizing and publishing data.

Package overview
A stable version of occCite (ver.0.4.7) is available via CRAN <https://CRAN.R-project.org/package=occCite>; the package is currently under review for inclusion in ROpenSci <https://ropensci.org/packages/> and the developer version can be accessed via GitHub <https://github.com/hannahlowens/occCite>. At its simplest, the occCite workflow follows a two-step process (Fig. 1).First, the user enters the names of one or more taxa into occQuery() and optionally, their GBIF login information (registration is free via the GBIF website and is required to access full metadata; <www.gbif.org>);occCite then checks these taxon names and searches for occurrence data via queries to the BIEN database (through BIEN) and/or the GBIF database (through rgbif).Search results and metadata are contained 1230 in an occCiteData object, both in their raw form and as a single table of results for each species with the date of observation, latitude, longitude, primary data source, and database aggregator source.The raw search returns are written to local memory for metadata purposes and to allow the user to access fields other than those in the single processed results table.The user can then pass the occCiteData object to occ-Citation(), which compiles citations and accession dates for the primary data providers based on metadata provided by BIEN and/or GBIF.occCiteData includes a summary() method that returns the taxonomic rectification and/or occ-Query() search and associated metadata, and occCitation() includes a print() method that returns a formatted, alphabetized block of text with citations for each primary data provider represented in the search results.These text citations are one potential mechanism for data citation, as part of literature-cited sections (Riemer et al. 2018) or supporting information.

Package details
While the basic occCite workflow is quite simple, we have designed several options and features (Fig. 2) to allow users to build more customized workflows and visualize the results of their searches both as graphs (Fig. 3) and interactive maps generated using the leaflet package (Cheng et al. 2019).What follows is a detailed explanation of the architecture and options available to occCite users to optimize the workflow to their specific needs.Novel occCite functions, methods, and objects are bolded and italicized.A supporting information vignette that demonstrates these package details and includes examples of citation output is available <https://hannahlowens.github.io/occCite/>.

Setup
We provided a dummy login in Example 1 and 2 to illustrate the format.A login is required because occQuery() is, in part, a wrapper around occ_download() from the rgbif package -this function is analogous to requesting a # Visualization features ---plot(simpleOC, bySpecies = FALSE, plotTypes = c("yearHistogram", "source", "aggregator")) occCiteMap(simpleOC, cluster = TRUE) 1232 doi-referenced dataset download via the GBIF website (Chamberlain et al. 2020).The username, email, and password are stored in the R working environment as a GBIFLogin object when they are supplied to occCite's GBIFLoginManager() function to simplify their specification for users.
1233 this function, it is possible for the user to choose from any of the available taxonomies in the Global Names Index <http:// gni.globalnames.org/>,a component of the Global Names Architecture (Patterson et al. 2010).studyTaxonList() creates an occCiteData object, which can then be passed into occQuery() to perform an occurrence data search.Names without a match in the taxonomy of choice are returned via warning messaging and flagged in the occCiteData 'cleaned-Taxonomy' slot to facilitate user review.Currently, occQuery() searches only for occurrence data that matches the input name in the occurrence database of choice, and thus does not return records corresponding to synonyms, misspellings, and other errors in the database(s).

Query
The occQuery() function is designed to provide users with several ways to generate and optimize repeatable occurrence searches while keeping detailed metadata.occQuery() returns an occCiteData object that stores information on the type of query made (i.e.user-supplied list or phylogeny), the date of the query, the taxonomic resources used for name rectification, the accepted taxonomic names used in the search, the database aggregators searched, and a named list of search results corresponding to the taxonomic names used in the search (Fig. 2).There are also several optional arguments for occQuery() to load local downloads of GBIF data as well as previously prepared downloads being stored on GBIF's servers; these arguments are detailed below under 'Advanced Features'.

Citations
After the occurrence data search is complete, the resulting occCiteData object can be passed to occCitation() to generate citations for primary biodiversity databases.occCitation() returns an occCiteCitation() data object, which is a named list with entries corresponding to the taxonomic names used to build a query.Each item in the list is a data frame with one row for each primary data provider to be cited.Columns include the name of the database aggregator, unique identifier code for the primary provider record as used by the database aggregator, the citation and accession date for the primary provider, and the number of occurrences supplied by that data provider.The print() method for an occCiteCitation object returns a formatted and alphabetized set of references, either as a single block of text for all species appropriate for addition to the references section of a publication, or as separate blocks of text for each species individually for more detailed source-parsing in publication Supporting information or other documentation.Examples of these outputs can be found in the package vignette <https://hannahlowens.github.io/occCite/>.

Advanced features
Downloading data from GBIF can be time-consuming, especially for multiple species and/or species with many occurrence records.To save time when repeating a query that has been run in the past, the user has two options: 1) download previously prepared datasets from the GBIF servers (stored for six months after initial download request; GBIF Secretariat 2020); or 2) access previously downloaded datasets stored on their local machine.By default, occQuery() checks GBIF's servers for the user's previously prepared datasets before preparing a new dataset (this behavior can be disabled by setting the checkPreviousGBIFDownload argument to 'FALSE').Alternatively, if the user wishes to access downloaded GBIF dataset .zipfiles on their local machine, the loadLocalGBIFDownload argument must be changed to 'TRUE' and the directory where the files are located must be specified via the GBIFDownloadDirectory argument.occ-Query() will crawl through the specified directory and collect all the downloaded datasets contained in that folder and its subfolders.It will then import the most recent downloads for each species in the taxon list into the R working environment.These GBIF data can then be appended to a BIEN search (if desired) in the same way as if the user conducted a simple real-time search (Example 1); acquiring citation data follows the same set of steps as Example 1 (Example 2).occCite does not currently support mixed data download sources-that is, it is not possible to download GBIF datasets for some species and load the rest from local .zipfiles.

Discussion
As the literature on biodiversity modeling expands, difficulties associated with generating and managing appropriate metadata to render these studies reproducible will continue to grow.Three recent papers have outlined complementary and interconnected visions for biodiversity model metadata reporting standards (Feng et al. 2019, Merow et al. 2019, Zurell et al. 2020).All three agree that sources of occurrence data are a basic necessity of model documentation workflows, although they differ in recommendations regarding the necessary level of detail for these data.Feng and colleagues (2019) reviewed recent literature and generated a checklist for reporting on modeling methods, recommending that researchers report the source of their occurrence data, as well as the download date and/or version of the data source used.Merow and colleagues (2019) did not make such a specific recommendation, but their Range Model Metadata Standards (RMMS) framework does require occurrence data sources to be reported.Most recently, Zurell and colleagues (2020) expanded on the RMMS data dictionary for their Overview, Data, Model, Assessment, and Prediction (ODMAP) protocol, designed specifically for species distribution model studies.They recommend not only that occurrence data sources be cited with accession dates, but that sample size per taxon and taxonomic reference system be reported.Both RMMS and ODMAP provide tools to generate a metadata document that supplement more traditional methods sections in biodiversity studies.However, neither of these sets of tools is designed to directly manage the complex stream of occurrence data upon which many biodiversity studies are 1234 built.occCite closes this gap by managing occurrence query information including dataset aggregators, accession dates, and taxonomic reference systems, as well as the primary data sources, observation dates, and per-taxon sample sizes of resulting occurrence records, along with reporting methods that can be directly fed into the RMMS workflow or copied into ODMAP.
occCite's utility in the context of biodiversity studies is clear, but its potential applications extend beyond facilitating citations for range modeling.We designed occCite to accept phylogenies into occurrence queries as a first step towards building documentation protocols for spatial comparative phylogenetics.By combining information on the evolutionary relationships among taxa with data on where those taxa are found, we can begin to more fully understand how ecological, geological, and evolutionary processes have shaped past and present biodiversity patterns.These inferences can then provide insight into the distributions of biodiversity in the future (Quintero andWiens 2013, Jezkova andWiens 2016).
The sheer amount of biodiversity data is growing every day, but documenting its use in scientific research following best-practice standards (Feng et al. 2019, Merow et al. 2019, Zurell et al. 2020) has not kept pace.As datasets and analyses continue to grow in size and complexity, ensuring acquisition and analysis protocols are completely reproducible requires increasing time and resources.Furthermore, in the process of aggregating data from many sources, citation linkages to primary data providers can be unintentionally severed as workflows and tools in common use are not well developed for this key data management task.occCite was built to keep all the ease of using existing tools but with the goal of significantly simplifying data citation production and improving reproducibility, as users are able to also more easily manage data resources stored either locally or on a cloud server.occCite also already integrates with the ecological modeling platform Wallace (Kass et al. 2018), thus enhancing existing, well-used tools meant to enhance best-practice species distribution modeling and data management frameworks.

Figure 2 .
Figure 2. Schematic of full occCite architecture, including optional and required workflow routes.