palaeoverse: A community‐driven R package to support palaeobiological analysis

The open‐source programming language ‘R' has become a standard tool in the palaeobiologist's toolkit. Its popularity within the palaeobiological community continues to grow, with published articles increasingly citing the usage of R and R packages. However, there are currently a lack of agreed standards for data preparation and available frameworks to support the implementation of such standards. Consequently, data preparation workflows are often unclear and not reproducible, even when code is provided. Moreover, due to a lack of code accessibility and documentation, palaeobiologists are often forced to ‘reinvent the wheel’ to find solutions to issues already solved by other members of the community. Here, we introduce palaeoverse, a community‐driven R package to aid data preparation and exploration for quantitative palaeobiological research. The package is freely available and has three core principles: (1) streamline data preparation and analyses; (2) enhance code readability; and (3) improve reproducibility of results. To develop these aims, we assessed the analytical needs of the broader palaeobiological community using an online survey, in addition to incorporating our own experiences. In this work, we first report the findings of the survey, which shaped the development of the package. Subsequently, we describe and demonstrate the functionality available in palaeoverse and provide usage examples. Finally, we discuss the resources we have made available for the community and our future plans for the broader Palaeoverse project. palaeoverse is a community‐driven R package for palaeobiology, developed with the intention of bringing palaeobiologists together to establish agreed standards for high‐quality quantitative research. The package provides a user‐friendly platform for preparing data for analysis with well‐documented open‐source code to enhance transparency. The functionality available in palaeoverse improves code reproducibility and accessibility, which is beneficial for both the review process and future research.


| INTRODUC TI ON
Since the development of large palaeontological datasets from the 1970s onwards, palaeontologists have increasingly adopted computational approaches to address questions about the history of life on Earth (Benton, 1999;Sepkoski, 1978). Today, most sub-disciplines within palaeontology regularly use large datasets to perform experiments in silico. This has initiated a 'Golden Age' of palaeontology (Sepkoski & Ruse, 2009), where extensive datasets of various formats are used to test macroevolutionary and macroecological hypotheses (e.g. Close, Benson, Alroy, et al., 2020;Mannion et al., 2014;Quental & Marshall, 2013;Zaffos et al., 2017). The growth and increasing availability of such datasets has made coding an integral part of palaeobiological research. Today, palaeobiologists commonly use code to clean (e.g. Flannery-Sutherland, Raja, et al., 2022;Zizka et al., 2019), analyse (e.g. Guillerme, 2018;Kocsis et al., 2019), and visualise data (e.g. Bell & Lloyd, 2015), as well as to build models (e.g. Silvestro et al., 2014;Starrfelt & Liow, 2016) and implement simulations (e.g. Barido-Sottani et al., 2019;Fraser, 2017;Furness et al., 2021;Jones et al., 2021). Although software has been developed in languages such as C++ (e.g. Garwood et al., 2019) and Python (e.g. Silvestro et al., 2014), the programming language R is currently the most popular in palaeobiology. This is due to the wide range of tools-in the form of R packages-available to help users work with their data. Many of these tools are often borrowed or repurposed from ecology (e.g. Chao et al., 2014;Oksanen et al., 2020), while others have been developed to specifically handle fossil data (e.g. Kocsis et al., 2019;Lloyd, 2016).
In spite of the growth of analytical tools, few packages explicitly focus on preparing data for analyses, forcing users to construct custom scripts. This can result in distinct differences in code style and practices amongst the community, including in code legibility and documentation. Accordingly, custom scripts can be inaccessible to other users (Filazzola & Lortie, 2022). Although increasingly requested by journals, code is also not always provided as supplementary material nor made available in online repositories (e.g. GitHub, Zenodo, Dryad). A lack of available code can lead to research results being unreproducible, preventing future studies from extending the work. Even when code is available, it might be poorly documented or written in a way that is specific to the dataset being analysed, and as such it might require extensive reworking before it can be applied to other data. Consequently, researchers are often forced to 'reinvent the wheel', putting time and effort into writing code that already exists but is unavailable, inaccessible, and/or difficult to repurpose (Filazzola & Lortie, 2022). Such issues are exacerbated by the absence of community standards for how data should be prepared for analyses; differing approaches utilised by different researchers result in a lack of consistency between studies, making comparison between results challenging. Thus, there is a well-established need for both protocols and tools for preparing palaeontological data for further analysis.
Here, we introduce the R package palaeoverse, a communitydriven toolkit for streamlining palaeobiological analyses and improving code accessibility and reproducibility. Our approach differs from other palaeontological R packages in that it aims to bring the palaeobiological community together to establish consensus on the steps taken in data preparation for analysis and how these steps should be implemented. The package contains functions that align with current researcher needs to clean, prepare, and explore occurrence datasets for further analysis. These needs were established via a survey conducted by members of a new working group. The functionality of palaeoverse is purposefully flexible and can be applied to a wide variety of occurrence datasets. In this paper, we report results from the survey, describe and detail the functionality of palaeoverse, and illustrate its features with usage examples.

| COMMUNIT Y SURVE Y
To assess the needs of the palaeobiological community, we conducted an online survey. The survey was distributed via social media (Twitter) and email, and it included questions related to resources we have made available for the community and our future plans for the broader Palaeoverse project. 4. palaeoverse is a community-driven R package for palaeobiology, developed with the intention of bringing palaeobiologists together to establish agreed standards for high-quality quantitative research. The package provides a user-friendly platform for preparing data for analysis with well-documented open-source code to enhance transparency. The functionality available in palaeoverse improves code reproducibility and accessibility, which is beneficial for both the review process and future research.

K E Y W O R D S
analytical palaeobiology, computational palaeobiology, R programming, readable, reproducible, reusable researchers' previous experience, their pre-existing code (to identify potential contributions), and what functionality they would consider to be useful in a new palaeobiological toolkit. We summarise the types of data that survey participants typically work with, the tasks commonly carried out when working with these data, and the tools they would like to have access to, in Figure 1.
We found that survey participants (n = 35) work with a wide range of data ( Figure 1) and that the checking and transformation of data is the most. A wide variety of functions were requested by survey participants, with data plotting, time binning, and data access commonly suggested (Figure 1). Over 40% of participants also indicated that they were willing to contribute code to palaeoverse, highlighting the potential for a community-driven project. Specific details regarding the survey and responses can be found in the Supporting Information.

| PACK AG E DE SCRIP TI ON
After conducting the community survey, we combined participant input with our own experience to develop a toolkit for palaeobiologists: the palaeoverse R package. The package provides auxiliary functions to support data preparation and exploration for palaeobiological analysis. A summary of the functions currently available in palaeoverse is provided in Table 1

| Installation
The palaeoverse package can be installed from CRAN using the install.packages function in R (R Core Team, 2022): install.packages("palaeoverse") If preferred, the development version of palaeoverse can be installed from GitHub via the remotes R package (Csárdi et al., 2021): remotes::install_github("palaeoverse-community/palaeoverse") Following installation, palaeoverse can be loaded via the library function in R: library("palaeoverse")

| Data
Functionality in palaeoverse was designed to be compatible with occurrence dataframes, such as those downloaded from the Paleobiology Database (https://paleo biodb.org/#/), the Geobiodiversity Database (http://www.geobi odive rsity.com), or the Neptune Sandbox Berlin database (https://nsb.mfn-berlin.de/). Functionality is purposely flexible in palaeoverse and can be applied to various data sources with ease. In most instances, the returned object from a function is also a dataframe, which we consider the easiest data structure for most F I G U R E 1 Summary of responses to the palaeoverse survey. (a) The types of palaeontological data that survey participants typically work with. Each box represents an individual check within a check-box list, in which participants could check multiple boxes. (b) Tasks that survey participants routinely carry out in their own analyses (dark pink), and the functions they would find useful in the palaeoverse package (light pink). users to understand and work with. Although this might be undesirable for some advanced R users, transforming data structures should be straightforward for these users.

| Functions
A summary of the functions available in palaeoverse along with their respective dependencies is provided in Table 1 Additional information about both datasets can be accessed via ?tetrapods or ?reefs once the package is loaded.

| Time bins
We developed time_bins to enable access to two popular Geological Time Scales (GTS): GTS2012 and GTS2020 (Gradstein et al., 2012(Gradstein et al., , 2020. Both GTS2012 and GTS2020 are included in the package as reference datasets. The time_bins function allows users to extract TA B L E 1 A summary table of the functions currently available in the palaeoverse R package and respective dependencies. Base R dependencies are highlighted with an asterisk.

Function Description Dependency
axis_geo Add a geological time scale axis to a plot deeptime (Gearty, 2023)  era, or eon) using these datasets for a specified interval input:

# Get stage-level time bins
time_bins(interval = "Phanerozoic", rank = "stage", plot = TRUE) As is evident from Figure  ages are known (e.g. from radiometric dating). However, the bespoke bin_time function (discussed below) is likely to be the preferred option for most fossil occurrence data, which often have an age range.

| Occurrence binning
Fossil occurrences are frequently 'binned' into distinct time intervals to quantify changes (e.g. biodiversity or disparity) through geological time, as described by the survey participants ( Figure 1). The function bin_time allows users to assign occurrences into time bins generated by the function time_bins, or those defined by the user:

# Assign occurrences to bins
bin_time(occdf = tetrapods, bins = bins, method = "mid") Although binning occurrences with tightly defined temporal limits is straightforward and has been implemented in other R packages We have developed the function palaeorotate to address this challenge. The function allows palaeocoordinates to be reconstructed within R using two different approaches: 'point' and 'grid'. The first approach uses the GPlates Web Service and allows point data to be rotated to specific ages using the available models (see https://gwsdoc.gplat es.org). The second approach uses reconstruction files of pre-generated palaeocoordinates to spatiotemporally link occurrences' modern coordinates and age estimates with their respective palaeocoordinates. These reconstruction files were generated using an equal-area hexagonal grid (~100 km spacings) via the h3jsr  (Jaro, 1989).
This metric provides a measure of the dissimilarity between character strings as a value between 0 (exact match) and 1 (completely dissimilar indet., the latter would be discarded under species-level analysis (i.e. a species richness of two). However, this occurrence clearly represents a different species to the two already present in the dataset.
Using tax_unique, Diplodocidae is treated as an additional species (i.e. a species richness of three) because this occurrence represents a different species than the two already present in the dataset. Yet, the implementation is also conservative: if multiple coarsely identified occurrences exist in the dataset, they are collapsed to the minimum number of possible species (i.e. two occurrences of Diplodocidae indet. would be treated as only one species). This method is similar to the 'cryptic' diversity measure introduced by Mannion et al. (2011).

# Evaluate unique taxa
tax_unique(occdf = tetrapods, genus = "genus", family = "family", order = "order", class = "class", resolution = "genus") Two functions exist in palaeoverse for computing taxon ranges. The first, tax_range_time, can be used to calculate and plot the temporal range of taxa. The function identifies all unique taxa provided in the occurrence dataframe and finds their first and last appearance dates (Figure 4). The second, tax_range_space, can be called to calculate the geographic range of taxa. This function allows the user to specify one of four different approaches (4) the number and proportion of occupied equal-area grid cells.
Similar to tax_range_time, the function will identify all unique taxa provided and calculate these metrics based on the available occurrences of each taxon.

# Compute latitudinal range of orders
tax_range_space(occdf = tetrapods, name = "order", method = "lat") The provided tax_expand_time and tax_expand_lat functions are complementary to the taxonomic range functions. They convert temporal or latitudinal range data to bin-level pseudo-occurrences.
These pseudo-occurrences serve to fill in ghost ranges, in which a taxon is presumed to be present but no record exists. Although these pseudo-occurrences should not be treated as equivalent to actual occurrence data, such data can be useful for performing statistical analyses where bin-level data are required.

| Phylogeny wrangling
The function phylo_check compares a list of taxonomic names to the list of tip names in a user-provided phylogeny using the ape package (Paradis & Schliep, 2019). This comparison can be provided as a or GTS2020 (the default). This functionality therefore enables numerical ages to be assigned to datasets only containing characterbased interval names (e.g. "Maastrichtian").

| RE SOURCE S
To support the aims and use of palaeoverse, we have made several resources available to the palaeobiological community. First, we have built a package website (http://palae overse.palae overse.org), which provides information on how to contribute to palaeoverse, how to report issues and bugs, and a general community code of conduct. Second, we have established a Google Group to foster collaboration and discussion around the issues faced by the community, such as establishing standards for data preparation (https://groups. google.com/g/palae overse).

| FUTURE PER S PEC TIVE S
Palaeoverse is envisioned as a community project. While the initial development of the palaeoverse R package was led by the authors of this manuscript, it was also informed by the perspectives of 35 F I G U R E 4 Temporal range of tetrapod orders in the palaeoverse example dataset.
additional researchers (survey participants). Our hope is that palaeoverse will evolve into a community-driven package by welcoming contributions from the wider palaeontological community to broaden the available functionality. To support this aim, we provide guidance on how community members can contribute to palaeoverse on the package website (http://palae overse.palae overse.org).
Our working group also has the wider aim of establishing community standards and consensus in computational palaeobiological research and facilitating comparison across studies. Through the palaeoverse R package, we hope to assist in making code more familiar and readable to fellow researchers, prevent researchers from 'reinventing the wheel' for common procedures, and improve the overall reproducibility of research through the use of computational tools that have been vetted and accepted by the broader community.
The development of the palaeoverse R package marks an initial effort to both streamline palaeobiological analysis pipelines and unite the computational palaeobiology community. Future efforts will see the expansion of the palaeoverse 'universe' with the development of Shiny applications to support non-R users and teaching exercises, tutorials to offer guidance for new researchers, and workshops to provide practical experience. In turn, we hope these efforts will foster collaboration and the sharing of resources within the palaeobiological community. Finally, we warmly welcome the community to join these efforts and have established a community space to help facilitate this process (https://groups.google.com/g/ palae overse).

AUTH O R CO NTR I B UTI O N S
Lewis A. Jones conceived the project. All authors contributed to developing the project. Lewis A. Jones, Bethany J. Allen, William Gearty, Kilian Eichenseer, Christopher D. Dean and Joseph T.
Flannery-Sutherland contributed the code. All authors contributed to testing and reviewing the code. Sofía Galván processed the survey results and produced the survey figures. All authors contributed to writing the manuscript.

ACK N O WLE D G E M ENTS
The authors are extremely grateful to all survey participants who helped to shape the development of palaeoverse. Special thanks are given to Emma M. Dunne, who participated in numerous discussions, and shared her experience with the development team.
Thanks are also given to two anonymous reviewers that helped

CO N FLI C T O F I NTER E S T S TATEM ENT
We declare we have no conflict of interest.

PE E R R E V I E W
The peer review history for this article is available at https:// w w w.webof scien ce.com/api/gatew ay/wos/peer-review/10. 1111/2041-210X.14099.
F I G U R E 5 Example Phanerozoic plot of the palaeolatitudinal distribution of reefs through time. The plot demonstrates the usage of the axis_geo function for adding the Geological Time Scale to a base R plot.

DATA AVA I L A B I L I T Y S TAT E M E N T
The palaeoverse R package is hosted on CRAN (https://cran.r-proje ct.org/web/packa ges/palae overs e/) and is available on GitHub (https://github.com/palae overs e-commu nity/palae overse). The code is also archived in Zenodo through continuous integration . All example datasets are bundled with the R package. All code is released under a GPL (>=3) licence.