The Extending Ocean Drilling Pursuits (eODP) Project: Synthesizing Scientific Ocean Drilling Data

For over 50 years, cores recovered from ocean basins have generated fossil, lithologic, and chemical archives that have revolutionized fields within the earth sciences. Although scientific ocean drilling (SOD) data are openly available following each expedition, the formats for these data are heterogeneous. Furthermore, lithological, chronological, and paleobiological data are typically separated into different repositories, limiting researchers' abilities to discover and analyze integrated SOD data sets. Emphasis within Earth Sciences on Findable, Accessible, Interoperable, and Reusable (FAIR) Data Principles and the establishment of community‐led databases provide a pathway to unite SOD data and further harness the scientific potential of the investments made in offshore drilling. Here, we describe a workflow for compiling, cleaning, and standardizing key SOD records, and importing them into the Paleobiology Database and Macrostrat, systems with versatile, open data distribution mechanisms. These efforts are being carried out by the extending Ocean Drilling Pursuits (eODP) project. eODP has processed all of the lithological, chronological, and paleobiological data from one SOD repository, along with numerous other data sets that were never deposited in a database; these were manually transcribed from original reports. This compiled data set contains over 79,899 lithological units from 1,125 drilling holes from 422 sites. Over 26,000 fossil‐bearing samples, with 5,378 taxonomic entries from 13 biological groups, are placed within this lithologic spatiotemporal framework. All information is available via GitHub and Macrostrat's application programming interface, which renders data retrievable by a variety of parameters, including age, site, and lithology.

. SOD data could be, but are not commonly, used in a similar fashion because the data are not housed in an easily accessible database.
Most unmodified (i.e., the original interpretations of taxonomy and age from the scientists aboard the drill ships) SOD data are housed in three distinct online repositories that are not readily searchable. This means that users must already know what they are looking for and where to go to find it. Furthermore, some SOD data, particularly age models and lithologic records, are decoupled from related data sets and not available online. All of these issues hinder the investigation of large-scale temporal and geographic patterns because researchers must create both the data sets and analytical tools for each study in isolation, in contrast to the group of paleobiologists who generate open data as well as analysis and visualization tools built around those data (see the PBDB website's "Resources" section). The lack of a shared and integrated SOD database is a known issue in paleoceanography (e.g., Greene & Thirumalai, 2019) and has been a source of recent work (e.g., Khider et al., 2019). Establishing a community-led, open-source ecosystem of SOD fossil and stratigraphic data is vital for achieving many paleoceanography and marine sedimentary geology research goals, such as quantifying regional to global biodiversity (including the effects of mass extinction), food web interactions, and marine sedimentation and sediment subduction trends (i.e., Müller et al., 2022).
Here, we introduce the extending Ocean Drilling Pursuits (eODP; https://eodp.github.io/) project, which is building capabilities for the improved use and reuse of SOD data via existing databases. The nexus for creating a unified SOD database system is stratigraphy; the age and environment of deposition is the foundation upon which all sedimentary research is built. In current SOD databases, stratigraphic data are stored unconnected to other data, including the fossil occurrences extracted from them ( Figure 1; Table 1). SOD has generated large amounts of data, but these data are of limited value without their meta-stratigraphic context. The eODP project facilitates the curation, access, analysis, refinement, and visualization of comprehensive and integrated marine fossil and stratigraphic data sets by adapting several established databases and tools. Macrostrat, a stratigraphic database that stores ages, sediment thicknesses, and lithologies in easily accessible and flexible formats, provides the spatiotemporal stratigraphic scaffolding that a unified SOD ecosystem requires, while the PBDB stores paleontological records and has considerable taxonomic capabilities to deal with more than 50 years of evolving taxonomic concepts for SOD fossils. The goal of eODP is to make SOD data easily accessible and manipulable by geoscientists, oceanographers, and biologists by adhering to the Findable, Accessible, Interoperable, and Reusable (FAIR) Data Principles (Wilkinson et al., 2016).

SOD Data Sources, Data Types, and Data Management
The International Ocean Discovery Program (IODP) is a multi-country collaboration to study Earth Science, primarily using deep-coring vessels (see OConnell, 2019 and references within for a detailed history of SOD programs and for the revolutionary effects SOD has had on understanding Earth processes). IODP has had several predecessor programs dating back over 50 years: Deep Sea Drilling Project (DSDP) from 1968 to 1983; Ocean Drilling Program (ODP) from 1983 to 2003; and Integrated Ocean Drilling Program from 2003 to 2013, each program had its own way of formatting and archiving data (Table 1). During each expedition (formerly known as Legs), a drillship visits sites or localities and drills or cores one or more holes at each site. Shipboard scientists gather information on the rocks, sediments, and fossils recovered from the cores. This information includes, but is not limited to, detailed macroscopic and microscopic lithologic descriptions, physical property measurements, geochemistry, magnetic properties, and paleontology. Age-depth relationship interpretations are constructed shipboard using indicator fossils, magnetic polarity, and occasionally using well-dated and described marker beds, such as volcanic tephras. These shipboard data are published in Initial Reports or Proceedings Volumes (hereafter termed shipboard reports) and are stored in three online sources: the National Oceanic and Atmospheric Administration's (NOAA's) National Centers for Environmental Information (NCEI) International Ocean Drilling Data Archive houses all DSDP data and ODP data from Leg 100 to 129; the SOD database "Janus" stores some data from ODP Leg 129 through Integrated Ocean Drilling Phase I Expedition 312; and the IODP database "LIMS" (Laboratory Information Management System) stores data from Integrated Ocean Drilling Phase II and IODP Expeditions 317 to present (Table 1). These sources do not include data from sites cored by the Chikyu vessel or Mission Specific Platform Expeditions (e.g., Expeditions 313-316).
There are several additional data sets not available in any online repository (Table 1), such as lithologic descriptions from the Janus era, which are only available from the core description forms (standardized lithological/ stratigraphic columns of individual cores that are published within the shipboard reports). Neither general geological ages of cored material nor detailed age models are available in repositories for most SOD programs.  Additionally, some fossil data were not incorporated into any database and were instead stored as a large table within the shipboard reports. Due to the fractured nature of SOD data, it is currently difficult to even estimate the total magnitude of the data store. For example, as of April 2022, there have been 282 completed expeditions that visited a total of 1,601 sites (https://www.iodp.tamu.edu/publicinfo/ship_stats.html). The Janus database alone contains over 1 million fossil occurrences, while to date the PBDB in total contains roughly 1.5 million, most of which are derived from continental outcrops of marine and terrestrial rock units.
The fossil remains of animals, plants, bacteria, protists, and fungi are all found within SOD samples and are generally lumped under the descriptive (rather than taxonomic) terms "micropaleontology" and "microfossil" because their small size necessitates a microscope to study them. The most common taxonomic groups found within SOD samples are listed in Table 2. Both the preservation and abundance of all microfossil taxa within an SOD sample are recorded via terms that are fairly standardized across expeditions but vary between fossil groups because of the differing sampling processing and counting methodologies required for each group. Microfossils have a robust species-level record (Ezard et al., 2011;Fraass et al., 2015;Jamson, Moon, & Fraass, 2022), a true novelty in paleobiology, and thus form a rich data set for addressing many of the questions highlighted in "Grand Challenges in Paleobiology" 2017 EarthRates' Report (see the "Grand Challenges": earthrates.org/news/earthrates-community-news-2/) at unprecedented levels of specificity.
There are ongoing efforts to mobilize key SOD data. Notably, the Neptune Sandbox (NSB) is a database of microfossil occurrences and age-depth relationships, largely constructed with postcruise age models . NSB is a tremendous resource for the paleoceanographic community, as it has focused on key sites with highly resolved chronologies and has been a source of important work for decades (e.g., Spencer-Cervato et al., 1994;Trubovitz et al., 2020). However, NSB does not include all sites and all shipboard microfossil data, nor does it include lithology logs. The targeted approach taken by NSB is complementary to the eODP project goals and the two projects are actively aligning efforts.

The Paleobiology Database
Created in 1998, the PBDB is an open data and software infrastructure centered around globally distributed, geographically and taxonomically explicit fossil occurrence data on all organisms through all time periods. Included in the data system are "bibliographic references, taxonomic names, taxonomic opinions on synonymies and classifications, primary collection data, taxonomic occurrences, and re-identifications of occurrences" (Uhen et al., 2013). As of June 2022, the PBDB contained over 81,000 references and 458,000 taxonomic names, with over 881,000 opinions on the classification of those names; over 1,560,000 occurrences from more than 225,000 collections were also available. These data were entered by over 674 contributors from many institutions around the world. Anyone can access all of the data via the PBDB websites and REST-ful application programming interface (API). The PBDB API originated in October 2014 and has since received hundreds of millions of requests from many different types of clients distributed around the globe. Documentation for the API is publicly available on the paleobiodb.org website (Peters & McClennen, 2016).

Macrostrat
Macrostrat was created circa 2005 to aggregate chronostratigraphically stacked rock units and their properties in order to enable quantitative analyses of regional-and continent-scale patterns in the rock record (e.g., Fraass et al., 2015;Peters, 2005Peters, , 2006Peters, , 2008Peters et al., 2013Peters et al., , 2018Peters et al., , 2022. Although the database has primarily been used to store and analyze generalized rock column representative of relatively large geographic regions (see references above), the fundamental structure of the database is scale agnostic, making it possible to store detailed measured sections and age models for them within the same data framework as lithostratigraphic-scale generalized regional columns. As of June 2022, the database contained 2,163 such rock columns, with 40,960 rock units distributed in North America, the Caribbean, New Zealand, South America and the deep sea realm. Continuous-time age models, generated initially algorithmically on the basis of imprecise but generally accurate constraints provided by stratigraphic superposition and correlations of units to chronostratigraphic bins that are in turn correlated with the current international timescale (Cohen et al., 2013(Cohen et al., , mod. 2022, are a key feature of Macrostrat. The SOD portion of the Macrostrat data set, prior to eODP, comprised 387 columns with 7,124 units, with a temporal sedimentary package hiatus structure (sensu Peters, 2006Peters, , 2008 defined by calcareous nannoplankton zones (Fraass et al., 2015;Peters et al., 2013). Correlations of these zones to the international timescale, and stacking order of sedimentary units, are used to assign a preliminary age model to these records. Macrostrat has several simple user interfaces that aid in the discovery and utilization of the data. The open API serves as the basis for these interfaces and is versatile; it is currently used in multiple research applications and mobile software tools, such as Rockd (Mobile app), Mancos (iOS), and Flyover Country (Mobile app).

Data Harmonization
What follows is an overview of the workflow used to compile fossil, age, and lithology records from the three distinct SOD platforms and from the shipboard reports, to clean and standardize these records, add them to the database entitled "eODP Database," and incorporate them into Macrostrat and the PBDB (Figure 1). The Python scripts and Jupyter notebook used to process eODP data from NOAA, LIMS, and Janus are available at: https://github.com/eODP/data-processing (https://doi.org/10.5281/zenodo.7535415; see also Kwan et al., 2022), Python scripts to insert the data into the eODP database is available at: https://github.com/eODP/api (https://doi. org/10.5281/zenodo.7535413), and the workflow of adding taxonomic data and associated fields to the PBDB is available at: https://github.com/eodp/files-for-Sessa-2022 (https://doi.org/10.5281/zenodo.7535423). Shipboard data were accessed from the data sources listed in Table 1 by the developers at Whirl-i-Gig (http://www.whirli-gig.com/), who created the processing scripts and the eODP database, performed initial cleaning with input as needed from the authors, and who then supplied the compiled data sets to the authors to clean and standardize.
There are numerous SOD data sets that are not available in any online repository, such as lithologic descriptions from the Janus era. This has necessitated transcribing the lithologic data manually and through optical character recognition (OCR) from the core description forms. Age-depth relationships are also not available in a standardized, digitized format that adheres to FAIR principles for most SOD programs and therefore also had to be manually transcribed from the shipboard reports. In addition, sometimes fossil data were not incorporated into any database and were instead stored as a large table pdf within the shipboard reports. Acquiring these data required OCR software to process the pdfs, sometimes passing through Microsoft Excel to restructure the text as a table, then manual checking and formatting of the resulting fossil tables.

SOD Header Harmonization
Whirl-i-Gig developers first produced a unified data structure spreadsheet whereby the columns consisted of all headers used in SOD data tables. There are ∼250 headers, arranged in three eras (Leg/Exp. This direct ingestion of LIMS data into Macrostrat was necessary because LIMS had the widest variety of headers; for example, "Comment," "Comments," "COMMENTS," "comments," "Comment (general)," "General comment," "Sample comment," "BF comment," and "Nannofossil comment" are all the same type of information functionally but were classified as unique headers that were then manually harmonized. The igneous and metamorphic rock files were imported into the eODP database without harmonization (i.e., no effort was made to standardize terms like "2ND crystal roundness" and "2ND lithic roundness" across Legs/Expeditions) and are therefore not included in the above counts. Additionally, some headers could be further harmonized, but this would require additional data transformations. For example, in the chronostratigraphic data, ages had been stored as zonation schemes, datums, numerical age values, and with minimum, maximum, and average values, or sometimes as simple single values. All of this variability was retained in the eODP database. The explicit goal of the eODP database is to harmonize the data structure as thoroughly as possible without modifying the underlying data, unless it was found to be clearly in error (e.g., misspelling of taxonomic names).

Taxonomic Data Workflow
Whirl-i-Gig developers compiled lists of all unique taxonomic names for each of the three data sources listed in Table 1, starting with LIMS. While most of the LIMS data were processed in bulk, some data files were misformatted and required individual processing. Numerous fossil data sets that were not incorporated into any database were transcribed manually (Table 1) and incorporated into this first data batch. The validity of all generic and higher names was checked by Sessa, with assistance as needed from LeVay, Fraass, and the researchers listed in Table 2 and by utilizing the websites listed in Table 2. Prior to the import of these taxonomic lists into the PBDB, we first added "taxonomic backbones"-taxonomic hierarchies of the taxa within the compiled data set for each group listed in Table 2-to the PBDB (see https://github.com/eODP/files-for- Sessa-2022) for PBDB taxon ID numbers and resolved and original taxonomic names. While SOD data are generally resolved to the species level, and species are the desired unit for research, there are considerably more species than genera. Also, species are automatically linked to a taxonomic backbone because they are always associated with genera. Thus, genera were the most efficient targets for this first import. Some species names were validated on an ad hoc basis during this initial stage. As directed by research goals, species within key groups will be subjected to this workflow once all SOD generic and higher names have been entered into the PBDB taxonomic backbones.
The steps taken during the cleaning and standardizing of taxonomic entries included correcting misspellings; standardizing to "indet." for all names above the generic level (many of these entries were just the higher name or included "sp."); standardizing informal names to formal (ex., "Miliolids" becomes "Miliolidae indet."); and moving authority, preservation, and morphologic and other descriptors into other comment fields (ex., "Ethmodiscus sp. fragments"-"fragments" is moved to the "Comment" field; "Rouxia sp. spatulate long heteropolar (MIS)"-"spatulate long heteropolar (MIS)" is moved to the Comment field). Statistics on the resulting cleaned and standardized taxonomic data set are provided in Table 3. For some groups, such as planktic foraminifera and ostracods, the PBDB already contained a fairly comprehensive backbone; for other groups, such as benthic foraminifera and diatoms, the backbone needed to be built nearly from scratch-compare the number of references, taxonomic opinions, and authorities entered into the PBDB for each group in Table 4.
The PBDB provides several advantages for housing these taxonomic data because substantial taxonomic tools have been incorporated into it over the years. For example, the PBDB tracks multiple taxonomic opinions. One example, as shown in Figure 2, is the taxonomic nomenclature of the planktic foraminifera genus Globorotalia.
The taxonomy of Globorotalia is complex, as over time this genus has been ascribed as subgenera, which were then sometimes formally or informally elevated to genera. All of these revisions can and will be incorporated and stored within the PBDB. The PBDB also contains tools to disambiguate and keep separate taxonomic homonyms, which are taxonomic names that are spelled identically but belong to two or more separate taxa, for example, "Emiliania" is both the name of a calcareous nannoplankton genus and a now invalid genus name of a bivalve (i.e., Sánchez, 2010). At this initial stage, our focus has been to generate the taxonomic backbones, rather than updating the various taxonomies to the current state-of-the-art, though some revisions were entered into the PBDB on an ad hoc basis.
There are several fields related to taxonomic lists that also required standardization-the abundance values and units of individual taxa within samples, and the preservation, fragmentation, and group abundance fields, which are properties of the sample (in the parlance of the PBDB, these are properties of a "collection," and the abundance values and units are properties of an "occurrence list" of taxa). Taxonomic files of the shipboard reports typically contain qualitative abundance codes, such as "A" for "Abundant," "C" for "Common," "R" for "Rare," etc. The "Methods" chapter of each shipboard report contains descriptions of how shipboard scientists delineated these categories. For example, "A" means "Abundant" for the majority of taxonomic groups; however, it is used to represent a variety of values, from a span of percentages (10%-30%, 10%-50%, >16%, >20%, 20%-50%, >30%, 50%-90%, >50%) to a range of individual specimen counts (1-10, >1, >2, >5, 5-10, 10-100, >10, >11, 11-20, Of these 5,378 entries, there are: Unique names of higher taxonomic names (any taxon above genus) 86 Unique genera 1,068 Entries resolved below genus 3,775 Unique below species 136 Note. "Number of entries" is the total count of all entries within a taxonomic group, whereas "number of distinct taxonomic entries" is the number of valid taxonomic entries (e.g., "Chaetoceros spp." and "Chaetoceros spp. and similar spores" are two entries and one distinct taxonomic entry); "unique below genus" is all genus-species pairs except "sp." and "spp." Bolded taxa are the most diverse groups in the data set, with benthic foraminifera representing 32% of all entries, followed by planktic foraminifera at 18% and calcareous nannoplankton at 17%.

10.1029/2022GC010655
8 of 17 >20, >25, >30, >50, >2,000). These definitions can sometimes differ when used for an individual taxon or the quantity of a particular group (e.g., planktic foraminifera). Further, the counting methods used to generate abundance vary by taxon and sometimes by expedition based on how the shipboard scientists processed samples and generated abundance data including, but not limited to, per field of view, per slide or tray traverse, per 300 individuals, the number of fossils compared to the number of sediment particles, compared to the number of foraminifera (benthic and planktic), or the number of individuals within that particular group (e.g., percentage of a particular benthic foraminifera taxon with that entire assemblage group). It is also important to check assumptions about the abundance codes themselves; for example, in rare cases "F" was used for "Frequent" and not "Few." To standardize these codes while ensuring that the original shipboard determinations were maintained, Peters generated a list of all codes used in each expedition by taxonomic group and Fraass collated sample processing, counting methodologies, and abundance definitions (both group and individual taxon). We have standardized these values for ease of use (ex., harmonizing "A" and "a" to "A," "R?" and "?R" to "R?," "rw" and "*" to "*" for reworked, because "*" is the standard symbol to denote reworking in the shipboard reports). In several instances, shipboard scientists would use an undefined code (e.g., using "C" when the scheme goes directly from "Abundant" to "Few"). In those cases, the original undefined code is retained in the abundance field of the particular taxon in the species list, and an interpreted definition of the undefined code is recorded in the comment field, as determined by Fraass. In a few instances, transitional abundances were listed (e.g., "C-A") but undefined; however, as both the individual values were defined, the midpoint between the two abundances was interpreted for the abundance field, with the rationale provided in the comment field. All of these comments are recorded in a unified fashion in the eODP database. As with all eODP data, the eODP database contains the original entries, the harmonized values and comments. The harmonized abundances, their corresponding units and comments (see file "PBDB_taxon_occurrence_harmonization.csv" in https://github. com/eODP/files-for-Sessa-2022) (https://doi.org/10.5281/zenodo.7535423) were then imported into the PBDB fields of "Abundance"; "unit"; and "Comments" of an "occurrence list" (i.e., an individual SOD sample). "Group abundance" is a measure of the overall abundance of a particular taxonomic group in a sample and was standardized in much the same way as the abundances of the individual species. Abundance codes were harmonized across different groups and expeditions such that, for example, an "A" always means "Abundant." The eODP database contains both the original and harmonized codes and comments (see file "group_abundance_ harmonization.csv" in https://github.com/eODP/files-for- Sessa-2022). The harmonized codes were incorporated into the PBDB field "Abundance in sediment," which includes the values of abundant, common, few, and rare (and is a property of a collection in PBDB parlance).
The same qualitative preservation codes generally are used across taxa and expeditions/legs and therefore the standardization of preservation was comparatively simple: "E" or "VG" for excellent/very good, "G" for good, "M" for moderate/medium, "P" for the poor, and "VP" for very poor. The eODP database contains all unedited preservation codes. For import into the PBDB, in instances where preservation was coded as spanning categories (ex., "G-M", "G-VG"), only the first letter was used, based on the presumption that the first letter was the most commonly seen preservation. In cases where the preservation was not contiguous (ex., "VG-P", "G-VP"), the preservation was recorded as "V" (for "variable") (see file Group preservation.csv in https://github.com/eODP/ files-for- Sessa-2022). This harmonization allowed eODP to use the existing "Fragmentation" field within the PBDB's "Preservation" data table with minimal loss of data, as "Fragmentation" is not a free-form field.
The processing of the first batch of taxonomic files began at the end of 2019, and generic and higher taxonomic names were validated in early 2020, when the entry of taxonomic authorities and opinions into the PBDB for the taxonomic backbones began. The first backbone entered into the PBDB was for the calcareous nannoplankton because LeVay is an expert in this group. Over the course of a year, three undergraduate and three graduate students from Sessa's and Fraass' institutions were paid to enter the taxonomic data listed in Table 4. The combined efforts of these six students were equivalent to a year of full-time work. About 540 references containing 572 taxonomic names and 1,187 taxonomic opinions were entered into the PBDB (Table 4).
Once taxonomic entries were cleaned and standardized, checks were run against the PBDB taxonomic backbone by Whirl-i-Gig developers using the PBDB API services to ensure that all generic and higher names were indeed within the PBDB and would be classified into their respective taxonomic hierarchies. Following these checks, taxonomic data were brought into the eODP database and then imported from there into the PBDB (Figure 1). Numerous file formatting and labeling errors were discovered when attempting to join the lithologic records in Macrostrat with the fossil files in the eODP database so that they could be imported into the PBDB. These errors, ranging from incorrect site names to missing lithology logs, are similar to the complexity and scope described above and below when standardizing and formatting the data within these files and have delayed the import into the PBDB.

Lithology Harmonization
Macrostrat stores hierarchical vocabularies relevant to the description of rocks; no standardization of terminology is enforced, meaning that Macrostrat accepts that there are multiple different ways to describe rocks and sediments and the focus is instead on hierarchy and nomenclature that is in use in the scientific literature (e.g., Macrostrat understands that "basalt" is a "mafic," "volcanic," and "igneous" rock). In order to incorporate SOD lithologic logs into Macrostrat, this lithology (and corresponding lithology attributes) vocabulary was used, allowing original shipboard descriptions to persist while at the same time providing a hierarchical level of classification that allows for flexible description and retrieval. That is, it is now possible to retrieve all "carbonates"-bearing units in the SOD data, regardless of the specific lithologies assigned to the lithologic units (e.g., units described as "micrite" and "lime mudstone," two alternative carbonate classification nomenclatural schemes, would both be retrieved in queries for "carbonate" or "sedimentary" rocks).
In practice, matching Macrostrat's curated vocabulary of lithologies and their descriptors to the LIMS SOD data assembled by eODP required significant effort, primarily because of the heterogeneity within the LIMS data. For example, there are typically primary and minor lithology fields within the LIMS data, each of which optionally contain "prefix" and "suffix" descriptions. For example, the primary lithology in LIMS might be described as "ooze [MMK88]" with a prefix of "Clayey radiolarian" and a suffix of "with nannofossil and diatoms." There are a total of 3,281 unique prefix-principal lithology-suffix combinations in the original LIMS data set. Matching these terms to the Macrostrat curated vocabularies was done within the database itself. The original descriptions associated with the LIMS data remain connected to these revised and standardized descriptions, should they be required for any reason and in part because some standardization remains (for example, for cases where spelling errors or other anomalies appear in the LIMS data set these modifiers may not yet be included in Macrostrat though all principal and minor lithologies have been standardized within Macrostrat's vocabulary. Concurrent with the taxonomic efforts, manual entry of lithologic core descriptions not housed in online databases (ODP Leg 129 to IODP Expedition 312; Table 1) began in late 2020. Since that time, six undergraduate and graduate students at Texas A&M University were paid to manually enter sediment lithologic descriptions for 12,078 individual units from 618 holes at 229 sites and 43 Expeditions/Legs. Descriptions are typically entered at the core level (∼9.5 m resolution) and include the shipboard age assignment. The enterers worked directly off of the shipboard core summary sheets and used core depths stored in the LIMS database. After about 6 months of this workflow, one of the students developed an OCR reading program and created core summary.csv files for each hole, including age and depth. This process reduced some of the steps associated with manual entry.

Chronostratigraphy Workflow and PBDB Connectivity
Similar to how the taxonomy of microfossils must be imported into the PBDB prior to further refinements of the system, Macrostrat requires a minimum level of chronostratigraphic detail (Peters et al., 2018). Because only limited age-depth relationships exist within any SOD database, eODP both manually entered age-depth information from the shipboard reports (Table 1) and collaborated with NSB to obtain chronostratigraphic information. In rare cases, postcruise information was used if, for example, the cruise sailed without any chronostratigraphers and therefore the shipboard reports contained no age information. The depth, core, and a variety of possible chronostratigraphic bins (e.g., calcareous nannofossil zone NN5, Eocene, and/or Chattian) were manually transcribed into web entry forms developed for Macrostrat. Additionally, NSB provided age models to the eODP team. These age-depth relationships were not incorporated in a one-to-one fashion-given Macrostrat's unit boundary-focused age model, but the ages were used to roughly calibrate the algorithm used to generate an initial age model. It is our intention that this is only a halfway step, and that further development of the stratigraphy aspects of eODP will involve fully faithful replication of NSB age depth relationships, with accompanying citation back to NSB. Further developments include the capability for age models to be retrieved by the PBDB and to then be served with microfossil collections to users. Having a process of updating age models would be a worthwhile endeavor, but this will require community agreement and coordination on many aspects, for example, is a linear interpolation or a spline fit the more appropriate method, which chronostratigraphic data types have priority over others, and even simple data standards would need to be agreed upon by the community before tools to update age models could be developed. eODP is standardizing and migrating these shipboard data to Macrostrat, which can serve as a first step and platform for the SOD community to build upon. Additionally, chronostratigraphy was the focus of the "Coding the Column" workshops that eODP held online in 2022 (discussed below, in Section 4 "eODP community engagement").

Enhancing Macrostrat Age Models
Within Macrostrat currently, a portion of stratigraphy (a unit or subdivision of a unit) can only belong to one chronostratigraphic unit (e.g., a single biostratigraphic zone). This limitation results from the current "continuous time age-model" (Peters et al., 2018), which does not allow units within a column to overlap in time. For example, a unit below cannot belong to calcareous nannofossil biozone NN2, while the unit above belongs to planktic foraminifer biozone N5, because those zones partially overlap. The solution is to place both units in a broader time bin, such as the early Miocene, preserving their stratigraphic order and resulting in a more stable but less precise stratigraphy. This is an acceptable resolution for the analysis of basin-scale patterns of sedimentation over the Mesozoic and Cenozoic, similar to the Peters et al. (2013) study, which used Macrostrat and only calcareous nannoplankton zonations for determining marine age models. It is problematic for finer-scale analyses because zones from these larger bins may encompass several million years of geologic time. eODP plans to accommodate more complex age models within the Macrostrat schema, for example, simultaneously accommodate both biostratigraphic datums and zonation schemes, or use multiple concurrent zonation schemes. The end goal, however, is to move toward well-defined algorithmic approaches (e.g., McKay et al., 2021). The current eODP age-depth data are confident at the Epoch scale and reasonable at the Stage level. Subdivision of eODP records below the Stage level is not advised without additional chronostratigraphic work by the end user.

Description of the Compiled Data Set and Examples of Results
As of November 2022, over 79,899 lithological units from 1,125 chronostratigraphically resolved ocean drilling holes from 422 sites containing over 26,000 fossil-bearing samples with more than 5,300 taxonomic entries from 13 major biological groups form the first compiled eODP data set (Table 3). Sites are added to the eODP database and then are imported into Macrostrat and PBDB on a regular basis as they are digitized and processed. The lithologic and chronostratigraphic data can be accessed via the Macrostrat API by including "project_id = 3" in the parameterization of the URL (the SOD sites that were entered into Macrostrat prior to the eODP project, i.e., those in Peters et al. (2013), can be accessed by including "project_id = 4" in the API query). Note that Macrostrat allows for columns that are either "in process," meaning they are part of a project still underway, or "active," meaning they are part of a project that has been completed. Currently, eODP data are "in process" and the parameter "status_code = in process" must be included in the URL to retrieve data at the present time. Taxonomic samples will be accessible via the PBDB's "Main Menu" page, using the "Search for a reference" function "Reference number" 82981.
The benthic foraminifera comprise 32% of these taxonomic entries, followed by planktic foraminifera (18%), calcareous nannoplankton (17%), diatoms (14%), radiolarians (13%), pollen and palynology (3%) and dinoflagellates (1%); all other groups listed in Table 3 comprise less than 1% of all taxonomic entries. Samples range from the Late Jurassic to the recent (Figure 3), with what is very likely a "Pull-of-the-Recent" bias (Raup, 1979) that begins in the mid-Cretaceous and is particularly evident from the mid-Neogene onwards. This bias results from the subduction of pre-Triassic and Jurassic sediments and a more complete sampling of younger sediments relative to those in the deep past and is characteristic of global compilations of unstandardized data through geologic time (ex., Alroy, 2010b;Peters, 2005;Lowery et al., 2020). Another reason for this feature that is specific to SOD data is that younger sediments must be cored through to reach older ones and it is common to incompletely core deep sea sections, resulting in the preferential sampling of younger sediments.
Collating these data also allows for the generation of reproducible maps of seafloor sediments from past intervals ( Figure 4). In particular, the mapping of biogenic sediments (sediments generated by organisms, i.e., diatoms producing siliceous ooze) can provide exceptional insights into how biogeochemical cycles, the preservation of sediments, and ocean-atmosphere interactions have evolved through time and across space. Most, although not all, SOD sediments are from deep ocean environments and in these settings, calcareous and siliceous sediments are biogenic and not abiotically precipitated. To generate the maps in Figure 4, all eODP data within Macrostrat were downloaded for each lithology type (carbonate, siliciclastics, chemical, and volcanics) by including in the API query: "units?project_id = 3&lith_type = siliciclastic&response = long" which returns age, lithology, column id (a collection of units from a single location; i.e., a stratigraphic column), unit id (a portion of sediments or rock within a column), thickness, and modern latitude and longitude coordinates for units in the eODP project. For this analysis, the additional parameter "&status_code = in process" was also included in the API call to retrieve all sites in the current "in process" eODP project; upon completion of the addition of all lithology and age files to Macrostrat (Figure 1), this parameter will no longer be required.
After retrieving the data, sediments present within each of the downloaded Macrostrat lithology types were sorted into six new lithology categories: calcareous, siliceous, clay, siliciclastic, glaciomarine, and volcanic sediments, based on the sediment categorization from Wade et al. (2020). The sediments were defined and distributed into these lithology categories using the primary lithology of each unit. As of November 2022, there are 18,644 calcareous points and 2,702 siliceous points, which were then confined to the epochs plotted in Figure 4, resulting in 7,885 calcareous points and 725 siliceous points. Many cores contain several different biogenic lithologies for a given time interval. For example, an east Pacific core (col id 4,803) contains unit id 58539 (classified as a calcareous ooze, ranges from 9.95 to 9.97 Ma) and the unit id 57885 (classified as a siliceous ooze, ranges from 9.97 to 9.98 Ma) during the Miocene; both units are plotted in Figure 4. Sediments are represented by different colored points (solid navy markers represent calcareous sediments; orange-colored open circles represent siliceous sediments). These maps were constructed using pyGplates v.036 (Müller et al., 2018)  These types of maps will be improved in the future with the addition of more data, more refined chronologies, and additional considerations, such as paleowater depth and the position of the calcite compensation depth (CCD). Despite these limitations, these maps display trends that have been found in more synthetic analyses (e.g., Lyle, 2003;Wade et al., 2020). For example, an abundance of siliceous sediments in the equatorial Pacific during the late Eocene (Figure 4a) is also clearly apparent in the Wade et al. (2020) map from this time.
There are areas and intervals with substantially higher sampling (e.g., the equatorial Pacific Ocean during the Miocene, Figure 4c) that are also apparent in the data sets of Lyle (2003) and Wade et al. (2020). The eODP data set records a change in the proportion of calcareous versus siliceous sediments during the Cenozoic; during the Eocene ∼88% of the sediments are calcareous (1,835 calcareous points to 256 siliceous points; Figure 4a), while during the Oligocene it is ∼94% (795 calcareous points to 256 siliceous points; Figure 4b), and ∼89% in the Miocene (3,407 calcareous points to 417 siliceous points; Figure 4c). This matches the expectation from the literature, with the ∼1 km deepening of the CCD across the Eocene-Oligocene boundary resulting in greater abundance of calcareous sediments in the Oligocene relative to adjacent time intervals (Coxall et al., 2005;Pälike et al., 2012). eODP represents a step-change forward for the scientific ocean drilling community, one where investigations like these can be readily done without painstaking and long hours generating new data sets. It is the hope of the eODP project that by following FAIR principles, these types of investigations can be facilitated much more readily.

eODP Community Engagement
The intention of eODP from the outset was to not only focus on data curation, but also to activate the SOD community to work on this material in a holistic way, co-designing tools in order to do so. In December 2018, we convened an EarthRates-funded workshop "Bringing Micropaleontology to the Paleobiology Database" (Workshop Report here: https://github.com/eODP/eODP-2018-EarthRates-Workshop; https://doi.org/10.5281/zenodo.7535419), where the ∼20 participants brainstormed on how to do this. The group also generated a datatype hierarchy document for microfossil occurrence data, with priority-levels (required, desired, optional) for various tasks, which is included within the above Github repository. In 2022, eODP hosted two virtual workshops on SOD and Earth Science databases, called "Coding the Column, Using Databases to Synthesize Stratigraphy and Geologic Age," which engaged ∼75 scientists in total. The focus of these workshops was chronostratigraphy. The first meeting began with talks from chronostratigraphic database creators explaining their databases' purview, system, and structures. The second meeting was focused on discussing methods employed to generate and store age models, as well as the philosophy behind them, and best-practices for storing, retrieving, and viewing chronostratigraphic and associated data (informational document, summary document, and video recording of Session #1 are available at https://github.com/eODP/Coding_the_Column; https://doi.org/10.5281/zenodo.7535411).
The eODP project has been introduced at several major conferences and seminars including the American Geophysical Union Annual Meeting in 2019 (LeVay et al., 2019), the European Geophysical Union conference in 2020 (Fraass et al., 2020b), the EarthCube Annual Meeting in 2020 and 2022 LeVay et al., 2020LeVay et al., , 2022, the Geological Society of America in 2020 and 2022 (Fraass et al., 2020a;Sessa et al., 2022b), PaleoPERCS (Fraass, 2021), the Geological Association of Canada -Mineralogical Association of Canada Conference in 2022 (Fraass, LeVay, Sessa, Peters, Kaufman, et al., 2022), and the 2022 American Geophysical Union Annual Meeting Sessa et al., 2022a). The eODP project has funding for several in-person workshops, both stand-alone and in conjunction with annual conferences, and it is our hope and plan to resume these activities, in-person, during 2023.

Future Directions and Conclusions
This is the first paper in an anticipated series; subsequent works will describe additional steps (e.g., improving genus-and species-level taxonomies, more complex and complete age-depth relationships) and address research questions that are only possible with eODP data sets. Offshore environments which are stable and continuous on million-year timescales and contain both the best-resolved fossil record and high-resolution paleoclimate records have the potential to allow understanding of the coupled ocean-climate-biosphere system at a deeper level than previously possible. It is the hope of the eODP team that these questions can be tackled both by the eODP team as well as a large community of other scientists employing this data set.
The status of the eODP project and its conclusions can be summarized as follows: • Using existing databases, instead of building from scratch, has several advantages. Both the PBDB and Macrostrat have existing cyberinfrastructures that make data in them readily accessible in standard, use-agnostic formats that follow FAIR principles. These systems either meet the needs of the SOD research community or can be adapted to serve them. • The PBDB does not currently reflect state-of-the-art taxonomy for many microfossil taxa. A planned step is to import the preexisting IODP Synonymy tables (circa 2010) developed by the Science and Technology Panel (STP) of the Integrated Ocean Drilling Program into the PBDB, alongside continued data entry work from the University of Victoria team. • The management of age-depth relationships is complicated by the extensive requirements for generating high quality marine stratigraphic records. eODP plans to continue developing tools as we work with the SOD community to establish the best ways in which to curate these data. • Facilitating workshops within the pandemic-era can be challenging, but despite the hurdles of virtual meetings, the SOD community remains eager to be a part of developing SOD data-resources, as seen in the participation and reception to the first EarthRates-funded workshop. We anticipate even greater success as in-person activity restarts and eODP is able to hold workshops. • Modern SOD data has been available for more than half a century; however, it is not easily findable and interoperable, making it extremely difficult to use. The eODP project has, and continues to, devote a significant effort to cleaning and harmonizing open data. This time investment highlights the difference between open data and FAIR data. If the community sees a benefit in SOD data being readily usable, working toward standardizing data collection would be prudent. The eODP ecosystem could be used to facilitate other goals of the SOD community including developing mechanisms/tools to incorporate post-cruise research with shipboard data and to incorporate data from other marine geological programs (e.g., piston cores collected by UNOLS cruises). eODP represents, we believe, a step toward a new era for scientific ocean drilling, with legacy data used for broader and deeper questions than before.

Data Availability Statement
The standardized scientific ocean drilling lithology and micropaleontology data sets that compose the eODP database can be accessed through the eODP GitHub [https://github.com/eODP; https://doi.org/10.5281/ zenodo.7535413] All of the raw data can be found at [https://github.com/eODP/data-processing/tree/master/raw_ data https://doi.org/10.5281/zenodo.7535415] and the processed, standardized files can be found here [https:// github.com/eODP/data-processing/tree/master/output; https://doi.org/10.5281/zenodo.7535415]. The "data processing" repository on this GitHub contains the scripts used to standardize.csv files. All of eODP's GitHub repositories are public.  -1948843 and ICER-1928323. provides the salary support for L.J.L. We are extremely thankful to David Lazarus and Johan Renaudie for age models and fruitful collaborative discussions. Austin Hendy generously provided guidance on entering data into the PBDB. Two anonymous reviewers improved the clarity of the manuscript. We acknowledge the longstanding efforts of many in the SOD community, too numerous to mention, in advocating for SOD data to be united and on easily accessible platforms. With respect to guiding eODP, we would like to thank Brian Huber, David Lazarus, and Ellen Thomas. This is PBDB publication number 439.