The LILY Database: Linking Lithology to IODP Physical, Chemical, and Magnetic Properties Data

During each expedition of the International Ocean Discovery Program and its precursor, the Integrated Ocean Drilling Program (jointly referred to as IODP), vast arrays of data are collected from drill cores. These data, which are accessible from the IODP LIMS (Laboratory Information Management System) database, include physical, chemical, and magnetic properties collected semi‐continuously along cores using automated track systems, as well as a variety of analyses conducted on discrete subsamples taken from the cores. In addition, the lithology of all cores is described based on visual characteristics of the surface of split cores, visual examination of smear slides and thin sections, and compositional or mineralogical information derived from geochemical analyses. We extract basic lithologic information from this complex array of descriptive information and then tie that information to all other measurements. This new database is referred to as LIMS with Lithology (LILY). LILY currently contains over 34 million data from 89 km of core recovered on 42 expeditions conducted 2009–2019. Some uses of LILY include identifying the abundance of different lithologies, finding data from core intervals with a specific lithology, assessing the efficacy of coring systems in different lithologies, or characterizing and analyzing physical, chemical, and magnetic properties based on lithology. We illustrate the use of LILY by computing the grain density by lithology from over 24,000 moisture and density measurements and then use those grain densities, along with the large IODP bulk density dataset, to compute a new high‐resolution porosity dataset with over 3.7 million new porosity estimates.


Introduction 1.General Overview
The standardized collection of data during each Integrated Ocean Drilling Program or International Ocean Discovery Program (IODP) expedition includes a wide variety of physical, chemical, and magnetic properties data, digital images, and descriptive information.Most of these data are collected every few centimeters along the cores using automated track systems.Additional datasets are constructed from a variety of analyses on discrete subsamples taken from the cores.Some of the properties typically measured include P-wave velocity, bulk density, grain density, porosity, magnetic susceptibility, natural remanent magnetization, natural gamma radiation, thermal conductivity, and visible spectral reflectance as well as a wide range of geochemical properties (Table S1 provides a full list of data types; Table S2 provides data quantity).In addition to these measurements, descriptive information is recorded continuously.Ten years of digital tabular descriptive information now exists for 42 expeditions, representing data from multiple ocean basins and geological time.
Since 2009, the DESClogik software system has been used onboard the JOIDES Resolution to collect descriptive data.DESClogik is a software package developed internally by IODP that includes a tabular interface to enter data

10.1029/2023GC011287
Key Points: • Lithologic information from ocean drill cores has been organized and paired with over 34 million other data to create a new database • The new database allows lithologic associations with other data types to be explored • Grain densities are determined for each lithology and combined with wholecore bulk densities to estimate over 3.7 million new porosities Supporting Information: Supporting Information may be found in the online version of this article.
There is currently no convenient method to compare, combine, or analyze DESClogik data, particularly lithology data, across multiple expeditions or to integrate descriptive data with the other track and discrete measurements collected onboard.Most shipboard data collected onboard the JOIDES Resolution are tabularly stored in the Laboratory Information Management System (LIMS) database.Descriptive information is an exception; it is available within the LIMS database in multi-tab Excel files, which complicates directly pairing the lithologic descriptions with other data.
The subjective nature of lithologic descriptions further complicates the pairing process, as the same lithology may not be described consistently from one expedition to the next.Inconsistencies in descriptions may even exist within the same expedition.Furthermore, lithologic terminology evolves over time.An additional complication is that lithologic descriptions are not made on the same scale as other observations.For example, each measurement made along a core section by a track system is representative of a distinct spot or narrow interval and data from each discrete sample are representative of a small volume of material, while descriptive information for a core section frequently includes large intervals (up to an entire ∼1.5 m section).
In this study, we extract the lithologic information from the complex array of descriptive information and then tie that information to other observations made on the cores in order to characterize the physical, chemical, and magnetic properties of a myriad of lithologies.The primary products are consistently cleaned, vetted, and organized tables of lithologies for each expedition and tables of all the other observations with a lithology attached.In these tables, each observation also contains the expedition, site, hole, core, core type, section, offset (interval), and depth in meters below seafloor (mbsf; referred to as CSF-A[m] in LIMS) and has associated metadata derived from the Hole and Core Summary files, including geographic coordinates, water depth, and additional coring type information (coring types are listed in Table S3).This database, which we refer to as LIMS with LithologY (LILY), contains millions of observations, some that are made on the scale of tens of microns to millimeters and others that are on the scale of centimeters to meters.
LILY may be used in a variety of investigations including simply determining the characteristics of different lithologies.For instance, what is the magnetic susceptibility of nannofossil ooze and how does it differ from clay or diatom ooze?While one might be able to find some ranges of magnetic susceptibilities for common rock types with an internet search or in geophysical textbooks, those ranges are generally based upon a few tens to a few hundreds of measurements for a limited number of general lithologies.There are over 6.8 million magnetic susceptibility measurements in LILY and over 98 principal lithologies, each with more than 1,000 susceptibility observations.Other studies might seek to investigate relationships between physical properties as a function of lithology, depth, latitude, tectonic province, or other criteria.For example, how does porosity vary with density for different lithologies?How does overburden affect seismic P-wave velocity for different lithologies?In what lithologies is there the largest P-wave velocity anisotropy?How does the lightness (L*) vary with carbonate content as a function of lithology and are the relationships dependent on location and paleoclimate, for example, intervals that are lighter and have higher carbonate content are associated with warmer interglacial intervals in the north Atlantic over the past 3 million years (Stein et al., 2006(Stein et al., , 2009)).
With lithologies paired with over 34 million track/discrete data-including digital image red/green/blue (RGB) color, visible spectral reflectance, magnetic susceptibility, natural remanent magnetization (NRM) before and after demagnetization treatment, P-wave velocity, natural gamma radiation (NGR), density (wet and dry), porosity, thermal conductivity, carbonate content, major element oxides and minor element content from ICP (inductively coupled plasma) measurements, and other chemical observations-LILY is a substantial database that can be probed with big data analytics, business intelligence, and machine learning methods to allow users to extract the information or attributes they desire.Below, we provide an example of this using grain and bulk densities to derive over 3.7 million new porosity estimates and characterize the densities and porosities by lithology.We chose this example because calculating porosities from bulk densities requires knowledge of the grain density, which in turn depends on the composition and hence the lithology of each sample.

Background on Lithologic Naming Conventions
From the Deep Sea Drilling Project (DSDP) to current IODP expeditions, lithologic naming conventions have evolved.In addition, the detail to which one science party describes the lithology of the recovered material varies depending on the goals of the expedition, the size and expertise of the science party, and the volume of material recovered.Since DSDP Leg 2, most legs or expeditions have provided an explanation of their lithologic naming conventions in an "Explanatory Notes" or "Methods" section of their "Initial Reports" or "Proceedings Volume" (e.g., Huber et al., 2019;The Shipboard Scientific Party, 1970), all of which are available online at http://publications.iodp.org.The understanding of lithologic naming conventions as well as other core description practicalities is important for the appropriate use of IODP descriptive data (e.g., Marsaglia & Milliken, 2023).
Although not ubiquitously or consistently used from one expedition to the next, a common theme for sedimentary rock classification for IODP expeditions has been to follow the guidelines in Mazzullo et al. (1988), with exceptions that generally entailed devising alternative naming conventions for the "mixed sediment" classification of Mazzullo et al. (1988).While the basic descriptive terminology varies amongst the expeditions, lithologies for granular sedimentary rocks generally have been described with a principal name that may have major modifiers given as prefixes and minor modifiers as suffixes.For sedimentary rocks, the prefix (major modifier) generally infers a composition that is >25% and suffixes (minor modifiers) are for compositions between 10% and 25% (Mazzullo et al., 1988).The principal lithology is representative of the dominant composition (>60%) derived from one of four types of granular sedimentary rocks (pelagic, neritic, siliciclastic, or volcaniclastic).For example, an unconsolidated sediment that is 40% silt + 35% clay (total siliciclastic = 65%) + 25% nannofossil grains would have a principal name = silt, because silt is the dominant component of the siliciclastics, which comprise >60% of the total composition.The prefix or major modifier would be "clayey" and the suffix or minor modifier would be "with nannofossils", resulting in a full lithologic name of "clayey silt with nannofossils".When none of the four granular types comprised >60% of the total, the term mixed sediment was used by Mazzullo et al. (1988).The "mixed sediment" principal lithology is not, however, used by any of the expeditions covered in this study.Instead, most expeditions have derived principal names from the granular sedimentary rock type that represents >50% of the composition.Given the difference in usage and some inherent ambiguities in the nomenclature, full lithologic names at best provide semiquantitative estimates of the amount of each component in a sample.For example, a lithology such as a clayey sand with foraminifera is dominantly sand with >25% clay and 10%-25% foraminifera.Given the percentages for the major and minor modifiers, sand comprises 40%-65% of the composition and clay must also be <45%.
Given this terminology, lithologic descriptions have been stored in the DESClogik "macroscopic" spreadsheets in three main columns or categories, which we refer to as "prefix", "principal", and "suffix", with a fourth column that gives the "full lithology" that is a combination of the prefix, principal, and suffix.Not every expedition uses all three columns, nor do they all use them in the same manner.Some expeditions (e.g., Expeditions 318,341,352,363,366,374) follow or partly follow the classification system of Dean et al. (1985) or some modified version of that system, in which they include major modifiers (25%-50% of the composition) and minor modifiers (10%-25% of the composition) in the prefix, and differentiate them by adding "-rich" or nothing for major modifiers and by adding "-bearing" for minor modifiers.The principal lithology name is determined from the dominant (>50%) non-biogenic or biogenic component.This system does not use the "suffix" category.A lithology with the prefix = "radiolarian-bearing nannofossil-rich" and principal = "diatom ooze", full lithology = "radiolarian-bearing nannofossil-rich diatom ooze" would indicate the ooze was between 10% and 25% radiolaria, 25%-40% nannofossils, and >50% diatoms (Note that nannofossils cannot be 25%-50% in this example because of the relative abundance of radiolaria and diatoms).In the unmodified version of the Dean et al. (1985) classification, the "-rich" was not added to a prefix of a component that comprises 25%-50% of the composition, and if more than two major modifiers were present, they were listed in order of increasing abundance.These practices result in ambiguities, for example, if the full lithology is given as "radiolarian-bearing nannofossil diatom ooze", diatoms may comprise 25%-50% if they are a major modifier of "ooze" or >50% if they are part of the principal lithology "diatom ooze".Confusion can be caused when some descriptions include a combination of methods that do not agree with what is stated in the "Methods" section of the associated Proceeding Volume.Several expeditions have minor modifiers in both the prefix and suffix columns even though their methods state that all minor modifiers will be placed in the prefix with "-bearing" added.For example, a lithology description that violates the classification systems of both Dean et al. (1985) and Mazzullo et al., 1988) is given as a prefix = "foraminifer bearing", principal = "diatom ooze", and suffix = "with radiolarians".In this example, a hyphen is missing in the prefix of the minor modifier and a second minor modifier is listed in the suffix, which presumably should have been in the prefix as "radiolaria-bearing."These types of inconsistencies or ambiguities are not uncommon and so we have developed some conventions of our own that hopefully add some consistency without altering the intent of the original lithologic descriptions, the details of which are given in Text S2 in Supporting Information S1 and Table S4.
Igneous rock classification, which is based on composition and texture, can also vary considerably between expeditions.As might be expected, basalt is by far the most common principal lithology for igneous rocks, but there are many compositional and textural terms applied to this base lithology.Prefixes are commonly used to define phenocryst composition, abundance, and/or size, for example, highly olivine-phyric basalt, fine-grained basalt, and aphyric cryptocrystalline basalt.Again the "Explanatory Notes" or "Methods" sections of the expedition publications provide the specifications for the nomenclature.For example, Expedition 335 defines aphyric cryptocrystalline basalt as basalt with <1% phenocrysts and with phenocryst grain size <0.1 mm (Expedition 335 Scientists, 2012).Suffixes are applied more rarely and have commonly been used for flow morphology, for example, "sparsely plagioclase phyric basalt pillow lava flow".

Methods Summary
Multiple script-based transformations are applied to data downloaded from IODP data repositories to generate the LILY database (Figure 1).All data used in this analysis are available from LORE (LIMS Online Reports; https://web.iodp.tamu.edu/LORE/).An exception to this is lithologic data for Expeditions 318 and 321, which are incomplete in LORE and are instead derived from IODP Publications archives and are provided in Zenodo as part of the full LILY database in Childress et al. (2023); Data Set S1, https://doi.org/10.5281/zenodo.8408297.

Data Gathering
DESClogik, a core and fossil description application, was first tested onboard the JOIDES Resolution in 2009 during Expedition 320T, although no data were archived, and then first implemented during Expedition 320; it was later updated in 2011 during Expedition 335 (sometimes generically known as DESClogik 2.0).DESClogik data from IODP expeditions that are out of moratorium are available at http://web.iodp.tamu.edu/DESCReport/.We use data from the 42 expeditions listed in Table 1, all of which were conducted between March 2009 and March 2019 and the last of which was out of moratorium as of March 2021 (Figure 2).Descriptive information is organized at the expedition-site-hole level, resulting in a separate final product Excel document for each IODP hole drilled.For some expeditions, multiple versions of the Excel document of descriptive information are available for a single hole, usually as the result of corrections made to descriptive information during postexpedition editorial meetings.In the case of multiple versions of the same descriptive file, the most recently dated file is used.
The descriptive information is available as Microsoft Office Excel format files (including .xlsand .xlsx)containing multiple sheets within each file, which include descriptive information such as drilling disturbance, lithology, structure, etc.While multi-sheet Excel file formats are a convenient method for storing tabular information, this format can present obstacles to data mining, and the naming of these files is inconsistent across the downloadable LORE dataset.For example, an export of lithologic data from Expedition 320 is given the filename "320 Core Description_U1332B". while an export from Expedition 363 is given the filename "X363-U1490B-macroscopic".Although not detrimental to the integrity of the data contained within, these inconsistencies create data mining hurdles.To more easily access individual data and remove unnecessary formatting, the R software environment (R Core Team, 2023) is used to extract descriptive data and, as described later, to pair descriptive and measured data.
To create independently accessible descriptive textual files that contain data records in a tabular format, each sheet of each Excel file was extracted, converted to comma separated value (CSV) format, and a consistent naming convention was applied (Table S5).These exported data are available as LILY-RawDESC.LILY-RawDESC files contain the raw text content of the Excel worksheet without applying any corrections or conversions.They thus have all the blemishes of the original observations but provide a raw, searchable dataset for all DESClogik data.However, because this is the raw data, errors can include misspelled words, undefined acronyms, extra carriage returns, extra spaces, improper punctuation, and erroneous sample identification information (also referred to as "scope of description" in DESClogik terminology), and are sometimes derived from worksheets with odd naming conventions.Because DESClogik templates for data capture are often reused across expeditions, many Excel sheets (and thus resulting CSV products) contain no data because the descriptive category was irrelevant to a later expedition and ignored rather than removed.These inclusions are less noticeable in the Excel-sheet format of the original data but, in the CSV format, many files contain only header information.These empty datasets are left in the LILY-RawDESC collection for completeness and transparency to the user that data have not been removed.
We devise a naming convention for the LILY-RawDESC files that allows them to be related to the original worksheet from which they were derived, while also containing a consistent convention that aids users in identifying the type of data in the files.The naming convention for the CSV dataset is formatted as  "XEXP_SHEETNAME_type_type2_SITEHOLE" where: X is an acronym for Expedition; EXP = expedition number; SHEETNAME = the original Excel file sheet name; type = a low-order data designation (ex.-radiolaria, sediment, diatom) based on the original filename; type2 = a high-order data designation of "microscopic", "macroscopic", or "paleo"; and SITEHOLE = the conventional IODP Site-Hole concatenation (ex.-U1417A for Site U1417, Hole A).If no low-order designation is provided by the filename, the high-order designation is repeated.For example, X374_General_macroscopic_macroscopic_U1524A.csv contains the macroscopic lithostratigraphy for Hole U1524A cored during Expedition 374.All CSV files can be traced to the Excel file of origin using Table S5.For example, the file X374_General_macroscopic_macroscopic_U1524A.csv was derived from the Excel file 374_U1524A_macroscopic.xlsx downloaded from LIMS.

Data Transformation and Management-Lithology
One of the most consistently collected and useful descriptive parameters in the DESClogik data is lithology.While lithology is recorded for cores from all expeditions, it can be stored across multiple sheets of the Excel files.For example, an expedition that recovers sedimentary, igneous, and metamorphic rocks may record the lithology of these materials across three separate sheets within the description file(s).
Additionally, DESClogik has not required a consistent header name for data entry of lithology.For example, the principal lithology column across expeditions may be written as "Principal lithology", "Lithology principal name", "PRINCIPALLITHOLOGY", "LithologyPRINCIPAL NAME", "Lithology", "Lithology 1 principal", "MAJ Lith.Principal name", "MAJ Principal lithology B", and a variety of variations of these combinations.In older data (prior to Expedition 339), lithology headers can vary hole-to-hole within a single expedition.Lithology prefix and suffix column headers vary similarly in the wide variety of styles and formats used for header column names.Some expeditions do not include a prefix and/or suffix column in the lithologic description and even within a single expedition, this may not be consistent across multiple description type sheets (ex.-sediment, metamorphic).Attempts to identify lithology columns by searching for a string or phrase such as "Lith" in the header are hindered by the presence of many other non-lithology column headers that include variations of the term "lithology," such as "Lith.Color" or "Lithologic accessories".The formatting of the column headers and column order for sample identification information (Expedition, Site, Hole, Core, etc.) is equally diverse across expeditions, and occasionally within a single expedition.Therefore, each sheet (as CSV) in LILY-RawDESC is processed separately to identify lithology columns (a list of these is given in Table S6), extract the principal lithologies along with major and minor modifiers, transform these observations into a consistent format, and add associated identification information and metadata to generate the LILY-RawLITH dataset (Figure 1).In order to generate a consistently formatted dataset, decisions had to be made on how to handle cases in which there was no clear principal lithology (either with or without major and minor modifiers) in the DESClogik data.For example, problems arise due to inconsistent use of the term "principal lithology".In some expeditions, "principal lithology" is referred to as "major lithology", although without the presence of a "minor lithology".Note that this "major lithology" is not the "major modifier" of a principal lithology; it is instead used in some expeditions in place of "principal lithology".Other expeditions employ a scheme of "major lithology" with some records of "minor lithology".For example, a lithologic description for a section of core may be given as having a major lithology that is "nannofossil ooze with diatoms" and a minor lithology as "diatom ooze with nannofossils".Here, the term "major lithology" indicates the most common lithology in an interval that has varying amounts of nannofossils and diatoms, with nannofossils being more common than diatoms for most of the described interval.In this case, we would use only the "major lithology" and record the "principal lithology" as "nannofossil ooze" and the "minor suffix" as "with diatoms" in the LILY-RawLITH dataset.Minor lithologies, such as those described above, are uncommonly recorded; they are found in only nine expeditions.These minor lithologies are extracted and included in LILY-RawLITH for availability to other researchers but are not used in the lithology analysis presented here.
Another issue arises because some DESClogik datasets (Table S6) contain multiple principal lithologies for the same depth interval and an additional column indicating the percentage of material of each lithology is included.
In instances of multiple principal lithologies for the same interval, the dominant (largest percentage value) lithology is retained.Fundamentally, only lithologies ≥50% should be retained, as this would be equivalent to a "principal lithology".Multiple expeditions, however, include three or more principal lithologies for the same description interval, and all percentages provided are <50%.In cases where multiple principal lithologies for the identical description interval are present with equivalent greatest percentage values (ex.50%-50% or 40%-40%), each entry is kept because no clear principal lithology can be determined.For example, consider an interval within a section that, rather than having a single row description in the DESClogik table with a principal lithology and with major and minor modifiers, is instead described in three rows by three different principal lithologies and each lithology has been assigned a percentage: "nannofossil ooze" = 40%, "clay" = 40%, and "mud" = 20%.In this case, there is no clear singular principal lithology, and both the "nannofossil ooze" and "clay" are retained as the principal lithology for the interval.This example is even more insidious because mud is a mixture of clay and silt; clay is mud, silt is mud, and mud would preferably never occur in lithologic descriptions that include silt or clay.Such terminology, however, does occur in the original descriptions, including principal lithologies given as "silty mud", "muddy clay", and "muddy silt".

Data Cleaning and Metadata Additions-Lithology
Description intervals with erroneous, missing, or corrupted identification information are common.Some errors in identification information, such as a randomly missing "A" for archive half in a single row of description, are reasonably obvious.Other errors are more complicated, such as several expeditions where pieces are described by the correct mbsf (CSF-A m) values but incorrect or ambiguous cm offset values.The most common of these are from the ambiguous use of "offset", which is the distance measured from the top of a sample, where the sample can be a section, piece, or subsample of a sample.Failure to identify the sample clearly means the distance given is measured from an unknown reference point.Generally, "offset" is used in IODP observations like the term "interval" was used for DSDP and ODP, in which "interval" is always a measure of the distance from the top of a section.To ensure consistent use of "offset" across all the expeditions, we only use offset as the distance from the top of a section and correct any of the originally ambiguous entries.Other erroneous description intervals include missing section values, descriptions at the core level that need to be expanded to the section level, missing cm offset values, missing mbsf (CSF-A m) values, piece identification information located in the wrong column, section identification information located in the wrong column, and rows that are otherwise flawed.Some of these errors are generated by the methods used to extract data from the CSV files.For example, many expeditions do not explicitly include identification information by column and so we parse it from the Sample ID during the lithology extraction process.The wide variety and format of Sample IDs further complicates the automation of the extraction process, requiring manual intervention.All noted identification errors are corrected by script.
To assess the completeness of lithologic descriptions, overlaps and omissions (gaps) in description are determined for LILY-CleanLITH.Overlaps in lithologic description are defined in this context as intervals that are associated with two or more lithologic descriptions.An exception to this is made for identical description intervals with different descriptions, as these are primarily derived from the early use of percentages in DESClogik (see above).An example of an overlap counted here is 369-U1513E-6R-2A, 0-29 cm (dolerite) and 21-29 cm (basalt).In total, 350 m of overlapping descriptions are found in LILY-CleanLITH.To determine omissions (gaps) in the lithologic description, we consider the total meters of curated material minus whole-round shipboard samples.Whole round (WRND) samples collected on the catwalk are typically not visually examined by core describers and therefore should lack a lithologic description.With WRND samples removed from the total curated length, 87,227 m of curated material are available for description, and of this, only 1,867 m are undescribed (∼2%).The largest discrepancies come from expeditions with high recovery where multiple holes or large portions of holes were not described owing to time constraints during the expeditions, for example, Expeditions 323, 339, and 363.This results in 85,360 m of lithology described in LILY-CleanLITH, regardless of the potential overlapping descriptions available.
Core Summary information includes the expedition, site, hole, core, coring type, top and bottom depths drilled, advances and recoveries, time and date of recovery, and the number of sections.Coordinates and water depth for each hole are derived from LIMS (and the Janus database at http://www-odp.tamu.edu/database/for older expeditions) and appended to Core Summary information [LILY-CoreSUMM].During Expeditions 341 and 346, the half-length advanced piston corer (HLAPC) system was used, but the core type for the advanced piston corer (APC) and HLAPC were both recorded in the IODP database as "H."A distinction of "F" was later used for the HLAPC core type.Because the original "core type" is part of the Label and Sample ID (ex.-tracking information attached to real sections and samples that exist) this information is not corrected in the core type column ("Type").
Rather than changing the sample ID information, which would generate problems with other datasets, we provide a new column ("Expanded Core Type") to indicate the true coring system used.Coring types in column "Expanded Core Type" include a wide variety of terminology (Table S3; Storms, 1990), however the most common types of interest are generally: APC = advanced piston corer, HLAPC = half-length advanced piston corer, XCB = extended core barrel, RCB = rotary core barrel, WASH = washed interval, DRILLED = drilled interval, and GHOST = ghost core.Because basic coring summaries are available for all DSDP, ODP, and IODP expeditions in the Janus and LORE databases, we have extracted that data, paired them with additional metadata (expanded core type, latitude, longitude, and water depth) and included them in LILY-CoreSUMM along with the expeditions included in the rest of this study.
The final cleaning of lithologic data is to standardize the terminology used for the prefix, principal, and suffix lithology terms.This standardization and linkage to the original DESClogik data is contained in the LILY dictionary (Table S4), which was created following the guidelines described in Text S2 in Supporting Information S1.Briefly, standardization includes the correction of variable capitalization (ex.-sand,Sand, SAND), misspellings (ex.-"bbasalt"), and redundant terminology (ex.-removal of "with nannofossils" from "nannofossil ooze with nannofossils").Other normalizations include ordering of multiple terms (rich, bearing, poor).Alphabetization of mineral nomenclature is also included.While in some instances the ordering has meaning and purpose, this is not always guaranteed and without standardization of these terms over the 42-expedition dataset, nearly identical lithologies are less readily compared.Some choices made in this dictionary are for the purpose of exploring the entirety of the 42-expedition dataset and may not be suitable/preferred for other types of analysis.This includes the removal of 27.75 m worth of descriptions with principal lithologies deemed non-lithological ("drilling disturbed zone of rock fragments", "Void", "alteration vein", "void", "???", "Vein fill", "fall-in").
The final LILY dictionary (Table S4) contains unique combinations of prefix, principal, and suffix linking all the original (DESClogik) full lithologies to updated (LILY) full lithologies.

Data Transformation and Management-Track and Discrete Data
All non-descriptive (physical, chemical, magnetic, etc.) data are available at http://web.iodp.tamu.edu/LORE/.Datasets paired with descriptive information are listed in Table S1 in Supporting Information S1 and information about the quantity of data in LILY by data type and expedition is given in Table S2 in Supporting Information S1.
Each track/discrete dataset is downloaded by expedition from LORE [LILY-RawDATA].An exception to this is when data are collected at a density (and therefore total volume) too large for LORE to download at the expedition level.In these instances, data are downloaded at the Site level and combined locally.Lithologic descriptive data (LILY-CleanLITH) are paired to the track/discrete data by matching the Expedition, Site, Hole, Core, Type, Section, and "Offset/Top-Bottom Offset cm".For the special case of IW (interstitial water) discrete samples, which are collected on the catwalk prior to core description, the lithology from immediately above the IW sample interval is presumed to continue through the sample.Where multiple principal lithologies are provided coincident with a track or discrete measurement, the track/discrete dataset row is duplicated.The track/discrete data paired with lithology are provided for each expedition in LILY-DataLITH.

The Lithologies of LILY
The lithologic dataset (LILY-CleanLITH) contains descriptions of over 85 km of core collected on the 42 expeditions included in this study.Coring locations span the globe, but the total amount of core recovered is unevenly distributed with >68% from the northern hemisphere (Figure 2).A quarter of Earth's surface, consisting of the western hemisphere south of the equator, contains only 4.8% of the core, although this lack of sampling will be partially alleviated by several recent expeditions that were conducted in this region (i.e., Expeditions 382, 383, and 378) but were still in moratorium at the start of this project.Nearly half (47%) of the core was collected in the tropics (between 23.4336°S and 23.4336°N), while only 5.4% was collected in polar regions (south of 60°S and north of 60°N).The cores were collected in water depths that vary from 87.15 to 5,696.66 m.
Principal lithology is a required observation for the dataset and is present for all >85,000 m of core described.Within this dataset, 209 unique principal lithologies are identified and the most common (by meters of core described) is "nannofossil ooze" (>20% of all descriptions).The 20 most common principal lithologies account for >80% of the meters described (Table 2), whereas the four least common principal lithologies comprise only 0.04 m of the core described.
Prefixes are used for 37,958 m of the dataset, with over half of the descriptions (>49,200 m) containing no prefix.Within this dataset, 431 unique prefix values are identified and the most common (by meters of core described) is "nannofossil-rich" (∼9% of all prefixes by meters).The 20 most common prefixes account for almost 75% of the meters described that include a prefix (Table 2), whereas the five least common comprise only 0.06 m of all core described.
Suffixes are used for 24,603 m of the dataset, with the bulk of descriptions (>62,600 m) containing no suffix.Within this dataset, 185 unique suffix values are identified and the most common (by meters of description) is "with foraminifera" (∼25% of all suffixes by meters).The 20 most common suffixes account for >90% of the meters described that include a suffix (Table 2).The least common suffix is "with algae," which comprises 0.01 m of the core described.
Thirty-nine percent of principal lithologies (by meter of core described) have no prefix or suffix, 34% have only a prefix, ∼18% have only a suffix, and ∼10% have both.The most common lithology with both a prefix and suffix is "nannofossil-rich clay with foraminifera", which comprises ∼1.0% of the core described.
Rather than look at specific lithologies, one can group lithologies by types.For example, the total amount of unconsolidated core collected accounts for ∼71% of the core described.Within the unconsolidated group, all clastic sediments (ex.-clay, silt, mud, and sand) can be grouped and these account for 58% of the unconsolidated core described.Biogenic oozes account for another 41% of unconsolidated sediments, leaving only ∼1% for all other unconsolidated lithologies.

Coring Type and Core Recovery Observations With and Without Lithologies
As noted above, LILY-CoreSUMM contains coring summaries that include coring type and recovery percentage information as well as associated metadata for all cores, drilled and washed intervals, and any other coring records from DSDP (21,725 records), ODP (36,388 records), and IODP (23,001 records; through Expedition 398).These cores were primarily collected (94% of LILY-CoreSUMM records) with one of four main coring systems: APC, HLAPC, XCB, and RCB.The data in the ''Expanded Core Type'' column will therefore generally be designated as being one of these four systems, although other core types, such as GHOST, NCB, PDCM, and MISC, occur much more rarely (Table S3).Of the 81,114 records in LILY-CoreSUMM, 35,344 are RCB, 24,055 are APC, 14,896 are XCB, and 2,693 are HLAPC cores.
The overall LILY-CoreSUMM dataset can be used to examine attributes of coring type, regardless of whether an expedition is included in the set of 42 expeditions with detailed lithology and track/discrete data primarily included in this study.Hence, one could examine the success of each coring system in terms of recovery percentage over the entire period that the coring systems have been in use for scientific ocean drilling, as summarized in Figure S1.
An analysis of how the coring systems work in different lithologies is possible using the 42 expeditions where coring type is paired with lithology, which contains over 14,900 records of which 5,701 are APC, 4,318 are RCB, 2,534 are XCB, and 1,890 are HLAPC cores.The coring systems were designed to collect certain rock types and thus their use naturally correlates with lithology.In general, piston coring (APC, HLAPC) is used for unconsolidated materials (i.e., sediments), XCB for semi-lithified materials (e.g., chalk), and RCB for rock.The XCB system is also used to collect short intervals of hard rock at the base of an APC/XCB cored hole, for example, to collect a short interval of igneous basement after penetrating the sedimentary overburden.Its primary purpose is to extend coring in a hole started with the APC system, without needing to trip the pipe back to the surface to change bits, which is required for using the RCB system.The RCB system may be used to collect soft or semilithified materials as it passes through the unconsolidated sedimentary overburden on its way to deeper, targeted lithified or igneous intervals.
A natural question to ask is how successful each system is in collecting different lithologies.A study by Evans (2020) first addressed this question for IODP Expeditions 317-375.In that study, all cores from a hole were assigned the lithology of the most common rock type from the hole and then placed into one of six lithologic groups (Ooze/chalk; Siliciclastics [silt, clay, sand]; Shallow-water carbonate; Siliceous ooze; Volcaniclastic; and Glacial), which allows for studying only first-order differences in coring system.Nonetheless, even those basic lithologic designations revealed interesting observations, like recovery for the APC and HLAPC system is nearly 100% in all six groups, whereas the XCB system had 63%-76% recovery in ooze/chalk and siliciclastics and only 27%-37% recovery in glacial, volcaniclastic, and carbonate sediments/rock.
Our results agree with these generalizations, but with LILY-CleanLITH one can refine the relative success of each coring system in different lithologies.Because core recovery is known only for each core, all lithologies within a core are assumed to be recovered at the percent recovery for that core, which should be a fairly accurate indicator for each lithology.As an example of how the database can be used, we consider some questions that might commonly arise, such as what recovery percentage can be expected when using the RCB system to core in basalt?
We consider only cores that have basalt as the principal lithology for 50% or more of the core recovered, for which there are 392 cores (Figure 3a).The mean recovery for these is 51.7%.The distribution of recovery percentages has a somewhat bimodal distribution with modes at about 16% and 82%.Thus, someone planning an expedition that targets basalt might expect about 50% recovery, with relatively poor recovery (∼16%) in about half of the cores recovered and good recovery (∼82%) in the other half.
As another example, LILY might be used to assess if the APC coring system has higher success recovering unconsolidated clastic sediments or biogenic oozes.LILY-CleanLITH has 2,560 APC cores that contain 50% or more clastic sediments (i.e., those with principal lithologies such as clay, silt, mud, or sand) and 2,365 APC cores that contain 50% or more biogenic oozes.Histograms of the recovery percentages for these two lithologies (Figure 3b) indicate highly unimodal distributions with modes for clastic sediments and biogenic oozes at 101% and 105% recovery and means of 101.7% and 103.1%, respectively [Note, recovery percentages greater than 100% generally result from core expansion (e.g., gas expansions in voids) when cores are brought from below the seafloor to the ship and because curation practices aim to preserve all material recovered, including soupy sediment at the tops of cores, some of which may have fallen downhole from above].The mean difference in the recovery of clastic sediments versus biogenic ooze with the APC coring system is small (1.4%) but is actually significant at the 95% confidence level because of the large number of cores.There are several explanations for this difference of which the two most likely are that the APC system either cores biogenic sediments slightly better than it cores siliciclastic sediments or that biogenic sediments expand more than siliciclastic sediments when they are retrieved from below the seafloor.With recovery rates over 100% for both lithologic groups, the main conclusion is that the APC system is very effective at recovering both types of sediments.
One might also investigate the efficacy of the RCB system relative to other systems in collecting the most common lithology, nannofossil ooze, and its more indurated version, nannofossil chalk.Figure 4 shows that nannofossil chalk is cored with moderate to good success with RCB, whereas nannofossil ooze recovery is poor to moderate.In contrast, nannofossil chalk can be reasonably well recovered with XCB, and in the few occasions it is cored by HLAPC, the recovery is quite good.Almost anything except RCB is a good bet for recovering nannofossil ooze.

Track and Discrete Data
More than 34 million unique individual track and discrete measurements from 23 instruments/tracks were paired with lithologic data (Figure 5; Table S2).The largest track data source, RGB, includes 10,789,797 measurements comprising ∼31% of the paired data.The smallest data source, the discrete measurement SRA, includes 121 measurements (<0.01% of the paired data).Some measurements (e.g., NGR) are collected consistently across all expeditions.However, several measurements are collected ad hoc (e.g., IW) or are collected only after a particular instrument or capability was acquired (e.g., RGB, SRA).

Data Quality
Given the large number of measurements, most made continuously along a core section, it is not surprising that anomalous or biased values are recorded, such as those that might occur for measurements made with any instrument that was poorly calibrated.The advantage of analyzing big data sets is that even though poor-quality measurements may occur, their numbers should be a relatively small percentage of the overall data, assuming good laboratory practices are generally followed when collecting data.Thus, meaningful estimates of physical, chemical, and magnetic properties can be extracted from the big datasets, with relatively little or even no filtering of "bad" data.Ultimately, some values recorded are so egregious that they are easy to filter out when analyzing the data.Other "bad" data may not be possible to identify or remove.
In the following analyses, it is important to understand some common reasons that anomalous or biased data occur, beyond typical uncertainties related to instrument sensitivity, which are covered in the IODP lab manuals (available at wiki.iodp.tamu.edu)and in the "Methods" section of the IODP "Proceedings Volumes".
As an example, consider track measurements, most of which have measurement-dependent parameters in their software, such as the expected core diameter, that are set and remain fixed for all cores measured, rather than being adjusted for real differences that occur along a core or between cores.For example, the inner diameter of the plastic core liner is 6.6 cm.For APC sediment cores, the liner is typically filled to the inner periphery.Thus, for measurements that vary with core diameter or volume of material being measured, such as density, P-wave velocity, susceptibility, and intensity of magnetization, 6.6 cm is a reasonably accurate estimate for the core diameter for intervals along a core that are filled with sediment.Of course, some intervals are not filled owing to cracks, gas voids, poor recovery, and various forms of disturbance related to drilling or processing the core.Such incompletely filled areas will give anomalous observations for those measurements depending on diameter or volume.For example, gamma ray attenuation (GRA) density measurements for some intervals will be lower than the density of water and may even be slightly negative; the former occurs over intervals that are only partially filled with material and the latter may occur over voids.Such intervals are not representative of any sediment or rock type and, if possible, need to be filtered out of the data when  S1 in Supporting Information S1 for track/discrete data abbreviations and Table S2 in Supporting Information S1 for data quantity.
analyzing the properties of the different lithologies.The best use of IODP data also includes the review of expedition volume materials, including "Methods" chapters, which may identify or explain certain anomalous data.
Systematic biases may also occur.For example, the diameter of RCB cores is never 6.6 cm because the RCB bit cuts a core narrower than the core liner.The throat of the RCB bit is 5.87 cm, and the typical core diameter for an RCB core is between 5.5 and 5.8 cm and it commonly gets smaller the more hours a bit is used, because the roller cones tend to wobble more with wear.Track systems that have the diameter fixed at 6.6 cm will produce biased values for RCB cores for those properties that depend on the diameter, such as density, and for those that are normalized by volume, such as susceptibility and intensity of magnetization.Rarely, the operator of the track system may adjust the diameter value being used, but this is not done consistently and thus biases lurk in the data.Removing all anomalous data and correcting data for biases are beyond the scope of this study, although below we provide an example of how both can be done for density data.Future studies may use the joint observations made along each core by the different instruments to devise ways to detect and filter out more of the unrepresentative data and to correct for biases.

LILY Example Studies of Density and Porosity
LILY data can be probed for a myriad of studies.We illustrate this using the moisture and density (MAD) data along with the GRA bulk densities in the following examples: 1. MAD data are used to estimate the crystallographic (grain) densities, bulk densities, dry densities, and porosities for the different principal lithologies, simplified lithologies, lithology types and subtypes, degree of consolidation, and core type.2. MAD and GRA bulk densities are compared to assess consistency and to investigate and correct for biases, improving the overall accuracy of the GRA densities.3. Porosity is estimated for 3.7 million GRA density data using the grain densities determined from the MAD data.

Moisture and Density (MAD) Data
For background, MAD data are collected from samples, typically 6-10 cm 3 in volume, that are taken from representative intervals of the core typically at a frequency of one to two samples per core section.The wet and dry masses of the samples are weighed with an accuracy of 0.1%, the dry volume of each sample is carefully determined in a pycnometer with an accuracy of 1%, and the resulting bulk densities, grain densities, porosities, void ratios, and moisture contents are estimated generally with an accuracy of few percent (Blum, 1997).
The densities and porosities in the MAD data are expected to follow relationships based on Archimedes principle (Hall & Hamilton, 2015), which is given in air by and in a fluid by where ρ B is the bulk density, ρ G is the grain density (i.e., the density of the solid components, also referred to as the crystallographic density), ρ F is the fluid density, and υ is the porosity given as a ratio of densities (we plot and express these values as percentages).The MAD densities are determined wet, where the pore-water fluid is considered to be seawater with a density of 1.024 g/cm 3 (Blum, 1997).As apparent in the equations, porosity varies linearly with bulk density.In the absence of any solid material, porosity is 100% and the bulk density is that of seawater (ρ B = ρ F ).As porosity decreases, bulk density increases linearly toward the grain density, and ρ B = ρ G when the porosity is zero.These relationships are valid for porous materials, where liquid fills the open pores and any closed pores are treated as part of the solid material.
From the above, it is apparent that if the grain density is accurately known, the expected porosity can be accurately calculated for a sample by measuring only the bulk density.This can prove useful because there are about 4 million GRA bulk density measurements without associated porosity determination.The only catch is that the grain density is dependent on the lithology.The value of LILY becomes apparent as it gives us the ability to compute the grain density for any lithology represented in the MAD dataset and to apply that information to the GRA bulk density dataset to get porosities.Likewise, we can compute the expected MAD porosities using the MAD grain densities and MAD bulk densities and then compare them with the observed MAD porosities as a check on the quality of the data.
We first examine some general properties of the MAD dataset using data visualization and basic statistics.The dataset contains 26,467 rows of data.After discarding duplicates, such as samples that are at lithologic contacts with two different principal lithologies or sections that were measured twice, and discarding two rows that are missing either a value in the bulk density, grain density, or porosity columns, 25,594 rows of data (independent samples with unique principal lithologies) remain.We do some minor cleaning of the data to remove extreme outliers, such as samples with density or porosity that is beyond what is possible (e.g., negative porosity) or probable for marine sediments and rocks.For example, marine sediments and rocks are predominantly composed of minerals that have grain densities well between 2.0 and 3.2 g/cm 3 , with only very few and scattered values plotting outside that range.Applying the following cutoffs for extreme outliers: • 1.024 g/cm 3 < bulk density < 3.2 g/cm 3 • 2.0 g/cm 3 < grain density < 3.2 g/cm 3 • 0% ≤ porosity ≤ 100% Geochemistry, Geophysics, Geosystems 10.1029/2023GC011287 reduces the number of MAD data to 25,335.There are 1,336 unique full lithologies and 148 unique principal lithologies represented in these measurements.
The distribution of the grain densities, bulk densities, and porosities for the full MAD data are Gaussian in shape, but these are in turn composed of distinct lithologic-dependent distributions, as illustrated for nannofossil chalk and basalt in Figure 6.Lithological differences in linear trends in bulk density versus porosity data are also apparent (Figure 7a).While the overall trend of all the data is linear, some non-linearity occurs in the combined data because bulk densities and porosities depend on lithology.This is particularly evident when the porosity is small and the bulk density approaches that of the crystallographic (grain) density of the constituent mineral.Hence, lithologies that are largely composed of calcite, like nannofossil ooze, will have grain densities near that of calcite (2.711 g/cm 3 ; Smyth and McCormick (1995), and lithologies, like basalt and gabbro, composed of mafic minerals will have grain densities approaching that of mixtures of mafic minerals, e.g., pyroxenes (∼3.3 g/cm 3 ), olivine (∼3.2 g/cm 3 ), plagioclase (∼2.7 g/cm 3 ), and hornblende amphibole (∼3.0 g/cm 3 ).
Scatter plots of grain density versus porosity, when combined with histograms and cumulative data density contours, proved to be useful for identifying outliers or anomalous observations that exist in addition to the "extreme outliers" (Figure 7b).The expectation is that grain density should vary with lithology and be independent of porosity for each separate lithology.Hence, we would expect to see the carbonate lithologies, such as nannofossil ooze, cluster around 2.71 g/cm 3 and mafic igneous rocks, such as basalt, cluster around 2.9 g/cm 3 regardless of porosity (see the vertical solid and the vertical dashed lines, respectively, in Figure 7b).While the grain densities for nearly all the lithologies do exactly that, we also observe a small population (<400 samples) with grain densities that trend toward lower values as the porosity increases above about 80%.Most of these samples are biogenic sediments, with 56% being diatom and radiolarian ooze.The mean grain density for diatom ooze is only 2.34 g/cm 3 for samples with porosity >80% and 2.49 g/cm 3 for samples with porosity <80%.We attribute this behavior to the incomplete drying of very wet and mostly  2) for nannofossil ooze, diatom ooze, and basalt, respectively, which have grain densities of 2.72, 2.46, and 2.89 g/cm 3 , respectively (see Section 4.2.1 Estimating MAD Grain, Bulk, and Dry Densities and Porosities).The same grain densities are vertical lines in (b) as grain density is independent of porosity.Diatom ooze, however, shows a decrease in grain density when the porosity is higher than about 80%, which we attribute to incomplete drying of samples with very high porosity.The dotted-dashed line with triangles is a model that shows how incomplete drying of only a few percentage of the total water can affect the grain density.The solid yellow line is a contour that contains 75% of the data.The contour and the histograms convey that the data are highly concentrated in a restricted range in both plots.7b) in which a small percentage of the pore water is not removed by drying samples with high porosity can explain the anomalous values.In the model, the percentage of pore water not removed increases nonlinearly from 0% for porosities <70% to about 2% for a porosity of 90%.Possibly some of the pore water is locked in the tests of microfossils in the high porosity samples, which will be those near the mudline (top of the seafloor) where microfossil shells are generally well preserved.Alternatively, samples with very high porosities may merely need to be dried longer than those without so much water in their pore spaces.

Estimating MAD Grain, Bulk, and Dry Densities and Porosities
To compute the best estimate of the grain, bulk, and dry densities and the porosity for each lithology, we first remove the extreme outliers as described above.Next, we group the data, where grouping is done by principal lithology, along with subsequent groupings by simplified lithology, lithology type, lithology subtype, degree of consolidation, and expanded core type.Each group is cleaned by excluding values that are more than three standard deviations from the group mean values, and then the means, medians, and their attendant statistics are recomputed (Table S7, with a subset of the data in Table 3).This cleaning reduces the number of MAD data from 25,335 to 24,748 samples, which is a reduction of <1%.We refer to this as the "ultra-clean" dataset and we use the mean values from this dataset as the most representative estimate of the true values.
Representative grain and dry densities for the principal lithologies with more than 50 observations are shown along with their standard deviation and 95% confidence limits in Figure 8.The 95% confidence limits listed in Table 3 and Table S7 are derived by first adding in quadrature the random standard error and a systematic standard error equivalent to the difference between the mean and median, and then multiplying this sum by Note.A more complete list of grain densities, bulk densities, dry densities, and porosities along with their statistics is given in Table S7.Results in that table also include values for data grouped by simplified lithology, lithology type, lithology subtype, degree of consolidation, and expanded core type.N is the number of observations.95% Confidence Limits include random and systematic errors as described in the text.
1.96.This systematic error is included to account for the non-Gaussian nature of some of the distributions.The most extreme example of this is for the claystone grain density, where the mean and median differ by 0.03 g/cm 3 , which significantly exceeds the 0.01 g/cm 3 95% confidence limits for random errors alone.
The grain densities computed from the large number of observations for some lithologies give mean values that have very small 95% confidence limits, averaging ±0.024 g/cm 3 for the 60 lithologies with 50 or more observations.This leads to some interesting observations such as nannofossil ooze (2.721 ± 0.007 g/cm 3 ) has a small but statistically significantly higher grain density than nannofossil chalk (2.705 ± 0.005 g/cm 3 ), which in turn has a higher grain density than limestone (2.665 ± 0.041 g/cm 3 ).Likewise, sand (2.746 ± 0.022 g/cm 3 ) has a significantly higher grain density than sandstone (2.659 ± 0.057 g/cm 3 ), silt (2.718 ± 0.010 g/cm 3 ) has a significantly higher grain density than siltstone (2.618 ± 0.071 g/cm 3 ), and clay (2.738 ± 0.025 g/cm 3 ) has a higher, although not statistically significant, grain density than claystone (2.718 ± 0.064 g/cm 3 ).These seemingly small differences apparently have a physical origin.Possibly, the more lithified versions of these lithologies may have some closed pore spaces that retain water, which would result in lower grain densities and higher dry densities than the unconsolidated analogs.This explanation is supported by the dry density data, where the more lithified lithologies (limestone, claystone, siltstone, and sandstone) have higher values than the unconsolidated analogs (nannofossil ooze, clay, silt, and sand).
The grain densities are generally consistent with the expected values based on the grain densities of the samples' constituent minerals, for example, the expected (calculated) grain density of pure calcite is 2.711 g/cm 3 and the observed grain densities of the lithologies mainly composed of calcite have comparable values: nannofossil ooze (2.721 g/cm 3 ), nannofossil chalk (2.705 g/cm 3 ), chalk (2.703 g/cm 3 ), and foraminifera ooze (2.723 g/cm 3 ).
Although none of the calcite-rich lithologies are pure calcite, they differ by no more than 0.012 g/cm 3 from the expected value for calcite.An exception to this is limestone (2.665 ± 0.041 g/cm 3 ), which differs only marginally given its uncertainty.
MAD bulk densities and porosities show the expected inverse relationship (Figure 9).Lithologies with large porosities (>60%), such as oozes and other unconsolidated clastic sediments, have low bulk densities (<1.7 g/cm 3 ), and rocks with very little porosity (<25%), such as igneous rocks, have high bulk densities (>2.4 g/cm 3 ).Lithologies composed of microfossils with silica tests, like diatomite, diatom ooze, diatom silt, and radiolarian ooze have the lowest density (<1.5 g/cm 3 ) and highest porosity (>70%), which can be attributed to how the silica tests of diatoms provide support for a porous skeletal structure.
Grouping data using categories other than principal lithology, such as the degree of consolidation or coring type, provide additional insights (Table S7): The average porosity of all the consolidated lithologies is 38% compared with 58% for unconsolidated lithologies.RCB cores are collected in material with an average porosity of 37% compared with 64% for APC cores and 50% for HLAPC.

MAD and GRA Densities Compared
Besides the MAD measurements on discrete samples, bulk density is measured indirectly by gamma ray attenuation (GRA) along whole-round core sections.While generally more accurate, the MAD measurements are time consuming and require destructive sampling and therefore cannot be conducted at the same frequency as indirect, track-based, non-destructive methods.As a result, the GRA density data set in LILY includes over 4 million measurements, more than two orders of magnitude more data than MAD measurements.To compare the GRA bulk densities to the MAD bulk densities, we find the nearest GRA datum for each MAD datum, where the two data must come from the same core section and have the same principal lithology.We required that the paired data be no further than 5 cm from each other.This leaves 24,178 paired data for comparison, with over 95% of them within 2 cm of each other.The results compare remarkably well for APC, HLAPC, and XCB cores (Figure 10).The GRA densities are on average 0.036 g/cm 3 and 0.016 g/cm 3 higher than the MAD densities for APC and HLAPC, respectively, and they are 0.048 g/cm 3 lower for XCB.For APC and HLAPC, we attribute the higher GRA densities relative to MAD densities to the sediments being under compression in the core liner for GRA measurements.Once the cores are split, the subsamples taken for MAD measurements are no longer under compression and hence their densities are marginally lower.The slightly lower GRA values relative to MAD values for XCB cores would be expected because XCB cores often do not fill the core liner completely.The expectation is that there would be more outliers with GRA densities significantly lower than MAD densities due to GRA densities being measured over intervals where the core liner is only partly filled, and this tendency is observed.This tendency is much more apparent for RCB cores, where voids and intervals with partially filled core liner are much more abundant (see the "Anomalous" regions in Figure 10d).
RCB cores have significantly lower GRA densities than MAD densities (on average by 0.210 g/cm 3 ) and the difference depends on lithology and degree of lithification.The harder rock types, for example, basalt, gabbro, diorite, breccia, conglomerate, tuff, limestone, siltstone, and sandstone, typically have GRA densities 0.2-0.4g/ cm 3 less than the MAD densities.Softer, less consolidated sediments, such as mud, clay, silt, and ash, have GRA densities 0.1-0.2g/cm 3 less than the MAD densities.The softest sediments, like nannofossil ooze, diatom ooze and diatomite, have GRA densities that are smaller than MAD densities on average by no more than 0.02 g/cm 3 .
These differences predominantly result from RCB cores having a smaller diameter than the standard core diameter (6.6 cm), which is a set parameter in the software for GRA measurements.This interpretation is supported not only by visibly examining cores but also by computing what the RCB core diameter is for different lithologies using the relationship: where d true is the actual core diameter, 6.6 cm is the assumed diameter, ρ obs is the observed GRA bulk density, and ρ true is the true bulk density, which can be estimated from the average of all the MAD bulk densities for each lithology.For example, the average MAD bulk density for basalt is 2.708 g/cm 3 and the average paired GRA bulk density is 2.267 g/cm 3 , which indicates that the average diameter of the core is 5.53 cm (or about 5.81 cm if we exclude GRA data that differ more than 0.5 g/cm 3 from their paired MAD data, as discussed below).As noted above, even though the liner in the core barrel is 6.6 cm, the throat of the RCB bit is 5.87 cm.Unlike lithified rock, soft sediments collected with the RCB tend to expand and fill the liner.For example, the MAD bulk density for nannofossil is 2.721 g/cm 3 and the averaged paired GRA bulk density (collected with the RCB system) is 2.701 g/cm 3 , which indicates that the actual average diameter of a RCB core with nannofossil ooze is 6.55 cm.
It is apparent that the bias or difference between the GRA and MAD densities is dependent on the diameter of core recovered and the diameter is dependent on the coring system and lithology, with the coring systems themselves being lithology dependent.For APC, HLAPC, and XCB systems, little bias is evident and what bias there is does not have a clear dependence on lithology.
Basically, the lithologies collected by these coring systems typically fill the core liner.For the RCB system, which collects a range of lithologies from hard rock to soft sediment, the lithology dependence is clear (Figure 10d).
Using LILY, we can compute the median differences between the paired GRA and MAD densities for each lithology collected with the RCB system and then use those values to correct all the GRA data (not just the paired data) from RCB cores (Table S8; Figure 10e).In computing the differences, we minimize the influence of data collected over intervals with partially filled core liners by using the median rather than mean and by excluding data where the GRA minus MAD results are <0.5 g/cm 3 or > 0.2 g/cm 3 (labeled as the "Anomalous" region in Figure 10d).Given the small differences between GRA and MAD densities for data from APC, HLAPC, and XCB cores, we use the median difference for each coring system rather than for each lithology as a robust correction factor.After excluding the anomalous values, the remaining 23,155 paired GRA and MAD data were used to compute correction factors (Table S8) and those factors were used to correct all the GRA bulk density data.

Porosities From GRA Densities
Before correcting the GRA bulk densities, we clean the GRA data similarly to how the MAD data were cleaned: duplicates and extreme outliers (densities <1.024 g/cm 3 or >3.2 g/cm 3 ) are removed.This leaves 3,752,145 GRA bulk densities, from which we compute corrected GRA bulk densities using the correction factors and compute porosities using the corrected GRA bulk densities, the MAD grain densities for the corresponding principal lithology, and Equation 2a (Data Set S2 in Acton et al., 2023; https://doi.org/10.5281/zenodo.10001854).
The new GRA porosity dataset has more than two orders of magnitude more data than in the MAD dataset and is spaced every few centimeters giving it a much higher resolution, as illustrated for an interval from Hole U1387A off the southern margin of Portugal (Figure 11).The sediment in this interval is predominantly nannofossil mud with thin interbeds of silty mud (Stow, Hernández-Molina, Alvarez Zarikian, Lofi, et al., 2013).GRA porosity shows a clear decrease associated with these silty mud layers (red dots along the gray line in Figure 11).This association was missed by the more widely spaced MAD porosity data.It is also apparent that had the sedimentologists had the high-resolution GRA porosity dataset, they may have identified a few other narrow intervals that are likely silty mud.

Comparison of IODP and DSDP Density Compilations
Tenzer and Gladkikh (2014) compiled bulk density measurements from 716 DSDP drill sites (21,937 samples).These data are similar to a mix of IODP MAD and GRA bulk density data because the DSDP data include results from discrete samples and from measurements of whole-round core samples by gamma ray attenuation (e.g., Boyce, 1976;Mills, 1985;Scientific Party, 1970).In addition, DSDP discrete sample volumes were not estimated by a pycnometer, but were instead estimated from the shape of syringe and mini-core cylinder samples or by water displacement for irregular shaped chunks of rock (e.g., Lee, 1973).As noted by Boyce (1976), the DSDP GRA densities are biased toward lower values when the core does not fill the core liner.This is exactly what we noted for the IODP GRA densities before we corrected them.The bias remains in the DSDP data, whereas it is absent from the IODP MAD data and has been removed from the IODP GRA data via the correction we applied (see Section 4.3 MAD and GRA Densities Compared).
This bias is apparent in the comparison of the DSDP densities with the IODP MAD densities (Figure 12).The DSDP data are on average about 0.2 g/cm 3 lower than the IODP MAD data, which is an amount virtually identical to what we noted for the GRA bulk densities for RCB cores.Note that the DSDP data are mostly from RCB cores (Figure 12a).
The distribution of DSDP versus IODP density values differs in at least two other notable ways.The DSDP distribution contains a small secondary mode between 2.8 and 3.1 g/cm 3 that is absent in the IODP data.About 45% of these data come from just three holes (395A, 462A, and 504B), which targeted mostly igneous basement.The GRA bias is not evident in these studies because these data are predominantly from discrete sample studies in which the bulk density was determined from weighing the sample and determining the volume from water displacement (e.g., Karato, 1983).The larger number of samples from igneous rocks reflects both the greater emphasis in coring oceanic basement by DSDP and the way cores were collected.During DSDP, spot coring was frequently employed to target intervals of interest, particularly igneous rock, while drilling through much of the sedimentary overburden.
Another noticeable difference is the bimodality of the primary maxima of IODP data for densities <2.5 g/cm 3 , whereas the DSDP data have a single maximum.The two peaks in the IODP data are associated with differences in coring systems used and types of lithologies recovered.RCB coring was the most common method used by DSDP prior to 1979, when the APC coring system was introduced.Thus, during DSDP much of the soft sediment that would be cored currently with an APC coring system was either drilled through or cored by RCB, resulting in much less recovery of soft sediment, which is quite evident in Figures 12a and 12b.Furthermore, when the bias in unfilled core liners is considered, the mode at about 1.65 g/cm 3 in the DSDP distribution actually corresponds to the mode at about 1.85 g/cm 3 in the IODP distribution.Thus, the mode at 1.65 g/cm3 in the IODP data is absent in the DSDP data, mainly because DSDP collected so few cores with the APC system and none with the XCB or HLAPC systems.

Conclusions
The continuous collection of consistent types of data on IODP expeditions over a long time makes them a natural target for broad studies that span beyond a single site or expedition.The scientific ocean drilling database is big data and an endless array of studies can be conducted with it.Undoubtedly, many new discoveries will be made with those data.That was true of the IODP data even before we added descriptive lithologic information to each datum to make LILY.
The many datasets that make up the IODP LIMS database are, however, from measurements collected along geologic units or from samples taken from those geologic units and it is only natural that the type of geologic unit be part of the metadata for each measurement.LILY ensures that information is part of the IODP data and enables users to investigate the properties and characteristics of the data that may be connected by or depend on lithology.
As we have shown in an example using MAD and GRA data, the addition of descriptive lithologic information allows new attributes of datasets to be resolved, features to be characterized, and hypotheses to be explored.In our example, we have used LILY to estimate grain densities from lithology from MAD data, to correct a bias that existed in the GRA bulk densities that was dependent on lithology and coring system, and to compute a highresolution porosity dataset using the lithologic-dependent grain densities and the corrected GRA bulk densities.With over 3.7 million new porosity estimates, this dataset should be of use in identifying fluid pathways, estimating reservoir capacity, determining the CO 2 sequestration potential for geologic units, investigating how changes in porosity with the depth might be related to landslide or earthquake nucleation or facilitation, and much more.We hope future investigators will uncover endless possibilities for the use of the LILY database and derive new observations and insights from it.

Figure 2 .
Figure 2. Geographic location of sites for the 42 expeditions.Symbol size is proportional to the meters of core described for each site.

Figure 3 .
Figure3.(a) RCB recovery of cores that are at least 50% basalt.(b) Histograms of percent recovery of the APC coring system in cores that are at least 50% clastic sediments (orange), biogenic lithologies (light blue), and their overlap (purple).Only the interval from 80% to 120% is shown because few cores have recoveries outside that range.

Figure 4 .
Figure 4. Recovery by coring type for cores with at least 50% nannofossil chalk or nannofossil ooze.

Figure 5 .
Figure 5. Extent of physical, chemical, and magnetic datasets paired with lithologic descriptions for this study.All data were retrieved from the IODP Laboratory Information Management System (LIMS) [http://web.iodp.tamu.edu/LORE/].See TableS1in Supporting Information S1 for track/discrete data abbreviations and TableS2in Supporting Information S1 for data quantity.

Figure 6 .
Figure 6.Histograms of MAD data showing distributions of the combined data and differences in the distributions for two lithologies, nannofossil chalk and basalt.

Figure 7 .
Figure 7. (a) MAD bulk densities versus porosity data and (b) MAD grain densities versus porosity are plotted for all the data and for four representative principal lithologies along with histograms for all the data.The solid, dotted, and dashed lines in (a) show the expected linear relationships between bulk density and porosity (see Equation2) for nannofossil ooze, diatom ooze, and basalt, respectively, which have grain densities of 2.72, 2.46, and 2.89 g/cm 3 , respectively (see Section 4.2.1 Estimating MAD Grain, Bulk, and Dry Densities and Porosities).The same grain densities are vertical lines in (b) as grain density is independent of porosity.Diatom ooze, however, shows a decrease in grain density when the porosity is higher than about 80%, which we attribute to incomplete drying of samples with very high porosity.The dotted-dashed line with triangles is a model that shows how incomplete drying of only a few percentage of the total water can affect the grain density.The solid yellow line is a contour that contains 75% of the data.The contour and the histograms convey that the data are highly concentrated in a restricted range in both plots.

Figure 8 .
Figure 8. Mean (red squares) and median (yellow dots) for (a) grain densities and (b) dry densities for some of the principal lithologies with more than 50 observations each, along with their standard deviations (blue bar) and 95% confidence limits (red bar).

Figure 9 .
Figure9.Mean bulk densities and porosities and their standard deviations for some of the principal lithologies with more than 50 observations each.

Figure 10 .
Figure 10.Comparison of GRA and MAD bulk densities measured from the same interval for (a) APC cores, (b) HLAPC cores, (c) XCB cores, and (d) RCB cores.In panels(a-d) the blue dots include all the data for the specific coring system.In panel d, besides plotting all the RCB data, data for three lithologies are superimposed to illustrate how GRA densities are more biased for some lithologies than for others.As discussed in the text, the difference between GRA and MAD densities is a bias in the GRA data that is dependent on the coring system and lithology.(e) We have determined the size of this bias as a function of those dependent variables and corrected for it.This panel shows data from all the coring systems combined (gray dots), with data from three specific lithologies superimposed and with outliers [from the "Anomalous" region shaded yellow in panel (d)] removed (see text).

Figure 11 .
Figure 11.The new high-resolution GRA porosities (gray line) are plotted along with MAD porosities (blue dots).The GRA porosities for intervals with silty mud are plotted as red dots.The principal lithology for all other intervals is nannofossil mud.

Table 1
Expeditions Included in LILY

Table 2
Twenty Most Common Prefixes, Principal Lithologies, Suffixes, and Full Lithologies by Meters of Lithology Described

Table 3
Mean Grain Densities, Bulk Densities, Dry Densities, and Porosities and Their Uncertainties Computed From the MAD Physical Properties Data Set porous biogenic samples.A simple model (line with triangles in Figure