COSORE: A community database for continuous soil respiration and other soil‐atmosphere greenhouse gas flux data

Abstract Globally, soils store two to three times as much carbon as currently resides in the atmosphere, and it is critical to understand how soil greenhouse gas (GHG) emissions and uptake will respond to ongoing climate change. In particular, the soil‐to‐atmosphere CO2 flux, commonly though imprecisely termed soil respiration (R S), is one of the largest carbon fluxes in the Earth system. An increasing number of high‐frequency R S measurements (typically, from an automated system with hourly sampling) have been made over the last two decades; an increasing number of methane measurements are being made with such systems as well. Such high frequency data are an invaluable resource for understanding GHG fluxes, but lack a central database or repository. Here we describe the lightweight, open‐source COSORE (COntinuous SOil REspiration) database and software, that focuses on automated, continuous and long‐term GHG flux datasets, and is intended to serve as a community resource for earth sciences, climate change syntheses and model evaluation. Contributed datasets are mapped to a single, consistent standard, with metadata on contributors, geographic location, measurement conditions and ancillary data. The design emphasizes the importance of reproducibility, scientific transparency and open access to data. While being oriented towards continuously measured R S, the database design accommodates other soil‐atmosphere measurements (e.g. ecosystem respiration, chamber‐measured net ecosystem exchange, methane fluxes) as well as experimental treatments (heterotrophic only, etc.). We give brief examples of the types of analyses possible using this new community resource and describe its accompanying R software package.


| INTRODUC TI ON
Fluxes of greenhouse gases (GHGs) between soils and the atmosphere constitute a significant component of global carbon and biogeochemical cycling (Friedlingstein et al., 2019), with the two most commonly measured being those of carbon dioxide (usually referred to as soil respiration, R S ) and methane. Soil respiration constitutes one of the largest carbon fluxes in the entire Earth system (Bond-Lamberty, 2018;Raich & Potter, 1995;Xu & Shang, 2016) and is useful, but underutilized for constraining and understanding other components of the carbon cycle (Barba et al., 2018;Davidson et al., 2002;Phillips et al., 2017;Wang et al., 2017). Atmospheric methane causes higher 100 year radiative forcing on a mass basis relative to carbon dioxide (Neubauer & Megonigal, 2015) and its production exhibits high temporal and spatial variability often associated with redox conditions  and climate. This contributes substantial uncertainty to global methane budgets (Friedlingstein et al., 2019;Kirschke et al., 2013;Saunois et al., 2016;Tian et al., 2015).
Other GHG fluxes are also measured, albeit less frequently, e.g. nitrous oxide (Gruber & Galloway, 2008), and researchers are beginning to measure multiple gases concurrently as well (Courtois et al., 2019).
These GHG fluxes are measured using a number of techniques (Pumpanen et al., 2004), most commonly infrared gas analyzers (IRGAs; Detto et al., 2011;DuBois et al., 1952) connected to chambers that sit on collars shallowly embedded into the soil surface (Nay et al., 1994;Xu et al., 2006). Continuous measurements of R S can also be made using in situ solid-state sensors (Hirano et al., 2003;Jassal et al., 2005;Tang et al., 2003) and forced diffusion technology (Lavoie et al., 2012(Lavoie et al., , 2015. In the last 30 years, continuously operating automated systems multiplexing multiple chambers to a single IRGA have been developed (Goulden & Crill, 1997;Irvine & Law, 2002;Rayment & Jarvis, 1997). Laser-based and spectroscopic methods for non-CO 2 gases are also increasingly used in field research (Brannon et al., 2016;Savage et al., 2014). These high frequency data, particularly when paired with complementary observations, open up new possible research applications, including understanding rapid plant-soil ecohydrological links (Volkmann et al., 2016), the coupling of phenology and respiration (Järveoja et al., 2018;Migliavacca et al., 2015;Raich, 2017), the contribution of root respiration (Högberg et al., 2001;Subke et al., 2006), validation of eddy covariance measurements in complex ecosystems (Miao et al., 2017), responses of soil GHG emissions to extreme climate events  and rising atmospheric carbon dioxide concentrations (Drake et al., 2018) and novel inversion techniques (Latimer & Risk, 2016).
The resulting GHG flux datasets, however, remain widely dispersed and frequently unavailable. There is no centralized database for chamber fluxes akin to FLUXNET (Baldocchi et al., 2001), although annual (Bond-Lamberty & Thomson, 2010) and some daily to seasonal ) R S flux databases do exist. This is troubling, both because of the lost or unavailable research opportunities for synthetic work with respect to temporally high-resolution GHG fluxes, but also because of the inevitable loss of data (Wolkovich et al., 2012).
Fortunately, the tools and knowledge to support a ground-up community GHG flux database are now available (Lowndes et al., 2017).
Here we describe an open database, COSORE (originally derived from 'COntinuous SOil Respiration'), that focuses on continuous and long-term soil-atmosphere GHG flux datasets and is intended to serve as a community resource for future synthesis and model evaluation.

| ME THODS
COSORE is designed to be a relatively lightweight database: as simple as possible, but not simpler. It is targeted at continuous-i.e. measured by automated systems-soil respiration flux data, but the database design accommodates manual point (survey-style) R S fluxes, methane fluxes and chamber measurements of net ecosystem exchange as well, paralleling the recent Soil Incubation Database database . Its development started in April 2019, and as of this writing (2020-09-04) the COSORE version number is 0.8.0.

| Database and dataset structure
The database is structured as a collection of independent contributed datasets (Table 1), all of which have been standardized to a common structure and units. Each dataset is given a reference name (internal to COSORE) that links its constituent tables, and provides a point of reference in reports. Each constituent dataset normally has a series of separate data tables: • description (Table 2) describes site and dataset characteristics; • contributors (Table 3) lists individuals who contributed to the measurement, analysis, curation and/or submission of the dataset; • ports (Table 4)  • ancillary (Table S1) summarizes site-level ancillary measurements; • columns (Table S2) maps raw data columns to standard COSORE columns, providing a record for reproducibility; and • diagnostics (Table S3) provides automatically generated statistics on the data import process: errors, columns and rows dropped, etc.
The common key linking these dataset tables is the CSR_ DATASET field, which records the unique name assigned to the dataset. In addition, a CSR_PORT key field links the ports and data tables. These links make it straightforward to extract datasets that have measured particular fluxes in certain ecosystem types, or isolate only non-treatment (control) chamber fluxes, for example.

| Versioning and archiving
COSORE uses semantic versioning (https://semver.org/), meaning that its version numbers generally follow an 'x.y.z' format, where x is the major version number (changing only when there are major changes to the database or package structure and/or function, in a manner that may break existing scripts using the data); y is the minor version number (typically changing with significant data updates); and z the patch number (bug fixes, documentation upgrades or other changes that are completely backwards compatible). Following each official (major) release, a DOI will be issued and the data permanently archived by Zenodo (https://zenodo.org/). All changes to the data or codebase are immediately available through the GitHub repository, but only official releases will be issued a DOI; we anticipate this happening on an approximately annual basis.

| Data license and citation
The database license is CC-BY-4 (https://creat iveco mmons.org/ licen ses/by/4.0/); see the 'LICENSE' file in the repository. This is identical to that used by e.g. FLUXNET Tier 1 and ICOS RI. In general, this license provides that users may copy and redistribute the TA B L E 5 Summary of COSORE's data table, which holds the actual flux observations and accompanying time-stamped data. Columns include field name, description, class (i.e. type of data), units and whether or not the field is required (required fields are marked by an asterisk); although not indicated, at least one flux observation (CSR_FLUX_CO 2 or CSR_FLUX_CH 4 ) is required in every database row. Note that all data in this table are acquired at the point of GHG flux measurement; see Table S1 for site-level data database and R package code in any medium or format, adapting and building upon them for any scientific or commercial purpose, as long as appropriate credit is given. We request that users cite this article and strongly encourage them to (a) cite all constituent dataset primary publications, and (b) involve data contributors as co-authors whenever possible, as is commonly done for other global databases such as FLUXNET (Baldocchi et al., 2001;Knox et al., 2019). In addition, users should also reference the specific version of the dataset they used (e.g. v0.6.0), access date and ideally the specific Git commit number. This supports reproducibility of any analyses. devtools::install_github("bpbond/cosore")library(cosore)

| DATA ACCE SS AND US E
Four primary user-facing functions (cf. Figure 1) are available: • csr_database() summarizes the entire database in a single convenient data frame, with one row per dataset, and is intended as a F I G U R E 1 Summary of COSORE structure (multiple datasets, each with six tables; Tables 2-7) and primary accessor R functions, as described in the text (see Section 2.1 in text). For example, R users can join specific tables across all datasets using the csr_table() function, and can access individual datasets with csr_dataset(). Non-R users access flat-file versions of the same data, with essentially the same structure as the R internal structure shown here high-level overview. It returns a selection of variables summarized in Tables 2-5 and Tables S1-S3, including dataset name, longitude, latitude, elevation, IGBP code, number of records, dates and variables measured; • csr_dataset() returns a single dataset: an R list structure, each element of which is a table (description, contributors, etc., as described above); • csr_table() collects, into a single data frame, one of the tables of the database, for any or all datasets; • csr_metadata() provides metadata information about all fields in all tables.
Two additional reporting functions may also be useful to users: • csr_report_database() generates an HTML report on the entire database: number of datasets, locations, number of observations, distribution of flux values, etc.; • csr_report_dataset() generates an HTML report on a single dataset, including tabular and graphical summaries of location, flux data and diagnostics.
Finally, a number of functions are targeted at developers, and include functionality to ingest contributed data, standardize data and prepare a new release. See the package documentation for more details.

| Documentation
The primary documentation for the COSORE database is this manu-

| Data quality and testing
When contributed data are imported into COSORE, the package code performs a number of quality assurance checks. These include: • Timestamp errors, for example illegal dates and times for the specified time zone; • Bad email addresses or ORCID identifiers; • Records with no flux value; • Records for which the analyzer recorded an error condition.
Any errors flagged or records removed during this process are summarized in the diagnostics table that is part of each dataset (Table S3 below). Across all contributed datasets, a median of 7.9% of raw observations were removed for one of these reasons. Note however that no checking on the flux values themselves is performed (e.g. for outliers, improbable values); currently this is the responsibility of the user.
The cosore R package also has a wide variety of unit tests (Zhao, 2003) that test code functionality via assertions about function behaviour and by verifying behaviour of those functions when importing test datasets (of different formats and with a variety of errors, for example). In total these tests cover 97.8% of the codebase.

| CURRENT DATA AND COMMUNIT Y CONTRIBUTI ON S
The database currently has 89 contributed datasets with a total of 8.14 million flux observations across 20 years and five continents (Table 1; Figure 2), widely distributed in climate and biome space, from Arctic to tropical ecosystems (Figure 3). In terms of data volume, the current database is dominated by CO 2 fluxes in evergreen and deciduous forests (Table 1; Figure 4) from the mid-northern latitudes (Figures 2 and 5). These data are unequally distributed around the year, with many more data available during the Northern Hemisphere growing season ( Figure 6). There is an order of magnitude more data in COSORE from the Northern than Southern Hemisphere, and currently no CH 4 data at all from the Southern Hemisphere. The interval between measurements ranges from 3 to 1,440 min, with 25%-50%-75% quantile values of 30, 60 and 60 min respectively. A one hour interval between measurements is thus by F I G U R E 8 Distribution of CO 2 fluxes in COSORE datasets, by IGBP classification (cf. Table 1 Dataset CO 2 fluxes (mostly soil respiration, but as noted above also some heterotrophic respiration and net ecosystem exchange) are generally log-normally distributed in most IGBP classifications ( Figure 8). The distribution of CH 4 is more complex, with most data clustered around 0 nmol m −2 s −1 but featuring long distribution tails to many orders of magnitude larger fluxes for both net uptake and release (Figure 9), due to the complexity and variety of biochemical processes involved in methane production and oxidation (Riley et al., 2011).
The COSORE team welcomes data contributions of soil-atmosphere GHG flux data. We prioritize continuously measured (i.e. from automated systems including non-chamber approaches) soil respiration datasets, but the database structure also accommodates (discontinuous, i.e. manual) data, as well as measurements of methane, net ecosystem exchange and heterotrophic respiration fluxes.
Contributors receive a QA/QC report for all submissions, including details on invalid data, removed data, etc., and can then request corrections or changes before the data are uploaded and go 'live'; contributors may also request a temporary embargo on their data. There currently is no standardized data template that contributors must follow, but we anticipate this changing before version 1.0 (planned for late 2020). There is no minimum data coverage required, either in time or space, although we suggest datasets should at a minimum span a growing season.
It is important to note that COSORE itself is not (

| CON CLUS I ON S: S TRENG TH S , LIMITATIONS AND FUTURE DIREC TIONS
COSORE is a 'coalition of the willing' (sensu Novick et al., 2018), and intended to be a community-driven resource for analyses of soil-atmosphere GHG exchange. Possible analyses and next steps include   following similar designs. Others, such as ForC (Anderson-Teixeira et al., 2018), take a broader scope and also focus on annual fluxes. We hope that the large volume of standardized, highfrequency GHG flux data in COSORE will enable novel global scale syntheses, modelling activities, new insights driven by machine learning (Albert et al., 2017;Vargas et al., 2018) and conceptual advances (e.g.  that are currently impossible. Linking COSORE data with other high-resolution, open databases such as FLUXNET (Baldocchi et al., 2001) and the ICOS RI Carbon Portal (https://www. icos-cp.eu/data-services) is also likely to yield new insights.
COSORE has a number of limitations, some peculiar to the effort and others intrinsic to the discipline and community. First, as with many observations in the ecological and Earth sciences, it is spatially non-representative at the global scale (Xu & Shang, 2016), and currently dominated by datasets from North America and East Asia ( Figure 2). There are no datasets from Africa (cf. Epule, 2015) and little South American data. The IGBP representation is skewed as well (Figure 4), although the database's climate space coverage is reasonable (Figures 3 and 6). This spatial patchiness-a function of many factors including economic development, infrastructure, scientific investment-imposes significant restrictions on our ability to draw global inferences and analyses from extant observational data.
A second category of limitations arises from COSORE's particular design. The database is oriented towards lightweight and minimal requirements, aiming for breadth over depth. This has benefits and costs. Having low barriers to entry shifts the burden of contributing data away from data providers, and keeping the design lightweight (with limited controlled vocabularies, ancillary data, etc.) has kept the burden on COSORE's designers and maintainers manageable; we are acutely aware that every additional field or piece of information imposes a cost, both immediately (for implementation) and in perpetuity (for maintenance). This was the rationale behind focusing initially on previously uncollated continuous measurements: to maximize scientific impact in terms of labour involved. In fact, nothing in COSORE's design itself precludes incorporation of spatially distributed, survey-style measurements. COSORE also remains relatively immature, with e.g. no 'level 2' data product incorporating external data (e.g. Fick & Hijmans, 2017). This imposes an additional cost-of time and effort-on database users to locate and integrate externally available data themselves.
Finally, analyses using COSORE will be limited by the nature of soil respiration and other soil-atmosphere gas flux measurements, and the state of the disciplines' networks and community. Automated measurements trade space for time: the systems are more expensive and require dedicated power, and do not perform well under certain conditions, limiting their spatial and temporal coverage at many scales (Barba et al., 2018). There remains no institutionally backed network akin to AmeriFlux or ICOS, and while there have been efforts to integrate chamber flux data into these networks' data products, this has inevitable consequences for continuity and consistency. There is also no standardization of measurement depths for ancillary measurements (e.g. soil temperature and moisture) in the manner of a top-down network such as NEON (Schimel et al., 2007) or ICOS RI (Op de Beeck et al., 2018). We intend however to shift this responsibility to data contributors before version 1.0, providing a template form that contributors must follow. This will allow for semiautomated data ingestion and follows the practices of many other earth sciences databases. Unusual or outlier measurements could also be automatically flagged for downstream users. More ambitiously, we have put substantial design work into ensuring interoperability so that COSORE data should flow relatively seamlessly into (or from) ESS-DIVE, Ameriflux and ICOS RI.

| Future directions
A long-term vision is that COSORE data could, for example, automatically be made available in the larger community database. It is crucial, we believe, that COSORE contributors have assurances that their data contributions are traceable across versions and that it is not necessary to prepare and submit their data to multiple repositories. Finally, currently all data are included in the COSORE R package download. While convenient for users, this model will likely break down when the database doubles or triples in data volume. At that point, the data will need to be hosted elsewhere and downloaded only on demand.

ACK N OWLED G EM ENTS
The authors declare no conflicts of interest. This research was

DATA AVA I L A B I L I T Y S TAT E M E N T
The data and code that support the findings of this study are openly available on GitHub at https://github.com/bpbon d/cosore.