MacroSheds: A synthesis of long‐term biogeochemical, hydroclimatic, and geospatial data from small watershed ecosystem studies

The US Federal Government supports hundreds of watershed monitoring efforts from which solute fluxes can be calculated. Although instrumentation and methods vary between studies, the data collected and their motivating questions are remarkably similar. Nevertheless, little effort toward their compilation has previously been made. The MacroSheds project has developed a future‐friendly system for harmonizing daily time series of streamflow, precipitation, and solute chemistry from 169+ watersheds, and supplementing each with watershed attributes. Here, we describe the breadth of MacroSheds data, and detail the steps involved in rendering each data product. We provide recommendations for usage and discuss when other datasets might be more suitable. The MacroSheds dataset is an unprecedented resource for watershed science, and for hydrology, as a small‐watershed supplement to existing collections of streamflow predictors, like CAMELS and GAGES‐II. The MacroSheds platform includes a web dashboard for visualization and an R package for data access and analysis.

Technology type(s): remote sensing, long-term dataset synthesis.
Temporal range: 1963-2022. Frequency or sampling interval: daily. Spatial scale: Watershed site-based data synthesized for 169 gauged stream sites; water chemistry and/or streamflow for 495 sites primarily across North America at the time of this publication.

Background and motivation
Watershed ecosystem science began in the late 1960s, when Herb Bormann and Gene Likens began estimating precipitation inputs and stream water exports for small gauged watersheds in the Hubbard Brook Experimental Forest (Bormann et al. 1968. These input and output fluxes and their differences were used to detect trends in air pollution, climate, rates of chemical weathering, nutrient limitation, and nutrient saturation, and to detect the magnitude, duration, and severity of disturbance on ecosystem element retention and loss (Likens 2013). All of these insights were gained from the consistent comparison of precipitation and streamflow volumes and chemistry conducted over long time scales. The simplicity of the watershed ecosystem approach and the magnitude of its scientific impact has led to similar watershed ecosystem studies being conducted in thousands of watersheds around the globe.
Altogether, hydrology labs and experimental forests operated by the US Forest Service, Department of Energy, and the National Science Foundation's Long Term Ecological Research, National Ecological Observatory Network (NEON), and Critical Zone Collaborative Network (CZNet, formerly CZO) programs, support hundreds of small watershed studies around the United States ( Fig. 1). Each of these programs collects nearly identical types of data. Yet to date, there has been no attempt to collate these datasets into a synthetic data platform that would facilitate comparison across sites. The notable examples where cross-site analyses have been performed (Williard et al. 1997;Kaushal et al. 2014;Zhang et al. 2017) have been limited in spatial scope or applied to only one element (like N) or general water balance. Each of these

Vlah et al. MacroSheds
individual efforts required significant supplemental funding and data expertise to enable synthesis. Kaushal et al. (2014) found that processing and retention of carbon and nitrogen varied significantly on a scale of kilometers, stressing the need for more studies across spatial scales. Synthesis work by Zhang et al. (2017) yielded important insights on cross-scale hydrologic response to forest changes using routine statistical tests, but synthesizing data used in those tests required a much larger effort. Differences in data structure, access method, time and location representation, and other challenges inherent to merging even relatively consistent datasets have ultimately limited the scale of inference in watershed ecosystem science. Indeed, watershed scientists have become increasingly self-critical, recognizing the failure of our community to develop generalities and theories that apply across scales (McDonnell et al. 2007;Kirchner 2009;Lohse et al. 2009). Much of recent watershed science has focused on gaining ever finer detail on the spatial and temporal heterogeneity of flow paths, water residence times, and biogeochemical processes (McClain et al. 2003;Bernhardt et al. 2017). This fine-scale focus has identified many unique idiosyncrasies of individual watersheds but has not helped us develop general theories about watershed dynamics that can be applied at regional to global scales. It is a fair critique to suggest that most watershed ecosystem studies remain rather parochial, involving detailed studies of individual or paired watersheds, or surveys of a small set of attributes across multiple watersheds. Macroscale watershed science, or the search for general principles that describe functional capacity and behavior across watersheds, has been limited. A major reason for this lack of large-scale focus is the challenge of data access and integration across sites. New requirements for data sharing have made it possible to access most National Science Foundation (NSF)-funded watershed science data, yet individual datasets are rarely interoperable across research sites, even when stored in the same repositories.
We find inspiration for harmonizing large datasets in the hydrology community, where there are two major modern efforts to synthesize records of discharge, precipitation, and watershed/ catchment attributes: GAGES-II and CAMELS (Falcone 2011;Newman et al. 2014;Addor et al. 2017). GAGES-II provides geospatial data and classifications (reference vs. nonreference) for the watersheds of 9322 US Geological Survey (USGS) stream gages. The CAMELS dataset builds on progress from GAGES-II by identifying 671 minimally disturbed watersheds, compiling their precipitation and runoff time series, and generating watershed attributes for each. Though preeminent examples of data aggregation and distribution, these datasets are limited in their scope to physical hydrology, mostly in watersheds too large to meet the assumptions of the watershed ecosystem concept, that is, uniform geology and a minimally permeable base of rock or permafrost (Fig. 2;. Whereas, with the conditions of the watershed ecosystem concept satisfied, it is possible to construct budgets of inputs, outputs, and net loss or gain for countless solutes of ecological importance. Still, CAMELS and GAGES-II provide a roadmap for synthesizing analysis-ready data for macroscale watershed ecosystem work. With 500 combined citations, they also demonstrate the value of such syntheses to the hydrology community. These datasets have enabled foundational shifts in the ways we make predictions at scale, especially through recent machine-learning advances in rainfall-runoff modeling (Kratzert et al. 2018(Kratzert et al. , 2022. MacroSheds opens this landscape of opportunity to the biogeochemistry community. Our primary goal in developing the MacroSheds dataset is to merge all US federally funded watershed ecosystem studies into a common platform, and to use that platform to develop a classification of watershed ecosystems that identifies differences in watershed functional traits (sensu McDonnell et al. 2007). Understanding these functional traits will allow us to predict how watershed biogeochemical cycles will respond to changing patterns of climate and element deposition. Ultimately, we hope that macroscale watershed science can build a mechanistic understanding of how variation in soil chemistry and biological demand for elements will alter the stoichiometric ratios of watershed outputs relative to inputs (in deposition and weathering). Merging records from hundreds of watershed ecosystem studies into a common format is the first step in developing macroscale watershed science. With this feat accomplished, a nearly limitless number of questions can be asked by researchers across the disciplines of hydrology, climate science, and ecology. We aim to facilitate these analyses through the MacroSheds dataset, R package, and web portal, which together constitute an open data platform. Fig. 2. Comparison of watershed areas as represented in the MacroSheds, CAMELS, and GAGES-II datasets. Each vertical bar represents a single watershed, but note that pink and blue bars have been widened for visibility. The tail of the pink arrow marks the upper limit of MacroSheds watershed areas. The MacroSheds dataset fills out two orders of magnitude at the small end, with 122 watersheds under 10 km 2 and 68 under 1 km 2 . For CAMELS, these numbers are 8 and 0, respectively. For GAGES-II, they are 207 and 2. Only those MacroSheds sites for which discharge data are publicly available are included in this figure.
In the MacroSheds dataset, we have unified publicly available data records of precipitation, streamflow, precipitation chemistry, and stream chemistry from watershed ecosystem studies that meet a requirement of at least monthly stream chemistry sampling. We used a common procedure to delineate the watersheds of any gauged stream sites without published boundaries, and daily, gridded climate data from PRISM (Daly et al. 2008) and Daymet (Thornton et al. 2020) to provide standardized estimates of precipitation, air temperature, and other climatological parameters within each watershed boundary. For each delineated watershed we summarized publicly available, gridded products encompassing topography, geology, soil, vegetation, and landcover attributes. A subset of watershed summary statistics and climate forcings included with the MacroSheds dataset are immediately commensurable with those of the published CAMELS dataset. MacroSheds therefore functions secondarily as a supplement to CAMELS, enhancing the predictive power of the combined set, especially for small watersheds.

Access methods and dataset contents
The MacroSheds dataset and all associated documentation can be found on the Environmental Data Initiative (EDI) data portal, at https://portal.edirepository.org/nis/mapbrowse?scope=edi&ide ntifier=1262. This URL will always point to the most recent dataset version, and at the time of this writing is synonymous with https://portal.edirepository.org/nis/mapbrowse?scope=edi& identifier=1262&revision=1 (Version 1). When new versions are published, the old versions will still be accessible by appending a version number to the end of the base URL in the above fashion. Throughout our current funding cycle, we intend to update this dataset annually with newly available data.
The dataset can also be downloaded through the "macrosheds" package for R (https://github.com/MacroSHEDS/ macrosheds; Rhea et al. 2023a), or explored without downloading, through the visualization platform at macrosheds.org. An interactive data catalog is available under the Data tab on macrosheds.org. See Table 1 for terms used throughout the following sections.
This dataset is derived from data already published in public repositories, primarily from US federally funded watershed studies, and in compliance with existing grant requirements. We report combined discharge and chemistry for 169 watershed studies (Fig. 1). The core dataset consists of seven data products (Table 2) grouped into two components, referred to below as "time series" and "watershed attributes." Each of these components of the core dataset has a supplementary Table 1. Common terms as used within the MacroSheds dataset and this paper.

Term Definition
Watershed All land area contributing runoff to a point of interest along a stream, regardless of contributing area. Does not necessarily account for inputs from subsurface flow or human-constructed diversions. The terms "catchment" and "basin" are sometimes used in this way. Site An individual gauging station or stream sampling location and its watershed. Domain One or more sites under common management. Network One or more domains under common funding/leadership. Product A collection of data, possibly including multiple datasets/tables. Primary sources may separate products by temporal extent/interval, scientific category, detection method, and/or sampling location. MacroSheds products are detailed in Table 2.

Siteproduct
The collection of all data for a single MacroSheds product, available at a single site. counterpart in which data structure, variables, and methods parallel the CAMELS dataset, to maximize interoperability between MacroSheds and CAMELS.

Time-series data
For the time-series component, we harmonized both physical hydrology and stream chemistry variables, capturing tremendous variation in hydrologic regimes and solute concentrations. MacroSheds data span a wide range of mean annual runoff (three orders of magnitude; Fig. 3). The distribution of flows is quite variable, with high frequency of high flow events as is typical of small, steep catchments that dominate this dataset. A significant fraction of streams goes completely dry in the average year with baseflow index ranging from 0 to 0.9. Water quality varies greatly among streams in the MacroSheds dataset, with pH covering almost the full range of that reported for natural waters (3-8;Wetzel 2001). Dissolved phosphorus and nitrogen concentrations are generally low compared with previously published data compilations (Falcone 2011;Newman et al. 2014), largely because this dataset is dominated by undisturbed, smaller watersheds. In contrast, dissolved organic carbon (DOC) ranges from near detection to > 30 mg L À1 -nearly black water-reflecting a wide variation in wetland habitat. The MacroSheds dataset does include some site-specific sample collection biases in that fewer than half of sites routinely collect total suspended solids (TSS), dissolved inorganic carbon (DIC), and alkalinity data (Fig. 4). These water quality patterns arise because of geologic variation, incoming precipitation chemistry, vegetation cover, and patterns of ecosystem productivity (Fig. 5). The specific variables available within each watershed study vary widely, but the MacroSheds dataset includes at a minimum stream discharge or major stream ion concentrations (Ca, Mg, K, SO 4 , etc.) for each site. In all, the MacroSheds dataset contains 185 stream and precipitation variables, including concentrations of nutrients, metals, photosynthetic pigments, and dissolved gases, temperature, turbidity, and other common water quality metrics where available. The total numbers of sites with discharge and chemistry data are 181 and 484, respectively. The total number with both is 169. A breakdown of data availability and data sources by domain is given in Table 4, but for a complete list of variables by site and temporal range, consult variables_timeseries.csv on EDI or visit the interactive data catalogs under the Data tab at macrosheds.org.
MacroSheds time-series data are tiered by domain according to the restrictiveness of licensing and intellectual rights (IR) terms associated with their primary sources. Tier 1 domains have minimal restrictions, requiring at most standard attribution, while Tier 2 domains require some additional action on the part of data users. Data tiers and license/IR information are detailed in our Data Use Agreements (data_use_agreements.docx on EDI), and citations for all MacroSheds primary sources are included in Tables 4 and 6 of this document. A full compendium of attribution, contact, and legal information can be found in our documentation on EDI (attribution_and_intellectual_rights CSV files). The "Data Use and Recommendations for Reuse" section of this document contains instructions on efficiently achieving license/IR compliance as a user of MacroSheds data.
MacroSheds time-series data are provided as CSV files, separated by domain and indexed by date, site code, and variable. The column structure is laid out in Table 3 and later referred to as "MacroSheds format."    Variable code, including sample type prefix (described in "Tracking of Sampling Methods for Each Record" section). see variables_timeseries.csv on EDI or run ms_load_variables() from the macrosheds package for more information. val The data value. ms_status QC flag. See "Technical Validation" section. Lowercase "ms" here stands for "MacroSheds." ms_interp Imputation flag, described in "Temporal Imputation and Aggregation" section. val_err The combined standard uncertainty associated with the corresponding data point, if estimable. See "Detection Limits and Propagation of Uncertainty" section for details.
In addition to our core time-series dataset, we provide a separate, supplementary collection of "CAMELS-   b,c,d, 2015a,b, 2016Hopkinson 2013a,b,c,d, e,f,g,h,i;Wollheim 2013aWollheim ,b,c,d,e,f,g,h,i,j,k,l, m,n, 2014aWollheim ,b,c,d,e,f,g,h,i,j,k,l,m,n,o, 2016aWollheim , b,c,d,e,f,g,h,i,j,k,l, 2018aWollheim ,b,c,d, 2019aWollheim and Vorosmarty 2014a,b,c,d, e,f;Green 2018a,b,c,d,e,f,g,h, i,j,k,l,m,n,o,p, 2019a,b,c,d,e,f,g,h Watershed attribute data The core watershed attributes component of the MacroSheds dataset is an extensive spatial summary product, compiled from published, gridded products (Table 6). It describes climate, geology, terrain, vegetation and land cover (Fig. 5). We also provide a separate, supplementary collection of "CAMELS-compliant watershed attributes" that conforms to the variables, data sources and methods used in the CAMELS dataset. Importantly, the MacroSheds dataset covers much smaller watersheds than those included in the CAMELS dataset ( Fig. 2). Due to the time cost of delineating watersheds, we elected to summarize attributes only for watersheds with discharge data, as they have substantially higher analytical potential. As an example, MacroSheds sites in Mediterranean California tend to have porous bedrock with high sulfur content, and to receive little nitrogen deposition, while eastern temperate forest sites have the highest geologic nitrogen content and receive the most nitrogen and sulfate deposition (Fig. 5). Concentrations of nitrogen and sulfur species in the stream water of each ecoregion will depend on these and other factors such as mineralization, plant uptake, and erosion rates, and cumulative fluxes will further depend on the longterm hydraulic output of each stream. Not only that, but the relationship between concentration and flux may change with hydraulic regime over the course of seasons or decades.
Watershed attribute data are provided as CSV files in two formats, representing different levels of aggregation. At the coarsest level, gridded spatial data are summarized to a single value per variable per watershed, and provided in wide format. However, some watershed attributes are temporally explicit, and our second format preserves the dates associated with each model estimation or satellite pass. Column structure for this format is given in Table 5.

Criteria for dataset discovery and inclusion in the MacroSheds dataset
Sites included in the MacroSheds dataset were primarily identified through the NSF-funded LTER, LTREB, and CZNet (formerly CZO) programs (113 of 169 sites, as of MacroSheds v1.0). Additional sites funded or managed by the US Geological Survey, Department of Energy, and Forest Service were  MacroSheds that is not associated with the federal government of the United States, but it will be joined by other US and international watershed studies as the MacroSheds project expands. NEON provides data products that will be integral to a future version of the MacroSheds dataset. Currently, NEON remains in its early operational phase, and its data products will be included in MacroSheds pending resolution of water quality and continuous discharge data anomalies that require further attention (Rhea et al. 2023b).
To be considered for inclusion in MacroSheds, a site requires either automated monitoring of stream discharge or routine sampling of stream chemistry, for at least a full year (minus periods of freezing or drying), as well as public data hosting. Additional data describing the quantity and chemistry of precipitation are highly valuable, but not required. Watershed boundaries can be delineated and geospatial summaries generated via MacroSheds tools, so these are not required.

Data processing system: Design and overview
The data acquisition and processing routines used to build the MacroSheds dataset comprise a system of cyclical ingestion pipelines (Fig. 6), written entirely in R (R Core Team 2022). Source code is designed functionally and organized hierarchically, mirroring the hierarchy of networkdomain-site organization across institutions that manage watershed studies. This allows routines specific to a domain, or shared across a network, to be loaded as modules, minimizing code redundancy and simplifying inclusion of new sites. Improvement of this design is ongoing, and will enable user data contributions, in exchange for watershed boundaries, summary statistics, and derived time-series products, in the near future.
For each domain, time series of discharge, precipitation, and chemistry are first downloaded and saved locally in whatever form and format they are provided. They are then processed by site-product into MacroSheds format. If a watershed boundary is not provided, it is delineated. Additional products are then derived, namely watershed-mean precipitation depth and chemistry (and daily solute flux may be generated via the "macrosheds" R package if desired; see the "Flux Calculation" section). Finally, we generate spatial summary statistics for each watershed.
The processing system is designed insofar as possible to accommodate future deviations from the ways primary sources currently structure and serve their products. Each pipeline is fault-tolerant, so if provider-side changes introduce errors at any stage of data access or processing, the errors are Fig. 6. Visualization of the four phases of MacroSheds data processing for a single domain: retrieval, harmonization (munging), derivation, and postprocessing, focusing on the evolution of precipitation data (P) from raw to final form as part of a domain dataset. Gold circles represent processing "kernels"-modular and customizable sets of routines that carry out the core steps of the first three phases. Within each phase, zero or more kernels are called in sequence, depending on which products need to be updated, as determined by the progress tracker. In Phase 1, retrieval kernels download primary source data. During Phase 2, kernels are called by one of four "munge engines" (pentagons) depending on whether primary source files are separated by site, by time, by product, or some combination. After Phase 3, time-series and geospatial data are organized into one file for each of the core MacroSheds products (discharge, stream chemistry, precipitation, precipitation chemistry, gauge locations, watershed boundaries). After Phase 4, a complete dataset has been generated for a single domain, and the process repeats for the next domain.
logged, the developers are notified by email, and the system moves on. Any change involving file headers, URL paths, or splitting/combining of datasets requires careful accommodation by the MacroSheds team (and anyone else who directly reuses primary data), so we encourage data providers to maintain structural consistency across dataset versions whenever feasible.

Time-series data access and amenity to harmonization
Among the 25 domains currently included in the MacroSheds dataset, we have identified five distinct tiers of "harmonization amenity," or the convenience with which we were able to access discharge, precipitation, and chemistry data and unify their idiosyncratic differences within a domain. Harmonization amenity encompasses the core elements of FAIR principles: Findability, Accessibility, Interoperability, and Reusability of data and metadata (Wilkinson et al. 2016), but also whether conceptually adjacent datasets share internal structure, and whether and how revisions are designated. Together, these elements determine the usability of public data, and the long-term practicality of including a source dataset in an ongoing synthesis effort like MacroSheds. Importantly, harmonization amenity tiers say nothing of the quality of a domain's data-only of its data structure and infrastructure. Licensing and IR restrictions are also a separate issue, with a separate tiering system (see data_use_agreements. docx). Our harmonization amenity tiers range from A, the most amenable, to E, the least amenable.
At Tier-E, data access is through personal correspondence only. As such, internal file structure is unpredictable and programmatic version-checking is impractical. We have generally avoided Tier-E domains and make no guarantees about their continued inclusion in MacroSheds, as they require an ongoing time commitment from our developers. We encourage watershed data managers to contribute routinely to public repositories like EDI, DataONE, HydroShare, or ESS-DIVE, so that we can build automated connections to MacroSheds.
Many datasets are hosted as hyperlinked, static files (Tier-D). This way of serving data is standardized only by the rules of transfer protocols (HTTP, FTP, etc.), which do not facilitate reliable file versioning (Postel and Reynolds;Belshe et al. 2015); however, it is possible to use the "last-modified" date in the header of a static file as a proxy for file version, as MacroSheds does. Many USFS and DOE domains, and even some CZNet domains, are Tier-D.
By hosting data in any public data repository that follows FAIR data standards, a domain can easily achieve Tier-C harmonization amenity or higher, meaning related files are naturally grouped or linked in a way that aids discovery. Most repositories permit straightforward versioning of files and file collections; however, in Tier-C the onus is on data managers to establish that an uploaded resource is a new version of some existing resource. Most CZNet domains are housed on CUAHSI's HydroShare, a premier environmental data and code repository that allows for easy creation of new versions of "formally published" resources. However, some CZNet domains have not published their data formally and edit their existing resources rather than creating official new versions. This makes programmatic identification of new file versions at least as difficult as with Tier-D harmonization amenity.
Datasets associated with Tier-B domains are easily found and fully versioned. Within MacroSheds, most domains associated with the LTER network are Tier-B, owing in part to the strict metadata and publishing requirements of the EDI data portal and underlying PASTA+ repository, which all but ensure proper versioning and within-domain findability of related files. Still, for Tier-B domains, neither data hosting architecture nor management dictate the internal structure or naming of files; however, the EDI repository does provide an effective set of recommendations to help contributors adhere to best practices: https://edirepository.org/resources/cleaningdata-and-quality-control.
At the forefront (Tier-A) of harmonization amenity are the USGS and NEON domains-each also networks per se-which provide systematic access and consistent data structure across all the sites they manage. This means, for example, the URL for water quality time series at site X is intuitively related to that for site Y, and that once downloaded, the two datasets are structured and formatted identically. Moreover, NEON and the USGS provide web servicesthrough which to explore, retrieve, and even manipulate their collections programmatically. In R, we conveniently queried these endpoints through official client packages (Lunch et al. 2021;Cicco et al. 2022). Because Tier-A institutions control data collection, storage, and hosting, they are able to establish a consistency of access and internal structure that is much more difficult to achieve post hoc.

Time-series data processing
This section details major steps taken to harmonize disparate chemistry, discharge, and precipitation data into MacroSheds format (see the "Data Description" section) and extract useful metadata. In any harmonization effort, there is a tradeoff between fidelity to the original datasets as they are, and cohesion of the aggregate set. We have endeavored for a MacroSheds dataset that is parsimonious but high in analytical potential, and that assimilates provided metadata where practical.
Each MacroSheds data ingestion pipeline performs a wide variety of basic processing routines. For a technical account of the steps involved in (1) conforming site and variable names, (2) resolving datetime formats and time zones, (3) converting units, and (4) reshaping data tables, consult code_autodocumentation.zip on EDI, and our complete codebase at https://github.com/MacroSHEDS/data_processing. The rest of this section covers assimilation of metadata on sampling methods and detection limits, propagation of uncertainty, and temporal imputation/aggregation.

Tracking of sampling methods for each record
The MacroSheds dataset includes measurements recorded by installed equipment and by hand (grab sample), and end users may wish to filter it accordingly. We further distinguish between measurements made via sensors vs. analytical or visual means. The former distinction is made programmatically with simple heuristics (e.g., inconsistent sample interval precludes autosampling), and the latter by consulting primary metadata. These distinctions are summarized as two-letter "sample regimen" codes prefixed to each MacroSheds variable code: "I" or "G" for "installed" vs. "grab," and "S" or "N" for "sensor" vs. "non-sensor." For example, "IS_discharge." At present, we do not report specific analytical methods for time-series variables, effectively assuming that commensurate units imply commensurability. We know this to be misleading for some variables-in particular those measured via fluorescence or absorbance-and intend to include more detailed methods for at least these variables (e.g., FDOM, turbidity) in a future release.

Detection limits and propagation of uncertainty
We were able to locate published limits of detection (LODs) for solute concentrations of only 10 of the 24 domains included in Version 1 of the MacroSheds dataset. For the rest, we assumed each variable's LOD to be the minimum LOD for that variable across the 10 domains with reported values. We do not attempt to infer LODs from the data, for example, by assuming they are approximated by the minimum reported absolute value. This risks egregious overestimation wherever measured values never approach the LOD, or underestimation wherever reported values have been transformed or determined via a calibration or rating curve.
Accurate cumulative flux calculations depend on relatively complete data records. It is thus critical that below-detectionlimit (BDL) samples be given a numeric value, so they are not confused with records for which a measurement is truly missing, and must be naively imputed. BDL measurements are variously reported by primary sources as ½ LOD, 1 /4 LOD, LOD, 0, missing, and so on. Some domains do not report BDL measurements. For consistency, we replace any value flagged as BDL with ½ of the reported/estimated LOD and set the corresponding ms_status to 1 ("questionable" vs. 0 for "clean"; see the "Technical Validation" section). Only values explicitly flagged as BDL are replaced in this way. For the rare case in which a value is flagged as BDL, and no LOD is reported for the corresponding variable at the reporting domain or any other domain, we set the value to 0 and the ms_status to 1. Within the MacroSheds dataset, BDL values are not flagged as such, but BDL flags can be reconstructed if necessary by cross referencing any time-series dataset with detection_limits.csv on EDI.
Before the MacroSheds processing system performs any mathematical transformation on raw data, uncertainty is attached to each record. Due to the scarcity of reported measurement or analytical precision/uncertainty, we have chosen not to propagate reported values. Instead, initial uncertainty for each domain-variable is determined by u ¼ 10 Àp , where p is the precision of the variable's reported LOD, after conversion to MacroSheds standard units. For example, a LOD of 0.008 mg L À1 has a precision of 3 (digits after the decimal), resulting in initial uncertainty of 0.001 mg L À1 . For domains that do not report LODs, we set the initial uncertainty for each variable according to the minimum (coarsest) reported p across all domains that do report LODs. For some variables, we have no basis by which to infer initial uncertainty, so we report it as missing. The two exceptions are discharge and precipitation, both required for computing solute flux. For these, we set initial uncertainty to zero. Uncertainty is then propagated through all MacroSheds mathematical transformations via the errors package (Ucar et al. 2018). A table of all known detection limits can be found in our documentation on EDI (detection_limits.csv).

Temporal imputation and aggregation
We currently report all time-series data (not including temporally explicit spatial summary data) at a daily interval. The timestamp associated with each incoming record is floored to midnight (0 h, 0 min, 0 s), and series with a subdaily interval are aggregated across each 24-h span. Precipitation, which is reported in mm, is aggregated by sum, while discharge and chemistry are aggregated by mean. After aggregation, any implicit missing values are made explicit, so that there are no missing timestamps within a series. Linear interpolation is then used to fill gaps of no more than 3 d in each discharge series, and no more than 15 d in each stream chemistry series. Next-observation-carried-backward interpolation is used for precipitation chemistry series. Precipitation volume/depth series are rarely published with missing values during periods of gauge deployment, but when these are encountered, we use source metadata or direct contact to determine whether measured values represent multiday accumulation. If not, we fill gaps with 0 s, indicating no precipitation; if so (we have not yet encountered this), we distribute measured precipitation values evenly across preceding missing values. For precipitation and precipitation chemistry, gaps of up to 45 d are interpolated. In the case of solute flux series provided by primary sources, the maximum gap length we interpolate is 15 d. Gaps larger than the aforementioned maximum lengths retain their missing values, and no extrapolation is performed. Records interpolated by the MacroSheds processing system are given an ms_interp value of 1; otherwise 0. A future version of the MacroSheds dataset may include subdaily records where available.

Watershed attributes retrieval and processing
The MacroSheds dataset includes 185 watershed attributesspatial summary statistics that may act as drivers of ecohydrological processes. These attributes are derived from Attributes are organized into six categories: vegetation, climate, terrain, parent material, landcover, and hydrology. Every spatial variable in the MacroSheds dataset has a two-letter prefix to indicate first the variable category, and second the data source. For example, Leaf Area Index (LAI) variables from the MODIS satellite have a prefix of "v" to indicate the vegetation category and "b" for MODIS, so the median LAI for a watershed in the MacroSheds dataset has the name "vb_lai_median." Watershed attribute prefix codes are catalogued in variable_category_codes_ws_attr.csv and variable_data_source_codes_ws_attr.csv on EDI.
Gridded products are summarized to watershed boundaries using one of two methods, based on where the source data product is held. For data accessible through Google Earth Engine (GEE), we used the R package "rgee" (Gorelick et al. 2017; Aybar 2021). First watershed boundaries are uploaded to GEE and stored as an asset. Then median and standard deviation values for each watershed at each reported time-step are summarized using the rgee function "reduceRegions." For products not housed on GEE, gridded data are locally processed using the "terra" package for R (Hijmans 2021). A list of gridded data products and their sources is in Table 6.
Most watershed attributes included in the MacroSheds dataset are temporally explicit, with sampling/modeling intervals varying from daily to decadal. We provide all watershed attributes in their native (as reported by primary source) temporal intervals, and a subset of attributes as averages by site. We do not provide all watershed attributes for all sites, as some gridded products are only available for the contiguous United States.

Derivation of additional products
One of the core aims of the MacroSheds project is to enable engagement with continental-scale questions about wholewatershed solute and hydrologic flux. We do not yet publish stream or precipitation flux estimates, except for a few daily solute flux series that are provided by primary sources, but the next release of this dataset will include cumulative monthly and annual flux estimates for each site. For now, daily flux can be easily computed via the "macrosheds" R package.
Estimation of watershed solute influx and outflux requires information not consistently provided alongside the timeseries data described above, namely watershed-mean precipitation and precipitation chemistry, and the watershed boundaries needed to compute them. Below we describe the derivation of these products.

Watershed delineation
For any watershed boundary not already published as a georeferenced spatial file, the MacroSheds processing system performs a delineation from the point of the stream gauge or sampling site (pour point). This process cannot be reliably automated for all pour points, due in part to imperfections in digital elevation models (DEMs), and in part to the fact that stream site locations are usually recorded from the banks nearby. Sometimes the watershed "found" by a delineation algorithm is actually a subset of, or adjacent to, the target watershed, and only visual inspection reveals the error. We rely on a semi-automated, interactive approach that delineates one or more candidate watersheds for each site, starting from one or more unique pour points. DEMs are retrieved using the "elevatr" package (Hollister et al. 2020) for R, and iteratively expanded any time a proceeding delineation meets the DEM edge. Candidate watersheds are presented for visual inspection and topographic comparison via package "mapview" (Appelhans et al. 2021). Hydrologic conditioning, pour point snapping, and delineation leverage the "whitebox" package (Wu 2021). If none of the candidates appears to represent the target watershed, the process can be conveniently repeated using updated parameters. For a detailed discussion of delineation parameters, see the "macrosheds" R package documentation.

Spatial interpolation of precipitation data
Each MacroSheds watershed is rasterized, or gridded, from the DEM used during delineation, or from one so retrieved. Precipitation chemistry is then imputed to each cell of the watershed raster by inverse squared-distance weighted interpolation, or IDW (Shepard 1968), using information from all precipitation gauges associated with the domain. Watershedmean precipitation chemistry is then computed as the mean across all raster cells, separately for each solute and each day with data.
Due to the orographic effect in mountainous regions, precipitation depth at a given elevation can be estimated from a local, linear relationship (Hevesi et al. 1992). Daily precipitation depth in the MacroSheds dataset is computed as a weighted ensemble of two predictions, one generated by IDW (weight = 1) and the other from the empirical elevationprecipitation relationship among all domain-associated gauges (weight = coefficient of determination). On days for which fewer than three precipitation gauges are in operation, only the IDW prediction is used.

Flux calculation
In Version 2 of the MacroSheds dataset, we will include cumulative monthly and annual solute flux estimates for each site. For now, we provide discharge, precipitation, and concentration data, and allow users to compute daily solute flux or volume-weighted concentration (VWC) via the "macrosheds" R package, using the ms_calc_flux function. Solute flux is computed according to Eqs. 1 and 2, where F s and F p are solute flux in stream water and precipitation, Q is discharge, P is mean precipitation depth over the watershed, C is solute concentration, and A is watershed area. F is reported in kg ha À1 d À1 , and is calculated on each day for which Q or P, and corresponding C, are measured or interpolated. If ms_status or ms_interp are equal to 1 for either factor (i.e., if either record has been flagged as "questionable" or has been interpolated by the MacroSheds processing system), resulting F inherits the same. VWC is computed according to Eq. 3, where N is the number of days in the aggregation period (e.g., a month or a year), C is solute concentration, and V is daily volume of streamflow or precipitation.

Technical validation
Quality control (QC) practices in watershed ecosystem science are almost as diverse as watersheds themselves; however, there are common currents that run through every QC flag and comment. For example, if a sensor is buried in sediment for a week, that week's data should be omitted from analyses. Likewise with a sensor that is wildly malfunctioning or a water sample that is severely contaminated. Ultimately, when data are analyzed, each record is included, omitted, or included with caution. Thus, we have distilled each domain's QC flags and comments down to either "bad data," which is excised during processing, "questionable," or "clean." If a flag definition or comment makes any mention of insufficient sample volume, minor contamination, sensor drift, or some other condition that could, but does not necessarily, invalidate the corresponding record, we designate it "questionable," and set its ms_status value to 1. Only if flags and comments are absent, or specify no issues of potential concern, do we designate a record "clean," and set its ms_status to 0.
Almost every domain reports per-observation QC flags or comments of some kind. When these are restricted to a predetermined set that is well documented, parsing their meanings is straightforward. In some cases, flags and/or comments are free-form and quite difficult to catalog. Like other obstacles to data harmonization, QC flag proliferation can be resolved by using professionally managed data repositories, where metadata standards control flag values and definitions by design. In attribution_and_intellectual_rights_timeseries. csv, MacroSheds data users can find DOIs and source URLs of primary time-series data and metadata, where fully detailed flag information can be found. The MacroSheds processing system currently performs minimal QC beyond assimilating primary source flags and comments; however, we do filter each time-series record through a very loose "range check," intended to ensure that physically impossible values that happen to have evaded primary source QC are omitted from our aggregate dataset. Minimum and maximum reasonable values have been chosen so as not to risk any encroachment on the true natural range for each variable. A full list of these filter ranges can be found in range_check_limits.csv on EDI. Beyond range checking, we currently rely on the expertise of primary data providers to publish data that have been vetted. We intend to implement more sophisticated anomaly detection in a subsequent release of the MacroSheds dataset and portal.

Data use and recommendations for reuse
The MacroSheds dataset is intended to provide analytical material for diverse investigations of watershed form and function. It is especially suited to comparing watersheds in terms of inputs and outputs of energy and material. In addition to precipitation, solute chemistry, and streamflow timeseries data, it contains a comprehensive set of potentially predictive watershed attributes for each of 177 stream monitoring sites. A visual summary of relationships between watershed attributes and stream solute concentrations reveals strong correlations between land development and major anion concentration in streams, and between bedrock chemistry and inorganic ion concentration, possibly mediated by weathering (Fig. 7). These and other relationships may be used to classify watersheds. They may also be leveraged in the fitting of statistical models, or the training of machine learning algorithms to predict watershed solute outflows from watershed features. To our knowledge, the MacroSheds dataset is the most comprehensive analysis-ready collection of watershed biogeochemical data for North America. As of this writing, there is also a soon-to-be-published CAMELS-Chem dataset, which supplements 506 of the original CAMELS sites with measurements of 18 common stream chemistry constituents (Sterle et al. 2022).
The MacroSheds dataset can also be used as a smallwatershed supplement to hydrological datasets like CAMELS and GAGES-II. Note that in addition to the original US-based CAMELS dataset, there are now equivalent products for Chile Because MacroSheds time-series data are currently represented at daily intervals, this dataset is not well suited to sub-daily analyses, such as those focused on stormflow dynamics. A future version may include time-series data at 15-min resolution.
To meet acceptable use requirements of the MacroSheds dataset, one must comply with the licensing and IR stipulations of all applicable primary sources. At minimum, this entails citing the MacroSheds dataset (Vlah et al. 2022), which is linked to source datasets through Ecological Metadata Language provenance. However, users must first check section 4.1 of data_use_agreements.docx, where our datasets are tiered according to the restrictiveness of source data licenses, as some sources require additional compliance. In any case, we provide tools that make citation/acknowledgement of all or a subset of MacroSheds data sources trivial, and we recommend acknowledgement/citation of source datasets even where attribution is not required. The first tool, for users of the "macrosheds" R package, is the ms_generate_attribution function, which produces a list of acknowledgements, citations, contact emails, and IR notifications based on a given data.frame in MacroSheds format. We also provide attribution_and_intellectual_rights_timeseries.csv and attribution_and_intellectual_rights_ws_attr.csv, which contain essentially the output of the ms_generate_attribution function, assuming the entire MacroSheds dataset is being used. The content of these documents can be copied and pasted, in whole or in part, depending on how much of the overall dataset is actually used.

Future directions for the MacroSheds project
Future developments will focus on the longevity of the MacroSheds project through targeted outreach and by better enabling community contribution. Outreach efforts will focus on encouraging data managers to leverage the FAIR-by-design standards of professionally managed data repositories like EDI and DataONE and to adopt open data licenses where possible. The long-term success of living, synthetic datasets like MacroSheds depends on consistency of source data and metadata from version to version, or at least predictability of changes (e.g., to file names). The long-term continued growth of MacroSheds will be aided by community contribution, inspired by the success of StreamPULSE (streampulse.org) and other projects that add value to user-uploaded datasets, incentivizing contributions that eventually become public. Toward this end, we plan to adapt the MacroSheds data processing system into an interactive web application complete with QC, which will allow anyone with stream data to delineate and summarize watersheds, estimate flux, and so on, and contribute to the MacroSheds dataset after an optional embargo period. In the near term, the MacroSheds team will continue to identify and assimilate data from established watershed ecosystem studies. Globally, there are many networks of watershed observatories that we hope to coalesce into a more international MacroSheds dataset. These include ECN