A unified dataset of colocated sewage pollution, periphyton, and benthic macroinvertebrate community and food web structure from Lake Baikal (Siberia)

Sewage released from lakeside development can introduce nutrients and micropollutants that can restructure aquatic ecosystems. Lake Baikal, the world's most ancient, biodiverse, and voluminous freshwater lake, has been experiencing localized sewage pollution from lakeside settlements. Nearby increasing filamentous algal abundance suggests benthic communities are responding to localized pollution. We surveyed 40‐km of Lake Baikal's southwestern shoreline from 19 to 23 August 2015 for sewage indicators, including pharmaceuticals, personal care products, and microplastics, with colocated periphyton, macroinvertebrate, stable isotope, and fatty acid samplings. The data are structured in a tidy format (a tabular arrangement familiar to limnologists) to encourage reuse. Unique identifiers corresponding to sampling locations are retained throughout all data files to facilitate interoperability among the dataset's 150+ variables. For Lake Baikal studies, these data can support continued monitoring and research efforts. For global studies of lakes, these data can help characterize sewage prevalence and ecological consequences of anthropogenic disturbance across spatial scales.

Globally, sewage pollution is a common and often concentrated source of nitrogen and phosphorus inputs that can reshape aquatic ecosystems. Sewage inputs are often associated with increased primary production (Edmondson 1970;Moore et al. 2003), which can eventually lead to nuisance algal blooms (Hall et al. 1999;Lapointe et al. 2015). Even in instances where sewage pollution is mitigated, restoring systems can be complicated and necessitate system-specific (Jeppesen et al. 2005), long-term mitigation strategies (Hall et al. 1999;Tong et al. 2020). As such, effective sewage monitoring can require merging a suite of chemical, biological, and ecological data to synthesize locations and timing of inputs with associated shifts in ecological communities (Rosenberger et al. 2008;Hampton et al. 2011).
Definitively identifying sewage as the source of excess nutrients in a system can be challenging. Nutrients can originate from multiple sources, such as agriculture (Powers et al. 2016) or melting permafrost (Turetsky et al. 2000;Anisimov and Reneva 2006;Moore et al. 2009), which can obfuscate wastewater signals. Unlike nutrients, sewage-specific indicators, such as enhanced δ 15 N stable isotope signatures (Costanzo et al. 2001;Camilleri and Ozersky 2019), pharmaceuticals and personal care products (PPCPs) (Bendz et al. 2005;Rosi-Marshall and Royer 2012;Meyer et al. 2019), and microplastics (Barnes et al. 2009), can be highly specific to human wastewater. Accordingly, sewage-associated micropollutants have garnered global attention for their usefulness in identifying presence and quantifying magnitude of wastewater inputs. While indicators may accumulate differentially in certain taxa (Gartner et al. 2002;Green 2016;Vendel et al. 2017;Richmond et al. 2018), acutely dangerous concentrations are not common in most aquatic systems (Kolpin et al. 2002;Focazio et al. 2008;Yang et al. 2018). However, chronic exposure to microplastics and PPCPs at even minute concentrations (e.g., μg L À1 ) can still disrupt ecological processes (Richmond et al. 2017). For example, oxazepam can increase feeding rate and decrease sociability of river perch (Brodin et al. 2013), and microplastics can release dissolved organic carbon, thereby altering microbial communities (Romera-Castillo et al. 2018). The pervasiveness and diversity of sewage-associated micropollutants in tandem with their potency as ecologically disrupting compounds necessitates investigation within and across systems, thereby enabling synthesis of how micropollutants alter ecosystems.
When assessing biological responses to increased nutrient loading, littoral benthic algal and macroinvertebrate communities often respond most markedly, as their physical proximity to the shoreline puts them in the path of sewage pollution entering the lake (Rosenberger et al. 2008;Hampton et al. 2011). Filamentous algae, for example, can quickly increase in abundance near sewage sources (Rosenberger et al. 2008;Hampton et al. 2011). As algal communities change, food webs can also restructure. For example, change in algal communities can alter the nutritional value of primary producers or cause changes in the relative abundance of different feeding groups (e.g., increased representation of detritivores). Among the suite of food quality metrics, availability of essential fatty acids (EFAs) offers a nuanced understanding of food quality as primary producers usually maintain consistent EFA signatures (Taipale et al. 2013) and consumers acquire EFAs by grazing (Dalsgaard et al. 2003) or trophic upgrading (Sargent and Falk-Petersen 1988;Dalsgaard et al. 2003).
Together, food web structure, community composition, and sewage indicator data can be powerful tools to assess biological impacts of sewage pollution. Despite their utility, these data are not often available for many limnological systems. PPCPs, for example, have historically been less measured in lake environments (Meyer et al. 2019). In instances where data are available, efficiently merging disparate data into a single, analytically friendly format can be challenging and sometimes require complex, computationally intensive workflows (Meyer et al. 2020).
To offer a template for harmonizing sewage indicator and biological data, we present a unified data product, which contains disparate data collected from 14 littoral and 3 pelagic sites at Lake Baikal from 19 through 23 August 2015 (Fig. 1). Located in Siberia, Lake Baikal is the oldest, most voluminous, and deepest freshwater lake in the world (Hampton et al. 2018). Lake Baikal also has the global distinction of being the most biodiverse lake, with the highest endemism (Moore et al. 2009). The lake is experiencing rapid warming associated with climate change, including decrease in ice cover duration (Moore et al. 2009), and it exhibits offshore plankton community changes associated with warming Katz et al. 2015;Izmest'eva et al. 2016). Less is known of the change occurring in the nearshore of Lake Baikal, where not only climatic changes (Swann et al. 2020) but also human activity  may introduce nutrients that alter the environment. Nearshore change is particularly important to understand in Lake Baikal, since the majority of the lake's biodiversity and endemic species occur in the littoral zone (Kozhova and Izmest'eva 1998). While Lake Baikal's pelagic zone is generally ultra-oligotrophic (Yoshida et al. 2003;O'Donnell et al. 2017), littoral areas abutting lakeside settlements have recently shown distinct signs of eutrophication, such as increased filamentous green algae abundance (Timoshkin et al. 2016;Volkova et al. 2018) as well as cyanobacteria (Bondarenko et al. 2021).
As a means of identifying sewage from small, concentrated lakeside towns and the associated ecological responses, we assembled a dataset consisting of over 150 variables collected at 14 littoral and 3 pelagic sampling sites. We structured the dataset in a tidy format, where each row is a sample, each column is a variable, and each CSV file is an observable unit, where more similar variables are contained within an individual file (Wickham 2014). Independent CSV files can be merged using unique locational identifiers as relational keys, enabling future researchers to customize analyses around a particular suite of variables. As a result of the dataset's interoperability, reproducibility, and extensive variable content, it is well poised for future reuse as supporting evidence of sewage pollution in Lake Baikal. Additionally, the data's flexibility and consistent structure enable it to be merged with similar datasets, so as to synthesize biological responses to sewage across systems and scales.
To our knowledge, no raw data on Lake Baikal macroinvertebrates, periphyton, or nearshore water quality are public in a machine-readable format, for any variable (i.e., abundance, fatty acid content, stable isotopes, nutrient, and pollutant concentration), and no georeferenced data on PPCPs or microplastics appear to be publicly available for any boreal, subarctic, or arctic lakes or rivers in Siberia. Thus, the dataset fills a substantial gap for future studies, providing a window into nearshore biotic assemblages and water quality in a unique, ancient ecosystem that holds 20% of the world's liquid surface water (Moore et al. 2009).

Data description
The final, replicate-level data products are available on the Environmental Data Initiative (EDI), where they can be freely accessed without potential barriers such as paywalls or account registrations. The final data are provided as 11 separate CSV files, each structured in a tabular format and containing a "site" column that can be used to merge tables. The repository also contains a compressed folder of R scripts (scripts.tar.gz), which were used in the main analysis of the dataset (Meyer et al. Under Review).
site_information.csv This file contains metadata for each of the pelagic and littoral sampling locations. Missing data are assigned as NA. year Year sampling occurred. month  mid_temp_celsius Temperature of water midway (i.e., depth_m/2) between surface and bottom at sampling location in Celsius.
bottom_temp_celsius Temperature of water near sediment at sampling location in Celsius.
comments Notes in the field describing sampling conditions. shore_photo Whether or not photos of the shoreline were taken. Photos are available on the project's Open Science Framework (OSF) portal (Meyer et al. 2015).
substrate_photo Whether or not photos of the substrate were taken. sponges Whether or not sponges were present at a sampling location.
brandtia Whether or not Brandtia spp. (endemic amphipod species) were present at a sampling location.
distance_weighted_population_metrics.csv This file contains inverse distance weighted (IDW), censusbased human population data for each sampled location. Although the majority of sites do not have adjacent shoreline human developments, we calculated IDW population for each sampling location. IDW population is a generalized representation of the size of and proximity to a sampling location's neighboring human settlements. As these population estimates are based on census data, they reflect static populations and do not account for seasonal population deviations from tourism. A full description of the methods used to calculate IDW population can be found in the companion manuscript Meyer et al. (Under Review).
site Unique alphanumeric identifier for a sampling location. distance_weighted_population IDW population for a given sampling location and estimated as number of people. Because this interpolation process is a function of the size of and proximity to neighboring developed sites, values can contain decimal values.
nutrients.csv This file contains nutrient concentrations for each of the associated sampling locations. Samples were collected at a depth of 0.75 m. Nutrient samples were not filtered prior to analysis, meaning that nitrogen concentrations have the potential to include intracellular nitrogen. Therefore, nitrogenous species' concentrations may be spurious. Minimal detection limits were estimated as 0.01 mg L À1 for nitrate, 0.005 mg L À1 for ammonium, and 0.04 mg L À1 for phosphorus.
site Unique alphanumeric identifier for a sampling location.
replicate Replicate for a given sampling location. comments Notes from the observer.
invertebrates.csv This file contains abundance for benthic macroinvertebrates collected at each of the 14 littoral sampling locations. Only amphipod taxa were identified to species.
site Unique alphanumeric identifier for a sampling location.
replicate Replicate for sampling location. While three replicates were collected in the field, some samples were poorly preserved, and invertebrates were not enumerated so as to prevent potential errors.
Acroloxidae     total_lipid.csv This file contains gravimetry data for each fatty acid sample.
site Unique alphanumeric identifier for a sampling location. Genus Genus of the analyzed organism.

Species
Species of the analyzed organism. When organism was identified solely to genus, the Species value is NA.
total_lipid_mg_per_g Total amount of lipids in a sample in milligrams of lipid per gram of tissue.
deviation Samples were weighed three times and standard deviation in measurement was calculated. All values are reported in milligrams of lipid per gram of tissue.
comments Quality flag column. Two samples spilled during fatty acid extraction. These samples are flagged as such.

Site information
The vast majority of Lake Baikal's 2100-km shoreline lacks lakeside development (Moore et al. 2009;Timoshkin et al. 2016). Our sample collection focused on a 40-km section of Lake Baikal's southwestern shoreline, which included three settlements of different sizes (Fig. 1) during a time of the year when tourism and summertime biological succession were likely at their annual peaks. Littoral locations were chosen to capture a range of sites with varying degrees of adjacent shoreline development-from "developed" (along the waterfront of human settlements) to "undeveloped" (no adjacent human settlements and complete forest cover; Fig. 1). The largest, Listvyanka, is primarily a tourist town of approximately 2000 permanent residents, although tourism can contribute significantly to the town's population with approximately 1.2 million annual visitors (Interfax-Tourism 2018). The other two settlements are the villages Bolshie Koty and Bolshoe Goloustnoe, which have approximately 80 and 600 permanent residents, respectively. Bolshie Koty is home to two field research stations and several small tourist accommodations. Bolshoe Goloustnoe has several hotels and tourist camps.
To assess disturbance gradients and ecological responses from littoral-to-pelagic zones and laterally along the shoreline, our transect consisted of 17 sampling sites that were meant to characterize differences along these gradients. Pelagic sites were located 2 to 5 km offshore from each of the developed sites in water depths of 900-1300 m ( Fig. 1; Table 1). All littoral sites were sampled at approximately the same depth (max depth of~1.25 m) at a distance of 8.90-20.75 m from shore (Table 1), which allowed us to collect samples without the need for SCUBA but precluded us from sampling deeper littoral environments. Due to this constraint, only littoral sites contain macroinvertebrate and algal samples. Otherwise, data are available for both littoral and pelagic sites. At each site, air temperature was measured with a mercury thermometer, and photographs were taken of the substrate and the shoreline. Visual inspection of substrate photographs suggested that littoral sites' substrate was consistent among sites and generally was characterized by large, oblate rocks and gravel.

IDW population calculation for each sampling location
We recognized that sewage indicator concentrations at each sampling location may be related to a sampling location's spatial position relative to both the size and proximity of neighboring developed sites. Therefore, we created the IDW population metric to compress, into a single metric, information about human population size, density, and location along the shoreline as well as distance between developed sites and sampling locations.
Our workflow for calculating IDW population required five steps. First, we traced polygons of each lakeside development's perimeter and line geometries of each development's shorelines from satellite imagery for each developed site in Google Earth. Polygons were traced for the entire area of visible development. Similarly, shoreline traces only reflected shoreline length for which there was visible development. Second, where I is the IDW population at sampling location j, P is the population at each of the three developed sites Listvyanka (LI), Bolshie Koty (BK), Bolshoe Goloustnoe (BGO), A is the area of a developed site in km 2 , L is the shoreline length at a developed site in km, and D is the distance from developed site j to each developed site's centroid in km. As these population estimates are based on census data, they reflect current, static populations and do not account for seasonal population swings from tourism.

Nutrients
Water samples for nutrient analyses were collected in 150 mL glass jars that had been washed with phosphate-free soap and rinsed three times with water from the sampling location. Samples were collected at a depth of approximately 0.75 m in duplicates and immediately frozen at À20 C until processing at the A. P. Vinogradov Institute of Geochemistry (Siberian Branch of the Russian Academy of Sciences, Irkutsk). Samples were not filtered prior to freezing, meaning that nitrogen and ammonium concentrations may include intracellular nitrogen and overestimate dissolved nitrogenous forms in the water column.
For ammonium (RD:52.24.383-2018 2018) and nitrate (RD:52.24.380-2017 2018) concentrations, samples were analyzed with a spectrophotometer (SF-26). GSO 7258-96 and 7259-96 standards of 1 g L À1 stock concentration were used to calibrate nitrate and ammonium measurements, respectively. When nitrate and ammonium analyses could be performed within 24 h after thawing, samples were kept at 2-8 C without addition of preservative agents. When nitrate analyses were performed between 24 and 48 h after thawing, samples were kept at 3-5 C and chloroform was added as a preservative at a ratio of 2-4 mL per 1 L of sample volume. When ammonium analyses were performed within 24-96 h after thawing, samples were kept at 3-5 C and~10% sulfuric acid solution was added as a preservative. Phosphorus concentration was measured with a spectrophotometer (SF-46) following the addition of persulfate (GOST:18309-2014 2016). When possible, samples were analyzed within 3 h of thawing. When analyses could not be performed within 3 h, samples were kept at 3-5 C and chloroform was added as a preservative at a ratio of 2-4 mL per 1 L of sample volume. Minimal detection limits were estimated as 0.01 mg L À1 for nitrate, 0.005 mg L À1 for ammonium, and 0.04 mg L À1 for phosphorus. Concentrations are reported in mg L À1 of each analyte.
For comparable methods in English, we recommend data users consult International Standards Organization (ISO) (1984) and ISO (2004) as analogs. Copies of the Russian-language methods are included in the OSP portal within the directory "Nearshore_sampling/methods."

Chlorophyll a
Water samples were collected in 1.5 liters plastic bottles from a depth of approximately 0.75 m. Although we did not note the plastic bottles' materials within the field, all bottles for Chl a measurement were cleaned, beverage bottles and likely made of polyethylene terephthalate. Within 12 h of collection, three subsamples (up to 150 mL each) were filtered through 25-mm diameter, 0.2 μm pore size nitrocellulose filters. Filters were then placed in a 35 mm petri dish, which was wrapped with aluminum foil to prevent light exposure, and frozen in the dark until processing.
Chlorophyll samples were processed in a manner similar to that of Welschmeyer (1994). Nitrocellulose filters were ground in 10 mL of 90% HPLC-grade acetone, in which chlorophyll extraction was allowed to proceed overnight. Chlorophyll extract was then analyzed using a Turner Designs 10-AU fluorometer (Turner Design) using an excitation wavelength of 436 nm and emission of 680 nm. 10-AU Secondary Solid Standard (P/N 10-AU-904) was used to calibrate fluorometer prior to samples being processed. Blank samples registered a raw fluorescence of approximately 0.1 FL units. Concentrations were calculated using formula 2 Chlorophyll concentration ¼ extract reading À blank reading ð Þ Â mL of extract mL of filtered sample ð2Þ Detection limits are estimated to be approximately 0.02 mg L À1 . Concentrations are reported as mg L À1 .

Pharmaceuticals and personal care products
Water samples for PPCP analysis were collected in 250 mL amber glass bottles that were rinsed with either methanol or acetone and then three times with sample water prior to collections. Following collection, samples were refrigerated and kept in the dark until solid phase extraction (SPE).
Within 12 h of collection, samples were filtered directly from the amber glass bottle using an in-line Teflon filter holder with glass microfiber GMF (1.0 μm pore size, WhatmanGrad 934-AH) in tandem with a SPE cartridge (200 mg HLB, Waters Corporation) connected to a 1-liter vacuum flask. Lab personnel wore gloves and face masks to minimize contamination. Prior to filtration, SPE cartridges were primed with at least 5 mL of either methanol or acetone and then washed with at least 5 mL of sample water. Rate of extraction was maintained at approximately 1 drop per second. Extraction proceeded until water could no longer pass through the SPE cartridge or until all collected water was filtered. Cartridges were stored in Whirlpacks at À20 C until analysis for 18 PPCP residues using liquid chromatography tandem mass spectrometry (LC-MS-MS) following methods of Lee et al. (2016) andD'Alessio et al. (2018) with labeled internal standards ( 13 C 3 -caffeine, methamphetamine-d8, MDMAd8, morphine-d3, and 13 C 6 -sulfamethazine). Detection limits are estimated to be 0.001 μg L À1 based on a 500 mL sample volume. Concentrations are reported in μg L À1 .

Microplastics
At each location, samples were collected at a depth of approximately 0.75 m in triplicate using 1.5 liters clear plastic bottles that were washed thoroughly with sample water before each collection. Samples were collected by hand for each littoral site and with a metal bucket from aboard the ship for pelagic sites.
For processing, each sample was vacuum filtered on to a 47-mm diameter GF/F filter. During filtration, aluminum foil was used to cover the filtration funnel to prevent contamination from airborne microplastic particles. After filtration, filters were dried under vacuum pressure and then stored in 50-mm petri dishes. Following filtration of all three replicates, the filtrate was collected and then re-filtered through a GF/F filter as a control for contamination from the plastic vacuum funnel or potentially airborne microplastics.
Microplastic counting involved visual inspection of the entire GF/F in a similar manner to methods described in Hanvey et al. (2017). Visual enumeration was conducted under a stereo microscope with~100Â magnification, and microplastics were classified into one of three categories: fibers, fragments, or beads. For all categories, plastics were defined as observed objects with apparent artificial colors, so as to not enumerate plastics potentially contributed from the sampling bottle itself. Fibers were defined as smooth, long plastics with consistent diameters. Fragments were defined as plastics with irregularly sharp or jagged edges. Beads were defined as spherical plastics. Although we did not measure microplastic size, this technique likely allowed us to reliably quantify microplastics as small as~300 μm (Hanvey et al. 2017). During enumeration, GF/Fs remained covered in the petri dish to minimize potential for contamination from the air.
It is worth noting that since the time of our field sampling, evidence has accumulated that our methods likely dramatically underestimated microplastic abundance Brandon et al. 2020). Recent investigations of microplastics in Lake Baikal near Bolshie Koty (BK) used analogous methods and measured similar microplastic concentrations (Karnaukhov et al. 2020). Future studies aiming to use these data for comparison or supplementing potential data gaps should consider the minimum microplastic size that could be reliably detected by the method, so as to ensure data are comparable across methods.

Periphyton collection and abundance estimates
At each littoral site, we haphazardly selected three rocks representative of local substrate. A plastic stencil was used to define a surface area of each rock from which we scraped a standardized 14.5 cm 2 patch of periphyton. Samples were preserved with Lugol's solution and stored in plastic scintillation vials. Additional periphyton was collected in composite from each site for fatty acid and stable isotope analysis.
Periphyton taxonomic identification and enumeration was performed by subsampling 10 μL aliquots from each preserved sample, containing approximately 10-15 mL of preserved periphyton. For all 10 μL aliquots, cells, filaments, and colonies were counted, for the entire subsample, until at least 300 cells were identified for a given sampling replicate. If the first aliquot contained less than 300 cells, we counted additional subsamples until we reached at least 300 cells in total. In instances when 300 cells were counted before finishing a subsample, we still counted the entire aliquot. Taxa were classified into broad categories consistent with Baikal algal taxonomy (Izhboldina 2007), using coarse groupings to capture general patterns in relative algal abundance. As a result, algal groups consisted of diatoms, Ulothrix spp., Spirogyra spp., and the green algal Order Tetrasporales.
Separate periphyton samples for stable isotope and fatty acid analyses were also collected. Instead of preserving samples in Lugol's solution, these samples were immediately frozen at À20 C at the field station. The samples were later transferred to the lab in the United States via a Dewar flask with dry ice.

Benthic macroinvertebrate collection and abundance estimates
Three kick-net samples were collected for assessment of benthic community composition and abundance. Using a D-net, we collected macroinvertebrates by flipping over 1-3 rocks, and then sweeping five times in a left-to-right motion across approximately 1 m. After the series of sweeps, the catch was rinsed into a plastic bucket. For each replicate, bucket contents were concentrated using a 64-μm mesh and placed in glass jars with 40% ethanol (vodka; the only preservative available to us at the time) for preservation and refrigerated at 4 C aboard the research vessel. The 40% ethanol preservative was replaced with~80% ethanol upon return to the lab within 24-48 h, and samples were stored at~4 C.
Invertebrate taxonomic identification and enumeration were performed under a stereo microscope. All adult amphipods were identified to species according to Takhteev and Didorenko (2015), whereas juveniles were identified to genus. Mollusks were identified to the family level according to Sitnikova (2012). Leeches were enumerated at the subclass level, but were likely all from the family Glossiphoniidae based on size, depth of sampling locations, and invertebrate communities sampled (Kaygorodova 2012). Like mollusks, caddisflies were also enumerated at the order level, although Baikal does contain over 14 species of caddisfly (Valuyskiy et al. 2020). Flatworms were enumerated at the phylum level. All isopods enumerated were from the family Asellidae. Aside from having limited time available to spend with Baikal taxonomists during our field campaign, our choice of taxonomic resolution ultimately was a result of relative abundance for each taxonomic group, where amphipods were the most abundant taxa and flatworms were among the least abundant taxa across all sites. All samples contained oligochaetes and polychaetes, but due to poor preservation, these taxa were not counted. Six samples of the 42 collected were not wellpreserved and were excluded from further analyses, in order to reduce errors in identification. KD-1 and LI-1 were the only sites with 1 sample counted. BK-2 and KD-2 each had two samples counted.
Separate collections were conducted for invertebrate fatty acid and stable isotope analyses. Invertebrates were collected using a D-net and by hand. Organisms collected by hand included amphipod species that were observed from the community composition D-net collections but not readily observed in the stable isotope and fatty acid D-net collections. Collected organisms were live-sorted, identified to species, and then frozen at À20 C at the field station. The samples were later transferred to the lab in the United States via a Dewar flask with dry ice.
Due to some samples warming in transit, we only processed samples that were completely frozen upon arrival to the United States. Given the potential for fatty acids to highlight more subtle, multivariate ecological responses along our transect than stable isotopes, we prioritized both periphyton and macroinvertebrate fatty acid analyses over stable isotope analyses. As such, there is an imbalance across species' abundance, stable isotope, and fatty acid data. Dominant taxa, such as Eulimnogammarus verucossus and Eulimnogammarus vittatus, though have paired data throughout the transect, whereas less common taxa, such as Brandtia spp., only have abundance estimates. Table 2 summarizes data available for each variable and taxonomic group.

Stable isotope analysis
Following freeze-drying, measurements of periphyton and macroinvertebrate δ 15 N and δ 13 C values were performed on an elemental analyzer-isotope ratio mass spectrometer (EA-IRMS; Finnigan DELTAplus XP, Thermo Scientific) at the Large Lakes Observatory, University of Minnesota Duluth. Stable isotope values were calibrated against certified reference materials including L-glutamic acid (NIST SRM 8574), low organic soil and sorghum flour (standards B-2153 and B-2159 from Elemental Micro-analysis Ltd.) and in-house standards (acetanilide and caffeine).

Fatty acid analysis
Following freeze-drying, samples were transferred to 10 mL glass centrifuge vials, and 2 mL of 100% chloroform was added to each under nitrogen gas. Samples were allowed to sit in chloroform overnight at À80 C. Fatty acid extractions generally involved three phases: (1) 100% chloroform extraction, (2) chloroform-methanol extraction, and (3) fatty acid methylation. Fatty acid extraction methods were adapted from Schram et al. (2018).
After overnight chloroform extraction, samples underwent a chloroform-methanol extraction three times. To each sample, we added 1 mL cooled 100% methanol, 1 mL chloroform : methanol solution (2 : 1), and 0.8 mL 0.9% NaCl solution. Samples were inverted three times and sonicated on ice for 10 min. Next, samples were vortexed for 1 min, and centrifuged for 5 min (3000 rpm) at 4 C. Using a double pipette technique, the lower organic layer was removed and kept under nitrogen. After the third extraction, samples were evaporated under nitrogen flow, and resuspended in 1.5 mL chloroform and stored at À20 C overnight.
Once resuspended in chloroform, 1 mL of chloroform extract was transferred to a glass centrifuge tube with a glass syringe as well as an internal standard of 4 μL of 19-carbon fatty acid. Samples were then evaporated under nitrogen, and then 1 mL of toluene and 2 mL of 1% sulfuric acid-methanol was added. The vial was closed under nitrogen gas and then incubated in 50 C water bath for 16 h. After incubation, samples were removed from the bath, allowed to reach room temperature and stored on ice. Next, we performed a potassium carbonate-hexane extraction twice. To each sample, we added 2 mL of 2% potassium bicarbonate and 5 mL of 100% hexane, inverting the capped vial so as to mix the solution. Samples were centrifuged for 3 min (1500 rpm) at 4 C. The upper hexane layer was then removed and placed in a vial to evaporate under nitrogen flow. Once almost evaporated, 1 mL of 100% hexane was added and stored in a glass amber autosampler vial for GC/MS quantification. GC/MS quantification was performed with a Shimadzu QP2020 GC/MS following Schram et al. (2018). As part of our peak quantification protocol, we quantified and identified every lipid compound that showed up in the chromatogram. Each sample contained peaks that were associated with known fatty acids, and among the 59 fatty acids contained in our dataset, few fatty acids were completely absent from a sample. Consequently, it is difficult for us to definitively ascribe a minimal detection limit to this analysis, but based on standards used, we estimate that this procedure had a minimal detection limit of 1 ng mL À1 .
Following methylation, remaining extracts were assessed for total lipid masses. Remaining sample extracts (~0.5 mL) were allow to evaporate to dryness under a fume hood overnight. Dried samples were then left in a weight room to Table 2. Summary table of algal and macroinvertebrate data within the dataset. Although fatty acids contain data on Hyalella spp., these specimens were likely misidentified in the field before processing. For consistency and detailing the breadth of fatty acid profiles among Baikal's littoral amphipods, we have included them in the dataset, but caution should be taken when considering these fatty acids explicitly as those representative of Hyalella spp. acclimatize for 30-60 min and then massed within the scintillation vials. To calculate an average lipid mass, samples were massed three times, so as also to assess deviation in measurements. Lipid gravimetry is reported as the mg of lipids per g of dry-weight tissue.

Technical validation
The dataset had three main validation procedures: taxonomic, analytical, and reproducible.
For taxonomic validation, all phylogenetic groupings were based off most recent identification keys. Amphipods were identified according to Takhteev & Didorenko (2015). Mollusks were identified according to Sitnikova (2012). Algal taxa were identified according to Izhboldina (2007). For consistency, all taxa were identified by one person (Michael F. Meyer), who was trained by experts in Baikal algal and macroinvertebrate taxonomy.
For analytical validation, internal standards were used for all mass-spectroscopy analyses. PPCP analyses involved labeled internal standards ( 13 C 3 -caffeine, methamphetamine-d8, MDMAd8, morphine-d3, and 13 C 6 -sulfamethazine). Stable isotope values were calibrated against certified reference materials including L-glutamic acid (NIST SRM 8574), low organic soil and sorghum flour (standards B-2153 and B-2159 from Elemental Micro-analysis Ltd.) and in-house standards (acetanilide and caffeine). Replicate analyses of external standards showed a mean standard deviation of 0.06‰ and 0.09‰, for δ 13 C and δ 15 N, respectively. Finally, fatty acid estimations used an internal 19 : 0 standard to assess oxidation of fatty acids during extraction, methylation, and quantification.
For data reproducibility, data aggregation and harmonization procedures were conducted in the R statistical environment (R Core Team 2019), using the tidyverse (Wickham et al. 2019) packages. As part of the data aggregation, an initial cleaning script (00_disaggregated_data_cleaning.R) removed incorrect spellings, erroneous data values, and inconsistent column names from raw data. This step created the standardized CSV files detailed above, which are available on the EDI repository (this study). Raw data files are available on the project's OSF portal (Meyer et al. 2015) but are not included in the EDI repository to prevent confusion or incorrect usage. Data hosted on EDI are at the replicate-level but can be aggregated to the sampling-site-level using script "01_data_cleaning.R." In addition to aggregation scripts, six R scripts used for analyses in Meyer et al. (Under Review) are also available on the EDI repository within the compressed entity "scripts.tar.gz." All R code for data aggregation was written by one person (Michael F. Meyer) and then independently reviewed by two others (Matthew R. Brousil and Kara H. Woo) to confirm that code performed as intended, was well documented, and annotations were complete.

A commitment to FAIR and TRUST principles
Throughout the dataset's development, we strove to incorporate both FAIR (Findable, Accessible, Interoperable, and Reproducible) and TRUST (Transparency, Responsibility, User Focus, Sustainability, and Technology) principles where applicable.
With respect to FAIR principles (Wilkinson et al. 2016), the data are openly accessible in a standardized, replicate-level format on the EDI portal. The 11 CSV files contained within the dataset are entirely interoperable using the "site" column, enabling all variables to efficiently be merged together. Finally, all analytical and some data wrangling scripts are available on the EDI portal in a compressed format, such that future users can reproduce data manipulation and analyses described in Meyer et al. (Under Review).
With respect to TRUST principles (Lin et al. 2020), we strove to document additional metadata and data-cleaning practices in a public OSF repository (Meyer et al. 2015). These steps are not necessarily critical to the core EDI dataset, but provide increased transparency for future users wishing recreate the dataset de novo. All "raw" data are provided in the OSF portal, including an initial cleaning script (00_dis-aggregated_data_cleaning.R) to remove incorrect spellings, erroneous data values, and inconsistent column names. This repository also includes photographs of both field notes as well as photographs of shoreline and substrate from sampling locations. To empower and expedite future reuse, all directories are accompanied with documentation that details directory contents, and all associated scripts are documented and annotated. While many of the files are redundant from the EDI repository, the OSF repository is meant to supplement the EDI repository, so as to enable sustainable, user-focused transparency of how data were collected and cleaned from their raw formats.

Data use and recommendations for reuse
Recognizing the potential for continued low-level, sewage pollution at Lake Baikal (Timoshkin et al. 2016Volkova et al. 2018) and lakes worldwide (Yang et al. 2018;Meyer et al. 2019), the final dataset can be applied to a suite of research questions pertaining to ecological responses to human disturbance. We highlight two main areas for immediate application.
First, the final data products can be harmonized with other littoral sampling efforts throughout Lake Baikal, so as to enhance spatial coverage and data diversity. Since 2010, Lake Baikal has experienced increasing filamentous algal abundance, especially near larger lakeside developments (Kravtsova et al. 2014;Timoshkin et al. 2016Timoshkin et al. , 2018Volkova et al. 2018). Recent benthic algal surveys throughout Lake Baikal's entirety, but especially near our sampling locations, have suggested that cosmopolitan filamentous algae, such as Spirogyra spp., tend to be more abundant near larger lakeside developments (Timoshkin et al. 2016;Volkova et al. 2018). For example, Listvyanka is a small town located at the beginning of the Angara River, Lake Baikal's only surface outflow. While Listvyanka's permanent population is approximately 2000 persons, the town is a growing tourism hub, and hosts over 1.2 million tourists per year (Interfax-Tourism 2018). Surveys conducted near Listvyanka have suggested increased Spirogyra spp. abundance is associated with wastewater release (Timoshkin et al. 2016). Although wastewater inputs are likely low and are diluted to negligible concentrations offshore (Meyer et al. Under Review), combining monitoring efforts across spatial and temporal scales is necessary to evaluate the spatial and temporal extent of wastewater entering Lake Baikal. As such, our data could complement previous, current, and future monitoring efforts, where observations may be missing.
Second, the final data products are useful to expanding freshwater PPCP, microplastic, and associated biological responses across large spatial scales. Recent syntheses of the PPCP literature have reported that studies involving lakes are less abundant relative to those focused on lotic systems (Meyer et al. 2019). Likewise, microplastic studies have noted that freshwater environments are less represented in the literature relative to marine ecosystems (Horton et al. 2017). For both PPCPs and microplastics, toxic responses to even minute concentrations can be uncertain and differ between ecosystem types (e.g., Rosi-Marshall et al. 2013 for lotic andShaw et al. 2015 for lentic). As a result of PPCPs and microplastics garnering increasing attention worldwide, sampling of PPCPs and microplastics with colocated biological data across multiple spatial and temporal scales would be necessary to synthesize biotic responses to micropollutants across systems. Although our data constitute a limited sample number of PPCP and microplastic data that exist globally, our final data products are highly structured and flexible for merging with similar datasets. Additionally, our dataset's sequential harmonization workflow could be adopted by similar monitoring efforts, thereby facilitating data interoperability. Through integration with similar monitoring efforts, our dataset can contribute to global synthesis of emerging contaminant consequences, especially in a region of the world that is often not easily accessible to many researchers.