Prochlorococcus, Synechococcus, and picoeukaryotic phytoplankton abundances in the global ocean

Marine picophytoplankton is the most abundant photosynthetic group on Earth; however, it is still underrepresented in dynamic ecosystem models. Major constraints for understanding its role in the ecosystem at a global scale are sparse data and lack of a baseline description of its distribution. Here, we present three datasets to assess the global abundance of the principal groups of picophytoplankton, Prochlorococcus, Synechococcus, and picoeukaryotic phytoplankton: (1) a compilation of 109,045 field observations with ancillary environmental data, (2) a global monthly climatology of 1° grids from 0 to 200 m, and (3) four climate scenarios projections, from the Coupled Model Intercomparison Project 5, spanning years 1901 to 2100. Together this set of observational and modeled data can improve our understanding of the role of picophytoplankton in the global ecosystem.

assess the global abundance of the principal groups of picophytoplankton, Prochlorococcus, Synechococcus, and picoeukaryotic phytoplankton: (1) a compilation of 109,045 field observations with ancillary environmental data, (2) a global monthly climatology of 1 grids from 0 to 200 m, and (3) four climate scenarios projections, from the Coupled Model Intercomparison Project 5, spanning years 1901 to 2100. Together this set of observational and modeled data can improve our understanding of the role of picophytoplankton in the global ecosystem.

Background and motivation
Picophytoplankton, comprising three major groups, Prochlorococcus, Synechococcus, and picoeukaryotic phytoplankton, is the smallest sized and most abundant phytoplankton component on Earth (Fuhrman and Campbell 1998;Partensky et al. 1999). These groups are dominant in the oligotrophic ocean (Flombaum et al. 2013), but cover all ocean environments (Raven 1998;Partensky et al. 1999;Scanlan et al. 2009) and are found from surface waters down to 150-200 m depth (Flombaum et al. 2013). Picophytoplankton is thought to contribute at least 10% of global net primary production that largely stays in the photic layer because of small cell size and high buoyancy (Raven 1998;Le Quéré et al. 2005), yet, new evidence suggested that picophytoplankton contribution to carbon export may be larger than expected (Stukel et al. 2013;Guidi et al. 2016). Thus, considering picophytoplankton in Earth System Models may improve our understanding of global biogeochemical cycles (Hood et al. 2006).
Biogeochemical models represent the complex structure of ecosystems grouping organisms into functional types and connecting these biological entities with chemical, biological, and physical processes. Phytoplankton is usually represented by few cell size classes (two or three) with different effects on nutrient concentrations, which may result in inaccurate biogeochemical processes (Gruber and Doney 2010). This reduced complexity allows high control and understanding of marine biogeochemical processes; however, highly simplistic representations may provide limited insights (Gruber and Doney 2010;Kwiatkowski et al. 2014) that should be considered during the analysis of results (Emerson and Hedges 2008).
Phytoplankton diversity affects the marine cycles of carbon, nitrogen, and phosphorus through distinct metabolic needs and amount of carbon exported to the deep ocean (Weber and Deutsch 2010). For example, the variability of C/N/P ratios showed a strong latitudinal pattern driven in part by the biological diversity of plankton assemblages, and in the nutrient-depleted subtropical North Atlantic Ocean, the high elemental ratios were partly explained by the higher ratios in marine Prochlorococcus and Synechococccus that dominated this region (Martiny et al. 2013a). Furthermore, in the future, ocean warming and reduced nutrients are expected to benefit Prochlorococcus and Synechococccus, at the expense of bigger cell size groups, increasing their domains and altering, in turn, carbon export (Flombaum et al. 2013;Martiny et al. 2013a). Yet, Prochlorococcus and Synechococccus high C/P ratios could in part smooth the effects of cell size decrease in carbon export (Martiny et al. 2013a). Therefore, biogeochemical models may be improved by more explicitly considering the biological uniqueness of phytoplankton groups (Martiny et al. 2013b;Flombaum et al. 2020).
To incorporate Prochlorococcus, Synechococcus, and picoeukaryotic phytoplankton in ocean models, it is key to understand which environmental variables shape their biogeography. Patterns for picophytoplankton can be obtained from in situ observations, remote sensing, and quantitative niche models. Regional in situ observations provided detailed information on abundance and associated environmental variables as nutrients (Guo et al. 2014), temperature (Morán et al. 2010), and light (Malmstrom et al. 2010), and described seasonal and spatial distribution patterns (Blanchot et al. 2001;Guo et al. 2014;Amorim et al. 2016). Previous global in situ data compilation for picophytoplankton specified information on flow cytometry cell counts, but lacked data on environmental variables (Buitenhuis et al. 2012). Remote sensing methods, based on chlorophyll a and other pigments, were used to distinguish Prochlorococcus and Synechococcus from other dominant groups (Alvain et al. 2005), providing real-time measurements at large scales for the ocean surface, and were also used to estimate the size distribution of phytoplankton, including picophytoplankton in the water column (Uitz et al. 2006;Lange et al. 2018). Still, since some accessory pigments are shared by different taxonomic groups, and organisms may cover a wide range of sizes, the pigment-based approach may have been inaccurate to reflect the actual phytoplankton community structure (Bracher et al. 2017;Mouw et al. 2017). Instead, quantitative niche models, based on the relationship between abundance and environmental variables, identified photosynthetic active radiation (PAR), temperature, and nutrients as predictive variables for picophytoplankton cell abundance (Morán et al. 2010;Flombaum et al. 2013Flombaum et al. , 2020. Thus, quantitative niche models for Prochlorococcus, Synechococcus, and picoeukaryotic phytoplankton, plus inputs of PAR, temperature and nutrients, can be used as an alternative to estimate their mean abundances, global scale patterns, and to project changes in future climates (Flombaum et al. 2013.
To understand the role of picophytoplankton in the Earth system, it is key to address existing data gaps in observations and modeling. Here, we contribute with an in situ observations dataset that includes environmental information that was used to derive quantitative niche models (Flombaum et al. 2013, a novel modeled monthly climatology dataset, and a novel set of modelled projections in future climates for Prochlorococcus, Synechococcus, and picoeukaryotic phytoplankton.

Data description
Our dataset is integrated by three independent global datasets that comprise past observations, mean present estimates, and future projections of picophytoplankton principal groups: Prochlorococcus, Synechococcus, and picoeukaryotic phytoplankton. The first dataset comprises 41,912 in situ observations of Prochlorococcus, 44,949 of Synechococcus, and 22,184 of picoeukaryotic phytoplankton along with in situ measurements of environmental variables, distributed in major ocean basins ( Fig. 1). Data cover the latitudinal range from 81 N to 69 S, down to a depth of 400 m, presenting 80% of the observations in the northern hemisphere and regions with intensive research efforts (North Atlantic Ocean, North Pacific Ocean). Although most of the observations of Prochlorococcus, Synechococcus, and picoeukaryotic phytoplankton were from tropical regions (53%, 51%, and 69%, respectively, located between 35 N and S) temperature range was well represented with 42%, 45%, and 30% of samples below 15 C (Fig. 1b). The second dataset is a monthly global climatology of Prochlorococcus, Synechococcus, and picoeukaryotic phytoplankton abundances predicted in cell counts for a 1 grid from the surface down to 200 m depth. Using quantitative niche models and the monthly mean of temperature, nitrate and PAR as inputs, we generated the climatology dataset for each group (Fig. 2). The mean annual concentration showed three areas of elevated concentration (highest 10%): a band between 20 N and S where Prochlorococcus and Synechococcus dominated, two bands around 40 N and S where Synechococcus and picoeukaryotic phytoplankton dominated, and an area above 50 N and S where picoeukaryotic phytoplankton dominated (Fig. 3d-f). Global abundance seasonal variation ranked for Prochlorococcus, Synechococcus, and picoeukaryotic phytoplankton in increasing amplitude, reaching 6.5%, 15.6%, and 28.0% difference between highest and lowest abundance, respectively (Fig. 4a).
The third dataset synthesized distributions and global abundances for the three groups from 1901 to 2100 (Figs. 3d-f and 4b-d). Similarly, projected cell abundance was the combination of quantitative niche models and inputs from the climate scenarios defined in the Coupled Model Intercomparison Project (CMIP5) (Fig. 2). Major projected changes included increase concentration in tropical areas and expansion towards higher latitudes ( Fig. 4d-f). As a whole, the three datasets represented observations, baseline conditions, and modeled projections for the three main groups constituting picophytoplankton.

Data record 1
The variables and units for the in situ observations dataset are listed in Table 1.
The dataset file "global-Prochlorococcus-Synechococcusand-picoeukaryotic-phytoplankton" (February, 2020 version) was uploaded to Biological and Chemical Oceanography Data Management Office (BCO-DMO, https://www.bco-dmo.org/) in csv format. A header containing all the fields listed in table 1 is available in the data file.

Data record 2
The variables and units for the monthly climatology dataset are listed in Table 2.
Each file is named after a group; every file follows the same name pattern. The name indicates the group as PRO for Prochlorococcus, as SYN for Synechococcus, and as PEUK for picoeukaryotic phytoplankton, followed by the data record name and month number (i.e., "PRO_climatology_01" indicates the file contains the global abundance climatology of Prochlorococcus for January). Annual mean is also identified with the group's prefix. dataset was used to establish niche models for each group (Flombaum et al. 2013. We generated a monthly climatology dataset using PAR, temperature, and nitrate climatologies as inputs for the niche models. Similarly, we obtained the projections for four climates scenarios using nitrate and temperature from an ensemble of five global circulation models. Each file is a matrix of 41,088 rows and 27 columns, which holds the global ocean abundance in units of cells mL −1 for each grid location (1 grid) and depth. The first column is latitude, the second is longitude and then each column corresponds to the abundance concentration from 0 to 200 m, following the same depth bins as the World Ocean Atlas (WOA, https://www. nodc.noaa.gov/OC5/woa13/), with 5 m intervals for the first hundred meters and 25 m intervals for the next hundred.
The dataset files (May, 2020 version) in csv format were uploaded to the Biological and Chemical Oceanography Data Management Office (BCO-DMO). A header containing all the fields listed in table 2 is available in each data file. The zip file named "picophytoplankton_climatology" contains all the files of this data record.

Data record 3
The climate change dataset variables and units are listed in Table 3.
This data record is composed of three files: "pro-syn-peukcc-global-abundance-mean," "pro-syn-peuk-cc-global-abundance-std," and "pro-syn-peuk-cc-surface." The first file holds the abundance yearly mean in cells units, for each group and climate scenario tested with five global circulation models. The matrix is of two dimensions arranged in 105 rows and 62 columns. Each row corresponds to a different year, and columns are related to group, scenario, and tested model. The first column shows the years        for scenarios RCP 2.6, RCP 4.5, and RCP 8.5. The next 20 columns correspond to Prochlorococcus abundance for each scenario tested with every model, and it then continues in the same order for Synechococcus and picoeukaryotic phytoplankton. The second file holds the abundance standard deviation in cell units arranged in an identical way.
The file "pro-syn-peuk-cc-surface" contains a 2D-matrix of global surface abundance (50 m) and distribution for each group, scenario and circulation model for the last 30 years of the 21 st century. The first column is latitude and the second is longitude. Subsequent columns are arranged in an identical way to previous files in this record.
The dataset files (February, 2020 version) in csv format were uploaded to the Biological and Chemical Oceanography Data Management Office (BCO-DMO). Headers containing all the fields listed in table 3 are available in each file.

Methods
Observations data record (data citation 1) We compiled a database for Prochlorococcus, Synechococcus, and picoeukaryotic phytoplankton cell counts along with environmental data (Martiny and Flombaum 2020a). In situ observations from public repositories and previous studies are listed in Supplementary Table 1. Environmental data included temperature, nitrate plus nitrite and phosphate, and ancillary information on depth, year, and Julian day.
We considered flow cytometry counts for the three groups, and included microscope counts for Synechococcus. No standardization was attempted. We did not consider Prochlorococcus microscopy cell counts because of their weak autofluorescence. We imposed a nitrate minimum of 10 −2 μM to avoid issues with detection limits.

Climatology data record (data citation 2)
We generated a monthly climatology for Prochlorococcus, Synechococcus, and picoeukaryotic phytoplankton cell abundance for the global ocean in a 1 grid from 0 to 200 m depth (Visintini et al. 2020). We used niche models established for each group (Flombaum et al. 2013 together with PAR and temperature, for Prochlorococcus and Synechococcus, plus nitrate, for picoeukaryotic phytoplankton. The explained variances (R 2 ) of the quantitative niche models of Prochlorococcus, Synechococcus, and picoeukaryotic phytoplankton were 0.66, 0.35, and 0.46, respectively (Flombaum et al. 2013. Downwelling irradiance was based on the climatologies of PAR (NASA Goddard Space Flight Center Ocean Ecology Laboratory Ocean Biology Processing Group 2018a) and K d490 diffuse attenuation coefficient (Modis-Aqua 0.083 grid) (NASA Goddard Space Flight Center Ocean Ecology Laboratory Ocean Biology Processing Group 2018b) , and then averaged to fit a 1 grid. Monthly global temperature and nitrate statistical means were obtained from WOA for 1 grids and 200 m of water column depth (Garcia et al. 2013;Levitus et al. 2015;Locarnini et al. 2013).
We did not generate abundance data for ranges of PAR, temperature, and nutrient concentration that fell outside niche model boundaries. Boundaries for PAR, temperature and nitrate were 10 −4 and 10 1.8 E m −2 d −1 , 0 and 30 C, and 10 −2 μM with no upper threshold value, respectively.
The global circulation models provided yearly temperature and nitrate concentration, while we considered PAR constant along time. PAR boundaries for Prochlorococcus and Synechococcus niche models were 10 −4 and 10 1.8 E m −2 d −1 , and 10 −3 and 10 1.8 E m −2 d −1 for the picoeukaryotic phytoplankton model. Global abundance represented the sum of cells number in the entire ocean, accounting for differences in grid size from low to high latitudes. Sea surface cell concentration represented the annual average of the first 50 m per grid.

Technical validation
For the in situ observations dataset, all available values were considered. For the climatology dataset we do not present abundance data below 1000 cells mL −1 for Prochlorococcus and Synechococcus and below 500 cells mL −1 for picoeukaryotic phytoplankton as we rounded abundances to those significant figures. In the climate change datasets, note that from years 2056 to 2065 the scenario RCP 4.5 does not contain data. For all datasets NaN and nd describe no data.

Data use and recommendations for reuse
Datasets can be used to parameterize and validate biogeochemical models that consider this group of phytoplankton, and to calibrate or support remote sensing approaches meant to detect major groups of picophytoplankton. Within the in situ observations dataset, there are time-series observations that can be used to analyze abundance temporal variations for some coordinates, while the climatology dataset enables the evaluation of the global abundance variability of Prochlorococcus, Synechococcus, and picoeukaryotic phytoplankton. The climatology dataset can be used as a reference for in situ observations at a regional scale as well. The dataset for future climates can be used to summarize projected changes and to contrast with other methods of projections, such as dynamic ecosystem models. Data