Predicting Daily River Chlorophyll Concentrations at a Continental Scale

Eutrophication is one of the largest threats to aquatic ecosystems and chlorophyll a measurements are relevant indicators of trophic state and algal abundance. Many studies have modeled chlorophyll a in rivers but model development and testing has largely occurred at individual sites which hampers creating generalized models capable of making broad‐scale predictions. To address this gap, we compiled a large data set of chlorophyll a concentrations matched to other water quality, meteorological, and reach characteristic data for a diverse set of 82 streams and rivers across the United States. We used this data set and extreme gradient boosting, a tree‐based machine learning algorithm, to predict daily chlorophyll a concentrations. Furthermore, we tested several practical considerations of broad‐scale models, such as making predictions at sites not included in model training or the utility of in situ water quality data versus universally available remotely estimated model inputs. Predictions were very strongly correlated to observations when compared against a randomly withheld subset of days; however, the model had lower accuracy when applied to completely novel sites withheld from model training. Turbidity and total nitrogen were the two most important variables for predicting chlorophyll a. Although in situ variables improved modeled estimates and were identified as more important during model interpretation, using only remote inputs still resulted in highly correlated predictions with small bias. Testing a model across many sites allowed for identification of common variables relevant to chlorophyll a and highlighted several challenges for applying data‐driven models to new sites or at larger spatial scales.


10.1029/2022WR034215
2 of 16 different geographic, hydrologic, and environmental conditions.The ability to estimate chlorophyll concentrations across a wide range of rivers could facilitate making predictions at unmonitored locations or identification of reaches that are likely to experience symptoms of excess productivity, such as eutrophication or harmful algal blooms.
Modeling chlorophyll dynamics in rivers can be challenging due to non-linear relationships between chlorophyll and its drivers, as well as covariation or interaction among drivers.Multiple approaches have been used to model chlorophyll in rivers including process-based (Billen et al., 1994;Everbecq, 2001;Pathak et al., 2021), traditional statistical (S.Kim, 2016;K. B. Kim et al., 2020), and machine learning (Cho et al., 2018;Jeong et al., 2001;Park et al., 2022).Although model objectives of predictive accuracy and process understanding are not mutually exclusive (Duarte et al., 2003), each modeling approach varies in its emphasis on these two components and has strengths and weaknesses for broad-scale use.Mechanistic models distill process information into their design which can aid understanding and interpretation, but in practice can require substantial detailed information for model parameterization and initialization (Duarte et al., 2003) which may limit their widespread application.By contrast, machine learning models can often be easily scaled to address larger problems but forgo specifying process knowledge (Baker et al., 2018) and can be difficult to interpret.Extreme gradient boosting (XGBoost) (Chen & Guestrin, 2016) is a commonly used decision-tree-based algorithm and has been used to make predictions of stream water temperature (Feigl et al., 2021), harmful algal blooms in coastal waters (Izadi et al., 2021), and chlorophyll a concentrations in streams and rivers (Park et al., 2022;Shin et al., 2020).Tree-based algorithms are well suited for making broad-scale predictions of chlorophyll concentrations because they are non-parametric, can handle non-linear relationships, are robust against multicollinearity of variables, and benefit learning from large data sets.
Machine learning models can suffer from a lack of transparency and it may be difficult to understand what contributed to predictions.Explainable artificial intelligence is an initiative with growing interest that aims to increase the transparency of machine learning models by improving the ability to explain their outcomes (Adadi & Berrada, 2018;Barredo-Arrieta et al., 2020).SHapley Additive exPlanations (SHAP) values are an approach for determining the importance of model input features and how they contributed to making predictions and have several useful properties for interpretation of machine learning models (Lundberg et al., 2019;Lundberg & Lee, 2017).First, SHAP values show the marginal contribution of each variable relative to the mean prediction; therefore, SHAP values can indicate both the magnitude and directionality that each variable contributes to predictions.Second, SHAP values are scalable and can be calculated for each individual prediction, globally, or various levels of aggregation in between.Third, there are multiple ways to visualize and interpret SHAP values such as examining the overall global importance with SHAP importance plots or showing how the importance of a variable changes over its range with SHAP dependence plots.Fourth, SHAP values can be decomposed into main effects and interaction effects and this allows for examining the impact of each variable after the removal of interactions with other variables (Lundberg et al., 2019).The use of SHAP values increases transparency of machine learning model predictions and can aid interpretation of modeled results.
Although it is often touted that machine learning models can handle large numbers of input features, it is still prudent to use domain knowledge to inform the initial selection of features to consider.A suite of in situ water quality and meteorological variables are commonly used to predict chlorophyll concentrations in rivers including air and water temperature, pH, electrical conductivity (or specific conductance), dissolved oxygen, turbidity, measures of nitrogen and phosphorus, precipitation, and solar radiation (H.G. Kim et al., 2019;Ly et al., 2021;Park et al., 2022;Shin et al., 2020).While many of these variables are commonly measured, they are certainly not available for the majority of reaches nor are they all measured concurrently at any given site.Alternatively widely available data sets exist that either directly estimate variables of interest or serve as useful proxies.For example, there are multiple data sets of gridded meteorological estimates (Thornton et al., 2022;Xia et al., 2012) that are analogous to in situ measurements of these variables.Relationships can also be captured by related variables, for example, longitudinal changes in stream productivity from light limitation (Finlay, 2011;Finlay et al., 2011) can be reflected in changes in channel width and riparian tree height (Savoy & Harvey, 2021).We selected a combination of in situ water quality data and remotely derived meteorological, nutrient, and geomorphological data so that we could examine the performance of models that rely only on remotely derived products versus models that also incorporate local water quality data.
To advance the goal of broad-scale predictive models of riverine phytoplankton chlorophyll concentrations, we compiled a large data set of daily chlorophyll a concentrations matched to relevant water quality, meteorological, and reach characteristic variables.We used extreme gradient boosting, a tree-based machine learning algorithm, 10.1029/2022WR034215 3 of 16 to predict daily chlorophyll concentrations because it is well suited to handling nonlinear relationships, is robust to multicollinearity, and benefits from the ability to learn across a large pool of diverse input data.This study builds on previous research but also expands upon it in several important ways.First, model performance was assessed at many sites with varying hydrogeomorphic, meteorological, and riparian corridor characteristics across the conterminous United States (CONUS).To the best of our knowledge this study represents the first broad-scale application and validation of modeled daily chlorophyll a concentrations in streams and rivers.Second, we used a robust approach of withholding entire sites from model training during both cross-validation and testing against holdout data.Withholding sites allowed us to examine model performance at completely novel sites, and simulated making predictions at reaches without chlorophyll a monitoring data.Third, we tested the performance of model estimates based solely on remotely derived inputs in addition to model estimates that include in situ water quality data.Comparing estimates using in situ and remote data sources allowed us to examine if predictions varied based on the quality and quantity of available input data.Finally, explainable measures were used to interpret the modeled estimates to increase transparency as well as to provide useful information about important features and how those features contributed to the final estimates of chlorophyll a concentrations.

Chlorophyll Data
Chlorophyll a concentrations (µg L −1 ) for flowing surface waters within CONUS were compiled from a variety of sources including the AquaSat data set (Ross et al., 2019), USGS National Water Information System (NWIS) monitoring stations (U.S. Geological Survey, 2021), and the National Ecological Observatory Network (NEON) (NEON, 2021c).A combination of periodic and near-continuous chlorophyll data was used to increase the likelihood of matchups with other ancillary data.Consequently, the data set only reflects water column chlorophyll a concentrations and does not include information about benthic chlorophyll.Ross et al. (2019) compiled and harmonized observations of chlorophyll for lotic sites using the Water Quality Portal (E.K. Read et al., 2017).While the final AquaSat data set matched in situ observations to remote sensing overpasses, their use of a transparent workflow and generation of intermediate products allowed us to select chlorophyll observations from earlier in their workflow without the constraint of matching to remote sensing overpasses.Because chlorophyll data were acquired from multiple sources, a few basic pre-processing steps were applied.Any extremely high (>400 μg L −1 ) or negative values were removed.Since the data set consisted of a combination of field samples and near-continuous sensor data, all dates with multiple observations at a sampling location were aggregated to daily mean values.A full description of how each data source was downloaded and processed can be found in Supporting Information S1.Observations were then matched to reaches in the National Hydrography Data set (NHDPlusV2.1, https://nhdplus.com/NHDPlus/NHDPlusV2_home.php) using the nhdplusTools R package (Blodgett & Johnson, 2022) to facilitate matching chlorophyll data to other variables.

Summary of Input Variables
A combination of in situ and remotely estimated variables were used as model inputs based on their availability and relevance to chlorophyll a concentrations (Table 1).These variables include factors which mediate phytoplankton dynamics as well as proxies or indicators of chlorophyll a concentrations.Productivity in flowing waters is influenced by several factors including discharge (Reynolds, 2000), light (Domingues et al., 2011;W. R. Hill et al., 1995), temperature (Demars et al., 2011), and nutrients (Grimm & Fisher, 1986).Total nitrogen (TN) and total phosphorus (TP) are particularly relevant because they represent total biologically available nutrients (Dodds, 2003;Dodds & Welch, 2000).Additionally, turbidity is important to stream productivity through its impact on the light environment (Cloern, 1987;Hall et al., 2015).Daily amplitudes of dissolved oxygen reflect patterns of productivity (Demars et al., 2015;Odum, 1956) and are a useful indicator of stream autotrophs.Specific conductance indirectly measures dissolved ions in solution and conductivity is regularly used in data-driven models of chlorophyll a (Park et al., 2022;Shin et al., 2020).In addition to typical water quality and meteorological variables, we also included several site-level factors.Changes in channel width and the height of riparian vegetation influence light available to the stream surface and have been found useful for estimating light limitation at broad-scales (Savoy & Harvey, 2021).Because lentic waterbodies within the watershed can influence nitrogen cycling and water quality (Schmadel et al., 2018), estimates of lentic areal density were also included.

Ancillary Water Quality Data
Additional water quality, meteorological, and reach characteristic data were gathered from 2000 onwards to serve as model predictor variables (Table 1).Acquiring and processing each of these three types of data varied due to differences in their temporal or spatial resolution.Ancillary water quality data obtained from both NWIS and NEON were used to capture in-stream conditions including discharge (m 3 s −1 ), water temperature (°C), dissolved oxygen (mg L −1 ), turbidity (FNU), and specific conductance (µS cm −1 at 25°C).For NWIS data we identified monitoring stations that were linked to a reach code (ComID) via the NHD linked data registry and where all five constituents were measured.NWIS data were downloaded using the dataRetrieval R package (Hirsch & De Cicco, 2015).NEON water quality (NEON, 2021c), discharge (NEON, 2021a), and water temperature (NEON, 2021b) data were downloaded using the neonUtilities R package (Lunch et al., 2021).Similar to chlorophyll data, all ancillary water quality data was aggregated to daily means except for dissolved oxygen.Instead of daily means, daily amplitude was calculated for dissolved oxygen because of its relation to patterns of productivity (Demars et al., 2015;Odum, 1956).Near-continuous sensor data were not available for TN and TP.Instead, long-term mean seasonal estimates of TN (mg L −1 ) and TP (mg L −1 ) were used to characterize nutrient conditions (Shen et al., 2019(Shen et al., , 2020)).A full description of downloading and processing ancillary water quality data can be found in Supporting Information S1.

Meteorological Data
Meteorological data from DAYMET (Thornton et al., 2022) were used to capture more distal controls such as air temperature, incoming light, and precipitation.Daily total precipitation (mm day −1 ) data were used directly from DAYMET but we also derived several variables from the DAYMET data.Daily mean air temperature (°C) was calculated from the daily minimum and maximum air temperatures.Incident daily total shortwave radiation (W m −2 ) was converted to photosynthetically active radiation, expressed as total incoming photosynthetic photon flux density (mol m −2 day −1 ), by using DAYMET photoperiod and the conversion from Britton and Dodd (1976).DAYMET data were downloaded using the daymetr R package (Hufkens et al., 2018) based on the midpoint of each reach.Note.Two sets of variables were tested: using all input variables (All) and using only remotely estimated variables (Remote).

Table 1
Summary of Variable Names and Descriptions

Reach Characteristic Data
In addition to timeseries of water quality and meteorological data, several variables that characterized reach conditions at each site were also included.Estimates of channel bankfull width (m) and depth (m) (Wieczorek et al., 2018) were retrieved based on the reach ComID.Estimates of riparian tree heights were also included because of their influence on stream shading and light environments.We used global estimates of canopy height at 30 m resolution (Potapov et al., 2021) and calculated a value for each reach by taking the 90th percentile of tree height in a 100 m riparian buffer from the reach midpoint.Estimates of lentic areal density (Schmadel & Harvey, 2020) were obtained based on the reach ComID.

Compiling the Modeling Data Set
Chlorophyll, ancillary water quality, meteorology, and reach characteristics data were all merged into a final data set for modeling (Figure 1a).Since the data sets consisted of multiple temporal and spatial resolutions, there were several options on how to merge data.We decided to use continuous monitoring stations as the basis for consolidating data because this is where the majority of the timeseries of chlorophyll and water quality data originated.To facilitate matching data sets with different resolutions, all data were linked to NHD reach ComIDs.If chlorophyll data were not collected at a site where other ancillary water quality data were present, chlorophyll data were matched to data at monitoring stations by first identifying a common NHD reach ComID, identifying the nearest station, and then finally matching to water quality data based on collection date.Meteorological and reach characteristic data were downloaded for each reach and matched to monitoring stations by reach ComID and then date.For simplicity the final merged data at each station are simply referred to as sites throughout the remainder of the text.All data were merged as a sparse data frame and then the final modeling data set was created by selecting days when chlorophyll and all 15 variables (Table 1) were present.The final model data set consisted of 82 sites with 25,706 total days of data.Individual sites had anywhere between 1 and 1,732 days of data, with a median of 131 days per site.The final model data set, as well as resulting model outputs, is freely available (Savoy & Harvey, 2023).

Approach to Analysis
Our objectives included testing modeled estimates at sites withheld from model training to simulate predictions at novel sites, comparing model performance using a combination of in situ and remotely estimated variables versus relying solely on remote data sets, and explainable interpretation of modeled estimates.These objectives informed the design of the analysis in several ways.We used XGBoost (Chen & Guestrin, 2016), a tree-based machine learning algorithm, to predict daily chlorophyll a concentrations.Since XGBoost has several hyperparameters, it is necessary to perform hyperparameter tuning within a cross-validation routine.We designed two separate cross-validation routines based on our objectives (Figure 1b).The first cross-validation routine examined how well daily chlorophyll could be predicted at new days for sites that were included in model training, and we refer to this as novel days throughout.The second cross-validation routine examined how well predictions could be made to completely new sites that were not included in model training and we refer to this as novel sites throughout.To address our objective of examining the impact of data quality and quantity on model performance, the cross-validation and hyperparameter tuning was performed using a combination of in situ and remotely estimated variables as well as for only remotely estimated variables alone (Table 1).Once the best set of hyperparameters was selected, all models were then tested against independent test data sets.Lastly, a final model with the best set of hyperparameters selected during cross-validation was trained using the entire data set for the purposes of doing post hoc analyses on the contribution of variables to chlorophyll estimates.Specific details about each component of the analysis are detailed in the following sections.

Creating Test and Cross-Validation Data Sets
There is an important distinction between how well a model can make predictions for novel days at sites where some training data are available and how well it can make predictions at totally novel sites.This is effectively analogous to the common topic of streamflow prediction in ungaged basins in hydrology (Sivapalan et al., 2003).
The process of model tuning, cross-validation, and testing was developed to assess model performance when making predictions at novel days versus novel sites.First, the data set was divided into a set of holdout test data to be used for independent testing after cross-validation, and a set of training data to be used during cross-validation (Figure 1b).Approximately 20% of the full data set was used to create the holdout test data set and was comprised of two components: (a) entire sites were removed from the data set until the total number of records withheld was approximately 10% of the total data set, (b) from the remaining sites individual days were randomly selected until the total number of days withheld was approximately 10% of the total data set.The remaining 80% of the data were used for cross-validation.Similar to the creation of test holdout data sets, cross-validation also assessed predictions at novel days and novel sites by creating two separate sets of cross-validation folds.For novel days a set of 5 nonoverlapping folds was created by randomly sampling individual days so that each fold was 20% of the cross-validation data set.For novel sites, a set of 5 nonoverlapping folds was created by selecting entire sites so that each fold comprised approximately 20% of the cross-validation data set.

Cross-Validation, Hyperparameter Tuning, and Independent Testing
A k-fold (k = 5) cross-validation routine was used, where each fold was used as the validation fold and the remaining four folds were used for model training.A total of four separate cross-validation runs were done so that both estimates at novel days and novel sites could be done using all variables and only remotely estimated variables.
All modeling was done using the xgboost R package (Chen et al., 2022).XGBoost has multiple tunable parameters and strictly testing every combination of parameter values would result in a very large number of models.Instead, latin hypercube sampling was used to select a set of distinct hyperparameter combinations because it is space-filling and can reduce the total number of model runs to effectively sample the hyperparameter space.A total of 20 hyperparameter combinations were created and each was trained and tested during cross-validation (Table S3 in Supporting Information S1).
The accuracy of estimates on the validation fold was assessed using Pearson's correlation coefficient (r), root mean squared error (RMSE), and bias.We followed the guidelines for interpreting the strength of correlations (e.g., weak, moderate, etc…) suggested by Evans (1996).The final cross-validation performance was calculated by taking the mean performance across all 5 validation folds.Average correlations were calculated by performing Fisher's Z transformation on the r values, calculating the mean value, then back-transforming to r values (Fisher, 1915).The best combination of hyperparameters was determined based on the minimum RMSE and then they were used to train a model on the full training data set and finally performance was assessed on the respective holdout test data set.

Variable Interpretation
The best set of hyperparameters for predicting novel days using all input variables was used to train the final model on the full model data set (i.e., the complete data set of all 25,706 days of data).The resulting final model was then used to perform post hoc analyses and interpretation of how variables influenced chlorophyll predictions.SHAP values, as well as SHAP main effects and interaction effects were calculated using both the SHAPforxgboost (Liu & Just, 2021) and shapviz (Mayer, 2022) R packages.SHAP values were used to make several types of interpretations at different levels of aggregation.First, the overall mean of absolute SHAP values were calculated to get a perspective on the global importance of each variable.However, taking the absolute value only gives an indicator of overall importance but not whether the variable positively or negatively impacted the prediction relative to the mean predicted value.To get a broad perspective of how each variable influenced model estimates, we calculated the mean SHAP values for the lower and upper quartiles of each input variable.This provided an indicator of how low and high values of a variable corresponded to higher or lower modeled estimates.Finally, changes in variable importance across sites was examined by constructing site-level dependence plots.Instead of using SHAP values, SHAP main effects were used to isolate the impact of each variable after interactions with other variables were removed.At each site the mean value of each variable and their respective main effects were calculated.The resulting site-level dependence plots provide a more detailed view of how a variable influenced model predictions across ranging site conditions.

All Variables
When all variables were used to make predictions at novel days, the best performing set of hyperparameters produced estimates that were very highly correlated (r = 0.91) to observations and RMSE was 11.76 μg L −1 with a slightly negative bias (−0.03 μg L −1 ) (Table 2).Hyperparameter tuning had a noticeable impact during cross-validation since r ranged from 0.37 to 0.91 and RMSE ranged from 29.28 to 11.76 μg L −1 across hyperparameter combination sets.By comparison, the best performing model using all variables at novel sites was only weakly correlated (r = 0.26) and RMSE was more than doubled compared to novel days (27.09 μg L −1 ).There was also a general trend toward underestimation when making predictions at novel sites (bias = −2.66μg L −1 ).Model results were also generally less sensitive to hyperparameter tuning when making predictions at novel sites.For example, RMSE only ranged from 29.73 to 27.09 μg L −1 between the worst and best performing set of hyperparameters.

Remotely Estimated Variables
When only remote variables were used to make predictions at novel days the best performing set of hyperparameters was strongly correlated (r = 0.70) with observations.While RMSE was higher (19.82 μg L −1 ) than when using all variables, bias (−0.09 μg L −1 ) was comparable.Hyperparameter tuning still had a noticeable effect with correlations ranging from 0.32 to 0.70 and RMSE ranged from 29.28 to 19.82 μg L −1 .However, when making predictions at novel sites the best performing set of hyperparameters produced estimates that were  only very weakly correlated to observations (r = 0.15).Despite this, both RMSE (27.47 μg L −1 ) and bias (−3.40 μg L −1 ) were relatively comparable to estimates made with all variables.During hyperparameter tuning the lowest correlation (r = 0.01) and highest error (RMSE = 44.87μg L −1 ) was encountered.

Test Holdout Data Set Evaluation
The best hyperparameter combination sets for novel days using all variables and only remote variables were each tested against the day holdout data and likewise the best hyperparameter combination sets for novel sites were tested against the site holdout data.When using all variables, estimates for novel days performed similar to the cross-validation results (r = 0.92; RMSE = 10.48 μg L −1 ) (Table 3).However, when predicting novel sites the results (r = 0.47; RMSE = 19.60 μg L −1 ) were better than during cross-validation.When using only remote variables performance on the test data set was similar to the cross-validation performance for both novel days (r = 0.68; RMSE = 19.89μg L −1 ) and novel sites (r = 0.16; RMSE = 22.50 μg L −1 ).Three sites with ample observations in the test data set were selected to examine the performance of estimating novel days at individual sites (Figure 2).Model predictions for all three sites tended to underestimate the highest values of chlorophyll, and this was more apparent when using only remotely estimated variables.Both estimates using all variables and remotely estimated variables tended to track the general timing of observations, with the latter having a slightly less daily variability.Model fitting statistics were calculated for all sites with at least 30 observations (n = 32) in the novel days holdout data set to examine differences in performance across sites.When using all variables, correlations ranged from 0.28 to 0.99 and 20 sites were highly correlated to observations.However, when using only remotely estimated variables correlations ranged from 0.04 to 0.96 and only six sites were highly correlated to observations.

Overall Variable Importance
The top five most important variables for predicting daily chlorophyll concentrations were turbidity, TN, discharge, daily amplitude of dissolved oxygen (DO amp ), and specific conductance (Figure 3a).Mean absolute SHAP values for turbidity (5.22) and TN (4.89) were relatively close but there was a large drop-off for the third most important variable, discharge (3.24).Generally, individual water quality variables were more important than meteorological ones with precipitation, air temperature, and total incoming photosynthetically active radiation (PAR) having relatively little impact.Mean absolute SHAP values were also calculated based on whether a variable was related to water quality, reach characteristics, or meteorology.Overall, water quality variables (2.71) were most important followed by reach characteristics (0.99) and meteorology (0.24).Similarly, mean absolute SHAP values for all in situ (2.58) were higher than remote (1.24) variables.
Mean SHAP values for the lower and upper quartile of each variable were used to examine the directionality of global variable importance.The lower quartiles of turbidity and TN had relatively negative SHAP values, which indicated that they correspond to lower than average estimates of chlorophyll (Figure 3b).By contrast the lower quartile of discharge corresponded with higher than average chlorophyll estimates as indicated by a positive SHAP value.However, most variables had little difference in SHAP values between the lower and upper quartiles.

Variable Dependence Plots
For turbidity the SHAP main effect was negative at lower turbidity values and showed a general positive trend with increasing turbidity; however, there was a good degree of scatter in this relationship (Figure 4).TN had negative SHAP main effects below about 1 mg L −1 and positive values above this point, with a slight increase over the rest of the range toward higher values.Discharge had a steep nonlinear shape with positive SHAP main effects at the lowest discharge and then a sharp decline toward slightly negative values for the remainder of the variable range.DO amplitude had a relatively tight positive trend with increasing SHAP main effects with increased DO amplitude.

Discussion
We estimated daily chlorophyll a concentrations across a diverse set of rivers using a tree-based machine learning algorithm (XGBoost).We also tested considerations for making broad-scale predictions, such as the ability to make predictions at sites not included in model training and the utility of in situ versus remotely estimated model inputs.
Finally, we examined how water quality, reach characteristics, and meteorological conditions contributed to modeled estimates of chlorophyll.Modeled estimates were very strongly correlated to observations when a combination of in situ and remote data sets were used and model assessment was performed on a randomly withheld subset of days.However, there was a noticeable decrease in model accuracy when assessed against sites that were completely withheld from model training.Model performance benefitted from the inclusion of in situ data and in-stream water quality variables were generally determined to be more important during model interpretation.However, if the objective is to make hind-cast or now-cast predictions at sites with sparse chlorophyll observations, then estimates from only remote data sets can still provide highly correlated predictions with low bias.Overall, turbidity and TN were the two most important variables for estimating chlorophyll followed by several other water quality variables like discharge, the daily amplitude of dissolved oxygen, and specific conductance.Assessing the importance of variables across multiple sites helped to identify measures that were relevant across diverse conditions, rather than being important at a specific site.
Several broad-scale studies have explored the importance of multiple factors mediating the spatial variability of chlorophyll and trophic state in lakes (R. A. Hill et al., 2018;Hollister et al., 2016) but related broad-scale studies in rivers are almost completely lacking.Turbidity, TP, and TN were the three most important variables for predicting trophic state across U.S. lakes (Hollister et al., 2016) and multiple regression analysis has also shown that TP, TN, and residence time are important predictors of algal abundance across rivers, lakes, and impoundments (Soballe & Kimmel, 1987).Although it is difficult to directly compare the importance of variables across different studies, our results also identified turbidity, TN, and hydrological conditions as important predictors of river chlorophyll concentrations.By testing a model across many sites we were able to examine which features were important across a wide range of conditions.Although our approach could not identify the mechanisms behind these relationships, determination of important features for broadscale prediction of chlorophyll could help identify relationships that can be further explored or explained by process models.

Total Nitrogen and Phosphorus
Nutrients, particularly nitrogen and phosphorous have been shown to be related to sestonic chlorophyll in rivers (Bennett et al., 2021;Van Nieuwenhuyse & Jones, 1996).While TN was the second most important variable we tested, TP had considerably less impact on modeled estimates.There are several possible explanations for why TP did not show up as an important variable.
In our data set TN and TP were very highly correlated (r = 0.80) (Figure S1 in Supporting Information S1) and although XGBoost is robust to multicollinearity for the purposes of making predictions, the presence of highly correlated variables can impact the interpretation of the importance of those variables.Because XGBoost sequentially develops trees, the presence of two highly correlated features can result in placing the majority of importance on one of the two features.To examine this issue we retrained the model on the full data set and assessed feature importance two additional times: once where TP was removed and once where TN was removed.While TN remained the second most important variable when TP was removed (Figure S2 in Supporting Information S1), TP did not become one of the most important variables when TN was removed and was only the ninth most important variable (Figure S3 in Supporting Information S1).These results demonstrate that the importance of TN is robust and not due to random chance between two highly correlated variables.
Although chlorophyll concentrations generally respond similarly to both TN and TP, this is not always the case (Bennett et al., 2021).There are several possible explanations for why TP did not show up as an important variable.Relationships between nutrients and chlorophyll can be complex.For example, there can be changing  relationships with watershed size as well as reduced correlations between nutrients and chlorophyll as nutrient availability increases (Van Nieuwenhuyse & Jones, 1996).Limitation by nitrogen, phosphorous, or colimitation can also vary across streams and rivers.Headwater streams are more likely to be N limited and larger rivers are more likely to be P limited (Jarvie et al., 2018).A total of 29 out of 82 sites in this study were headwaters (first to third order sensu Vannote et al., 1980) while only 10 of the 82 sites in this study were large rivers (>sixth order sensu Vannote et al., 1980) and it is possible that for this subset of sites TN was generally more limiting than TP and thus more closely related to observed chlorophyll.It is also important to remember that in this study nutrient data represented modeled estimates of long-term seasonal averages.The impact of TP may change if a different timescale was used and any uncertainty in estimated TP would also impact our current interpretations.

Turbidity
Turbidity was the second most important variable for estimating chlorophyll and the partial dependence plot showed a general trend of increasing chlorophyll with turbidity across sites.Hollister et al. (2016) performed a similar study in lakes and not only found that turbidity was the most important predictor of chlorophyll, but also found increasing chlorophyll concentrations with increasing turbidity.Similar to Hollister et al. (2016), we interpreted our results as indicative of turbidity capturing increased phytoplankton.At first glance this interpretation may seem counterintuitive given observed relationships between turbidity and light attenuation coefficients (Davies-Colley & Nagels, 2008;Savoy & Harvey, 2021), as well as reduced planktonic biomass (Cloern, 1987) and primary production (Hall et al., 2015) with increased turbidity.While chlorophyll concentrations have been related to water clarity, chlorophyll is only one of several components that can result in light attenuation within water.Contributions of non-algal sources such as suspended solids and colored dissolved organic matter also contribute to water clarity, and thus light availability (Brezonik et al., 2019).Whether turbidity serves as a proxy for light availability or as a proxy for chlorophyll concentrations will depend on the relative contributions of algal and non-algal components of water clarity (Lorenzen, 1980).A benefit of using XGBoost was that no a priori relationship was established between chlorophyll and turbidity and this approach should be able to identify turbidity as important regardless of the functional form of the relationship.

Discharge
Interpretation of the SHAP main effects suggest the potential for high chlorophyll concentrations at very low discharge but a sharp decline as discharge increases.Negative relationships between chlorophyll and discharge have been previously observed in large rivers (Schmidt, 1994).However, because discharge is related to many other river processes it is difficult to determine whether discharge directly or indirectly mediates chlorophyll concentrations.When discharge is high it may be a dominant control on planktonic biomass but low discharge may shift controls to other factors (Reynolds, 2000) where algal blooms may result from a combination of favorable conditions such as increased residence times and warmer water temperatures (van Vliet & Zwolsman, 2008).Residence time can be an important determinant of chlorophyll concentrations both within and across freshwaters (Soballe & Kimmel, 1987).Although we did not directly investigate the impact of residence time, predicted high chlorophyll at low discharge is consistent with findings regarding the importance of residence time across freshwaters for chlorophyll.

Methodological Considerations
There are several limitations to the scope of the current study.We focused entirely on planktonic chlorophyll and did not account for benthic algae, which can be an important source of productivity.Not only can the benthic algae contribute to productivity, but they can also produce cyanotoxins in rivers (Bouma-Gregson et al., 2018).Recent research has demonstrated that effluent can shift systems from phytoplankton to periphyton communities (Bergbusch et al., 2021) and not accounting for benthic algae could miss important periods of algal activity in some systems.Adding information on benthic algae would help capture more spatial and temporal variability in stream productivity, water quality, and ecosystem health across streams and rivers.Another way to further expand the scope would be to use more remotely derived data sets.Remote sensing can capture useful measures of water quality across broad spatial scales (Gardner et al., 2021;Ross et al., 2019) and is one potential avenue to explore for conducting larger synthetic analysis of patterns and drivers of riverine chlorophyll.
While XGBoost has many favorable properties, it also has some limitations that should be considered in the context of this study.Due to their nature, tree-based algorithms can not make predictions outside of the range of training data and are thus poorly suited for extrapolation.We partially mitigated this limitation by training the model on a pooled set of data from diverse sites, which increased the heterogeneity of chlorophyll a values in comparison to training a model on a single site.Similar future studies could further mitigate this issue by including even more sites, particularly from other parts of the world or from underrepresented systems, but this does not directly address the issue.Another alternative would be to use "theory-guided data science," which is a hybrid approach where empirical models are partially constrained or informed by theory-based models (Karpatne et al., 2017).For example, physics-guided machine learning approaches have been used to predict lake water temperature (Jia et al., 2019;J. S. Read et al., 2019).
Imbalanced data sets can present challenges for many modeling approaches.XGBoost and other decision-tree algorithms will often struggle to accurately predict underrepresented classes.Many ecological data sets may be positively skewed due to rare or extreme events and this presents similar challenges to imbalanced categorical data.Our compiled data set of chlorophyll a was positively skewed and while values ranged from 0 to 387 μg L −1 , the mean (10.67 μg L −1 ) and median (2.4 μg L −1 ) were relatively small.This means that extremely high chlorophyll concentrations represent only a small portion of the data set and are likely to be underpredicted by the model.Multiple approaches have been used to address class imbalance of categorical data, such as obtaining more equal sampling through resampling techniques or adding weights to place greater emphasis on accurate prediction of the minority class.However, these approaches have traditionally been used on discrete data and have received considerably less attention than the analogous issue of highly skewed continuous response data or extreme events (Krawczyk, 2016;Torgo et al., 2015).Several methods of handling imbalanced data were recently tested for predicting a continuous response of manganese concentrations and adding weights was found to be a preferable option because it did not modify the original data set (DeSimone & Ransom, 2021).Given that rare events of extremely high chlorophyll concentrations may be of great interest, it is worth exploring if existing methods for handling imbalanced data could improve the prediction of these events.

Conclusions
Although many studies have modeled chlorophyll a concentrations in rivers, these studies have overwhelmingly only made predictions at a single site or a small handful of sites in close proximity.While single-site model development and validation is useful for achieving reliable estimates at a given site, this approach does not indicate how models perform across a wide range of conditions nor does it indicate their capacity to make predictions at completely novel sites that were not used during model development, parameterization, or training.Our goals were to examine the performance of a machine learning algorithm (XGBoost) across a diverse set of sites, to test the accuracy of predictions at sites that were not included in model training, and to explore the utility of models which solely utilize widely available data sets versus models which also include more detailed in situ measurements.
One approach to making broad-scale predictions is to develop a model which predicts the response variable from the best available data sets and then makes predictions at new sites based on this relationship.Our results demonstrated that modeled estimates had noticeable reductions in performance when making predictions at sites that were completely withheld from model training versus when all sites were able to contribute a portion of data to model training.In other words, while the model was able to make predictions across a diverse set of conditions, it was not able to maintain the same performance when applied to new sites.These results suggest that single-site validation is not indicative of how the model would perform if applied to other sites.It is likely that failing to perform validation at sites that are independent of model parameterization or training will lead to overconfidence in data-driven approaches for the purposes of making predictions at unmonitored reaches.
An alternative approach to making broad-scale predictions is to build models which solely use widely available input data sets to facilitate making predictions across large spatial extents.We found that a model based solely on remote drivers had poorer performance than one that included in situ data sets; however, the model based only on remote inputs performed better than trying to extrapolate a more detailed model to new sites not included in model training.Many sites have a handful of periodic measurements of chlorophyll a but lack other concurrent measures of water quality.Sites with periodic measurements of chlorophyll concentrations could be paired with widely available remote data sets to estimate daily timeseries of chlorophyll a concentrations at those sites.The availability of daily timeseries could allow for retrospective analysis on temporal patterns of chlorophyll dynamics, or now-cast predictions based on remote inputs.
While we have presented an initial attempt at synthetic analysis and prediction of daily riverine chlorophyll a concentrations, there is undoubtedly much more to be learned from additional broad-scale analyses.Testing models across diverse conditions helps to identify areas for future model improvements regardless of the modeling approach.Our approach using XGBoost indicates that while it was capable of making accurate predictions across a pooled set of heterogenous data from many sites, it was not able to reliably make estimates at new sites.We identified variables that were important for predicting chlorophyll a across many sites and provided some interpretation of how these variables influenced modeled estimates.Identifying factors that are generally relevant to chlorophyll a could inform other data driven approaches or serve as the basis for further mechanistic inquiry to improve current process models.We have taken several steps towards the goal of broad-scale predictive models of riverine chlorophyll a, particularly with respect to machine learning models.Future work could expand on our analysis by including a more diverse set of global rivers, exploring other machine learning methods that may be more appropriate for extrapolating to new sites, applying process-driven models to our compiled data set, including novel data types, or utilizing other methods to infer processes mediating chlorophyll patterns.Given the varied conditions experienced by phytoplankton in rivers, additional broad-scale application and analysis of models will result in iterative improvements of process understanding and model performance which should promote more generalized models.The work described here was supported by the U.S. Geological Survey (USGS) Water Availability and Use Science Program and was completed as part of the USGS Proxies Project, an effort supported by the U.S. Geological Survey Water Mission Area (WMA) Water Quality Processes program to develop estimation methods for PFAS, harmful algal blooms, and metals, at multiple spatial and temporal scales.The National Ecological Observatory Network is a program sponsored by the National Science Foundation and operated under cooperative agreement by Battelle.This material is based in part upon work supported by the National Science Foundation through the NEON Program.Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.We would also like to thank the anonymous reviewers for their comments which helped us refine and improve the content of the manuscript.

Figure 1 .
Figure 1.Overview of the workflow for compiling a data set for modeling (a) and the steps used in model cross-validation and testing (b).NWIS, U.S. Geological Survey National Water Information System; NEON, National Ecological Observatory Network; ComID, unique reach codes; C.V., cross-validation.

Figure 2 .
Figure2.Observed and predicted chlorophyll a concentration (µg L −1 ) for novel days at three sites (out of 82 sites in this study); station_07144100 (a, d), station_14159500 (b, e), and station_13213000 (c, f).The full model training data set (80% of all data) was used to train models with the best set of model hyperparameters identified during cross-validation.Only predictions for data in the independent day holdout are shown.

Figure 3 .
Figure 3. Overall mean absolute SHapley Additive exPlanations (SHAP) for all variables (a).SHAP values were calculated after training the model on the full data set.Mean SHAP values were also calculated for the lower and upper quartile of each variable (b).

Figure 4 .
Figure 4. SHapley Additive exPlanations (SHAP) main effects for turbidity (a), total nitrogen (TN) (b), discharge (c), and the daily amplitude of dissolved oxygen (d).Each point represents the mean SHAP main effect and variable value at a single site.

Table 2
Summary of k-Fold Cross-Validation Results Two separate holdout data sets were created to test model performance.Predictions were made for novel days at sites that were included in model training as well as for novel sites that were not included in model training.r is Pearson's correlation coefficient and RMSE is root mean squared error.

Table 3
Summary of Test Results on Holdout Data Sets