Modeling the drivers of eutrophication in Finland with a machine learning approach

Anthropogenic eutrophication is one of the most common threats to inland water quality, often causing toxic algal blooms and loss of aquatic biodiversity. Mitigating the harmful impacts of eutrophication requires managing nutrient inputs from the catchment focusing on the major local drivers of eutrophication. These drivers can be identified using models that predict lake trophic state based on characteristics of the lake and its catchment. In this study, we aimed to extend the spatial scope of these models by identifying drivers of eutrophication in a large sample of lakes (1547) distributed across Finland. Moreover, we used satellite-observed chlorophyll a (chl a ) concentration as trophic state indicator, instead of site-sampled data, which is commonly used in existing research. We identified major drivers of eutrophication on river basin district to country scale based on 11 catchment and lake characteristics, applying the random forest algorithm. On country scale, the catchment and lake characteristics explained 41% of the variation in lake chl a concentrations, and on river basin district scale, 20% – 44%. Catchment and lake hydromorphology were the most important explanatory characteristics. Especially, high natural eutrophication level, shallow mean depth of lake, and small share of lake area


INTRODUCTION
Eutrophication is one of the most common stressors deteriorating the ecological status of lakes (Bhagowati & Ahamad, 2019;European Environment Agency, 2018;Vinçon-Leite & Casenave, 2019). It is caused by increased primary productivity of aquatic ecosystems, often due to increased nutrient load from surrounding land areas (Bhagowati & Ahamad, 2019;Le Moal et al., 2019;Vinçon-Leite & Casenave, 2019). Natural eutrophication is a process where nutrients and sediment from the catchment accumulate in a water body over geologic time (Le Moal et al., 2019). However, human activities, such as agriculture as well as discharge of effluents from households and industry, have intensified eutrophication in lakes and other water bodies in many parts of the world since the mid-20th century (Bhagowati & Ahamad, 2019;Lürling et al., 2016;Schindler et al., 2016). Eutrophication commonly causes harmful algal blooms, aquatic biodiversity loss, and oxygen depletion, which lead to death of aquatic organisms (Bhagowati & Ahamad, 2019;Le Moal et al., 2019;Wurtsbaugh et al., 2019). Moreover, it compromises the supply of safe drinking water for humans and animals as well as decreases the recreational and economic values related to surface waters (Jenny et al., 2020;Wurtsbaugh et al., 2019).
Regulations, such as European Union (EU) Water Framework Directive (WFD) and Clean Water Act in the United States, have established frameworks for the assessment and management of water quality, aiming to restore and improve the ecological, physical, and chemical qualities of inland and coastal waters (Ferreira et al., 2007;Gibson et al., 2000). Achieving good ecological status in lakes requires controlling nutrient inputs from the catchment and applying protective measures designed based on local pressures (Carvalho et al., 2019;European Environment Agency, 2018;Schindler, 2012). Several existing studies have shown that the trophic state of lakes can be predicted based on characteristics of the lake and its catchment, and predictive models can be used for identifying major local drivers of eutrophication (e.g., Catherine et al., 2010;Hollister et al., 2016;Huo et al., 2015;Knoll et al., 2015). These models provide a cost-effective tool for planning effective use of protective measures since fine-resolution spatial data on environmental factors constantly become more accessible.
However, most trophic state predictive models in existing research have been calibrated using site-sampled chlorophyll a (chl a) or nutrient concentrations (e.g., Catherine et al., 2010;Hollister et al., 2016;Knoll et al., 2015;Liu et al., 2011), which limit their applicability to areas within established water quality monitoring networks. Moreover, many models in existing research have been developed for an individual catchment (e.g., Ekholm et al., 2000) or for a limited number of water bodies (<100) (e.g., Catherine et al., 2010;Huo et al., 2015;Liu et al., 2011), and hence, they might have not reflected the heterogeneity of environmental factors influencing eutrophication on larger spatial scales. The applicability of these models could be extended to scales relevant for decision-making (e.g., country scale, river basin district [RBD] scale) and their spatial coverage could be improved by utilizing complementary, widely available water quality data sources, such as satellite observations.
In this study, we aimed to extend the spatial scope of lake trophic state predictive models for identifying drivers of eutrophication. We examined a large sample of lakes (1547) distributed across Finland and used satellite-observed chl a concentration as the indicator of lake trophic state. We applied random forest algorithm, which has been shown to be suitable for capturing the relationship between lake eutrophication and its driving factors (Hollister et al., 2016;Kehoe et al., 2012;Liu et al., 2019), and developed predictive models both on country scale and on RBD scale to assess model applicability on different spatial scales and in areas with distinct environmental conditions. The models quantified the strength and the nature of the relationship between lake chl a concentration and characteristics of the lake and its catchment and determined the relative importance of catchment-scale characteristics in explaining lake trophic state.

Study area
The studied lakes and their catchments were distributed across Finland (Figure 1) (60 -70 N, 20 -32 E) that covers an area of 390,000 km 2 . Finland is divided into eight RBDs that have been defined following the natural basins of major rivers, and river basin management plans (e.g., related to EU WFD) are made on RBD scale (Alahuhta et al., 2013). We used RBDs as spatial units for the district-scale models. Most study catchments were in RBDs 1, 2, 3, and 4, hereafter referred to as RBDs East, South, West, and Central, respectively ( Figure 1). Therefore, these four RBDs were selected as the focus areas of RBD-scale modeling. The full sample of 1547 lakes and their catchments, spatial distribution shown in Figure 1, were included in country-scale modeling. The number of lakes included in the study was maximized for the available data on environmental characteristics. The catchment sizes of the studied lakes ranged from 1.4 km 2 to 3900 km 2 and 665 catchments were nested within other catchments. Among the studied lakes, 11% were oligotrophic (chl a concentration <2.6 μg/L), 31% mesotrophic (2.6-7.6 μg/L), and 58% eutrophic (>7.6 μg/L) (Figure 1), based on Carlson's trophic state index (Carlson, 1977).
The country-scale and RBD-scale study areas host a wide range of environmental and land use conditions that influence lake eutrophication (Appendix S1: Figures S1 and S2 and Table S1). Agricultural activity and erosion-prone clay soils, which have been shown to intensify eutrophication in existing research (e.g., Keatley et al., 2011;Ulén et al., 2007), are concentrated in the southern and southwestern parts of the country. On the other hand, the share of peatland area, one of the most important factors influencing organic carbon, nitrogen, and phosphorous concentrations in boreal lakes (Palviainen et al., 2016), is highest in eastern and northern Finland. In Finland, lake processes are characterized by ice cover in winter due to sub-zero temperatures. Summer and autumn are the rainiest seasons, and most of the nutrient leaching that contributes to lake eutrophication occurs in autumn and winter when there is less vegetation cover preventing water erosion (Puustinen et al., 2007).

Data
The model developed in this study utilized satelliteobserved chl a concentration in lakes in combination with lake, soil, land use, and hydroclimatological characteristics (Table 1). The environmental characteristics were summarized within lake-specific catchments and catchments were handed as individual samples in modeling. The study datasets were acquired from Finnish open data services, except for the catchment areas and chl a concentration data, which were obtained from Finnish Environmental Institute. In the following section, we describe the datasets and their preprocessing steps as well as motivate the selection of studied catchment and lake characteristics.
Lake-specific catchment delineation Lake-specific catchments were delineated by recursively finding upstream catchments relative to the catchment where each study lake is located in the Finnish Environmental Institute Catchment areas dataset F I G U R E 1 (a) Spatial distribution of study catchments included in country-scale modeling and borders of four river basin districts (RBDs) that were chosen as the focus areas of RBD-scale modeling. (b) Mean of observed annual median chlorophyll a concentrations in study lakes. For nested catchments, aggregated mean values are shown.

ECOSPHERE
3 of 17 (Finnish Environment Institute, 2014). The Finnish Environment Institute Catchment areas dataset has been derived using a 10-m elevation model and it includes catchments for channels (or chains of channels) with a length of at least 100 m and that end in a lake (with an area of at least 0.5 km 2 and all WFD lakes) or in an intersection of channels. The dataset does not include information about the hierarchical structure of catchments (e.g., which smaller catchments are part of a larger catchment). Therefore, to facilitate finding upstream catchments when delineating complete catchments for each study lake, we defined the pour point of each catchment in the Finnish Environmental Institute Catchment areas. For finding these pour points, we used a 10-m digital elevation model (National Land Survey of Finland, 2019) and the flow accumulation method in the Pysheds Python package (Bartos et al., 2020) version 0.3.
The resolution of the elevation model was too coarse for computationally determining pour points for 22 small catchments with areas below 0.012 km 2 . On the other hand, for one of the largest study catchments, pour point finding processing time became excessive. For these 23 catchments, pour points were assigned manually in QGIS (QGIS.org, 2018) based on visual inspection and the locations of other pour points. Finally, lakes with catchments extending outside of the borders of Finland (National Land Survey of Finland, 2021) were excluded. Study catchments that were nested within other catchments were linked to identifiers of their "parent catchments" (Finnish Environment Institute, 1991), that is, catchments that are not located within another catchment. The parent identifiers were utilized for three purposes: (1) to ensure that nested catchments would always end up in the same set in the train-test split to minimize T A B L E 1 Lake and catchment characteristics included in the random forest models and the characteristic summarization methods. the effects of spatial autocorrelation in model evaluation, (2) to aggregate results within nested catchments when evaluating model performance (e.g., in Figure 2b), and (3) to visualize variable values in nested catchments because it would be challenging to visualize the spatial extent of each nested catchment in such figures (e.g., Figure 1; Appendix S1: Figure S1).

Lake and catchment characteristics
In this study, the trophic state of lakes was explained based on 11 catchment and lake characteristics ( Table 1) that have been shown to correlate with chl a or nutrient concentrations in lakes. For modeling, the characteristics were summarized within each study catchment. In the following section, we describe the background of characteristics selection, characteristics data, and summarization methods. Lake chl a concentration was used as the target variable in the models developed in this study. Chl a concentration is a commonly used indicator of trophic state in ecological modeling (Vinçon-Leite & Casenave, 2019), and it is used as the eutrophication indicator in EU WFD water quality assessment (Papathanasopoulou et al., 2019). Chl a concentrations in study lakes were derived from monthly statistics of satellite observations from Copernicus program Sentinel-2 series Multispectral Instrument (MSI) (Finnish Environment Institute [SYKE], 2020). The chl a dataset over the studied lakes has been processed following similar data handling steps as presented in Attila et al. (2018), updated to account for the changes in instrumentation to Sentinel-2 MSI. Chl a concentrations have been derived from satellite observations using a bio-optical model applicable for water areas with high absorption of colored dissolved organic matter (Brockmann et al., 2016;C2RCC, 2022). For this study, average chl a concentrations in each lake were summarized by first calculating annual median values from the observed monthly medians and then an overall arithmetic mean across the study years 2016-2019.
Catchment land use has been shown to correlate with chl a and nutrient concentrations in inland waters (e.g., Catherine et al., 2010;Liu et al., 2011;Read et al., 2015;Röman et al., 2018). Moreover, nutrient concentrations are impacted by lake area to catchment area ratio (LA:CA) (Horppila et al., 2019;Röman et al., 2018), which can be viewed to represent catchment flushing rate (e.g., Fraterrigo & Downing, 2008). The models developed in this study included four land cover characteristics: share of agricultural area, built area, area on natural landscapes on peatland (hereafter "peatland area"), and lake area, which were derived from Corine Land Cover 2018 (20 m) (Finnish Environment Institute, 2018). We used the finest (i.e., fourth) level classification in Corine Land Cover to divide land use into the four land use categories (Appendix S1: Table S2). Forest, which is the dominant land cover category across Finland, was excluded from the models to reduce collinearity between the land cover categories.
Climate characteristics, such as precipitation and air temperature, have distinct impacts on nutrient losses from the catchment during winter and summer seasons, mainly due to differences in vegetation cover and nutrient uptake by plants. Increased precipitation or runoff and increased air temperature, especially outside of growing season, have been associated with higher nutrient losses and intensified eutrophication in water bodies (Coffey et al., 2019;Jabloun et al., 2015;Puustinen et al., 2007). In this study, we included seasonal 20-year averages of 2-m air temperature and cumulative seasonal precipitation as climate characteristics in the models. We derived these characteristics from monthly 2-m air temperature and precipitation rate datasets (Finnish Meteorological Institute, 2021), which include monthly weather radar observations interpolated to a 1 km × 1 km grid. The seasonal division was based on the annual observation period of the chl a statistics that is used for determining the ecological status of lakes in Finland (Attila et al., 2018): summer season (i.e., chl a observation season) covers months from June to September and winter season covers months from October to May. Seasonal 20-year average temperatures were calculated by selecting the monthly data within each season and taking the respective arithmetic means. Seasonal 20-year average cumulative precipitations were determined by calculating annual cumulative precipitations for each season each year and taking their respective arithmetic means.
Lake-specific factors such as mean or maximum lake depth have been important predictors of lake trophic state in several studies (Catherine et al., 2010;Knoll et al., 2015;Read et al., 2015). Moreover, lake trophic state is impacted by naturally occurring nutrient concentrations, which depend on catchment soil type, lake depth, lake size, and water retention time (Aroviita et al., 2019). In this study, lake-specific factors were represented by two characteristics: lake mean depth and natural eutrophication level. The lake mean depth data (Finnish Environment Institute, 2012) were retrieved from the Finnish lake registry. The depth measurements are based on soundings, depth contours from the topographic database of the National Land Survey of Finland, and graphical estimations based on depth maps (Finnish Environment Institute, 2012). Natural eutrophication-level categories were derived from the lake-type classification in the WFD water bodies dataset (Finnish Environment Institute and ELY-Centres, 2017) and the corresponding natural eutrophication levels from Aroviita et al. (2019). The natural eutrophication levels represent chl a concentrations expected to be found in lakes that have excellent ecological status and face minimal human impact. The reference values for each lake type have been defined based on chl a concentration measurements from minimally disturbed lakes, historical datasets, expert judgment, modeling, or a combination of these methods, with priority given to measurements (Aroviita et al., 2019). We divided the study lakes into three nearly equal-sized natural eutrophication-level categories based on lake-type-specific natural chl a concentration in reference conditions: (1) low (<4 μg/L, 489 lakes), (2) moderate (4-6 μg/L, 535 lakes), and (3) high (>6 μg/L, 523 lakes) (Appendix S1: Table S3).
The intensity of diffuse nutrient loading, which is a major driver of eutrophication, is related to the erosion sensitivity of the catchment, especially in agricultural catchments (Puustinen et al., 2007;Röman et al., 2018;Ulén et al., 2007). Therefore, erosion risk on agricultural fields was included as a catchment characteristic in the models. The erosion risk characteristic value was calculated using the field erosion risk map (2 m) (Karelia University of Applied Sciences and Natural Resources Institute Finland, 2020) and the Field Parcels 2016 (Agency for Rural Affairs in Finland, 2016) datasets, by summarizing median erosion risk within field parcels within a study catchment. Catchments with no fields were excluded. The erosion risk map combines information on slope length and steepness, soil erodibility, and erosive effect of rain and runoff according to the RUSLE erosion model adjusted to Finnish conditions by Natural Resources Institute of Finland (Karelia University of Applied Sciences and Natural Resources Institute Finland, 2020). The Field Parcels 2016 dataset includes arable land that is cultivated or available for cultivation and that is reported to the Finnish Food Authority for applying area-based agricultural subsidies (European Parliament and Council of the European Union, 2013; Finnish Food Authority, 2016).

Model design
Chl a concentration (i.e., trophic state) in study lakes was explained based on characteristics of the lake and its catchment (Table 1) by applying random forest regression (Breiman, 2001), implemented using the Scikit-learn Python package (Pedregosa et al., 2011) version 1.2.0. Random forest is a commonly used classification and regression algorithm in ecological modeling (Meyer et al., 2019) and it has been shown to be suitable for capturing nonlinear relationships, for example, the relationship between algal blooms and their driving factors, in complex environmental systems (e.g., Hollister et al., 2016;Kehoe et al., 2012;Liu et al., 2019). Compared with parametric modeling approaches, such as linear or nonlinear regression, random forest has several advantages for applications in ecology. First, unlike parametric models, random forest does not make assumptions about the nature of the relationship between explanatory variables and the target variable. Second, random forest does not assume that explanatory variables are normally distributed, independent, or that they have similar scales of values. Finally, random forests can handle both numerical and categorical predictor variables, which was beneficial in the specific model design of this study.
We developed one country-scale random forest model using the full country-scale dataset and four RBD-scale models ( Figure 1). With these modeling scales, we aimed to assess the applicability of trophic state predictive models on spatial scales relevant for decision-making: the countryscale model could be applicable for country-scale decisionmaking, whereas RBD scale is used, for example, in managing river basins related to the EU WFD. Furthermore, with the RBD-scale models, we aimed to assess model performance on areas with distinct environmental characteristics. The RBD-scale models did not cover the northern parts of Finland because there were too few (<200) samples from those areas for the random forest modeling approach.
We manually tuned two random forest hyperparameters: the number of trees and the maximum depth of trees. The number of trees was set to 300 and maximum depth of trees to 5 (country scale) or 4 (RBD scale). The depth of the trees was restricted to reduce model complexity and to avoid overfitting because the model sample sizes were relatively small for random forest (214-1547; Table 2) and because we expected the data to be noisy. For other hyperparameters, we used the default values defined in the Scikit-learn package documentation (Scikit-learn Developers, 2022), summarized in Appendix S1: Table S4.

Model validation
We aimed to reduce the impact of spatial autocorrelation and hierarchical structures on the model results by applying spatial blocking in model validation. Spatial autocorrelation is common in environmental characteristics data, such as the lake and catchment characteristics in this study. Furthermore, in the study dataset, 665 catchments were nested within larger catchments. Roberts et al. (2017) have demonstrated that cross-validation with spatial blocking reduces the risk of model overfitting to spatial and hierarchical structures within the data, and contributes to achieving a realistic estimate of the model performance. We performed repeated random subsampling cross-validation (or Monte Carlo cross-validation) with 100 splits. Instead of grid-like spatial blocking suggested in Roberts et al. (2017), we divided the data into train and test sets using the boundaries of parent catchments (Appendix S1: Figure S3), with test set size 20% (±2.5% in individual model runs). Using this spatial structure for selecting train and test sets ensured that there was spatial buffer zone between catchments in the two datasets, and that nested catchments would always end up in the same set. Regardless of the nestedness of catchments, each catchment was handled as an individual observation in the random forest models. The model training and validation included the following steps: Measures of model applicability and the impacts of lake and catchment characteristics The ability of lake and catchment characteristics to explain chl a concentration in individual lakes was measured for all models by calculating the coefficient of determination (R 2 ) and root mean square error (RMSE) for chl a concentration estimates in model test and train runs. In addition, for the country-scale model, we measured the linear correlation between spatially aggregated estimated chl a and spatially aggregated observed chl a with Pearson's correlation coefficient r (Pearson's r). We used arithmetic mean as the aggregation method. The model test estimations of chl a concentration were spatially aggregated on three levels of coarseness, following natural basin boundaries: parent catchment level (i.e., results aggregated only for study catchments nested within other study catchments), first main river basin subdivision level, and main river basin level (Appendix S1: Figure S3). The contribution of each catchment and lake characteristic in explaining lake trophic state was measured with Shapley values in both country-scale and RBD-scale models. Shapley values are based on cooperative game theory (Shapley, 1953) and they measure the average marginal impact that all the possible values of an explanatory variable have in the model output with all possible combinations of other explanatory variable values. Shapley values produce reliable measures of explanatory variable importance even when there are correlations between the explanatory variables (Lipovetsky & Conklin, 2001). We calculated and combined Shapley values for the lake and catchment characteristics from the 100 subsampling validation steps and visualized them using TreeExplainer (Lundberg et al., 2020) in the SHAP Python package (Lundberg & Lee, 2017) version 0.40.0. The four characteristics with the largest mean impact to estimated chl a concentration were chosen for more detailed examination. The nature of the impact of these four characteristics was visualized with dependence scatter plots generated from Shapley values. The dependence plots show the impact that an individual variable has on the output of the model for each observation in each model run. In the dependence plots, lowest and highest 1% of the characteristic values (i.e., outliers) were excluded.
To validate the Shapley value results, characteristic contribution to explaining lake trophic state was also calculated with permutation importance for both country-scale and RBD-scale models. Permutation importance measures the decrease in model performance when explanatory variable columns are permuted one by one. It reduces bias toward numerical variables in variable importance estimates when the model includes both numerical and categorical variables. Permutation importance was calculated for the test data using the permutation importance function in Scikit-learn Python package (Pedregosa et al., 2011) version 1.2.0, and the results from the 100 subsampling steps were averaged to obtain the final permutation importance ranking of lake and catchment characteristics.

Model applicability
R 2 and RMSE values averaged over 100 model runs were used to measure how well lake and catchment characteristics can explain lake trophic state on spatial scales relevant for decision-making (Table 2). Additionally, on RBD scale, R 2 and RMSE were used to measure the applicability of the models to areas with distinct environmental characteristics. On country scale, the selected lake and catchment characteristics explained on average 41% of the variation in chl a concentration in individual lakes, with average RMSE test = 5.43 μg/L (Table 2). In RBD South (RBD boundaries shown in Figure 1), the lake and catchment characteristics had the highest potential in explaining lake trophic state, with R 2 test ¼ 0:44 and RMSE test = 4.76 μg/L. However, on average, the countryscale model performed better than the RBD-scale models. The difference between model training and testing performance was on average 0.20-0.56 for R 2 and 0.98-2.23 μg/L for RMSE. The smallest difference between model training and testing performance was found on country scale and the largest difference in RBD West, where the number of lakes included in the model was the smallest. F I G U R E 2 Linear correlation between spatially aggregated (i.e., mean) observed chlorophyll a (chl a) concentration and estimated chl a concentration on country scale. The scatter points are shaded with natural eutrophication-level category, which was the most important explanatory characteristic in the country-scale model. Each scatter point represents an aggregated result from one model run; the model was run 100 times. The panels show results on four levels of spatial aggregation: (a) not aggregated (i.e., individual lakes), (b) means within parent catchments (i.e., aggregation only for nested catchments), (c) means within first main river basin subdivision level basins, and (d) means within main river basins.
On country scale, the estimated chl a concentrations correlated strongly with the observed chl a concentrations (Pearson's r 0.65-0.75). The correlation was the lowest with no spatial aggregation (0.65) and strengthened when the results were aggregated (Figure 2). Further, the range of estimation error decreased in the coarser levels of aggregation. At all levels of spatial aggregation, the range of estimation errors was smaller for lower observed chl a concentrations. Moreover, the range of errors was linked to natural eutrophication level: the largest errors were found in lakes with the highest natural eutrophication levels. The aggregated country-scale model errors were normally distributed and there were no apparent spatial patterns in their magnitude (Appendix S1: Figures S4  and S5).

Impacts of catchment and lake characteristics on lake trophic state
Catchment hydrology and lake-specific characteristics (i.e., natural eutrophication level, mean lake depth, and share of lake area in the catchment) had the largest impacts on lake trophic state on country scale and in most RBDs (Figure 3). The impacts of these characteristics remained similar across all model areas. We observed decreased chl a concentration in lakes with low natural eutrophication level. On the other hand, estimated chl a concentration increased in shallow lakes as well as in catchments with small share of lake area. The ranking of the most important lake and catchment characteristics remained similar irrespective of the method used for evaluating their importance (Shapley and permutation importance: Figure 3 and Appendix S1: Figure S6, respectively).
There was variability in the local importance of land use and climate characteristics. On country scale and in RBD East, increasing share of agricultural area in the catchment seemed to significantly increase chl a concentration in lakes (Figure 3). The impact of agricultural area in the catchment was similar but less significant across the other study areas. On the contrary, in RBDs South, West, and Central, increasing share of peatland area was among the main drivers of increasing lake chl a concentration. In RBD East, we did not observe a clear relationship between peatland area and lake chl a concentration, although the RBD has a relatively large peatland area. Climate characteristics did not seem to significantly influence eutrophication levels in most study areas. However, in RBD Central, catchment climate had more impact than in other study areas and low summer season precipitation seemed to increase lake chl a concentration.
We observed mixed or insignificant impacts for the remaining lake and catchment characteristics between the model areas ( Figure 3). In RBDs South, West, and Central, we observed an increase in lake chl a in catchments with higher shares of built area. Higher field erosion risk seemed to increase lake chl a concentration in RBD South but to lower chl a concentration in lakes on country scale and in RBDs East and Central. The impact of winter season precipitation was mixed in most models, but very low precipitation seemed to increase chl a concentration in most areas. In RBD Central, low summer and winter season temperatures decreased lake chl a concentration. On the other hand, on country scale and in RBDs East and South, lower summer season temperature seemed to increase lake chl a concentration. Similarly, on country scale, lower winter season temperature and higher summer season precipitation increased chl a concentration.
Next, we inspected the Shapley dependence of estimated chl a concentration on the four most important explanatory variables in the country-scale model ( Figure 4). Shapley value measures the average marginal impact that an individual explanatory variable has on the model output, with all possible combinations of other explanatory variable values. For lakes with low natural eutrophication levels, the model predicted around 3-6 μg/L lower chl a concentration than for other lakes (Figure 4). We also observed that in shallower lakes, chl a concentration tended to increase. The steepest increase in chl a concentration was found when the mean depth decreased from 4 to 3 m. In lakes with mean depth of at least 6 m, the impact plateaued at around −2 μg/L. Small share of lake in the catchment (<7.5%) was related to increased chl a concentration. When the share of lake area increased above 10%, there was around 1 μg/L reduction in chl a concentration. The relationship between agricultural land use and eutrophication was nonlinear. Small (<5%) share of agricultural area in the catchment was related to 0-3 μg/L reduction in chl a concentration. When the share of agricultural area T A B L E 2 Performance statistics of the country-scale model and the river basin district (RBD)-scale models.  increased above 5%, there was a steady, slight increase in estimated chl a concentration (~1 μg/L per 10% increase in share of agricultural area) and the range of impacts grew to around 1-5 μg/L at maximum.

Identifying major country-and district-scale drivers of eutrophication
In this study, we quantified the strength and nature of the relationship between lake chl a concentration and characteristics of the lake and its catchment. We regressed satellite-observed chl a concentration across over 1500 lakes against lake and catchment characteristics with random forest models. This approach extends existing research, which has mostly focused on analyzing eutrophication using site-sampled chl a concentrations. The models in this study identified catchment hydrology and lake-specific factors (i.e., natural eutrophication level, lake mean depth, and share of lake area in the catchment) as the most important characteristics explaining lake chl a concentration on both country and RBD scale in Finland. Higher natural eutrophication, shallow lake depth, and small share of lake area in the catchment were all related to increased chl a concentration in Finnish lakes at both spatial scales. Additionally, share of agricultural area in the catchment was among the most important characteristics explaining eutrophication on country scale. These results are mostly in line with existing knowledge about major drivers of inland water eutrophication in Finland and on large spatial scales. The identified most important characteristics explaining lake trophic state were similar to those found in existing research: catchment hydrology and lake-specific factors (e.g., lake depth and LA:CA) impact water quality on local to country scale (Horppila et al., 2019;Knoll et al., 2015;Liu et al., 2011;Read et al., 2015;Röman et al., 2018). In deep lakes, thermal stratification and relatively large volume of water compared with sediment area results in higher nutrient dilution potential (Sondergaard et al., 2001;Xie, 2006). Thereby, larger lake depth can dampen the impacts of external loading from, for example, agricultural areas in the catchment (Huo et al., 2015). In the modeling results, the impact of thermal stratification could be demonstrated in the steepness of the relationship between lake mean depth and lake trophic state at around 3-4 m depth (Figure 4), which is sometimes used as the cutoff depth between nonstratifying and stratifying lakes (e.g., Solheim et al., 2019). Short lake retention time (which is often related to small LA:CA) further controls nutrient availability (Soballe & Kimmel, 1987). This relationship was illustrated in the lower estimated chl a concentrations in lakes with higher share of lake area (Figures 3 and 4).
Natural eutrophication level was a useful basis for explaining trophic state in lakes in the categories "low" and "moderate" natural eutrophication but lead to underestimation of chl a concentration, especially in the category "high" natural eutrophication lakes (Figure 2). A similar pattern was found comparing the distribution of spatially aggregated absolute errors (Appendix S1: Figure S4) and the observed chl a concentration across the study area (Figure 1). Comparable models in existing research have not included naturally occurring nutrient concentrations as explanatory variables. However, the EU WFD ecological status classification for Finnish lake types (Aroviita et al., 2019) illustrates that when ecological status of lakes degrades, chl a concentration is expected to increase more steeply in lakes with high natural eutrophication level. This variability in chl a concentration response to degrading ecological status could explain model estimation bias in the "high" natural eutrophication category.
Depending on the more dominant land cover category, share of agricultural area or share of peatland area was identified as key characteristic in explaining lake trophic state. Agriculture is the largest single source of nutrient loads to waters in Finland (Rankinen et al., 2015) and share of agricultural area in the catchment has been strongly associated with increased nutrient concentrations in inland water bodies in multiple studies in Finland (Ekholm et al., 2000;Horppila et al., 2019;Palviainen et al., 2016;Röman et al., 2018;Tattari et al., 2015). A similar effect has been found in large spatial-scale studies in the United States and China (Hollister et al., 2016;Knoll et al., 2015;Liu et al., 2011). The high nutrient loads from agriculture in Finland are explained by fertilizer use, soil erodibility in agricultural catchments (Hyvönen et al., 2020), and potentially the concentration of agricultural land use to areas with nutrient-rich and erosion-sensitive soils (Appendix S1: Figure S1). In Finland, peatland is the land use class producing the highest nutrient loads per area (e.g., Palviainen et al., 2016), which explains its importance as an explanatory variable in RBD Central, where the share of peatland area is higher than in the other three RBDs (Appendix S1: Figure S1).

Model uncertainties and way forward
There was strong linear correlation between spatially aggregated chl a concentration observations and country-scale model estimates, which implies that the model was able to find meaningful relationships between lake chl a and the lake and catchment characteristics. However, the model explained a smaller percentage of variation in lake chl a concentration and had larger RMSE values than comparable models that used site-sampled chl a concentrations as target variable (Catherine et al., 2010;Hollister et al., 2016;Knoll et al., 2015;Liu et al., 2011). Cuevas et al. (2006) suggest that the predictive performance of water property predictive models may be degraded by the heterogeneity in environmental factors, which is introduced with increasing sample size and model spatial coverage. Therefore, the smaller range of errors and stronger linear correlation coarser levels of spatial aggregation (Figure 2) might be due to the dampened effect of individual extreme prediction errors caused by, for example, variability in environmental conditions. Because of this, the identified drivers of eutrophication might better represent general country-level patterns than drivers of eutrophication in individual catchments.
The lower predictive performance of the models in this study might also be related to methodological choices aiming to improve the robustness of model predictions in a case where explanatory variables are spatially autocorrelated, which is common in environmental characteristics. First, we excluded geolocation variables (e.g., latitude or elevation) even though these have previously been ranked among the most important predictors of nutrient concentrations or lake trophic state on large spatial scales (Catherine et al., 2010;Hollister et al., 2016). Second, we performed cross-validation using spatial buffers between selected samples (see Methods) instead of truly random cross-validation, which has been applied in, for example, Catherine et al. (2010), Hollister et al. (2016, and Knoll et al. (2015). Both geolocation variables and random cross-validation have been linked to stronger overfitting and reproducing training data rather than successfully predicting on new samples when the input data are spatially autocorrelated (Meyer et al., 2019;Roberts et al., 2017). Meyer et al. (2019) explain that these effects stem from the dependency between training and testing datasets if samples in the sets are spatially near each other. Further, Meyer et al. (2019) demonstrate that geolocation variables might produce artifacts in results, for example, linear patterns following the spatial gradient of a geolocation variable.
Measured with R 2 and RMSE, on average, the RBD-scale models had lower performance than the country-scale model. We identified two potential reasons for the lower model performance on RBD scale. First, the RBD models were developed using smaller subsamples of the complete dataset (Table 2). Decrease in sample size has been linked to lower model performance in existing research (Luan et al., 2020;Soranno et al., 2020;Soultan & Safi, 2017). With a small sample size, individual observations and their randomness can have a larger effect on the training and validation of the model, which could hinder the model from finding consistent patterns in the data. Second, there could be locally relevant drivers of eutrophication that were not included in the models and whose impacts could be highlighted in the RBD scale. Such locally important drivers in Finland include point sources of nutrient loads (e.g., manufacturing and municipal waste water treatment plants) and dispersed settlement (Röman et al., 2018;Tattari et al., 2015). These sources constitute around 15%-30% of the nitrogen load and 10%-20% of the phosphorous load in the four study RBDs (Tattari et al., 2015). Data on these drivers have been utilized in existing research (e.g., Röman et al., 2018;Tattari et al., 2015) but it is not openly available in Finland.
The applied variable importance ranking method (Shapley values, see Methods) is robust with multicollinear explanatory variables, but it should be noted that lower model performance introduces uncertainty to variable importance rankings. This could explain why the variable importance rankings in the lowest performing RBD-scale models (RBD West and RBD East; Table 2) did not fully reflect major drivers identified in existing research. In RBD West, the share of agricultural area was not identified as one of the most important variables, even though agricultural land use produces more than 70% of phosphorous and nitrogen loads to waters (Tattari et al., 2015).
Similarly, in RBD East, share of peatland area received relatively low importance ranking, even though it has previously been identified as the most important determinant of organic nutrient loading in the area (Palviainen et al., 2016).
Further, the direction of impacts of variables with low importance ranking may be distorted by another more important explanatory variable with an opposite spatial gradient. For example, in RBDs East and South, low summer season temperature was associated with lower chl a concentration (Figure 3), which contrasts existing research (e.g., Jenny et al., 2020). The unexpected effect of this variable might be explained by the opposite spatial gradients with two other strong explanatory variables within the two RBDs: share of peatland area and natural eutrophication (Appendix S1: Figure S1). Similarly, on country scale and in RBDs East and Central, higher field erosion risk was related to lower chl a concentration, but the spatial gradient of field erosion risk opposes the gradient of natural eutrophication level, which is a more impactful explanatory variable in these three models.
In the future, the predictive performance of the models could be improved by incorporating additional data on drivers of eutrophication (e.g., point sources of nutrient load and hydrologic connectivity of lakes) and longer time series of observed chl a concentration. Additional data, especially on point source nutrient loading, could improve model performance on RBD scale and in extremely eutrophicated lakes, where the model estimation errors were the largest (Figure 2). Moreover, the general applicability of similar models needs to be assessed in a more diverse set of climates and ecological zones, since most existing trophic state or nutrient concentration predictive models have been developed for water bodies in the northern hemisphere in temperate (Catherine et al., 2010;Huo et al., 2015), boreal (e.g., Ekholm et al., 2000;Horppila et al., 2019;Röman et al., 2018), and continental climates (Knoll et al., 2015).
The models in this study were able to identify major catchment-scale drivers of eutrophication on spatial scales that are relevant in decision-making (district to country scale), and the large and spatially comprehensive study sample reflects the variability in the presence and intensity of environmental factors affecting eutrophication (Appendix S1: Figure S2 and Table S1). These results suggest that trophic state predictive models that use satellite-observed chl a concentration as trophic state indicator could have practical applicability in planning and targeting monitoring and restoration actions on both study scales, because the models could identify potentially vulnerable water bodies based on the presence of locally important drivers of eutrophication. Compared with other methods of identifying drivers of eutrophication (e.g., mass balance modeling and catchment nutrient loss models), empirical models, such as the models in this study, require less data and can therefore be feasibly applied on larger spatial scales. Moreover, the increasing availability of satellite observation products on chl a concentration (e.g., Seegers et al., 2021;Wang et al., 2021) as well as other environmental characteristics provides opportunities to extend modeling to areas with limited access to site-sampled water quality data and catchment characteristics.

CONCLUSION
The aim of this study was to expand the potential spatial scope of empirical trophic state predictive models in identifying the main drivers of eutrophication by utilizing satellite-observed chl a concentration. We explained chl a concentration in 1547 lakes distributed across Finland based on 11 lake and catchment characteristics that are known to correlate with nutrient concentrations in inland water bodies. We applied the random forest algorithm and developed a country-scale model using the full sample of lakes as well as four RBD-scale submodels and then ranked the 11 lake and catchment characteristics based on their importance in explaining lake chl a concentration.
The country-scale model and most RBD-scale models identified the major drivers of eutrophication in Finland, although they explained less than 50% of the variation in observed chl a concentration. Catchment and lake hydromorphology (lake mean depth, natural eutrophication level, and share of lake area in the catchment) were the most important characteristics explaining lake chl a concentration in all models. Moreover, depending on the more dominant land use type in the model areas, share of agricultural area or peatland area in the catchment was also ranked among the most important catchment-scale drivers of eutrophication.
The results of this study suggest that trophic state predictive models that use satellite-observed chl a concentrations as the trophic state indicator can identify the major drivers of eutrophication on district and country scales. However, the applicability of the models reflects the quality and quantity of the input data, and developing a representative model requires a priori knowledge of the locally important drivers of eutrophication. A representative model could be utilized as an additional tool in planning targeted monitoring and restoration actions, especially in areas lacking an extensive water quality monitoring site network. In the future, the utility of these models needs to be further examined in areas with more limited access to environmental datasets, as well in a diverse set of climates and ecological zones.
AUTHOR CONTRIBUTIONS Sara Heikonen, Matias Heino, and Maria Yli-Heikkilä designed the research. Sara Heikonen performed the analyses and wrote the manuscript with support from Matias Heino and Maria Yli-Heikkilä.

ACKNOWLEDGMENTS
We are grateful to Matti Kummu for comments on this manuscript, Olli Varis for help in understanding the broader context of lake trophic state modeling, and Alexander Horton for assistance in improving the modeling design. We thank Jenni Attila and Teemu Hynönen from Finnish Environmental Institute (SYKE) for providing the satellite-observed chl a statistics data, and SYKE for providing data on catchment area boundaries. We are grateful to the two anonymous reviewers for their insightful comments that helped us to clarify and improve this manuscript. We acknowledge CSC-IT Center for Science, Finland, as well as Aalto University for high-performance computing cluster resources. Sara Heikonen and Matias Heino received funding from Maa-ja vesitekniikan tuki ry and Academy of Finland funded project TREFORM (grant 339834). Sara Heikonen was also supported by the Aalto University School of Engineering and the Water and Environmental Engineering research group at Aalto University. Maria Yli-Heikkilä was supported by the European Union (grant 101033957).

SUPPORTING INFORMATION
Additional supporting information can be found online in the Supporting Information section at the end of this article.