Predicting stream water temperature with artificial neural networks based on open‐access data

Predictions of stream water temperature are an important tool for assessing potential impacts of climate warming on aquatic ecosystems and for prioritizing targeted adaptation and mitigation measures. Since predictions require reliable baseline data, we assessed whether open‐access data can serve as a suitable resource for accurate and reliable water temperature prediction using artificial neural networks (ANNs). For this purpose, we trained and tested ANNs in 16 small ( ≤1m3s ) headwater streams of major types located in Bavaria, Germany. Between four and eight different combinations of input parameters were trained and tested for each stream ANN, based on data availability. These were air temperature (mean, minimum and maximum), day of the year, discharge, water level and sunshine duration per day. We found that the input combination with the highest accuracy (lowest RMSE) was stream‐specific, suggesting that the optimal input combination cannot be generalized across streams. Using a reasonable, but random, input combination resulted in an increase in error (RMSE) of up to >100% compared to the stream‐specific optimal combination. Hence, we conclude that the accuracy of water temperature prediction strongly depends on the availability of open‐access input data. We also found that environmental parameters such as hydrological characteristics and the proportion of land use in the 5 m riparian strip and the entire catchment are important drivers, affecting the accuracy and reliability of ANNs. ANNs' prediction accuracy was strongly negatively related to river length, total catchment area and water level. High proportions of semi‐natural and forested land cover correlated with a higher accuracy, while open‐canopy land use types such as grassland were negatively associated with ANN accuracy. In conclusion, open‐access data were found to be suitable for accurate and reliable predictions of water temperature using ANNs. However, we recommend incorporating stream‐specific environmental information and tailor the combination of input parameters to individual streams in order to obtain optimal results.


| INTRODUCTION
With rising atmospheric temperatures, climate change affects stream water temperatures due to the well-established relationship between air and water temperature (Crisp & Howson, 1982;Kothandaraman & Evans, 1972;Mohseni & Stefan, 1999;Webb et al., 2003).This is of particular relevance in small headwater streams, where the relatively low mean stream depth is highly influenced by surface energy fluxes (Leach et al., 2023), which are correlated with air temperature (e.g.solar radiation).Moreover, headwater areas govern processes further downstream and their coldwater spots provide important refugia for coldwaterdependent species in light of climatic change (Kuhn et al., 2021).Stream water temperature is naturally regulated by various drivers: meteorological (air temperature and net radiation), hydro(geo)logical (discharge and groundwater inflow), hydromorphological (stream width and depth), and vegetational, the latter determining shading and evapotranspiration.
There are a number of anthropogenic activities that influence variables such as discharge and flow variation, the proportion of surfaces rendered impervious by urbanization, changes in ice cover, and thermal pollution, which can additionally affect the water's thermal properties and natural temperature regimes on a large spatial scale (Caldwell et al., 2015;Nelson & Palmer, 2007).As one of the most common forms of anthropogenic disturbance to ecosystems, land use plays a key role in water temperature regulation, for example due to the partitioning of precipitation into infiltration and surface runoff, which affects water regimes on a (sub-)catchment scale.
Since temperature is the most crucial determinant of abiotic and biotic processes, further anticipated changes in stream water temperature due to global warming and other human impacts are expected to have substantial effects on aquatic ecosystems (Smith, 1981).This not only includes a decreased saturation concentration of oxygen triggered by global warming (Piatka et al., 2021), but also changes in viscosity, vapour pressure, density, and surface tension (Caissie, 2006).
Additionally, temperature controls a wide range of biological processes, such as the decomposition rate of organic matter, species composition in aquatic communities, biotic interactions, and energy transfer in aquatic food webs (Woodward et al., 2010).The rapid pace of global warming (IPCC, 2022) creates a need for more detailed predictions of future water temperatures in streams.These are urgently required to enable an assessment of the potential impacts of climate warming on the abiotic stream environment and the consequences for biological communities.An understanding of this is also key to targeting and prioritizing mitigation and adaptation measures.
The importance of predicting stream water temperatures is reflected by the variety of approaches that have already been tested.As stated in Rabi et al. (2015), water temperature prediction models can generally be divided into two major categories: deterministic and statistical.Statistical models are in turn differentiated into parametric and non-parametric ones (for definitions, see e.g.Rabi et al. (2015) or Benyahya et al. (2007)).The availability of data for deterministic models, such as SHADE (Chen et al., 1998) or CEQUEAU (St-Hilaire et al., 2000), is problematic, as many variables are required for catchment and thermal representations, along with complete time series for discharge and meteorological parameters.
While parametric statistical models have much lower data requirements and are simple to use, their structure is specified from the start and hence not flexibly adjustable to the data (Benyahya et al., 2007).This limitation can lead to incorrect water temperature predictions when using linear regression, a technique often applied to describe the relationship between air and water temperature (Ahmadi-Nedushan et al., 2007;Crisp & Howson, 1982;Harvey et al., 2011;Krider et al., 2013;Rabi et al., 2015;Smith, 1981;Webb et al., 2003).At elevated and low air temperatures, physical effects lead to non-linearity (Mohseni & Stefan, 1999), which is beyond the limits of linear regression analysis.
In attempting to deal with the above challenges, the non-parametric statistical approach of Artificial Neural Networks (ANNs) is increasingly popular and has displayed equal or even higher accuracy (as evident from RMSE) than the majority of deterministic and parametric statistical models (Chenard & Caissie, 2008;Feigl et al., 2021;Hadzima-Nyarko et al., 2014;Pilgrim et al., 1998;Piotrowski et al., 2015;Rabi et al., 2015;Zhu et al., 2019,b,c).To the best of our knowledge, the smallest and hence best RMSE values reported for water temperature prediction using ANNs ranged between 0.46 C (Zhu, Heddam, Nyarko, et al., 2019) and1.58 C (Hadzima-Nyarko et al., 2014).In the following, we refer to this "state of the art" range as "sota-range".
Besides performance, a major benefit of using ANNs compared to deterministic models are the lower data requirements.While deterministic models require large amounts of data for predictions of water temperature, ANNs have already displayed good results with comparably limited information.It is currently unknown which input parameters produce optimal results, and so their selection varies in different studies.While air temperature is a key input parameter, its format varies greatly, as do the additional input parameters, particularly those concerning the temporal resolution of the data.Several studies used only daily mean air temperatures or once-a-day-measurements (Graf et al., 2019;Hadzima-Nyarko et al., 2014;Qiu et al., 2020;Rabi et al., 2015;Zhu et al., 2019,b), while others are based on daily mean, minimum and maximum air temperatures (Chenard & Caissie, 2008;Feigl et al., 2021;Piotrowski et al., 2015).Most studies used discharge or water level (Chenard & Caissie, 2008;Feigl et al., 2021;Qiu et al., 2020;Zhu et al., 2019,b,c) and/or the day of the year as an additional input (Chenard & Caissie, 2008;Feigl et al., 2021;Hadzima-Nyarko et al., 2014;Qiu et al., 2020;Zhu et al., 2019,b,c), while only one study additionally used global radiation (Feigl et al., 2021) and one the declination of the sun (Piotrowski et al., 2015).
Various measures can be obtained to determine the quality of a prediction, the most prominent one being accuracy.Accuracy is an indicator of how exactly an ANN predicts the output in the context of training and testing.However, climate change and natural variability involve data variations that ANNs might not be sufficiently capable of learning, since data obtained for training and testing cannot be used to represent future climatic developments.It is therefore not sufficient to rely solely on accuracy to determine the suitability of an ANN for its task.For classification networks, there are several methods available that provide more insight into the behaviour of ANNs (for examples, see Bach et al., 2015;Baehrens et al., 2010;Erhan et al., 2009;Huang et al., 2020;Simonyan et al., 2013;and Sundararajan et al., 2017).However, the prediction of water temperature is not a classification but a regression problem.In the field of regression problems, Mohr et al. (2021), to the best of our knowledge, were the first to develop a methodology, able to give insight into the behaviour of regression models and to measure their behaviour not only on the basis of accuracy calculations but also as a means of examining reliability.Consequently, we included these methods in our study to enable a more holistic picture of water temperature ANN performance.
While determining accuracy and reliability is important for understanding how much trust can be placed in a prediction, these measures do not fully explain the disturbances found in the predictions.
Variations in environmental conditions and the land use form surrounding streams are highly relevant for explaining the behaviour of ANNs, how they are influenced by environmental factors, and which conditions allow for a reasonable use of ANNs for predicting water temperature.
Hence, in this study, we address whether it is possible to train accurate and reliable ANNs based on open-access data for small ( ≤ 1 m 3 s ) headwater streams in Bavaria, Germany.Additionally, we address whether an optimal combination of input parameters exists and whether these combinations are unique for each stream or can be generalized across streams.To confine the range of environmental conditions in which ANNs can be optimally applied, we studied how environmental parameters such as stream length, hydrological characteristics and proportion of land use types affect ANN accuracy.We hypothesized that open-access data suffices to predict water temperature in small headwater streams with an RMSE in the sota-range.We also hypothesized that the accuracy and reliability of ANNs are influenced by both the input combinations and the environmental parameters of the streams, which would make the optimal input combination stream-specific.

| Study sites
For this study, we investigated 16 streams in major eco-regions of different geological origins throughout Bavaria, Germany.Figure 1 shows the locations of the gauging stations by Gewaesserkundlicher Dienst Bayern (abbr.GkD) for each stream.Figures 2 and 3 depict water temperature time series that were available for each of the 16 gauging stations.We selected streams with a mean annual discharge of ≤ 1 m 3 s , based on open-access data from GkD.Using this criterion, we were able to focus on headwater streams, which are of special interest since they also govern processes further downstream.

| Measures of model performance
To assess the ANNs' performance, we used three different accuracy metrics as described in the following and three newly developed reliability metrics from Mohr et al. (2021) as described in Appendix A.3.
Our aim was to prioritize the used accuracy metrics according to their expressive power regarding the reliability of ANNs.Therefore, we conducted correlation analysis (see description in Section 2.6.2) to identify connections between accuracy and reliability metrics.In the following, we describe the three used accuracy metrics and define them in Formulas 1, 2, and 3. Moriasi et al. (2007), the root mean square error (for RMSE, see Formula 1) is an error index commonly used in the context of model evaluation.A value of 0 indicates a perfect fit.

RMSE: According to
We chose this metric since it is regularly used in the context of water temperature predictions with ANNs and is intuitively well understandable.

R:
The Pearson's product-moment correlation coefficient (for R, see Formula 2) describes the degree of collinearity between predicted and observed data (Rabi et al., 2015).It ranges from À1 to 1, where 0 indicates no linear relationship and À1 or 1 indicate a linear relationship.In this study, we aimed for a positive correlation between the observed and predicted values that is, values close to +1.
As the RMSE, this metric is regularly used in the context of water temperature prediction with ANNs but in the contrary does not show the mean error, but the degree of collinearity.
PBIAS: According to Gupta et al. (1999) as cited in Moriasi et al. (2007), the Percent bias (PBIAS, see Formula 3) shows whether the predictions are, on average, over-or underestimated.A value of 0 indicates a perfect fit, while positive values indicate underestimation and negative ones indicate overestimation.This metric is uncommon in the field of water temperature prediction with ANNs but opens up a new perspective, since the direction of prediction inaccuracies (over-or underestimation) is displayed.On the contrary, the other two metrics concentrate, in general, on the amount of difference between observation and prediction.

RMSE
To define the evaluation metrics, we followed the notation by Rabi et al. (2015), where P i is the ith predicted water temperature value, O i is the ith observed water temperature value, P is the average of P i , O is the average of O i and n is the size of the dataset.

| Input
The data basis for the ANNs consisted of open-access data supplied by the GkD and Germany's National Meteorological Service (DWD).
The data used for each stream consisted of the daily mean water temperatures ( C), daily discharges (Q, m 3 /s) and daily mean water levels (L, cm), as obtained from each of the GkD gauging stations.Additionally, the daily minimum, maximum and mean air temperatures (T, C) and the daily sunshine durations (S, defined by DWD as duration of direct solar radiation at a given location) were derived from the two closest DWD gauging stations for each stream.All data sets carried a time stamp, which we recalculated to obtain the day of year (D) as a continuous number.To improve the learning of our ANNs, we employed data normalization, which is a common technique used in machine learning (Han et al., 2011).
We trained and tested all ANNs by inputting data taken from four consecutive days, with the predicted fourth day's daily mean water temperature as the output.This amount of days was found to lead to F I G U R E 1 Location of GkD gauging stations throughout Bavaria, Germany.
a better accuracy compared to ANNs with the input of 1, 2, 3, or 5 consecutive days at a case study in the Bavarian headwater catchment Mähringsbach (Drainas, 2020).
Since not all data were measured continuously, we chose time periods for each stream during which all the input parameters were available (except for Scheine and Soellbach, for which no sunshine data was available).The data used for each stream, the DWD stations used for each GkD gauging station, and the distances between them are presented in Appendix A.4 in Table A2.To test the suitability of the different input parameters, we trained and tested ANNs with all possible combinations of input parameters for each stream.However, we rejected any combination with no air temperature or date (Mohseni & Stefan, 1999;Zhu, Nyarko, Hadzima-Nyarko, et al., 2019).We used the following com-  (h) Wolnzach.

| The modelling approach: Artificial Neural Networks
ANNs are a machine learning approach that is inspired by biological nervous systems (da Silva et al., 2017).They can create output information (in this study water temperature) based on given input information (in this study different input combinations).To do so, they need to learn the relationship between in-and output, for which they need to be trained.Hereby the ANNs have access to the input as well as to the output data and iteratively learn to predict the output based on the input.After the training, the ANNs' ability to predict the output is tested by only presenting the input and comparing the predicted output to the actual output information.In this study, this comparison was conducted by calculating the RMSE.
For the distribution into training and testing phase, the given dataset needs to be divided into a training and a testing dataset.In this study, this was done randomly in a ratio of 90% (training) to 10% (testing) as it is default for an established library (see Appendix A.2 for details).
For this study, we trained and tested nine ANNs for each of the 14 waterbodies, for which all input parameters were available.Additionally, we trained four ANNs for Scheine and four for Soellbach (no sunshine duration available), resulting in a total of 136 ANNs.The ANNs were determined by using search methods from the scikit-learn Python package (Pedregosa et al., 2011) (for a description of the search methods, see Appendix A.1). Optimization was based on the calculated root mean square error (RMSE), which we also employed in this work as an accuracy metric (see Formula 1).

| Examination of environmental parameters
To determine the influence of the environmental parameters on the prediction accuracy of ANNs, we examined the environmental parameters of two different spatial scales for which an influence can be expected.
For that, we used ArcGIS software to create riparian strips upstream of the 16 catchment outlets flanking 5 m left and right (measured from the middle of the stream) of the entire stream length (in the following referred to as "5 m riparian strip").This width was chosen to capture the direct impact of the landuse on the stream, e.g. in the form of shading.We obtained the stream geometry and catchment spatial maps as Regarding the resolution of the 5 m riparian strip, only the land use variables were provided, along with the total riparian strip area.
In Appendix A.4, the description and results of a principal component analysis (PCA) are presented, showing the distribution of the 16 stream sites along the different environmental gradients captured in our environmental dataset.

| Correlation analysis
To determine connections between the prediction accuracy and the environmental parameters, we conducted correlation analysis.For this, we used the calculated accuracy metrics (RMSE, R and PBIAS) for each input combination and each waterbody, once for the entire catchment and once for the 5 m riparian strip resolution.First, we used the Shapiro-Wilk test to examine whether our datasets were normally distributed.We then tested the distribution both for each input combination separately and for each environmental parameter.We then conducted tests to examine the correlation between each input combination and each environmental parameter.If both datasets were normally distributed, we used Pearson product moment correlation.
Otherwise, if one or both of the datasets was not normally distributed, we used Spearman rank-order correlation.To visualize the rho and corr values, we created heatmaps.We conducted all steps of the correlation analysis with RStudio (RStudio Team, 2022; Warnes et al., 2020).
Additionally, we conducted correlation analysis to find connections between reliability and accuracy to assess which accuracy metric is most suitable for displaying the reliability of an ANN's predictions.
Details are described in Appendix A.5.

| Distance-based linear model
As a multivariate approach, we used distance based Linear Model (DistLM), which is based on a procedure called "distance based redundancy analysis" (dbRDA) (Legendre & Anderson, 1999) and implements a routine that analyses and models the relationship between a multivariate resemblance matrix and a set of given predictor variables.
DistLM is applied as a multivariate multiple regression that models the explanatory significance of the environmental predictor variables via partitioning of variation that facilitates permutation-based significance testing.In our case, we first used the resemblance matrix of the RMSE, R, and PBIAS values of all calculated ANN input combinations and the same data as predictors, to reduce dimensionality.We chose this combination of data, to investigate which of the accuracy metrics and ANN input combinations explained most of the between-stream variability and to identify any redundancy in the three accuracy metrics (Eval-predict).In a subsequent approach, we used the same resemblance matrix but the environmental data set as predictor variables to determine the environmental variables that explain most of the observed variations in the multivariate data set of accuracy metrics of different ANNs (Enviro-predict).For both approaches, we used the DistLM function and redundancy analysis plots (dbRDA-plots) and chose the step-wise method and Adjusted R 2 for model comparison in PRIMER v7 & PERMANOVA+ (Anderson et al., 2008).

| Measures of model performance
To prioritize the use of the three different accuracy metrics applied in this study, we evaluated which of them is most suitable to also display the reliability of an ANN.Therefore, we conducted correlation analysis between the accuracy and the reliability metrics (for detailed results see Appendix B.2), which resulted in significant correlations between two of the reliability metrics and the RMSE and one significant correlation between reliability metrics and R and PBIAS each.

| Selection of ANNs and input parameters
The comparison of prediction accuracy of randomly searched ANNs (see RandomizedSearch in Appendix A.1) for all different input combinations (see Tables A3, A4, A5, and A6) showed that the optimal input combination was different for each of the tested streams (see Table 1).
The most frequently used combination of input parameters was DTL (38% of all streams), followed by DTLQ (25% of all streams), allinputs (21%, if sunshine duration was available), DTQS (14%, if sunshine duration was available) and DTQ (6% of all streams) (Table 2 top).The combinations DT, DTS and DTLS were not selected as input combinations with the greatest predictive power in any of the streams.Consequently, the share of individual input parameters in the composition of ANNs was as follows: day of the year and air temperature were identified as input parameters in 100% of all streams, water level was identified in 81% of all streams, discharge was identified in 63% of all streams, and sunshine duration was identified in 36% of the streams for which sunshine duration was available (Table 2 bottom and 0.112% (Prien).Comparing the input combinations with the highest accuracy according to the RMSE from each stream, with the combinations of lowest accuracy, the error increased on average by 41% when a random input combination was used, compared to the optimal combination, with a minimum of 5% (Otterbach) and a maximum of 102% (Scheine).
Given the number of ANNs and the accuracy metrics, a DistLM (Eval-predict) was calculated to determine which of the accuracy metrics and ANN input combinations explained the majority of the between-stream variability and to identify redundancy in accuracy metrics.The RMSE and R values were strongly correlated along dbRDA axis 1, implying that the accuracy of calculated ANNs was very similarly reflected by these two metrics (Figure 4).Both allinputs-ANNs of RMSE and R individually explained 58% of the total variability in the dataset according to marginal testing, and the sequential tests furthermore confirmed that the information of R and RMSE of allinputs-ANNs was redundant, as only one of them was included in the best-solution set of variables.However, the PBIAS values calculated on the basis of multiple input combinations were responsible for approximately 35% of the remaining variability in the data set and distinctly discriminated streams on dbRDA axis 2, hence showing that PBIAS provides information that cannot be substituted by the other two used accuracy metrics, and even streams with high accuracy measures, as shown by the small RMSE or high R values, can be subject to under-or overestimation in temperature predictions (Figure 4).

| ANN accuracy metrics
To determine connections between environmental parameters and ANNs' accuracy, we conducted correlation analyses in the spatial scales of entire catchment as well as 5 m riparian strip resolution.

| Entire catchment
With regard to the entire catchment resolution, we observed the most statistically significant associations between accuracy metrics and environmental parameters for RMSE (34), followed by R ( 16) and PBIAS (6).
For RMSE (see Figure 5a), we detected significantly positive correlations between all ANN input combinations and total river length as well as between all ANN input combinations and the hydrologic and semi-natural land use (DTS) (Figure 7a).

| 5 m riparian strip
Significant associations between accuracy metrics and environmental parameters in the 5 m riparian strip were most numerous for RMSE (26), followed by R (21) and PBIAS (2).The RMSE values of all ANN input combinations, with the exception of DTLS and DTS, were significantly positively correlated with total riparian strip area (Figure 5b). in DTLS, allinputs and DTQS.Also, the proportion of forest and water surface in the riparian-strip area correlated negatively with RMSE values.
Similarly, R values were significantly negatively related to the total riparian strip area (all ANN combinations except DTS and DTLS), negatively related to grassland (Figure 6b) and positively related to semi-natural land use.
For PBIAS values (see Figure 7b), we only observed a significant negative relationship with semi-natural land use (DT and DTS).

| Multivariate analysis of environmental predictors of ANN accuracy metrics
To investigate which of the accuracy metrics and ANN input combinations explained most of the between-stream variability and to identify any redundancy in the three accuracy metrics, we conducted a DistLM.We found that the 14 variables depicted in Figure 8 explained a total variation of R 2 ¼ 0:99601, Adjusted R 2 ¼ 0:94014.

| DISCUSSION
In line with our hypothesis, our results suggest that the accuracy and reliability of ANNs' predictions for single streams are highly dependent on input combination and environmental parameters.To understand how environmental parameters affect ANNs' accuracy and reliability, we analysed a broad range of environmental predictors, showing that river length and water levels, the size of the catchment and open-canopy land use types were particularly negatively associated with ANN accuracy in the streams we tested.

| Measures of model performance
To prioritize the use of the accuracy metrics RMSE, R, and PBIAS for the evaluation of ANNs, we examined correlations between these metrics and reliability metrics as established in Mohr et al. (2021).We found, that not all accuracy metrics correlated significantly with all reliability metrics, confirming the finding of Mohr et al. (2021), that the use of accuracy metrics alone is insufficient and should be supplemented by reliability metrics.Still, we can conclude that as accuracy metric, the RMSE was the most suitable one to reflect the reliability of an ANN, due to two significant correlations with reliability metrics as opposed to one significant correlation for R and PBIAS each.We also observed that the RMSE had a greater resolution and hence contributed more significant relationships with environmental parameters than R, probably because it had a higher potential to reflect the high-resolution dynamics of hydrologic parameters.This further confirmed the plausibility of its

| Input parameters
The most striking finding of this study was that the input combination with the highest accuracy was a stream-specific set of input parameters, suggesting that the optimal input combination cannot be generalized across streams.As important asset, this study used a systematic procedure of training and testing ANNs with different sets of input parameters, which provided us with the opportunity to compare ANN accuracies within single streams.While other studies like Feigl et al. (2021) previously identified that the input combination has an effect on ANN performance in general, our finding, that the optimal input combination is stream-specific, adds an important new insight to this field, which can help to make stream water temperature predictions more accurate in the future.As we were able to show, the error in the prediction (RMSE) could increase to > 100% in a single stream if a random input combination was used instead of the optimal input combination.Even when using the allinputs combination, the error increased by up to 34%, indicating that allinputs might be more accurate than a random input combination, but still not as accurate as if the combination was determined systematically.This result is in line with the "explosion" of Myth #7 in Maier et al. (2023), where it is stated that an increase in the number of input variables does not necessarily improve model performance, but that these variables need to be selected carefully.Clearly, the search for the optimal input combination is time consuming compared to a fixed procedure using a set of pre-defined input variables.Hence, for supporting the application of ANNs based on our results, we provide a flow chart to facilitate decision-making along the process of water temperature prediction with ANNs (see Figure 9).
Comparing the RMSE values from Table 1 to previous studies that F I G U R E 9 Flow chart with recommendations on stream-specific artificial neural network-development.
Based on Zhu, Nyarko, Hadzima-Nyarko, et al. (2019), we expected a minor role of discharge in explaining temperature, since they state that discharge plays a minor role in stream water temperature prediction compared to the day of the year and that discharge's importance increases for high-altitude catchments.Still, in all of the streams of our study, the most accurate ANNs all had water level and/or discharge as inputs.Unfortunately, Zhu, Nyarko, Hadzima-Nyarko, et al. (2019) did not consider water level, which hinders direct comparison with our results.Nevertheless, based on our results, we suggest using at least one hydrologic input parameter for water temperature prediction with ANNs, while we cannot generalize the recommendation to a specific hydrologic parameter, since this is highly stream specific.Still, we conclude that no unique optimal input combination exists for each stream.

| Environmental influences on water temperature prediction
Given the high specificity of input combinations we determined for individual streams, it was a key interest of this study to identify stream environmental conditions that govern the accuracy of ANNs.
In light of climate change, such knowledge is also highly relevant for deducing mitigation and management strategies in streams related to securing high oxygen concentrations (Piatka et al., 2021), endangered fish populations (Wild et al., 2023), and temperature refuges (Kuhn et al., 2021;Mejia et al., 2023).In contrast to existing approaches, which mainly consider hyperparameter tuning and dataset length, the associations between environmental parameters and ANN accuracy allow a more mechanistic and realistic assessment of model applicability at individual stream sites, as demonstrated for our datasets.Several significant correlations between environmental parameters and prediction accuracy of ANNs were identified, suggesting key influences of catchment hydrological variables.
Specifically, the accuracy of ANNs (RMSE and R) was strongly related to total river and longest river length, total catchment area, and the hydrological parameters MW, HW, and NW, indicating a decrease in ANN accuracy with increasing river length, catchment size, and water level.
Since stream water temperatures are defined by complex and dynamic physico-chemical, hydrologic and atmospheric processes and not solely based on air temperature (Caissie, 2006;Leach et al., 2023), a possible explanation for the strong negative relationship between ANN accuracy and river length and catchment area could be the increase of air-temperature-unrelated complex influences along the flow path of streams.Beginning at the spring, the water has a specific temperature, depending on its origin and the distance to its spring.As the stream water passes through the landscape, energy exchange is influenced by advective fluxes like evaporation or longitudinal changes in advection and radiation due to changes in vegetation (Coats & Jackson, 2020;Leach et al., 2023).Energy is added by river bank and bed friction, and contact with the atmosphere increases, as do the radiative fluxes (Dan Moore et al., 2005;Kuhn et al., 2021;Webb et al., 2008).Hence, with increasing river length, the potential number of complex influences increases and thus, the accuracy of the water temperature predictions decreases.This is especially pronounced for models like ANNs, which do not receive additional information on catchment-size related variables but have to learn in the context of local input parameters, measured at the gauging station.
As with river length and catchment size, higher levels of HW, NW and MW were associated with a lower prediction accuracy (RMSE) of ANNs.The relationship between extreme water levels (HW) and ANN accuracy is due to difficulties in predicting the temperatures of water sources entering the stream along its flowpath (e.g.groundwater, hyporheic water, precipitation, anthropogenic water influxes (Nelson & Palmer, 2007;Webb et al., 2008).During spates and high-water events, these water sources contribute different relative quantities to total water volume, and temperature mixing during high water events is then presumably more difficult for ANNs to predict.Additionally, it has been shown that air-water temperature relationships are stronger and more sensitive for flows below median levels (Webb et al., 2003), likely because high water levels lead to a lower water-atmosphere interaction of the surface area compared to total water volume, influencing radiation influx and sensible heat transfer.As a result, depending on surrounding atmospheric temperatures, energy fluxes are often easier to predict for smaller water volumes, which explains the higher prediction accuracy for lower MW and NW values of streams.Hence, the connection between increasing water levels, in particular the HW values and decreasing accuracy in water temperature prediction by ANNs, seems plausible and should be considered when predicting water temperatures in streams during periods of high water.
We found that the land use types semi-natural, forest and water bodies had a positive effect on ANN accuracy.Further, our results showed that high proportions of grassland in the 5 m riparian strip (but not on the entire catchment resolution) correlated with decreasing accuracy (RMSE) in water temperature prediction.
The land use surrounding a stream has a strong influence on its temperature regime and humidity, which controls the wateratmosphere interaction (Webb & Zhang, 1997).It can be assumed that high proportions of grassland facilitated heat-induced evaporation, which can lead to cooling effects especially during high temperature phases (Ouellet & Caissie, 2023), inducing a paradoxical relationship between air temperature (increasing) and water temperature (decreasing).In low temperature phases, this effect is not induced, resulting in an inconsistent relationship between air and water temperature, hence potentially reducing the accuracy of water temperature predictions based on air temperature data.
In general, open-canopy land use such as grassland involves higher levels of radiation and heat fluxes due to a lack of shading and temperature buffering through a micro-climate of complex riparian vegetation.As solar radiation is the most important component of heat transfer in streams (Webb & Zhang, 1997, 1999), accounting for 70% of non-advective heat fluxes in a stream (Webb et al., 2008), open-canopy land use forms are associated with higher air temperatures and lower humidity, which can in turn result in more pronounced temperature extremes and drought conditions in streams.
For example, Rutherford et al. (2004) and Ebersole et al. (2003) attributed a maximum temperature decrease of 4 C downstream of shaded areas to the effect of riparian vegetation.In a simulation study, Wondzell et al. (2019) determined that shading through a mature forest can account for a decrease of water temperature of 8 C. Johnson (2004) quantified the net energy transfer in July in a stream in Oregon.Nonshaded, the water temperature gained 580 W/m 2 , but fully shaded, the stream's water lost 149 W/m 2 .Hoess et al. (2022) found that shading by coniferous vegetation could even compensate for a temperature increase caused by pond effluents.Also, without shading, conduction between water and heated alluvial substrates is an often underestimated process influencing stream water temperatures, particularly under forest harvest scenarios (Brown, 1969;Johnson & Jones, 2000).Hence, riparian shading appears to be of paramount importance for controlling and regulating stream water temperatures.
Our findings further demonstrated that for the prediction of water temperatures using an air-water-temperature relationship, the land use in the proximate riparian surroundings (in our case the 5 m riparian strip) seemed more important than the catchment's global land use.Also, Kail et al. (2021) found that large trees in the 10 m riparian strip are a better predictor of water temperature than the width of riparian strips (in their case 30 m), due to the presence of large trees that provided direct shade for the streams and hence cooled the stream water highly effectively.As we showed that prediction accuracy (RMSE) was higher in streams with higher proportions of forest b.For an optimal outcome, all available input parameters should be tested for their suitability (see recommendations in Figure 9).
2. If water temperature is to be predicted for a specific stream, it might not be sufficient to use open-access data, especially if the stream is characterized by specific environmental parameters, which reduce the accuracy and reliability of water temperature prediction.
3. If the ANN is intended to predict water temperature for a future or past time with different climatic conditions compared to present ones, not only the accuracy but also the reliability of the ANN should be considered in the choice of architecture and input parameters (see recommendations in Figure 9).If it is not possible to test reliability, the RMSE is a good (but not in itself sufficient) predictor of ANN reliability and should hence be used.
Our findings highlight that water temperature predictions are more accurate and reliable in headwater streams closer to their source, especially if adjacent land use comprises forests and natural riparian vegetation that lack anthropogenic influences.The finding that ANN prediction accuracy is distinctly compromised by disturbances in the riparian cover, which commonly accumulate along a river's course, leads us to conclude that the lower ANN accuracy reflects the increasing disturbances in the air-water-temperature relationship.We propose that measures of ANN accuracy, as a proxy for an inconsistent air-water-temperature relationship, could even be used to indicate a functional and resilient water temperature regime in headwater streams.Given the importance of small headwater streams and spring ecosystems as refuges and highly specialized environments that feature a broad width of unique and sensitive species requiring special protection (Cantonati et al., 2012;Richardson, 2019), ANN accuracy measures could serve as an indicative tool to identify, evaluate and monitor headwater streams with regard to their temperature integrity and to support decision making regarding where and how to best protect these unique environments.Further, this research highlights that anthropogenic and, spe- These combinations were attained by preselecting values for each hyperparameter based on prior experience.As stated above, preselection can reduce the power of the search, so we recommend including as many values as possible.

A.3 | Reliability of ANNs
Since common accuracy metrics only consider the differences between observed and predicted values, they are not suitable for assessing the reliability of the ANN, especially when it comes to changes in the database as expected for climate change scenarios.
Hence, we also applied the reliability methodology as established in essential for the assessment of a model's reliability.In this study, we applied perturbation analysis to simulate changes in the input variables.For that, we perturbed every input value except the date by 0.01 (normalized) and evaluated the changes in the mean output.This reliability method is similar to accuracy metrics (comparison of observed vs. predicted data) but differs in that the input is changed.

A.3.2. | MinMax analysis
To consider how reasonable an ANN works regarding its predictions, it is useful to know the range of prediction values that the ANN can display.Therefore, we used MinMax analysis, where we chose random input values between 0 and 1 (normalized) to identify the operational range of each ANN.We optimized the initially chosen input and repeated the method 10 times for each ANN for the minimum and 10 times for the maximum value, respectively.

A.3.3. | Impact analysis
While the reaction of the ANN to perturbations and its operational range already give a good overview of its reliability, the so-called Impact Analysis, which is a method similar to sensitivity analysis (Zurada et al., 1994), can be used to determine which input the ANN is sensitive to, or, more specifically in our case, can be used to measure the importance of each input feature by determining its contribution to the water temperature calculation.With this information, it can be assessed whether single input parameters are weighted unreasonably high or low and hence predictions of future scenarios might not be reliable.

A.4 | Principal component analysis
To assess the environmental variables used to distinguish between the 16 assessed streams, we applied a principal component analysis (PCA) based on the normalized environmental variables that we compiled in the environmental dataset.The PCA and all subsequent multivariate analyses were calculated with the statistical software PRIMER v7 & PERMANOVA+ (Anderson et al., 2008).

A.5.1. | Reliability metrics
As described for the accuracy metrics above, we also conducted a correlation analysis of the reliability metrics.To do this, we correlated all the environmental parameters of both resolutions (entire catchment   Grosse Ohe, for the water level of the current day (L) at Prien, and for the water level 3 days before the current day (L:3) at Sulzbach.On the other hand, we observed that for all streams, the sunshine duration (S) for the current and the previous days had no impact (0%), contradicting the findings of the accuracy metrics, in which some streams displayed the highest accuracy when S was included as input parameter.
MinMax analysis: MinMax analysis was applied to define the specific limits of water temperature prediction for each stream.For this analysis, values between 0 and 1 (normalized) were randomly recombined to identify the ANN's minimum and maximum water temperature predictions for each stream.The results of the MinMax analysis were in line with the above results, showing that the maximum and minimum range of the calculated values varied strongly depending on the stream's specifics.

B.3. | Environmental characteristics of sites
The PCA of environmental conditions across the streams (Figure B3) showed that the 16 sites were broadly distributed along multiple environmental gradients.The first PC axis, covering 26.8% of the observed variation (Eigenvalue = 6.44), structured streams primarily according to the proportion of natural and forested vegetation and water bodies in their surroundings.It exemplifies that the streams Grosse Ohe, Soellbach and Bernauer Ache feature a higher share of natural vegetation than such streams as the Scheine, Kleine Vils or Sulzbach.Also hydrological features of the streams investigated, such as NW and HW, were reflected by PC1, with streams in the negative space of PC1 tending to have higher mean and high water levels than those in the positive space.
The second PC axis, making up 16.6% of the observed variation in the data set (Eigenvalue = 3.98), grouped streams largely according to the proportion of agricultural and urban land use (with a high share, for example, along Sulzbach and Wolnzach and a low share in Kirnach, Prien and Illach), while the proportion of grassland in the surroundings and the total length of the river upstream from the sampling site grouped streams in the opposite direction.Detailed proportions of land use are depicted in Table C1.Further information on environmental parameters is depicted in Tables A1 and A2.

B.4. | Environmental predictors of ANN reliability metrics
Regarding the overall catchment resolution (Table B2 top Not all accuracy vs. reliability metrics correlated significantly.This confirmed the finding of Mohr et al. (2021), that the use of accuracy metrics alone is not sufficient and must be supplemented with reliability metrics.Still, we can conclude that as accuracy metric, the RMSE is the most suitable one of those we used to reflect the reliability of an ANN.We are able to conclude this thanks to the significant correlations both to mean perturbation and to the maximum values obtained in the MinMax analysis, while only one significant correlation was demonstrated for R and PBIAS, respectively.We also observed that the RMSE had a greater resolution and hence contributed more significant relationships with environmental parameters than R, probably because it had a higher potential to reflect the high-resolution dynamics of hydrologic parameters.This further increased the benefit of the RMSE and confirms the plausibility of its frequent use for measuring the accuracy of water temperature prediction with ANNs (Ahmadi-Nedushan et al., 2007;Caissie et al., 1998;Chenard & Caissie, 2008;Cho & Lee, 2012;Feigl et al., 2021;Graf et al., 2019;Hadzima-Nyarko et al., 2014;Qiu et al., 2020;Quan et al., 2020;Rabi et al., 2015;Rahmani et al., 2020;Rehana, 2019;St-Hilaire et al., 2000;Zhu, Nyarko, Hadzima-Nyarko, et al., 2019).
In this study, we additionally employed PBIAS as an accuracy metric, which is unusual for water temperature prediction with    T A B L E B 2 (Continued)
).Using the most accurate input combination for each stream based on the RMSE, the RMSE values ranged between 0.373 C (Aubach) and 1.667 C (Otterbach), R values ranged between 0.997 (Aubach) and 0.958 (Otterbach), and PBIAS values ranged between À0.767% (Kirnach) parameters MW and HW.Additionally, four ANN input combinations correlated significantly positively with NW (DT, DTL, DTQ and DTLQ), three ANN input combinations correlated significantly positively with the longest river (DT, DTQ and DTLQ), and three ANN input combinations correlated significantly positively with Dist1 (DTQ, DTQS, DTLS).The R of all input combinations was significantly negatively related to total river length and longest river length (Figure6a).PBIAS correlated significantly positively with total catchment area (DTLS) and Dist1 (DT) and significantly negatively with DOD (DTLS and DTS) 60.5% of the 2-D configuration of the 12 streams was explained by dbRDA axis 1 and 19.05% by dbRDA axis 2. The DistLM's marginal tests indicated a significant relationship between the multivariate configuration of the streams, based on the three accuracy metrics (RMSE, R, PBIAS) with total river length (prop = 0.35, p < 0:01), longest river length (prop = 0.28, p < 0:01) and Dist1 (prop = 0.20, p < 0:05).Total river length correlated with the negative space of dbRDA axis 1, indicating that decreasing accuracy in terms of RMSE and R (see Figure8) was significantly correlated with the total length of tributary streams.The streams exemplifying this relationship were the Otterbach in the most negative space of dbRDA axis 1, with a total river length of 41.43 km, contrasting the Aubach with a total river length of 7.91 km, in the upper positive space of dbRDA axis 1.On dbRDA axis 2, streams were mainly separated along a gradient of the parameters: longest river length, Dist1, catchment area as well as proportions of grassland, seminatural land use and HW.Thus, when relating these findings to the underlying configuration of accuracy metrics, environmental parameters on dbRDA axis 2 were positively associated with overestimation (longest river length, Dist1, catchment area) or underestimation (grassland, semi-natural land use and HW) of water temperature prediction by ANNs.

F
I G U R E 5 (a) RMSE entire catchment; (b) RMSE 5 m riparian strip.Increasing intensity of red colour indicates increasing correlation (positive as well as negative).Significance is marked with *p < 0.05 and **p < 0.01.Input parameters (x-axis): D, day of the year; T, air temperature; Q, discharge; L, water level; S, sunshine duration; DTQLS, allinputs.Environmental parameters (y-axis): as described in Section 2.6.frequent use for measuring the accuracy of water temperature prediction with ANNs(Ahmadi-Nedushan et al., 2007;Caissie et al., 1998;Chenard & Caissie, 2008;Cho & Lee, 2012;Feigl et al., 2021;Graf et al., 2019;Hadzima-Nyarko et al., 2014;Qiu et al., 2020;Quan et al., 2020;Rabi et al., 2015;Rahmani et al., 2020;Rehana, 2019;St- Hilaire et al., 2000; Zhu, Nyarko, Hadzima-Nyarko, et al., 2019).We additionally employed PBIAS as an accuracy metric, which is unusual for water temperature prediction with ANNs.Although we saw advantages of including the PBIAS due to the different aspects of model performance it highlights, in this study we were not able to find any general significant correlations between the assessed environmental parameters and the PBIAS.This might be because the PBIAS F I G U R E 6 (a) R entire catchment; (b) R 5 m riparian strip.Increasing intensity of red colour indicates increasing correlation (positive as well as negative).Significance is marked with *p < 0.05 and **p < 0.01.Input parameters (x-axis): D, day of the year; DTQLS, allinputs; L, water level; Q, discharge; S, sunshine duration; T, air temperature; Environmental parameters (y-axis): as described in Section 2.6.ingeneral reflects variation in two directions, but in our study the direction of estimation (over-or underestimation) did not necessarily correlate with the examined environmental parameters in only one direction.The overestimation was pronounced for Kirnach, a stream with a very high proportion of grassland (72.58%) and a very low proportion of semi-natural land cover (0.01%).In contrast, underestimation of water temperature was pronounced for Aurach, a long stream with a large catchment.These findings were also confirmed by the F I G U R E 7 (a) PBIAS entire catchment; (b) PBIAS 5 m riparian strip.Increasing intensity of red colour indicates increasing correlation (positive as well as negative).Significance is marked with *p < 0.05 and **p < 0.01.Input parameters (x-axis): D, day of the year; DTQLS, allinputs; L, water level; Q, discharge; S, sunshine duration; T, air temperature; Environmental parameters (y-axis): As described in Section 2.6.DistLM analysis of environmental predictors of evaluation metrics, in which PBIAS/overestimation was associated with high proportions of grassland, particularly in the 5 m riparian strip.Consequently, it would be advisable to carefully check for both over-and underestimation of water temperature prediction, particularly in catchments with high proportions of open-canopy landscape.
predicted water temperatures with ANNs (sota-range 0.46 C to 1.58 C), only one stream (Otterbach, RMSE = 1.667C) had an RMSE slightly higher than the sota-range.Further, 12 streams had an RMSE within the sota-range and three streams had RMSE values even lower than the sota-range, namely Attel (RMSE = 0.453 C), Aubach (RMSE = 0.373 C), and Soellbach (RMSE = 0.419 C).To the best of our knowledge, the RMSE values of Attel, Aubach and Soellbach were the smallest ever reported for stream water temperature prediction using ANNs.F I G U R E 8 distLM-Eval-Enviro-plot, Resemblance: D1 Euclidian distance, Correlation between total river length and negative space of dbRDA axis 1, discrimination of streams by longest river length, Dist1, catchment area, proportions of grassland and semi-natural land use, and HW along dbRDA axis 2. Environmental parameters: Total river length: Sum of lengths of all contributing rivers; Longest river length: Length of the longest contributing river; Land use: agriculture, forest, grassland, semi-natural, urban, water; Catchment: Catchment size of all contributing catchments; Area: Total buffer area; MW: Mean water level; HW: Highest measured water level; NW: Lowest measured water level; MQ: Mean discharge; HQ: Highest measured discharge; Tributaries: Number of tributaries; DOD: Number of days for which data was used; IPO: Number of input data points per output data point; Dist1: Distance between GkD station and DWD station 1; Dist2: Distance between GkD station and DWD station 2. Resolution: entire: Entire catchment; 5 m 5 m riparian strip.
and semi-natural land use (5 m riparian strip) and semi-natural land use and water body area (entire catchment), it can be assumed that riparian shade stabilizes water temperatures, hence facilitating more accurate prediction, as water temperatures are likely more linearly and consistently related to atmospheric temperatures.Also the proportion of water bodies is likely related to prediction accuracy, due to their temperature-buffering properties in the catchment.Our results imply that larger proportions of open-canopy land use forms and the associated higher radiation and low levels of shading can lead to high levels of temperature variability, potentially hampering ANN accuracy and reliability.Consequently, we advise greater caution when using ANNs for streams in open-canopy landscapes.5 | CONCLUSIONSWe conclude the following for water temperature prediction in streams with ANNs, based on open-access data: 1.It is possible to use open-access data for water temperature predictions within the sota-range.a.The use of open-access data, however, comes with the problem that there is only a limited number of parameters.Hence, the choice of streams for which the water temperature is to be predicted is crucial for the accuracy and reliability of the predictions.
cifically, land-use-derived disturbances along stream ecosystems affect stream water temperatures and will consequently exacerbate the climate-change-associated warming of stream water.We have therefore added highly relevant information to the use of ANNs to predict stream water temperatures.In combination with climate change projections, ANNs could prove to be a cost-efficient and invaluable resource for decision makers to use when assessing future developments in stream water temperatures, aiding the evaluation and prioritization of restoration, renaturation and adaptation measures in streams.on Continuous Verification of Cyber-Physical Systems (GRK 2428).How to cite this article: Drainas, K., Kaule, L., Mohr, S., Uniyal, B., Wild, R., & Geist, J. (2023).Predicting stream water temperature with artificial neural networks based on open-access data.Hydrological Processes, 37(10), e14991.https://doi.org/10.1002/hyp.14991APPENDIX A | Additional information on Materials and Methods A.1 | Searching algorithms Scikit-learn Our results were obtained using RandomizedSearch, which, given different options, optimized the architecture and hyperparameter combination of the ANNs for each individual input combination and waterbody.RandomizedSearch certainly delivers a lower search quality than GridSearch, since it only determines local optima, unlike GridSearch, which delivers global optima.Still, as the ANN accuracy in our study demonstrated no weakness, in contrast to the results in the literature, we can support the use of Randomized-Search, since it requires considerably less computing power and time.It is important to note, however, that even better results for the RMSE might be obtained with GridSearch and so it may be worth investing more time if fewer streams and less data needs to be processed or the time and capacity investment does not play a relevant role.Difference between RandomizedSearch and GridSearch: While in RandomizedSearch, random sets of hyperparameters are used and tested, GridSearch tests all possible hyperparameter combinations systematically.The process can be accelerated by preselecting hyperparameters to reduce the total number of hyperparameter combinations.Of course, this again reduces the power of the search.In conclusion we suggest not using GridSearch if time and/or computing power are limited (see Figure 9).A.2 | Results of RandomizedSearch Using scikit-learn's RandomizedSearch, we determined an ANN with the hyperparameter combination leading to the lowest RMSE, for each waterbody and input combination.The RMSE values for all ANNs determined by RandomizedSearch are presented in Figure A1, according to waterbody.Figure A2 shows the same information but sorted by input combination.In Tables A3, A4, A5, and A6, these results are sorted by waterbody.The tables show which input combination for each stream reached what accuracy measures based on which hyperparameter combination.The abbreviations stand for the hyperparameters as indicated in the table below.
values for all input combinations per waterbody.F I G U R E A 2 RMSE values for all streams per input combination.Input parameters: D, day of the year; DTQLS, allinputs; L, Water level, Q, discharge; S, sunshine duration; T, air temperature.T A B L E A 1 Environmental parameter for distLM I.
and 5 m riparian strip) with the mean perturbation values as well as with the results of the MinMax analysis that is, the minimum possible values and the maximum possible values for the ANN with the allinputs combination for each stream.A.5.2.| Accuracy versus reliability metricsFinally, we conducted a correlation analysis between the accuracy metrics RMSE, R, and PBIAS and the reliability metrics mean perturbation, minimum of MinMax analysis and maximum of MinMax analysis, to determine whether and to what extent the accuracy and reliability metrics agree with and/or complement each other.B | Additional information on ResultsB.1.| Assessment of ANNsImpact analysis: The mean importance, as an indicator of the contribution of individual parameters for water temperature prediction, showed that the impact of individual input parameters strongly varied among streams (FigureB1).Calculating the mean over all streams, we observed that the water level of the current day (L) was the most important, with a value of 14%, followed by the mean air temperature of the closest DWD station of the current day (mean_St1), with a value of 11%.The greatest individual importance value of 35% was determined for the mean air temperature of the closest DWD station for the current day (mean_St1) at ), we determined a significantly positive correlation between mean perturbation F I G U R E B 1 Impact analysis for all input parameters in each stream.Inputs on x-axis indexed as below.Whiskers mark 95% confidence intervals and bars mark mean importance for each input.St indicates the station from which air temperature was received (St1 = DWD station closest to GkD gauging station, St2 = DWD station second-closest to GkD gauging station).The "addendum.No" indicates how many days prior to D the data is from (0.1 = the day before D, 0.2 = 2 days before D, 0.3 = 3 days before D).Input values: D, day of the year; L, water level; max, maximum air temperature; mean, mean air temperature; min, minimum air temperature; Q, discharge; S, sunshine duration.and HW (r ¼ 0:67, p < 0:01) as well as a significantly negative relationship between the minimum values of the MinMax analysis and DOD (r ¼ À0:60, p < 0:05) and between the maximum values of the MinMax analysis and semi-natural (r ¼ À0:55, p < 0:05).For the 5 m riparian strip resolution (see Table B2 bottom), there was a significantly positive correlation between mean perturbation and grassland (r ¼ 0:52, p < 0:05) as well as between the minimum values of the MinMax analysis and grassland (r ¼ 0:50, p < 0:05).C. | Additional information on the Discussion C.1.| Accuracy and reliability ANNs.Although we see advantages in combining different accuracy metrics and including the PBIAS due to the different aspects of model performance it highlights, in this study we were not able to find any general correlations between the environmental parameters F I G U R E B 2 Comparison of calculated and observed minimum and maximum values for all waterbodies.Calculated values were determined by MinMax analysis, observed values were retrieved from the datasets.T A B L E B 1 Correlations between evaluation and assessment metrics.
Principal component analysis plot, all stream sites broadly distributed along multiple environmental gradients.Streams structured by PC axis 1 mainly according to the proportion of natural and forested vegetation and waterbodies in their surroundings, as well as by NW and HW.Structure by PC axis 2 mainly according to proportion of agricultural and urban land use.T A B L E B 2 Correlation analysis of environmental parameters versus robustness measures, entire catchment and 5 m riparian strip.
and the PBIAS.This might be because the PBIAS reflects variation in two directions, but the direction of estimation (over-or underestimation) does not necessarily correlate with an environmental parameter in only one direction.Even though the PBIAS showed no consistent trends in the correlation analysis with either environmental parameters or reliability metrics, the significant correlation between the PBIAS and the maximum values of the MinMax analysis showed that with increasing MinMax-max values, the ANNs tended to overestimate water temperature.This overestimation was pronounced for Kirnach, a stream with a very high proportion of grassland (72.58%) and a very low proportion of semi-natural land cover (0.01%).In contrast, underestimation of water temperature was pronounced for Aurach, a long stream with a large catchment.These findings were also confirmed by the DistLM analysis of environmental predictors of evaluation metrics, in which PBIAS/overestimation was associated with high proportions of grassland, particularly in the 5 m riparian strip.Consequently, it would be advisable to carefully check for both over-and underestimation of the water temperature prediction, particularly in catchments with high proportions of open-canopy landscape (FiguresC1-C6).

F
I G U R E C 1 Cumulative barplot illustrating the shares of land use in the 5 m riparian strip.F I G U R E C 2 Cumulative barplot illustrating the shares of land use in the entire catchment.F I G U R E C 3 Catchments (a) Abens, (b) Attel, (c) Aubach, and (d) Aurach.F I G U R E C 4 Catchments (a) Bernauer Ache, (b) Grosse Ohe, (c) Illach, and (d) Kirnach.F I G U R E C 5 Catchments (a) Kleine Vils, (b) Otterbach, (c) Prien, and (d) Scheine.
Evaluation of the most suitable ANN for each waterbody, as determined by RandomizedSearch.
Also, grassland was significantly positively associated with all ANN input combinations except allinputs and DTQS.Semi-natural land use, in contrast was significantly negatively correlated with RMSE values T A B L E 1 Note: Column titles: Stream, name of examined stream; Inputs, input combination used; RMSE, R, PBIAS, evaluation metrics as defined in Formulas 1,2,3.Abbreviations: D, day of the year; DTQLS, allinputs; L, water level; Q, discharge; S, sunshine duration; T, air temperature.
T A B L E 2 Top: Frequency of input combinations used for the best ANNs as depicted in Table1.Bottom: Frequency of input parameters used in input combinations in Top.
Note: Sunshine duration was only available for 14 streams.
Distances (km), distances between GkD gauging station and DWD station; DOD, number indicating how many days served as data basis for training and testing; Gauging station, name of GkD gauging station from which water temperature, discharge and water level were obtained; IPO, maximum number of input values per output value; Stations, DWD stations from which air temperature data was used, bold indicates that sunshine duration was available (value from closer station preferred if possible); if no station is indicated in bold, no sunshine duration was available; Stream: name of stream investigated.We applied perturbation analysis to test the degree to which the ANNs' predictions changed when historical input data varied by 0.01 (normalized).The mean perturbation value over all streams was 2.620 ± 2.109 C, with the highest mean perturbation observed in Otterbach (9.981 C) and the lowest in Wolnzach (0.985 C).
Agriculture, forest, grassland, semi-natural, urban, water: land use; Area, total area of riparian strip; Area, total buffer area; Catchment, Total size of all contributing catchments; D, Day of the year; Dist1, distance between GkD station and DWD station 1; Dist2, distance between GkD station and DWD station 2; DOD, number of days for which data was used; DTQLS, allinputs; HQ, highest measured discharge; HW, highest measured water level; IPO, number of input data points per output data point; L, water level; Longest river length, the length of the longest contributing river; Max, maximum determined by MinMax-analysis; Min, minimum determined by MinMax-analysis; MQ, mean discharge; MW, mean water level; NW, lowest measured water level; Perturbation, mean perturbation determined by perturbation analysis; Q, discharge; S, sunshine duration; T, air temperature; Total river length, sum of lengths of all contributing rivers; Tributaries, number of tributaries.*p<0.05; **p < 0.01; ***p < 0.001.T A B L E C 1 Land use in entire catchment and in 5 m riparian strip.Note: Top: land use in entire catchment.Bottom: land use in 5 m riparian strip for whole river.Abbreviations: Agriculture, forest, grassland, semi-natural, urban, water: Proportion of land use in percent, for 5 m riparian strip as mean over all arms; Mean river length: (for 5 m riparian strips) If stream contained more than one arm, this is the mean of the lengths of the arms in km; Stream, name of stream investigated; Total river length, sum of lengths of all contributing rivers in km.