Learning from mistakes—Assessing the performance and uncertainty in process‐based models

Abstract Typical applications of process‐ or physically‐based models aim to gain a better process understanding or provide the basis for a decision‐making process. To adequately represent the physical system, models should include all essential processes. However, model errors can still occur. Other than large systematic observation errors, simplified, misrepresented, inadequately parametrised or missing processes are potential sources of errors. This study presents a set of methods and a proposed workflow for analysing errors of process‐based models as a basis for relating them to process representations. The evaluated approach consists of three steps: (1) training a machine‐learning (ML) error model using the input data of the process‐based model and other available variables, (2) estimation of local explanations (i.e., contributions of each variable to an individual prediction) for each predicted model error using SHapley Additive exPlanations (SHAP) in combination with principal component analysis, (3) clustering of SHAP values of all predicted errors to derive groups with similar error generation characteristics. By analysing these groups of different error‐variable association, hypotheses on error generation and corresponding processes can be formulated. That can ultimately lead to improvements in process understanding and prediction. The approach is applied to a process‐based stream water temperature model HFLUX in a case study for modelling an alpine stream in the Canadian Rocky Mountains. By using available meteorological and hydrological variables as inputs, the applied ML model is able to predict model residuals. Clustering of SHAP values results in three distinct error groups that are mainly related to shading and vegetation‐emitted long wave radiation. Model errors are rarely random and often contain valuable information. Assessing model error associations is ultimately a way of enhancing trust in implemented processes and of providing information on potential areas of improvement to the model.


| INTRODUCTION
There will always be model errors (residuals) apparent in any prediction, as models are only a crude representation of reality. While these errors might be random and the result of randomness in the investigated system (i.e., aleatoric uncertainty), there is also the possibility that they are related to the formulation of the model and thus to the chosen process representations (i.e., epistemic uncertainty or systematic uncertainty (Beven, 2016).
This non-randomness of residuals can occur due to (1) simplified, misrepresented or inadequately parameterised processes, (2) missing processes, (3) measurement errors. (1) Would also include any issues related to scaling (Blöschl & Sivapalan, 1995). Systematic measurement errors can be mitigated or excluded by adequate calibration procedures, frequent checks of the sensors and post-processing. The other two potential error sources are more difficult to investigate and in a typical process-based model application the occurrence of these errors is mainly identified using the investigators knowledge and experience with the model and the modelled system (Kirchner, 2006;Klemeš, 1986). While these errors might be negligible in some situations and with an experienced investigator, they potentially lead to unrecognised systematic errors and corresponding biassed conclusions. It is thus important to have an objective method to analyse model residuals to avoid systematic errors and enhance trust in the applied model. This is especially important if the results are used in some sort of decision-making process (Jemberie, 2004), especially due to the dangers of uncritical application (Beven, 1989).
This study specifically focuses on model residuals and their relationship to processes represented in process-based models. Therefore, we assume the model structure/formulation and all its parameters to be fixed. Hence, we aim to investigate how the performance of a fully set up process-based model can be used to gain knowledge and to enhance trust in its usefulness for a given application.
While there exists a range of studies that applied model residual predictions to improve process-based model outputs (e.g., Babovic et al., 2001;Demissie et al., 2009;Konapala et al., 2020;Wu et al., 2019;Xu et al., 2014;Young et al., 2017), to our knowledge only Jemberie (2004) investigated the relationship of model predicted residuals with observed variables to gain insight into process-based models by using information theory and analysing machine learning approaches to predict model errors. He developed a two-step approach that included (1) an assessment of the information content of model residuals and a selection of related observed variables using the average mutual information measure and (2) the prediction of model errors using selected observed variables and a neural network. When combining (1) and (2) with physical insight, patterns of errors can potentially be traced back to physical processes. Based on the same ideas as formulated in Jemberie (2004), we here present an approach that allows to assess not only the general relationship between errors and observed variables, but also the relationship between each particular model error to observed variables to further gain physical insight from model errors. This is made possible by the recent developments in explainable machine-learning methods (Molnar et al., 2020), which result in several methods to estimate variable importance for machine-learning models.
This study presents the learning from mistakes (LFM) workflow for analysing errors of process-based models as a basis for relating them to process representations. The evaluated approach consists of: (1) training a machine-learning error model using the input data of the process-based model and other available variables, (2) estimation of local explanations (i.e., contributions of each variable to an individual prediction) for each predicted model error using SHapley Additive exPlanations (SHAP, Lundberg et al., 2017Lundberg et al., , 2020 in combination with principal components (PC), and (3) clustering of SHAP values of all predicted errors to derive groups with similar error generation characteristics. By analysing these groups of different error/variable association, hypotheses on error generation and corresponding processes can be formulated.
To show the general applicability of our proposed methods for assessing error-variable relationships, we apply the LFM workflow in a case study of an alpine stream in the Canadian Rocky Mountains with the process-based stream water temperature model HFLUX (Glose et al., 2017). Water temperature in first-order streams are sensitive to a multitude of energy exchange processes between water and air and water and sediment, including radiation, turbulent heat fluxes, bed conduction, and groundwater exchange (Moore et al., 2005). Therefore, this case study serves as a useful example to demonstrate the model sensitivity to errors in various model variables representing different physical processes. It is used to propose hypotheses for error generation and define situations of low and large uncertainty of model prediction. The model and study site were chosen due to our detailed knowledge of the study area, the measurements and the model implementation from a previous study .
In the following, we introduce the LFM workflow for objectively assessing the relationship between model errors and available variables.
The originality of the study includes (1) a novel approach that allows model developers, users and stakeholders to gain further insight into the applied process-based model, (2) an extension to SHAP values to estimate local explanations for correlated variables, and (3) an application in a case study for simulating water temperature.

| MATERIALS AND METHODS
The following section is separated into three distinct parts describing (1) the LFM workflow, (2) PCA SHAP and (3) the case study with the process-based stream water temperature model HFLUX. The first part includes a depiction of the workflow which is generally applicable for any process-based model, the second part contains a description of our proposed extension to SHAP values for correlated datasets and the third part includes modelling choices that are specific to the presented case study.

| Preliminary steps
The LFM workflow assumes a fixed model setup, model residuals and a set of variables that are potentially associated with the model residuals. These variables will in most cases consist of the model input variables as well as any available variables that could be relevant for the modelled processes. A prerequisite for deriving insights from model residuals is their association with these variables, or more generally-their non-randomness. Therefore, before starting the LFM workflow, a linear regression model should be applied together with a F-test to check for significance association between model residuals and variables. In case there are no significant associations, hence no linear relationships, a very simple ML-model (e.g., random forests) can be used to check if at least some of the model error can be predicted.
If this still does not produce a significantly improved prediction, the assumption of non-randomness might not hold.
The LFM workflow is shown in Figure 1 and starts with the process-based model application and model residual generation ( Figure 1 part 1-4). In this study we define residuals as values resulting from subtracting observed values from predicted values, hence positive residuals imply overestimation and negative residuals imply underestimation.

| Machine learning models for residual prediction
After deriving residuals, we can set up a machine-learning model ( Figure 1 part 5) that predicts model residuals from available observed variables. However, before using the observations as input for the ML-model, a principal component analysis (PCA, Pearson, 1901) is applied to derive their principal components (PCs).
PCA converts possibly correlated variables in a set of linearly uncorrelated variables. Deriving uncorrelated features is necessary, as dependency structures between input variables can potentially lead to biassed variable usage for predictions in the applied MLmodel, that is, in the extreme case variables are not used by the model since other highly correlated variables are used instead. The PCA is applied on standardised inputs, that is, subtracting the mean of the values of a variable and divide the resulting centred values by their standard deviation. The resulting PCs are linear combinations of these standardised inputs. Hence, the ith PC can be written as, where x j denote the standardised input variables and γ ij their factor loadings. Factor loadings of PCs of standardised inputs are the correlation coefficients between variables and PCs. A PCA results in an equal number of PCs than original variables, which can be directly used as input features for the ML-model.
Originally, some PCA properties assume normality of the input variables, but a violation of normality should not influence the usefulness of PCA (Jolliffe, 2002). Since, the reason for applying PCA in the LFM workflow is to create independent variables, this can also be checked in case non-linear dependency structures in the data are believed to exist. Possible measure of dependency that could be F I G U R E 1 Learning from mistakes workflow. ① The available observations. ② The process-based model that should be investigated. ③ The simulation results compared to the observed values. ④ The resulting model residuals. ⑤ The ML-model that is used to predict model residuals from PCA transformed observed variables. ⑥ The predicted residuals and the local explanations derived from the ML-model. ⑦ Importance of single variables for a cluster derived from the local explanations. The value of each variable in these plots shows the median local explanation for the corresponding cluster applied are Hoeffding's D (Hoeffding, 1948) and the maximum information criteria (Reshef et al., 2011).
Setting up an adequate ML-model for residual prediction is arguably the most challenging part of this workflow but will in most cases be solvable with ready-to-use models from widely used ML-libraries.
The supporting information (Data S1) include a brief guide on how to choose a ML-model for residual prediction.

| Computing local explanations
A ML-model that can adequately predict model residuals shows that there is valuable information contained in the available data.
To further gain insight in the dependencies between individual residuals and ML-model inputs, it is necessary to derive local explanations, that is, contribution of each ML-model input for an individual prediction (Baehrens et al., 2010;Lundberg et al., 2020). Local  their interpretation is consistent with human intuition (Lundberg et al., 2017). The SHAP method and its extension for correlated input variables developed for the LFM workflow-PCA SHAP-will be described in detail in Section 2.2.

| Error clusters
This final step aggregates the local explanation values into comprehensible error groups, which are the basis for deriving hypotheses on error cluster to processes relationships. It consists of finding clusters with similar underlying error generation associations, that is, clusters of residuals with similar local explanation patterns (see Figure 1; part 7). This can be achieved by a range of clustering algorithm, for example, Hierarchical clustering (Ward, 1963) or k-means (MacQueen, 1967), in combination with methods to assess cluster separation, where several possibilities exists (e.g., Davies & Bouldin, 1979;Dunn, 2008;Rousseeuw, 1987). To find the optimal number of clusters and the most adequate clustering method, we propose to assess the performance of multiple cluster algorithms with different numbers of clusters using silhouette coefficients (Rousseeuw, 1987). Silhouette coefficients are a measure that describes how similar an object is to its cluster. They can range from À1 to 1 and are defined as the normalised difference between the mean distance of a point to its own cluster and the mean distance of a point to its neighbouring cluster. High positive values indicate adequate separation, negative values indicate poor cluster quality.
To this point, all steps of the LFM workflow can generally be applied to any model setup. However, understanding the relationships between modelled processes and errors and deriving error generation hypotheses is strongly depending on the model and the type of application. A model developer will most likely be interested in different process-error relationships compared to a user who wants to show that only a certain type of process is related to model errors and thus the model can be used for her/his specific use case. In the following case study, we provide examples on how to approach this task, but we also emphasise that even though the LFM workflow results in evidence for error-process relationships, a certain understanding of the model and its processes is necessary to interpret these.

| PCA SHAP
SHAP is based on shapley values (Shapley, 1953), a concept in game theory that estimate the marginal contributions of every player on the where x ℝ K denotes the inputs of the ML-model, f denotes the ML- to the SHAP value estimations and restricts to using the kernel SHAP implementation instead of the faster Tree SHAP (Lundberg et al., 2020) or Deep SHAP (Lundberg et al., 2017) where ϕ i denotes the SHAP value of the ith PC and γ ij the factor load-

| Case study: Stream water temperature modelling
This case study assesses the uncertainties and error-process relationships of the process-based stream water temperature model HFLUX in an application for simulating water temperature in a small headwater stream in the Canadian Rocky Mountains. The corresponding field work, data collection, data preparation and model setup are described by Roesky and Hayashi (2022). The following will give a brief overview of the study area, the model setup and all specifics of the LFM application.

| Study site and data
The study site is a headwater stream located on the eastern slopes of  Figure 3 shows the measurement point that was finally used for the error model in this study for the position of the other points refer to .
Discharge was measured continuously at the four gauging stations (GS i , Figure 3). Meteorological data was available from an automatic weather station (AWS1, 2084 m.a.s.l., Figure 3) installed by the University of Saskatchewan located in the meadow next to the three headwater springs.
It measured air temperature, relative humidity, shortwave and longwave radiation, canopy temperature and wind speed. The western tributary (at GS3, Figure 3) was treated as distinct water source with measured discharge and water temperature.

| Stream water temperature modelling
To model stream water temperature along the study stream, the onedimensional heat transfer model HFLUX (Glose et al., 2017) was used.
HFLUX solves the differential equation: where A (m 2 ) is the cross-sectional area of the channel, Q (m 3 s À1 ) the discharge, T w ( C) the stream water temperature, Q L (m 3 s À1 m À1 ) and T L ( C) the lateral inflow and the inflow temperature, W (m) the width of the stream, q t (W m À2 ) the total energy flux to the stream, ρ w (kg m À3 ) the density of water and c w (J kg À1 C À1 ) the specific heat of water. The total energy flux is the sum of all incoming energy fluxes, which are listed in Table 1 (2022) The error model was chosen to be XGBoost (Chen & Guestrin, 2016), a regression tree based boosting algorithm. This choice was made based on our experience with stream temperature prediction from a previous study (Feigl, Lebiedzinski, et al., 2021), which showed that the performance of most of the tested ML-models mainly dependent on the input data and the hyperparameter optimisation. Only recurrent neural networks (RNNs) showed different behaviour as its structure already allows for sequence of inputs instead of tabular data. Preliminary results however showed that RNNs (LSTMs and GRUs) did have a lower performance compared to XGBoost. We trained it on 13 days of data with a 5 times repeated 10-fold cross validation and tested it on different randomly selected time steps distributed over the whole study time period with a total length of 5 days. The model hyperparameters were estimated by Bayesian hyperparamter optimisation (Kushner, 1964;Močkus, 1989;Snoek et al., 2012) (Ward, 1963), namely (2)  The properties of all error clusters are investigated and used to propose hypotheses on error generation. This is followed by a test in which the model is adapted without further calibration to minimise these process related errors.  Note: The empirical constants for height of measurement refers to constants that depend on the heights at which wind speed and vapour pressure are measured (Dingman, 1994;Glose et al., 2017).

| Error-variable relationships and clustering
An example of the PCA SHAP values plotted together with the HFLUX model residuals and the predicted residuals for August 8-9 is shown in Figure 6. For these days differences in the error patterns during night and daytime can be observed. For understanding these patterns, it is also necessary to keep in mind that even an input variable value of 0 can have a PCA SHAP value that is not zero, as they are still used for prediction, for example, shortwave radiation at night.
It is also notable that the variable hour has usually low importance, pointing to the fact that there are only few systematic errors that cannot be described with available other variables. While this representation gives insights into the error-variable relationships, it is hard to interpret due to its complexity, especially when examining the full study time period. This motivates the use of cluster analysis to aggregate these results to find distinct groups of errors that can be further examined.
Clustering results showed the best performance in terms of silhouette coefficient values for the k-means algorithm with 3 clusters, with a mean silhouette coefficient value of 0.35. Figure 7a shows all

| Error clusters and hypotheses
In this section we examine each of the derived error clusters and propose hypotheses of their relation to processes. The PCA SHAP values for the full study period can be found in the Data S1 (supporting information S2).

| Cluster 1: Decreasing discharge and morning air temperatures
The characteristics of cluster 1 are summarised in Figure 8a-c. The results in Figure 8b shows that the errors of cluster 1 are strongly related to the variables discharge, shortwave radiation, air temperature, relative humidity and wind speed. They mainly occur in the late morning and are almost always positive, showing that cluster 1 errors mainly occur during warming periods of the stream, in which HFLUX is consistently predicting a more rapid increase in stream temperature than observed.
Since air temperature, shortwave radiation and relative humidity are highly correlated (shown in Figure 4), we can observe a similar importance for predicting cluster 1 errors in Figure 8b. However, canopy temperature does not seem to be as important, even though it is also highly correlated with these variables (mean correlation of 0.78). From these results we propose the following hypothesis for the error-process relationship of cluster 1: These errors occur due to a combination of (1) measurement uncertainties during decreasing discharge, and (2) presumably faster warming of the meadow, in which the meteorological station is located, compared to the conditions in the river reach, for which HFLUX is applied to.

| Cluster 2: Vegetation-emitted longwave radiation
The characteristics of cluster 2 are shown in Figure 10a-c. This cluster has the lowest residuals which occur during the night or on cloudy days, thus in times of low or no shortwave radiation input. Both under-and overestimation are present. Besides the low or absent shortwave radiation, Figure 10b shows that canopy temperature, longwave radiation, relative humidity and discharge are associated with the prediction of these errors. During cluster 2 error occurrence relative humidity is always quite high, longwave radiation tends to be higher than average and canopy temperature is generally lower than or equal to air temperature with a mean difference of À1.1 C. From these results we propose the following hypothesis for the error-process relationship of cluster 2: Cluster 2 errors are mainly related to longwave radiation energy fluxes and to a smaller part to sensible and latent heat energy fluxes. The longwave radiation flux in HFLUX is estimated by incoming solar longwave radiation and vegetation-emitted longwave radiation. Since HFLUX uses air temperature to estimate vegetation-emitted longwave radiation, differences between these two will lead to over-and underestimation of this flux depending on the sign of the differences. The influence of relative humidity, air temperature and wind speed show also that some errors are associated with uncertainties in latent and sensible heat fluxes.
This is most likely due to differences between the measurements made over the meadow compared to the conditions around the modelled stream reach.

| Cluster 3: Canopy temperature
The characteristics of cluster 3 are shown in Figure 12a-c. Cluster 3 errors are characterised by medium to high shortwave radiation, high air and canopy temperature and medium to low relative humidity.
The absolute PCA SHAP values in Figure 12b show that these errors are mainly driven by shortwave radiation, air temperature, relative humidity, wind speed discharge and canopy temperature. They are mainly negative and occur in the afternoon, thus stream temperature is underestimated in times of high shortwave radiation. Figure 13 show an example time series for cluster 3 for 8-9 August, 2019. Cluster 3 errors always occur during times in which canopy temperature is much higher than air temperature and end when they are nearly equal again. They are characterised by low or increasing relative humidity. Shortwave radiation is usually high at the beginning but decreasing until zero at the end of each period.
From these results we propose the following hypothesis for the error-process relationship of cluster 3: Cluster 3 errors mainly result from a combination of errors in shortwave radiation and longwave radiation. Due to the timing and importance of shortwave radiation we assume that the constant shading value estimated by HFLUX is too high in the time between 12:00 and 17:00. Since the shortwave radiation represents the largest energy input into the system, uncertainties in shading can result in large underestimations. In addition to shading related errors, it can also be assumed that a smaller part of the error is related to longwave radiation. During cluster 3 errors, measured canopy temperature is higher than air temperature (median difference 1.4 C, max difference 14.7 C), leading to an underestimation of the vegetation-emitted longwave radiation.

| Error hypotheses testing
Based on the proposed hypotheses we assume that a large part of the observed errors is related to two energy fluxes and their representation in the process-based model: the shortwave radiation and longwave radiation energy fluxes. To test this assumption, we modified the HFLUX model to include a two-point shading factor and used measured canopy temperature instead of air temperature to estimate We test these hypotheses with adjustments to HFLUX and its inputs without further calibration and could show that the errors related to these processes were reduced.
The LFM workflow is applicable to any process-based model and will mainly differ in the type of selected ML-model. However, the usage of the derived error-process relationships will dependent on the type of application.  The simple adjustment to the shading factor, which is a rough estimate based on the experience from field work in the catchment, is most likely not the solution which best fits the amount of shading that will be observed at every part of the stream. However, it shows the importance of temporal variability in shading when modelling the thermal system of a stream. In a study that focuses on an application related to shading, it would be necessary to include measurements to estimate shading along the stream at different times of the day, for example, with LiDAR (Loicq et al., 2018), hemispherical images (Garner et al., 2017), or a densiometer (Gravelle & Link, 2007).
The proposed PCA SHAP method results in explanations that also reflect feature dependence in the data. By using PCs of data as input for the ML-model, the assumption of feature independence will always be true. Since this assumption is necessary for SHAP (Aas et al., 2021), we also know that the resulting SHAP values reflect an unbiased variable-error relationship. While computing single variable PCA SHAP values from PCA loadings results in a loss of the SHAP local accuracy property, it allows for values that reflect unbiased explanations of all variables for a single prediction. Furthermore, the usage of multiple cluster methods and numbers of clusters in combination with a cluster quality metric, allows for objectivity and thus enhances trust in the derived clusters. Iterative procedures like kmeans can still produce slightly different results any time they are applied, but this will not change the general cluster properties and thus their interpretation.
Model errors contain valuable information that should be used to further understand physical processes and process-based models. By applying the presented workflow, it should be possible to derive databased hypothesis on error generation for any type of model. Especially the PCA extension for SHAP values should make it applicable to any dataset, where correlated features might heavily influence interpretability. This type of data driven approach allows using objective data driven methods to be used to further understand and develop process-based models. ML-models have already been widely applied to estimate model residuals of process-based models to improve predictions (e.g., Konapala et al., 2020;Mekonnen et al., 2015;Wu et al., 2019;Xu et al., 2017). At the same time, process-based models are especially important for tasks in which MLmodels are not applicable, that is, understanding processes and/or predicting the results of changes in the system. Thus, only improving prediction will not assist these types of application. Recent work by Bennett and Nijssen (2021) shows a coupled process-based model and neural network approach with an explainable machine-learning method to further understand a specific process. Similarly, the LFM workflow was developed to enhance and improve understanding of process-based model applications for any type of model.
Besides these applications, the improvement of model performance by an error ML-model can potentially be seen as a quantity of model limitation or error. It represents the information that is still contained in the data, which could potentially be used to improve the model. Similarly, it is also interesting that the error ML-model does not achieve perfection. The difference to a RMSE of zero could be seen as a quantification of other sources of uncertainties, such as uncertainties in the input data, or additional non-known controlling variables. These quantities can potentially lead to further evidence for model development and process understanding.
This study has potential limitations. Applying the LFM workflow needs in-depth knowledge of the physical system and its processes, the input ML-model input data, and to some extent, machine-learning prediction models. This can be a potentially limiting factor, if expertise in one of those areas is not available. To mitigate this limitation, we provide the python code and data to reproduce the presented results, thereby providing a blueprint and template for other applications.
Another limitation is the validity of the resulting hypotheses on error generation. Without additional tests or measurements, these F I G U R E 1 4 Comparison of observed and predicted time series of stream temperatures. Observed values are shown in blue, initial HFLUX predictions in orange and HFLUX predictions with new shading scheme and measured canopy temperatures as the basis for longwave radiation estimation are shown in green hypotheses cannot be verified or falsified with certainty. In simple cases, the resulting hypothesis will be sufficient and obviously reasonable, in other cases this could lead to wrong assumptions.

| CONCLUSIONS
In this work we found that learning from model errors is possible and a worthwhile pursuit when applying process-based models for gaining knowledge or decision making. The developed approach of estimating explanations for ML-models that are not influenced by feature dependence, is convenient and can also be used for other ML applications in hydrology. In a next step it would be interesting to apply the procedure to more complex models with a specific research question related to either decision making or process understanding. It would also be interesting to investigate if the error-process relationships and the error model can be transferred to another river or river-branch.