On the evaluation of climate change impact models

In‐depth understanding of the potential implications of climate change is required to guide decision‐ and policy‐makers when developing adaptation strategies and designing infrastructure suitable for future conditions. Impact models that translate potential future climate conditions into variables of interest are needed to create the causal connection between a changing climate and its impact for different sectors. Recent surveys suggest that the primary strategy for validating such models (and hence for justifying their use) heavily relies on assessing the accuracy of model simulations by comparing them against historical observations. We argue that such a comparison is necessary and valuable, but not sufficient to achieve a comprehensive evaluation of climate change impact models. We believe that a complementary, largely observation‐independent, step of model evaluation is needed to ensure more transparency of model behavior and greater robustness of scenario‐based analyses. This step should address the following four questions: (1) Do modeled dominant process controls match our system perception? (2) Is my model's sensitivity to changing forcing as expected? (3) Do modeled decision levers show adequate influence? (4) Can we attribute uncertainty sources throughout the projection horizon? We believe that global sensitivity analysis, with its ability to investigate a model's response to joint variations of multiple inputs in a structured way, offers a coherent approach to address all four questions comprehensively. Such additional model evaluation would strengthen stakeholder confidence in model projections and, therefore, into the adaptation strategies derived with the help of impact models.


| INTRODUCTION
Human activity has become a geologic-scale force, changing landscapes and climate at increasing rates in our effort to supply societies' growing demand for water, energy and food. A fundamental scientific and societal question of our time is: How will this activity alter water, energy, and biogeochemical cycles, and when and where will vulnerability thresholds critical for society and our environment be reached (Gleeson et al., 2020;Rockström et al., 2021)? To address these questions, we need impact models that can reliably translate climate signals into decision-relevant environmental variables (while also considering other possible changes, e.g., to land use) defining future boundary conditions for society and ecosystems. Information provided by impact models across appropriate space-time scales guides decision-and policy-makers in developing adaptation strategies, long-term planning, and infrastructure design suitable for future conditions (Barron, 2009).
Any impact model is an imperfect representation of the underlying system and will unavoidably be affected by uncertainties introduced along the modeling chain, including those from the climate projections or impact model parameters (Clark et al., 2016;Hakala et al., 2018;Hattermann et al., 2018;Singh et al., 2013;Smith et al., 2018;Wilby & Dessai, 2010). A critical question in the context of using such models is how they have been evaluated regarding both their ability to perform their task adequately, and their suitability as system representation in the first place (Gupta et al., 2012;Klemeš, 1986;Seibert et al., 2000;Wagener et al., 2010). The process of addressing these questions is often referred to as model validation, although the use of this term has been criticized because it suggests that a model can be established as being true, which is not possible when dealing with open systems (Oreskes et al., 1994) or with highly complex systems that can be aggregated into much simpler simulation models in many different ways. Therefore, we prefer here to use the term evaluation, instead of validation, to suggest that we can ever only achieve an incomplete and conditional assessment of a model's suitability (as suggested by Oreskes et al., 1994). In a recent review of validation methods and practices for resource management models, Eker et al. (2018) found that observation-based strategies dominate, even in the context of climate change impact assessment (or "scenario modeling"): that is, a model's ability to reproduce observed system responses over a historical period is taken as evidence that the model is a valid representation of reality and hence suitable for subsequent use. When observations of the predicted variables of interest are not available, for example, because they cannot be measured at relevant scales (e.g., groundwater recharge; Reinecke et al., 2021), or because the region is ungauged (Blöschl et al., 2013), or because the model output is a highly conceptualized variable with no real-world equivalent (e.g., some global integrated assessment variables; Butler et al., 2014), impact models are often not evaluated at all.
We claim that observation-based strategies as defined above are useful but not sufficient for comprehensive model evaluation and that an additional (essentially observation-independent) evaluation stage is necessary to gain confidence in future impact model projections. For reasons that will become apparent in the remainder of the paper, we call this additional stage a "response-based" evaluation. In developing our argument, we find it helpful to clearly distinguish two aims of model evaluation in the context of scenario modeling (Figure 1): model evaluation as an attempt to establish that the model is an adequate representation of the real-world system; and/or, to establish that the model is adequate for a specific task (such as prediction of a single decision-relevant variable)-two aims which can sometimes be in conflict as we discuss below.
2 | OBSERVATION-BASED MODEL EVALUATION STRATEGIES 2.1 | Is the model an adequate representation of the real-world system? First, the modeler might want to demonstrate a model as an adequate representation of the real-world system ( Figure 2). The dominant strategy to achieve this is an assessment of a model's accuracy, that is, a quantification of the fit between simulated and observed time-series or spatial patterns of a target variable, such as streamflow or the frequency of hot days. This comparison is most often made using one (or more) statistical objective functions in which the differences between observed and simulated variables are aggregated over time and/or space using summary metrics such as the root mean squared error or Nash-Sutcliffe efficiency (Bennett et al., 2013). Some previous studies have suggested performance thresholds related to such metrics to decide whether a model is adequate or not. For example, Moriasi et al. (2007) suggest fixed quality thresholds for metrics such as the Nash-Sutcliffe efficiency. However, other studies have shown that the use of a single-performance threshold for an objective function across different systems can be highly misleading (Knoben et al., 2019), because it is much easier to achieve high performance metrics for some catchments than for others (Van Werkhoven et al., 2008). Flexible performance benchmarks have therefore been suggested as more meaningful alternatives (Schaefli and Gupta, 2007). A more physically meaningful evaluation strategy is the use of the so-called system signatures (Gupta et al., 2008;Hrachowitz et al., 2014). System signatures can be defined as indices that provide insight into the functional behavior of the underlying system and thus how well the model matches this behavior. Examples of signatures in hydrology are the runoff ratio (defined as the long-term fraction of precipitation that leaves a catchment as streamflow) or the slope of the flow duration curve (which describes how damped a catchment response is) (McMillan, 2021). Such signature-based evaluation is a step toward attempting F I G U R E 1 Conceptualization of impact model evaluation strategies based on both observation-based and response-based strategies. Fct., function F I G U R E 2 Schematic representation of accuracy (How close are predictions to observations?) and precision (What is the variability of model predictions?) in the context of simulating past and future time periods to understand the consistency between the simulation model and our current perception of the underlying real-world systems (Euser et al., 2013;Wagener et al., 2021).
However, a simple comparison to historical observation is problematic in the context of climate change impact modeling because the real-world system might evolve with changing climatic boundary conditions, thus requiring different model parameters to reproduce system behavior under different (future) climates (Merz et al., 2011, for hydrologic models or Rosero et al., 2010, for land surface models). How and with what data an impact model is calibrated might therefore strongly define future projections (Müller Schmied et al., 2014). Different studies have tried to create some type of resampling of the past (Fowler et al., 2018) or trading space for time strategies (Singh et al., 2013) to better reflect future conditions during model calibration/evaluation. Other modelers prefer to run their models (especially across large domains) in uncalibrated mode, thus relying on their physical realism. This, however, regularly leaves significant performance gaps between models and observed behavior for climate change impact projections, and eliminating ensemble members that perform poorly on historical observations should increase stakeholder confidence in the ensemble projections (Krysanova et al., 2018), although using all ensemble members-regardless of historical performance-is rather common in large scale impact studies (Gosling et al., 2017). In either approach, there is the additional problem that any discrepancy in historical model performance might not be similar for future model projections. For example, Milly and Dunne (2017) demonstrated how projected future hydrologic drying and wetting trends might be systematically overestimated or underestimated through the choice of evapotranspiration routine.

| Is the model adequate for the task it is intended for?
Second, the modeler might want to justify a model as being adequate for the task at hand (Figure 1). Modelers might see the second aim in connection with the first-assuming that if the model is an adequate representation of the system, then it should be adequate for any practical task-or they might see it as a more practical alternative to aim one, when this is difficult to achieve. The latter approach translates the basic sentiment of "all models are wrong, but some are useful" (Box, 1976), and thus focuses on establishing that a model is mainly suitable for a chosen purpose. In fact, as any model is an imperfect representation of reality, and thus unavoidably includes model structural issues that often do not allow it to match multiple system observations or different objective functions simultaneously (Kollet et al., 2012), the modeler may have to make choices regarding how far to sacrifice one purpose for the other. For example, Wagener and McIntyre (2005) showed that a rainfall-runoff model can become less and less believable as a hydrologic catchment representation the more it is optimized for a specific water resources management task.
A general comparison of the model to historical data-for example, using standard statistical objective functions mentioned above-might ignore the intended use of the model, even though one might expect it to be a primary driver of the evaluation strategy. For example, Brunner et al. (2021) showed that model evaluation with a widely used statistic of overall fit-to-data does not ensure that a hydrologic model is suitable for flood hazard assessment. Klemeš (1986) introduced multiple ideas for validation strategies in relation to intended model use, for example, related to modeling land-use change. These ideas are still rarely fully implemented. Still, focusing on fitting historical data is emphasizing model performance, rather than model robustness under unavoidable uncertainties. Lastly, comparing the model to historical data might ignore the way stakeholders gain trust in model predictions, especially when modeling change (Eker et al., 2018). Stakeholders might, for example, care strongly whether the modeler's perception of the real-world system reflects an understanding that is consistent with their own, that is, they might ask whether the perceptual model underlying the simulation model is consistent with their experiences (Mahmoud et al., 2009).

| BEYOND MODEL EVALUATION BASED ON FIT-TO-OBSERVATIONS
Here we claim that the above strategies based on fitting historical observations are necessary and useful (when appropriate observations are available) but by no means sufficient to demonstrate that a model is either a suitable system representation, nor that it is adequate for the task. We suggest that they should be complemented with strategies that demonstrate the "internal consistency" (Oreskes et al., 1994) of the model, that is, the model's input-output response is sufficiently consistent with our current perception of the underlying system (Wagener and Gupta, 2005) and the intended use of the model (Klemeš, 1986).
Our discussion connects closely to the concepts of accuracy and precision, though applied to model predictions instead of measurements ( Figure 2). A model is accurate if its predictions are close to the observed system output. They are precise if the variability of predictions from ensemble members is narrow. Key to our discussion here is whether we assess accuracy and precision for past periods (for which we can have historical observations), or whether we are looking at projections for future periods for which observations do not yet exist. Therefore, accuracy can be assessed for the past and not for the future. Although precision, and importantly its causes, can be assessed for both past and future. In this short piece, we argue that the evaluation of precision and its causes should be a key element in climate change impact model evaluation.
We argue that one strategy to perform this observations-free evaluation of precision is through the use of global sensitivity analysis (GSA). GSA describes a set of mathematical tools that enable a modeler to systematically investigate how the variations in the inputs of a model (including parameters, forcing, or initial conditions) propagate into variations in the model's output(s) (Saltelli et al., 2000). A typical result of GSA is a set of sensitivity indices each measuring the relative contribution of every varied input to the variability (uncertainty) of a model's output(s). Some GSA methods also provide information about the input-output mapping, such as threshold values for the inputs or sub-regions of the input variability space that maps into output ranges-for example, extreme values. In the remainder of this paper, we will discuss why it is advantageous to use GSA for impact model evaluation, including why it inherently involves understanding the influence of multiple interacting factors. We will structure our discussion around four specific evaluation questions in the context of impact assessment (named consistency, elasticity, leverage, and attribution), discussing them first in general terms and then through selected, previously published, examples.
In brief, GSA essentially comprises four key steps ( Figure 3): 1. Characterizing variability/uncertainty in the model inputs, including both numerical inputs such as forcing data and parameters, and "conceptual" ones such as modeling choices and assumptions. Depending on the nature of the input and the information available, input uncertainty may be represented through a probability distribution, range of variability, an ensemble, or a list of plausible values. 2. Generating input combinations, typically through statistical sampling from the input distributions defined in Step 1. 3. Executing the model for all input combinations and calculating one (or more) summary output metric(s) for each input combinations. Importantly, for future periods, the metric can be a defined using only the model output itself The four key steps of global sensitivity analysis (GSA) (e.g., frequency of occurrence of an event or exceedance of a threshold). The output sample can be used to quantify the output variability/uncertainty. 4. Analyzing the input-output dataset to derive a set of sensitivity indices (or input-output mapping information). The definition of the sensitivity indices, and the calculation procedure to approximate their values from the input-output dataset, vary depending on the GSA method. For example, some methods use correlations between inputs and output samples to measure sensitivity, while other methods measure sensitivity to an input through the reduction in output variance (or some other statistic) when fixing that input. Starting points to learn about the different methods and their practical implementation can be found in Saltelli et al. (2000) and Pianosi et al. (2016). We refer to Norton (2015) for analytical approaches to sensitivity analysis and for the use of emulators in this context.

| Consistency: Do modeled process controls match our system perception?
We argue that the "internal consistency" (between model behavior and perception of the real-world system), which Oreskes et al. (1994) demanded, cannot be demonstrated without looking at the input-output relationship (or "response surface") established by the model. GSA enables us to analyze this input-output relationship and to understand which model parameters (and therefore modeled processes) dominate the variability in the model output, even (if needed) disaggregated in space and time (Wagener & Pianosi, 2019). We can then ask whether the dominant processes in the model are consistent with those we expect to dominate in the real-world system. And, with respect to climate change studies, which of the model parameters (and therefore modeled processes) show sensitivity to changing climatic boundary conditions (Confalonieri et al., 2010). Clearly, we have no way to measure the correct dominant processes under potential future conditions, though, we might have empirical or theorical arguments to support our expectations. For example, Pappas et al. (2013) utilized GSA to analyze the consistency in the behavior of an ecosystem model, LPJ-GUESS, with both expectations and empirical evidence (Figure 4a). Their analysis revealed that some processes exerted an unexpectedly high or low control on the model output, thus suggesting a need for model structural improvement. Other examples used GSA to reveal an unexpected climate dependence of environmental model parameters (Rosero et al., 2010;Van Werkhoven et al., 2008), implying that this dependence must be considered for scenario analyses. The integration of signatures-discussed in Section 2.1-with GSA can be particularly insightful (Van Werkhoven et al., 2008). There are interesting methodological similarities with pattern-oriented modeling or the use of stylized facts from other areas of modeling such as ecology and integrated assessment, that have yet to be explored (Grimm et al., 2005;Schwanitz, 2013).

| Elasticity: Is my model's sensitivity to changing forcing as expected?
How natural or coupled natural-human systems respond to changing climatic forcing has been challenging impact modelers for a long time (Nemec & Schaake, 1982). For water resources systems, Schaake (1990) introduced the concept of elasticity-originally from economics-to quantify the sensitivity of streamflow to changes in precipitation. Others subsequently defined and studied elasticities with respect to other climatic forcing variables (see review in Sankarsubramanian et al., 2001) using historical data (Vogel et al., 1999) or simulation models (Nijssen et al., 2001;Vano et al., 2012). We use the term elasticity here to describe the analysis of the extent to which a model's response surface is controlled by (potential) climatic forcing. For example, Saltelli et al. (2020) disentangled how different factors influence streamflow in the Colorado River, highlighting how the warming-driven loss of reflective snow increases evapotranspiration. Factor interactions would suggest that GSA should be a helpful tool to disentangle the diverse interaction of uncertain forcing and modeled system responses.
Approaching climate impact assessment as a bottom-up (or "scenario-neutral"; Prudhomme et al., 2010) rather than a top-down (scenario-based) problem has added some energy to the assessment of plausible input (forcing) spaces for impact models (Figure 4b). Bottom-up strategies start by defining stakeholder relevant thresholds to subsequently determine which combinations of climatic forcing produce model outputs above or below these thresholds (Brown et al., 2012;Poff et al., 2016;Wilby & Dessai, 2010). Such bottom-up approaches have been proposed as more appropriate than scenario-modeling given the "deeply" uncertain future we face in a non-stationary world. Deep uncertainty refers to a lack of knowledge or a lack of agreement regarding the probability distribution of parameters describing a system, its boundary, or the system itself (Lempert & Collins, 2007). Such bottom-up approaches have been applied widely, including in hydrology (Singh et al., 2014), ecology (Poff et al., 2016), water resources systems analysis (Ghile et al., 2014;Quinn Saltelli et al., 2020), and natural hazards studies (Almeida et al., 2017).

| Leverage: Do modeled decision levers show adequate influence?
A key purpose of impact models is to clarify the influence of human decisions on the modeled system output under climate change, especially in the context of developing adaptation or intervention strategies (Beltrame et al., 2021). We might, for example, want to understand how much land-use choices like deforestation/reforestation impact the level of downstream flooding under future climate conditions, or we might want to know the value of increased airconditioning on reducing human losses from excessive heat. We, therefore, must demonstrate that the parameters reflecting these intervention levers (such as those describing human actions exert an adequate control on the model output consistent with our current understanding (Butler et al., 2014). In other words, when assessing leverage, we investigate the impact of those model factors that are under our control-and included as such in our model-in contrast to elasticity which focuses on the system forcing outside our control.
For example, Hadjimichael et al. (2020) set up a diagnostic evaluation framework with GSA at its core to study water scarcity vulnerabilities in institutionally complex river basins (Figure 4c). The authors assessed how hydrological/climatic, institutional, and demand factors impact frequency and duration of water shortages in a subbasin of the Colorado River, USA. Figure 4c depicts a result from their study in which the authors demonstrated that the tributaries shown are likely more sensitive to natural streamflow availability than to human-controlled water demand for an extremely dry year, 2002. Saltelli et al. (2020) subsequently used this approach for exploratory modeling of water scarcity vulnerability under plausible future conditions. An early application of this concept is the study of Pastres et al. (1999), who applied GSA to a water quality model for a shallow water system, and found that the main decisionrelevant factor, nitrogen load, was less influential on the occurrence of anoxic crises than the history of the system defined by the initial density of benthic algae. This uncertainty in the initial condition thus prevented the model from being able to identify potential management options.

| Attribution: Can we attribute uncertainty sources throughout the projection horizon?
Uncertainty is an unavoidable aspect of any impact projection, starting from the emission scenarios themselves, their translation into climate projections, the properties of all sub-systems involved, etc. (Clark et al., 2016). These uncertainty sources are nicely visualized in the uncertainty cascade of Wilby and Dessai (2010), who showed the compounding presence of uncertainty through the scenario modeling chain. However, quantification of these uncertainties is rather difficult (Rougier & Crucifix, 2018;Stephenson et al., 2012), especially since they will change along the projection horizon, for example, some will grow the further out we project . It is therefore valuable to understand which uncertainties dominate during which specific period within the projection horizon. Again, GSA can help by attributing output uncertainty to its sources, so that we can understand their relative importance over time. Although GSA does not solve the problem of having to decide the magnitude of the individual input uncertainties in the first place (see the issue of deep uncertainty discussed above), it nonetheless can guide the modeler regarding the importance of these choices.
For example, Saltelli et al. (2020) used time-varying GSA to reveal the relative contribution of different uncertainties on coastal land loss projections given sea level rise scenarios (Figure 4d). They demonstrated that both the contributions of different uncertain input factors and their temporal evolution varies significantly across EU countries for projections covering the next 80 years. Failing to consider these uncertainties may invalidate conclusions and thus subsequent decisions. Another example of uncertainty attribution along a projection horizon is the study by Le Cozannet et al. (2015) who showed that uncertainty in coastal flood defense vulnerability projections was dominated by local factors such as bathymetry for near-term projections before climate change scenario uncertainty started to dominate.

| CONCLUSION
Evaluation of climate change impact models-an important task for developing robust adaptation strategies and for gaining stakeholder confidence, and in line with the evaluation of climate models themselves (Eyring et al., 2019)cannot be based on assessing a model's fit to historical observations alone, even though this approach is still common practice (Eker et al., 2018). Such observation-based assessment can contribute toward establishing confidence in a simulation model (Fowler et al., 2018) and in its ability to perform its intended task (Klemeš, 1986;Refsgaard & Knudsen, 1996).
However, we suggest here that additional evaluation is necessary to achieve greater confidence in model behavior for climate change impact studies, both related to a model being an acceptable system representation and to it being appropriate for the task at hand. In this context, GSA is a valuable tool to complement any observation-based strategy because it allows us to make the model and its simulations significantly more transparent (Razavi et al., 2021;Saltelli et al., 2020;Wagener & Pianosi, 2019).
Appropriate GSA strategies can be defined that enable the modeler to simultaneously address the four evaluation questions we set out in this paper, thus reducing the computational burden that would occur through separate analyses. The four evaluations differ with respect to the input factors included or the way the results need to be interpreted. In summary, (1) Consistency evaluates whether the controlling processes in the model are consistent with expectations or empirical evidence, thus focusing on the model parameters and often relying heavily on the modeler's depth of understanding of the real-world system. (2) Elasticity focuses on evaluating how the modeled system responds to changes in climatic forcing, and whether this response is as expected. (3) Leverage evaluates whether the parameters that represent decision-making levers have sufficient influence on the modeled system response in the presence of other uncertainties.
(4) Attribution evaluates which uncertainties dominate the modeled system response at what time, thus guiding potential uncertainty reduction to enhance the model's value.
Recent studies have further demonstrated that GSA can be performed even on highly complex models (Saltelli et al., 2020), on those covering a global domain (Reinecke et al., 2019); and that it can include wider model assumptions, such as model resolution (Savage et al., 2016). However, the complexity of models and therefore the computational burden of such analyses remains the main bottleneck for the widespread application of GSA. Advancing these methods for the purpose of climate change impact study evaluation therefore remains an exciting area of research.

ACKNOWLEDGMENT
The authors thank the two reviewers for their constructive criticism that helped to improve the paper.
Open Access funding enabled and organized by Projekt DEAL.

DATA AVAILABILITY STATEMENT
Data sharing is not applicable to this article as no new data were created or analyzed in this study.