Identification of Dominant Hydrological Mechanisms Using Bayesian Inference, Multiple Statistical Hypothesis Testing, and Flexible Models

In hydrological modeling, the identification of model mechanisms best suited for representing individual hydrological (physical) processes is of major scientific and operational interest. We present a statistical hypothesis‐testing perspective on this model identification challenge and contribute a mechanism identification framework that combines: (i) Bayesian estimation of posterior probabilities of individual mechanisms from a given ensemble of model structures; (ii) a test statistic that defines a “dominant” mechanism as a mechanism more probable than all its alternatives given observed data; and (iii) a flexible modeling framework to generate model structures using combinations of available mechanisms. The uncertainty in the test statistic is approximated using bootstrap sampling from the model ensemble. Synthetic experiments (with varying error magnitude and multiple replicates) and real data experiments are conducted using the hydrological modeling system FUSE (7 processes and 2–4 mechanisms per process yielding 624 feasible model structures) and data from the Leizarán catchment in northern Spain. The mechanism identification method is reliable: it identifies the correct mechanism as dominant in all synthetic trials where an identification is made. As data/model errors increase, statistical power (identifiability) decreases, manifesting as trials where no mechanism is identified as dominant. The real data case study results are broadly consistent with the synthetic analysis, with dominant mechanisms identified for 4 of 7 processes. Insights on which processes are most/least identifiable are also reported. The mechanism identification method is expected to contribute to broader community efforts on improving model identification and process representation in hydrology.

This paper focuses on hydrological model development from the perspective of using observed data to identify individual process mechanisms rather than complete models, using inference and hypothesis-testing techniques capable of reflecting the inherent data and model uncertainty. The presentation is illustrated using models that represent the temporal dynamics of rainfall-runoff processes.
The selection of processes and mechanisms to be included in models has been a perennial challenge in hydrological and broader environmental sciences. Important questions in hydrological process understanding and representation relate to hydrological laws emergent at the catchment scale, causes of spatial heterogeneity in streamflow, evaporation, groundwater and other environmental fluxes, processes that control surface water-groundwater interactions and catchment connectivity, mechanisms by which climate and land use change impact on the hydrology of arid and semiarid regions, and many others (e.g., see Beven, 1989;McDonnell et al., 2007;Wagener et al., 2007;Clark et al., 2011b;Ehret et al., 2014;Gupta et al., 2014;Blöschl et al., 2019, and many others).
Reliable identification of hydrological mechanisms is important for multiple purposes, including improving predictive performance and gaining insight into catchments' internal functioning, with the latter important in order for the model to "get the right answers for the right reasons" (Kirchner, 2006). The selection of hydrological mechanisms is also generally expected to respect the principle of parsimony, which favors simpler over complex model representations (Jakeman & Hornberger, 1993).
Hypothesis testing in hydrology and broader environmental sciences has been hampered by several challenges, which complicate hypothesis testing both for complete models and individual mechanisms (see, e.g., Clark et al., 2011a;Gupta et al., 2012;Clark et al., 2015 the series of debates papers in Blöschl, 2017;Beven, 2018Beven, , 2019Höge et al., 2019). Notable challenges include lack of access to the true process (e.g., flow through an aquifer), the "uniqueness of place" paradigm (Beven, 2000), broader heterogeneity and variability of environments (Carrera et al., 2005;McDonnell et al., 2007), and resulting uncertainties/scarcity of observation data Wagener & Montanari, 2011). In addition, as noted by Nearing et al. (2016Nearing et al. ( , 2020, model hypothesis testing is necessarily constrained by the Duhem-Quine thesis (Duhem, 1991), which refers to the difficulty or even impossibility of separating individual model hypotheses from their "surrounding" model environment.
The adage "All models are wrong but some are useful" (Box, 1979) is particularly salient in environmental modeling. From this perspective, it may be more constructive to seek "better" and/or "quasi-true" rather than "true" models, where "quasi-true" models are sets of equations that reproduce the entire system or an individual system component with suitable "accuracy" at the scale of interest (Beven, 2018;Gupta et al., 2012;Höge et al., 2019).
Then, how to identify mechanisms representative of specific hydrological processes given that we only have incomplete and inexact information? Hypothesis testing and model selection/development in hydrology have progressed along several directions.
The "fixed modeling approach" seeks a complete model structure that reproduces the observed data (generally streamflow series or hydrological indices) across a wide range of catchments (e.g., PDM, Moore & Clarke, 1981;Moore, 2007; HVB, Lindström et al., 1997, Seibert & Vis, 2012GR4J, Perrin et al., 2003;Van Esse et al., 2013). Well-performing fixed models can be developed through systematic model comparison (e.g., see the GR4J development history). However, the intermediate analyses conducted as part of these developments are not necessarily structured to provide systematic hypothesis testing of individual components.
Bayesian approaches offer a number of important techniques for model selection and hypothesis testing. The well-known Bayes Factor (Raftery, 1993;Kass & Raftery, 1995) quantifies the strength of evidence in favor of one model over another model (Kass & Raftery, 1995;Jeffreys, 1998), which in turn allows assigning "degrees of belief" to the proposed models. Approximations such as the Bayesian Information Criterion (BIC) and Kashyap Information Criterion (KIC) can be used to avoid computationally costly Monte Carlo integration of the likelihood function when computing Bayesian Model Evidence (BME) and Bayes Factors (Ye et al., 2008;Schöniger et al., 2014). Bayesian model selection has seen multiple applications across hydrology, including the selection of rainfall-runoff models based on observed streamflow data (e.g., Marshall et al., 2005), improvement of model predictions (e.g., Vrugt & Robinson, 2007), evaluation of the worth of different types of observations for the selection of crop models , evaluation of uncertainty in posterior model weights due to measurement error in soil plants models ; quantification of information provided by a regionalization or hydrological model (Prieto et al., 2019), and many others (see Höge et al., 2019 for a recent review). To our knowledge, previous studies using Bayesian methods have focused primarily on model selection at the level of complete models, and not yet at the level of individual mechanisms.
Statistical hypothesis testing is a method of inference that allows comparing two or more models (hypotheses) describing systems with random (uncertain) behavior (Lumley, 2000;Burnham & Anderson, 2002). Typically, the "lack" of a discovery is taken as the "null" hypothesis (benchmark), for example, "the subsurface flow process is not represented by a Fickian diffusion mechanism," and rejection of the null hypothesis indicates a "discovery" (here, that "subsurface flow is Fickian"). Due to the presumed inherent uncertainty in the system being modeled, statistical hypothesis testing is formulated to include a confidence level (e.g., 95%), such that the corresponding significance level (e.g., 5%) represents the chance of reaching erroneous conclusions, notably Type I errors (false positives, i.e., incorrectly rejecting a true null hypothesis). It is also known that stringent control of Type I errors will generally translate into increased frequency of Type II errors (false negatives, i.e., failing to accept a true null hypothesis) (Smith & Bryant, 1975). The initial selection of hypotheses can also bias test results, for example, if some model components/mechanisms are overrepresented in the ensemble under consideration (see, e.g., Elkan, 2001;Saerens et al., 2002).
Many other new approaches for model selection and improvement are emerging in the field of data analytics, including the information-theoretic approach envisioned by Gupta (2015, 2018) and Nearing et al. (2020). The approach uses information theory and machine learning to test and refine model hypotheses (e.g., if a data-driven model can extract more information from the data than a conceptual model, then the conceptual model can be improved).
This study develops new Bayesian methods for mechanism identification in hydrology taking advantage of flexible modeling frameworks. A key emphasis is on the identification of process mechanisms rather than complete models, and on the use of statistical techniques to reflect the inherent uncertainty in testing hypotheses related to hydrological mechanisms.
The study aims are as follows: 1. Develop a systematic hypothesis-testing approach for the identification of dominant hydrological mechanisms, using a combination of flexible models, Bayesian inference, and methods for multiple hypothesis testing. 2. Investigate the performance of the proposed mechanism identification method in the presence of error (low, medium, and high magnitude) using synthetic data and using detailed performance metrics computed under replication.
PRIETO ET AL.
10.1029/2020WR028338 3 of 32 3. Investigate the performance of the proposed method in a real catchment.
The case studies are implemented using the FUSE hydrological modeling system (Clark et al., 2008;2011a) and daily hydrological data from the Leizarán catchment in northern Spain.
The paper is organized as follows. Section 2 presents theoretical developments, including the derivation of the posterior probability of mechanisms and their use within a hypothesis-testing framework. Section 3 describes the case study setup. Section 4 presents the case study results, which are then discussed in Section 5. Section 6 summarizes the key conclusions.

Key Concepts and Terminology
This section defines the key concepts and terminology underlying the proposed method for identifying dominant hydrological mechanisms. Here, it is important to distinguish concepts related to the physical catchment itself from concepts related to its mathematical model.
The term hydrological process refers to a physical phenomenon occurring in a catchment, for example, surface runoff generation. The term hydrological model process, , then refers to a single hydrological process intended to be represented by a hydrological model (irrespective of the accuracy of this representation).
A key concept in this study is a hydrological model mechanism,  m , which we define as a set of model equations intended to represent a hydrological model process . Note that a mechanism may or may not contain calibrated model parameters; for this reason, we use it in preference to the term "process parameterization" often used in calibration contexts to denote process equations and associated parameter values.
Given these definitions, a hydrological model structure is a specific combination of hydrological model mechanisms intended to represent a number of (preselected) hydrological model processes. For example, the SACRAMENTO and PRMS models are examples of hydrological model structures.
An ensemble of hydrological model structures comprises multiple hydrological models that differ in their selection of hydrological model processes and/or in the selection of hydrological model mechanisms used to represent these processes.
A multihypothesis framework (MHF) is defined as a framework for generating ensembles of model structures, i.e., a framework for generating model structures from a pool of available hydrological process mechanisms. In this work, the MHF is given by the flexible modeling framework FUSE (Clark et al., 2008;2011a).
A hydrological model mechanism is defined as dominant if, conditionally on the study method and assumptions, it is "substantially" (according to a quantitative criterion) more likely to represent a particular hydrological process than all alternative mechanisms under consideration in that analysis. A dominant mechanism should be not confused with a dominant process, which is a process that contributes substantially to the overall catchment water balance.
In terms of mathematical behavior, model structures may be deterministic or probabilistic. In this study, we propose methods for general probabilistic models and illustrate them for the common case in current hydrological practice where probabilistic models comprise a deterministic model (intended for process representation) augmented by a relatively simple error model (intended for predictive uncertainty representation).

Hydrological Model
Consider a probabilistic model of streamflow at time t, with model structure G representing a function of parameters θ, forcing data 1:t x (up to time t), and initial conditions 0 s , G, which highlights the fundamental assumption that the "true" (or at least "quasi-true") model is in the ensemble. In Equation 7, this assumption is reflected in the scaling of the Bayesian Evidence of model ( ) k G by the sum of BMEs of all models in the ensemble.
The computation of BME and related terms has long been a challenge in Bayesian model selection (e.g., see Ye et al., 2008;Schöniger et al., 2014, for analyses in hydrological contexts). Several approaches are available, including "semianalytical" approximations (often referred to as "information criteria") and numerical approximations (e.g., using Monte Carlo samples from the parameter posterior).
The Bayesian Information Criterion (BIC) offers an attractive avenue to approximate the BME terms in Equation 7. The BIC is computed directly from the likelihood function evaluated at the maximum a posteriori parameter set (which given uniform parameter priors coincides with the maximum likelihood parameter set) and includes a so-called "Occam Razor" term to penalize model complexity (quantified by the number of parameters). The use of BIC to approximate Equation 7 is described in Appendix A, which also includes a brief discussion of potential alternatives.
It is emphasized that the treatment of the BME term in Equation 7 is independent from subsequent derivations, which work solely with  ( ) ( | , ) k p G q G . As such, the modeler is free to compute  ( ) ( | , ) k p G q G using methods/approximations suitable for their specific application.

Posterior Probability of a Hydrological Mechanism
We now consider the estimation of probabilities of hydrological process mechanisms, rather than the probabilities of complete models (which represent combinations of mechanisms). N indicates the number of mechanisms available for process ℘. As per earlier definition in Section 2.1, we assume that this model structure ensemble is generated using an MHF.
e k p m q G denote the "ensemble-specific" posterior probability of mechanism  k m given observed streamflow  q and the ensemble of probabilistic hydrological models G. This posterior probability can be expressed using total probability as follows: In practice, the model structures within the MHF ensemble G will be selected in an "unbalanced and opportunistic" way (e.g., Saerens et al., 2002). In other words, the MHF will generally not provide uniform and unbiased-let alone complete-coverage of the "universe" of possible hydrological models and mechanisms. For example, different processes may have a different number of available mechanisms, and some mechanisms for separate processes may be mutually incompatible. As a result, some mechanisms will appear more frequently than others across the model structures provided by the MHF ensemble. In addition, the selection of mechanisms themselves may be biased toward the modeler's expertise, general community preferences, etc.
The inherently subjective (implicit) nature of mechanism selection within any realistic MHF impacts on the estimation of both prior and posterior probabilities of process mechanisms.
For example, the imbalance in the frequency of mechanisms included in an MHF makes the underlying prior distribution of mechanisms nonuniform, with an "effective" ("ensemble-specific") prior distribution of a mechanism  i m being . This prior differs from the genuine uniform prior The unbalanced mechanism frequency also affects the posterior probabilities in Equation 8, for example, as shown by Elkan (2001) and Saernes et al. (2002). In particular, due to the probabilities being summed, posterior estimation will be biased toward mechanisms that appear more frequently in the model ensemble.
In order to account for these imbalances, we employ the correction proposed by Saernes et al. (2002), which yields the posterior probabilities   ( | , ) If we set uniform priors in the mechanism space, Equation 10 is used to compute the "unnormalized" posterior probabilities of all mechanisms for process ; these quantities are then normalized to sum up to 1, yielding . This normalization follows from the earlier assumptions that a hydrological process is represented by a single model mechanism and that the true model (and hence the true mechanism) is present in the ensemble.
Equation 10 is intuitive in that the sum of posterior probabilities of model structures with a given mechanism is now scaled by the corresponding number of model structures, i.e., the sum of posteriors is replaced by the average of posteriors. The implied assumption is that, on average, model structures that include highly probable mechanisms have higher posterior probability than model structures that include less-probable mechanisms. This assumption is generally reasonable but can be compromised by interactions between model mechanisms for multiple processes.
Given the considerations presented in this section, Equation 10 will be used in lieu of Equation 8 to compute posterior probabilities of mechanisms in this work. In light of its assumptions, the use of Equation 10 with practical MHF ensembles represents an approximation to the probability of a given mechanism rather than a "general" statement of probability theory. For this reason, it is important to embed Equation 10 in a robust hypothesis-testing framework (Sections 2.6-2.9) and to undertake a rigorous verification of its empirical performance (e.g., Section 3.4). We provide additional discussion of these aspects in Sections 5.1.1, 5.2.1, and 5.3.2.

Multiple Hypothesis-Testing Setup
The posterior probabilities of mechanisms can be used for hypothesis testing. For example, for a given process, we could just search for the mechanism with the highest posterior probability. However, we develop a more comprehensive and reliable hypothesis-testing process, which recognizes the model ensemble uncertainty associated with Equation 10, includes the specification of a prescribed significance level  and reflects the potentially large number of hypothesis tests being carried out (which raises the probability of Type I errors, i.e., false identification of a mechanism as dominant).
The building blocks of the proposed hypothesis-testing method are described next. The method is designed to identify the dominant mechanism for a given process . Multiple processes are accommodated by applying the hypothesis-testing method separately to each individual process.

An individual comparison (or hypothesis test) is defined as an individual test of whether a mechanism
 k m is "dominant," i.e., substantially more likely than all other alternative mechanisms available in G to represent process . The quantitative definition of "substantially" will be introduced shortly.
The null hypothesis for an individual test, The family of comparisons is then the set of individual comparisons to identify the dominant hydrological model mechanism for model process . In this work, given a set of mechanisms, a "family of comparisons" comprises the set of individual tests of each mechanism against all other mechanisms available to describe that process. Note that each mechanism is only tested for dominance once.  k m is identified as "dominant" for process . In other words, the dominant hydrological model mechanism is the mechanism for which the individual null hypothesis is rejected. Note that it is also possible for no mechanism to be identified as dominant.
The family wise error rate (FWER) is the probability of making one or more Type I errors in a family of multiple tests, i.e., the probability of incorrectly identifying a mechanism as dominant for the given process (after testing all proposed mechanisms). Note that to keep FWER below a prescribed significance level , stricter significance levels   must be imposed in individual tests. In this study, the (conservative) Bonferroni correction is employed for this purpose (Hochberg, 1988).
Hypothesis testing is implemented by defining a test statistic t for the individual null hypothesis (Section 2.7). The empirical probability of the computed test statistic exceeding a prescribed threshold  is estimated using a bootstrap approach (Section 2.9) and compared to the prescribed significance level   (Section 2.8). This approach follows the principles of classical hypothesis testing (Lehmann & Romano, 2005;Triola, 2001).
A flowchart of the hypothesis-testing approach is provided in Figure 1, and detailed descriptions are provided in the following sections.
PRIETO ET AL.
10.1029/2020WR028338 9 of 32 Figure 1. Flowchart of the mechanism identification method. General steps (e.g., "estimate BME of all models") and their specific implementations in this work (e.g., "use BIC approximation") are indicated. The function  1 freq v count v N returns the fraction of true elements in Boolean set v of length N. BME, Bayesian Model Evidence; BIC, Bayesian Information Criterion.

Test Statistic Providing the Definition of a "Dominant" Process
The test statistic  k t for (rejecting) the individual null hypothesis  H0 k is taken as the posterior probability of mechanism  k m . A mechanism is considered dominant if the test statistic exceeds a threshold value  , The selection of  represents a modeling choice that, jointly with the significance level  introduced in Section 2.8, controls the stringency of the hypothesis test. For example,   0.5 represents the weakest requirement, where a mechanism is considered dominant even if it is barely more probable than its alternatives. In this work, we select   0.75, which requires a mechanism to be at least 3 times more probable than all its alternatives. This choice is subjective and may be application specific.
Note that the statistic  k t could be formulated equivalently as , respectively. However, Equation 11 is simpler and hence preferred in this work.

Hypothesis Testing Using the Test Statistic
The test statistic defined in Equation 11 depends on the model ensemble G. As noted earlier, this model ensemble on its own cannot be expected to provide complete coverage (sampling) of the total space of models and mechanisms. For this reason, we treat  k t as a realization of a random variable N is the Bonferroni correction to the overall significance level  (Hochberg, 1988).
Note that the overall confidence level of the family of comparisons (overall mechanisms) is . In this paper, we set   0.05 and test the sensitivity of conclusions to this choice.
The next section describes the estimation of the distribution of the test statistic, i.e., the empirical probability of exceedance   k .

Bootstrap Estimation of the (Empirical) Distribution of the Test Statistic
In order to estimate the uncertainty in the test statistic computed from models generated using a MHF, we distinguish the MHF model space from the hypothetical total model space. The MHF model space is given by the set of all distinct model structures available within the framework, i.e., all combinations of mechanisms available within that framework. As defined earlier, this sample space is G. In contrast, the total model space total G is conceptualized as the set of all possible distinct model structures, i.e., all combinations of all mechanisms that might exist "in principle." Our implementation estimates the uncertainty in the test statistic (more precisely, its probability of exceedance) numerically by applying bootstrapping (Efron & Tibshirani, 1993) to the model ensemble G. The bootstrap approximation is based on the assumption that the uncertainty introduced when the total model space total G is reduced to G is "similar" to the uncertainty introduced when G is replaced by a resampled subset. This assumption is typical of bootstrap approximations (e.g., Press et al., 1992;see also;Efron & Tibshirani, 1993;Varian, 2005), except here it is applied in the space of model structures.
PRIETO ET AL.

10.1029/2020WR028338
10 of 32 More specifically, the null hypothesis  H0 k that mechanism  k m is not dominant for process , with significance level  (e.g., 0.05), is tested using the following procedure: for all other mechanisms for process , where the function count v is defined as the number of true elements in a Boolean set v.
k is not rejected. The original ensemble G may be selected randomly one or more times like any other bootstrap ensemble and is not given any special treatment. Due to sampling with replacement, any given bootstrapped ensemble ( ) b G may (and almost always will) contain multiple instances of one or more model structures at the expense of excluding one or more other model structures. This setup is analogous to classic bootstrapped data sets, which exchange some of the observed data points with multiple instances of other observed data points. for the entire family of comparisons is not rejected. In this case, no mechanism is identified as dominant, i.e., the dominant mechanism is "not identified" or "undefined".
The same hypothesis-testing procedure is then applied to estimate the dominant mechanisms for all other model processes

Case Study Description
This section details the synthetic and real data case studies used to investigate the fundamental properties of the proposed mechanism identification method. A major focus is on the ability of the method to identify dominant mechanisms in the presence of data/model error. In the synthetic case study, the "true" dominant mechanisms are assumed to be known, and the error is added in a controlled way from a distribution that matches the assumed error model, in order to provide a controlled experiment. Multiple (synthetic) data replicates are used to quantify the performance of the mechanism identification method using metrics of statistical reliability and power. In contrast, the real data study investigates the method under conditions when the error assumptions are not met exactly. The following sections provide technical details of the case study procedures. Flowcharts of the synthetic and real data analyses are given in Figure 2.

Catchment and Data
The case studies use data from Leizarán catchment (c8z1)  Daily time series of precipitation, PET, and streamflow were provided by the Diputación Foral de Guipúzcoa. The annual averages reported above are estimated by arithmetic averaging of daily data.
Data from October 1, 1995 to September 30, 2002 are used for model calibration and mechanism identification, and data from October 1, 1995 to September 30, 1996 are used for model warm-up (to reduce the impact of unknown model initial conditions).

Modeling Framework for Hypothesizing Hydrological Mechanisms and Deterministic Hydrological Model Structures
The hydrological models and mechanisms for hypothesis testing are generated using the FUSE, a multihypothesis hydrological modeling system designed to facilitate work on model representation and improvement (e.g., Clark et al., 2008;2011a;Konapala et al., 2020).
FUSE provides a choice of multiple model mechanisms to represent each model process. A total of seven processes are considered in this paper, namely: (1) architecture of the upper soil layer for the storage of water occurring in the unsaturated zone; (2) architecture of the lower soil layer for the storage of water occurring in the saturated zone; (3) evapotranspiration; (4) interflow for the lateral movement of the water into the soil; (5) percolation for the vertical movement of water from the unsaturated zone (upper soil layer) to the saturated zone (lower soil layer); (6) surface runoff generation; and (7) routing for the evolution (shape and time) of the surface runoff hydrograph as the water moves through the river. The total number of mechanisms is 19; see Table 1 for details.
PRIETO ET AL.
10.1029/2020WR028338 12 of 32 The FUSE mechanisms are represented by components of existing models, namely, PRMS, SACRAMENTO, TOPMODEL, and ARNO/VIC. Each FUSE model structure is a combination of 7 hydrological mechanisms, with a single mechanism specified for each hydrological process. A total of 624 deterministic model structures are thus considered.
The inputs into FUSE are the (daily) time series of catchment-average observed rainfall and potential evapotranspiration, and the outputs are the (daily) time series of simulated streamflow.

Probabilistic Model and Parameter Estimation (Single Model Structure)
The probabilistic hydrological model in this work is given by FUSE Equations 3   For pragmatic considerations, the estimates of hydrological and residual model parameters in Equation 5, , are obtained using a computationally efficient "hybrid" least squares/method-of-moments (LS-MoM) approach, similar to the approach presented by McInerney et al. (2018).
The hybrid approach is summarized below: Stage 1. The estimated hydrological model parameters, ˆh θ , are obtained using the least squares method, by minimizing the sum of squared errors in Box-Cox space. The robust Gauss-Newton optimization algorithm with 10 multistarts is used (Qin et al., 2018). The estimates are then refined using a quasi-Newton optimization method with higher resolution finite difference gradient estimation and tight convergence tolerance (Kavetski & Clark, 2010).
Stage 2. The estimated standard deviation of normalized residual errors,  ˆ, is obtained using the method of moments from the time series of normalized residuals η (computed using the optimum hydrological model parameters ˆh θ ). This approach is equivalent to joint optimization of ˆh θ and  ˆ but is computationally faster (McInerney et al., 2018). The assumption of negligible posterior parameter uncertainty is consistent with the use of the BIC to approximate posterior model probabilities (Section 2.4 and Appendix A). The reduced computational cost is important given the number of model calibrations (parameter estimations) required for the case studies in this paper (Section 3.6).

Empirical Analysis Using Synthetic Data
The synthetic study enables a thorough investigation (verification) of the fundamental properties of the proposed mechanism identification method. A major focus is on the ability of the method to identify dominant mechanisms in the presence of error. This error is intended to represent the effects of data and model structural errors. Multiple replicates of synthetic data are used to quantify the empirical performance of the inference using metrics of statistical reliability and power.
A flowchart of the synthetic data study is given in Figure 2a; technical details are given in the following sections.

Synthetic Data Generation and Error Scenarios
The synthetic "observed" data are obtained by assuming a set of true mechanisms (yielding a "true" model) and "true" parameter values, shown in Table 1, using the procedure detailed in Appendix C. To achieve a degree of "realism" (representativeness of real conditions), the set of true mechanisms and parameters is taken from a model structure that was well performing in the real case study. The synthetic "observed" data are obtained from the synthetic "exact" data by adding Gaussian noise with variance   2 in Box-Cox transformed space with   0.2 and  0.035

A
. This synthetic error model is consistent with the assumed error model (Section 3.3) and can be interpreted as representing combined "data/model" error. In each scenario, 50 synthetic replicates are generated, shown in Figure 3. The replicates are treated as "statistical trials" when computing the performance metrics detailed in the following section.

Performance Attributes and Metrics for the Hypothesis-Testing Method
The following performance attributes and quantitative metrics are considered: PRIETO ET AL.
In other words, if the method identifies a mechanism as dominant, what is the probability that this mechanism is the true (dominant) mechanism?
We define (empirical) reliability R as where TP N is the number of trials where the true mechanism is identified as dominant and FP N is the number of trials where the wrong mechanism is identified as dominant. This quantity has also been termed "Positive Prediction Value" (Tharwat, 2020). The definition of trials based on the replicates is given in Section 3.4.3.
R ranges from 0 to 1, with higher values indicating better reliability. Given the definition of FWER in Section 2.6, with prescribed significance level of   5%, a method can be considered "reliable" if , as the frequency of false positives is below the imposed significance level.
2. What is the "statistical power" of the method?
Statistical power is defined as the probability of (correctly) rejecting the null hypothesis when it is indeed false (Dekking, 2005;Neyman & Pearson, 1928). For our experimental setup, an empirical estimate of power is given by where FN N is the number of trials where a dominant mechanism was not identified and trials N is the total number of trials.
P ranges from 0 to 1, with higher values indicating better power. Low values of P correspond to the method being "indecisive" and hence of little value to a modeler. In the hypothesis-testing literature, a method is generally considered "powerful/decisive" if   0.8 1 P (e.g., Ellis, 2010).
Note that in general there are trade-offs between the reliability and power of a test: as the probability of making Type I errors (false positives/ discoveries) decreases, the probability of making Type II errors (false negatives/rejections) increases (Smith & Bryant, 1975). In the context of hydrological process analysis, it is preferable to be indecisive (i.e., do not identify a dominant mechanism) than wrong (i.e., identify a wrong mechanism as dominant), because identifying a wrong mechanism as dominant is misleading and confuses our catchment understanding.

Use of Performance Metrics Across the Synthetic Error Scenarios
The performance metrics are used to analyze the method for three different stratifications/pooling setups.
1. Scenario-specific metrics: metrics listed in Section 3.4.2 computed separately for each data/model error scenario defined in Section 3.4.1. We pool together the results of identifying all model processes, so that the number of synthetic trials per scenario is scen trials N = 350 (7 processes × 50 replicates). This analysis tells us how the performance of the hypothesis-testing method depends on the error magnitude. 2. Process/scenario-specific metrics: metrics computed separately for the identification of each process in each scenario. In this case, the number of (synthetic) trials is  processes, which processes are the most/least identifiable (i.e., for which processes are the true mechanisms identified as dominant with highest/lowest reliability), and what is the statistical power of these identifications. 3. Overall metrics: metrics computed by pooling together the results of identifying all processes in all scenarios, so that the number of total trials is overall trials N = 1,050 (3 scenarios × 7 processes × 50 replicates). These metrics can be used to compare the overall performance of the hypothesis-testing method.
The analysis above is carried out for a significance level   0.05. In addition, we report results obtained when  is relaxed to 0.1 and when  is tightened to 0.01. This analysis provides an indication into the sensitivity of the inference to the significance level.

Empirical Analysis Using Real Data
The performance of the proposed method is illustrated in a real data case study using observed data from the Leizarán catchment (see Section 4.2). Since the "true" mechanisms in this catchment are not known, our analysis here is limited to the following: • Evaluation of broad similarities in the behavior of the mechanism identification method under synthetic versus real conditions. In particular, we compare how the number and type of dominant mechanisms identified using real observations compared to the number and type of dominant mechanisms identified using synthetic data. • Comparison of findings regarding dominant mechanisms to the existing knowledge of the hydrology of the Leizarán catchment. • Appraisal of consistency of mechanism identification based on 6 and 12 years of data, and sensitivity to the choice of significance level .
A flowchart of the real data study is given in Figure 2b.

Computational Platform and Costs
The numerical experiments carried out in this study are computationally expensive.
In total, across all scenarios, replicates, error magnitude levels, and FUSE model structures, (approximately) 95,000 individual calibrations were required, each implemented using 10 optimizations (corresponding to 10 initial seeds). The complete analysis consumed approximately 13 months of CPU time on the IHCantabria supercomputer cluster Neptuno.

Synthetic Data Experiments
This section reports the performance of the mechanism identification method in Scenarios 1-3 as the magnitude of (synthetic) error is increased. Given the synthetic nature of the analysis, the true mechanisms are known; hence, we can compute the reliability and power metrics. Figure 2 shows the synthetic "exact" streamflow and the synthetic "observations" of streamflow generated in Scenarios 1-3 (see Section 3.4.1). The true mechanisms are highlighted in bold in Table 1. Figure 4 shows the estimated empirical reliability and power of the method for Scenarios 1-3 (low, medium, and high error, respectively). The performance metrics are shown both for each process individually and averaged across all processes (upper row). In addition, we report mechanism identifiability averaged across all processes (right-most column).
We begin by considering reliability and power achieved for the "default" significance level of 0.05. The sensitivity to this specification is examined in Section 4.1.5. Figure 4 shows that, in the presence of low errors, the mechanism identification method achieves a reliability R scen1 = 1 and P scen1 = 0.92. In other words, a determination is made in 92% of the (yielding P = 0.92), and the true mechanism is (correctly) identified as dominant in 100% of these trials (yielding R = 1).

Scenario 1: Low Errors
Looking at each process independently, Figure 4 also shows "perfect" mechanism identification for 5 processes: processes related to the storage in the upper and lower soil layers, evapotranspiration, surface runoff generation, and routing. For these 5 processes, the dominant mechanisms are identified correctly in all 50 replicates ( process trials N = 50), i.e., R = 1 and P = 1.
For the remaining 2 processes, interflow and percolation, the dominant mechanisms are identified with perfect reliability (R evapotranspiration1 = R interflow1 = R percolation1 = 1). However, these mechanisms are identified with lower power, P interflow1 = 0.78 and P percolation1 = 0.66.
PRIETO ET AL.

Scenario 2: Medium Errors
Figure 4 (row 1) shows that, in the presence of medium errors, the mechanism identification achieves a reliability R scen2 = 1 and P scen2 = 0.84.
Looking at each process independently, Figure 4 shows that the reliability and power in Scenario 2 remain the same as in Scenario 1 for processes related to the storage in the upper and lower soil layers, surface runoff generation and routing. For these 4 processes, the dominant mechanisms are identified correctly in all 50 replicates ( process trials N = 50), i.e., R = 1 and P = 1.
For the remaining 3 processes, perfect reliability is maintained, R evapotranspiration2 = R interflow2 = R percolation2 = 1, but once again with a notable reduction in power, for example, P evapotranspiration2 = 0.9 and P interflow2 = 0.76. The identification of percolation suffers the largest loss of power, with P percolation2 = 0.22 (down from P perco-lation1 = 0.66 in Scenario 1).

Scenario 3: High Errors
Figure 4 (row 1) shows that, in the presence of high errors, the mechanism identification maintains same reliability as Scenario 2, R scen3 = 1, while its power decreases from P scen2 = 0.84 to P scen3 = 0.8.
Looking at each process independently, Figure 4 shows that the method maintains perfect identifiability for processes related to the storage in the lower soil layer and for the routing process (R lower soil layer3 = R routing3 = 1 and P lower soil layer3 = P routing3 = 1). These results are the same as in Scenario 2.
For the remaining processes, reliability remains perfect (R upper soil layer3 = R evapotranspiration3 = R surface runoff3 = R perco-lation3 = 1), but there is a notable loss of power. For example, for the upper soil layer and evapotranspiration, P upper soil layer3 = P evapotranspiration3 = 0.88, and for surface runoff generation, P surface runoff3 = 0.76. For percolation, the loss of mechanism identification power is almost complete, P percolation3 = 0.1.

Comparison Across the Processes for All Error Levels
Pooling the results across all scenarios and processes (Figure 4, upper right hand corner), mechanism identification achieves R overall = 1 and P overall = 0.86.
The most identifiable processes are those related to the storage in the lower soil layer and routing. For these processes, mechanism identification achieves perfect reliability and power in all three scenarios, i.e., regardless of streamflow errors.
In contrast, the least identifiable process is the percolation process, especially as streamflow errors increase. For this process, the power of mechanism identification deteriorates from P = 0.66 in Scenario 1 to P = 0.22 in Scenario 2 and then to P = 0.1 in Scenario 3.
Interestingly, the identifiability of (mechanisms for) the interflow process decreases from P = 0.78 in Scenario 1 to 0.76 in Scenario 2 but then increases to 0.92 in Scenario 3. This pattern of change is accompanied by several other processes becoming poorly identifiable. For example, the identification of mechanisms for the surface runoff generation process deteriorates to P surface runoff generation3 = 0.76 (down from P surface runoff genera-tion2 = 1 in Scenario 2). A similar deterioration is seen in the identification of mechanisms for the storage in the upper soil layer. The concurrent increase of power in the identification of interflow mechanisms and the drop of power in the identification of mechanisms for other processes suggests a "compensatory/interaction behavior," discussed in Section 5.4.

Sensitivity to the Prescribed Significance Level
We now consider the sensitivity of mechanism identification to the prescribed significance level . Figure 4 shows that the results are in general stable. For example, tightening  from 0.05 to 0.01 does not impact R in any of the scenarios and does not impact P in Scenarios 1 and 2 for processes in the unsaturated and saturated zones, surface runoff generation, and routing.
The value of  makes the most impact when errors are high. For example, in Scenario 3, tightening  from 0.05 to 0.01 results in a degradation in the identification of interflow, with P interflow3 decreasing from 0.92 to 0.84 (i.e., a loss in P of 0.08). For percolation, the loss in P is only 0.02; however, power is already very low PRIETO ET AL.

10.1029/2020WR028338
18 of 32 in the identification of this process, for example, P percolation3 = 0.1 when   0.05 which reduces further to P percolation3 = 0.08 when   0.01.
Conversely, the value of  has the least impact when errors are low. For example, in Scenario 1, tightening  from 0.05 to 0.01 does not impact the identification of the dominant mechanisms in the upper and lower soil layers, evapotranspiration, percolation, surface runoff generation and routing, where P = 1 both when   0.05 and when   0.01. Indeed, the only process affected substantially by the change in the significance level is the interflow, where the tightening of α from 0.05 to 0.01 results in a reduction of P from 0.78 to 0.72. Table 2 reports the mechanisms identified from real data in the Leizarán catchment. In this analysis, the true mechanisms are unknown and hence we cannot reliably establish whether the mechanisms identified as dominant are the "true" (or "quasi-true") mechanisms. Estimates of reliability and power are unavailable, and we focus instead on qualitative results.

Real Data
We begin by considering mechanism inference from 6 years of data, with significance level   0.05.
Dominant mechanisms are identified for 4 of the 7 processes represented in FUSE: storage in the unsaturated zone (single state variable), storage in the saturated zone (tension storage plus two parallel tanks), evapotranspiration (proportional to the depth of the roots in each layer), and routing (routing present).
No dominant mechanisms are identified for the remaining 3 processes, namely interflow, percolation, and surface runoff.
These process identification patterns are broadly similar to those found in the synthetic scenarios. In terms of identifiable processes/mechanisms: processes related to the storage in the lower soil layer and routing have well identifiable dominant processes-similar to synthetic Scenarios 2 and 3. In terms of nonidentifiable mechanisms: notably percolation is not identifiable in the real catchment, just as it was in the synthetic study. This similarity is unsurprising given that the synthetic studies were set up using real data findings but does indicate that no unexpected artifacts are introduced by the inference. Table 2 reports the sensitivity of mechanism identification to changes in the significance level . For the 6-year inference period, tightening  from 0.05 to 0.01 has no impact on mechanism identifiability for any of the processes. However, loosening  from 0.05 to 0.1 leads to a dominant mechanism being identified for the percolation process.
PRIETO ET AL.  Table 2 also shows the impact of data length on mechanism identification. There are no direct contradictions between the mechanisms identified using 6 and 12 years of data: in all processes for which a mechanism is identified in both periods, the same mechanism is identified in both periods. More specifically, mechanisms for processes in the saturated zone, evapotranspiration, and routing of surface runoff remain identifiable in both periods, and the same mechanisms are identified for these processes.
However, the change in data length does impact identifiability, with some switches in the processes for which dominant mechanisms are identified. In particular, there are switches in the identifiability of processes in the unsaturated zone and processes for surface runoff generation. For the 6 years period, the dominant mechanism is identified for the process of storage in the saturated zone (singe state variable mechanism), whereas for the 12 years period, a dominant mechanism is not identified for this process. The opposite is found for the surface runoff generation process: a dominant mechanism is not identified from the 6 years period but is identified from the 12 years period (the mechanism where saturated area is related to the storage in the unsaturated zone via a Pareto distribution).
Finally, Figure 5 compares the observed and simulated hydrographs from a selected subset of FUSE model structures. The following 18 model structures are selected: (1) models with mechanisms identified as dominant for processes in the upper and lower soil layer, evapotranspiration, and routing; and (2) models with all possible mechanisms for those processes where a mechanism is not identified as dominant (interflow, percolation, and surface runoff generation). Figure 5 shows that the FUSE models provide a generally accurate approximation to the observed streamflow, with NSE values (not labelled) of around 0.82-0.88 (and as high as 0.91-0.93 when computed with BC0.2-transformed streamflow).

Discussion
The discussion section is organized as follows. Key insights from the synthetic and real data case studies are discussed in Sections 5.1 and 5.2, respectively. Broader concepts of mechanism identification in the presence of data and model error are discussed in Section 5.3. Future research directions to overcome present study limitations are outlined in Section 5.4.
PRIETO ET AL.
10.1029/2020WR028338 20 of 32 (1) models with mechanisms identified as dominant for processes in the upper and lower soil layer, evapotranspiration, and routing; and (2) models with all possible mechanisms for those processes where a mechanism is not identified as dominant (interflow, percolation, and surface runoff generation).

Higher Reliability Gives the Modeler Confidence in the Mechanisms Identified as Dominant
The mechanism identification method is "reliable": if it identifies a mechanism as dominant, this mechanism is the true (or, more generally, "quasi-true") mechanism. Perfect reliability is achieved for all processes and scenarios, R overall = 1 across a total number of trials overall trials N = 1,050, which is obviously well within the imposed confidence level of 95%.
However, high reliability comes at the cost of lower power, i.e., ability to make an identification. The power when pooled over all scenarios is P overall = 0.86, indicating that the method does not identify a dominant mechanism in 14% of the trials. However, for some processes and error levels, the power dropped to as low as 0.22 and 0.10 (see Section 5.1.3 for further discussion of process identifiability).
The trade-off between reliability (i.e., low frequency of false positives) and power (i.e., low frequency of false negatives) is known from theoretical considerations. In particular, tightening the significance level  reduces the probability of Type I errors but increases the probability of Type II errors (e.g., Smith & Bryant, 1975). It is important to note that such lack of identification power is at least not misleading to the modeler-in contrast to the case of lack of reliability where the wrong mechanism is identified as dominant. A method with lower reliability, especially in conjunction with high power, has less practical value: the relatively high frequency of identifying a wrong mechanism will confound the modeler's interpretation of catchment functioning. It may also result in worse predictive performance when the identified mechanisms are used in predictive modeling contexts.
Overall, the high reliability of the mechanism identification method gives the modeler confidence in the identification of a mechanism as dominant, and cases where no mechanism is identified as dominant may point the modeler to seek more and/or higher quality data and/or to hypothesize new process mechanisms (see Section 5.1.3).

Performance in the Presence of Increased Errors in the Model/Data
Mechanism identification tends to deteriorate as the magnitude of data/model error is increased. In this synthetic study, this error represents a combined effect of data/model error (as the residual error model used to obtain the synthetic "observed" data does not distinguish multiple sources of error). While the deterioration in performance is unsurprising, the deterioration pattern is relatively "benign." In particular, the deterioration manifests as a loss of power, i.e., an increased probability of Type II errors (here, not identifying any mechanism as dominant): no mechanism is identified as dominant in 8%, 16%, and 20% of the trials for Scenarios 1, 2 and 3, respectively. In line with earlier arguments, we consider a loss of power to be a lesser limitation than a loss of reliability.

Which Processes Are the Most and Least Identifiable?
Processes related to storage of water in the saturated zone, and routing, appear well identifiable. Dominant mechanisms for these processes can be identified with perfect reliability and power even from streamflow corrupted with high errors. In contrast, interflow and percolation processes are the least identifiable. Specifically, no percolation mechanism is identified as dominant in as many as 78% and 90% of trials in Scenarios 2 and 3, respectively.
Process identifiability necessarily depends on aspects such as the contribution of the process to the model response used for mechanism identification. For example, a mechanism for a process that contributes a negligible amount of streamflow is unlikely to be identified from streamflow alone. The degree of difference in the competing mechanisms is also of clear relevance. For example, distinguishing between three mechanisms will be much harder if they employ similar equations (e.g., see earlier study by Gupta and Sorooshian 1983). This effect can be seen for the percolation process, for which FUSE provides three options all of which are power functions of (different) model storages. Mechanism identification clearly suffers in these circumstances, with power dropping to as low as 22% and 10% for medium and high errors, respectively.
The dependence of identifiability on mechanism similarity is not surprising. As we consider more and more subtle differences between mechanisms, our ability to establish a mechanism as dominant will necessarily PRIETO ET AL.

10.1029/2020WR028338
21 of 32 become more limited, especially in the presence of error. Increasingly accurate data would be needed to continue refining process representation.
Other considerations of data uncertainty are also relevant here, for example, estimated values of low flows can be sensitive to data errors (Westerberg et al., 2016), complicating the identification of mechanisms operating during low flow and ephemeral conditions.

Are Findings in the Real Data Study Similar to the Synthetic Scenarios?
In this section, we compare the findings in the synthetic and real case studies to look for broad similarities and differences. Section 4.2 indicates that process identifiability is generally consistent across the real data study and Scenarios 2 and 3. For example, the saturated zone process is always identifiable, with the dominant mechanism being tension storage plus two parallel tanks, and the surface runoff routing process is also always identifiable, with the dominant mechanism being "routing present." A minor exception is the evapotranspiration process, where a dominant mechanism is always identified in the real data study but is occasionally not identified in Scenarios 2 and 3 (P is 0.9 and 0.88).

Connection Between the Mechanisms Identified as Dominant and Existing Process Understanding in the Leizarán
The mechanisms identified as dominant can be interpreted from a process-oriented perspective that is available, albeit in a limited way, in the Leizarán (Basque Water Agency, personal communication, March 12, 2020): Storage in the upper soil layer. Approximated by a single state variable (i.e., without a tension storage). This mechanism, which presumes low tension storage, is plausible because Leizarán catchment has a low clay content (3%).
Storage of water in the saturated zone. Two parallel tanks and one tension storage. This mechanism might be plausibly linked to the combination of the geological and topographic conditions of the catchment, which favors a subsurface flow component. Geology is composed of 28% calcareous rocks, 28% sands, and 37% siliceous rocks; the catchment slope is high (elevation change over horizontal length is 0.42).
Evapotranspiration. Proportional to the depth of roots in each soil layer. This mechanism is plausible given the catchment has a riparian forest whose vegetation is oak and alder.
Routing. Present. The inference method finds that it is more probable that the surface runoff hydrograph is propagated through the river to the catchment outlet rather than being directly delivered to the outlet, i.e., the "routing mechanism is dominant." This mechanism, which specifies a lag between precipitation and streamflow generation, is plausible because the catchment has meanders across its area of 114 km 2 .
For interflow, percolation, and surface runoff generation (the remaining processes), no mechanism is identified as dominant.
We note that these interpretations are necessarily tentative, in view of the currently limited understanding of the hydrology of the Leizarán catchment. These interpretations can be pursued in more depth in future work, along the research lines on the correspondence between models and catchments (Fenicia et al., 2014;Wrede et al., 2015;Carrer et al., 2019).
The general finding that the interflow and percolation processes are the hardest to identify aligns with previous work on hydrological process identification, where processes related to soil, geology, and vegetation were found harder to characterize than those related to climatic attributes (Beck et al., 2015;Addor et al., 2018). This raises the question of how to make the best use of soil and geological data, including signatures, for hydrological modeling Fenicia et al., 2018), and how to represent the continuum of response dynamics in the unsaturated and saturated zones (Silva et al., 2009).
The sensitivity of the identified dominant mechanism to the choice of significance level  appears low: only a single additional mechanism becomes identifiable if  is relaxed to 0.1 (percolation for the 6 years period and interflow for the 12 years period). However, this finding is likely to be case specific.
Available data length is expected to impact mechanism identification, both by providing information for such identification and by potentially introducing variability into the identification if the catchment undergoes hydrological change. In the real catchment case study using 6 and 12 years of data, 4 out of 7 mechanisms are identified in both periods. An encouraging finding is that the mechanism inference is consistent: for those processes where a mechanism is identified as dominant with both lengths of data, the same mechanism is identified. In other words, there was no change in the estimation from one mechanism to another as more data were added-the only changes were from a mechanism being identified to a mechanism not being identified, and vice versa. Naturally the temporal consistency of mechanism identification also depends on the catchment not undergoing any genuine major changes. These findings also align with the previous literature on hydrological model component identification, including on the number of model components identifiable from a streamflow time series (e.g., Jakeman & Hornberger, 1993) and on the length of data needed to calibrate a model in a humid catchment (e.g., Sorooshian et al., 1983;Yapo et al., 1996;Li et al., 2010).

Methods Proposed in This Work Focus on Individual Model Processes/Mechanisms, in Contrast to Existing Methods Which Focus on Complete Models
Flexible modular models have facilitated important advances for hypothesis testing in hydrology, enabling the decomposition of models into multiple testable hypotheses about mechanisms for individual processes (e.g., Clark et al., 2008;2011a;Fenicia et al., 2011;Kraft et al., 2011;Wrede et al., 2015;Fenicia et al., 2016;Addor and Melsen, 2019;Knoben et al., 2019;Craig et al., 2020, and others).
Previous applications of Bayesian model selection in hydrology have focused on comparing models (or model parameters) but to our knowledge have not considered the question of individual mechanism identification. Our study builds on previous work on Bayesian model selection (e.g., Marshall et al., 2005;Vrugt & Robinson, 2007;Almeida et al., 2014;Schöniger et al., 2014;Wöhling et al., 2015;Prieto et al., 2019) and develops methods for the identification of dominant hydrological mechanisms by making hypotheses about the model mechanisms and their uncertainty. These advances are presented in Section 2 and represent the major contribution of this study.
The proposed mechanism identification method is general: it is derived for an ensemble of general probabilistic models without assuming a particular probabilistic model composition. For example, the case studies use a probabilistic model constructed from a combination of a deterministic model of hydrological processes and a residual error model for uncertainty characterization. This composition is typical in contemporary hydrological models. More general probabilistic (stochastic) models as well as probabilistic models constructed by forcing deterministic models with ensemble inputs could also be used, provided the probability density function of their outputs is known or can be approximated. As discussed next, the probabilistic model construction is less important than the coverage by the MHF of the space of mechanism hypotheses.

Toward Accounting for Uncertainty in the Identification of Dominant Hydrological Mechanisms
A key challenge in the hydrological community is to develop identification methods that perform reliably in the presence of incomplete and inexact information (i.e., uncertain streamflow observations, approximate model components and structures, limited coverage of hypothesis space, etc.).
Bayesian model selection, via posterior model probabilities (Hoeting et al., 1999), is well posed for process identification if a (quasi) true model is in the ensemble (Höge et al., 2019). Bayesian methods have been used for model selection, ranking, and elimination , as well as for model structure estimation (Bulygina & Gupta, 2011 In this paper, Bayesian model inference is applied in the distinct (though related) context of identification of dominant hydrological mechanisms rather than complete models. The cornerstone of this method is the ability to estimate the posterior probabilities of individual mechanisms, which is based on the posterior probabilities of models containing these mechanisms. Hence, the mechanism identification method proposed in this work builds on previous applications of Bayesian inference to model selection in hydrology (Höge et al., 2019), including Bayes Factors (Marshall et al., 2005), and the use of information criteria including Occam Razor terms to approximate the BME term (Ye et al., 2008;Schöniger et al., 2014).
The inference of the dominant mechanisms can be blurred by interactions between mechanisms used for (multiple) model processes. Hypothesis testing is necessarily constrained by the Duhem-Quine thesis (Duhem, 1991), which highlights the difficulty or even impossibility of separating individual model hypotheses from their "surrounding" model environment (Nearing et al., 2016(Nearing et al., , 2020. Conceptually, we see mechanism interactions as similar to parameter interactions in traditional inference: for example, two "poor" mechanisms (parameter values) can compensate for each other's weaknesses and produce a model with similar or even higher posterior probability than a model with two "good" mechanisms (parameter values). Such interactions can be problematic, as seen from the derivation of the mechanism identification equations (Section 2.5). Scenario 3 presents some empirical evidence of mechanism interactions (see end of Section 4.1.4), where they manifest in a loss of power-though strong interactions may eventually manifest in loss of reliability. As individual model hypotheses are (gradually) improved, we expect to see corresponding improvement in the identification of dominant mechanisms. Interactions between model mechanisms can also be reduced by using observations of model outputs other than streamflow, for example, actual ET and groundwater levels. The benefits of using multivariate data for model identification and refinement are vividly seen from previous studies (e.g., Fenicia et al., 2008;Gupta et al., 2008;Wagener & Montanari, 2011;Euser et al., 2013).
In terms of requirements for the MHF, a key requirement is sufficiently wide coverage of the mechanism hypothesis space, including a diverse range of simple and complex candidate mechanisms, so that the resulting ensemble is more likely to include the true (or at least quasi-true) mechanisms. The presence of nested models and/or nested model mechanisms could lead to multiple models having the same likelihood function value, though the inclusion of Occam Razor terms in the computation of posterior model probabilities can help steer the inference toward the simpler representation (see Appendix A). Another important consideration is to avoid the potential trap of all models in the ensemble being wrong for the same reason (Clark et al., 2011a).
Finally, important simplifications undertaken in the case studies are the exclusion of parametric uncertainty within each model and the (related) use of the BIC to approximate the BME. These simplifications are motivated by computational considerations, with limitations and future work outlined next in Section 5.4.

Limitations and Future Work
Important directions for future research emerge in the application of the proposed mechanism identification method to more complex modeling scenarios, both in the context of synthetic tests and real data applications. This section lists the most salient opportunities.
The case studies in this work did not consider parametric uncertainty in the hydrological model. This assumption may be reasonable in many circumstances, for example, when the uncertainty associated with a (relatively) parsimonious hydrological model is represented using a simple residual error model, and a suitable long observational record is available for parameter inference (Kavetski, 2018;McInerney et al., 2018). However, there are many modeling situations where parametric uncertainty can be substantial, notably, in poorly identifiable models (e.g., Renard et al., 2010). An important and logical next step is to apply the mechanism identification method in a more complete Bayesian context that explicitly considers model parameter uncertainty, for example, using importance or Markov Chain Monte Carlo sampling.
The synthetic studies in this paper have used relatively simple error models to corrupt the synthetic "observed" data, and, importantly, the true mechanisms were included in the model space. But what happens if a true mechanism is not included in the ensemble? It is well known that many statistical model identification methods assume that the true model is present, which may not be realistic in practical hydrological PRIETO ET AL.

10.1029/2020WR028338
24 of 32 contexts. An important practical question is then how "good" should the "best available" mechanism/model representation be for it to be identified as "dominant" by the method?
Another important uncertainty-related consideration that remains largely unexplored in this study is the interaction between identified mechanisms (see Duhem, 1991;Nearing et al., 2016Nearing et al., , 2020. In most catchments, specialized measurements to test each hypothesis/mechanism (e.g., "observations" of storage of water in the unsaturated and saturated zone, percolation, actual evapotranspiration, etc.) are not available, and hypothesis testing is necessarily limited to tests against observed streamflow. Given the dependence of streamflow on multiple and often interacting hydrological processes, model predictions can be expected to depend on multiple interdependent hypotheses (Pfister & Kirchner, 2017). This aspect of mechanism identification also requires substantial further investigation.
Furthermore, as flexible frameworks continue to improve and expand their coverage, so will their ability to meet the mechanism identification requirements. However, as more candidate mechanisms are hypothesized, it is likely that differences between them will become smaller. We anticipate a limit on the degree of mechanism detail that can be inferred from highly uncertain environmental data. But more research will be needed before meaningful quantitative statements can be made.
Future work will apply the method proposed in this paper to a more diverse range of catchments, as well as to a range of different time periods in the same catchment, analogous to split-sampling and cross-validation. These analyses can help elucidate the pattern, consistency, and variability of the identified dominant mechanisms and can help establish model performance when faced with data unseen during mechanism identification. Of particular interest are applications in experimental catchments, where more extensive fieldwork and multivariate data other than streamflow time series (e.g., storages of water) is available. Application of Bayesian mechanism identification in experimental catchments could extend previous studies where model comparison was done on the basis of existing perceptual knowledge (Carrer et al., 2019;Fenicia et al., 2014;Wrede et al., 2015). Such work could yield insights into correspondence of models and reality, as well as into the variability of hydrological mechanisms in space (across catchments, e.g., due to differences in topography, geology, and land use) and time (within catchments, e.g., across seasons).

Conclusions
The development of hydrological models that provide an accurate representation of catchment dynamics and produce accurate streamflow predictions represents a formidable model identification challenge. In this work, we approach this model identification challenge with a focus on identifying individual model components (hydrological mechanisms) for each hydrological process that is included in the model.
The proposed hydrological mechanism identification method takes advantage of flexible hydrological models, Bayesian inference, and statistical hypothesis testing. A "dominant" mechanism is defined as a mechanism with (substantially) higher posterior probability than the sum of posterior probabilities of all other proposed mechanisms; here, we set the probability threshold at 0.75 (i.e., 3 times more a posteriori probable than any alternative mechanism). A test statistic is constructed for the null hypothesis that "none of the proposed mechanisms is dominant", and its estimated probability is compared against an a priori confidence level (here, 95%, corresponding to the classic significance level of 5%).
The method is evaluated empirically using a synthetic and real data case study based on daily data from the Leizarán catchment (Basque Country, Spain). The hydrological modeling system FUSE is employed to represent seven hydrological processes using 2-4 mechanisms per process, yielding a total of 624 feasible model hypotheses. Synthetic scenarios with 3 levels of error magnitude (low to high) and 50 data replicates each are used to establish the performance of the proposed method in the presence of data/model error. Metrics of statistical reliability and power are used to quantify and then rate the method performance. Real data are used to investigate the generality of key qualitative findings of the synthetic experiments.
PRIETO ET AL.
The most probable parameter set is defined as the parameter set that maximizes the posterior distribution of model parameters in Equation 5, Equation A1 can then be approximated using several analytical approaches.
The BIC, also known as the Schwarz Information Criterion (SIC), is defined as N θ is the total number of model parameters in the probabilistic model ( ) k G (Ye et al., 2008;Schöniger et al., 2014). The last term in Equation A3 is a so-called "Occam Razor" term, which penalizes model complexity (here measured by the number of parameters). Occam Razor terms are essential to distinguish between models (and eventually mechanisms) that yield comparable predictive performance but differ in their degree of complexity.
The BIC approximation of the BME in Equation 7 is then The last term in Equation A4 uses the shift by Unless specific prior information is available, the prior over the hydrological models is set as uniform, where G N is the total number of models under consideration.
The BIC is derived by applying the Laplace approximation to the integral in Equation A1 and only retaining the terms dependent on the data length t N , i.e., assuming the number of observations is large (Schwarz, 1978;Konishi & Kitagawa, 2008). There are several alternative information criteria. For example, the Akaike Information Criterion (AIC) is derived from information theory and penalizes the number of parameters according to ( ) 2 k N θ instead of ( ) log k t N N θ (Akaike, 1974). The Kashyap Information Criterion (KIC) is derived by assuming the parameter posterior is Gaussian and has additional Occam Razor terms, including a term containing the determinant of the Fisher information matrix (Fisher & Russell, 1922;Kashyap, 1982).
All three information criteria, AIC, BIC, and KIC, converge to the integral in Equation A1 as the number of observations used in the analysis increases, by the virtue of the likelihood term eventually dominating all other terms. The KIC approach is theoretically more accurate; however, the approximation of the Fisher matrix (e.g., using finite differences) is noisy, and it is difficult to ensure comparable numerical accuracy across many (hundreds to thousands) model structures as required in the mechanism identification framework. For these reasons, the BIC approach offers a better balance of numerical accuracy and computational robustness in the specific context of this study.
It is also emphasized that the simplification in Equation A4 is separate from the derivation of the mechanism identification equations, which requires solely 