On hypothesis testing in hydrology: Why falsification of models is still a really good idea

This opinion piece argues that in respect of testing models as hypotheses about how catchments function, there is no existing methodology that adequately deals with the potential for epistemic uncertainties about data and hydrological processes in the modeling processes. A rejectionist framework is suggested as a way ahead, wherein assessments of uncertainties in the input and evaluation data are used to define limits of acceptability prior to any model simulations being made. The limits of acceptability might also depend on the purpose of the modeling so that we can be more rigorous about whether a model is actually fit‐for‐purpose. Different model structures and parameter sets can be evaluated in this framework, albeit that subjective elements necessarily remain, given the epistemic nature of the uncertainties in the modeling process. One of the most effective ways of reducing the impacts of epistemic uncertainties, and allow more rigorous hypothesis testing, would be to commission better observational methods. Model rejection is a good thing in that it requires us to be better, resulting in advancement of the science.


| INTRODUCTION
Hydrology is an inexact science, subject to both random (aleatory) and knowledge (epistemic) uncertainties. As such there are important issues about how to test models as hypotheses about system functioning in all aspects of hydrology and water science. When we can assume that model errors have a simple aleatory structure, then the full power of statistical hypothesis testing is available. But this is not normally the case. It is more usual that epistemic uncertainties dominate model performance such that how to do hypothesis testing is a more open question. A recent series of papers in Water Resources Research addressed the problem of hypothesis testing in hydrology (Blöschl, 2017). They concluded that testing models as hypotheses would be a good way of improving hydrological science but is difficult: because both observed and predicted variables are theory-laden quantities and subject to significant uncertainties; because model structures are often complex, but incomplete; and because parameters are often difficult to estimate given the data available (parameter inference is underdetermined, leading to equifinality of conceptual models and parameter sets; Neuweiler & Helmig, 2017;Pfister & Kirchner, 2017).
These papers also addressed the question of epistemic uncertainties by suggesting that where we lack hydrological knowledge then there is a need for more exploratory and experimental hydrology (what Baker (2017) refers to as abductive inference) and that exploratory hydrology can be fun and rewarding and should be valued more highly by research funders (Baker, 2017;McKnight, 2017;Pfister & Kirchner, 2017). Most of us (certainly of my generation) cut our hydrological teeth on exploratory hydrology in trying to understand what was happening in one particular catchment area, and we were consequently influenced by its own uniqueness of hydrological and landscape characteristics (see, in my case, Beven (1978Beven ( , 2000Beven ( , 2001). We soon realized that exploratory hydrology is already difficult, given that the field techniques we had available were not really adequate to investigate flow pathways and fluxes in detail, particularly in the subsurface and at scales larger than small cores. In fact, that is still the case-it remains difficult to get good estimates of the rainfalls over even a small catchment area; more so for actual evapotranspiration over heterogeneous land use and hillslopes; and even more so for fluxes in subsurface flow pathways (Neuweiler & Helmig, 2017;Pfister & Kirchner, 2017). It is clear that there is still much to learn from exploratory hydrology but there remains a lack of discussion of how we might test models as hypotheses in the face of epistemic uncertainties.

| EXPLORATORY HYDROLOGY, EXPLORATORY MODELING, AND MODELS AS HYPOTHESES
As experimental hydrologists, we also soon realized that as catchment scale increases, it becomes much more difficult to do active studies of processes that are representative of the wider range of conditions that come into play. We then often resort to inferences from observed hydrological responses at the scale of interest, and one way of doing so is to create models that reflect our small scale understanding from observation and experimental study as far as possible, though this might not properly represent the change in dominant processes at different scales (Beven & Wood, 1993;Kirkby, 1976). That has also served the purpose of allowing quantitative predictions to be provided to decision makers who manage the water system to meet the needs of society. Water Resource Management, in all its aspects, has been a major driver for model development in hydrology, in addition to that drive to demonstrate "that we do, after all, understand our science and its complex interrelated phenomena" (in the words of Max Kohler).
But the complexity of hydrological systems means that models that reflect that understanding will have many components and parameters, even though there are many aspects about which we have relatively poor understanding. The combination of a model structure and a parameter set can be considered as a hypothesis about how a hydrological system functions (I include here the additional components that depend on the simulation of water fluxes, including water quality, sediment transport, drainage systems, and other features that might be required for water resources management). If we can find an acceptable model of that system, either deterministic or stochastic, it can be used to make deductive simulations of behavior for a variety of purposes. That is why we want our models to get the right results for the right reasons, so that they are fitfor-purpose when used to simulate required variables, not all of which might be readily observable. There has been a long discussion in hydrological modeling about how best to determine appropriate effective values of parameters to allow for the uniqueness of particular catchment areas when calibration data are often limited and uncertain (Beven, 2008(Beven, , 2012, but less about what qualifies as fit-for-purpose for different types of purpose. As hydrological scientists, we do not want to be using models that are not fit-for-purpose. To do so would be to draw the wrong inferences about the future behavior of hydrological systems and lead to less than robust decisions in management. This suggests, therefore, that testing models (or components of models) as hypotheses is a valuable part of doing science in hydrology, and that falsification of models is still a really good idea.
But that is not really how hydrology, as an example of an inexact science, seems to have worked. There are not many papers in the literature that explicitly reject hydrological models as hypotheses. This is certainly partly because of the positive bias of publication. Papers are much more likely to be accepted for publication if they conclude that a model gives adequate simulations of the observations, than reporting failures (even if, in some cases, this requires data assimilation to update the model states as the simulation proceeds). There are just a few examples of papers that report rejecting all the versions of a model tried (Beven, 2001;Choi & Beven, 2007;Dean, Freer, Beven, Wade, & Butterfield, 2009;Mitchell, Beven, Freer, & Law, 2011;Mitchell, Freer, & Beven, 2009;Page, Beven, & Freer, 2007), but personal experience suggests that such papers can be rather difficult to get past referees, especially where a referee is implicated in the development of a model.
More generally, poor model results do not get reported. They are considered rather as part of the development of a modeling study. They are improved by debugging the model code, changing the model assumptions, modifying parameter sets or "correcting" model boundary conditions. By such learning processes, we aim to gradually improve models as representations of hydrological systems, even if we do so without going through a specific hypothesis testing process (it might rather be called exploratory modeling, by analogy with exploratory field hydrology). The result, however, is that we have many competing hydrological models, with different assumptions, parameterizations and numerical solution schemes, that purport to do the same thing-modeling the rainfall-runoff process, modeling water tables, modeling isotope and nutrient concentrations and other water quality variables, modeling erosion and sediment transport, and so on. We also have modeling systems that provide many options of different process components, most recently, the SUMMA system (Clark et al., 2015). This implies a need to test different model structures as hypotheses about how a catchment works, in addition to the estimation of parameter values within those structures. It might be the case that different hydrological processes and regimes in different catchments will require different modeling assumptions, but why has there been so little real testing of those models as hypotheses or, even more importantly, as fit-for-purpose?
One reason is that they have all been considered acceptable in some sense, at least by the authors and referees on the papers in which they, and their simulation results, are described. Since hydrology is an inexact science, we (as modelers and referees ourselves) expect that the degree to which we can simulate any available observations will be necessarily limited. There will always be some residual error, whether that be due to error and uncertainty in the observational data itself, necessary approximations in the model assumptions, or error and uncertainty in the forcing data for the model. And it is clear that the resulting errors and uncertainties cannot be treated in simple statistical terms given their epistemic nature (see Beven, 2006Beven, , 2012Beven, Buytaert, & Smith, 2012). This makes both establishing appropriate likelihood measures and defining appropriate methods of hypothesis testing particularly challenging (Beven, 2016b).

| HYPOTHESIS TESTING AND FIT-FOR-PURPOSE
But some form of more rigorous evaluation, that allows the possibility of model falsification as not fit-for-purpose, is surely required. That might depend, of course, on what the particular purpose might be. Purpose will govern the types of model structures chosen to be evaluated and the way in which they might be evaluated given the data available. For some purposes we will be more interested in flood peaks, in others recession behavior, in others flow pathways, and in others how water fluxes relate to residence and travel times and water quality. It might also be appropriate to use different model structures and parameter sets for different model scales. So what methodologies are available for testing models as hypotheses in this context?
For some, hypothesis testing implies a statistical analysis (for example, the analysis of changes in flow duration curves of Kroll, Croteau, and Vogel (2015), and the tests of flood frequency distributions using information criteria in Haddad and Rahman (2011)). This generally requires making assumptions about the structure of the model residuals conditional on the model being true, and that the sources of uncertainty are fundamentally aleatory in nature. This is not really a good basis for considering whether models are fit-for-purpose, especially when we suspect that there will be important epistemic uncertainties involved. Statistical likelihood functions do not allow for giving hypotheses a likelihood of zero, only for comparisons between likelihoods. Likelihoods might be very very small (and range over tens or hundreds of orders of magnitude) but are never zero. Model rejection in that context requires some additional subjective judgments, either in assuming a prior likelihood of zero for some model configurations in a Bayesian framework, defining some tolerance level in approximate Bayesian computation, or in deciding on the choice of one model over another using some information criterion or Bayes ratios. There is no real mechanism for falsification in such a framework (except again by some qualitative judgment by the modeler that the results are not yet good enough to write the paper).
A more attractive approach, still based in probability theory, is the information theory approach advocated by Nearing and Gupta (2015) and Nearing, Tian et al. (2016). This makes a comparison between the performance of a model and the information content that can be extracted directly from the available data using purely data-based or machine learning algorithms. The basis of comparison is an entropy measure as calculated from the cumulative distribution of the variables being predicted and the equivalent model outputs. This approach has some attractive features, in particular that it does not require any assessment of sources of uncertainty in the modeling process, but works directly with the data as recorded. It also does not require any explicit assumptions about the structure of the modeling residuals. Nearing, Tian, et al. (2016) argue that it is far more valuable to consider the information provided by a model than the uncertainty associated with the predictions, and we should require that any process model should provide more information than can be gleaned from the data itself. Thus, any process model that results in an entropy greater than the data-based model could be rejected (though it is important to compare like with like: in an early application of this approach Gong, Gupta, Yang, Sricharan, and Hero (2013) tested a hydrological simulation model against a one-step ahead data-based forecasting model; unsurprisingly, the simulation model did not perform as well!). Some interesting recent studies have concerned "model benchmarking" in evaluating land surface parameterizations in climate models (Best et  The advantages of this approach would, however, also seem to contain the seeds of some important limitations. Since the entropy measure is based only on the cumulative distribution of the variable of interest, any information about timing errors, either within an event or in the overprediction and underprediction of different events, is not taken into account. It might also be the case that if there are consistent epistemic uncertainties in the forcing data and evaluation observations, then not all events might be informative in evaluating model performance (Beven & Smith, 2015), in that the data conflict with basic concepts that underlie the model. A data-based model could compensate for consistent biases in the data, in ways that a process model constrained for example by mass and energy balance cannot.
Of course, demonstrating that a purely data-based model can extract more information from the data than a process based model is in itself a valuable learning tool. It suggests that we could do better. There are other issues with this informationbased approach including the possibility of different data-based models being more or less successful for different data periods (a form of equifinality of data-based models); testing for the possibility of overfitting of data-based models when uncertainties are epistemic; and whether a difference in entropy measures should be considered significant if we accept that there are uncertainties in the data. It has also been suggested that treating data as crisp values might not result in the best data-based models (Klir, 2006).
The information-based assessment of models is, however, one way of asking the question of just how good should we expect a model to be, given the information contained in a dataset? Posing that question a little differently, we could also ask just how bad does the performance of a model have to be for it to be rejected as not fit-for-purpose given what we understand about uncertainties in hydrological datasets? This type of rejectionist framework has always been available in the generalized likelihood uncertainty estimation (GLUE) methodology, initially using a decision about a threshold for one or more performance measures (and widely criticized for the subjective nature of that choice) and later in the use of limits of acceptability based on what is known about uncertainties in the observational data (Beven, 2006;Beven & Binley, 2014;Blazkova & Beven, 2009;Liu, Freer, Beven, & Matgen, 2009). Within this framework, we can decide on when a model (structure and parameter set) as hypothesis should be considered as acceptable or rejected using what we know, or can speculate, about the nature of errors in the observational data and about what is required to make a difference to a decision in the purpose of a model application. We can also decide to make new observations or new types of observations with a view to being more rigorous in deciding whether a model is giving the right results for the right reasons (at least where this is feasible given the available observational techniques and resources). It is important, as noted by Beven (2008Beven ( , 2012, that those limits of acceptability (and any assumptions on which they are based, including the identification of disinformative periods of data) should be defined before running the models to be evaluated.
A suitable framework is available, therefore but, as already noted, the application of any form of model hypothesis testing in hydrology is relatively rare. Why is this? Is there some concern that more models might not be considered as acceptable (a recent study that showed that all SWAT models tried in an application simulating discharge and nutrient concentrations in a UK catchment could be rejected has been proving difficult to get published but see Hollaway, Beven et al. (2018))? Is it simply the expectation of limited accuracy of models in the inexact sciences so that if the results look qualitatively reasonable then it is not necessary to look more closely, since all models will be expected to be wrong in some details?
But that is then saying that our standards need not be too high. Should we not have the ambition of being a little more rigorous than that? This question has, of course, been raised before, notably by Vit Klemeš in his papers on model testing. It seems that little has changed in the three decades since Klemeš (1986) demanded: "What are the grounds for credibility of a given hydrological simulation model? In current practice, it is usually the goodness of fit of the model output to the historic record in a calibration period, combined with an assumption that conditions under which the model will be used will be similar to those under calibration. This may be reasonable in the simplest cases of the 'filling-in missing data' problem but certainly not if the express purpose of the model is to simulate records for conditions different from those corresponding to the calibration record. Here we have to do with the problem of general model transposability which has long been recognized as the major aim and the most difficult aspect of hydrological simulation models. Despite this fact, very little effort has been expended on the testing of this most important aspect" (p. 15).

| HYPOTHESIS TESTING AND DATA UNCERTAINTIES
Given that we should not expect a model to be better than the data that is used to force it or evaluate it, we need to make a careful assessment of such data uncertainties. By analogy with Type I and Type II errors in statistical decision making, there are two types of errors that we wish to avoid in evaluating a model (Beven, 2006(Beven, , 2012. We do not want to reject a model that would be useful in prediction just because of data uncertainties; and we do not want to accept a model that would be misleading in prediction just because of data uncertainties. Of these two types of error, the former is more important since once a model is rejected it will (generally) not be considered further. In the latter case, we would expect that further evaluation would show that a model is not actually fit for purpose.
So any form of hypothesis testing in hydrology needs to take proper account of data uncertainties. But, as noted earlier, such uncertainties for both forcing and evaluation data will be usually dominated by epistemic rather than aleatory error. Analysis of such errors requires assumptions to be made about the characteristics of different sources of uncertainty, and clearly it is possible to be wrong in making such assumptions, for good epistemic reasons. That does not, however, imply that it is not worth the effort. The very fact of having to decide about assumptions already makes the process more rigorous, in that those assumptions then define an audit trail for the analysis, an audit trail that can then be evaluated by potential users of the model outputs for an application (Beven, Leedal, & McCarthy, 2014).
There remain issues to be resolved about the types of assumptions that might be made. If we consider the case of a distributed rainfall-runoff model that requires rainfall and evapotranspiration forcing data, based on local rain gauge and eddy correlation latent heat observations, and that will be evaluated using soil moisture, water table, and discharge data, then probably the only variable that is easy to assess for error and uncertainty is the stream discharge (and even then for extreme high and low flows there will normally be significant epistemic uncertainty in the rating curve). For the input data, we will not be too sure about how accurate and representative the rain gauge and eddy correlation data are for different types of events over the catchment, even if there are multiple site observations. Constructing plausible realizations for such errors (as opposed to simple stochastic models of point variables) has not been properly addressed, and would seem to be quite a difficult problem; again for good epistemic reasons (for example, are there any constraints on "outlier" errors for different event types that might result in events that are disinformative in model evaluation (Beven, 2016b;Beven & Smith, 2015)). For the internal state data, we will not be too sure about how the point soil moisture and water table data might relate to the equivalent variables at the discrete element scale in the distributed model (the commensurability problem). It is also unclear how measured values of catchment characteristics might relate to the effective values of model parameters (also a form of commensurability problem). In addition, any errors in the forcing data will get processed through the nonlinear dynamics of the (approximate) model structure to produce model errors of complex and nonstationary structure (Beven, 2016b). Note that does not mean that we need to work outside a probabilistic assessment of uncertainty, only that it is difficult to define likelihoods and probabilities that reflect the epistemic nature of the uncertainties involved. We can probably generally conclude, however, that epistemic uncertainties create problems for estimating likely occurrences in any formal framework (including, as noted earlier, for information-based testing using entropy measures) and will necessarily involve some subjective choices that will affect any consequent estimates of probabilities.
So how to proceed? One way is through the limits of acceptability approach. This is equivalent to a form of fuzzy reasoning, with the limits acting as constraints in the sense of the general theory of uncertainty of Zadeh (2005Zadeh ( , 2006 and given an axiomatic basis in the general information theory of Klir (2006). It was one of the options suggested in the original GLUE paper of Beven and Binley (1992) and in the set theoretic approach of Keesman and van Straten (1990), Van Straten and Keesman (1991), and Rose, Smith, Gardner, Brenkert, and Bartell (1991). In imposing limits as constraints, we can try to assess the observational error in the predicted variables and use that as the basis for model evaluation. Limits can be imposed on individual observations, or on summary statistics of those observations. If the limits of acceptability are normalized to a common scale (Beven, 2006;Beven & Smith, 2015;Blazkova & Beven, 2009), different limits of acceptability evaluations against different observations can be considered in a common framework.
Whether the model predictions lie within limits determined in this way, however, will depend on the error and uncertainty associated with the forcing data. The limits will need to be extended to allow for the uncertainty in the forcing data, and as noted above, this adjustment might need to be nonstationary in nature. Because of the difficulty in constructing input error realizations when the errors are epistemic in nature, we cannot easily determine the magnitude of the adjustment (this could be the subject of research in basins where there are very good spatial observations of hydrological forcing data). We can, however, assess what critical adjustment of the limits would be required for a model run to be considered acceptable. Given some definition of the limits based on the evaluation data, this is easily calculated for every model run on a normalized scale.
So what would we then expect as hydrologists? If a model performs within limits of two times the assessed observation error would we consider that model to be acceptable. Probably yes. What about 5 times? Or 10 times? Would a model that cannot simulate within 10 times the limits of acceptability based on the evaluation observational data be considered as useful in prediction? Perhaps not, unless we had reason to suspect that the forcing data could produce errors of such a magnitude (what would cause us to suspect that degree of effect?). Any decision about an appropriate limit would necessarily depend on how good the forcing data are, of course, but if the forcing data (combined with any model structural errors) are sufficiently in error that the simulation cannot fall within 10 times the base limits of acceptability, should that model be considered as useful in prediction or fit-for-purpose? Should the debate about hypotheses testing be about defining standards of fitness-for-purpose in different circumstances, in the same way as statisticians allow for standards in allowing for Type I and Type II errors? This might also be imposed as a further fuzzy constraint within Zadeh's generalized theory of uncertainty, which allows for natural language variables. Could fitness-for-purpose be handled in such a framework?

| REDUCING DATA UNCERTAINTIES
Statistical theory allows for the reduction in Type I and Type II errors as more sample become available (though the uncertainties associated with individual samples is often ignored, or assumed to be taken from a simple common statistical distribution). Perhaps the most effective way of improving hypothesis testing in the inexact sciences would be to decrease the uncertainties associated with the forcing and evaluation data. Many of the advances in hydrology in the last 50 years have been initiated by the availability of a new type of measurement. I have already suggested that the hydrological community should be much more proactive about commissioning new experimental methods (Beven, 2016a), in a similar manner to commissioning a satellite such as SWOT (Yoon et al., 2012). This is a long, long, process, but would surely benefit our science. It would be an interesting discussion about what the community should, in fact, commission. This would be limited by the current state of technology, but would also require some hypotheses about what new variables it would be most important to observe, or what existing observables, including rainfalls and discharge, it might be most important to improve. Commensurability issues also require consideration of the scale of observations required (with due consideration to the physical and technological constraints). It would already be an advance for hypothesis testing if we could be sure that the observations used to drive and evaluate a hydrological model did, themselves, satisfy the water balance and energy balance equations over the area of interest.

| ADVANCING THE SCIENCE
We currently have a situation in hydrology where a wide variety of models are used to do essentially the same types of predictions and future projections of river flows and other variables of interest. Perhaps the majority of papers published in water resources journals now involve model predictions and projections of some type. Where model intercomparisons have been done, different models give different results and it is often the case that the rankings of models in terms of performance will vary with the period of data used, site, or type of application. This would seem to be a very unsatisfactory situation for the advancement of the science, especially when we expect that when true predictions are made, they will turn out to be at best highly uncertain and at worst quite wrong. It is a situation that cries out for more rigorous testing of models as hypotheses, while recognizing the uncertainties associated with the data. But defining what might be considered as rigorous requires a research program based on the best datasets available, and preferably datasets where both hydrological and tracer response information are available, so that better testing of whether a model is getting the right result for the right reasons is possible (Davies, Beven, Nyberg, & Rodhe, 2011;Davies, Beven, Rodhe, Nyberg, & Bishop, 2013;Kirchner, 2006). There is an implication that, given rigorous hypothesis testing we should surely be much more willing to falsify some of the models that are currently available and widely used. This is, after all, a good thing so that the science will progress in the future, by rejecting what has been inadequate in the past.

| CONCLUSION
This paper has discussed some of the issues involved in testing models as hypotheses about catchment functioning in the face of epistemic uncertainties in both data and process representations. A framework for hypothesis testing, in terms of defining limits of acceptability before making model runs, is suggested. Past discussion (Beven, 2016b;Clark, Kavetski, & Fenicia, 2011;Beven, Smith, Westerberg and Freer, 2012;Mitchell et al., 2009) suggests that hydrologists might find it difficult to agree on such a framework, depending on how far epistemic uncertainty is seen as an issue. It is, however, a framework that might be refined as we learn more about the nature of the observational and commensurability uncertainties for both forcing and evaluation data, and about the value of different types of evidence about the system response that could be used in model evaluation. Eventually this might lead toward more rigor in testing models as hypotheses. Such a framework focuses attention on the quality of forcing and evaluation data used in model testing, resulting in a suggestion that the community should be more proactive in commissioning better observational methods. That might be the most effective way of reducing the impacts of epistemic uncertainties. And if we cannot falsify models in this way, then what does that imply about the uncertainties in the hydrological data that we use, and the decisions that depend on both data and model outcomes?