A guide to good practice in modeling semantics for authors and referees
 This opinion piece makes some suggestions about guidelines for modeling semantics that can be referred to by authors and referees. We discuss descriptions of model structures, different forms of simulation and prediction, descriptions of different sources of uncertainty in modeling practice, the language of model validation, and concepts of predictability and fitness-for-purpose. While not expecting universal agreement on these suggestions, given the loose usage of words in the literature, we hope that the discussion of the issues involved will at least give pause for thought and encourage good practice in model development and applications.
 There are many examples of bad practice in the use of words in the hydrological modeling literature, and we have found it rather surprising that referees have not always picked up on this. Thus, we would respectfully like to suggest that it would be rather useful to have some guidelines for good practice in modeling semantics that can be referred to by authors and referees, with the aim of encouraging good practice in model development and applications. Here we discuss some of the issues involved, drawing on practice in other areas. Section 'Model Structures' discusses descriptions of model structures; section 'Simulation, Prediction, Projection, Forecasting, and Hindcasting' discusses the different forms of simulation and prediction; section 'Describing Uncertainties' discusses descriptions of different sources of uncertainty in modeling practice; section 'The Language of Model Evaluation' discusses the language of model validation, and section 'Conditional Validation, Falsification, and Fit-For-Purpose' concepts of predictability and fit-for-purpose.
2. Model Structures
 There are many different classifications of model structures in the hydrological literature dating back to at least Clarke . A variety of categories such as lumped or distributed, deterministic or stochastic, inductive or deductive, and black-box or process-based are commonly used, while particular models transcend such crisp boundaries by being described as semidistributed, gray-box, or “using stochastic ensembles of deterministic model runs.” One term that referees should actively discourage is “physically based.” This has become corrupted in its use (for example, in being applied to models such as SWAT) [e.g., Gassman et al., 2010]. That is not to say that models cannot have components that aim to represent different hydrological processes, or that models cannot be interpreted in physically meaningful ways. Such alternative descriptions are more acceptable and correct. Even simple transfer function models derived from observations can have a mechanistic interpretation in terms of meaningful time constants or gains [e.g., Young, 2003, 2013], but there are (at least as yet) no models that are based on correct physical principles in hydrology. Models based on the Darcy-Richards equation and Manning equation might claim to have a physical foundation, but the empirical “physics” on which they are based is clearly not universally applicable [see, for example, Beven and Germann, 2013; Beven, 2013]. In principle, of course, we can have deductive physically based models derived from principles of conservation of mass, energy, and momentum (as in the Representative Elementary Watershed “REW” framework of Reggiani et al. [2000, 2001]) and simplifications of the Navier-Stokes equations [e.g., Neuman, 1977]. In both cases, except for some special cases, the auxiliary and boundary conditions required to have a working model at useful scales will reintroduce an inductive element.
 If the difficulty of applying physical principles in hydrology is embraced, then it seems that we could simplify approaches to the classification of models to simply distinguish between models that are deductive from those that are inductive, possibly with an additional qualifier describing whether a model calculates spatially distributed values of variables. Deductive model structures are defined prior to applying the available observational data, while inductive model structures are normally identified on the basis of the available observational data.
 Of course, the deductive-inductive dichotomy is rather unsatisfactory [Young, 2011, 2013], and these approaches are certainly not mutually exclusive. The best aspects of both approaches can be exploited to produce a systematic approach to hydrological model development. For example, there are deductive models that are based on principles originally derived by induction (Darcy-Richards, Manning), and the calibration (estimation) of parameters will introduce an inductive element (and associated uncertainty) to even the most deductive model [Beven, 2012; Young, 2011, 2013]. Referees should ensure, however, that the deductive and inductive elements of a modeling study are clearly differentiated.
 We also recognize that a distinction between deterministic and stochastic model structures can be useful, even if the deductive role of purely deterministic model runs would appear to be rather limited (see section 'Describing Uncertainties'). Given the nature of hydrological systems and the uncertainties in the modeling process, referees should expect that deterministic models are applied with a proper account taken of the relevant uncertainties. They should not, therefore, accept papers based on single deterministic model runs, unless a convincing argument is made as a special case. This should otherwise be considered bad practice.
3. Simulation, Prediction, Projection, Forecasting, and Hindcasting
 There is considerable confusion in the literature about the use of the terms simulation, prediction, projection, forecasting, and hindcasting; the usage in hydrological modeling in some cases conflicts with usage in other areas. It would be desirable for authors and referees to use of these terms with more rigor. The terms themselves are, with one exception, not that ambiguous (see Table 1). Simulation is the quantitative reproduction of the behavior of a system, given some defined inputs but without reference to any observed outputs. Forecasting is the quantitative reproduction of the behavior of a system ahead of time, but given observations of the inputs, state variables (where applicable), and outputs up to the present time (the forecasting origin). Hindcasting is the application of forecasting to past data as if the inputs and outputs were only available up to a certain point in time (most forecasting that is reported in the literature is, for obvious reasons, actually hindcasting). We should also add projection to the list of activities. Projection, or “what-if” simulation, is the simulation of the future behavior of a system given prior assumptions about future input data.
Table 1. Forms of Modelinga
|Simulation||I(t), t = 1:N|| || |
|Ex-ante forecasting||I(t), t = 1:to||O(t), t = 1:to|| |
|Ex-post forecasting||I(t), t = 1:to + k||O(t), t = 1:to|| |
|Hindcasting||I(t), t = 1:tf||O(t), t = 1:tf|| |
|Projection|| || || |
 That leaves “prediction” that has been used ambiguously in hydrology. In many other areas of modeling, prediction is used synonymously with forecasting, i.e., forecasting into the future with inputs and other observations known only up to the present forecasting origin. This is consistent with the Latin origin of the word (from praedicere, to make a statement about the future). In hydrology, however, it has been used more generally for the case of any model output in time and in space, whether produced by simulation or forecasting and in particular, for the temporal or spatial outputs of simulation models used in model evaluation (for example, in “validation” or “verification” runs of a simulation model, see section 'Describing Uncertainties'). But this is not making a statement about the future, it is a simulated model output. To avoid this ambiguity we would thus suggest that the word prediction should generally be avoided (the terms in Table 1 cover all useful circumstances). We suspect, however, that it will continue to be used, with a variety of meanings but would suggest that referees should ensure that this and the other different terms are at least defined and used clearly by authors in ways consistent with Table 1.
 The situation is further confused by two different types of forecasting that have not always been distinguished in the hydrological literature. It is necessary to differentiate between “ex-ante” and “ex-post” forecasting. These are terms that are well known in the forecasting literature [Young, 2013]. Here ex-ante forecasts are real forecasts, only dependent on data up to the forecasting origin. Ex-post forecasting, on the other hand, are model outputs calculated as if future measured data (particularly the inputs, such as rain and evapotranspiration, which are particularly difficult to forecast) are also known after the forecasting origin. Note the difference here between ex-post forecasting and simulation. Both assume that the input data required are known, but ex-post forecasting also assumes that observations of the system response are also available up to the start of the forecasting interval. Table 1 summarizes these definitions.
 Ex-post forecasting (actually more strictly ex-post hindcasting since measured inputs cannot be available into the actual future) should never be portrayed either as real forecasting or as real simulation, since this is highly misleading: indeed, it is difficult to justify ex-post forecasting in any realistic application [e.g., Beven, 2009; Young, 2013], unless it is made quite clear that it is being used only as a check on the stochastic simulation abilities of the model. Again we suggest that referees should act to ensure good practice by requiring revision to papers that use “prediction” without clarification of its purpose, and in particular where it is used to mean ex-post forecasting. In rainfall-flow modeling, for instance, an ex-post forecast that uses future observed rainfall to forecast flow over the forecast horizon is clearly misleading on its own, since a true ex-ante forecast would require a forecast of future rainfall beyond the forecasting origin, based only on measurements of rainfall (and any other related variables required, such as potential evapotranspiration) available at the forecasting origin.
4. Describing Uncertainties
 There has been significant discussion in the literature about what methods of uncertainty analysis to use. Because many of the sources of uncertainty are epistemic in nature, the debate about uncertainty estimation techniques will not be resolved soon (see recent discussions in Beven [2006, 2008, 2012], Montanari , Beven et al. , Clark et al. [2011, 2012], Beven et al. , McMillan et al. , Beven and Smith , and Young ). What is important is that some consideration of uncertainty is incorporated into any modeling study, and as noted above, referees should ensure that this is the case.
 There is also an extensive literature on the classification of different types of uncertainty [Hoffman and Hammonds, 1994; Helton and Burmaster, 1996; Walker et al., 2003; Van der Sluijs et al., 2005; Refsgaard et al., 2006, 2007, 2013; Warmink et al., 2010; Spiegelhalter and Riesch, 2011]. There are multiple classifications of varying degrees of complexity. Operationally, however, we would suggest that sources of uncertainty can be simply differentiated as either aleatory or epistemic (see examples in Table 2). Aleatory uncertainties (from the Latin alea meaning a die, or game of dice) are concerned with apparent random variability and can be treated directly in probabilistic terms. In this definition, aleatory uncertainties can be treated as statistical variables, potentially of complex structure [e.g., Koutsoyiannis, 2010]. It is sometimes suggested that aleatory uncertainties are irreducible, but in the working definition we are suggesting here that will not necessarily be the case.
Table 2. Examples of Aleatory and Epistemic Uncertainties in Hydrological Modeling
|Rainfall observations||Gauge errors, after allowing for consistent bias associated with height, wind speed, shield design etc.||Neglect of, or incorrect corrections for, gauge errors and radar estimates. Errors associated with lack of knowledge of spatial heterogeneity|
|Radar reflectivity residuals, after corrections for type of reflector, drop size distribution, attenuation, bright band, and other anomalies|
|Rainfall interpolations||Residuals for any storm given choice of interpolation method||Rain gauge network or choice of interpolation method might not be appropriate for all storms, unobserved cells, nonstationary spatial covariance characteristics/choice of fractal dimension|
|Evapotranspiration estimates||Random measurement errors in meteorological variables||Biases in meteorological variables relative to effective values required to estimate catchment average evapotranspiration|
|Choice of assumptions in process representations|
|The choice of functions in simulation or forecasting equation.|
|Neglect of local boundary condition effects over a heterogeneous surface|
|Neglect of wet/dry surface effects|
|Discharge observations||Fluctuations in water level observations||Poor methodology and operator error in direct discharge measurement for rating curve definition|
|Observation error in direct discharge measurements for rating curve definition||Unrecorded nonstationary changes in cross section from vegetation growth and sediment transport|
|Residuals from rating curve||Inappropriate choice of rating curve, particularly in extrapolating beyond the range of available discharge observations|
| ||Errors due to unmeasured and distributed flow effects|
|Internal state variables (soil moisture, water tables, etc.)||Point observation errors||Commensurability errors of predicted variables with respect to observed values arising from inappropriate theory and scale effects|
|Remote sensing data||Random error in correction of sensor values to fields of digital numbers (sensor drift, atmospheric corrections, etc.)||Inappropriate correction algorithms or assumptions about parameters|
|Random error in converting digital numbers to hydrological relevant variables||Inappropriate conversion algorithms in obtaining hydrologically relevant variables|
 Epistemic uncertainties (from the Greek πιστμη, for knowledge or science), on the other hand, arise from lack of knowledge and understanding and, it is often suggested, could be reduced in principle by having more or better measurements; or by new science. Epistemic uncertainties may not be treated easily in probabilistic terms, but it will not usually be clear what other uncertainty framework could be used usefully to represent them. Epistemic errors are often treated in terms of probabilities as if they were aleatory, but the probabilities are much more likely to be incomplete, imprecise, or nonstationary in nature. Consequently, any representation will require subjectively chosen assumptions (actually, by definition, since if we had the knowledge to describe them properly, they would no longer be epistemically uncertain, see the discussion of Rougier and Beven ). Note that some epistemic errors may lead to disinformation in learning about the hydrological response of a catchment, with consequences for model calibration, estimation, and evaluation. Examples are where rainfalls are recorded, and there is no observed discharge response in a humid catchment, or where apparent runoff coefficients are greater than 1 [e.g., Beven and Westerberg, 2011; Beven et al., 2011; Young, 2013].
 Thinking a little more deeply about uncertainty introduces still more complications. For instance, we might have a source of uncertainty that appears to be aleatory and therefore, appropriately represented in probabilistic terms, but it might not be clear as to whether we are using the correct probabilistic model (at base, a distribution or joint distribution, together with any consistent bias or correlation structure). Techniques have been developed (e.g., the meta-Gaussian transform, or copula methods) to try to convert apparently complex series of errors into simple distributional forms for which the theory is well developed (in particular, multivariate Gaussian forms) [e.g., Montanari and Koutsoyiannis, 2012]. But there might then be epistemic uncertainty about the choice of distributional form or transform (e.g., the choice of a particular distribution or copula). We do not know a priori what the correct form is for a given type of error arising in a particular type of application, although there are often forms accepted as “standard practice” (e.g., the distributions commonly used in flood frequency analyses). Thus, deeper thought will often reveal that sources of uncertainty are more epistemic than aleatory (see Table 2), even if the choice of treating them as if aleatory may be justifiable on the basis of checking the associated assumptions about the structure of the errors in both calibration and forecasting or simulation.
 Note that ex-ante adaptive forecasting is one area where it is most appropriate to treat all sources of uncertainty as if they are aleatory in nature. Useful serial correlation in the residual error may well be captured by an aleatory model, such as the heteroscedastic autoregressive, moving average (ARMA) process used in Young , thus enhancing the stochastic description of the data and leading to improvement in ex-ante forecasting ability. The adaptive nature of the forecasting will also allow some account to be taken of epistemic errors as they occur.
 To summarize, while referees should ensure that all modeling studies are associated with some form of uncertainty estimation, it has often been difficult in the past to assess how authors have differentiated between aleatory and epistemic errors. Perhaps it is sufficient at this stage to ensure that authors make a clear statement about how aleatory and epistemic errors have been treated, with a clear statement about the assumptions made, and their evaluation.
5. The Language of Model Evaluation
 Another vexed issue is the language used for model evaluation, confirmation, validation, and verification that has generated so much discussion in the literature. Refsgaard and Henriksen  provide a useful discussion of this topic in relation to hydrological modeling. As pointed out by Oreskes et al. , Rykiel , and others, this is partly because there are multiple uses of these words that range from the colloquial to the technical. There are also quite different uses in different domains of the environmental sciences. In meteorological modeling it is standard practice to present information on different model verification statistics in the evaluation of weather forecasts [e.g., Pappenberger et al., 2008; Cloke and Pappenberger, 2008]. In other domains, model verification is normally only used for the evidence that a computer code is consistent with the definition of the model concepts that it purports to represent, without any requirement that the resulting outputs will be consistent with observations in a particular application.
 The loose usage of the language of model evaluation was first criticized by Oreskes et al. , following the a posteriori evaluation of hydrogeological simulation models published in Konikow and Bredehoeft  and Anderson and Wössner . They pointed out that the word verification comes from a Latin root (verus) meaning truth. But models are not truthful representations of reality, they are necessarily approximations, so verification is not an appropriate term to use in respect of models. Rykiel , following Oreskes et al. suggested that the only possibility of having a logical truth in environmental modeling would be in the implementation of the model concepts as computer code. He suggested that verification should therefore be restricted to this context, and not used when referring to how a model might represent the actual response of the system of interest. As such, verification should be qualified as conditional verification since any such exercise for a nonlinear model will be conditional on the range of tests to which the model has been submitted. We know from experience that even widely distributed model codes can crash with certain combinations of parameter values and boundary conditions.
 Model validation is quite complex in its origins. The word validation is derived from the Latin validus (meaning strong or effective) and only came to signify being supported by evidence or authority in the seventeenth century. It is now widely used in environmental modeling to mean the successful testing of model output variables against some criteria of performance [see, for example, Klemeš, 1986; Refsgaard and Knudsen, 1996; Refsgaard and Henriksen, 2004]. There have, for example, been many papers that have declared success in validating the SWAT model [e.g., Gassman et al., 2010]. But as Mitro  points out, the statement that a model has been validated needs some qualification as to how that validation has been carried out. A decision maker might be faced by predictions from a number of different models, all of which have been validated in some sense. The degree of belief in each of the models should then depend on an assessment of the validation process, so the relevant information needs to be communicated. Young  introduces the term “conditional validation,” where the model is validated on data other than that used in its estimation (sometimes termed “predictive validation”). A model that is conditionally valid is one that has not yet been falsified by tests against observational data (see later for a discussion of predictability and falsification). Models that are considered to be conditionally valid in this sense clearly have immediate practical utility in simulating within the range of the calibration and evaluation data, while allowing for their updating in the light of future research and development or change in catchment characteristics.
 In this sense, conditional validation represents good practice, even in its weakest form when a model that has been calibrated for one data set is evaluated against a different data set of the same variables, at the same site as used in the calibration (the classical “split-record test”). Stronger forms of conditional validation are also possible by using additional variables or sites that have not been used in calibration [Klemeš, 1986; Refsgaard, 1997; Refsgaard and Knudsen, 1996]. In general, validation will be both criterion and data set dependent [see, for example, Choi and Beven, 2007]. Thus, predictive validation can only ever be conditional [Young, 2001], dependent on the conditions of the evaluation process. Mitro  therefore suggests that the term validation should also be avoided by simply stating explicitly the conditions and results of the evaluations carried out. Referees should consider this to be good practice. If the word validation is used, then, as pointed out previously, it should always be associated with the word conditional, and the conditions of the model evaluation should be stated clearly (this is consistent with the suggestions of Refsgaard and Henriksen ). The implication is that further testing in the future might reveal model (or data) deficiencies that up to now have not been apparent. Suggestions for the clear use of terms in this context are suggested in Table 3.
Table 3. Suggestions for the Use of Terms in Hydrological Model Evaluation
|Verification||Only for checks that a model computer code properly reproduces the equations it purports to solve. Will generally be only conditional verification dependent on the range of tests carried out||Sometimes used in context of comparing model outputs and observables (as in forecast verification): this should be avoided in favor of conditional validation or model evaluation. Conditions of model verification should be clearly stated|| |
|Validation||Conditional evaluation that a model computer code reproduces some observed behavior of the system being studied to an acceptable degree||Should always be used in the form conditional validation with an explanation of the conditions of the model evaluation||Conditional model evaluation|
|Falsification||An assessment that a model computer code does not reproduce some observed behavior of the system to an acceptable degree||Should always allow for potential errors in observed behavior so as to avoid Type I error of rejecting a model as a result only of errors in the data||Model rejection|
|To fail a hypothesis test|
|Fit-for-purpose||Conditional evaluation that a computer model code is suitable for use in a particular type of application||Should always state conditions of how fitness-for-purpose has been assessed||To have (conditional) predictability|
6. Conditional Validation, Falsification, and Fit-For-Purpose
 A suggestion that a model has satisfied some conditions in evaluation suggests that the model is (to some level of belief) fit-for-purpose. But it is important that the conditional nature of the validation should be made explicit so that it can be evaluated by potential users of the model output. The limitations of conditional validation have been discussed by Kumar  in relation to different types of model use. He differentiates the use of a model to reduce the uncertainty in estimating a future trajectory of the catchment response, and the use of a model to explore potentially new trajectories for the catchment response as a result of changing boundary conditions or the complexity of evolving catchment dynamics. The first can be subject to conditional validation (for example, in terms of reduction of uncertainty relative to the “climatology” of the historical responses) but does not guarantee successful simulation or forecasts in future if there are changes in the system. The examples of postaudit analysis of groundwater models cited earlier are good examples of this. The second cannot be subject to validation until the future evolves along a particular new trajectory. Since this might involve other modes of response and process interactions to the historical response, a different model might be required that cannot easily be tested in the ways outlined above. Conditional validation in the sense of Young  allows for these possibilities by stressing the need for continued recursive updating of model parameters and data assimilation under the assumption that the parameters might change over time. Statistically significant changes in parameters will then reveal changing dynamic behavior caused by changes in the system and the need to re-evaluate the model.
 It is quite possible, of course, that a model structure or parameter set may fail such a continuing evaluation. Model falsification or rejection is important because in progressing the science, we normally learn more from falsification than from (conditional) model validation. Falsification implies some improvement is required, either in the data that is being used to drive and evaluate the model or in the model structure itself. However, testing models as hypotheses in this manner is difficult in hydrology and other areas of environmental science because of the inherent epistemic nature of many sources of uncertainty, as discussed earlier [see Beven et al., 2012; Clark et al., 2011, 2012]. Since all models are approximations, a close enough look at the model outputs will always reveal some deficiencies. Thus, defining fit-for-purpose and testing models as hypotheses in some posterior analysis must also be conditional. In hydrology and as far as we can tell, in many other areas of environmental modeling, the conditional nature of model validation and what constitutes being fit-for-purpose have not been given sufficient consideration, despite the publication of various papers on this topic [see, e.g., Beven, 2001; Young, 2001; Kirchner, 2006; Clark et al., 2011; Beven et al., 2012]. Authors and referees should, however, at least recognize the nature of these issues in the presentation of model results and posterior analysis and ensure that the conditions of accepting a model as useful are clearly set out.
 In this opinion piece, we have sought to provide some guidance to authors and referees about the proper use of words in describing model structures, in the use of terms describing model outputs, in the description of different types of uncertainties, and in the use of language in model evaluation and falsification. We have (mostly) limited this opinion paper to comments about good practice in modeling semantics rather than good practice in modeling. However, we would hope that if referees insist on more rigor in the technical semantics, the resulting papers will be both clearer and result in better practice in model applications.
 Jens-Christian Refsgaard, David Huard, and a third anonymous referee are thanked for their comments, which helped improve the final version of this manuscript.