Many Commonly Used Rainfall‐Runoff Models Lack Long, Slow Dynamics: Implications for Runoff Projections

Evidence suggests that catchment state variables such as groundwater can exhibit multiyear trends. This means that their state may reflect not only recent climatic conditions but also climatic conditions in past years or even decades. Here we demonstrate that five commonly used conceptual “bucket” rainfall‐runoff models are unable to replicate multiyear trends exhibited by natural systems during the “Millennium Drought” in south‐east Australia. This causes an inability to extrapolate to different climatic conditions, leading to poor performance in split sample tests. Simulations are examined from five models applied in 38 catchments, then compared with groundwater data from 19 bores and Gravity Recovery and Climate Experiment data for two geographic regions. Whereas the groundwater and Gravity Recovery and Climate Experiment data decrease from high to low values gradually over the duration of the 13‐year drought, the model storages go from high to low values in a typical seasonal cycle. This is particularly the case in the drier, flatter catchments. Once the drought begins, there is little room for decline in the simulated storage, because the model “buckets” are already “emptying” on a seasonal basis. Since the effects of sustained dry conditions cannot accumulate within these models, we argue that they should not be used for runoff projections in a drying climate. Further research is required to (a) improve conceptual rainfall‐runoff models, (b) better understand circumstances in which multiyear trends in state variables occur, and (c) investigate links between these multiyear trends and changes in rainfall‐runoff relationships in the context of a changing climate.


Introduction
Improvement of conceptual rainfall-runoff models so that they can defensibly be applied in climate change impact studies is one of the most pressing issues in hydrology today. Climate change impact studies inform policy in areas as diverse as water supply (Flörke et al., 2018;Zhuang et al., 2018), flood impacts (Iqbal et al., 2018;Zhou et al., 2018), ecosystem services (Mahmoud & Gan, 2018), environmental flow requirements (Cui et al., 2018), food production (Balkovič et al., 2018), hydroelectricity (de Jong et al., 2018), and sociological and indigenous impacts (Emanuel, 2018;Xue et al., 2018). In many such studies, rainfall-runoff models form a vital link, translating projected climatic and weather conditions into river flow and available water. Usually, relatively simple conceptual models are adopted (Melsen et al., 2018), and arguably there is insufficient attention paid to their suitability for the critical role they fulfill (but see, e.g., Vaze et al., 2010;Melsen et al., 2018).
In practice, rainfall-runoff models often perform poorly when faced with historic hydroclimatic variability, particularly multiyear droughts (Saft, Peel, Western, Perraud, & Zhang, 2016;Seibert & van Meerveld, 2016). This is the main challenge to their suitability for climate change impact assessments: if rainfall-runoff models cannot represent the past, how can we trust them to simulate under future climate change? A common tool for testing models is the Differential Split Sample Test (Klemeš, 1986;Refsgaard et al., 2014), where models are tested on subperiods of the historic record that are drier or wetter than calibration data. In this test, models typically perform increasingly poorly in proportion to the degree of change in hydroclimatic variables such as precipitation (Vaze et al., 2010; see also Wilby, 2005;Merz et al., 2011;Coron et al., 2014;Refsgaard & Knudsen, 1996;Singh et al., 2011). These failures seem to indicate that the model equations are inadequate and must be improved or replaced, and this study discusses ideas to inform this process of model improvement. However, we note that a variety of other factors may impede model performance and/or proper evaluation, including unsystematic testing procedures (e.g., Coron et al., 2012;Thirel et al., 2015;Zheng et al., 2018), data quality issues (e.g., Kuczera et al., 2010), and poor calibration methods (e.g., Arsenault et al., 2013 ;Fowler et al., 2016 ; possibly exacerbated by poor parameter identifiability (e.g., Beven & Freer, 2001a).
Despite clear need, few studies so far have proposed improvements to model structures to make them better able to simulate under changing climatic conditions ; for examples of exceptions see Herron & Croke, 2009;Westra et al., 2014;Grigg & Hughes, 2018). We believe that one way to hasten progress is to guide model development and selection based on the plausibility of internal dynamics during simulation (cf. Kuczera and Mroczkowski, 1998;Kirchner, 2005). Plausibility has sometimes been considered in the context of improving model calibration, where local knowledge and/or the knowledge of experimentalists allows qualitative assessment of the realism of model simulations (Seibert & McDonnell, 2002;Winsemius et al., 2009;Gharari et al., 2014;Holländer et al., 2014). However, plausibility and realism are rarely considered in the hydrological components of climate change impact studies, which usually focus on describing and applying, rather than critiquing, the chain of modeling components (e.g., global climate models, regional climate models, and rainfall-runoff models).
In this study, only one aspect of plausibility is considered: the presence or absence of long, slow dynamics in simulated state variables. Here "long, slow dynamics" means that a model allows multiyear trends in at least one state variable (e.g., groundwater storage). Low-frequency dynamics have been reported in hydrology for many years (e.g., Hurst, 1951), with authors describing low-frequency components in rainfall regimes (e.g., Cañón et al., 2007 ;Xu et al., 2004) and discussing their propagation through the hydrological cycle (Hughes et al., 2012;Markovic & Koch, 2015;Shun & Duffy, 1999;Wang et al., 2015). This propagation is often accompanied by a lag which varies depending on hydrological component-for example, following a change in long-term average rainfall, soil moisture characteristics might respond on shorter time scales than groundwater (e.g., Eltahir & Yeh, 1999;Leblanc et al., 2009). Lag time scales are also mediated by the geology of aquifers and by soil characteristics (e.g., Markovich et al., 2016;Tague et al., 2013;Tague & Grant, 2009). With regards to multiannual dynamics, one point on which the literature is unclear is the relation of these dynamics to hydrological models-for example, are such models able to operate under low slow changes in climate and able to represent possible amplifications of climate signals by the hydrological cycle? Some studies (e.g., Marcovic and Koch, 2015) report that rainfall runoff models are indeed able to do this, but overall, little research effort has been expended in this area.
With respect to hydrological systems, long slow dynamics are very important because systems and models with such dynamics exhibit the following characteristics: a. streamflow that, for a given time step, depends not only on recent climatic conditions (the preceding weeks or months) but also on past years or even decades (Almanaseer & Sankarasubramanian, 2011;Risbey & Entekhabi, 1996); b. relatively slower recovery from extreme events such as multiyear droughts (Hughes et al., 2012;Yang et al., 2017); and c. relatively greater sensitivity to climate change, since accumulated changes in state variables can push the model or system into rainfall-runoff relationships not seen before in the observed period (Saft et al., 2015).
Another way of phrasing point (a) is that the model (and catchment) has "memory": it can "remember" preceding years' climatic forcing and modulate the flow response accordingly for the present time step (Grigg & Hughes, 2018;Risbey & Entekhabi, 1996). However, here we use the memory terminology with care because it is often used in unrelated debates in hydrology (specifically, about which mathematical models best replicate the properties of observed streamflow given apparent Hurst-Kolmogorov dynamics-e.g., Blender & Fraedrich, 2006;cf. Mandelbrot & Wallis, 1968;Koutsoyiannis, 2011). We also note that one viewpoint on point (c) is that the catchment system may have multiple stable states of flow response (in the sense used by Peterson et al., 2009;Peterson & Western, 2014;2014b); in this view, extreme events or climate change may cause a semipermanent transition between the stable states . However, these concepts are not considered further in this article.
This paper asks the following research questions: 1. Do multiyear trends occur in environmental state variables (e.g., groundwater) in the selected study region (the state of Victoria, Australia)? 2. Are qualitatively similar trends present in any simulated state variables for selected commonly used conceptual rainfall-runoff models calibrated within the study region?
In our discussion, we also focus on a further question: what are the implications for the suitability of the models for climate change impact assessment, and for the simulation of hydrologic processes?
By "qualitatively similar" we mean exhibiting similar patterns over similar time scales, even if the models in question do not explicitly have (for example) a groundwater store to directly compare with groundwater levels. This is often the case, since impact studies typically avoid using rainfall-runoff models that explicitly represent physical processes, due to prohibitive data and computational requirements. Instead, impact studies commonly use conceptual "bucket"-style rainfall runoff models and these are the type in focus here. Other types of conceptual models are considered in the section 6.6 but not explicitly tested.
Before explaining the technical aspects of the method (section 4), we first outline the rationale for the approach of the study (section 2) and the region in focus (section 3).

Rationale for Approach
Below, the rationale is explained using south-east Australia's Millennium Drought as the example; however, the logic could apply equally to other cases of multiyear drought subject to climate-induced multiannual trends in state variables such as groundwater.
To explain the rationale, we use the following thought experiment. The present-day hydrologist tasked with hydrological projections for future climate scenarios is thought of as similar to the hydrologist of 1996 tasked with projections of flows during the 1997-2010 Millennium Drought, assuming foreknowledge of forcing variables (precipitation, evapotranspiration) over this period. Is there any aspect of model simulations that can guide the hydrologist of 1996 in model selection, given they know the ensuing years will be significantly drier?
To explore this question, Figure 1 shows simulated storage time series from two hypothetical models A and B. A and B could be from two different model structures, or two parameter sets within the same model structure. For simplicity, we assume that the storage time series indicate the total water stored in each respective model, over all model stores. We first consider the blue line only. From the perspective of the hydrologist of 1996, the blue line represents simulated storage over recorded history. Both models show a qualitatively similar time series, except that during the pre-1997 period Model B comes considerably closer to the "minimum possible storage" limit. This limit arises in many models directly from the model conceptualizationfor example, conceptualizing a state variable as the contents of a bucket introduces an inherent minimum possible storage state, namely, "empty." Model B comes close to empty at the low point of the seasonal cycle every year prior to 1996, whereas Model A does not. Assuming identical past performance in other respects (equifinality; cf. Beven, 2006), which model should the hydrologist of 1996 choose? We argue that Model A should be considered inherently superior, if there is sufficient reason to expect that the catchment could enter increasingly dry states. The hydrologist of 1996 does not know whether future dry conditions might cause downward trends in real-world state variables, but it is reasonable and prudent to assume that such trends are possible. Since Model B already reaches empty in the summers preceding the drought, it explicitly forbids long-term downward trends (the pink-purple line in Figure 1) from developing during the Millennium Drought. On this basis, a priori judgment between A and B is possible based on plausibility, without any knowledge of streamflow during the drought.
Two key points of this paper are (i) henceforth, a similar logic should be used by hydrologists to judge (and disqualify) models as unsuitable for runoff projections in a drying climate and (ii) many commonly used "conceptual bucket" models fail the test, which is likely related to poor performance of these models in split sample testing. With respect to (i), related arguments could be given for a wetting climate, but that is not in focus here. We use the Millennium Drought case study to show that such trends can occur in real state variables in response to changing climate (i.e., Research Question 1) and that the trends are not matched-even in a qualitative sense-by calibrated rainfall-runoff models (Research Question 2), thus significantly reducing their plausibility.
The following comments are relevant to how the present study fits into the broader picture of model improvement. In principle, we could categorize all catchments globally into three categories: (i) catchments with recorded multiyear trends in state variables over the observational record; (ii) catchments with no recorded multiyear trends but which might experience them under future climate; and (iii) catchments which will never experience significant multiyear trends regardless of climate, possibly due to some structural limitation on terrestrial water storage (note, an a priori distinction between ii and iii may be difficult or impossible). Hydrologists working in type (ii) catchments could be forgiven for adopting conceptual bucket models even if such models disallow multiyear trends-after all, recorded history shows no multiyear trends, so why does it matter? Furthermore, even if they concluded it did matter, such modelers have no means of parameterizing model components that govern the multiannual dynamics. Nonetheless, models that disallow multiyear trends might neglect feedback between climate change and the hydrologic system, thus significantly underestimating future likelihood of water shortage and leaving open the possibility for "surprise" reductions in water supply. Thus, • It is incumbent upon us to investigate category (i) catchments in order to improve models, since category (ii) catchments are uninformative. This is the purpose of this paper-not (yet) to propose new models, but to critique existing models in category (i) catchments, and • For modelers who suspect their study area might be in category (ii), it is better to use these improved models and accept that the low-frequency components are poorly identified (Reichert & Omlin, 1997) than to use models that disallow low-frequency behavior altogether. When applied in the context of uncertainty analysis, the improved models may indicate the possibility of larger future water supply reductions than those indicated by the unimproved models. This is useful-indeed necessary-for appropriate risk management of water resources.
Parts of south-east Australia are undoubtedly in category (i), a fact confirmed by catchment responses to the 1997-2010 "Millennium" drought, as described herein and elsewhere (e.g., Leblanc et al., 2009;Leblanc et al., 2011;van Dijk et al., 2013). Thus, we now describe this region and event in more detail.

Study Area
The study area is the state of Victoria, Australia (Figure 2), with a total area of 237,000 km 2 . Victoria has a temperate climate and relatively high interannual variability of rainfall and streamflow (Peel et al., 2001), with numerous multiyear droughts evident in the historic record (Potter et al., 2010). Spatial variability in topography and rainfall is relatively high, and the western part of Victoria is relatively flatter and drier than the east (Figure 2). Together with much of south-eastern Australia, Victoria was significantly impacted by the Millennium Drought, which began in the mid-1990s and ended with large floods in late 2010 (Kiem & Verdon-Kidd, 2010). The Millennium Drought was the most severe drought on record in many areas (Potter et al., 2010;Verdon-Kidd and Kiem, 2009;Van Dijk et al., 2013). In response, the city of Melbourne (along with other Australian cities) commissioned water savings measures and water supply augmentation through desalination and interconnection with adjacent supply basins. In addition, legislation was introduced to improve the sustainability of water management in the Murray Darling Basin, which Previous studies have reported trends in groundwater and Gravity Recovery and Climate Experiment (GRACE) data in the Murray Darling Basin (e.g., Leblanc et al., 2009;Leblanc et al., 2011;van Dijk et al., 2013). For example, Leblanc et al. (2009) reported a persistent decline in groundwater storage six years into the Millennium Drought, in contrast to rapid drying of soil moisture (shallower than 2 m) within only two years. They estimated that the groundwater declines are significantly larger than could be explained by human use of groundwater. In this paper we assume that the declines are due to a change in the balance between recharge, evapotranspiration from groundwater, and drainage of groundwater to streams and rivers. Furthermore, our presumption is that because such changes to the natural system are likely to impact stream flows (e.g., Condon & Maxwell, 2017), they are "within-scope" for any rainfall-runoff model to be used in a climate-change impact study-even those that do not explicitly represent each process. In the present study, the Millennium Drought serves as an instructive test case for such models. Nearly all Global Climate Models agree that Victoria's future climate will be drier (e.g., Whetton et al., 2016), possibly associated with Hadley Cell expansion (Nguyen et al., 2015), so it is important for policy development in Victoria that models demonstrate fidelity to behavior during historic dry periods (droughts). Such fidelity is a "necessary but not sufficient" (Klemeš, 1986) requirement in order for models to be considered suitable for climate change impact studies-that is, it is a minimum requirement and necessary first step toward plausible runoff projections.

Overview of Lines of Evidence
Numerous lines of evidence are synthesized in this article in order to explore whether long-term trends are present in observed and simulated data. Ideally, a controlled experimental approach would be taken where multiple types of environmental variables are measured simultaneously for the same set of catchments. Unfortunately, such data are not available in the study area for the duration of the Millennium Drought. In Victoria, groundwater data are concentrated in lowland areas where groundwater is extracted and used (some care was required to select bores unaffected by extractions-see section 4.3). These data provide water table fluctuations rather than volume fluctuations. In contrast, streamflow data are concentrated in water supply areas in the highlands, with little overlap with groundwater data. A regional approach-examining the whole of Victoria rather than individual catchments-is adopted to overcome these limitations, using a relatively large sample of catchments and bores to gain generalized insights (Gupta et al., 2014;Addor et al., 2017;Addor et al., 2019).
To supplement groundwater records from specifically chosen bores (section 4.3), GRACE water storage estimates are also sourced (section 4.2). Groundwater data and GRACE data are relatively complementary: groundwater records are useful to characterize long-term trends but cannot be converted into actual volumes or depths of water because accurate estimates of specific yield are not available across the study area. In contrast, GRACE products are reported directly as depths, so they are more comparable with volumetric measures (such as simulated storage time series). However, temporal coverage of GRACE data is relatively short (2002 onward) and can only be extracted over large regions, whereas bore data can be related to individual catchments. Note that, in this study, we do not explicitly separate out the soil moisture and groundwater components of GRACE storage, opting instead to report GRACE data subject to minimal manipulation.
With respect to modeling, we consider daily rainfall-runoff models (sections 4.4-4.7) tested in various calibration experiments (some flow-only, and some inclusive of groundwater) in addition to annual rainfallrunoff relationships prepared according to Saft et al. (2015), as described in section 4.8.

GRACE Water Storage Estimates
GRACE measurements integrate information over large areas, so GRACE cannot provide information at the scale of individual catchments, but only at regional scales or larger. Therefore, we define two regions within the study area, labeled "West" and "East" (Figure 2). The regions are used not only in reporting of GRACE results but also for aggregating other results types, because they capture useful hydrological distinctions in the study area, as already noted. Specifically, the East region contains the upland areas of Victoria, which are subject to steeper topography, higher elevation (up to 2,000 m), and relatively high rainfall (historic averages up to 2,500 mm/year). In contrast, the West region is flatter and drier on average, ranging up to only 1,200-m elevation and generally with rainfall less than 1,000 mm/year. The north-west of the state is semiarid and is not included in either region. Neither region experiences year-round snowpack. Readers familiar with GRACE products prepackaged into large (e.g., 1°× 1°) grid cells will notice the contrast with regions here-the adopted regions are irregular and composed with no regard to such grid squares. Normally, large spatial filtering (400-km smoothing) applied in the GRACE data processing causes the "leakage" of gravity change signals from surrounding area. To minimize issues arising from these two inherent limitations, we resample the raw GRACE data within our respective analysis regions and apply a spatial delineation technique to ensure that the influence of adjacent land areas and/or the ocean is having minimal impact on the final result. The technical method to do this is described below.
The terrestrial water storage (TWS) over the west and east regions of this study is estimated using the Center for Space Research RL06 GRACE monthly gravity solutions, which use spherical harmonic coefficients to degree and order 60 from January 2003 to August 2017. We follow the conventional geophysical reduction (Peltier et al., 2015) and spherical harmonic coefficients adjustment and replacement (Cheng et al., 2011;Swenson et al., 2008) to estimate water storage change from the gravity solutions.
The water mass information of GRACE suffers from noise and aliasing error originating from the spherical harmonic coefficients (Seo et al., 2008); thus, filtering is applied to suppress them. We apply two-step spatial filtering, which is composed of a decorrelation filter (Swenson & Wahr, 2006) and 400-km Gaussian smoothing. The spatial resolution of GRACE data after such spatial filtering is about 400 km or larger. Therefore, an obvious trade-off of the spatial filtering is smoothing of the true signal due the diffused signal from adjacent regions (including the ocean for our study regions). Also, the spatial smoothing can make the TWS signals from the east and west regions less discernible.
In order to reduce the undesirable effect of the spatial smoothing, a forward modeling (FM) approach is deployed (Chen et al., 2013;Seo et al., 2015). In FM, a set of initial guesses for "unsmoothed" TWSs are assigned to analysis regions and they are iteratively updated until their spatially smoothed TWSs converge to the smoothed GRACE data. It is assumed that FM reproduces the unsmoothed signal without restoring the noise and aliasing error. The initial guess values do not affect the final solutions (e.g., zeros can be assigned to initial TWSs as a priori). The FM approach has been successfully applied to correct the smoothing effect in earlier works (e.g., Wouters et al., 2005). In this study, we use spatially averaged TWS values from smoothed GRACE data as the initial guesses, and spatially averaged unsmoothed TWSs are successively re-calculated during the iteration.
Please note that, in order the test the efficacy of FM in our study regions, we have applied FM to synthetic GRACE data of the west and east regions generated by TWS from the Global Land Data Assimilation Scheme. We contaminated the synthetic data by the estimated noise and aliasing error using the method of Eom et al. (2017) which involves applying the same filters GRACE uses, to Global Land Data Assimilation Scheme data, to generate the synthetic GRACE data. Prior to analyzing the real GRACE data, the same two-step spatial filtering as applied to the GRACE data was applied to these synthetic GRACE data. "Smoothed" TWSs from the synthetic GRACE data were reduced to unsmoothed TWSs using the same FM procedure (down to 25-km resolution as per Global Land Data Assimilation Scheme) applied in real GRACE data above. Figure S1 in the supporting information compares the synthetic true TWSs with their smoothed values and the unsmoothed values by FM over the west and east regions. Figure S1 clearly shows that amplitudes of the initial estimates of TWS are underestimated. The unsmoothed TWSs by FM agree well with true TWSs.
Water storage in large reservoirs exerts gravity which must be taken into account. Time series of water storage are publicly available (e.g., www.g-mwater.com.au/storages/). In section 5, aggregate time series for each region are presented together with, and used to correct, the GRACE estimates. These aggregated time series considered individual storage time series for the following storages: Lakes Dartmouth, Hume, Eildon, Eppaloch, plus total storage in the Melbourne Water Supply Network and the Grampians Wimmera Mallee Supply Network.

Groundwater Bore Data
Thousands of groundwater bores exist in the study area, but the vast majority are affected by groundwater extractions and/or have few observations during the study period. We consider only bores in the "State Observation Bore Network," denoting locations of higher priority for long-term monitoring. Bores are chosen according to the following criteria: (i) shallow bores, that is, with maximum screen depth less than~20 m, to characterize unconfined, rather than confined, aquifers; (ii) located away from extraction points of groundwater and preferably outside of Victoria's "Groundwater Management Zones" and "Water Supply Protection Areas," so as to minimize the influence of groundwater pumping (see map, Figure S2); and (iii) bores with monitoring covering the period prior to and throughout the Millennium Drought (approximately 1994-2009). Of approximately 1,500 State Observation Bores, only 300 are shallow, of which a very high proportion were commissioned to monitor drawdown due to nearby extractions and thus fail criteria (ii). Only 19 bores remain after all criteria are applied, and these bores are listed in section 5 ( Figure 4).
For each of the 19 bores, raw observations (available at https://www.vvg.org.au) are supplemented by temporal interpolation using the freely available "Hydrosight" package (https://github.com/peterson-tim-j/ HydroSight/wiki). Given that observations at bores in Victoria are often sporadic (e.g., monthly or seasonal), temporal interpolation by Hydrosight improves visual discernment of long-term trends. Hydrosight applies the time series modeling approach of Peterson and Western (2014) to generate a groundwater hydrograph based on climatic forcing (rainfall and potential evapotranspiration as described in section 4.6). In a subsequent step, the resulting groundwater hydrograph is made to perfectly honor each observation using the interpolation methods of . Outlier observations are identified and removed as per Peterson et al. (2017).
For both GRACE data and groundwater data, no formal method is used to quantify long-term trends; rather, the emphasis is on visual inspection of the plotted time series.

Rainfall-Runoff Models: Assessing Model Simulations
In this section, we first explain the adopted method for analyzing a given model simulation to answer Research Question 2, then provide relevant details including site selection and model choice. Recall that Research Question 2 requires checking for long-term trends in simulated state variables. Visual inspection is cumbersome if we wish to test large samples of catchments and/or many different models, so is only provided for in the supporting information. To make the results digestible in a single plot, the following steps are applied to each model that is being tested: a. Define periods: Three two-year periods, labeled as "predrought," "middrought," and "late drought" are defined. These are calendar years 1995-1996, 2004-2005, and 2008-2009, respectively. b. Extract sum of state variables: For each day in each two-year period, the simulated state variables are extracted and added together. The types of state variables differ between models, but may include soil moisture storage, groundwater storage, and/or interception storage. Regardless, this step provides a daily time series of total water storage for each two-year period. c. Get minimum and maximum storage over each period: This is extracted and expressed as a percentage of the maximum possible storage (if defined) or, if unlimited, the maximum storage attained over the full modeling period. Note that this calculation depends upon the parameter set in question in cases where storage capacities are set by calibrated parameters. Further details are provided in Text S1 in the supporting information. d. Repeat, and plot: Steps a-c are repeated across other catchments. A box plot is prepared to summarize the distribution of minimums across all catchments in a given region, with a separate box plot for maximums.
This analysis is relevant to Research Question 2 because minimums close to zero for the predrought period imply that multiyear downward trends during the drought are impossible, as per Figure 1. Catchment selection, model choice, and split sample design follow from previous studies: • 38 catchments are adopted from Fowler et al. (Fowler et al., 2016;. Note that these earlier studies adopted a wider sample of 86 catchments, but 48 of them are not in the current study area, leaving 38 for adoption here. All catchments are unregulated, free from significant temporal changes (e.g., in land use) and have long-term "Hydrologic Reference Stations" measuring discharge (Turner, 2012). Data checks are done for each catchment as per Fowler et al. (2016).
• Five daily conceptual bucket models are tested, with details given in Fowler et al. (Fowler et al., 2016;, but in summary: GR4J (Perrin et al., 2003), SIMHYD (Chiew et al., 2002), IHACRES using the catchment wetness index approach (Jakeman & Hornberger, 1993;Ye et al., 1997), GR4JMOD (Hughes et al., 2013;Grigg & Hughes, 2018;see Fowler et al., 2016 for version information), and SACRAMENTO (Burnash et al., 1973). The phrase "model structure" is used hereafter to indicate the model equations and parameter limits (see Text S2) as distinct from a given parameterization. As noted in Fowler et al. (2016), the five models tested here cover a wide range of approaches to model development and are representative of many rainfall-runoff models in common use. • Differential split sample testing is adopted from Fowler et al. . This testing is conducted whereby the entire historic record, minus the seven driest consecutive years (judged according to streamflow), is used in calibration (the "nondry" period). The seven driest years (the "dry" period) are reserved for independent model evaluation.
Of the eight objective functions tested in Fowler et al. , two are used here, the Kling-Gupta Efficiency (KGE; Gupta et al., 2009) and the Refined Index of Agreement (Willmott et al., 2012). Fowler et al.  reported that the latter objective function gave good scores in evaluation, and thus gives the models the best possible chance (given the current state of knowledge) to perform over the varied hydroclimatic conditions in the study area. • To assess model performance consistently across both objective functions, we follow Fowler et al.  and use the KGE. This means that the KGE is used as both the evaluation metric and also as one of two objective functions (for the distinction between objective function and evaluation metric, see the discussion in Fowler et al. (Fowler, Coxon, et al., 2018)). KGE is selected as evaluation metric because the KGE's three components of bias, correlation, and match in variability are readily understandable and broadly reflect what water resource managers expect from models. • Forcing data (daily rain and potential evapotranspiration) are adopted from Jones et al. (2009) andJeffrey et al. (2001), as described in Fowler et al. (2016).

"Coverage" Tests
To supplement differential split sample test results, the concept of coverage is adopted from Fowler et al. (Fowler, Coxon, et al., 2018). Coverage indicates performance against criteria regarding model performance-in this case, two criteria are used: KGE nondry > 0.7 and KGE dry > 0.7 (cf. Fowler et al., 2016). Unless at least one parameter set is found that fulfills both criteria simultaneously, the model is deemed to have no coverage of that portion of the objective space. In principle, coverage can be determined using a large random sample; thus, coverage is independent of calibration method and depends only on the model structure and its numerical implementation. In practice we test coverage using the multiobjective optimizations of Fowler et al. (2016).

Joint Calibration to Surface Water and Groundwater/GRACE
The split sample calibration described above is (or should be) standard practice and a minimal first test for determining the adequacy of models for application in changing climate (Fowler, Coxon, et al., 2018;Klemeš, 1986;Refsgaard et al., 2014). However, if models thus calibrated do not exhibit multiyear trends in state variables, this does not necessarily mean they cannot do so. To test this, we need a calibration directly to a state variable that exhibits such a trend.
As noted already, the available data are sparse, making such a test exceedingly difficult to formulate. Few catchments contain bores and no catchment has sufficient bores to adequately characterize internal spatial variability in water table depth. Also, it is unclear how to relate the water table spatial field to the simulations of a lumped conceptual rainfall runoff model, because of lumping across space and across different hydrologic processes and state variables. Arguably the GRACE data are more suited to this purpose since they aggregate soil water and groundwater together, but the mismatch in spatial scale is problematic.
We are relieved from this dilemma via a simple observation: for many bores in the Western Region, the magnitude of the seasonal variation is exceeded by the magnitude of the multiyear trend over the Millennium Drought. This observation would presumably also be true of GRACE data if these data were available back to 1997. To see the truth in these statements, the reader is directed to section 5, Figures 3, 4,  this observation as a kind of "soft" data to inform calibration (e.g., Seibert & McDonnell, 2002). The logic is formalized as follows (note that this is shown diagrammatically in Figure 8): • Start with the time series of total simulated water storage as defined in section 4.4b; • Focusing on the predrought (1995)(1996) and late-drought (2008-2009) periods of section 4.4a: ∘ Let A be the minimum simulated storage during the predrought period; ∘ Let B be the maximum simulated storage during the late-drought period. • Define the objective function to be minimized: Put simply, this objective function incentivizes the model to strengthen the multiyear trend over the drought sufficiently to ensure that the seasonal range by the end of the drought has nil overlap with the seasonal range just prior to the drought. For brevity, we henceforth call this the "nil-overlap" requirement. In practice, we apply this in a three-objective Pareto optimization experiment, where the first two objectives are OF 1 = KGE nondry and the OF 2 = KGE dry (refer to section 4.4; note that these are maximized). The above nil-overlap objective function is referred to as OF 3 . By using these multiple objectives, we are requiring each of the five models to simulate flows over a broad range of climatic conditions in addition to showing a multiyear trend in total model storage. Pareto optimization is undertaken using AMALGAM (Vrugt & Robinson, 2007) with a generation population of 500 and a total of 200,000 function evaluations. These settings are informed based on earlier experience with this combination of optimizer, model structures, and catchments (Fowler et al., 2016).
The limitation of such an approach is that it can only be applied in catchments where we are sure that the original observation (nil overlap) is likely to be true. Three catchments are selected for the experiment, corresponding to gauges 406213, 406214, and 407230. These three are all located in the same region, the central highlands of Victoria, where local bore data consistently indicate sufficiently strong multiyear declines in groundwater (see section 5 and Figure 8).

Assessment of Temporal Changes in Calibrated Parameters
Since different parameters govern different aspects of flow response it may be insightful to analyze how parameter values change when the models are calibrated to different climatic periods (Merz et al., 2011). Here we focus on parameter change between the nondry and dry periods as defined above. . GRACE equivalent water heights for the two GRACE regions, along with reservoir storage in each respective zone. The dashed lines and solid red lines show GRACE data before and after being adjusted for water in reservoirs, respectively, and in each case is reported so that the long-term average is zero. Ticks on the horizontal axis mark the beginning of the calendar year. Note that GRACE data are not available prior to 2003.
When basing this solely on single "optimum" parameter sets obtained via optimization methods (e.g., , one concern is that equifinality (Beven, 2006) means that the parameter values of the numerically optimum parameter set may have limited meaning. This is because multiple different parameter sets, often with widely varying values of a given parameter, might all have similar near-optimum performance. Examining the change among multiple parameter sets, rather than just one, may partly mitigate this problem. Thus, here we consider multiple parameter sets.
We use the calibration method of Fowler et al. (2016) (also used in the coverage tests; see above) which provides a Pareto Curve (trade-off) between KGE nondry and KGE dry . The curve itself is composed of 100 parameter sets. We then select the five best performing parameter sets according to KGE nondry and select a different set of five according to KGE dry . In each case the selected five will include the endpoint of the curve and the next four. Note that Fowler et al. (2016) showed that endpoints of the Pareto curves are almost always very similar (in terms of performance) to results from a single-objective optimizer. For each parameter separately, the five parameter values are noted and the median calculated. Temporal changes in parameters are quantified by the change in this median value from nondry to dry periods.
The above is done for all five models and for the three catchments defined in the previous subsection. Also, the analysis is repeated using the Willmott objective function instead of KGE (i.e., running the multiobjective optimizer to define the trade-off between Willmott nondry and Wilmott dry ).

Changes in Annual Rainfall-Runoff Relationships
To supplement the simulation results, we test the stationarity of annual rainfall-runoff relationships in each study catchment, using the method presented by Saft et al. (2015). Working in south-east Australia, Saft et al. (2015) demonstrated that statistically significant changes in the relationship between annual rainfall and annual runoff occurred in many catchments during the Millennium Drought. To investigate whether such catchments also exhibit poor rainfall-runoff model coverage and poor differential split sample test performance (cf. , we repeat this analysis here. The method involves the following steps: (i) linearization of the relationship using a Box-Cox transformation (Box & Cox, 1964), (ii) definition of a separate relationship over the drought (in this case, the seven-year dry period of section 4.4), (iii) correction for autocorrelation in residuals (Haan, 2002), and (iv) a statistical test (at the 5% level) of the hypothesis that the rainfall-runoff relationship during the dry period is different to the rainfall-runoff relationship for the rest of the record.

Multiyear Trends in Observations of State Variables
The GRACE and groundwater results indicate clear multiyear trends associated with the Millennium Drought. For both zones, GRACE data ( Figure 3) over 2003-2009 show decreasing trends superimposed over a seasonal cycle, followed by some recovery in 2010 when the Millennium Drought ended with significant flooding. As mentioned, our focus is on the drought rather than subsequent recovery, so we defer further discussion of drought recovery to future work, limiting subsequent figures in this paper to the pre-2010 period. Note the following points with respect to GRACE: (i) GRACE data are reported relative to an arbitrary baseline (in this case the mean value over the period); (ii) the Millennium Drought commenced in 1997 prior to the launch of GRACE, so declines in TWS during the first six years of drought are not captured by GRACE. Thus, it not possible to compare predrought and postdrought GRACE data. (iii) The dashed line in Figure 3 shows the GRACE time series prior to adjustment for water stored in reservoirs. Total reservoir storage is more significant in the east (approximately 13,000 GL at full supply, which is 120 mm averaged over the whole east region) than in the west (approximately 1,500 GL or 15 mm at full supply over the west region), but in both cases the adjustment is small relative to the GRACE signal itself. Figure 4 shows groundwater hydrographs at each of the study bores. Many hydrographs show clear dominance of multiyear trends over seasonal fluctuations. The strength of the multiyear trend varies from site to site but is clearly discernible in almost all cases (for exceptions, see, e.g., 110166 and 111526). A small number of bores show abrupt changes rather than smooth trends (e.g., 112919). There is little difference between the two regions: of the six bores in the East, only one (110166) shows no discernible multiyear trend. As noted previously, the hydrology of the East region is dominated by the steep upland areas of Victoria, and yet there are no long-term bores in the upland areas ( Figure 2). Each of the bores in the East are located in the surrounding lowlands and are thus of limited use in generalizing across the region as a whole.

Model Behavior Case Study
Before considering whether multiyear trends are present in simulated state variables across all catchments, we examine a case study catchment: Axe Creek at Longlea (406213; west region; Figure 5a), one of very few catchments with concurrent long-term records in groundwater and surface water (Chiew et al., 2014;Figure 4). Figure 5a shows that rainfall in this catchment reduced by around 20% during the Millennium Drought, and flow responded nonlinearly with reductions of around 80%. Note that this catchment is also affected by small private dams, and these effects have not been explicitly represented here since the focus is on the qualitative mismatch between modeled storage and groundwater levels. As an example model, GR4J simulations are also shown, calibrated as per section 4.4 and using the Index of Agreement as the objective function. GR4J simulated flows over the Millennium Drought are about twice as large as observed (consistent with Saft, Peel, Western, Perraud, & Zhang, 2016; Figure 3). Figure 5b shows the groundwater levels from four selected bores-note that we have relaxed the rule requiring shallow (20 m deep) bores in three cases here (two are 50 m deep and one is 100 m deep-see map in Figure S3). To ease interpretation, the bore data are shifted on the y axis so that each bore has a value of zero at 1 July 1997, and values at other times are expressed relative to this. As with the bore data in the previous figure, multiyear trends during the Millennium Drought dominate these records. Finally, Figure 5c shows the time series for the GR4J "production" store state variable. This store functions as GR4J's soil moisture store, but also implicitly captures dynamics in shallow groundwater, since these are not explicitly represented in GR4J. The production store time series is dominated by seasonal and episodic components, showing little multiyear trend, contrasting strongly with the bore data. If this model were to exhibit multiyear declines such as those seen in the groundwater data, we would expect that the simulated storage would remain quite high (say, the minimum values would remain above 50% at least) during the predrought period, which is clearly not the case. In particular, the annual minimums show no trend, since they were already at zero prior to the drought (e.g., in 1992, 1995, 1997) and cannot go below zero. The annual maximums decline during the drought, but do not do so in a multiyear trend like the bore data do. Note that in this case the GR4J routing store (not shown but cf. Figure S6) has very low storage by comparison and GR4J has no . Further information on the groundwater data, including ID numbers and bore location map, is in Figure S3. (c) shows GR4J's production store only; in this case the GR4J's only other store (the routing store) has very low storage by comparison.

10.1029/2019WR025286
Water Resources Research other state variables. In this case GR4J has been calibrated over the composite period 1978-2001 and 2009-2010, using the most robust objective function identified by Fowler et al. . To summarize, the above discussion suggests that the GR4J model (as calibrated) does not plausibly represent-neither explicitly nor implicitly-the hydrological changes that occurred in the Axe Creek catchment during the Millennium Drought. But is this an isolated case? Figure 6 gives box plots of maximum and minimum simulated storage for the two-year periods defined in section 4.4. This plot summarizes the information in the 190 subplots in Figure S6, which show the full time series for each store for every model and catchment. Hypothetically (and as noted above), if a model were to exhibit multiyear trends such as those seen in the groundwater data, this would be indicated by a relatively high storage during the predrought period. In this hypothetical, in Figure 6 both the seasonal minimum storage and the seasonal maximum storage would each decline during the middrought years, as catchment drying accumulates. The lowest values would be reached during the late drought, after more than 10 consecutive drought years.

Analysis of Model Behavior: Full Set of Catchments and Models 5.3.1. Analysis of Simulated Storage Dynamics
The box plots results contrast strongly with this hypothetical situation, suggesting that the simulated behavior seen in Figure 5 (GR4J in Axe Creek) is typical of our sample of rainfall-runoff models, particularly in the West region. Storage in the West typically drops below 5% during the predrought period for all models except for SACRAMENTO, a profoundly low value indicating very little scope for the development of multiyear trends. Relatively speaking, SACRAMENTO typically has 2 to 3 times the minimum storage compared to the other model structures, but still usually below 20% in the West region. The larger values for SACRAMENTO arise from interactions between the multiple storages in this model. In general, SACRAMENTO storages need to be filled to a relatively higher level before streamflow occurs. Focusing now on the maximums, these show some downward trend across the three two-year periods. However, this is likely indicating the relative dryness of the years, rather than an accumulation with time, as discussed in section 6.2.
In the East, simulated multiyear trends seem possible based purely on the predrought minimums, which lie significantly above zero. For example, the GR4J simulated storage typically varies between 20% and 50% of the maximum value (for definitions see section 4.4/Text S1), providing scope for significant downward departures from the predrought range of variation.
Despite this, multiyear trends do not develop in the models, as seen by the relative similarity between the middrought and late-drought minimum box plots in Figure 6. Like the west, the maximums do decrease with time as discussed in section 6.2.

Model Performance and Coverage
Model performance results show a stark contrast between East and West. Recall from section 4.4 that models were calibrated to all of the historic record except for the seven driest consecutive years, which were reserved for evaluation. Figure 7a shows the coverage results-recall from section 4.4 that this is a calibration-independent measure of whether consistent performance is possible over the two periods (calibration, evaluation) of contrasting climatic conditions (without changing the parameter set). In the East, all model structures have coverage in more than 80% of catchments, but in the West, only two model structures (IHACRES and SACRAMENTO) have coverage in more than 20% of catchments. Split sample results (Figure 7b) show similar patterns, with median KGE in evaluation of around 0.7 for all models in the East, compared to −2.0 (GR4J; GR4JMOD), 0.0 (SIMHYD), 0.25 (SACRAMENTO), and 0.35 (IHACRES) in the West. Interestingly, model performance in calibration exceeds KGE of 0.7 even in the West, which is a sobering reminder that model performance in the present is a poor predictor of model performance under a changed climate . As mentioned in section 4.4, the analysis in Figure 7b is conducted for two different objective functions; for brevity only one (KGE) is shown here. The results for the other (Index of Agreement; Figure S4) show improved evaluation performance (consistent with  and a stronger contrast in calibration scores: calibration scores for the West are significantly lower than the East. This suggests that using the KGE as the objective function may cause overfitting, possibly to high-flow events ; Figure 2).  Figure S4 shows results when the objective function is the Index of Agreement. The whiskers extend a maximum of 1.5 times the interquantile range. Values beyond the whisker are marked as outliers and are denoted as cross.

10.1029/2019WR025286
Water Resources Research 5.4. Results: Joint Calibration to Surface Water and Groundwater/GRACE Figure 8b shows the results for the joint calibration to surface water and groundwater/GRACE, for the three selected catchments in the West region. Recall from section 4.6 that the calibration is not directly to groundwater or GRACE data, but rather the nil-overlap requirement, a "soft data" requirement related to the seasonal range of simulated water storage. Namely, the seasonal range late in the drought should have nil overlap with the seasonal range prior to the drought (Figure 8a). The output of this calibration experiment is an ensemble of 500 parameter sets (Figure 8b) that are spread out over the threedimensional Pareto surface and describe the trade-off between the three objectives. When plotted in any two dimensions, this ensemble will appear as a cloud rather than a line, which is why the plots in Figure 8b lack the crisp monotonic character of two-dimensional Pareto studies such as Fowler et al. (2016).
The ensembles are plotted with OF 1 (KGE nondry ) and OF 2 (KGE dry ) on the x and y axes, respectively-the same as in  However, the main focus here is on OF 3 , which measures long, slow dynamics in water storage and is denoted by color. Because GR4J, SIMHYD, and SACRAMENTO are unable to attain an OF 3 score of zero, Figure 8 confirms that these models lack the ability to simulate long, slow dynamics within our experimental setup. For IHACRES and GR4JMOD, long slow dynamics are possible with certain parameter sets (dark blue) but these parameter sets have poor performance with regards to streamflow. Recall that the best Figure 9. Temporal changes in calibrated parameter values between the nondry ("nd") and dry ("d") periods for the same catchments as Figure 8 and using the methods outlined in section 4.7, when KGE is used as objective function. Figure S5 shows results when the Willmott Refined Index of Agreement is used as objective function instead. Parameter values are normalized by the minimum and maximum allowable in optimization. See Text S2 for parameter ranges.

10.1029/2019WR025286
Water Resources Research possible OF 3 score is 0 and attaining this score should be considered a minimum requirement since it merely implies that the two seasonal ranges are offset-possibly only slightly. In reality, the ranges are significantly offset for some of the bores (e.g., 116568 in Figure 8a).
Much of the Pareto surface is a long way outside of the plot bounds of Figure 8b (too far to show here) and it is necessary to interrogate this unseen portion to reveal the best possible value of OF 3 (aside from IHACRES where a value of 0 is seen within the plot bounds). For GR4JMOD, the minimum possible value is zero which affirms the changes made by Hughes and co-authors (Grigg & Hughes, 2018;Hughes et al., 2013). For the other models, the best possible values of OF 3 are 1.0, 14.2, and 111.7 mm for GR4J, SIMHYD, and SACRAMENTO, respectively. Therefore, these models cannot fulfill the nil-overlap soft data requirement (section 4.6). Overall, these multiobjective results affirm earlier split sample results suggesting the absence of long, slow dynamics from these five conceptual bucket rainfall runoff models, or, at least, the inability of these models to harbor multiyear trends while also providing good simulation performance for streamflow. Figure 9 shows the change in individual parameter values when calibrated to the nondry period and then separately to the dry period. In general, the results are mixed and not always consistent between catchments and models. Across the examples shown, the parameter governing soil moisture storage capacity increases in the dry period for GR4J/GR4JMOD (x1) and SIMHYD (smsc). When the Willmott objective function is used instead of KGE ( Figure S5) this is also true for IHACRES, but no longer for SIMHYD. As discussed below in section 6.5, this behavior could result from trying to simulate an enlarged unsaturated zone under groundwater decline. Given significant model biases over the dry period as reported in Fowler et al. (Fowler, Coxon, et al., 2018), many of the parameter changes can be interpreted as compensation for excess water in the model (i.e., underestimation of ET) during the drought. Two examples are the downward trend in GR4J's x2 (i.e., toward greater water export) and the increase in SIMHYD's interception storage capacity rinsc. Some results are challenging to explain and may be the result of model-specific interactions that are difficult to untangle. For example, the trend toward higher x4 in GR4J and GR4JMOD tends to attenuate flood peaks (possibly to soften the impact of overestimation of flood peaks on objective function score), whereas the trend toward lower rk in SIMHYD has the opposite effect. Model-specific factors likely explain these seeming contradictions. In SACRAMENTO, complex interactions make it difficult to interpret the results. For example, of the two parameters governing the capacity of the soil's "upper zone," one increases while the other decreases (uztwm and uzfwm, respectively). Thus, it is difficult to say whether the above hypothesis Figure 10. Results of the analysis of changes in annual rainfall-runoff relationships (cf. section 4.6). Red catchments had statistically significant downward shifts (p < 0.05) in their annual rainfall-runoff relationships during the seven driest consecutive years on record, relative to the remainder of the record. about soil moisture storage holds for this model, without conducting detailed model-specific experiments. While it would be easy to explain certain trends as logical-for example, the temporal decrease of parameters governing SACRAMENTO's interflow and the wetting up of saturated areas (uzk and adimp, respectively)-it must be recalled that correct process representation would ideally result in a constant parameter value even as the underlying flux changes with time.

Changes in Annual Rainfall-Runoff Relationship
Finally, Figure 10 maps the catchments where the annual rainfall-runoff relationship underwent a change during the drought (of statistical significance at the 5% level). The map shows that 10 of 12 catchments in the West experienced a change in rainfall-runoff relationship, compared to only 7 out of 26 in the East. As per section 4.6, this analysis is conducted over the same seven-year dry periods as the evaluation periods in Figure 7. Thus, there is strong concordance between results of annual analysis and daily models.

Answers to Research Questions
The first research question is do multiyear trends occur in environmental state variables (e.g., groundwater) in the study region? Based on Figures 3 and 4, the answer to this question is yes. During the Millennium Drought, GRACE data show significant downward trends, superimposed upon seasonal fluctuations. Similarly, groundwater hydrographs show multiyear trends in nearly all cases, albeit with varying strength between sites. Consistent with section 2 and LeBlanc et al. (2009), it is unlikely that these trends are due to human causes such as groundwater extraction. Groundwater use in Victoria is less than 500 GL (17 mm) per year in 2015/2016 (Government of Victoria, 2016), and a large proportion of this is likely replenished by recharge, even during drought conditions. This can be seen in the groundwater hydrographs in Figure 4: hypothetically, if recharge had dropped to near-zero during the drought, this would have removed the upward half of the seasonal cycle in the bore data, producing a monotonic trend. In reality, in Figure 4 most bores show a continued seasonal cycle superimposed upon the downward trend throughout the drought.
The second research question is are qualitatively similar trends present in any simulated state variables for selected commonly used conceptual rainfall-runoff models calibrated within the study region? Despite the aggregation across stores used in Figure 6, it is clear that significant downward trends in modeled state variables are absent, for both West and East regions. Interested readers can examine individual stores in Figure  S6. Particularly in catchments in the West region, multiyear decline during the drought was not possible because the model "buckets" were already empty during summer predrought.
The third question is what are the implications for the suitability of the models for climate change impact assessment, and for the simulation of hydrologic processes? This question requires more detailed consideration and is addressed in section 6.4 below. However, first we must consider whether the results could not be explained by confounding factors other than model structural deficiency (section 6.2) and whether there is auxiliary evidence that the declining groundwater is linked to streamflow generation (section 6.3).

Attributing Results to Model Structural Deficiency and Not Confounding Factors
While the results above appear to indicate that the five selected conceptual bucket models are incapable of harboring multiyear trends, it is possible that the results are actually arising from other factors. In the literature, deficiencies in simulations are often due to factors other than structural inadequacy, such as • Data errors in model inputs (e.g., Andréassian et al., 2001), • Errors in data to which models are calibrated (e.g., Boughton, 2006), • Expecting the model to simulate aspects of behavior that it was not calibrated to (e.g., Mroczkowski et al., 1997), and • Poor choice of objective function (e.g., . In isolation, the split sample results are vulnerable to many of these criticisms. For example, perhaps different split sample results would arise from adopting a different objective function. This is why the split sample test is accompanied here by the coverage test. The coverage test can be conducted based on random samples of parameter sets (section 4.4), and thus, it is a more direct test of the capabilities of the model than a split sample test since the latter is always contingent on calibration method (Fowler, Coxon, et al., 2018). A further criticism is that perhaps the split sample results would be different if the models were directly calibrated to groundwater rather than to streamflow only. This is why the split sample test is accompanied here by the three-objective Pareto test (Figure 8), which directly tests the ability of the model to harbor multiyear trends in catchment storage. With respect to data errors, while it is true that errors can have a large impact on model simulations and parameter inference, it is difficult to see how such errors could be responsible for the reported Pareto results. Consider the third objective in the three-objective Pareto test, which merely tests the nil-overlap requirement (i.e., that the simulated seasonal range in storage over the predrought period should have nil overlap with the range in the late-drought period). This simple requirement is highly robust to the effect of data errors-that is, it is highly unlikely that the errors would be organized in just such a way as to provide a "false-negative." Based on the above discussion, it is unlikely that the lack of long, slow dynamics in the five models tested can be explained by anything other than structural inadequacy.

Groundwater-Surface Water Connections
To assess the implications for climate change impact assessment, we need to go beyond mere discussion of model structural inadequacy and consider hydrological processes. For example, it could be stated that "the fact that a model cannot correctly predict groundwater dynamics does not mean that it cannot be used to predict streamflow," and the only way to decide this is to consider auxiliary evidence of groundwater-surface water interaction. Consider the following three possibilities: 1. The streamflow and groundwater systems were "connected" both prior to and during the drought; 2. The streamflow and groundwater systems were connected predrought, but became "disconnected" during the drought (see also Fuchs et al., 2019;Kinal & Stoneman, 2012); and 3. The streamflow and groundwater systems were not connected predrought, nor during the drought.
The above statement does not hold for (1), as a connection implies streamflow can be influenced by groundwater. Nor does it hold for (2), in which streamflow is influenced by groundwater up until a point in time during the drought, then the connection ceases. Thus, the only category to which the statement can be applied is (3). So to demonstrate that the statement does not apply to the study catchments here, we need only demonstrate that they are not in category 3. There are two main lines of evidence of this, as listed below. Note that this review of evidence mainly focuses on the western region and particularly the three catchments in Figures 8 and 9: • Bore data: For some bores, for example, bore 116568 (Figure 8), the water table was above the ground surface predrought, indicating groundwater discharge. For other locations, shallow groundwater coexists with local slopes that are steep enough to indicate likely nearby groundwater discharge. For example, at bore 104359 (Figure 8), the water table was regularly shallower than 4 m predrought, and given significant local topography (e.g., within a 2.5-km radius are points 90 m higher and 40 m lower in elevation) it is considered unlikely that such shallow groundwater could exist independently of local surface water systems. • Prior studies: groundwater interaction with the surface is sufficiently strong that numerous local studies have investigated its impact on water supply and salinity. For example, Parsons et al. (2008) conducted a Murray-Darling-Basin-wide study of connectivity as part of a yield assessment and reported widespread groundwater-surface water interaction across our study region, including in the catchments reported in Figure 8. For example, they categorized the Campaspe River (406213) as "gaining" groundwater along the majority of its length upstream of the gauging station (see their Figures 4 and 5). Likewise, DSE (Department of Sustainability and Environment, Victoria) (2012) identified multiple locations along Joyces Creek (407230) as potential groundwater-dependent ecosystems and/or mineral springs (see their Figure 6). Lastly, in a predrought audit of salinity across the Murray Darling Basin, MDBMC (Murray Darling Basin Ministerial Council) (1999) estimated that the Campaspe and Loddon (of which Joyces Creek is a tributary) as having 1,000 (Campaspe) and 2,000 (Loddon) hectares in the headwater regions subject to "high water tables." In the Campaspe River, salinity records (http://data.water.vic.gov.au/) over the past 30 years indicate that salinity within a given year typically ranges between 100 and 600 mg/L at gauge 406213, with a slight increase during the mid-late Millennium Drought, but otherwise relatively stable. The salt is transported to the stream and land surface via groundwater interactions, which was the motivation for the MDBMC (1999) survey of shallow water tables.
The above evidence strongly supports the hypothesis that groundwater plays an integral role in streamflow generation in the catchments surveyed in Figure 8. Thus, the observed multiyear declines in groundwater are very likely to be the cause, or one of the causes of the unexpectedly large declines in streamflow (Saft et al., 2015) and the poor simulation performance of the tested rainfall runoff models.

Discussion of Flow Generation Processes
Working in a wider set of catchments than here, ) proposed a statistical model explaining shifts in annual rainfall-runoff relationships. Catchment storage variability predrought (implied from low flow behavior) and soil depth were selected as among the most informative from a large range of potential predictors. The authors concluded that catchment storage is a key factor in understanding changes in catchment response, which is consistent with the logic used here (see also section 3).
In terms of processes, two further concepts that may be helpful (particularly in explaining the East-West divide) are mentioned here. The first is the concept of nonlinear feedback between catchment storage and runoff generation, for example, through "saturation excess" mechanisms (e.g., Dunne, 1983). Petheram et al. (2011, p. 3,622) said, We propose that in the low-relief, moderate rainfall catchments of southern south-east Australia [our West region], relatively high groundwater levels may have amplified overland flow during pre-drought conditions by reducing the storage capacity of the unsaturated zone and by facilitating organised patterns of drainage and the connection of source areas of runoff as the soil wetted up during a rainfall event. Under a falling water table, the storage capacity of the unsaturated zone increased, and hence saturation conditions were less likely to occur, and the connectivity of source areas was likely to be less organised.
If this interpretation is correct, then runoff projections may be more compromised in landscapes where these processes are more likely to occur-presumably, catchments with gentler slopes, flat valley bottoms (Gallant & Dowling, 2003), soils deep enough to store significant volumes of water, and favorable (for the phenomena) hydrogeology. Further research is required to characterize this systematically, but in general, catchments in the East usually have steeper slopes and fewer flat valleys, so the processes described in the above quote may be less applicable in the East.
The second concept that may be helpful in explaining the East-West divide is the distinction between stream segments that are "gaining" groundwater (because the groundwater level exceeds the stream water level, thus leading to a hydraulic gradient and movement of water into the stream) and those "losing" water to groundwater (the opposite, where groundwater level is below stream level and thus the groundwater seeps downward toward the water table). It is known that stream reaches may transition from losing to gaining in the shift from dry season to wet . As the Millennium Drought progressed, perhaps many stream segments in the West transitioned from generally gaining to generally losing Saft et al., 2015). However, such generalizations belie the complexities arising from landscape form-for example, the Axe Creek catchment ( Figure 5) has significant topography and rainfall only on its south-western boundary ( Figure S3) and thus might be described/modeled as a single upstream source area supplying water to lowland reaches that were gaining predrought and losing thereafter. This hypothesis would be difficult to model using generic model structures and suggests that sitespecific approaches to modeling based upon perceived catchment functioning (e.g., Fenicia et al., 2014) may have a place in formulating process-based descriptions of drought hydrology.

Implications
To explore the implications of the results, consider the aforementioned properties (cf. section 1) that systems and models with long, slow dynamics exhibit. The first property is streamflow that, for a given time step, depends not only on recent climatic conditions (the preceding weeks or months) but also on past years or even decades. The results imply that simulated streamflow from the models used in this study lacks this property. For example, every time the GR4J storage in Figure 5c empties (i.e., almost every year during the dry season), this "wipes clean" the memory of past events. The previous wet season could have been the wettest or driest on record-regardless, if the bucket subsequently empties, it makes no difference to future simulations. Model states in each year thus depend only on conditions in that year. We suggest the middrought maximums in Figure 6 exceed the late-drought maximums only because 2005-2006 was slightly wetter than 2008-2009 (cf. Figure 5a), not because multiyear trends develop.
The second property exhibited by systems and models with long, slow dynamics is relatively slower recovery from extreme events such as multiyear droughts. As noted throughout the study, the scope here is limited to the drought period, leaving the recovery topic to future work. Thus, commentary on this property is limited (for now) to what we can learn from the data presented over the drought period. GRACE and bore data indicate the accumulation of a water deficit during the drought (Figures 3 and 4). At the conclusion of the drought, it seems likely (to be explored in future studies) that this deficit must be replenished before the prior rainfall-runoff relationship is restored. If this interpretation is correct, hydrological drought may continue beyond meteorological drought (Eltahir & Yeh, 1999;Van Loon, 2015;Yang et al., 2017), and we recommend this phenomenon be explored in future research.
The final property is relatively greater sensitivity to climate change, since accumulated changes in state variables can push the model into rainfall-runoff relationships not seen in the historic record. Inability to track deficit accumulation means models projecting runoff under a drying climate will likely overestimate future streamflows. This depends on the model in question: some models such as IHACRES (refer to Ye et al., 1997) contain thresholds of catchment wetness below which flow ceases, and such thresholds may effectively mimic natural processes in some cases (we believe that this partially explains the superior coverage results of IHACRES, as per Figures 7a and 8b). Other conceptual models may lack such thresholds but still exhibit suitably low flows during extended dry periods because the relevant equations (i.e., relating catchment wetness and streamflow) are highly nonlinear at the drier end of the range of conceptual storage. These latter models may still be severely limited if the moisture deficit ceases being properly tracked, because a relatively small rainfall event may be sufficient to drive the soil moisture back into a state of higher flow production, thus ending the dry spell unrealistically early.

Recommendations for Model Improvement
The data shown demonstrate that many catchments have multiannual memory of climatic forcing. In order for this memory to be simulated, the simulation models would need time responses that are at least as long as the memory. However, the longest time constants on the models tested (and most rainfall-runoff models) are those related to "baseflow" stores or similar, and are an order of magnitude shorter (months) than the observed memory (years). Such stores are typically designed to track event or seasonal recession behavior, not multiannual dynamics. Therefore, we suggest that changes are required to such models, and some possibilities are presented below.
• Models with additional storages. One simple way of adding a longer time response would be to add a model storage with a slow time constant that interacts with the faster-responding components, but care would be needed to parameterize the new storage on short or uninformative data. Coupled groundwater-surface water models are another possibility, but since they are often complex distributed models (Sophocleous, 2002), their high data requirements and computational cost may limit adoption in climate change impact studies, depending on context. Thus, more conceptual groundwater representations may also be helpful. Alternatively, quasi-distributed models such as Dynamic TOPMODEL (Beven & Freer, 2001b;Coxon et al., 2018) contain ensembles of storages, each representing different locations in the landscape (hillslopes, valley bottoms, etc.-cf. Savenije, 2010). Some formulations allow for the possibility that portions of the landscape may become hydrologically cut off in response to groundwater decline-for example, the upper segments of hillslopes may temporarily become cut off from the stream network. If this happened for extended periods (e.g., multiyear droughts), this could constitute a form of low-frequency dynamics, but the physical realism of this behavior would need consideration. • Alternative systems for tracking moisture. In typical model buckets, moisture can vary between two fixed limits: "empty" and "full." Here observed state variables (e.g., groundwater) entered multiyear decline but models were unable to do so because they could not go "beyond" empty. This suggests the suitability of alternative moisture-tracking formulations that do not impose a hard lower limit (empty) on storage. Examples include "bottomless" buckets (Wagener et al., 2003) and deficit formulations (Beven and Kirkby, 1979;Coxon et al., 2019; see Knoben et al., 2018 for discussion). We also recommend testing of functions that allow ET to continue even after flow has ceased, for example, via a (smoothed) threshold in catchment storage that must be exceeded before flow can occur (e.g., Ye et al., 1997;Son and Sivapalan, 2006;Kavetski & Kuczera, 2007).
• Models with multiple stable states: Peterson et al. (2009) demonstrated a hillslope Boussinesq rainfallrunoff model with multiple steady states arising from positive feedback arising partially from soil salinity. If it can be confirmed that a given aspect of the Millennium Drought led to a change from one steady state to another (e.g., feedback associated with vegetation dynamics), a numerical model of such dynamics may be more capable of modeling streamflow under changing climatic conditions. This is an area of active current research.
A key difficulty in assessing the veracity of all these modeling ideas is the lack of data to understand the mechanisms of long-term hydrological change. This underscores the need for ongoing continuous monitoring and field campaigns in experimental catchments that continue over years and preferably decades.

Other Recommendations
Based on the findings of this study, the following is recommended: • Initial plausibility checks for climate change impact assessments. In regions subject to projected drying, rainfall-runoff models should be checked for plausibility via inspection of their storage time series, before being adopted in climate change impact studies. Models likely to prevent development of multiyear trends, like Model B in Figure 1, should be avoided. Ideally, multiple models should initially be trialled (Clark et al., 2011), and model choice should depend on both performance and plausibility of simulations. • Research on other events similar to the Millennium Drought. A key limitation of this study is that it considers only one region and only one event, the Millennium Drought. Similar studies are recommended in other regions (for example, the 2011-2017 Californian drought) and preferably other climate types, to test the generality of conclusions derived from the Millennium Drought. • Investigation of mechanisms and causal drivers. The causes of shifts in rainfall-runoff relationships are still poorly understood (Descroix et al., 2009;Yang et al., 2017). Future research might seek to better understand circumstances where multiyear trends occur and are hydrologically relevant, including exploring the long-term dynamics in groundwater, soil moisture, actual evapotranspiration, and plant water use and growth (including remotely sensed variables). • Research into drought recovery to learn whether delays in recovery are simply due to long, slow processes or are better characterized as a semipermanent transition to a new state of flow response (see previous subsection). In Victoria this is hampered because of recent conditions: after record floods ended the Millennium Drought in 2010, two thirds of the intervening years have been drier than average, weakening the contrast between drought and postdrought conditions. • Research into calibration methods. A model structure's ability to simulate runoff in climate conditions different to the calibration data is strongly contingent on calibration method (Fowler, Coxon, et al., 2018;. Despite recent progress in this area (e.g., , further work is required.

Conclusions
Evidence presented in this paper suggests that five commonly used conceptual bucket rainfall runoff models lack the ability to simulate long, slow processes that can dominate catchment response to climate change.
The five models tested here cover a wide range of approaches to model development and are representative of many rainfall-runoff models in common use. Catchment state variables (GRACE and groundwater bore data) exhibited multiyear trends during the 13-year Millennium Drought in the state of Victoria, Australia. In contrast, the five conceptual bucket models tested replicate event and seasonal dynamics but lack any multiyear trends. The model storages go from high to low values in a typical seasonal cycle. Once the drought begins, there is often no room for decline in the simulated storage because model buckets are already emptying on a seasonal basis prior to the drought.
The data shown demonstrate that many catchments have multiannual long-term memory of climatic forcing. In order for this memory to be simulated, the simulation models would need time responses that are at least as long as the memory. However, the longest time constants on the models tested (and many other rainfall-runoff models) are an order of magnitude shorter (months) than the observed memory (years), and are designed to track seasonal recession behavior, not multiannual dynamics. Therefore, we suggest that changes are required to such models, and have provided suggestions above (section 6.6). Since the effects of multiyear drought (or climate change) cannot accumulate within these models, they cannot be used to extrapolate to different climatic conditions (particularly drier conditions) and are thus invalid for use in climate change impact studies.
The results were different in the drier, flatter West region compared to the mountainous East region. In the East, the model storages did not approach empty as closely, and the models performed better in split sample testing. These findings were discussed in the context of physical processes of flow generation, including the presence or absence of nonlinear feedback through saturation excess mechanisms or similar. The prevalence of such mechanisms may make a given region more vulnerable to a drying climate, but this hypothesis requires further research.
Based on our findings, we propose a test (section 2) that models should pass before they are adopted in climate change impact studies. Models likely to prevent development of multiyear trends should be avoided. With respect to modeling, future research is required to (i) identify model structural elements well suited to these case studies (as discussed in section 6.6); (ii) where relevant, test existing models that already include such structural elements; and (iii) if required, incorporate such structural elements into new or improved model structures to better simulate the long, slow dynamics. More broadly, future work may (a) expand the analysis here to other regions and drought events; (b) examine drought recovery, aiming to gain further insight into dominant processes; and (c) investigate mechanisms and causal drivers, including the link between long-term climate anomalies and changes in rainfall-runoff relationship. We see this continued research as crucial to the credibility and development of trust in runoff projections.