Improving the Representation of Long‐Term Storage Variations With Conceptual Hydrological Models in Data‐Scarce Regions

In the Luangwa basin in Zambia, long‐term total water storage variations were observed with Gravity Recovery and Climate Experiment, but not reproduced by a standard conceptual hydrological model that encapsulates our current understanding of the dominant regional hydrological processes. The objective of this study was to identify potential processes underlying these low‐frequency variations through combined data analysis and model hypothesis testing. First, we analyzed the effect of data uncertainty by contrasting observed storage variations with multi‐annual estimates of precipitation and evaporation from multiple data sources. Second, we analyzed four different combinations of model forcing and evaluated their skill to reproduce the observed long‐term storage variations. Third, we formulated alternative model hypotheses for groundwater export to potentially explain low‐frequency storage variations. Overall, the results suggest that the initial model's inability to reproduce the observed low‐frequency storage variations was partly due to the forcing data used and partly due to the missing representation of regional groundwater export. More specifically, the choice of data source affected the model's ability to reproduce annual maximum storage fluctuations, whereas the annual minima improved by adapting the model structure to allow for groundwater export from a deeper groundwater layer. This suggests that, in contrast to previous research, conceptual models can reproduce long‐term storage fluctuations if a suitable model structure is used. Overall, the results highlight the value of alternative data sources and iterative testing of model structural hypotheses to improve runoff predictions in a poorly gauged basin leading to enhanced understanding of its hydrological processes.


Introduction
Long-term and thus low-frequency total water storage variations have been observed in many regions world-wide (Long et al., 2017;Scanlon et al., 2018). This includes long-term increasing or decreasing storage trends and multi-annual variabilities over ≥10 years. For example, decreasing storage trends were observed in Australia during the Millennium Drought in 1997-2010 (e.g., Chen et al., 2016;Leblanc et al., 2009;Zhao et al., 2017a), whereas both, increasing and decreasing long-term trends as well as multi-annual variabilities were observed in the United States (Boutt, 2017;Long et al., 2013), the La Plata basin in South America (Chen et al., 2010), China Z. Zhang, Chao, et al., 2015), and different African river basins (Awange et al., 2016;Bonsor et al., 2018;Werth et al., 2017) which were attributed to rainfall variabilities, glacier melting, droughts or human activities such as groundwater abstractions and land cover changes.

Abstract
In the Luangwa basin in Zambia, long-term total water storage variations were observed with Gravity Recovery and Climate Experiment, but not reproduced by a standard conceptual hydrological model that encapsulates our current understanding of the dominant regional hydrological processes. The objective of this study was to identify potential processes underlying these low-frequency variations through combined data analysis and model hypothesis testing. First, we analyzed the effect of data uncertainty by contrasting observed storage variations with multi-annual estimates of precipitation and evaporation from multiple data sources. Second, we analyzed four different combinations of model forcing and evaluated their skill to reproduce the observed long-term storage variations. Third, we formulated alternative model hypotheses for groundwater export to potentially explain low-frequency storage variations. Overall, the results suggest that the initial model's inability to reproduce the observed low-frequency storage variations was partly due to the forcing data used and partly due to the missing representation of regional groundwater export. More specifically, the choice of data source affected the model's ability to reproduce annual maximum storage fluctuations, whereas the annual minima improved by adapting the model structure to allow for groundwater export from a deeper groundwater layer. This suggests that, in contrast to previous research, conceptual models can reproduce long-term storage fluctuations if a suitable model structure is used. Overall, the results highlight the value of alternative data sources and iterative testing of model structural hypotheses to improve runoff predictions in a poorly gauged basin leading to enhanced understanding of its hydrological processes.
Plain Language Summary According to satellite observations, the total amount of water stored on and below the land surface varied over the years in the Zambian Luangwa river basin. However, this variation was not well reproduced by existing rainfall-runoff models, resulting in inaccurate predictions of runoff and water availability. The goal of this study was to identify processes causing longterm fluctuations in the total water storage by using alternative data sources and by adjusting the model structure. First, we analyzed whether similar long-term fluctuations existed in the climate using different satellite products. Second, we tested whether these fluctuations could be better represented using different data sources. Third, we tested whether they could be caused by inter-basin groundwater flow. We indeed showed that long-term storage fluctuations were better represented by alternative data sources and by incorporating groundwater loss from the basin, leading to more reliable runoff predictions in the poorly gauged Luangwa basin.

HULSMAN ET AL.
These studies relied on satellite-based Gravity Recovery and Climate Experiment (GRACE) data, with the exception of Boutt (2017) who used in situ data.
However, many hydrological models fail to reproduce such observed long-term storage variations (Fowler et al., 2020;Scanlon et al., 2018;Winsemius et al., 2006). As highlighted by previous studies, these observed long-term storage trends and variations can be a result of climate variability, land-cover change, other human interventions or any combination thereof, while the inability of models to reproduce these variations can be a result of model structural deficiencies, poor parameterization, data errors, unsuitable parameter values or any combination thereof (Fowler et al., 2018;Grigg & Hughes, 2018;Jing et al., 2019;Saft et al., 2016). For example, Bouaziz et al. (2020) showed that although a suite of different conceptual models could similarly well reproduce stream flow over almost two decades, they considerably varied in their skill to reproduce observed storage variations, which was attributed to deficiencies of different model architectures. With a few notable exceptions (e.g., Bouaziz et al., 2018;Goswami et al., 2007;Hrachowitz et al., 2014;Le Moine et al., 2007;Perrin et al., 2003;Samaniego et al., 2011), processes that could potentially allow longterm memory effects, such as groundwater export (Fowler et al., 2020;Istanbulluoglu et al., 2012), remain mostly unaccounted for in standard formulations of conceptual rainfall-runoff models (Bergström, 1992;Burnash et al., 1973;Euser et al., 2015;Fenicia et al., 2014;Liang et al., 1994;Willems, 2014). This leads to the situation that these models cannot sufficiently well capture long and slow processes dominating longterm storage variations, as convincingly demonstrated by Fowler et al. (2020). Their study, which focused on the Millennium Drought in Australia, illustrated that modeled annual minimum storage remained rather constant instead of showing a decreasing trend. The reason for this was that the modeled storage converged to or even reached zero toward the end of each dry season and hence could not decrease any further. Such an omission of processes that allow to account for long-term memory processes in rainfall-runoff models results in biased modeled discharge and impedes accurate estimations of water availability which is particularly crucial during extreme dry conditions (Saft et al., 2016).
In many river basins, detecting long-term storage variations and identifying their drivers is challenged by limited availability of high-quality ground observations. That is why in this context satellite observations may play an important role. For example, satellite-based GRACE observations describe variations in the Earths' gravity field which can be used to detect regional mass changes that are dominated by variations in the terrestrial water storage after removing atmospheric effects. In other words, GRACE observations, which are available on monthly timescale, provide valuable information on total water storage changes (Landerer & Swenson, 2012;Swenson, 2012). For example, GRACE observations have been used for groundwater monitoring (Tangdamrongsub et al., 2018;J. Zhang et al., 2020), or drought analysis (Chao et al., 2016;Hulsman et al., 2021;Leblanc et al., 2009;van Dijk et al., 2013;Zhao et al., 2017b;D. Zhang, Zhang, et al., 2015).
While several previous studies focused on identifying long-term storage variations from (satellite-based) observations and potential drivers for these variations as well as on quantifying differences between observations and model results (e.g., Fowler et al., 2020;Jing et al., 2019;Joodaki et al., 2014;Leblanc et al., 2009;Meng et al., 2019;Scanlon et al., 2018), only very few studies attempted to modify a hydrological model to allow for meaningful representations of long-term storage variations. In one exception, Grigg and Hughes (2018) modified the GR4J rainfall-runoff model (Perrin et al., 2003) to mimic long-term catchment memory effects. This was done by introducing a threshold in the storage reservoir such that percolation from this reservoir stopped when the storage was lower than the threshold while evaporation losses continued. Some other studies highlighted the value of incorporating groundwater import or export in hydrological models (e.g., Bouaziz et al., 2018;Hrachowitz et al., 2014;Le Moine et al., 2007). For example, Le Moine et al. (2007) analyzed 1040 French catchments and concluded inter-basin groundwater flow should be incorporated explicitly in hydrological models instead of correcting the rainfall or potential evaporation to close the water balance. Bouaziz et al. (2018) illustrated inter-catchment groundwater flow reached on average 10% of the precipitation in the Meuse river basin and should be accounted for in models to prevent overestimating the actual evaporation. However, these studies did not analyze the effect of such a process on the long-term variability in the total water storage. Other studies corrected for poor representations of long-term storage trends by assimilating total water storage anomaly observations according to GRACE into hydrological models (Khaki et al., 2018;Schumacher et al., 2018).
In this study, long-term storage variations were observed in the Luangwa river basin, but not reproduced by a distributed implementation of a conceptual model. This model was used in several previous studies to assess the potential of satellite-based altimetry, evaporation and total water storage anomaly observations for stepwise model development and spatial-temporal model calibration Hulsman, Winsemius, et al., 2020). In these studies, the model successfully reproduced the dynamics of multiple hydrological variables in the Luangwa basin. The objective of this paper was to identify potential and so far overlooked processes underlying these low-frequency variations in a combined data analysis and model hypothesis testing approach (Clark et al., 2011). In the spirit of Nearing et al. (2016) and Addor and Melsen (2019), we here more specifically tested the hypotheses that the frequently reported inability of conceptual hydrological models to reproduce observed long-term, low-frequency water storage variations is a result of the combined effects of (1) model forcing data that are incongruent with data of storage variations and (2) oversimplified representation of processes associated with basin-scale groundwater dynamics in such models and that (3) a careful choice of the data source and adaptation of groundwater-related model processes can significantly improve the representation of long-term storage variations in conceptual models.

Site Description
The Luangwa River is a 770 km long, mostly unregulated tributary of the Zambezi in Zambia (Figure 1). Its 159,000 km 2 large basin area is poorly gauged and mostly covered with deciduous forests, shrubs and savanna. The elevation varies from 400 m up to 1,850 m between the low-lying areas around the river and the highlands. In this semi-arid area, there is a distinct wet season from October to April with heavy rains up to 100 mm month −1 . Nevertheless, the mean annual potential evaporation (1,555 mm yr −1 ) exceeds the mean annual precipitation (970 mm yr −1 ) (Hulsman, Winsemius, et al., 2020;The World Bank, 2010). The lithology is governed by intergranular/fractured siliciclastic sedimentary rocks in the center of the basin and weathered/fractured metamorphic rocks closer to the basin borders (Hartmann & Moosdorf, 2012;IG-RAC & UNESCO-IHP, 2015;Ó Dochartaigh, 2019).

Data Availability
In this study, hydro-meteorological data as shown in Table 1 were used. This included two satellite-based precipitation products (CHIRPS and TRMM) and five actual evaporation products (WaPOR, SEBS, SSEBop, GLEAM and MOD16). Land-cover changes were assessed using NDVI (Normalized Difference Vegetation Index). Temperature data from the CRU data set (Climatic Research Unit) Version 4.01 was used to estimate the daily potential evaporation with the Hargreaves (Hargreaves & Allen, 2003;Hargreaves & Samani, 1985) and Thornthwaite (Maes et al., 2019) methods. For this purpose, monthly temperature observations were interpolated to daily timescale using in situ observations at two locations (28° 30' E, 14° 24' S and 32° 35' E, 13° 33' S).
Processed GRACE observations generated by Centre for Space Research (CSR), GeoForschungsZentrum Potsdam (GFZ), and Jet Propulsion Laboratory (JPL) were obtained from the GRACE Tellus website (https:// grace.jpl.nasa.gov/). This study used the average of these three sources which previously processed the raw data to remove atmospheric mass changes, systematic errors and noise, and to subtract the 2004-2009 timemean baseline (Landerer & Swenson, 2012;Swenson & Wahr, 2006;Wahr et al., 1998). As a result, total water storage anomalies were available in equivalent water thickness. Total water storage anomaly observations include all terrestrial water storage components, hence water stored in the surface water bodies, soil moisture and groundwater.

Approach
This study consisted of three steps. In the first step, we analyzed the effect of the choice of the data source used to explain observed total water storage variations to understand whether any of the data contain, in principle, sufficient information to at least broadly reflect the dynamics of storage variations. This was necessary to rule out that the model's inability to reproduce long-term storage variations is merely an artifact of unsuitable data. Thus, we investigated whether periods of high water storage anomalies roughly coincide with periods of high precipitation anomalies and/or low evaporation anomalies and vice versa. To do so, we contrasted long-term estimates of variables such as precipitation, potential and actual evaporation from multiple data sources with the observed water storage variations. This allowed a preliminary assessment of which data sources are more consistent with the observed low-frequency storage variations than others. Based on that, we then analyzed, in a second step, four different combinations of data sources, that is, precipitation and potential evaporation, as input for a distributed implementation of a process-based hydrological model and evaluated their respective effects to reproduce the observed long-term storage variations with the model. In a third step, we then iteratively formulated and tested several alternative model hypotheses, incorporating alternative and/or additional process representations, such as regional groundwater export, for their importance to meaningfully reproduce long-memory effects.
where S is total water storage, P precipitation, E evaporation, and Q discharge.

Data Analysis
Long-term, basin-averaged satellite observations of precipitation from the CHIRPS and TRMM data products, actual evaporation from the WaPOR, SEBS, SSEBop, GLEAM and MOD16 products, potential evaporation according to the Hargreaves (Hargreaves & Allen, 2003;Hargreaves & Samani, 1985) and Thornthwaite (Maes et al., 2019) methods, respectively, as well as land-cover based on NDVI (Table 1) were contrasted with and compared to the water storage variations estimated by GRACE. For each of these data sources, the temporal variability was visualized on monthly and/or annual timescale.
To assess the potential role of regional groundwater import to or export from the basin, the water balance was estimated using long-term average annual precipitation, evaporation, and discharge from the different satellite products. Assuming negligible long-term storage changes and data uncertainties, surpluses or deficits in the long-term water balance, hence if   P E Q ≠ 0, can then be largely attributed to groundwater import/export. In case of groundwater export, the average annual loss term can then be estimated according to (e.g., Bouaziz et al., 2018): where L Q is annual mean groundwater export (mm yr −1 ), P annual mean precipitation (mm yr −1 ), E annual mean evaporation (mm yr −1 ), and Q annual mean discharge (mm yr −1 ). HULSMAN ET AL.

Benchmark Model (Model A0)
The process-based distributed hydrological model used in this study for the Luangwa basin was stepwise developed and refined in previous studies Hulsman, Winsemius, et al., 2020) following the FLEX-Topo modeling concept (Savenije, 2010). Each 0.25° × 0.25° model cell had the same model structure and parameter set, but was forced differently using spatially distributed precipitation and potential evaporation data (e.g., Euser et al., 2015). In addition, each cell was further discretized into functionally distinct landscape classes, that is, hydrological response units (HRUs) based on the topography (Nijzink et al., 2016). All HRUs within a cell were connected through a common groundwater component ( Figure 2a). This groundwater reservoir was lumped over the entire basin assuming a homogeneous groundwater system . The HRUs were classified based on the local slope and "Height-above-the-nearest-drainage" (HAND; Rennó et al., 2008) into sloped areas (slope ≥ 4%), flat areas (slope < 4%, HAND ≥ 11 m), and wetland areas (slope < 4%, HAND < 11 m). As a result, 68% of the basin was classified as flat, 28% as sloped, and 8% as wetlands (Figure 1b). This FLEX-Topo modeling concept was previously successfully applied in many different environments Gharari et al., 2014;Hulsman, Winsemius, et al., 2020;Nijzink et al., 2016).
As illustrated in Figure 2a, the hydrological model consisted of multiple storage components representing the interception storage, unsaturated root-zone storage, as well as fast and slow responding storages. Each storage component was schematized as reservoir with corresponding water balance and constitutive equations as shown in Table 3. As the dominant processes and thus the associated model structures of the three individual HRUs were very similar to each other, the major differences between the HRUs were accounted for by different parameter values. Model process constraints were applied as shown in Table 4 to allow partly overlapping prior parameter distributions with relationships consistent with our physical understanding of the system Hrachowitz et al., 2014), and to limit equifinality HULSMAN ET AL.
10.1029/2020WR028837 6 of 31 Figure 2. Schematization of the model structure applied to each grid cell for Models A0-A5. For Models A1-A5 (b-f), only the groundwater module is shown for brevity and clarity of the presentation, as the rest of the model structure remained the same. Abbreviations: precipitation (P), effective precipitation (P e ), potential evaporation (E p ), interception evaporation (E i ), plant transpiration (E t ), infiltration into the unsaturated zone (R u ), drainage to fast runoff component (R f ), delayed fast runoff (R fl ), groundwater recharge (R r ), groundwater upwelling (R GW ), fast runoff (Q f ), groundwater recharge into Deeper Groundwater reservoir (R s ), shallow groundwater flow (Q ss ), groundwater loss (Q L ) and deep groundwater flow (Q sd ). (Beven, 2006). For example, in the Luangwa basin, higher interception evaporation and larger root-zone storage capacities were expected in the densely vegetated, forest dominated sloped areas compared to the flat, grass-and shrub-land dominated areas and wetlands. Processes unique to a single HRU were incorporated by adjusting the model structure where necessary. In sloped and flat areas for example, the groundwater system was recharged by downward infiltration whereas in wetlands upwelling groundwater sustained the shallow groundwater tables and the unsaturated zone soil moisture content under the assumption that water is pushed upwards from the upland groundwater system into the wetlands due to the groundwater head difference between the uplands and wetlands .
After having calculated the runoff for each grid cell, the total flow at the outlet was estimated by applying a simple routing scheme based on the flow distance to the outlet and a constant, calibrated flow velocity. The modeled total water storage anomaly was then calculated for each grid cell by taking the sum of all storage components, hence S tot = S i,tot + S u,tot + S f,tot + S su + S sd (see Table 3 for an explanation of the abbreviations), and subtracting the 2004-2009 time-mean baseline similar to GRACE. The storages S i,tot , S u,tot and S f,tot are weighted averages from the storages S i,HRU , S u,HRU and S f,HRU , respectively, in each HRU in a grid cell. This model consisted of 17 calibration parameters with uniform prior distributions and process constraints as summarized in Table 4  . Parameter ranges were based on previous studies (e.g., Gao et al., 2014;Gharari et al., 2014;Wang-Erlandsson et al., 2016), their most extreme values (for example the splitter W) or selected based on trial and error such that different internal processes occur without introducing too much flexibility. In this benchmark model, the precipitation product CHIRPS was used and potential evaporation was estimated with the Hargreaves method (see Table 2).

First Model Adaptation: Alternative Forcing Data (Models B0-D0)
As first model adaptation, the forcing data were changed to assess the role of data uncertainty for the model's ability to reproduce the observed long-term storage variations and to test whether some combinations of data sources allow model results to be more consistent with the observed storage variations than others. Starting with Model A0 as benchmark, different combinations of precipitation products, that is, CHIRPS and TRMM, on the one hand and methods to estimate potential evaporation, that is, Hargreaves and Thornthwaite, on the other hand were tested in Models B0-D0 (Table 2).

Second Model Adaptation: Alternative Model Structure (Model A1-A5)
As second model adaptation, the model structure was changed to allow for additional alternative process formulations, representing deep groundwater flow or inter-basin groundwater export/import and to test their potential as relevant drivers for the observed long-term storage variations. In this study, a distinction was made between shallow groundwater flow (Q ss ), deep groundwater flow (Q sd ), and groundwater loss (Q L ). While the shallow and deep groundwater flow reached the river, the groundwater loss (Q L ) leaked out of the Luangwa basin and potentially reached the Zambezi river further downstream. Based on benchmark Model A0, CHIRPS precipitation data and the Hargreaves method to estimate potential evaporation were used in these five model adaptations A1-A5.
More specifically, with Model A1, it was tested whether only groundwater export, hence groundwater leaking out of the Luangwa basin, was a dominant driver for the long-term storage variations. In this model, a groundwater loss (Q L; Equation 36), which did not reach the river upstream of the gauging station, was introduced (Figures 2b and 3). In the spirit of model parsimony, Q L was assumed to be constant, and thus independent of the water content in the Upper Groundwater reservoir to limit the number of calibration parameters in the absence of more detailed information. Thus, the Upper Groundwater reservoir (S su ) was formulated as a deficit store that can become negative and that loses water at a constant rate Q L , expressed as a free calibration parameter. Note that, the shallow groundwater flow Q ss only occurred when this storage was positive (if S su > 0, Equation 27). Such a formulation allowed groundwater to keep on draining, and thus groundwater levels falling, even if discharge in the river ceased during dry periods (e.g., Bouaziz et   . From there constant groundwater loss (Q L ) then leaked out of the basin equivalent to Models A1 and A2.
With Model A4, it was tested whether temporally variable groundwater export from the Deeper Groundwater reservoir recharged only during wet seasons, was the main driver for long-term storage variations. In this model, the groundwater loss (Q L , Figures 2e and 3) was a function of the water content in the Deeper Groundwater reservoir (Equation 34). As in Models A1-A3, this groundwater loss (Q L ) did not reach the river.
With Model A5, it was tested whether temporally variable groundwater flow from the Deeper Groundwater reservoir recharged only during wet seasons, was the main driver for long-term storage variations. In this model, the groundwater drained from the Deeper reservoir into the river as Q sd , thereby contributing to the total river flow (Equation 38, Figures 2f and 3). Hence, only in Model A5 the additional contributions from a deep groundwater storage reached the gauged river system whereas in Models A1-A4 groundwater exclusively leaked out of the basin. Figure 3 gives an overview of all alternative model hypotheses tested in this study. The relevant model equations are given in Table 3 and the corresponding prior parameter distributions in Table 4. HULSMAN ET AL.

Third Model Adaptation: Alternative Forcing Data and Model Structure
As third model adaptation, the forcing and the model structure were changed simultaneously. For this purpose, the best performing model based on the results of the first model adaptation, that is, changing the forcing data (Models A0-D0) and the second model adaptation, that is, changing the model structure (Models A0-A5) were combined. For example, if Models D0 and A4 performed best, respectively, then the combined Model D4 using the forcing data applied in Model D0 and the model structure of Model A4 was tested.
To ensure a robust representation of both, discharge and total water storage anomalies, the above model selection was based on the combined performance metrics for both variables. We explicitly acknowledge the possibility of this not being the combination that most reliably reflects real world processes. However, exhaustively testing all possible combinations goes beyond our computational capacity.

Model Performance Metrics
The model performance was evaluated with respect to discharge and basin-average total water storage anomalies. With respect to discharge, eight hydrological signatures were evaluated simultaneously using the Nash-Sutcliffe efficiency (E NS,θ , Equation 39 in Table 5 or relative error (E R,θ , Equation 40), depending on the signature. The individual performance metrics included the Nash-Sutcliffe efficiency of the daily flow time-series (E NS,Q ) and its logarithm (E NS,logQ ), of the flow duration curve (E NS,FDC ) and its logarithm (E NS,logFDC ), and of the autocorrelation function of the daily flows (E NS,AC ). In addition, the relative error of the mean seasonal runoff coefficients during dry and wet periods (E R,RCdry , E R,RCwet ), and the rising limb density of the hydrograph (E R,RLD ) (Euser et al., 2013) were used. These signatures were combined, assuming equals weights, using the Euclidian distance (D E,Q , Equation 41) with D E,Q = 1 corresponding to the "perfect" model.
The model performance with respect to the basin-average total water storage anomalies was evaluated with the Euclidian distance (D E,S , Equation 41) of the Nash-Sutcliffe efficiencies on monthly (E NS,S,monthly ) and annual (E NS,S,annual ) timescale. On annual timescale, the Nash-Sutcliffe efficiency was calculated for the annual minima and maxima separately which were then averaged to obtain E NS,S,annual . The annual time-series were normalized by dividing it with the maximum range in the observed annual minima or maxima total water

Deeper Groundwater
Notes. Fluxes ( storage anomalies, respectively. With this performance measure for the total water storage anomalies, more emphasis could be given to annual variations rather than to seasonal variations only. The combined model performance with respect to discharge and total water storage anomalies (D E,QS ) was calculated with the Euclidian distance (Equation 41) using D E,Q for the discharge and D E,S for the total water storage anomalies. This performance measure was used to select the best performing models representing both the discharge and the total storage as good as possible.

Parameter Selection Procedure
Each hydrological model (A0-D0 and A1-A5) was calibrated by running the model with 10 5 random parameter sets generated with a Monte-Carlo sampling strategy with uniform prior parameter distributions. Then, following two different strategies, the optimal parameter set was selected according to the model performance metrics as previously described with respect to (1) discharge (D E,Q ) and (2) discharge combined with total water storage anomalies (D E,QS ). The 5% best-performing parameter sets with respect to D E,Q or D E,QS were considered as feasible. The feasible parameter sets were used to evaluate the model performance with respect to discharge (D E,Q ), total water storage anomalies (D E,S ) and both simultaneously (D E,QS   In addition, the predictive strength of the benchmark Model A0 and the best performing model hypothesis (i.e., third model adaptation; Section 4.2.4) were compared by calibrating both models with respect to discharge and total water storage anomalies simultaneously (D E,QS ) for the time period 2002-2012, and post-calibration evaluating the models with respect to total water storage anomalies for the time period 2012-2016. Due to the limited data availability in 2012-2016, the model could not be evaluated with respect to discharge.

GRACE Total Water Storage Anomalies
In the Luangwa basin, the total water storage anomalies varied both seasonally and in the long-term (for example Figure 4a). The seasonal variation, hence the difference between the annual maximum and minimum, remained rather similar throughout the years (on average 225 mm). However, the annual minima, mean, and maxima changed over One possibility is that these variations were a result of uncertainties in GRACE observations as the Luangwa basin is relatively small (150,000 km 2 ) relative to the resolution of GRACE. Previous studies estimated errors in GRACE observations to be about 20 mm for areas of around 63,000 km 2 (Landerer & Swenson, 2012;Vishwakarma et al., 2018). But similar long-term variations were also observed for the entire Zambezi basin (Figure 4b), which is considerably larger (1,390,000 km 2 ) and where the maximum variation (194 mm) was an order of magnitude larger than the average uncertainty error of 20 mm.
In addition, long-term variations in large open water bodies could influence the GRACE signal. In this study, multiple open water bodies were within a radius of 300 km of the Luangwa Basin (Figure 1a) which typically is the distance used for data smoothing when processing GRACE data (Blazquez et al., 2018;Landerer & Swenson, 2012). The area of these open water bodies was 2% of the Luangwa basin for the Cahora Bassa reservoir, 4% for the Kariba reservoir, and 20% for Lake Malawi. As no long-term variations were observed in the altimetry observations for the Cahora Bassa reservoir ( Figure S1 in the Supplementary Material) and since this reservoir had a small area compared to the Luangwa basin, the effect of this reservoir was assumed to be negligible. For the Kariba reservoir ( Figure 4c) and Lake Malawi (Figure 4f), long-term variations were observed in the altimetry data, but with a low temporal correlation with the total water storage anomalies as shown in Figure S2 in the Supplementary Material. For the Zambezi basin where similar long-term storage variations were observed (Figure 4b), these three open water bodies covered together 2.7% of the basin. This was considered to be too small to have a significant effect.
Furthermore, GRACE observations used in this study were an average of products generated by CSR, GFZ, and JPL as explained in Section 3. The individual products showed similar long-term and seasonal patterns for the 2002-2015 time-period as shown in Figure S3a in the Supplementary Material. The standard deviation between the three solutions reached up to 11.7 mm for the annual minima and 19.1 mm for the annual maxima which is significantly smaller compared to the multi-annual variations. In 2016, the standard deviation increased to 73 mm for the annual maximum when considering the mascon (mass concentration block) solution according to JPL too. That is why it is plausible to assume that these long-term storage variations were not dominated by uncertainties in the GRACE observations for the 2002-2015 time-period.

Precipitation
Alternatively, long-term variations in the total water storage can be caused by changes in precipitation. In the Luangwa basin, the annual observed precipitation volumes varied over the years, depending on the data source, from 920 to 1,337 mm (CHIRPS) and from 858 to 1,213 mm (TRMM), as shown in Figures 4a and 4d.
In general, precipitation anomalies preceded storage variations by roughly 1-3 years. According to CHIRPS (Figure 4a . These observations suggest that long-term variations in precipitation alone already contain considerable information to potentially explain much of the observed long-term storage variations.

Potential and Actual Evaporation
The two different methods to estimate potential evaporation and its variations over the study time period, gave dramatically different results. While the Hargreaves method suggested a long-term mean annual E P = 1,565 mm yr −1 (Figure 5a), Thornthwaite estimated long-term mean E P = 1,904 mm yr −1 (Figure 5b). Major long-term variations in E P were only observed for estimates based on the Thornthwaite method (Figure 5b), but with a different pattern compared to the total water storage anomalies resulting in weak correlations with respect to the annual minimum (R 2 = 0.03) and maximum variations (R 2 = 0.34). In contrast, no discernible long-term fluctuations were observed when applying the Hargreaves method (R 2 = 0.00 and HULSMAN ET AL.

10.1029/2020WR028837
13 of 31 R 2 = 0.12 with respect to the annual minima and maxima, respectively). As the potential evaporation did change over the years according to the Thornthwaite method, it is possible this was one of the reasons why the modeled total water storage anomalies did not capture any long-term variations when using the Hargreaves method for the potential evaporation.
Analysis of the actual evaporation did not reveal any systematic long-term patterns that could clearly explain observed variations in the total water storage for most of the satellite products used in this study (Figure S4 in the Supplementary Material). In general, the magnitudes and long-term fluctuations varied for each satellite product as a result of different underlying assumptions and input data which could influence whether or not long-term fluctuations are visible. This resulted in a range of R 2 = 0.00-0.13 with respect to the annual maxima and R 2 = 0.02-0.17 with respect to the annual minima for all satellite products used in this study except for SSEBop which showed the highest R 2 = 0.37 (Figure 5c and Figure S4 in the Supplementary Material). Note, that the observed annual minimum storage increase of 67 mm over 3 years (2006)(2007)(2008)(2009), which in fact is an accumulated difference arising from the combined history of inputs and outputs over that period, can result from a mean daily deviation of only 0.06 mm d −1 in evaporation, which is by far within the uncertainty range of many satellite-based evaporation products (Long et al., 2014;Westerhoff, 2015). Hence, evaporation can potentially be one of the drivers for the observed long-term storage fluctuations, but additional in-depth analyses is necessary to substantiate this hypothesis which was outside the scope of this study due to the limited ground observations available.
Overall, long-term variations in potential and actual evaporation, according to most satellite products used here, exhibited only very limited direct correspondence with water storage variations, which was likely a consequence of the subtle and spatially varying interactions between water supply and atmospheric water demand in this largely water limited environment. Thus, while actual evaporation is largely controlled by water supply in hillslope regions, it is to a higher degree dominated by variations in atmospheric water demand in wetland areas, where sufficient water supply is sustained by shallow groundwater throughout most of the year. On the basin average, these processes can, to some degree, cancel each other out and thus HULSMAN ET AL.

10.1029/2020WR028837
14 of 31 prevent the development of a clear long-term signal. Based on the above analysis, it therefore remains difficult to meaningfully assess the uncertainty of the different analyzed evaporation products.

Land-Cover
Affecting the magnitudes of transpiration, land-cover changes could also be one of the drivers for the observed annual storage variations. In the Luangwa basin, deforestation, forest recovery, and agricultural expansion have occurred in the past (Handavu et al., 2019;Phiri et al., 2019aPhiri et al., , 2019b. However, inspections of NDVI time-series ( Figure 5) did not reveal any significant long-term variations directly corresponding with water storage variations over the 2002-2016 period. NDVI showed some fluctuations, including a considerable decrease after 2010, which, however, did not directly correspond with the observed water storage variations. This resulted in low correlations between the annual minimum/maximum total water storage anomalies and NDVI (R 2 = 0.01 and R 2 = 0.06, respectively). It was therefore assumed that land use change did not play a major role for the observed long-term storage variations.

Overall Water Balance
Another potential reason for the observed long-term storage variations can be regional, inter-basin groundwater exchange. For example, groundwater may leak out of the Luangwa basin below the river, thus never contributing to the (river) flow at the basin outlet, and into the Zambezi river basin further downstream eventually draining into that river or potentially even directly into the sea. Given the available observations, this would result in a water balance surplus for the Luangwa basin. Depending on the rainfall and evaporation products used, the water balance surplus in the Luangwa basin for the study period ranged between 9 and 332 mm yr −1 (Table 1). This suggested that even in the likely presence of data uncertainty, groundwater export may occur at least to some degree in the study region. Assuming an inter-basin export of Q L = 332 mm yr −1 , discharge would be considerably overestimated as compared to actual discharge observations ( Figure 6). To remain within the ranges spanned by multiple analytical solutions for water partitioning in the Budyko space (dark gray area in Figure 6; Gerrits et al., 2009), groundwater export should not exceed Q L = 143 mm yr −1 , which corresponds to a mean daily flow of Q L = 0.39 mm d −1 or ∼13% of the annual rainfall. Therefore, based on the water balance, a plausible range of groundwater export of Q L = 0.02-0.39 mm d −1 is in the following assumed for the study basin.

Benchmark Model (Model A0)
Following the first calibration strategy, that is, calibrating with respect to discharge, the benchmark Model A0 captured the discharge well (Figures 7a and 7b) with an optimum model performance of D E,Q,opt = 0.85 (Table 6, Figure 8a). The modeled flow dynamics such as the timings of the wet and dry season were broadly consistent with the observations (Figure 7a), but the high flows were slightly underestimated and low flows somewhat overestimated (Figure 7b). In contrast, and in spite of its general ability to reproduce discharge, the model could only poorly reproduce the time-series of monthly and annual total water storage anomalies with D E,S = −14 (Table 6, Figure 8a). On the monthly timescale, the general seasonal storage fluctuations were modeled well with respect to the timings of the wet and dry season ( Figure S7a in the Supplementary Material). However, the annual storage maxima were significantly overestimated, and the annual minima underestimated (Figure 7c). In addition, the modeled total water storage anomalies did not reflect any fluctuations in the annual minima in contrast to the observations (Figure 7e, R 2 = 0.07), whereas the modeled annual maxima varied throughout the years, but with a different pattern compared to the observations (Figure 7d, R 2 = 0.20). As a result, the overall model performance with respect to discharge and total water storage anomalies D E,QS = −9.6 remained poor.
Following the second calibration strategy, that is, calibration with respect to discharge and total water storage anomalies simultaneously, the ability of the model to reproduce flow decreased significantly to D E,Q = −0.23 (Table 6, Figure 8b). While the general flow dynamics were modeled well ( Figure S8a  10.1029/2020WR028837 the Supplementary Material), but with slight differences in the storage decrease during the dry seasons. The magnitudes of the annual maxima and minima corresponded better with the observations (Figure 9b) and the fluctuations in the annual maxima improved slightly (Figure 9c, R 2 = 0.31). However, the modeled storage did not reflect any fluctuations in the annual minima (Figure 9d, R 2 = 0.06). Hence, the overall model performance D E,QS = −0.17 improved, but remained poor. Even when calibrating with respect to total water storage anomalies only, the annual minima did not reflect any fluctuations ( Figure S10 in the Supplementary Material, R 2 = 0.08).
As a result, this benchmark Model A0 reproduced the flows well only with calibration strategy 1, while the seasonal fluctuations in the total water storage were better reproduced with calibration strategy 2. However, the long-term variations in the total water storage anomalies with respect to the annual maxima were poorly modeled and with respect to the annual minima completely missed for both calibration strategies.

First Model Adaptation: Alternative Forcing Data (Models B0-D0)
Following the first calibration strategy, Models B0-D0, using different combinations of input data sources, represented the discharge in general well with D E,Q = 0.85-0.92 (Table 6 Following the second calibration strategy, the modeled flow improved for all Models B0-D0 to D E,Q = 0.32-0.83 compared to the benchmark Model A0 (Table 6, Figure 8b). The general flow dynamics were represented well for all models ( Figure S8 in the Supplementary Material), but the flow magnitudes were only captured well for Models C0 and D0 (Figure 9a). While Models A0 and B0 significantly overestimated the flows continuously, Model C0 only slightly overestimated the flows and Model D0 only slightly underestimated the medium to low flows (Figure 9a). As a result, Model D0 had the highest model performance with respect to discharge with D E,Q = 0.83 (Table 6, Figure 8b). Also the modeled monthly and annual total water storage anomalies improved for Models B0-D0 with D E,S = 0.00-0.34 compared to the benchmark Model A0 (Table 6, Figure 8b). On monthly timescale, the general seasonal variations were captured well for all models, but with slight differences in the storage decrease during dry seasons ( Figure S9   . Runoff coefficient (Q/P) as a function of the dryness index (E p /P) where Q is discharge, P precipitation, and E p potential evaporation. The blue dashed line indicates the energy limit and the blue horizontal dash-dotted line the water limit. The gray area indicates envelope of analytical solutions according to Schreiber (1904), Ol'dekop (1911, Turc (1953), Pike (1964), and Budyko (1974). The dryness index was estimated using CHIRPS or TRMM for the precipitation and the Hargreaves method (E P = 1,565 mm yr −1 ) or Thornthwaite (E P = 1,904 mm yr −1 ) for the potential evaporation. The runoff coefficient was estimated with the same precipitation products and (1) recorded discharge without groundwater exchange (red stars), (2) estimated discharge including groundwater exchange (Q Q P E   L , Equation 2 using the same precipitation products and SEBS (red dots), GLEAM (blue dots), MOD16 (brown dots), SSEBop (green dots), and WaPOR (orange dots) for the evaporation resulting in Q L = 9-332 mm y −1 depending on the chosen satellite products, and (3) sum of recorded discharge and maximum groundwater export (Q L = 332 mm yr −1 , blue stars). To remain within the Budyko space (dark gray area), the groundwater exchange should range between Q L = 9-143 mm yr −1 depending on the satellite products used. See Table 1 for the corresponding long-term values of the individual fluxes.
(R 2 = 0.00-0.03; Figure 9d). The overall model performance with respect to discharge and total water storage anomalies improved the most for Model D0 with D E,QS = 0.52 (Table 6, Figure 8b).
As a result, the ability of the model to reproduce long-term variations of the total water storage during the wet seasons, that is, the annual maxima, was considerably influenced by the choice of precipitation data source and the method to estimate potential evaporation. In contrast, the modeled dry season storage, that is, annual minima, did not reflect the observed pattern for any combination of data sources but remained rather stable. Overall, the combination of TRMM with the Thornthwaite method (Model D0) here produced model results that were most consistent simultaneously with observed discharge and the observed total water storage variations. This suggests that the choice of data source can already explain a significant part of the inability of the model to reproduce long-term water storage variations.

Second Model Adaptation: Alternative Model Structure (Model A1-A5)
Following the first calibration strategy, all Models A1-A5 reproduced the discharge well with D E,Q = 0.82-0.87 (Table 6, Figure 8a). All models captured the general flow dynamics and magnitudes (Figures S11 and S12a in the Supplementary Material). The monthly and annual total water storage anomaly time-series was modeled very poorly for all models (D E,S = −1,066 to −9.8, Table 6, Figure 8a). While Models A1 and A5 consistently over-or underestimated the storage with little resemblance in the fluctuations of the annual maxima (R 2 = 0.19-0.22) and minima (R 2 = 0.08-0.16), Models A2 and A3 substantially overestimated the long-term variations (R 2 = 0.00-0.11, Figures S12 and S13 in the Supplementary Material). Also in Model A4, the storage was over-or underestimated, but the long-term variations improved with respect to the annual maxima (R 2 = 0.56) and minima (R 2 = 0.27). As a result, the overall model performance with respect to discharge and total water storage anomalies simultaneously improved the most for Model A4 with D E,QS = 0.32 (Table 6, Figure 8a).
Following the second calibration strategy, the modeled discharge improved considerably for Models A3 (D E,Q = 0.28) and A4 (D E,Q = 0.54) compared to the benchmark Model A0, but was poorly represented for the remaining models with D E,Q = −0.31 to −0.20 (Table 6, Figure 8b). The general flow dynamics were reproduced well for Models A1-A4 ( Figure S14 in the Supplementary Material), albeit with slight differences in the timing of the wet season and dry season recession, whereas Model A5 poorly represented the recession during dry seasons. In addition, the flows were significantly over-or underestimated with Models A1-A3 and A5 (Figure 10a), whereas Model A4 only slightly overestimated the high flows and underestimated the low flows. The monthly variations in the total water storage anomalies were captured well for all models with some differences in the storage decrease during dry seasons especially for Model A2 ( Figure S15 in the Supplementary Material). While the magnitudes of the annual maxima and minima were captured well for all models (Figure 10b), the annual fluctuations improved the most Model A5 with respect to the annual maxima (R 2 = 0.51, Figure 10c) and for Models A2 and A5 with respect to the annual minima (R 2 = 0.23, Figure 10d). When considering both the monthly and annual fluctuations and magnitudes, Models A4 (D E,S = 0.16) and A5 (D E,S = 0.23) improved the most (Table 6, Figure 8b).
As a result, the model's ability to reproduce the long-term total water storage variations during dry and wet seasons, that is, annual minima and maxima, was significantly influenced by the model structure. The modeled annual and monthly total water storage anomalies improved the most for Models A4 and A5 (Table 6, Figure 8b) where a Deeper Groundwater reservoir was incorporated with groundwater loss and/or flow as function of the water content in the Deeper Groundwater reservoir. However, Model A5 only poorly captured the discharge (D E,Q = −0.31, Figure 10a). Therefore, when considering the overall model performance with respect to discharge and total water storage anomalies simultaneously (D E,QS ), Model A4 performed the best with D E,QS = 0.32 (Table 6, Figure 8b). This model captured the flows well as also the monthly and HULSMAN ET AL.

Table 6 Model Performance With Respect to Discharge (D E ), Total Water Storage Anomalies (D E,S ) and Both Combined (D E,QS ) Including Their 5/95% Percentile Ranges of the Feasible Parameter Sets for Models A0-D4
annual total water storage anomaly magnitudes and fluctuations, albeit with a slight overestimation of the annual minima and maxima in 2004-2006 (Figure 10b). These results provide evidence that the model hypotheses A0-A3 as well as A5 generate hydrological response patterns that are less consistent with the available data than those of Model A4. It is thus not implausible to reject hypotheses A0-A3 and A5 and to assume that long-term storage fluctuations are potentially the result of groundwater export according to the loss rate Q L from the Deeper Groundwater reservoir (Model A4).

Third Model Adaptation: Alternative Forcing Data and Model Structure
According to the first model adaptation (comparing Models A0-D0), Model D0 performed the best using precipitation data from TRMM and estimating the potential evaporation with the Thornthwaite method. According to the second model adaptation (comparing Models A0-A5), Model A4 performed the best featuring a Deeper Groundwater reservoir which was only recharged during the wet season and from where groundwater leaked out of the basin (Figures 2 and 3). In this section, both models D0 and A4 were HULSMAN ET AL.

10.1029/2020WR028837
19 of 31 combined into Model D4 where we used TRMM as data source for precipitation, the Thornthwaite method to estimate potential evaporation and the model structure associated with Model A4.
Following the first calibration strategy, this model reproduced the discharge well ( Figure S16a in the Supplementary Material) with D E,Q = 0.93 which was better than all other alternative model hypotheses (Table 6, Figure 8a). Both, the general flow dynamics and magnitudes were captured well with this model (Figures S16a and S16b in the Supplementary Material). The monthly and annual total water storage anomalies improved significantly to D E,S = 0.31 (Table 6, Figure 8a). The modeled monthly storage variations were broadly consistent with the observation ( Figure Figures 11c and 11e). The overall model performance increased to D E,QS = 0.63 which was better than all other alternative model hypotheses (Table 6, Figure 8b).
In a last step, the predictive strength of Model D4 was compared to that of the benchmark Model A0. For this purpose, both models were calibrated with respect to discharge and total water storage anomalies simultaneously (calibration strategy 2) for the time period 2002-2012, and post-calibration evaluated due to the lack of flow data only with respect to total water storage anomalies for the time period 2012-2016 (see Section 4.4). While the general flow dynamics were modeled well for both models ( Figure S19 in the Supplementary Material), the magnitudes improved significantly for Model D4 as the flows were only slightly underestimated during medium flows (Figure 12a). For the calibration period, the modeled flow improved from D E,Q = −0.13 for Model A0 to D E,Q = 0.51 for Model D4 (Table 7). Also the monthly and annual total water storage anomaly time-series improved for Model D4 to D E,S = 0.63. On monthly timescale, Model D4 captured the seasonal variations better with considerable improvements in the storage decrease during dry seasons ( Figure S20 in the Supplementary Material). While the magnitudes of the annual minima/maxima were captured well for both models (Figure 12b), long-term fluctuations improved for Model D4 with respect to the annual maxima (R 2 = 0.57, Figure 12c) and minima (R 2 = 0.44, Figure 12d). In both cases, Overall, the results suggest that the model's ability to simultaneously reproduce both the observed discharge and long-term and seasonal total water storage variations was considerably influenced by both, the choice of forcing data and model structure, respectively. Overall, the combination of TRMM data for precipitation, the Thornthwaite method for potential evaporation and the model structure associated with Model A4 here produced model results most consistent with the observed total water storage anomalies and discharge time-series. This Model D4 allowed for a better representation of the discharge and better prediction of the total water storage anomalies with respect to the seasonal and long-term fluctuations. The forcing data mostly controlled the model's ability to mimic annual storage maxima, whereas the annual storage minima improved the most when incorporating groundwater loss from the Deeper Groundwater reservoir (Model A4 and D4).

Discussion
In this study, we identified plausible drivers for the observed long-term total water storage variations in the Luangwa Basin. The results indicated modeled annual maximum storage fluctuations were to a large extent controlled by the choice of forcing data, whereas modeled annual minima were influenced by pro-HULSMAN ET AL.
10.1029/2020WR028837 23 of 31  cesses generating long-term memory effects which are missing in the original benchmark Model A0 and in many other common hydrological models (cf. Fowler et al., 2020). More specifically, the representation of monthly and annual total water storage fluctuations improved when using TRMM precipitation data, the Thornthwaite method to estimate potential evaporation and allowing for groundwater export via a loss from a deeper groundwater layer (Model D4). In 2005-2007, most models poorly reproduced the annual maximum/minimum total water storage anomalies. This could be further improved in future studies by testing alternative forcing data sources, model formulations for groundwater loss and calibration strategies. Depending on the model, the modeled total water storage anomalies improved considerably when calibrating only with respect to this variable, but at the cost of decreased discharge performances ( Figure S21 in the Supplementary Material).
The results demonstrated that models that can adequately reproduce discharge do not necessarily reproduce storage well which was also observed by Bouaziz et al. (2020). In this study, the benchmark Model A0 reproduced the general dynamics and magnitudes of the discharge well but did not reproduce the observed storage magnitudes nor the long-term storage fluctuations. Incorporating the total water storage anomalies in the calibration procedure only improved the modeled storage magnitudes, but not the long-term fluctuations. While alternative forcing data sources improved the representation of the annual maximum storage fluctuations, the storage conditions during dry seasons, that is, annual minima, remained poorly represented (Models A0-D0) and only improved after modifying the model structure (Model D4). These results suggested that groundwater loss from the Luangwa basin played an important role to explain long-term annual storage variations. The average groundwater loss/flow reached up to 0.68 mm d −1 for Models A1-A5 and D4 when considering the optimal and feasible parameter sets. However, in many commonly used hydrological models such processes allowing long-term memory effects are missing (e.g., Bergström, 1992;Fenicia et al., 2014;Liang et al., 1994) resulting in biased predictions of discharge and storage which is especially crucial during extreme dry conditions (Fowler et al., 2020;Saft et al., 2016).
Furthermore, this study showed that simple process formulations allowing for long-term memory effects can be readily incorporated in conceptual hydrological models. In this study, several model hypotheses were tested to assess which processes most likely dominated long-term memory effects in the Luangwa basin (Models A1-A5). The results suggested long-term storage variations were a result of groundwater loss from a deeper groundwater layer which was only recharged during wet seasons (Model D4). With this model, the storage prediction substantially improved compared to the benchmark Model A0, yet remained at a modest level (D E,S < 0, Table 7) most likely due to the chosen model performance metric and the limited number of data points for the evaluation when considering annual minima/maxima for the time-period 2012-2016 as explained in Section 5.2.4. In addition, the above model modifications also improved the model's skill to reproduce observed discharge time-series such that the general dynamics and magnitudes were represented better with Model D4 (Figure 11) compared to the benchmark Model A0 (Figure 7). Overall, the results suggest that the model hypotheses A0-D0 as well as A1-A5 can be rejected in favor of hypothesis D4. This underlines the crucial role of model hypothesis testing for improving the simultaneous representation of multiple variables in models and thereby providing evidence that the process representation in D4 is likely a more consistent representation of real-world processes than the other hypotheses tested here (Beven, 2018;Clark et al., 2011).
Previous studies highlighted the inability of many conceptual models to reproduce long-term storage variations and attributed this to data errors, poor parameterization, model structural deficiencies or a combination thereof (Fowler et al., 2018;Jing et al., 2019;Saft et al., 2016;Scanlon et al., 2018;Winsemius et al., 2006). Fowler et al. (2020) recently demonstrated that commonly used conceptual hydrological models cannot reproduce long-term storage variations as they lack long-term memory processes and hence should not be used for discharge predictions in for example drying climates. However, here we could show that following a careful, iterative data and model selection procedure, the representation of long-term storage variations in a conceptual model can be considerably improved. This further implies that although many typical implementations of hydrological models indeed cannot reproduce long-term storage changes, in particular with respect to annual fluctuations in dry season conditions, that is, annual minima, as shown by Fowler et al. (2020) and here with Models A0 -D0, this inability is not an inherent property of conceptual models per se. Instead, our results provide evidence that this inability can, at least to some degree, be overcome when adopting a systematic procedure to test alternative model hypotheses and thus to improve the representation of real-world processes (here: Models A1-A5).
The (satellite) observations used in this study are prone to data uncertainties. Uncertainties in GRACE observations are a result of data (post-) processing which includes data smoothening with a radius of 300 km (Landerer & Swenson, 2012). This results in signal leakage between neighboring cells of 1° and in grid cells which are not completely independent from each other. In addition, data gaps occur in GRACE observations due to instrument issues, calibration campaigns and battery management activities every 6 months since 2011 ( Figure S3b in the Supplementary Material). Uncertainties in precipitation data can be considerable, in particular for extreme events on small scale or in mountainous regions Hrachowitz & Weiler, 2011;Kimani et al., 2017;Le Coz & van de Giesen, 2019). As shown in Figure 4 for CHIRPS and TRMM, different methods and input data underlying satellite-based precipitation products affect the long-term patterns and hence the modeled long-term total water storage variations. In addition, bias errors in the precipitation affect the estimated long-term average groundwater export based on the water balance . Similarly, satellite-based evaporation data are a result of models with uncertainties in the input data, parameterization or model conceptualization (K. Zhang et al., 2016). This affects the monthly values and the long-term patterns as shown in Figure S4 in the Supplementary Material. In Section 5.1.3, it was illustrated that daily deviations within the uncertainty range of many evaporation satellite products can result in a considerable storage change of over multiple years. In addition, uncertainties in the evaporation affect the long-term water balance closure and hence also the estimated long-term groundwater export/import. Uncertainties in the potential evaporation are a result of the underlying equations and input data. This study compared the Hargreaves and Thornthwaite methods which both use temperature data to estimate potential evaporation but with different equations (Hargreaves & Allen, 2003;Hargreaves & Samani, 1985;Maes et al., 2019). This resulted in different monthly values and long-term variations as shown in Figure 5. Uncertainties in the potential evaporation affect the modeled actual evaporation especially during wet seasons when the total evaporation is limited by the energy, whereas during dry seasons the total evaporation is limited by the water availability. Discharge uncertainties are a result of rating curve uncertainties (Domeneghetti et al., 2012;McMillan & Westerberg, 2015;Westerberg et al., 2011) and limited data availability. Due to the limited data availability in this study, it was not possible to validate these observations with field measurements to estimate the magnitude of the uncertainties.
According to the International Groundwater Resources Assessment Centre (IGRAC), there are two aquifers which are shared by the Luangwa basin and neighboring basins ( Figure S22 in the Supplementary Material). These aquifers are located in the South upstream of the gauge station and in the East at the border with Malawi and mostly consist of alluvial sediments/sands and fractured crystalline -metamorphic basement rocks, respectively (IGRAC & UNESCO-IHP, 2015;TWAP, 2015aTWAP, , 2015b. Both aquifers were identified in previous studies as part of an effort to identify transboundary aquifers world-wide to support transboundary aquifer management activities. Additional studies are needed to characterize these aquifers more detailed and to analyze whether there are additional aquifers shared by Luangwa and surrounding river basins within Zambia. According to Fraser et al. (2020), who identified transboundary aquifers in Malawi based on lithology, hydrogeology, groundwater levels and literature, water flows from Zambia to Malawi in the aquifer at the eastern border of the Luangwa basin. This supports our findings with respect to the significance of potential groundwater leakage in the Luangwa basin.
In addition, Tóth (1963) illustrated groundwater flow occurs on local or regional scale depending on the topography such that local groundwater flow generally occurs in regions with large local relief and shallow aquifers, whereas regional groundwater flow generally occurs in regions with large regional relief and deep aquifers. In the Luangwa river basin, elevation differences are less pronounced near the eastern and southern border compared to the North and West ( Figure S23 in the Supplementary Material) which supports the possibility of regional groundwater flow in the East and South. Schaller and Fan (2009) illustrated regional groundwater flow often occur in arid regions as the groundwater level often remains below the local topography. This further supports the possibility of regional groundwater flow in the semi-arid Luangwa basin. According to Condon et al. (2020), conceptual hydrological models typically focus on the shallow groundwater system assuming the deep groundwater system is negligible even though deep flow paths can be relevant depending on the river basin and research question. This results in for example open water balances and flawed conceptual models (Condon et al., 2020). Our study illustrated that the addition of a Deeper Groundwater Reservoir with groundwater loss has the potential to substantially improve the modeled discharge and long-term total water storage anomalies in conceptual hydrological models.
For future studies, it will be interesting to explore the effects of evaporation on long-term storage fluctuations in a more detailed analysis. Our results suggest that long-term fluctuations in the potential evaporation can occur depending on the chosen estimation method (Hobbins et al., 2008;Huang et al., 2015;Roderick & Farquhar, 2005;Xu et al., 2018). It would therefore be interesting to look into alternative, potentially more accurate estimation methods. In addition, long-term fluctuations in the actual evaporation were observed depending on the satellite product due to the different underlying assumptions and input data (Bai et al., 2019;Feng et al., 2019;Goroshi et al., 2017;Wang et al., 2018). As shown in a previous study by  and in Figure S24 in the Supplementary Material, the basin-averaged evaporation was modeled well and incorporating this flux in the calibration procedure did not improve the modeled long-term storage variabilities. That is why, more in-depth analyses on the occurrence of long-term variations in the actual and potential evaporation, their main drivers and their effect on long-term storage fluctuations testing different model hypotheses is recommended. This was outside the scope of this study due to the limited data availability.
Furthermore, the calibration approach used in this study allowed to analyze the influence of different model adjustments on the behavioral parameter sets. We assumed that if a model adjustment is relevant, a clear improvement in the distribution of performances for the parameter sets retained as feasible should be visible. Other calibration schemes are recommended when attempting to identify the mathematically optimal parameter set. However, a solution that may mathematically be the best fit, is unlikely the most plausible representation of the real-world system given the many sources of uncertainty in the modeling process especially in data scarce regions (e.g., Beven, 2006). Sensitivity analysis of the parameters related to the groundwater loss indicated that these parameters influenced the modeled total water storage significantly with some exceptions, whereas the impact on the discharge was mostly limited ( Figure S25). We recommend using additional information sources to constrain these parameters, which was outside the scope of this study due to limited data availability. As this study focused on a necessarily limited number of model hypotheses, it should be noted that additional alternative model hypotheses clearly may lead to similar model improvements. This also includes alternative hypotheses with respect to the conceptualization of the groundwater system (e.g., de Graaf et al., 2015;Reinecke et al., 2019;Stoelzle et al., 2015). The results of our study therefore need to be understood in that context. Model hypothesis D4 allowed the rejection of all other hypotheses tested here, yet it may in the future be rejected itself in favor of another hypothesis.

Conclusion
In the Luangwa basin, long-term total water storage variations were observed with GRACE, but not reproduced by a previously developed process-based hydrological model that encapsulates our current understanding of the dominant regional hydrological processes. The objective of this paper was to identify so far overlooked processes underlying these low-frequency variations in a combined data analysis and model hypothesis testing approach. The data analysis results revealed different long-term patterns in the precipitation, potential and actual evaporation depending on the satellite product which could partly explain the observed long-term storage variations. The results of the model hypotheses testing suggest that the initial model's inability to reproduce the observed low-frequency storage variations was a combined effect of the data source used to run the model and the missing representation of regional groundwater export. More specifically, it was shown that a different choice of the model input data source produced model results that are more consistent with observed fluctuations in long-term annual maximum total water storage anomalies. In contrast, the incorporation of a process representing regional groundwater export from a deep groundwater layer significantly improved the model's ability to reproduce the observed long-term variations in the annual minimum storage. The results highlighted the combined value of alternative data sources and iterative hypothesis testing to improve our understanding of hydrological processes, their quantitative description in models and eventually toward more reliable predictions of hydrological models.

Data Availability Statement
Discharge data for the study region were made available by WARMA (Water Resources Management Authority in Zambia) and can be accessed upon request at WARMA. Satellite observations were obtained from publicly available online databases as described in Section 3 and Table 1.