Can hydrological models assess the impact of natural flood management in groundwater‐dominated catchments?

Natural flood management (NFM) is widely promoted for managing flood risks but the effectiveness of different types of NFM schemes at medium (100–1000 km2) and large scales (>1000 km2) remains widely unknown. This study demonstrates the importance of fully understanding the impact of model structure, calibration and uncertainty techniques on the results before the NFM assessment is undertaken. Land‐based NFM assessment is undertaken in two medium‐scale lowland catchments within the Thames River basin (UK) with a modelling approach that uses the Soil and Water Assessment Tool (SWAT) model within an uncertainty framework. The model performed poorly in groundwater‐dominated areas (P‐factor <0.5 and R‐factor >0.6). The model performed better in areas dominated by surface and interflow processes (P‐factor >0.5 and R‐factor <0.6) and here hypothetical experiments converting land to broadleaf woodland and cropland showed that the model offers good potential for the assessment of NFM effectiveness. However, the reduction of large flood flows greater than 4% in medium‐sized catchments would require afforestation of more than 75% of the area. Whilst hydrological models, and specifically SWAT, can be useful tools in assessing the effectiveness of NFM, these results demonstrate that they cannot be applied in all settings.


| INTRODUCTION
Natural flood management (NFM) has gained much interest in recent years, driven by increased recognition of its potential for reducing flood risk, and enhancing and sustaining ecosystem services and functions (Dadson et al., 2017;McLean et al., 2013). Catchment-based NFM is the implementation of land and soil management measures to protect, restore, or enhance the natural features, characteristics, and functions of catchments, to reduce peak flood flows and/or delay their timing, and, in turn, to reduce flood impact (Connelly et al., 2020;SEPA, 2016). Although NFM is recognised as an effective natural approach, its effects vary across the scales and cannot always be translated between sites/catchments (Wingfield et al., 2019). Investing in NFM therefore requires robust evidence on its effectiveness where it is to be implemented.
Despite the wide adoption of NFM across the UK, most evidence for its effectiveness has only been established at small scales (catchment areas of <100 km 2 ). There remains a distinct and concerning lack of evidence at larger scales (Connelly et al., 2020;Dadson et al., 2017;Kay et al., 2019;Waylen et al., 2018;Wilkinson et al., 2019;Wingfield et al., 2019), and limited investigations of the detailed processes governing the catchment hydrological system and associated uncertainties, and how this can be affected by NFM. Few studies on the effectiveness of NFM have focused on permeable lowland catchments with complex hydrology and where integrated modelling of surface water-groundwater interactions remains challenging (Lane et al., 2019;Wagener et al., 2021;Wheater et al., 2007); yet, these catchments are subject to frequent flooding from groundwater and intensive rainfall (Ascott et al., 2017;Hughes et al., 2011;Macdonald et al., 2012). There is a clear need to test NFM effectiveness in large lowland catchments, to provide the necessary evidence required to take appropriate actions.
Due to the complexity of large-scale and lowland groundwater-dominated catchments systems, NFM assessment in these requires a consistent modelling framework that allows realistic simulation of the surface and groundwater interactions while considering the spatial variability of landscape features, acknowledging uncertainties, and assessing the changes in the processes and parameters due to NFM (Ellis et al., 2021). This would enable the assessment of the catchment scale impacts of NFM, the interactions between individual NFM interventions and sub-catchments, and catchment process changes after implementation of NFM measures Metcalfe et al., 2017). Within such a framework, the catchment model itself should be able to represent the spatial complexity of the catchment features and the NFM-related processes, whilst still being practical and computationally realistic to employ .
The Soil and Water Assessment Tool (SWAT) is an open-source semi-distributed and physically based catchment-scale hydrological model that has been used extensively to simulate the impacts of land use and land cover changes on hydrologic processes (Perra et al., 2018;Wangpimool et al., 2013). However, its application remains challenging in groundwater-dominated catchments. Although modified versions of the SWAT model, with modifications to the groundwater module (Guse et al., 2016;Nguyen & Dietrich, 2018;Pfannerstill et al., 2014), or inclusion of coupled groundwater-surface water (Aliyari et al., 2019;Bailey et al., 2016;Chunn et al., 2019;Liu et al., 2020) have been tested, there has been little attempt to capture how different representations of processes, parametrisations, calibration and uncertainty analysis (CUA) techniques, affect the model performance. However, this is essential for understanding the application of models for use in NFM assessment.
The objective of this study is to fully assess a modelling framework for NFM assessment in two mediumscale lowland catchments with complex hydrology within the Thames basin, UK. By doing so, the study aims to highlight the importance of implementing such a framework in order to understand to what extent the selected catchment model can be used to assess land-based NFM, the impact of different model configurations, as well as the uncertainties and the sensitivity of the model to changes in soil and land management practice.
The paper is organised into four sections. Section 2 presents the study areas and describes the datasets, modelling framework and the model used as well as how it is calibrated, validated and applied for testing NFM scenarios. Section 3 presents the results for the studied catchments on the model performance, changes in parameters and processes according to the landscape properties and flooding events, and the effects of the tested NFM scenarios. Section 4 discusses the meanings and implications of the results and finally, Section 5 summarises the main findings and outlook.

| Study areas
The modelling study was carried out for two lowland catchments, the Pang and the Blackwater, subcatchments of the Thames Basin in southern UK ( Figure 1). The study has been conducted as part of the LANDWISE project, for which a number of catchments were targeted for hydrological modelling which was performed at small, medium and large scales. The Pang and Blackwater catchments were selected for medium-scale modelling.
The Pang is mainly rural with intensive farming of wheat and barley cultivated in winter and spring, and oilseed rape. The eastern part of the Blackwater is heavily urbanised compared to the western part which remains largely underdeveloped.
The bedrock geology of the Pang is mainly Cretaceous Chalk with high permeability, but downstream the Chalk is covered by low-permeability Palaeogene mudstone bedrock (Allen et al., 1997;Peters & van Lanen, 2005). In the Blackwater, the bedrock geology is more diversified, composed of mainly Chalk and London Clay, overlain by a series of relatively thin sands and clays.
The choice of Blackwater was guided by the fact that it has different landscape properties, compared to the other selected catchments. In fact, apart from having its upper part on the Whitewater catchment underlain by Chalk, most of the Blackwater catchment has different bedrock geology, and related soils, which gives an opportunity to test the model and the NFM effectiveness in different landscapes. In addition, heavily urbanised catchments like Blackwater are rarely ideal; assessing the suitability of such catchments was part of the underlying rationale of the LANDWISE project.
The Pang presents a particular feature known as the "Blue Pool" spring complex located near Bucklebury-a perennial spring discharging on average 0.1 and 0.2 m 3 s À1 of groundwater to streamflow in the summer and F I G U R E 1 Elevation with the location of hydrometeorological stations and geology of the Pang (a,b) and Blackwater catchments (c,d).
Three streamflow gauging stations on the River Pang were used: Frilsham, Bucklebury, and Pangbourne (NRFA, 2022). Backwater effects have been observed at the confluence of the Pang and Thames, which result in drowning of the Pangbourne gauge and uncertainty in recorded high flows. Five stream gauge stations are used in the Blackwater: two on the Whitewater at Lodge Farm at the edge of the Chalk outcrop and downstream at Holdshot Farm; one on the Hart at Bramshill House; and two on the Blackwater at Farnborough and Swallowfield ( Figure 1). Sewage effluent is a major component (up to 40%) of the Blackwater flows (NRFA, 2022).
The hydrological features of the nested subcatchments are summarised in Table 1.

| Modelling framework
The modelling framework integrates the catchment model with different processes, configurations and methods as well as input data and information from different sources on elevation, weather, soil, land use and river flows (Figure 2). It allows the evaluation of whether the catchment hydrological system with NFM-related processes can be adequately simulated from available datasets while considering uncertainties. This involved several steps ( Figure 2). Therefore, first, two model setups are constructed to test the influence of using slope bands (measurement of the percentage of slope inclination) since this is known to affect model performance. Second, the constructed models are calibrated based on two different options of using the time series of observed flow, which has different flooding events, while considering uncertainties. Third, each model verification is performed to assess how close the processes are to the reality and the variations and changes in parameters. Fourth, the calibrated models are compared to select the best model set up. Fifth, the NFM scenarios are constructed and finally they are modelled and assessed in the last step 6. For this study, the SWAT is used (see Section 1 of Supporting Information).

| Datasets
Daily rainfall station datasets from England's Environment Agency (EA) were collected between 2001 and 2014 used to set up and run the model. Further climate variables were obtained from the University of Reading meteorological observatory. Soil data are from the National Soil Map from England and Wales-NATMAP (Cranfield University, 2013). For the elevation model, the EA continuous digital terrain model (DTM) of 2 m resolution, resampled to 5 m due to computational issues, was used. Landcover data are from the UK Centre for Ecology and Hydrology land cover polygons for 2015.
River flow data from the UK National River Flow Archive (NRFA, 2022) were used for the calibration and validation of the models. Point source sewage effluent discharge from the Blackwater catchment is obtained from Thames Water. The integration of point source data in the model has helped to improve the baseflow simulation at the involved outlets.

| Model setups
The SWAT model was implemented via the QSWAT3 v1.1 plugin integrated into QGIS 3.10.
Two model setups per catchment were modelled (Table 2) to test the impacts of using slope bands in the models since catchment models are known to be particularly sensitive to these factors. Each of the two models is set up with the 5 m DTM, one with slope bands (measurement of the percentage of slope inclination) and another without ( Figure 2, Step 1). The slope bands were defined according to the British Land Capability classification system. Based on the minimum and maximum altitude of the target catchment, we considered three aggregated slope classes: 0-3 (0-5.24%), 3-11 (5.24-19.4%) and > 11 (19.4%) corresponding to gently sloping, moderately-to-strongly sloping, and moderately-to-very deep sloping.
The soil and land cover maps used in this study and the delineated sub-basins as well as the geology maps (for information) are summarised in Figure S1.
The next step after HRU creation is importing the weather data and writing data files for each subbasin and each HRU. The SWAT model distributes the weather data to each sub-basin by assigning the data of the closest station to the sub-basin centroid. Subsequently, weather data of a subbasin are assigned to each of the HRUs within it.  Storage of water from rainfall interception by agricultural land and its loss via evaporation a Parameters for which a relative change is applied to an initial value during the calibration, whereas an absolute change was used for the remaining parameters.

| Model calibration and assessment
Calibration parameters were selected based on their relation to NFM processes. In total, 16 parameters for each group of subbasins, and 5 catchment-wide parameters, were selected (Table 3) resulting in a total of 53 parameters for the Pang and 85 for the Blackwater. Catchmentwide parameters are the baseflow alpha factor for the deep aquifer (ALPHA_BF_D) and the maximum canopy storage which is assumed to vary similarly across the subbasins but differently according to land cover classes such as forest (CANMX_FR), pasture (CANMX_PST), rangeland (CANMX_RNG), and agricultural land (CANMX_AGR).
The SWAT model was calibrated using the simultaneous multi-site calibration technique (Leta et al., 2017) coupled with the multicriteria sequential calibration and uncertainty analysis (MS-CUA) approach proposed by Wu et al. (2021) using the Python package SPOTPY (Houska et al., 2015). This involved the definition of groups of subbasins based on the number of streamflow gauging stations used (Table 4).
Two calibration-validation breakdowns are tested to consider the different types of flood events that appear in the relatively short record: (i) 2012-2014 calibration which includes a groundwater flood and 2005-2008 validation which has a flood event from intensive rainfall; and (ii) 2005-2008 calibration and 2012-2014 validation, with 2 years warm up in each calibration. The two periods were defined based on the availability of the datasets required by the model using the longest continuous periods available. Considering alternate periods for calibration and validation allows to determine whether the same model is able to simulate floods originating from rising groundwater as well as from intensive rainfall and with the same parameters.
Predictive uncertainty at 95% confidence interval was estimated and the model performance evaluated using P-factor and R-factor (Abbaspour et al., 2017;Camargos et al., 2018; Section 2 of Supporting Information).
Verification was applied to check that the estimated parameters, simulated processes and their spatialtemporal variations are realistic ( Figure 2, Step 3). Water balance components were also analysed and compared between calibrations. Finally, for each catchment, the best model set up was selected based on the quality of the simulated hydrograph ( Figure 2, Step 4).

| NFM scenarios generation and assessment of effects
To assess the extent to which the best model setup can be used for NFM assessment, two broadscale hypothetical scenarios were generated ( Figure 2, Step 5). These scenarios consist of "what if" assumptions and focused on the change in the water balance components and subsequently river discharge when all the land cover, except water and urban areas, is converted to: (i) broadleaf woodland and (ii) conventional agricultural land. The use of "what if" scenarios is a common technique in sensitivity analysis and one that allows the behaviour of the model to be interpreted and explained in more detail (Cloke et al., 2008). Their use is important to understand the sensitivity of the model to land use changes, which is therefore primordial for the assessment and understanding of the effects of more realistic NFM scenarios. It provides critical evidence, for instance, on how much increasing the extent of afforestation to its maximum extent can reduce flood risk.
The broadscale NFM effects in each catchment were assessed by looking at changes in the 95% prediction intervals ( Figure 2, Step 6). Differences between large floods, small floods and high-flow pulses, considered as flows with exceedance probability less or equal to 2%, between 2% and 10% and between 10% and 40% respectively, were assessed. These flows were computed from the 95% prediction uncertainties of the calibrated model and each modelled scenario. The Kolmogorov-Smirnov

| RESULTS
For the two catchments, the integration of slope bands in the model is not necessary as this involves the integration and calibration of more parameters but does not markedly improve the quality of the simulated hydrograph.
Thus, models without slope bands, PGM1 and BWM1 for the Pang and Blackwater, respectively, were selected as the best ones. flows are simulated relatively well (Figure 3c). At Frilsham, the model reacts earlier to rainfall inputs, and the modelled hydrograph is flashier than the observed response of the catchment. During the validation period (2005)(2006)(2007)(2008), the model performance drops drastically, overestimating flows at Frilsham (Figure 3d) and Bucklebury ( Figure 3e) with P-factor less than 0.2 at the two stations although relatively better results are obtained in Pangbourne ( Figure 3f) with a P-factor of 0.45. At all stations, the modelled flows are higher than recorded flows for the July 2007 peak, although it should be noted that the peak flow at Pangbourne, in particular, is highly uncertain due to extensive overspill into the neighbouring Sulham Brook.

|
The 2005-2008 calibration shows poor model performance at Frilsham and Bucklebury (P-factor $ 0.2, Rfactor >0.6) and an improvement at Pangbourne (Pfactor = 0.5). However, the performance seems better There are important changes in parameters related to infiltration and soil water: CN2, SOL_K and SOL_AWC (Figure 5a,b,d), those related to groundwater dynamics: GW_DELAY, GWQMN, RCHRG_DP, REVAPMN and GW_REVAP (Figure 5i-m), and to river channel: CH_N2, CH_K2 and ALPHA_BNK (Figure 5n-p). Parameters such as SOL_BD (Figure 5c), ESCO, SUR-LAG, OV_N, and ALPHA_BF (Figure 5e-h) vary less. For the calibration over 2005-2008, for which an improved model performance and the minimisation of the streamflow overestimation were observed, CN2, RCHRG_DP and REVAPMN have decreased in PI and PII. This leads to more infiltration of water in the soil, more percolation to the deep aquifer and more water in the shallow aquifer that can move to the root zone. CH_N2 parameter has also decreased in the three subbasin groups (Figure 5n) while SOL_AWC and GW_REVAP parameters values increase, suggesting reduced channel roughness and more soil water storage and evaporation from the shallow aquifer. Moreover, the analysis of the simulated hydrographs of both calibrations for the Pang at Frilsham, where the differences are drastic, reveals that the changes occurred importantly in CN2, SOL_K,

| Verification of the average water balances
A comparative analysis of the average water balances shows that the most important differences between the two calibrations reside in the surface runoff (SURQ) and groundwater contribution to streamflow (GWQ) which decreased for the 2005-2008 calibration, reducing the water yield (WYLD). At the same time, a marked increase in the soil water storage (SW) occurred while the percolation (PERC), lateral flow (LATQ) and evapotranspiration (ET) did not change much. In terms of percentage, different components of the water balance seem to be constant throughout the year, without any noticeable variability from 1 month to another ( Figure 6). Further analysis of the spatial variability of the simulated processes and their changes between the two calibrations show that the annual average simulated ET varies according to the land use; in a similar fashion for both calibrations (Figure 7a,b). It ranges between 200 to 800 mm and is reduced by values of 0-30 mm when the model is run with 2005-2008 parameters compared to 2012-2014 (Figure 7c). The generated surface runoff (SURQ_GEN) varies from 0-350 mm/year and is less than 10 mm for a large part of the catchment especially on the upper chalk, ranging from 10 to 50 mm on the lower clay. From the catchment coverage, SURQ_GEN above 50 mm, and more precisely between 50 and 100 mm, is higher for the 2012-2014 calibration (Figure 7d)

| Model performance
The 2012-2014 calibration and 2005-2008 validation indicate in general poor values of P-factor (between 0 and 0.7) and R-factor (from 0.2 to 1.7). The visual inspection shows that the model performs relatively better for most of the sub-catchments except on the Whitewater at Lodge Farm underlain by the chalk where the model barely captures the dynamics of the observed flows and has higher uncertainties during the calibration and validation (Figure 8a,b). On the same Whitewater sub-catchment downstream on the clay bedrock geology at Holdshot Farm, the quality of the model responsiveness improves but it overestimates the baseflow and peak flows during the calibration and validation (Figure 8c,d). Missing data during the validation period makes it challenging to gain further insights into the model performance. For the Hart at Bramshill House and the Blackwater at Farnborough where data are available only for the calibration period, the model performs relatively better but hardly captures certain peak flows (Figure 8e,f). The best performance of the model is obtained for the Blackwater at Swallowfield (Figure 8i,j).
For the 2005-2008 calibration and 2012-2014 validation, the values of P-factor (0-0.6) and R-factor (0.4-2) at most stations, except at Lodge Farm during the calibration, still indicate an overall poor fit. However, in general, an improved model performance over the two periods without any drastic differences compared to 2012-2014 calibration/2005-2008 validation can be observed. The most important change can be observed for the Whitewater at Lodge Farm, where a larger prediction uncertainty is obtained (Figure 9a,b). For the Whitewater at Holdshot Farm, there is no noteworthy change in the performance criteria during the calibration and validation (Figure 9c,d) but the model produces higher peak flows compared to 2012-2014 calibration/2005-2008 validation. This is also observed on the Hart at Bramshill House (Figure 9e,f), the Blackwater at Swallowfield (Figure 9g,h).

| Verification of parameter distribution densities across the landscapes
Selecting the parameter set from the 2012-2014 calibration, the parameters' probability densities for the groups of subbasins show a consistent variation across landscape characteristics ( Figure 11). For several parameters related to infiltration, soil water and groundwater, a clear difference can be observed in the densities for the Whitewater at Lodge Farm compared to those of other sub-catchments. For instance, the lowest values for CN2 and higher RCHRG_DP, implying higher infiltration and groundwater recharge, are observed on the subcatchment underlain by the chalk. Conversely, distributions centred on higher CN2 and lower RCHRG_DP values are observed both on the Blackwater at Farnborough and Swallowfield indicating higher surface runoff and lower recharge, respectively. Furthermore, a peak in the probability density at higher ALPHA_BF is observed for the Whitewater at lodge Farm indicating a more rapid land response to recharge compared to other sub-catchments. This response is also consistent with the observed density of the GW_DELAY, which is lower for the Whitewater at Lodge farm. The densities of the maximum canopy storage parameter, CANMX, which was calibrated catchment-wide and plotted according to land cover types, indicate that higher interception occurs on forest and rangeland, with values of around 1.5 and 1.25 mm, respectively, while agricultural land and pasture have lower values (Figure 12). In addition, agricultural land and pasture tend to have similar maximum canopy storage capacity.

| Blanket conversion to broadleaf woodland
The broadscale land conversion to broadleaf woodland except for areas covered by water and urban areas in the high-flow pulses extracted from the calibration and the modelled scenario. Estimations reveal a reduction of the medians between 12% and 14% of large floods, 15% and 16% of small floods, and 21% and 28% of high-flow pulses following woodland planting (Figure 14a-c). Similar shift towards a reduction in the streamflow is observed in the Blackwater catchment following the blanket conversion to broadleaf woodland (i.e., from 15% to 74% of the catchment covered by woodland) (Figure 13b). Statistically significant differences in the distributions of the large floods, small floods and highflow pulses extracted from the calibration and the modelled scenario were also obtained with reductions between 4% and 15%, 12% and 16%, and 17% and 29% in the medians of large floods, small floods and high-flow pulses, respectively (Figure 14d-f).

| Blanket conversion to cropland
The broadscale land conversion to cropland in the Pang at Pangbourne (from 54% to 96% of the catchment) and in the Blackwater (from 21% to 74% of the catchment) show an increasing shift of the 95% streamflow prediction uncertainty in the two catchments (Figure 13c,d). The Kolmogorov-Smirnov test shows that there is sufficient evidence to conclude that the large floods, small floods and high-flow pulses extracted from the calibration and the modelled scenario in the two catchments come from different distributions.
For the Pang catchment, the is an increase between 5% and 7%, 5% and 6%, and 7% to 8% in the medians of large floods, small floods and high-flow pulses, respectively (Figure 14g-i). For the Blackwater at Swallowfield, the change is balanced between a reduction from 0 to 1% and an increase between 0% and 10% of the median of large floods while an increase between 5% and 8% and 8% to 9% are obtained in the medians of small floods and high-flow pulses respectively (Figure 14j-l).

| Variation of SWAT model performance as affected by landscape characteristics, processes configuration and calibration techniques
For NFM modelling in a catchment with heterogenous landscapes, a model that can account for the heterogeneities of soil and land use and perform well across landscapes is essential.
In this study, the model performance varies across the catchments depending on landscape characteristics in terms of soil and geology, modelling technique as well as the cause of the flooding event (i.e., pluvial or groundwater flooding). The poor performance of the model upstream in the Pang catchment at Frilsham and Bucklebury compared to Pangbourne reflects the complexity of the catchment hydrological system mainly governed by regional groundwater dynamics. The groundwater catchment has been shown to differ from the topographic catchment in the Pang, as regional groundwater flow is south-east towards the deeply incised River Thames (Jackson et al., 2006a;Jackson, Wheater, et al., 2006;Wheater et al., 2007) and the Pang acting as a drain that "switches on" when groundwater levels are high. The effect will be most pronounced in the upper Pang and thus helps explain why model performance at Frilsham and Bucklebury is poor, and, moreover, why model parameters calibrated against a period of high groundwater levels (2012-2014) are unable to reproduce groundwater baseflow during a time of low/average groundwater levels (2003)(2004)(2005)(2006)(2007)(2008). Whereas streamflow at Frilsham is almost exclusively from the Chalk aquifer, streamflow at Pangbourne also has a surface runoff contribution from the Palaeogene clay deposits, which may further explain why model performance at Pangbourne is better than at Frilsham and Bucklebury. SWAT conceptualises each HRU as an independent aquifer system (comprising a shallow unconfined aquifer and a deeper confined aquifer). To simulate hydrology in the Pang in a way that respects our understanding of the catchment, it is necessary to model lateral groundwater flow between HRUs as well as with neighbouring catchments. While the SWAT conceptualisation of the aquifer as two stacked aquifers is not completely consistent with the understanding of the hydrogeology of the Pang (Jackson et al., 2006a), it does allow groundwater to be lost from the system, accounting for the flow of groundwater beneath the Pang surface water catchment to discharge to the River Thames. The weak performance of the model predominately related to the simulation of groundwater can also be seen in the parameter shifts between the two calibration periods ( Figure 5). In the 2005-2008 calibration, SOL_AWC increases, indicating a decrease in baseflow; GWQMN and RCHRG_DP increase, implying a reduction in baseflow generation and an increase in recharge; and REVAPMN decreases, meaning more water is allowed to F I G U R E 1 4 Changes in large floods, small floods and high-flow pulses from the broadleaf woodland scenario in the Pang (a-c) and Blackwater (d-f) catchments; and from the cropland scenario in the Pang (g-i) and Blackwater (j-l) catchments. move back from the shallow aquifer to the unsaturated zone and is subsequently lost via evapotranspiration. The fact that the simultaneous multisite calibration technique helped to achieve reliable results across different landscapes in the study catchments provides compelling support to previous work (Brighenti et al., 2019;Leta et al., 2017), which demonstrated the advantages of this calibration technique. As concluded by Leta et al. (2017), in catchments with high spatial heterogeneities in soils and landscapes as well as geology, and subsequently spatial variations in streamflow generation processes, this simultaneous multisite calibration should be preferred depending on the scale of the catchment. If the outlets are hydrologically connected, the simultaneous multisite calibration can be coupled with a sequential approach that consists of narrowing the parameter intervals from upstream to downstream as used in this study. This has the advantage of minimising the identifiability issue and quickly achieving the calibration.

| Performance of SWAT model for areas with different streamflow generation processes
Due to the variation of the SWAT model performance in different landscapes, it is important to understand where it performs well, and the reasons behind this, in order to guide future applications involving the model. The superior performance of the model at the Pang at Pangbourne and at sub-catchments of the Blackwater with a smaller groundwater contribution-that is, excluding gauges on the Whitewater-confirms that the model performs well in areas where streamflow generation is dominated by surface and shallow subsurface processes. In order to improve the performance of the model in catchments underlain by Chalk, SWAT would need to be coupled with a regional groundwater flow model. Although groundwater flow models of Chalk aquifers-including of the Pang -work well at a coarser temporal resolution as water resources models, modelling flood peaks on a daily time step is more challenging (Collins et al., 2020). In part, this is due to the complexity of interactions between matrix and fractures that govern flow through the Chalk unsaturated zone (Collins et al., 2020;Ireson et al., 2009;Jackson et al., 2006a;Rahman & Rosolem, 2017). These authors proposed different methods to improve water flow simulation in the unsaturated zone of Chalk including a decoupled soil layer approach , a coupled discrete soil layer approach (Brouyère, 2006), and a multi-layered approach which represents both soil and weathered chalk (Ireson et al., 2009) with the application of the dual porosity Richards equation to groundwater flow through the matrix. These approaches add a considerable number of parameters which are difficult to measure and increase the computational time, making the robust uncertainty analysis used in this study more challenging to apply. This suggests that NFM modelling in Chalk catchments may have a number of obstacles to overcome to reach the level of confidence required by stakeholders to invest in NFM schemes.
4.3 | Test of the effectiveness of NFM in the selected catchments: The effect of size of flood event The reduction of peak flows from the blanket land conversion to broadleaf woodland and their increase resulting from conversion to cropland, are consistent with previous studies on the effects of catchment woodland on the hydrological cycle and flood risks (Bosch & Hewlett, 1982;Thomas & Nisbet, 2007), as well as on the impacts of crop-land expansion on the streamflow (Rogger et al., 2017). Moreover, the relatively modest effect of broadscale land conversion to broadleaf woodland on the large flood events suggests that a worthwhile reduction of large flood flows in medium-sized catchments will require very extensive afforestation. Buechel et al. (2022) found similar effects of afforestation on high and very high flows in twelve catchments ranging in size from 500 to 10,000 km 2 . They also showed that the planting extent has a stronger control on streamflow than the location of planting within the catchment. Crooks et al. (2000) tested two scenarios consisting of 100% grassland and 100% forest (50% deciduous and 50% coniferous), respectively, over 10,000 km 2 of Thames basin and found a decrease in flood frequency. Moreover, results of the EUropean River Flood Occurrence & Total Risk Assessment System (EUROTAS) project have shown for the Thames basin that a realistic scenario for urbanisation of 2% of the catchment leads to an increase of flood discharge within the uncertainty of the modelling process, and that the effects of urbanisation on flood runoff are localised (Samuels, 2001a(Samuels, , 2001b.
If the afforestation requires a substantial catchment area for a sensible reduction in peak flow, a small change in the flood flow can change the flood-risk of individual stakeholders and therefore is important in terms of flood risk reduction.
Results demonstrate that the model setup, process configuration, parametrisation as well as CUA used in this study enables NFM assessment, especially in areas where streamflow generation is dominated by surface and shallow subsurface processes. Such a consistent modelling framework provides means for investigating the effects of different woodland types not only for NFM but also to guide the development of future tree planting proposals as raised by Cooper et al. (2021), for example, to avoid the occurrence of undesirable low flows. For the groundwater-dominated areas, challenges remain, requiring therefore a combination of techniques, analysis and reasoning including expert knowledge for a meaningful and objective conclusion regarding the effect of NFM.
The overall disappointing values of P-factor and Rfactor, though somehow dependent on the threshold values of the efficiency criteria used for separating behavioural from non-behavioural simulations, indicate that the tested modelling framework only provides an indication of NFM effectiveness in the selected catchments particularly in areas dominated by surface and fast subsurface processes. For robust policy recommendation on NFM implementation in the selected catchments, a systematic improvement of the model performance will be necessary.

| NFM modelling at medium-scale catchments: Requirements and importance of the tested modelling framework
Overall, this study demonstrates that it is possible to reliably model land-based NFM in medium-scale catchments, but that success is highly dependent on the one hand on catchment landscape characteristics and on the other hand on the model choice, configuration, parameterisation, and CUA techniques. This reflects the importance of the tested modelling framework, which provides evidence on the feasibility of NFM modelling, using models like SWAT in complex landscapes with different hydrogeologic properties.
From the practitioners' point of view, the study can be seen as a supportive technical guideline for any NFM modelling and implementation in heterogenous mediumscale catchments. In fact, the proposed modelling framework allows its users to improve their understanding of catchments characteristics and NFM-related processes, to test the model suitability, identify its best set up and parametrisation and finally but not least account for uncertainties. The tested modelling framework also highlights the importance of model parametrisation, as well as CUA techniques, in catchments with heterogenous landscapes and complex hydrology.

| CONCLUSION
In this study, we tested how the SWAT model can be used to simulate NFM in medium-scale lowland catchments, while considering uncertainties by exploring the effect of using different process configuration and calibration techniques. We found that in catchments with heterogeneous soils, landscapes and geology, the model performance can be improved by using simultaneous multi-site CUA techniques compared to simple multisite calibration. Depending on whether stations are hydrologically connected, a cascading approach which adjusts parameters from upstream to downstream will enhance the calibration. The model performed better in areas where streamflow generation is dominated by surface and fast subsurface processes but poorly in groundwaterdominated areas. The use of catchment hydrological models to simulate groundwater-dominated catchments may help to understand the system functioning but does not offer the means to assess NFM measures given the uncertainties associated with the modelling results.
Apart from the feasibility of NFM modelling in medium-scale catchments, and based on the better performance in areas dominated by surface and interflow processes where the model performs better, it is likely that the NFM seems to be less effective on large floods in comparison to small floods and high-flow pulses, and the effectiveness is likely to vary according to storm conditions. However, further investigation is required to understand how the processes related to NFM change as a function of rainfall intensity, duration and frequency, and the interactions with groundwater.
Our methodological approach allows for a holistic assessment of the modelling framework to highlight the suitability and limitations of its components (data, methods and techniques). The proposed modelling framework provides the means to fully understand the functioning of the catchment system and how the NFM measures change the hydrological processes in areas where streamflow generation mechanisms are dominated by surface and fast subsurface processes. However, it does not provide means for accurately estimating the potential contribution of NFM in groundwater-dominated areas. We conclude that prior to hydrological impact modelling in the context of NFM, it is necessary to understand both the target flood generation mechanism in relation to catchment landscape properties and processes, and the model ability to simulate this, since the translation of landscape features related to NFM into tangible hydrological model configurations and parameters is not simple and straightforward. NFM modelling should only be used for policy recommendations for NFM implementation if the results can be assessed as robust and reliable. Our study provides an approach designed to ensure that this is the case and we hope that it also improves the perception of NFM modelling within the hydrological community. Our findings in particular highlight the need to further develop integrated hydrological simulation tools for groundwater-dominated catchments.