Assessing the hydrological and geomorphic behaviour of a landscape evolution model within a limits‐of‐acceptability uncertainty analysis framework

Landscape evolution models (LEMs) have the capability to characterize key aspects of geomorphological and hydrological processes. However, their usefulness is hindered by model equifinality and paucity of available calibration data. Estimating uncertainty in the parameter space and resultant model predictions is rarely achieved as this is computationally intensive and the uncertainties inherent in the observed data are large. Therefore, a limits‐of‐acceptability (LoA) uncertainty analysis approach was adopted in this study to assess the value of uncertain hydrological and geomorphic data. These were used to constrain simulations of catchment responses and to explore the parameter uncertainty in model predictions. We applied this approach to the River Derwent and Cocker catchments in the UK using a LEM CAESAR‐Lisflood. Results show that the model was generally able to produce behavioural simulations within the uncertainty limits of the streamflow. Reliability metrics ranged from 24.4% to 41.2% and captured the high‐magnitude low‐frequency sediment events. Since different sets of behavioural simulations were found across different parts of the catchment, evaluating LEM performance, in quantifying and assessing both at‐a‐point behaviour and spatial catchment response, remains a challenge. Our results show that evaluating LEMs within uncertainty analyses framework while taking into account the varying quality of different observations constrains behavioural simulations and parameter distributions and is a step towards a full‐ensemble uncertainty evaluation of such models. We believe that this approach will have benefits for reflecting uncertainties in flooding events where channel morphological changes are occurring and various diverse (and yet often sparse) data have been collected over such events.


| INTRODUCTION
In most landscapes, processes of weathering, erosion and deposition are highly integrated with hydrological processes and river flow. For several decades, research has tried to unravel the complexity of this behaviour using various methods, increasingly depending on the descriptive and predictive capabilities of numerical models to do this (Owens & Collins, 2006). A part of this research effort has been focused on developing LEMs to simulate and visualize various geomorphological and hydrological process dynamics simultaneously and examine potential short-term and long-term process linkages. As a result, LEMs are increasingly used in a wide range of applications characterizing various aspects of geomorphological change, such as: (1) testing hypotheses about landform process dynamics (e.g. Densmore et al., 1998;Gioia & Lazzari, 2019;Tucker & Slingerland, 1994); (2) evaluating the effect of changing climate on river morphology (e.g. Coulthard et al., 2000;Coulthard & Macklin, 2001;Hancock, 2009;Hancock et al., 2017;Temme et al., 2009); (3) as a management tool for environmental problems (e.g. Coulthard & Macklin, 2003;Hancock et al., 2000; and (4) assessing the effects of digital elevation model (DEM) resolution on model simulated outputs (e.g. Claessens et al., 2005;Schoorl et al., 2000) and vegetation effects (e.g. Bastola et al., 2018;Collins et al., 2004;Hooke et al., 2005).
The landscape evolution modelling community has made significant advances in understanding model complexity and component interactions ( Van de Wiel et al., 2011), but validation and uncertainty investigations have been limited (Skinner et al., 2018;Tucker & Hancock, 2010). However, to ensure effective use of LEMs, quantifying the magnitude and sources of uncertainty associated with observed constraining data and model simulations is essential as this can increase the reliability of the model predictions and effectively define realistic values that should be used in subsequent assessments. This paper addresses this important issue by exploring the parameter and predictive uncertainty of a LEM by assessing the ability of hydrological and geomorphic uncertain observations to constrain model simulations.
Uncertainty in LEMs is currently acknowledged as an issue, as it is for all environmental modelling, but has rarely been quantified (Skinner et al., 2018;Van de Wiel et al., 2011). Whilst the initial form of the landscape and external driving conditions are often interpolated or extrapolated due to the paucity of available data, the model structure, choice of geomorphic processes, processes formulations and parameterization are strongly affected by a lack of prior knowledge and the difficulty in deciding whether to include or neglect certain processes. These epistemic (knowledge) uncertainties are inherent in the model calculations in terms of cell grid structure and time step (Temme et al., 2011). Whatever the uncertainties, most LEMs have a large number of possible parameters, and each can combine many different ranges of acceptable values. Past studies have mainly focused on LEM sensitivity to changes in climate variability and precipitation characteristic (Armitage et al., 2018;Coulthard & Skinner, 2016;Skinner et al., 2020), and variations in initial conditions Ijjaszvasquez et al., 1992;Kwang & Parker, 2019). Hancock (2009)  instance, how to identify realistic changes simulated by the LEMs given the parameter and data uncertainties, and how the uncertainty in LEMs propagates when the results of LEMs are used for subsequent analysis. For example, LEM results could be used as inputs in flood inundation models to account for how morphological change impacts on flood risk. By identifying realistic simulations by the LEMs and uncertainty propagation, this is, therefore, a crucial first step in quantifying the parameter uncertainty in LEMs so as to improve their reliability as physically based numerical models.
In common with other types of environmental model applications, LEMs suffer from equifinality (Beven, 1996; such that there can be several or many combinations of parameter sets which result in equally acceptable simulations when the model is evaluated against observed data. Numerous approaches (e.g. standard Bayesian approaches (Krzysztofowicz, 2002;Kuczera & Parent, 1998), Bayesian model averaging (BMA) Duan et al., 2007;Vrugt et al., 2006), generalized likelihood uncertainty estimation (GLUE) (Beven & Binley, 1992) and others) have been proposed on the basis of accepting multiple acceptable parameter sets and treating each as a scenario of uncertain conditions that describes a simulated system. In most of the real applications, since a residual time series is neither available nor independent of uncertain observed data (Winsemius et al., 2009), the justification for making strong statistical assumptions about the nature of likelihood functions is rather weak. It will be difficult to apply formal statistical parameter inference which involves updating a prior distribution of model parameters based on statistical likelihood measures and requires an explicit account of errors in the model structure, parameters and the input data (Mantovan & Todini, 2006). Given the fact that the residual error characteristics are not fully known and the issues with a lack of commensurability between the limited observed data and the model, GLUE is a more suitable approach to reveal uncertainties in LEMs from which a probabilistic prediction can be made at the expense of relaxing certain statistical assumptions of the formal Bayesian approach (Beven, 2009).
Since there is no objective method to choose the threshold for some informal likelihood measures (Mantovan & Todini, 2006;Montanari, 2005), GLUE has been criticized for its subjective distinction between behavioural and non-behavioural model. In response, an approach to model evaluation on the basis of limits of acceptability (LoA) for use within the GLUE methodology was outlined by Beven (2006) and first demonstrated by Liu et al. (2009). This approach suggested that, before simulating any model runs, the LoA should first be defined based on a range of 'effective observational error' that incorporates observation error in the measurements given the availability of information and allows for the effects of commensurability error and input error. The Monte Carlo model runs that provide predictions that all fall within the LoA are then classified as behavioural, and each model is associated with a performance score that summarizes how well the model simulates the observed. Since different error uncertainties are often inherent within different types of data, interpretation of the model results could be biased because the errors from various sources (i.e. input data, model structure, model parameters and observations for calibration and validation) could compensate each other (McMillan et al., 2012). The LoA approach thus pays attention to different sources of uncertainty and allows different limits to be set for individual observations in the calibration process such that the modelling exercise is fit for purpose. Liu et al. (2009) demonstrated the use of the LoA approach in GLUE for identifying behavioural models whilst allowing for uncertainties in observational data. Other studies which focused on the calibration of hydrological models have adopted this approach in building their LoA for: (i) uncertainties in the stage-discharge relationship and evaluation points in a flow-duration curve (Van Hoey et al., 2015;Westerberg et al., 2011b); (ii) flood frequency estimation (Blazkova & Beven, 2009); (iii) evaluation of model structure, parameter and data uncertainties Mackay et al., 2018;Teweldebrhan et al., 2018); (iv) calibration of models incorporating both hard and soft information (Winsemius et al., 2009);and (v) uncertainties in hydrological and water quality data (Coxon et al., 2015;Hollaway et al., 2018;McMillan et al., 2012). It is therefore expected that this approach will be equally applicable in other environmental modelling frameworks such as LEMs.
The aim of this paper is to apply an LoA uncertainty analysis approach to evaluate LEM simulations: (1) to quantify and assess the value of uncertain hydrological and geomorphic data in constraining the catchment response, and (2) to identify the uncertainty of parameters for predictions of observed and uncertain hydrological and geomorphic behaviours and dynamics. The River Derwent and Cocker catchment in North West England, UK provides a good test of the methodology because of the availability of stage-discharge and suspended sediment load data, which are both highly uncertain and so require the quantification of the expected observational errors (see Table 1 for data summary). Also, the locations of the gauges and the monitoring sites provide an opportunity to assess not only the model performance at the catchment outlet but also the model ability in capturing the internal catchment response. Such an evaluation is uncommon in most LEM applications. The catchment is very steep in its upstream sections, which contain some of the highest peaks in England (over 900 m). The catchment geology is dominated by the Skiddaw Slate Group, which the River Cocker and most of the River Derwent upstream of Cockermouth (including Bassenthwaite Lake and Derwent Water) flow over, while the upper reaches of the River Derwent, Naddle Beck and St Johns Beck lie on the Borrowdale Volcanic Group (Hatfield & Maher, 2008. The remainder of the Derwent lies on Carboniferous limestone, milestone grit and coal measures (Moseley, 1979;Wilson, 2010). Holocene (post-glacial) alluvium (river sediments) occur along many of the main valleys in the catchment, which are dominated by loamy soils. The watercourses within the catchment comprise steep bedrock channels with step-pool sequences in the headwaters and boulder/gravel-bed channels in valley reaches. Sediment sizes range from sand to boulders but are dominated by gravel and cobbles.
Vegetation within the catchment is dominated by grassland (Environment Agency, 2009). The average annual rainfall is 2,408 mm but can be as high as 4,175 mm on the mountaintops (Environment Agency, 2009). In the upper headwaters, the rivers have a very flashy flow regime due to topography, geology and soils. However, this flashy response of the upstream reaches is attenuated by the lakes in the downstream sections of the River Derwent (Hatfield & Maher, 2008). The River Cocker has a more flashy response than the River Derwent. Combined with the impermeable underlying geology and waterlogged upland soils, large amounts of runoff are produced, and this can cause significant downstream flooding (Chiverrell et al., 2019).

| Data
To improve the understanding of the catchment dynamics, a NEXTMap DEM at 50 m resolution of the entire catchment was used. This DEM was provided by Intermap Ltd. based on an airborne interferometric synthetic-aperture radar (IFSAR) survey with a vertical accuracy of 1 m. Rainfall data were provided by the Environmental Agency of England and Wales (EA) during the period from December 1990 to September 2012. To account for the spatial and altitudinal effect, rainfall data from 21 tipping-bucket rain gauges located within and 6 km around the catchment (see Figure 1 for spatial distribution) were obtained to produce an areal average rainfall (AAR) for both the River Derwent and Cocker catchments. Stage-discharge measurements at five flow gauging stations ( Figure 1 and For suspended sediment, the data reported in the study of Warburton (2010) in which fluvial sediment flux monitoring was undertaken in the Derwent catchment from 1 April to 30 November 2006 were used (Table 1). River stage (m) and turbidity (measured in nephelometric turbidity units using an Analite 395 nephelometric turbidity probe) were recorded continuously and logged at 15-min intervals on a Campbell CR200 data logger at two monitoring sites (Derwent at Portinscale and Newland Beck at Braithwaite). The 15-min interval data were re-grouped to hourly time intervals and processed by Warburton (2010) to provide the suspended sediment data. While no primary field data were collected during this study, the grain size distribution (GSD) characteristics of the Eden catchment, a neighbouring catchment north east to the River Derwent and Cocker, were used as the input of the model in this study. The Eden catch-    Table A1  (2) maximize the numbers of years so that the rainfall data series are of reasonable length, ca. 10 years; (3) retain as many rain gauges within the catchment as possible even though their percentage of missing values in some years exceeds 10%; and (4) evaluate the rain gauge consistency against the three nearest neighbours by correlation analysis. Accordingly, the hourly rainfall series of the rain gauges (except rain gauges 1, 18, 19 and 20, Figure 1) from 1999 to 2011 were used and regarded as the observed rainfall period. The hourly rainfall data of the 17 rain gauges were interpolated using the ordinary kriging method, which preserves the pattern of spatial dependency (see e.g. Goovaerts, 2000;Mair & Fares, 2011) to estimate the spatial precipitation field of both catchments at an hourly time step.
The total gridded spatial rainfall within the catchment was divided by the catchment area to produce the hourly homogeneous AAR. This was done separately for the River Derwent and River Cocker catchments. To assess the robustness of the time series, the AARs of both catchments were aggregated into daily values, and its spatial field and temporal variability were compared with daily 1 km-gridded rainfall data, which is a composite of radar data and rain gauge data provided by the EA. In general, the AARs showed similar rainfall characteristics (with coefficient of determination, R 2 , of 0.61 and 0.57 for the River Derwent and River Cocker catchments, respectively) and were compatible with daily 1 km-gridded rainfall data.

| THE CAESAR-LISFLOOD MODEL
CAESAR-Lisflood is an LEM that simulates the evolution of landforms by directing water over a regular grid of cells and modifying elevations based on erosion and deposition caused by fluvial and slope processes . CAESAR-Lisflood integrates the LISFLOOD-FP 2D hydrodynamic flow model (Bates et al., 2010) with the CAESAR geomorphic model (Coulthard et al., 2000(Coulthard et al., , 2002(Coulthard et al., , 2005Van De Wiel et al., 2007) to dynamically simulate both erosion and deposition and flood inundation extent and depth simultaneously in river catchments and reaches over a range of temporal scales. There are four main components in CAESAR-Lisflood, featuring hydrological processes, multidirectional routing of river flow, fluvial erosion and deposition over a range of different grain sizes, and slope processes (soil creep, mass movement). These components are described briefly below, but for a full description see Coulthard et al. (2002), Van De Wiel et al. (2007) and Coulthard et al. (2013).
CAESAR-Lisflood can be run in two modes: a catchment mode, with no external influxes other than rainfall; and a reach mode, with one or more points where discharge and sediment enter the system.
When running in catchment mode, hourly rainfall data are used to drive an adapted version of TOPMODEL (Beven & Kirkby, 1979) to calculate runoff, which is then routed using the flow model. In reach mode, sources of discharge (both water and sediment) can be added at user-defined points. Surface water routing is then carried out using the 2D hydrodynamic LISFLOOD-FP model (Bates et al., 2010). The hydraulic model time step is controlled by the shallow-water Courant-Friedrichs-Lewy condition to maintain numerical stability. The flow depth and flow velocity are used to calculate shear stress, which in turn determines the erosion, transport and deposition of sediment. CAESAR-Lisflood can simulate erosion and deposition over nine sediment fractions, with one fraction treated as suspended sediment. As the study channels are largely gravel and sand, sediment transport is calculated using the Wilcock and Crowe (2003) equations, which are based on field and laboratory data from a coarser bed gravel-sand mix. Deposition of sediments differs between bed load and suspended load, with bed load being moved directly from cell to cell, whereas suspended load is deposited according to fall velocities and concentrations for the suspended sediment fraction. Erosion within the channel is controlled by the inchannel lateral erosion rate, which represents how cohesive or not the sediment is.
Slope processes are also included, with mass movement and soil creep. Mass movement (landslides) is represented as an instantaneous movement process. When a critical slope threshold is exceeded by the slope between adjacent cells, material is moved from the uphill cell to the one below until the angle is lower than the threshold.
Movement upslope may be triggered by a small slide in a cell at the base of the slope, and the adjacent cells are checked iteratively until there is no more movement (Coulthard et al., 2002). Soil creep (per year) is calculated between each cell using a simple diffusion equation which is linearly proportional to slope angle. After the fluvial erosion/ deposition and slope process amounts are calculated, the elevations and grain-size properties of the cells are updated simultaneously. In this study, CAESAR-Lisflood version 1.2x was used.

| Model setup
CAESAR-Lisflood was set up in catchment mode, and its main data sources were a DEM as the landscape, hourly rainfall inputs and grain size distribution (GSD). The river channel and floodplain and lake topography in the model were described using the modified DEM as above (i.e. 200 m and 100 m DEM for the River Derwent and Cocker catchments, respectively). Bedrock can afford an important control on channel incisionespecially in upland areas and over long (e.g. millennial) timescales. However, mapping the bedrock depth and any locations where it is at the surface is a considerable task for such a large catchment. Furthermore, for this study, many parts of the simulated basin were low gradient and lacustrine, where the impact of bedrock will be less than in an entirely upland basin. Therefore, the simulations were run without a bedrock layer present.
Many numerical models require a spin-up period for the results to become stable. In the case of CAESAR-Lisflood, the model produces extremely high sediment transport rates in early simulations as surface roughness in the digital elevation is removed and smoothed and the particle size distribution is sorted across the catchment according to the topography and hydrology. In this case, the model was usually run using repeated rainfall data for years of simulation Since the river channels are dominated by gravel and cobbles with sediment sizes ranging from sand to boulders, it is important to reflect the sediment variability of the catchment in order to examine the effects of GSD on model performance and the hydrological and geomorphic behaviour of the model during floods. In this regard, the grain size data were classified into nine size ranges to suit CAESAR-Lisflood (Table 2). Since grain sizes smaller than 0.3 mm were not observable using the photo analysis technique, the GSDs were adjusted by assuming a 20% proportion of fine sediments for each sub-catchment (personal communication with Jorge Ramirez). Figure 2 shows the uncertainty bound (grey) from 0.5 mm (+1ϕ) to 128 mm (À7ϕ) based on the variability (5th and 95th percentile) of the field measurements of all sub-catchments, while the envelope less than 0.5 mm is typical of upland soils described in Wilson (1993). The black line (case 1 in Table 2) represents the mean GSD characteristic of all subcatchments. To provide a basis for varying the GSD, the mean GSD was first described by four grain-size parameters which are based on Folk and Ward (1957)  Only variations in mean and kurtosis were of interest in this study, thus cases 2 to 5 ( Figure 2 and Table 2) were selected by changing the mean and kurtosis of the distribution while maintaining the skewness and sorting of the distribution to fall into the same description. One grain size parameter is changed in each case to further reduce the numbers of cases that need to be simulated. In general, case 2 can be described as 'peaky', case 3 as 'less peaky', case 4 as 'coarser' and case 5 as 'finer', relative to case 1, to reflect different distribution characteristics.

| Model parameters and sampling range
There are a total of 24 parameters in the CAESAR-Lisflood model that must be specified and are normally treated as being homogeneous across the model domain. To reduce the dimensionality of the Case 1 represents the mean grain size distribution characteristic of all sub-catchments of the Eden. Cases 2 to 5 are selected by changing the skewness and kurtosis of the distribution while maintaining the mean and sorting of the distribution to fall into the same description. One grain size parameter is changed in each case. In general, case 2 can be described as 'peaky', case 3 as 'less peaky', case 4 as 'coarser' and case 5 as 'finer', relative to case 1 parameter space, five parameters of the model were kept at their 'default' values due to the model's lack of sensitivity to variations in these factors in previous studies (Skinner et al., 2018;Ziliani et al., 2013). Also, five parameters (two for bedrock lowering and three for physical weathering) in the soil development component of the model were not included to further reduce the model complexity.
We acknowledge that the bedrock might have an impact on the model simulations, but since the focus of the current modelling is on the hydrological and geomorphic behaviour of the model during floods, we assume that the bedrock lowering and physical weathering play a minor role as compared with other factors. As a result, this study identified 14 parameters, which include one hydrologic parameter, three hydraulic parameters, five sediment parameters, two slope parameters and three vegetation parameters, plus the GSD cases as one of the parameters sampled within the GLUE analysis (  et al., 2007) and therefore has some precedence. A summary of these parameters and their sampling ranges is given in Table 3.

| Quantification of observational error
The starting point for setting LoA is to assess the uncertainty in the observed data that are being used for model evaluation. Uncertainties in input data (e.g. rainfall) could certainly be included to define a set of ensemble simulations. However, making an assessment of input error is limited by the simple representation of the rainfall in the model which could not fully account for the spatial aspects of the rainfall uncertainty. This study therefore focused on the errors in discharge and suspended sediment data arising from uncertainty in the stagedischarge rating curve and sediment load duration curve respectively, where clear evidence was available to quantify these data.

| Stage-discharge rating curve and observational uncertainty
Numerous methods (e.g. log-log linear regressions (Liu et al., 2009), envelope curves McMillan et al., 2010), Bayesian statistics (Moyeed & Clarke, 2005) and fuzzy rating curves (Pappenberger et al., 2006;Westerberg et al., 2011a)) have previously been used to estimate discharge rating curve errors (Kiang et al., 2018). Applying the most commonly used power function F I G U R E 2 Grain size distribution used for input into CAESAR-Lisflood. The uncertainty bound (grey) from 0.5 mm (+1ϕ) to 128 mm (À7ϕ) is based on the variability of the field measurements of all sub-catchments, while the envelope less than 0.5 mm is indicated by the ranges described in Wilson (1993). Case 1 (black line) represents the mean grain size distribution characteristic of all subcatchments of the Eden. Cases 2 to 5 are selected by changing the mean and kurtosis of the distribution while maintaining the skewness and sorting of the distribution to fall into the same description. One grain size parameter is changed in each case. In general, case 2 can be described as 'peaky', case 3 as 'less peaky', case 4 as 'coarser' and case 5 as 'finer', relative to case 1 [Color figure can be viewed at wileyonlinelibrary.com] T A B L E 3 Parameters, description and ranges used in the GLUE LoA uncertainty analysis The stage-discharge gauging data and observed uncertainty ranges estimated were fitted with the locally weighted scattering smoothing (LOWESS) method of Cleveland (1979). The LOWESS method provides an objective and empirical approach to curve estimation and associated uncertainties that requires no a priori assumption as to the form of the relationship. It is preferred in cases, such as those in this study, where the log-log relationship is non-linear and exhibits curvature (e.g. Hicks et al., 2000). The stage-discharge gauging data were first transformed to obtain a linear relationship through a log transform of gauge-height data and a Box-Cox transform of discharge data, corresponding to a more empirical form of the usually log-transformed stage-discharge power-law function (Moyeed & Clarke, 2005). The lambda parameter in the Box-Cox transformation was optimized to achieve the highest degree of linearity. The transformed stage-discharge data were then fit with LOWESS, where the LoA uncertainty bounds are computed from how well the estimated curve fits the population of stage-discharge gaugings.
Since the maximum stages in the observation period (1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011) are higher than those in the historical records at all gauging stations, extrapolation from the above LOWESS curves and uncertainty estimate fittings were needed. This was quantified by assuming a linear relationship from the upper tail of the LOWESS fittings with the same level of uncertainties in the log-log-transformed space. Whilst we recognize that such uncertainties might become relatively larger as discharge magnitudes are increased, we have no evidence to justify a different approach, so this simple extension approach was used. The extended LOWESS rating curves (Figure 3) were then used for the estimate of discharge in the observation period, and the resulting uncertainty bounds consequently defined the maximum and minimum discharge intervals for given stages. As a result, the percentages of the total time series in the observation period that were extrapolated in this form ranged from 3.5% to 17.5% for different gauging stations (Figure 3).

| Sediment load-duration curve
Similar to the stage-discharge measurements, a log transform of suspended sediment concentration and discharge data (Q-C F I G U R E 3 Uncertain rating curves for the five flow gauging stations in the River Derwent and Cocker catchment derived from the stagedischarge measurements. The black crosses represent the measured values, and the black solid lines indicate the fitted rating curve, while red and blue dashed lines represent the uncertainty limits for the fitted rating curve and the extrapolated part, respectively. The rating curve from the EA is also plotted in each flow gauging station for reference [Color figure can be viewed at wileyonlinelibrary.com] relationship) was first applied to find a linearized relationship, but this proved negative. This could be due to a number of well-documented reasons, for instance hysteresis effects, seasonal effects, antecedent conditions in the catchment and temporal change in vegetation cover (Asselman, 2000;Ferguson, 1987;Walling, 1977). In this regard, the sediment load-duration curve was developed to estimate the errors in the suspended sediment data. This adapts the concept of the flowduration curve that is commonly applied in hydrology and engineering fields, but in this case it indicates the percentage of time that given suspended sediment loads are likely to be exceeded. Suspended sediment loads (tonnes) were calculated by multiplying discharge data by suspended sediment concentrations. Since the sediment outputs in CAESAR-Lisflood were in m 3 , a further unit conversion was undertaken from tonnes to cubic metres by assuming the dry bulk density at source to be 1.3 tonnes/m 3 (Verstraeten & Poesen, 2001). Exceedance percentages from the sorted suspended sediment load values were then calculated based on the percentile values 100(0.5/n), 100 (1.5/n),…, 100([n À 0.5]/n), where n is the number of suspended sediment load values.
The accuracy of the sediment load-duration curve was limited by the availability and representativeness of the data, hence a bootstrap technique (Efron, 1979) was employed here to account for the uncertainty of the sediment load-duration curve induced by the concentration data. A total of 10,000 bootstrap samples were taken in this study, and the 95% confidence intervals of the set of samples defined the upper and lower observed LoA for the sediment load-duration curve. In general, the uncertainty bounds were the largest at the highload part of the sediment load duration curve but were progressively smaller towards low load (Figure 4). A scaled score (S) was calculated to define the deviation of the simulated results from both the observed discharge and sediment load data. The scaled score was calculated relative to the observed uncertainty from the LoA at each evaluated time step, given as

| Evaluation of model performance
where with R j being one of the j = 1,…, J accepted models with parameter set θ. The conditional probability was used to assess the model parameter identifiability and uncertainty.
To account for the ability of the model to capture the uncertainty limits of observed discharge and cumulative suspended sediment load, two measures were used to evaluate against the resulting 5th and 95th simulation bounds for each time step from the behavioural simulations. The first one is reliability (RM) (Equations 10 and 11), which calculates the overlap between the observed and simulated uncertainty bounds (Westerberg et al., 2011b). RM is calculated as the mean of the percentage of the overlapping range between the observation and simulation relative to the observation and relative to the simulation range, given as where Q and SS are discharge and suspended sediment load, Roverlap is the intersection between the simulated and observed ranges, Robs is the observed range and Rsim is the simulated range, and T is the number of time steps.
The second one is precision (PM) (Equations 12 and 13), which calculates the average percentage of the width of the overlapping range between the observed and simulated uncertainty bounds to the width of the simulated bounds for all time steps (Guerrero et al., 2013), given as The range for both measures is 0-100%. Guidance on a threshold of acceptable values for RM and PM is subjective, so we use these as a Another reason might be the artificial regulation of flow from Thirlmere Reservoir because this will particularly affect the magnitude of flood peaks, which is dependent on the reservoir level. The flow measured at the station 1 km downstream of the reservoir is affected by public water supply abstraction and also flood release regulation. Figure 6 illustrates how the behavioural ensemble is able to capture the sediment dynamics at the two monitoring sites. In addition, the cumulative suspended sediment loads were plotted (Figure 7) to examine whether the behavioural models were able to capture the timing and magnitude of the whole time series in terms of accumulation.

| Model performance to reproduce sediment yields
At both monitoring sites, the majority of the sediment loads at the low-load evaluation points were totally under-predicted at or greater than 10% exceedance ( Figure 6, left panel) and at or greater than 1% exceedance ( Figure 6, right panel) for the Derwent at Portinscale and for Newlands Beck at Braithwaite, respectively. This shows that the behavioural simulations did not generate any suspended sediment loads for ≥95% of the time but were able to capture the high-load part of the sediment load-duration curve during effective discharge events.
In Figure

| Behavioural ensembles for different diagnostics
Based on the behavioural simulations, the reliability (RM) and precision measures (PM) for the hydrological response were calculated and are summarized in Table 4. These reveal the overall ability of the behavioural simulations to capture the uncertainty limits of observed discharge for all 14 flood events (10 flood events for Glenderamackin   Table 3) in the River Derwent catchment. The conditional probability was calculated using equation 9 the homogeneous rainfall. The spatial rainfall aspects of the model were only evaluated recently, and their impact remains uncertain. Coulthard and Skinner (2016) demonstrated that the temporal and spatial resolutions of the rainfall data had a small impact on basin hydrology but generated larger changes in basin geomorphology.
Though subsequent simulations showed CAESAR-Lisflood simulations were sensitive to the choice of rainfall products, where hydrological changes were linear but non-linear differences were seen in sediment yields (Skinner et al., 2020). Therefore, the impacts of spatial rainfall input on basin discharge in larger basins like ours remains as a hypothesis to be tested. More in-depth analyses could be done to further investigate the impacts of rainfall input on model performance, but this is not within the scope of this study and future work could be done, for instance, to fully account for the rainfall uncertainties arising from the measurement errors and the spatial representation of the rain gauges. given the uncertainty spread in some of these relationships ( Figures A1 and A2). However, the behavioural simulations were still able to capture the high-magnitude low-frequency sediment events, which were possibly in terms of effective geographic events such as landslides or hydrological events such as flooding (Warburton et al., 2008). With the current focus on the model ability in simulating hydrological and geomorphic dynamics of the catchments during F I G U R E 9 Dotty plots of behavioural parameters for the 15 CAESAR-LISFLOOD parameters (the parameter names are explained in Table 3) in the Cocker catchment. The conditional probability was calculated using equation 9 floods, we argue that the model was able to provide realistic simulations and their associated uncertainties for subsequent analysis (e.g. as inputs for downstream flood analysis), given the model performance at catchment outlets showing better behaviours.
It is acknowledged that different sets of behavioural simulations were found when evaluating the model performance with different gauges within the catchment, different flood events and different types of observed data. There was no single behavioural simulation that could adequately and simultaneously reproduce both hydrological and sedimentological behaviours across different parts of the catchment. In many ways, this is not surprising because of the highly non-linear relationship between rainfall and discharge/sediment yield. This non-linear hydrology-sediment delivery response is fundamentally embedded into the sediment transport formulae in the model, such that the shear stress is proportional to the square of the flow velocity, and the flow velocity is non-linearly related to discharge as well . Given the uncertainties within the suspended sediment measurements and the stagedischarge rating curves, it remains a challenge to evaluate LEM performance in quantifying and assessing both at-a-point behaviour (e.g. sediment yield) and areal catchment response (e.g. streamflow).
We believe that addressing these uncertainties directly in model performance metrics is necessary, particularly as each type of data has its own error characteristics. Instead of assuming a good degree of data quality and using standard objective functions, we should be more explicit about quantifying data uncertainties when evaluating the model performance. While there are increasing numbers of research papers in the hydrological modelling literature considering observational uncertainties in model assessments, we advocate that similar awareness is needed in assessing the LEMs, and our study could be regarded as the beginning of a dialogue to deal with data uncertainties in LEM evaluation.
The LoA approach adopted in this study provided a flexible way to explore the interactions between observational uncertainty and parameter uncertainty. This is an important step looking at the uncertainty of the model performance rather than the sensitivity. Given the different nature and level of uncertainties in the observed data, the hydrologic parameter (m), Manning friction coefficient (n) and vegetation critical shear stress (Veg CriShear) were shown to be more identifiable for both catchments in this study, which aligned with the findings concluded by previous studies (Skinner et al., 2018;Welsh et al., 2009). Similar model performance by different behavioural parameter sets revealed that simulated results at a specific location could be driven by similar or totally different dynamics within the catchment (i.e. equifinality due to parameter uncertainties). This also implies that simulated results at a specific location of the catchment (e.g. catchment outlet) could only reflect the catchment behaviour of that specific location but do not necessarily provide the relevant information on other parts of the catchment (Skinner et al., 2018). Even with a large amount of available multi-proxy data, the reliability of the LEM was evaluated with the assumption of a good degree of data quality and low impact of other error sources (Gioia & Lazzari, 2019).
In this regard, our study applied for the first time the LoA approach within a GLUE uncertainty analysis framework to assess the reliability of a LEM with the consideration of observational and parameter uncertainties. This highlights a high potential of using such an approach in assessing different sources of uncertainty in LEMs.

| CONCLUSIONS
Despite their proven capabilities, the success of LEMs has been hampered by the lack of uncertainty investigations, in most cases because of the paucity of available data. Even when detailed evaluation data are available, the uncertainties inherent in observed data (e.g. discharge and suspended sediment load) could be largely owing to the unpredictability of the environmental time series and errors in collection techniques. Thus, a framework of the LoA approach in GLUE was adopted in this study, which enables parameter conditioning under the circumstances of highly uncertain data. This also allows for integration of different types of data in the parameter conditioning process by reflecting their quality. Here, we assessed the ability of hydrological and geomorphic uncertain observations in constraining catchment response and to explore the parameter identifiability in a landscape evolution model, CAESAR-Lisflood, using the River Derwent and Cocker catchments in the UK as an example.  (Wong et al., 2015). Finally, although the sophistication of landscape evolution models is rapidly evolving, evaluating their performance is often limited by the availability of longduration, high-resolution field datasets and spatial data. This paper is a first step to full-ensemble uncertainty evaluation of such models that reflects the challenges of constraining simulations with uncertain observations.

Jefferson Wong was funded by a University of Bristol Postgraduate
Scholarship. The authors thank the UK Environment Agency for providing the rain gauge and flow gauging station data. Thanks are also due to Dr. Jorge Ramirez for his efforts in collecting the grain size data and providing access to the data used in this study. The authors furthermore thank the two anonymous reviewers and Stuart Lane, the editor, for their invaluable comments, which greatly helped to improve the quality of the paper. Paul Bates was supported by a Royal Society

CONFLICT OF INTEREST
There is no conflict of interest in this paper.

DATA AVAILABILITY STATEMENT
Data used in this study are available from the corresponding author upon reasonable request.