Metamodeling uses computationally efficient models to emulate the outputs of complex models, trading off computational time against prediction accuracy and/or precision. Although potentially powerful, there is limited understanding of the uncertainty introduced by the metamodeling procedure. In particular, the errors associated with transformations of the predictions, such as aggregations during upscaling or differences between results used for impacts analysis, have not been explored in the metamodeling literature. We present an application of metamodeling that upscales physics-based model predictions to make catchment scale predictions of land management change impacts on peak flows. Two parallel sets of simulations are conducted, one with the original physics-based models and the other with metamodels. Despite good performance in emulating the local scale physics-based model simulations, once incorporated into a catchment scale model and especially once impacts of change are calculated, errors associated with the metamodeling procedure alone become significant, accounting for almost half of the prediction uncertainty in peak flows. The additional (metamodel-contributed) uncertainty is introduced both through biases in peak flows and through increases in peak flow variance. In the context of land management impacts, the results demonstrate the importance of tracking propagation of errors during upscaling, and of evaluating a model's ability to predict change, as well as independent observations. Despite these errors, the predictions of land management impacts from both physics-based models and metamodels are broadly consistent between each other, and in accordance with expectations from the literature.
 There are a variety of water resource management problems where information about small-scale or nonlinear processes needs to be incorporated into computationally efficient decision support tools. These tools need to be able to rapidly assess a variety of management scenarios, taking into account uncertainty, in order to provide strategic policy guidance in reasonable timeframes. While there is a danger of overselling the capabilities of physics-based models [Beven, 1989; Woolhiser, 1996], there is generally a consensus (even among the strongest critics) that physics-based models have a role to play for some classes of prediction problems, such as the prediction of the impacts of land management or climate change. However, in addition to concerns about the validity of physics-based models for hydrological predictions [e.g., Beven, 1989; Beven, 2001; Ebel and Loague, 2006; Loague and VanderKwaak, 2004], their computational burden and challenges with model identifiability (due in part to their overparameterization) generally limit their suitability in a practical decision support role [Ratto et al., 2012].
 Notwithstanding these issues, if it is assumed that fine resolution physics-based models provide us with the best available representation of local scale responses under future land management change scenarios, the challenge then becomes how we can transfer this small-scale information to the catchment scale in a computationally efficient and sufficiently accurate way, and (recognizing the errors involved) one that allows quantitative uncertainty estimates to be made.
 One promising approach to this challenge is the use of metamodels. A metamodel can be broadly defined as a model of a model, characteristically more computationally efficient than the original model [Barton, 1998] and with fewer parameters. Beyond this general definition, the form of the metamodel in different applications varies depending on the nature of desired model output and the characteristics of the problem. When the output has only one relevant component (e.g., a statistic of peak flow estimates), the metamodel structure is typically a response surface approximation [Barton, 1998]; for example, in the form of regressions, splines, kriging, neural networks, and radial basis functions [Piñeros Garcet et al., 2006; Razavi et al., 2012a; Ballard et al., 2012]. In many cases, time series outputs from nonlinear models are temporally aggregated to produce the most relevant outputs. This has the advantage of suppressing some of the nonlinear response features, allowing the construction of simpler response surface approximations; for example, Forsman and Grimvall  took the average nitrate leaching over 30 years as predicted from a time series of leaching and then fitted a simple model to describe the aggregated response.
 Metamodels can be used as surrogates for complex physics-based models, allowing for more computationally intensive or time limited analyses to be performed, such as sensitivity analysis, model optimization, or real time decision support [Piñeros Garcet et al., 2006; Razavi et al., 2012a]. Metamodeling to predict time series is generally referred to as dynamic emulation [Castelletti et al., 2012a]. Possible approaches for dynamic emulation (with examples given from hydrological applications) include:
 1. Analytical derivations of simplified forms of the governing physics-based equations and their parameters [Wigmosta and Prasad, 2006; Martina et al., 2011].
 2. Maintaining the governing equations, but coarsening the time and space grids over which they are solved [Martina and Todini, 2008; O'Hagan et al., 1999], potentially with parameters derived based on emulation of finer grid simulations [e.g., Dunn and Mackay, 1996].
 3. Empirical fitting of a time series function to the output from the original numerical model to produce lower-order models that produce essentially the same outputs [e.g., Galelli et al., 2010]. This can include constraining the function to one that has some mechanistic interpretation [e.g., Young, 2001; Young and Ratto, 2009; Young, 2013].
 4. Maintaining the understanding that was used to develop the original model, but in the form of a simpler conceptual-type model, the parameters of which are then estimated by calibration to the original model outputs and whatever relevant observations may be available [Ewen, 1997; Wheater et al., 2008].
 Although there are significant computational benefits in using metamodeling, it is important that the implicit trade-off between efficiency and accuracy is understood and that modelers identify when metamodeling cannot provide a useful degree of accuracy [e.g., Razavi et al., 2012b]. This paper investigates the trade-offs associated with the fourth of the metamodeling approaches listed above in the context of predicting land use impacts on floods.
2. Problem Specification
 The nature of the problem to be solved plays an important role in the solution approach. In the case study used in this paper, the practical motivation is answering the question: “How does land use change impact on flooding?” Previous case studies [Wheater et al., 2008, 2012] concluded that metamodeling, used within a “what-if” scenario exercise [Castelletti et al., 2012a], may be a means of computationally efficient continuous and spatially explicit simulation of long (30+ years) flow series under multiple scenarios. The particular requirements in this context—looking at differences between scenarios, a spatially complex model, and the need to generate long time series in order to support extreme value frequency analysis—raise particular motivations for the use of metamodeling and particular challenges, which are expanded upon in this section.
Wheater et al.  used metamodeling as part of an upscaling procedure to examine the impacts of land use and land management change on flooding at the Pontbren catchment in Wales, UK. Their general approach, illustrated in Figure 1, divides the catchment into a number of runoff generating elements, which are each classified based on soil type and land management into a number of runoff classes. It is assumed that the runoff response is sufficiently similar within each class that the same model may be used, and that variability of response within any class may be considered random and representable by parameter uncertainty. For each runoff class, a physics-based model is developed, incorporating understanding and measurements of local hydrological processes and properties [Jackson et al., 2008], either from local observations, literature sources, or surrogate sites (sites of similar characteristics to those of interest, but outside of the catchment of interest). The outputs from these physics-based models are used to train simpler metamodels, which despite their simplicity remain consistent with understanding of local responses [McIntyre et al., 2011]. The metamodels are then incorporated into a semidistributed catchment model that routes local runoff through a channel network to the catchment outlet. This approach appears to provide an effective way of introducing small-scale process understanding in a computationally efficient way into catchment scale models [Jackson et al., 2008; Wheater et al., 2008; Wheater et al., 2012].
 The applications of Jackson et al. , Wheater et al. , and Wheater et al.  illustrate the importance, in this context, of explicit representation of the main spatial properties of land use and hydrology [Ewen et al., 2013]. The almost-infinite number of spatial patterns of land use change that may be of interest means that use of a spatially explicit hydrological model (i.e., a spatially distributed model) is essential. It is convenient to be able to build this spatially distributed model out of a set of metamodels, each one of a predefined spatial unit, and estimate impacts by taking the difference between model runs of two scenarios. In the usual approach to metamodeling [e.g., Razavi et al., 2012a; Castelletti et al., 2012a; and references therein], the output of interest is produced directly by a single run of a single metamodel; whereas in our case, a set of metamodels are run and their outputs are transformed first through aggregation through the stream network and then secondly by taking the difference between two scenario outputs. The nature of errors introduced by these transformations is central to understanding the applicability of metamodeling to land use scenario analysis and other problems involving scenarios of spatial change.
 The predominant source of uncertainty in metamodeling has been cited as the functional mapping of the complex model parameter space to the metamodel parameter space, based on sampled pairs of parameter sets [Young and Ratto, 2009]. The method used by us excludes this source of uncertainty by deriving metamodel parameter sets only at the sampled points in the complex model space, corresponding to the land use classes of interest. In other words, we do not attempt to map the parameter spaces, with the penalty of only being able to look at scenarios of change built up from different spatial combinations of the sampled land use classes. Although some progress toward such mapping was made as discussed in Ballard , this paper focuses on evaluating uncertainty arising from the approximations of the sampled points. The sources of uncertainty are demonstrated in Figure 1 and listed below:
 1. Classification of the runoff generating elements: it is assumed that the same physics-based model and metamodel can be used for all areas deemed to be within the same class.
 2. The identification and estimation of the structure and parameters of the physics-based models.
 3. The identification and estimation of the structure and parameters of the metamodel for each runoff generating element.
 4. The aggregation of elemental responses, and their errors, by the channel network model.
 5. The aggregation of errors that may occur when calculating the change in response between two scenarios.
 6. Errors associated with extrapolation of the metamodel over a range of conditions beyond those experienced within the training period (e.g., extreme rainfall events that might be observed when running for 30+ years).
 Although there are increasing efforts in the hydrological community to address uncertainties associated with (1) and (2) [e.g., Beven, 1993; Bulygina et al., 2011; Freer et al., 2004; Nandakumar and Mein, 1997; O'Hagan et al., 1999; Carrillo et al., 2011], in general, not enough attention has been given to all these sources of error in metamodeling applications [Razavi et al., 2012a]. In particular, most previous water resources applications of metamodels simply replace the complex models with their surrogates with the discrepancies between the two (i.e., source 3) ignored [Razavi et al., 2012a].
 For the purposes of this study, it will be assumed that the physics-based model simulations presented, which include multiple parameter set samples to account for uncertainty, are fit for purpose, e.g., the model structures are appropriate and the sample ranges represent the potential range of system responses (e.g., we do not explore the influences of error sources 1 and 2 in this paper). Further, we will not make predictions of low frequency flood events, or evaluate the potential error associated with using the metamodel with forcing beyond the range of those used in the original emulation (error source 6), which is a general problem encountered when attempting to model extremes [e.g., Hall and Anderson, 2002; Dobler et al., 2012]. Hence, within this paper, we will focus on error sources 3–5, specifically aiming to address the following questions:
 1. How much additional uncertainty in flow simulations is introduced at the model element scale due to the metamodeling procedure?
 2. How does this uncertainty propagate through to the catchment scale?
 3. Is this uncertainty reduced or amplified when considering changes in flow due to land management changes?
 We will address these questions through a case study of the Hodder Catchment in north-west England examining the impacts of a number of upland land management changes on peak flows. The challenges encountered and the potential solutions offered have wider applicability in many environmental modeling applications.
 In the following sections, we describe in more detail the models and upscaling methodology, followed by a description of the methods that we have used in order to evaluate the metamodel performance. The general methodology originally developed by Wheater et al.  is shown in Figure 1. A more detailed methodology, specific to this application, with references to sections, figures, and tables within this paper is shown in Figure 2. Some of the details of the method are governed by the particular case study of predicting land management impacts in the Hodder Catchment, so first this case study is described.
3.1. Catchment Description and Land Management Scenarios
 The data used in this study are from a multiscale monitoring programme in the Hodder Catchment, collected as part of the United Utilities SCaMP programme (Sustainable Catchment Management Plan) [Ewen et al., 2008]. The data are from a subcatchment of the Hodder, defined by the Footholme flow gauge. The Footholme gauge drains a catchment area of approximately 25.3 km2 that ranges in elevation from 544 to 180 m above sea level at the gauging station. Annual average precipitation in the catchment is approximately 1500 mm [Ewen et al., 2009]. Land use in the catchment is primarily agricultural, dominated by sheep and, to a lesser extent, cattle farming. Regions of the catchment also include commercial coniferous forestry, some parts of which have recently been clear felled and replanted. In the upper extents of the catchment, there are large areas of moorland and peatlands, which are managed for grouse and low intensity grazing.
 The Footholme catchment consists of three soil series: Belmont, Wilcocks, and Winter Hill (based on NSRI's NATmap [NSRI, 2011]). The Winter Hill series is a blanket peat deposit and is located on the low gradient hilltop plateaus. The Belmont soil series is an iron pan stagnopodzol [Thompson, 2007, p. 7] and typically forms on valley slopes adjacent to the blanket peat hill tops. The Wilcocks soil series is a cambic stagnohumic gley soil that typically forms in the valley bottoms. Land cover in the Footholme catchment is assessed based on the LCM2000 data [Fuller et al., 2002], supplemented through inspection of aerial photographs and local knowledge (e.g., to identify the locations of peatland drainage and blocked drains).
 Along with the baseline scenario (representative of the current land management regime), four alternative catchment land management scenarios are considered in this study:
 1. Complete coniferous afforestation of mineral soils.
 2. Extension of existing coniferous plantation near the Footholme gauging site.
 3. Returning all peatland blocked drains to functioning drains.
 4. Changing all peatland blocked drains to intact peatland.
 These scenarios involve changes to between 3 and 30% of the catchment area. Scenario 1 is a an extreme forestry scenario (included to demonstrate maximum potential impacts), Scenario 2 is a more realistic forestry expansion scenario, and Scenarios 3 and 4 represent previous peatland management scenarios within the catchment. Comparisons of the effects of Scenarios (1–4) will be made against the baseline scenario. Scenarios 1 and 2 primarily involve changes from moorland to coniferous trees, and 3 and 4 changes from blocked drains to intact or drained peatland. Figure 3 shows the spatial distribution of land management over the catchment for each of the scenarios.
 The Environment Agency (England) operates a calibrated flume at Footholme with data stored at 15 min intervals. The stage on a 1.7 km2 tributary, Bre_sap, is monitored using a pressure transducer (Van Essen Divers [Ewen et al., 2010]). The stage is converted to discharge using a stage-discharge relationship developed using observations and extended for high flows based on physical reasoning [see Ewen et al., 2010]. Rainfall is measured by a tipping bucket rain gauge and climate measurements (used to estimate potential evaporation) are made from an automatic weather station and a barometer within the catchment. There is a daily abstraction record upstream of the Footholme that is used to naturalize the flows.
3.2. Runoff Generating Element Classification
 The procedure starts by dividing the catchment into a number of 200 m × 200 m grid cells, each representing a runoff generating element. Based on the soil classification and land management, runoff generating elements are grouped together into runoff classes. In total, there were 11 different runoff classes for the Footholme catchment, listed in Table 1.
Table 1. Summary of Runoff Classes Within the Footholme Catchment
Deciduous tree planting
Coniferous tree planting
Deciduous tree planting
Coniferous tree planting
Blocked drained peatland
3.3. Physics-Based Models
 The 11 classes of runoff generating elements require identification of 11 physics-based models. We have developed two different physics-based model structures: one to represent classes 1–8 in Table 1 and the other covering classes 9–11. Full details of the model development and testing are in Ballard  and Ballard et al. —only a summary is provided here.
 The first model represents mineral soils within the catchment with land uses of: moorland, grazed grassland, deciduous trees, and coniferous trees. The model is a two-dimensional hillslope model that couples Richards' equation for subsurface flow [Richards, 1931], the kinematic wave equation for overland flow [Singh, 1996], an adapted version of the Rutter model for interception [Rutter, 1975] and the Penman-Monteith equation for potential evaporation [Allen, 2006]. The model represents the different land uses through changes in parameter values. Both models have inputs of rainfall and potential evaporation, and outputs of runoff (both surface and subsurface).
 The second model is a quasi-3-D model of blanket peatlands that simulates differences in drainage management (drained, blocked drains, and intact peatland) for blanket peatlands. The model can represent a variety of open ditch drainage geometries through coupled one-dimensional models of subsurface, surface, and ditch flows (represented by the Boussinesq equation [Boussinesq, 1867] and kinematic wave equations, respectively). The different land management types are represented by model structural changes.
 Both physics-based models have a number of structural limitations and assumptions, such as simplified topography, homogeneity, and exclusion of macropore representations. However, these models were arrived at after many months of model development, testing, and refinement as reported in Ballard , and thus represent close to the best possible understanding of the relevant processes and efforts at physics-based modeling given the limited available data. The reader is directed to Ballard  and Ballard et al.  for further information about the development and testing of the models.
 Recognizing that there is variability within runoff classes and also uncertainty in the physics-based model parameter values, a Monte Carlo analysis framework was employed to generate an ensemble of runoff responses. As there were no suitable local scale runoff observations within the Footholme catchment to calibrate the physics-based models, the parameter values were estimated a priori. For each runoff class, 100 parameter sets were sampled, with sample ranges selected a priori based on data from a 5 m resolution digital elevation model, the NSRI soils database [NSRI, 2011] and various literature sources. Between runoff classes, the land management parameters (e.g., tree height, rooting depth, interception capacity) were independently sampled from the respective ranges, while each of the 100 samples of non-land management parameters (slopes, soil types) was uniform across classes. This meant that changes in land management class could be represented by pairs of parameter sets in which slope and baseline soil characteristics were constant. Likewise, the same 100 land management specific parameters (e.g., tree height, rooting depth, interception capacity) were applied across the different soil types. Details of the parameter ranges and sampling procedure can be found in Ballard et al.  and Ballard . All parameter sets are considered to be equally likely as there is no information to support any other interpretation. Simulations were conducted for a one year period from 1 December 2008 to 31 November 2009 with inputs of observed climate and rainfall, with outputs every 15 min.
 The metamodel used in this application is a conceptual rainfall-runoff model, with a moisture accounting module coupled with two parallel routing stores. As the model is conceptual in nature, the model still incorporates some of the physical relationships inherent in the original physics-based simulations. The advantage of using this approach, rather than a purely statistical model, is that the structure retains the perceived key states and fluxes of the physics-based models [Razavi et al., 2012a]. Hence, we are more comfortable using the metamodel structure with driving inputs beyond the range of those used in the original physics-based model simulations; however, in this paper, we do not use wider ranging inputs, so cannot comment on the potential additional uncertainty that doing so may introduce, nor how this might compare against the uncertainty associated with extrapolation with a statistical model. Further, maintaining the state-space representation can assist in the interpretation of results as well as adding credibility for stakeholders and decision makers [Castelletti et al., 2012b].
 Although it is feasible to select a different metamodel structure for each runoff class, or even multiple structures for each class [e.g., Viana et al., 2009], for the purposes of simplicity, the metamodel with the best performance over all runoff classes was chosen. Using a single structure also has the advantage that changes in metamodel parameters following land management change can be directly quantified and then potentially be attributed back to changes in corresponding physical properties. It also allows for more straightforward sensitivity analysis to be conducted independent of the land use scenario, as demonstrated in Ewen et al. . Full details of the procedure employed to select an optimum metamodel structure are provided in supporting information.
 The final model structure selected is a combination of the Catchment Moisture Deficit model and three parallel linear reservoirs, shown in Figure 4. Mathematical descriptions of the relationships can be found in Evans and Jakeman  and Ballard . In the original model of Evans and Jakeman , the actual evaporation/potential evaporation (PE) ratio gradually reduces as the moisture deficit increases; this has been simplified so that evaporation is a constant fraction of PE. The Matlab numerical implementation of the model is from the Rainfall-Runoff Modeling Toolbox (RRMT) [Wagener et al., 2004].
 The time series inputs to the metamodel are rainfall and PE and the output is runoff at the element scale. The rainfall input time series is the same as that used in the physics-based models, while the same climate variables were used to estimate the PE. The PE is estimated for each of the modeled vegetation types (grazed, moorland on mineral soils, coniferous, deciduous, and peatland) using the Penman-Monteith equation with typical plant parameters for each land management type. The metamodel parameter estimation is described separately in section 3.5.
 The comparison between the physics-based models and the corresponding metamodels in terms of complexity and computational cost may be summarized as:
 1. While the physics-based models are computationally demanding, taking approximately 6–10 min per simulation year, the metamodels take approximately 2–6 s per simulation year.
 2. While physics-based models require estimation of between 10 and 40 parameters (depending on the specific soil and land management combination), the metamodels only require the estimation of 4–9 parameters.
3.5. Metamodel Parameter Estimation
 With the model structure selected, the optimal metamodel parameter set was estimated for each physics-based model realization and for each runoff generation class (giving a total of 1100 optimal parameter sets, with 100 physics-based model parameter sets being used for each of the 11 classes) for a 150 day period from 16 June 2009 to 13 November 2009. Tests indicated that lengthening the period to the full one year simulation period had limited influence on the identification of an optimal parameter set. Further, by selecting a subset of the physics-based models, some of the data remained for a validation period. We found that there were only marginal differences in performance from the training period to the validation period. The estimation procedure for each of the 1100 physics-based model realizations consisted of: randomly sampling 50,000 parameter sets using Latin hypercube sampling; running the metamodel using the appropriate PE input time series; and identifying which set best replicated the physics-based model output runoff according to the chosen goodness of fit measure. The sampling of the parameter sets was done from ranges shown in Table 2, which were set based on guidance from the Rainfall-Runoff Modeling Toolbox (RRMT) user manual [Wagener et al., 2001].
Table 2. Sampling Ranges for the Metamodel Parameters
 Goodness of fit was based on peak flows and changes in peak flows under land management scenarios. A vector of the 10 rainfall events (r) that leads to the 10 largest runoff events, qP(r) for each physics-based simulation is identified. The magnitude of the peak flows for the same runoff events from the metamodel simulations, qM(r) is then extracted. The mean square error over the 10 sample events (MSEP) is used as a measure of the goodness of fit between the physics-based and metamodel peak predictions.
 It is obviously desirable that the metamodels replicate the responses of the physics-based models well. In the context of applications such as this, it is particularly important that the differences in flows between different land management simulations are also well replicated. If the metamodels were perfect matches of the physics-based models, this would not be an issue; however, as some error is introduced due to the metamodeling procedure, it is possible that these errors could be amplified when evaluating differences in flows. Further, initial tests with best fit parameter sets based on MSEp alone produced poor change predictions (not reported within this paper). Therefore, rather than selecting optimal metamodel parameter sets based on MSEP alone, an alternative procedure is used, as outlined below.
 Consider the jth (of 100) set of physics-based simulations of the peatland model, which includes a time series of drained, intact, and blocked land management scenarios. This set can be characterized individually by the vectors of peak flows, , , and (where the superscript indicates that they are values derived from the physics-based models, and subscripts indicate the land management scenario). MSEP provides a measure of goodness of fit between the metamodel parameter sets and these vectors. However, this set of simulations also includes the differences in peak flows, , and , where is defined as .
 Taking the 20 metamodel parameter sets that “best fit” (based on MSEP) each of the individual time series, there are 203 potential combinations of the differences in peak flows between drained, intact and blocked scenarios. Selecting 20 sets is arbitrary, but it ensures that peak flows for each single land use type are well simulated. A parameter set that contains metamodel parameters for all of the land management scenarios for a given soil type is referred to as a total metamodel parameter set. In the peatland case, where there are three potential land management scenarios, there are 3 × 8 parameters in the total metamodel parameter set, where 8 is the number of parameters in the metamodel.
 For each of the 203 potential total metamodel parameter sets, , , and are calculated and compared against the corresponding physics-based predictions using a mean square error. The average (in this case across three estimates) of the mean square errors of these difference vectors is calculated (referred to as MSEΔP). The total metamodel parameter set that minimizes MSEΔP is taken as the final metamodel parameter set to represent the jth set of physics-based simulations of the peatland model. The same process is conducted for each of the 100 physics-based simulations and for each of the soil types, hence the total number of parameter sets remains the same from the physics-based model simulations through to the optimal metamodel simulations.
3.6. Catchment Semidistributed Modeling Procedure
 The 200 m × 200 m runoff generating elements are modeled in a semidistributed model framework, using the RRMT-SD software [Orellana et al., 2008]. The runoff from each element is routed through a channel network (derived based on topography) to the catchment outlet. Each channel element routes flow through a single linear reservoir. The flow inputs are the outputs from upstream channel elements and the outputs are the flow from the reservoir combined with the contribution of locally generated runoff from the given grid cell. The same routing model is used for both the physics-based and the metamodel catchment simulations. Although there is scope to extend the metamodeling to the channel network simulation, this was omitted from the current experiment.
 Two categories of storage coefficient for the linear reservoirs are defined, one each for the major channel network (KA) and the minor channel network (Ka). The distinction between major channel network and the minor channel network elements is assessed based on a critical contributing area (Acrit). It is assumed that the channel routing parameterization is independent of land management change, thus neglecting impacts of land management change on channel roughness and geometry, through properties and processes such as sediment transport, woody debris, and channel bank vegetation. One approach could be to calibrate the routing parameters for each of the 100 catchment simulations to reflect the observed values; however, in this case, the calibrated routing parameters would have been dependent on the element runoff and, for those sets that poorly reflect reality, would have compensated for this in their selection. An alternative approach might have been to have a single routing parameter set, but selecting this would also not have been straightforward, and would have undoubtedly underestimated the catchment scale prediction uncertainty.
 In order to ensure that routing parameters are independent of the variability in runoff for each grid cell, and also represent the parametric uncertainty in the routing scheme we randomly selected and assigned the routing parameters for each catchment simulation. For the catchment land management scenario applications in the following sections, 100 sets of the routing parameters were randomly sampled from the ranges shown in Table 3 and were randomly assigned to each of the 100 catchment simulations. The ranges were selected through a trial and error process such that performance in prediction of field observations was independent of the routing parameters. Thus, uncertainty in the semidistributed model is represented by the variability in the stream routing parameters and through the uncertainty in runoff generation elements (represented by the 100 samples for each runoff class).
Table 3. Sample Range for Catchment Routing Parameters
3.7. Evaluating Loss of Information in the Metamodeling and Upscaling Procedure
 In order to evaluate the uncertainty introduced by the metamodeling procedure, two sets of catchment scale simulations of flow were conducted, first with runoff generation predicted by the physics-based models and then by the metamodels, for each of the catchment land management scenarios described in section 3.1. The goodness of fit is assessed by calculating the normalized root mean square error (nRMSE, which is the RMSE normalized by the range of the physics-based model predictions), the bias, the ratio of variances and the nRMSE with corrections for (a) bias (nRMSEb) and (b) both bias and differences in variance (nRMSEb,σ) for both and (the mean across the 10 rainfall events of the vectors q(r) described in section 3.5 and the percentage difference of these means).
 The goodness of fit in steps (2) and (3) is assessed first by calculating the normalized root mean square error (nRMSE) over the 100 simulations for each runoff class:
 = mean peak flow prediction, , of the physics-based models.
 = mean peak flow prediction, , of the metamodels.
 Systematic errors in the predictions are assessed by calculating the prediction bias (equation (2)) and ratio of the physics-based model standard deviation to the metamodel standard deviation of over the 100 simulations (σP/σM). Positive bias indicates that the metamodels underpredict the peak flow predictions of the physics-based models and when the ratio of variances is greater than one, the variance of ensemble of simulated peak flows has increased following metamodeling.
 The nRMSE is recalculated first with a bias correction (nRMSEb, equation (3)) and then with both a bias correction and a standard deviation correction (nRMSEb,σ, equation (4)). Improvements in nRMSEb compared to the original nRMSE, suggest that the bias is systematic, and prediction accuracy is changed due to metamodeling. Improvements in nRMSEb,σ compared to the nRMSEb suggest that the prediction precision is changed due to metamodeling. The same measures are calculated for .
 mean across all 100 samples of mean peak flow prediction for the physics-based models.
 mean across all 100 samples of mean peak flow prediction for the metamodels.
 Note: .
 For decision making purposes, it is likely the ensembles will be used to determine a median prediction and upper and lower uncertainty bounds. As such, the specific performance in predicting any one of the 100 model simulations may not influence the management decision should the two ensembles be the same. Hence, the significance of any differences between the physics-based and metamodel distributions (for raw data, bias corrected and bias and variation corrected metamodel predictions) is evaluated using a two-sample Kolmogorov Smirnov test, where the null hypothesis is that the two alternative model structures make predictions from the same distribution.
4.1. Element Scale
 Figure 5 is an example of the hydrographs produced by the best fit parameter sets for the Belmont soils, using an arbitrary parameter set out of the 100 samples. Overall, the metamodels appear to emulate the physics-based model flow time series well, with maximum, median, and minimum Nash Sutcliffe Efficiencies across all 1100 simulations of 0.96, 0.83, and 0.65, respectively. It is important to note that the error in the metamodel predictions, although apparently small, is of similar magnitude to the differences between the land management scenarios.
 Figure 6 shows plots of predicted by the physics-based models against predicted by the metamodels for the Belmont, Wilcocks, and Winter Hill soils, respectively, as well as a summary of the goodness of fit statistics for the predictions of . nRMSE values shown in bold are those for which the null hypothesis of equality between the physics-based and metamodel distributions could not be rejected at the 95% confidence level. Without the bias correction, the null hypothesis was only rejected for the Wilcocks grazed scenario. This is primarily due to a systematic overprediction; following a bias correction, the null hypothesis could no longer be rejected. Bias correction produced the most significant improvements in nRMSE for the Winter Hill soils series (where metamodels consistently under predicted physics-based simulations), and variance correction produced the most significant improvement for the Belmont soil series (where metamodels consistently had a greater standard deviation than the physics-based simulations). Figure 6 also demonstrates the uncertainty due to the physics-based models (which is up to ± 50% of the median predictions). The magnitude of the error due the metamodeling procedure relative to the uncertainty in the original physics-based models is captured by nRMSE, due to the normalization by the range of the physics-based model predictions, and also in the ratio of the variances, where a value of 2 would indicate that the metamodeling procedure had introduced the same amount of uncertainty as the physics-based models (at least in terms of the ensemble spread).
 The same performance comparison was conducted for the difference in the mean peak flows between each of the land management types. Figure 7 shows plots of (predicted by the physics-based models) against (predicted by the metamodels) for the Belmont, Wilcocks, and Winter Hill soils as well as a summary of the metamodel element scale prediction performance for . By calculating differences (as opposed to considering the time series independently), the variance ratios for the mineral soils have all dramatically increased, particularly for the Grazed-Moorland pairing where the variance is 4.14 times greater than the equivalent physics-based model variance (compared to a maximum ratio of 1.24 for the independent predictions). Even with the bias correction, due to greatly increased variance, the null hypothesis of equality between and was still rejected for the distributions of Grazed minus Moorland for both mineral soil types. The null hypothesis of equality could not be rejected for all other soil type/land use combinations following bias correction. Although the Grazed to Moorland pairings show the worst nRMSE, this is related to the small predicted range of the physics-based models for this pairing compared to other land management pairs, as the range is used for normalization of the RMSE.
4.2. Catchment Scale
 Figure 8 shows an extract from the simulation period of the physics-based and metamodel prediction ensemble hydrographs in comparison to the observed hydrograph at the catchment outlet. In general, the flow peaks are well predicted; however, there is a tendency for low flows to be underpredicted. This is not entirely surprising, given that the focus of all stages of the modeling has been to replicate flow peaks. The differences between the ensembles predicted by the physics-based and metamodels are not immediately obvious. On close inspection, differences in the low flows and recession periods can be seen, with the metamodel simulations typically having steeper recessions and lower low flows. There is also a tendency for the metamodels to make slightly lower predictions of the peak flows compared to the physics-based model predictions.
 Figure 9 provides a summary of the goodness of fit of the metamodel predictions to the physics-based mean peak flow , at both Footholme and Bre_sap for the baseline and Scenarios 1–4. In all cases, the nRMSE is smaller than that of the most equivalent element scale scenario (for Scenarios 1 and 2: Belmont moorland to coniferous, for Scenario 3, Winter Hill blocked to drained, Scenario 4, Winter Hill blocked to intact). This is most likely related to the fact that the scenarios only simulate land management change for a small proportion of the catchment (3–30% of the total area). A strong bias is observed in the small peatland catchment (Bre-sap), with metamodels consistently underpredicting , although the same bias is not observed at the larger Footholme catchment.
 The metamodel catchment scale prediction performance for is shown in Figure 10. In all cases, the nRMSEb,σ is larger than that of the most equivalent element scale scenarios (e.g., Figure 7); the catchment scale application has amplified the errors in the prediction of change. Based on information from the plot scale scenarios, it would be predicted that changing from blocked to intact peat should lead to a reduction in peak flows. However, the catchment simulations for Bre-sap with runoff generated directly from the physics-based models, predict the opposite behavior (baseline minus Scenario 4). It is postulated that this is related to the distribution of land management change within this small catchment, which all takes place close to the gauging point. When rapid runoff generation occurs near a gauging location, there is potential for reductions in flow peaks, as the locally generated flows can desynchronize from the main flow peak of the catchment, allowing the locally generated peak flows to pass the gauging location prior to the arrival of the catchment peak flow. Flows are resynchronized when the runoff generation near the outlet is slowed (in this case changing from blocked to intact); this appears to be a likely reason for the prediction of increased peak flows. The same behavior does not occur in the catchment scale predictions with element runoff generated from the metamodels; this is presumably because the metamodeling procedure reduced the difference in predicted runoff between these two land management types at the element scale (in particular underpredicting peak flows for the blocked scenarios, which is also associated with a delay in the peak arrival time) and hence the same degree of desychronization is not observed for the baseline scenario (with the blocked drains). This issue highlights the potential importance of the distribution of land management change, the significance of peak flow timing and how the routing network can act to inflate errors in element scale runoff prediction.
 Table 4 provides a comparison of the predictions of reduction in peak flows for the four catchment scale land management simulations using both physics-based models ( ) and metamodel local runoff generation . Results are given in terms of the minimum, median, and maximum reductions over the 100 ensemble members. These ranges demonstrate the uncertainty in the predictions in change in peak flows associated with the modeling procedure described within this paper. In general, the median reductions are well maintained between the two modeling methods, however, the errors introduced in the metamodeling process lead the metamodels to predict greater median reductions for deciduous trees compared to coniferous trees (results for scenarios with changes in deciduous trees are not listed in the table), and decrease in peak flows following reversion of blocked peatland to intact peatland; both predictions are opposite to those predicted by the physics-based models. However, the absolute error that leads to these discrepancies (approximately, a of 3%) is small. The general picture given by Table 4 is that significant information is lost and uncertainty introduced when moving from the physics-based to the metamodel predictions of change at the catchment scale; however, given the large uncertainty in the physics-based model and the (generally) close agreement of the medians, the same conclusions about land management impacts are likely to be reached.
Table 4. Minimum, Median, and Maximum Reductions in Mean Catchment Peak Flow, , Predicted by Both Physics-Based and Metamodel Predictions for Land Management Scenarios 1–4a
Physics Based (%)
Values in bold indicate ensembles where the reduction in median is statistically different from zero.
 This paper presented a procedure for training metamodels to replicate physics-based model runoff responses at an element (200 × 200 m) scale, and upscaling the responses using a semidistributed catchment model. The performance of the metamodels was evaluated not only in terms of emulating the time series and peak flows from the physics-based model but also by ability to emulate impacts of land management change. By running a semidistributed model with runoff generated (1) by the physics-based models and (2) by the metamodels, it was possible to evaluate how and where the error introduced due to the metamodeling procedure is inflated once the metamodels are used to generate runoff in the catchment model. Tracking the errors introduced by the metamodels at each of the different stages provides information for interpreting the source of the differences between the predictions (and uncertainty bounds) for the physics-based models and metamodels shown in Table 4. While a small number of metamodel performance evaluations have previously been done (see examples in Razavi et al. [2012a]), the evaluation of the errors in predicting change and how metamodel errors aggregate during upscaling are new contributions to the literature. Furthermore, these two sources of error are not exclusive to metamodeling and hence this study has applicability to upscaling and change modeling in general.
 The metamodels were successful in terms of their first requirement of significantly reducing computational time, with run times of the metamodels on average 99% faster than the corresponding physics-based models (although this is based only on run time, and does not account for the significant upfront investment of time in the development of the metamodels). However, their success in terms of accurate representation of the original physics-based models was more variable. Small uncertainties in the metamodel predictions at the element scale can inflate once differences in peak flows between scenarios are calculated and also once the models are applied as local scale runoff generation within the semidistributed model. For the forestry scenarios, increases in variance of both and (and hence total prediction uncertainty) were the main error types, whereas for the peatland scenarios, systematic bias was the main error type; the differences in error type are unique to the specific pairing of physics-based model and metamodel. The error type also changes as the local scale runoff is implemented in the semidistributed model; this is due to a combination of the distribution of the land management change in space, error compensations, as well as the smoothing influences of the channel routing.
 The catchment scale predictions of change based on metamodels are highly uncertain. The predictions summarized in Table 4 show that up to 60% of the uncertainty in the predicted changes in flows (calculated as the percentage difference in the ranges between the upper and lower uncertainty bounds) is due to errors introduced by the metamodeling process (as opposed to the uncertainty introduced due to the wide a priori parameter ranges for the physics-based models); although this amount varies between scenarios. If the performance of the metamodel parameter sets had only been assessed based on the ability to replicate the individual element scale simulations (e.g., Figure 4, which suggests good replication of the physics-based hydrographs), then it may not have been immediately apparent that the metamodeling process was introducing so much uncertainty to the model predictions. Given the large investment in field data needed to improve the physics-based models, the most immediate improvements that could be made to the entire upscaling strategy would be to investigate ways to transfer information more effectively from the physics-based models to the metamodels.
 Once an optimized model structure(s) is identified, improvements could be made to parameter identification strategies. When trialing and selecting an appropriate strategy, it is important to note the parameter sets that optimize runoff prediction may not necessarily optimize the prediction of change [Ewen et al., 2006]. This was addressed in the Hodder case study procedure by incorporating prediction of change, averaged over all possible scenarios, in the fitting criterion. Performance for any subset of changes (e.g., scenarios involving only drain blocking) might have been improved by including only that type of change in the criterion, although this would lead to different models of the same runoff class and would be incoherent in that sense. This raises a more general issue of approaches to predict impacts of change, where common strategies involve conditioning on “before” and “after” observations, or proxy or paired catchment observations [Bulygina et al., 2011; O'Connell et al., 2004; O'Connell et al., 2007] and then using pairs of conceptual model predictions to calculate change without explicitly paying attention to the ability of the approach to calculate change. This can be particularly hazardous when the magnitude of change is expected to be small (as is the case with the land management scenarios presented within this paper).
 We also recognize that there may be other opportunities to improve results through refinements to the metamodel fitting procedures. Future research could investigate the use of alternative (potentially multiobjective) optimization strategies (e.g., by exploring the multidimensional pareto front, Gupta et al. , Wagener et al. ), rather than the simple sampling-based approach used here. The optimization strategy used in this study does not take into account the problem of “equifinality” [Beven, 2001] for each individual calibration, where multiple metamodel parameter sets can give equally good approximations of the physics-based models, hence, methods could have been employed in order to select and carry forward multiple parameter sets in order to represent each physics-based model simulation [e.g., Beven, 1993]. Further, toward more justifiable uncertainty analysis, alternative parameter estimation strategies could also be investigated, such as using a formalized Bayesian framework [e.g., Bulygina et al., 2012], although this would require an error model to be specified for the physics-based model outputs.
 The differences between the element scale results and the catchment scale results suggest that for some land management changes (e.g., Figure 10, Scenario 4: Bre_Sap), the spatial distribution of change and the configuration of the drainage network can be very important when evaluating changes in peak flows and in controlling model error. For land management changes that are less strongly spatially distributed, predicted directions of change were the same for the physics-based, metamodels, and element scale predictions. In most cases, the median predictions of peak flow change for each catchment scale scenario were reasonably predicted by the element scale predictions scaled by the fraction of area under land management change. Given the potential influences of the channel network demonstrated at the catchment scale, further research is still required in order to investigate how the transfer of local scale uncertainty is influenced by the routing procedures [e.g., O'Donnell et al., 2011; Ewen et al., 2013].
 Another potential avenue toward reducing uncertainty is improvements to the metamodel structure identification. Improvements in metamodel structure could potentially be made by identifying structures using techniques that do not require an a priori model structure definition yet maintain physical plausibility, for example, through data-based mechanistic modeling [Young, 2001; Young and Ratto, 2009], or through a bridging of data-based and conceptual model identification techniques [Young, 2013]. Following the extensive evaluation described in the supporting information of this paper, a single metamodel structure was applied to all runoff classes. Alternative approaches that may improve the prediction performance include retaining several metamodel structures for each runoff class [Viana et al., 2009] or identifying and employing the best metamodel structure for each individual runoff class to reduce issues related to mis-specification. This could be particularly important where inputs are more extreme than those experienced in the training period. However, in the Hodder case study, the differences in element-scale performance between structures were small compared to the uncertainty in the physics-based models. Future research could explore the uncertainty introduced through selecting a single metamodel structure through comparison of the simulations presented in this paper against a case where the “best possible” metamodel is used for each runoff class.
 Ideally, metamodel parameters could be estimated by a functional mapping between the physics-based parameter space and the metamodel parameter space, derived from the existing pairs of parameter sets [Young and Ratto, 2011]. This would allow a wider range of runoff classes to be modeled without extra physics-based model runs, using measurable physical properties. However, attempts at mapping between parameter spaces using linear regression (not reported in this paper) showed limited success. The fact that no successful parameter mapping could be achieved means that every single element type to be included in the model needs a physics-based model to be run, limiting the general applicability of the metamodeling and upscaling procedure beyond the application at the Footholme catchment. Given the importance of the parameter mapping for the ongoing use of a given metamodel, future research efforts could explore alternative techniques to achieve more reliable maps, such as methods described in Castelletti et al. [2012a] and Razavi et al. [2012a].
 The specific magnitudes of change presented within this paper for the different land management scenarios should be considered as preliminary, as they may be highly dependent on: (1) the model assumptions, (2) the averaging procedure used over events of different characteristics, and (3) the specific characteristics of the Hodder catchment. They have been included primarily to demonstrate the potential predictions and associated uncertainty bounds that could be produced by the modeling methodology described within this paper.
 This paper described and critically evaluated a procedure for predicting hydrological impacts of land management change using a metamodeling and upscaling procedure. The metamodeling approach reduced the computation time required by the original physics-based model by 99% while successfully replicating the direction and approximate magnitude of impacts on peak flows, although significant effort was required to develop these metamodels. It also replicated observed peak flows to within an impressive tolerance. However, the procedure introduced significant uncertainty over and above that stemming from the original physics-based model, particularly when considering differences between flow scenarios. This additional uncertainty is associated mainly with aggregation of errors when upscaling. In some cases, there was also some notable bias in the predictions. Improved precision and accuracy could probably be achieved by more investment in the physics-based model and metamodel identification exercises. However, a significant trade-off between accuracy and/or precision and cost of implementation is likely to remain. This value judgement needs to be made by the users of these tools, and investment in increasing metamodel accuracy is only likely to be made should lower uncertainty predictions be required in order to inform a decision making process.
 This research was funded by the UK Flood Risk Management Research Consortium Phase 2, EPSRC grant EP/F020511/1.