Assessment of hydrological model predictive ability given multiple conceptual geological models

Authors


Corresponding author: D. Seifert, Water Management, ALECTIA A/S, Teknikerbyen 34, DK-2830 Virum, Denmark. (dos@alectia.com)

Abstract

[1] In this study six hydrological models that only differ with respect to their conceptual geological models are established for a 465 km2 area. The performances of the six models are evaluated in differential split-sample tests against a unique data set with well documented groundwater head and discharge data for different periods with different groundwater abstractions. The calibration results of the six models are comparable, with no model being superior to the others. Though, the six models make very different predictions of changes in groundwater head and discharges as a response to changes in groundwater abstraction. This confirms the utmost importance of the conceptual geological model for making predictions of variables and conditions beyond the calibration situation. In most cases the observed changes in hydraulic head and discharge are within the range of the changes predicted by the six models implying that a multiple modeling approach can be useful in obtaining more robust assessments of likely prediction errors. We conclude that the use of multiple models appear to be a good alternative to traditional differential split-sample schemes. A model averaging analysis shows that model weights estimated from model performance in the calibration or validation situation in many cases are not optimal for making other predictions. Hence, the critical assumption that is always made in model averaging, namely that the model weights derived from the calibration situation are also optimal for model predictions, cannot be assumed to be generally valid.

1. Introduction

[2] Predictions of hydrological changes due to changes in catchment or climate conditions have long been recognized as a major challenge [Klemes, 1986]. With the need to predict the effects of various human-induced measures to mitigate the increasing pressure on water resources and with the need to predict effects of future climate changes and possible adaptation options, this challenge will become even more important in the future. The two main problems in this respect are that data to test the predictive capability of our models are most often lacking for such situations and that assessments of model predictive uncertainties for such situations are fundamentally difficult [Refsgaard et al., 2006].

[3] According to the principles for testing of hydrological models under different conditions with respect to data availability and stationarity of catchment conditions outlined by Klemes [1986], a model's capability to predict conditions different from those in a calibration period should be evaluated by a so-called differential split sample. Several examples of such validation tests have been reported [Kuczera et al., 1993; Refsgaard and Knudsen, 1996; Refsgaard et al., 1998; Donnelly-Makowecki and Moore, 1999; Seibert, 2003].

[4] The terminology of model testing is disputed. Oreskes et al. [1994] argue that verification and validation of environmental models is impossible, because natural systems are open. In the present paper we use the terminology of model validation as a test of a site-specific model's ability to predict specified variables for specified geographical locations within given accuracy levels [Refsgaard and Henriksen, 2004], i.e., in a strictly conditional sense rather than the universal sense as discussed by Oreskes et al. [1994].

[5] When there are no data available for model validation tests, which is often the case when models are intended to be used for, e.g., prediction of land use change or climate change impacts, model structure uncertainty is recognized by many authors to be the main source of uncertainty and hence a multiple modeling approach is recommended [Beven, 2002; Neuman, 2003; Bredehoeft, 2005; Refsgaard et al., 2006; van der Linden and Mitchell, 2009; Vagstad et al.; 2009]. Recent studies have confirmed that the importance of conceptual uncertainty originating from model structure uncertainty relative to parameter uncertainty increases with the degree of extrapolation beyond the calibration base [Højberg and Refsgaard, 2005; Rojas et al., 2010].

[6] As multiple model predictions are typically made for situations where no observational data are available, the predictions can seldom be tested in rigorous validation tests. We are only aware of one such study [Troldborg et al., 2007], where four hydrological models, differing only in the conceptual geological models, were calibrated against hydraulic head and discharge data and subsequently used to predict concentrations of environmental tracers for which field observations were available for evaluation of the model predictions. An interesting research question, which has received little attention in the literature so far, is to which extent the multiple model predictions will be able to encapsulate the reality as assessed by field data.

[7] A limitation of the multiple modeling approach is that the results from the individual models are not aggregated into an optimal model prediction. In many scientific disciplines the use of multimodels and model averaging based on a weighting of the individual models has therefore become increasingly popular, as it will often produce robust predictions compared to a single model prediction [Cavadias and Morin, 1986; Poeter and Anderson, 2005; Sánchez et al., 2009; Winter and Nychka, 2010; Diks and Vrugt, 2010]. Bayesian model averaging (BMA) [Hoeting et al., 1999; Neuman, 2003; Rojas et al., 2008] is a theoretically comprehensive and computationally demanding framework that not only produces optimal predictions, but also allows estimation of the total prediction uncertainty. Less sophisticated model averaging techniques, such as equal weights averaging (EWA), Bates-Granger averaging (BGA), and Granger-Ramanathan averaging (GRA) have in some cases turned out to be as good as the computationally much more demanding BMA method [Diks and Vrugt, 2010]. In the case of BMA, the model weights are calculated based on a combination of prior belief and the model performance during the calibration, while only the model performance is included for the simpler averaging techniques such as BGA and GRA. A critical assumption in all cases is that one model performing better than another model during the calibration will also perform better for prediction. This assumption is questionable, particularly for situations that represent predictions beyond the calibration base, which is the typical application of the multiple model approach. We are not aware of studies that have tested this assumption against field data that are collected during periods where the stresses on the hydrological system have been significantly different.

[8] In the present study a unique data set with well-documented groundwater head and discharge data for different periods with different groundwater abstractions are utilized to test the performance of six hydrological models that only differ with respect to their conceptual geological model. The objectives of the study are to assess the predictive capabilities of multiple models for differential split-sample validation tests compared to their respective performances during calibration; and to assess to which extent model weights based on model calibration performance are representative also for predicting conditions different from those in the calibration.

[9] While the study results have implications for the assessment of prediction uncertainties, it is beyond the scope of the present paper to fully address prediction uncertainty, including stochastic hydrogeological and Bayesian modeling aspects. Similarly, the purpose of the model weighting analysis is not to derive the most optimal averaging scheme, but is restricted to evaluate the commonly used assumption that the model performance on data from the calibration situation also reflect the relative goodness of performance for simulations under different conditions.

2. Methodology Outline

[10] The following approach has been adopted in the present study:

[11] 1. Establish six conceptual geological models. Five different geological models are used for the same catchment area. The models are developed by different (hydro)geologists. Two of the models are based on the Zealand part (7000 km2) of the national water resources model of Denmark, two models are based on regional models covering the Roskilde County (1500 km2), and one is a local model (330 km2). In the calibration process parameterization of the local model is handled in two different ways, resulting in two conceptual models.

[12] 2. Establish six hydrological models. The six geological models are converted into six hydrological models for a 465 km2 area with the same input data, boundary conditions, etc., using the MIKE SHE model code [DHI Water and Environment (DHI), 2009a, 2009b].

[13] 3. Model calibrations. The hydraulic parameters of each model are calibrated against observations of groundwater head and stream discharge for the period 2000–2005 using the PEST parameter optimization tool [Doherty, 2004, 2008].

[14] 4. Split-sample validation test. The six models are validated against observations of groundwater head and stream discharge for another period (1995–1999) with comparable boundary conditions.

[15] 5. Differential split-sample validation test. The six models are used for a simulation of changes in groundwater head and stream discharge for two other periods (1990–1993 and 3 months in 2007), where the groundwater abstraction in the area is significantly changed compared to the calibration period (2000–2005). These are tests of the models' capabilities to make predictions outside of the calibration base.

[16] 6. Model averaging. Finally, three different methods of model averaging (EWA, BGA, and GRA) are tested and compared with observations for the differential split-sample test cases. The EWA, BGA, and GRA schemes were chosen because the objective was to test the assumptions of model performance on three different methods, and it was outside the scope of the present study to analyze the importance of prior beliefs as can be done by Bayesian model averaging.

3. Study Area

[17] The study area is located in the central part of Zealand, Denmark (Figure 1). The land surface elevation varies from 0 m above sea level (m asl) at the Køge Bay in the east and at Roskilde Fjord in the north to 90–100 m asl at the ridge in the western part of the model area, and to 60–70 m asl at the northeastern boundary. The model area is 465 km2 with 88% consisting of land and the rest of sea. The focus area of the study is located in the Langvad Stream valley system, and is defined as a buffer zone of 3 km around the well fields located in the catchment area of Langvad Stream, see Figure 1.

Figure 1.

Model area of the Langvad Stream catchment area with land surface elevation, streams, abstraction wells, and location of the west-east geological cross-section in Figure 2. The clusters of wells along the streams are the main well fields.

3.1. Climate and Hydrology

[18] The mean precipitation in the area varied in the period 1990–2005 between 570 and 960 mm yr−1, with an average of 760 mm yr−1, and the potential evapotranspiration was ∼600 mm yr−1. Based on results from the national water resources model of Denmark [Højberg et al., 2008] the actual evapotranspiration in the area was, on average, estimated to 500 mm yr−1. This results in an average net precipitation of 260 mm yr−1.

[19] The mean daily discharge in 1990–2005 at the downstream gauging station in Langvad Stream was 900 L s−1 (160 mm yr−1). In the summer months (June–August) the mean discharge was only 180 L s−1 at the downstream station, and the median of the annual minimum flow was 35 L s−1.

3.2. Geology and Hydrogeology

[20] The geology in the area is constituted by formations from Upper Cretaceous, Paleocene, and Pleistocene. Deposits from late Paleogene and Neogene were eroded during the Quaternary [Japsen et al., 2002] and are therefore absent in the area. Upper Cretaceous deposits consist of Maastrichtian chalk, which has a thickness of up to 2000 m on large parts of Zealand. The chalk surface dips toward the west and is overlain by Danian Limestone. The limestone surface is found 50–150 m below ground surface (m bgs.) and has a thickness of ∼100 m the in the western part of the study area, whereas the thickness decreases toward the east, where the formation is absent near the coast. In the valleys of Langvad Stream the depth to the limestone surface is down to 20–30 m bgs. as a result of erosion of the overlaying Quaternary deposits.

[21] The formations Greensand and Kerteminde Marl from Selandien are found on top of the limestone in the western part of the area. The Marl and the Greensand often form an inter-bedded system with highly variable thicknesses extending up to ∼50 m. Because of the Quaternary erosion, the Selandien formations are absent in some areas. The Pre-Quaternary sequence is disturbed by a fault zone striking south from Roskilde Fjord. Deposits from the Weichselian glaciation cover the entire area. The Quaternary deposits are described as interchanging and discontinuous layers of clayey moraine till and fluvial sand; see the different conceptual geological models in Figure 2.

Figure 2.

Geological cross-sections, west-east, for the alternative hydrostratigraphical models. The location of the cross-section is indicated in Figure 1. The clay unit in the models covers both the moraine till and the Kerteminde Marl located on top of the limestone in the western part of the area.

[22] The tectonic and glacial activities have resulted in the formation of fractures in the upper part of the carbonate rocks. Hence, the Maastrichtian Chalk, the Danian Limestone, and the Selandien Greensand are potentially fractured [Bonnesen et al., 2009]. The fracture density and fracture apertures are expected to decrease with depth, and therefore only the upper part of the Pre-Quaternary deposits is characterized by high hydraulic conductivities. The carbonate rocks serve as the most important aquifers in the area, whereas the Quaternary sand formations act as secondary aquifers. In most of the model area the carbonate aquifers are confined by the Quaternary till deposits except where the Kerteminde Marl is present, in which case it forms a lower confining unit.

[23] The erosion of the Quaternary deposits forming the valleys of Langvad Stream has also influenced the flow direction of the aquifers. The main flow direction is from the heights west and east of Langvad Stream toward the valley, and in the valley the flow direction is toward the Roskilde Fjord. Steep eastward gradients in the potentiometric surface west of the Langvad Stream system are caused by north-south faults in the limestone aquifer. In the eastern part of the model area the groundwater flows to Køge Bay.

[24] Groundwater abstraction in the area is mainly withdrawn from the well fields connected to Lejre Waterworks that supplies Copenhagen with drinking water. Hence, most of the abstracted water is exported out of the catchment. The well fields are located along the streams (see Figure 3). The total abstraction in the model area has decreased from ∼50 mm yr−1 (23 million m3 yr−1) in 1990 to ∼35 mm yr−1 (16 million m3 yr−1) in 2002–2005. About 90% of the water is abstracted from the limestone aquifer.

Figure 3.

Location of observation wells and discharge stations used for calibration (2000–2005) and validation (1995–1999) of the models. Names and location along the streams of the main well fields at Lejre Waterworks and location of selected wells in Figure 8 are indicated.

[25] With a net precipitation of 260 mm yr−1, a stream discharge from the mainstream system of 160 mm yr−1, and up to 200 mm yr−1 for other streams catchments, and a groundwater abstraction of 35–50 mm yr−1 for the total model area and 65–90 mm yr−1 for the focus area results, on average, in an outflow from the model area of ∼10–35 mm yr−1.

3.3. Alternative Hydrostratigraphical Models

[26] Based on available borehole logs, the hydrogeology in the model area has been described and alternative geological models developed with 3–11 hydrostratigraphical layers (Table 1). The two simplest geological models with 3–5 hydrostratigraphical layers, models R1 and R2, are extracted from regional models. The model with seven hydrostratigraphical layers, models L1/L2, is a local model of the catchment area, and the models with 11–12 hydrostratigraphical layers are based on the national models, models N1 and N2. All geological models consist of the bottom to the top of Paleocene Limestone, Paleocene Clay, and a Quaternary unit. In the more complex geological models, the Quaternary unit is divided into several alternating sand and clay layers. In some of the models the Paleocene Limestone is divided into two layers (Selandien Greensand and Danian Limestone), and in a few models a layer of Upper Cretaceous Chalk (Maastrichtian) forms the bottom of the geological package.

Table 1. Geological Models of the Langvad Stream Catchment Area
NameR1R2L1/L2N1N2
Year20022003200519982006
Hydrostratigraphical layers3571112
Numerical layers in model357910
Geological units101772118
Calibration parameters71010/1084
Reference[Roskilde Amt, 2002][Roskilde Amt, 2003][Københavns Energi, 2005][Henriksen et al., 1998][Højberg et al., 2008]

[27] In the N1 and N2 models each hydrostratigraphical layer in the Quaternary zone contains a single geological unit with uniform hydraulic properties, and therefore uniform parameter values. In the remaining three models the hydrostratigraphical layers are divided into geological units with varying properties, e.g., the limestone layer is divided into zones with low, mean, and high hydraulic conductivities, and the Quaternary layers contain units with both sand and clay. The number of geological units for each model is listed in Table 1.

[28] All geological models have been used to construct hydrological models, which have been subject to calibration. The hydraulic conductivities (K) of the original hydrological models vary significantly for the different hydrogeological units. The horizontal K-values for Quaternary sand span from 10−6 to 10−3 m s−1, and the vertical K-values for Quaternary clay span from 10−10 to 10−8 m s−1. The upper and often more fractured clay can have Kh-values of up to 10−5 m s−1. The effective hydraulic conductivity of limestone is also significantly dependent on the fracture intensity and can vary from 10−9 to 10−3 m s−1.

[29] In spite of the similarities in the conceptualization and representation of geological structures in the hydrostratigraphical models, the five geological models used in this study differ considerably from each other. Both the vertical location of the limestone surface, the proportion of sand and clay in the Quaternary unit, and the location and extent of the sand aquifers are significantly different between the geological models. This is illustrated in Figure 2 and shows a west-east cross section through the models.

[30] The initial distribution of the hydraulic conductivity (K) of the Quaternary units and the limestone aquifer are taken from the original hydrological models. The distribution of K-values in the limestone in the original models are determined in very different ways: interpolation, based on transmissivities inferred from pumping tests, or specific yield assessments in wells (N1 and N2), zonation, primarily based on geological interpretation of the type of limestone: Selandien Greensand, Danian Limestone, and Maastrichtian Chalk (R1 and R2), and zonation, primarily based on maps of head observations where areas with high and low hydraulic gradient have been identified (L1/L2). In models N1 and N2, the hydraulic conductivity is calibrated by changing a factor, which is multiplied with the interpolated K-field. Figure 4 illustrates the differences in distribution of the hydraulic conductivity in the limestone aquifer in the hydrostratigraphical models. The thickness of the numerical layer(s) containing limestone varies from 10 to 50 m.

Figure 4.

Distribution of estimated hydraulic conductivity in the limestone aquifer.

[31] The L1/L2-model contains a high number of geological units. Hence, the risk of over-parameterization is large, and to avoid subsequent problems with parameter estimation a simplification of the geological model is necessary. Two alternative methods have been used to simplify the original model, where the geological units are categorized in 11 and 12 groups. Within each group the parameter values are tied together during calibration with a constant ratio equal to the ratio of the parameter values from the original model. The two resulting models are denoted L1 and L2 in the following:

[32] L1: Two groups of geological units are defined in each model layer where all high-permeable zones are categorized in one group and all low-permeable zones are categorized in the other group. This results in a total of 12 geological units, as the K-values in layers 6 and 7 are identical.

[33] L2: The parameter zones are tied together into 11 groups depending on the value of the hydraulic conductivity in the original model, regardless of layers and geology, e.g., sand and limestone may appear in the same group.

4. Model Setup

[34] An integrated hydrological model is set up for the catchment area. The different geological models are subsequently incorporated into this model. This section describes the elements in the basis model.

[35] The integrated hydrological model of the catchment area is based on the national water resource model developed by the Geological Survey of Denmark and Greenland [Højberg et al., 2008], where the submodel covering the entire island of Zealand is used. The hydrological model is constructed using the MIKE SHE/MIKE 11 modeling software [DHI, 2009a, 2009b], which is an integrated hydrological modeling system. MIKE SHE includes modules for evapotranspiration, overland flow, land use, a two-layer description of the unsaturated zone, saturated zone, drains, and pumping wells. The river model MIKE 11 links to MIKE SHE, so that water can be exchanged between streams, the groundwater aquifers, and drains. In MIKE 11 the streams are defined by location of branches, stream cross-sections used for the river stage and discharge relation, boundary conditions such as point sources, inflows and water levels, and a riverbed leakage coefficient.

4.1. Discretization

[36] The model covers an area of 465 km2 (see Figure 1). Using a uniform cell size of 200 m, the model area is discretized in 162 × 127 cells. Within the model boundaries ∼12,000 cells are active for each model layer resulting in 36,000–120,000 active cells in the different models with 3–10 computational layers. The vertical discretization of the saturated zone follows the hydrostratigraphical units interpreted in the different geological models, but in two of the models (N1 and N2) the three top hydrostratigraphical layers are described by one numerical layer.

[37] The model is set up in transient mode with time-varying input data for the period 1990–2008. Time series with daily values of precipitation (10 km grid), potential evapotranspiration (20 km grid), and temperature (20 km grid) are used in the model. The abstraction at the main well fields in the focus area is described using weekly data, whereas the abstraction in the rest of the wells is resolved on a yearly basis. The maximum time step for MIKE SHE is 24 h and the time step for MIKE 11 is 12 h. MIKE SHE dynamically changes the time step during the simulation in response to stresses on the system. The simulation time varies from 1 to 3 h for the different models.

4.2. Boundary Conditions

[38] The model area has been defined to obtain robust boundary conditions (no flow and constant head) for most of the boundary and avoid possible interferences from the model boundaries on the model results. The northern model boundary at the Roskilde Fjord and the eastern boundary at Køge Bay (Figure 1) are specified as constant head boundaries in the top layer (h = 0 m), and for the lower layers the coastal boundary is defined as no-flow ∼1.5–2 km from the coast. The remaining boundaries follow flow lines from the water divides in the secondary Quaternary sand aquifers in the southwest and northeast and defined by no-flow boundaries for all layers. A contour map of the observed mean hydraulic head in the limestone aquifer shows an outflow from the model area over the southern boundary. The boundary condition for this boundary is defined as a gradient boundary with a gradient of 3.5 m km−1, which can be found from the contour map of the hydraulic head. This corresponds to an outflow of ∼10 mm yr−1 and accounts for 20% of the total surface and subsurface outflow to the sea. The outflow across the model boundary is compared with the model results from a regional model of Zealand.

[39] The water divide and flow lines in the sand aquifer are not precisely following the topography and catchments areas of the streams, resulting in cut off of the headwaters of the stream flowing out of the model area (see Figure 1). The amount of water is assumed to be small compared to the total water balance.

[40] The abstraction from wells close to the model boundary is small compared to the abstraction on the main well fields, and changes in these will not affect the boundary significantly. Observations show that the hydraulic head >3 km from the main well fields is unaffected by changes in the abstraction on the main well fields. The boundary conditions are presumed to be valid for the entire simulation period.

4.3. Streams and Drains

[41] Major streams are represented in the model by the MIKE 11 module. The stream water is routed downstream through the stream systems using the Manning equation to compute the relation between the water level and stream discharge. Minor streams, trenches, and drains in, e.g., agricultural fields are represented in the MIKE SHE as drains in the entire model area. The drain depth is specified to 0.5 m below ground surface. Close to the coast, where the land surface elevation is lower than 0.5 m asl, the drains are located at the land surface. Each drain cells are provided with a code defining which part of the stream system or outer boundary the drain water is routed to.

4.4. Unsaturated Zone and Land Use

[42] The unsaturated zone is described by the two-layer module in MIKE SHE. The method focuses on the water balance and calculates the evapotranspiration and the infiltration to the saturated zone. The thickness of the layers depends on the root depth (RD) and the water table and may change with time. The evapotranspiration takes place in the root zone (upper layer), and the interaction between the saturated and unsaturated zone from the root depth to the water table (lower layer). The root depth is defined as the maximum depth of active roots in the root zone. The vegetation is described by the leaf area index (LAI) and the root depth (RD) which vary according to the seasonal development of the crops.

[43] The soil properties in the root zone include soil moisture contents both at the wilting point, at field capacity, and at saturation. The properties are distributed according to a topsoil map, divided into clay (70%), sand (20%), and paved areas (10%).

[44] The land use within the model area is divided into agriculture (80%), forest (10%), urban areas (10%), and lakes (<1%).

5. Calibration and Validation of Models

[45] The models are calibrated against observations of hydraulic head and stream discharge for the period 2000–2005, when the groundwater abstraction is almost constant. Subsequently, the models are validated against observations from the period 1995–1999, where groundwater abstraction is decreasing and both wet and dry years are found. All models are calibrated by inverse optimization using PEST [Doherty, 2004, 2008].

5.1. Observation Data and Objective Function

[46] Sixty-six observation wells with time series of hydraulic head mostly for every second month, and 11 stream gauging stations with a time series of daily stream discharge are available for the calibration and/or the validation period (see Figure 3). Additionally, a data set representing the mean hydraulic head in the calibration period is calculated based on 179 observation wells with only a few measurements.

[47] The observation wells within the model area are mainly located in the limestone aquifer (75%), while fewer (25%) are located in the upper sand aquifers. The observation wells with time series tend to cluster around the large well fields in the western part of the model area, as shown in Figure 3. The observation wells used for the mean hydraulic head have a relatively uniform spatial distribution. Most of the gauging stations (Figure 3) are located in the catchment area of Langvad Stream and its tributaries (Figure 1).

[48] Multiobjective calibration of the models is performed with the objective functions defined by the observation groups listed in Table 2. Formulations of the root-mean-square error (RMS), the total water balance error (Fbal), and the Nash-Sutcliff coefficient (E) are provided in equations (3)(5). The amplitude in heads is calculated as the difference between the maximum and minimum hydraulic head in the head time series within the calibration period. The amplitude error (ErrAmplTS) is then found as the difference between the observed and simulated amplitude. The variation in head both reflects seasonal variations and variations in abstraction. In the optimization, the individual objective functions, ϕi, are initially scaled by a factor, vi2, to ensure that the contribution of each objective function, ϕi, to the total objective function, Φ, is initially about the same. A second factor, wi, is applied to control the weight given to the individual objective functions during the calibration. The total objective is thus formulated as,

display math
Table 2. Calibration Set-Up, Relative Contribution to Objective Functions (wi), Number of Observation Points From 2000–2005
Observation Groups for the Objective Function, ϕiCalibration 1, wi,1Calibration 2, wi,2Observation Points in the Calibration Period
Root-mean-square error of time series of hydraulic head (RMSTS)0.25064
Root-mean-square error of mean hydraulic head in wells with few data (RMSMean)0.250179
Total water balance error for stream discharge (Fbal)0.250.339
Nash-Sutcliff coefficient for daily stream discharge data (E)0.250.339
Amplitude error in time series of hydraulic head (ErrAmplTS)00.3364

[49] The objective function is described in more details in Appendix A. The multiobjective calibration set-up is inspired by Stisen et al. [2011], using the same components for the objective function and three different weightings of the components. Our objective function is somewhat simpler than the one used by Stisen et al. [2011], who, among others, also includes a criterion on water balance for the summer months providing a focus on the simulation of low flows. As this criterion is omitted here it is likely that other parameter combinations than those resulting from our model calibrations would be able to produce a slightly better simulations of low flows.

[50] The goodness-of-fit-criteria that we use in the multiobjective calibration are subjectively defined and not formulated within a rigorous statistical framework, and hence they cannot be expected to result in minimum variance parameter estimates. This would have been critical if we had focused on the estimation of parameter confidence intervals and associated prediction uncertainty. However, the multicriteria approach has been tested thoroughly during the development of the national water resources model [Henriksen et al., 2003; Stisen et al., 2011], where it has been the goal to obtain well-behaved hydrological models that are able to reproduce the overall behavior of the hydrological system. Since our focus is to analyze the impact of alternative geological conceptualizations, and since a key assumption in a strict statistical framework, namely that the conceptualization is correct, obviously does not hold, we consider that our approach can be justified for context of the present study.

5.2. Sensitivity Analysis and Calibration Setup

[51] Sensitivity analysis and automatic optimization of the models is carried out using the parameter estimation program PEST [Doherty, 2004, 2008]. The optimization process is performed in a two-stage procedure, where each stage includes a sensitivity analysis and a model calibration. The weights of each objective function are given in Table 2.

[52] In the first stage, hydraulic conductivities, time constant for drains, leakage coefficient for streams, and root depth are estimated (calibration 1). The observations of hydraulic head and stream discharge are contributing equally to the objective function (Table 2). The second stage focuses on dynamics and the calibration of storage coefficients for limestone, drain time constant, stream leakage coefficient, and root depth are carried out. The parameters estimated in the first calibration step are used as initial values for the second step. The contribution to the objective function (Table 2), are in the second step 2/3 on stream discharge and 1/3 on the error in head amplitude for wells with time series.

[53] The two-step process is used as the storage coefficients were not included in the group of sensitive parameters to be calibrated in the first calibration. After estimating the K-values in the first calibration, a second sensitivity analysis was carried out where the storage coefficients were identified as sensitive.

[54] The values of the horizontal (Kh) and vertical (Kv) hydraulic conductivities are tied together using an anisotropy factor (Kh/Kv) of 10. The value of the summer root depth (RDs) and winter root depth (RDw) for agriculture is likewise tied together with a factor of 5 (RDs/RDw).

[55] The relative composite sensitivities of the model parameters are examined to identify the most important model parameters to include in the optimization. The selection of the calibration parameters for the first calibration step follows the criteria: only parameters with sensitivities higher that 20% of the most sensitive parameter, and a maximum of 10 calibration parameters. In the second calibration the storage coefficients for limestone, the drain time constant, stream leakage coefficient, and the root depth are calibrated. The number of calibrated parameters is provided in Table 1.

5.3. Calibration Results

[56] In all models the hydraulic conductivity of the limestone aquifer, clay layers, and the root depth are sensitive parameters. In most models the hydraulic conductivity of the sand aquifers and the drain coefficient are also sensitive. The sum of the sensitivity coefficients for the calibration parameters comprises in all models >65% of the total sensitivity for all parameters.

[57] In two models the root depth and the drain time constant are highly correlated with parameter correlation coefficients of ∼0.9. Most parameter pairs (85%) have no to low correlation (correlations coefficients of 0–0.5), the rest are moderately correlated (0.5–0.7).

[58] The parameter estimates for each model are listed in Table 3. All parameter values of the hydraulic conductivities are within the realistic interval of sand, clay, and limestone, but vary significantly between the models with a factor of up to 1–2 orders of magnitude. Besides the difference in the absolute values of the hydraulic conductivity between the models, the spatial distribution of the hydraulic conductivity of the limestone aquifer differ considerably (Figure 4). In the N1 and N2 models the hydraulic conductivity of the limestone in the area close to the well fields is significantly lower than for the other models, which have a higher K-value at the well fields than in the surroundings.

Table 3. Parameters in Each Modela
NameUnitsR1R2L1L2N1N2
  • a

    Values in brackets represent a mean value.

  • b

    Not calibrated values.

Drain constant., CD:(s−1)1.4 × 10−71.3 × 10−70.19 × 10−70.13 × 10−72.4 × 10−72.3 × 10−7
Leakage coef., CL:(s−1)0.82 × 10−81.4 × 10−81.0 × 10−80.029 × 10−80.052 × 10−81.0 × 10−8
Root depth, RDs:(mm)49182165060211381076
Limestone, Kh:(m s−1)6.9 − 170 × 10−5 (36 × 10−5)0.1 − 800 × 10−5 (110 × 10−5)0.054 − 200 × 10−5 (8.3 × 10−5)0.045 − 200 × 10−5 (8.6 × 10−5)0.11 − 49 × 10−5 (3.8 × 10−5)0.12 − 55 × 10−5 (4.3 × 10−5)
Clay, Kv:(m s−1)0.29 − 16 × 10−90.99 − 10 × 10−90.69 − 41 × 10−90.94 − 81 × 10−90.25 − 150 × 10−92.8 × 10−9
Clay (top), Kv:(m s−1)0.013 − 4.4 × 10−70.0094 − 0.11 × 10−71.5 × 10−7b1.8 × 10−7b
Sand (S2), Kh:(m s−1)2.2 − 100 × 10−50.15 − 20 × 10−50.12 − 100 × 10−50.1 − 100 × 10−52.5 × 10−55.1 × 10−5b
Storage coef. limestone, ss:(m−1)0.10 × 10−42.8 × 10−40.11 × 10−41.2 × 10−42.6 × 10−41.4 × 10−4

[59] The performance statistics for the calibration period, 2000–2005, are listed in Table 4. The mean error (ME) and the root-mean-square error (RMS) are given by:

display math
display math

m is the number of observation wells or stream discharge stations, and n is the number of observations in each well or discharge station. ψobs, j is the observation, e.g., hydraulic head or stream discharge and ψsim, j is the corresponding simulated value.

Table 4. Statistics for Calibration Period (2000–2005) and Validation Period (1995–1999) for Each Model
Calibration PeriodNameUnitsR1R2L1L2N1N2
HeadMETS(m)−1.41−0.200.31−0.161.38−0.19
RMSTS(m)6.523.122.082.014.414.82
MEMean(m)1.03−0.410.900.320.88−2.15
RMSMean(m)6.154.574.654.065.515.40
Stream dischargeFbal(%)−17−8−2−2−2−2
E0.580.580.17−0.120.630.75
Validation PeriodNameUnitsR1R2L1L2N1N2
HeadMETS(m)−2.37−0.760.04−0.510.990.16
RMSTS(m)6.543.472.312.444.424.80
Stream dischargeFbal(%)−33−21−9−9−15−9
E0.630.640.06−0.140.720.80

[60] The water balance error for the stream discharge is given by:

display math

where math formula and math formula are the mean observed and simulated discharge.

[61] The Nash-Sutcliffe model efficiency coefficient, E, is given by:

display math

[62] E = 1 corresponds to a perfect match between model results and observations, whereas when E < 0 the mean value of the observations, math formula, describes the data better than the model results, i.e., the residual variance is larger than the variance of the observations.

[63] The difference in the statistical values reveals significant differences in the model results, e.g., the mean error for the head observations (ME) show that the simulated head differs with up to 3 m between the six models, where some models simulate on average a hydraulic head that is too low (ME > 0) and some models it is too high (ME < 0). The stream discharge is on average simulated too large in all models (Fbal < 0). For head observations, the R2, L1, and L2 models perform significantly better than the others, but for stream observations the N1 and N2 models give the best results. Hence, on the basis of the calibration statistics no model is superior to the other.

5.4. Validation Results

[64] The models are validated against observations of hydraulic head and stream discharge for the period 1995–1999, where the groundwater abstraction on average is 14% higher than in the calibration period. The model statistics for the validation period are shown in Table 4. Generally, the simulation results for the validation period are slightly inferior to the calibration period, but the statistical values have the same magnitude for the two periods. The E-values for the stream observations are slightly better for the validation period than for the calibration period.

6. Results

6.1. Simulation of Changes in Abstraction

[65] Groundwater abstraction in the focus area was changed significantly two times in the simulation period. First, a long-term trend of decreasing abstraction has been observed, especially in the period from 1990 to 2000, resulting in a significantly higher abstraction in the beginning of the 1990s than in the calibration period 2000–2005. Second, abstraction to Lejre Waterworks was stopped temporarily in late 2007. Both the effects of the long- and the short-term change in abstraction on the hydraulic head distribution and the stream discharge are used to test the predictive capabilities of the models.

6.1.1. Change in Abstraction From 1990–1993 to 2000–2005

[66] The abstraction at the well fields connected to the Lejre Waterworks has been reduced by 30% from 1990–1993 to 2000–2005. This has resulted in an increase of 2–3 m in the hydraulic head of the limestone aquifer close to the well fields, and 0.5–1 m at a distance of 2–3 km from the well fields. On average the hydraulic head increases 1.5 m in the 32 observation wells in the focus area. The 5% fractile of the observed stream discharge (i.e., the low flow, which is mainly groundwater fed) increases on average by 5 L s−1 corresponding to an increase of 30%. The precipitation has on average increased by only 1% and during the summer period by 5%. Hence, the changes in hydraulic head and low flow are assumed to be caused primarily by the decrease in groundwater abstraction.

[67] In Figure 5 the simulated mean change in hydraulic head, Δhsim, in the model area is shown. All models simulate an increase in the hydraulic head close to the well fields, though with a significant difference in the magnitude and extent of the increase. From Table 5, the average changes in hydraulic head within the focus area are seen to vary from 0.13 to 2.2 m. Some of the results, especially for model R2, cover over both increasing and decreasing hydraulic head in the focus area, which is reflected by the high standard deviation compared to the mean value given by the coefficient of variation (CV) of CV = 9, whereas CV is close to 1 for the remaining models.

Figure 5.

Mean change in hydraulic head in the limestone aquifer due to a long-term reduction in the abstraction from 1990–1993 to 2000–2005.

Table 5. Mean and Standard Deviation of Simulated Change in Hydraulic Head in the Focus Area Due to Long-Term Reduction in the Abstraction (1990–1993 to 2000–2005)
 R1R2L1L2N1N2MeanSD
Mean change in head (m)0.440.131.81.31.32.21.20.8
SD (m)0.471.12.01.81.21.81.40.6
Coefficient of variation1.18.71.11.40.90.8  

[68] The error in the simulation results for the individual models and the variation between the models increase when going from the calibration period to the validation period, and further to predicting a change in the groundwater abstraction as described above. During the calibration period (2000–2005) with an abstraction in the focus area of 9.2 million m3 yr−1, the mean RMS for the six models using the 32 observations wells in the focus area is 3.8 m, and the standard deviation of the ME-values is 2.1 m (see Table 6). The mean RMS is the same for the validation period (1995–1999) with an abstraction of 10.5 million m3 yr−1, but the variation in the results between the models increases to a standard deviation of ME of 2.3 m. For prediction of the period with a significantly higher groundwater abstraction of 12.7 million m3 yr−1 (1990–1993), the mean RMS increases to 4.1 m and the standard deviation of ME to 2.8 m.

Table 6. Model Statistics for the 32 Wells in the Focus Areaa
PeriodQ, million m3 yr−1SD of ME (m)Mean RMS (m)
  • a

    Comparison between the calibration, validation and a prediction period with changed groundwater abstraction (Q).

Calibration (2000–2005)9.22.13.8
Validation (1995–1999)10.52.33.8
Prediction (1990–1993)12.72.84.1

[69] In Table 7, statistics on the differences between observed changes in hydraulic head, Δhobs, and simulated changes in hydraulic head, Δhsim, for the 32 observation wells in the focus area are presented, using the six alternative models. Observed and simulated head changes are calculated as the difference in mean head for the two periods in the individual wells. MEΔ and RMSΔ are calculated by replacing ψobs and ψsim in equations (2) and (3) with Δhobs and Δhsim. MEΔ values between −1.8 m and 1.4 m are obtained implying that both under- and overestimation are found. Compared with the observed mean change of 1.5 m, the model errors are significant. The smallest error of −0.3 m is found when the predicted change is calculated as the mean of the head change from all six models, denoted “mean” in Table 7. The root-mean-square values of the differences between observed changes versus simulated changes, RMSΔ, show variations in the residuals in the same magnitude or higher than the observed changes (Δhobs, mean = 1.5 m). The model with the best performance on hydraulic head in the calibration period (L2) also makes the best prediction of the mean changes in hydraulic head, although with a high variation (RMSΔ = 1.7 m). However, the model performing worst in the calibration (R1) is overall best in predicting the changes in the hydraulic head (RMSΔ = 1.1 m). Hence, in this case the model performance in the calibration period does not necessarily give information on performance in predicting the effects of changed abstraction conditions.

Table 7. Statistics for Predicted Change in Hydraulic Head of 32 Observation Wells and Mean Change in Discharge Due to Long-Term Reduction in Abstraction (1990–1993 to 2000–2005) Using Six Alternative Models and Model Averaginga
Six alt ModelsNameUnitsObservationsR1R2L1L2N1N2MeanSD
HeadMEΔ(m)0.81.4−1.0−0.5−0.7−1.8−0.31.2
RMSΔ(m)1.11.62.01.71.53.11.80.7
DischargeMean change in 5% fractile discharge(L s−1)5−1−1−2−3440.23.1
SD(L s−1)6973474  
Model AveragingNameUnitsEWABGA, RMS2BGA, EBGA, RMS2, EGRA, TS-HGRA, TS-QGRA, TS-all
  • a

    The mean observed change is: Δhobs,mean = 1.5 m. MEΔ and RMSΔ, see explanation in text.

HeadMEΔ(m)−0.3−0.46−0.21−0.33−1.23−2.02−1.19
RMSΔ(m)1.11.381.091.162.183.932.10
DischargeMean change in 5% fractile discharge(L s−1)1−242050
SD(L s−1)4274162

[70] Studying the change in the low flow (5% fractiles of the stream discharge), Table 7, the only models estimating an increase in the discharge are N1 and N2. The results in Table 7 represent averages of the eight discharge stations in the focus area and hide large variations in between the stations as illustrated in Figure 6. The N2 model is overall the best model to simulate the observed changes in stream discharge.

Figure 6.

Change in the 5% fractile of the stream discharge at each discharge station due to long-term reduction in the abstraction from 1990–1993 to 2000–2005.

6.1.2. Abstraction Shutdown in 2007

[71] During the winter 2007–2008, the groundwater abstraction at many of the main well fields in the focus area was temporarily stopped for 3 weeks. To study the effect on the hydraulic head in the limestone aquifers, divers were installed in 22 observation wells within the focus area and high frequency monitoring of the development in hydraulic head was carried out for 21/2 months. The 15 observation wells within 2 km from the well fields showed rising hydraulic heads from 0.5 to 5 m. In the remaining seven wells, no changes in the hydraulic head related to the interruption in abstraction were observed. Because of the short abstraction shutdown and a coincident period of intensive rainfall, no effect was observed in the stream discharge.

[72] The observed and simulated recoveries in selected wells are illustrated in Figure 7. When comparing the observed and simulated curves it must be noted that the models cannot be expected to match field data for individual wells. First, the models have been calibrated for the overall catchment with very few calibration parameters. Second, abstraction from the main well fields is recorded and included in the models on a weekly basis, while some of the observed head data show clear signs of variations in abstraction within a week. With this reservation in mind, it is noted that all models respond to the change in abstraction, although both the timing and magnitude of the increase in hydraulic head are quite different. Some models, especially R1 and L1, tend to give a fast and strong response, whereas the response of other models is relatively small. In Table 8 the statistics of the results from the six alternative models are listed for the period shown in Figure 7. On average, the recoveries simulated by the models are too small (MEΔ > 0 m), but as shown in Figure 7 this result covers both under- and overestimation of the recovery within the 70 d period considered. The R1 and L1 models perform best with MEΔ = 0.6–0.8 m and RMSΔ = 1.2–1.4 m. The remaining models perform on average very similar with MEΔ = 1.1–1.2 m and RMSΔ = 1.5–1.7 m. The Nash-Sutcliffe model efficiency coefficient, EΔ, is below zero for all models, i.e., the mean value of the observed recovery represents the data better than the model results. The mean of the results in this scenario are not significantly better than the single models.

Figure 7.

Observed and simulated recovery of the hydraulic head in selected observation wells at the temporary abstraction shutdown in 2007. The location of the observation wells are indicated in Figure 3.

Table 8. Statistics for Recovery of Hydraulic Head in 15 Observation Wells Due to Temporary Abstraction Shutdown Using Six Alternative Models and Model Averaginga
Six Alternative ModelsUnitsR1R2L1L2N1N2MeanSD
MEΔ(m)0.61.20.81.11.11.21.00.24
RMSΔ(m)1.21.71.41.51.61.61.50.16
EΔ−1.3−1.8−1.5−1.4−1.9−1.6  
Model averagingUnitsEWABGA, RMS2BGA, EBGA, RMS2, EGRA, TS-HGRA, TS-QGRA, TS-all 
  • a

    The mean of the maximum observed change in each well is: Δhobs,mean = 2.5 m. MEΔ, RMSΔ, and EΔ, see explanation in text.

MEΔ(m)1.000.991.041.010.881.140.84 
RMSΔ(m)1.391.381.401.381.371.501.38 
EΔ−0.87−0.94−0.99−0.93−1.08−1.12−1.18 

6.2. Model Averaging

[73] Three different averaging methods are applied: equally weighting (EWA), Bates-Granger (BGA), and Granger-Ramanathan (GRA). Results using Akaike's information criterion (AIC) and Bayes's information criterion (BIC) are not shown as they put all weight only on the one model with the smallest objective function (N2), as also found by Diks and Vrugt [2010]. wj,i, j = 1, …, n, and i = 1, …, k, are the weights from the different methods, j, of the individual hydrological model, i, where k is the number of models. The equal weights averaging of the models,

display math

are independent of the model performance of the individual models, whereas model statistics or observation and simulation data for the validation period are used for the other two weighting methods.

[74] BGA weighting is carried out using three different model results for the validation period (Table 4): the RMS-values for the hydraulic head (equation (7)), the Nash-Sutcliffe model efficiency coefficient (E) for stream discharge (equation (8)), and both RMS- and E-values (equation (9)).

display math
display math
display math

[75] The GRA method uses linear regression,

display math

to estimate the weights, resulting in both positive and negative weighting coefficients. wj represent a vector with the weights for the GRA method using different types of observation, j = 5, 6, 7. Y is a vector with selected observation data and X is a matrix with the corresponding simulated state variables from each of the individual models in the columns of the matrix. For example, if Y consists of head observations in n wells (hobs,1, hobs,2, … , hobs,n), X is a matrix with a simulated head in the same n wells from each of the k models ([hsim,1,1, hsim,2,1, … , hobs,n,1],… , [(hsim,1,k, hsim,2,k, … , hsim,n,k]). Different types of data from the validation period are used to compare the difference in the weighting: time series of hydraulic head (1121), time series of stream discharge (1122), and time series of both hydraulic head and stream discharge (2243). The number of observations used is indicated in the parentheses.

[76] The individual hydrological models perform differently on different types of data; some perform better in simulating hydraulic head, whereas others are better in simulating stream discharge (Table 4). These results influence the weights (Table 9), depending on which data are used in the calculations. The models performing best in hydraulic head, e.g., L1 and L2, are assigned higher weights (w2 and w5) when using RMS or the head time series to calculate the weights. A similar pattern is observed for stream discharge (w3 and w6). This results in a very equal weighting between the models for the BGA method using both hydraulic head and stream discharge.

Table 9. Weights Used for Model Averaging
Weights (w)R1R2L1L2N1N2
w1 (EWA)0.170.170.170.170.170.17
w2 (BGA, RMS2, H)0.040.150.340.300.090.08
w3 (BGA, E, Q)0.220.220.020.000.250.28
w4 (BGA, RMS2, and E)0.130.190.180.150.170.18
w5 (GRA, time series H)−0.06−0.040.720.270.000.13
w6 (GRA, time series Q)0.46−0.38−0.10−0.01−0.781.61
w7 (GRA, time series all)−0.050.010.820.120.020.12

[77] Based on the calculated weights, the state variables for an average model (Ψj) for each of the averaging methods are found as,

display math

where ψi are the state variables as simulated by the individual models. The results from the model averaging are listed in Tables 7 and 8, corresponding to the results from the individual models.

[78] The results from the long-term change in groundwater abstraction (Table 7) show that the simple averaging methods (EWA and BGA) perform better than the individual models in forecasting the change of hydraulic head, both with respect to MEΔ and RMSΔ. The GRA methods perform significantly worse than the simple methods. Although the GRA method using only stream discharge observations for calculating the weights (GRA, TS-Q) is producing the best forecast of the change in the 5% fractile of the stream discharge and is even better than the best individual model (Table 7). The BGA,E method is performing nearly as good as the best individual model in forecasting change in the low-flow stream discharge. The other averaging methods fail to estimate the change in stream discharge.

[79] Using the model averaging methods to forecast the change in the hydraulic head at the short term abstraction shutdown, Table 8, gives no improvement of the results compared to using the individual models (Table 8). As the results from the individual models are very similar, the model averaging cannot be expected to produce different or better results.

7. Discussion

7.1. Predictive Capability of Multiple Models for Extrapolations Beyond Calibration Base

[80] We calibrated six hydrological models against groundwater head and discharge data for the period 2000–2005 and performed split-sample tests against data from 1995 to 1999, where the catchment conditions (groundwater abstractions) were similar. The six models were subsequently tested against data from two other periods where the groundwater abstractions were significantly different. These validation tests for conditions beyond the calibration base correspond to differential split-sample tests according to Klemes [1986]. Few similar tests have been reported in literature, and mostly limited to the use of single models [Kuczera et al., 1993; Refsgaard et al., 1998; Nyholm et al., 2002]. Kuczera et al. [1993] tested the capability of two models to predict the effects of land use change on river discharges and found that one of the two models failed to adequately predict peak flows, while the other model with a different conceptualization provided a reasonably good description. Refsgaard et al. [1998] tested an integrated river/reservoir/groundwater model and concluded that it was able to make reasonably good predictions of changes in groundwater level regime caused by damming of the Danube, which significantly changed the stream-aquifer interaction dominating the aquifer dynamics. Nyholm et al. [2002] measured the stream depletion in a pumping test and established two groundwater models calibrated on data outside the pumping test and for the pumping test periods, respectively. They found that each calibrated model closely fitted its own calibration data but made biased predictions for the other period with different pumping conditions.

[81] Altogether, the differential split-sample tests reported in literature with the use of one or two models do not provide a unique answer to the question as to how good models are performing under conditions different from the calibration situation. This broad range of performances, from very poor to very good, most likely have site-specific explanations in terms of data availability, model structure, model calibration, and the character of extrapolation used in model prediction. But, in any case it is very difficult, if not impossible, beforehand (without making suitable tests) to know how good the performance will be. Moreover, most often such tests cannot be performed due to lack of data. Therefore, it is interesting to analyze if the use of multiple models provides a more robust prediction than the use of single models. The results from the present study show that the six models generally perform poorer in the differential split-sample test than they do in the ordinary split-sample tests. Our results, however, also show that the observed changes in most cases were within the ranges predicted by the six models, suggesting that the multiple modeling approach can be seen as a sound strategy to achieve robust assessments of prediction errors, which is very problematic when just using a single model. This ability to encapsulate the observations by multiple models cannot, however, be generally assumed. In the work of Troldborg et al. [2007] predictions of environmental tracer concentrations were generated in a similar hydrogeological multilayer aquifer setting by use of multiple models calibrated against groundwater heads and discharges. They found that the multiple models could only encapsulate less than half of the observed concentrations. This difference may be explained either by the level of extrapolations (solute transport is a further extrapolation than flow modeling, as in our case), or because our conceptual geological models differed more from each other than those used by Troldborg et al. [2007].

[82] According to the original scheme proposed by Klemes [1986], differential split-sample schemes should be carried out by a single model in two similar (e.g., neighboring) catchments with suitable test data sets; and the performance in these two catchments would then be a measure of expected prediction accuracy for the model for the target catchment. These tests are in practice very difficult to carry out, because they require very good data sets and many study resources, which seldom are available at the same time. The use of multiple models in the target catchment appears to us to be a good, and maybe even a better, alternative approach to the original differential split-sample scheme proposed by Klemes [1986]. The multiple modeling approach has the advantage that it operates in the target catchment, and hence critical and questionable assumptions on the representativeness of the neighboring catchments can be avoided. Furthermore, it preserves the idea of testing an alternative hypothesis ensuring a more robust answer than just using a single model for a single catchment. Finally, it often requires less study resources and is therefore easier to implement in practice. This approach is also adopted by the climate modeling community, where multiple climate models are de facto standard practice [van der Linden and Mitchell, 2009].

7.2. Model Averaging

[83] When using a multiple modeling approach it is obvious to consider some kind of model averaging to achieve an estimate of the most optimal model prediction and its variance. No matter whether comprehensive model averaging schemes such as Bayesian model averaging [Neuman, 2003] or simpler schemes such as equal weights averaging (EWA), Bates-Granger averaging (BGA), or Granger-Ramanathan averaging (GRA) [Diks and Vrugt, 2010] are used, a critical assumption will be that one model performing better than another during the calibration or validation will also perform better for prediction (except for the EWA method). Our data set provides an opportunity to test this assumption, by calculating the weights of the six models based on model performance against heads and discharge data during the calibration/validation situation, and then test to which extent the relative performance for the six models are preserved during the prediction situation. Our results show that a model performing best under some conditions or in simulating a certain type of data, not necessarily is the best in predicting other data or under changed conditions.

[84] The findings from our study reemphasize the concern raised by Refsgaard et al. [2012] that the estimated model weights derived from one situation may not be optimal for making predictions on something else. Although we have only documented this for three model averaging methods, it appears to be a severe limitation in all kinds of model averaging, no matter whether it is based on simple or sophisticated methods such as Bayesian model averaging.

8. Conclusions

[85] We have used six groundwater models that differ with respect to geological conceptualization but otherwise are identical to predict the effects of changes in groundwater abstraction on groundwater heads and discharges, corresponding to a differential split-sample test [Klemes, 1986]. The model predictions from the six models are very different, which is a confirmation of the importance of the geological conceptualization for model predictions beyond the calibration base as also found both for Danish conditions [Refsgaard et al., 2012] and elsewhere [Rojas et al., 2010].

[86] Our results show that the six models generally perform poorer in the differential split-sample test, when used for making predictions under conditions different from those in the calibration situation, than they do in the ordinary split-sample tests. In most cases, the observed changes were within the ranges predicted by the six models, suggesting that the multiple modeling approach can be seen as a sound strategy to achieve a more robust assessment of prediction errors than possible with just using a single model. We conclude that the use of multiple models in the target catchment appears to be a good alternative approach to the original differential split-sample scheme [Klemes, 1986], which suggests the use of a single model in two similar (e.g., neighboring) catchments.

[87] We have tested three alternative model-averaging schemes in order to assess to which extent the model weights estimated on the basis of model performance against observed data in the calibration/validation situation are also the optimal weights for model predictions. Our results show that this is not the case, because the models performing best in the calibration situation were generally not the same that performed best during the differential split-sample prediction situation. Hence, we conclude that the critical assumption that is always made in model averaging, namely that the model weights derived from the calibration or validation situation are also optimal for model predictions, cannot be assumed to be generally valid.

Appendix A:: Formulation of the Objective Function

[88] In estimating the model parameters, the parameter estimation program (PEST) minimizes the objective function (equation (1)) which consists of five components described by the observation groups in Table 2 (number of observation wells or discharge stations in parentheses):

[89] RMSTS, root-mean-square error of time series of hydraulic head, meter (64),

[90] RMSMean, root-mean-square error of mean hydraulic head in wells with few data, meter (179),

[91] Fbal, total water balance error for stream discharge, percent (9),

[92] E, Nash-Sutcliff coefficient for daily stream discharge data, dimensionless (9),

[93] ErrAmplTS, amplitude error in time series of hydraulic head, meter (64).

[94] RMS, Fbal, and ErrAmpl equal to zero and E equal to 1 represent the perfect match between model results and observations. Hence, the objective function for each component, ϕ, consists of the summed squared residuals between the value for a perfect match and the model results of the observation wells/stream discharge stations.

display math
display math
display math
display math
display math

[95] RMSi, Fbal,i, and Ei are given by equations (3)(5) in the main text. The amplitude error, ErrAmpl, is the difference between the observed and simulated amplitude in the time series of hydraulic head for an observation well within the calibration period.

[96] The magnitude of the five objective functions differ significantly as they consist of different numbers of observations and as the magnitude of the observations varies from 0–1 for E and from 5%–30% for Fbal, hence, the objective functions for each component are weighted and the multiobjective function (Φ) in equation (1) is formulated as,

display math

where v is the weighting of the groups assigned by PEST (using the utility program PWTADJ1) ensuring that each term initially contributes equally to the objective function, Φ, and w is a weighting defining the contribution to the objective function. v differs from each model and each calibration, and w is listed in Table 2.

Acknowledgments

[97] This work was funded by a grant from the Danish Strategic Research Council for the project Hydrological Modeling for Assessing Climate Change Impacts at Different Scales (HYACINTS – www.hyacints.dk) under contract DSF-EnMi 2104-07-0008. We appreciate the critical and constructive comments from three anonymous reviewers.

Ancillary