## 1. Introduction

[2] Considerable progress has been made in the past three decades on uncertainty quantification in environmental modeling [*Liu and Gupta*, 2007; *Matott et al*., 2009; *Tartakovsky*, 2013; and references therein]. Initially, the emphasis has been on uncertainty in model parameters. A more recent trend has been to consider uncertainties in both model structures and parameters [*Ye et al*., 2010a; *Gupta et al*., 2012]. This has been motivated by a growing recognition that environmental systems are open and complex, rendering them prone to multiple conceptualizations and mathematical descriptions, regardless of the quantity and quality of available data and knowledge [*Beven*, 2002; *Bredehoeft*, 2003, 2005; *Neuman*, 2003]. Multimodel analysis has become popular for quantification of model uncertainty [*Burnham and Anderson*, 2002; *Ye et al*., 2004, 2005, 2008a, 2008b, 2010b, 2010c; *Poeter and Anderson*, 2005; *Marshall et al*., 2005; *Beven*, 2006; *Foglia et al*., 2007; *Ajami et al*., 2007; *Vrugt and Robinson*, 2007; *Tsai and Li*, 2008a, 2008b; *Wohling and Vrugt*, 2008; *Rojas et al*., 2008, 2009; *Rubin et al*., 2010; *Winter and Nychka*, 2010; *Riva et al*., 2011; *Neuman et al*., 2012; *Lu et al*., 2011, 2012; *Nowak et al*., 2012; *Seifert et al*., 2012; *Rings et al*., 2012; *Parrish et al*., 2012; *Dai* *et al*., 2012]. In multimodel analysis, rather than choosing a single model, modeling predictions and associated uncertainty from multiple competing models are aggregated, typically in a model averaging process. Consider a set of models, , and denote as a prediction (e.g., mean prediction or probability distribution) of model *M _{k}* for a quantity of interest. The weighted average estimate, , of the prediction is

where *w _{k}* is the averaging weight associated with model

*M*, the most critical variable to be estimated in the process of model averaging. It is still an open question how to estimate the averaging weights with mathematical and statistical rigor and computational efficiency.

_{k}[3] This study is focused on evaluating model averaging weights using

where the *IC* (Information Criteria) are various model selection criteria, and is the difference between the *IC* of model *M _{k}* and the minimum

*IC*, . Four model selection criteria are considered in this study:

*AIC*[

*Akaike*, 1974],

*AICc*[

*Hurvich and Tsai*, 1989],

*BIC*[

*Schwarz*, 1978], and

*KIC*[

*Kashyap*, 1982]. They are defined for model as [

*Ye et al*., 2008a]

where is the maximum likelihood (ML) estimate of a vector of adjustable parameters (which may include statistical parameters of the calibration data) associated with model ; **D** is a vector of *N* observations collected in space time; is the minimum of the negative log likelihood (*NLL*) function, , occurring, by definition, at ; is the prior probability of evaluated at ; and **F*** _{k}* is the observed (implicitly conditioned on the observations

**D**and evaluated at the maximum likelihood parameter estimates ) Fisher information matrix having elements [

*Kashyap*, 1982]

[4] Models associated with smaller values of a given criterion are ranked higher than those associated with larger values and correspondingly assigned larger model averaging weights; the absolute value of the criterion being irrelevant. As shown in *Neuman* [2003] and *Ye et al*. [2008a], model averaging weight calculated using *KIC* is a maximum likelihood (ML) approximation to posterior model probability of Bayesian model averaging (BMA) described in *Hoeting et al*. [1999]. Therefore, BMA based on the model selection criteria is referred to as MLBMA hereinafter.

[5] The model selection criteria have been widely used in groundwater modeling for both model selection and model averaging, and they are default outputs of popular software of groundwater inverse modeling such as PEST [*Doherty*, 2005], UCODE [*Poeter et al*., 2005], iTOUGH2 [*Finsterle*, 2007], and MMA [*Poeter and Hill*, 2007; *Ye*, 2010]. Their popularity in model selection is due to their quantitative representation of the principle of parsimony. The first term of each criterion, , measures goodness-of-fit between predicted and observed data, **D**; the smaller this term, the better the fit. The terms containing *N _{k}* represent measures of model complexity. The criteria thus embody (to various degrees) the principle of parsimony by penalizing models for having a relatively large number of parameters if this does not bring about a corresponding improvement in model fit. Their popularity in model averaging is due to their relative ease of computation and computational efficiency, particularly, in comparison with other methods that use Monte Carlo (MC) methods to calculate model averaging weights.

[6] However, the model selection criteria in equations (3)-(6) aggressively exclude inferior models with relatively large Δ*IC* values. For example, models receive less than 5% probability if their Δ*IC* values are larger than 6. Application of the criteria to hydrologic modeling has sometimes led to the assignment of close to 100% of the averaging weight to one model when available data and knowledge suggest that exclusion of other competing models is unjustifiable. For example, *Meyer et al*. [2007] developed four models simulating uranium transport at the Hanford Site 300 Area of the U.S. Department of Energy (DOE). All the model selection criteria assigned almost 100% averaging weight to a single model, whereas this model was not significantly superior to the other three models for matching calibration data. *Singh et al*. [2010] encountered a similar situation, when working with nine models developed for one of the corrective action units of the DOE Nevada National Security Site (NNSS), USA. For another NNSS corrective action unit, *Pohlmann et al*. [2007] and *Ye et al*. [2010a] considered 25 groundwater models, each of which has different recharge components and hydrostratigraphic frameworks. Based on the four model selection criteria, only two models received significant weights, and the weights of the other 23 models were negligible. However, evaluating the models based on expert judgment [*Ye et al*., 2008b] and examining calibration results of the models did not support discarding all 23 models (though it was reasonable to discard some). Similar situations occurred in *Morales-Casique et al*. [2010] when studying a number of geostatistical and air flow models, in *Diks and Vrugt* [2010] for two cases that involved eight watershed models and seven soil hydraulic models, respectively, and in *Seifert et al*. [2012] for six hydrological models with different conceptual geological configurations.

[7] *Tsai and Li* [2008a] proposed to address the problem of unjustifiable assignment of model averaging weight to a single model by calculating weights via

where *α* is a subjective factor. *Tsai and Li* [2008a] gave several examples in which the averaging weight of a single model was reduced from 100% to as little as 60% using reasonable values of *α*. As shown below, however, use of (8) with a similar value of *α* did not solve the problem that one model unreasonably receives 100% weight in our numerical experiments.

[8] *Diks and Vrugt* [2010] evaluated a number of methods for estimating model averaging weights that did not assign 100% weight to a single model in either of their two applications. Several of these methods allow negative model weights. As observed by *Raftery et al*. [2005], negative weights can be difficult to interpret since they imply a negative correlation between a model's simulated value and the predicted (model average) value. In addition, only positive weights can be used when calculating a model average probability density (to avoid negative densities). According to *Diks and Vrugt* [2010], the model averaging weights proposed by *Bates and Granger* [1969] had predictive performance significantly worse than the use of *AIC* or *BIC*. The other methods evaluated in *Diks and Vrugt* [2010] allow only positive model weights and had better predictive performance than the model selection criteria of *AIC* and *BIC*. These methods included Bayesian Model Averaging (BMA) with a likelihood based on a finite mixture model [*Raftery et al*., 2005], BMA with a likelihood based on a linear regression model (with weights constrained to be positive) [*Raftery et al*., 1997], and Mallows model averaging [*Hjort and Claeskens*, 2003; *Hansen*, 2007] (with weights constrained to be positive). With these methods, model weights are determined by fitting the model average result to the calibration data. This is in contrast to the use of equation (2) in which model selection criteria (and therefore the weights themselves) are determined on the basis of each individual model's fit to the calibration data and on complexity of the individual models. Unlike the model selection criteria, the BMA methods evaluated in *Diks and Vrugt* [2010] do not include a term representing model complexity. *Ye et al*. [2008a, 2010c] showed that model averaging weights determined from equation (6) have a rigorous mathematical basis in the context of the BMA method of *Hoeting et al*. [1999]. A comparative study with the BMA method of *Raftery et al*. [1997, 2005] is needed to better understand the theoretical and numerical similarities and differences. Similarly, the investigation of error correlation in this study is expected to be of general use to other model averaging methods. For example, it could be included in the log likelihood function of the BMA method of *Raftery et al*. [2005], in which independence of forecast errors in space and time is explicitly assumed. These additional studies, however, are beyond the scope of this paper.

[9] As described in the remainder of this paper, the problem of assigning unreasonably large model averaging weight to a single model when using the model selection criteria is caused by disregarding the correlation between total errors (including model errors and measurement errors) for calculation of the *NLL* term common to all the model selection criteria. As discussed in section 2 below, the error correlation, reflected in the covariance structure used for maximum likelihood model calibration, affects the calculation of *NLL*, and correspondingly the evaluation of the model averaging weights. To resolve this problem for temporal data, an iterative two-stage parameter estimation method is developed and introduced in section 3 to incorporate total error correlation into the covariance matrix of model calibration and *NLL* evaluation. The method is evaluated using synthetic data in section 4 and then applied to an experimental study in section 5 for estimating model averaging weights of several surface complexation models developed to simulate column experiments of hexavalent uranium [U(VI)]. The effects of disregarding the correlation of total errors on model averaging weights and predictive performance of individual models and model averages are evaluated for the synthetic and experimental studies.