## 1. Introduction

[2] When developing a conceptual model to represent a subsurface formation, uncertainties in model data, structure, and parameters always exist. To accommodate for different sources of uncertainty, strategies as model selection, model elimination, model reduction, model discrimination, and model combination are commonly used to reach a robust model, using single-model approaches [*Cardiff and Kitanidis*, 2009; *Demissie et al*., 2009; *Engdahl et al*., 2010; *Feyen and Caers*, 2006; *Kitanidis*, 1986; *Gaganis and Smith*, 2001, 2006, 2008; *Irving and Singha*, 2010; *Nowak et al*., 2010; *Wingle and Poeter*, 1993] or multimodel approaches [*Doherty and Christensen*, 2011; *Li and Tsai*, 2009; *Morales-Casique et al*., 2010; *Neuman*, 2003; *Refsgaard et al*., 2006; *Rojas et al*., 2008, 2009, 2010a-2010c; *Singh et al*., 2010; *Troldborg et al*., 2010; *Tsai and Li*, 2008a, 2008b; *Tsai*, 2010; *Ye et al*., 2004, 2005; *Wöhling and Vrugt*, 2008].

[3] Although single-model approach is commonly used for model prediction and uncertainty assessment of hydrologic systems, yet it has several flaws. *Beven and Binley* [1992] and *Beven* [1993] bring the concept of equifinality by pointing to model nonuniqueness of catchment models, which is the possibility that the same final solution can be obtained by many potential model propositions. This concept as coined by *von Bertalanffy* [1968] means that unlike a closed system, which final state is unequivocally determined by the initial conditions, the final state of an open system may be reached from different initial conditions and in different ways. The problem of model nonuniqueness is salient to almost any field-scale hydrogeological model due to uncertainty about data, model structure, and model parameters. Thus, a single model may result in failing to accept a true model or failing to reject a false model [*Neuman and Wierenga*, 2003; *Neuman*, 2003]. In addition, even if a single model can still explicitly segregate and quantify different sources of uncertainty, *Neuman* [2003] points out to an important observation that adopting one model can lead to statistical bias and underestimation of uncertainty. The hierarchical treatment in this study clearly illustrates this point.

[4] Multimodel approach aims at overcoming the aforementioned shortcomings of the single-model approach by utilizing competing conceptual models that adequately fit the data. Multimodel methods aim at ranking or averaging the considered models through their posterior model probabilities. The most general multimodel method is the generalized likelihood uncertainty estimation (GLUE) [*Beven and Binley*, 1992], which is based on the equifinality [*Beven*, 1993, 2005]. In the first step, different models are generated by Monte Carlo simulation and are behavioral according to a user-defined threshold based on their residual errors. In the second step, the posterior model probability for each of accepted models is calculated based on observation data for a given likelihood function.

[5] Variant GLUE methods can be developed by modifying the first step of model generation and acceptance. For example, to move from equifinality to optimality, *Mugunthan and Shoemaker* [2006] show that calibration performs better than GLUE both in terms of identifying more behavioral samples for a given threshold and in matching the output. However, this is a debatable point. For example, *Rojas et al*. [2008] remarked that by including a calibration step in multimodel approaches, errors in the conceptual models will be compensated by biased parameter estimates during the calibration and the calibration result will be at the risk of being biased toward unobserved variables in the model [*Refsgaard et al*., 2006]. This study proposes a hierarchical Bayesian averaging approach to address this concern by explicitly segregating different sources of uncertainty.

[6] Variant GLUE methods can also be developed by modifying the second step by using different likelihood functions for model averaging. Formal GLUE [*Beven and Binley*, 1992] uses inverse weighted variance likelihood function, but the method is flexible allowing for diverse statistical likelihood functions such as exponential function [*Beven*, 2000] or even possibilistic functions [*Jacquin and Shamseldin*, 2007]. Exponential and inverse weighted variance likelihood functions do not account for model complexity and number of data points and may lack statistical bases [*Singh et al*., 2010]. *Rojas et al*. [2008, 2010a-2010c] introduce Bayesian model averaging (BMA) in combination with GLUE to maintain equifinality. Although using BMA is statistically rigorous, yet a typical problem with BMA is that it tends to favor only few best models [*Neuman*, 2003; *Troldborg et al*., 2010]. For example, several studies [*Rojas et al*., 2010c; *Singh et al*., 2010; *Ye et al*., 2010b] show that model averaging under formal BMA criteria (AIC, AICc, BIC, and KIC) tends to eliminate most of the alternative models, which may underestimate prediction uncertainty and bias the predictions, while GLUE probabilities are more evenly distributed across all models resulting superior prediction. To maintain the use of statistically meaningful functions, while avoiding underestimating uncertainty, *Tsai and Li* [2008a, 2008b] propose a variance window to allow selection of more models, but may simultaneously enlarge the magnitude of uncertainty, while satisfying the constraints imposed by the background knowledge.

[7] All the previously cited studies are collection multimodel methods, in which all models are at one level. *Wagener and Gupta* [2005] remark that an uncertainty assessment framework should be able to account for the level of contribution of the different sources of uncertainty to the overall uncertainty. In the groundwater area, to advance beyond collection multimodel methods, *Li and Tsai* [2009] and *Tsai* [2010] present a BMA approach that can separate two sources of uncertainty, which arise from different conceptual models and different parameter estimation methods. These were the first two studies to extend the collection BMA formulation of *Hoeting et al*. [1999] to two levels. The current study generalizes the work of *Li and Tsai* [2009] and *Tsai* [2010] to a fully hierarchical BMA method. To our knowledge, this is the first work that extends the BMA formulation in *Hoeting et al*. [1999] to any number of levels for analyzing individual contributions of each source of uncertainty with respect to model data, structure, and parameters.

[8] The hierarchical BMA provides more insight than collection BMA on the model selection, model averaging, and uncertainty propagation through a BMA tree. Each level of uncertainty represents an uncertain model component with its different competing discrete model propositions. For example, the variogram model selection can be one source of uncertainty and its competing propositions could be exponential, Gaussian, and pentaspherical variogram models. The proposed HBMA method serves as a framework for evaluating competing propositions of each source of uncertainty, to prioritize different sources of uncertainty and to understand the uncertainty propagation through dissecting uncertain model components.

[9] We test the HBMA method on an indicator hydrostratigraphy model to characterize the Baton Rouge aquifer-fault system in Louisiana. The outline of the study is as follows. Section 'Hierarchical Bayesian Model Averaging' shows the derivation of HBMA under maximum likelihood estimation. Section 'Case Study' describes the indicator hydrostratigraphy model and the segregation of its uncertain model components, which are calibration data, variogram model, geological stationarity assumption, and fault conceptualization. Through the BMA tree of the hydrostratigraphic models, section 'Results and Discussion' presents the evaluation of the competing model propositions, the uncertainty propagation, and the prioritization of the uncertain model components. Section 'Conclusions' draws conclusions about the main features of the HBMA.