Corresponding author: H. Yoon, Geoscience Research and Applications Group, Sandia National Laboratories, Albuquerque, NM 87185, USA. (email@example.com)
 Assessing the impact of parameter estimation accuracy in models of heterogeneous, three-dimensional (3-D) groundwater systems is critical for predictions of solute transport. A unique experimental data set provides concentration breakthrough curves (BTCs) measured at a 0.253 cm3 scale over the 13 × 8 × 8 cm3 domain (∼53,000 measurement locations). Advective transport is used to match the first temporal moments of BTCs (or mean arrival times, m1) averaged at 0.253 and 1.0 cm3 scales through simultaneous inversion of highly parameterized heterogeneous hydraulic conductivity (K) and porosity (φ) fields. Pilot points parameterize the fields within eight layers of the 3-D medium, and estimations are completed with six different models of the K–φ relationship. Parameter estimation through advective transport shows accurate estimation of the observed m1 values. Results across the six different K–φ relationships have statistically similar fits to the observed m1 values and similar spatial estimates of m1 along the main flow direction. The resulting fields provide the basis for forward transport modeling of the advection-dispersion equation (ADE). Using the estimated K and φ fields demonstrates that advective transport coupled with inversion using dense spatial field parameterization provides an efficient surrogate for the ADE. These results indicate that there is not a single set of model parameters, or a single K–φ relationship, that leads to a best representation of the actual experimental sand packing pattern (i.e., nonuniqueness). Additionally, knowledge of the individual sand K and φ values along with their arrangement in the 3-D experiment does not reproduce the observed transport results at small scales. Small-scale variation in the packing and mixing of the sands causes large deviations from the expected transport results as highlighted in forward ADE simulations. Highly parameterized inverse estimation is able to identify those regions where variations in mixing and packing alter the expected property values and significantly improve results relative to the naïve application of the experimentally derived property values. Impacts of the observation scale, the scale over which results are averaged and the number of observations and parameters on the final estimations are also examined. Results indicate existence of a representative element volume (REV) at 0.25 cm3, the existence of subgrid scale heterogeneity that impacts transport and the accuracy of highly parameterized models with even relatively small amounts of observations. Finally, this work suggests that local-heterogeneity features below the REV scale are difficult to incorporate into parameterized models, highlighting the importance of addressing prediction uncertainty for small-scale variability (i.e., uncaptured variability) in modeling practice.
 Accurate identification of heterogeneous material properties in the subsurface is essential for predictive modeling of fluid flow and solute transport. Direct observations of the aquifer properties are limited, and data must be expanded and/or interpolated from the observation locations to populate the model domain. Furthermore, available observations of state variables such as pressure and/or concentration can be brought to bear on estimation of hydraulic parameters through inverse modeling approaches [de Marsily et al., 1999; Carrera et al., 2005]. Recently, the importance of identifying the spatial distribution of hydrogeological heterogeneity has been increasingly centered on transport processes in the subsurface for applications including environmental remediation [Murakami et al., 2010; Zachara, 2010], geological storage of CO2 [Baines and Worden, 2004; Department of Energy (DOE), 2007], and water resources management [South Florida Water Management District, 2005; Young et al., 2010]. For strongly nonlinear problems such as reactive transport processes and groundwater flow in highly heterogeneous formations, a majority of inverse parameter estimations are computationally expensive and the impact of small-scale heterogeneity on model prediction can be profound. Additionally, quantification of the uncertainty associated with model predictions resulting from estimated parameters is receiving increased attention in order to better understand different sources of uncertainty [Renard et al., 2010] and provide decision-makers with reliable predictive models [Keating et al., 2010].
 A number of previous works have estimated more than one spatially heterogeneous parameter from the same set of observations in 2-D fields. Most commonly, heterogeneous transmissivity (T) and storativity (S) fields have been estimated from some combination of steady state and transient head observations [Hendricks Franssen et al., 1999; Li et al., 2005, 2007]. Other works have focused on the simultaneous estimation of a heterogeneous T field and a spatially varying transport parameter such as the distribution of sorption coefficient, dispersivity, or contaminant source terms [e.g., Huang et al., 2004; Nowak and Cirpka, 2006; Tonkin et al., 2007; Hosseini et al., 2011]. These approaches have generally been demonstrated on horizontal fields with other researchers estimating multiple properties in 2-D cross-sectional domains [e.g., Fienen et al., 2009]. A question remains in all of these studies regarding the relationship between heterogeneous parameters. Most commonly, independence between the estimated parameter values is assumed, as often necessitated by the lack of observed values for one of the parameters.
 Over the past decade, a number of studies have increasingly focused on parameter estimation in heterogeneous 3-D fields. Lavenue and de Marsily  used observations of a sinusoidally varying pressure test with pilot point parameterization to estimate heterogeneous hydraulic conductivity (K) in a two-layered medium. Hendricks Franssen and Gómez-Hernández  estimated K in a 3-D fractured medium from steady and transient head observations with the locations of fracture zones specified a priori. Llopis-Albert and Capilla  extended this work to account for stochastic representation of complex fracture structures similar to that presented by Gómez-Hernández et al. . Li et al.  used the geostatistical inverse approach to estimate a 3-D K field from steady state drawdown and vertical flowmeter surveys. Riva et al.  used a stochastic Monte Carlo method to describe 3-D random geological facies and hydraulic conductivities in order to capture the features of the depth averaged breakthrough curve (i.e., temporal moments and long tailing). In particular, Riva et al.  used an empirical relationship between K and porosity (φ) to generate φ distribution with a dual medium model. Recently, Schöniger et al.  improved ensemble Kalman filter (EnKF) approaches by applying nonlinear, monotonic transformations to the observed states in parameter estimation from 3-D hydraulic tomography in multi-Gaussian log K fields. Huber et al.  used the pilot point method to calibrate a 3-D groundwater flow model of a strongly heterogeneous aquifer with the hydraulic conductivities at a limited number of locations and the leakage coefficients for a number of zones, which were updated in real time with EnKF.
Doherty  applied the pilot point method for parameter estimation in the context of underdetermined problems. This allowed pilot points to be distributed through the model domain, resulting in highly parameterized inversion estimation. Inverse estimation of the highly parameterized model with stability and uniqueness was accomplished through regularizations such as the truncated singular-value decomposition (TSVD) and Tikhonov regualarization (i.e., hybrid subspace) approaches [Tonkin and Doherty, 2005]. In addition, the highly parameterized inversion using pilot points has led to the development of computationally efficient calibration-constrained model prediction uncertainty by Doherty and coworkers [Moore and Doherty, 2005; Tonkin et al., 2007; Tonkin and Doherty, 2009]. Highly parameterized inverse estimation with pilot points also has been used in 3-D domains to simultaneously estimate three heterogeneous parameters: porosity, horizontal and vertical hydraulic conductivity, in a model with 11–14 layers conditioned to both head and concentration data [Tonkin and Doherty, 2005; Tonkin et al., 2007]. While parameter estimation with pilot points has become relatively common [Alcolea et al., 2006, 2008; Riva et al., 2010; Doherty and Hunt, 2010; Doherty et al., 2010], its application to estimation of 3-D fields has been limited.
 Here we focus on highly parameterized inverse modeling of spatially heterogeneous fields for a 3-D transport problem. The high level of parameterization is used to estimate fields of spatially correlated properties, and the pilot point method is used to parameterize these spatial fields. Parameter estimation with pilot points often utilizes multiple locations with existing property measurements to condition the estimated field. The experimental data set examined here is unique in that it provides extremely fine-scale (0.253 cm3) 3-D exhaustive observations on concentrations in a 3-D flowcell over a 10–20 cm scale [Zhang et al., 2007; Yoon et al., 2008]. Contrary to most parameter estimation studies, however, there are no direct measurements of K or φ within the model domain, but independently measured K and φ values are available [Zhang et al., 2007; Yoon et al., 2008]. The exhaustive 3-D observations available from magnetic resonance imaging (MRI) in the flowcell packed with sands are an uncommon measurement set for inversion, but are analogous to time series imaging of transport in a geologic unit with geophysical techniques. This unique concentration data set provides an opportunity to explore the application of inverse parameter estimation for multiple variables using highly parameterized models in 3-D groundwater flow and solute transport problems.
 Objectives of this work are to: (1) Utilize a set of spatially exhaustive 3-D observations to simultaneously estimate heterogeneous K and φ fields through application of highly parameterized models. In particular we evaluate the capability of six different approaches for quantifying the relationship between K and φ in the estimation process to recover the observed tracer transport. This evaluation also includes the straight forward approach of using the known zonation of the sands and the laboratory measured K and φ values of each sand. (2) Examine the ability of highly parameterized estimations of K and φ to explain breakthrough curves (BTCs) with and without the addition of a dispersion term in forward solute transport models. If K and φ are parameterized at a high resolution, we hypothesize that it should be possible to use a purely advective transport model to match the first temporal moments of BTCs (or mean arrival times, m1) as a surrogate for more complex transport in the inversion. The estimated fields should then provide accurate transport results when used for forward runs with the advection-dispersion equation (ADE). (3) Determine the impact of varying the number of parameters and the number of observations on the ability of the estimated 3-D fields to match observed data at different scales. Additionally, the impact of the averaging scale of the observations (i.e., 0.253 and 1.0 cm3) on the resulting estimated fields is examined to evaluate the impact of subscale heterogeneity on forward transport results with ADE.
2. Experimental Setup and Parameterization
2.1. Experimental Setup
 The experimental procedure and data acquisition are given in detail by Zhang et al.  and Yoon et al. . Here, we briefly describe the experimental setup for flow and transport modeling. Five different sizes of sands were used to construct a heterogeneous K field in a 3-D flowcell. The entire flowcell has a dimension of 21.5 × 9 × 8.5 cm3 and the central portion of the flowcell (14 × 8 × 8 cm3) was packed with 1 cm3 cubes of five different sand types. The K and φ of all five sands were measured independently [Yoon et al., 2008] (Table 1). The sand pattern is based on the distribution of a K field generated using the sequential indicator simulation algorithm (SISIM) [Deutsch and Journel, 1998]. As shown in Figure 1a, there is a distinct, connected, high-K (sands 12/20 and 20/30) feature along the central portion of the flowcell. A snapshot of tracer concentration after 42 min qualitatively illustrates the preferential flowpath along the central portion of the flowcell (Figure 1b). A brass divider (14 × 8 × 1.2 cm3) with a grid of 1cm2 square openings was used to pack the flowcell with dry sands in each 1 cm layer. The rest of the flowcell includes a 4.5 cm zone adjacent to the inlet, a 3 cm zone adjacent to the outlet, a 0.5 cm thick layer on the bottom, and a 0.5 cm thick vertical layer adjacent to the side walls, all of which were filled with the lowest-K sand (50/70). A constant head was maintained at the inlet and outlet reservoirs and a nonreactive, para-magnetic tracer solution was continuously injected through the water-saturated flowcell by maintaining a constant flux. The total time for the tracer solution to flow through the entire heterogeneous region was ∼4–5 h.
Sand type is defined as a sand fraction between two sieve numbers.
 The signal intensity of the tracer concentration using MRI was obtained at a resolution of 0.1875 × 0.1875 × 0.225 cm3 at a regular sampling interval of 2.17 min over the MR imaging region (13 × 8 × 8 cm3), which is slightly smaller than the entire heterogeneous region (14 × 8 × 8 cm3), and was converted into tracer breakthrough curves (BTCs) [Yoon et al., 2008]. Observed concentration BTCs averaged at both 0.253 cm3 (i.e., 0.016 cm3, OBS0.016) and 1 cm3 scales (OBS1.0) were used. For model comparison, the mean arrival times (m1) and second central moments (m2c) were calculated from the observed BTCs using the method of temporal moments with a step input [Luo et al., 2006]. Observed data within the top 0.25 cm of the heterogeneous region were not used for model calibration due to decreased imaging accuracy. Actual observed data used are from x = 4.5 to 17.5 cm, y = 0.5 to 8.5 cm, and z = 0.5 to 8.25 cm in the heterogeneous region (Figure 1a) and the total number of observed m1 data at the OBS0.016 and OBS1.0 scales are 51,584 and 832, respectively.
 In this study, there are no direct measurements of K or φ in the flowcell after packing, but independently measured K and φ values for all five sand types used for packing are available (Table 1). A priori bounds on K and φ parameters were also independently measured over a mix of two sand types. Porosity varied as a function of volume fraction of each sand type and has a range from 0.25 to 0.45. K values were also measured for sand mixtures with the lowest, median, and highest porosity values, and have a range from 0.3 to 61 cm min−1. In this work, the distribution of the K field used for packing and the independently measured values of K and φ on each sand (KPack) are used for comparison of simulation results with the estimated K and φ fields from different parameterizations.
2.2. Parameterization of K and Porosity (φ)
 Six different approaches for quantifying the relationship between heterogeneous K and φ in the estimation process subject to the observed tracer transport were compared. The relationship between K and φ is modeled using the following six different parameterizations:
 1. Independence (KIND). Following work in a number of previous studies where multiple spatially varying parameters were estimated [e.g., Hendricks Franssen et al., 1999; Li et al., 2005; Tonkin and Doherty, 2005], we consider K and φ to be independent of each other. This leads to the largest number of parameters (i.e., 2 × the number of pilot points) to be estimated.
 2. Independence with Zonal Information (KZone). In a variation on the independence parameterization, we add knowledge of the packing distribution of the five different sands as prior information. The packing distribution is referred to as “zonal information”. The spatial estimation of each parameter is limited to only using information from pilot points within the same zone (sand type) within that layer. Zonal information can be analogous to stratigraphic or geological facies information, or lithology reconstruction from other geophysical interpretations. This approach should better represent the discrete boundaries of the sand types in the estimated K and φ fields. The number of estimated parameters is equal to that in the previous case.
 3. Linear model of Coregionalization (KCoreg). The linear model of coregionalization [e.g., Goovaerts, 1997] enables modeling a linear relationship with a specified correlation coefficient between two variables that also exhibit spatial correlation. K is estimated at pilot points and the estimated K field is used to modify an independently generated random field for φ with the same covariance structure as the K field to produce a specified (K–φ) correlation coefficient. Here, we model a linear relationship between log K and φ with the correlation coefficient, positive or negative, in each layer as an estimated parameter. The resulting number of parameters is equal to the number of pilot points where K is estimated plus 24 (the correlation coefficient, mean and standard deviation of porosity in each of eight layers).
 4. Kozeny-Carman Equation (KKC). Empirical relationships between K and φ are also used to reduce the degrees of freedom for estimation problems [Hu et al., 2009; Riva et al., 2008]. For sandy materials, the Kozeny-Carman formula can be used to develop an empirical relationship between K and φ for the five different sands used for packing. Using the characteristic mean grain size and measured K values for the five different sands (Table 1), a linear regression model is developed based on the Kozeny-Carman equation as
 The regression model (R2 = 0.989) is used to estimate φ from log K (cm min−1). This parameterization reduces the number of estimated parameters for porosity to zero.
 5. Constant Porosity (Domain) (Konly). In the majority of studies where K (or transmissivity, T) is estimated from head and concentration measurements, φ is assumed to be constant-valued throughout the model domain [e.g., Fienen et al., 2009; Fu and Gómez-Hernández, 2009; Herckenrath et al., 2011]. We include this parameterization here for comparison. This parameterization reduces the number of estimated parameters for porosity to one.
 6. Constant Porosity (Layers) (KLAY). In a variation on the constant porosity parameterization, we add a single estimated porosity value for each layer in the 3-D model. This approach is similar to that used by Kollat et al.  and results in eight parameters for the porosity estimation.
 For the highly parameterized model, 896 pilot points were used; one in the center of each 1 cm3 block within the heterogeneous region. Additionally, 160 pilot points with 20 pilot points and one pilot point for K and φ, respectively, in each layer are located in the homogeneous inlet region. Table 2 summarizes the main parameters and the number of pilot points for all parameterization methods. Simulation results with the estimated fields of K and φ from all six parameterizations are compared to each other and to the results obtained from KPack.
Table 2. Summary of the Parameters Used for Six Inversion Modelsa
 Model calibration was performed using the parameter estimation package PEST [Doherty, 2010; Doherty and Hunt, 2010]. In particular, BeoPEST [Hunt et al., 2010], which uses the Message Passing Interface (MPI) protocol suitable for a tightly integrated cluster and cloud computing, was used to improve computational efficiency. Spatially varying parameters considered for estimation are log K and φ. Additionally, the head at the inlet (i.e., head difference between inlet and outlet boundaries) is estimated to match the flowrate entering the inlet of the flowcell because the total flowrate varies during calibration of K and φ fields. The overall objective function to be minimized is the sum of the measurement (Φm) and regularization (Φr) objective functions for the cases (KZone and KIND) where the regularization is employed. The objective function is
where the first term is Φm which is the weighted sum of squared errors between the observed (obs) and simulated (sim) m1, the second term is Φr, d is a regularization observation vector, Z is a matrix that relates the parameter values (b) to d, μ is a regularization weight factor, diagonal matrixes Qm and Qr represent the square of the observation and regularization weights, respectively, and a superscript, T is the transpose operation. A weight factor of 1000 ∼ 5000 is used for the fixed head at the inlet, compared to a unit weight factor for all m1 data, to ensure accurate estimation of the flowrate during calibration (i.e., negligible flowrate error). For regularization inversion, a regularization weight of 300 is used for porosity so that initial Φm and Φr are similar, and the optimal value for the regularization weight factor was calculated for the iteration during calibration by PEST [Fienen et al., 2009; Doherty et al., 2010].
 Both K and φ are considered to be spatial fields and the spatial distribution of parameters is represented using pilot points. In this work, kriging with a spherical variogram model was used to interpolate property values over the entire numerical domain based on parameter values at pilot points. The range of the variogram was 4.2 cm for both log K and φ, and the sill of log K and φ were 0.2 and 0.005. It is noted that in this work the value of the sill has no effect on calibration, but rather the proportional contributions from the Kriging factors modify the heterogeneous field [Doherty et al., 2010]. To reflect the packing procedure in which 1 cm thick physical layers (total 8 layers above the 0.5 cm bottom layer) were packed independently, interpolation of parameter values to the model nodes from pilot points was undertaken for each layer (i.e., 2-D interpolation). The number of neighboring pilot points used in the local kriging interpolation to the model grid was limited to five to account for local heterogeneity by avoiding over-smoothing of the calibrated parameter field [Doherty et al., 2010]. This leads to the pilot point distance of 1 cm that is much smaller than the variogram range, which renders the influence of the variogram range to be negligible [Cirpka and Nowak, 2003] and allows local heterogeneity to be captured by the parameter values at pilot points during calibration [Doherty et al., 2010]. While the variogram parameters (i.e., range and sill) are not estimated in this work, the parameters describing spatial variability of heterogeneity can be estimated by Monte Carlo simulations with pilot point method [Castagna and Bellin, 2009] and moment equations [Riva et al., 2010].
 In the 3-D domain, there are four numerical grid cells (0.25 cm) comprising each packing layer, so each grid cell in the vertical direction within each 1 cm layer has the same material properties (i.e., K and φ). The initial values for K and φ at each pilot point were assigned to the mean of the measured values of five sand materials (Table 1). The pilot points are, in general, uniformly distributed over the heterogeneous region. For the most highly parameterized models, the pilot points are located at the center of each 1 cm3 based on the packing distribution. For the inlet homogeneous region, a total of 160 pilot points (i.e., 20 pilot points per each layer) were uniformly located to properly calibrate the mean arrival times in the upstream of the heterogeneous region. In this region, log K is estimated at each pilot point, but φ is estimated per each layer due to the expected uniform porosity during packing of a single sand size.
 The parameter estimation with the OBS0.016 data (51584 m1) is solved using the truncated singular value decomposition (TSVD) approach within PEST [Doherty, 2003, 2010; Tonkin and Doherty, 2005, 2009]. For the case of calibrating φ as an independent parameter (KIND and KZone), Tikhonov regularization was additionally employed to regularize φ. Through Tikhonov regularization, preferred values for φ were provided and balancing between misfit and the departure from the preferred values is achieved by selecting an optimal solution according to equation (2). In this study, φ was regularized by using a single constant preferred value (0.355) for all five sands based on the measured values (Table 1). In addition, a priori bounds on all K and φ parameters were assigned based on independently measured log K (−1 to 2) and φ (0.25 to 0.45) values over different sand mixtures as described earlier. A relatively low truncation criterion of singular values (i.e., 1 × 10−9) was used due to the different weighting factor on the flowrate, so that all significant parameters are estimated during calibration. The hybrid Tikhonov and TSVD approach provides a tool to achieve a stable and effective estimation, which is particularly well-suited for highly parameterized models [Doherty et al., 2010].
 Parameter estimation with the 1 cm3 observed data (OBS1.0, 832 m1) is solved using the SVD-assist approach within PEST [Doherty et al., 2010; Tonkin and Doherty, 2005, 2009]. For this case, no Tikhonov regularization is applied, but the same bounds of parameter values used for OBS0.016 are employed here. In SVD-assist, the parameter space is decomposed based on primary components from SVD of the sensitivity matrix (i.e., the weighted Jacobian matrix) whose linear combinations are called superparameters in PEST. The calibration solution space estimable by calibration data set is determined through truncating SVD based on the TSVD threshold [Doherty and Hunt, 2010]. For highly parameterized models the TSVD in the SVD-assist technique is numerically stable and computationally efficient since the number of parameters to be calibrated is effectively reduced into a smaller set of the calibration solution parameters, instead of computing sensitivities with respect to all variable parameters. The maximum number of the calibration solution (super-) parameters is set to the number of observed data used for calibration. The threshold value for singular values is set to 1 × 10−9.
2.4. Numerical Simulation
 For parameter estimation, 3-D, steady state, water flow was simulated using MODFLOW-2000 [Harbaugh et al., 2000] and advective tracer transport was simulated using the particle tracking code MODPATH [Pollock, 1994]. A uniform numerical model grid spacing of 0.25 cm was used in the entire 3-D domain. The top, bottom and side boundaries were assigned as zero-flux boundaries, and constant head boundaries at the inlet and outlet were imposed, but a head difference at the inlet and outlet was calibrated to match the overall flowrate (5.25 cm3 min−1). At the start of each simulation, the tracer is introduced into the steady flow field by placing a total of 78,336 particles (64 particles per model grid cell) uniformly over the cross-section (i.e., y-z plane) at a specified x-location upgradient of the heterogeneous portion of the domain. The number of particles used in this work ensures that all numerical grid cells have particles pass through, such that mean arrival times can be computed. Individual particle locations in the heterogeneous domain are tracked at an interval of 0.5 min, and the forward MODPATH simulations were performed until all particles exited the heterogeneous domain. A histogram is constructed by counting the number of particles that arrive at each grid cell within the time interval. The histograms are converted into BTCs by normalizing over the total number of particles within each grid cell over the entire simulation time.
 In addition, forward solute transport (i.e., ADE) was simulated to determine the ability of highly parameterized estimations of K and φ with an advective (surrogate) transport model in order to provide accurate transport results compared to the observed BTCs. The estimated K and φ fields were input for ADE simulations. The longitudinal dispersivity (αL) was set equal to the mean grain diameter for each sand type based on the packing distribution. The transverse dispersion coefficient (αT) was set to a 10th of the longitudinal dispersion coefficient, as discussed in Yoon et al. . The integrated finite difference simulator STOMP (Subsurface Transport Over Multiple Phases) [White and Oostrom, 2006] was used to solve the ADE, and the grid spacing in the x-direction was 0.25 cm and the grid spacing in y and z directions was 0.125 cm to ensure numerical accuracy. A third-order scheme using the total variation diminishing (TVD) technique was used to solve advective transport. The boundary and initial conditions were the same as described above. Additional description of the STOMP simulation is reported in Yoon et al. .
2.5. Model Evaluation
 In this study, parameter estimations have been performed with two sets of observed m1 data (OBS0.016 and OBS1.0). Observed and simulated BTCs were compared at three different averaging scales: (1) the numerical grid scale of 0.253 cm3 (0.016 cm3), (2) the intermediate scale of 0.25 cm × 1 cm2 (0.25 cm3) slice in the transverse directions (y-z directions), and (3) the physically based block scale of 1 cm3. These three different averaging scales span 2 orders of magnitude and will be hereafter referred to as AVG0.016, AVG0.25, and AVG1.0 scales. For model comparison, m1 and m2c were calculated from the simulated BTCs using the method of temporal moments as described earlier.
 The quality of estimated K distributions compared to the actual sand packing pattern is assessed by mapping accuracy. In this study, mapping accuracy was the fraction of correctly estimated grid cells with regard to the sand type used for physical packing (KPack). In addition, grid cells which match each sand type as an adjacent sand type in order of K value (e.g., matching sand 20/30 as sand 10/20 or sand 30/40) are also counted as accurate. A higher mapping accuracy indicates higher portions of estimated K values matching the corresponding sand type for the packing pattern.
 Several statistical measures are used to evaluate the accuracy of the calibrations and ADE simulations: the mean error (ME), the mean absolute error (MAE), and the root mean square error (RMSE). Error is calculated by (observed value – simulated value). Outliers are defined as estimates with ME greater than 15 min or less than −15 min, and are used solely to highlight the number of extreme mismatches. MAE represents the measure of mismatch between measured and estimated results and RMSE emphasizes mismatches of larger magnitudes. In addition, the AICc [Hurvich and Tsai, 1989] and KIC [Kashyap, 1982], which are model selection criteria based on model discrimination (or information), are computed after inversion in order to rank different parameterizations. In this work, the model selection criteria for groundwater models are defined as in Poeter and Hill 
where n is the number of observations, σ2 is the estimated residual variance (= Φm/n), k is the number of parameters (=the number of calibrated model parameters + 1), X is the sensitivity matrix, and ω is the weight matrix. The model selection criteria account for goodness of fit between observed and predicted states using maximum likelihood estimates of model parameters and model complexity based on the principle of parsimony which penalizes models with a large number of parameters depending on the improvement of model fit with more parameters. The determinant in the last term of KIC is computed using the first term (Φm) in equation (2) and the role of this determinant in a Fisher information matrix term can be found in Ye et al. . The model with the lowest value in equation (3) is the best model (i.e., minimizing information loss). In the context of highly parameterized models with the number of parameters exceeding the number of observations, practical and theoretical issues are discussed in Poeter and Hill , Riva et al. , and Doherty .
3.1. Parameter Estimation With Highly Parameterized Models
 Parameter estimations with highly parameterized models have been performed with OBS0.016m1 data using MODFLOW and MODPATH (i.e., advection only). Parameters employed in each method are listed in Table 2. K and φ distributions for KPack and resulting estimations for all six K–φ relationships for four of the eight horizontal layers are shown in Figure 2. Estimated K fields for all six methods capture the features of packing where high-K zones are predominantly interconnected through the central regions of the flowcell (Figure 1). Pairs of inverse models with similar parameterizations (e.g., KZone – KIND, KCoreg – KKC, and KLAY – Konly) produce similar estimated K patterns. The quality of estimated K distributions compared to the original packing pattern is quantitatively assessed by mapping accuracy in Table 3. All six K-φ relationships are able to estimate the locations of different sand type distribution used in KPack, on average, with greater than 50% accuracy. As expected, addition of the zonation boundaries as prior information in the inversion results in the KZone approach having a much better mapping accuracy than other methods. KIND has a better mapping accuracy than the rest of four parameterizations, but the confidence interval of KIND is slightly overlapped with that of four parameterizations (Table 3), indicating that more parameterization without zonal information leads to slightly better identification of the packing pattern, but the increase in accuracy is not statistically significant.
Table 3. Mapping Accuracy of K Distribution Compared to the Packing Patterna
Confidence Interval (95%)
Mapping accuracy in percent.
 It is also noted that even using the zonal parameterization, there is a high degree of variability within the zones relative to the assumed internally homogeneous packing pattern. As an example, Figure 2 identifies several regions with K values higher and lower than those used in KPack. The lower-K regions correspond to areas where coarse sand interfaces fine sand in the original packing and represents regions where fine sand has mixed into the inter-grain volumes within the coarse sand, thus reducing K and φ in those areas. The higher-K regions represent relatively interconnected homogeneous regions such as central portions of the flowcell (e.g., dark red color in Figure 2) where loose packing of the sand may result in higher K and φ [Zhang et al., 2007, 2008]. This result highlights the ability of the highly parameterized model to capture key features of the heterogeneity.
 Statistics of mean arrival times (m1) for all parameterizations are very similar (Figure 3). As expected, the KZone and KIND sets have slightly lower MAE and RMSE values than other parameterization sets, primarily due to higher parameterization. RMSE was reduced more than AAE compared to KPack, indicating that the higher magnitudes of outliers are significantly reduced by all parameterizations as shown in box plots. AICc and KIC values for six parameterizations are listed in Table 4. For the OBS0.016 data, the rank in both criteria is the same and the KZone set has the lowest value, indicating that higher parameterization with additional information (i.e., zonal information) leads to an improvement in model fit. The KIND and KCoreg sets are next ranked, and the improvement of model fit in KIND can be considered as a direct result of using more parameters. For the OBS1.0 data, only AICc is computed due to the complexity of computing the determinant term in KIC and the number of estimated parameters for all six parameterizations was set to the number of observed data (n = 833) through TSVD, making AICc strongly determined by goodness of fit (i.e., the first term in AICc). AICc values for the KZone and KIND sets are lower than those for other sets. This may demonstrate that the information contained in OBS1.0 was best transferred to parameters with highly parameterized models with more parameters such as KZone and KIND. Further work is needed to provide practical and theoretical guidance for applying the model selection criteria for highly parameterized models.
Table 4. AICc and KIC Results for Six Parameterizationsa
Values outside and within parenthesis represent results calibrated with OBS0.016 (n = 51,585) and OBS1.0 (n = 833) data, respectively. For results with OBS1.0 the number of the calibration solution parameters is set to the number of observed data (n = 833) by TSVD.
k is the number of parameters (=the number of calibrated model parameters + 1) in equation (3).
Φm is the measurement objective function (=σ2 × n).
 Compared to KPack, all inverse results reduce the mismatch of m1, resulting in lowering the objective function (equation (2)), demonstrating that the actual experiment and the designed experiment are different and inverse modeling improves the results relative to KPack relying on the experimental design. Comparison of m1 profiles for KPack and six parameterizations in Figure 4 reveals that simulated m1 profiles matched the trend of observed m1 well. In particular, estimated m1 profiles for all inverse parameterizations show high similarity, despite different K and φ distributions (Figure 2). This result indicates that all parameterizations with highly parameterized models used in this work can characterize the moderately heterogeneous field in the flowcell conditional to the m1 observations, despite different parameterization of K and φ.
 Comparison of estimated K fields between KIND and Konly reveals that the shape of high-K regions is very similar, but Konly has relatively higher hot spots of the highest-K values and the size of high-K regions is bigger than KIND (Figure 2). This result indicates that if φ is not included in the estimation, the estimation process has to make significant changes to K in order to achieve calibration conditional on solute transport observations; model structure error leads to parameter error. By contrast, parameterization with KIND needed a bimodal φ distribution in order to fit to mean arrival times due to local heterogeneity (i.e., larger m1 to reduce the objective function). Heterogeneity at these unresolved scales is a possible reason for relatively low to moderate correlation of K–φ for Kzone: correlation coefficients were 0.54, 0.44, 0.59, and 0.22 from top to bottom layers in Figure 2. This pattern is also evident in comparison with results between KCoreg and KKC. KKC has the higher-K contrast, while KCoreg has the higher porosity contrast (Figure 2). KKC can be considered a special case of KCoreg where the perfect correlation between K and φ is formulated based on the Kozeny-Carman equation. This comparison highlights the importance of the local-scale heterogeneity on inverse parameter estimation and the need to account for the impact of the observation scale on transport modeling.
3.2. Transport Modeling and Impact of Scales
3.2.1. Effect of the Estimated Fields of K and φ on ADE Results
 The calibrated fields of K and φ from all six highly parameterized models using OBS0.016 and OBS1.0 data were used to simulate solute transport with the ADE. Overall, the calibrated fields consist of 12 separate estimated K and φ fields (i.e., 6 fields per observed data set). All statistical results of m1, m2c, and BTCs over three different averaging scales (AVG0.016, AVG0.25, and AVG1.0) using the ADE are presented in Table 5. The averaging scale refers to the scale at which the volume averaging of concentrations from the ADE results and observations is computed for m1 and m2c values. First, MAE and RMSE values of m1 in Table 5 increased compared to those with advection only at the AVG0.016 scale (Figure 3). Since dispersivities reflect the small-scale fluctuations (i.e., mechanical dispersion) due to K and φ variability at the sub-Darcy scale, it is expected that m1 estimation with the ADE using calibrated fields based on advection only leads to an increased level of mismatch. Although the mismatch is inevitable, it can be minimized with the highly resolved estimation of the K and φ variability. As shown in m1 profiles along the main flow x-direction (Figure 5), m1 errors from the ADE increase with systematic over or under estimation of the observed m1 values, compared to advective transport results (Figure 4). However, the overall increase of mismatch of m1 due to dispersion in ADE was not significant. In addition, m1 profiles for ADE results using calibrated fields from all six different methods are similar. This result shows that using advective transport with the highly parameterized model as a surrogate in inversion on concentration data only results in fields that accurately model transport under the ADE.
Table 5. Statistical Results of Advection-Dispersion Simulations for the Original Sand Packing and Six Calibrated Fields With the Observed Data at the 0.016 cm3 and 1 cm3 Calibration Scalesa
The number of observed m1 data at the two 0.016 cm3 and 1 cm3 calibration scales are 51,584 and 832, respectively. Statistical values outside and within parenthesis represent results of the forward ADE simulations for the calibrated fields with the observed data at the 0.016 cm3 and 1 cm3 calibration scales, respectively.
The number of outliers is counted if ME is greater than 15 min or less than −15 min.
The number in brackets represents the number of simulated and observed data at each averaging scale.
 Inspection of the BTCs at selected points indicates that the majority of the observed BTCs are relatively well matched. However, as indicated by the number of outliers in Table 5, 3–6% of the BTCs at the AVG0.016 scale are significantly mismatched. Results of simulated and observed BTCs at the AVG0.016 and AVG0.25 scales are shown in Figure 6. Estimated fields with OBS0.016 and OBS1.0 were used for simulations. At all three locations, simulated results match m1 relatively well. Figure 6a–6f and Table 5 show that highly resolved estimation using only m1 data is able to accurately reproduce the full BTCs at most locations. However, the agreement between observed and simulated BTCs varies. In particular, observed BTCs at one location (Figures 6g and 6h) show the effect of subgrid scale heterogeneity on BTCs over all scales. Even at the AVG0.016 scale, the sharp increase of concentration is followed by gradual increase after ∼90 min, indicating that this grid cell receives mixed transport signals from upstream regions of differing velocity values, i.e., particles that are younger or older due to their upstream travel history. These particles can flow into one cell by simple discretization effects (e.g., misalignment between the cell boundaries and streamtubes, and/or numerical transverse dispersion), or by actual transverse dispersion. Hence, tailing and irregular shapes of BTCs in Figures 6g and 6h are mainly attributed to local heterogeneity. It is also noted that highly parameterized calibrated models with fixed dispersivities are able to capture some of the effect of local heterogeneities in a 3-D real packing problem.
3.2.3. Impact of Observation Scales
 For ADE results in Table 5, the number of outliers (i.e., ME > 15 min or ME < −15 min) from AVG0.016 to AVG0.25 scales decreased dramatically by 25 ∼ 160 times, compared to the reduction of the averaging volume by a factor of 16. However, all statistics at the AVG0.25 and AVG1.0 scales are very similar for the calibrated fields with both sets of observed data (OBS0.016 and OBS1.0), indicating that the AVG0.25 scale can be considered a representative elementary volume (REV).
 Comparison of statistics in Table 5 for KCoreg and KKC as well as KLAY and Konly shows that MAE and RMSE values of m1 for KKC and Konly were lower than those for KCoreg and KLAY, respectively. Note that KCoreg and KLAY have more calibrated parameters related to φ, compared to KKC and Konly. In contrast to results with advection only where KCoreg and KLAY have lower objective functions (Table 4), results with ADE simulations reveal that calibration of parameters related to φ with advection only should be carefully done to avoid over-fitting. Comparison of the calibrated K-φ fields using OBS0.016 and OBS1.0m1 data (Figures 2 and 7) show that calibrated K fields are quite similar, while φ fields for KZone and KIND using OBS1.0m1 data (Figure 7) represent less contrast because φ at a majority of pilot points was not involved in calibration (i.e., not included in the calibration solution space) due to a small number of observations (832 m1) compared to the number of parameters (1954). It should be noted that statistical results for calibrated fields with OBS1.0 (832 m1) and OBS0.016 (51584 m1) are in a good agreement at all three AVG scales, while individual statistical values for each K and φ relation vary (Table 5). This result indicates that OBS0.016 data (below a REV scale) are not necessarily helpful to account for the impact of local heterogeneity on transport.
3.3. Impact of Number of Observations and Parameters
 The impact of the number of observations and parameters is investigated by estimating the parameters at the AVG1.0 scale, while varying the number of OBS1.0m1 (48, 96, 224, 416, and 832) and the number of parameters (256, 456, 616, 1065, and 1952) corresponding to 112 to 896 pilot points in the heterogeneous region. All pilot points are uniformly distributed in the 3-D field and K and φ are independently estimated at every pilot point (i.e., KIND parameterization with varying pilot points). For example, 256 parameters are located at 120 pilot points with 15 pilot points in the heterogeneous region and one pilot point in the homogeneous region for each layer. For each number of parameters with different observations, initial parameter values are calibrated results from previous estimation with OBS1.0 (832 m1). Calibration of m1 data was performed with MODPATH (advection only).
 RMSE values of m1 at the AVG0.016 and AVG1.0 scales are shown in Figure 8. At both scales, RMSE values are distinctly different between 1065 and 616 parameters. For 1065 parameters, pilot points are located every two 1 cm3 cells; hence, the distance between pilot points was less than 2 cm and shorter than the correlation length (4.2 cm) of the spherical variogram model. For 616 parameters, pilot points are located every four cells, so the distance between pilot points is close to the correlation length. Hence, the number of pilot points within the correlation length is an important criterion for interpolation from pilot points to neighboring numerical cells, as demonstrated in analytical work for the interpolation uncertainty as a function of the data density on regular grids [Cirpka and Nowak, 2003]. A practical guideline on the pilot point spacing is well-summarized in Doherty et al. . The RMSE values with 456 parameters are higher than those with 256 parameters (Figure 8). Overall, Figure 8 shows that for the cases studied in this work pilot point spacing must be less than the correlation length and accuracy in the calibrations is more strongly influenced by the number of parameters than the number of observations.
 A subset of observation (96 m1 data at OBS1.0) was used to test a more realistic case of limited observations. The subset of 96 observed data mimics 12 multiple observation wells per each layer in the 3-D field. Highly parameterized models with 896 pilot points in the heterogeneous region and one pilot point in the homogeneous region were used to calibrate K–φ fields (Figure 9). Observation wells are uniformly distributed. Comparison of m1 profiles for KPack and six different methods demonstrates that simulated m1 profiles matched the trend of observed m1 relatively well along the main flow direction (Figures 10a and 10c) where observation wells are located. However, the discrepancy between m1 profiles increased in the center of the flowcell (Figure 10b) where there are no observation wells. In particular, estimated K distributions (Figure 9) show that the resulting K contrast is much smaller than it was for fields calibrated with larger amounts of observation data (Figure 7), but all six highly parameterized models captured the key features of packing patterns relatively well. Relative composite sensitivities, which are the product of composite sensitivity and calibrated parameter value, show that sensitivity values for a majority of K in the heterogeneous region were similar in magnitude, while sensitivity values for φ were relatively lower than those for K (results not shown). This comparison indicates that in well-defined or characterized regions (i.e., high K along the center of the domain in this study) more sensitive parameters such as K rather than φ would be enough to calibrate the observed data. Hence, the calibration solution space whose dimension was the same as the number of observations (i.e., 96) was mostly related to K. This also explains the different φ distributions between KCoreg and KKC because parameters related to φ in KCoreg were not included within the calibration solution space, resulting in correlation coefficients similar to an initial value (i.e., 0.5). Sensitivity and parameter importance or data worth associated with the design of monitoring network and predictive uncertainty reduction will be explored in the future.
4. Summary and Conclusions
 Inverse estimation of K and φ based on advection only using the unique MRI concentration data set was performed to explore the application of highly parameterized models in a 3-D heterogeneous transport experiment. Comparison of the inverse estimations from the six different K and φ relationships show similar statistical measures of mean arrival times (m1) and very similar m1 spatial estimates as shown in m1 profiles along the main flow direction (Figure 4). All six models recovered the main physical feature of the K pattern in KPack (i.e., highly interconnected high-K sands along the central part of the flowcell), though they differed in specific details of the K and φ patterns. This work highlights that complete knowledge of the 3-D field in the experimental design (KPack) does not provide good predictions, and inverse models with all different parameterizations can considerably improve upon the naïve application of the experimental setup. Model comparison also suggests that site-specific conceptual model information such as zonal information or a particular K–φ relationship will help constrain the calibration and create more reliable predictions. Compared to the most commonly used KKC and Konly (and/or KLAY) approaches with less parameterization, the KIND approach within the framework of regularized highly parameterized models can be a practical solution to estimate aquifer properties. The KIND approach produced the estimations most similar to those obtained with KZone for all cases considered. Model selection criteria such as AICc and KIC also show that the rank of models with different parameterizations follows the order of the objective functions. Future work is necessary to provide guidance for the application of the model selection criteria to highly parameterized models.
 Forward transport modeling of the ADE using calibrated K and φ fields based on advection only also shows that BTCs at a majority of locations as well as statistical results of m1, m2c and BTCs for all six relationships are very similar. This comparison shows that there is not a single set of model parameters or a single K–φ relationship that leads to a single, best representation of the real experimental condition, and the estimation conditioned only to tracer breakthrough data can be nonunique, but, further study is needed to confirm this, in particular, for cases with direct K and φ measurements at limited locations. The difference between forward model and experimental results of BTCs increased with decreasing the averaging scale of the concentration, demonstrating the significance of small-scale variability in aquifer properties causing uncertainty in transport predictions. Even with the KZone set where packing (i.e., zonal) information is used as the basis for a highly parameterized model, small-scale variations in packing and mixing of the sands cause significant deviations from the expected transport results as seen in the number of BTC outliers. These results are consistent with similar impacts of fine-scale structures on transport results in heterogeneous media observed in other detailed experimental investigations [e.g., Klise et al., 2008].
 In this study, ME, MAE, and RMSE values at the AVG0.25 and AVG1.0 scales are similar, but different from those at the AVG0.016 scale (Table 5), indicating that a REV scale is approximately at AVG0.25. Another important feature in these statistical results is that all statistics with the OBS0.016 (51584 m1) and OBS1.0 (832 m1) data are strikingly similar across all averaging scales. This suggests that even with much larger number of observations at the OBS0.016 scale local-scale heterogeneity features below the REV are not properly incorporated into parameterized models used in this work. Hence, a proper scale of observation for transport problems (e.g., concentration and/or mass flux) should be considered such as at an REV scale and then local fluctuations can be separately treated as random or correlated deviations [Alcolea et al., 2008; Hu et al., 2009]. This work strongly suggests that uncaptured variabilities during model calibration (e.g., local-heterogeneity features in this study) inevitably exist, highlighting the importance of addressing prediction uncertainty following calibration in modeling practice.
 Although transport parameters such as dispersivities were not included during calibration in this study, the forward modeling results with calibrated fields indicate that even in a moderately heterogeneous field, BTC tailing can be observed at a small scale such as AVG0.016 (Figure 6). In low to moderate K contrast fields, predicted and experimental BTCs have varied considerably [Barth et al., 2001; Nowak and Cirpka, 2006] and these differences have been attributed to variability in the measured hydraulic conductivity [Barth et al., 2001] and small-scale heterogeneity [Nowak and Cirpka, 2006]. In this study, even at the AVG0.016 scale, observed BTCs can exhibit distinct features such as a fast increase of concentration followed by a more gradual increase due to local heterogeneity at high spatial resolution and mixing of flow paths (i.e., streamtubes). Recently, a limited set of BTCs has been calibrated with different parameterizations such as inclusion of dispersivity as a fitting parameter in 2-D tens of meter-long tank [Nowak and Cirpka, 2006], mass transfer between a mobile and an immobile in a 3-D layered heterogeneous field and spatial variability of porosity [Riva et al., 2008], time-fractional ADE in 2-D slab [Major et al., 2011], and inclusion of nested variograms at two different scales in 2-D synthetic heterogeneous fields [Alcolea et al., 2008]. In particular, this study shows that if the conditions are met to use an advection-only transport in inversion, ADE transport with calibrated fields based on the advection only can be accurately modeled, highlighting computational and practical efficiency of inverse estimation with advection-only as a surrogate for more expensive ADE solutions to provide results and initial conditions for further system analysis.
 This material is based upon work supported as part of the Center for Frontiers of Subsurface Energy Security, an Energy Frontier Research Center funded by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences under Award Number DE-SC0001114. Sandia National Laboratories is a multiprogram laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000. We also acknowledge the effort of the Associate Editor, Wolfgang Nowak, Matthew Tonkin, and two anonymous reviewers for their careful and constructive reviews, which led to significant improvement of our manuscript.