Groundwater modeling has become a vital component to water supply and contaminant transport investigations. These models require representative hydraulic conductivity (K) and specific storage (Ss) estimates, or a set of estimates representing subsurface heterogeneity. Currently, there are a number of approaches for characterizing and modeling K and Ss heterogeneity in varying degrees of detail, but there is a lack of consensus for an approach that results in the most robust groundwater models with the best predictive ability. The main goal of this study is to compare different heterogeneity modeling approaches (e.g., effective parameters, geostatistics, geological models, and hydraulic tomography) when input into a forward groundwater model and used to predict 16 independent cross-hole pumping tests. We first characterize a sandbox aquifer through single- and cross-hole pumping tests, and then use these data to construct forward groundwater models of various complexities (both homogeneous and heterogeneous distributions). Two effective parameter models are constructed: (1) by taking the geometric mean of single-hole test K and Ss estimates and (2) calibrating effective K and Ss estimates by simultaneously matching the response at all ports during a cross-hole test. Heterogeneous models consist of spatially variable K and Ss fields obtained via (1) kriging single-hole data; (2) calibrating a geological model; and (3) conducting transient hydraulic tomography (Zhu and Yeh, 2005). The performance of these parameter fields are then tested through the simulation of 16 independent cross-hole pumping tests. Our results convincingly show that transient hydraulic tomography produces the smallest discrepancy between observed and simulated drawdowns.
1.1. Characterization Methods of Subsurface Heterogeneity in Hydraulic Parameters
 Subsurface characterization for groundwater investigations relies on the determination of the distribution of hydraulic parameters such as hydraulic conductivity (K) and specific storage (Ss). These values are then used to build groundwater models of various complexities to obtain quantitative estimates of hydraulic heads, groundwater fluxes, and the distribution and concentration of contaminants. Commonly, hydraulic parameters are estimated by collecting cores and subjecting them to permeameter tests and grain size analysis in a laboratory, or conducting slug, single-hole, and/or pumping tests in situ. Most of these in situ methods rely on analytical solutions that treat the geological medium to be homogeneous. These simplified solutions and the resulting estimated parameters have been utilized in a variety of real world applications and academic studies [e.g., Theis, 1935; Hantush, 1960; Neuman, 1972], despite the fact that the subsurface is heterogeneous at multiple scales. In particular, the knowledge of detailed three-dimensional distributions of K is critical for the prediction of contaminant transport, delineation of well catchment zones, and quantification of groundwater fluxes including surface water/groundwater exchange. Even though many studies treat Ss so it does not vary significantly, in some formations, where the aquifer compressibilities vary significantly from one material type to the next (e.g., sands versus clays), Ss could vary several orders of magnitude.
 The characterization of subsurface heterogeneity is fraught with difficulties as numerous samples are required to delineate the variability of hydraulic parameters as well as their spatial correlations and connectivity. Using soil cores to accurately characterize the K heterogeneity of a site requires a large number of samples to be tested in the laboratory [e.g., Sudicky, 1986; Sudicky et al., 2010]. Alternatively, these samples are sieved to obtain grain size distributions, which can then be analyzed using various empirical relations to estimate K.
1.2. Methods for Capturing Spatial Heterogeneity in Hydraulic Parameters
 Common approaches when mapping K (and less so Ss) heterogeneity are to utilize geostatistical or stochastic estimation techniques or more sophisticated interpolation methods. In particular, these approaches are considered to be the de facto standards which assume that a user-specified covariance function is valid and hydrogeologic parameters are lognormal and stationary. However, these assumptions are difficult to satisfy in many geologic settings. Because of these assumptions, and when data are not abundant, stochastic estimation techniques may provide a smooth image of the spatial heterogeneity and may not represent the true distribution accurately. Although a variety of stochastic simulation techniques [e.g., Deutsch and Journel, 1998] exist that can overcome this issue of smoothing, it still does not address the preservation of many geological features. This is because of the fact that traditional geostatistical methods are based on variograms computed using two-point statistics. To overcome this shortcoming, multiple point geostatistics [e.g., Guardiano and Srivastava, 1993; Caers, 2001; Strebelle, 2002; de Vries et al., 2009] have been developed through the use of more complex point configurations, whose statistics are retrieved from training images that represent the geological facies distributions obtained from outcrop mappings and/or geophysical imaging.
 Recently, geostatistical and stochastic inverse methods have received increasing attention. The approach produces the first and second statistical moments of hydrogeologic variables, representing their most likely estimates and their uncertainty, respectively, conditioned on available observations. Cokriging relies on the classical linear predictor theory that considers spatial correlation structures of flow processes (such as hydraulic head and velocity) and the subsurface hydraulic property, and cross-correlation between the flow processes and the hydraulic property. In the past few decades, many researchers [e.g., Kitanidis and Vomvoris, 1983; Hoeksema and Kitanidis, 1984, 1989; Rubin and Dagan, 1987; Gutjahr and Wilson, 1989; Harvey and Gorelick, 1995; Yeh et al., 1995, 1996] have demonstrated its ability to estimate K, head, and velocity, as well as solute concentrations in heterogeneous aquifers.
 In the laboratory, Illman et al.  recently assessed the performance of various methods for characterizing K estimates by predicting the hydraulic response observed in cross-hole pumping tests in a synthetic heterogeneous aquifer and total flow rates obtained via flow-through tests. Specifically, they characterized a synthetic heterogeneous sandbox aquifer using various techniques (permeameter analyses of core samples, single-hole, cross-hole, and flow-through testing). They then obtained mean K estimates through traditional analysis of test data by treating the medium to be homogeneous. Heterogeneous K fields were obtained through kriging and steady state hydraulic tomography. To assess the performance of the each characterization approach, Illman et al.  conducted forward simulations of 16 independent pumping tests and six steady state, flow-through tests using these homogeneous and heterogeneous K fields. The results of these simulations were then compared to the observed data. The results showed that the mean K and heterogeneous K fields estimated through kriging of small-scale K data (core and single-hole tests) produced biased predictions of drawdowns and flow rates under steady state conditions. In contrast, the heterogeneous K distribution or “K tomogram,” estimated via steady state hydraulic tomography, yielded excellent predictions of drawdowns of pumping tests not used in the construction of the tomogram and very good estimates of total flow rates from the flow-through tests. On the basis of these results, Illman et al.  suggested that steady state groundwater model validation is possible if the heterogeneous K distribution and forcing functions (boundary conditions and source/sink terms) are characterized sufficiently.
1.3. Goal of This Study
 This study extends the work of Illman et al.  who examined only various K characterization approaches and their performance in predicting independent test data under steady state conditions. In particular, the main goal of this study is to extend the work of Illman et al.  to the transient case. Using the same sandbox aquifer as Illman et al. , we jointly assess the performance of various characterization and modeling techniques that treat the aquifer to be either homogeneous or heterogeneous through the prediction of independent, transient cross-hole pumping tests not used in the characterization effort. Specifically, we characterize the 2-D heterogeneous aquifer using both single- and cross-hole pumping tests. These data are then used to construct various forward groundwater models with homogeneous and heterogeneous K and Ss estimates. Two homogeneous or effective parameter models are constructed: (1) by averaging local scale K and Ss estimates from single-hole pumping tests and treating the medium to be homogeneous and isotropic; and (2) using MMOC3 [Yeh et al., 1993] coupled with PEST [Doherty, 1994] to estimate the horizontal and vertical hydraulic conductivities (Kx, Kz), as well as Ss by simultaneously matching the transient drawdown data from all ports during a cross-hole pumping test and treating the medium to be homogeneous and anisotropic for K. Three heterogeneous models are constructed and consist of spatially variable K and Ss fields obtained via (1) kriging single-hole K and Ss data; (2) accurately capturing the layering and calibrating the K and Ss values for these layers using a parameter estimation program (i.e., a calibrated geological model); and (3) conducting transient hydraulic tomography (THT). The performance of these homogeneous and heterogeneous K and Ss fields are then quantitatively assessed by simulating 16 independent cross-hole pumping tests and comparing the simulated drawdowns to the observed drawdowns. It should be noted that these different methods utilize varying amounts of data, thus one may consider the comparison to be not fair in a strict sense. However, aside from hydraulic tomography, the approaches examined are commonly utilized to deal with heterogeneity and the goal of this comparison is to assess the performance of these various methods in comparison to hydraulic tomography, which is designed to incorporate data from multiple pumping tests.
2. Experimental Methods
2.1. Sandbox and Synthetic Heterogeneous Aquifer Description
 A two-dimensional synthetic heterogeneous aquifer was constructed in a sandbox measuring 193.0 cm in length, 82.6 cm in height, and 10.2 cm deep. Forty-eight ports, 1.3 cm in diameter, were cut out of the stainless steel wall to allow coring of the aquifer as well as installation of fully penetrating horizontal wells (Figure 1). Each port was instrumented with a 0 to 1 psig (pounds per square inch gauge) Setra model 209 pressure transducer.
 The synthetic heterogeneous aquifer was created through the cyclic deposition of sediments under varying water flow and sediment feed rates. Our goal in relying on sediment transport was to create a more realistic heterogeneity pattern with various scales of heterogeneity in an efficient manner. Table 1 summarizes the grain size characteristics, K, and Ss estimates of the sands used to create the heterogeneous aquifer, and the layers in which these sands occur. Figure 2 is a photograph of the synthetic heterogeneous aquifer showing the interfingering nature of the deposits and layer numbers. Further details to this synthetic heterogeneous aquifer and its construction approach are provided by Illman et al. .
Table 1. Characteristics of Each Layer Used to Create a Synthetic Heterogeneous Aquifera
If multiple ports are in the same layer then the geometric mean is presented.
1.03 × 10−1
3.20 × 10−2
5.32 × 10−2
2.12 × 10−4
2.99 × 10−2
5.29 × 10−2
5.67 × 10−2
2.60 × 10−4
7.29 × 10−3
7.14 × 10−2
5.70 × 10−2
5.00 × 10−4
6.68 × 10−2
5.68 × 10−2
5.10 × 10−2
2.22 × 10−4
8.16 × 10−2
5.00 × 10−2
4.00 × 10−4
5.70 × 10−2
1.27 × 10−1
7.35 × 10−2
4.20 × 10−4
5.33 × 10−2
1.34 × 10−1
4.50 × 10−2
1.75 × 10−4
6.68 × 10−2
8.69 × 10−2
4.60 × 10−2
2.15 × 10−4
1.20 × 10−2
1.13 × 10−1
8.25 × 10−2
1.14 × 10−3
5.70 × 10−2
1.37 × 10−1
2.05 × 10−1
2.15 × 10−4
1.32 × 10−1
3.40 × 10−2
4.95 × 10−2
6.32 × 10−4
1.03 × 10−1
2.60 × 10−1
1.05 × 10−1
9.80 × 10−4
9.22 × 10−3
9.79 × 10−2
5.70 × 10−2
9.80 × 10−4
6.68 × 10−2
8.58 × 10−2
7.50 × 10−2
2.00 × 10−3
4.16 × 10−2
2.69 × 10−2
7.11 × 10−4
7.29 × 10−3
4.51 × 10−2
4.47 × 10−2
1.14 × 10−3
1.03 × 10−1
1.45 × 10−1
1.16 × 10−1
3.38 × 10−3
 The aquifer system was bounded by three connected constant head boundaries (one situated at the top of the tank, and one on each end). The remaining boundaries (front, back, and bottom) were all no-flow boundaries.
2.2. Characterization of a Synthetic Heterogeneous Aquifer (Single-Hole Pumping Tests)
 The synthetic heterogeneous aquifer was characterized using single-hole pumping tests to obtain K and Ss estimates at each of the 48 ports. Since the support scale of parameters estimated via single-hole tests is unknown, we assume the length of the well-screen open to the aquifer is representative [e.g., Guzman et al., 1996, Illman and Neuman, 2000; Illman, 2005] of the support scale. The tests were conducted by pumping water at each port at a constant rate and monitoring the transient head change within the pumped well using a pressure transducer. A constant pumping rate (Q = 1.25 cm3 sec−1) was set for each single-hole pumping test. For each test, data collection started without the pump running in order to obtain the initial hydraulic head in the sandbox at all measurement ports. A peristaltic pump was then activated at the pumping port and allowed to run at a constant rate until the development of steady state flow conditions. The entire transient head data response was matched using VSAFT2 [Yeh et al., 1993] through manual calibration by treating the aquifer to be homogeneous. VSAFT2 was chosen for the analysis as opposed to traditional type curve models because the numerical model is able to more accurately describe the sandbox geometry and boundary conditions. Details of the numerical modeling and calibration effort are provided by Craig . The single-hole K estimates ranged from 0.01 to 0.32 cm s−1 with a geometric mean of 0.06 cm s−1 and a variance () of 0.38, and single-hole Ss estimates ranged from 1.0 × 10−4 cm−1 to 5.5 × 10−3 cm−1 with a geometric mean of 6.1 × 10−4 cm−1 and a variance () of 0.97. Here, we consider the geometric mean of the 48 local K and Ss estimates from the single-hole pumping tests to represent an effective K and Ss for the entire aquifer. We also considered the use of arithmetic and harmonic means as alternatives to the geometric mean. However, the arithmetic mean is representative of flow along stratification and the harmonic mean is representative of flow across layers. Since these single-hole tests induce flow along and across the layers we feel the geometric mean provides the most reasonable estimate of effective parameters for K and Ss.
2.3. Characterization of the Synthetic Heterogeneous Aquifer (Cross-Hole Pumping Tests)
 Twenty-five cross-hole tests were also performed in the sandbox for the purposes of effective parameter characterization, calibration of a geological model (presented in section 4), transient hydraulic tomography (presented in section 5), and model validation (presented in section 6). The tests were conducted at each port along columns 2 (ports 2, 8, 14, 20, 26, 32, 38, and 44) and 5 (ports 5, 11, 17, 23, 29, 35, 41, and 47) and nine additional pumping tests at various ports outside of these two columns (ports 13, 15, 16, 18, 21, 37, 39, 40, and 42) (see Figure 1). The cross-hole tests were conducted by pumping at rates ranging from 2.50–3.17 cm3 sec−1 at 25 separate ports indicated by open and dashed squares in Figure 1. During each test, head measurements in all 48 ports (and the constant head reservoirs) were recorded. Pumping continued until the development of steady state conditions, which was determined by observing the stabilization of all head measurements within the aquifer. The transducers in the constant head reservoirs indicated that the hydraulic head remained constant for the duration of the pumping tests.
 The cross-hole test at port 21, completed near the center of the aquifer, was used to estimate the effective homogeneous values of K and Ss. Analogous to the analysis of single-hole data, the observation head data from this test were analyzed by manually calibrating each observation port data by VSAFT2 and treating the aquifer to be homogeneous. Estimates of K and Ss obtained between the pumping and observation intervals, when the medium is treated to be homogeneous, are considered to be an equivalent hydraulic conductivity (Keq) or specific storage (Sseq). Analysis of the cross-hole test yielded 48 estimates of K and Ss for the equivalent homogeneous medium. The Keq estimates ranged from 0.054 to 0.42 cm s−1 with a geometric mean of 0.11 cm s−1 and a variance () of 0.22. The corresponding Sseq estimates ranged from 8.5 × 10−5 cm−1 to 7.5 × 10−3 cm−1 with a geometric mean of 3.3 × 10−4 cm−1 and a variance () of 1.03.
 The cross-hole pumping test at port 20 is used to estimate effective homogeneous values of K (anisotropic) and Ss (isotropic). The parameter estimation program PEST [Doherty, 1994] was coupled with the groundwater flow model MMOC3 [Yeh et al., 1993] to simultaneously match the transient data recorded at all ports. The synthetic aquifer (measuring 160 × 10.2 × 78 cm) used for the parameter estimation was discretized into 741 elements and 1600 nodes with element dimensions of 4.1 × 10.2 × 4.1 cm. A finer mesh was also considered, however, the results did not change significantly. Thus, to minimize the computational requirements, the discretization presented was used for all cases. Both sides and top boundaries were set to the same constant head, while the bottom, front, and back boundaries of the sandbox were considered no-flow boundaries. Estimated Kx, Kz, and Ss values are 1.4 × 10−2 cm s−1, 2.5 × 10−3 cm s−1, and 5.9 × 10−4 cm−1, respectively.
3. Geostatistical Analysis of Single-Hole Pumping Test Data
 Kriging and other simplified interpolation methods are commonly used to characterize subsurface heterogeneity [e.g., Sudicky, 1986; Adams and Gelhar, 1992; Chen et al., 2000; Sudicky et al., 2010]. As such, we used kriging of single-hole K and Ss data to generate heterogeneous distributions that could be used for forward modeling of cross-hole pumping tests. The exponential variogram model was fit to the experimental variograms in both horizontal and vertical directions, resulting in an anisotropic variogram model. Examination of the experimental variogram of the ln-Ss data revealed an increasing variogram with lag distance. We attributed this to a trend of ln-Ss values declining from high to low values from the top to the bottom of the sandbox. Such a trend was also observed in the Ss values from a separate sandbox packed in an entirely different way [Liu et al., 2007]. We detrended the data [e.g., Chen et al., 2000] by fitting an anisotropic exponential model to the residuals of the original experimental variograms for ln-Ss data.
Table 2 lists the variogram parameters fit to the experimental variograms. For both model fits, the anisotropy ratio of the natural log of the parameters was determined to be 3 with the horizontal correlation length larger than the vertical correlation length because of the layered nature of the deposits. The results (not shown here), in general, reveal smoother K and Ss fields in comparison to the interfingering layers shown in Figure 2, which is expected considering that there are only 48 data points used for kriging (see Figure S1 in the auxiliary material). It should be noted that ordinary kriging, as applied here, produces smooth parameter fields and is unable to handle abrupt changes in material types. One way to handle such abrupt changes is to construct a geological model on the basis of available borehole data.
Table 2. Geostatistical Model Parameters for Kriging Single-Hole ln-K and ln-Ss Data
4. Construction of a Geological Model
 Groundwater flow and transport models are commonly built using various hydrogeologic data and often deterministically incorporate the knowledge of site geology. To compare the performance of groundwater flow models on the basis of the knowledge of geology to other models, we constructed a numerical model using Figure 2 as a reference to construct a parameter field that closely resembles the stratigraphy of the synthetic heterogeneous aquifer.
 To construct the geological model, we assumed that the stratification is known for the entire simulation domain. In practice, a perfect knowledge of stratigraphy is not available, thus we consider this to be a best-case scenario in terms of having information on stratigraphy. We further assumed that the stratification shown on the glass (Figure 2) was uniform throughout the thickness of the sandbox.
 The parameter estimation program PEST [Doherty, 1994] coupled with the groundwater flow model MMOC3 [Yeh et al., 1993] was used to estimate K and Ss values for each layer using the data collected during the cross-hole test at port 20. In total, 38 parameters were estimated (K and Ss for 19 layers). Only 18 layers are identified in Figure 2, however, layer 5 which is discontinuous because of erosion that occurred during the deposition of layer 8 is treated as two separate layers in the PEST estimation. The model domain used for this estimation is identical to that described in section 2.3. Figure 3 shows the ln K and ln Ss (Figures 3a and 3b, respectively) distribution for the calibrated geological model.
 For the estimation procedure, parameter estimates were constrained to values between 1 × 10−4 and 10 cm s−1 for K, and 1 × 10−6 and 1 cm−1 for Ss. While the parameter estimation procedure did converge after approximately 600 model calls, the confidence intervals for the parameter estimates for each individual layer were quite large. This is likely a result of the highly parameterized nature of the inversion [e.g., Carrera and Neuman, 1986; McLaughlin and Townley, 1996; Carrera et al., 2005; Tonkin and Doherty, 2005] and the fact that data are not available at all points, thus causing the inverse problem to be ill-posed [e.g., Yeh et al., 1996; Yeh and Liu, 2000].
 The quality of the calibration to the data at port 20 can be assessed examining the observed versus simulated scatterplot for port 20 in Figure S5. The relatively small amount of scatter and the cluster of data points along the 1:1 line indicate that for this particular cross-hole test, the estimated values reasonably capture the heterogeneity of the sandbox aquifer.
5. Cross-Hole Pumping Tests and Transient Hydraulic Tomography Analysis
 The transient hydraulic tomography (THT) analysis of cross-hole pumping tests in the sandbox was conducted using the sequential successive linear estimator (SSLE) code developed by Zhu and Yeh . The inverse model assumes a transient flow field and the natural logarithm of K (ln-K) and Ss (ln-Ss) are both treated as multi-Gaussian, second-order stationary, stochastic processes. The model additionally assumes that the mean and correlation structure of the K and Ss fields are known a priori. Further details to the SSLE code can be found in the work of Zhu and Yeh .
5.1. Input Parameters and Cross-Hole Tests Used
 The model domain used to obtain K and Ss tomograms with THT is identical to that described in section 2.3 for the calibration of the geological model.
 Inputs to the inverse model include initial guesses for the K and Ss, estimates of variances and the correlation scales for both parameters, volumetric discharge (Qn) from each pumping test where n is the test number, as well as head data at various times selected from the head-time curve. Although available point (small-scale) measurements of K and Ss can be input to the inverse model, we do not use these measurements to condition the estimated parameter fields.
 For this THT analysis, the initial parameter fields were homogeneous and based on the analysis presented in section 2.3, with K and Ss approximated as 0.19 cm s−1 and 1 × 10−4 cm−1, respectively. The estimates of variance for K and Ss which have been shown to have negligible effects on the resulting tomogram [Yeh and Liu, 2000] were on the basis of estimates from the available small-scale data and used as our input variance in the inverse model. The correlation scales represent the average size of heterogeneity, which is difficult to determine accurately without a large number of data sets in the field. The effects of uncertainty in correlation scales on the estimate on the basis of THT are negligible because the tomographic survey produces a large number of head measurements, reflecting the detailed site-specific heterogeneity [Yeh and Liu, 2000]. Therefore, the correlation scales were approximated based only on the average thickness and length of the discontinuous sand bodies. The estimated values were 50 and 10 cm for the horizontal and vertical correlation lengths, respectively.
 Prior to the incorporation into the inverse model, the transient head records were treated with various error reduction schemes discussed by Illman et al. , while data from pumped ports were not included in the inverse model because of excessive noise caused by pumping at a higher flow rate than that used for the single-hole tests. Each drawdown curve was then fit with a fifth- or sixth-order polynomial curve following Liu et al. . A fifth- or sixth-order polynomial was found to best capture the overall drawdown behavior for the majority of the data. We then manually extracted five data points representative of the entire transient record.
 Data curves that could not be properly fit due to excessive noise were manually excluded from the analysis. In total, we utilized eight independent cross-hole tests with pumping taking place at ports 47, 44, 35, 32, 17, 14, 5, and 2 for the analysis. More specifically, we utilized five data points from 47 ports totaling 235 data in all tests, except for the pumping test in port 2. At port 2, five data points were obtained from 43 ports totaling 215 data. Some of the data points were excluded from this particular test as the data were excessively noisy. In total, we utilized 1860 data points from eight different tests in our transient inversions.
5.2. Computation of ln-K and ln-Ss Tomograms
 All computations for transient hydraulic tomography analyses were executed using 44 of 48 processors on a PC-cluster (consisting of 1 master and 12 slaves each with Intel Q6600 Quad Core CPU running at 2.4 GHz with 16 GB of RAM per slave) at the University of Waterloo. The operating system managing the cluster was CentOS 5.3 based on a 64-bit system. The total computational time for inverting data from eight pumping tests was about 14 min. Figures 4a–4d are the ln-K tomograms obtained by inverting the transient head data from two, four, six, and eight pumping tests, respectively. Figure 4a shows that with only two pumping tests a coarse picture of the heterogeneity pattern emerges, although the distribution is still pretty smooth and many details of the heterogeneity, and in particular details to the stratification, are missing. As more tests are included into the SSLE algorithm, we see that more detail of the heterogeneity structure emerges. In particular, the final ln-K tomogram obtained (Figure 4d) using eight pumping tests reveals considerable detail to the heterogeneity structure including the connectivity of various high and low K layers. Figure 4e is a ln-K tomogram computed using the steady state hydraulic tomography algorithm of Yeh and Liu  by Illman et al.  and is included for comparison purposes.
Figures 5a–5d show the corresponding ln-Ss tomograms that were estimated simultaneously. In contrast to Figures 4a–4d, the layering structure visible in the ln-K tomogram is smoother for the ln-Ss tomogram. However, a decreasing trend in ln-Ss with depth in the synthetic aquifer is apparent. Physically speaking, this makes sense because the sands in the upper portion are less compressed, while the deeper sands are more compressed due to the stress exerted by the overlying material. This finding suggests that ln-K values are not significantly correlated with the ln-Ss values in this sandbox and is in agreement with those found by Liu et al.  for a different sandbox packed with a considerably different heterogeneity pattern. The only data available for comparison are the single-hole estimates presented in Table 1 and shown as a kriged distribution in Figure S1b. While not identical, the pattern is similar, and the values are in a similar range, suggesting the Ss-tomogram is reasonable physically.
5.3. Statistical Summary of Results
Table 3 summarizes the geometric mean (KG), variance (), and correlation lengths of the resulting ln-K tomogram. The statistical parameters for the kriged and calibrated geological model cases are also included for comparison purposes. The estimated KG of the ln-K tomogram, after including data from eight cross-hole tests, was 1.0 × 10−1 cm s−1, while the estimated was 1.32. We note that the value of KG is identical to that estimated using steady state hydraulic tomography by Illman et al. , but the is slightly higher compared to a value of 1.12. The estimated KG from transient hydraulic tomography is somewhat higher than the estimate of KG obtained by taking the geometric mean (0.06 cm s−1) of the 48 local K values from single-hole tests. In contrast, the estimate of from transient hydraulic tomography () is considerably higher than that estimated from the 48 single-hole K data ().
Table 3. Statistical Properties of the Estimated ln-K Fields
KG (K ∼ cm s−1)
Kriged K and Ss
Calibrated geological model
Hydraulic tomography (2 tests)
Hydraulic tomography (4 tests)
Hydraulic tomography (6 tests)
Hydraulic tomography (8 tests)
 It is of interest to note that there is little change in the KG and the correlation lengths of the ln-K tomograms as more tests are included in the inverse analysis. On the other hand, increases as more cross-hole tests are included in the inverse model. These results imply that with as few as two pumping tests, one could reliably estimate the KG and the correlation lengths of the K distribution in the synthetic aquifer, however, the accurate estimation of requires more cross-hole tests. A similar trend in the improvement of estimates of geostatistical parameters was observed in the hydraulic tomography analysis of steady state head data by Illman et al. .
Table 4 summarizes the geometric mean (SsG), variance (), and correlation lengths of the ln-Ss tomogram. The estimated SsG of the ln-Ss tomogram after including data from eight cross-hole tests was 9.08 × 10−5 cm−1, while the estimated was 0.76. The estimate of SsG obtained by taking the geometric mean of 48 single-hole Ss estimates yields 6.16 × 10−4 cm−1, which is somewhat higher than the estimate obtained through transient hydraulic tomography. The estimate obtained from the single-hole data () is close to that estimated through transient hydraulic tomography. The estimated correlation lengths in the horizontal and vertical directions appear to stabilize as additional tests are included in the analysis.
Table 4. Statistical Properties of the Estimated ln-Ss Fields
SsG (Ss ∼ cm−1)
Kriged K and Ss
Calibrated geological model
Hydraulic tomography (2 tests)
Hydraulic tomography (4 tests)
Hydraulic tomography (6 tests)
Hydraulic tomography (8 tests)
5.4. Visual Comparison of the ln-K Tomogram to the Deposits
 A visual comparison of the ln-K tomogram (Figure 4d) to the deposits (Figure 2) shows that many of the features are captured, although due to the intralayer variability in K, we do not expect a 1:1 correlation of the ln-K tomogram and the stratification seen in Figure 2. The comparison of the ln-K tomogram from transient hydraulic tomography to the kriged K distribution (available as auxiliary material Figure S1a) shows a marked difference in the K distribution. We notice that many of the features captured by the ln-K tomogram are captured by the kriged map, but the latter is distinctively smoother. In addition, the connectivity of the layers captured in the ln-K tomogram are not visible in the kriged ln-K field. Finally, we point out that the kriged ln-K distribution covers the midrange values of the ln-K tomogram revealing the kriged ln-K field produces a reasonable ln-K field in an average sense, but lacks the details in the heterogeneity pattern that the ln-K tomogram reveals.
5.5. Comparison of ln-K Tomograms: Transient Versus Steady State Hydraulic Tomography
 We next compare the ln-K tomogram obtained using transient hydraulic tomography (Figure 4d) to the one obtained using steady state hydraulic tomography (Figure 4e) computed previously by Illman et al. . In both cases, eight cross-hole tests are used for the inverse analysis. Comparison of Figures 4d and 4e shows that overall, ln-K distributions from the two approaches are similar, however, more details are visible in the ln-K tomogram from transient hydraulic tomography. To facilitate a pixel-by-pixel comparison, we include a scatterplot (Figure 6) of ln-K values from Figures 4d and 4e. The dashed line indicates a perfect 1:1 correlation; the solid line is a linear model fit to the data; and, R2 is the coefficient of determination. Results show that the data cluster around the 1:1 line indicating good agreement between the two cases, however, we acknowledge that there is some scatter and bias in the estimates.
6. Performance Assessment of Results
6.1. Performance Assessment by Simulating Individual Tests
 The various effective and heterogeneous K and Ss fields are assessed by simulating 16 independent cross-hole pumping tests. The cross-holes tests are considered independent as they were not used by the various methods described earlier to characterize the aquifer (the exception is the cross-hole test performed at port 20 which was used to calibrate the geological model). If a given characterization technique can capture the salient features of the true heterogeneity of the aquifer, then the resulting prediction of the independent pumping tests should be accurate. That is, the discrepancy between observed and simulated drawdown values should be small. In contrast, if the predicted drawdown values are inaccurate, then we consider the approach used to idealize the heterogeneity to be poor. In particular, we construct forward numerical models for the following 5 cases: (1) the effective K and Ss estimates from single-hole tests; (2) the PEST estimated homogeneous parameters of Kx, Kz, and Ss calibrated to the cross-hole test at port 20; (3) the kriged K and Ss fields from single-hole tests; (4) the PEST calibrated geological model; and (5) the K and Ss tomograms from transient hydraulic tomography. For comparison purposes only, the results from the best performing cases (4 and 5) are illustrated with the figures presented here. Tables 5–6 summarize various performance metrics and are provided to allow for a direct comparison of all five cases. Additionally, figures for all cases are provided in the auxiliary material (Figures S2–S6). Since these various methods include different amounts of information, with cases 1 and 2 including the least, and cases 3–5, progressively including more, it is reasonable to expect the performance of the models will improve along these lines.
Table 5. Statistics of the Linear Model Fit and Correlation of Determination (R2)
1. Geometric mean K/Ss (single hole)
2. PEST Effective Kx, Kz, Ss
3. Kriged K and Ss
4. Calibrated geological model
5. Transient hydraulic tomography
Figure 7 shows scatterplots of observed versus simulated drawdowns from independent cross-hole tests 18, 23, 40, and 42. The simulated drawdown values were obtained through numerical simulations using the calibrated geological model. Figure 7 includes a dashed line indicating a perfect 1:1 correlation, a solid line which is the linear model fit to the data, and the coefficient of determination (R2). Data plotted in Figure 7 are drawdown values from 0.5, 2, 5, and 10 s since the pumping test began. We see that for most cases, the points cluster around the 1:1 line with some positive or negative bias as indicated by the slope of the linear model fit. This same pattern is seen for most of the other tests except for those performed near the top of the aquifer, where the simulated drawdown tends to be smaller than the observed suggesting that the estimated K of the upper layer (layer 18) is too high. The R2 values for the 16 cases range from 0.002 to 0.86 with an arithmetic mean value of 0.65 (see Figure S5 and Table 5). The slope and the intercept of the linear model fit also provide an indication of bias. In comparison to cases 1–3, we see a significant improvement of the prediction accuracy of groundwater flow models built to simulate the 16 cross-hole tests, when lithofacies are incorporated in conjunction with model calibration. This finding is in agreement with those from Sakaki et al. .
Figure 8 shows similar plots but is based on the K and Ss tomograms computed using THT. These results show a significant improvement in the predictions of drawdowns for the four selected independent cross-hole pumping tests. The R2 values for the 16 cases range from 0.82 to 0.99 with an arithmetic mean of 0.94 indicating a marked improvement over the other modeling approaches. In addition, the comparison of the scatterplots in Figure 8 to those from Figure 7 clearly shows that THT is able to better predict independent cross-hole tests.
Table 5 summarizes the minimum, maximum, and the mean values of the slope and intercept of the linear model fit as well as the R2 for all 16 tests for cases 1 to 5. Table 5 shows that the slope of the linear model fit is quite variable ranging from 0.01 to 2.25 for all characterization approaches and the mean values range from 0.79 to 1.00. Likewise, the intercept of the linear model fit ranges from −0.52 to 0.27 with a mean ranging between −0.29 to 0.02. Finally, the R2 values range from 0.002 to 0.99 with a mean ranging from 0.29 to 0.94. Examination of the slope, intercept, as well as the R2 values across all the characterization methods shows that in an average sense over 16 independent cross-hole pumping tests, the K and Ss tomograms computed by THT yields the best predictions of drawdown data with least bias and scatter.
 To further quantitatively assess the correspondence between the simulated and observed drawdown values, we compute the mean absolute error (L1) and the mean square error (L2) norms of all cases examined. The L1 and L2 norms are computed as
where n is the total number of drawdown data, i indicates the data number, and and represent the estimates from the simulated and measured drawdowns, respectively. The L1 and L2 norms were calculated for each case by evaluating the observed and simulated drawdowns at four times (0.5, 2, 5, and 10 s) at each port, except for the port that was pumped due to excessive noise. Thus, each L1 and L2 norm represents 188 observations.
 Table S1 (available in the auxiliary material) summarizes the L1 norm, while Figure 9 summarizes the L2 norm calculated for all of the cases. The cells of each entry in the table are color-coded. The minimum value in the table is assigned a color of dark green, the maximum value is assigned a color of dark red, and the median value a color of yellow. Values intermediate to these anchor points are assigned appropriate intermediate colors. Both Tables S1 and 6 show that forward simulations using K and Ss tomograms from THT, consistently yields the lowest L1 and L2 norms for all 16 independent cross-hole pumping tests, suggesting that the approach yields the best predictions.
Table 6. Statistics of the Linear Model Fit, Correlation of Determination (R2), and L1 and L2 Norms for the Ensemble Analysis of All Casesa
Data includes transient hydraulic tomography after the inclusion of two, four, six, and eight tests.
 The scatterplots for cases 1 to 5 were also analyzed in an ensemble sense to see if the homogeneous cases were able to estimate the average behavior of all of the 16 cross-hole tests. This comparison is presented for Table 6 and includes the L1 and L2 norms as well as the slope, intercept, and R2 for the linear model fits when the data from all 16 cross-hole tests are analyzed collectively. The trends seen in the individual scatterplots are also seen in these ensemble scatterplots. The results for THT after one, two, four, six, and eight tests are also presented in Table 6. While not changing significantly with the inclusions of additional cross-hole tests, the matches do improve slightly.
6.2. Predictability of Transient Drawdown Curves
 Finally, to further illustrate the robustness of THT, we also plot simulated and observed drawdown curves for a cross-hole test performed at port 40. Again, only cases 4 and 5 are included here. Full figures showing the comparison at all 48 ports are available online in the auxiliary material as Figures S7–S11 for cases 1–5, respectively. In particular, Figures 9 and 10, show double-logarithmic plots of observed (small dots) and simulated (curves) drawdown records at 16 selected ports. In Figure 10, the simulated drawdown values are obtained through numerical simulations using the calibrated geological model, while in Figure 11, simulated drawdown values are obtained through numerical simulations with the K and Ss tomograms.
Figure 10 shows that the calibrated geological model does a reasonable job of predicting drawdowns at most of the ports, however, drawdowns are significantly underestimated in the upper ports. This again suggests that the upper layer (layer 18) is not accurately characterized. In contrast, Figure 11 shows a drastically improved result when the simulated values are obtained from the forward simulation of the pumping test using the K and Ss tomograms from THT. An examination of Figure 11 shows that the match is not perfect, but overall, the drawdown curves are captured very well throughout the duration of the pumping test, which the other approaches failed to accomplish. The robust performance of transient hydraulic tomography is due to the sequential inverse modeling of multiple pumping tests, which results in repeated calibration and refinement of aquifer parameter fields.
7. Summary and Conclusions
 We characterized and modeled a synthetic heterogeneous aquifer using two effective parameter approaches and three separate methods that consider the heterogeneity of the aquifer. Two effective parameter models are constructed: (1) by taking the geometric mean of single-hole test K and Ss estimates obtained by treating the aquifer to be homogeneous; and (2) using MMOC3 [Yeh et al., 1993] coupled with PEST [Doherty, 1994] to obtain effective Kx, Kz, and Ss estimates by matching the transient drawdown data simultaneously from all ports during a cross-hole test. Heterogeneous models consist of spatially variable K and Ss fields obtained via (1) kriging single-hole K and Ss data; (2) accurately capturing the layering (i.e., geological model) and calibrating K and Ss estimates of these layers with MMOC coupled to PEST; and (3) conducting transient hydraulic tomography (THT) [Zhu and Yeh, 2005].
 It should be reiterated that these different methods include different amounts of data, as such, the comparison could potentially be considered to be not fair in a strict sense. However, aside from hydraulic tomography (which explicitly incorporates data from multiple pumping tests) these approaches are unable to incorporate data from multiple pumping tests for the estimation of heterogeneous parameter fields in their current form. While this is possible with PEST when coupled with a groundwater model and a geostatistics code, as implemented by Vesselinov et al. [2001b], this would be a form of hydraulic tomography which is outside of its common application of PEST. Thus, the work presented in this study compares THT, which utilizes data from multiple pumping tests, to other commonly available methods that do not include data from multiple pumping tests.
 The performance of these homogeneous and heterogeneous K and Ss fields were then tested through the forward numerical simulation of cross-hole pumping tests that were not used in the characterization effort. The drawdown values from 16 cross-hole pumping tests conducted in the sandbox aquifer were then directly compared through scatterplots. The comparison was done for individual tests and also for all 16 tests together. A linear model was fit to each of the scatterplots and the coefficient of determination (R2) computed to quantitatively assess the goodness-of-fit between the observed and simulated drawdown values. The slope and intercept of the linear model fit provide information on prediction bias.
 We found that the forward numerical simulations using the two homogeneous cases could not predict the 16 cross-hole pumping tests accurately for almost all tests and significant bias were shown when the observed and simulated drawdowns were compared through a series of scatterplots. The scatter and bias were evident in not just individual tests, but also when the data from all 16 tests were plotted together and examined in an ensemble sense.
 The heterogeneous models were also tested in a similar fashion. We found that the forward simulations of 16 independent cross-hole pumping tests using the kriged K and Ss fields were unable to accurately match the observed drawdowns. The calibrated geological model showed a significant improvement, however, it did not seem to accurately represent the upper-most layer, resulting in very poor matches in this region. It should be emphasized that for the calibrated geological model, the stratigraphy was assumed to be perfectly known. In reality, such knowledge is impossible to obtain. Therefore, the constructed and calibrated geological model should be considered to be a best-case scenario.
 The forward numerical modeling of the 16 tests using the ln-K and ln-Ss tomograms computed via THT convincingly showed that these distributions led to the smallest discrepancy between observed and simulated drawdowns. Our sandbox experiments demonstrate the robustness of THT conducted in a controlled environment. The repeated calibration from the inclusion of data from multiple cross-hole pumping tests leads to improved heterogeneity characterization of the connectivity of hydraulic parameters. This is the main strength of THT and an important advantage over other methods examined here. The accurate estimation of hydraulic parameters and their distribution was critical for the accurate prediction of independent cross-hole pumping tests. We must emphasize that THT is merely a fusion of data from sequential pumping tests to estimate aquifer heterogeneity as pointed out by Yeh et al. . This fusion of information can certainly be undertaken using different inverse modeling algorithms such as PEST and others. Results of our study therefore supports the call for change in the way we collect and analyze data for characterizing the aquifer [Yeh and Lee, 2007].
 We are optimistic about the capabilities of hydraulic tomography, but are also cautious in generalizing our sandbox results to the field scale. It remains to be seen whether hydraulic tomography will yield robust results under field conditions. We are currently conducting a comprehensive field assessment of hydraulic tomography and other heterogeneous characterization and modeling methods at the University of Waterloo. These results will be reported in the near future.
 This research was supported by the Strategic Environmental Research and Development Program (SERDP) under grants ER-1365 and ER-1610, by the National Science Foundation (NSF) under grants EAR-0229713, IIS-0431069, and EAR-0450336, and by the Natural Sciences and Engineering Research Council (NSERC) of Canada through the Discovery grant to Walter Illman. In addition, Steve Berg was supported in part through the Ontario Graduate Scholarship. We thank Andy Craig for data collection and Junfeng Zhu for assisting us initially with the computation of the ln-K and ln-Ss tomograms. Finally, we thank the Associate Editor and the three reviewers for their constructive suggestions, which led to an improved paper.