This paper addresses the inverse problem in spatially variable fields such as hydraulic conductivity in groundwater aquifers or rainfall intensity in hydrology. Common to all these problems is the existence of a complex pattern of spatial variability of the target variables and observations, the multiple sources of data available for characterizing the fields, the complex relations between the observed and target variables and the multiple scales and frequencies of the observations. The method of anchored distributions (MAD) that we propose here is a general Bayesian method of inverse modeling of spatial random fields that addresses this complexity. The central elements of MAD are a modular classification of all relevant data and a new concept called “anchors.” Data types are classified by the way they relate to the target variable, as either local or nonlocal and as either direct or indirect. Anchors are devices for localization of data: they are used to convert nonlocal, indirect data into local distributions of the target variables. The target of the inversion is the derivation of the joint distribution of the anchors and structural parameters, conditional to all measurements, regardless of scale or frequency of measurement. The structural parameters describe large-scale trends of the target variable fields, whereas the anchors capture local inhomogeneities. Following inversion, the joint distribution of anchors and structural parameters is used for generating random fields of the target variable(s) that are conditioned on the nonlocal, indirect data through their anchor representation. We demonstrate MAD through a detailed case study that assimilates point measurements of the conductivity with head measurements from natural gradient flow. The resulting statistical distributions of the parameters are non-Gaussian. Similarly, the moments of the estimates of the hydraulic head are non-Gaussian. We provide an extended discussion of MAD vis à vis other inversion methods, including maximum likelihood and maximum a posteriori with an emphasis on the differences between MAD and the pilot points method.
 This paper presents a new approach for inverse modeling called method of anchored distributions (MAD). MAD is an inverse method focused on estimating the distributions of parameters in spatially variable fields. MAD addresses several of the main challenges facing inverse modeling. These challenges fall into two broad categories: data assimilation and modularity.
 Data assimilation in inverse modeling is the challenge of using multiple and complementary types of data as sources of information relevant to the target variable(s). In hydrogeological applications, for example, one may be interested in mapping the spatial distribution of the hydraulic conductivity K [cf. Kitanidis and Vomvoris, 1983; Kitanidis, 1986, 1991, 1995, 1997a, 1997b; Carrera and Neuman, 1986a, 1986b; Hernandez et al., 2006] using measurements of the hydraulic head, measurements of concentrations and travel times obtained from solute transport experiments [Bellin and Rubin, 2004], and measurements of geophysical attributes obtained from geophysical surveys.
 Another example is the mapping of ocean circulation, which relies on a variety of data types (e.g., temperature, density, velocity vector components) obtained from ship surveys, moored instruments, buoys drifting freely on or floating below the ocean surface, and satellites. These data are measured over a wide range of scales and frequencies, and they need to be assimilated to yield accurate circulation models.
 In a third example, air quality management requires constructing maps of dry deposition pollution levels. Ideally, such maps would be based on a dense network of monitoring stations, but generally such networks do not exist. Alternative and related information must be used instead. For example, there are two main sources of information for dry deposition levels in the United States: one is pollution measurements at a sparse set of about 50 monitoring stations called CASTNet, with spacing between stations on the order of hundred of kilometers, and the other is the output of regional scale air quality models with grid resolution on the order of a few kilometers [Fuentes and Raftery, 2005].
 In all these cases, the observations can be related to the target variables by functions that relate measurement to target variables. The challenge in all these cases is to combine the multiple sources of data into a coherent map of the target variables without introducing external factors such as smoothing and weighting.
 To address such a wide range of problems we would need to address the challenge of modularity. Modularity means an inverse modeling approach that is not tied to particular models or data types and maintains the flexibility to accommodate a wide range of models and data types. Inverse methodology and the numerical simulation of data-generating processes have become very closely intertwined, in a way that makes them very limited in applications. (“Data-generating processes” in this document refers to the natural processes that result in a quantity being measured. These processes are usually simulated by numerical codes.) This can be attributed to the increasing complexity of the processes that are being analyzed and of the computational techniques needed for their analysis. For example, inverse modeling in hydrogeology evolved from Theis' type-curve matching into modern studies that include complex and specialized elements such as (1) adaptive and parallel computing techniques, (2) geophysical modeling of electromagnetic fields and of the propagation of seismic waves, and (3) complex multi component chemical reactions. The range of skills needed for implementing these elements forced researchers to build the inversion procedure around their own or favorite numerical codes. As a result, the potential for expanding the range of applications beyond the original application, for example, by changing the data types or the numerical models used, is limited. Modularity is a strategy for alleviating this difficulty by pursuing a model-independent inverse modeling framework.
 This paper, through presentation of the MAD concept, explores all these issues using a Bayesian framework. A theoretical approach is developed and demonstrated with two subsurface flow problems.
2. Data Classification and Anchor Definitions
 This section presents several of the principles underlying the MAD concept. It summarizes for completeness and expands a few developments included in an unpublished manuscript by Zhang and Rubin (Inverse modeling of spatial random fields, unpublished manuscript, 2008) and in a conference presentation by Zhang and Rubin . The MAD approach to inverse modeling is built around two elements. The first element is data classification. The second element is a strategy for localization of nonlocal data. The localization strategy intends to create a unified approach for dealing with all types of data. These two elements are integrated into (1) a Bayesian formalism for data assimilation and (2) a forward modeling strategy in the form of conditional simulation. The integration of these two elements is done in a modular form that can accommodate a wide variety of data types and multiple ways in which these data can be related to the target variables of interest. The rationale for formulating the inverse problem using a statistical formalism has been amply discussed in the literature [e.g., Kitanidis, 1986; Rubin, 2003] and is not repeated here for the sake of brevity.
 We consider a spatial random process denoted by Y(x), where x is the space coordinate. As discussed earlier, Y could be a variable in any number of fields. It could represent, for example, the hydraulic conductivity in hydrogeological applications. In this case, inverse modeling would focus on the conductivity field based on measurements of pressure induced by a pumping test, concentration data from a tracer experiment, or the arrival times of seismic waves at multiple locations obtained from a geophysical survey [Hoversten et al., 2006; Hou et al., 2006]. The entire field of Y is denoted by . A realization of is denoted by . Given data z that is related to , the goal is to derive the conditional distribution of the field, p(|z), and to generate random samples from that distribution.
 The field is defined through a vector of model parameters (, ). The part of this vector includes a set of parameters that are designed to capture the global trends of Y, and it can assume different forms. For example, if one selects a geostatistical approach for modeling the global trends of Y [cf. Rubin and Seong, 1994; Seong and Rubin, 1999], would include parameters such as the mean of Y and the parameters of its spatial covariance. An alternative formulation of could involve a zonation-based approach [e.g., Poeter and Hill, 1997], whereby includes the values of Y at various zones of the model domain. It could also be a hybrid approach of these two concepts, whereby the model domain is subdivided into geological units, with each of the geological units characterized by a different geostatistical model [cf. Rubin, 1995; Dai et al., 2004; Ritzi et al., 2004; Rubin et al., 2006]. The component of this vector consists of the anchored distributions. Anchored distributions or anchors, in short, are devices used to capture the local effects or, in other words, all the elements or features of that cannot be captured using the global parameters represented by (provided that the local effects have an impact on the data). In their simplest form, anchors would be error-free measurements of Y. Other forms of anchors include measurements of Y coupled with error distributions and/or anchors that are obtained by inversion. The anchors are defined in detail in section 2.2.
 The overall strategy for deriving p(z) is to derive a joint conditional distribution of the model parameters, p(, z), which would in turn allow us to generate multiple realizations of that maintain the global trends and capture the local effects. The distribution p(, z) should be general and flexible enough to accommodate a wide range of formulations of the vector (, ), as well as a wide range of data types that could be folded into the vector z.
2.1. Data Classification
 The concept of anchors is built around a general approach for classifying data based on the relation of the data to the attribute y and the support volumes of the data and the attribute. There are two classifiers that are commonly used to describe such relationships: local and nonlocal. Data could be measured over the same support as y and be modeled as a function of the collocated y, and in this case, we would refer to them as local. Otherwise, they would be nonlocal. As we will show below, these relations could be captured using anchors. We shall refer to local and nonlocal data as Type A and Type B, respectively, and we will use za and zb to denote the Type A and Type B data, respectively. The vectors za and zb are general symbols for data and may include any or all of the following: measurements, descriptions of statistical relations, and measurement errors. Type A and Type B data refer to on-site data. Information in the form of expert opinion or from similar fields is treated differently falls under the category of prior information.
 Type A data can be related to y through the equation,
where y is a known function and a is a vector of length na of zero-mean errors. The vector za could include measurements of Y, and in that case a would represent measurement error. It can include predictions of y obtained by regression or through the use of petrophysical models [cf. Ezzedine et al., 1999], and in that case a would represent regression or modeling error. In the case where y is permeability, for example, Type A data could include measurements of permeability obtained using permeameters or predicted permeability from soil texture and grain size distributions using petrophysical models [cf. Ezzedine et al., 1999; Mavko et al., 1998].
 Type B data include all the data that cannot be classified as Type A. Type B data are functions of the y field or a section of it that is larger than the volume support of the y measurement defined by Type A data. The Type B data can be described by the following equation,
where M is a known function, numerical or analytical, of the spatial field, representing one or more physical processes, and b is a vector of length nb zero-mean errors. It is recalled that the tilde sign over a variable denotes a field of that variable. The vector zb can include a wide range of data types that could be obtained from multiple sources. In hydrogeological applications, zb could include data obtained from small- and large-scale pumping tests, solute transport experiments, continuous observations of hydraulic heads, and geophysical surveys. With data from multiple sources, M in equation (2) should be viewed as a collective name for all models that relate the data to , including, for example, flow and transport simulators and geophysical simulators.
2.2. The Concept of Anchors
 With Type A and Type B data available, the inversion's final goal becomes the generation of Y fields conditioned on both data types. This could be done through the conditional distribution p(za, zb). The major challenge in deriving this distribution is the absence of a simple device that would guarantee that the simulated/generated fields are already conditioned on the Type B data. This would save the need to verify that the generated fields are conditioned on the Type B data without repeated use of M. We will construct such a device in the form of anchors.
 Anchors are model devices in the form of local statistical distributions of Y, intended to establish connections between the unknown Y field and the data z = (za, zb). Using anchors and structural parameters, we will be able to generate Y fields that are conditioned on Type A and Type B data as well as on the inferred distributions of the structural parameters. Conceptually, structural parameters describe global trends and spatial associations, whereas anchors capture local features. We should emphasize at this point that anchors are not pilot points. There are fundamental conceptual and technical differences that are discussed in great details in section 8.
 Anchors are always given as statistical distributions, but there is a different correspondence between anchors and Type A data than for anchors and Type B data. In the case of Type A data, anchors are given in the form of statistical distributions representing the Y values plus measurement (or regression) error, and we have one anchor per one Type A measurement. Type B measurements, on the other hand, could be represented by more than one anchor, with each anchor possibly corresponding to one or several Type B measurements. Anchors are planted at multiple locations, with the idea that they would capture the information contained in the Type B measurements that is relevant to the Y field. This is achieved by transforming the Type B data into multiple anchors at known locations based on our knowledge of the Y field and the nonlocal data generation process M. The transformation of Type B data into anchors changes both the form of the data as well as the location of the information. Subsequently, simulations conditioned on these anchors are, to a large degree, conditioned on both Type A and Type B data.
 Anchor placement is an important element of MAD: obviously, we would want to place the anchors such that they capture all the relevant information contained in the data. This is trivial for Type A data because in that case the anchors are collocated with the measurements. It is a complex issue in the case of Type B data because of the complex averaging applied on Y by the Type B process. This issue will be discussed in section 6. Leaving the anchor placement question aside for now, we will proceed to discuss how the statistical distributions are determined for a given set of anchor locations.
 In our approach, the anchors are viewed as model parameters, similar to the structural parameters, and as such, will be determined by inversion. Denote the vector of anchors by = (a, b), where a, located at known locations xa, are the anchors corresponding to Type A data, whereas b, located at chosen locations xb, are the anchors corresponding to Type B data. The goal of the inversion is to derive the joint anchor-parameter distribution p(, za, zb). Once this distribution is defined, any random draw of (, ) from this distribution contains all the information needed for generating conditional realizations. The derivation of this distribution is the subject of section 2.2.1.
2.2.1. MAD With Type A and Type B Data
 Whereas MAD in general uses both Type A and Type B data, it also can be used for inverse modeling when only Type A or Type B data are available. These cases are presented in this section. As a starting point, let us consider the derivation of p(, za, zb), the joint distribution of the model parameters, including structural parameters and anchors, conditional to Type A and Type B data. Following the anchor notations defined in the previous section, it can be shown that
Equation (3) is a Bayesian model that relates model parameters to data and to prior information in the form of a posterior probability. In equation (3), the posterior probability is simplified by dropping za whenever it is coupled with a as conditions (i.e., to the right of ‘|’), under the assumption that the information provided by the anchors a encapsulates the information provided by za, thus rendering the conditioning on za superfluous.
 In equation (3), p(zb, ) denotes the likelihood of the Type B data, which is the key for relating the posterior distribution of the model parameters with the Type B data. p(, za) has the dual role of being the posterior probability given Type A data, as well as a prior probability, preceding the introduction of Type B data. The distribution p(aza) is derived from the Type A data. The distribution p(a) is discussed in section 3.1 below. The distribution p(b, a) is the prior of the anchors b given Type A data and the structural parameters vector .
 Equation (3) highlights the role of the anchored distribution as the mechanism for connecting between Type A and Type B data, without making any specific modeling assumption to relate them. The only opening in equation (3) for ambiguity is in the relationships between the various Type B data that are included in the likelihood function and the target variables. This can be dealt with in one of two ways. First, this relationship may be known or can safely be assumed or derived from physical principles using statistical modeling assumptions [cf. Hoeksema and Kitanidis, 1984; Dagan, 1985; Rubin and Dagan, 1987a, 1987b]. Otherwise, the likelihood function can be defined nonparametrically and derived numerically based on extensive numerical simulations. Both approaches can be implemented in MAD. In the case study pursued here, we employed the second option.
2.2.2. MAD With Type A Data Only
 In the presence of Type A data only, equation (3) simplifies to
Since Type A data is local, the role of the anchors here is limited to modeling measurement or regression errors, and so the anchors here are measured or regressed, unlike the anchors corresponding to Type B data, which are inverted.
 The distribution p(aza) represents the probabilities of the various ensembles of anchor values that are plausible in light of the Type A data and the distribution of the measurement/regression errors. These ensembles can lead to various structural parameter combinations that are summarized in the distribution p(a).
 The application of MAD in this case includes a sequence of three steps. First, the anchor distributions need to be defined. Working with Type A data, the anchors are not obtained by inversion: they are determined based on the measurement and/or regression errors. At the next step the distribution of is to be determined based on a [e.g., Hoeksema and Kitanidis, 1984; Kitanidis, 1986; Diggle and Ribeiro, 2006] and in the final step, this distribution is used in conjunction with za to generate realizations of the Y field from the distribution p(, za).
 When the Type A data are error-free, equation (4) could be simplified by noting that in this case the anchors are equal to the measurements. In this case, the anchor notation could be ignored altogether, and the posterior distribution of the structural parameters is given by
 The final step of this process consists of generating realizations from the conditional distribution p(za). In the studies cited in the previous paragraph, the distribution p(, za) was obtained by standard conditioning procedures for multivariate normal distributions. Distributions other than multivariate normal could also be used. Diggle and Ribeiro  used trans-Gaussian transforms to deal with nonnormal distributions.
2.2.3. MAD With Type B Data Only
 In the absence of Type A data, equation (3) simplifies as follows: p(aza)p(a) becomes p(). Next, p(b, a) becomes p(b). Finally, p(zb, ) becomes p(zb, b), leading to the following definition of the posterior,
with the main difference compared to the previous cases is that the prior is not informed by the Type A data.
3. Inverse Modeling With MAD
 The critical elements in application of equation (3) include the following: determination of the prior, the placement of the anchors, the derivation of the likelihood function and the application of MAD for predictions. In this section, we discuss these elements individually (except the anchor placement issue, which is discussed in section 5), and then we will show how they combine together into a modular algorithm.
 The nonparametric approach is more general because it covers a wide range of distributions. The appeal of nonparametric methods lies in their ability to reveal structure in the data that might be missed by parametric methods. This advantage could be associated with a heavy price tag: nonparametric methods are often much more computationally demanding than their parametric counterparts. The MAD algorithm (see next section) is flexible in its ability to employ both parametric and nonparametric methods. Several alternatives for calculating the likelihood function are summarized in the work of Scott and Sain  and Newton and Raftery . For the particular applications discussed below, we employed the algorithms described by Hayfield and Racine , which are part of the R-Package [R Development Core Team, 2007].
 Both approaches are suitable for MAD and can be implemented at the modeler's discretion. The overriding consideration should be the selection of the most appropriate representation of the likelihood, and this depends on the data. For example, when zb is composed of pressure head measurements measured in a uniform-in-the-average flow in an aquifer domain characterized by small variance of the log conductivity, a multivariate normal likelihood function is appropriate because the head can be expressed as linear function of the log conductivity [Dagan, 1985]. We selected to present here a nonparametric approach because it is more general and because it is not commonly used in groundwater applications.
 The likelihood p(zb, ) in equation (3) is estimated using numerical simulations, as follows. For any given (, ), we generate multiple conditional realizations of the Y field; with each realization, a forward model provides a prediction of zb in the form of b. In other words, zb is viewed as a measured outcome from random process, whereas b is one of many possible realizations, each corresponding to a particular realization of (, ). The ensemble of b constitutes a sample of zb, and it is used for estimating the likelihood at zb.
 Nonparametric estimation of statistical distributions requires a large number of forward simulations. When multiple data types are involved, that would include conditional simulation of the Y field followed by forward modeling. In MAD, there are two elements that combine to reduce the computational effort. First, we do not need to evaluate the likelihood p(zb, ) for every possible combination of zb values, but only for the particular set of values that were measured, thus requiring a smaller number of samples to ensure convergence. Second, the dimension of the parameters vector (, ) is small compared to full-grid inversion.
3.3. A Flowchart for MAD
 The MAD approach is represented schematically in Figure 1, using the notation provided in equation (3). Figure 1 shows the modular structure of the MAD approach. There are three blocks in MAD, labeled Blocks I, II, and III, respectively, and two auxiliary blocks, labeled Auxiliary Blocks A and B.
3.3.1. Block I
 Block I is the preprocessing module, and it is focused on the Type A data. It computes the joint distribution of structural parameters () and anchors () based on the Type A data as well as any prior knowledge on the parameter vector . As noted earlier, there is a wide range of ideas that could be implemented through this block. There is no particular approach to modeling the prior that is hard-wired into MAD. The output of Block I is the conditional distribution p(, za), which is the posterior distribution of with respect to za. If we view zb as the “main” data, then this provides a prior of (, ) for the Bayesian analysis of zb that is carried out in Block II.
3.3.2. Block II
 Block II is the likelihood analysis module. It incorporates Type B data through the likelihood function p(zb, ). When combined with the Block I product, p(, za), where are the anchor values at locations xυ, it yields the posterior p(, za, zb). The likelihood function p(zb, ) is linked to observations by a forward model M, through the relationship zb = M() + . This relationship is composed of two elements: random field generation and forward simulation. The joint distribution of and from Block I is used to generate conditional realizations of the Y field . With each realization, the forward model M generates a realization of the Type B data that would eventually be part of the ensemble used to evaluate the likelihood function. Several forward models can be employed simultaneously, depending on the number of different Type B data employed in the inversion process, as indicated by the vertical bar linked to forward simulations. Examples for M functions include flow models, solute fate and transport models, and geophysical models. The positioning of Models I and II within Block II and the positioning of Models III, IV as external but linkable elements to Block II intend to signify the flexibility to plug in user-supplied models in addition to hard-wired models.
3.3.3. Block III
 Block III is the prediction block. It covers the postinversion analyses needed for predicting a future, unknown process of interest. It can connect directly to Block I in the absence of Type B data. The forward simulation step in Block III can guide the selection of anchor locations in Block II. A simple way to do it is by evaluating alternative anchor placement schemes. As with the other blocks, Block III could be linked with a wide range of forward simulation codes and computational techniques.
 The prediction block is built around multiple conditional realizations of the random field. Each of these realizations is generated using a realization of parameters and anchors drawn from the joint distribution of and . Generating random fields from the joint distribution of and is advantageous because many alternative combinations of both and could be evaluated, leading to a more complete characterization of uncertainties associated with the model. This is in contrast to the commonly used maximum likelihood (ML) or maximum a posteriori (MAP) approaches, both of which present the uncertainties of Y corresponding to a fixed set of the model parameters.
 The auxiliary blocks include Block A, which is dedicated for anchor placement analysis, and Block B, which is dedicated for model inter-comparison (both topics are discussed in the next section). They are not considered as core blocks because they contain elective procedures that are not absolutely necessary for a complete application of MAD.
4. Measurement Errors, Parametric Errors, and Conceptual Modeling Errors
 An important element of any inverse modeling, including MAD, is the forward model M. Once a realization of Y is generated in the form of a field , M() could be used to generate the field or possibly a number of fields b, corresponding to all Type B attributes. A subset of b, specifically those values generated atxb, could be used to construct a sample of the likelihood p(zb, ). One can reasonably expect that the values generated by M() at xb would be different from zb, because of parametric errors, conceptual modeling errors, numerical modeling errors and measurement errors. Parametric error refers to estimation errors in the parameters of a particular M, whereas conceptual modeling errors refer to errors in formulating the concepts that underlie the model M.
 It is recalled that equation (1) defines the measurement/regression error associated with Type A data. Equation (2) defines the errors associated with Type B data due to measurement and parameter errors for a given model M. By introducing these errors into equation (3), we could account explicitly for the impact of these errors on the posterior distribution. equation (3) could be expanded to include the error terms as follows:
The error in the Type A data is represented by a. The parameter error is represented by b, and it covers errors in b and in the structural parameters . = (a, b) and p() denotes its distribution. p(aya, a) is the distribution of anchors given measured (or regressed) Y values and the distribution of the measurement/regression errors.
 For demonstration, consider the case of zero error in the Type A data. This distribution becomes the Dirac function p(aya, a) = δ(a − ya), which means that the anchors corresponding to the Type A data are equal to the measured values. On the basis of that, equation (7) becomes:
Equation (7) does not account for the errors associated with the formulation of M, because it considers only a single, deterministically known model M, and thus, we cannot know how M would compare against the perfect, error-free model (which in itself is an elusive concept). Note that a “deterministically known” model does not exclude from consideration parameter errors, it only assumes that the model formulation is taken to be correct.
 One idea on how to depart from the confines of a single model is to formulate and evaluate several alternative and plausible models and with this minimize the risk implied by betting on a single model. This is the approach pursued by Hoeting et al. , Neuman , and Ye et al. , which we apply here for the MAD algorithm. The idea here is to formulate N alternative forward models: Mi, i = 1, .., N. Each alternative model is defined by a different set of parameters: a model Mi is defined by a corresponding set of parameters (i, i). N could be large, reflecting the wide range of choices that could be made in the model formulation process. Such choices could include, for example, the selection of a particular correlation model for Y or selecting a particular multivariate distribution of Y from several available alternatives. These choices could also reflect decisions about numerical implementation, such as the selection of grid size and time step.
 Alternative models imply alternative combinations of structural parameters and anchors. Assuming that Mi, i = 1, .., N, are the N models under consideration and i, i are the set of parameters corresponding to Mi, equation (3) and each of its derivatives (e.g., equation (7)) could be written for each of the N combinations i, i instead of a single , combination. Solving the inverse problem would mean deriving the posterior distribution for each of these combinations.
 To account for the multiple models in predictions, each of these models need to be weighted by a probability p(Mi) such that p(Mi) = 1. The role of p(Mi) is to reflect the plausibility of the corresponding model. With these definitions, any variable of interest can be predicted in a variety of ways. For example, one could average the expected value of variable ψ at the maximum likelihood point of each of the models using 〈ψ〉 = ψ(i)p(i) where ^ denotes the maximum likelihood point. The major challenge here is to determine the model probabilities p(Mi). Derivation of the probabilities p(Mi), i = 1, .., N, is pursued by Hoeting et al. , Neuman , and Ye et al. . In our subsequent discussion, we shall refer to this approach as the discrete model approach.
 In the discrete approach, each of the models Mi is defined by a different set of parameters i, i and a probability p(Mi). Defining the likelihood functions requires a computational effort that scales up with the number of models and the number of parameters in each model. This effort could be reduced if parameters could be used to distinguish between conceptual models. We refer to this idea as broad spectrum model selection because a single parameter could be used to represent a broad spectrum of models. It can complement discrete model selection or replace it, depending on the application.
 To demonstrate this idea, we will consider the selection of the spatial covariance of Y. The literature suggests several authorized covariance models, e.g., normal, exponential, etc. Combinations of authorized models are also likely candidates. For each case one could consider isotropic and anisotropic models [Rubin, 2003, chapter 2]. One can easily identify N alternative models that could be used in equation (4), and each of them could be associated with a probability p(Mi). Representing all or a subset of these alternatives using equation (4) is a possibility, as discussed earlier. The alternative we propose here is to consider the Matérn family of covariance functions [Matérn, 1986], given by [cf. Nowak et al., 2010]
where ℓ2 = (ri/λi)2, ri, i = 1, .., ns are the components of the lag vector r, λi are the corresponding scales, ns is the space dimensionality selected for modeling, and σ2 is the variance of Y. Bκ is the modified Bessel function of the third kind of order κ [Abramowitz and Stegun, 1965, section 10.2]. κ ≥ 0 is a shape parameter because it controls the shape of the covariance function. For example, κ = 0.5, 1, ∞ correspond to the exponential, Whittle and Gaussian covariance models, respectively. The shape factor κ can assume any nonnegative value, and as such, searching over the range of κ values amounts to screening an infinite number of models. The advantages are obvious: we can evaluate an infinite number of covariance models instead of a finite number. Furthermore, we do not need to assign probability to each of these models. Instead, a distribution for κ is obtained by inverse modeling. Embedding this concept in MAD is straightforward: it is sufficient to introduce κ into the structural parameters vector. The MAD procedure would yield its distribution, with any value in this distribution representing a different covariance model.
 The discrete approach and the broad spectrum approach can be combined: a discrete approach could be used for those components of the model that cannot be defined based on the broad spectrum approach (e.g., alternative numerical schemes), and the broad spectrum approach could be applied for the rest. The Matérn family of covariances can be used for a wide range of situations where spatial variability is of concern, and as such holds a potential for reducing the computational effort and the limitations of working a fine number of alternatives of the discrete approach.
5. Placement of Anchors
 Where to place the anchors? The answer depends on what we want to accomplish with the anchors. The derivation of equation (3) assumes that conditioning on the data z = (za, zb) is equivalent to conditioning on the anchors = (a, b). Because of that, placement of the anchors intends to meet this requirement. This provides us with a clear guideline on placement of the anchors. This is a challenge because it is difficult to know where to place the anchors in a field that is poorly characterized. Consider, for example, the decline in water-table elevation. This decline could be controlled by the presence of local features such as high- and/or low-conductivity areas. Obviously, we would want to capture these features, and this could be done by proper placement of anchors. However, the locations of these features may not be known a priori and we could consider placement of multiple anchors, under the assumption that this will secure our ability to identify the important local features. At the same time, we should consider that anchors are model parameters, and as such, more is not necessarily better. Hence, a strategy is needed that would place the anchors such that all the important information is captured using a small number of anchors.
 We propose a strategy for placing anchors that is built around two steps. In the first step, the anchors are placed based on geological conditions, the characterization goals and the method(s) of data acquisition. The second step is a test of sufficiency. The first step is built around physical principles, judgment, and experience, whereas the second step is more mechanistic in nature, and intends to capitalize on increase in our understanding of the site's geology. Before discussing this approach in detail, we will provide some general background information.
 There are several findings from previous studies that are relevant in this context. Bellin et al.  investigated solute transport in heterogeneous media numerically and indicated that the spatial resolution of numerical models that would capture accurately the effects of spatial heterogeneity on solute transport is of the order of a quarter of the integral scale of the log conductivity. This dependence, however, does not mean that all information collected in a transport experiment could be localized: it only means that the data is sensitive to local data in the aggregate. For example, the spatial moments of large solute plumes depend on spatial variability only in the aggregate, whereas the moments of small plumes, on the other hand, depend very much on local effects [Rubin et al., 1999, 2003]. In another example, Sánchez-Vila et al.  showed that the large-time drawdown response measured during a pumping test led to transmissivity values that were the same regardless of the location of the observation well. The transmissivity backed out from such data is the effective transmissivity that is sensitive to spatial variability only in the aggregate. In such case there is no use for anchors, and the inverse modeling (see equation (3)) should be limited to identifying structural parameters [cf. Copty et al., 2008]. However, early time drawdown in wells and in observation wells is reflective of local conditions at their respective locales and would constitute ideal locations for anchors.
 In order to localize Type B information, one would need to identify first the Type B data types that could be localized and the locales that are the most sensitive to such data. An attractive strategy for identifying such locales is sensitivity analysis. Castagna and Bellin  used a sensitivity analysis for such purpose in the context of hydraulic tomography. Vasco et al.  used a sensitivity analysis in the context of tracer tomography. Both studies indicated that certain locales (in both cases they were near the injection wells and observation wells) are much more sensitive to nonlocal data than others. Such locales are prime targets for placement of anchors.
Castagna and Bellin  found that the most sensitive locales are close to the tomographic wells. The areas that are a bit removed from the wells are less sensitive but they are uniformly sensitive. Placement of anchors over areas of uniform sensitivity should reflect geological conditions, and in particular, the heterogeneity's characteristic length scales. From Castagna and Bellin , we conclude that in cross-hole tomography, anchors should be placed 0.25 IY apart, where IY is the integral scale of the log conductivity. The challenge here is in the fact that the integral scale may not known a priori. However, reliable prior information could be obtained from field studies conducted in similar formations [e.g., Scheibe and Freyberg, 1995; Hubbard et al., 1999; Rubin, 2003, chapter 2; Sun et al., 2008; Ritzi et al., 2004; Ramanathan et al., 2008; Rubin et al., 2006] that could assist in a preliminary analysis. Additional anchors could be placed based on the test of sufficiency discussed below.
 We discussed thus far the placement of anchors based on sensitivity analysis and geological conditions. The next idea to explore here is placing anchors where they would be the most beneficial in terms of predictions. We refer to this practice as targeted anchor placement. The posterior distribution p(, za, zb) in equation (3) could be modified into the form p(, (1)za, zb) where (1) is a subset of and it contains those anchors that are potentially the most beneficial for prediction. One could also consider working with a subset of zb that correspond to (1). For example, if one is interested in detailed analysis of transport processes in a subdomain, then it would make sense to place (1) over that subdomain only. This would have the benefit of (possibly significant) reduction in the computational effort associated with the inversion. We should note, however, that targeted placement could be risky proposition because anchors and parameters are estimated simultaneously, and the elimination of anchors could affect the accuracy of the estimated structural parameters. For example, estimating the integral scale would require several anchor pairs to be placed at distances on the order of IY [Castagna and Bellin, 2009]. So targeted placement is recommended only when a compelling case could be made to support it.
 Once an initial set of anchors is placed and the corresponding inversion is completed, a test of sufficiency could follow. Consider a set of anchors b(1) and a nonoverlapping set of test anchors b(t). The set b(t) intends to verify that all the relevant and extractable information contained in zb has been captured by b(1). This condition would be achieved when the marginal distributions of the test anchors in b(t) do not change as additional anchors are added to b(1). Let us further consider an expanded anchor set, b(2), which includes b(1) and a few additional anchors placed at potentially valuable locations covering the same subdomain as b(1). When the test set b(t) satisfies the following condition,
for an increasingly large b(2), then we could say that that b(1) captures all the information contained in zb, because the additional anchors are redundant in terms of information content.
 By looking at the marginal distributions of each the test anchors individually, we could determine locations where introducing additional of anchors into the set b(1) is warranted. Such locations could represent local features that affect the observations and that were not captured by the original set of anchors. This point is discussed further in section 6. Our discussion there shows how the density functions of the dependent variables converge to stable asymptotic limits as more anchors are added, which is a clear indication that the optimal number of anchors is reached because no additional information could be extracted form the data.
 It is possible that the condition stipulated in equation (9) would be attained without any of these distributions being equal to p(b(t)zb), meaning that the anchors did not capture all the information contained in the actual data for example when the anchors are placed at locations that are too remote to be of consequence. In order to avoid that possibility, the recommendations from the studies discussed above regarding spacing and sensitivity analysis could be implemented.
 Targeted placement could be applied in a variety of combinations. For example, a dense grid of anchors could be placed where prediction accuracy is critical, whereas a low-density grid could be used for the rest of the domain. The high-density portion of the grid will be effective for capturing local features, whereas the low-density grid would be useful for estimating the global trend parameters. In another example, anchors could be placed in the locations that are most beneficial for predictions. This is the idea of network design that was pursued by Janssen et al.  and Tiedeman et al. [2003, 2004].
6. Case Study
 The goal of this case study is to determine the spatial distribution of the transmissivity using a sparse network of Type A and Type B data. The case study consists of the following steps. In the first step, we generated a spatially variable conductivity field for a given set of structural parameters. In the second step, we solved the flow equation for a given set of boundary conditions, to get the spatial distribution of the pressure field. The conductivity field and the computed pressure fields are taken as the baseline case. Conductivities and pressures at selected locations were selected as Type A and Type B data, respectively. Other values are used for evaluating the quality of the inversion and for testing the predictive capabilities of the inferred model.
6.1. Background and Methods
 The target variable is the log-transform of the transmissivity, which we will denote by Y. The information available for inversion includes Type A data in the form of Y measurements, and Type B data in the form of pressure measurements, taken at multiple locations. Hence, the data vector available for inversion is z = (za, zb), where za(xa) includes na measurements of Y taken at the vector of locations xa of length na, whereas zb(xb) includes nb pressure measurements taken at xb.
 The flow field is at steady state, and is uniform in the average, resulting from pressure difference across two opposing boundaries and no-flow boundary conditions at the other two boundaries. With this, the target variable and observations can be related through the flow equation given in Appendix A, which constitutes the forward model M (see equation (2)).
 Inverse modeling consists of two stages, as shown in Figure 1. The first stage is the derivation of the joint distribution of the anchors and the structural parameters, as specified in equation (3) (this part of the inversion is covered by Blocks I and II of Figure 1). In the second stage, this distribution is introduced into Block III, to be used for generating multiple realizations of the Y field. Following this route, we define a vector of structural parameters for Y and a vector of anchors (see equation (3)). The vector is a vector of order n1 + n2, defined by = (a, b), where a is a vector of order n1 = na, corresponding to na measurements of Y, and b is a vector of order n2, corresponding to the number of anchors used to localize the nb pressure measurements. In out study we'll assume that the Type A data are error-free, hence the two vectors a, za are identical. Furthermore, a is now nonrandom, saving the need to derive its distribution.
 Following equation (3), and recalling that a is nonrandom, i.e., p(aza) = δ(a − za) with δ being Dirac's delta function, we can integrate a out of equation (3), leaving us with the joint anchors and structural parameters distribution given by
In equation (10), p(za) is given by equation (5), p(b, za) is the prior distribution of b given the structural parameters and the Type A data, and p(zb, b, za) is the likelihood function. These terms will be addressed below.
6.1.1. Derivation of p(za)
 Following equation (5), this derivation requires one to define p() and p(za). To derive p(), we modeled Y a space random function defined by an expected value and an isotropic spatial covariance (i.e., λ1 = λ2 = I) of the type given by equation (8) with κ = 0.5 and ns = 2. With this formulation, the vector of structural parameters is = (I, σ2, m), corresponding (from left to right) to the integral scale, variance and the design matrix for the expected value of Y, of order d, where d > 1 corresponds to nonstationary situations. In stationary situations, d = 1, and m contains only one term which is the expected value of Y. For the prior distribution of the vector of parameters, we specified the following distribution, following Jeffreys  multiparameter rule and Pericchi [1981, equation 2.4],
 In our case study we assumed a stationary Y field, leading to d = 1. In this case all the terms in m are equal to μ.
 The conditional prior for σ2 and m, p(σ2, mI) ∝ σ−(d+1), is noninformative with regard to σ2 and m, i.e., the prior densities of m and log(σ2) are both flat on (−∞, ∞). These modeling choices follow Box and Tiao [1973, sections 1.3, 2.2–2.4, 8.2]. Diggle and Ribeiro [2006, p. 161] adopted the same prior and noted that it is an improper prior because its integral over the parameter space is infinite. They commented, however, that formal substitution of this prior into equation (5) leads to a proper distribution.
 The unspecified component p(I) is flexible. It was taken to be uniform and bounded in our case study. Alternative models could be used as well [cf. Hou and Rubin, 2005]. The other distribution appearing in equation (5), p(zaθ), is modeled as a multivariate normal distribution with mean and covariance as defined above,
where R is the correlation matrix between the various locations in xa and ∥a∥A2 is shorthand for aTAa. The selection of a multivariate normal distribution for p(zaθ) is based on the observation that Y was found to be normal in many case studies [see Rubin, 2003, chapter 2].
6.1.2. Derivation of p(b, za)
 The conditional distribution of b (xb) (of length nb) given the structural parameters vector and the conditioning data za(xa) is given by
where the conditional mean and covariance of b are given by
6.1.3. Derivation of the Likelihood Function p(zb, b, za)
 We adopted a nonparametric approach and it is estimated following the procedure outlined in section 4.2.
Figure 2 is an aerial view of the aquifer with the locations of the Type A and Type B data and with several combinations of anchors. The mean flow direction is parallel to the y axis (see Appendix A for additional information). MAD was applied to each of these cases in order to evaluate using different numbers and locations of the anchors.
Figure 3 shows the posterior distributions of the length scale parameter I obtained using the various layouts of data and anchors shown in Figure 2. We note that augmenting the data base with Type B data has a favorable effect, leading to distributions that are narrower compared with those obtained using Type A data only. Adding anchors also has a favorable effect. Figure 3f depicts the mode of the distribution just next to the actual value of the scale. The posterior distribution is not much of an improvement compared to the prior. The scale is relatively large compared to the aquifer's domain. The domain would need to be much larger than the scale in order to be ergodic with respect to the scale.
Figure 4 shows the marginal distributions for the variance σ2. Here we note that the contribution of the Type B data is more significant compared to what we saw in Figure 3 even when no use is made of anchors. We could also note that the mode of the distribution is getting somewhat closer to the actual value as the number of anchors increase, but the improvement is minor. The variance is a statistic much easier to infer than the scale, because the scale is a compound aggregate of length scales of the various geological units in the aquifer [cf. Ritzi et al., 2004; Rubin et al., 2006; Sun et al., 2008].
 For accurate prediction of processes (e.g., prediction of pressures, in this case study), is it critical to be able to obtain accurate parameter estimates? This is an important question because the goal of studies such as this is not the estimation of the geostatistical model parameters. Diggle and Ribeiro [2006, p. 160] speculated that difficulties in estimating the parameters may not necessarily translate into poor estimates of the processes. This question is addressed below.
Figure 5 evaluates the combined impact of measurements and anchors on a couple of transects along the aquifer. Two transects are examined. The transect on the left is surrounded by measurements and anchors. The other one is somewhat removed from the measurement locations, and relies mostly on anchors as sources of local information. Both transects show tighter bounds as the data base is augmented with Type B data (Figure 5b) and with the addition of anchors (Figure 5c). The transect on the right shows only a minor improvement as anchors are introduced due to the larger distance from the measurement locations.
 A different way to evaluate the quality of the inversion results is to evaluate the improvement of the inferred model's predictive capability for various combinations of data and anchors. This can be done through application of Block III (see Figure 1). We selected several test points throughout the aquifer's domain and compared the baseline pressures with the distributions of the pressures obtained from the Block III calculations. Figure 6 shows the pressure's density distribution at the test point shown in Figure 6i. Comparing between Figures 6a, 6b, and 6c shows that the Type B data are very effective in improving the model's predictive capability and that this improvement is much more significant compared to the parameters' inference (Figures 3 and 4). Anchors also play a significant role in improving predictions. With anchors nearby, the distributions are more tightly arranged around the baseline pressure value. Accurate predictions are obtained even without anchors at the immediate vicinity of the test point, and there are several reasons for that. First, pressure is nonlocal and as such it is correlated over much larger distances compared to Y. Second, parameter estimates get somewhat more accurate with more anchors, and that leads to improved prediction. Third, the head is a much smoother variable compared to the conductivity because its variability is constrained by the physical principles of flow. The surprising finding from Figure 6, which we found consistently true for dozens of test points we examined, is that significant improvement in predictive capability is achieved even with relatively poor parameter estimates.
 The most computationally expensive part of inverse modeling with MAD lies in the extensive forward simulation runs, which is represented as Block II in Figure 1. In order to obtain convergence of the posterior distribution, a sufficient number of structural parameter sets from the prior distribution (np) need to be generated. Each structrural parameter set is used for generating multiple parameter set realizations (ns). For each parameter set realization, a large number of random fields (nr) must be generated and forward simulations run prior to likelihood estimation. Therefore, the total computational time is proportional to the product of nsnpnr and depends on the complexity of the forward model.
 In this case study, 200 structural parameter sets were generated from p(|za), and for each structural parameter set, 10 (for the 11 anchor cases) or 20 (for the cases with a larger number of anchors) parameter set realizations were generated. To produce reliable likelihood estimation for each parameter set, 200 realizations of random fields conditional to anchors were found to be sufficient. In total, 400,000 or 800,000 forward runs are required in one configured case, which took roughly 24 or 48 h on an Intel Core 2 Quad Q6600 2.40 GHz processor. The computation can be easily parallelized because the forward runs of Block II (see Figure 1) on different parameter combinations can be conducted independently.
7. Updating With Multiple Data Sets
 Inverse modeling is a complex process of data accumulation and assimilation. Multiple data types and data sets could be acquired for the purpose of enhancing the quality of the inversion. Most commonly, data sets are acquired sequentially and at times over extended periods of time. Such process of data acquisition necessitates repeated applications of the inverse modeling algorithm. Bayesian methods are very efficient in this regard. Consider for example, a case where zb = (zb(1), zb(2)), where zb(1) and zb(2) are two subsets of data collected at different times, and the set of anchors b = (b(1), b(2)) corresponding to zb = (zb(1), zb(2)). The two sets, b(1) and b(2), may be identical or may overlap to some extent (meaning that a few anchors may appear in both sets). With this, the likelihood in equation (3), p(zb, ), can be modified as follows,
In the second line, zb(1) is removed from the list of conditioning terms because its informational content is captured by b(1). If zb(1) and zb(2) do not overlap spatially (e.g., two pump tests conducted in two different and far-apart subdomains), then the likelihood could be simplified further into the form,
Updating the likelihood in MAD is performed using a two-step approach as follows. In the first step, when zb(1) is the only data set available, the likelihood is equal to p(zb(1), a, b(1), b(2)) or p(zb(1), a, b(1)), depending on whether equation (16) or (17) is used. Substituting these distributions into equation (3) leads to a posterior given Type A data and zb(1). The second step takes place when zb(2) becomes available. In that case, the likelihood is given by the products given in equation (16) or (17). This would require us to compute only the conditional distributions of zb(2), because the conditional distribution of zb(1) is already known from the first step. This procedure can be expanded to include any number of additional data sets, zb(3), ..., zb(N), with each new data set introduced being used to update the posterior distribution, without requiring recomputation of the previously computed likelihood(s).
 In contrast, the problem with optimization-based method (such as the pilot points method, see section 8) is that optimal results obtained based on zb(1) cannot be used as a starting point for an update based on zb(2). For example, in the case of pilot points (see section 8), one could only speculate that the optimized pilot points, obtained using initial set of data, could be used as a starting point for further updating. And so, once zb(2) is acquired, the optimization-based parameter search must start anew.
8. MAD vis à vis Other Inversion Methods
 This section discusses the similarities and dissimilarities between MAD and several other inverse modeling approaches for the purpose of adding perspective. We will identify and discuss differences in the assumptions and concepts employed, and we will look at some results. This discussion is not intended to be a comprehensive review, but rather to highlight a few points in order to clearly position MAD vis à vis other methods.
 Inverse modeling in hydrology may (or may not) include elements of estimation and simulation. Estimation refers to obtaining estimates for parameters of interest based primarily on statistical laws or considerations on one hand, or on meeting fitting criteria, on the other. Similarly, simulation refers to generating a single or multiple realizations of the parameters based on statistical laws or one or more fitting criteria. One can thus define two conceptual approaches, or domains, to inverse modeling: one that is defined by statistical laws and one that is defined by fitting criteria. The boundaries between these two domains are not sharp, but they are useful in trying to map the terrain of inverse modeling: Inverse modeling can be defined based on how they are positioned, in terms of their primary goals or products (with some variations between authors) on the spectrum defined by these two domains.
 The methods of maximum likelihood (ML) and maximum a posteriori (MAP) are positioned on the probabilistic side of the spectrum and they focus primarily on estimation. In MAP, for example, following the MAP concept, one gets a MAP estimate, whereas drawing samples from the MAP distribution, if available, would amount to simulation. The pilot point method is positioned on the fitting side of spectrum, with a focus on simulation because it is defined by an objective function that is based on one or more fitting criteria, and because it produces some sort of conditional simulations. MAD is positioned on the probabilistic side of the spectrum, and it includes elements from both estimation and simulation, as will be explained in our subsequent discussion.
8.1. MAD and Maximum Likelihood (ML)
 MAD and ML are both probabilistic methods. The difference between MAD and ML is that ML is focused on finding an estimate of the unknown parameter, and is thus an estimation theory method, whereas MAD focuses on obtaining the distribution of the unknown parameter, and is thus a Bayesian method. It can be shown that MAD is an extension of the ML logic. For example, the ML approach of Kitanidis and Vomvoris  can be related to MAD through equation (3). ML focuses on the likelihood term in equation (3), namely, p(zb, za), and aims at estimating the vector . Usually, a modeling assumption is made with regard to this distribution and the parameters of the assumed distribution comprise the vector . The ML parameter estimates are those that maximize the model approximation of p(zb, za), or in other words, the probability of observing the data. The parameter vector models the global trends of the target variable (e.g., through its moments), and it is not intended to capture local features, which is the role of the anchors in MAD. One could possibly add anchors into the likelihood function, rewriting it in the form p(zb, za, ) and obtaining the ML estimates of both and . This would lead to a formulation of ML along the lines proposed by Carrera and Neuman [1986a, 1986b] and Riva et al. . But including anchors in the likelihood function would not amount to transforming ML into MAD because ML derives single-valued parameter estimates whereas MAD derives parameter distributions.
 One of the challenges facing ML is providing estimation variances. Under some assumptions [Schweppe, 1973], ML can provide lower bounds for the estimation variances. These variances can be translated into statistical distributions by assuming some sort of distribution: a Gaussian model is justified asymptotically. The assumption of Gaussianity is reasonable and in many cases justified [cf. Woodbury and Ulrych, 1993], but it cannot be guaranteed a priori. This is shown in Figure 3, where the distributions do not appear to be Gaussian, but it appears that a Gaussian approximation could work very well in this case. We will show later that it does not always work.
8.2. MAD and Maximum A Posteriori (MAP)
 MAP, similar to ML, is a probabilistic method that aims at obtaining parameter estimates [McLaughlin and Townley, 1996]. MAP derives parameter estimates but not their distributions. Consider the posterior distribution shown in equation (3), and let us do a couple of things: first, let us ignore the anchors and second, let us replace the prior distribution of the parameters with a prior distribution for za. This leaves us with the MAP distribution in the form:
MAP proceeds by assuming models for the distributions appearing on the right-hand side of equation (18). The MAP parameter estimates are those that maximize the model approximation of equation (18). In other words, the MAP parameter estimates are those that correspond to the mode of the parameter distribution.
 The prior p(za) in MAP acts to regularize the solution by stabilizing it around the prior, but unlike the Pilot-Point method (PPM), its weight is not manipulated to control the results: in MAP, the prior is a starting point, not a constraint! We shall see below that the transformation of the prior term into a regularization term, as done by PPM, has significant consequences.
 The likelihood function p(zb, za) is commonly taken as p(b) where b = zb − M(). The error terms in p(b) are usually assumed to be zero-mean, uncorrelated and Gaussian [McLaughlin and Townley, 1996]. Similar assumptions could be made in ML. A modeling assumption is a required component of both ML and MAP because both seek parameter values that are defined by a characteristic of the assumed distributions (e.g., the mode in ML). In other words, both ML and MAP use parametric models. MAD, on the other hand, estimates the likelihood function and not its parameters, and hence can employ nonparametric likelihood functions. The advantage of employing nonparametric models is in the flexibility it offers in terms of model selection, but this of course comes with a heavy computational price tag. Additional discussion on the differences between ML and MAP is provided in the work of Kitanidis [1997a, 1997b].
8.3. MAD and the Pilot Point Method (PPM)
 In this section, we will highlight the differences between PPM and MAD. PPM was reported in several studies [e.g., Doherty et al., 2003; Kowalsky et al., 2004; Hernandez et al., 2006; Alcolea et al., 2006]. PPM is fundamentally different from ML, MAP and MAD in that it is a model-fitting method and not a probabilistic method. We will show that PPM's goals are vastly different from the other methods, and we will show how meeting these goals affect the results. We will also show that pilot points are not anchors: not only by name but also not in concept.
 Let is start by summarizing how PPM works. Schematically, it works like this (specific details may vary between authors):
 1. Define a vector of structural parameters using the data za.
 2. Generate an unconditional realization of Y. The generated field 0 is made conditional on za.
 3. Determine the number and locations of the pilot points, and assign to them initial values y0. The initial set of pilot point values y0 is taken from 0.
 4. Set an objective function. The objective function intends to control the values assigned to the pilot points. Additional discussion of the objective function is provided below.
 5. Change the values of y0 to y following a deterministic optimization search procedure as follows. Starting with y0, the field 0 is conditioned deterministically on the pilot points, leading to the conditional field . A numerical simulation of the process, relating the Type B data zb to , is then performed. The results of this simulation are then compared with the measured zb values, and y is modified with the goal of improving on the objective function. The process is repeated with the goal of reaching eventually a preset threshold value for the objective function. The search is terminated once the threshold is crossed.
 6. The final product of this process is the field 0 conditioned on the set of pilot point values y that were obtained from the optimization process.
 Following this summary, we shall look at the following aspects of PPM: (1) The significance of PPM's stated goals, (2) the use of pilot points as fitting parameters, and (3) the implications of the optimization procedure. In doing so, we shall also highlight the differences between PPM and the other methods.
 By following the procedure outlined above, PPM attempts to achieve several goals [Cooley, 2000, and Cooley and Hill, 2000]. The first goal is to maintain the frequency distribution of the target variable in similar to the distribution observed by measurements of the target variable. The second goal is to generate realizations of that are equally likely, and the third goal is to closely reproduce the observations of the dependent variables zb.
 Let us consider the first goal. This goal poses several challenges because, first, the observed distributions of the target variable are either poorly defined or nonexistent to begin with. However, for the sake of discussion, let us assume that some data are available to construct a prior distribution for Y from Type A data, p(yza). If Type B data is available, it should be used as additional source of information, leading to p(yzb, za). These distributions could be different, and it is reasonable to expect that they will, because the Type B data brings additional information into consideration. And so we should ask ourselves whether it is reasonable or helpful to consider p(yza) as a constraint. MAP and MAD recognize the significance of the prior, but they do not use it as a constraint: they use it as a starting point, because they recognize that it could change if we have informative Type B data. Similarly, ML does not use p(yza) as a constraint.
 The second PPM goal [Cooley and Hill, 2000] is the generation of equally likely realizations of the target variable field . This goal is challenging on several counts. First, generating a realization is a Bayesian concept. Within classical statistics we have the concept of generating conditional realizations of estimates, which can be obtained by somehow perturbing the data. PPM is not a Bayesian concept nor an estimation method, so it is unclear what the PPM realizations represent. Second, there is a question of semantics here: in order to qualify multiple realizations as equally likely, one needs to have a statistical model to quantify that likelihood in the first place, which PPM does not have. So perhaps a more accurate adjective to use instead of “equally likely” would be “equally drawn.” Third, even if one assumes that PPM can generate equally likely realizations, the advantage of working with such realizations is questionable, because for prediction one would want to consider more likely and less likely realizations, or in short, random sampling. Random sampling is the key for sampling the complete probability space without bias. It is a fundamental tenet of statistics that samples must be drawn at random [Mugunthan and Shoemaker, 2006] in order to prevent bias. Equally likely realizations do not amount to random sampling, as shown in the next paragraph. MAD, in contrast, produces plausible realizations, with various degrees of plausibility, as measured by probability.
 Let us be more specific with regard to the third point of the last paragraph. PPM considers only realizations that cross a preset threshold value defined for the objective function. That means that realizations that do not cross the threshold will not be admitted into the pool of realizations. But the rejected realizations may be defined by nonzero probability, and hence should not be eliminated from consideration. Surprisingly, it is not only poor-performing (“poor” from the PPM objective function sense) realizations that are rejected by PPM, but also the superior realizations are rejected, because the PPM search algorithm stops once the preset threshold is crossed, and no efforts are made to further improve them. We can conclude then that PPM under-samples the probability space, and is potentially biased. The bias effect due to optimization was noted in Mugunthan and Shoemaker [2006, p. W10428] where a comment is made on “…bias introduced during optimization because of over-sampling of high goodness-of-fit regions of the parameter…” These effects may be small or large, we cannot say. For a credible application, PPM applications must show that this effect is small.
 An important implication of the biased sampling is that PPM cannot assign probabilities to realizations. Consequently, users cannot assign probabilities to events that are modeled based on the PPM realizations. In MAD, on the other hand, no optimization is used and no threshold criteria are set, thus avoiding all these issues altogether. The probability space is sampled exhaustively and without bias, and realizations can be associated with probabilities using the posterior distributions or by looking at histograms of events.
 Let us now take a look at the role of the pilot points in achieving the PPM third goal, which is the reproduction of the observations. PPM uses pilot points as fitting parameters. PPM uses many pilot points for that, and in fact it encourages the user to add as many pilot points as one would need [Doherty, 2003]. This aspect of PPM underlies its need to use the plausibility term (synonymous with the more often used regularization term) discussed in Step 4 of the algorithm. The plausibility term is used to control the problem of using a number of fitting parameters (the pilot points) that can far exceed the number of observations. Tikhonov and Arsenin  showed theoretically that it is possible for a model to fit observations exactly when the number of parameters is equal to the number of data, and that additional parameters render the problem singular unless regularization is applied. This situation applies to PPM, and was confirmed in a study of PPM by Alcolea et al. [2006, p. 1679] who commented that over-parameterization (in the form of a large number of pilot points) “…leads to instability of the optimization problem” and that “…instability implies …large values of some model parameters due to unbounded fluctuations … large jumps in the value of hydraulic properties…” etc. This instability is brought under control in PPM by the regularization term [which is referred to as the plausibility term in Alcolea et al., 2006]. The weight assigned to the plausibility term can be made arbitrarily large (or small) depending on the magnitude if the instability. That effect, although beneficial from the instability perspective, is the root cause of PPM's biased sampling because it controls the extent of censoring (from both the “bad” and “good” realizations sides, as discussed earlier). It should also be noted that “…the degree of data reproduction is a poor indicator of the accuracy of estimates” [Kitanidis, 2007].
 In additional to creating instability, pilot points can lead to artifacts in the generated fields. Because in PPM the introduction of additional pilot points is the only PPM mechanism for improving model performance and addressing neglected elements (such as three-dimensional flow, unsteady flow, recharge and leakage, geological discontinuity and such), it could lead to the appearance of artificial features in the target variable field realizations [see Cooley, 2000, p. 1162]. Alcolea et al. [2006, p.1679], confirmed the existence of this effect and indicated that it could be controlled by regularization, but it is unclear how and to what extent. Kitanidis  also confirmed the existence of this effect when he noted that “By over-weighting the data reproduction penalty, the data are reproduced more closely and more details appear in the image that is estimated from the optimization but the image is also more affected by spurious features.” Studies such as Hernandez et al.  suggest that the artifact issue can be brought under control, but it is unclear what constitutes an artifact (except after it shows up) and how this aspect of the simulation can be managed. To summarize, the plausibility term, in addition to controlling instability, is also used for reducing artifacts. But in the absence of any indication to the contrary, one can only speculate on how efficient it is in doing so.
 The fundamental difference between PPM and MAD in this regard is that MAD is a Bayesian method whereas PPM is a model-fitting method. Specifically, anchors are not fitting parameters: they are devices for reducing data into a suitably convenient form. PPM attempts to fit measurements by adding pilot points and tweaking their assigned values, whereas MAD does not fit anything: it is built around estimating statistical distribution of the differences between observations and model predictions. MAD can use parameters to model distributions, but it does not adjust point estimates.
 The likelihood function in MAD is the only subject of estimation. Estimating the likelihood function in MAD is unlike the fitting exercise of PPM because of the number of data points involved. In PPM the number of data points is limited to the number of measurements, whereas in MAD the number of data points corresponds to the number of differences between measurements and predictions, which can be set arbitrarily high (it depends on the number of Monte Carlo realizations generated for the purpose of estimating the likelihood function). For example, consider the case of N measurements. PPM attempts to fit the N measurements with a number of pilot points that can far exceed N, whereas in MAD the number of data points is on the order of N × 106 or more. Theoretically, there is no limit on the number of realizations that could be generated, and hence stability is not an issue in MAD.
 The issue of stability is demonstrated in Figure 6. Figure 6 shows the marginal distribution of the pressure at a validation point (this is a point not used for inversion, but for testing the quality of the predictions). The pressure distribution is shown here for different numbers of anchors. The distribution does not show any sign of instability as the number of anchors increases. Similar stability was observed for dozens of points spread all over the simulated domain. Figure 6 demonstrates that MAD does not have a stability problem, despite the fact that it does not use any regularization term. It also highlights the issue of anchor placement (see section 5), the point being that the convergence of the statistical distributions of the target variables provides an indication that the number of anchors used reached a satisfactory level in terms of the ability to extract information form the data. Such a measure of sufficiency is not available with PPM.
 This section includes a brief comparison of MAD with the PPM case study presented by Hernandez et al. , subsequently referred to as H06. The case study focuses on a rectangular flow domain with spatially variable Y = ln(conductivity). The spatial variability is modeled using a stationary mean and an exponential spatial covariance with variance of Y, σY2, equal to 4, and an integral scale equal to 1. The hydraulic head gradient was defined by a head difference of 10 length units. For both the head and Y, measurement errors were added to the data, defined by a unit variance. We implemented the same models in our case study. We did not have access to the baseline Y field of H06, and we generated our own baseline field using the same spatial variability model.
 As discussed earlier, there are fundamental differences between MAD and PPM and we shall not repeat them here. In this section we shall focus on specific details related to the implementation of MAD and PPM to this case study and on some results. The first difference between MAD and H06 is with regard to estimating the parameters of the spatial variability model. H06 provides estimates for the parameters based on alternative criteria of optimality, whereas MAD considers the statistical parameters as random variables and derives their distributions.
 The second difference concerns the statistical model that was employed for modeling the joint distribution of the heads and Y.H06 assumed the heads to be spatially uncorrelated. Their model amounts to assuming that the heads are deterministic variables subject to uncertainty due to spatially uncorrelated errors (which is the structure often assumed for measurement error). This assumption is not in line with the approach employed by H06 for modeling of Y, which assumed Y to be a spatially correlated space random function. H06 also assumed the heads to be uncorrelated with Y, whereas Y was assumed to be normally distributed and spatially correlated. These assumptions are not in line with multiple studies and observations that showed the heads to be spatially correlated and cross-correlated with Y, and furthermore, to be normally distributed only for σY2 smaller than 1 [Rubin, 2003]. This statistical model was employed in other studies as well [cf. Kowalsky et al., 2004]. Regardless of whether this model is justified or not, the point we want to make is that PPM assumptions are made which may be simplified or restrictive. MAD, in contrast, does not require making any assumption in this regard, and it derives a nonparametric posterior distribution.
 We chose to demonstrate the differences with two sets of results. The first set of results focuses on the hydraulic heads, and the second set focuses on the geostatistical parameters. Figure 7 shows the actual and expected values of the hydraulic head along the centerline of the flow domain. We also show the 95% confidence intervals obtained with MAD as well as those identified by H06 and given in their Figure 5. Our 95% confidence intervals are taken directly from the head distribution (see Figure 8) whereas H06 obtained theirs by computing the variance of the head through the ensemble of realizations and assuming a Gaussian distribution. Comparison of confidence intervals shows our confidence intervals to be somewhat larger than those predicted by H06. As discussed earlier, PPM censors suboptimal as well as above-threshold optimal realizations, which leads to underestimation of the confidence intervals and explains this difference. H06 upper confidence interval near the left boundary allows the head to vary above the head at the boundary, and similarly, the head is allowed to vary below the lower bound set by the right boundary. This is an outcome of the assumption of Gaussianity. In MAD, however, the upper and lower bounds of the distributions next to the boundaries are bounded correctly, because the distribution is derived, not postulated.
Figure 8 shows the head distributions at different locations along the transect shown in Figure 7. The distributions are shown on q-q plots, so that we could evaluate their departure from Gaussianity. We note that the head distributions are non-Gaussian throughout the flow domain. They are strongly skewed next to the boundaries because of the constraints imposed by the nearby boundaries. The departure from Gaussianity is much less pronounced around the center of the flow domain (Figure 8c), but the departure from Gaussianity is still pronounced at the tails. H06, on the other hand, assumes the heads to be Gaussian throughout the transect. One consequence of that assumption is that heads are allowed to be larger than 10 and smaller than zero next to the upper and lower boundaries, respectively. As noted earlier, MAD does not assume posterior distributions, but rather infers them, and a consequence of that is the flexibility to obtain a variety of distributions, and a better compliance with the underlying physics.
 Our next set of results deals with estimates of the geostatistical parameters, which MAD provides in the form of statistical distributions. Figure 9a shows the probability density functions obtained by MAD for three different sets of data and error levels. In all cases the distributions are well-aligned with the theoretical the actual value, which is around 1 unit length. H06 deals with this parameter in their Figure 14. They do not provide distributions. Instead, they attempt to identify an optimal value by analyzing various performance criteria for a wide range of parameter values. Theoretically, these criteria should peak at the vicinity of the actual values. Of the five evaluation criteria tested in Figure 14 of H06, none peaked at around 1, but instead peaked at around zero.
Figure 9(b) provides our results for the variance. All cases we analyzed show well-defined peaks somewhere between 4 and 5 (with the higher values corresponding to the case with large measurement error on the heads). Results in H06 for this parameter are provided in their Figures 13 and 16. The evaluation criteria they used did not peak at the actual values. A few criteria show preference toward high values, but without displaying a well-defined peak. The various evaluation criteria in H06 are not consistent in the trends they display. For example, in Figure 13 of H06 one the evaluation criteria identified the variance at around 0.5 whereas the others seem to prefer 6. This leaves unclear which criteria should be selected a priori as the most reliable. We speculate that this insensitivity or underperformance of the evaluation criteria in H06 could be related (1) to the elimination of the high-probability and low-probability events due to the use of optimality criteria as discussed in section 8.3 and (2) to the use of multiple fitting parameters, in the form of pilot points, which masks the actual spatial structure by introducing artifacts. This possibility was alluded to in the works of Cooley  and Cooley and Hill .
 We presented a Bayesian method for inverse modeling of spatially variable fields, called the method of anchored distributions (MAD). This work expands in some parts on work done by Z. Zhang and Y. Rubin (unpublished manuscript, 2008). MAD is general in the sense that it is not limited to particular data type or types or to any particular models. MAD is built around two new elements. The first element is a concept called anchored distributions and the second is a data classification scheme. On the basis of these two concepts, various types of data can be combined, or assimilated, systematically into an improved image of the random field.
 MAD's basic approach is to model the spatially variable fields as random fields. In MAD, the random field is modeled through a combination of structural parameters and anchors. Structural parameters capture the global trends and statistical characteristics of the random fields, whereas anchored distributions capture the local effects. Anchored distributions are statistical devices that extract information from data that is directly or indirectly and locally or nonlocally related to the target variable. The information is captured in the form of statistical distributions of the target variable.
 Data used for inversion is classified by the way it is related to the target variable. We distinguish between data that is local and directly related to the target variables (Type A) and to data that is nonlocal and indirectly related to the target variables (Type B). Both types of data are represented using anchors. In the case of Type A data, anchors are either measured or regressed, and in the Type B case, anchors are obtained by inversion: MAD converts the Type B data, using formal statistical arguments, into local statistical distributions of the target variable called anchors. This conversion provides two benefits. The first benefit is that Type A and Type B data can be used for inversion in a way that is consistent with the data acquisition technique on one hand and the support scale of the target variable on the other. The second benefit is that both types of data can be used for conditional simulation. Data that could be related to the target variables in complex ways can be easily used for conditional simulation. For example, multiple realizations of the hydraulic conductivity field can be generated conditional to small and large scale pumping tests, tracer tests, geophysical surveys, and borehole information.
 The MAD algorithm requires the modeler to define prior distributions when prior information is available, and a likelihood function. MAD can accommodate multiple modeling choices with regards to both. We presented a general, nonparametric formulation of the likelihood function in order to have large flexibility in using the Type B data. Nonparametric formulations are usually associated with large computational efforts because the information cannot be collapsed into a small number of parameters. For example, Bayesian methods adopt in many cases a normal likelihood function because of the computational advantage of working with such sparse parameterization. This comes at a cost: the normal likelihood function is not universal and cannot fit all data. The nonparametric likelihood function that is used in MAD is much more flexible than the normal counterpart, but it is much more demanding computationally. However, MAD's ability to process nonnormal Type B data could counterbalance this increase because it allows the user to convert the Type B data into forms that are easier to model. For example, if we have Type B data in the form of drawdown data obtained from a pumping test, these data could be processed as time series, but it could also be reduced into a series of temporal moments. The temporal moments can be related to the hydraulic conductivity using a series of steady state equations for M, and this is much easier to process compared to the time series that would require a transient flow equation for M [Zhu and Yeh, 2006].
 The MAD algorithm has a modular structure. It is built around three distinct blocks: a block for analyzing Type A data, a block for analyzing Type B data, and a block dedicated to making predictions. The blocks are defined by their tasks in the total inversion schemes and not by the computational tools employed. The modular structure is advantageous because it is not limited to any type of data or model, and opens the door for applications with a wide variety of data types and modeling tools.
 We compared MAD with other concepts used for inversion, including maximum likelihood (ML), maximum a posteriori (MAP), and the pilot points method (PPM) (see section 8). For the purpose of this comparison we identified two classes of inversion methods. The first is built on principles of statistical estimation, which is the process of obtaining estimates for the target parameters based primarily on statistical laws or considerations. This class includes ML, MAP, and MAD. Within this first class, ML and MAP can be viewed as statistical estimation methods, whereas MAD is Bayesian estimation method coupled with a statistical localization strategy. The second class of methods, which includes PPM, emphasizes model fitting through optimization of an objective function. The role of optimization associated with PPM is evaluated and shown to lead to bias that results from the elimination of low-probability and high-probability realizations from consideration. Additional concerns with regard to PPM raised by Cooley  and Cooley and Hill  are also evaluated.
 Of interest, in this context, was to compare the definitions of anchors versus pilot points. The pilot points are fitting parameters whereas the anchors are statistical devices used for capturing and simplifying information. Pilot points are defined as point values that are optimal in some sense, whereas anchors are defined by joint statistical distributions. A numerical case study comparing MAD with Hernandez et al.  highlights a few of the differences between MAD and PPM.
where K [LT−1] is the hydraulic conductivity and H [L] is the hydraulic head (related to pressure P through P = ρgh where ρ[ML−3] is mass density and g[LT−2] is the gravitational constant). This equation is augmented by Dirichlet type boundary conditions of constant (but different) heads along the boundaries at y = 0 and y = 120 with a pressure head difference of 5 cm/120 m (see Figure 2), and a Neuman type no flow boundary condition at x = 0 and x = 90, leading to a uniform-in-the-average pressure gradients parallel to the y axis. Grid size was selected at ∼0.3 of the integral scale of the log conductivity. The numerical solution is based on the finite element method reported by Chapra and Canale .
the SRF (space random function) of interest, for example the log conductivity.
a realization of the Y field.
vector of structural parameters describing the spatial variability of Y.
= (a, b)
vector of anchors corresponding to both Type A and Type B data. The subscripts “a” and “b” refer to Type A data and Type B data, respectively.
number of Type A data.
number of Type B data.
z = (za, zb)
vector of Type A and Type B data of length na + nb.
a realization of the zb field.
vector of data locations. The vector xa denotes the coordinates of za, etc.
vector of anchor locations. The vector xϑa denotes locations of anchors related to Type A data. xϑa and xa are identical.
a model, numerical or analytical, relating to b. For example, it could represent a flow model. When we have more than one type of data in b, then M would be a collective name for all relevant models. Alternatively, Mi, i = 1, .., N is used to denote multiple models.
model error, composed of the two error vectors mentioned below.
error associated with Type A anchors.
error associated with Type B anchors and structural parameters Notations: boldface type denotes vectors or fields. Superscript: when in parenthesis, denotes subset, e.g., (1) is a subset of . Subscripts: when a running index is used, denote the components of a vector. Otherwise denote name of a related variable or data type. Notation for statistical distributions: p(..) is used to denote statistical distributions. Different statistical distributions are denoted by the name of the variables in parenthesis: p() and p() denote two different statistical distributions, not the same statistical distribution estimated at different points.
 The authors wish to thank Peter Kitanidis for stimulating discussions on this paper. This study has been funded by the U. S. DOE Office of Biological and Environmental Research, Environmental Remediation Science Program (ERSP), through DOE-ERSP grant DE-FG02-06ER06-16 as part of Hanford 300 Area Integrated Field Research Challenge Project. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U. S. Department of Energy under contract DE-AC02-05CH11231.