## 1. Introduction

[2] Groundwater is vulnerable to contamination from point and nonpoint sources. One of the first steps in any environmental remediation project is to identify the locations and release histories of contaminant sources so that a cost-effective remediation strategy can be made and so that cleanup costs can be partitioned among liable parties. In most cases, source locations and release histories are unknown when contamination is first detected. The reconstruction of contaminant source locations and release histories from observed concentration records is a special type of inverse problem.

[3] *Sun* [1994] classified inverse problems in groundwater modeling into five types: namely, the identification of parameters, boundary conditions, initial conditions, sinks and sources, and the simultaneous identification of more than one of these components. From a theoretical point of view, the identification of source terms in a partial differential equation is simpler than identification of the equation coefficients [*Isakov*, 1990]. In practice, however, the source identification problem is very challenging because of its ill-posed nature and lack of data [*Sun and Sun*, 2002, 2005].

[4] Contaminant source identification has been studied for over two decades in groundwater hydrology. Three subproblems are often considered under this topic: finding the release history of a source, finding the location of a source, and recovering the initial distribution of a contaminant plume. Various deterministic and statistical methods have been devised to solve these problems. *Atmadja and Bagtzoglou* [2001a] and *Michalak and Kitanidis* [2004] provided extensive literature reviews on this subject. Existing approaches can be roughly classified into three categories.

[5] The first category of approaches formulates the source identification problem as an optimization problem and solves it using either a linear or a nonlinear programming technique. For example, *Gorelick et al.* [1983] used nonlinear programming to identify pollution sources and disposal episodes. *Wagner* [1992] presented a nonlinear maximum likelihood methodology for simultaneously identifying flow and transport model parameters, as well as pollutant sources. By recognizing the fundamental ill-posed nature of source recovery problems, *Skaggs and Kabala* [1994] used Tikhonov regularization (TR) to reconstruct the source release history of a one-dimensional transport problem. *Alapati and Kabala* [2000] used a nonlinear least squares method without regularization to recover the release history of a one-dimensional problem. *Mahar and Datta* [2001] considered identifying contaminant sources in conjunction with optimal monitoring network design using nonlinear optimization techniques. *Aral et al.* [2001] used a progressive genetic algorithm to solve the nonlinear optimization problem for source identification.

[6] The second category of approaches adopts a probability-based method. For example, *Woodbury and Ulrych* [1996] and *Woodbury et al.* [1998] used minimum relative entropy (MRE), a Bayesian inference approach, for source recovery. Given prior information in terms of lower and upper bound and a prior “best estimate” of the model, MRE yields closed form expressions for the posterior density function of the estimate by minimizing a measure of relative entropy. *Snodgrass and Kitanidis* [1997] used a geostatistical method to recover the release history of a point source in a one-dimensional steady state flow field. The unknown release history is regarded as a statistical field characterized by a few statistical parameters. The geostatistical method has been extended to two- and three-dimensional cases [*Butera and Tanda*, 2002; *Michalak and Kitanidis*, 2004].

[7] The third category of approaches solves the advection-dispersion equation (ADE) backward in time. Since the dispersion process is irreversible, the ADE cannot be solved backward simply with negative time steps. *Bagtzoglou et al.* [1992] and *Wilson and Liu* [1994] reversed the advection part while keeping the dispersion part unchanged. Their method gives a backward location probability or a backward traveltime probability. Later, *Neupauer and Wilson* [1999, 2001, 2004] used the adjoint state method to compute these probabilities. *Atmadja and Bagtzoglou* [2001b] derived a backward beam equation for the ADE to obtain the backward-in-time solution, and later, *Bagtzoglou and Atmadja* [2003] showed this method might produce better results than those obtained by the quasi-reversibility method of *Skaggs and Kabala* [1995]. The backward-in-time method is an effective direct method if field measurements can provide the final condition for the backward solution. For two- and three-dimensional problems, the backward beam equation is difficult to solve because the advection dispersion operator is squared in the equation.

[8] From our literature review and the comparison-of-methodology table provided by *Michalak and Kitanidis* [2004], we see in most of the previous studies, the effect of model error was often evaluated through ad hoc sensitivity studies. For the contaminant source identification problem, model error can be caused by oversimplified model structure, inexact model parameters, and numerical error (e.g., the error caused by numerical dispersion or numerical discretization). The approach we will present below allows a modeler to directly incorporate his or her knowledge about model uncertainty into estimation.

[9] The contaminant source identification problem can be transferred into a direct least squares problem because of the linearity of the ADE. This effective method is not used in practice due to its numerical instability [*Lawson and Hanson*, 1995; *Bjorck*, 1996]. The least squares matrix is often ill-conditioned so that small errors in the observation data, even if normally distributed, may cause a significant change in the solution. The TR method [*Tikhonov and Arsenin*, 1977] is often used [*Skaggs and Kabala*, 1994] to stabilize the solution and it can prevent the solution from growing without bound. There is, however, no rigorous way for determining the regularization parameter. As a result, TR may yield either overregularized or underregularized solutions and thus does not allow confidence in the final result [*Schubert*, 2003]. The total least squares (TLS) method, pioneered by *Golub and Van Loan* [1980] and further refined by *Van Huffel and Vandewalle* [1989, 1991], considers errors in both the coefficient matrix and the right-hand side of the least squares equations. TLS has been successfully applied to many different fields in the last decade [e.g., *Van Huffel*, 1997; *Van Huffel and Lemmerling*, 2002], but direct application of the standard TLS is not well suited for contaminant source identification because it can break down easily when the error distribution is not independent identically distributed (IID) with zero mean.

[10] In recent years, a robust counterpart of the ordinary least squares, the robust least squares (RLS) method, was developed based on advances in robust convex programming [*El Ghaoui and Lebret*, 1997; *Chandrasekaran et al.*, 1996; *Ben-Tal and Nemirovski*, 1995, 1998]. In the context of this paper, robustness is defined as the resistance, or immunity, of an estimator to model and data uncertainty. Uncertainties can be characterized either in a set theoretic setting which consists of defining bounds for the uncertain variables, or a probabilistic setting which relies on finding probability density functions. Many parameter estimation techniques assume that the error distribution is Gaussian. Real world uncertainty, however, also include non-Gaussian, nonwhite noise, and systematic errors. These uncertainties can easily be considered in a set theoretic setting [*Walter and Piet-Lahanier*, 1990; *Ben-Tal and Nemirovski*, 1997; *Goldfarb and Iyengar*, 2003], and this is the setting used in RLS, which assumes that the system is subject to unknown but bounded perturbations and attempts to reduce the sensitivity of the optimal solution to these perturbations. Robustness, however, does not come without a price. After all, all robust estimators achieve robustness through some type of regularization and thus introduce biases to the solution. One of the key motivations behind using a robust estimator like RLS, as *Bertsimas and Sim* [2004, p. 1] wrote, is that “robustness assures that the solution remains feasible and near-optimal when data are uncertain.”

[11] RLS has been successfully applied to robust controller analysis [*El Ghaoui and Lebret*, 1997; *De Fonseca et al.*, 2001], image processing [*Schubert*, 2003], structural analysis [*Ben-Tal and Nemirovski*, 1997; *Mares et al.*, 2002], and financial analysis [*Bertsimas and Sim*, 2004]. To the best of our knowledge, RLS has not been applied to contaminant source identification.

[12] In this paper, we formulate a new variant of RLS, i.e., a constrained robust least squares (CRLS) method, for solving the source identification problem. Prior information on variable bounds, as well as additional linear constraints, can be incorporated directly into our formulation. The presented method is effective (no iteration is needed), robust (bound of model error is incorporated), and practical (system based). It can be readily combined with a mass transport model, such as MT3DMS [*Zheng and Wang*, 1999], for conducting real case studies.

[13] The paper is organized as follows: we present our system-based approach for source identification in section 2 and discuss various methods for solving a system of linear equations in section 3, with emphases on RLS and our extension to RLS (i.e., CRLS). Finally, both one- and two-dimensional numerical examples are given in section 4. The impacts of observation errors, model errors (including numerical dispersion error) on estimated source strengths are discussed, and comparisons with other estimation methods are made.