# Fields of nonlinear regression models for inversion of satellite data

## Abstract

[1] A solution is provided to a common inverse problem in satellite remote sensing, the retrieval of a variable y from a vector x of explanatory variables influenced by a vector t of conditioning variables. The solution is in the general form of a field of nonlinear regression models, i.e., the relation between y and x is modeled as a map from some space to a subset of a function space. Elementary yet important mathematical results are presented for fields of shifted ridge functions, selected for their approximation properties. These fields are shown to span a dense set and to inherit the approximation properties of shifted ridge functions. A serious mathematical difficulty regarding the practical construction of continuous fields of shifted ridge functions is pointed out; it is circumvented while providing grounding to a large class of construction methodologies. Within this class, a construction scheme that builds upon multilinear interpolation is described. When applied to the retrieval of upper-ocean chlorophyll-a concentration from space, the solution shows potential for improved accuracy compared with existing algorithms.

## 1. Introduction

[2] A statistical model aims at explaining an exogenous variable y from several explanatory variables x1,…, xn. In the case where x1,…, xn are deterministic variables, it expresses the dependence of the expected value E[y] on the explanatory variables and an unknown parameter vector ω, as a function f(x1,…, xn; ω). In the random case, the model is written conditionally to the observations, i.e., E[y] is replaced by the conditional expected value E[yx1,…, xn]. The function f is called the link function between y and x1,…, xn and, depending on its expression, defines a linear or non-linear regression statistical model. Models such as perceptrons, falling in the class of so-called ridge constructions, achieve this statistical modeling goal with several well-known interesting properties. Let us just mention the density or universal approximation property [Cybenko, 1989; Lin and Pinkus, 1993], and the results related to the approximation rate, including the dimension-independent upper bound [Barron, 1993; Burger and Neubauer, 2001; Makovoz, 1998], and the asymptotic expression obtained by Maiorov [1999].

[3] In this vein, we focus on a slightly different regression problem, for which we propose a modified solution, based on ridge function approximants, that inherits the interesting mathematical properties mentioned above. This problem still consists in explaining y from x1,…, xn, but with the difference that, in fact, only some of the xi, say x1,…, xd (d < n), convey information about y, while the remaining variables act as parameters, or conditioning variables, in the sense that they influence the link function between y and the true informative variables x1,…, xd.

[4] Typical examples of this kind of problem are found in geosciences, where the observed data may depend on several angular variables that define the geometry of the observation process. They include the retrieval of ocean color and aerosols from reflectance measurements in the visible and near infrared, and the retrieval of wind speed, salinity, and sea surface temperature from brightness temperature measurements at microwave wavelengths. In ocean color remote sensing, the objective is to estimate the concentration of oceanic constituents, such as phytoplankton chlorophyll-a. The informative variables x1, x2,…, xd, in this case the top-of-atmosphere reflectance measurements, depend continuously on the angular variables that characterize the positions of the observing satellite and of the Sun relatively to the target on the Earth's surface. Hence these angular variables, which obviously do not carry any information about the chlorophyll-a concentration, have to be taken into account, for the link function between chlorophyll-a concentration and x1, x2,…., xd depends on them.

[5] For this kind of problem, it seems natural to separate the variables being effectively informative with respect to y, from the conditioning variables. We shall denote by x the d-dimensional vector of informative variables, and by t the p-dimensional vector of conditioning variables. The proposed solution consists in attaching to t a nonlinear regression model explaining y from x, and where we demand that the attachment vary smoothly in t. This approach yields a field of nonlinear regression models over the set of permitted values for t.

[6] The paper is organized as follows. In section 2, the problem of interest is stated more formally, and fields of nonlinear regression models are defined. In section 3, construction schemes of such a model from scattered data are presented. In section 4, results obtained by applying this methodology to ocean color remote sensing are discussed. Finally, conclusions are given, as well as perspectives on future work.

## 2. Function Fields and Nonlinear Regression Model Fields

[7] Let x be a vector of explanatory variables, let t be a vector of conditioning variables, and let y be the real variable to be explained. Let X and T be the sets of permitted values for x and t, respectively. We consider statistical models of the following form:

where for each tT, ft is an element of a subset ℳ of &#55349;&#56478;(X), the set of continuous real valued functions on X, and ε is a random variable of null mean and finite variance σ2 that is not correlated with x. Hence in this model, x carries information about y, while t does not, but the link function between y and x depends on t. The definition of the set ℳ will be stated later.

[8] To study the dependence of ft on t, including continuity and regularity, we introduce the notion of a function field over T. We shall assume that X is locally compact and Hausdorff, and that T is compact, metric and Hausdorff. A space S is said to be compact if every open covering of S has a finite subcovering, and called Hausdorff if, for any two points xy, there exists two disjoint open sets U and V with xU, and yV. The compact subsets of Rn are well characterized; they are its closed and bounded subsets. We define a function field over T as being a map T → &#55349;&#56478;(X). The set of all continuous function fields over T will be denoted by (&#55349;&#56478;(X))T. The natural topology on (&#55349;&#56478;(X))T is the compact-open topology, which is equivalent to the topology of uniform convergence on compact sets, under the above assumptions on X and T. Furthermore, there is the homeomorphism &#55349;&#56478;(X × T) (&#55349;&#56478;(X))T [see, e.g., Bredon, 1993, pp. 437–440]. This fact tells us that the elements of (&#55349;&#56478;(X))T are in one-to-one correspondence with the elements of &#55349;&#56478;(X × T), and that this correspondence is continuous in both directions. Consequently, for each ζ ∈ (&#55349;&#56478;(X))T, there corresponds the unique map ζ* of &#55349;&#56478;(X × T) such that ζ*(x, t) = ζ(t) (x), for all xX and tT, and conversely. Similarly, the set of all ℳ-valued continuous function fields over T will be denoted by ℳT, for ℳ a subset of &#55349;&#56478;(X).

[9] Returning to the initial problem, equation (1) may be rewritten equivalently as:

or as

where ζ belongs to ℳT. Hence equation (2) defines a field of regression models over T. One may show that if ℳ is dense in &#55349;&#56478;(X) and if T is as above, then ℳT is dense in (&#55349;&#56478;(X))T.

[10] Herein, we shall be interested in the case where the model set ℳ is the set spanned by functions of the ridge form. A ridge function on Rd is a function of the form h(ax + b), where h ∈ &#55349;&#56478;(R), aRd and bR. Hence we consider the set ℳ = ∪nn, where

As mentioned in the introduction, ℳ is dense in &#55349;&#56478;(Rd), in the topology of uniform convergence on compact subsets [Lin and Pinkus, 1993]. Let us introduce some notations. Each element of ℳn depends on parameters ci, ai, bi, for i = 1,…, n, that we shall summarize by a vector θn. The elements of ℳn will be denoted by f(.; θn). Let Θn be the set of allowable values for θn, i.e., Θn = R × Rd × R, and let in: Θn → ℳn be the continuous map carrying a parameter vector θn to the corresponding model of ℳn. So inn) is the function f(.; θn), which associates to each xRd the real number f(x; θn).

[11] At this point, a function field ζ is a relatively abstract object: to evaluate the value ζ(t) (x) at some t and x, an explicit representation of ζ is necessary. Consider a function field ζ ∈ ℳT. Since T is compact, we may assume, without loss of generality, that ζ belongs to ℳnT, for some integer n. We intend to build a continuous function field ζ ∈ ℳnT via a parameter map ξ: T → Θn such that ζ = in ○ ξ. Let us mention the following difficulties, arising because the map in is only a continuous surjection. First, for each ζ ∈ ℳnT, there might not exist a continuous map ξ: T → Θn such that ζ = in ○ ξ. Second, if we proceed conversely by building ζ according to ζ = in ○ ξ, where ξ is continuous, we are not sure to get all of ℳnT when ξ is allowed to vary in all of &#55349;&#56478;(T, Θn). However, it is easy to prove that the set of continuous function fields ζ ∈ ℳT such that

for some integer n, ci ∈ &#55349;&#56478;(T), ai ∈ &#55349;&#56478;(T, Rd), and bi ∈ &#55349;&#56478;(T), is dense in (&#55349;&#56478;(X))T. It simply follows from the fundamentality in &#55349;&#56478;(X × T) of the set of functions of the ridge form on X × T.

## 3. Construction Schemes

[12] Let &#55349;&#56479; be a data set of N samples (xi, ti, yi). Based on &#55349;&#56479;, we are willing to represent the link between y, x and t through a field of nonlinear regression models of the form defined in equation (2) where ζ ∈ ℳnT, and where ℳn is as in (4). In light of the results stated in the previous section, we present below two methods for constructing ζ via a parameter map ξ: T → Θn. In both of them, we hypothesize normally distributed residuals, i.e., we assume ε ∼ &#55349;&#56489;(0, σ2), and use the averaged sum of the squared errors = (yi − ζ(ti) (xi))2 as the natural criterion to be minimized for selecting a function field from the data. The main difference with traditional parametric modeling techniques is that here, the parameters for ζ are maps T → Θn.

[13] The first method consists in taking a parameterized subset ℱρ of &#55349;&#56478;(T, Θn), that contains the constant and affine functions of t (for the reason previously mentioned), where ρ is the parameter vector. Hence the problem reduces to the one of minimizing ℰ with respect to ρ, which may be achieved, for instance, by means of a stochastic gradient descent algorithm or simulated annealing. However this method may suffer from an inappropriate choice of ℱρ, which may yield a much more larger n than necessary. The second method, described below, allows one to cope with this issue.

[14] This second method consists in building a sample of a continuous map ξ: T → Θn such that the induced field ζ := in ○ ξ minimizes ℰ. The sketch is as follows. We start by defining a set ℱK of real valued continuous and piecewise-differentiable functions whose domain is a set containing T; the value of any of these functions is obtained by multilinear interpolation. Next those maps are used to define a function field over T. More precisely assume T is a compact subset of Rp. Let t1Ξ,…, tKΞ be K points of Rp, and the vertices of a regular grid of Rp, such that T is included in the smallest p-dimensional cube Ξ containing all of the tkΞ. Note that K is the product of p integers ki ≥ 2. Hence T ⊂ Ξ, and tkΞ Ξ, for all k = 1,…, K. Let γ1,…, γK be K real numbers, and consider those continuous and piecewise-differentiable maps g(Ξ) such that g(tkΞ) = γk for all k = 1,…, K, and defined for all t Ξ such that ttkΞk by:

In this equation, the are the 2p immediate neighbors of t on the grid, i.e., they are the vertices of a p-cube containing t, and the coefficients αi(t) are the coefficients of the standard p-dimensional interpolation procedure on a p-cube. We shall denote by K the set of all such maps. This construction method is illustrated in Figure 1, in the case where T is of dimension 2. Next consider those function fields ζ over T, being the restrictions to T of function fields over Ξ, of the form defined in equation (5), where the bi, the ci, and the components of the ai, belong to K. Hence such a function field is parameterized by n(d + 2)K real numbers γij, where 1 ≤ iK and 1 ≤ jn(d + 2), since dim(Θn) = n(d + 2).

[15] The minimization of ℰ with respect to them may be performed as follows. First, pick randomly a sample (xi, ti, yi) from &#55349;&#56479; and compute the error ei of the model for that sample. If ti falls on one of the vertices of the grid, say on then the error ei depends on 1 ≤ jn(d + 2)K. Otherwise, ti is different from all the vertices, and if we let 1 ≤ l ≤ 2p, be the 2p immediate neighbors of ti on the grid then the error ei depends on for 1 ≤ l ≤ 2p and 1 ≤ jn(d + 2)K. So in a second step, modify the appropriate γkj according to the rule:

where η is a strictly positive scalar. Finally, repeat those steps until convergence. By this method, we obtain from the sets &#55349;&#56479;jΞ := {(tk, γkj); k = 1,…, K} a sample of a map ξ yielding the function field ζ = in ○ ξ. The advantages with respect to the first method are i) that the grid may be refined during the execution of the algorithm, and ii) that the resulting sample may be used to choose an appropriate model set for ξ that achieves, for example, a higher degree of regularity. Indeed this algorithm performs a stochastic gradient descent and, when the number N of samples tends towards infinity, the resulting field ζ of nonlinear regression models is expected to be a good approximation to the field Et[yx] over T of the (t dependent) conditional means of y given x.

## 4. Application to Ocean Color Remote Sensing

[16] For this problem, we have let t = t(cos θs, cos θv, cos Δϕ), where θs, θv, Δϕ are the Sun, view, and relative azimuth angles, respectively, and the set T of permitted values for t is [0, 1] × [0, 1] × [−1, 1]. The vector x is composed of top-of-atmosphere (TOA) reflectances in spectral bands in the visible and near-infrared, centered at 412, 443, 490, 510, 555, 670, 765, and 865 nm (case of the Sea-viewing Wide Field-of-view Sensor). The reflectances are corrected for molecular scattering effects. The variable y is the near surface chlorophyll-a concentration ([Chl-a]). A statistically significant ensemble of about 62,000 realizations of x, encompassing the major sources of variability (mostly due to the atmosphere) and including maritime, continental, and urban aerosols in varied mixtures, has been generated via intensive use of simulation (multiple runs of a coupled ocean-atmosphere radiative transfer code [Vermote et al., 1997]). In the code, the reflectance of the water body has been modeled according to Morel and Maritorena [2001], with no variability due to phytoplankton type. The vector x ensemble has been split into data sets &#55349;&#56479;l0 and &#55349;&#56479;v0, used for model construction and validation, respectively. Two fields F and Fν of nonlinear regression models have been built according to the second method, on a 2 × 2 × 3 regular grid, i.e., with vertices belonging to {0; 1} × {0; 1} × {−1; 0; 1}. The nonlinear regression models attached to t are elements of ℳn, with n = 10, where ℳn is as in (4). The generator function h has been taken as the hyperbolic tangent, a popular choice in the field of ridge approximation. The number n = 10 of basis functions has been determined experimentally, by starting at building models with a small value of n and increasing the value gradually until no significant improvement in the quality is noticed. This construction procedure leads to ridge function fields of the form given by equation (5), where the ai(t), bi(t), and ci(t), are continuous functions of the angular variables (8 dimensional vector-valued for the former one, and real-valued for the latter two). The fields have been built both on &#55349;&#56479;l0, but in the case of Fν, some amount of noise has been added to the data during the execution of the stochastic algorithm. To be as general as possible, this noise scheme has been defined as the sum of spectrally independent and correlated noises, with a total amount of 1%.

[17] To measure the quality of the [Chl-a] estimation, the root mean squared error (RMS), the bias in natural logarithm (bln), and the root mean squared error in natural logarithm (RMSln), have been used. Their values for fields F and Fν are summarized in Table 1. The error in [Chl-a] estimation is on the order of 4.2% over the range 0.03–30 mgm−3, in the case of non-noisy data, and 10% in the case of realistic noisy data, which illustrates the efficiency and the robustness of this modeling, as well as its generalization ability. Plots of estimated versus expected [Chl-a] are given in Figure 2, for fields F and Fν in the non-noisy and noisy cases, showing that the estimation is accurate in the whole range 0.03–30 mgm−3. Performance is minimally affected by aerosol optical thickness and type. These results suggest that the methodology based on ridge approximants has potential for improving the accuracy of [Chl-a] retrievals, since current processing techniques yield, without noise, theoretical errors that may reach 20% in the presence of non- or little absorbing aerosols and larger errors when aerosols are strongly absorbing [Gordon, 1997].

Table 1. Mean Relative Error in [Chl-a] Estimation Evaluated on Data Sets l0, v0, l1, v1, for Models F (Built on Non-noisy Data) and Fν (Built on Noisy Data), Where l1 and v1 are Noisy Versions of l0, v0, Respectively, With a Total Amount of Noise of 1%
FFν
&#55349;&#56479;l0
RMS0.5200.836
bln−0.0000.000
RMSln0.0420.068
&#55349;&#56479;v0
RMS0.5340.859
bln−0.001−0.002
RMSln0.0420.070
&#55349;&#56479;l1
RMS1.6721.091
bln0.0000.000
RMSln0.1510.104
&#55349;&#56479;v1
RMS1.6501.118
bln−0.001−0.002
RMSln0.1510.105

## 5. Conclusions

[18] Fields of nonlinear regression models allow one to deal with composite data, where some variables are effectively explanatory, while the others are conditioning, and without having recurse to the product space X × T, which in some cases, such as the ocean color remote sensing problem, may be meaningless. They distinguish themselves from classical parameterized models by the fact that their parameters are functions of the conditioning variables. From this peculiarity follows a mathematical difficulty at constructing dense sets of continuous function fields for some choices of ℳn and ℳ, namely those subsets being not homeomorphic to some arcwise-connected subset of an Euclidean space, including the sets spanned by at most n functions of the ridge form, and their union. In this particular case, the difficulty may be circumvented, and a large class of methodologies, comprising the presented method based on multilinear interpolation, apply for the practical generation of dense sets of continuous fields of ridge functions.

[19] In this nonlinear regression context, fields of shifted ridge functions are especially interesting when no external, problem-related, knowledge is available since, as shown above, they inherit their interesting approximation properties. The developed methodology is rather general and could be adapted to another set ℳn, the choice of which could be driven by application specific requirements. Extension to the simultaneous explanation of several, eventually correlated variables is possible, by choosing a set ℳn of vector-valued functions. Remote sensing of ocean color is a multi-variate problem, especially complex in coastal and estuarine waters, and one may attempt to retrieve the concentrations of other constituents than phytoplankton, such as yellow substances and inorganic material, or their inherent optical properties. But it would be worth exploring the question of whether or not this approach would be optimal, in the statistical sense of exhaustive description. Intuitively, a preliminary de-correlation of the variables to be explained might lead to an explanation problem of lower complexity.

## Acknowledgments

[20] This work was supported by the National Aeronautics and Space Administration, by the National Science Foundation, and by the Applied Mathematics Laboratory of the University of Le Havre, France.