The Regionalization of National-Scale SPARROW Models for Stream Nutrients1

Abstract This analysis modifies the parsimonious specification of recently published total nitrogen (TN) and total phosphorus (TP) national-scale SPAtially Referenced Regressions On Watershed attributes models to allow each model coefficient to vary geographically among three major river basins of the conterminous United States. Regionalization of the national models reduces the standard errors in the prediction of TN and TP loads, expressed as a percentage of the predicted load, by about 6 and 7%. We develop and apply a method for combining national-scale and regional-scale information to estimate a hybrid model that imposes cross-region constraints that limit regional variation in model coefficients, effectively reducing the number of free model parameters as compared to a collection of independent regional models. The hybrid TN and TP regional models have improved model fit relative to the respective national models, reducing the standard error in the prediction of loads, expressed as a percentage of load, by about 5 and 4%. Only 19% of the TN hybrid model coefficients and just 2% of the TP hybrid model coefficients show evidence of substantial regional specificity (more than ±100% deviation from the national model estimate). The hybrid models have much greater precision in the estimated coefficients than do the unconstrained regional models, demonstrating the efficacy of pooling information across regions to improve regional models.

The SPARROW watershed modeling technique uses a hybrid statistical and processbased approach to estimate contaminant sources and contaminant transport in the surface waters of large-scale watersheds. A unique feature of the method concerns the spatial referencing of sources, delivery processes and monitoring stations in relation to a reach network, a referencing system that enables improved resolution of the effects sources and transport processes have on monitored loads. A mass balance accounting of load is achieved by linking all measured in-stream loads to identified upstream sources, and by requiring that the accumulation of load across sources and reaches be strictly additive. The dependent variable used in SPARROW is the long-term mean annual contaminant load, the analysis of which does not require an understanding of temporary storage, and thereby facilitates mass balance accounting of mean conditions in streams over long time periods. The spatial referencing noted above, in conjunction with mass balance accounting, allows the model to be used to assess the origin of in-stream contaminant load, both by source type and location, at any location along the reach network.
The SPARROW model specification includes three major structural components: contaminant sources, processes affecting the delivery of sources to streams in the reach network, and the aquatic processes affecting the transport of contaminants once they are in the stream network. A more detailed understanding of the SPARROW methodology can be gained by consideration of the basic equation that describes these elements. Let * i F be the model-estimated flux for contaminant leaving reach i. This flux is equal to the flux leaving adjacent reaches upstream of reach i, denoted j F ′ , where j indexes the set ( ) i J of adjacent reaches upstream of reach i, plus additional flux that is generated within the incremental reach segment i. The set of adjacent upstream reaches will consist of two or more reaches if reach i is the result of a confluence, which is the most common condition, or one reach if reach i is a continuation of a single flowline, or no reaches if reach i is a headwater reach.
The functional relations determining reach i flux are given by (S1) The first summation term represents the amount of flux that leaves upstream reaches and is delivered downstream to reach i, where j F ′ equals measured flux, M j F , if upstream reach j is monitored or, if it is not, is given by the model-estimated flux, * j F . i δ is the fraction of upstream flux delivered to reach i. This fraction is assigned by the reach network: if there are no diversions, then i δ is set to 1; otherwise, the fraction is the fraction of streamflow diverted to reach i. ) (⋅ A is the aquatic delivery function representing attenuation processes acting on flux as it travels along the reach pathway. This function defines the fraction of flux entering reach i at the upstream node that is delivered to the reach's downstream node. The factor is a function of measured stream and reservoir characteristics, denoted by the vectors S i Z and R i Z , with corresponding coefficient vectors S θ and R θ . If reach i is a stream, then only the S i Z and S θ terms determine the value of ) (⋅ A ; conversely, if reach i is a reservoir then the terms that determine ) (⋅ A consist of R i Z and R θ . The second summation term in equation (S1) represents the amount of flux introduced to the stream network at reach i. This term is composed of the flux originating in specific sources, indexed by Associated with each source is a source variable, denoted m S . Depending on the nature of the source, this variable could represent the mass of the source available for transport to streams, or it could be the area of a particular land use. The variable m α is a source-specific coefficient (the collection of all source-specific coefficients is denoted subsequently by the vector represents the land-to-water delivery factor. For sources associated with the landscape, this function along with the source-specific coefficient determines the amount of contaminant delivered to streams. The land-to-water delivery factor is a source-specific function of a vector of delivery variables, denoted by D i Z , and an associated vector of coefficients D θ . It is common practice to express the D i Z variables as differences from their respective mean for the modeled region, a transformation that expresses the source coefficient as a mean contaminant load delivered to the nearest stream segment in the reach network, per unit of the associated source variable. The last term in the equation, the function ) (⋅ ′ A , represents the fraction of flux originating in and delivered to reach i that is transported to the reach's downstream node. This function is similar in form to the aquatic delivery factor defined in the first summation term of the flux equation; however, the default assumption in SPARROW models is that if reach i is classified as a stream, as opposed to a reservoir reach, the contaminants introduced to the reach from its incremental drainage area receive the square root of the reach's full in-stream delivery. This assumption is consistent with the notion that contaminants are introduced to the reach network at the midpoint of reach i and thus experience only half of the reach's time of travel. For reaches classified as reservoirs, the default assumption is that the contaminant receives the full attenuation associated with the reservoir. The nonlinear model structure in equation (S1) contains several key features. The additive contaminant source components and multiplicative land and water transport terms are conceptually consistent with the physical mechanisms that explain the long-term mean rates of supply and movement of contaminants in watersheds. Total modeled flux for a reach is separated into its individual sources. Because this same decomposition is done for all upstream reaches, it is possible in this framework to perform flux accounting, whereby total flux is attributed to its source components. All processes are spatially referenced with respect to the stream network according to the reach in which they operate. This means, for example, that a reservoir at reach i may affect the transport of contaminants entering the reach network upstream, but not downstream, of reach i. The additive source components also provide a mathematical structure in the model that preserves mass. This can be seen by noting that a doubling of each of the source variables i m S , , along with a doubling of all upstream sources, as represented by a doubling of j F ′ , results in an exact doubling of modeled flux * i F . Finally, the modeled flux at any reach i is conditioned on monitored fluxes entering the stream network anywhere upstream of reach i. This approach to nested basins serves to isolate errors introduced in any upstream basin from incremental errors that arise in a downstream basin, making it defensible to treat nested basins as independent observations. Formally, the approach is similar to the way time series models condition predictions on past realizations of the series.
The relation between measured ( M i F ) and modeled ( * i F ) flux for monitored reach i is assumed to be given by where i e is a random error that is independent across monitored basins , with mean zero and variance 2 i σ , and ( ) β * i F is abbreviated notation for the explicit formula given in equation (S1), with . Given a collection of N such observed basins, the method of weighted nonlinear least squares (WNLLS) is used to obtain estimates of the model coefficients, β , and their associated covariances. The weights in the WNLLS procedure are set to , a specification that preserves the average variance of the residuals. Typically, varying weights are used to account for differences in the accuracy of the mean load estimates obtained at monitoring stations. It is also common to estimate the model constraining the source coefficients and the land and aquatic delivery processes, ) (⋅ m D , ) (⋅ A , and ) (⋅ ′ A to be non-negative, a restriction that insures predicted in-stream load is positive. Additional details on the specification and estimation of a SPARROW model are provided in Schwarz et al. (2006).

Derivation of the covariance matrix for the partially constrained regional model used in the stepwise procedure
The following describes the derivation of the covariance matrix for the unconstrained regional differentials, reported as equation (  For simplicity, the observation-specific weights are assumed to be known; results presented in Amemiya (1985) imply that the covariance matrix derived below is unaffected if the weights are estimated consistently using a two-stage process whereby the squared values of the residuals estimated from a first-stage model, one with observation-specific weights set to 1, are regressed on various predictors, the predictions of which are used to compute the observation-specific weights in a second-stage model. Let , and . Multiplication of (S5) by s N and solving for ( ) The first term in the braces of (S6) is where, by assumption, . Because these coefficients are estimated using data that exclude region s, these errors are independent of the other terms in (S6). The secondorder partial derivative terms take the form has an asymptotic distribution that is multivariate normal with a zero mean vector and covariance matrix equal to and j Ṽ is the asymptotic covariance matrix of ( ) , which links the asymptotic covariance matrix derived here to the covariance matrix for the weighted average of the region-specific coefficient vectors reported in the article.
and the probability limits of the sample size-normalized second-order derivatives of the objective function are (S13) Given these results, and given the statistical independence between has an asymptotic distribution that is multivariate normal with zero mean vector and asymptotic covariance matrix given by . Note that except for the term involving s f , (S14) has the same form as the asymptotic variance derived by Murphy and Topel (1985). Murphy and Topel made the implicit assumption that in the limit the absolute difference between the number of observations used to estimate the prior model (here, the model used to estimate s b ) and the number of observations used to estimate the subsequent constrained model (here, the model used to estimate U s δ ) is bounded. In that case ( ) and the term involving s f in (S14) gets dropped from the expression. In the present analysis we allow for the possibility that sampling across regions is proportional, in which case the absolute difference between the number of observations in the various regions is not bounded, leading to the retention of the s f term in (S14). Ultimately, the issue is moot because the sample-estimated variances are not normalized by the number of observations, in which case the s f term disappears from the variance expression (see below).
A consistent estimate of ( ) U s δ Vˆ can be obtained as follows. First, obtain weighted nonlinear least squares estimates of the region-specific process coefficients for the complement regions. These estimates can be obtained by minimizing an objective function such as (S3) for each regional model independently, without cross-region constraints on the coefficients, to obtain consistent estimates of the region-specific coefficient vectors, being a consistent estimate of ( )

Derivation of the covariance matrix for the constrained estimation of the regionalized, fixed-effects national model
The following describes the derivation of the covariance matrix for the constrained estimation of the regional fixed-effects model (i.e., the hybrid model), appearing in the article as equation (A12). This derivation follows the method described by Gallant (1987). The Lagrangian representing the constrained form of the weighted nonlinear least squares objective function is given by with the weighted sum of squares objective given by vector of Lagrangian constraint parameters. To simplify the analysis, the observation-specific weight vector, w, is assumed to be known.
The first-order conditions for minimization of (S16) with respect to β and λ, to obtain the estimates β and λ , are given by (assuming all constraints are binding) ( ) where n 0 is an 1 × n vector of zeros.
The first-order Taylor expansion of ( ) about the true value of the coefficients, β , is given by Substitution of the right-hand side of (S19) for in the first equation of (S18) and, assuming Because the constraints must be satisfied for both β and β , pre-multiplication of (S20) by C G gives ( ) which, assuming is a nonsingular matrix, provides the solution for the Lagrangian constraint parameters Substitution of this result into (S20), and multiplication through by N , gives ; ; where, by assumption, e is an N-element vector of independent, normally distributed random variables having mean N 0 and assumed covariance matrix the asymptotic limit of which is given by Under the conditions stated for Theorem 4.3.2 in Amemiya (1985), the probability limit of β is β . Because β  lies between β and β , it must also have a probability limit of β .
Moreover, the conditions imply the probability limit of ( ) (S27) Thus, assuming the conditions for Amemiya's Theorem 4.3.2 hold, the asymptotic distribution of ( ) β β − N is multivariate normal with mean RK 0 and asymptotic covariance matrix given by ( ) . , 1 1 1 1 1 2 , and define the covariance matrix sample estimate to be Under the conditions of Theorem 4.3.2, is a consistent estimate of ( ) β Vˆ given in (S28), thus establishing the validity of equation (A12) in the article.

Complete results for the hybrid regionalized national model, the regionalized fixed-effects model, the weighted averages of the fixed-effects estimates, and the national model
The following tables present the complete estimation results, for each of the three regions, for both total nitrogen (TN , Table S1) and total phosphorus (TP , Table S2). Each table reports the coefficient estimates of the national model (as reported in Alexander et al., 2008); the regionalized fixed-effects model; and the hybrid model. Also reported are the standard errors and p-values for each coefficient estimate, an indicator of which process coefficients are subject to a physical bound on feasible values for the coefficient (the bound being that the coefficient must be greater than or equal to 0; applied to all source and aquatic removal coefficients), indicators of which hybrid and regional model estimates for which the physical bound is binding (that is, the estimate is forced to zero in order to meet a physical bound that the value be non-negative), and an indicator of which hybrid estimates have a binding cross-region constraint (that is, which hybrid coefficients are constrained to equal a weighted average of the complement regions' hybrid model estimates. The standard errors for the hybrid model estimates are based on the covariance matrix computed using equation (A12) of the Appendix for the article. The standard errors for the regionalized fixed-effects estimates are based on covariance matrices computed using standard nonlinear methods and derived from the results of the regional fixed-effects model (referred to in the article as the r V matrices; see, for example, equation (A2) in the Appendix). Note that because model residuals for a SPARROW model are assumed to be independent, and because all regions are delineated at a monitoring station, the covariances corresponding to fixed-effect coefficients from different regions are exactly zero. Standard errors for the regionalized fixedeffects and hybrid models are unreported if the corresponding coefficient is constrained in estimation to equal the physical bound of zero (source and aquatic removal rate coefficients are constrained to be non-negative; the coefficient is set to zero if the constraint is binding). The standard errors for the national model are taken from Alexander et al. (2008). The reported pvalues are based on a Student's t distribution, using errors degrees of freedom as reported in Table 2. Reported p-values are one-tailed for coefficients subject to a model constraint (coefficients with a "1" in the column titled "Coefficient subject to model constraint"); otherwise, a two-tailed p-value is reported. TABLE S1. Coefficient estimates and associated statistics for the total nitrogen (TN) models. See the above discussion for a description of each field presented in the table. Region abbreviations are E for East, NW for Northwest, and SW for Southwest (see Figure 1).  a Missing values indicate the corresponding coefficient has a binding physical constraint. b All source and aquatic removal rate coefficients are subject to the physical constraint that the coefficient be non-negative. c Value is 1 if the regional fixed-effects model coefficient estimate is equal to the physical constraint of zero. d Value is 1 if the hybrid model coefficient estimate is equal to the physical constraint of zero. e Value is 1 if the hybrid model coefficient is constrained to equal a weighted average of the hybrid model coefficients for the complementary regions.