Water Resources Research

Self-organizing linear output map (SOLO): An artificial neural network suitable for hydrologic modeling and analysis

Authors


Abstract

[1] Artificial neural networks (ANNs) can be useful in the prediction of hydrologic variables, such as streamflow, particularly when the underlying processes have complex nonlinear interrelationships. However, conventional ANN structures suffer from network training issues that significantly limit their widespread application. This paper presents a multivariate ANN procedure entitled self-organizing linear output map (SOLO), whose structure has been designed for rapid, precise, and inexpensive estimation of network structure/parameters and system outputs. More important, SOLO provides features that facilitate insight into the underlying processes, thereby extending its usefulness beyond forecast applications as a tool for scientific investigations. These characteristics are demonstrated using a classic rainfall-runoff forecasting problem. Various aspects of model performance are evaluated in comparison with other commonly used modeling approaches, including multilayer feedforward ANNs, linear time series modeling, and conceptual rainfall-runoff modeling.

1. Introduction

[2] Artificial neural network (ANN) methods have found increasing utility in a variety of hydrological applications [Maier and Dandy, 2000; ASCE Task Committee on the Application of Artificial Neural Networks in Hydrology, 2000]. Among the most widely used network structures are the Multilayer Feedforward Network (MFN), the recurrent neural network (RNN), and the radial basis function (RBF) network. In previous work [Hsu et al., 1995, 1997a, 1997b, 1999; Sorooshian et al., 2000], the applicability of ANN methods for hydrologic applications such as streamflow forecasting and estimation of spatial precipitation fields was investigated. Relevant to this paper, Hsu et al. [1995] showed that a 3-layer MFN (Figure 1a) provides excellent one step ahead predictions of streamflow, including both flood peaks and recessions. Hsu et al. [1997a] explored the RNN extension of the MFN structure (Figure 1b) which adds time-delayed feedback loops to simulate the “storage capacity” of a dynamical hydrologic system.

Figure 1.

(a) A three-layer feedforward neural network and (b) a three-layer recurrent neural network.

[3] However, the existence of multiple local optima and extensive regions of parameter insensitivity complicates the identification and training of MFN and RFN networks, which significantly limits their widespread application [Gupta et al., 1997]. Therefore it is important to resolve the network identification problem while maintaining high standards of network performance and accuracy. This paper presents a multivariate ANN procedure, entitled self-organizing linear output map (SOLO), whose structure has been designed for rapid, precise, and inexpensive estimation of network structure/parameters and system outputs, as well as estimates of their uncertainty. More important, SOLO provides additional insight into the underlying input-output processes, thereby extending its usefulness beyond forecast applications. The scope of this paper is organized as follows. The architecture of the SOLO model and an illustrative application to a streamflow prediction problem are described in sections 2 and 3, respectively. In addition, the performance of SOLO is evaluated in comparison with multilayer feedforward ANNs, a linear time series model, and a conceptual rainfall-runoff model. Section 4 discusses how the analysis of intermediate products generated by the network can facilitate insight into the underlying structure of the input-output process. Technical issues including identification of network size, stability of the parameter estimates, principal component analysis, the relationship to linear input-output modeling, and “overfitting” are discussed in section 5. Issues of model prediction uncertainty are presented in section 6.

2. SOLO Model

[4] The architecture of a SOLO network is listed in Figure 2. This network consists of three layers. The input layer is comprised of n0 neural units (one for each variable of the input vector), each connected to all units of the classification and mapping layers. The classification and mapping layers consist of n1 × n1 matrixes (Figure 2): one to classify the input information using a self-organizing feature map (SOFM) [Kohonen, 1989] and the other to map the inputs into the outputs using multivariate linear regression. The SOFM matrix functions as a “switchboard” to turn “on” or “off” the units of the regression matrix: i.e., the SOFM matrix classifies each input vector and determines to which unit in the regression matrix it must be routed for output prediction. The values of the network parameters (i.e., connection weights linking the units) are determined by the process of calibration (i.e., training).

Figure 2.

The architecture of a SOLO network.

[5] The mathematical description of SOLO is as follows. Let wji represent the connection strength (weight, parameter) linking the ith input variable (i = 1,…n0) to the jth SOFM unit (jn1 × n1), and let vji represent the connection strength linking the same ith input variable to the jth regression unit. Compute the Euclidian “distance” dj between the input vector, x = {xi, i = 1,… n0}, and the jth SOFM unit as follows:

equation image

Compute this distance to each SOFM unit and select the unit, c, which has the smallest distance, i.e., dc= min(dj), for all j. This identifies the input cluster to which the input vector xi belongs. All of the corresponding outputs of the units of the SOFM matrix will be “0”, except for the cth unit, for which the output will be “1”. This output is used to determine which unit of the regression matrix is to be used in the computation of the output value z from the input vector (x)

equation image

where ϕ means that the outputs other than the unit j = c are not selected or estimated. In this way, each unit of the regression matrix is represented by a linear input-output regression function that is associated with a single input cluster. The input-output function mapping is therefore accomplished by a set of n1 × n1 piecewise linear regression functions that covers the entire input domain.

[6] The training of the network connection weights wji and vji is achieved as follows. First, the weights, wji, of the SOFM matrix are trained using an iterative nonsupervised (self-organizing) procedure. In this procedure, the weights, wji, in the SOFM matrix are initialized to randomly selected values and are then adjusted so that the nodes which are in the neighborhood Λc of the node (c), determined to be closest to the current input vector, are moved towards the input vector using the iterative adjustment rule:

equation image

Here, m is the training iteration, Λc(m) defines the size of a neighborhood around the winner unit c, and η(m) is the learning rate (step size) at the iteration m. This procedure is applied several times to the entire data set of input vectors. As the training proceeds, the sizes of both Λc(m) and η(m) are progressively reduced, and the SOFM stabilizes to a form that approximates the distribution of the data in the input space.

[7] Next, the weights vji of the linear regression matrix are determined. Due to the self-organizing (clustering) procedure outlined above, each node of the regression matrix has become associated with a distinct region of the input space. Therefore the input-output data now associated with node j are used in the determination of the nodal regression parameters (vji) by solving the linear set of equation

equation image

where Z is a p × 1 vector with p output training data (runoff observations), X is a p × n0 matrix with p sets (rows) of training input vectors (xi)T, i = 1, …n0, θ is a n0 × 1 vector of regression parameters (weights) for unit j, θ = [vj1, vj2, …, vjn0]T, and ε is a p × 1 vector of estimation errors with zero mean and variance σe2. In general, because the number of equations (size, p, of the data set) exceeds the number of unknown regression parameters (n0), the “optimal” (unbiased) estimates of θ can be determined by minimizing the root mean square error of the output residuals, using the equation:

equation image

However, the solution of equation (5) is typically complicated by the presence of significant correlation among the input variables (xi), causing the matrix (X) to be colinear and the inverse matrix (XTX)−1 to be singular. This problem is avoided by applying a principal component transformation (C) (also called empirical orthogonal function) [Jolliffe, 1986; Tatsuoka and Lohnes, 1988; Peixoto and Oort, 1992] to the matrix X to obtain a matrix Y having independent (orthogonal) column vectors:

equation image

Here Y is the p × n0 matrix of principal components, and C is the p × n0 transformation matrix with eigenvectors derived from the covariance matrix of X: E(XTX) and CTC = CCT = I (Appendix A). Substituting equation (6) into equation (4) yields

equation image

where β = CTθ can be determined without difficulty, using the relevant analog to equation (5), because the variables in Y are independent (orthogonal). Selecting and including only the largest principal components of Y, those that explain the majority of the variance, can avoid the instability in estimates of the regression parameters. In SOLO, the number (m) of the largest principal components used in the regression is determined to ensure that the ratio equation image This reduces the dimension of the inverse matrices to be solved during regression, thereby simplifying and speeding up the network training.

3. Use of SOLO for Streamflow Prediction

[8] Applied ANNs in streamflow prediction are discussed in many recent publications. Smith and Eli [1995] trained MFNs to predict the peak discharge and peak time from synthetic data and hypothetical watersheds. In the study by Minns and Hall [1996], runoff was generated from a simple nonlinear model using synthetic storm events. The results show that the performance improved with more hidden layers in the network. Mason et al. [1996] applied RBF networks in runoff prediction and showed that RBFs are more efficient than slow back propagation learning strategy in the model calibration. Tokar and Johnson [1999] explored the network performance to the training data set and found that the model trained from both dry and wet records had a better prediction accuracy. Ahmad and Simonovic [2001] used MFNs to predict the amount and timing of peak flow, base flow, timing of rising and falling limbs, and shape of the runoff hydrograph. For more other relevant ANN applications, readers may refer to the summary papers from Maier and Dandy [2000] and the ASCE Task Committee on the Application of Artificial Neural Networks in Hydrology [2000].

[9] In this case study, the SOLO network was applied to the problem of streamflow prediction, and its performance was compared with that of four commonly used hydrological modeling approaches. The test data consisted of 36 years (1 October 1948 to 30 September 1983) of daily rainfall and streamflow data for the Leaf River basin (1949 km2) near Collins, Mississippi. With respect to the data record used in the model calibration, in the previous study, Yapo et al. [1996] suggested that approximately eight years of data are required to obtain calibrations that are relatively insensitive to the data period selected. To avoid models calibrated from an insufficient data record, the first 11 years of data were selected for model development and calibration, and the remaining 25 years were used for performance evaluation. The comparison is based only on the ability of each model to provide accurate one-day-ahead predictions of streamflow. Note, however, that each model has different input data requirements, as mentioned in the associated subsections.

3.1. ARX Model

[10] The ARX (n1, n2) time series model (autoregressive with exogenous inputs) is a lumped linear regression structure that has been used extensively for the prediction of streamflow using observed rainfall and runoff sequences [O'Connell and Clark, 1981; Wood, 1980]:

equation image

where ai and bj are parameters, and q(t) and r(t) are the observed streamflow and rainfall sequences, respectively. The time unit t is one day, and ε(t + 1) is the error of streamflow estimation. Three previous time steps of rainfall and streamflow observations are used as the inputs to the model (i.e., n1 = n2 = 2). For a fair comparison of various model performances, the same six input variables in the ARX, MFN, and SOLO models were also selected. The parameters of the ARX model (ai, bj) are determined by minimizing the root mean square error (RMSE) of the streamflow residuals computed over the training data set:

equation image

3.2. Multilayer Feedforward Network (MFN)

[11] The MFN model [see Hsu et al., 1995] is widely used to model nonlinear processes because the simple three-layer network MFN (n0, n1, n2) shown in Figure 1a has been mathematically proved [Hornik et al., 1990; Gallant and White, 1992] to be capable of mapping any kind of continuous nonlinear function (n0, n1, and n2 represent the number of units in the input, hidden, and output layers, respectively). For a comparison of different models, the same six input variables used in the ARX model are used as network inputs:

equation image

The output layer has a single unit representing the runoff prediction, and the hidden layer uses three units, based on tests exploring the use of different numbers of hidden nodes, with performance evaluated over both calibration and evaluation data [Hsu et al., 1995]. The optimal network structure for one-day-ahead prediction of streamflow equation image (t + 1) on the Leaf River basin was determined to be MFN(6, 3, 1), written as

equation image
equation image
equation image

where f(.) is the neural transfer function, yk is the output of unit k in the hidden layer, wjk is the connection weight between the unit k in the input layer and unit j in the hidden layer, vjk is the connection weight between the unit j in the output layer and unit k in the hidden layer, wj0 and v0 are bias weights. Estimates for the parameters (wjk and vk) were determined by minimizing the RMSE using the linear least squares simplex (LLSSIM) algorithm [Hsu et al., 1995].

3.3. Recurrent Neural Network (RNN)

[12] The RNN model (Figure 1b) is an extension of the MFN network to include internal temporal storage/memory processes. Its application to the problem of streamflow prediction has been discussed by Hsu et al. [1997a]. The RNN architecture requires that only the rainfall data at the current time step be used as external input to the network (i.e., (xi) = r(t), i = 1). The network architecture for streamflow prediction on the Leaf River basin is RNN(1, 4, 1), written as

equation image

where f(.) is the neural transfer function (equation 13), yj(t) is the output of the hidden unit j at time t, wj1 is the connection weight from the input unit to the hidden layer unit j, wrjk is the time-delayed recurrent connection weights from the hidden unit k to the hidden unit j, and wj0 is the bias weight. Model output is calculated by

equation image

where vj is the connection weight from hidden unit j to the output unit, vr is the time-delayed output recurrent connection weight, and v0 is the bias weight. As with the MFN, the network connection weights, {w, wr, v, vr}, were determined by minimizing the RMSE using the LLSSIM algorithm.

3.4. SAC-SMA Model

[13] The Sacramento Soil Moisture Accounting (SAC-SMA) model is a conceptual multistorage streamflow simulation model, developed and maintained by the U.S. National Weather Service [Burnash et al., 1973; Burnash, 1995]. The inputs to the model include the mean basin precipitation at the current time step, r(t), and the mean basin evapotranspiration. A description of the SAC-SMA model is given by Sorooshian et al. [1993]. The parameters of the SAC-SMA model were calibrated using the shuffled complex evolution (SCE-UA) algorithm, developed at the University of Arizona [Duan et al., 1992]. Discussions related to conceptual rainfall-runoff model identification and use are given by Sorooshian et al. [1993], Sorooshian and Gupta [1995], Gupta et al. [1998], and Boyle et al. [2000, 2001].

3.5. SOLO Model

[14] As described above, the SOLO model uses piecewise linear regression functions to predict the streamflow. To remain consistent with the ARX, MFN, and RNN models, the input vector (x) used here consisted of the three most recent precipitation and streamflow observations (see equation 10). The SOFM and regression matrixes were selected to consist of 15 × 15 nodes each. A discussion regarding the selection of the size of the SOFM and regression matrices is given later.

3.6. Results

[15] Summary statistics of the one-day-ahead streamflow prediction performance of the five models (ARX, MFN, RNN, SOLO, and SAC-SMA) are presented in Table 1. It shows that the SOLO model consistently provides the best values for the NSE (Nash-Sutcliffe coefficient of efficiency), RMSE, CORR (correlation), and BIAS (bias) statistics for both the 11-year calibration and 25-year evaluation periods. The MFN model provides the next best performance, having similar statistics to SOLO. The remaining three models (RNN, ARX, and SAC-SMA) provide significantly worse calibration and evaluation performance; in particular, the SAC-SMA model has noticeably larger residual bias (BIAS = 5.44).

Table 1. Comparison of Models Using 11-Year Calibration Data and 25-Year Evaluation Data
StatisticsCalibrationEvaluation
NSERMSECORRBIASTime, hoursNSERMSECORRBIAS
ARX0.91717.240.9610.1050.080.89420.530.949−0.036
MFN0.95113.570.9760.369460.92317.910.9620.588
RNN0.90419.310.954−0.682410.84225.810.924−0.593
SOLO0.95912.360.9800.0741.50.92917.010.9650.036
SAC-SMA0.91117.800.9592.2980.91519.430.9605.444

[16] The CPU time required for model calibration (on a SUN SPARC workstation) is a gross indication of the relative efficiencies of the different modeling procedures. As indicated in Table 1, the ARX model is the most efficient procedure, requiring less than five minutes to calibrate the model using 11 years of data. SOLO is next, requiring 1.5 hours, most of which were spent on the SOFM classification step. However, the SAC-SMA model required eight hours, and the MFN and RNN models each required over 40 hours of CPU time. While this is a somewhat crude basis for comparison, the results strongly support the conclusion that the SOLO network can provide a superior approximation of the complex nonlinear rainfall-runoff relationship in an efficient manner.

[17] A more detailed comparison of model performance is given in Figures 3 and 4. The daily RMSE values (cms) plotted against the volume of annual streamflow (cms) for each model for each year are shown in Figure 3. The solid squares represent calibration years, while the solid circles represent the evaluation years. For each model, it is clear that the error variance increases with “wetness” of the year (i.e., the annual RMSE increases with annual streamflow) in a somewhat linear fashion. An arbitrary line has been added to each graph from (0,0) to (70,40) to provide a basis for simple visual comparison. Notice that the SOLO model provides relatively smaller RMSEs over the full range of annual flows. The SAC-SMA model performs well on high-flow (wetter) years but not as well on low-flow (drier) years. The performance of the ARX model is similar to the SAC-SMA for low-flow years but is poorer for high-flow years. The RNN model provides the worst performance in high-flow years.

Figure 3.

The annual RMSE (cms) with respect to the annual streamflow (cms) of the testing models.

Figure 4.

The daily flow time series of testing models over highest validation flow year (1980).

[18] Plots of the hydrographs comparing one-day-ahead model predictions with streamflow observations are shown in Figure 4 for the wettest year (1980) of the evaluation period. For this year, the RMSEs are (in order of decreasing performance) SOLO = 25.39 cms, SAC-SMA = 30.44 cms, MFN = 32.31 cms, ARX = 34.36 cms, and RNN = 47.45 cms. The SOLO and MFN models show the closest tracking of the flows on all portions of the hydrograph. The SAC-SMA shows some difficulty in tracking the recessions. The ARX model shows undesirable (high-frequency) “spikes” during transitions between high- and low-flow periods and a significant tendency to underestimate the recessions.

4. Insights Into the Network Structure

[19] The SOFM layer of SOLO partitions the input space into a number of regions, each represented by a number of nodes, such that the nodal connection weights (wji) represent the cluster mean values for the associated subset of the data. A linear regression equation is then fit between the inputs and outputs for each region so that the result is an efficient, and arbitrarily accurate, piecewise linear approximation of the entire input-output domain.

[20] An interesting by-product of this approach is that an analysis of the properties of the SOFM layer can facilitate insight into the underlying structure of the input-output process. To illustrate this, we present an example in which the SOFM layer was constructed using only a 2 × 2 grid of nodes. The model was trained using the same 11-year Leaf River basin daily rainfall-runoff time series described in section 3. A 100-day portion of the observed rainfall hyetograph and observed and simulated streamflow hydrographs for water year 1961 is shown in Figure 5. As a result of the classification step, each of the four SOFM nodes is activated by a different characteristic pattern (let us call it a “mode”) of input (rainfall) behavior. Hence, a different linear regression equation (one of four) is utilized to provide one-step-ahead predictions of different portions of the streamflow hydrograph. In Figure 5, the streamflow predictions, q(t + 1), associated with each of these “modes” of input-output behavior are indicated by different symbols (squares, triangles, diamonds, and stars). Note that the automatic classification algorithm has identified four distinct modes of behavior: base flow recessions (indicated by squares), rising limbs (indicated by diamonds), peaks and quick recessions (indicated by triangles), and early portions of the rising limb having temporary reductions in rainfall intensity (indicated by stars). It is interesting to note that the first three of these are consistent with how a hydrologist might visually partition the hydrograph [e.g., Boyle et al., 2000, 2001], while the fourth represents a subtlety of behavior that might not normally draw the attention.

Figure 5.

Daily rainfall and observed and simulated streamflow from 2 × 2 units installed in the SOFM layer.

[21] Detailed results for the 15 × 15 node SOFM classification utilized in the model comparison study presented earlier are shown in Figure 6. Note that each SOFM node has six weights, corresponding to each of the six input variables (j = 1, …,6). To illustrate the distribution of weights across the SOFM matrix, an icon was constructed consisting of three vertical bars (representing the three rainfall inputs {r(t − 2), r(t − 1), r(t)} and three line-connected squares (representing the three streamflow inputs {q(t − 2), q(t − 1), q(t)}. The size of the vertical bar is proportional to the strength of the rainfall input contribution, and the relative vertical position of the squares represents the strength of streamflow contribution. Figure 6 visually illustrates the distribution of rainfall-runoff modes identified by the classification algorithm. For comparison, an interpolated contour plot of the distribution of average streamflow prediction q(t + 1) associated with the layout of SOFM nodes is given in Figure 7a. Analyses of the input-output relationships constructed (learned) by the SOLO procedure are presented in Figures 6 and 7a. For discussion, the five specific SOFM regions outlined loosely in Figure 6 by ellipses and connected by arrows have been identified. The association between these five classification regions and the rainfall-runoff process is also clearly illustrated in Figure 8. The regions are (1) base flow region (region I), (2) increasing rainfall region (region II), (3) peaking hydrograph region (region III), (4) quick recession region (region IV), and (5) slow recession region (region V).

Figure 6.

Classified characteristics in 15 × 15 SOFM units demonstrated by six normalized network connection weights.

Figure 7.

Contour maps of (a) averaged estimated streamflow and (b) estimated RMSEs over the 15 × 15 SOFM units.

Figure 8.

A relationship between five classification regions and the rainfall-runoff process.

[22] Region I is located in the central area of the SOFM matrix. The behavior is characterized by no-rain and low-level, almost unchanging streamflows during a 3-day period (Figure 6). The corresponding streamflow prediction is associated with a region of very small values, as shown in Figure 7a.

[23] Region II is located in the top right corner area of the SOFM matrix. Rainfall is steadily increasing during the 3-day period, but streamflows have only just begun to respond (Figure 6). The model predicts high streamflow levels during the next period, as given in Figure 7a. Thus this region identifies the initial stages of a rain storm and the associated rising limb of the hydrograph.

[24] Region III is located in the bottom right and middle portion of the SOMF matrix. The rainfall has peaked, but the streamflow continues to increase (Figure 6). A region of moderate to high streamflow predictions is displayed in Figure 7a. This region is therefore associated with prediction of the peak levels of the hydrograph.

[25] Region IV is located in the bottom left corner of the SOFM matrix. Rainfall intensities have reduced considerably, and the hydrograph has begun to recede. Streamflows, however, are still reasonably high. This region is associated with the early (quick) streamflow recession and determines the rate at which the streamflow will diminish.

[26] Region V is located in the left-middle portion of the SOFM matrix. There has been no rainfall during the past three days, and streamflow has continued to recede. The model predicts a progressively diminishing streamflow value. The remaining portion of the SOFM matrix, not discussed above, is clearly associated with events with small durations of rainfall leading to moderate streamflow responses.

[27] Another by-product of the classification scheme is the ability to analyze the level of prediction uncertainty associated with each SOFM node, and by extension, the uncertainty level associated with prediction of different portions of the hydrograph. An interpolated contour plot of the evaluation period root mean square error (RMSE) over the SOFM matrix is presented in Figure 7b. Consistent with the findings presented earlier (Figure 5), a comparison of Figures 7a and 7b indicates that the RMSE is directly related to the level of streamflow; for region I, the RMSEs are around 5 cm/d, increasing to 10–30 cms/d for regions II and III, and 30–40 cm/d for flood peaks (streamflows exceeding 100 cms/d). This information is used later (see section 6) to provide estimates of 95% confidence intervals on the predictions.

[28] Prediction accuracy provided by the SOLO model will depend on correct selection of the physical input variables containing information relevant to the generation of streamflow (output variable). Further, the training data set must contain sufficient variability that is representative of the underlying input-output process. While the skill and experience of the researcher will ultimately determine which input variables are selected for model development, the kinds of tools and analyses presented above can significantly facilitate the input selection and model construction process.

5. Technical Issues Related to Model Performance

[29] Some technical issues related to model performance are given in this section. These include the selection of the size of the SOLO network (i.e., the number of nodes in the SOFM matrix), the identification of stable estimates for the network weights via windowing and principal component analysis, and comments regarding “overfitting” to the data.

5.1. Selecting the Size of the SOFM Matrix

[30] In the example presented above, the selection of SOFM matrix sizes (2 × 2) and (15 × 15) was essentially arbitrary and for illustrative/exploratory purposes only. To determine parsimonious/optimal size for the Leaf River basin model, a series of experiments using progressively larger SOFM matrix sizes was conducted. The RMSE, correlation (CORR), and bias (BIAS) statistics, respectively, evaluated over the streamflow prediction residuals as a function of network matrix size are displayed in Figures 9a–9c. The results indicate only marginal improvements for networks of size exceeding 5 × 5. The statistics indicate a small tendency towards negative bias (underestimation) over both the calibration and evaluation periods. High correlation (0.96) between observed and one-day-ahead predicted streamflows over the 25-year evaluation period is obtained.

Figure 9.

Test of model performance from different size of SOFM units: (a) root mean square error, (b) correlation coefficient, and (c) bias estimates. Model performance of the SOLO model with respect to the percentage of the total data variation (explained by the principal components): (d) root mean square error, (e) correlation coefficient, and (f) bias estimates.

5.2. Windowing to Ensure Stability of the Model Parameters (Network Weights)

[31] The stability of the parameters estimated for the regression function associated with each SOFM node depends to a large extent on the number of data points available. For the Leaf River basin rainfall-runoff example presented above, 11 years of daily data correspond to a total of 4017 sample data points. The distribution of training data across the 15 × 15 SOFM matrix after the input classification step has been completed is shown in Table 2. Notice that the distribution is highly uneven, with most of the data points belonging to the central region (region I) associated with base flow recession and very small numbers of data points in regions associated with high streamflow values. For many of these nodes, the small number of available data points is insufficient to ensure stable estimates of the regression parameters (in this example, each node has seven regression parameters: one corresponding to each of the six input variables and one to allow for a bias adjustment term).

Table 2. Number of Data Classified in Each SOFM Unit: 15 × 15 SOFM Units
SOFM Unit Number123456789101112131415
1343124220313216
2514841554221223
3162565442252339
4218116291294233237
5152323551037374184
63424251276291164242802
72027431120019510124542
81631744399727142219177532
93441216197412142021
1024251744377752811101
11323234121131332723
123163123537323034
135111111265134231
14312313342254122
15504032335641223

[32] The solution implemented in SOLO is to extend a large enough window around each node to include data from the surrounding nodes in the fitting of the nodal regression equation. This amounts to a strategy of “borrowing” data from neighboring nodes having somewhat “similar” input-output characteristics. The method involves the selection of a minimum sample size threshold (herein selected empirically to be five times the number of parameters) and an extension of the window associated with that node to be just large enough to include enough data to satisfy that threshold. The size of the neighborhood Ω is defined as a matrix of (2n + 1 × 2n + 1), n = 0, 1… centered at the calculation unit. The neighborhood sizes determined for the Leaf River basin example are given in Table 3, and the resulting number of training data points thereby associated with each unit are presented in Table 4. Notice that the nodes with large numbers of data do not need to borrow data from the neighbors, while the neighborhood size for some of the nodes is as large as 3.

Table 3. Window Size of SOFM Unit for Including Sufficient Number of Data in Regression Analysis
SOFM Unit Number123456789101112131415
1322222222233223
2222212222222222
3221111112222222
4221111111222111
5221111011111112
6222221101101112
7221111100011112
8221100100111222
9221111111111122
10221111121011112
11222111221111112
12222222221111122
13222222221111222
14322222222111223
15333333222222233
Table 4. Total Number of Data Included in Finding Regression Parameters According to the Window Size of Each SOFM Unit Listed in Table 2
SOFM Unit Number123456789101112131415
1654355565658504543387782453752
24065831113612511010173594847635243
34876465475727449151897286998569
4477843457012213110343284219203455344
5417140406512655210151126124138896894
64284142212295973757663241564141946789
7428545851251291082200195101249156805180
842875510944391019727142217821774623314176
9418457120173159816218422811573180796813158
1047863864827548240412977195102943754
114467105424139118203118158210131964150
123953767477798416544931221065612352
134049585157576589367575641488137
144939453744455468114484436913781
15394553636174404864666256499137

[33] The use of data/node windowing has a very interesting implication. Consider an extreme case where the window size for each node is made so large that it includes the data from all of the other nodes. In this case, the procedure would identify an identical linear input-output regression model for all nodes, so that the effective output of SOLO would be identical to that of the linear time series ARX model. If the window size were then to be gradually reduced, the individual regression models at each node would begin to differ, and the overall input-output relationship would progressively adjust to match the underlying nonlinearity of the system. In the extreme case of no-windowing, the regression model at each node can be quite different (if so indicated by the data). The SOLO procedure therefore implements a sensible identification strategy in which (1) the availability of very small amounts of data would lead to identification of a linear input-output model, (2) the availability of large amounts of data would support the detection of system nonlinearity, and (3) the availability of data that are distributed in a nonuniform fashion over different modes of system behavior would lead to a smoothing (windowing) effect which ensures a balanced trade-off between system stability and system nonlinearity.

5.3. Regression Using Principal Component Analysis

[34] Another strategy used in SOLO to ensure stable estimates for the network weights is the use of principal component analysis. A complication inherent in the selection of relevant input variables is the tendency for informative input variables to be correlated to some degree. In the rainfall-runoff example presented earlier, the three previous rainfall values and three previous streamflow values are necessarily correlated in time. The SOLO procedure utilizes principal component (PC) transformations of the input space associated with each node before construction of the nodal input-output regression equation. This ensures a well-posed regression problem and also reduces the number of regression variables (and hence model parameters). The variation of the performance of the 15 × 15 Leaf River basin SOLO model with respect to the percentage of total data variation (explained by the principal components) detailed in Figures 9d–9f. By selecting the number of PCs to account for 95% of the total variance (V = 95%), the evaluation period RMSE reaches 21 cms/d, the CORR is 0.94, and the BIAS is −0.02 cms. Notice that the performance degrades sharply for V less than 80%. For the 15 × 15 Leaf River basin SOLO model, the numbers of principal components associated with different selections for V are given in Figure 10. With V chosen to be 99.9%, 196 of the 225 nodes utilize all six of the principal components (PCs), while 29 of the nodes use five PCs. In contrast, when V is reduced to 90%, only five nodes use five PCs, 40 nodes use four PCs, 170 nodes use three PCs, and 10 units use two PCs. To balance model efficiency with performance, the value, V = 95%, is suggested.

Figure 10.

Distribution of the number of principal component variables and the selected total data variance used (V%); the Leaf River SOLO model has six input variables and 15 × 15 SOFM nodes.

5.4. Overfitting

[35] Overfitting is a term used when the model fits the calibration data very well, but the performance degrades significantly over independent evaluation periods. To avoid this phenomenon, it is common to reference both the calibration and evaluation data during model development and training. An interesting by-product of the procedures described above for selection of network size and for stabilizing the parameter estimates is that these procedures seem to counter the tendency towards “overfitting” of the regression functions. For network sizes smaller than n1 = 5, the network performance tends to oscillate, but for n1 > 5, the performance remains consistent (see Figures 9a–9c). In fact, the performance of the 15 × 15 network is no worse than that of the 5 × 5 network during both calibration and evaluation, suggesting that the tendency towards overfitting has been avoided.

6. Model Prediction Uncertainty

[36] An important extension of the piecewise linear methodology used by SOLO is the ability to use the classical regression theory to provide robust estimates of the model output uncertainty. If the principal component data for an SOFM node are represented by y and the model prediction of streamflow by equation image, then equation image, where equation image is the vector of linear regression parameters [see equation (7)]. Accordingly, the expected value of the estimate equation image is yequation image and the variance of equation image is σ2γ with γ = yT(YTY)−1y, if the error in z comes from a normal distribution. Based on this, the upper and lower bounds (Uα, Lα) of the model output predictions corresponding to a 100(1 − α) confidence range can be derived as

equation image

where t1−α/2,np is a t distribution with np degrees of freedom, n is the size of data, and p is the rank of Y [Haan, 1977].

[37] A 100-day portion of the observed and predicted streamflow hydrographs for water year 1980 (an evaluation year) is given in Figure 11. The plot also shows the 66% and 95% upper and lower confidence bounds associated with the one-step-ahead streamflow predictions. Note that the piecewise linear SOLO model uses a total of 225 linear regression functions, each associated with a slightly different characteristic input-output behavior. The prediction uncertainty, as shown in Figure 11, is relatively small for the low and medium range of streamflows.

Figure 11.

100 days of observed and predicted streamflow data for water year 1980 (evaluation year); predicted streamflow confidence bounds (66% and 95%) included.

[38] It has been suggested in the literature [Sorooshian and Dracup, 1980; Sorooshian et al., 1983] that the errors associated with the rainfall-runoff process are not homogenous (i.e., have nonconstant variance). Sorooshian and Dracup [1980] commented that large flows tend to have larger error variance compared to smaller flows, partially because of the nonlinear nature of the rating curve used to transform stage measurements to flow volume estimates. The information depicted in Figures 7a, 7b, and 12 support this view. Figures 7a and 7b show that the size of residual RMSE increases with flow value and the distributions of the calibration data residuals (on the range [-20, 20] cms/d) for each of the 225 SOFM nodes, as presented in Figure 12. The variances are small in the central region I (low flows) and higher in the regions associated with precipitation variability and larger flows. The assumption of normally distributed residual error is reasonably good for node location (14,4), as displayed in Figure 13. However, the assumption does not hold up as well for other nodes. Further study of the residual distribution and its relation to the construction of accurate confidence bounds is ongoing and will be reported in due course.

Figure 12.

Distribution of calibration data residuals (on the range of [−20 20]) at 15 × 15 SOFM nodes.

Figure 13.

(a) Distribution of error residuals for the SOFM node (14,4) and (b) normal distribution plot of error residuals for the SOFM node (14,4).

7. Summary and Discussion

[39] In previous work [Hsu et al., 1995, 1997a, 1997b, 1999; Sorooshian et al., 2000], the applicability of ANN methods for hydrologic applications such as streamflow forecasting and estimation of spatial precipitation fields was investigated. The work presented here was motivated by network identification difficulties associated with classical ANN approaches that reduce performance accuracy and limit widespread application. A novel artificial neural network structure (SOLO) suitable for a wide variety of hydrologic (and nonhydrologic) applications was presented and illustrated using a case study application to streamflow forecasting. The similarities and differences between SOLO and several commonly used streamflow-forecasting approaches were discussed. The case study illustrates the relative superiority of the SOLO procedure and its ability to provide rapid, precise, and inexpensive estimation of network structure/parameters and system outputs. Equally important to scientists are the characteristics of SOLO that facilitate insight into the underlying physical/functional processes, thereby extending SOLO's usefulness beyond applied forecast applications.

[40] The power of the SOLO approach comes from a judicious merging of classical linear regression theory with self-organizing (data clustering) ideas developed by ANN researchers. The result is a function-mapping algorithm based on nodal piecewise linear principal component regression. Of course, the prediction capability of the resulting model is critically dependent on judicious selection of informative input variables. The structure of SOLO facilitates detailed analyses of the explanatory power of the current input selection separately over different portions of the input-output process. The value of this capability will be reported in the future.

[41] Our experiences with the SOLO architecture lead us to suggest that further study of its capabilities and usefulness is warranted. In related work, the use of the SOLO architecture for estimating regional- and global-scale precipitation fields by combining satellite-based remotely sensed data with atmospheric model outputs and surface measurements is being explored. One feature of SOLO which has been explored in that context (and has not been discussed here) is that the piecewise linear mapping can be readily combined with a recursive parameter identification procedure to facilitate (1) progressive model identification from new data as they become available and (2) recursive model updating to track temporal changes in the behavior of the underlying system. Furthermore, extensions of the nodal principal component regression algorithm from ARX to other regression structures such as ARMAX (autoregressive moving average with exogenous inputs) are straightforward to implement. Of particular interest to researchers, however, will be further investigation to determine what information about model system behavior can be revealed (visualized) by creative analyses of the properties of the SOFM matrix.

[42] As always, constructive dialog and collaboration with researchers interested in these methods is invited. The code for the SOLO algorithm can be obtained by request from the first author (hsu@sahra.arizona.edu).

Appendix A.

A1. Principal Component Transformation

[43] Consider a standardized data matrix, X, which has p rows (observations) and n0 columns (variables). Let the covariance matrix of standardized input variables be Σ, where Σ = Cov(X) = E(XTX). The linear transformed orthogonal matrix Y is presented as:

equation image

where Y is the principal components with element (i, j) of ith observation and jth principal component, and C is a (n0 × n0) matrix with eigenvector elements of the covariance matrix of X: Σ = Cov( X) = E( XTX), and CTC = CCT = I.

[44] Because the transformed components are uncorrelated to each other, the covariance matrix of principal components is listed below:

equation image

[45] The solution of the PCA provides a set of orthogonal-based eigenvectors, C, with their eigenvalues, λi, representing the variance of each component after PCA transformation. The total variance of the data matrix is represented as:

equation image

[46] This shows that the total variance of the data matrix is identical to the total variance after PCA transformation. The orthogonal PCA coordinates are selected according to the first component having the largest variance of the data matrix, X, whereas the other principal components are ranked from large variance to smaller variance, i.e., λ1 ≥ λ2 ≥ … ≥ λn0. To preserve most of the data variance after transformation, one could select the first few principal components with the coverage of most variances in the original data matrix. The percentage of total variance explained by the first mth component is:

equation image

[47] The higher the selection of the total data variance, V, the better the properties of the data matrix are preserved. A small number of principal components are selected, but still retain most of the data variance in the selected components, if the reduction of variables is considered. If the transformation is to prevent the colinearity of regression variables, the selected component number m in equation (A4) can be set for a higher total variance, such as V = 95% ∼ 99%. This higher total variance is mainly to avoid very small λk, k = m + 1n0, which may cause very high variance in the estimated regression parameters.

A.2. Principal Component Regression

[48] A multivariate linear regression model having p observations and n0 independent variables is given below:

equation image

where Z is a vector of p observations (p × 1), X is p × n0 matrix with element (i, j) of ith observation and jth independent variable, θ is a vector of regression coefficients, θ = [v1, v2, …, vn0]T, and ε is a vector of estimation error (p × 1) with zero mean and variance σe2. Parameters are estimated from minimizing the root mean square error of sample data and are given below

equation image

where equation image is the unbiased estimates of regression parameters. The above equation estimates the unbiased regression parameters that minimized the root mean square error. When the input variables are colinear, the inverse matrix of (XTX)−1 becomes singular, which makes finding regression parameters difficult. To reduce the uncertainty of the regression estimates, principal component transformation of input variables into uncorrelated variables before regression analyses is useful for finding more reliable regression parameters.

[49] Substituting the input variable of the linear regression function, as shown in equation (A5), with the transformed principal component variables from equation (A1), we obtain

equation image

where β = CTθ are the regression parameters of principal components.

[50] Parameters of principal component regression are estimated as

equation image

When multicolinearities exist among original input variables, the regression parameters show high variance to those variables that are colinear to others. The regression parameters of the original variable, θ, are given below

equation image

where Λ is a diagonal matrix with kth largest eigenvalue, λk, on kth diagonal element. The ek is the eigenvector of the principal component with kth largest eigenvalue. Assume that observations are uncorrelated and have a constant variance of σ2 for each observation zi. The covariance matrix of equation image is given below

equation image

If multicolinearity appears in the original variables, X, it will reveal that the eigenvalues are very small in the later principal components. The variances of the regression parameters become very large from the value of λk−1 term in the above equation. To avoid large variance on the regression parameters, those small eigenvalue terms in the calculation are removed. The new regression parameters are then expressed as

equation image

where λk, k=m+1, m+2,…,n0, are with very small eigenvalues being removed. The covariance of new regression parameters is reduced, and the covariance matrix of those regression parameters becomes:

equation image

Because none of the above eigenvalues are small numbers, the variances of the estimated regression parameters are not that high. We have:

equation image

If the above term is nonzero, omitting this term would result in a biased estimate. However, the advantage from the reduction of parameter variance is substantial under multicolinear circumstances.

Acknowledgments

[51] Partial support for this research was provided by SAHRA (NSF STC for “Sustainability of semi-Arid Hydrology and Riparian Areas”) (grant EAR-9876800), NASA-EOS, TRMM, and HyDIS research programs (grants NA56GPO185, NAG5-7716, and NAG-8503), and the National Weather Service (grant NA47WG0408). Corrie Thies provided careful proofing and editing of the manuscript.

Ancillary