Nonparametric direct mapping of rainfall-runoff relationships: An alternative approach to data analysis and modeling?



[1] We present a new approach for the analysis and modeling of catchment rainfall-runoff relationships that uses as predictor variables input history summary variables only. The latter are defined as linear combinations of inputs at a given number of previous time steps. This transforms the dynamic identification problem into a static one. As the identification algorithm we use regression trees, which act as a nonlinear nonparametric model. The original algorithm is adapted to account for serial correlation in variables. The new method is applied to two subcatchments of the U.S. Department of Agriculture Forest Service Andrews Experimental Forest Watershed (Oregon, United States). Simple and interpretable tree models explain more than 80% of the initial deviance of the observations in both calibration and validation. This suggests that the selected variables have a good predictive power and that further modeling attempts using them are warranted. The models show a distinct pattern of the selected explanatory variables. Applications of the method include data quality control, comparative analysis, assessment of hydrological change, and multicriterion evaluation of parametric hydrological models.

1. Introduction

1.1. A Critique of Parametric Rainfall-Runoff Modeling

[2] Catchments are nonlinear dynamic hydrological systems. A natural and long-established framework for modeling their rainfall-runoff relationships is the input-state-output parametric approach, i.e., the internal fluxes and the output are modeled through parametric relationships involving a set of state variables. The recourse to input and output data for parameter estimation appears as unavoidable for presently available models [Beven, 2000, 2002]. The data and model errors play a central role in calibration, as, explicitly or implicitly, inference is based on the nature of the distribution of modeling errors. There are good reasons to believe that hydrological modeling errors are both heteroscedastic and serially correlated [Sorooshian and Dracup, 1980]. Also, the presence of outliers (large errors in data) is likely. Despite this, the most widely used performance measures, such as the Nash-Sutcliffe criterion, implicitly assume a Gaussian independent and identically distributed (i.i.d.) additive error model.

[3] A critical issue, often raised in the hydrological modeling literature, is that of over-parameterization or over-fit [e.g., Jakeman and Hornberger, 1993; Hornberger et al., 1985; Young, 2001]. This essentially means that the flexibility or the capacity of the model structure is too high with respect to the information content of the available data. A related issue is that of interference. When global parametric relationships are used to model the mappings from the space of predictor variables to that of response variables, introducing the information content for a new evaluation data point may cause a change in the estimated model parameters; that is, it has a global effect [e.g., Schaal, 1994]. Estimating parameters using data from one region of the data space thus has a potential effect on how well the model will fit the observations in other regions. This leads to potential trade-offs between over-fitting parts of the mapping, especially those that contribute most to the performance measure, and under-fitting (introducing bias) in other parts. Interference has received less attention in the hydrological literature, although recent research [e.g., Gupta et al., 1998; Wagener et al., 2003] has questioned the use of a “global” performance measure on the grounds that it entails a loss of information through the aggregation of model residuals, i.e., evaluating the model globally on all the regions of the input space. Generally, global parametric methods are successful when the investigated structures of the model and the errors are sufficiently close to the “true” ones, but there is no guarantee that this will be the case in applications of rainfall-runoff models to real data sets (see discussion by Beven and Young [2003]).

[4] Accumulated experience in the use of parametric models suggests that only a model structure with less than 10 parameters can be supported by rainfall-runoff data identified using a single objective function for the prediction of runoff, with a suggestion that most of the information is present in 1–3 years of data (“wet,” “dry,” “average”) [e.g., Kirkby, 1975; Jakeman and Hornberger, 1993; Young, 2001], although further research [e.g., Yapo et al., 1996] has suggested that longer records are needed for conceptual rainfall-runoff models having larger parameterizations. The use of simpler model structures, which attempt to retrieve the “dominant modes” of behavior at the catchment scale, contrasts sharply with the perceived complexity of the physical characteristics and response of even “homogeneous” small basins or hill slopes when viewed from a bottom-up or process-based modeling viewpoint. However, it has been proven very difficult to take advantage of the prior information on processes and physical characteristics that should be expected from a process-based approach, albeit for good reasons [Beven, 2000, 2002].

[5] It has also proven difficult to take advantage of good quality and extensive data sets from experimental catchments in refining model description. The availability of such data sets has not resulted, as might have been expected, in better predictive ability, in particular in allowing increased complexity in modeling with respect to shorter/poorer quality data sets. This is certainly due to continuing deficiencies with the current generation of model structures.

[6] The question is still open whether (1) the hydrological response at the catchment scale is intrinsically “simple” or (2) there is a fundamental limitation in the available data (quantity and quality) that allows only for the specification of simple (“dominant mode”) models, or (3) a simple parametric approach has significant limitations in consistently using the available information across the full range of catchment responses due to different event characteristics and spatial patterns of antecedent conditions. The answer to this question has far-reaching theoretical and practical implications. In our view, the hypothesis that our actual knowledge of system characteristics, especially model structure and errors, is too weak in order to fully exploit the available information in the data within this paradigm deserves further consideration.

1.2. An Alternative Nonparametric Input-Output Approach

[7] In this contribution we explore a new data-driven approach to modeling the rainfall-runoff relationship that differs radically from that outlined above by two characteristics. First, we hypothesize (reasonably) that the response of a catchment should be determined by its trajectory in the space of historical input forcing variables. As a first approximation, we use as predictor variables only linear combinations of inputs at previous time steps. That is, we do not use state variables and thus are able to circumvent the issues that can arise from their recurrent calculation in a nonlinear model [Sjöberg et al., 1995; Kavetski et al., 2003]. Second, we use as the identification algorithm regression trees [Breiman et al., 1984], a nonparametric identification method. Local and nonparametric methods avoid the need to find a potentially complicated parametric function capable of representing all the data by dividing the input space into many fixed or adaptive partitions and by modeling them locally with much simpler functions, based directly on the data [Schaal, 1994]. The partitions will reflect the observed responses under different sets of hydrologically similar conditions. These methods also allow the quality of a model to be evaluated locally, which is a significant advantage when the global error structure is poorly known and potentially far from the Gaussian i.i.d. assumption. The regression tree needs to be trained on the observed data, like a neural network, but avoids the introduction of weighting coefficients or other parameter values that require explicit calibration. However, these advantages come at a price, in particular the need for large data sets in order to identify complex relationships, especially in high-dimensional input spaces.

[8] Local and nonparametric inductive methods seem to have gotten little attention in the hydrological literature. Nevertheless, they have the potential to better exploit the long, good quality data sets that are available. In particular they could give better insight in the measurement error structure in that they avoid aggregating all residuals and reduce the potential for a model structural component to the prediction error due to an a priori choice of conceptual model components. There is no requirement to specify a priori definitions of state variables and parametric relationships in making predictions and therefore model structure to be wrong. However, there will be new sets of conditions, not well represented in the calibration data set, that will require extrapolation within the space defined by the identified regression tree.

2. Methods

2.1. Predictor Variables

[9] The usual choice in the modeling of dynamical systems is to use as predictor variables inputs as well as past simulated outputs and/or past simulated values of other “state” variables calculated within the model. In this research we adopt an alternative choice that has not yet been explored in depth in hydrological modeling that consists of taking only inputs at a given number of previous time steps. This approach uses the fact that input history uniquely determines outputs and effectively transforms the identification problem into a “static” one [Sjöberg et al., 1995]. Its main advantages are that the resulting model does not have to cope with the issues related to recurrent calculations of the hypothesized state variables and that a large range of methods developed for static problems (that is, regressions) can be applied. The main disadvantage is the high dimensionality of the input space, as the chosen “memory” window has to cover the whole, or at least a major part, of the dynamic response time. The memory of hydrologic systems may span several orders of magnitude, from, say, a few minutes for a parking lot to several decades or more for a large regional aquifer. While system memory and the time discretization necessary to describe the response show a positive correlation, this may still result in a huge number of variables leading to the so-called “curse of dimensionality” [e.g., Friedman, 1994] issue. A variant of this approach is to form other predictors from the inputs, for example, by taking linear combinations of them. This allows fewer effective predictor variables to be used. A “natural” choice for hydrological systems is the sum of inputs over given time intervals. Another simple option is to capture the decaying influence of past inputs, for example, summation with an exponential function of time. Note that the latter is essentially similar to the well-known antecedent precipitation index (API) (though it is not necessary in this approach to restrict the choice of input variables to a single choice of API function). These choices are based on the assumption that inputs to hydrological systems are (loosely) additive. In this research we used five types of variables, which are presented in Table 1. The integration time for types 1, 2, and 3 and the exponential decay constant for types 4 and 5, both noted with τ [h], can be interpreted as dynamic time characteristics, and m represents the considered memory window.

Table 1. Summary of Predictor Variables
TypeDescriptionMathematical Formulation
Icumulative precipitation over the previous τ = kΔt time intervalequation imageP(tiΔt) k = 0, 1, …m
IIcumulative potential evapotranspiration over the previous τ = kΔt time intervalequation imagePET(tiΔt) k = 0, 1, …m
IIIcumulative precipitation minus potential evapotranspiration over the previous τ = kΔt time intervalequation image(P(tiΔt) − PET(tiΔt)) k = 0, 1, …m
IVsum of exponentially decaying (decay constant τ) precipitation (API) over the available record (r)equation imageexp(−equation image)P(tiΔt) τ ∈ [0, mΔt/2]
Vsum of exponentially decaying (decay constant τ) precipitation minus potential evapotranspiration over the available record (r)equation imageexp(−equation image)(P(tiΔt) − PET(tiΔt)) τ ∈ [0, mΔt/2]

2.2. Classification and Regression Trees

[10] Classification and regression trees (CART) provide nonparametric predictive inferences based on recursive binary partitioning of the data set. The space of predictor variables (regressors) is partitioned in a set of disjoint subspaces. The branches of the tree represent a hierarchical nested structure for the subspace partitions. At the end of each branch is a terminal node (or leaf), representing the finest subspace partition. The boundaries of these subspaces are planes (hyperplanes) in the space (hyperspace) of the predictor variables. Each subspace node should represent the expected responses (in this case discharges) under different sets of hydrological conditions as defined by the input variables (here different rainfall and other climatic forcing variables). A single valued predictor (or by extension a predicted nonparametric distribution) for the regressed variable is associated with each of these subspaces, based directly on the observed values during the training period. CART methods were made popular by the seminal monograph of Breiman et al. [1984]. The most significant application of CART in the hydrological literature to date is probably the exploration of parameter spaces by Spear et al. [1994].

[11] For building the regression tree we use the iterative growing and pruning method of Gelfand et al. [1991]. The latter consists of splitting the training data in two subsets having equal length and (roughly) similar distribution of the response variable (discharges in our case). A large tree is grown based on one of the subsets and pruned on the other (methods for growing and pruning are given below). Then a large tree is grown from the terminal nodes of the pruned tree based on the second training subset. Then a pruned tree is selected by using the first data subset. The procedure is then iterated, successively interchanging the roles of the two subsets. The iterative process is terminated when the pruned trees after two successive iterations are identical. Finally, the prediction is made using the entire training set with this final tree (defined by the variables and thresholds selected at each branching). Growing involves a procedure to systematically explore a tree in search of nodes to split. We use a breadth-first algorithm, meaning that all the nodes at a given depth (i.e., resulting from a same number of splits) are explored before going to the next depth. The search proceeds in top-down manner starting from the root node (i.e., the entire training subset). A node is declared terminal if it has either identical values for all the predictor variables or for the response variable, or contains less than a prespecified number of samples, the usual default value for regression being 6. The same prediction value is associated with all samples in a node. All the results reported in this paper were obtained using the average of the training samples in a given node (partition) as the predictor. Once a nonterminal node has been found, the algorithm performs a search on all predictor variables and all possible splitting values (i.e., distinct values of the predictor variable present in the training data set) in order to identify the one that minimizes an error criterion to maximize the discriminatory power of the split. The choice for the error criterion and that of the prediction rule are interrelated. Usually the least squares error (LS) criterion is associated with taking the subset average as the predicted value. This is implemented here in terms of minimizing the sum of the deviances of the descendent partitions, where deviance is defined as the sum of squares of the differences between the actual values of the response variable in a partition and the average value in that partition. An alternative to LS trees is that based on the least absolute deviation criterion and the median as prediction (LAD trees). The latter are known to be more robust with respect to outliers in data [Breiman et al., 1984].

[12] However, we did use an ad hoc modification of the splitting criterion to better account for autocorrelations in variables (and errors). We define below a composite criterion that considers not only the deviance but also the “diversity” of the descendent partitions so as to enhance the possibility that each partition should be composed of data coming from as many different parts of the record as possible (richness) and that the sample distribution among these parts should be even (regularity). The rationale is to facilitate the identification of patterns that repeat themselves in time. The record is divided into l consecutive periods having equal duration. Then, using Shannon's entropy formula, the diversity of a partition could be expressed as

equation image

where H is the diversity index, n is the total number of samples in the partition, and nj is the number of samples from period j. The index H has a minimum value equal to zero when all of the samples belong to a single time period and attains its maximum when samples are evenly distributed among all time periods. The composite criterion that is to be minimized at each partitioning split is then defined as

equation image

where α is a weighting parameter that takes values between zero and unity, D is the deviance of the response variable in a given partition, and the superscripts P, L, and R denote the parent, left, and right child partitions, respectively.

[13] Once a split is selected, two child nodes are created. The samples having the values of the selected predictor variables less than the selected split values are assigned to the left node, and the others are assigned to the right node. A growing phase is terminated when all the undivided nodes are declared terminal.

[14] Pruning represents the mechanism by which the complexity of the regression tree is controlled in order to avoid over-fitting and is therefore a critical phase in building regression trees. Pruning a tree essentially means replacing some of the subtrees rooted in its internal nodes with terminal nodes (leaves), i.e., reducing its complexity. A brief description of the adopted pruning method, also called reduced error pruning [Esposito et al., 1997], is given in what follows. The criterion to prune is based on the prediction error on an independent data set. The prediction error of a node is calculated as the sum of the squares between the predicted and the observed values of the response variable. A similar breadth-first algorithm is used in order to explore the tree, the difference with respect to the growing phase being that the tree is explored in a bottom-up manner starting with the terminal nodes at the maximal depth. During this exploration the prediction error of the internal nodes of the tree is compared with the sum of the prediction errors of the terminal nodes of the subtree rooted in that node. The subtree is pruned when the latter is higher than the former. The process is terminated when the root node is reached.

[15] Adopting these methods of deciding on the depth of branching and partitioning of nodes remains consistent with the choice of mean observation in each partition as the predicted value. However, the distribution of observed values in each partition can also provide additional information on the uncertainty associated with prediction in a way that is consistent with the training data set.

[16] The characteristics of CART methods make them candidates for the direct mapping of rainfall-runoff relationships. They are fast and conceptually simple [Breiman et al., 1984]. They are also the most interpretable among the techniques implementing function approximators [Friedman, 1994] in that they do not require the identification of any weighting coefficients (as in neural net methods) and each final partition is associated with a particular (if complex) set of hydrological conditions. To our knowledge, this is the first attempt to apply the regression tree technique to the prediction of the outputs of a dynamic hydrological system.

[17] One important advantage of regression trees is that the algorithm performs automatic variable selection [Breiman et al., 1984; Friedman, 1994]. Different variables may be selected in different parts of the input space; thus the method is able to exploit the local relevance of variables and a locally lower dimensionality of the mapping. These are partial responses to the “curse of dimensionality” issue. The method is also able to deal quite effectively with correlations in predictor variables [Friedman, 1994]. This can be intuitively understood as follows. Once a variable is selected and the partitioning performed, its relevance, and that of significantly correlated variables, is likely to decrease in its descendent partitions. That will encourage the choice for subsequent splits of other, less correlated, variables to the extent that their relevance is higher.

[18] While regression trees are able to approximate the complex functionality of nonlinear systems, they can only reflect the nature of the relationships that are contained within the available data and therefore will generally require extensive data sets. Another limitation is the piecewise constant nature of the predictions that are produced. All observations that lie in a same partition are attributed the same modal predicted value, irrespectively of their location in the input space or subspace. In the simplest implementation (as used here), information from data points outside the boundary of the partition is ignored, even if those points may be near to the point for which we want to make the prediction. The predictions also tend to be biased when extrapolating in the input space.

[19] Another interesting characteristic of regression trees is the sensitivity of the partitioning process to the training set [Breiman, 1996]; that is, a relatively small change in the training data set may lead to a different choice when selecting a split, which in turn may represent a significant change in the subtree rooted in that node. This is a drawback when interpretation of the selected variables is attempted. However, this does not affect the predictive ability. Moreover, this property proves to be essential for the success of the recent improvements in predictive ability developed by Breiman [1996, 2001a, 2001b]. The latter essentially imply building a large ensemble of trees (by either bootstrapping the data or randomizing the choice of the split) and aggregating their predictions for each sample. However, the interpretability of the partitioning process is completely lost. This is the main reason we chose to limit ourselves to the basic method.

[20] It is also worth noting that the principle of recursive partitioning lends itself to combinations with other concepts yielding numerous so-called hybrid methods such as fuzzy trees [e.g., Suárez and Lutsko, 1999], multiple additive regression splines (MARS) [Friedman, 1991], or flexible metric nearest-neighbor [e.g., Friedman, 1994]. Linear regressions have also been applied on the terminal partitions [e.g., Quinlan, 1993].

2.3. Study Site and Data

[21] The H. J. Andrews Experimental Forest is situated in the western Cascade Mountains (Oregon, United States). Only a brief catchment description is given here, as numerous publications describe the sites in detail [e.g., Harr, 1977; Harr et al., 1982; Jones and Grant, 1996]. The watersheds WS1 and WS2 are adjacent subcatchments and have areas of 1.0 and 0.6 km2, respectively, and elevations between 460 and 990 and 530–1070 m, respectively. The slopes range between 60 and 100%. Mean annual precipitation is a function of altitude and ranges between 2300 and 2500 mm. Over 80% of the precipitation falls from November to April. Elevations between 400 and 1200 m may alternatively receive snow or rain. Evapotranspiration accounts for about 40% of the incoming precipitation. The study area is underlain by highly weathered, deeply dissected volcanics. Soils are weakly developed with thick organic litter horizons, deeply weathered parent materials, and high stone content. Moisture storage and transfer is characterized by high porosity and hydraulic conductivities. The vegetation of these catchments consisted mainly of 100- to 500-year-old Douglas fir closed canopy forests. Catchment WS1 has been clear cut and regenerated between 1962 and 1966, while catchment WS2 has been left undisturbed.

[22] Forty-one years (1957–1998) of 15-min precipitation and discharge measurements were available for this study. As evapotranspiration measurements at finer timescales or other meteorological variables were not available for the whole period of data, the latter was estimated considering a mean annual value of 850 mm distributed throughout the year using a sinusoidal form. More than 1000 predictor variables were built using the five types from Table 1. The smallest value considered for the time characteristic (τ) was 0.25 hour. The latter was progressively incremented, with increments taken in a geometric progression having an initial value of 0.25 hour and a ratio of 1.02. We considered a maximum 1-year memory window. The 40 remaining years were divided into two 20-year periods. The first one was used for model training (calibration), while the second one was used for model validation. Table 2 gives some data statistics for the two periods.

Table 2. Summary of Statistical Characteristics of Input-Output Data
PeriodPrecipitation Average, mm+1 yr−1WS1 Average, L+1 s−1WS1 Variance, L+2 s−2WS1 Coefficient of Variation [ ]WS2 Average, L+1 s−1WS2 Variance, L+2 s−2WS2 Coefficient of Variation [ ]
Calibration (1958–1978)228339.6463112.00423.0817401.807
Validation (1978–1998)225136.5355222.03421.5415581.832

[23] The time period adopted for the calculated partition diversity (defined by equation (1)) was chosen to be equal to 2 months (equivalent to l = 60 for each of the two 10-year training subsets). The chosen time period reflects our perception that quite long autocorrelation timescales characterize the analyzed system.

3. Results and Discussion

[24] Figure 1 presents the reduction in deviance, expressed as a fraction of the initial deviance that can be obtained at the first split as a function of the time constant (τ) for the five variable types listed in Table 1. Results were obtained by setting α in equation (2) to zero (i.e., there was no attempt to account for serial correlations). Note that the relative reduction in deviance can be directly interpreted as values of the Nash-Sutcliffe criterion of the corresponding model. Figure 1 shows that simply splitting the data set in two partitions and taking the average discharge in each partition as predictions yields a model that explains more than 40% of the initial deviance in the data. Results for time constants of less than 1 hour are not shown because they add nothing to the explanation of the catchment outputs. This is not unexpected at the Andrews catchments, which, although small, are mostly dominated by relatively slow subsurface flow responses.

Figure 1.

Reduction in deviance at the first split as a function of the time characteristic τ: (a) cumulative variable types; (b) exponential decay variable types.

[25] Slightly better results are obtained for exponential decay types of variables (types IV and V) compared with cumulative types and for “balance” variables (types III and V) compared with precipitation only variables (types I and IV). Evapotranspiration variables (type II) offer consistently lower deviance reductions except for short (<10 hour) and for very long (>1000 hour) timescales. It can be noted that the optimum region is quite flat due to the high correlations between variables. This pattern suggests that the regression trees' sensitivity to data (mentioned above) will manifest itself in the framework adopted for this application as an “uncertainty interval” on the time characteristic selected for the first split. All variables that are based on precipitation (TI, TIII, TIV, and TV) show a consistent pattern, with the maximum region for WS1 being obtained at significantly lower values of the time variable than for WS2. Three-parameter univariate nonlinear regressions on these variables (not shown) show a similar shape as a function of the time constant and explain up to 65% of the initial deviance in discharges. This confirms that the results obtained with the binary splits are a good indicator for the strength of the nonlinear relationships between the chosen predictor variables and discharge.

[26] Figure 2 represents a depth-five tree (i.e., a partition is the result of five splits at most) obtained at Andrews WS2. All five variable types were used, while α was set to 0.5 (in order to reflect a balance between reducing deviance and enhancing diversity in descendent partitions). The result of the splitting process can be directly interpreted as decision rules. For example, if TV_112 hours (samplei) < 62 mm and TV_907 hours (samplei) > 192 mm, then samplei equation image 5. Table 3 synthesizes the information on the partitions and their splits and presents performance assessment results for the training and validation periods. Prediction error and prediction error reduction are expressed as a percentage of the deviance of the training and validation data sets. Because of the skewed distribution and the least squares type of criterion, the repartition between the descendant partitions is quite uneven. The use of diversity in the criterion only partly compensates for the effect of the least squares that gives emphasis to high flows. In this tree there are 29 terminal nodes (partitions), i.e., numbers 10, 32–39, and 44–63 that serve to make predictions. That is, the prediction is made using only 29 discrete values for all the range of discharges. Despite this, the predictive performance is quite impressive: 84.7% explained deviance for the calibration period and 84.3% for the validation period. As stated above, these values can be directly interpreted as values of the Nash-Sutcliffe efficiency criterion. These performances can be improved by adopting more complex models. Note that deviance and deviance reduction in the validation period are calculated using the average of the response variable in the calibration period. These results can be compared with those obtained by Waichler et al. [2002] using the distributed hydrology soil vegetation model [Wigmosta et al., 1994], a process-based, distributed parameter hydrologic model. The reported Nash-Sutcliffe efficiency values using the whole 1958–1998 period for model calibration and a hourly time step are 80.7% for WS2 and 78.9% for WS1.

Figure 2.

Depth-five regression tree for WS2 (see Table 3 for details).

Table 3. Summary Results for the Depth-Five Regression Tree for WS2a
Node NumberParent NumberTypeTime Characteristic, hoursValue, mmNumber of SamplesAverage (Prediction), L s−1Prediction Error, %Error Reduction, %Number of SamplesAverage, L s−1Prediction Error, %Error Reduction, %
  • a

    Splits indicated in parentheses were eliminated from the final tree; see text for justification and comments.

3216   222,0731.50.015 201,2681.40.013 
3316   6,1297.60.010 4,4888.00.013 
3417   88,2144.60.030 76,4464.00.018 
3517   3,13513.50.015 2,35113.10.013 
3618   40,9108.10.029 46,5787.60.034 
3718   31,59712.60.098 32,61411.10.055 
3819   22,54915.10.079 25,17714.10.171 
3919   1,41829.60.023 1,47034.20.034 
(40)(20)   (848)(98.6)(0.007) (3,587)(16.9)(2.205) 
(41)(20)   (759)(36.7)(0.023) (888)(15.7)(0.038) 
(42)(21)   (69,550)(18.6)(0.718) (80,477)(15.8)(0.510) 
(43)(21)   (61,356)(26.2)(0.543) (69,504)(23.9)(0.726) 
4422   14,82633.80.200 16,53630.00.266 
4522   17,29438.30.333 18,19235.70.471 
4623   15,60644.60.551 17,42140.50.480 
4723   13,36854.20.717 18,26148.20.637 
4824   5,32224.20.087 8,47622.70.143 
4924   8,35148.70.325 7,20852.90.393 
5025   19,17462.51.244 16,99056.70.647 
5125   16,91581.21.398 14,97273.81.100 
5226   3,18835.70.126 2,81535.70.165 
5326   1,51371.60.130 1,48553.90.255 
5427   22,142101.72.730 20,40597.22.251 
5527   8,525144.21.933 7,938151.31.199 
5628   1,347221.70.332 904205.40.340 
5728   829167.10.278 652190.20.262 
5829   159141.90.025 335141.50.125 
5929   1,169258.10.482 1,398234.50.561 
6030   412172.80.139 95104.30.066 
6130   358296.90.126 700301.00.551 
6231   1,471346.60.765 1,010320.40.932 
6331   773496.81.174 639596.03.005 
Total       ∑ = 84.677   ∑ = 84.410

[27] Figure 3 illustrates the observed and modeled hydrographs for a 2500-hour period from 8 January to 22 April 1986 (the validation period). Despite the stair-type of prediction, the regression tree provides visually acceptable reproductions of the timing and magnitudes of the observed response. The stair-type feature becomes less apparent as the partitioning becomes finer (compare Figures 3a and 3b) but is unlikely to disappear completely due to the high serial correlations in variables and to the inherently discontinuous prediction provided by regression trees. However, several hybrid methods (see sections 2.2 and 4.3) do provide continuous predictions.

Figure 3.

Observed and modeled hydrograph for WS2 (8 January 1986 to 22 April 1986): (a) model with 29 terminal nodes; (b) model with 57 terminal nodes.

[28] The subtree rooted in node 10 appearing in Table 3, but already pruned from Figure 2, deserves special attention. It can be noted that after its split the left partition (node 20), corresponding to values of the predictor variables less than the threshold (i.e., no rainfall in the previous 2.2 days), has an average discharge 3 times higher than that of the right node. This does not fit our understanding about the system's hydrological response. A closer examination of the upper tail of partition 10 (see also Figure 5b) using the original record suggests that the latter corresponds to spring snowmelt episodes with little incoming precipitation. Obviously the available predictor variables cannot account for the latter. At least temperature variables or better estimates of the actual liquid phase inputs to the system (i.e., accounting for snow accumulation and melt) would have been needed. In its attempt to reduce the deviance, the algorithm found a spurious split. The diversity did not help much in this specific case, as this proved to be a recurring pattern in the calibration period (but not in the validation period; see below). These considerations justify pruning the subtree rooted in node 10 and declaring it terminal. They also illustrate the idea that the interpretability of regression trees may serve not only to provide insight into the functioning of the system but also to effectively control the results by revealing sets of conditions that are not physically meaningful. It can also be noted (Table 3 and Figure 5f) that the split of partition 10 actually decreases the predictive performance in the validation period.

[29] Figure 4 plots the cumulative reduction in deviance as a function of the timescale (τ) of the variable on which the split was made. It can be observed that the model for WS1 shows similar overall performance to that obtained for WS2. The profile of WS1 is almost parallel but shifted toward lower time characteristics. The pattern observed for the first split is preserved. These, and the higher coefficient of variation of discharges (see Table 2), suggest that WS1 is more responsive and more sensitive to shorter timescales. As inputs to these adjacent subcatchments are similar (and were considered identical for this analysis), it is a logical implication that these differences will reflect differences in catchment physical characteristics.

Figure 4.

Cumulative explained deviance as a function of the time scale (τ) characterizing split variables.

[30] Figure 5 represents the distribution in selected partitions, including that of the entire data set (partition 1, Figure 5a), for the calibration and validation period. It can be observed that the latter are broadly similar (with the exception of partition 20, Figure 5f, as discussed above). The average discharge values in Table 3 corroborate this observation. These elements, as well as the good results obtained in validation, qualitatively support the hypothesis that the system is apparently stationary, in the sense that the catchment characteristics and discharge are similar between the two periods. Differences in the distributions of the outputs in most of the partitions are small (Figures 5a–5e). The use of more formal quantitative tests for change is an interesting possibility, but the fact that the samples are not independent needs adequate consideration.

Figure 5.

Cumulative distributions for the calibration and validation periods (normal probability paper): (a) partition 1 (whole period); (b) partition 33; (c) partition 10; (d) partition 46; (e) partition 62; (f) partition 20.

[31] One can also note that the distributions in terminal partitions are closer to normal distributions than that of the entire sample. This is especially true for partition 62 (Figure 5e). The longer tails in partitions 10 (Figure 5c) and 33 (Figure 5b) are explained to a large extent by the snowmelt effect discussed above. In such a case where some parts of the response cannot be explained by the available variables (no temperature data were used here), the use of LAD trees, which should be more robust to outliers (see above), may be justified but was not pursued in this initial study. An interesting feature that can be observed from Figure 5 and also derived from Table 3 is heteroscedascity. However, this is not a prior assumption but rather a result of the partitioning algorithm. This reflects both measurement errors and the fact that the predictor space is unevenly populated, with higher densities for smaller values of the predictor variables. For example, partition 32 contains 31.7% of the initial sample but contributes only 0.1% to the unexplained deviance, while partition 63 that contains 0.1% of the sample is responsible of almost 8% of the latter.

4. Perspectives

[32] This research concerns only a first, rather basic, application of nonparametric techniques for the identification of rainfall-runoff relationships using the direct mapping from the input space to the output space. However, the results obtained are promising and seem to warrant further developments in several directions that are briefly outlined below.

4.1. Data Quality Control

[33] The principle of regression trees is to identify increasingly homogeneous input configurations that should lead to increasingly homogeneous output configurations. Results convincingly show that the algorithm is quite effective for achieving this. In this context, the identification of outliers is of considerable interest and warrants further examination. There are several possible explanations for outliers in partitions. One, as illustrated above, is the lack of relevant predictor variables. Possible measurement errors, however, are another significant source. The principle that higher inputs should yield higher outputs offers a complementary control. In this respect, the application of our technique on two other systems (Plynlimon and Panola watersheds (I. Iorgulescu and K. Beven, unpublished results, 2003)) allowed very large outlier periods to be identified that we will later show to be a result of data measurement and data manipulation errors.

4.2. Comparative Studies and the Assessment of Hydrological Change

[34] Regression trees are exclusively data based, and only a minimal number of assumptions is needed. This feature opens interesting possibilities to perform comparative analysis. This paper already contains some elements illustrating the latter for catchments WS1 and WS2, in particular the differences, noted above, between the selected predictor variables. As the inputs (precipitation and evapotranspiration) are roughly similar for the two subcatchments, and were considered identical in this analysis, differences are essentially explained by the physical characteristics of the catchments. Another possible analysis is to feed the data from one catchment into the model built for the other and to compare the explained variances and the (normed) distributions in the partitions. This should allow a more detailed analysis of the differences for various input configurations.

[35] Analysis of catchment change would follow a similar approach. When such a change is known to have occurred, models can be built for the prechange and/or postchange situations. The principle of the subsequent cross-validation procedure is illustrated by the analysis of the comparison between the calibration and the validation periods (see Table 3 and Figure 5). The Andrews Watershed case also offers the potential to be a test case for this approach. In particular the 1962 clear-cut of WS1 has to be considered as a hypothesis for explaining the differences in response with respect to the control catchment WS2. This analysis is in progress and will be reported in a subsequent contribution.

4.3. More Elaborate Nonparametric and Local Parametric Approaches

[36] Clearly, there is room for improving the results by adopting better adapted methods, in an attempt to compensate for some of the disadvantages of the basic regression trees algorithm (e.g., discontinuity in prediction and fixed partitions) used in this research. Besides radically different methods, two lines of development that stem directly from the recursive partitioning principle seem to hold promise: (1) hybrid methods, such as fuzzy trees [e.g., Suárez and Lutsko, 1999] or MARS [Friedman, 1991], and (2) implementing the recent work of Breiman [1996, 2001a, 2001b] on bagging and random forests. Such methods will also be required to extrapolate to conditions beyond the range of the calibration data set. This is the subject of ongoing research with a number of catchment data sets.

4.4. Multicriterion Evaluation of Parametric Rainfall-Runoff Models

[37] The main idea is to allow a multicriterion prediction approach on the partitions identified by the regression tree model. The rationale behind such an approach is to give less opportunity for trade-offs between over-fitting dominant modes and accounting for useful information on less represented modes as well as to be able to relax the strong hypothesis of a simple global error structure. Partitioning has already been advocated for dealing with the interference issue and poor prior knowledge of a “global” error structure (as, for example, in the differentiation of hydrograph parts used by Wagener et al. [2003] and Freer et al. [2003] or the regressions used in the self-organizing linear output map (SOLO) approach of Hsu et al. [2002]). However, we argue that the regression trees provide a robust methodology for partitioning because of the following:

[38] 1. It is a data-based approach and thus reproducible and independent of a particular parametric model structure. Moreover, the partitions represent data-based “modes” of response that respect increasingly similar input configurations.

[39] 2. At each step in forming the regression tree, the algorithm uses increasingly local predictor variables and thresholds to define the subsets so as to maximize the explained deviance. This provides a partitioning of the nonlinear responses of the catchment that is derived directly from the characteristics of the data without the need for prior criteria for the classification of hydrological behaviors.

[40] 3. Nested structure and local (i.e., within partition) learning of regression trees allow the complexity of the tree to be controlled. The within-partition deviance, that accounts for both heteroscedascity in measurement errors and for uneven densities in the input space, offers conservative bounds on within partition prediction error. Therefore values of Nash-Sutcliffe efficiency (if LS trees are acceptable) higher than zero can serve as justifiable “behavioral” thresholds. We conjecture that the latter will significantly constrain the range of acceptable models.

5. Conclusions

[41] This research provides a new approach for the analysis and modeling of rainfall-runoff relationships, based on a nonparametric mapping from the input space to the output space. Such an approach is best applied when the information in input-output data dominates prior knowledge about system behavior. Significant advantages of the method are its ability to “learn” when more data are presented to it, and the possibility of directly comparing the different responses under similar hydrological conditions as defined by the subspace partitioning. The recursive partitioning provided by the regression tree approach is a powerful technique for the exploratory analysis of data sets, especially in that it allows insights into features of the data including complexity, local nonlinearities, heteroscedascity, and data outliers (that might identify measurement errors).

[42] One of the main findings of this research is that suitably chosen (linear) combinations of inputs offer a good performance when used as predictor variables. This option also offers the significant advantage of transforming the nonlinear dynamic problem into a static one, effectively eliminating possible instabilities due to model structural errors or numerical approximation errors, and allowing the use of a wide range of identification techniques (i.e., it basically involves solving a regression problem). The results demonstrate that the new method is quite successful; given a sufficient record length, it has a predictive ability comparable, if not superior, to traditional input-state-output models. The fact that such a data-based method is able to compete with state-of-the-art hydrological models, at least within the range of the available calibration data, puts the latter in an unflattering perspective and more generally questions the relevance of the knowledge on which they rely.

[43] The predictor variables are easily interpretable in that the decision variables at the branches of the tree define particular sets of hydrological circumstances determined from the data themselves, rather than from any a priori assumptions. This might offer the potential to have new insights in rainfall-runoff relationships (and also to quality control the data for anomalous periods).

[44] In this application, very simple regression tree models are able to extract a large part of the deviance (e.g., a model with just four partitions explains more than 60% of the initial deviance). This may be thought as somehow mirroring the relative success of simple parametric models. However, much more complex models are allowed by the data. Even the terminal nodes (partitions) in the models presented above arguably still contain much useful information. Indeed, they are characterized by large coefficients of variation explained in part by measurement errors but also by the heterogeneity of their input configurations. This raises a more fundamental issue: Is it possible to find a global parametric models involving a relatively small set of hypothesized state variables linked by relatively “simple” relationships that can model accurately the complex natural responses revealed by the nonparametric partitioning?

[45] The greatest limitation on the approach relative to parametric methods at the present time (as common to other data-based approaches) is the extrapolation of the nonlinearities present in the available data set to other (more extreme) conditions that have not yet appeared in the data, and to other catchments with less observations. This will require extrapolation algorithms that can be invoked when needed, making use of the partition data local to the sets of conditions (or trajectories in the history of the catchment) that produce those more extreme conditions. This, together with the uncertainty associated with such extrapolations, is the subject of ongoing research.


[46] Fred Swanson and Don Henshaw (U.S. Forest Service) are gratefully acknowledged for making available the data from the H. J. Andrews Experimental Forest catchments. We are also grateful to two anonymous referees, who caused us to think more carefully about the presentation of the methodology in this context.