2.1. Predictor Variables
 The usual choice in the modeling of dynamical systems is to use as predictor variables inputs as well as past simulated outputs and/or past simulated values of other “state” variables calculated within the model. In this research we adopt an alternative choice that has not yet been explored in depth in hydrological modeling that consists of taking only inputs at a given number of previous time steps. This approach uses the fact that input history uniquely determines outputs and effectively transforms the identification problem into a “static” one [Sjöberg et al., 1995]. Its main advantages are that the resulting model does not have to cope with the issues related to recurrent calculations of the hypothesized state variables and that a large range of methods developed for static problems (that is, regressions) can be applied. The main disadvantage is the high dimensionality of the input space, as the chosen “memory” window has to cover the whole, or at least a major part, of the dynamic response time. The memory of hydrologic systems may span several orders of magnitude, from, say, a few minutes for a parking lot to several decades or more for a large regional aquifer. While system memory and the time discretization necessary to describe the response show a positive correlation, this may still result in a huge number of variables leading to the so-called “curse of dimensionality” [e.g., Friedman, 1994] issue. A variant of this approach is to form other predictors from the inputs, for example, by taking linear combinations of them. This allows fewer effective predictor variables to be used. A “natural” choice for hydrological systems is the sum of inputs over given time intervals. Another simple option is to capture the decaying influence of past inputs, for example, summation with an exponential function of time. Note that the latter is essentially similar to the well-known antecedent precipitation index (API) (though it is not necessary in this approach to restrict the choice of input variables to a single choice of API function). These choices are based on the assumption that inputs to hydrological systems are (loosely) additive. In this research we used five types of variables, which are presented in Table 1. The integration time for types 1, 2, and 3 and the exponential decay constant for types 4 and 5, both noted with τ [h], can be interpreted as dynamic time characteristics, and m represents the considered memory window.
Table 1. Summary of Predictor Variables
|I||cumulative precipitation over the previous τ = kΔt time interval||P(t − iΔt) k = 0, 1, …m|
|II||cumulative potential evapotranspiration over the previous τ = kΔt time interval||PET(t − iΔt) k = 0, 1, …m|
|III||cumulative precipitation minus potential evapotranspiration over the previous τ = kΔt time interval||(P(t − iΔt) − PET(t − iΔt)) k = 0, 1, …m|
|IV||sum of exponentially decaying (decay constant τ) precipitation (API) over the available record (r)||exp(−)P(t − iΔt) τ ∈ [0, mΔt/2]|
|V||sum of exponentially decaying (decay constant τ) precipitation minus potential evapotranspiration over the available record (r)||exp(−)(P(t − iΔt) − PET(t − iΔt)) τ ∈ [0, mΔt/2]|
2.2. Classification and Regression Trees
 Classification and regression trees (CART) provide nonparametric predictive inferences based on recursive binary partitioning of the data set. The space of predictor variables (regressors) is partitioned in a set of disjoint subspaces. The branches of the tree represent a hierarchical nested structure for the subspace partitions. At the end of each branch is a terminal node (or leaf), representing the finest subspace partition. The boundaries of these subspaces are planes (hyperplanes) in the space (hyperspace) of the predictor variables. Each subspace node should represent the expected responses (in this case discharges) under different sets of hydrological conditions as defined by the input variables (here different rainfall and other climatic forcing variables). A single valued predictor (or by extension a predicted nonparametric distribution) for the regressed variable is associated with each of these subspaces, based directly on the observed values during the training period. CART methods were made popular by the seminal monograph of Breiman et al. . The most significant application of CART in the hydrological literature to date is probably the exploration of parameter spaces by Spear et al. .
 For building the regression tree we use the iterative growing and pruning method of Gelfand et al. . The latter consists of splitting the training data in two subsets having equal length and (roughly) similar distribution of the response variable (discharges in our case). A large tree is grown based on one of the subsets and pruned on the other (methods for growing and pruning are given below). Then a large tree is grown from the terminal nodes of the pruned tree based on the second training subset. Then a pruned tree is selected by using the first data subset. The procedure is then iterated, successively interchanging the roles of the two subsets. The iterative process is terminated when the pruned trees after two successive iterations are identical. Finally, the prediction is made using the entire training set with this final tree (defined by the variables and thresholds selected at each branching). Growing involves a procedure to systematically explore a tree in search of nodes to split. We use a breadth-first algorithm, meaning that all the nodes at a given depth (i.e., resulting from a same number of splits) are explored before going to the next depth. The search proceeds in top-down manner starting from the root node (i.e., the entire training subset). A node is declared terminal if it has either identical values for all the predictor variables or for the response variable, or contains less than a prespecified number of samples, the usual default value for regression being 6. The same prediction value is associated with all samples in a node. All the results reported in this paper were obtained using the average of the training samples in a given node (partition) as the predictor. Once a nonterminal node has been found, the algorithm performs a search on all predictor variables and all possible splitting values (i.e., distinct values of the predictor variable present in the training data set) in order to identify the one that minimizes an error criterion to maximize the discriminatory power of the split. The choice for the error criterion and that of the prediction rule are interrelated. Usually the least squares error (LS) criterion is associated with taking the subset average as the predicted value. This is implemented here in terms of minimizing the sum of the deviances of the descendent partitions, where deviance is defined as the sum of squares of the differences between the actual values of the response variable in a partition and the average value in that partition. An alternative to LS trees is that based on the least absolute deviation criterion and the median as prediction (LAD trees). The latter are known to be more robust with respect to outliers in data [Breiman et al., 1984].
 However, we did use an ad hoc modification of the splitting criterion to better account for autocorrelations in variables (and errors). We define below a composite criterion that considers not only the deviance but also the “diversity” of the descendent partitions so as to enhance the possibility that each partition should be composed of data coming from as many different parts of the record as possible (richness) and that the sample distribution among these parts should be even (regularity). The rationale is to facilitate the identification of patterns that repeat themselves in time. The record is divided into l consecutive periods having equal duration. Then, using Shannon's entropy formula, the diversity of a partition could be expressed as
where H is the diversity index, n is the total number of samples in the partition, and nj is the number of samples from period j. The index H has a minimum value equal to zero when all of the samples belong to a single time period and attains its maximum when samples are evenly distributed among all time periods. The composite criterion that is to be minimized at each partitioning split is then defined as
where α is a weighting parameter that takes values between zero and unity, D is the deviance of the response variable in a given partition, and the superscripts P, L, and R denote the parent, left, and right child partitions, respectively.
 Once a split is selected, two child nodes are created. The samples having the values of the selected predictor variables less than the selected split values are assigned to the left node, and the others are assigned to the right node. A growing phase is terminated when all the undivided nodes are declared terminal.
 Pruning represents the mechanism by which the complexity of the regression tree is controlled in order to avoid over-fitting and is therefore a critical phase in building regression trees. Pruning a tree essentially means replacing some of the subtrees rooted in its internal nodes with terminal nodes (leaves), i.e., reducing its complexity. A brief description of the adopted pruning method, also called reduced error pruning [Esposito et al., 1997], is given in what follows. The criterion to prune is based on the prediction error on an independent data set. The prediction error of a node is calculated as the sum of the squares between the predicted and the observed values of the response variable. A similar breadth-first algorithm is used in order to explore the tree, the difference with respect to the growing phase being that the tree is explored in a bottom-up manner starting with the terminal nodes at the maximal depth. During this exploration the prediction error of the internal nodes of the tree is compared with the sum of the prediction errors of the terminal nodes of the subtree rooted in that node. The subtree is pruned when the latter is higher than the former. The process is terminated when the root node is reached.
 Adopting these methods of deciding on the depth of branching and partitioning of nodes remains consistent with the choice of mean observation in each partition as the predicted value. However, the distribution of observed values in each partition can also provide additional information on the uncertainty associated with prediction in a way that is consistent with the training data set.
 The characteristics of CART methods make them candidates for the direct mapping of rainfall-runoff relationships. They are fast and conceptually simple [Breiman et al., 1984]. They are also the most interpretable among the techniques implementing function approximators [Friedman, 1994] in that they do not require the identification of any weighting coefficients (as in neural net methods) and each final partition is associated with a particular (if complex) set of hydrological conditions. To our knowledge, this is the first attempt to apply the regression tree technique to the prediction of the outputs of a dynamic hydrological system.
 One important advantage of regression trees is that the algorithm performs automatic variable selection [Breiman et al., 1984; Friedman, 1994]. Different variables may be selected in different parts of the input space; thus the method is able to exploit the local relevance of variables and a locally lower dimensionality of the mapping. These are partial responses to the “curse of dimensionality” issue. The method is also able to deal quite effectively with correlations in predictor variables [Friedman, 1994]. This can be intuitively understood as follows. Once a variable is selected and the partitioning performed, its relevance, and that of significantly correlated variables, is likely to decrease in its descendent partitions. That will encourage the choice for subsequent splits of other, less correlated, variables to the extent that their relevance is higher.
 While regression trees are able to approximate the complex functionality of nonlinear systems, they can only reflect the nature of the relationships that are contained within the available data and therefore will generally require extensive data sets. Another limitation is the piecewise constant nature of the predictions that are produced. All observations that lie in a same partition are attributed the same modal predicted value, irrespectively of their location in the input space or subspace. In the simplest implementation (as used here), information from data points outside the boundary of the partition is ignored, even if those points may be near to the point for which we want to make the prediction. The predictions also tend to be biased when extrapolating in the input space.
 Another interesting characteristic of regression trees is the sensitivity of the partitioning process to the training set [Breiman, 1996]; that is, a relatively small change in the training data set may lead to a different choice when selecting a split, which in turn may represent a significant change in the subtree rooted in that node. This is a drawback when interpretation of the selected variables is attempted. However, this does not affect the predictive ability. Moreover, this property proves to be essential for the success of the recent improvements in predictive ability developed by Breiman [1996, 2001a, 2001b]. The latter essentially imply building a large ensemble of trees (by either bootstrapping the data or randomizing the choice of the split) and aggregating their predictions for each sample. However, the interpretability of the partitioning process is completely lost. This is the main reason we chose to limit ourselves to the basic method.
 It is also worth noting that the principle of recursive partitioning lends itself to combinations with other concepts yielding numerous so-called hybrid methods such as fuzzy trees [e.g., Suárez and Lutsko, 1999], multiple additive regression splines (MARS) [Friedman, 1991], or flexible metric nearest-neighbor [e.g., Friedman, 1994]. Linear regressions have also been applied on the terminal partitions [e.g., Quinlan, 1993].
2.3. Study Site and Data
 The H. J. Andrews Experimental Forest is situated in the western Cascade Mountains (Oregon, United States). Only a brief catchment description is given here, as numerous publications describe the sites in detail [e.g., Harr, 1977; Harr et al., 1982; Jones and Grant, 1996]. The watersheds WS1 and WS2 are adjacent subcatchments and have areas of 1.0 and 0.6 km2, respectively, and elevations between 460 and 990 and 530–1070 m, respectively. The slopes range between 60 and 100%. Mean annual precipitation is a function of altitude and ranges between 2300 and 2500 mm. Over 80% of the precipitation falls from November to April. Elevations between 400 and 1200 m may alternatively receive snow or rain. Evapotranspiration accounts for about 40% of the incoming precipitation. The study area is underlain by highly weathered, deeply dissected volcanics. Soils are weakly developed with thick organic litter horizons, deeply weathered parent materials, and high stone content. Moisture storage and transfer is characterized by high porosity and hydraulic conductivities. The vegetation of these catchments consisted mainly of 100- to 500-year-old Douglas fir closed canopy forests. Catchment WS1 has been clear cut and regenerated between 1962 and 1966, while catchment WS2 has been left undisturbed.
 Forty-one years (1957–1998) of 15-min precipitation and discharge measurements were available for this study. As evapotranspiration measurements at finer timescales or other meteorological variables were not available for the whole period of data, the latter was estimated considering a mean annual value of 850 mm distributed throughout the year using a sinusoidal form. More than 1000 predictor variables were built using the five types from Table 1. The smallest value considered for the time characteristic (τ) was 0.25 hour. The latter was progressively incremented, with increments taken in a geometric progression having an initial value of 0.25 hour and a ratio of 1.02. We considered a maximum 1-year memory window. The 40 remaining years were divided into two 20-year periods. The first one was used for model training (calibration), while the second one was used for model validation. Table 2 gives some data statistics for the two periods.
Table 2. Summary of Statistical Characteristics of Input-Output Data
 The time period adopted for the calculated partition diversity (defined by equation (1)) was chosen to be equal to 2 months (equivalent to l = 60 for each of the two 10-year training subsets). The chosen time period reflects our perception that quite long autocorrelation timescales characterize the analyzed system.