Tree-based iterative input variable selection for hydrological modeling

Authors

  • S. Galelli,

    Corresponding author
    1. Singapore-Delft Water Alliance, National University of Singapore, Singapore
    2. Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milano, Italy
    • Corresponding author: S. Galelli, Pillar of Engineering Systems & Design, Singapore University of Technology and Design, 20 Dover Drive, 138682, Singapore. (stefano_galelli@sutd.edu.sg)

    Search for more papers by this author
  • A. Castelletti

    1. Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milano, Italy
    Search for more papers by this author

Abstract

[1] Input variable selection is an important issue associated with the development of several hydrological applications. Determining the optimal input vector from a large set of candidates to characterize a preselected output might result in a more accurate, parsimonious, and, possibly, physically interpretable model of the natural process. In the hydrological context, the modeled system often exhibits nonlinear dynamics and multiple interrelated variables. Moreover, the number of candidate inputs can be very large and redundant, especially when the model reproduces the spatial variability of the physical process. The ideal input selection algorithm should therefore provide modeling flexibility, computational efficiency in dealing with high dimension data set, scalability with respect to input dimensionality and minimum redundancy. In this paper, we propose the tree-based iterative input variable selection algorithm, a novel hybrid model-based/model-free approach specifically designed to fulfill these four requirements. The algorithm structure provides robustness against redundancy, while the tree-based nature of the underlying model ensures the other key properties. The approach is first tested on a well-known benchmark case study to validate its accuracy and subsequently applied to a real-world streamflow prediction problem in the upper Ticino River Basin (Switzerland). Results indicate that the algorithm is capable of selecting the most significant and nonredundant inputs in different testing conditions, including the real-world large data set characterized by the presence of several redundant variables. This permits one to identify a compact representation of the observational data set, which is key to improving the model performance and assisting with the interpretation of the underlying physical processes.

1. Introduction

[2] The problem of input variable selection arises every time one wants to model the relationship between a variable of interest, or predictand, and a subset of potential explanatory variables, or predictors, but there is uncertainty about which subset to use among a number, usually large, of candidate sets available [George, 2000]. The selection of the most relevant or important input variables is a recurrent problem in hydrologic and water resources applications involving the resolution of single or multiple regression problems, such as rainfall-runoff modeling, prediction in ungauged basins, water quality modeling, etc. The difficulties in variable selection for these applications come primarily from three sources [May et al., 2008]: (i) the complexity of the unknown functional relationship between inputs and output; (ii) the number of available variables, which may be very large; and (iii) the cross correlation between candidate inputs, which induces redundancy [Maier et al., 2010]. If a linear relationship characterizes the underlying system, well established selection methods exist to obtain a candidate subset of the input variables [e.g., Miller, 1990]. However, the assumption of linear dependency between the inputs and the output is overly restrictive for most real physical systems [see, e.g., Tarquis et al., 2011]. In addition, advances in monitoring systems, from remote sensing techniques to pervasive real and virtual sensor networks [e.g., Hart and Martinez, 2006; Hill et al., 2011], has made available an increasingly larger amount of data at the local and global scale at progressively finer temporal and spatial resolution, thus not only increasing data set dimension from dozens to tens or hundreds of thousand but also adding considerably to data set redundancy. The objective of variable selection is threefold [Guyon and Elisseeff, 2003]: improving model performance by avoiding the interference of nonrelevant or redundant information and more effectively exploiting the data available for model calibration, providing faster and more cost-effective models, and assisting in interpretation of the underlying process by enabling a more parsimonious and compact representation of the observational data set. In fact, by reducing the input space dimension the generalization capability of the constructed model is maximized, while parsimonious and compact representations of the data allow for an easier interpretation of the underlying phenomenon [Kohavi and John, 1997].

[3] Input variable selection methods can be distinguished between model-based (or wrapper) and model-free (or filter) approaches (see Das [2001]; Guyon and Elisseeff [2003]; Maier et al. [2010], among others). The model-based approach relies on the idea of calibrating and validating a number of models with different sets of inputs and to select the set that ensures the best model performance. The candidate inputs are evaluated in terms of prediction accuracy of a preselected underlying model. The problem can be solved by means of global optimization techniques used to define the combination of input variables that maximizes the underlying model performance, or by stepwise selection (forward selection/backward elimination) methods, where inputs are systematically added/removed until model performance is no longer improved. The main drawback of this approach stands in its computational requirements [Kwak and Choi, 2002; Chow and Huang, 2005], as a large number of calibration and validation processes must be performed to single out the best combination of inputs and so the method does not scale well to a large data set. Moreover, the input selection result depends on the predefined model class and architecture. Thus, the optimality of a selected set of inputs obtained with a particular model is not guaranteed for another one, and this restricts the applicability of the selected set [Maier et al., 2010]. On the other hand, model-based approaches generally achieve better performance since they are tuned to the specific interactions between the model class and the data. Unlike the model-based approach, in model-free algorithms the variable selection is directly based on the information content of the candidate input data set, as measured by interclass distance, statistical dependence, or information-theoretic measure (e.g., the mutual information index [Peng et al., 2005]). Computational efficiency is a strong argument in favor of model-free methods; however, the significance measure is generally monotonic and, thus, without a predefined cutoff criterion, the algorithm tends to select very large subsets of input variables, with high risk of redundancy.

[4] In hydrological and water resources studies, input variable selection has been mainly used for two classes of problems: problems involving the identification of a nonlinear regression between time-independent explanatory factors and a relevant statistic of a given hydrological output characteristic and problems concerning the selection of the most relevant explanatory time varying variables to characterize, through a dynamic model, the development over time of a hydrological variable of interest. The first class is predominantly populated by statistical regionalization methods and the use of regional hydrological model parameters for the estimation of the hydrological response in ungauged watersheds. Traditionally, both model-free approaches, such as principal component analysis [e.g., Alcázar and Palau, 2010; Salas et al., 2011; Wan Jaafar et al., 2011], and model-based methods, such as stepwise regression [e.g., Heuvelmans et al., 2006; Barnett et al., 2010], have long been used to relate runoff characteristics to climate and watershed descriptors. The use of variable selection for statistical downscaling of meteorological data can also be classified under this category [e.g., Traveria et al., 2010; Phatak et al., 2011, and references therein]. More recently, novel causal variable selection methods [Ssegane et al., 2012] have been demonstrated to outperform stepwise selection approaches in terms of accuracy in characterizing the physical process while showing a lower predictive potential.

[5] The second class of problems comprises a larger variety of application fields ranging from rainfall predictions [e.g., Sharma et al., 2000] to streamflow modeling [e.g., Wang et al., 2009], evaporation estimation [Moghaddamnia et al., 2009], and water quality modeling [e.g., Huang and Foo, 2002]. Given the usually high number of candidate input variables in these problems, model-free methods are generally preferred over model-based approaches [Maier et al., 2010], which are mostly used to determine the optimal input set to specific classes of model [e.g., Noori et al., 2011, and references therein]. Starting from Sharma [2000], who expanded the mutual information index into the computationally more efficient and reliable partial mutual information (PMI), traditional nonlinear information-theoretic-based selection criteria have been revisited and adapted to a number of increasingly complex hydrological applications. Bowden et al. [2005a] improve the PMI criterion by Sharma [2000] using artificial neural networks for salinity prediction in the River Murray, South Australia [Bowden et al., 2005b]. The method is further elaborated by May et al. [2008], who proposed an alternative termination criteria to make the PMI more efficient and, subsequently, by Fernando et al. [2009], who introduced the use of a more efficient estimator (shifted histograms) of the mutual information. Finally, Hejazi and Cai [2009] illustrate an improved, computationally efficient, minimum redundancy maximum relevance approach to select the most significant inputs among 121 candidates to predict the daily release from 22 reservoirs in California.

[6] In this paper, we build on these previous works and propose a novel hybrid approach, combining model-free and model-based methods to input variable selection, called the tree-based iterative input variable selection (IIS). IIS incorporates some of the features of model-based approaches into a fast model-free method able to handle very large candidate input sets. The information-theoretic selection criterion of model-free methods is replaced by a ranking-based measure of significance [Wehenkel, 1998]. Each candidate input is scored by estimating its contribution, in terms of variance reduction, to the building of an underlying model of the preselected output. First, unlike information-theoretic selection, ranking-based evaluation does not require any assumption on the statistical properties of the input data set (e.g., Gaussian distribution) and, thus, can be applied to any sort of sample. Second, it does not rely on computationally intensive methods (e.g., bootstrapping) to estimate the information content in the data and, thus, is generally faster and more efficient. Nonparametric tree-based regression methods, namely extremely randomized trees [Geurts et al., 2006], are adopted as underlying model family since, thanks to their ensemble nature [e.g., Sharma and Chowdhury, 2011], they perform particularly well in characterizing strongly nonlinear relationships and provide more flexibility and scalability than parametric models (e.g., artificial neural networks). The ranking-based selection is embedded into a stepwise forward selection model-based approach evaluated by k-fold cross-validation [Allen, 1974]. Forward selection is appropriate because on most real-world hydrological data sets the number of significant variables is a small proportion of the total number of available variables. In such situations, forward selection is far less time consuming than backward elimination [Das, 2001]. In the rest of the paper, we first present the IIS algorithm and illustrate its building blocks. Then, we validate the IIS selection accuracy on the synthetic test problems used by Sharma [2000], Bowden et al. [2005b], May et al. [2008], and Hejazi and Cai [2009]. Following that, we demonstrate the algorithm on a real-world case study of streamflow prediction in the upper Ticino River Basin (Switzerland), a subalpine catchment characterized by extremely variable weather conditions, where both rainfall and snowmelt significantly contribute to the high flow variations. A comparison between the variables selected by the IIS and PMI-based input selection (PMIS) by May et al. [2008] is also provided. Finally, section 'Conclusions' gives conclusions.

2. Methods and Tools

[7] Given a sample data set inline image composed of N observations of the output variable y to be explained and n candidate inputs inline image, the model-based input variable selection problem can be formulated as follows: to find the subset inline image of input variables that provides the best performance of an underlying model inline image as measured by a suitable metric inline image. As detailed next, within the proposed hybrid selection approach, the subset inline image is incrementally built using the information content in the data with a ranking-based procedure and then validated by the model-based forward selection process as illustrated below.

2.1. Iterative Input Variable Selection

[8] The IIS algorithm is composed of three steps as in Figure 1:

Figure 1.

Flowchart of the Iterative Input Variable Selection (IIS) algorithm.

[9] Step 1. Given the sample data set inline image and n candidate inputs, the IIS algorithm runs an input ranking (IR) algorithm to sort the n candidate inputs according to a nonlinear statistical measure of significance (e.g., the explained variance). In principle, the first variable in the ranking should be the most significant in explaining the output. In practice, in the presence of several potentially significant, but redundant, inputs, their contribution to the output explanation is equally partitioned and they might not be listed in the very top positions. To reduce the risk for mis-selection, the first p variables inline image in the ranking are individually evaluated in the following step.

[10] Step 2. The relative significance of the first p-ranked variables is assessed against the observed output. To this end, p single-input-single-output (SISO) models inline image, are identified with an appropriate model building (MB) algorithm and compared in terms of a suitable distance metric between the output y and each SISO model prediction inline image. The best performing input inline image among the p considered is added to the set inline image of the variables selected to explain y.

[11] Step 3. A MB algorithm is then run to identify a multiinput single-output (MISO) model inline image mapping the variables inline image so far selected into the output y.

[12] The procedure is repeated using the residuals inline image as the new output variable in steps 1 and 2, and these operations iterated until either the best variable returned by the IR algorithm is already in the set inline image or the performance of the underlying model inline image, as measured by a suitable metric inline image, does not significantly improve. The algorithm is then terminated and inline image is the optimal set inline image can be a suitable accuracy index (e.g., coefficient of determination, mean-squared error, etc.) or a more sophisticated metric balancing accuracy and parsimoniousness (e.g., Akaike information criterion, Bayesian information criterion, or Young identification criterion [see Young, 2011, and references therein]). A tabular version of the algorithm is available as Algorithm 1, while more details can be found in Galelli [2010].

Algorithm 1. Iterative Input Selection (IIS) algorithm.

Require: The data set inline image, the set inline image of candidate input variables and the output variable y to be explained.

Ensure: The set inline image of variables selected to explain y.

Initialize: Set the variable inline image equal to y.

repeat

-With an Input Ranking (IR) algorithm, select the set inline image of the p most relevant input variables to explain inline image.

repeat

-With a Model Building (MB) algorithm, estimate a model inline image that explains inline image using as input the j-th component xj of the set inline image.

-Compute the distance metric inline image between inline image and the model output inline image.

until j = p

-Select the most relevant variable inline image in the set inline image, and add inline image to the set inline image.

-Estimate a model inline image that explains y using as input the set inline image, and compute the residuals inline image between y and inline image.

-Compute the distance metric inline image between y and the model output inline image.

- Evaluate the variation inline image of the distance metric between the current and the previous iteration.

until inline image

return inline image

[13] The evaluation of the metric inline image follows a k-fold cross-validation approach [Allen, 1974]: the training data set is randomly split into k mutually exclusive subsets of equivalent size, and the MB algorithm is run k times. Each time the underlying model is validated on one of the k folds and calibrated using the remaining k – 1 folds. The estimated prediction accuracy is then the average value of the metric inline image over the k validations. The k-fold cross validation is aimed at estimating the ability of the model to capture the behavior of unseen or future observation data from the same underlying process, and, as such, it minimizes the risk of overfitting the data [Wan Jaafar et al., 2011, and references therein].

[14] The reevaluation of the ranking on the model residuals every time a candidate variable is selected ensures that all the candidates that are highly correlated with the selected variable, and thus may become useless, are discarded. This strategy reinforces the SISO model-based evaluation in step 2 against the selection of redundant variables and it is independent of the model building (MB) and input ranking (IR) algorithms adopted. The choice of these two algorithms is fundamental to equip the IIS algorithm with the other desirable properties of modeling flexibility, i.e., the ability of characterizing strongly nonlinear relationships, computational efficiency, and scalability with respect to input dimensionality, i.e., the ability of handling many input variables with different range of variability. Among the many MB and IR alternatives available, we combined the IIS algorithm with Extremely Randomized Trees (Extra-Trees), a tree-based method proposed by Geurts et al. [2006], which was empirically demonstrated to outperform other MB approaches in terms of the abovementioned properties. In addition, Extra-Trees, like several other tree-based ensemble methods [Jong et al., 2004], can directly be used as an IR procedure, since their particular structure can be exploited to infer the relative importance of the input variables and to order them accordingly [Wehenkel, 1998; Fonteneau et al., 2008].

2.2. Extra-Trees

[15] Extra-Trees is a nonparametric tree-based regression method already experimented within a wide range of applications, such as image classification [Marée et al., 2005], bioinformatics [Marée et al., 2006], environmental modeling [Jung et al., 2009; Castelletti et al., 2010a], and water reservoirs operation [Castelletti et al., 2010b]. Tree-based regressors are all based on the idea of decision trees, which are tree-like structures, composed of decision nodes, branches, and leaves, which form a cascade of rules leading to numerical values. The tree is obtained by first partitioning at the top decision node, with a proper splitting criterion, the set of the input variables into two subsets, thus creating the former two branches. The splitting process is then repeated in a recursive way on each derived subset, until some termination criterion is met, e.g., the numerical values belonging to a subset vary just slightly or only few elements remain. When this process is over, the tree branches represent the hierarchical structure of the subset partitions, while the leaves are the smallest subsets associated to the terminal branches. Each leaf is finally labeled with a numerical value. Tree-based methods include both deterministic (e.g., classification and regression trees [Breiman et al., 1984], M5 [Quinlan, 1992]) and randomized methods (e.g., Bagging predictors [Breiman, 1996], Random Subspace method [Ho, 1998], Random forests [Breiman, 2001], and PERT [Cutler and Guohua, 2001]), where the splitting process is performed somehow randomly. Extra-Trees belong to this second class of approaches and are an ensemble method as explained next.

2.2.1. Model Building

[16] Given a regression problem with an output variable y, n inputs inline image and a training data set inline image composed of N input-output observations, the Extra-Trees MB algorithm grows an ensemble of M trees using a top-down approach. For each tree, the decision nodes are split using the following rule: K alternative “cut directions,” i.e., input variables xi, with inline image, candidate to be the argument of the node splitting criterion, are randomly selected and, for each one, a random cut-point si is chosen; the variance reduction (or equivalently the coefficient of determination) is computed for each cut direction and the cut direction maximizing this score is adopted to split the node. Let's inline image be the jth nonterminal node in a tree composed of Ω nonterminal nodes (i.e., inline image), then the variance reduction inline image associated to node inline image is defined as

display math(1)

where the terms inline image and inline image are the two subsets of inline image satisfying the conditions inline image and inline image, respectively, with si being the randomly selected cut point. The algorithm stops partitioning a node if its cardinality is smaller than a predefined threshold nmin and the node is a leaf in the tree structure. Each leaf is assigned with a value, obtained as the average of the outputs y associated to the inputs inline image that fall in that leaf. The estimates produced by each single tree are finally aggregated with an arithmetic average over the ensemble of M trees, and the associated variance reduction (or coefficient of determination) of the ensemble is

display math(2)

[17] The rationale behind the approach is that the combined use of randomization and ensemble averaging provides more effective variance reduction than other randomized methods, while minimizing the bias of the final estimate [Geurts et al., 2006]. This means that the final model structure does not require any of the postprocessing methods (e.g., pruning or smoothing) commonly used to reduce the output variance (and thus the risk of over-fitting) of other tree-based methods [Jothiprakash and Kote, 2011]. Increasingly high values of M not only reduce the variance of the final estimate but also significantly add to the computational requirements of the MB algorithm. K regulates the level of randomness in the tree build building process and its optimal default value for regression problems is equal to the number n of inputs, and so the number of cut directions randomly selected. Finally, the threshold nmin is used to balance bias and variance reduction. Large values of nmin lead to small trees, with high bias and small variance; conversely, low values of nmin lead to fully grown trees, which may overfit the data. The optimal tuning of nmin can depend on the level of noise in the training data set: the noisier are the outputs, the higher should be the optimal value of nmin. Although this tuning might require some experiments, Geurts et al. [2006] have empirically demonstrated that a value of nmin equal to 5 is a robust choice in a broad range of conditions. For a detailed analysis of the sensitivity of Extra-Trees performance to the parameters M, K, and nmin, with an application to a streamflow modeling problem, the reader is referred to Galelli and Castelletti [2013].

[18] From the computational point of view, the complexity of the Extra-Trees building procedure is on the order of inline image, while the computational time linearly increases with M and K and logarithmically decreases for increasing nmin, meaning that the approach still remains computationally efficient, although based on the construction of an ensemble of trees [Geurts et al., 2006]. This is because the splitting rule they adopt is very simple, if compared to other splitting rules that locally optimize the cut points, as, for example, the one in classification and regression trees [Breiman et al., 1984].

2.2.2. Input Ranking

[19] The particular structure of Extra-Trees can be exploited to rank the importance of the n input variables in explaining the selected output behavior. This approach, as originally proposed by Wehenkel [1998], is based on the idea of scoring each input variable by estimating the variance reduction it can be associated with by propagating the training data set inline image over the M different trees composing the ensemble. More precisely, the relevance inline image of the ith input variable xi in explaining the output y can be evaluated as follows:

display math(3)

where inline image is equal to 1 if the variable xi is used to split the node inline image (and 0 otherwise), inline image is the number of samples in the considered subset inline image, and inline image is the variance reduction associated to node inline image (see equation (1)). Finally, the input variables inline image are sorted by decreasing values of their relevance.

3. Synthetic Test Problems

[20] To assess the competence of the IIS algorithm in selecting the most significant and nonredundant subset of inputs, the method is tested on six synthetic data sets with a priori known dependence attributes. The first five data sets are generated by the autoregressive models introduced by Sharma [2000] and subsequently used by Bowden et al. [2005a] and Hejazi and Cai [2009] for the evaluation of input variables algorithms in hydrological and water resources problems. Three models are autoregressive (AR) models of different orders, while the last two are nonlinear threshold autoregressive (TAR) models. The sixth data set set is produced by a nonlinear (NL) model introduced by Bowden et al. [2005a]. The model equations are as follows:

display math(4)
display math(5)
display math(6)
display math(7)
display math(8)
display math(9)

[21] Similarly to the previous studies, in all the test models, et is a standard Gaussian random variable, with a zero mean and unit standard deviation. For the first five models, 520 data points are generated, with the first 20 points discarded to reduce the effect of initialization. The first 15 lags of the data (i.e., inline image) are considered as candidate inputs. As for the nonlinear model, 15 standard Gaussian random variables (i.e., inline image) are generated, each one consisting of 500 data points. The data set generation process is repeated 30 times for each model to provide a statistic analysis of the IIS algorithm results.

[22] The IIS algorithm is run on each independent instance with a different configuration of its parameters p, k, and ε. The number p of SISO models evaluated at each iteration (step 2) and the number k of folds adopted for the k-fold cross-validation process is, respectively, set to 1, 5, 10 and 2, 4, 6, 8, 10. The metric inline image used to evaluate the SISO and MISO model performance (steps 2 and 3) is the coefficient of determination R2. Since R2 lies in the range [0,1], the algorithm tolerance ε is varied in the range [0,0.1], with a variation step equal to 10−4. When ε is equal to 0.1, the algorithm is stopped if the selection of a further variable leads to an increase of R2 lower than 0.1. On the other hand, when ε is equal to 0, the algorithm is stopped if the selection of a further variable leads to a decrease of R2, namely the underlying MISO model is brought to an overfitting condition. Considering the values set for the parameters p, k, and ε, a total of 15 × 103 different parameterizations are explored for each instance of each model.

[23] As for the Extra-Trees setting, default values for the three parameters M, K, and nmin are set according to Geurts et al. [2006] indications and the subsequent authors' experiences [Castelletti et al., 2010a; Galelli and Castelletti, 2013]: the number M of trees in an ensemble is 500, the number K of alternative cut-directions is equal to the number of inputs (i.e., 1 and 15 for the SISO and MISO evaluations, respectively), and the minimum cardinality nmin for splitting a node is 5.

[24] The overall results of this benchmarking study are summarized in Figures 2 and 3, which show the average selection accuracy achieved on the data sets generated with linear (i.e., AR(1), AR(4), and AR(9), Figure 2) and nonlinear models (i.e., TAR(1), TAR(2), and NL, Figure 3). Results for AR(1) and AR(4) models show that the IIS algorithm produces a correct specification (i.e., only the exact input variables are selected) for a very broad range of parameters' values. For a value of ε and k larger than 0.02 and 4, respectively, the algorithm achieves 100% accuracy. The value of ε is key in determining the difference between correct and over specification. When the value of ε is too small, the algorithm is not sufficiently sensitive to irrelevant inputs and tends to select a larger set of input variables. The value of k is less important in this selection process, although it is evident that small values of k, such as 2 or 4, can slightly decrease the algorithm accuracy. This is because the adoption of large folds (small values of k) does not allow the underlying model to capture the behavior of unseen observations.

Figure 2.

Average selection accuracy (in percentage) of IIS algorithm for linear data sets (i.e., AR(1), AR(4) and AR(9)) with 500 data points.

Figure 3.

Average selection accuracy (in percentage) of IIS algorithm for nonlinear data sets (i.e., TAR(1), TAR(9), and NL) with 500 data points.

[25] These large “islands of high performance” are somehow reduced for the AR(9) model. In addition, in this case, the algorithm can achieve high accuracy (around 90% of correct specification), but for values of ε included between 0.01 and 0.04. Similar to the previous models, a small value of ε is associated with an over specification behavior, while for a large value of ε, the algorithm shows an under specification behavior and provides an incomplete set of inputs. This can be explained by considering the nature of the AR(9) model. While the AR(1) and AR(4) models are characterized by few significant variables ( inline image and inline image, respectively), the AR(9) has a larger number of useful inputs (i.e., inline image and inline image), which can be discarded when the algorithm sensitivity is reduced by assuming a large value of ε. In all three cases, the value of p appears to be less important than k and ε. This is because p is used to minimize the risk of selecting redundant inputs while the three data sets are essentially characterized by a high level of noise rather than redundant information.

[26] The results on the datasets generated with linear models are in line with what was found by May et al. [2008], who obtained a selection accuracy of about 95% when applying the PMIS algorithm to data sets of the same size (500 samples). Interestingly, the PMIS algorithm shows the best performance when using as termination criterion the tabulated critical values of MI, which inherently make an assumption regarding the distribution of the data, and this may affect the input selection results in case of deviations from the assumed distribution. On the other hand, the use of Extra-Trees as underlying model does not require any assumption on the statistical properties of the input data sets.

[27] The results obtained for the two nonlinear models (i.e., TAR(1) and TAR(2), Figure 3) are similar to those obtained for the AR(9) one. In addition, for these two models, the islands of good performance are bounded by values of ε between 0.01 and 0.04 and are slightly increased by values of k larger than 4. When the IIS algorithm parameters are within these ranges, the accuracy is about 90% of correct specification. Again, similar to the AR(9) model, a small value of ε is associated to an over specification behavior, while for a large value of ε, the algorithm tends to provide an incomplete set of selected variables. Results for the NL model (Figure 3) are somehow in line with what found for the other nonlinear models but with a significant reduction in the islands of high performance. The highest value of correct specification is around 60%, and most importantly, this performance is achieved for a narrow range of the algorithm parameters. In particular, the IIS algorithm provides an incomplete set of selected variables for values of ε larger than 0.01. In such a difficult test, where different variables (i.e., x2, x6, and x9) are related by a strong nonlinear function, a small increase in the algorithm tolerance can easily lead to the mis-selection of at least one variable. Again, these results are in line with May et al. [2008], who found a selection accuracy of about 75% (average on the three nonlinear data sets) when using the tabulated critical values of MI and almost 100% with the Akaike information criterion. The application to nonlinear data sets also shows that the amount of available data (i.e., the amount of observations N) and the number k of folds have an important impact, as discussed in the supporting information, where the algorithm sensitivity to the data set length is described.

4. Case Study—The Upper Ticino River Basin

[28] We also evaluated the IIS algorithm on a real-world case study, where it was used to identify the most relevant input variables for daily streamflow prediction in the upper Ticino River Basin (Switzerland).

4.1. Study Site and Data

[29] The upper Ticino River Basin (Figure 4) extends for 1515 km2 from the alpine continental divide to the northern shores of Lake Maggiore (Switzerland). It is characterized by significant variations in hydroclimatic conditions and by pronounced orographic heterogeneity. The altitudinal variation in the basin ranges from 220 to 3402 m a.s.l., with more than 80% of the area located higher than 1000 m a.s.l. and a mean slope exceeding the 27%. The dominance of a subalpine climate regime results in extremely variable weather conditions, which cause notable floods and drought periods.

Figure 4.

The upper Ticino River Basin (Switzerland) showing the measurement stations used in the input selection problem: S1, Robiei; S2, Piotta; S3, Acquarossa; S4, San Bernardino; S5, Magadino.

[30] The hydrometeorological observational data used in this study consist of time series covering a 4 year period (2004 and 2007–2009). As shown in Figure 4, the meteorological data are collected in five stations located in different parts of the basin, while the streamflow is measured at the Bellinzona station. The data set consists of daily streamflow (y), precipitation (p), maximum and minimum temperature inline image, maximum and minimum relative air moisture inline image, net solar radiation (Rn, calculated from the amount of sun hours/overall solar radiation), and wind speed (u). Moreover, the daily potential evapotranspiration (ET) is computed for each meteorological station using the recorded meteorological data and the FAO-Penman-Monteith equation. For further details, see Table 1.

Table 1. Summary of the Hydrometeorological Observational Data Used in This Study (Daily Values)a
VariableDescriptionUnit of Measurement
  1. a

    The superscript i indicates the station as detailed in Figure 4.

yStreamflowm3/s
piPrecipitationmm
ETiPotential evapotranspirationmm
inline imageMaximum temperature°C
inline imageMinimum temperature°C
inline imageMaximum relative air moisture%
inline imageMinimum relative air moisture%
inline imageNet solar radiationMJ/m2
uiWind speed at 2 m heightm/s

4.2. Candidate Input Variables

[31] To assess the ability of the IIS algorithm in selecting the most significant input variables for predicting the daily streamflow at time t + 1 (i.e., inline image), two different experiments are considered.

4.2.1. Experiment 1

[32] The candidate input variables considered are the precipitation inline image and the potential evapotranspiration inline image with one and two temporal delay in the five measurement stations available, inline image (see Figure 4). No a priori assumption is made on the spatial dynamics of the hydrometeorological process (e.g., spatial aggregation). Moreover, following Young et al. [2006], the previous day streamflow yt is also considered. This results in a total of 21 candidate inputs.

4.2.2. Experiment 2

[33] The precipitation and streamflow variables are the same as in Experiment 1, while the potential evapotranspiration is substituted for by the variables appearing in the FAO-Penman-Monteith formula. Therefore, the input selection exercise considers the previous day streamflow yt and all the variables with one and two temporal delay in the five meteorological stations inline image, i.e., the precipitation inline image, maximum and minimum temperature inline image, maximum and minimum relative air moisture inline image, net solar radiation inline image and wind speed inline image. In this case, the number of candidate input variables is 71 (seven relevant variables with one and two temporal delay in five stations, plus the previous day streamflow).

4.3. IIS Algorithm Setting

[34] The IIS algorithm parameters p, k, and ε are set according to the results of the sensitivity analysis described in section 'Synthetic Test Problems'. The number p and k of SISO models and folds adopted for the k-fold cross-validation process are both set to 5, while the algorithm tolerance ε is equal to 0.001 (with the coefficient of determination R2 as the distance metric). As discussed in section 'Synthetic Test Problems', this latter value might lead to an over specification behavior, i.e., to the selection of one or more redundant variables. However, considering that the hydrological processes to be modeled are highly nonlinear and the correct set of input variables is unknown, we prefer taking a chance on selecting a larger set of inputs rather than missing one or more inputs. The Extra-Trees setting follows the empirical evaluations in Geurts et al. [2006]: M, the number of trees in an ensemble, is 500, K is assumed equal to the number of input variables considered (i.e., 1 for the SISO evaluation, and 21 and 71 for the MIMO evaluation in Experiments 1 and 2, respectively), nmin, the minimum cardinality for splitting a node, is 5.

4.4. Results and Discussion

4.4.1. Experiment 1

[35] The results obtained for the first input selection experiment are reported in Figure 5, which shows the cumulated performance R2 of the underlying model, as well as the contribution ΔR2 of each selected variable, evaluated as the variation of the coefficient of determination R2 at each iteration of the IIS algorithm. The cumulated performance increases monotonically with the number of selected variables, up to the fifth, when the selection of an additional variable does not lead any further significant increase in the underlying model performance, and the algorithm tolerance ε is reached. A good percentage of the streamflow process can thus be described by means of four variables, which relate to the main driving forces of the streamflow formation process: (i) the previous day streamflow yt can be seen as a proxy of the overall catchment condition, and, as such, it is characterized by a relevant contribution ΔR2 of about 0.63; (ii) the measured precipitation inline image and inline image, at San Bernardino and Robiei station, have a direct, positive impact on streamflow generation; and (iii) the potential evapotranspiration inline image at Robiei station, together with the precipitation inline image and inline image, provides an indirect representation of the soil moisture content available in the catchment. Moreover, inline image can negatively affect the output dynamics and introduces the annual periodicity of the process being modeled, as it depends on a set of meteorological variables with a strong seasonal/annual pattern. Interestingly, the algorithm selects two stations for the measured precipitation (i.e., San Bernardino and Robiei) lying in the boundaries of the basin, and not in the middle, such as Piotta or Acquarossa (see Figure 4). This may be due to the large spatial heterogeneity of rainfall, which makes essential to use two instead of a single station representative of the basin. Indeed, the cross correlation between the measured precipitation inline image and inline image is only 0.70. In addition, the contribution of the precipitation (especially inline image) is more prominent than the potential evapotranspiration inline image. The catchment is indeed characterized by a short time of concentration (slightly lower than 24 h) and relatively fast dynamics, so the IIS algorithm privileges the precipitation, highly correlated to surface runoff (fast flow component), rather than the potential evapotranspiration, correlated to the interflow/baseflow linkage representing the low flow component. Moreover, all the selected variables have one temporal (i.e., day) delay only, while variables with a 2 days delay are completely discarded. Again, this might be due the short time of concentration of the Ticino catchment. This analysis is confirmed by the scatterplot and the trajectories produced by the underlying model during the variable selection process (see Figure 6). The usage of the previous day streamflow yt only (1 input, Figure 6a) allows to describe the main trends in the streamflow dynamics, while the introduction of the measured precipitation inline image and inline image (2 and 3 inputs, Figures 6b and 6c) enhances the model predictive capabilities and permits to capture the sudden increases in the river streamflow (see, for example, the comparison in Figure 6e between the measured and predicted flow on the 26th September). Finally, the inclusion of the potential evapotranspiration inline image provides a further, yet modest, improvement of the model performance (Figure 6d).

Figure 5.

Performance improvement by each selected variable in Experiment 1. The continuous line is the cumulated performance of the underlying model.

Figure 6.

Comparison between the measured and predicted daily streamflow with the different inputs selected by the IIS algorithm in Experiment 1: (a–d), scatterplots and (e) specimen of the corresponding hydrograph for the period Sept.–Oct. 2004.

[36] The model with the four inputs selected by the IIS algorithm is then compared against (i) a model fed by the previous day streamflow yt, plus the daily mean areal precipitation pt and potential evapotranspiration ETt (obtained through the Thiessen polygon method), (ii) a model fed by all the potential 21 candidate inputs, and (iii) a naive model, with the predicted streamflow inline image equal to the previous day streamflow yt (i.e., inline image), used to assess the true utility of the developed models. The same Extra-Trees setting adopted for the four input model (i.e., M = 500, nmin = 5, and K equal to the number of considered inputs) ensures the best performance for these three further models also. The comparison is based on multi-assessment criteria, aimed at describing the models behavior under different flow conditions. These criteria are the coefficient of determination (R2) and the relative root-mean-squared error (RRMSE), namely normalized statistics providing a description of the models behavior over the whole range of flows; the root–mean-squared error (RMSE), which measures the goodness of fit relevant to high flows; the mean absolute error (MAE), which indicates the goodness of fit at moderate flow values. Results in Table 2 show that the model fed by the selected inputs can (slightly) outperform the first two models with respect to all the criteria: In particular, it is interesting to notice that the model fed by the daily mean areal precipitation pt and potential evapotranspiration ETt is less performing in high flow conditions (as shown by the higher value of RMSE). This effect is probably due to the fact that the spatial averaging of the Thiessen polygon method reduces the information content of the spatiotemporal variation of pt and ETt, which is fully exploited by the IIS algorithm. This is further demonstrated by an analysis of the scatterplot and the hydrograph (Figure 7), which show that the model fed by mean areal precipitation and evapotranspiration has a tendency to underestimate the observed streamflow, while the model with the four selected inputs is characterized by a lower prediction error. Moreover, Table 2 shows that all the models based on Extra-Trees can largely outperform the naive model: The hydrological processes leading to the streamflow generation are strongly driven by an autoregressive process, but knowing this latter does not allow to predict the quick streamflow increases driven by precipitation and potential evapotranspiration correctly.

Table 2. Comparison of the Model Performance, Evaluated in k-Fold Cross Validation (With k = 5), Obtained in Experiment 1
Model inputsR2RMSE (m3/s)RRMSEMAE (m3/s)
inline image0.76225.0520.49510.971
inline image0.75725.7450.50911.025
21 candidate inputs0.75325.3890.50211.080
yt (naive model)0.57433.0230.65212.557
Figure 7.

Comparison between the measured and predicted daily streamflow: (a) scatterplot and (b) hydrograph, with a model fed by the selected inputs in Experiment 1 (red line), all the potential inputs (black line) and the mean areal precipitation and evapotranspiration (green line).

4.4.2. Experiment 2

[37] The results of the input selection process for the 71 candidates considered in the second experiment are reported in Figure 8, together with the cumulated performance R2 of the underlying model. Similar to the first experiment, the cumulated performance increases monotonically with the number of selected variables, up to the sixth, when the overfitting effect prevails and the model performance decreases (i.e., the contribution ΔR2 of the sixth variable is negative). The presence of inline image, and inline image in the first three positions empirically confirms that the previous day streamflow is a proxy of the overall catchment condition and that the precipitation is the main driving factor of the streamflow process in the Ticino catchment. With respect to Experiment 1, the potential evapotranspiration inline image is replaced by the minimum temperature inline image and the minimum relative air moisture inline image in Piotta and Robiei station. The selection of the minimum temperature inline image can be related to the snowmelt process, which has a significant contribution to the streamflow generation in late spring.

Figure 8.

Performance improvement by each selected variable in Experiment 2. The continuous line is the cumulated performance of the underlying model.

[38] The scatterplot and the trajectories produced by the underlying model during the variable selection process (Figure 9) show that the selection of the measured precipitation inline image and inline image (2 and 3 inputs, Figures 9b and 9c) and the minimum temperature inline image (4 inputs, Figure 9d) allows the model to predict the rapid increase in the river streamflow. On the other hand, the selection of the minimum relative air moisture inline image (Figure 9e) has a minimum impact on the model performance. The same multiassessment criteria adopted for the previous experiment show that this model can outperform a model fed by all the potential 71 candidate inputs (Table 3) in a wide range of flow conditions (as well as the naive model adopted in the previous experiment). This is also confirmed by the scatterplot and the hydrograph reported in Figure 10.

Figure 9.

Comparison between the measured and predicted daily streamflow with the different inputs selected by the IIS algorithm in Experiment 2: (a–e) scatterplots and (f) specimen of the corresponding hydrograph for the period Sept.–Oct. 2004.

Table 3. Comparison of the Model Performance, Evaluated in k-Fold Cross Validation (With k = 5), Obtained in Experiment 2
Model InputsR2 (m3/s)RMSERRMSEMAE (m3/s)
inline image0.77524.4050.48210.603
71 candidate inputs0.73726.2020.51811.240
yt (naive model)0.57433.0230.65212.557
Figure 10.

Comparison between the measured and predicted daily streamflow: (a) scatterplot and (b) hydrograph, with a model fed by the selected inputs in Experiment 2 (blue line) and all the potential inputs (black line).

4.4.3. Discussion

[39] The results obtained for Experiment 1 reveal that the IIS algorithm has the ability of selecting the most significant variables among an initial set of 21 candidates. This variable selection process permits to minimize the redundancy of the initial set, with consequent (modest) benefits for the underlying model. The MB algorithm is thus not interfered by nonrelevant information and can effectively exploit the available data. As far as modeling flexibility is concerned, the underlying model is also empirically demonstrated to well approximate the nonlinear relationship of streamflow formation. In this respect, it is important to consider that the structural error characterizing all the models is due to the physical characteristics of the catchment. The catchment has indeed a time of concentration slightly lower than 24 h, so the models performance is somehow “bounded” by the usage of measured external forcing only.

[40] Experiment 2, with a total of 71 candidate input variables, empirically confirms that the IIS algorithm scales well to large data sets. Interestingly, the substitution of the potential evapotranspiration inline image in each station for the raw meteorological information available at the same station is demonstrated to increase the underlying model performance (see the multiassessment criteria in Table 3 and the hydrograph in Figure 11). In particular, it is interesting to further comment on the selection of the minimum temperature inline image. As shown in Figure 8, inline image is selected at the fourth iteration of IIS, but it contributes to R2 more than other variables. In other words, the contribution of the single variables does not monotonically decreases during the variable selection process. The possible reason for this behavior is that a variable, when considered alone, can be quite useless in explaining the output, while it becomes much more informative when considered together with other variables [Guyon and Elisseeff, 2003].

Figure 11.

Comparison between the measured and predicted daily streamflow: (a) scatterplot and (b and c) hydrograph, with a model fed by the inputs selected in Experiments 1 and 2.

[41] Finally, the selection of few, significant, and nonredundant variables leads to a more parsimonious and compact representation of the observational data set, and this is key in assisting with the interpretation of the underlying physical processes. In the present study, both the input selection experiments explore whether the direct usage of originally measured variables without any a priori assumption (i.e., spatial aggregation and/or postprocessing of the meteorological data) is more convenient in terms of accuracy of the streamflow prediction and providing a physical insight of the streamflow formation process. For example, the first experiment shows that it is more convenient to select precipitation and evapotranspiration information in few relevant stations, rather than using the Thiessen polygon method to calculate the mean areal precipitation and evapotranspiration.

4.4.4. Comparison Against PMI-Based Input Selection

[42] To further validate the results obtained with the IIS algorithm, Experiments 1 and 2 are solved by means of the PMIS algorithm [Sharma, 2000; May et al., 2008], which found successful application in different hydrological problems. More specifically, the PMIS algorithm is run with a bootstrap size of 100 and without specifying any termination criteria. This is done to study the relative importance for the PMIS algorithm of the inputs selected by the IIS approach.

[43] The results obtained for Experiment 1 (first 10 iterations) are reported in Table 4 and show a good agreement between the two algorithms, with three inputs selected by the IIS algorithm chosen by the PMIS one at the first four iterations. In particular, (i) the previous day streamflow yt is selected at the first iteration and it is characterized by the highest value of mutual information I equal to 0.851; (ii) the measured precipitation inline image, and inline image, at Piotta, Robiei, and San Bernardino stations, are subsequently selected because of their direct impact on streamflow generation; and (iii) the potential evapotranspiration from the different stations is then included. The two algorithms thus follow the same order in selecting the most relevant variables, with the only difference that the measured precipitation inline image in Piotta station is not chosen by the IIS algorithm (only inline image and inline image in Robiei and San Bernardino stations are selected) and that the potential evapotranspiration inline image in Robiei station is selected after eight iterations of the PMIS algorithm. This could be explained by the fact that the contribution of the potential evapotranspiration is low and very similar between the different stations (as shown by the value of the mutual information), so, the two algorithms tend to choose randomly among different inputs when this uncertainty arises. Table 4 also reports the performance, in terms of R2, of 10 models fed by the inputs iteratively selected by the PMIS algorithm. The same Extra-Trees setting adopted for the IIS algorithm (i.e., inline image, and K equal to the number of inputs) is used in order to perform a fair comparison. Results show that the value of R2 is slightly lower than the one obtained with the inputs selected by the IIS algorithm. This is because the optimal subset of inputs selected by the IIS algorithm depends on Extra-Trees as underlying model and, in turn, the Extra-Trees prediction performance depends on the particular inputs selected, so the interaction between the IIS algorithm and Extra-Trees is expected to enhance the prediction performance.

Table 4. Variables Selected by the PMIS Algorithm During the First 10 Iterations in Experiment 1 and Corresponding Mutual Information Ia
VariableIR2
  1. a

    The last column shows the performance of an ensemble of Extra-Trees evaluated in k-fold cross validation (with k = 5) and fed by the selected inputs.

yt0.8510.567
inline image0.0780.712
inline image0.0130.724
inline image0.0320.730
inline image0.0120.746
inline image0.0190.752
inline image0.0120.758
inline image0.0160.756
inline image0.0150.771
inline image0.0070.764

[44] These findings are confirmed in Experiment 2, where with four input variables selected by the IIS algorithm chosen by the PMIS one at the first seven iterations (see Table 5). These variables are the previous day streamflow yt, the minimum temperature and humidity inline image and inline image at Piotta and Robiei stations and the measured precipitation inline image at Robiei. Only the precipitation inline image at San Bernardino is not selected. This difference between IIS and PMIS result can be explained by considering that the second experiment has a larger number of candidate input variables (i.e., 71), so there is more uncertainty about which subset to use. Overall, the experiments show that the two algorithms are in agreement when there is a clear correlation between candidate inputs and output. This is demonstrated by the fact that they both select the previous streamflow and measured precipitation in three stations (Piotta, Robiei, and San Bernardino) and that they discard variables with 2 days delay.

Table 5. Variables Selected by the PMIS Algorithm During the First 10 Iterations in Experiment 2 and Corresponding Mutual Information Ia
VariableIR2
  1. a

    The last column shows the performance of an ensemble of Extra-Trees evaluated in k-fold cross validation (with k = 5) and fed by the selected inputs.

yt0.8510.568
inline image0.0780.712
inline image0.0210.649
inline image0.0260.745
inline image0.0270.745
inline image0.0260.742
inline image0.0400.757
inline image0.0350.748
inline image0.0300.749
inline image0.0270.752

5. Conclusions

[45] The paper presents a novel tree-based algorithm for input variable selection for hydrological and water resources modeling studies. The approach combines the advantages of model-based selection with the efficiency of model-free methods. The tree-based nature of the underlying model ensures modeling flexibility, computational efficiency, and scalability, while the iterative nature of the algorithm minimizes redundancy. The IIS algorithm accuracy and sensitivity to parameters p, k, and ε is first evaluated on a synthetic test case study from the input variable selection literature. Following that, the algorithm is demonstrated for a real-world daily streamflow prediction problem in the upper Ticino River Basin. Results and comparison against PMIS algorithm indicate that the algorithm is capable of selecting the most significant and nonredundant inputs under different conditions, as the presence of noise (synthetic test case study) or several redundant variables (real-world problem) in the sample data set. This permits to identify more parsimonious and compact representations of the observational data set, with increased model performance and assisted interpretation of the underlying physical processes. Moreover, results show that the algorithm provides corrects results for a broad range of the parameters' values, although its tolerance ε may require some trial-and-error tuning. Although the IIS algorithm is essentially conceived for the development of tree-based regression/classification methods, its nature allows its usage for the selection of the most significant input variables for any (linear or nonlinear) regression method.

[46] Future research will concentrate on improving the predictive performance of Extra-Trees, which show a tendency to underestimate the flow peaks because of the averaging performed at the trees leaves [Galelli and Castelletti, 2013]. Another aspect that deserves further study is the physical interpretability of the selected variables. The IIS algorithm, similarly to the majority of the input selection algorithms available in literature, selects the most relevant variables on the basis of their numerical correlation, which does not necessarily mean physical causality. It is therefore interesting to combine the IIS algorithm with causality detection approaches [e.g., Bizzi et al., 2013].

Acknowledgments

[47] The research presented in this work was carried out as part of the Singapore-Delft Water Alliance (SDWA) Multi-Objective Multiple-Reservoir Management research program (R-303-001-005–272). The authors are grateful to Marcello Restelli and Rodolfo Soncini-Sessa for the essential contribution in developing the algorithm. The meteorological and hydrological data used in this study have been provided by the Swiss Federal Office for Meteorology (MeteoSwiss) and the Swiss Federal Office for Water and Geology (BWG).

Ancillary