## 1. Introduction

[2] The problem of input variable selection arises every time one wants to model the relationship between a variable of interest, or predictand, and a subset of potential explanatory variables, or predictors, but there is uncertainty about which subset to use among a number, usually large, of candidate sets available [*George*, 2000]. The selection of the most relevant or important input variables is a recurrent problem in hydrologic and water resources applications involving the resolution of single or multiple regression problems, such as rainfall-runoff modeling, prediction in ungauged basins, water quality modeling, etc. The difficulties in variable selection for these applications come primarily from three sources [*May et al*., 2008]: (i) the complexity of the unknown functional relationship between inputs and output; (ii) the number of available variables, which may be very large; and (iii) the cross correlation between candidate inputs, which induces redundancy [*Maier et al*., 2010]. If a linear relationship characterizes the underlying system, well established selection methods exist to obtain a candidate subset of the input variables [e.g., *Miller*, 1990]. However, the assumption of linear dependency between the inputs and the output is overly restrictive for most real physical systems [see, e.g., *Tarquis et al*., 2011]. In addition, advances in monitoring systems, from remote sensing techniques to pervasive real and virtual sensor networks [e.g., *Hart and Martinez*, 2006; *Hill et al*., 2011], has made available an increasingly larger amount of data at the local and global scale at progressively finer temporal and spatial resolution, thus not only increasing data set dimension from dozens to tens or hundreds of thousand but also adding considerably to data set redundancy. The objective of variable selection is threefold [*Guyon and Elisseeff*, 2003]: improving model performance by avoiding the interference of nonrelevant or redundant information and more effectively exploiting the data available for model calibration, providing faster and more cost-effective models, and assisting in interpretation of the underlying process by enabling a more parsimonious and compact representation of the observational data set. In fact, by reducing the input space dimension the generalization capability of the constructed model is maximized, while parsimonious and compact representations of the data allow for an easier interpretation of the underlying phenomenon [*Kohavi and John*, 1997].

[3] Input variable selection methods can be distinguished between *model-based* (or wrapper) and *model-free* (or filter) approaches (see *Das* [2001]; *Guyon and Elisseeff* [2003]; *Maier et al*. [2010], among others). The model-based approach relies on the idea of calibrating and validating a number of models with different sets of inputs and to select the set that ensures the best model performance. The candidate inputs are evaluated in terms of prediction accuracy of a preselected underlying model. The problem can be solved by means of global optimization techniques used to define the combination of input variables that maximizes the underlying model performance, or by stepwise selection (forward selection/backward elimination) methods, where inputs are systematically added/removed until model performance is no longer improved. The main drawback of this approach stands in its computational requirements [*Kwak and Choi*, 2002; *Chow and Huang*, 2005], as a large number of calibration and validation processes must be performed to single out the best combination of inputs and so the method does not scale well to a large data set. Moreover, the input selection result depends on the predefined model class and architecture. Thus, the optimality of a selected set of inputs obtained with a particular model is not guaranteed for another one, and this restricts the applicability of the selected set [*Maier et al*., 2010]. On the other hand, model-based approaches generally achieve better performance since they are tuned to the specific interactions between the model class and the data. Unlike the model-based approach, in model-free algorithms the variable selection is directly based on the information content of the candidate input data set, as measured by interclass distance, statistical dependence, or information-theoretic measure (e.g., the mutual information index [*Peng et al*., 2005]). Computational efficiency is a strong argument in favor of model-free methods; however, the significance measure is generally monotonic and, thus, without a predefined cutoff criterion, the algorithm tends to select very large subsets of input variables, with high risk of redundancy.

[4] In hydrological and water resources studies, input variable selection has been mainly used for two classes of problems: problems involving the identification of a nonlinear regression between time-independent explanatory factors and a relevant statistic of a given hydrological output characteristic and problems concerning the selection of the most relevant explanatory time varying variables to characterize, through a dynamic model, the development over time of a hydrological variable of interest. The first class is predominantly populated by statistical regionalization methods and the use of regional hydrological model parameters for the estimation of the hydrological response in ungauged watersheds. Traditionally, both model-free approaches, such as principal component analysis [e.g., *Alcázar and Palau*, 2010; *Salas et al*., 2011; *Wan Jaafar et al*., 2011], and model-based methods, such as stepwise regression [e.g., *Heuvelmans et al*., 2006; *Barnett et al*., 2010], have long been used to relate runoff characteristics to climate and watershed descriptors. The use of variable selection for statistical downscaling of meteorological data can also be classified under this category [e.g., *Traveria et al*., 2010; *Phatak et al*., 2011, and references therein]. More recently, novel causal variable selection methods [*Ssegane et al*., 2012] have been demonstrated to outperform stepwise selection approaches in terms of accuracy in characterizing the physical process while showing a lower predictive potential.

[5] The second class of problems comprises a larger variety of application fields ranging from rainfall predictions [e.g., *Sharma et al*., 2000] to streamflow modeling [e.g., *Wang et al*., 2009], evaporation estimation [*Moghaddamnia et al*., 2009], and water quality modeling [e.g., *Huang and Foo*, 2002]. Given the usually high number of candidate input variables in these problems, model-free methods are generally preferred over model-based approaches [*Maier et al*., 2010], which are mostly used to determine the optimal input set to specific classes of model [e.g., *Noori et al*., 2011, and references therein]. Starting from *Sharma* [2000], who expanded the mutual information index into the computationally more efficient and reliable partial mutual information (PMI), traditional nonlinear information-theoretic-based selection criteria have been revisited and adapted to a number of increasingly complex hydrological applications. *Bowden et al*. [2005a] improve the PMI criterion by *Sharma* [2000] using artificial neural networks for salinity prediction in the River Murray, South Australia [*Bowden et al*., 2005b]. The method is further elaborated by *May et al*. [2008], who proposed an alternative termination criteria to make the PMI more efficient and, subsequently, by *Fernando et al*. [2009], who introduced the use of a more efficient estimator (shifted histograms) of the mutual information. Finally, *Hejazi and Cai* [2009] illustrate an improved, computationally efficient, minimum redundancy maximum relevance approach to select the most significant inputs among 121 candidates to predict the daily release from 22 reservoirs in California.

[6] In this paper, we build on these previous works and propose a novel hybrid approach, combining model-free and model-based methods to input variable selection, called the tree-based iterative input variable selection (IIS). IIS incorporates some of the features of model-based approaches into a fast model-free method able to handle very large candidate input sets. The information-theoretic selection criterion of model-free methods is replaced by a ranking-based measure of significance [*Wehenkel*, 1998]. Each candidate input is scored by estimating its contribution, in terms of variance reduction, to the building of an underlying model of the preselected output. First, unlike information-theoretic selection, ranking-based evaluation does not require any assumption on the statistical properties of the input data set (e.g., Gaussian distribution) and, thus, can be applied to any sort of sample. Second, it does not rely on computationally intensive methods (e.g., bootstrapping) to estimate the information content in the data and, thus, is generally faster and more efficient. Nonparametric tree-based regression methods, namely extremely randomized trees [*Geurts et al*., 2006], are adopted as underlying model family since, thanks to their ensemble nature [e.g., *Sharma and Chowdhury*, 2011], they perform particularly well in characterizing strongly nonlinear relationships and provide more flexibility and scalability than parametric models (e.g., artificial neural networks). The ranking-based selection is embedded into a stepwise forward selection model-based approach evaluated by *k*-fold cross-validation [*Allen*, 1974]. Forward selection is appropriate because on most real-world hydrological data sets the number of significant variables is a small proportion of the total number of available variables. In such situations, forward selection is far less time consuming than backward elimination [*Das*, 2001]. In the rest of the paper, we first present the IIS algorithm and illustrate its building blocks. Then, we validate the IIS selection accuracy on the synthetic test problems used by *Sharma* [2000], *Bowden et al*. [2005b], *May et al*. [2008], and *Hejazi and Cai* [2009]. Following that, we demonstrate the algorithm on a real-world case study of streamflow prediction in the upper Ticino River Basin (Switzerland), a subalpine catchment characterized by extremely variable weather conditions, where both rainfall and snowmelt significantly contribute to the high flow variations. A comparison between the variables selected by the IIS and PMI-based input selection (PMIS) by *May et al*. [2008] is also provided. Finally, section 'Conclusions' gives conclusions.