Tree-based iterative input variable selection for hydrological modeling


  • S. Galelli,

    Corresponding author
    1. Singapore-Delft Water Alliance, National University of Singapore, Singapore
    2. Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milano, Italy
    • Corresponding author: S. Galelli, Pillar of Engineering Systems & Design, Singapore University of Technology and Design, 20 Dover Drive, 138682, Singapore. (

    Search for more papers by this author
  • A. Castelletti

    1. Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milano, Italy
    Search for more papers by this author


[1] Input variable selection is an important issue associated with the development of several hydrological applications. Determining the optimal input vector from a large set of candidates to characterize a preselected output might result in a more accurate, parsimonious, and, possibly, physically interpretable model of the natural process. In the hydrological context, the modeled system often exhibits nonlinear dynamics and multiple interrelated variables. Moreover, the number of candidate inputs can be very large and redundant, especially when the model reproduces the spatial variability of the physical process. The ideal input selection algorithm should therefore provide modeling flexibility, computational efficiency in dealing with high dimension data set, scalability with respect to input dimensionality and minimum redundancy. In this paper, we propose the tree-based iterative input variable selection algorithm, a novel hybrid model-based/model-free approach specifically designed to fulfill these four requirements. The algorithm structure provides robustness against redundancy, while the tree-based nature of the underlying model ensures the other key properties. The approach is first tested on a well-known benchmark case study to validate its accuracy and subsequently applied to a real-world streamflow prediction problem in the upper Ticino River Basin (Switzerland). Results indicate that the algorithm is capable of selecting the most significant and nonredundant inputs in different testing conditions, including the real-world large data set characterized by the presence of several redundant variables. This permits one to identify a compact representation of the observational data set, which is key to improving the model performance and assisting with the interpretation of the underlying physical processes.