### Abstract

- Top of page
- Abstract
- 1. Introduction
- 2. Contributions of the Study
- 3. Methodology
- 4. Results
- 5. Conclusions and Future Work
- Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
- Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
- Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
- Acknowledgments
- References
- Supporting Information

[1] Water resource management requires robust assessment of the consequences of future states of the resource, and, when dependent on prediction models, it requires assessment of the uncertainties associated with those predictions. Ensemble prediction/forecast systems have been extensively used to address such issues and seek to provide a collection of predictions, via a collection of parameters, with intent to bracket future observations. However, such methods do not have well-established finite-sample properties and generally require large samples to additionally determine better performing predictions, for example, in nonlinear probabilistic ensemble methods. We here propose a different paradigm, based on Vapnik-Chervonenkis (VC) generalization theory, for robust parameter selection and prediction. It is based on a concept of complexity (that is data-independent) that relates finite sample performance of a model to its performance when a large sample of the same underlying process is available. We employ a nearest neighbor method as the underlying prediction model, introduce a procedure to compute its VC dimension, and test how the two paradigms handle uncertainty in one step ahead daily streamflow prediction for three basins. In both paradigms, the predictions become more efficient and less biased with increasing sample size. However, the complexity-based paradigm has a better bias-variance tradeoff property for small sample sizes. The uncertainty bounds on predictions resulting from ensemble methods behave in an inconsistent manner for smaller basins, suggesting the need for further postprocessing of ensemble members and uncertainty surrounding them before using them in modeling uncertainty estimation. Finally, complexity-based predictions appear to mimic the complexity of the underlying processes via input dimensionality selection of the nearest neighbor model.

### 1. Introduction

- Top of page
- Abstract
- 1. Introduction
- 2. Contributions of the Study
- 3. Methodology
- 4. Results
- 5. Conclusions and Future Work
- Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
- Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
- Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
- Acknowledgments
- References
- Supporting Information

[2] Streamflow forecasting is critical for short- to medium- to long-term water resources planning. It is therefore also critical to investigate uncertainty in streamflow forecasting if managers or decision makers depend upon such numbers to plan for the future. Some of the uncertainty components, such as data and model uncertainty, are conceptually well understood [e.g., *Bárdossy et al.*, 2005; *Beven*, 2002, 2004, 2005; *Kavetski et al.*, 2006a, 2006b; *Liu and Gupta*, 2007; *Oreskes and Belitz*, 2001; *Pande et al.*, 2005; *Refsgaard et al.*, 2006], and balanced consideration of such uncertainties clearly makes forecasting credible.

[3] What is desirable is a description of uncertainty in future predictions of streamflow levels, say *U*, which is also minimal in extent, i.e., a set *U* that is smallest in size but explains most of the uncertainty in prediction. A probability measure of total uncertainty in forecasting, *P*(*U*), can then be defined as an integral of a joint distribution of uncertainty associated with sources, i.e., models () and data (*D*), with the integral over the latter. For discrete variables, it can be expressed as a sum. In other words,

where *i* and *j* denote indices for a model and a sample respectively. This can further be decomposed in terms of conditionals with some probability measures defined on and *D*:

While vague, a probability measure on may define the weights ascribed to each element in model space. However, a probability measure on *D* defines a measure of stochasticity of the underlying processes and is independent of model space. Thus, *P*(_{i}∣*D*_{j}) = *P*(_{i}) = *w*_{i}. If we assume a model space that contains discrete elements, as is the case when we have a finite collection of hydrologic models = {*m*_{i}, *i* = 1, …, *M*}, and analyze uncertainty under a given sample of data *D*_{j}, equation (2) transforms to

where *w*_{i} is the weight associated with model *m*_{i}, and *w*_{i} = 1.

[4] In order to have a description of forecasting uncertainty that is minimal in extent, minimization of *P*(*U*) from (3) requires minimization of the right-hand side (RHS) over data into the future yet not available. When only one model is considered, minimization of (3), in absence of future data, is realized by parameter estimation of the model based on maximization of some likelihood measure over the calibration data.

[5] Uncertainty bounds on future predictions have been previously estimated using the GLUE methodology [e.g., *Beven and Binley*, 1992; *Blazkova and Beven*, 2002, 2004; *Cameron et al.*, 2000; *Freer et al.*, 1996], where generalized likelihood measures are used to obtain equifinal set of parameters for a model. Predictions corresponding to those equifinal sets represent uncertainty in prediction into the future. Other parameter estimation techniques that minimize some error functions have also been extensively used; a review can be found in work by *Gupta et al.* [2003, 2005] and *Wagener and Gupta* [2005]. Novel parameter estimation algorithms based on extension of the highly successful shuffled complex evolution algorithm [*Duan et al.*, 1992] have been used to assess predictive uncertainty as a consequence of uncertainty in the parameters [e.g., *Vrugt et al.*, 2003b; *Vrugt and Robinson*, 2007], with further extensions into multiobjective parameter estimation [e.g., *Vrugt et al.*, 2003a; *Yapo et al.*, 1998]. The former not only considers predictive uncertainty in estimating parameters using one objective function but also includes uncertainty based on selecting multiple objective functions. More recently, [*Moradkhani et al.*, 2005; *Vrugt et al.*, 2005] have coupled data assimilation and adaptive methods of parameter estimation to address forecasting uncertainty. However, all the above methods only account for uncertainty emanating from data finiteness based on one realization of data.

[6] When multiple hydrologic models are considered, predictive uncertainty can be defined in terms of multimodel ensemble methods [e.g., *Georgakakos et al.*, 2004; *McIntyre et al.*, 2005; *Schaake et al.*, 2006]. Overall uncertainty can then be defined as a weighted sum of uncertainties corresponding to individual calibrated models. Bayesian modeling average methods [e.g., *Ajami et al.*, 2006, 2007; *Duan et al.*, 2007; *Neuman*, 2003; *Raftery et al.*, 2005], take a step further by also optimally identifying the model weights in the estimation of equation (3). Thus, Bayesian modeling average methods address predictive uncertainty due to data finiteness and model uncertainty sources [*Hendry and Reade*, 2005].

[7] Treatment of predictive uncertainty conditional on a particular model is currently limited to likelihood of the model on available calibration data. The summand on the RHS of equation (3) can be represented by uncertainty in parameter estimation given a model *m*_{i} and a sample *D** as follows:

Here, the RHS is estimated on the basis of a likelihood measure on available data. Therefore, it is a parameter uncertainty that propagates uncertainty into prediction. Irrespective of the treatment of conditional parameter uncertainty, either by multiple objective methods or generalized likelihood measure, it remains random as *D** is one realization of *D* which occurs with a probability of *P*(*D* = *D**). Therefore, analysis of uncertainty, even using ensemble-based methods, is incomplete as it is still plagued with sampling uncertainty associated with finite samples.

[8] Major issues associated with finite samples come forth, especially for high-dimensional hydrologic models. The parameter uncertainty obtained from finite calibration data is uncertain because of sampling error [*Vapnik*, 2002]. It is also difficult to identify a collection of parameters that works well for different realizations of data from a distribution due to underlying processes. Finally, as is evident from equation (3), predictive uncertainty based on calibration on only one sample doesn't consider the entire data uncertainty. Under such circumstances point forecasts, even if represented by the median of ensembles, may prove highly unreliable.

[9] Traditionally in regression analysis, such randomness is considered by bootstrapping samples of data *D* of same length in order to mimic *P*(*D*) so that overall uncertainty due to finite data can be considered [e.g., *Andrews and Buchinsky*, 2001; *Efron*, 1979; *Hardle and Bowman*, 1988; *Lall and Sharma*, 1996]. Similar studies have also been extended to study the effect of data uncertainty on parameter uncertainty [e.g., *Kavetski et al.*, 2006a; *Pande et al.*, 2005]. More recently, I. A. Tcherednichenko et al. (Effect of data uncertainty on parameter estimation and its uncertainty using high-density regions of bootstrapped likelihood, unpublished manuscript, 2009) have also studied how hydrologic models corresponding to parameters in high-density regions of bootstrapped likelihood space behave under simple additive noise assumption on the underlying data distribution. It also probes how well such parameters represent the overall parameter uncertainty resulting from data uncertainty. However, such methods are data and computationally intensive.

[10] The nature of predictive uncertainty from data uncertainty can be evidenced from the joint distribution of uncertainty conditional on model specification,

Thus in order to study the current treatment of data uncertainty in ensemble methods, we choose a specific class of models called nearest neighbor algorithms, which have been extensively used in nonparametric stochastic hydrology [e.g., *Bárdossy et al.*, 2005; *Karlsson and Yakowitz*, 1987; *Lall and Sharma*, 1996; *Sharma et al.*, 1997; *Yakowitz*, 1993], and compare it with the performance of the same class of models using another paradigm that can also handle data uncertainty.

[11] The paper is organized as follows. Section 2 describes the specific contributions of this study. Section 3 is on methods, and describes the methodology and data used. The methodology section covers the concepts of Vapnik-Chervonenkis (VC) complexity measure and the algorithm used for its calculation. It also explains robust prediction as well as NNPE methods; concluding with the steps to implement these two paradigms for further analysis. Specific algorithms are thoroughly described in Appendices A–C. Section 4 presents the results based on the methodology outlined in section 3. In particular, it presents a calculation of VC dimension for nearest neighbor methods, a comparison in performance of complexity-based prediction, and NNPE for varying sample sizes and over multiple test data sets for three basins, how well complexity-based model selection is able to mimic the underlying complexity of the processes across the three different basins considered, and finally the behavior of uncertainty bounds of the NNPE paradigm with increasing sample size for the same three basins. Section 5 concludes the paper.

### 2. Contributions of the Study

- Top of page
- Abstract
- 1. Introduction
- 2. Contributions of the Study
- 3. Methodology
- 4. Results
- 5. Conclusions and Future Work
- Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
- Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
- Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
- Acknowledgments
- References
- Supporting Information

[12] In order to empirically study current ensemble-based prediction methods and how they treat sampling uncertainty and inspired by *Tamea et al.* [2005], we present the finite sample performance of a nonlinear probabilistic ensemble method and its prediction uncertainty for the nearest neighbor class of models. Our ensemble method, however, differs from that of *Tamea et al.* [2005] in the use of a locally constant model rather than a locally linear model. We therefore call the ensemble method presented here a nearest neighbor probabilistic ensemble (NNPE) method. We also point out that a conclusion drawn for the NNPE may not hold for the *Tamea et al.* [2005] method. By finite sample performance we mean how the model predictions perform with varying size of calibration (or “training”) data. Nonetheless, need for a robust predictor is ever present, which ensemble methods may provide via the median of ensembles. However, no theoretical foundation is available that can ensure robustness of such median predictions. Further, it is more data-dependent as training data has to be split into calibration and validation subsets to create appropriate ensembles [*Tamea et al.*, 2005].

[13] We therefore propose another paradigm for robust prediction based on Vapnik-Chervonenkis generalization theory [*Vapnik and Chervonenkis*, 1991]. Its foundations lie in a parameterization based on worst-case performance [*Cortes*, 1993], thus providing parameters of a model that work well for different realizations of data from underlying but unknown processes. Such parameters perform especially well for small sample sizes and its estimation is controlled for the dimensionality of the problem [*Vapnik*, 2002]. Critical to VC generalization-based model estimation is the concept of model complexity that determines the robustness of its prediction on unseen future data for a given size of calibration (or “training”) data. Various measures of model complexity and selection of models with optimal complexity have been proposed by many [*Atkinson et al.*, 2002; *Jakeman and Hornberger*, 1993; *Puente and Sivakumar*, 2007; *Schoups et al.*, 2008]. Model calibration and therefore prediction is then based on the Occam's razor principle that optimally trades off model complexity with its likelihood on available data [*Downer and Ogden*, 2003; *van der Linden and Woo*, 2003; *Young et al.*, 1996]. Optimal complexity chosen for a given sample of training data may therefore also provide some insight into the complexity of the underlying processes. Finite sample performance similar to NNPE methods is also applied to this paradigm to test its robustness and also to compare its performance with the NNPE paradigm.

[14] Thus, the contributions of this study are manifold. While ensemble-based methods are prominent in literature to describe predictive uncertainty, we are not aware of a study on how well it handles sampling uncertainty resulting from finiteness of data. We examine this issue via finite sample performance of NNPE-based prediction for the simplest class of hydrologic models, i.e., nearest neighbor methods. Since finite sample performance is closely linked to robustness of predictions via how well data uncertainty is handled, we propose a new paradigm based on VC generalization theory. Hereinafter, we introduce a concept of complexity propounded within VC generalization theory and calculate it for our class of models. Furthermore, through this study we investigate if robust modeling is connected to the complexity of the underlying processes and how the uncertainty bounds provided by NNPE paradigm perform with increasing sample sizes. Finally, we compare the finite sample performance of VC theory-based robust modeling to that of NNPE to find evidence, if any, of advantages of explicitly considering model complexity over ensemble methods (i.e., improved model performance over small sample sizes).

### 3. Methodology

- Top of page
- Abstract
- 1. Introduction
- 2. Contributions of the Study
- 3. Methodology
- 4. Results
- 5. Conclusions and Future Work
- Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
- Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
- Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
- Acknowledgments
- References
- Supporting Information

[15] The nearest neighbor methods belong to a class of nonparametric regression functions. Given a hydrologic time series, **D**_{N} = {{*y*_{i}, **x**_{i}} ∈ _{+}^{1} × _{+}^{ℓ}, *i* = 1, …, *N*}, a nonparametric regression estimator, ^{NP}, minimizes the cost functional in (5) over **D**_{N} (from here on referred to as “training” data):

The variable of interest, or outcome variable, is *y*_{i} and the input variables are represented in the vector **x**_{i}. Input variables, or the “feature vector,” may refer to rainfall, streamflow, lagged streamflow, etc., or any combination of these at multiple locations; **x**_{o} identifies a point of interest (henceforth defined as the “query” or “query point” from a “test” data set) in the input space (or feature space) for which the value(s) of the outcome variable is desired. Similar to the **D**_{N}, we define another data set (a test data set), **D**_{M}, such that (1) it has no common element with **D**_{N}, (2) a “query” belongs to this set and corresponds to independent **x** while, (3) the corresponding dependent *y* of the data set can be used to test the predictor obtained by parameterizing the predictor on **D**_{N}. *K*(.) is the weight or kernel function that defines the locality about the query in the input space. For the nearest neighbor method, we define the kernel as

Here, ∥.∥ refers to Euclidean norm. Minimizing equation (5) over *y*^{NP} with the kernel as defined in equation (6), renders the regression estimator,

The set *I*_{nn} contains the indices of a hydrologic series such that the input data lies within radius of the query and ∣*I*_{nn}∣ refers to the cardinality of any set *I*_{nn}. Or

We use training data **D**_{N} such that the index *i* refers to time in days and {*y*_{i}, **x**_{i}} represents a lagged hydrologic time series. If {*z*_{t}}_{t=1, …, N} represents a hydrologic time series and ℓ the number of lags in days,

We consider streamflow at two points within the Leaf River basin one at Collins and the other at McClain. A third basin of a size in between the other two, but with similar characteristics is also considered. Figure 1 shows the locations of the catchments and Table 1 summarizes their characteristics. The data were collected from the MOPEX database [*Duan et al.*, 2006]

Table 1. Characteristics of the Basins Used in the Study | A (km^{2}) | P (mm/a) | Q_{a} (mm/a) | Q_{d} (m^{3}/s) | T_{max} (deg C) | T_{min} (deg C) |
---|

Leaf River at Collins, Mississippi | 1390 | 1439 | 507 | 22.3 | 24.3 | 10.9 |

Leaf River at McClain, Mississippi | 9060 | 1462 | 522 | 150.0 | 24.6 | 11.3 |

Spring River at Waco, Missouri | 5290 | 1077 | 282 | 47.3 | 19.6 | 7.3 |

[16] The nearest neighbor predictor is easily obtainable from equation (7). However, its performance on future unseen data still depends on the choice of parameters and ℓ. These parameters are chosen on the basis of two paradigms presented below.

#### 3.1. Complexity-Based Prediction Paradigm

[17] The complexity-based prediction paradigm is based on Vapnik-Chervonenkis generalization theory [*Vapnik and Chervonenkis*, 1991]. Complexity of the nearest neighbor methods is defined as its capacity to replicate a given set of data points [*Vapnik*, 2002]. Such a measure depends on both the structure and the parameters of the models. It therefore depends on the choice of radius of neighborhood and number of lags ℓ for nearest neighbor methods. Larger complexity of models have a tendency to overfit more for a given data set, i.e., have larger estimation error, as it trades off model performance on unseen data for better performance on calibration or training data. Here, “estimation error” is defined as the absolute difference in expected error of a model; say, model 1, estimated on a given sample (used for model parameterization) and the expected error of a model, say, model 2, estimated on infinitely large sample obtained from the underlying processes. While the estimation of model 1 is limited by the given sample (and thus is probably not the best performing model available), model 2 is the best performing model that is available. We also define “approximation error” as the error made by the estimated model in approximating reality. However, more complex models may also approximate the underlying processes better [*Atkinson et al.*, 2002; *Cucker and Smale*, 2001]. Thus, models with larger complexity require more data to have lower estimation error and possibly lower approximation error. Since, approximation error is generally not verifiable; complexity can be used as a measure to select a model with lowest estimation error [*Bartlett and Kulkarni*, 1998].

[18] Figure 2 further illustrates the relation between estimation error and complexity and how measuring complexity of a model is equivalent to measuring the size of its “span.” By model span we mean the *N*-dimensional space (of **y** = {*y*_{1}, *y*_{2}, *y*_{3}, …, *y*_{N}}) spanned by a model under different data sets sampled from the underlying but unknown distribution. Consider two model spaces *M*_{1} and *M*_{2} (Figure 2a) and assume that there are only two possible values that **y** can take in the output space. Let points A and B refer to the output model 1 and 2 corresponding to one input sequence **X** = {**x**_{1}, **x**_{2}, **x**_{3}, …, **x**_{N}}. Let vectors and define the expected errors of models 1 and 2 conditioned on **X**. The magnitude of the vector defines the “estimation error” conditioned on **X,** associated with picking a wrong model (say choosing model 1 instead of model 2), is bounded from above by the magnitude of vector . Further, the uncertainty in estimating the conditional (on **X**) error () depends on the uncertainty in **X**, which in turn depends on the width of model space *M*_{1}(*M*_{2}). This is so because a model space encapsulates the output space that is spanned by a model (Figure 2b). Thus, the unconditional “estimation error” obtained by integrating over different realizations of **X** from the underlying distribution depends on the width of the combined model spaces *M*_{1} and *M*_{2} (Figure 2c). This concept can be easily extended to more general cases, where the complexity of a modeling system is determined by measuring the size of the corresponding model space.

[19] The concept of complexity is formalized in section 3.1.1 for nearest neighbor type models and the measuring of the corresponding modeling space in section 3.1.2, via estimation of its VC dimension.

##### 3.1.1. Concept of Complexity: VC Dimension

[20] To further explain the notion of complexity introduced here, we first consider a binary classification problem. This is a special case of regression estimation with the outcome variable being binary valued rather than real valued and the estimated classification function groups the input data (or the forcing data) into one of the two classes.

[21] A set of classifiers functions can be represented by a set of indicator functions,

Here, Λ defines a set of parameters, each element of which identifies one element of the set *M*. In other words for some *α* ∈ Λ, *Q*(**w**, *α*) is a classification function from the set *M* that labels an arbitrary input vector (or a feature point) **w** as either 0 or 1.

[22] The Vapnik-Chervonenkis (VC) dimension [*Vapnik et al.*, 1994; *Vapnik*, 2002], due to VC generalization theory, is used to quantify the complexity of such a class of mappings. It is defined as the maximum number, *h*, of any input or feature points **w**_{1}, **w**_{2}, .., **w**_{h} that can be separated into the two classes in all 2^{h} possible ways by any subset of *M* [*Vapnik*, 2002]. The *h* feature points are then said to be “shattered.” If for any positive *n* there exists a set of *n* feature points that can be shattered by the set of indicator functions defined above then its VC dimension is equal to infinity. If a set of classifiers is capable of shattering a large number of feature points, it is complicated enough to find a mapping for any finite data set (say *n*, such that *n* ≤ *h*) irrespective of the underlying input-outcome probability distribution from which such a data set is sampled. Such highly complex mappings are said to have “overfit” the data and are also responsible for high estimation uncertainty [*Vapnik*, 1999]. Therefore, a set of functions with low complexity is desirable for finding an appropriate mapping that has low estimation uncertainty when the number of data points is small.

[23] In the case of nearest neighbor methods for time series forecasting (we scale the entire data set between 0 and 1), we can obtain a set of classifiers for a given set of data **D**_{N} from equation (7) as

where is the radius of nearest neighbors, ℓ is the number of lags in days,

and _{pred}^{nn}, *I*_{nn}, **D**_{N} have been previously defined. Therefore, possible values of b and ℓ substitute for the set of abstract parameters in the general definition of the set of classifiers.

[24] Finally, an equivalent definition of VC dimension applies to a set of real functions, just like the set of real valued mappings obtained from equation (7), defined as the cardinality of its minimal ɛ net [*Vapnik et al.*, 1994; *Vapnik*, 1999]. However, the VC dimension of the set of real valued nearest neighbor algorithms can also be set to the VC dimension of the corresponding set of classifiers as defined in equation (8) [*Vapnik*, 2002].

[25] The VC dimension, for a given training size *N*, defines the rate of uniform convergence of empirical error on the training data to its expected error. However, expected error defines the performance of a model on future unseen data, or in our case performance of a nearest neighbor model for a given value of and ℓ on future unseen data. For the case of binary classifiers, if we define the empirical or classification error on data **D**_{N} as

and the corresponding expected error (with P(.) a measure of probability) as

then VC theory provides an upper bound on the worst case absolute deviation of the empirical to the expected error [*Cortes*, 1993; *Vapnik and Chervonenkis*, 1971]

Here, *m*(*n*) is the growth function of our set of classifiers, *M*_{nn}, that is defined as the maximum number of ways in which a finite set of data of size *n* could be classified by our set of classifiers, and the maximum is taken over all possible data sets of size *n* that can be sampled from any given but fixed distribution. Given that the VC dimension of *M*_{nn} is *h*, the growth function of *M*_{nn} follows the relationship [*Cortes*, 1993]

Using (11) for a one-sided bound, the following inequality then holds for all *α* ∈ Λ with probability 1 − *δ*,

Inequality (11) shows that a larger complexity (implying a larger VC dimension), leads to a weaker bound on the uniform convergence (or convergence for the worst performing parameter from parameter space for any given finite sample size *N*) of empirical error to its expected value as sample size *N* goes to infinity. Since the upper bound on the rate of convergence is uniform, it applies to the rate of convergences for all parameters in the parameter space. By rearranging the terms in inequality (11), we obtain equation (12) which suggests that the strength of the upper bound on expected error for a given sample size *N* not only depends on the empirical risk but also on a function of the complexity of *M*_{nn}. Thus for the sample size *N*, more complex models tend to bound the expected error at the same level for any parameter (that also includes the parameter that minimizes the empirical risk) with lower confidence level 1 − *δ*. Or, a larger number of data points is required for more complex models to obtain a desired level of confidence in bounding its expected error from above by a certain value. A similar kind of inequality for a real-valued class also exists and dictates a similar relationship between performance and the VC dimension [*Vapnik*, 2002].

[26] An inequality of type (12) forms the basis of VC generalization theory [*Vapnik*, 1999]. It, however, only provides an upper bound on the expected error that is valid for any parameter of the parameter set defining the class of models. Minimizing the RHS of (12) at a certain level of confidence 1 − *δ* can only allow exclusion of worse performing (in expected error sense) parameters, while obtaining the parameter that minimizes the expected error via such minimization remains approximate. Hereinafter, we call robust predictor the one that minimizes the upper bound on the expected error of the type as in (12). Such parameterization also implicitly trades off the empirical error with the complexity of the models in class *M*_{nn}. Finally, the set of models such as *M*_{nn} can also be subdivided into a collection of disjoint set of models with their respective measures of complexity. Such a subdivision can be based on how well the complexity of the underlying class of models needs to be approximated. For example, *M*_{nn} can be decomposed into a collection of disjoint sets of type

such that

Thus as ɛ decreases, *M*_{nn} can be represented by a union of many smaller sets of nearest neighbor models with certain complexity associated to each of those sets. Consequently, an inequality of type (12) is associated with each such set of models. This decomposition of the set of nearest neighbor models will be used in the later part of the paper when complexity of the nearest neighbor models is calculated and used in robust predictor estimation.

##### 3.1.2. Algorithm for Calculation of VC Dimension

[27] If we now define *v*_{2}(.) as *v*_{1}(.) in equation (10) except that it is calculated on a different realization of data _{N} of the same size and drawn from the same distribution, then the VC dimension for our class of classifiers in equation (13) obeys the approximate equality in (14) [*Corani and Gatto*, 2006a, 2006b; *Cortes*, 1993; *Vapnik et al.*, 1994],

where

Here, the parameters of the bound above are *K*, *d*, and *B*. The value of *B* is chosen from the continuity conditions that at *N* = *h*/2 the value of Φ(.) is 1. The other two parameters are free, assumed to be universal, and are obtained empirically by fitting the bound in the relationship (14) to experimental data using a set of classifiers whose VC dimension is known [*Vapnik et al.*, 1994].

##### 3.1.3. Robust Model Selection

[29] The VC dimension calculated as described above also applies to the real valued class of functions from the *M*_{nn} class of classifiers [*Corani and Gatto*, 2006a, 2006b; *Vapnik*, 2002]. Therefore, robust estimation of *α* = {, ℓ}, for real valued nearest neighbor models can be obtained by minimizing the RHS of the following inequality similar to inequality (12) [*Cherkasky and Mulier*, 1998; *Corani and Gatto*, 2006a, 2006b; *Pande*, 2005].

where *r*_{N}(*α*, **x**_{o}, *N*) is the minimized empirical risk corresponding to a nearest neighbor predictor _{pred}^{nn} at query **x**_{o} ∈ ^{+}. From cost function in (5), the minimized empirical risk is

with the expected risk defined as

Further,

Here *vm* (or “Vapnik's measure”) [*Cherkasky and Mulier*, 1998] is an inverse measure of complexity, i.e., it is inversely related to the measure of complexity or VC dimension, obtained from (14). The left-hand side of (15) is the expected error (or generalization error) and the right-hand side bounds it from above. For a given *α* = {, ℓ}, the numerator of the RHS is also the tightest possible bound for nearest neighbor models that is obtained by minimizing the cost function (or the empirical risk function for the nearest neighbor model) in (5) over ^{NP}. Further, minimizing the entire RHS over *α* ∈ of (15) formalizes the principle of Occam's razor, i.e., selection of the simplest hypothesis (least complexity) that also agrees with data as much as possible. The algorithm for the implementation of prediction based on robust model selection is presented in Appendix B.

#### 3.2. Probabilistic Ensemble Method Paradigm (Nearest Neighbor Probabilistic Ensemble Method)

[30] We built an ensemble approach similar to that of *Tamea et al.* [2005] but devised for *k* nearest neighbors and call it NNPE. This method is different from the nonlinear probabilistic ensemble (NLPE) method of *Tamea et al.* [2005]. The NLPE was proposed for nonlinear time series analysis using locally linear predictors, while here we utilize a method adapted for one step ahead nearest neighbor methods.

[31] A time series is divided into a training part and a testing part, with the training part further subdivided into a calibration and a noncalibration set. A nearest neighbor predictor for a query is estimated over the calibration set for a given value of parameters *α* = {, ℓ}, and the performance of such different predictors (corresponding to values of *α*) is evaluated over the noncalibration set. A certain number of top performers (leading to a collection of top performing alphas) is collected to form an ensemble prediction for the query. Additional uncertainty to an ensemble, making it probabilistic in nature, is attached by associating the residuals of the calibration step with each member of the ensemble. A collection of predictions for a query is thus created and a confidence interval of prediction can be created. As mentioned by *Tamea et al.* [2005], such a method differs from the GLUE methodology [*Beven and Binley*, 1992] as the latter pursues the notions of equifinality concept at all levels of model estimation, while here equifinal sets over *α* ∈ are obtained once a nearest neighbor point predictor has been created for each *α* = {, ℓ}. Further, a probabilistic nature is imparted to ensemble predictions by incorporating each member's prediction uncertainty. The specific implementation of the NNPE algorithm is described in Appendix C.

[32] A nonlinear probabilistic ensemble can thus be subsequently obtained for a test data set with confidence interval defined as certain interquantile range. We consider only 95% confidence level defined as 2.5 to 97.5 percentile range and following *Tamea et al.* [2005] use the median of ensemble predictions for any query as its point prediction.

#### 3.3. Application to Hydrologic Data

[33] In order to apply the methods and compare their finite sample performance, we scale the 54 years of Leaf River streamflow data at each of the two gauging stations (Lear River at Collins and McLain, Mississippi) between 0 and 1 and do the same for another basin (Spring River at Waco, Missouri) with data overlapping the same time period. Collins is nested within McLain; and Spring is of a size in between the other ones (See Figure 1 for locations and Table 1 for characteristics). At any time, the available normalized data is split into disjoint sets of a training data **D**_{N} and 19 test data sets {**D**_{M}^{i} : *i* = 1, 2, …, 19}. The start and the end dates for the 19 test data sets are fixed while those of training data set is held variable as is explained in the following. For the initial value of *t* = 0, the following steps are taken. 1. Nearest neighbor models are parameterized within the two paradigms for the same training data of a fixed size, *N* = *2*^{t} (in years), and evaluated on a test data of fixed size, *M* (2 years). 2. Repeat 1 for different test data of same size 19 times, once for each of {**D**_{M}^{i} : *i* = 1, 2, …, 19}. 3. Increment *t*, *t* = *t* + 1. 4. If (*t* < 5) then go to 1, else stop.

### 5. Conclusions and Future Work

- Top of page
- Abstract
- 1. Introduction
- 2. Contributions of the Study
- 3. Methodology
- 4. Results
- 5. Conclusions and Future Work
- Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
- Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
- Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
- Acknowledgments
- References
- Supporting Information

[49] We have introduced a complexity-based prediction paradigm for nearest neighbor methods that is robust in its choice of parameters. It is based on Vapnik-Chervonenkis generalization theory, which suggests good small sample properties such as its capacity to handle uncertainty in prediction on future data [*Vapnik*, 2002]. We calculated a measure of complexity, called VC dimension, for the nearest neighbor methods and utilized it in complexity-based robust predictions for nearest neighbor methods. We also compared this paradigm with a nearest neighbor probabilistic ensemble (NNPE) method inspired by [*Tamea et al.*, 2005]. In comparing the two paradigms, we partitioned the data sets for three basins into a fixed set of 19 2-year testing data sets, while using the remaining data set in different sizes for training nearest neighbor methods via the two approaches. Thus the two paradigms were trained on 1, 2, 4, 8, and 16 years and their performance was tested on 19 2-year testing data sets to show how the two paradigms compare in terms of their small sample performance.

[50] Complexity-based prediction methods were found to have a “weakly” better control on bias-variance tradeoff in predictions than NNPE methods, with consistently lower bias (though there is no definitive evidence that suggests either of the methods being superior to the other). This was possibly due to the tendency of the robust methods to select conservative models. However, bias was more difficult to control for complexity-based methods in smaller basins. Thus, a poorer control on bias-variance tradeoff was observed for complexity-based methods in smaller basins.

[51] Significant coherence between low-frequency observed streamflow events and input dimensionality selected by complexity-based predictor was also found, potentially indicating that complexity-based methods tend to conform to the complexity of the underlying processes. However, in spite of the choice of higher input dimensionality that was deemed a robust choice for more complex parts of the observed hydrograph, predictions for these events remained inaccurate. A similar situation was observed for the NNPE method. But in addition to it, evaluation of the uncertainty in prediction, via uncertainty bounds, by NNPE methods was poor and inconsistent for the small basins (though the performance was consistent for larger basin size). Measures such as the fraction of observations bracketed by the uncertainty bounds, as well as the width of those bounds, were inconsistent with what is expected with increasing sample size. This indicates, potentially, that the NNPE method has difficulty in controlling uncertainty in its predictions. Furthermore, it suggests a need for postprocessing of ensemble members and uncertainty surrounding them before their use to estimate modeling uncertainty.

[52] We here add a note of caution. The observation made herein does not reflect the performance of NLPE of *Tamea et al.* [2005] and is restricted to that of NNPE. The difference between the two is that the former is based on local linear models while the latter is based on locally constant models. Since the motivation for the paper was to introduce a complexity-based paradigm and evaluate its performance in comparison to a probabilistic ensemble paradigm for a given model (a nearest neighbor model), we here constrained ourselves to the choice of nearest neighbor models. However, we envision a comparison of the two paradigms for locally linear models (or even locally polynomial models) in future work.

[53] Both the methods showed poorer finite sample performance for the smaller basins of the three used. However, the study is not exhaustive over the size of the basins (though two of the three basins were of similar sizes and smaller than the third one) and this observation may not be universal. Nonetheless, possible weaknesses in performance of the two methods for the smaller basin size may be due to the use of a nonphysically based model, i.e., nearest neighbor methods. Nearest neighbor methods used here solely depended on autoinformation in a univariate time series for future predictions. For larger basin areas (assuming basins differ only in one characteristic, its size), streamflow data may contain more relevant autoinformation for future predictions at a particular time scale because of its slower response to precipitation.

[54] This, in turn, motivates future studies into how complexity (with VC dimension as the measure of complexity) based methods perform for physically based or conceptual hydrologic models, especially for smaller basins. The motivation to use VC generalization theory lies in its concept of complexity being model-independent and its foundation in probability theory. It will be an interesting exercise to look into potential connection between model structure selected by complexity-based methods and complexity underlying basin's behavior (similar to the work by *Jakeman and Hornberger* [1993] and *Young et al.* [1996]). A natural extension to comparison with ensemble methods will also yield insights into the advantages of either of the methods.