SEARCH

SEARCH BY CITATION

Keywords:

  • complexity;
  • hydrologic uncertainty;
  • VC generalization theory;
  • robust parameter estimation;
  • nearest neighbor models

Abstract

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Contributions of the Study
  5. 3. Methodology
  6. 4. Results
  7. 5. Conclusions and Future Work
  8. Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
  9. Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
  10. Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
  11. Acknowledgments
  12. References
  13. Supporting Information

[1] Water resource management requires robust assessment of the consequences of future states of the resource, and, when dependent on prediction models, it requires assessment of the uncertainties associated with those predictions. Ensemble prediction/forecast systems have been extensively used to address such issues and seek to provide a collection of predictions, via a collection of parameters, with intent to bracket future observations. However, such methods do not have well-established finite-sample properties and generally require large samples to additionally determine better performing predictions, for example, in nonlinear probabilistic ensemble methods. We here propose a different paradigm, based on Vapnik-Chervonenkis (VC) generalization theory, for robust parameter selection and prediction. It is based on a concept of complexity (that is data-independent) that relates finite sample performance of a model to its performance when a large sample of the same underlying process is available. We employ a nearest neighbor method as the underlying prediction model, introduce a procedure to compute its VC dimension, and test how the two paradigms handle uncertainty in one step ahead daily streamflow prediction for three basins. In both paradigms, the predictions become more efficient and less biased with increasing sample size. However, the complexity-based paradigm has a better bias-variance tradeoff property for small sample sizes. The uncertainty bounds on predictions resulting from ensemble methods behave in an inconsistent manner for smaller basins, suggesting the need for further postprocessing of ensemble members and uncertainty surrounding them before using them in modeling uncertainty estimation. Finally, complexity-based predictions appear to mimic the complexity of the underlying processes via input dimensionality selection of the nearest neighbor model.

1. Introduction

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Contributions of the Study
  5. 3. Methodology
  6. 4. Results
  7. 5. Conclusions and Future Work
  8. Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
  9. Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
  10. Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
  11. Acknowledgments
  12. References
  13. Supporting Information

[2] Streamflow forecasting is critical for short- to medium- to long-term water resources planning. It is therefore also critical to investigate uncertainty in streamflow forecasting if managers or decision makers depend upon such numbers to plan for the future. Some of the uncertainty components, such as data and model uncertainty, are conceptually well understood [e.g., Bárdossy et al., 2005; Beven, 2002, 2004, 2005; Kavetski et al., 2006a, 2006b; Liu and Gupta, 2007; Oreskes and Belitz, 2001; Pande et al., 2005; Refsgaard et al., 2006], and balanced consideration of such uncertainties clearly makes forecasting credible.

[3] What is desirable is a description of uncertainty in future predictions of streamflow levels, say U, which is also minimal in extent, i.e., a set U that is smallest in size but explains most of the uncertainty in prediction. A probability measure of total uncertainty in forecasting, P(U), can then be defined as an integral of a joint distribution of uncertainty associated with sources, i.e., models (equation image) and data (D), with the integral over the latter. For discrete variables, it can be expressed as a sum. In other words,

  • equation image

where i and j denote indices for a model and a sample respectively. This can further be decomposed in terms of conditionals with some probability measures defined on equation image and D:

  • equation image

While vague, a probability measure on equation image may define the weights ascribed to each element in model space. However, a probability measure on D defines a measure of stochasticity of the underlying processes and is independent of model space. Thus, P(equation imageiDj) = P(equation imagei) = wi. If we assume a model space that contains discrete elements, as is the case when we have a finite collection of hydrologic models equation image = {mi, i = 1, …, M}, and analyze uncertainty under a given sample of data Dj, equation (2) transforms to

  • equation image

where wi is the weight associated with model mi, and equation imagewi = 1.

[4] In order to have a description of forecasting uncertainty that is minimal in extent, minimization of P(U) from (3) requires minimization of the right-hand side (RHS) over data into the future yet not available. When only one model is considered, minimization of (3), in absence of future data, is realized by parameter estimation of the model based on maximization of some likelihood measure over the calibration data.

[5] Uncertainty bounds on future predictions have been previously estimated using the GLUE methodology [e.g., Beven and Binley, 1992; Blazkova and Beven, 2002, 2004; Cameron et al., 2000; Freer et al., 1996], where generalized likelihood measures are used to obtain equifinal set of parameters for a model. Predictions corresponding to those equifinal sets represent uncertainty in prediction into the future. Other parameter estimation techniques that minimize some error functions have also been extensively used; a review can be found in work by Gupta et al. [2003, 2005] and Wagener and Gupta [2005]. Novel parameter estimation algorithms based on extension of the highly successful shuffled complex evolution algorithm [Duan et al., 1992] have been used to assess predictive uncertainty as a consequence of uncertainty in the parameters [e.g., Vrugt et al., 2003b; Vrugt and Robinson, 2007], with further extensions into multiobjective parameter estimation [e.g., Vrugt et al., 2003a; Yapo et al., 1998]. The former not only considers predictive uncertainty in estimating parameters using one objective function but also includes uncertainty based on selecting multiple objective functions. More recently, [Moradkhani et al., 2005; Vrugt et al., 2005] have coupled data assimilation and adaptive methods of parameter estimation to address forecasting uncertainty. However, all the above methods only account for uncertainty emanating from data finiteness based on one realization of data.

[6] When multiple hydrologic models are considered, predictive uncertainty can be defined in terms of multimodel ensemble methods [e.g., Georgakakos et al., 2004; McIntyre et al., 2005; Schaake et al., 2006]. Overall uncertainty can then be defined as a weighted sum of uncertainties corresponding to individual calibrated models. Bayesian modeling average methods [e.g., Ajami et al., 2006, 2007; Duan et al., 2007; Neuman, 2003; Raftery et al., 2005], take a step further by also optimally identifying the model weights in the estimation of equation (3). Thus, Bayesian modeling average methods address predictive uncertainty due to data finiteness and model uncertainty sources [Hendry and Reade, 2005].

[7] Treatment of predictive uncertainty conditional on a particular model is currently limited to likelihood of the model on available calibration data. The summand on the RHS of equation (3) can be represented by uncertainty in parameter estimation given a model mi and a sample D* as follows:

  • equation image

Here, the RHS is estimated on the basis of a likelihood measure on available data. Therefore, it is a parameter uncertainty that propagates uncertainty into prediction. Irrespective of the treatment of conditional parameter uncertainty, either by multiple objective methods or generalized likelihood measure, it remains random as D* is one realization of D which occurs with a probability of P(D = D*). Therefore, analysis of uncertainty, even using ensemble-based methods, is incomplete as it is still plagued with sampling uncertainty associated with finite samples.

[8] Major issues associated with finite samples come forth, especially for high-dimensional hydrologic models. The parameter uncertainty obtained from finite calibration data is uncertain because of sampling error [Vapnik, 2002]. It is also difficult to identify a collection of parameters that works well for different realizations of data from a distribution due to underlying processes. Finally, as is evident from equation (3), predictive uncertainty based on calibration on only one sample doesn't consider the entire data uncertainty. Under such circumstances point forecasts, even if represented by the median of ensembles, may prove highly unreliable.

[9] Traditionally in regression analysis, such randomness is considered by bootstrapping samples of data D of same length in order to mimic P(D) so that overall uncertainty due to finite data can be considered [e.g., Andrews and Buchinsky, 2001; Efron, 1979; Hardle and Bowman, 1988; Lall and Sharma, 1996]. Similar studies have also been extended to study the effect of data uncertainty on parameter uncertainty [e.g., Kavetski et al., 2006a; Pande et al., 2005]. More recently, I. A. Tcherednichenko et al. (Effect of data uncertainty on parameter estimation and its uncertainty using high-density regions of bootstrapped likelihood, unpublished manuscript, 2009) have also studied how hydrologic models corresponding to parameters in high-density regions of bootstrapped likelihood space behave under simple additive noise assumption on the underlying data distribution. It also probes how well such parameters represent the overall parameter uncertainty resulting from data uncertainty. However, such methods are data and computationally intensive.

[10] The nature of predictive uncertainty from data uncertainty can be evidenced from the joint distribution of uncertainty conditional on model specification,

  • equation image

Thus in order to study the current treatment of data uncertainty in ensemble methods, we choose a specific class of models called nearest neighbor algorithms, which have been extensively used in nonparametric stochastic hydrology [e.g., Bárdossy et al., 2005; Karlsson and Yakowitz, 1987; Lall and Sharma, 1996; Sharma et al., 1997; Yakowitz, 1993], and compare it with the performance of the same class of models using another paradigm that can also handle data uncertainty.

[11] The paper is organized as follows. Section 2 describes the specific contributions of this study. Section 3 is on methods, and describes the methodology and data used. The methodology section covers the concepts of Vapnik-Chervonenkis (VC) complexity measure and the algorithm used for its calculation. It also explains robust prediction as well as NNPE methods; concluding with the steps to implement these two paradigms for further analysis. Specific algorithms are thoroughly described in Appendices AC. Section 4 presents the results based on the methodology outlined in section 3. In particular, it presents a calculation of VC dimension for nearest neighbor methods, a comparison in performance of complexity-based prediction, and NNPE for varying sample sizes and over multiple test data sets for three basins, how well complexity-based model selection is able to mimic the underlying complexity of the processes across the three different basins considered, and finally the behavior of uncertainty bounds of the NNPE paradigm with increasing sample size for the same three basins. Section 5 concludes the paper.

2. Contributions of the Study

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Contributions of the Study
  5. 3. Methodology
  6. 4. Results
  7. 5. Conclusions and Future Work
  8. Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
  9. Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
  10. Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
  11. Acknowledgments
  12. References
  13. Supporting Information

[12] In order to empirically study current ensemble-based prediction methods and how they treat sampling uncertainty and inspired by Tamea et al. [2005], we present the finite sample performance of a nonlinear probabilistic ensemble method and its prediction uncertainty for the nearest neighbor class of models. Our ensemble method, however, differs from that of Tamea et al. [2005] in the use of a locally constant model rather than a locally linear model. We therefore call the ensemble method presented here a nearest neighbor probabilistic ensemble (NNPE) method. We also point out that a conclusion drawn for the NNPE may not hold for the Tamea et al. [2005] method. By finite sample performance we mean how the model predictions perform with varying size of calibration (or “training”) data. Nonetheless, need for a robust predictor is ever present, which ensemble methods may provide via the median of ensembles. However, no theoretical foundation is available that can ensure robustness of such median predictions. Further, it is more data-dependent as training data has to be split into calibration and validation subsets to create appropriate ensembles [Tamea et al., 2005].

[13] We therefore propose another paradigm for robust prediction based on Vapnik-Chervonenkis generalization theory [Vapnik and Chervonenkis, 1991]. Its foundations lie in a parameterization based on worst-case performance [Cortes, 1993], thus providing parameters of a model that work well for different realizations of data from underlying but unknown processes. Such parameters perform especially well for small sample sizes and its estimation is controlled for the dimensionality of the problem [Vapnik, 2002]. Critical to VC generalization-based model estimation is the concept of model complexity that determines the robustness of its prediction on unseen future data for a given size of calibration (or “training”) data. Various measures of model complexity and selection of models with optimal complexity have been proposed by many [Atkinson et al., 2002; Jakeman and Hornberger, 1993; Puente and Sivakumar, 2007; Schoups et al., 2008]. Model calibration and therefore prediction is then based on the Occam's razor principle that optimally trades off model complexity with its likelihood on available data [Downer and Ogden, 2003; van der Linden and Woo, 2003; Young et al., 1996]. Optimal complexity chosen for a given sample of training data may therefore also provide some insight into the complexity of the underlying processes. Finite sample performance similar to NNPE methods is also applied to this paradigm to test its robustness and also to compare its performance with the NNPE paradigm.

[14] Thus, the contributions of this study are manifold. While ensemble-based methods are prominent in literature to describe predictive uncertainty, we are not aware of a study on how well it handles sampling uncertainty resulting from finiteness of data. We examine this issue via finite sample performance of NNPE-based prediction for the simplest class of hydrologic models, i.e., nearest neighbor methods. Since finite sample performance is closely linked to robustness of predictions via how well data uncertainty is handled, we propose a new paradigm based on VC generalization theory. Hereinafter, we introduce a concept of complexity propounded within VC generalization theory and calculate it for our class of models. Furthermore, through this study we investigate if robust modeling is connected to the complexity of the underlying processes and how the uncertainty bounds provided by NNPE paradigm perform with increasing sample sizes. Finally, we compare the finite sample performance of VC theory-based robust modeling to that of NNPE to find evidence, if any, of advantages of explicitly considering model complexity over ensemble methods (i.e., improved model performance over small sample sizes).

3. Methodology

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Contributions of the Study
  5. 3. Methodology
  6. 4. Results
  7. 5. Conclusions and Future Work
  8. Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
  9. Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
  10. Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
  11. Acknowledgments
  12. References
  13. Supporting Information

[15] The nearest neighbor methods belong to a class of nonparametric regression functions. Given a hydrologic time series, DN = {{yi, xi} ∈ equation image+1 × equation image+, i = 1, …, N}, a nonparametric regression estimator, equation imageNP, minimizes the cost functional in (5) over DN (from here on referred to as “training” data):

  • equation image

The variable of interest, or outcome variable, is yi and the input variables are represented in the vector xi. Input variables, or the “feature vector,” may refer to rainfall, streamflow, lagged streamflow, etc., or any combination of these at multiple locations; xo identifies a point of interest (henceforth defined as the “query” or “query point” from a “test” data set) in the input space (or feature space) for which the value(s) of the outcome variable is desired. Similar to the DN, we define another data set (a test data set), DM, such that (1) it has no common element with DN, (2) a “query” belongs to this set and corresponds to independent x while, (3) the corresponding dependent y of the data set can be used to test the predictor obtained by parameterizing the predictor on DN. K(.) is the weight or kernel function that defines the locality about the query in the input space. For the nearest neighbor method, we define the kernel as

  • equation image

Here, ∥.∥ refers to Euclidean norm. Minimizing equation (5) over yNP with the kernel as defined in equation (6), renders the regression estimator,

  • equation image

The set Inn contains the indices of a hydrologic series such that the input data lies within equation image radius of the query and ∣Inn∣ refers to the cardinality of any set Inn. Or

  • equation image

We use training data DN such that the index i refers to time in days and {yi, xi} represents a lagged hydrologic time series. If {zt}t=1, …, N represents a hydrologic time series and ℓ the number of lags in days,

  • equation image

We consider streamflow at two points within the Leaf River basin one at Collins and the other at McClain. A third basin of a size in between the other two, but with similar characteristics is also considered. Figure 1 shows the locations of the catchments and Table 1 summarizes their characteristics. The data were collected from the MOPEX database [Duan et al., 2006]

image

Figure 1. Location of the gauges used in the present study.

Download figure to PowerPoint

Table 1. Characteristics of the Basins Used in the Study
 A (km2)P (mm/a)Qa (mm/a)Qd (m3/s)Tmax (deg C)Tmin (deg C)
Leaf River at Collins, Mississippi1390143950722.324.310.9
Leaf River at McClain, Mississippi90601462522150.024.611.3
Spring River at Waco, Missouri5290107728247.319.67.3

[16] The nearest neighbor predictor is easily obtainable from equation (7). However, its performance on future unseen data still depends on the choice of parameters equation image and ℓ. These parameters are chosen on the basis of two paradigms presented below.

3.1. Complexity-Based Prediction Paradigm

[17] The complexity-based prediction paradigm is based on Vapnik-Chervonenkis generalization theory [Vapnik and Chervonenkis, 1991]. Complexity of the nearest neighbor methods is defined as its capacity to replicate a given set of data points [Vapnik, 2002]. Such a measure depends on both the structure and the parameters of the models. It therefore depends on the choice of radius of neighborhood equation image and number of lags ℓ for nearest neighbor methods. Larger complexity of models have a tendency to overfit more for a given data set, i.e., have larger estimation error, as it trades off model performance on unseen data for better performance on calibration or training data. Here, “estimation error” is defined as the absolute difference in expected error of a model; say, model 1, estimated on a given sample (used for model parameterization) and the expected error of a model, say, model 2, estimated on infinitely large sample obtained from the underlying processes. While the estimation of model 1 is limited by the given sample (and thus is probably not the best performing model available), model 2 is the best performing model that is available. We also define “approximation error” as the error made by the estimated model in approximating reality. However, more complex models may also approximate the underlying processes better [Atkinson et al., 2002; Cucker and Smale, 2001]. Thus, models with larger complexity require more data to have lower estimation error and possibly lower approximation error. Since, approximation error is generally not verifiable; complexity can be used as a measure to select a model with lowest estimation error [Bartlett and Kulkarni, 1998].

[18] Figure 2 further illustrates the relation between estimation error and complexity and how measuring complexity of a model is equivalent to measuring the size of its “span.” By model span we mean the N-dimensional space (of y = {y1, y2, y3, …, yN}) spanned by a model under different data sets sampled from the underlying but unknown distribution. Consider two model spaces M1 and M2 (Figure 2a) and assume that there are only two possible values that y can take in the output space. Let points A and B refer to the output model 1 and 2 corresponding to one input sequence X = {x1, x2, x3, …, xN}. Let vectors equation image and equation image define the expected errors of models 1 and 2 conditioned on X. The magnitude of the vector equation image defines the “estimation error” conditioned on X, associated with picking a wrong model (say choosing model 1 instead of model 2), is bounded from above by the magnitude of vector equation image. Further, the uncertainty in estimating the conditional (on X) error equation image (equation image) depends on the uncertainty in X, which in turn depends on the width of model space M1(M2). This is so because a model space encapsulates the output space that is spanned by a model (Figure 2b). Thus, the unconditional “estimation error” obtained by integrating equation image over different realizations of X from the underlying distribution depends on the width of the combined model spaces M1 and M2 (Figure 2c). This concept can be easily extended to more general cases, where the complexity of a modeling system is determined by measuring the size of the corresponding model space.

image

Figure 2. Relation between estimation error and complexity and how complexity of a model is equivalent to the size of its “span.” (a) equation image, the “estimation error” conditioned on a input sample X, is bounded by equation image, i.e., the distance between the points in the span of models 1 and 2 corresponding to X. (b) Uncertainty in equation image, i.e., the expected error of model 1 (M1 in prediction) conditional on X, depends on the width of the span of the model. (c) Complexity, which affects the “estimation error,” can be computed as the width of combined model spans.

Download figure to PowerPoint

[19] The concept of complexity is formalized in section 3.1.1 for nearest neighbor type models and the measuring of the corresponding modeling space in section 3.1.2, via estimation of its VC dimension.

3.1.1. Concept of Complexity: VC Dimension

[20] To further explain the notion of complexity introduced here, we first consider a binary classification problem. This is a special case of regression estimation with the outcome variable being binary valued rather than real valued and the estimated classification function groups the input data (or the forcing data) into one of the two classes.

[21] A set of classifiers functions can be represented by a set of indicator functions,

  • equation image

Here, Λ defines a set of parameters, each element of which identifies one element of the set M. In other words for some α ∈ Λ, Q(w, α) is a classification function from the set M that labels an arbitrary input vector (or a feature point) w as either 0 or 1.

[22] The Vapnik-Chervonenkis (VC) dimension [Vapnik et al., 1994; Vapnik, 2002], due to VC generalization theory, is used to quantify the complexity of such a class of mappings. It is defined as the maximum number, h, of any input or feature points w1, w2, .., wh that can be separated into the two classes in all 2h possible ways by any subset of M [Vapnik, 2002]. The h feature points are then said to be “shattered.” If for any positive n there exists a set of n feature points that can be shattered by the set of indicator functions defined above then its VC dimension is equal to infinity. If a set of classifiers is capable of shattering a large number of feature points, it is complicated enough to find a mapping for any finite data set (say n, such that nh) irrespective of the underlying input-outcome probability distribution from which such a data set is sampled. Such highly complex mappings are said to have “overfit” the data and are also responsible for high estimation uncertainty [Vapnik, 1999]. Therefore, a set of functions with low complexity is desirable for finding an appropriate mapping that has low estimation uncertainty when the number of data points is small.

[23] In the case of nearest neighbor methods for time series forecasting (we scale the entire data set between 0 and 1), we can obtain a set of classifiers for a given set of data DN from equation (7) as

  • equation image

where equation image is the radius of nearest neighbors, ℓ is the number of lags in days,

  • equation image
  • equation image

and equation imageprednn, Inn, DN have been previously defined. Therefore, possible values of b and ℓ substitute for the set of abstract parameters in the general definition of the set of classifiers.

[24] Finally, an equivalent definition of VC dimension applies to a set of real functions, just like the set of real valued mappings obtained from equation (7), defined as the cardinality of its minimal ɛ net [Vapnik et al., 1994; Vapnik, 1999]. However, the VC dimension of the set of real valued nearest neighbor algorithms can also be set to the VC dimension of the corresponding set of classifiers as defined in equation (8) [Vapnik, 2002].

[25] The VC dimension, for a given training size N, defines the rate of uniform convergence of empirical error on the training data to its expected error. However, expected error defines the performance of a model on future unseen data, or in our case performance of a nearest neighbor model for a given value of equation image and ℓ on future unseen data. For the case of binary classifiers, if we define the empirical or classification error on data DN as

  • equation image

and the corresponding expected error (with P(.) a measure of probability) as

  • equation image

then VC theory provides an upper bound on the worst case absolute deviation of the empirical to the expected error [Cortes, 1993; Vapnik and Chervonenkis, 1971]

  • equation image

Here, m(n) is the growth function of our set of classifiers, Mnn, that is defined as the maximum number of ways in which a finite set of data of size n could be classified by our set of classifiers, and the maximum is taken over all possible data sets of size n that can be sampled from any given but fixed distribution. Given that the VC dimension of Mnn is h, the growth function of Mnn follows the relationship [Cortes, 1993]

  • equation image

Using (11) for a one-sided bound, the following inequality then holds for all α ∈ Λ with probability 1 − δ,

  • equation image

Inequality (11) shows that a larger complexity (implying a larger VC dimension), leads to a weaker bound on the uniform convergence (or convergence for the worst performing parameter from parameter space for any given finite sample size N) of empirical error to its expected value as sample size N goes to infinity. Since the upper bound on the rate of convergence is uniform, it applies to the rate of convergences for all parameters in the parameter space. By rearranging the terms in inequality (11), we obtain equation (12) which suggests that the strength of the upper bound on expected error for a given sample size N not only depends on the empirical risk but also on a function of the complexity of Mnn. Thus for the sample size N, more complex models tend to bound the expected error at the same level for any parameter (that also includes the parameter that minimizes the empirical risk) with lower confidence level 1 − δ. Or, a larger number of data points is required for more complex models to obtain a desired level of confidence in bounding its expected error from above by a certain value. A similar kind of inequality for a real-valued class also exists and dictates a similar relationship between performance and the VC dimension [Vapnik, 2002].

[26] An inequality of type (12) forms the basis of VC generalization theory [Vapnik, 1999]. It, however, only provides an upper bound on the expected error that is valid for any parameter of the parameter set defining the class of models. Minimizing the RHS of (12) at a certain level of confidence 1 − δ can only allow exclusion of worse performing (in expected error sense) parameters, while obtaining the parameter that minimizes the expected error via such minimization remains approximate. Hereinafter, we call robust predictor the one that minimizes the upper bound on the expected error of the type as in (12). Such parameterization also implicitly trades off the empirical error with the complexity of the models in class Mnn. Finally, the set of models such as Mnn can also be subdivided into a collection of disjoint set of models with their respective measures of complexity. Such a subdivision can be based on how well the complexity of the underlying class of models needs to be approximated. For example, Mnn can be decomposed into a collection of disjoint sets of type

  • equation image

such that

  • equation image

Thus as ɛ decreases, Mnn can be represented by a union of many smaller sets of nearest neighbor models with certain complexity associated to each of those sets. Consequently, an inequality of type (12) is associated with each such set of models. This decomposition of the set of nearest neighbor models will be used in the later part of the paper when complexity of the nearest neighbor models is calculated and used in robust predictor estimation.

3.1.2. Algorithm for Calculation of VC Dimension

[27] If we now define v2(.) as v1(.) in equation (10) except that it is calculated on a different realization of data equation imageN of the same size and drawn from the same distribution, then the VC dimension for our class of classifiers in equation (13) obeys the approximate equality in (14) [Corani and Gatto, 2006a, 2006b; Cortes, 1993; Vapnik et al., 1994],

  • equation image

where

  • equation image

Here, the parameters of the bound above are K, d, and B. The value of B is chosen from the continuity conditions that at N = h/2 the value of Φ(.) is 1. The other two parameters are free, assumed to be universal, and are obtained empirically by fitting the bound in the relationship (14) to experimental data using a set of classifiers whose VC dimension is known [Vapnik et al., 1994].

[28] Note that equality (14) now allows us to empirically estimate the dimension of a class of classifiers [Vapnik et al., 1994; see also Corani and Gatto, 2006a, 2006b; Cortes, 1993; Shao and Cherkasky, 2000] and also provides a method to calculate the VC dimension of the set of classifiers on the basis of (14), which we outline in Appendix A.

3.1.3. Robust Model Selection

[29] The VC dimension calculated as described above also applies to the real valued class of functions from the Mnn class of classifiers [Corani and Gatto, 2006a, 2006b; Vapnik, 2002]. Therefore, robust estimation of α = {equation image, ℓ}, for real valued nearest neighbor models can be obtained by minimizing the RHS of the following inequality similar to inequality (12) [Cherkasky and Mulier, 1998; Corani and Gatto, 2006a, 2006b; Pande, 2005].

  • equation image

where rN(α, xo, N) is the minimized empirical risk corresponding to a nearest neighbor predictor equation imageprednn at query xoequation image+. From cost function in (5), the minimized empirical risk is

  • equation image

with the expected risk defined as

  • equation image

Further,

  • equation image

Here vm (or “Vapnik's measure”) [Cherkasky and Mulier, 1998] is an inverse measure of complexity, i.e., it is inversely related to the measure of complexity or VC dimension, obtained from (14). The left-hand side of (15) is the expected error (or generalization error) and the right-hand side bounds it from above. For a given α = {equation image, ℓ}, the numerator of the RHS is also the tightest possible bound for nearest neighbor models that is obtained by minimizing the cost function (or the empirical risk function for the nearest neighbor model) in (5) over equation imageNP. Further, minimizing the entire RHS over αequation image of (15) formalizes the principle of Occam's razor, i.e., selection of the simplest hypothesis (least complexity) that also agrees with data as much as possible. The algorithm for the implementation of prediction based on robust model selection is presented in Appendix B.

3.2. Probabilistic Ensemble Method Paradigm (Nearest Neighbor Probabilistic Ensemble Method)

[30] We built an ensemble approach similar to that of Tamea et al. [2005] but devised for k nearest neighbors and call it NNPE. This method is different from the nonlinear probabilistic ensemble (NLPE) method of Tamea et al. [2005]. The NLPE was proposed for nonlinear time series analysis using locally linear predictors, while here we utilize a method adapted for one step ahead nearest neighbor methods.

[31] A time series is divided into a training part and a testing part, with the training part further subdivided into a calibration and a noncalibration set. A nearest neighbor predictor for a query is estimated over the calibration set for a given value of parameters α = {equation image, ℓ}, and the performance of such different predictors (corresponding to values of α) is evaluated over the noncalibration set. A certain number of top performers (leading to a collection of top performing alphas) is collected to form an ensemble prediction for the query. Additional uncertainty to an ensemble, making it probabilistic in nature, is attached by associating the residuals of the calibration step with each member of the ensemble. A collection of predictions for a query is thus created and a confidence interval of prediction can be created. As mentioned by Tamea et al. [2005], such a method differs from the GLUE methodology [Beven and Binley, 1992] as the latter pursues the notions of equifinality concept at all levels of model estimation, while here equifinal sets over αequation image are obtained once a nearest neighbor point predictor has been created for each α = {equation image, ℓ}. Further, a probabilistic nature is imparted to ensemble predictions by incorporating each member's prediction uncertainty. The specific implementation of the NNPE algorithm is described in Appendix C.

[32] A nonlinear probabilistic ensemble can thus be subsequently obtained for a test data set with confidence interval defined as certain interquantile range. We consider only 95% confidence level defined as 2.5 to 97.5 percentile range and following Tamea et al. [2005] use the median of ensemble predictions for any query as its point prediction.

3.3. Application to Hydrologic Data

[33] In order to apply the methods and compare their finite sample performance, we scale the 54 years of Leaf River streamflow data at each of the two gauging stations (Lear River at Collins and McLain, Mississippi) between 0 and 1 and do the same for another basin (Spring River at Waco, Missouri) with data overlapping the same time period. Collins is nested within McLain; and Spring is of a size in between the other ones (See Figure 1 for locations and Table 1 for characteristics). At any time, the available normalized data is split into disjoint sets of a training data DN and 19 test data sets {DMi : i = 1, 2, …, 19}. The start and the end dates for the 19 test data sets are fixed while those of training data set is held variable as is explained in the following. For the initial value of t = 0, the following steps are taken. 1. Nearest neighbor models are parameterized within the two paradigms for the same training data of a fixed size, N = 2t (in years), and evaluated on a test data of fixed size, M (2 years). 2. Repeat 1 for different test data of same size 19 times, once for each of {DMi : i = 1, 2, …, 19}. 3. Increment t, t = t + 1. 4. If (t < 5) then go to 1, else stop.

4. Results

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Contributions of the Study
  5. 3. Methodology
  6. 4. Results
  7. 5. Conclusions and Future Work
  8. Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
  9. Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
  10. Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
  11. Acknowledgments
  12. References
  13. Supporting Information

4.1. Calculation of VC Dimension

[34] Figure 3 plots the VC dimension of nearest neighbor methods following the algorithm presented in section 3.1.2 over the radii of neighborhood equation image and for input dimensionalities (number of lags) ℓ = {1, 2, .., 5}. For low values of equation image (or the radius of nearest neighbors) the complexity approaches infinity as it allows only itself as the nearest neighbor. The nearest neighbor method can then classify the entire data set exactly. For very large equation image the effect of locality vanishes, as it then considers the entire data for predicting the labels, and the nearest neighbor method becomes the estimation of the mean of labels. Mean estimators have VC dimension of 1 [Vapnik et al., 1994], and therefore, the estimated VC dimension asymptotes to 1 for large equation image.

image

Figure 3. VC dimension for nearest neighbor methods as a function of α = {b, ℓ}. Note that larger VC dimension implies higher complexity of the model.

Download figure to PowerPoint

[35] Figure 3 demonstrates that nearest neighbor methods with high input dimensionality require a larger radius of influence to have the same VC dimension and hence the same complexity as nearest neighbors with lower input dimensionality. Also, for the same radius equation image, the methods with larger dimensionality have greater complexity. It is also worth noting that such estimation of the VC dimension is independent of the data, the underlying input-output probability distribution, and the method of calibration. Thus VC dimension is a universal property that just depends upon the class of functions considered. Finally, it also demonstrates that even the simplest of the models such as nearest neighbor can be highly complex (and therefore not as simple as originally thought) depending on the choice of α = {equation image, equation image}.

4.2. Application Results

4.2.1. General Observations

[36] Figure 4 shows the results of the application (section 3.3) of the two prediction paradigms to the McLain watershed and the first 200 days of the ninth test data set when the size of training data used is 4 years. The first row shows NNPE performance (Figure 4a) while the second row shows robust prediction performance (Figure 4b). Note that Figure 4a shows the median prediction of NNPE as well as its 95% confidence bounds. Confidence intervals for NNPE are available because of the probabilistic nature of NNPE. However, robust prediction only provides a point prediction (single value) and no confidence intervals for robust predictions are displayed in Figure 4b. For this subset of observations, both approaches appear to perform in a similar fashion: (1) both miss high-flow events in the middle of the selected time window (Figures 4a and 4b); (2) both predict the low-flow events closely and in a similar manner; and (3) both provide similar but not identical observed versus predicted scatterplots (Figures 4c and 4d).

image

Figure 4. Implementation results: performance of NNPE and robust predictors for the ninth test data set (out of 19) for the McLain basin. Sample size for training is 4 years, and a comparison is made for the first 200 days of the selected test data set. R2, mean square error (MSE), bias, and slope of observed to predicted fit (β) are also presented. (a, b) A simple k nearest neighbor (knn) is also presented. (c, d) The scatter of simple knn is displayed with pluses.

Download figure to PowerPoint

[37] Figure 4b shows the performance of the robust prediction paradigm along with the selection of radii equation image and input dimensionality ℓ (shown as vertical bars from the top horizontal axis) for each query point in the test data set. Larger values of α = {equation image, ℓ} are generally chosen for high-flow events while lower values of the parameters are chosen for the low-flow events. For high-flow events, choice of larger values for the parameters and therefore more complex models coincides with the rising and the recession parts of the hydrograph. This may serve as weak evidence for preferring more complex models in terms of input dimensionality as a robust choice for more complex underlying processes considering that rising and receding parts of the hydrographs usually involve more complex hydrological processes. Note, however, that a larger radius of neighborhood is also chosen, which implies lower complexity for the same input dimensionality. Thus implicit in the choice of parameters is a tradeoff between equation image and ℓ complexity contributions (to the overall complexity), that achieves a mix of complexity “optimal” for the prediction task at hand. Meanwhile, NNPE predictions are shown with 95% confidence intervals along with the median. These prediction bounds bracket ∼72% of the observed streamflow values and generally exclude the rising limbs of high-flow events. The confidence intervals are also broader for high-flow values indicating that NNPE methods express higher uncertainties in predicting such events. A comparison of the observed versus predicted scatterplots suggests that NNPE (Figure 4c) predictions may be more biased than robust predictions (Figure 4d) with the slope of predicted to observed fit and bias being 0.76 and −0.0049 for NNPE and 0.87 and −0.0014 for robust predictions respectively.

4.2.2. Finite Sample Performance

[38] We now compare the performances of the two paradigms for 5 different training sizes over multiple test data sets using four performance statistics, two related to efficiency in performance: R2 and mean square error (MSE), and two related to bias in performance: observed to predicted regression slope or fit (β) and mean bias. Each of the 19 multiple test data sets span over the same time periods for the two paradigms, thus the statistics across the paradigms are comparable. We use the median of the ensemble predictions from the NNPE paradigm as its point prediction in order to compare it with the predictions from the robust prediction paradigm. We calculate the four performance statistics described above on each of the 19 test data sets on the basis of the point predictions of both the approaches using different training sizes (see section 3.3 on implementation where we have chosen 1, 2, 4, 8, and 16 years as different training sizes) in each of the three basins under study. We therefore have a distribution of performances indices for each training size, paradigm, and basin, and use them for various comparisons in the following sections.

4.2.2.1. Comparison Between Robust and NNPE Prediction Paradigms

[39] In Figure 5 we show the performance of robust prediction and NNPE predictors for the McLain basin. For the robust predictor (gray boxes), the efficiency in prediction increases (increase in R2 and decrease in mean square error) while one bias measure (slope of observed to predicted regression) is stable and the other (bias) asymptotically stabilizes to lower values with increasing training (or sample) size. Figure 5 also shows the same performance statistics for NNPE predictions. While for NNPE efficiency increases in a fashion similar to robust prediction with increasing training size, its slope of observed to predicted regression shows an increasing trend and bias shows a stable trend. However, a close cross comparison between the two paradigms reveal that NNPE consistently underperforms by a considerable margin in bias performance measures though it is slightly better than robust prediction in efficiency measures. In particular, for small sample sizes, robust predictions have almost the same variance in prediction when compared to NNPE but lower bias. Therefore, while for large sample sizes robust predictions achieve less biased but higher-variance models than NNPE; it provides “weakly” better models for smaller sample sizes because variance in prediction is nearly the same but bias is lower. However, this evidence is not sufficient to unequivocally suggest that robust prediction outperforms NNPE. Nonetheless, optimal bias-variance tradeoff is fundamental to the design of complexity-based prediction methods [Vapnik, 2002].

image

Figure 5. Finite sample performances for robust (gray) and NNPE (white) predictors on the McLain basin. Four performance measures are evaluated on 19 test data sets (each of 2 years) for each training length: (a) R2, (b) slope of observed to predicted regression β, (c) mean square error, and (d) bias. Sizes of training data are 1, 2, 4, 8, and 16 years, and periods of test data sets are the same for different training periods.

Download figure to PowerPoint

[40] Figures 6 and 7 similarly show comparisons between finite sample performances of robust and NNPE predictions for the Collins and Spring river basins respectively. Similar conclusions to those on the McLain basin can also be drawn for these two basins. Specifically, (1) for smaller training sample sizes, variance in predictions is similar for the two paradigms (except for R2 comparison between robust and NNPE predictions in the case of the Spring river basin) while the bias measures in predictions are significantly lower for robust predictions and (2) for large training sample sizes, variance in predictions for robust predictions worsens in comparison to NNPE predictions while the advantage of lower bias of the former over the latter reduces sometimes. One interpretation may be that robust predictors tend to select conservative models. The simplest (in complexity) of the best “equi-” performing models is chosen, which has lower bias but tend to have higher variance in predicting future unseen data. These observations suggest that the use of robust prediction paradigm may be advantageous over the median predictor of the NNPE paradigm when the available data are limited while NNPE methods may be preferable for large sample sizes (large data availability).

image

Figure 6. Finite sample performance for robust (gray) and NNPE (white) predictors on the Collins basin. The definition of training and the test data sets is the same as in Figure 5.

Download figure to PowerPoint

image

Figure 7. Finite sample performance for robust (gray) and NNPE (white) predictors on the Spring river basin. The definition of training and the test data sets is the same as in Figures 5 and 6.

Download figure to PowerPoint

4.2.2.2. Comparison Across Basins

[41] Figures 8a and 8b compares the performance (R2 and bias) for NNPE predictions across different basins for increasing training sizes. It appears that the largest basin (McLain, white) demonstrates better performance statistics than the smaller basin (Spring, gray) with the same sample sizes. This assertion is supported by the results that show, for the McLain basin, a higher R2 (Figure 8a) and a smaller negative bias (Figure 8b) for each sample size. When comparing NNPE performance between McLain (white) and Collins (gray), see Figures 8c and 8d, similar results are observed, i.e., superior performance for the larger basin, McLain, with the same characteristics for R2 and bias.

image

Figure 8. Comparison of finite sample performance for NNPE and robust predictors across different basins showing (a–d) performances for NNPE and (e–h) performances for robust predictors. Figures 8a and 8b display R2 and bias for McLain (white) and Spring (gray) river basins. Figures 8c and 8d display R2 and bias for McLain (white) and Collins (gray) river basins. Similarly, Figures 8e and 8f compare performances of McLain (white) and Spring (gray) river basins, while Figures 8g and 8h compare performances between McLain (white) and Collins (gray) river basins. The training and the test data sets are the same as in corresponding Figures 57.

Download figure to PowerPoint

[42] The robust predictions are compared in Figures 8e and 8f for McLain (white) and Spring (gray) while Figures 8g and 8h compare McLain (white) and Collins (gray). Again the R2 statistic is superior for McLain as compared to Spring (Figure 8e) and Collins (Figure 8g) basins. However, the bias shown in Figures 8f (McLain versus Spring) and 8h (McLain versus Collins) are not necessarily better. A bias reduction is observed with the training sample sizes. In particular, we observe robust predictor is less able to control the biases in predictions for smaller basins at small sample sizes.

[43] To summarize, the performance statistics for both paradigms at the smaller basins (i.e., Spring and Collins) are never superior to the statistics at the largest one (i.e., McLain). This especially holds when sample sizes are small and may suggest that control on bias-variance tradeoff by robust predictors becomes difficult as basin size reduces on small sample sizes.

4.2.2.3. Mimicking Complexity of the Underlying Processes by Complexity Selection?

[44] Figure 3 demonstrates how the complexity of nearest neighbor models increases with increase in input dimensionality, ℓ, or decrease in the radius of neighborhood selected, equation image. Figure 4, though displaying only a small portion of the overall test data set, also demonstrates how the selection of larger input dimensionality (ℓ) and radius of neighborhood (equation image) in robust prediction is associated with rising or receding parts of the observed hydrograph.

[45] The nearest neighbor model underlying the two paradigms minimizes empirical error while fitting a function locally. Increasing the number of data points in the neighborhood (by increasing the radius) increases the confidence in the local estimate (because of a larger number of “effective” data points available for induction). However, increasing the neighborhood size need not ensure the similarity of neighboring data points in order to predict rarer events. Prediction of rarer events, such as a high flow on the rising part for example, require a longer sequence of the immediate past to identify it as being on the rising part of the hydrograph. The requirement of a longer sequence of the past implies higher input dimensionality for the nearest neighbor models, and this in turn implies higher complexity for the nearest neighbor models. Robust predictor, on the other hand, trades off the need for having a larger number of effective data points (extent of locality) with the need to identify the current prediction requirement (for example: prediction on a rising limb requires similar data points from the past that were also on the rising limb). This leads to a tradeoff between the choice of radius of neighborhood and the input dimensionality respectively, and the tradeoff is achieved implicitly by the right-hand side of equation (15). This contrasts the choice of input dimensionality and radius of neighborhood for more common events as the predictor can choose smaller radius of neighborhood and low input dimensionality and still make a prediction with at least as much confidence as for the prediction of high flows. This is so because there are more data points available for more common events within a data set.

[46] In order to analyze more exhaustively how selection of complexity may be linked to the complexity of the underlying processes, in terms of input dimensionality selected within robust model selection, we (1) assume that the complexity of the underlying processes can be measured by the rarity of corresponding streamflow events; (2) combine all the 19 (streamflow) test data sets (2 years each), which are also contiguous in time, into one continuous data set of 38 years; (3) combine the parameters α = {equation image, ℓ} selected corresponding to robust predictions; and (4) analyze the coherence between the input dimensionality selected and the streamflow data that is significant at the 95% confidence level following [Amjad et al., 1997]. In this context, we interpret coherence as the magnitude squared of the correlation between the finite Fourier transforms of two time series. Further, note that in step 3 we obtain a 38-year time series of selected α = {equation image, ℓ} parameters. This is so because we pick parameters α = {equation image, ℓ} of the nearest neighbor model at each test data point, and make predictions as we go through the test data sets.

[47] Figure 9 shows the histograms of only those frequencies of the observed streamflow (test) data set, obtained from step 2 above, at which a significant level of coherence between the observed streamflow and selected input dimensionality exists. Three basins and three different training sample sizes of 1, 4 and 16 years are displayed. We observe significant coherence levels in relatively high number at lower frequencies of streamflow events. This suggests significantly high values of cross spectra relative to auto spectra at frequencies of rare events, or tendency of input dimensionality selected to follow the observed closely in low-frequency events. As to lack of similar coherence at higher frequencies, it appears that complexity selection is not an important issue at such frequencies. Lowest input dimensionality (of 1) is always selected for most of the high-frequency low flows, leading to poor coherence at such frequencies. This in turn may suggest that complexity-based robust prediction tends to select more complex models in terms of input dimensionality as a robust choice when predicting low-frequency events (and thus more complex underlying processes by assumption). However, rise in complexity due to the choice of larger input dimensionality for rarer events is compensated for by the choice of larger radius of neighborhood (that reduces the overall complexity of the selected model), therefore keeping the model choice robust. This behavior of robust predictors is more dominant for low training sizes and for larger basins.

image

Figure 9. Frequency distribution of (a) McLain, (b) Collins, and (c) Spring for frequencies with significant coherence at 95% confidence levels between the test data set and input dimensionality, ℓ, selected for robust predictors for different training sizes and basins. The training and test data sets are the same as in Figures 58.

Download figure to PowerPoint

4.2.2.4. Representation of Uncertainty by NNPE Bounds

[48] Figure 10 shows how well 95% confidence level bounds of NNPE are able to include the streamflow observations and how wide these bounds are. These statistics are presented for all three basins and for all the training sizes, with the statistics collected on each of the 19 test data sets for each training size. With increasing sample size, the 95% confidence bounds bracket more observed streamflow at McLain and Collins basins; the fraction of observed test data that lie within the bounds show an increasing trend at the median. That is not the case at Spring where the bracketing is similar for all sample sizes. The width of the 95% confidence bounds tends to shrink at larger training sizes for McLain. Thus, confidence bounds for NNPE are able to represent uncertainty well only for McLain, as the behavior of NNPE uncertainty bounds for other basins is inconsistent with what is expected. If the uncertainty bounds of NNPE predictions represent the underlying uncertainty, it should neither increase in width nor should the fraction of bracketed observed streamflow points decrease as sample size increases. We observe this behavior, though inconclusively, for Collins and Spring, while NNPE bounds perform in an almost consistent manner for McLain. Such observations may suggest that NNPE finds it harder to represent uncertainty in smaller basins.

image

Figure 10. Performance of 95% confidence level bounds for NNPE predictions over the same test and training data sets as in Figures 58 for (a) McLain, (b), Collins, and (c) Spring. The top plots show fraction of total observed test data points that lie within the uncertainty bounds, and the bottom plots show the width of these uncertainty bounds with increasing training data sizes.

Download figure to PowerPoint

5. Conclusions and Future Work

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Contributions of the Study
  5. 3. Methodology
  6. 4. Results
  7. 5. Conclusions and Future Work
  8. Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
  9. Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
  10. Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
  11. Acknowledgments
  12. References
  13. Supporting Information

[49] We have introduced a complexity-based prediction paradigm for nearest neighbor methods that is robust in its choice of parameters. It is based on Vapnik-Chervonenkis generalization theory, which suggests good small sample properties such as its capacity to handle uncertainty in prediction on future data [Vapnik, 2002]. We calculated a measure of complexity, called VC dimension, for the nearest neighbor methods and utilized it in complexity-based robust predictions for nearest neighbor methods. We also compared this paradigm with a nearest neighbor probabilistic ensemble (NNPE) method inspired by [Tamea et al., 2005]. In comparing the two paradigms, we partitioned the data sets for three basins into a fixed set of 19 2-year testing data sets, while using the remaining data set in different sizes for training nearest neighbor methods via the two approaches. Thus the two paradigms were trained on 1, 2, 4, 8, and 16 years and their performance was tested on 19 2-year testing data sets to show how the two paradigms compare in terms of their small sample performance.

[50] Complexity-based prediction methods were found to have a “weakly” better control on bias-variance tradeoff in predictions than NNPE methods, with consistently lower bias (though there is no definitive evidence that suggests either of the methods being superior to the other). This was possibly due to the tendency of the robust methods to select conservative models. However, bias was more difficult to control for complexity-based methods in smaller basins. Thus, a poorer control on bias-variance tradeoff was observed for complexity-based methods in smaller basins.

[51] Significant coherence between low-frequency observed streamflow events and input dimensionality selected by complexity-based predictor was also found, potentially indicating that complexity-based methods tend to conform to the complexity of the underlying processes. However, in spite of the choice of higher input dimensionality that was deemed a robust choice for more complex parts of the observed hydrograph, predictions for these events remained inaccurate. A similar situation was observed for the NNPE method. But in addition to it, evaluation of the uncertainty in prediction, via uncertainty bounds, by NNPE methods was poor and inconsistent for the small basins (though the performance was consistent for larger basin size). Measures such as the fraction of observations bracketed by the uncertainty bounds, as well as the width of those bounds, were inconsistent with what is expected with increasing sample size. This indicates, potentially, that the NNPE method has difficulty in controlling uncertainty in its predictions. Furthermore, it suggests a need for postprocessing of ensemble members and uncertainty surrounding them before their use to estimate modeling uncertainty.

[52] We here add a note of caution. The observation made herein does not reflect the performance of NLPE of Tamea et al. [2005] and is restricted to that of NNPE. The difference between the two is that the former is based on local linear models while the latter is based on locally constant models. Since the motivation for the paper was to introduce a complexity-based paradigm and evaluate its performance in comparison to a probabilistic ensemble paradigm for a given model (a nearest neighbor model), we here constrained ourselves to the choice of nearest neighbor models. However, we envision a comparison of the two paradigms for locally linear models (or even locally polynomial models) in future work.

[53] Both the methods showed poorer finite sample performance for the smaller basins of the three used. However, the study is not exhaustive over the size of the basins (though two of the three basins were of similar sizes and smaller than the third one) and this observation may not be universal. Nonetheless, possible weaknesses in performance of the two methods for the smaller basin size may be due to the use of a nonphysically based model, i.e., nearest neighbor methods. Nearest neighbor methods used here solely depended on autoinformation in a univariate time series for future predictions. For larger basin areas (assuming basins differ only in one characteristic, its size), streamflow data may contain more relevant autoinformation for future predictions at a particular time scale because of its slower response to precipitation.

[54] This, in turn, motivates future studies into how complexity (with VC dimension as the measure of complexity) based methods perform for physically based or conceptual hydrologic models, especially for smaller basins. The motivation to use VC generalization theory lies in its concept of complexity being model-independent and its foundation in probability theory. It will be an interesting exercise to look into potential connection between model structure selected by complexity-based methods and complexity underlying basin's behavior (similar to the work by Jakeman and Hornberger [1993] and Young et al. [1996]). A natural extension to comparison with ensemble methods will also yield insights into the advantages of either of the methods.

Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of Equation (14)

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Contributions of the Study
  5. 3. Methodology
  6. 4. Results
  7. 5. Conclusions and Future Work
  8. Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
  9. Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
  10. Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
  11. Acknowledgments
  12. References
  13. Supporting Information

[55] 1. Generate a random set of size 2n, Z2n, such that the first column of the data has binary values (class label of the rest of the columns) and the number of the rest of the columns is equal to the input dimensionality, ℓ, of Mnnɛ (equation image, ℓ). 2. Split it into two sets of equal sizes, Z1, Z2. 3. Flip the class labels of second set. 4. Merge the two sets and evaluate the nearest neighbor predictor, equation imageprednn, in Mnnɛ (equation image, ℓ) using equation (7) with values of parameters as α* = {equation image, ℓ}. 5. Separate the sets and flip the labels on the second set back again to restore the original labels. 6. Measure the absolute difference between the error rates on the two sets,

  • equation image

where vi(N, α) is evaluated on Zi for i = {1, 2} using the nearest neighbor predictor, equation imageprednn, obtained in step 4. 7. Repeat steps 1–6 for different set sizes N = n1, n2, .., ni, .., nd and for a particular sample size ni, iterate m times. 8. The mean values of the absolute difference between error rates is considered equation image (ni) = equation imageequation imageξ(ni) and the VC dimension of Mnnɛ {equation image, equation image} for a specific α* is obtained by

  • equation image

9. Repeat steps 1–8 for each element of the set, α = {(equation image, ℓ) ∈ equation image :equation image = {equation imagemin, equation imagemin + ɛ, …, equation imagemax} × {1, 2, …, ℓmax}} to obtain the VC dimension of nearest neighbor models corresponding to Mnnɛ (equation image, ℓ), as defined in equation (13).

Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Contributions of the Study
  5. 3. Methodology
  6. 4. Results
  7. 5. Conclusions and Future Work
  8. Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
  9. Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
  10. Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
  11. Acknowledgments
  12. References
  13. Supporting Information

[56] 1. For each query, xo, find α* such that

  • equation image

where equation image = {equation imagemin, equation imagemin + ɛ, …, equation imagemax} × {1, 2, …, ℓmax}. 2. A point prediction for query, xo, is

  • equation image
  • equation image

3. Repeat steps 1 and 2 for each query in the test data set.

Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Contributions of the Study
  5. 3. Methodology
  6. 4. Results
  7. 5. Conclusions and Future Work
  8. Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
  9. Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
  10. Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
  11. Acknowledgments
  12. References
  13. Supporting Information

[57] 1. Split the training data DN = {{yi, xi}: yiequation image+, xiequation image+, i = 1, .., N} of size N into a calibration, DC, and validation sets, DV, of equal sizes. 2. Select a test data point, xoequation image+. 3. For a value of radius of neighborhood, equation image, find the set of neighbors, XαC where α = {equation image, ℓ}, to xo in the calibration data set such that

  • equation image

4. Calculate the average, yα, of yαC, where yαC = {yi : xiXαC, {yi, xi} ∈ DC}, thus forming a nearest neighbor prediction equation imagennpred for α. 5. Repeat step 4 on the validation data set to obtain a set of neighbors, XαV, to xo and obtain yαV = {yi : xiXαV, {yi, xi} ∈ DV}. 6. Calculate mean square error, Eα between the prediction yα, obtained in step 4 and yαV obtained in step 5. 7. Repeat steps 3–6 for each αequation image, where equation image = {equation imagemin, equation imagemin + ɛ, …, equation imagemax} × {1, 2, …, ℓmax}, to obtain a vector of error Eequation image. 8. Sort Eequation image in ascending order to pick the top 100 performers, and retrieve corresponding parameters α into a vector equation imagesel. 9. Retrieve yequation imageC for each equation imageequation imagesel, and merge the elements to form another set YC = {yequation imageC : equation imageequation imagesel} that generates probabilistic ensemble of predictions for test data point, xo. 10. Repeat steps 2–10 for each data point, xo, in the test data set.

Acknowledgments

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Contributions of the Study
  5. 3. Methodology
  6. 4. Results
  7. 5. Conclusions and Future Work
  8. Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
  9. Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
  10. Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
  11. Acknowledgments
  12. References
  13. Supporting Information

[58] This study was partially supported by the Utah Water Research Laboratory and the Utah Center for Water Resources Research at Utah State University, the U.S. Bureau of Reclamation, the Sevier River Water Users Association, and the U.S. Geological Survey. The authors would like to acknowledge the intellectual contribution of Roger Hansen of the Provo, Utah, office of the U.S. Bureau of Reclamation. The authors also would like to thank the suggestions of reviewer Stefania Tamea, Associate Editor Alberto Montanari, and one anonymous reviewer. The first author is grateful to the guidance of Mariush Kemblowski and Wynn Walker and also thanks the Center for World Food Studies, Amsterdam, for its kind support.

References

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Contributions of the Study
  5. 3. Methodology
  6. 4. Results
  7. 5. Conclusions and Future Work
  8. Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
  9. Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
  10. Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
  11. Acknowledgments
  12. References
  13. Supporting Information

Supporting Information

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Contributions of the Study
  5. 3. Methodology
  6. 4. Results
  7. 5. Conclusions and Future Work
  8. Appendix A:: Algorithm to Calculate the VC Dimension of the Set of Classifiers on the Basis of
  9. Appendix B:: Algorithm for Implementation of Prediction Based on Robust Model Selection
  10. Appendix C:: Algorithm for Nearest Neighbor Probabilistic Ensemble Method (NNPE)
  11. Acknowledgments
  12. References
  13. Supporting Information
FilenameFormatSizeDescription
wrcr12084-sup-0001-t01.txtplain text document0KTab-delimited Table 1.

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.