Corresponding author: A. Sankarasubramanian, Department of Civil, Construction and Environmental Engineering, North Carolina State University, Raleigh, NC 27695-7908, USA. (firstname.lastname@example.org)
 Model errors are inevitable in any prediction exercise. One approach that is currently gaining attention in reducing model errors is by combining multiple models to develop improved predictions. The rationale behind this approach primarily lies on the premise that optimal weights could be derived for each model so that the developed multimodel predictions will result in improved predictions. A new dynamic approach (MM-1) to combine multiple hydrological models by evaluating their performance/skill contingent on the predictor state is proposed. We combine two hydrological models, “abcd” model and variable infiltration capacity (VIC) model, to develop multimodel streamflow predictions. To quantify precisely under what conditions the multimodel combination results in improved predictions, we compare multimodel scheme MM-1 with optimal model combination scheme (MM-O) by employing them in predicting the streamflow generated from a known hydrologic model (abcd model orVICmodel) with heteroscedastic error variance as well as from a hydrologic model that exhibits different structure than that of the candidate models (i.e., “abcd” model or VIC model). Results from the study show that streamflow estimated from single models performed better than multimodels under almost no measurement error. However, under increased measurement errors and model structural misspecification, both multimodel schemes (MM-1 and MM-O) consistently performed better than the single model prediction. Overall, MM-1 performs better than MM-O in predicting the monthly flow values as well as in predicting extreme monthly flows. Comparison of the weights obtained from each candidate model reveals that as measurement errors increase, MM-1 assigns weights equally for all the models, whereas MM-O assigns higher weights for always the best-performing candidate model under the calibration period. Applying the multimodel algorithms for predicting streamflows over four different sites revealed that MM-1 performs better than all single models and optimal model combination scheme, MM-O, in predicting the monthly flows as well as the flows during wetter months.
 Recently, a new approach, as suggested by Distributed Model Intercomparison Project (DMIP) [Smith et al., 2004], aims at improving hydrologic model predictions through a combination of multiple model predictions so that a particular model's deficiency could be compensated by considering other models. Studies have shown that multimodel ensemble averages perform better than the best-calibrated single-model simulations [Ajami et al., 2007]. Georgakakos et al.  attributed the superior skill of the multimodel ensembles to the fact that model structural uncertainty is at least partly accounted for in the multimodel approach. Recently, Devineni et al.  showed that the reliability of multimodel streamflow forecasts is superior to the reliability of seasonal streamflow forecasts with individual models.
 Compared with the traditional way of improving a single model and its parameterizations, multimodel combination approach essentially seeks a different paradigm in which the modeler aims to extract as much information as possible from a pool of existing models. The rationale behind multimodel combination lies in the fact that predictions from individual models invariably contain errors from different sources, which could be potentially reduced through optimal model combination and performance analyses [Feyen et al., 2001]. With multimodel combination, these errors could be reduced by optimally combining candidate hydrologic models, resulting in overall improved predictions [Ajami et al., 2007]. Weigel et al.  showed that the optimal combination of climate forecasts from multiple models reduces the overconfidence of individual models resulting in a multimodel forecasts that is superior to the best single model. Many studies have shown that combining different competing hydrologic models results in improved streamflow predictions [Georgakakos et al., 2004; Ajami et al., 2007; Vrugt and Robinson, 2007]. However, most of these studies have focused on demonstrating the superiority of the multimodel algorithm by applying the proposed algorithm for a given basin. In this study, we systematically address the issue why combining multiple hydrologic models result in improved predictions based on a synthetic setup and then apply the candidate algorithms for four selected basins from different hydroclimatic regimes. Further, we also propose a unique way to combine multiple hydrologic models by modifying the multimodel combination algorithm of Devineni et al. , which primarily focused on combining seasonal streamflow forecasts from multiple models.
 This paper is organized as follows: section 2 provides an overview of multimodel combination techniques and also proposes a new methodology for multimodel combination, which is modified from Devineni et al.  for combining watershed models. Following that, we describe the experimental design for understanding how multimodel combination works and discuss the results from the synthetic study. Section 4 demonstrates the application of the proposed multimodel combination algorithm for selected watersheds from humid and arid regions. Finally, we summarize the salient features and conclusions resulting from the study.
2. Multimodel Combination Methodology
 Given the uncertainties present in the modeling process, it is unlikely a single hydrologic model to give consistently skillful predictions during the entire evaluation period as well as over different flow conditions (e.g., during extremes). Multimodel prediction, basically a combination of predictions from many single models, can capture the strength of these single models resulting in improved predictability [Ajami et al., 2007; Duan et al., 2007; Devineni et al., 2008]. Multimodel predictions are usually obtained by taking a weighted average of the predictions from the single models. The weights, which sum up to one, are obtained based on the performance of the single model over the calibration period. There are several established multimodel combination methods, which may include simple or weighted average of single model predictions [Shamseldin et al., 1997; Xiong et al., 2001; Georgakakos et al., 2004] or using statistical techniques such as multiple linear regression [Krishnamurti et al., 1999], Bayesian model averaging [Hoeting et al., 1999; Ye et al., 2004; Marshall et al., 2005; Vrugt et al., 2006; Vrugt and Robinson, 2007; Duan et al., 2007], or a more flexible hierarchical modeling framework [Marshall et al., 2006, 2007]. Approaches to combine multiple models using dynamic weights that vary with time also have been proposed [Oudin et al., 2006; Chowdhury and Sharma, 2009]. A detailed treatment of uncertainty in hydrological predictions could be found in Montanari . Recently, Devineni et al.  showed the importance of combining multiple models by evaluating the individual models' performance conditioned on the selected predictor state(s). The main advantage of this approach is in giving higher weights for models that perform better under a given predictor(s) condition. In this study, we modify the algorithm of Devineni et al.  for combining multiple hydrologic models, since Devineni et al.  primarily focused on combining probabilistic streamflow forecasts developed using sea surface temperature conditions. Apart from this multimodel combination algorithm, we also consider optimal model combination approach in which weights are obtained purely based on optimization [Rajagopalan et al., 2002].
2.1. Multimodel Combination: Weight Conditioned on Predictor State
 The multimodel combination algorithm presented in Figure 1 evaluates the performance of the model over the calibration period and obtains multimodel prediction over the validation period (1977–1988). Assume that streamflow predictions, , are available from calibrated hydrologic models over a common calibration period . Based on this, we calculate the errors between the observed streamflow and model predictions as for every time step during the calibration period. One could also use squarer error or log of the squared error as a metric to evaluate the model performance of the model contingent on the predictor state. Store from all the models and in a matrix, ∑, of size , with in the very last column of the matrix ∑. Given the precipitation (predictor), , over the validation period, we identify “K” nearest neighbors for from the calibration period. Different approaches to estimate K is discussed in the next paragraph. For the identified neighbors, we can find the corresponding errors , where denote K neighbors for the conditioning variable from matrix ∑ and obtain the average error over K neighbors as . Based on , the time-varying weights for a specific model,m, for the given time step in the validation period can be determined by . The underlying basis behind this dynamic weight estimation is to give higher (lower) weights for the models that perform better (poorer) under similar predictor conditions. We combine the predictions from different single models and obtain the multimodel prediction (MM-1) as over the validation period using the dynamic weights.
 The selection of nearest neighbors K has been addressed in detail in the semiparametric and nonparametric statistics literature. It has been shown that the optimum number of neighbors is equal to the square root of the number of the points ( ) available for identifying the neighbors as nc approaches infinity [Fukunaga, 1990]. For cases with very small samples for model fitting/estimation, studies have used leave-one-out cross validation by selectingK neighbors that minimizes the modeler's performance metric [Craven and Wabha, 1979; Prairie et al., 2007; Devineni and Sankarasubramanian, 2010]. To develop the experimental design for evaluating the utility of the proposed multimodel algorithm, we consider the precipitation, PET, and streamflow available for the Tar River basin at Tarboro. Monthly time series of precipitation, PET, and streamflow for Tar River at Tarboro were obtained from the national hydroclimatic database described by Sankarasubramanian and Vogel . The available 37 years (1952–1988) of monthly time series were split into two sets, with the first 25 years (1952–1976) being employed for calibration and the remaining 12 years (1977–1988) being kept for validation.
 To develop the experimental design for evaluating the utility of the proposed multimodel algorithm, we consider the case study for predicting the monthly streamflow for the Tar River basin at Tarboro (discussed in detail in section 4). The proposed multimodel combination algorithm was applied for Tar River at Tarboro, and we found that K= 20 produces the reduced RMSE for the calibration period (1952–1976). Since the focus of this study is to evaluate the utility of multimodel combination in real-time prediction conditioned on the inputs, we estimated weights by evaluating the candidate models' performance over the calibration period. However, if the purpose is to develop better simulated flows from all the models, one could consider the entire period over which streamflow records are available for estimating the weights. Hence, we chooseK = 20 for the synthetic study. It is important to note that our estimate of 20 neighbors is closer to the optimum neighbors (i.e., 18 neighbors) suggested by Fukunaga , since we consider 25 years of monthly flows (25 × 12) for identifying neighbors. Thus, for application at other timescales (e.g., daily), one could consider Fukunaga's suggestion as preliminary estimate to identify the neighbors. Another issue related to selection of neighbors K is for basins experiencing many months of zero rainfall (e.g., North East Brazil, see Sankarasubramanian et al. ). For basins experiencing zero rainfall, it would be better to consider additional variables—temperature/PET or initial states of the model (e.g., simulated soil moisture)—as conditioning variables, so that we identify proper neighbors that influence the streamflow. Devineni et al.  employed Mahalonobis distance for identifying neighbors with multiple predictors. This way, the ability of the model to predict on days with zero rainfall could be differentiated based on relevant conditioning variables.
Devineni et al.  considered different probabilistic skill metrics for quantifying the skill of forecasts under similar neighboring (K) conditions. We considered both mean absolute deviation as well as mean-squared error for obtaining model weights, but the difference in model weights was minimal between the two metrics. Given that we focus on monthly streamflow predictions, perhaps it is not that sensitive to using mean-squared error of the identified neighbors. On the other hand, if one were to apply this approach for combining streamflows from multiple models at daily timescale, it may be preferable to use mean-squared error, since the daily streamflow exhibit more variations and extremes in the observed values.
2.2. Optimal Combination of Individual Models
 For comparing the proposed dynamic weight-based algorithm, MM-1, we also consider the static weights approach with weights for individual models being obtained purely based on optimization. This is the baseline multimodel procedure that we consider for comparing the performance of MM-1. We obtain weights based onequation (1) that minimize the squared error between the multimodel predictions and the observations over the calibration period:
where are weights (decision variables) for model m, is the streamflow prediction at time step t from model m, and is the streamflow observation at time step t. It is obvious that the weights have to sum up to 1 , which is included as a constraint for minimizing the objective function (1). The weights are obtained using the optimization algorithm available in MATLAB optimization toolbox [Byrd et al., 1999]. Once the weight for each model is determined during the calibration period, we employ these optimal weights in the validation period to combine the predictions from multiple single models and obtain the multimodel prediction (MM-O), , based on static weights. Thus, in MM-O, the weight for a given model does not vary for each time step “t” for obtaining multimodel predictions. The next section details a synthetic streamflow generation scheme for evaluating the performance of two multimodel combinations (MM-1 and MM-O) proposed in this section.
 Though many studies have demonstrated improved multimodel combination techniques by evaluating them through case studies, to our knowledge, this is the first attempt to understand why multimodel combination results in improved performance using a synthetic study with the true model/flows being known. For this purpose, an experimental design is presented in this section to understand why multimodel combination techniques perform better than individual models. The experimental design is based on a synthetic streamflow generation scheme, which assumes that the “true” parameters of a given hydrologic model are known.
 Given a hydrologic model, fm where denote the inputs, initial conditions corresponding to time step t, and true model parameters, respectively, one can obtain the true flows, , from the model m. Basically, the function, fm (.), is a set of nonlinear storage equations that convert the inputs and initial conditions into a set of leftover storage and outflows (i.e., streamflow and evapotranspiration) from the watershed. Since we consider the same prescribed inputs, , to obtain true flows, we assume the input uncertainty/errors to be very similar across all the models. We consider the observed monthly time series of precipitation and PET available at Tar River at Tarboro as the true inputs, with the true streamflow being generated based on the selected watershed model, whose true parameters ( ) are obtained by calibrating the selected model with the observed flows at Tarboro. We add noise to the true flows, , to develop corrupted flows that will be employed for testing the performance of different candidate watershed models and multimodel combination techniques in predicting the true flows from a given model. The exact steps of the experimental design are provided below:
 1. Generate corrupted streamflow from true flows, , with denoting the measurement error term having zero mean and variance , and represents variance inflation factor [Cureton and D'Agostino, 1983], and represents the variance of monthly flows in a given month with t = 1, 2,…, n. This indicates that the measurement errors are heteroscedastic and the error variances vary depending on the month. We did not consider serially correlated measurement errors, since the focus of the study is at the monthly timescale. We split the corrupted flows, Qt, into two parts t = 1, 2, … , nc and t = nc + 1, nc + 2, … , n with one for calibration (1952–1976) and another for validation (1977–1988), respectively.
 2. Calibrate the candidate model m (discussed in the next section) using Qt and and obtain the model parameters using the data available for the calibration period (t = 1, 2, … , nc). The parameters of the watershed models are obtained by minimizing the chosen objective function (discussed in the next section) by using the optimization algorithm in the MATLAB optimization toolbox [Byrd et al., 1999].
 3. Validate the model using and available, obtain model-predicted flows, , over the validation period t = nc + 1, nc + 2, … , nand compute the root-mean-square error (RMSE) between the model-predicted flows, , and the corrupted flows, , over the validation period.
 4. Combine the flows obtained from individual models, , using the multimodel combination techniques (discussed in sections 2.1 and 2.2) and obtain RMSE between the multimodel-predicted flows, , and the corrupted flows, , for the two combination techniques.
 5. Repeat the above procedures (1–4) 100 times and record the RMSE for individual models and multimodels for further analysis.
3.1. Candidate Models and Data Sources
 We consider two hydrologic models, “abcd” model and “VIC” model (Appendix A), for developing true flows, , as well as for estimating the predicted flows, . Apart from these two models, we also consider a water balance model developed using Budyko's supply-demand concept byZhang et al.  (see Appendix A), which is used only for generating true streamflow to understand how the multimodel combination addresses hydrologic model structural uncertainty (section 3.3) by obtaining individual model predictions from abcd and VIC models. The inputs, , monthly time series of precipitation and PET available from the Tar River at Tarboro are obtained from the monthly climate database (water year; 1952–1988) for the hydroclimatic data network (HCDN) sites [Vogel and Sankarasubramanian, 2000]. The true parameters ( ) for each model are obtained by calibrating the model against the observed flows by minimizing the sum of squares of errors (SSE) over the period 1952–1976. We then generate true flows, , from the abcd and VIC models using the monthly time series of precipitation, PET, and true parameters ( ) to obtain true streamflow ( ). Depending on selected f2 and the monthly variance, the true flows are further corrupted with Gaussian noise to generate monthly flows (Qt) with measurement errors for 37 years.
 Both models are calibrated against the synthetically generated streamflow, Qt, based on two objective functions (step (b) in the experimental design), SSE and sum of absolute deviation (SAE) , to estimate the flows under validation using the calibrated model parameters . The reason behind using two objective functions for estimating the flows is due to their difference in ability in predicting the observed flow conditions. For instance, it has been shown that models calibrated using ordinary least-square objective function perform well in predicting extreme conditions, whereas models calibrated with heteroscedastic maximum likelihood estimator perform well in predicting normal flow conditions [Yapo et al., 1998]. By calibrating two hydrologic models using two different objective functions, we obtain a total of four streamflow predictions, , over the validation period, with m =1(2) representing abcd model-predicted flows using SSE (SAE) parameter estimates andm =3(4) representing VIC model-predicted flows using SSE (SAE) parameter estimates. Using these four streamflow predictions, we develop two different multimodel predictions, , whose performances will be compared against the individual model predictions ( ) over the validation period.
3.2. Synthetic Study Results: Measurement Errors
 Following the steps described in section 3.1, RMSE of two multimodel predictions ( ) and four individual model predictions (m = 1, 2, 3, 4) are obtained over the validation period. The corresponding weights for each individual model under the two multimodel combination algorithms are also considered for analysis. We also consider the median of the RMSE as well as the entire probability distribution of RMSE for analysis. Figure 2 shows the overall performance of different models, given abcd model as true model (Figures 2a and 2c) and VIC model as true model (Figures 2b and 2d) with measurement errors being heteroscedastic. From Figures 2a and 2b, we can infer that, for very small measurement errors (e.g., or 0.1), the median RMSEs of MM-1 and MM-O are larger than the median RMSE of the true model (i.e., abcd or VIC models). This may be difficult to see from the plots (Figures 2a and 2b), but this could be clearly seen from Figures 2c and 2d, which show the probability of MM-1 RMSE being smaller than the rest of the models. Given that the true flows are from abcd model, it is expected that the VIC model would not be able to outperform the abcd model.
 As the measurement error variance (i.e., by increasing ) increases, multimodel MM-1 outperforms all the other candidate models, including the predictions from the true model. In these situations ( ), the median RMSE of MM-1 is much lower than the median RMSE of the individual models as well as the optimal multimodel combination algorithm MM-O. It is important to note that the summarized RMSE is obtained only over the validation period, indicating the potential for application of the multimodel algorithm in real-time streamflow prediction. For the case with true model being the VIC model (Figure 2b), as expected, VIC model has the lowest RMSE among all the models for small measurement error variances. However, as measurement error variance increases, both MM-1 and MM-O perform better than individual models.
 Under Figures 2c and 2d, we compare the performance of individual models and multimodels based on their respective distributions of RMSE obtained from the 100 validation trials. Figures 2c and 2dplot the probability of multimodel MM-1, being the best model having the minimum RMSE in comparison to the rest of the individual models (abcd model or VIC model) and MM-O. From these two figures, we can clearly see that for lower measurement errors, the probability of RMSE of MM-1 being lower than the minimum RMSE of the true models and multimodel MM-O being very small (around 5%–10%). VIC model (abcd model) is not seen inFigure 2c (Figure 2d), since the true model being abcd model (VIC model), VIC model (abcd model) did not result with minimum RMSE in comparison to the rest of the candidate models even once out of the 100 validation trials. As the measurement error variance increases, the probability of MM-1 being the best model increases and approaches toward 1.
 Comparing the performance of MM-1 with MM-O (Figures 2c and 2d), we infer that MM-O performs slightly better than MM-1 for smaller values ( ). However, the differences in RMSE are very small between the two schemes under smaller values of f2. However, as increases further, we clearly see that MM-1 outperforms MM-O with the RMSE from MM-1 being lower than MM-O with higher probability (>65%). This is primarily due to the nature of combination methods employed in MM-1 and MM-O. Under MM-O, the weights are obtained optimally based on the model performance over the calibration period, whereas under MM-1, weights for multimodel combination vary depending on the predictor state, and they are estimated statistically based on mean absolute error overK neighbors. Thus, under large model error variance, analyzing the model performance conditioned on similar predictor (s) state provides better estimates of weights for multimodel combination.
 Given the overall reduction in RMSE from multimodel combination methods, it would be interesting to look at the performance under different flow conditions. Figure 3presents the result of abcd model as a true model by comparing their performance (median RMSE) under six different monthly flow conditions: (1) <10th percentile, (2) 10th to 25th percentile, (3) 25th to 50th percentile, (4) 50th to 75th percentile, (5) 75th to 90th percentile, and (6) >90th percentile. From both figures, apart from the previously discussed findings, we infer that the improvement of multimodel MM-1 under extreme conditions (10th percentile and 90th percentile) is larger than normal conditions (10th percentile to 90th percentile). On the other hand, MM-O performs as good as the flows predicted by the best individual model. One reason for improved performance of MM-1 compared with MM-O under high-flow and low-flow conditions stems from the fact that MM-O evaluates the performance of individual models during the entire calibration period. On the other hand, MM-1 evaluates the performance of individual models based on the mean absolute error around the conditioning state. Thus, MM-1 outperforms the best performing individual model and MM-O not only over the considered validation period but also under different flow conditions. The only exception is under 25th to 50th percentile for large (Figures 3c), under which MM-1 has slightly higher RMSE in comparison to MM-O and abcd model. This increased RMSE for MM-1 could be due to ignoring additional predictors that could relate to proper identification of neighbors (K). For instance, during normal flow conditions, it is reasonable to expect that errors in predicting the flows could depend on both precipitation and PET. We observed similar behavior (figure not shown) with VIC model outperforming abcd model and MM-1 outperforming all the individual models and MM-O during all the flow conditions, except during 25th to 50th percentile.
 To understand why multimodel MM-1 performs better in comparison to individual models and MM-O, we show the box plots of weights obtained for multimodel schemes MM-1 (Figure 4a and 4b) and MM-O (Figures 5a and 5b), with the true flows being abcd (Figures 4a and 5a) and VIC models (Figures 4b and 5b). Each box in Figure 4represents the various percentiles of median weights for the four individual models from the 100 validation runs under MM-1 scheme. Similarly, boxes inFigure 5represent various percentiles of optimized weights (MM-O) obtained from the 100 validation trials. It is important to note that we employ median weights under a given validation run for MM-1, since MM-1 weights vary with respect to time. On the other hand, MM-O scheme employs the same weight for each model in a given validation trial. FromFigure 4, we can clearly see that when measurement error variance ( ) is very small, MM-1 draws weights close to one from the true model and weights close to zero from other candidate models. However, as measurement errors ( ) increase, MM-1 draws weights equally from all the models since the flows from the true model are substantially corrupted. On the other hand, comparing these weights with optimized weights obtained from MM-O scheme (Figure 5), we see that the weights are drawn mostly from abcd-OLS model alone during most of the situations (Figure 5a), although others models are assigned slightly higher weights (0–0.2) as measurement error variance ( ) increases. One reason for the increased weights for abcd-OLS is due to the nature in which we obtain optimized weights under MM-O (inequation (1)). Since MM-O weights are obtained by minimizing the sum of squares of errors during the calibration period, larger weights are assigned for abcd-OLS in comparison to abcd-ABS. With VIC model as true model (Figure 5b), we can see that both VIC-OLS and VIC-ABS obtain higher weights, whereas the abcd models gain little weights under large measurement error variance. Though VIC-OLS weights are slightly higher than VIC-ABS (Figure 5b), which is primarily due to the nature by which weights are estimated using equation (1). However, the weights under MM-O are far more equally distributed between the two different estimates of the VIC model if the true flow arises from the VIC model, which is different from the weights (Figure 5a) for various candidate models under MM-O. This requires further investigation, but one possible reason could be due to structural differences between the VIC and abcd models. This brings us to an important point in understanding how the multimodel combination performs if the true flows arise from a model that is completely different from the structure of the candidate models. We discuss the results from this analysis in the next section.
3.3. Synthetic Study Results: Model Structural Uncertainty
 Results from section 3.2showed how both dynamic weights approach (MM-1) and static weights approach (MM-O) for combining multiple models improves monthly streamflow estimates in comparison to the estimates from individual models if the true flow is corrupted with the measurement error. However, this analysis does not consider structural uncertainty, since the true model was also considered as part of the candidate models whose parameters are estimated by the corrupted flows. To understand how the multimodel combination techniques perform under model structural uncertainty and measurement errors, we generate flows from theZhang et al.  water balance model (see Appendix A for a quick overview of the model) for varying values of “f2” to account measurement errors. The candidate models are abcd-OLS, abcd-ABS, VIC-OLS, VIC-ABS, MM-1, and MM-O.
Figure 6ashows the performance of all the models based on median of the RMSE and the probability of MM-1 being the best model in comparison to the rest of the models when the true flows arises from the Budkyko framework model ofZhang et al. . It is very clear from the figure that the performance of MM-1 is better than that of the rest of the models. Though it is difficult to see from the plot, the median RMSE of MM-1 is lower than the median RMSE of the rest of the models for all values off2. This could be inferred from Figure 6b, which shows the probability of RMSE of individual models or MM-1 being the lowest across all the models. FromFigure 6b, we understand that MM-1 has 52% probability of minimum RMSE in comparison to all the models over 100 realizations even underf2being equal to zero. Among the individual models, RMSE of abcd model was always lower than that of the RMSE of the VIC model. Hence, it is not shown in the plot. Similarly, we are not showing the probabilities for MM-O, because it did not have the lowest RMSE even under one trial. This does not mean that MM-1 performed better than MM-O over all the 100 trials. Basically, this implies that whenever MM-O performed better than MM-1, abcd model performed better than MM-O. Asf2increases, the probability of MM-1 having the lowest RMSE further increases, indicating its superior performance under the presence of both measurement and structural errors. Thus, when streamflow exhibits structural and measurement errors, the performance of MM-1 improves further in comparison to a situation with no model structural uncertainty. In real-world situations, it is natural to expect structural uncertainty to be significant, since it is difficult to explicitly model the actual physical processes within the water balance model. Though we have not shown the weights plots for this scenario, we found that MM-O draws weights from all the models equally under increasedf2. Thus, the synthetic study clearly demonstrates that it is more prudent to employ the proposed dynamic model combination approach, MM-1, particularly when the observed streamflow encompasses both measurement errors and structural uncertainty.
 To further understand why MM-1 performs better than MM-O even in the presence of model structural uncertainty, the average weights (Figure 7) for each candidate model were computed over 100 realizations conditioned on the monthly precipitation value over the validation period for two different values of f2 = 0.5 (Figure 7a) and f2 = 1.0 (Figure 7b). It is important that for each f2 that the input variable, precipitation, does not vary, but the generated synthetic streamflow varies according to the value of f2. Since the weights of MM-1 were estimated by evaluating the models' performance conditioned on the predictor state, the average weights over 100 realizations for the four candidate models vary depending on the monthly precipitation in the validation period. On the other hand, MM-O prescribes static weights for each model, since the weights under a given realization do not vary over the entire validation period. Apart from this,Figure 7 reveals several interesting information. For instance, under both f2 = 0.5 and f2=1.0, MM-1 estimates almost equal weights for abcd and VIC models for monthly precipitation values around 100 mm/month, since all models perform similarly in that neighborhood. But, MM-1 gives relatively lower weights for the VIC model under a monthly precipitation of 50 mm/month due to its limited ability to predict streamflow under those conditions. Further, as measurement error variancef2increases, the weights assigned for each candidate model under MM-1 decreases with weights being spread over a smaller range forf2= 1.0. On the other hand, MM-O always gives higher weights for the best-performing model even underf2= 1.0. Given that MM-1 evaluates the best-performing model under a given input/predictor state, it estimates dynamic weights for model combination conditional on the input state. The next section applies the proposed multimodel combination algorithm for predicting the observed streamflow in four different watersheds.
4. Multimodel Combination: Application
 For evaluating the utility of proposed multimodel combination technique in predicting observed streamflow, we consider four basins with two humid basins from North Carolina (NC) and two semiarid basins from Arizona (AZ) as our study region. The selected four watersheds are as follows: Tar River at Tarboro (HCDN ID: 02083500), NC, and Haw River near Benaja, NC (HCDN ID: 02093500), San Pedro river at Charleston, AZ (HCDN ID: 09471000), and Salt River near Roosevelt, AZ (HCDN ID: 09498500). Given that these stations from the HCDN database [Vogel and Sankarasubramanian, 2005], the observed flows are virgin without any significant anthropogenic influences. Detailed physical characteristics of these four basins as well as the length of the study period are summarized in Table 1. Monthly time series of precipitation, PET, and streamflow are obtained from the national climatic database developed by Vogel and Sankarasubramanian .
Table 1. Physical Characteristics of Different Selected Basins
San Pedro River
Tarboro, North Carolina
Near Benaja, North Carolina
Near Roosevelt, Arizona
Annual precipitation (mm)
Oct 1952 to Sep 1988
Oct 1952 to Sep 1971
Oct 1952 to Sep 1988
Oct 1952 to Sep 1988
 We calibrate abcd and VIC models based on two objective functions by minimizing SAE and SSE over the calibration period (1952–1976). Based on the calibrated parameters, we obtain individual model predictions over the validation period (1977–1988). Following that, we apply two multimodel combination algorithms discussed in section 3 for these four basins and compare their performance with the single model predictions developed for these basins. We calculate correlation and RMSE (Figure 8) from all the four single model prediction and two multimodel predictions , over the validation period.
 From Figure 8, we can see that the correlation between the multimodel (MM-1) predicted streamflow and the observed streamflow shows significant improvement in comparison to the correlations estimated from the single model predictions over the four basins. On the other hand, correlation of MM-O is just around the same level with the best single model. This is probably due to MM-O's weighting scheme, which obtains the weights for individual models by minimizing the SSE over the calibration period (equation (1)). Further, the RMSE of MM-1 is also lower than the four single models and MM-O for all the four basins during validation period. The primary reason behind MM-1's better performance is due to its weighting scheme, which evaluates the performance of candidate models conditioned on the input state, thereby giving higher weights for the model that performs well under similar predictor conditions.
 To understand further, we evaluate the ability of individual models and multimodels (MM-1 and MM-O) in predicting the seasonality of streamflow (Figure 9). To compare across months, we consider the relative RMSE, which is calculated as , where and indicate the single model “m” predicted flows and observed flows for month “j” in year i. Thus, the squared relative errors in predicting the monthly flows are averaged over the respective month in the validation period to compute the relative RMSE. For this figure, we present the results only from the best performing individual model abcd-OLS and abcd-ABS alone for clarity. Similarly, relative RMSE for the multimodels, MM-1 and MM-O, are also computed. The primaryY axis in Figure 9 shows the relative RMSE for the abcd models and multimodels and the secondary Y axis indicates the mean monthly streamflow. From Figure 9, we can clearly see that for high-flow months for each station, the multimodel MM-1 performs better than the individual models and multimodel MM-O. However, during the respective low-flow months, multimodel MM-1 does not perform better than the candidate single models or MM-O. During low-flow months, the precipitation is low and the runoff from the watershed could be more dependent on the available energy (i.e., PET). The higher relative RMSE is also occurring due to very small values of flows under which a small difference in the prediction could be amplified due to the smaller values of the flows. Thus, it may be appropriate to identify neighbors based on PET and initial conditions as well. For instance, under San Pedro River (Figure 9c), MM-1 performs poorly even during July, which is one of the wettest months. This is mainly because the model is not taking into account the initial conditions of the model. For instance, as we move from the drier period (spring) to the wetter period (summer), the identified neighbors purely based on precipitation do not obtain proper weights for model combination.Devineni et al.  suggested using Mahalanobis distance for identifying neighbors under multiple predictors. We plan to investigate combining multiple models contingent on multiple predictors as part of a future study. But, overall from Figures 8 and 9, it is clear that MM-1 performs better than the individual models and MM-1 over the entire validation period (Figure 8) as well as in predicting the peak flow months (Figure 9). In our future studies, we will investigate the role of PET and other initial conditions such as soil moisture and groundwater as conditioning variables by considering basins exclusively from the arid regions.
 A new approach based on dynamic weights is proposed to combine predictions from multiple hydrologic models. The motivation is to understand why multimodel combination performs better than individual candidate models and also to evaluate their utility in real-time streamflow prediction. For this purpose, we evaluated the proposed algorithm under a synthetic streamflow setting with the true model, being known as well as in predicting the monthly streamflow for four basins with varied hydroclimatic settings. Though the proposed dynamic combination approach, MM-1, performs better than the individual models and the static combination approach (MM-O), the algorithm introduces one additional parameter,K, in developing the multimodel predictions. Estimation of Kcould be chosen by plotting the goodness-of-fit statistics of MM-1 over the calibration period for different values ofK. Another approach that could be pursued for estimating the optimum number of neighbors is purely based on the square root of the number of the points ( ) available calibration. Since our interest is in evaluating the algorithm for improving real-time streamflow prediction, we were not interested in estimatingK as part of the validation period. Estimating weights purely based on the calibration period introduces a potential drawback particularly if the candidate models' perform very differently between the calibration and validation periods. Under such situations, the entire calibration window or the fitting period could be moved 1 month at a time [Oh and Sankarasubramanian, 2012], so that the performance of each candidate model in predicting the streamflow over the recent period of record could be considered in estimating the number of neighbors. However, if such a distinct performance is noted in model performance between calibration and validation periods, it may even require calibration of individual models. Further, considering a moving window for estimating the number of neighbors introduces substantial computational challenges particularly when the methodology is employed for large spatial scale.
 Another issue worth discussing is extending the approach for combining the predictions from distributed models. If the distributed models have one parameter set for the entire watershed, then one could just obtain a single neighbor, K, which results with minimized goodness-of-fit statistics over the calibration period. On the other hand, if the interest is in predicting streamflow at different locations, then different neighbors could be estimated for various subbasins depending on the spatial variability in the input variables (i.e., precipitation/PET) within the basin. For instance, if the upper basin exhibits similar spatial coherence in input variables, the one parameter could be considered for the upper basin with different neighbors being estimated for lower basins. Another possible way is to estimate the neighbors,K, as part of the Bayesian modeling framework [Kuczera et al., 2006; Kavetski et al., 2006; Marshall et al., 2007], so that a posterior distribution of the neighbors could be estimated. One goal for the future study is to extend the proposed dynamic model combination approach into a fully integrated Bayesian modeling framework that recognizes the spatiotemporal covariability in estimating the model parameters that includes the neighbors for evaluating the candidate models' performance.
5. Summary and Conclusions
 A systematic analysis on how multimodel combination reduces measurement uncertainty and model structural uncertainty is presented. The study also proposes a new dynamic-weighting approach (MM-1) that combines multiple hydrologic model predictions by assigning higher weights for a model that performs well under a given predictor(s) state. The performance of MM-1 is compared with optimized multimodel combination (MM-O), which obtains static weights for individual models purely based on optimization by minimizing the errors over the calibration period. Thus, under MM-1, the model weights vary with time, whereas under MM-O, the weights are purely dependent on the model performance over calibration period. The study compares the performance of two single models with parameters obtained from two different objective functions with MM-1 and MM-O with flows generated from a known hydrologic model by considering measurement uncertainty and model structural uncertainty. Apart from the synthetic study, the proposed multimodel combination methodology, MM-1, is also compared with single models and MM-O in predicting observed flows over four different watersheds.
 Results from the synthetic study show that under increased measurement errors, both multimodels (MM-1 and MM-O) perform better than the candidate single models even if the flows are generated from one of them. Overall, MM-1 performs better than MM-O in predicting the monthly flow values as well as in predicting extreme monthly flows (<10th percentile and >90th percentile). Comparison of the weights obtained from each candidate models reveals that as measurement errors increase, MM-1 assigns weights equally for all the models, whereas MM-O assigns higher weights for always the best-performing model under the calibration period. Given that the flows are corrupted under large measurement errors, MM-1 provides higher weights for the best-performing single model under a given predictor state, resulting in reduced RMSE over the entire validation period as well as under extreme flow conditions.
 We also evaluated the performance of multimodel combination scheme, MM-1, under model structural uncertainty. For this purpose, we generated true flows with measurement errors from a different water balance model that was not part of candidate models. Even under this scenario with no measurement errors, MM-1 performed better than the rest of the candidate models, but the difference in RMSE between the best individual model and MM-1 is very small. As the measurement error increased under model structural uncertainty, the performance of MM-1 improved substantially compared to the rest of the candidate models. Thus, the analyses from the synthetic study shows that under increased model and measurement uncertainty, it is always better to consider multiple models for monthly streamflow predictions. We also infer that our proposed dynamic weights approach of combining multiple models contingent on the predictor state performed better than the static combination approach.
 The study also evaluated both multimodel combination schemes, MM-1 and MM-O, by predicting the flows under different hydrologic regimes. Over the selected four sites, both multimodel schemes performed better than streamflow predictions obtained from the best single model. Analyzing the ability of single models and multimodels in predicting the seasonality of monthly runoff showed that MM-1 performs better than single models and MM-O in high-flow months, whereas MM-O performed better than MM-1 and single models during low-flow seasons. One possible reason for this poor performance is that the precipitation is low during low-flow months, and the runoff from the watershed could be more dependent on other variables including soil moisture and PET. Thus, it may be required to consider additional predictors, such as PET and soil moisture, for combining single models. Though the proposed algorithm was demonstrated for lumped models, we expect a similar behavior upon application of the model for distributed models. We intend to investigate this as a future study.
Appendix A:: Hydrological Models
 We consider two watershed models, abcd model [Thomas, 1981] and VIC model [Liang et al., 1994; Abdulla and Lettenmaier, 1997], for the purpose of demonstrating why multimodel performs better than the individual model. Since both these models and their description have been presented in numerous studies, we present a brief description of the models and their parameters here. Apart from these two models, for understanding the model structural uncertainty, we generate flows from a monthly water balance model based on Budyko's framework for long-term hydroclimatology [Budyko, 1958]. A brief description of these models is provided here.
A1. abcd Model
 The abcd model is a nonlinear hydrologic model that accepts precipitation and potential evaporation as input, producing streamflow as output. Internally, the model also represents soil moisture storage, groundwater storage, direct runoff, groundwater outflow to the stream channel, and actual evapotranspiration. The abcd model was originally introduced by Thomas  at Harvard University and was later compared with numerous monthly water balance models, leading to its recommendation by Alley . Detailed description of the abcd model could be found in Sankarasubramanian and Vogel . The model has four parameters “a,” “b,” “c,” and “d,” with each having a physical significance. Parameter a reflects the propensity of runoff to occur before the soil is fully saturated, whereas parameter b represents the upper limit on the sum of actual evapotranspiration and soil moisture storage in a given month. The parameter c governs the allocation of water that leaves the unsaturated zone and enters the saturated zone and thus implying the baseflow index. Parameter d denotes the reciprocal of the groundwater residence time. All these four parameters were estimated using automatic calibration by minimizing the SSEs or the sum of absolute errors.
A2. VIC Model
 The VIC model is a physically based hydrologic model that simulates water and energy fluxes at the land surface using three vertical soil layers [Liang et al., 1994, 1996]. VIC model comprises a two-layer characterization of the soil column. Since the focus of the study is to demonstrate why multimodel performs better than individual model, we consider a lumped version of the VIC model to reduce computation. The infiltration algorithm in the VIC model can be interpreted within the context of a spatial distribution of soils of varying infiltration capacities. The model assumes that the infiltration capacity of the soil is not spatially uniform, and therefore runoff generation and evapotranspiration vary within an area owing to variations in topography, soil, and vegetation. Details of the VIC model could be found inLiang et al. [1994, 1996], Abdulla and Lettenmaier , and hence they are not presented here. VIC model has a total of nine parameters [Abdulla and Lettenmaier, 1997]. Of which, the most commonly calibrated parameters are the infiltration parameter, maximum soil moisture of layer 1, maximum soil moisture of layer 2, maximum baseflow parameter, fraction of the maximum baseflow, and fraction of the maximum soil moisture. In this study, we employed automatic calibration to obtain these model parameters.
A3. Monthly Water Balance Model Based on Budyko's Framework
 Budyko's analysis of long-term hydroclimatology [Budyko, 1958; Sankarasubramanian and Vogel, 2002] focused on partitioning the incoming moisture (precipitation) and energy (net radiation) into long-term runoff. The framework is based on the concept of water availability (supply) to atmospheric demand.Zhang et al.  extended the concept for modeling various storage zones, soil moisture and groundwater, within the watershed to formulate a parsimonious water balance model. The model has four parameters: (1) the first parameter (values range from 0 to 1) rainfall retention efficiency quantifies the percentage of water retained in the basin without direct runoff, (2) the second parameter (values between 0 and 1) evapotranspiration efficiency denote the partitioning of water to the atmosphere and groundwater with large values, indicating lesser groundwater recharge, (3) the third parameter denotes the maximum soil moisture holding capacity between field capacity and wilting point of the soil, and (4) the fourth parameter quantifies the baseflow contribution indicating the recession constant. All the four parameters are obtained by minimizing the SSEs in predicting the observed flows at Tar River at Tarboro over the calibration period. The Zhang et al.  model is used only for generating the true flows and not as a candidate watershed model for multimodel combination, so that we can understand the utility of multimodel combination in improving streamflow predictions even under structural uncertainty (Figure 7).
 This study was partially supported by the National Science Foundation award (Award 0756269) from the Environmental Sustainability Program. The authors thank the two anonymous reviewers and Minxue He, whose valuable comments led to significant improvements in the manuscript. The authors also thank the Associate Editor, Alberto Montanari, for reviewing and handling the manuscript.