A nonparametric kernel regression model for downscaling multisite daily precipitation in the Mahanadi basin



[1] Hydrologic impacts of global climate change are usually assessed by downscaling large-scale climate variables, simulated by general circulation models (GCMs), to local-scale hydrometeorological variables. Conventional multisite statistical downscaling techniques often fail to capture spatial dependence of rainfall amounts as well as hydrometeorological extremes. To overcome these limitations, a downscaling algorithm is proposed, which first simulates the rainfall state of an entire study area/river basin, from large-scale climate variables, with classification and regression trees, and then projects multisite rainfall amounts using a nonparametric kernel regression estimator, conditioned on the estimated rainfall state. The concept of a common rainfall state for the entire study area, using it as an input for projections of rainfall amount, is found to be advantageous in capturing the cross correlation between rainfalls at different downscaling locations. Temporal variability and extremities of rainfall are captured in downscaling with multivariate kernel regression. The proposed model is applied for downscaling daily monsoon precipitation at eight locations in the Mahanadi River basin of eastern India. The model performance is compared, with a recently developed conditional random field based as well as with established multisite downscaling models, and is found to be superior. Analysis of future rainfall scenarios, projected with the developed downscaling model, reveals considerable changes in rainfall intensity and dry and wet spell lengths, among other things, at different locations. An increasing trend of rainfall is projected for the lower (southern) Mahanadi River basin, and a decreasing trend is observed in the upper (northern) Mahanadi River basin.

1. Introduction

[2] Global warming and its associated climate change are expected to have major impacts on ecosystems, agriculture, and human society that are sensitive to changes in precipitation. The tremendous importance of water in both society and nature necessitates the understanding of how any change in global climate could affect regional water availability. The potential effects of climate change in regional hydrology are assessed by comparing future hydrologic scenarios, derived with the simulations of general circulation models (GCMs), to that of the observed case. GCMs are the most credible tools in simulating the global climate systems due to the increased level of greenhouse-gas concentration, and they provide current and future time series of climate variables for the entire globe [Prudhomme et al., 2002; Intergovernmental Panel on Climate Change-Task Group on Scenarios for Climate Impact Assessment, 1999]. In spite of being able to capture large-scale circulation patterns and also model smoothly varying fields such as surface pressure, GCMs often fail to reproduce nonsmooth fields such as precipitation [Hughes and Guttorp, 1994]. In addition to the above, the spatial scale on which GCMs work is very coarse (e.g., 3.75° latitude × 3.75° longitude for coupled global circulation model (CGCM2)), for hydrological modeling purposes [Prudhomme et al., 2003]. GCMs also have limited skill in resolving subgrid-scale features such as convection and topography [Xu, 1999]. Hence, while the impact of greenhouse gases on large-scale atmospheric circulation is well understood, regional changes in the hydrological cycle are far more uncertain in GCM simulations. Downscaling is therefore necessary to model regional-scale climatic/hydrologic variables such as evapotranspiration, precipitation, soil moisture, and river runoff at a smaller scale, based on the large-scale GCM outputs.

[3] Downscaling techniques can broadly be classified into dynamic and statistical downscaling techniques. While dynamic downscaling involves nesting a high-resolution regional climate model within coarser grids of the GCM [Jones et al., 1995], statistical downscaling techniques construct a parametric/nonparametric and/or linear/nonlinear relationship between large-scale atmospheric predictor variables and the regional climate variable(s) of interest (predictand) [Wilby et al., 1998] in order to simulate future climatological/hydrological scenarios. Statistical downscaling is primarily based on the view that regional climate may be thought of as being conditioned by the large-scale climate pattern and by regional/local physiographic features such as topography, land use, land cover, distance to coast, and land-sea distribution [von Storch, 1995, 1999]. The underlying concept is that there exists a link between large-scale climate phenomena and local climatic/meteorological conditions. Statistical downscaling techniques are popular because they are computationally simple and can easily be modified for application to new regions of study. The wide application of statistical downscaling techniques is, nevertheless, attributed to the minimum use of climate predictors for forecasting regional climate variables of interest. However, a major presumption involved in statistical downscaling is that the statistical relationship, once developed between the large-scale predictor field(s) and local-scale variable, is not altered by climate change [Easterling, 1999]. A thorough review on downscaling concepts, prospects, and limitations can be found in the works of von Storch [1995], Hewitson and Crane [1996], Wilby and Wigley [1997], Gyalistras et al. [1998], Murphy [1999], Haylock et al. [2006], Hanssen-Bauer et al. [2005], and Christensen et al. [2007].

[4] Statistical downscaling techniques developed so far can principally be grouped into three categories, namely, weather classification/weather typing [e.g., Hay et al., 1991; Bardossy and Plate, 1992; Corte-Real et al., 1995; Conway and Jones, 1998; Schnur and Lettenmaier, 1998], regression/transfer function [Murphy, 1999; von Storch et al., 1993; Crane and Hewitson, 1998; Bardossy et al., 1995], and weather generators [Hughes et al., 1993; Hughes and Guttorp, 1994; Wilks, 1999; Khalili et al., 2009]. The weather typing and transfer function based approaches, generally known as perfect prognosis (PP) downscaling, establish a relationship between observed large-scale predictors and observed local-scale predictands. Application of these relationships in this context is justified if the predictors from GCMs are realistically simulated [Kalnay, 2003; Wilks, 2006]. Selection of large-scale predictors [Huth, 1996, 1999; Wilby and Wigley, 1997; Wilby et al., 1998; Charles et al., 1999; Timbal et al., 2008] and development of statistical model that establishes link between large-scale climate predictors and local-scale predictands are the main constituents of PP downscaling techniques. Use of high-dimensional correlated predictor field in the context of statistical downscaling may sometimes lead to overfitting or ignoring valuable information. Hence, methods such as principal component analysis (PCA) [Preisendorfer, 1988; Hannachi et al., 2007], canonical correlation analysis [Huth, 1999; von Storch and Zwiers, 1999; Widmann, 2005; Tippett et al., 2008], and physically/meteorologically motivated transformation techniques [Jones et al., 1993; Michelangeli et al., 1995; Wilby and Wigley, 2000; Stephenson et al., 2004; Philipp et al., 2007] are employed to reduce dimensionality of predictors field. Some of the widely used statistical downscaling models are based on the linear regression [Karl et al., 1990], the generalized linear model (GLM) [Dobson, 2001], the generalized additive model [Hastie and Tibshirani, 1990; Vrac et al., 2007b], and vector GLM [Yee and Wild, 1996; Yee and Stephenson, 2007; Maraun et al., 2010a]. The past decade witnessed a growth of objective weather typing based techniques using clustering and classification algorithms [Plaut and Simonnet, 2001; Bárdossy et al., 2005; Casola and Wallace, 2007; Wehrens and Buydens, 2007; Leloup et al., 2008; Vrac et al., 2007a; Rust et al., 2010]. Local-scale precipitation in this case is downscaled using a linear model conditioned on weather types [Maraun et al., 2010b]. Weather generators, on the contrary, are statistical models that generate random sequences of weather variables, which preserves statistical properties of observed weather [Richardson, 1981; Richardson and Wright, 1984; Wilks, 1998; Allcroft and Glasbey, 2003; Mason, 2004; Kilsby et al., 2007].

[5] Although progress has been made in the development of statistical downscaling techniques, especially for simulation of rainfall, challenges still exist in representing realistic levels of interannual variability in the generated sequences [Katz and Zheng, 1999; Wilby et al., 2004], generating multisite sequences with realistic spatial dependence, representing accurately the extreme behavior, and simulating complex dynamical structures within a relatively cheap computational framework [Wheater et al., 2005]. Perhaps the biggest challenge is the representation of spatial dependence in rainfall occurrence, particularly for a larger region [Yang et al., 2005]. Weather classification methods have limited success in reproducing the persistence characteristics of at-site wet spell (WS) and dry spell (DS) [Wilby, 1994]. Weather state based models such as nonhomogeneous hidden Markov models (NHMMs) [Hughes and Guttorp, 1994; Hughes et al., 1999; Charles et al., 1999, 2004; Vrac and Naveau, 2007; Vrac et al., 2007c] and nonparametric NHMMs [Mehrotra and Sharma, 2005] overcome the poor performances of weather classification methods, in simulating spatial variability of daily precipitation, by identifying distinct patterns in the multisite daily precipitation and also by capturing the temporal variability through persistence in the weather states. Mehrotra and Sharma [2007] developed a semiparametric model, which uses a two-state first-order Markov model for multistation rainfall occurrence and a kernel density estimator for generation of rainfall amounts conditioned on the occurrence of rainfall. Weather state-based generative models such as hidden Markov models (HMMs) evaluate the joint distribution by making use of the independence assumption that each hidden state depends only on its immediate predecessor and that each observation variable depends only on the current state [Raje and Mujumdar, 2009]. A recently developed conditional random field (CRF) downscaling model, as reported by Raje and Mujumdar [2009], does not require assumptions about the independence of atmospheric variables or their distribution, unlike HMM models. This property enables CRF models to utilize the entire sequence of observations for predicting output. However, a major disadvantage of CRF-based models, as reported by Raje and Mujumdar [2009], is the requirement of a large number of parameters to maintain the spatial and temporal structures and the requirement of intensive computational capabilities. Other disadvantages include discretization of precipitation into classes, which amounts to loss of information, and subjectivity in the selection of feature functions. Furthermore, the CRF model is observed [Raje and Mujumdar, 2009] to fail in capturing spatial correlations, and it also fails to simulate mean conditions, resulting in larger deviations from the observed mean. This may be due to the heuristic fixation of the number of rainfall classes used in the CRF model, without confirming the exact number of rainfall classes required, for the model based on any cluster validity tests. Other methods used are Bayesian hierarchical models [Cooley et al., 2007] and analog method in a weather generator context [Orlowsky et al., 2008]. Nonparametric statistical downscaling techniques such as kernel density estimators [Lall et al., 1996; Harrold et al., 2003a, 2003b] or k-nearest neighbors (KNNs) [Lall and Sharma, 1996; Rajagopalan and Lall, 1999; Harrold et al., 2003a; Yates et al., 2003; Mehrotra et al., 2004; Mehrotra and Sharma, 2006] are widely used for daily precipitation at multisites that belong to a subset of weather generators. A comprehensive review of downscaling techniques with a focus on recent developments in statistical downscaling, model output statistics, weather generators, and evaluation techniques to assess downscaling skill can be found in the work of Maraun et al. [2010b].

[6] In conclusion, a majority of relevant studies based on models such as the space-time model [Bardossy and Plate, 1992; Bogardi et al., 1993], NHMMs [Hughes and Guttorp, 1994; Hughes et al., 1999; Bellone et al., 2000] and the CRF-based model [Raje and Mujumdar, 2009], developed for forecasting multisite daily precipitation, are reported to be modestly successful in simulating the spatial dependence of observed precipitation series. Also, statistical downscaling approaches based on the Markov model or its variant for rainfall occurrence and simple/more complex probabilistic models for rainfall amounts can partially explain the unexplained variance associated with day-to-day variance in the rainfall [Katz and Parlange, 1998; Wilks, 1999]. Markov-based models of daily rainfall cannot effectively reproduce the variability of a nonstationary climate, as these models do not consider exogenous climate predictors. Some researchers have allowed variations in the stochastic model parameters by conditioning on a covariate containing atmospheric signals [Hughes and Guttorp, 1994; Hughes et al., 1999; Mehrotra et al., 2004; Katz and Parlange, 1993; Katz and Zheng, 1999; Wilks, 1989]. A majority of researchers still use simple/complex probabilistic models for rainfall amounts. Recently, Mehrotra and Sharma [2010] adopted a variant of a probabilistic model to generate rainfall amounts using a nonparametric kernel density simulator conditional on previous time step rainfall and selected exogenous atmospheric variables. This prompted us to develop a new approach that explicitly uses exogenous climate predictors for simulations of multisite rainfall amounts.

[7] In our study, we propose to model the rainfall state not at the station level but at the river basin level, using classification trees, as the occurrence is largely controlled by global circulation. The rainfall states derived are analogous to weather states reported in the work of Corte-Real et al. [1999]. We also propose to develop a nonparametric kernel regression estimator to downscale multisite daily rainfall amounts conditional on the derived rainfall state. One of the major challenges in multisite downscaling is modeling of spatial dependence [Yang et al., 2005]. In the present work, this is addressed with an innovative and novel approach, where a rainfall state of the region (where multisite downscaling is performed) is first obtained with classification and regression tree (CART) from the large-scale atmospheric circulation pattern, which is hypothesized to represent the spatial pattern of that region. Conditional on this rainfall state, the kernel regression is performed at individual locations with rainfall as predictand and large-scale climate variables as predictors. The novelty of this method is the inclusion of river basin-scale rainfall state in simulations, which presents the spatial pattern, and hence, individual simulations of the rainfall occurrence at sites are not required. The proposed multivariate kernel density function intrinsically holds the rainfall occurrence information at site, and hence, it is not required to be modeled explicitly. To demonstrate the applicability of the present model, the results are compared with those obtained with a recently developed model based on CRF [Raje and Mujumdar, 2009] and a widely used stochastic weather generator developed by Wilks [1999] as well as a KNN approach-based model.

[8] Hence, the purpose of the present paper is twofold. First, we demonstrate that the proposed downscaling technique is adequate to capture the spatial dependence in rainfall field sequence. We then aim to construct time series of multisite daily rainfall by downscaling outputs of CGCM3.1 for various emission scenarios in an Indian river basin. Major emphasis is given to the methodology for statistical downscaling in this work. The results of the study will be used to investigate climate change impacts on hydrological regimes in the river basin.

[9] An overview of the proposed statistical downscaling technique for downscaling precipitation from a GCM-projected circulation pattern is presented in Figure 1. The proposed technique predicts multisite rainfall in two stages: first by predicting the rainfall state of the study area using large-scale atmospheric variables and subsequently, by forecasting multisite rainfall amounts with the help of a multivariate kernel regression estimator conditioned on the rainfall state and large-scale atmospheric variables. Historical rainfall states for the river basin were classified from the observed rainfall field by applying a k-means clustering algorithm. The CART-based model is allowed to build a classification tree between historic large-scale atmospheric circulations coupled with a lag-1 rainfall state of the river basin (predictors) and the historic rainfall state of the river basin (predictand). Rainfall states for future emission scenarios are predicted using the classified tree with the GCM output. Precipitation amounts at multiple grid points are generated with the help of the multivariate kernel regression estimator, conditioned on the current-day rainfall state of the river basin and the current-day principal components of climate variables of the GCM output. The proposed statistical downscaling technique is applied for a case study in the Mahanadi River basin to generate multisite rainfall sequences for plausible emission scenarios.

Figure 1.

Overview of proposed downscaling technique. (a) CART formulation for prediction of rainfall state. (b) Multivariate kernel regression model for rainfall amounts.

2. Study Area and Data Used

[10] The Mahanadi River is a major peninsular river of India, flowing from west to east and joining the Bay of Bengal. It drains an area of 141,589 km2 and has a length of 851 km from its origin. The Mahanadi basin lies northeast of the Deccan plateau between latitudes 19°21′N and 23°35′N and longitudes 80°30′E and 87°00′E. The location and basin map of the Mahanadi River is given in Figure 2. The Mahanadi River splits into at least six major tributaries and numerous smaller channels before meeting the Bay of Bengal. The delta region through which these tributaries flow is densely populated (with a population of 400–450 people/km2) and is a flat, extremely fertile region.

Figure 2.

Mahanadi River basin with downscaling locations.

[11] Gridded daily rainfall data (1° latitude × 1° longitude) developed by Rajeevan et al. [2006] for the entire Indian region have been obtained from the India Meteorological Department (IMD). Rajeevan et al. [2006] considered daily rainfall data recorded at 1803 rain gauges across India, which had a minimum of 90% data availability during the observation period (1951–2003), for interpolation in order to minimize the risk of generating temporal inhomogeneities in the gridded data due to varying station densities. Irregularly spaced daily rainfall data were interpolated by Rajeevan et al. [2006] to form a regular n-dimensional array with the use of a numerical interpolation scheme proposed by Shepard [1968], which makes use of a suitably defined collection of simple, local interpolation functions that match appropriately at their boundaries. The local interpolation functions were so constructed that the subdomain for each local function is automatically defined, and the function is continuously differentiable everywhere, except at barriers where discontinuities are deliberately specified [Shepard, 1968]. The interpolation scheme takes care of the directional effects and the natural barriers. More details of the interpolation scheme are given in the works of Shepard [1968] and Rajeevan et al. [2006]. In our study, we extracted daily rainfall data (the predictand for the proposed downscaling technique) from the IMD gridded rainfall set, for the monsoon months of June, July, August, and September (JJAS) at eight grid points covering the entire Mahanadi basin (Figure 2).

[12] Statistical downscaling techniques are predominantly based on the view that regional climate is largely controlled by the large-scale climate circulation pattern [Bardossy and Plate, 1991; Hughes and Guttorp, 1994; Bardossy et al., 1995; Wetterhall et al., 2005]. However, there is little consensus on the choice of selecting appropriate climate predictors used for downscaling techniques. As reported in the literature [Wilby et al., 1999; Wetterhall et al., 2005], the predictors used for downscaling techniques need to be:

[13] (1) reliably simulated by GCMs;

[14] (2) readily available from the archives of GCM outputs; and

[15] (3) strongly correlated with the surface variables of interest (rainfall in the present case).

[16] Sharma [2000a] reported the use of partial mutual information (PMI) criteria to identify seasonwise sets of important atmospheric predictors. Further, the standard predictor identification using PMI types of measures may not be applicable, as the models are to be used for a future climate which may be poorly simulated by the GCM. A variable convergence score (VCS) developed by Johnson and Sharma [2009] has addressed this problem by ranking variables based on the coefficient of variation of the ensemble. VCS allows for a quantitative assessment between different hydroclimatic variables for a future climate. Johnson and Sharma [2009] found that scores for surface variables such as pressure, temperature, and humidity are the highest for the Australian region. However, it may also be possible that some of the standard predictor identification methods may exclude predictors based on the current climate performance that could be important in future changed climates. This led us to adopt a conventional method as suggested by Wilby et al. [1999], in tune with the following strategy, to identify climate predictors. Surface air temperature, which may explain the soil moisture/precipitation feedback and which also accounts for the observed terrestrial surface air temperature/precipitation covariability [Déry and Wood, 2005], is identified as one of the predictors for this study. Mean sea level pressure is selected, as it forms the basis for many other GCM-derived variables such as surface vorticity, airflow strength, meridional and zonal wind flow components, and divergence [Wilby and Wigley, 2000]. Surface specific humidity is chosen because of the reported significance of this variable to GCM precipitation schemes [Hennessy et al., 1997; Wilby and Wigley, 2000]. Zonal and meridional wind velocity components are selected, as they play a vital role in transport of moist air. The predictors thus identified are in tune with the variables identified by VCS as developed by Johnson and Sharma [2009].

[17] The National Center for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) reanalysis-I data [Kalnay et al., 1996; http://www.cdc.noaa.gov/cdc/reanalysis/reanalysis.shtml] provide global atmospheric information from 1948 to the present, which is a mixture of physical observations and model forecasts. The data assimilation system used in the NCEP/NCAR reanalysis includes the NCEP global spectral model and a 3-D analysis scheme that incorporates land surface, ship, rawinsonde, satellite, and other data, with a T62 horizontal resolution and 28 vertical sigma levels. Kalnay et al. [1996] classified the global circulation data into three classes. Type A variables, for example, zonal and meridional wind, are the most reliable products and are strongly influenced by available observations. Type B variables are influenced by both the observations and the model. Winds at the lowest sigma level and moisture variables, including specific humidity, are classified as type B. Type C variables are completely determined by the model, which includes precipitation, among other variables. More details on the types of NCEP/NCAR reanalysis variables can be referred from Kalnay et al. [1996]. For the purpose of our work, the NCEP/NCAR reanalysis I provides daily reanalysis data on surface air temperature, mean sea level pressure, specific humidity, zonal wind velocity, and meridional wind velocity for a region spanning latitudes 7.5°N–35.0°N and longitudes 70.0°E–97.5°E, which encapsulates the entire study area. Thus, the daily data extracted from 144 grid points (2.5° × 2.5° grid spanning) for the monsoon months of JJAS for a period of 50 years from 1951 to 2000 form predictors for training (calibration) and validation of the proposed downscaling model.

[18] The output of the T63 version of the third generation coupled global climate model (CGCM3.1), obtained from the Canadian Centre for Climate Modeling and Analysis (CCCMA; http://www.cccma.ec.gc.ca/data/cgcm3/cgcm3.shtml), will be used for downscaling precipitation for future emission scenarios. This model is simulated for the current climate scenario (20C3M) and various other emission scenarios such as COMMIT, SRESA1B, SRESA2, and SRESB1. As the grid spacings of both NCEP/NCAR and CGCM3.1 are not the same, the CGCM3.1 data are interpolated to NCEP/NCAR grid points.

[19] The rationales behind using data obtained from CGCM3.1 are as follows:

[20] (1) Climate models developed by institutes such as CCCMA, Canada, Institute of Numerical Mathematics, Russia, and Meteorological Research Institute, Japan, used flux correction in order to maintain a stable climate in their control runs [Covey et al., 2003].

[21] (2) Kripalani et al. [2007] showed that the models developed by CCCMA (CGCM3.1), the Max Planck Institute for Meteorology, Germany (ECHAM5), Meteo-France, Centre National de Recherches Meteorologiques, France (CM3), Model for Interdisciplinary Research on Climate (MIROC), Japan (MIROC3.2_HIRES), and the Hadley Centre for Climate Prediction and Research, UK (HadCM3) can simulate well the interannual monsoon variability of India. These models can also simulate the biennial oscillation of monsoon rainfall reasonably well.

[22] (3) CGCM3 model output contains all relevant data necessary for our downscaling approach, at a finer spatial resolution.

[23] Standardization [Wilby et al., 2004] of the predictor is carried out, prior to statistical downscaling, to reduce systematic biases in the means and variances of various GCM predictors, with respect to that of NCEP/NCAR predictors. A large-scale climatic predictor variable is standardized by subtraction and division of the respective variable, with mean and standard deviation obtained for a predefined baseline period. The mean and standard deviation of a predictor variable are computed using daily data of the respective predictor variable pertaining to monsoon months of JJAS for the entire baseline period. The baseline period considered for this study is from 1951 to 1980. It is presumed that the climatology remained stationary and did not contain any strong global climate change signal during the baseline period. Standardization of CGCM3.1 predictors for various emission scenarios is carried out, with the help of mean and standard deviation of predictors pertaining to 20C3M experiment, for the same baseline period.

[24] Use of high-dimensional correlated data in the downscaling framework is computationally expensive, and the high correlation between the variables may result in multicollinearity. PCA, the most widely used multivariate statistical technique, is used to reduce a data set containing a large number of variables to a data set containing fewer new variables, which represent a large fraction of the variability contained in the original data. There is not a single clear criterion that can be used to choose the number of principal components that are best retained in a given circumstance. Of the many principal component selection rules investigated by Preisendorfer [1988], we have used Kaiser's rule adopted by Jolliffe [1972] to retain the principal components accounting for more than the average amount of the total variance. In our study, the standardized predictor containing 720 variables (5 climate variables at 144 grid points) was reduced to a predictor set containing 50 variables without discarding important information carried in the original data.

3. Methodology

[25] In this section, we introduce a daily rainfall generator called a “nonparametric kernel regression estimator” conditioned on the rainfall state of the river basin. All multivariate vectors or matrices are expressed as bold and single variables or parameters or observation using nonbold characters or symbols. The following subsections describe the rainfall occurrence and amount models.

3.1. Occurrence of Daily Rainfall State

[26] The general structure of the basin-wise daily rainfall state occurrence model presented in the study of Kannan and Ghosh [2011] is described here in the context of occurrence of daily rainfall state. In general, the rainfall state occurrence model could be expressed as a classification-type problem, where we attempt to predict values of a categorical dependent variable (rainfall state) from one or more continuous (large-scale circulation) and/or categorical predictor variables.

[27] Modeling the occurrence of daily rainfall state involves construction of a classification tree. The tree building technique classifies objects or predicts outcomes by selecting from a large number of variables in determining the outcome. Thus, a classification problem is a systematic way of predicting the class of an object based on measurements. A classification tree is a directed acyclic graph T with a tree structure. A hypothetical six-class tree is depicted in Figure 3. The root node of the tree does not have any incoming edges, and every other node has exactly one incoming edge and may have zero, two, or more outgoing edges. A node T depicted without outgoing edges is called a leaf node and is labeled with one of the class labels. Each internal node depicted with two or more outgoing edges is associated with one attribute variable XT, called the split attribute. Each edge (T,T′) from an internal node T to one of its children T′ has a predicate or a split selection rule q(T,T′) associated with it, where q(T,T′) involves only the splitting attribute XT of node T. The set of predicates QT on the outgoing edges of an internal node T must contain disjoint predicates involving the split attribute whose conjunction is true, and for any value of the split attributes, exactly one of the predicates in QT is true. We will refer to the set of predicates in QT as splitting predicates of T.

Figure 3.

A hypothetical classification tree containing six classes.

[28] Given a decision tree T, one can define the associated classifier in the following recursive manner:

display math(1)
display math(2)

where C(•) is a classifier, and DT(•) is a decision tree classifier. Thus, to make a prediction, one can start at the root node and navigate the tree on true predicates until a leaf is reached. The class label associated with the leaf node is returned as the result of the prediction. If the tree T is a well-formed decision tree, then the function DT(•) is also well defined.

[29] Given a data set inline image, where ωi are independent identically distributed random samples from a probability distribution P over the set of events Ω, an optimal classification tree T is constructed by minimizing misclassification rate. A classification tree is usually constructed in two phases [Breiman et al., 1984]. In tree growth phase, an overly large classification tree is constructed from the training data. In tree pruning phase, the final size of the tree T is determined with the goal to minimize an approximation of misclassification rate. Detailed discussion of tree building, split selection, and pruning methods can be found in the work of Breiman et al. [1984].

3.2. Multisite Rainfall Amounts

[30] In this study, generation of multisite rainfall is achieved by modeling the relationship between the large-scale global circulation climate variables (predictors) and the regional-scale precipitation.

display math(3)

where Rt is the rainfall at individual gird point at time t, Xt is the climate predictors at time t, and St is the rainfall state of the river basin at time t. The functional relationship in equation (3) can generally be modeled using resampling or regression (parametric/nonparametric) or as a probabilistic model [e.g., Hay et al., 1991; Corte-Real et al., 1999; Wilks, 1999; Mehrotra and Sharma, 2007]. A nonparametric kernel regression is employed in our study to model the multisite rainfall. The predictors, used in the kernel regression estimator, are the current-day principal components of NCEP/NCAR climate predictors and the current-day rainfall state of the river basin, and the predictand is the current-day rainfall at multiple sites (computed individually). In the proposed technique, the statistical relationship between the predictors and the predictand is conditioned on the rainfall state of the river basin, which makes the developed model different from the existing models, and the rainfall state information helps to preserve the spatial correlation of rainfall field.

3.2.1. Kernel Regression Estimator

[31] Any regression analysis deals with the relationship of how the dependent or predictand variable (Y; daily rainfall at a site) can be explained by the independent or predictor variable (X; large-scale global climate variables). Theoretically, it is not necessary to apply any restriction on the associated relationship function. But in practice, the relationship can be formalized as

display math(4)
display math(5)

[32] It is implied that the relationship does not need to hold exactly for the ith observation but is “disturbed” by the random variable εi. In a typical linear regression problem, the expected value of a dependent variable Y is related to a set of explanatory variables X1, X2, …, Xd in the following way:

display math(6)

with the prior assumption that m(•) has smooth functional form, and the parameters are estimated by the method of ordinary least squares. Nonparametric smoothing methods are designed to simultaneously estimate and model the underlying structure in the data [Härdle et al., 2004], for extracting structural elements of variable complexity from patterns of random variations without assuming the functional form of estimators. This involves high-dimensional objects like density functions, regression surfaces, or conditional quantiles. These objects are difficult to estimate for data sets with mixed, high-dimensional, and partially unobservable variables. Nonparametric regression estimators produce an estimate of m(•) at an arbitrary point by averaging over the neighboring region of the observed values of the dependent variable (predictand). The exact method of weighting is determined by a weight function, which assigns heavy weights to nearby observations and zero weights to far away observations. Multivariate kernel regression [Silverman, 1986; Härdle, 1990; Scott, 1992; Wand and Jones, 1995] is such a method, belonging to the class of nonparametric smoothing techniques, which uses a weighted sum of the observed responses with the use of kernel density functions for weights. The general form of conditional expectation function inline image for the multivariate distribution of the variables Y and X is given as follows:

display math(7)

where m(X) is the conditional expectation function, inline image is the large-scale climate predictors, and Y is the predictand, say, rainfall at a station, f(y|x) is the conditional probability density function (PDF) of Y given X = x, and fX(x) is the marginal PDF of X. The multivariate generalization of inline image (Nadaraya-Watson estimator) [Nadaraya, 1964; Watson, 1964] can be obtained by replacing the multivariate density f(y,x) and fX(x) by their kernel density estimates inline image and inline image, respectively, and

display math(8)
display math(9)
display math(10)

where K(•) is the kernel function of the predictand Y, κ(•) is the kernel function of the predictor X, n is the number of observations, Xi is the ith observation, and h or H is the smoothing parameter known as bandwidth, respectively, in the context of kernel density estimation. Commonly available kernel functions are uniform, triangular, biweight, triweight, Epanechnikov, and Gaussian kernel functions. The use of uniform kernel results in kernel density estimates with discontinuities, which is undesirable. The Epanechnikov kernel is optimal in a minimum variance sense. Here, a Gaussian kernel function is used as a smoothing kernel, owing to its convenient mathematical properties. Depending on the choice of the kernel, inline image is estimated, which is the weighted (local) average of those Yi, where Xi lies in a ball or cube around x.

3.2.2. Bandwidth Selection

[33] The selection of bandwidth is an important step in kernel regression. A change in bandwidth may dramatically change the shape of the kernel estimate [Efromovich, 1999], which in turn affects the kernel regression altogether. In 1-D or 2-D cases, it is easy to choose an appropriate bandwidth, with the help of the plot of density estimates for different bandwidths. A generalized method for deriving rule-of-thumb bandwidths from asymptotic mean integrated square error (AMISE) for the multivariate kernel density estimator, reported in the work of Scott [1992], and Wand and Jones [1995], is used in this study.

display math(11)

where Rf(x) is the Hessian matrix of second partial derivatives of the kernel density estimate function f, and inline image is the d-dimensional squared L2 norm of κ. Wand and Jones [1995] further simplified the AMISE by a formula by adopting the following form of equation:

display math(12)

to derive rule-of-thumb formulae for different assumptions of H and Σ. Considering H and Σ to be diagonal matrices while minimizing AMISE leads to

display math(13)

[34] The method suggested here to derive a rule of thumb for estimating the bandwidths is equivalent to applying Mahalanobis transformation to the data to transform the estimated covariance matrix into an identity matrix for computing the kernel estimate [Scott, 1992] and finally retransforming the estimated PDF back to the original scale.

3.2.3. Stochastic Simulation of Rainfall Amounts

[35] Stochastic simulation of rainfall can be achieved by perturbing the expected mean rainfall estimates ( inline image) obtained using the kernel regression model (herein after referred as KR) using a nonparametric simulation procedure following Sharma et al. [1997] and revised subsequently by Sharif and Burn [2006].

display math(14)
display math(15)
display math(16)

where inline image is the perturbed rainfall value for ith day and jth station, xi is the climate predictor for ith day, and σ2 is the covariance of the climate predictors in the kernel of interest. H is the bandwidth of the kernel under consideration, d is the dimension of the predictor vector xi, n is the number of samples in the kernel of interest, and Wi is N(0,1). However, there is a possibility of obtaining a negative estimate of Yi. This stems from the fact that the Gaussian kernels used in the kernel density estimate have infinite support and lead to leakage of probability across boundaries. To overcome this problem, the value of λ is transformed if the probability of generating a negative value is too large [Sharif and Burn, 2006]. A threshold probability, say α = 0.06, is selected, following Sharma and O'Neill [2002], for which N(0,1) = −1.55. Hence, the largest value of λ corresponding to the probability of generating a negative value of exactly α is given by

display math(17)

where inline image is the conditional standard deviation of the predictand computed from KNNs [Sharma et al., 1997; Sharma and O'Neill, 2002; Sharif and Burn, 2006]. The leakage of probability can be addressed by checking at each step whether the simulated rainfall amounts are positive. Whenever a negative amount is encountered, a new sample can be generated from the same kernel slice by generating a new random number Wi until a positive value of Yi is obtained [Sharma et al., 1997; Sharma, 2000b]. The boundary normalization procedure leads to a small bias in the simulated density in the neighborhood of the boundary, which is negligible compared to the total realizations [Sharma et al., 1997]. Thus, the present downscaling scheme not only allows proper representation of temporal dependence attributes in the simulated rainfall series but also provides alternative rainfall realizations that are stochastically similar to the historical record.

3.3. Modeling Spatial Dependence in Rainfall

[36] The rainfall occurrence model outlined in the previous section generates rainfall occurrence in the form of common rainfall state of the river basin. The rainfall occurrence model considers the previous day rainfall state as one of the predictors to pass on daily or short-term persistence, and/or other continuous variables explain the long-term persistence. Spatial dependence in the rainfall field is maintained by the combined use of common rainfall state of the river basin and a data-driven KR approach in the present statistical downscaling scheme. Simulations based on this approach retain the marginal and joint density structure of the historical time series of rainfall field that includes nonlinearities and state dependence, thus deviating from the conventional practice of explicitly generating random variates to induce spatial dependence. The KR-based approach not only allows proper representation of temporal dependence attributes in the simulated rainfall, but it also provides alternative rainfall realizations that are stochastically similar to the historical record. The perturbation in our approach serves to smooth over the gaps between data points in the regression estimate. The proposed approach explicitly considers the effect of large-scale circulation conditioned on the rainfall state, thus providing a more logical and refined way of incorporating temporal and spatial dependence.

3.4. Model Application

[37] A classification tree is constructed to predict daily rainfall state of the river basin with training data consisting of input objects/predictors (current-day principal components of NCEP/NCAR climate predictors and previous day(s) rainfall state(s)), and the desired output object/predictand (current-day rainfall state of the river basin). Prior to building the classification tree, an unsupervised k-means clustering technique [McQueen, 1967] is applied to the IMD gridded rainfall data for a period of 30 years from 1951 to 1980, to identify historical daily rainfall states of the river basin. The optimum number of clusters or rainfall states is determined from cluster validity tests such as Dunn's [1973] index, Davies-Bouldin index [Davies and Bouldin, 1979], and Silhouette index [Rousseeuw, 1987]. More details on k-means clustering and cluster validity tests can be found in the work of Kannan and Ghosh [2011]. The optimum number of clusters is found to be three, and the optimal clusters/rainfall states are named as “almost dry,” “medium,” and “high,” on the basis of rainfall amounts represented by the cluster centroids. It is emphasized that the “almost dry” rainfall state indicates the occurrence of zero or marginal rainfall in the study area. Standardized and dimensionally reduced NCEP/NCAR predictor data and concurrent rainfall states for a period of 30 years from 1951 to 1980 are used as a training data set for construction of the classification tree, and the remaining data for a period of 20 years from 1981 to 2000 are used for validation of the CART model. Skill measures such as the success rate of model prediction, Heidke skill score, and inline image goodness of fit test statistic, as suggested by Wilks [2006], are applied for identification of the best classification tree. Thus, any future-day rainfall state of the basin is predicted with the best classification tree constructed using NCEP/NCAR predictor data along with a lag-1 rainfall state. More details of the CART procedure for modeling the occurrence of basin-wise rainfall state and the results obtained therein are reported in the work of Kannan and Ghosh [2011].

[38] The KR model uses feature vector spaces for the generation of daily rainfall. The feature vector spaces are constructed with standardized and dimensionally reduced NCEP/NCAR reanalysis daily climate data and concurrent daily precipitation for a period of 30 years from 1951 to 1980. The feature vector spaces are classified on the basis of identified daily rainfall states. We have validated the KR model with the remaining standardized and dimensionally reduced NCEP/NCAR predictors for a period of 20 years from 1981 to 2000, conditioned on the rainfall state, to predict daily rainfall during the monsoon months of JJAS. We extend the KR model to generate stochastic rainfall, following Sharma et al. [1997] and Sharma [2000b]. Leakage of probability in the stochastic model, caused by the adoption of a Gaussian kernel function, is arrested, following Sharif and Burn [2006]. We have also simulated the KR model with modified feature vector spaces containing logarithmic transformed precipitation data, termed as kernel regression with logarithmic transformed predictand (KRLT) model, to check the leakage of probability at the kernel boundaries [Lall et al., 1996]. Simulation results of kernel regression without conditioning on rainfall states (KRWS) is also compared with the present modeling approach in order to have a complete understanding of the kernel regression technique adopted in this study. This paper also compares the validation results of the KR model with those of several well-known models such as a stochastic weather generation scheme adopted by Wilks [1999], KNN approach, and a model based on CRF theory developed by Raje and Mujumdar [2009], in order to show the strength of the present downscaling technique. Short descriptions of these models are given later for the sake of continuity.

[39] Wilks' [1999] method (WM), based on a stochastic weather generation model, uses the two-state first-order Markov chain dependent process for daily precipitation occurrence. The transition probabilities p01, the probability of a wet day following a dry day, and p11, the probability of a wet day following a wet day, can then be used to simulate a sequence of wet and dry days by generating a uniform [0,1] variate for each day and comparing it with the appropriate transition probability [Wilks, 1999]. On simulating a wet day, the precipitation amount corresponding to that day is generated either by using gamma distribution or by the mixed exponential distribution. More details of this method can be obtained from the work of Wilks [1999].

[40] The KNN estimator, on the other hand, can be viewed as a kernel estimator with uniform kernel K(u) = (1/2)I(|u| ≤ 1) and variable bandwidth being the distance between x and its furthest KNN. Otherwise, the form of equation (10) can be applied in this case also. More details on bias, variance, and number of nearest neighbors (k) can be referred from Härdle [1990].

[41] Raje and Mujumdar [2009] developed a probabilistic downscaling model that considers the daily precipitation sequence as a CRF. The conditional distribution of at-site precipitation sequence was modeled as a linear-chain CRF given the large-scale atmospheric information. Parameters of the CRF model, as reported by Raje and Mujumdar [2009], can be obtained using maximum likelihood parameter estimation technique using limited memory Broyden-Fletcher-Goldfarb-Shanno optimization, and the most likely precipitation sequence for a given set of large-scale atmospheric variables can be determined by the Viterbi algorithm with the help of maximum a posteriori estimation. More details on the modeling aspects of CRF can be obtained from Raje and Mujumdar [2009]. A list of models investigated in this study with their acronym is given in Table 1.

Table 1. Investigated Models With a Short Description
Sl. No.AcronymDescription
1KRWSKernel regression without conditioning on weather states
2KRKernel regression conditioned on weather states
3KRLTKernel regression conditioned on weather states with log-transformed predictands
4KNNk-nearest neighbor kernel regression conditioned on weather states
5WMWeather generator based on Wilks' method

4. Results and Discussion

[42] We generated 100 independent realizations of daily rainfall for the baseline period during model calibration (using reanalysis data) and model evaluation (using GCM 20C3M outputs) phases. The statistics obtained from these realizations are ranked to obtain the best estimate (50th percentile). The 5th and 95th percentile values obtained from these realizations are also extracted to form the confidence band around the median estimate. Results of various downscaling approaches such as KRWS, KR, KRLT, KNN, and WM are evaluated for spatial and temporal variations of rainfall. In addition, WS/DS length probabilities and cumulative distribution functions (CDFs) of the expected mean rainfall series at a few selected sites are presented to ascertain the ability of the downscaling approach under consideration. The present downscaling technique is applied to simulate rainfall for the near-future (2026–2050) and far-future (2076–2100) periods using the standardized and dimensionally reduced predictors for A1B, A2, and B1 emission scenarios of CGCM3.1/T63 runs, and the results are compared with those for the COMMIT scenario of the same GCM.

4.1. Model Validation Over the Baseline Period

4.1.1. Comparison of Statistics

[43] The performances of KRWS, KR, KRLT, KNN, and WM models in reproducing various statistical attributes of the observed data are analyzed. The first three moments of expected rainfall series that are generated using various models, and the same for the CRF model as obtained from Raje and Mujumdar [2009], are compared with those of the historical rainfall record.

[44] Table 2 presents the statistics of the rainfall series obtained using various modeling techniques, and they are compared with observed rainfall for the validation period (1981–2000) at eight downscaling locations. The dispersion of mean and standard deviation of daily rainfall series obtained from 100 simulations with different modeling approaches at all locations is shown in the form of a violin plot (Figure 4). It is observed that mean and standard deviation of generated rainfall series at many downscaling locations is qualitatively well captured by the KR-based approach. We have used Student's t test to check statistically whether the means of the model-generated rainfall series at various locations are similar to those of the observed data, using the hypothesis “H0 that the means of two series are same.” It is observed from Table 3 that means of rainfall sequences generated by the KR model at various downscaling locations are found to be similar to those of the observed rainfall at 1% level of significance for most of the locations except location 5. Similarly, the KNN-simulated means are similar to the observed mean for most of the locations except for locations 5 and 7. However, the test results for the KRLT and CRF models show that the simulated means are mostly different from the observed means at a 1% level of significance. The test results for KRWS and WM show mixed outcomes.

Table 2. Computed Statistics for Observed and Predicted Rainfall Series
 Downscaling Location
Standard Deviation
Table 3. Results Obtained for Testing Means of Observed and Predicted Rainfall Series
Location IDTest Results for Acceptance/Rejection of Null Hypothesis (H0)at 1% Level of Significance
1Do not rejectDo not rejectDo not rejectDo not rejectDo not rejectReject
2Do not rejectDo not rejectRejectDo not rejectRejectReject
3RejectDo not rejectRejectDo not rejectRejectReject
4Do not rejectDo not rejectRejectDo not rejectRejectReject
5RejectRejectRejectRejectDo not rejectReject
6Do not rejectDo not rejectDo not rejectDo not rejectDo not rejectReject
7Do not rejectDo not rejectDo not rejectRejectDo not rejectReject
8RejectDo not rejectRejectDo not rejectRejectReject
Figure 4.

Violin plot of means and standard deviations of daily rainfall at eight downscaling locations. The uncertainties represented in the violin plots result from 100 simulations corresponding to each of the downscaling approaches.

4.1.2. Basin-Averaged Wet Days and Rainfall Amounts

[45] Characteristics of WSs and intervening DSs are extremely useful for planning and management of precious water resources, especially for countries depending on their water needs from monsoon activity. This assumes a greater significance in the wake of global climate change and climate change scenario projections. Hence, reproduction of WS/DS and rainfall amounts in the context of statistical downscaling is very important. Though many definitions exist in the literature for identification of WS (DS), we have adopted the following definition of WS (DS) from the work of Singh and Ranade [2010]. A WS (DS) is identified as a “continuous period with daily rainfall equal to or greater than (less than) daily mean rainfall (DMR) of the climatological monsoon period over the area of interest.” Singh and Ranade [2010] worked out a DMR of 9.1 mm/d for the northeast coast subregion of India where the study area is located. More details on this topic can be found in the work of Singh and Ranade [2010]. Table 4 compares the 5th, 50th, and 95th percentile estimates of both observed and downscaled monthly and monsoon wet days and rainfall amounts over the study region. It shows that performances of the KR and KNN models are similar to and better than others in terms of capturing the total number of wet days. However, for monthly rainfall amount, KR performs better than KNN. The KRWS model performs better in terms of capturing the monsoon total rainfall amounts. However, the KR model is found to be reasonable in terms of both the criteria, viz., capturing the number of wet days and simulating total rainfall amounts for the entire monsoon as a whole and also during JJAS months individually.

Table 4. Observed and Downscaled (5th, Median (50th), and 95th Percentile Estimates) Monthly and Monsoon Wet Days and Rainfall Amount for the Testing Period: 1981–2000
SeasonWet DaysRainfall Amount (mm)Percentage Change in Median Value
Obs.Simulated Percentile EstimateObs.Simulated Percentile EstimateWet DaysRainfall Amount
Results Using Reanalysis Data
Model: KRWS
Model: KR
Model: KRLT
Model: KNN
Model: WM
(Results for Validation of KR Using Current Climate Data of GCM (20C3M)
Model: KR

4.1.3. Distribution of Basin-Averaged Annual Wet Days and Rainfall Amounts

[46] Cumulative frequency (frequency at which a variable exceeds a specific value) curves of the basin-averaged annual wet days and rainfall amounts are presented in Figure 5 for comparison and validation of the downscaling models used in this study. The low-frequency variability (year-to-year persistence) of both number of rainy days and rainfall amounts is an important characteristic while assessing the extremities of drought/flood regimes [Mehrotra and Sharma, 2010]. Figure 5 indicates that both the KRWS and WM models overestimate the year-to-year total number of monsoon wet days. The KRLT model underestimates the year-to-year total number of monsoon wet days. Among all the models, the KR model is the best in terms of capturing the year-to-year total number of monsoon wet days. As regards the year-to-year variability of monsoon total rainfall amount, both KR and KRWS models perform better than other models, with a varying degree of marginal overestimates of the monsoon total rainfall amounts. We infer from the results that the KR model is capable of reproducing the distribution of the basin-averaged observed number of wet days and rainfall amounts in the downscaled realizations using NCEP reanalysis data, and hence, we use the same for downscaling the GCM simulations for 20C3M.These downscaled realizations of the basin-averaged monsoon rainfall amounts using the 20C3M simulations show an overestimate, though the number of wet days is well modeled. The differences in rainfall amount may result from the bias in GCM predictors, which have not been removed by standardization.

Figure 5.

Distribution plots of observed and model-simulated basin-averaged annual (a–f) wet days and (g–l) rainfall amounts for downscaled reanalysis data and CGCM3.1 output (20C3M). The downscaled outputs from reanalysis data are derived with (a and g) KRWS, (b and h) KR, (c and i) KRLT, (d and j) KNN, and (e and k) WM methods for comparison. (f and l) The KR method is finally selected for reasonable performances and used with CGCM3.1 20C3M simulations.

4.1.4. Basin-Averaged WS/DS Length and Conditional Probabilities

[47] In addition to the total number of wet days and their distribution over the time period, we also evaluate the model for WS/DS length probabilities and conditional probabilities of rainfall, the predictand. Our intention is to check whether the shape and distribution of WS/DS length and also the conditional probabilities (CDF of simulated rainfall) of the predictand are captured by the downscaling model for the validation period (1981–2000). Figure 6 shows the plots of basin-averaged WS and DS length probabilities computed for observed and model-predicted rainfall series for the validation period. It is inferred from Figure 6 that the KR and KRLT models exhibit good performances, as the spell length probability curves for both WS and DS have a close match with that of the observed case, whereas noticeable deviations in the shape of WS/DS length probability plots are detected in the results for KRWS and WM models. Figure 7 compares the CDF obtained from basin-averaged observed rainfall series with those obtained using various modeling techniques. Among all the downscaling methods used with reanalysis data, the CDF obtained from the results of the KR model shows minimum deviation from that obtained for observed rainfall. A sudden increase in probability value observed at three points in the subplot comparing KNN CDF with that of the observed might be attributed to the following facts: (a) uniform kernel used in the KNN model assigns equal weights to all neighborhood points of interest, and (b) assignment of feature vectors into kernels is based on the crisp classification of rainfall states, whereas some of the feature vectors under consideration might lie in the boundaries of two kernels.

Figure 6.

Basin-averaged WS and DS length probabilities for downscaled NCEP reanalysis data and GCM 20C3M simulations. (a–e) Downscaled simulations are obtained from reanalysis data with KRWS, KR, KRLT, KNN, and WM. (f) The KR model is finally selected for better simulation and used for GCM 20C3M outputs.

Figure 7.

CDF of basin-averaged rainfall. The downscaled outputs from reanalysis data are derived with (a) KRWS, (b) KR, (c) KRLT, (d) KNN, and (e) WM methods for comparison. (f) The KR method is finally selected for reasonable performances and used with CGCM3.1 20C3M simulations.

4.1.5. Basin-Averaged Interannual and Intraannual Rainfall Variability

[48] Indian summer monsoon rainfall exhibits large spatial and interseasonal variability across peninsular India. It is imperative to model the day-to-day variability of rainfall characterized by “active” and “break” periods [Krishnamurthy and Shukla, 2000]. The unique geographical features of the Indian subcontinent, along with associated atmospheric, oceanic, and geophysical components, are extremely influential in ensuring the anticipated behavior of the Indian summer monsoon. The monsoon starts in the western coast around 1 June and covers India entirely by around 15 July. Its withdrawal from India typically starts from the first week of September onward and completes by the first week of October. Hence, we try to investigate the interannual and intraannual variability of area-averaged rainfall simulated by various models for the calibration period. We use the expected daily rainfall sequence (50th percentile estimates) generated by all models for the validation period. The monthly/monsoon total rainfall amounts are worked out over the 122 days of JJAS in this analysis. Prior to computation of the daily climatological rainfall, very high frequency fluctuations in the daily rainfall have been removed by applying a 5 day moving average to the daily rainfall data in order to obtain more coherent results [Krishnamurthy and Shukla, 2000]. Figure 8 compares the plots of annual, monthly, and daily climatological mean rainfall (area averaged over the river basin), simulated by KRWS, KR, KRLT, KNN, and WM models for a period of 20 years from 1981 to 2000, with that of the observed rainfall. The basin-averaged annual rainfall computed from the expected rainfall (50th percentile) series generated by the KR model shows a good match (with a correlation coefficient of 0.88) with that of the observed. For the monthly flow series, the KNN and the KR models show a good match with the observed series (with correlation coefficients of 0.94 and 0.92, respectively). However, it is noticed that daily climatological mean rainfall series obtained from the results of kernel regression-based methods show a better performance in terms of capturing the vagaries of monsoon rainfall. Even the daily climatological mean series estimated for WM-based stochastic weather generator is comparable with that obtained using observed rainfall sequence. The onset/withdrawal of the Indian monsoon has clearly been depicted by a low/near-zero rainfall at the start and end of the monsoon period and a daily mean monsoon rainfall varying between 8 and 10 mm/d, which confirms the results obtained by Singh and Ranade [2010] for this region. However, the daily climatological mean series obtained using results of the KNN model does not show any such vagaries during the entire period of monsoon.

Figure 8.

(a) Annual, (c) seasonal, and (e) daily climatological mean rainfall, spatially averaged over Mahanadi basin. The daily climatological mean is computed with 5 day moving average of rainfall time series. The corresponding correlation coefficients for multiple downscaling approaches are presented in (b), (d), and (f)

4.1.6. Temporal Rainfall Variability and Spatial Dependence

[49] Any hydrological model requires accurate projection of rainfall exhibiting both spatial and temporal variability over the river basin for any studies on water availability and/or prediction of extremes, in the context of changed future emission scenarios, in order to assess the impending situation. In this regard, we evaluate the performance of all downscaling models in terms of their ability to capture both temporal and spatial dependence of the generated rainfall field. Table 5 provides details on the correlation coefficient obtained by comparing the model-predicted rainfall time series with that of the observed at all eight downscaling locations for the validation period. The results indicate that overall performance of the KR model conditioned on the rainfall states is fairly good when compared with other models.

Table 5. Correlation Coefficients Obtained for Observed and Predicted Rainfall Series at Various Downscaling Locations (Testing Period: 1981–2000)
Location IDCorrelation Coefficient Obtained for Model-Generated Rainfall Series

[50] The interstation correlation coefficients between various pairs of downscaling locations are computed for the observed and predicted rainfall sequences obtained from the model calibration runs. Figure 9 presents the scatterplots of cross-correlation coefficients obtained from the observed and model-simulated daily rainfall series for all station pairs by different modeling techniques. The scatterplots show that the spatial structure in the rainfall field is captured well by the KRWS model. It is also observed that the KR and KRLT models overestimate the cross-correlation coefficients. This may be because of the added artificial correlation to the simulations by the conditional rainfall state. Hence, the cautious use of the conditional state is important, where the spatial correlation is of primary interest. The other models, such as KNN, WM, and CRF, completely fail to capture the spatial correlation structure existing in the observed data.

Figure 9.

Interstation correlation coefficient between pair of station for different modeling approaches ((a) KRWS, (b) KRLT, (c) KNN, (d) CRF, and (e) WM) compared with that of KR model.

4.1.7. Model Selection

[51] We analyze the performances of various downscaling models based on multiple criteria. Some important insights are presented in this section. The WM model shows poor performances in capturing basin-averaged annual and monthly mean rainfall, the spatial rainfall field structure, and WS/DS length probabilities. Though the KNN model performs moderately well in capturing the basic statistics, it overestimates the basin-averaged annual rainfall amount, fails to capture the spatial dependence in rainfall field, and performs poorly in simulating the basin-averaged daily climatological mean rainfall series. On the other hand, the KRLT model has a tendency to underestimate the total number of rainy days and rainfall amounts, which are evident from the basic statistics as well as from the distribution plot of basin-averaged annual wet days and rainfall totals for the validation period. A tight situation arises when selecting the best model between KR and KRWS. Hence, we have adopted a step-by-step comparison of performance measures for both these models:

[52] (i) Mean and standard deviation computed for rainfall series generated by KR model are closer than those generated by KRWS to the observed estimates at various locations;

[53] (ii) KR model means are accepted at a 1% level of significance for most of the locations except downscaling location 5, whereas means of KRWS-model-generated rainfall series are rejected at two downscaling locations;

[54] (iii) Basin-averaged monthly and monsoon wet days estimated from KR model simulations are found to be closer to the observed values than those of the KRWS model. However, basin-averaged monthly rainfall amounts estimated by both KR and KRWS models show mixed results. Rainfall estimates for the months of June and September are well captured by the KR model, whereas rainfall estimates for the months of July and August are well captured by the KRWS model;

[55] (iv) Distribution plot of basin-averaged annual wet days obtained from the KR model results closely matches with that of the observed case, whereas a similar plot obtained from basin-averaged monsoon rainfall series generated by the KRWS model shows a better match than that of the KR model;

[56] (v) Basin-averaged annual and monthly rainfall (computed with 50th percentile estimates) series obtained from the KR model show a good correlation with observed data. However, daily climatological mean rainfall series is well simulated by the KRWS model;

[57] (vi) Basin-averaged spell length and conditional probabilities obtained from the KR model have a good match with those of the observed data;

[58] (vii) Rainfall series generated by the KRWS model captures the spatial correlation structure present in the observed rainfall field better than that of the KR model.

[59] After careful consideration of results of various performance measures, we have selected the KR model as a reasonable model for the simulation of rainfall series pertaining to A1B, A2, and B1 emission scenarios in the near-future (2025–2050) and far-future (2076–2100) periods.

[60] Further, to crosscheck whether the temporal variability of daily rainfall has adequately been captured by the KR model, we have selected 3 months in the calibration period for which the correlation between observed and estimated rainfall is high. Figure 10 shows the comparison plots of observed and NCEP-downscaled expected rainfall obtained for selected months in the calibration period at locations 1, 4, and 7. It is found that the KR model sometimes fails to capture the observed rainfall amounts. For example, the actual observed rainfall at location 1 on 9 August 1982 is close to 150 mm, whereas the KR model predicts much less rainfall on that day. Further, on 15 August 1982, the KR model predicts a high rainfall estimate, whereas the observed rainfall did not show any such high rainfall occurrence. Similar mismatches are detected on the 2nd, 3rd, and 21st of July 1989 for location 4 and on the 11th, 12th, and 14th day of September 1998 for location 7. Therefore, it is acknowledged that near-perfect prediction of rainfall estimates is a difficult task for the kernel regression estimator-based stochastic downscaling technique. The same difficulty is common for any well-known stochastic downscaling approach. While using the climate model outputs, we must understand that models are not reality, and they “cannot capture all the factors involved in a natural system, and those that they do capture are often incompletely understood” [Maslin and Austin, 2012].

Figure 10.

Comparison of observed and model computed expected rainfall at selected three locations for selected months. The comparison plots for (a) location 1 during August 1982, (b) location 4 during 1989, and (c) location 7 during September 1998 are presented.

4.2. Projections With GCM Simulations

[61] The KR model is applied with standardized and dimensionally reduced data pertaining to the COMMIT, A1B, A2, and B1 scenarios of the CGCM3.1 outputs conditioned on the rainfall states of respective scenarios, in order to obtain 100 independent simulations of future-day JJAS daily rainfall realizations. To investigate the influence of global warming on changes in precipitation characteristics, we have selected two time slices (2026–2050 and 2076–2100) in the future. It is expected that the water-holding capacity of air mass increases by about 7%/1°C warming [Trenberth et al., 2003], which leads to increased water vapor in the atmosphere, and this probably provides the biggest influence on precipitation. Intensification of storms, supplied by increased moisture, is expected to produce more intense precipitation events that are widely observed to be occurring, even in places where total precipitation is decreasing. We present here the changes in basin-averaged monthly and monsoon wet days and rainfall amounts, changes in shape of WS/DS length, and conditional probabilities at a few selected locations for two time slices in the near future (2026–2050) and the far future (2076–2100). Further, the 50 year return period of extreme daily rainfall, trends in annual mean and annual maximum rainfall estimates for all downscaling locations in the river basin, and similar estimates obtained from precipitation flux data of CGCM3.1/T63 for two grid points falling within the Mahanadi basin are also investigated for impending change in global climate. All results in the following sections are compared with those of the COMMIT scenario. The COMMIT emission scenario represents an idealized scenario in which the atmospheric burdens of long-lived greenhouse gases are held fixed at 2000 AD levels.

4.2.1. Projected Rainfall Changes During 2026–2050 and 2076–2100

[62] Tables 6 and 7 provide details on the estimated changes in the number of wet days and rainfall amounts on a monthly and a monsoon basis during the period 2026–2050. Percentage change in median estimates of any scenario is obtained with respect to that of COMMIT results for the same period. The changes detected for the A1B scenario show a 3% increase in monsoon wet days and a 17% decrease in the monsoon rainfall amount. Both A2 and B1 scenarios project a moderate increase in monsoon wet days and rainfall amounts. For the A2 scenario, the increase is 24% for wet days and 21% for rainfall amounts, while the corresponding figures for B1 are 21% and 15% increases, respectively. Tables 8 and 9 give details on the estimated changes in the total number of wet days and rainfall amounts on a monthly and a monsoon basis during the period 2076–2100. The changes detected for the A1B scenario show a 44% increase in monsoon wet days and a 26% increase in monsoon rainfall amount. For the A2 scenario, the increase is 71% for wet days and 66% for rainfall amounts, while the corresponding figures for B1 are 44% and 46% increases, respectively. Percentage change in the median estimate of rainfall amounts for the month of June shows an increase for all emission scenarios, with a lowest value of 17% for the A1B scenario and a highest value of 80% for the A2 scenario during 2026–2050. However, an increase in rainfall amounts is reported for both June and September during 2076–2100. For the month of June, the increase is 57%, 82%, and 95% for the A1B, A2, and B1 scenarios respectively, while the corresponding figures for the month of September are 81%, 129%, and 65%, respectively. The highest increase in the rainfall amount is reported for the month of September during 2076–2100 for the A2 scenario.

Table 6. Percentage Changes in Monthly and Monsoon Numbers of Wet Days During 2026–2050
Median Estimate of Number of Wet DaysMedian Estimate of Number of Wet DaysPercentage Change in Median EstimateMedian Estimate of Number of Wet DaysPercentage Change in Median EstimateMedian Estimate of Number of Wet DaysPercentage Change in Median Estimate
Table 7. Percentage Changes in Monthly and Monsoon Rainfall Amounts During 2026–2050
Median Estimate of Rainfall (mm)Median Estimate of Rainfall (mm)Percentage Change in Median EstimateMedian Estimate of Rainfall (mm)Percentage Change in Median EstimateMedian Estimate of Rainfall (mm)Percentage Change in Median Estimate
Table 8. Percentage Changes in Monthly and Monsoon Numbers of Wet Days during 2076–2100
Median Estimate of Number of Wet DaysMedian Estimate of Number of Wet DaysPercentage Change in Median EstimateMedian Estimate of Number of Wet DaysPercentage Change in Median EstimateMedian Estimate of Number of Wet DaysPercentage Change in Median Estimate
Table 9. Percentage Changes in Monthly and Monsoon Rainfall Amounts during 2076–2100
Median Estimate of Rainfall (mm)Median Estimate of Rainfall (mm)Percentage Change in Median EstimateMedian Estimate of Rainfall (mm)Percentage Change in Median EstimateMedian Estimate of Rainfall (mm)Percentage Change in Median Estimate

4.2.2. WS/DS Length Probabilities and Conditional Probability Under the Changed Climate

[63] Change in the number of wet days in any downscaling location results in the changes of WS lengths and their corresponding occurrence probabilities. A similarly sustained increase/decrease in the amount of predicted rainfall will definitely have impacts in shaping the CDF of the conditional variable (rainfall). Therefore, we investigate the results for changes in WS/DS length probabilities and also for changes in the shape of the CDF due to change in climate. Results are projected for changes during 2026–2050 and 2076–2100 at two downscaling locations. Locations 1 and 7 are selected based on their geographical positions: 1 is at the coastal region, while 7 is an interior location. Figure 11 shows the plots of WS and DS length probabilities obtained from the model results for COMMIT, A1B, A2, and B1 scenarios at locations 1 and 7. The WS/DS length probabilities obtained for the A1B, A2, and B1 scenarios are compared with that of the COMMIT scenario. Plots of WS length probabilities show no change in shape during 2026–2050 at location 1 for all emission scenarios (Figure 11a). Similar behavior is exhibited for plots of DS length probabilities during 2026–2050 at locations 1 and 7 (Figures 11e and 11f). The WS length probabilities computed for the A2 and B1 emission scenarios at location 7 during 2025–2050 show a marginal increase in WS length probabilities for three or more days, while the plot for the A1B scenario shows a noticeable decrease in WS length probabilities for three or more days (Figure 11b). However, the results are slightly different for the period during 2076–2100 at location 1, where the A2 scenario shows a strong increase, and the A1B shows a marginal increase in WS length probabilities for three or more days (Figure 11c), while the A1B and A2 scenarios show a marginal increase in DS length probabilities for all days (Figure 11g). On the other hand, the results show a marginal increase in WS length probabilities for the A2 and B1 scenarios (Figure 11d) and no changes in DS length probabilities at location 7 during 2076–2100 (Figure 11h).

Figure 11.

WS and DS length probabilities for the near-future (2026–2050) and far-future (2076–2100) periods for locations 1 and 7. (a) The WS length probability plots obtained for all scenarios during 2026–2050 at location 1 and the DS length probability plots for all scenarios during 2026–2050 at locations (e) 1 and (f) 7 match with that of the COMMIT. (b) The WS length probability plots for location 7 during 2026–2050 show noticeable change in shape for A1B scenario. For period 2076–2100, (c) the WS length probabilities at location 1 for A2 scenario show a noticeable change in shape, while (g) A2 and A1B scenarios show a marginal increase in the DS length probabilities at location 1. (d) The WS length probability plots for A2 and B1 scenarios show noticeable changes in shape, whereas (h) no changes are found in the shapes of the DS length probabilities for all scenarios at location 7 during 2076–2100.

[64] CDFs are the best tools to detect the changes in the frequency of the occurrence of high/low rainfall activity over a period of time, where an increase/decrease in rainfall activity shifts the CDFs below/above the reference curve. CDFs are obtained with model-generated rainfall series for all emission scenarios to detect changes in the frequency of high/low rainfall occurrence. Figure 12 presents the CDF plots obtained for locations 1 and 7 for two future periods under consideration. It is found that CDFs obtained for emission scenarios A1B, A2, and B1 almost match with the CDF of COMMIT for both locations during 2026–2050. However, an observed downward shift of CDF pertaining to the A2 scenario indicates an increased frequency of high rainfall events for location 1 during 2076–2100. Location 7 does not report any such changes. We have also developed probability-probability (p-p) plots, with the CDF of the COMMIT scenario forming the abscissa and the CDFs of other emission scenarios such as A1B, A2, and B1 forming the ordinate of the plot. The p-p plots for two future time periods of any given emission scenario would give insight into the changes in frequency of high/low rainfall activity. Figure 13 shows p-p plots obtained for all three emission scenarios for two time periods under consideration for locations 1 and 7. The p-p plots for the A1B and A2 scenarios at location 1 show considerable changes in the shapes of the CDFs for high rainfall values during 2076–2100. To the contrary, p-p plots for the B1 scenario show a notable change in the mid-rainfall values during 2076–2100 at location 1. However, the changes observed in the shape of the p-p plots at location 7 for various emission scenarios are opposite to those obtained for location 1. Therefore, it is evident that location 1 encounters increased instances of high rainfall events for the A1B and A2 scenarios and decreased instances of medium rainfall events for the B1 scenario during 2076–2100. Location 7 encounters decreased instances of low-to-medium rainfall values for the A1B and A2 scenarios.

Figure 12.

CDF of daily rainfall for the near-future (2026–2050) and far-future (2076–2100) periods for locations (a and b) 1 and (c and d) 7, respectively.

Figure 13.

p-p plot of CDF of daily rainfall for the near-future (2026–2050) and far-future (2076–2100) periods for locations (a–c) 1 and (d–f) 7. The greenhouse forcing simulations (A1B, A2, and B1) are compared with COMMIT scenario.

4.2.3. Trends in Annual Daily Mean and Annual Maximum Daily Rainfall

[65] To identify any significant trend present in the model-generated expected rainfall series (2001–2100) under different emission scenarios, trends are computed for annual DMR and annual daily maximum rainfall data at all downscaling locations. Figure 14 shows the significant positive trends at locations 1, 2, and 4 and significant negative trends at locations 7 and 8 for annual DMR for the A1B scenario. Significant positive trends for all locations except location 8 and significant negative trends for all locations except location 1 are observed for the A2 and B1 scenarios. A similar exercise was carried out for raw precipitation flux data obtained from the runs of the same GCM for the same period, which does not show any significant trend in the derived precipitation series for the A1B, A2, and B1 scenarios. This may have resulted from poor simulation performance of coarse resolution GCM. Also, significant positive trends in annual maximum daily rainfall are observed at downscaling locations 1, 2, and 3, and negative trends are observed at locations 5, 6, 7, and 8 for the A1B scenario. For the A2 scenario, locations 1, 2, and 3 exhibit a positive annual maximum rainfall trend, and locations 7 and 8 exhibit a negative trend. The results for the B1 scenario show a positive rainfall trend for locations 1 and 2 and negative annual maximum rainfall trends for locations 4, 5, 6, and 7. However, the annual maximum daily rainfall series obtained from precipitation flux data shows a positive trend for nearshore locations for the A1B scenario only. Data pertaining to the other two emission scenarios do not show any significant positive/negative trend with the coarse resolution GCM output. However, cautious interpretation of the spatial trend exhibited by the model results is important, as the KR model exhibits overestimation of the interstation cross-correlation coefficients in the present downscaling scheme.

Figure 14.

(a–c) Annual DMR trend and (d–f) annual maximum daily rainfall trend for all locations in Mahanadi basin for downscaled output along with similar estimates obtained from raw precipitation flux data of CGCM3.1/T63 for two grid points falling within Mahanadi basin.

[66] Extreme events are reported to be increased in a warming environment [Goswami et al., 2006]. Gumbel extreme value distribution has been fitted with extracted annual daily rainfall maxima for two time periods in the future during 2026–2050 and 2076–2100, to obtain 50 year return period extreme rainfall for future emission scenarios. Figure 15 compares the results for all emission scenarios for two future time periods during 2026–2050 and 2076–2100 at downscaling locations 1 and 7. The results can be summarized as follows: (a) location 1 exhibits a decreasing trend in the amount of extreme rainfall for 50 year return periods for the A1B and A2 scenarios, whereas an increasing trend has been reported for the B1 scenario; (b) location 7 exhibits a decreasing trend in extreme rainfall for all future emission scenarios; and (c) spatial heterogeneity exists in extreme rainfall characteristics of locations 1 and 7, which is in agreement with Ghosh et al. [2012]. We are curious to check whether similar trends are exhibited in the precipitation flux data at two grid points falling within the Mahanadi basin. The results show an increasing trend in extreme rainfall for the A1B and A2 scenarios and a decreasing trend for the B1 scenario at both grid points, which is quite contradictory to the downscaled one. A recent work of Ghosh et al. [2012] reports increasing variability of rainfall extremes in India. This supports our findings of regional changes in extreme rainfall characteristics.

Figure 15.

Fifty year return period of extreme daily rainfall for the near-future (2026–2050) and far-future (2076–2100) time periods for locations 1 (a) and 7 (b) and similar estimates obtained from precipitation flux data of CGCM3.1/T63 for two grid points falling within Mahanadi basin (c and d).

[67] The present findings are different from those obtained by Raje and Mujumdar [2009] using the CRF model. They projected similar trends for all the locations of the Mahanadi River basin. The CRF model failed to show the variability of rainfall trend due to its inability to simulate the spatial pattern of rainfall in the river basin. The proposed model simulates the spatial variability in rainfall pattern in an improved way, which is also reflected in terms of varying trend of rainfall at all downscaling locations in the Mahanadi River basin.

5. Summary and Conclusion

[68] The work reported in this paper contributes toward developing methodologies for the generation of multisite rainfall for a river basin from large-scale GCM output of circulation patterns conditioned on the rainfall state of the river basin. The downscaling approach presented here provides a new approach to reproduce the spatiotemporal structure of the observed record at daily time scales. The multisite statistical downscaling model presented here is composed of two parts, a classification tree-based basin-wise rainfall state occurrence model and a multivariate KR model for rainfall amounts. The classification tree-based CART model is allowed to have both categorical and continuous predictors to predict the rainfall state of the river basin. Lag-1 rainfall is used in the classification tree building process to impart daily or short-term persistence and/or other continuous variables to explain the higher time scale persistence. The multisite rainfall sequences in the Mahanadi basin are generated with the help of the nonparametric KR model conditioned on the climate predictors and rainfall state of the river basin. Since the present downscaling technique combines a nonparametric regression with a stochastic simulation process together, the KR formulation not only estimates a mean rainfall value from the conditional PDF but also adds a stochastic part following a procedure suggested by Sharma et al. [1997] and Sharma [2000b]. Furthermore, unlike widely used NHMMs, the present method does not have any hidden states; rather, the rainfall states used here can be visualized in terms of rainfall amounts. The model captures well the spatiotemporal variability of rainfall, though it overestimates the spatial cross correlation.

[69] The kernel regression-based downscaling framework ensures moderately good representation of seasonal variations as well as the spatial dependence structure in the generated rainfall sequences. The present downscaling approach directly uses exogenous variables, thereby presenting possibilities of the approach being used for the simulation of rainfall field under a changed climate as simulated using GCMs. Hence, the present downscaling approach is suitable for the simulation of rainfall under changed climate.

[70] The paucity of observed rainfall data in the study region has driven us to use IMD gridded rainfall data product in our study for both occurrence of basin-wise rainfall state and simulation of rainfall sequences using the KR model. IMD gridded rainfall data are obtained by interpolating daily rainfall data observed at 1803 stations in India [Rajeevan et al., 2006]. It is reported that the density of stations considered for interpolation is not uniform throughout India. Network density is the highest over the southern peninsula and poor over the northern plains and eastern parts of central India. This is evident from the fact that the NCEP/NCAR reanalysis rainfall over eastern India shows a decreasing trend, whereas the rainfall rate over the same area derived from the IMD gridded rainfall data does not show any decreasing trend [Rajeevan et al., 2006]. Hence, the interpolated gridded rainfall data used in this study may cause some uncertainty in our projections due to less rain gauge network density in the study area.

[71] We have adopted Kaiser's rule to identify the optimum number of principal components, as there is no clear rule for the selection of the optimum number of principal components in the literature. Our logic for the selection of predictor variables can be subjective, but it reflects the attributes a user would expect in the kind of studies used for the simulation of rainfall sequences under the changed climate. However, with the availability of a large set of GCM-simulated climate variables, nonparametric stepwise predictor identification analysis [Mehrotra and Sharma, 2010] based on the PMI may be performed.

[72] The issues related to errors and the inadequate resolution of the NCEP/NCAR reanalysis data used as predictors in this study can have a certain impact on the downscaling. It is important to keep in mind that the quality of the reanalysis data depends on the quality and quantity of the data fed in for reanalysis. For example, TIROS Operational Vertical Sounder data were not available before 1979 for data assimilation [Dominguez and Kumar, 2005]. Furthermore, we wish to highlight the fact that reanalysis data products are a mixture of inhomogeneous data observations from various platforms and model forecasts. Hence, the quality of the reanalysis model will have an adverse effect on the quality of the reanalysis data. For example, in large geographic regions with little or no observational coverage, a reanalysis will tend to move away from nature and reflect more of the model's own behavior [Dominguez and Kumar, 2005]. Other limitations are poorly observed quantities, such as surface evaporation, which mainly depend on the quality of the model's representation or parameterizations of the relevant physical processes. Therefore, the quality of reanalysis data used might cause serious problems in the assessment of the downscaling relationship as well as affect the efficiency of the downscaling approach. Also, the KR modeling approach, primarily a data-driven approach, captures relationships established from reanalysis data. This leads to stationarity of the relationship between climate variables and rainfall in changed condition, which may not always be valid. This is a limitation of the model, and such limitation is associated with any statistical downscaling algorithms. Testing the stationarity of the relationship and modifications of the model accordingly may be considered as the future scope of the present work.

[73] Some of the other important factors one should be careful with while adopting a nonparametric modeling approach include the curse of dimensionality and the use of limited training data. Presently, we are using a daily predictor containing 50 attributes in our model, following Jolliffe [1972]. Still, there is scope for reducing the number of attributes in the predictor following robust principal component selection rules [Preisendorfer, 1988]. This is another future scope of the present study.

[74] Also, one should be cautious when a single GCM output is used for downscaling. Different outcomes may be expected if the present analysis is performed with other GCMs. This results in GCM uncertainty [Ghosh and Mujumdar, 2007; Mujumdar and Ghosh, 2008; Ghosh and Mujumdar, 2009], the modeling of which, after downscaling, is a potential research area and may be considered as the future scope of the present work.

[75] The main advantage of the present downscaling technique for the generation of multisite rainfall sequences is its ability to capture the variability of the predictand at a single location and also the ability to capture the spatial dependence of rainfall field. This is demonstrated with the application of the present nonparametric kernel regression modeling framework for downscaling multisite rainfall at eight locations in Mahanadi basin.

[76] With all the limitation in data mentioned above, the present study unveils rainfall trends in the Mahanadi River basin at a finer spatial scale and projects monsoon rainfall patterns incorporating the spatial variability of the trend, which is vital for planning and management of precious water resources in the study area. The results also corroborate the heterogeneous nature of rainfall patterns in the study area. The generated rainfall sequences for future emission scenarios will form a valuable input to study the impact of climate change on local hydrology. The present downscaling technique is less computationally intensive when compared to other parametric/nonparametric methods. The present downscaling technique uses the gridded rainfall data only for identification of rainfall states in Mahanadi basin. However, there is scope for the inclusion of other climate variables for the identification of rainfall states, which can show a greater variability in the cluster centroid and thus an improvement in the occurrence and amounts of rainfall to be predicted by the present downscaling technique.


objective function used in the k-means clustering algorithm.


ith feature vector in the n-dimensional attribute space inline image.

inline image

squared Euclidian distance between feature vectors.


number of clusters.

inline image

number of feature vectors in cluster k.

inline image

amount of rainfall at station j on ith day assigned to cluster “k.”

inline image

mean value of rainfall at station j for cluster k.


rainfall state at time t.


climate predictors at time t.


rainfall on tth day.


state of atmospheric variables on tth day.


goodness of fit statistic used for evaluation of CART model.


kernel regression estimator.

inline image




inline image

conditional PDF of Y given X = x.

inline image

marginal PDF of X.

inline image

multivariate generalization of the Nadaraya-Watson estimator.


multivariate PDF by their kernel density estimates.

inline image

kernel density estimate of inline image

inline image

kernel density estimate of marginal PDF of X.

Kh(Yi − y)

kernel function of the predictand.

κH(Xi − x)

kernel function of the predictor.


number of observations in the training data set.


ith observation of predictor in the training data set.


ith observation of predictand in the training data set.

h or H

smoothing parameter or bandwidth of respective kernel function.


asymptotic mean integrated square error.


Hessian matrix of second partial derivatives of kernel density estimate function f.

inline image

2-D squared L2 norm of inline image


[77] The work presented in this article is funded by Space Technology Cell (STC, Indian Space Research Organization, and IIT Bombay). The authors sincerely thank the Editor, the Associate Editor, and the three anonymous reviewers for reviewing this manuscript and providing constructive comments to improve the quality. The first author wishes to express his gratitude to I. D. Gupta, Director, and M. M. Kshirsagar, Senior Research Officer, Central Water and Power Research Station, Khadakwasla, Pune, for their constant encouragement and support.