## 1. Introduction

[2] Global warming and its associated climate change are expected to have major impacts on ecosystems, agriculture, and human society that are sensitive to changes in precipitation. The tremendous importance of water in both society and nature necessitates the understanding of how any change in global climate could affect regional water availability. The potential effects of climate change in regional hydrology are assessed by comparing future hydrologic scenarios, derived with the simulations of general circulation models (GCMs), to that of the observed case. GCMs are the most credible tools in simulating the global climate systems due to the increased level of greenhouse-gas concentration, and they provide current and future time series of climate variables for the entire globe [*Prudhomme et al*., 2002; *Intergovernmental Panel on Climate Change-Task Group on Scenarios for Climate Impact Assessment*, 1999]. In spite of being able to capture large-scale circulation patterns and also model smoothly varying fields such as surface pressure, GCMs often fail to reproduce nonsmooth fields such as precipitation [*Hughes and Guttorp*, 1994]. In addition to the above, the spatial scale on which GCMs work is very coarse (e.g., 3.75° latitude × 3.75° longitude for coupled global circulation model (CGCM2)), for hydrological modeling purposes [*Prudhomme et al*., 2003]. GCMs also have limited skill in resolving subgrid-scale features such as convection and topography [*Xu*, 1999]. Hence, while the impact of greenhouse gases on large-scale atmospheric circulation is well understood, regional changes in the hydrological cycle are far more uncertain in GCM simulations. Downscaling is therefore necessary to model regional-scale climatic/hydrologic variables such as evapotranspiration, precipitation, soil moisture, and river runoff at a smaller scale, based on the large-scale GCM outputs.

[3] Downscaling techniques can broadly be classified into dynamic and statistical downscaling techniques. While dynamic downscaling involves nesting a high-resolution regional climate model within coarser grids of the GCM [*Jones et al*., 1995], statistical downscaling techniques construct a parametric/nonparametric and/or linear/nonlinear relationship between large-scale atmospheric predictor variables and the regional climate variable(s) of interest (predictand) [*Wilby et al*., 1998] in order to simulate future climatological/hydrological scenarios. Statistical downscaling is primarily based on the view that regional climate may be thought of as being conditioned by the large-scale climate pattern and by regional/local physiographic features such as topography, land use, land cover, distance to coast, and land-sea distribution [*von Storch*, 1995, 1999]. The underlying concept is that there exists a link between large-scale climate phenomena and local climatic/meteorological conditions. Statistical downscaling techniques are popular because they are computationally simple and can easily be modified for application to new regions of study. The wide application of statistical downscaling techniques is, nevertheless, attributed to the minimum use of climate predictors for forecasting regional climate variables of interest. However, a major presumption involved in statistical downscaling is that the statistical relationship, once developed between the large-scale predictor field(s) and local-scale variable, is not altered by climate change [*Easterling*, 1999]. A thorough review on downscaling concepts, prospects, and limitations can be found in the works of *von Storch* [1995], *Hewitson and Crane* [1996], *Wilby and Wigley* [1997], *Gyalistras et al*. [1998], *Murphy* [1999], *Haylock et al*. [2006], *Hanssen-Bauer et al*. [2005], and *Christensen et al*. [2007].

[4] Statistical downscaling techniques developed so far can principally be grouped into three categories, namely, weather classification/weather typing [e.g., *Hay et al*., 1991; *Bardossy and Plate*, 1992; *Corte-Real et al*., 1995; *Conway and Jones*, 1998; *Schnur and Lettenmaier*, 1998], regression/transfer function [*Murphy*, 1999; *von Storch et al*., 1993; *Crane and Hewitson*, 1998; *Bardossy et al*., 1995], and weather generators [*Hughes et al*., 1993; *Hughes and Guttorp*, 1994; *Wilks*, 1999; *Khalili et al*., 2009]. The weather typing and transfer function based approaches, generally known as perfect prognosis (PP) downscaling, establish a relationship between observed large-scale predictors and observed local-scale predictands. Application of these relationships in this context is justified if the predictors from GCMs are realistically simulated [*Kalnay*, 2003; *Wilks*, 2006]. Selection of large-scale predictors [*Huth*, 1996, 1999; *Wilby and Wigley*, 1997; *Wilby et al*., 1998; *Charles et al*., 1999; *Timbal et al*., 2008] and development of statistical model that establishes link between large-scale climate predictors and local-scale predictands are the main constituents of PP downscaling techniques. Use of high-dimensional correlated predictor field in the context of statistical downscaling may sometimes lead to overfitting or ignoring valuable information. Hence, methods such as principal component analysis (PCA) [*Preisendorfer*, 1988; *Hannachi et al*., 2007], canonical correlation analysis [*Huth*, 1999; *von Storch and Zwiers*, 1999; *Widmann*, 2005; *Tippett et al*., 2008], and physically/meteorologically motivated transformation techniques [*Jones et al*., 1993; *Michelangeli et al*., 1995; *Wilby and Wigley*, 2000; *Stephenson et al*., 2004; *Philipp et al*., 2007] are employed to reduce dimensionality of predictors field. Some of the widely used statistical downscaling models are based on the linear regression [*Karl et al*., 1990], the generalized linear model (GLM) [*Dobson*, 2001], the generalized additive model [*Hastie and Tibshirani*, 1990; *Vrac et al*., 2007b], and vector GLM [*Yee and Wild*, 1996; *Yee and Stephenson*, 2007; *Maraun et al*., 2010a]. The past decade witnessed a growth of objective weather typing based techniques using clustering and classification algorithms [*Plaut and Simonnet*, 2001; *Bárdossy et al*., 2005; *Casola and Wallace*, 2007; *Wehrens and Buydens*, 2007; *Leloup et al*., 2008; *Vrac et al*., 2007a; *Rust et al*., 2010]. Local-scale precipitation in this case is downscaled using a linear model conditioned on weather types [*Maraun et al*., 2010b]. Weather generators, on the contrary, are statistical models that generate random sequences of weather variables, which preserves statistical properties of observed weather [*Richardson*, 1981; *Richardson and Wright*, 1984; *Wilks*, 1998; *Allcroft and Glasbey*, 2003; *Mason*, 2004; *Kilsby et al*., 2007].

[5] Although progress has been made in the development of statistical downscaling techniques, especially for simulation of rainfall, challenges still exist in representing realistic levels of interannual variability in the generated sequences [*Katz and Zheng*, 1999; *Wilby et al*., 2004], generating multisite sequences with realistic spatial dependence, representing accurately the extreme behavior, and simulating complex dynamical structures within a relatively cheap computational framework [*Wheater et al*., 2005]. Perhaps the biggest challenge is the representation of spatial dependence in rainfall occurrence, particularly for a larger region [*Yang et al*., 2005]. Weather classification methods have limited success in reproducing the persistence characteristics of at-site wet spell (WS) and dry spell (DS) [*Wilby*, 1994]. Weather state based models such as nonhomogeneous hidden Markov models (NHMMs) [*Hughes and Guttorp*, 1994; *Hughes et al*., 1999; *Charles et al*., 1999, 2004; *Vrac and Naveau*, 2007; *Vrac et al*., 2007c] and nonparametric NHMMs [*Mehrotra and Sharma*, 2005] overcome the poor performances of weather classification methods, in simulating spatial variability of daily precipitation, by identifying distinct patterns in the multisite daily precipitation and also by capturing the temporal variability through persistence in the weather states. *Mehrotra and Sharma* [2007] developed a semiparametric model, which uses a two-state first-order Markov model for multistation rainfall occurrence and a kernel density estimator for generation of rainfall amounts conditioned on the occurrence of rainfall. Weather state-based generative models such as hidden Markov models (HMMs) evaluate the joint distribution by making use of the independence assumption that each hidden state depends only on its immediate predecessor and that each observation variable depends only on the current state [*Raje and Mujumdar*, 2009]. A recently developed conditional random field (CRF) downscaling model, as reported by *Raje and Mujumdar* [2009], does not require assumptions about the independence of atmospheric variables or their distribution, unlike HMM models. This property enables CRF models to utilize the entire sequence of observations for predicting output. However, a major disadvantage of CRF-based models, as reported by *Raje and Mujumdar* [2009], is the requirement of a large number of parameters to maintain the spatial and temporal structures and the requirement of intensive computational capabilities. Other disadvantages include discretization of precipitation into classes, which amounts to loss of information, and subjectivity in the selection of feature functions. Furthermore, the CRF model is observed [*Raje and Mujumdar*, 2009] to fail in capturing spatial correlations, and it also fails to simulate mean conditions, resulting in larger deviations from the observed mean. This may be due to the heuristic fixation of the number of rainfall classes used in the CRF model, without confirming the exact number of rainfall classes required, for the model based on any cluster validity tests. Other methods used are Bayesian hierarchical models [*Cooley et al*., 2007] and analog method in a weather generator context [*Orlowsky et al*., 2008]. Nonparametric statistical downscaling techniques such as kernel density estimators [*Lall et al*., 1996; *Harrold et al*., 2003a, 2003b] or *k*-nearest neighbors (KNNs) [*Lall and Sharma*, 1996; *Rajagopalan and Lall*, 1999; *Harrold et al*., 2003a; *Yates et al*., 2003; *Mehrotra et al*., 2004; *Mehrotra and Sharma*, 2006] are widely used for daily precipitation at multisites that belong to a subset of weather generators. A comprehensive review of downscaling techniques with a focus on recent developments in statistical downscaling, model output statistics, weather generators, and evaluation techniques to assess downscaling skill can be found in the work of *Maraun et al*. [2010b].

[6] In conclusion, a majority of relevant studies based on models such as the space-time model [*Bardossy and Plate*, 1992; *Bogardi et al*., 1993], NHMMs [*Hughes and Guttorp*, 1994; *Hughes et al*., 1999; *Bellone et al*., 2000] and the CRF-based model [*Raje and Mujumdar*, 2009], developed for forecasting multisite daily precipitation, are reported to be modestly successful in simulating the spatial dependence of observed precipitation series. Also, statistical downscaling approaches based on the Markov model or its variant for rainfall occurrence and simple/more complex probabilistic models for rainfall amounts can partially explain the unexplained variance associated with day-to-day variance in the rainfall [*Katz and Parlange*, 1998; *Wilks*, 1999]. Markov-based models of daily rainfall cannot effectively reproduce the variability of a nonstationary climate, as these models do not consider exogenous climate predictors. Some researchers have allowed variations in the stochastic model parameters by conditioning on a covariate containing atmospheric signals [*Hughes and Guttorp*, 1994; *Hughes et al*., 1999; *Mehrotra et al*., 2004; *Katz and Parlange*, 1993; *Katz and Zheng*, 1999; *Wilks*, 1989]. A majority of researchers still use simple/complex probabilistic models for rainfall amounts. Recently, *Mehrotra and Sharma* [2010] adopted a variant of a probabilistic model to generate rainfall amounts using a nonparametric kernel density simulator conditional on previous time step rainfall and selected exogenous atmospheric variables. This prompted us to develop a new approach that explicitly uses exogenous climate predictors for simulations of multisite rainfall amounts.

[7] In our study, we propose to model the rainfall state not at the station level but at the river basin level, using classification trees, as the occurrence is largely controlled by global circulation. The rainfall states derived are analogous to weather states reported in the work of *Corte-Real et al*. [1999]. We also propose to develop a nonparametric kernel regression estimator to downscale multisite daily rainfall amounts conditional on the derived rainfall state. One of the major challenges in multisite downscaling is modeling of spatial dependence [*Yang et al*., 2005]. In the present work, this is addressed with an innovative and novel approach, where a rainfall state of the region (where multisite downscaling is performed) is first obtained with classification and regression tree (CART) from the large-scale atmospheric circulation pattern, which is hypothesized to represent the spatial pattern of that region. Conditional on this rainfall state, the kernel regression is performed at individual locations with rainfall as predictand and large-scale climate variables as predictors. The novelty of this method is the inclusion of river basin-scale rainfall state in simulations, which presents the spatial pattern, and hence, individual simulations of the rainfall occurrence at sites are not required. The proposed multivariate kernel density function intrinsically holds the rainfall occurrence information at site, and hence, it is not required to be modeled explicitly. To demonstrate the applicability of the present model, the results are compared with those obtained with a recently developed model based on CRF [*Raje and Mujumdar*, 2009] and a widely used stochastic weather generator developed by *Wilks* [1999] as well as a KNN approach-based model.

[8] Hence, the purpose of the present paper is twofold. First, we demonstrate that the proposed downscaling technique is adequate to capture the spatial dependence in rainfall field sequence. We then aim to construct time series of multisite daily rainfall by downscaling outputs of CGCM3.1 for various emission scenarios in an Indian river basin. Major emphasis is given to the methodology for statistical downscaling in this work. The results of the study will be used to investigate climate change impacts on hydrological regimes in the river basin.

[9] An overview of the proposed statistical downscaling technique for downscaling precipitation from a GCM-projected circulation pattern is presented in Figure 1. The proposed technique predicts multisite rainfall in two stages: first by predicting the rainfall state of the study area using large-scale atmospheric variables and subsequently, by forecasting multisite rainfall amounts with the help of a multivariate kernel regression estimator conditioned on the rainfall state and large-scale atmospheric variables. Historical rainfall states for the river basin were classified from the observed rainfall field by applying a *k*-means clustering algorithm. The CART-based model is allowed to build a classification tree between historic large-scale atmospheric circulations coupled with a lag-1 rainfall state of the river basin (predictors) and the historic rainfall state of the river basin (predictand). Rainfall states for future emission scenarios are predicted using the classified tree with the GCM output. Precipitation amounts at multiple grid points are generated with the help of the multivariate kernel regression estimator, conditioned on the current-day rainfall state of the river basin and the current-day principal components of climate variables of the GCM output. The proposed statistical downscaling technique is applied for a case study in the Mahanadi River basin to generate multisite rainfall sequences for plausible emission scenarios.