Uncertainty resulting from multiple data usage in statistical downscaling



Statistical downscaling (SD), used for regional climate projections with coarse resolution general circulation model (GCM) outputs, is characterized by uncertainties resulting from multiple models. Here we observe another source of uncertainty resulting from the use of multiple observed and reanalysis data products in model calibration. In the training of SD, for Indian Summer Monsoon Rainfall (ISMR), we use two reanalysis data as predictors and three gridded data products for ISMR from different sources. We observe that the uncertainty resulting from six possible training options is comparable to that resulting from multiple GCMs. Though the original GCM simulations project spatially uniform increasing change of ISMR, at the end of 21st century, the same is not obtained with SD, which projects spatially heterogeneous and mixed changes of ISMR. This is due to the differences in statistical relationship between rainfall and predictors in GCM simulations and observed/reanalysis data, and SD considers the latter.

1 Introduction

General circulation models (GCMs) are reported to simulate precipitation [Hughes and Guttorp, 1994; Mehrotra and Sharma, 2006] with low accuracy due to the coarse spatial resolution [Ghosh and Mujumdar, 2007; Gutowski et al., 2007]. The spatial resolutions at which GCMs operate (generally more than 1.8°) directly hamper the accuracy of rainfall projections at regional scales since subgrid features (topography, cloud physics, and land surface processes) that influence rainfall are often not properly incorporated in models. Furthermore, rainfall projections at coarse spatial resolution may not be suitable for impact assessment at regional scales, which underscores the need of downscaling coarse resolution projections to high resolution. Downscaling is used for simulation of fine resolution processes (e.g., precipitation), with the coarse resolution variables, simulated by a GCM. Statistical downscaling (SD) [Wilby et al., 2004] is a computationally efficient downscaling technique, which is based on the assumption that regional climate is conditioned by two factors, the large-scale climatic state and “regional/local” physiographic factors (topography, land use, etc.) [Wilby et al., 2004]. With this basic principle, SD first derives the statistical relationship between large-scale climatic factors (predictors) and regional target variables (predictand) in observation. This relationship is region specific and implicitly considers the regional factors. The statistical model is then fed to bias-corrected GCM simulations, for projections of regional climate. This procedure is also known as perfect prog approach [Maraun et al., 2010]. Statistical downscaling, used for this analysis, is a transfer function-based approach, where linear regression is used to develop relationship between predictors and predictand.

Climate change projections with downscaling is associated with uncertainties [Huth, 2004; Ghosh and Mujumdar, 2007; Mujumdar and Ghosh, 2008] that comprise intermodel (multiple GCMs) uncertainty [Tebaldi et al., 2004], intramodel (multiple runs of same GCMs) uncertainty [Stainforth et al., 2007], scenario uncertainty [Wilby and Harris, 2006], and downscaling (multiple downscaling methods) uncertainty [Ghosh and Katkar, 2012]. A reliable and robust climate change projection must consider all sources of uncertainties. Development of statistical relationship in SD methods needs observed data. For synoptic-scale predictor variables, reanalysis data are often used as a proxy to observed data [Wilby et al., 2004; Kannan and Ghosh, 2013]. Observed station level/gridded data are used for predictands. With the availability of multiple sources for both reanalysis and observed data [Collins et al., 2013], here we assess the uncertainty in downscaled simulations resulting from the use of different reanalysis and observed gridded data products. The model is applied to Indian monsoon at 0.5° resolution. Details of data and method used for this analysis are presented in the next section.

2 Data and Methods

The data required for statistical downscaling are monthly large-scale predictors (from reanalysis data as well as GCM output) and monthly local-scale predictand, which is rainfall, here. The reanalysis data used here are National Centers for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) reanalysis data [Kalnay et al., 1996] and ERA-Interim reanalysis data [Dee et al., 2011] (by European Centre for Medium-Range Weather Forecasts). For Indian Summer Monsoon Rainfall (ISMR), we use three gridded data products provided by the India Meteorological Department (IMD) [Rajeevan et al., 2006; Rajeevan and Bhate, 2008], Asian Precipitation-Highly-Resolved Observational Data Integration Towards Evaluation (APHRODITE) [Yatagai et al., 2012], and the University of Delaware (UDel_AirT_Precip data provided by the NOAA/Oceanic and Atmospheric Research/Earth System Research Laboratory Physical Science Division, Boulder, Colorado, USA, from their Web site at http://www.esrl.noaa.gov/psd/) referred to UoD, all at 0.5° resolution. It should be noted that the qualities of the gridded data sets are not the same, and this essentially depends on the number of stations used as well as on the applied interpolation technique. IMD uses more station data compared to the other two gridded rainfall products. The two reanalysis and three rainfall data, for a concurrent period of 27 years from 1979 to 2005, result into six training options, which are used in the present work.

The SD model (Figure S1 in the supporting information), used here, involves bias correction for predictors, principal component analysis for dimensionality reduction, and linear regression to obtain the relationship between principal components and rainfall at individual grid points (Text S1). The predictors selected are mean sea level pressure (MSLP), specific humidity, air temperature, and zonal and meridional wind speeds at surface and at 500 hPa. The spatial extents of the predictors differ across IMD meteorological zones (Figure S2 and Text S1). Principal components, which explain at least 85% of the predictor variance, are used in the linear regression of the SD model. For testing of the model, here we use threefold cross validation, where the entire 27 years data (1979–2005) are divided into three equal parts. Two subsets are considered for training and one for validation, and this is repeated 3 times, with all possible combinations. The statistical relationships thus developed, with all the training options, are then applied to 10 GCM simulations (Table S1) from Coupled Model Intercomparison Project Phase 5 (CMIP5) suite. GCM simulations are bias corrected with standardization [Wilby et al., 2004]. The uncertainty resulting from multiple data sources (six training options) and multiple CMIP5 models are quantified with the variance of changes. The changes in projected mean rainfall for 2070–2099, with respect to historic period 1979–2005, are first obtained for all the 10 GCMs with six training options (total 60 combinations). For ith GCM and jth training option the change is denoted as Xij. For ith GCM, the data uncertainty is computed as variance of ith GCM simulated changes across all j options. The mean data uncertainty is then computed as the average of variances obtained from previous step, across all i. For jth training option, the GCM uncertainty is computed as the variance of changes, for those training options, across all i GCMs. The GCM uncertainty is then computed as the average of variances obtained, across all j.

3 Results and Discussion

The SD model is first validated with threefold cross-validation technique. The mean root-mean-square error (RMSE) and R2 values, obtained with the threefold cross validation, are presented in Figures S3 and S4. In general, the skills of SD models look reasonable; however, for IMD gridded rainfall, the RMSE is on higher side, as compared to APHRODITE and UoD products, when trained with both the reanalysis data. Errors are found to be high in the projections for Western Ghats and northeast regions, which also report high rainfall.

Indian rainfall has significant spatial variability due to differences in orography and local physiographic factors, and all the three rainfall data products exhibit this variability (Figures S5a–S5c), with high rainfall amounts in Western Ghats and northeast India. GCMs, being a coarse resolution model, fail to simulate fine resolution geophysical processes resulting more than 50% bias in multimodel CMIP5 projections (Figures S5d–S5f). The linear regression-based SD approach reduces these errors, for multimodel CMIP5 average, with all the six training options (Figures S5g–S5l). The errors in downscaled simulations, for all the cases are observed to be of same order; though the results with IMD rainfall (Figures S5g and S5j) are observed to be high and heterogeneous, as compared to the others. This is probably attributed to the consideration of more station data in generating IMD gridded product, as compared to that of APHRODITE or UoD. The skill seems to cluster according to the choice of gridded rainfall product and not to the choice of reanalysis data.

Future (2070–2099) projections of multimodel ensemble mean show spatially varying changes of monsoon rainfall in India (with respect to 1979–2005), and such spatial heterogeneity is observed for all the training options (Figure 1). The spatial heterogeneity is observed to be more, where IMD gridded data are involved for calibration of SD model. Disagreements exist among the projections, obtained with different relationship (with multiple data set), even with opposite signs in a few regions. We find that downscaling models, trained with the same reanalysis predictors, but different rainfall products as predictand, project similar changes, which are evident from the subplots in the same row of Figure 1. This is due to the differences that exist between the values of predictor variables, obtained from different reanalysis data. The differences in mean and standard deviation of the predictor variables (for central Indian region) between NCEP/NCAR and ERA-Interim are presented in Figure S6. Such differences lead to different relationship between predictors and predictands and are further transmitted and reflected in the projected changes.

Figure 1.

Changes in (differences between period 2070–2099 and period 1979–2005) monsoon rainfall, as projected by downscaled multimodel average projections. The six training options, which are used here, in this study, are the following: (a) NCEP-IMD, (b) NCEP-APHRODITE, (c) NCEP-UoD, (d) ERA-Interim-IMD, (e) ERA-Interim-APHRODITE, and (f) ERA-Interim-UoD.

We also find that the downscaled changes of future monsoon rainfall are spatially heterogeneous, which is not in agreement with the original projections, simulated by coarse resolution GCMs. The original GCM projections of rainfall show spatially uniform increasing changes. Similar spatial heterogeneity in projected changes are also observed in other downscaling models (both statistical and dynamical) by Rupa Kumar et al. [2006], Krishna Kumar et al. [2011], Ashfaq et al. [2009], Dobler and Ahrens [2011], and Salvi et al. [2013]. Here we investigate the reason behind such dissimilarity and observe that it stems from different partial correlation between predictors and predictand for observed and GCM simulated data.

The projections of 2070–2099, as simulated by multimodel average of GCMs, show increase in ISMR, almost in the entire country (Figure 2a). To understand the changes in relationship between predictor and predictands in GCM simulations, we first obtain the relationship between predictor and interpolated predictand, both simulated by GCMs during 1979–2005, and then apply the same to the predictors for future (2070–2099) as simulated by the same GCMs. This does not show (Figure 2b) increasing changes for the entire country, though smoother than statistically downscaled projections calibrated with IMD and NCEP/NCAR data (Figure 2c) for historic period (1979–2005). Figures 1a and 2c are the same but plotted with different color bars for comparison. Individual GCMs also show similar results, which is seen at the example of simulations of MIROC (Figures 2d–2f). To analyze this further, we obtain the partial correlation between the principal components of predictors and local predictand, which are used for computation of regression coefficients. The partial correlations between the first principal component of mean sea level pressure (MSLP) and fine resolution rainfall at central India (Figure 2g) are presented in Figures 2h–2j, respectively, for MIROC historical (1979–2005), MIROC future (2070–2099), and observed with IMD-NCEP/NCAR. These figures show two critical findings. First, the MIROC simulations of partial correlation between predictor and predictands are different from those of observed, possibly because of GCMs inability to model fine resolution processes. The same is observed with other GCMs, and similar figures have been reproduced in Figure S7. The second observation is that there are dissimilarities between the partial correlation field of historical and future simulations, showing the possibility of changes in relationship between predictors and predictands. Statistical downscaling has the limitation that it assumes stationarity in relationship between predictors and predictands, and hence, the downscaled outputs should be used with caution.

Figure 2.

Differences in the projections between original and downscaled GCM simulations. The multimodel average of (a) original GCM projections and (c) downscaled projections has dissimilarities, when the downscaled model is calibrated with NCEP/NCAR reanalysis and IMD gridded rainfall data. (b) The results are also presented, when the downscaling model is calibrated with GCM simulated predictors and rainfall. The historical calibration period is 1979 to 2005; projections for future are made for duration 2070 to 2099. (d–f) Similar plots are obtained with individual GCM, and the results for MIROC are demonstrated. In regression, the coefficients are controlled by partial correlation coefficients and the same is obtained for (g) Central India rainfall field with first principal component of MSLP. The partial correlation coefficients obtained with (h) historical (1979 to 2005) GCM simulations, (i) future (2070 to 2099) GCM simulations, and (j) observed data (1979 to 2005) are also presented.

The uncertainty, resulting from different data products and GCMs, is quantified in terms of variances across GCMs and calibration sets. The uncertainty is first computed for individual grids. The data uncertainty is presented in Figure 3a, and this uncertainty is estimated to be higher than that from multiple GCMs (Figure 3b), at various locations in north, south, and northeastern hilly regions. The high uncertainty in the northeast hilly region is due to inadequate number of rain gauge stations used for generating the gridded rainfall products. The mountainous/hilly regions have significant spatial as well as rainfall heterogeneity and needs more gauging stations. GCM uncertainty has been addressed in scientific literature extensively; however, the uncertainty resulting from different observed/reanalysis data products in downscaling has remained undetected. Further, when we combine both of these sources of uncertainties, we observe large uncertainties (Figure 3c), which must be considered before using the downscaled projections in impacts assessment. The combined uncertainty is estimated to be high at only those regions where there is a very high data uncertainty. This concludes that the data uncertainty is a major source of uncertainty for downscaled projections. It is also important to note that the SD models are calibrated with post-1979 data, which are partially based on satellite products, and hence, the reanalysis products are expected to have less disagreements. However, this is not reflected in terms of uncertainty in the projected rainfall.

Figure 3.

Uncertainty resulting from (a) multiple data sources, (b) multiple GCMs, and (c) both.

To present a region-wise estimation of uncertainty, we present the pdf of grid-wise changes for different IMD meteorological regions, obtained with all training options and GCMs (Figure 4). The grid-wise changes are regionally aggregated in a spatial probability density function (pdf) for each GCM and training option. As it is seen from Figure 4, data uncertainty mainly stems from different reanalysis products; we use two different colors of pdfs for different reanalysis calibration set. We observe that for north, south, and northeastern hilly regions, the differences in changes, between the projections, with different reanalysis data, are prominent. This is also seen in Figure 3a. Region-specific uncertainty estimates, due to the use of different rainfall products, may also be large and are not shown in Figures 4a–4g. To understand the sensitivity of these uncertainty estimates on selection of training periods, we make complete random selection of training and validation data of 12 sets, each having 18 years for training and 9 years for validation. The data and GCM uncertainty, obtained from these 12 sets, are presented in box plot for regional averages (Figure 4h). They consistently show higher data uncertainty compared to GCM uncertainty for all the regions. The difference is highest for northeast hilly region.

Figure 4.

Region-wise estimation of uncertainty from different reanalysis data. The pdf of grid-wise changes in mean rainfall, obtained with all training options and GCMs for (a) central, (b) Jammu Kashmir, (c) northeast, (d) northeast hills, (e) north, (f) south, and (g) west zones of India. (h) The sensitivity of these uncertainty estimates on selection of training periods for all the zones are presented with box plots. Figures 4a–4g show the differences in downscaled projections due to the use of multiple reanalysis data as predictors in calibration.

4 Conclusions

Our results highlight the sensitivity of data selection to downscaled and projected changes of Indian monsoon rainfall. Literatures on uncertainty assessment in climate modeling deal with either multimodel uncertainty, scenario uncertainty, or downscaling uncertainty. Here we observe another source of uncertainty, resulting from the use of multiple reanalysis and rainfall data, during training of models. The conclusions derived from the analysis are the followings:

  1. The uncertainty resulting from the use of multiple reanalysis and rainfall data is of higher magnitude than that assessed from multiple GCMs. Consideration of such uncertainty is essential for impacts assessment as changes in data even lead to opposite signs of projected changes of rainfall.
  2. The downscaled projections are observed to have dissimilar changes as compared to original GCM simulations.
  3. Statistical downscaling suffers from the assumption of stationarity in statistical relationship between predictor and predictand. GCM simulations show the possibility of changes in the relationship between predictors and predictand.
  4. The relationship observed between predictor and predictands in original GCM simulations are also not reliable, as there is little agreement between the multiple GCM simulations of partial correlation coefficients between predictors and predictand.
  5. A systematic study and design of experiment [Duan et al., 2012; Hertig and Jacobeit, 2013] is necessary to study the validity of downscaling models in changed climate.

Our results suggest that the regional modelers need to be aware of the uncertainty arising from the use of multiple data products during calibration of downscaling models and should test the validity of assumption of stationarity between predictor and predictand in a systematic way [Duan et al., 2012; Hertig and Jacobeit, 2013] before using them for impacts assessment.


The reanalysis data are obtained from http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html and http://www.ecmwf.int/products/data/archive/descriptions/e4/. The APHRODITE and UoD rainfall data are available online, while IMD data are purchased from India Meteorological Department. The authors sincerely thank the Editor and the two anonymous reviewers for providing suggestions in improving the quality of the work.

The Editor thanks two anonymous reviewers for their assistance in evaluating this paper.