Percentile indices monitoring the frequency of moderate temperature extremes are widely used to assess changes in present and future temperature extremes because of their straightforward interpretation. While observed trends in such indices can be, and have been, compared with model-simulated trends, their definition relative to each model's own climatology inhibits their use for the evaluation of model-simulated temperature variability. This is unfortunate, as in many parts of the world, indices from observations remain the only source of publicly available information about extreme temperature variability. We approach this problem by introducing a novel adjustment to the standard method for deriving indices from climate models. This involves the removal of the bias in the mean annual cycle of the models and the use of percentile thresholds from a reference data set. We illustrate the technique by comparing daily minimum (TN) and maximum (TX) temperatures from the fifth phase of Coupled Model Intercomparison Project (CMIP5) historical simulations with those from an observation-based data set and from the National Centers for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) and the European Centre for Medium-Range Weather Forecasts (ERA-40) reanalyses. Biases in the annual cycle also translate into biases in the representation of the percentile indices in the models and reanalyses. Generally, percentile indices based on daily TX are well represented by the models and reanalyses compared to the observations. For percentile indices based on daily minimum temperature, however, large discrepancies occur particularly between the reanalyses.
Indices for climate extremes are widely used to assess present and future climate change and its impacts (e.g., Tebaldi et al., 2006; Zhang et al., 2011; Orlowsky and Seneviratne, 2012; Seneviratne et al., 2012, Zwiers et al., 2013). They are also often used to monitor changes in extremes, as in many parts of the world, indices from observations remain the only source of publicly available information about temperature extremes (Klein Tank et al., 2009). Despite their wide use, a thorough evaluation of the representation of the indices in climate models has not yet been performed apart from Sillmann et al. (2013). The latter study gives a first overview of the performance of the latest generation of global climate models participating in the fifth phase of the Coupled Model Intercomparison Project (CMIP5, Taylor et al., 2012) in simulating a wide variety of indices for climate extremes.
Here, we focus on one particular category of these indices, the percentile threshold-based indices (referred to as percentile indices below). These indices are associated with the exceedance rate of the 10th and 90th percentiles of the daily minimum (TN) or maximum (TX) near-surface temperature as determined from a standard base period (usually 1961–1990). Percentile indices are more statistically robust and spatially homogeneous than absolute value indices such as the annual maximum of TX or annual minimum of TN, which can be used to assess the most extreme tails of the temperature distribution (e.g., Kharin et al., 2013). Percentile indices do not describe very rare events, but they can be estimated robustly based on limited temporal availability of observations (Zhang et al., 2005).
As illustrated in Sillmann et al. (2013), there is fairly good agreement between CMIP5 models, reanalyses and observations in the temporal evolution of the globally averaged percentile indices based on the original temperature time series, particularly in the base period 1961–1990. Although we do not expect the models to perfectly match observed trends due to internal variability, we see consistent trends between the CMIP5 models and reanalyses compared to the observed changes documented in Donat et al. (2013), i.e., decreasing trends in TN10p and TX10p and increasing trends in TN90p and TX90p (Figure S1). The averaged magnitude of the original percentile indices is, however, largely independent of model deficiencies in simulating daily variability. It is, thus, difficult to evaluate models with these indices using standard performance measures as pointed out in Sillmann et al. (2013) because, by construction, the average threshold exceedance rate in the base period is approximately 10% for all models, reanalyses and observations (Figure 1).
Hence, in order to provide a means to evaluate model-simulated variability with the percentile indices, we suggest a bias-correction method in which the mean model bias is removed prior to computing the percentile indices. The bias-correction methodology proposed here is not meant to correct future climate projections based on mean biases in present climate simulations as discussed in various studies focusing on the bias-correction of global (e.g., Watanabe et al., 2012 and references therein) or regional climate models (e.g., Christensen et al., 2008; Boberg and Christensen, 2012; Räisanen and Räty, 2013), but to enable a more meaningful evaluation of model-simulated temperature variability based on percentile indices.
2. Indices and data sets
2.1. Definition of percentile indices
The percentile indices as defined by the Expert Team on Climate Change Detection and Indices (ETCCDI) represent annual frequencies [%] of days for which TN or TX lie beyond their 10th or 90th percentiles as derived from the base period 1961–1990. The percentile thresholds for each calendar day are calculated from 5-day windows centered on that calendar day in all years in the base period using the bootstrapping method described in Zhang et al. (2005). We will consider indices that monitor the frequency of cold nights (TN10p) and cold days (TX10p), which describe the frequency with which TN or TX is below the 10th percentile threshold, as well as warm nights (TN90p) and warm days (TX90p), which describe the frequency with which TN or TX falls above the corresponding 90th percentile threshold. Observation-based data sets of these indices have fairly good global data coverage (e.g., GHCNDEX as described in Donat et al. (2013)), which is of particular importance for global climate model evaluation.
2.2. Data sets
We use model-simulated daily TN and TX from the CMIP5 multi-model ensemble as retrieved from the Earth System Grid (ESG) data portals. To illustrate our proposed methodology, we present results for an ensemble of 18 CMIP5 models (Table S1), considering only one ensemble member for each model. These simulations employ historical changes in anthropogenic forcing, which include time-evolving land cover information, and solar and volcanic forcing (Taylor et al., 2012).
As a basis for our comparison, we use percentile thresholds calculated from the National Climatic Data Center's Global Historical Climatology Network (GHCN)-Daily data set (Menne et al., 2012) that correspond to the percentile indices reported in the global observation-based extremes data set (GHCNDEX, Donat et al., 2013). Other than the GHCNDEX percentile indices, which have limited global coverage as we can see later, the GHCN-Daily percentile thresholds have almost complete global coverage (except Antarctica), because GHCN-Daily percentile thresholds have a larger de-correlation length scale (DLS), which is used as search radius for station data when calculating the weighted grid box averages for both the GHCNDEX percentile indices (see Donat et al., 2013) and the GHCN-Daily percentile thresholds. A large DLS, as for the GHCN-Daily percentile thresholds, implies that more remote stations may contribute to a grid box value.
Owing to their complete global coverage and gridded format, reanalyses are often used to evaluate climate models. We consider two reanalyses that cover the ETCCDI-recommended base period 1961–1990 to show the effect of the choice of data set for the evaluation of percentile indices. ERA-40 reanalysis (ERA40, Uppala et al., 2005) 6-h near-surface temperature values were downloaded from the European Centre for Medium-Range Weather Forecasts (ECMWF) archive (http://www.ecmwf.int/products/data/archive/). The daily temperature extremes TN and TX are approximated by the minimum and maximum of the four instantaneous 6-h values available per day. The ERA40 data are available on a 144 × 73 (2.5° × 2.5°) longitude/latitude grid. Daily TN and TX values of the National Centers for Environmental Prediction (NCEP) and the National Center for Atmospheric Research (NCAR) NCEP/NCAR Reanalysis 1 (NCEP1, Kalnay et al., 1996) were downloaded from the Physical Science Divisions website of the National Oceanic and Atmospheric Administration (NOAA) Earth System Research Laboratory and are available on a 192 × 94 Gaussian grid (approximately 1.9° × 1.9°) (http://www.esrl.noaa.gov/psd/data/gridded/).
The percentile indices are first computed for each model and reanalysis on their native grids and then re-gridded to a common 144 × 73 grid (2.5° × 2.5°) for intercomparison purposes using a first-order conservative remapping procedure (Jones, 1999) implemented in the Climate Data Operators (https://code.zmaw.de/projects/cdo). The GHCNDEX indices are directly downloaded from http://www.climdex.org with a resolution of 144 × 73 grid. As the analysis of temperature extremes is more relevant for populated land areas, it is performed for land grid points only, also noting that the observations are only available over land.
3.2. Bias-correction of model time series
The standard definition of the ETCCDI percentile indices expresses the indices relative to the percentiles of the data being evaluated. To evaluate model variability, we propose calculating percentile indices for models relative to thresholds derived from observations rather than models. Model exceedance rates calculated relative to the observed percentile thresholds can deviate substantially from the nominal rate of approximately 10% for two main reasons. First, the model mean climate may deviate from that in the observations. Secondly, the model variability about the mean climate, which determines the width and the shape of the temperature distribution, may also differ from the observed variability. For the relatively mild extremes described by the 10th and the 90th percentiles, biases in the width (standard deviation) of the distribution are likely to be more important than eventual biases in the shape (skewness, kurtosis). While differences in the mean climate are readily identified using standard methods, we are primarily interested in the model evaluation aspects that are related to differences in variability about the mean (i.e., threshold crossing behaviour). The simple solution we adopt is to correct the bias in the model mean climate prior to calculating exceedance frequencies relative to observed thresholds.
The bias-correction methodology is illustrated conceptually in Figure 1, showing temperature probability distributions. First, the annual cycle of TN and TX is estimated for each model in the 1961–1990 base period. This annual cycle is then subtracted from the original model time series and substituted with the corresponding annual cycle estimated from GHCN-Daily or the NCEP1 and ERA40 data sets. By doing so, we shift the model distribution (red curve) to match the mean of the observed distribution (green curve). The resulting bias-corrected time series (orange curve) is then utilized to calculate the percentile indices in the model, but using the percentile thresholds derived from the observations or respective reanalysis (green curve 90th percentile threshold). Since the annual cycle in the bias-corrected model time series (orange curve) is now equal to that in the observations or reanalysis (green curve), respectively (i.e., the mean of the temperature distribution is the same), the resulting differences in the exceedance rates (shaded areas) are due to differences in the simulated daily temperature variability. Thus, an exceedance rate that is substantially greater than 10% for a given model (green plus orange shaded area) would indicate that the model tends to overestimate the frequency of excursions into the upper tail and an exceedance rate lower than 10% would indicate that the model tends to underestimate such variability compared to that in the observations or reanalysis, respectively (green shaded area). Different aspects of model-simulated variability can therefore be evaluated relative to the variability in the target observational or reanalysis data set by considering different percentile thresholds also including thresholds other than the 10th or 90th percentile.
4.1. Annual cycle
As a first step, we compare the annual cycles of TN and TX derived from different data sets (i.e., CMIP5, GHCN-Daily and reanalyses) globally averaged over land (excluding Antarctica). At each location, and for each data set, the annual cycle is estimated by averaging values for each calendar day in the base period and applying a temporal Gaussian filter with a standard deviation of 3 weeks. For TN, shown in Figure 2, the globally averaged annual cycle of the CMIP5 ensemble mean and NCEP1 correspond well with that of GHCN-Daily, although we can notice a slight cold bias in NCEP1 compared to the observations. In contrast, ERA40 shows a pronounced warm bias. For TX, both reanalyses and the models have a consistent cold bias compared to GHCN-Daily, which is less pronounced in the Northern Hemisphere (NH) summer season (May–August). These biases are also apparent in the annual cycles of the 10th and 90th percentiles of TN and TX (Figure S2(a) and (b)). Corrections of the mean annual cycle of the models or reanalyses with respect to the mean annual cycle of the GHCN-Daily data set also results in a considerable improvement of the biases in the 10th and 90th percentiles of TN and TX (Figure S2(c) and (d).
4.2. Bias-corrected percentile indices
Here, we evaluate the percentile indices in the CMIP5 ensemble mean and the reanalyses after bias adjustment using the percentile thresholds from GHCN-Daily. The biases revealed in the annual cycles of the models and reanalyses with respect to the observations are also reflected in the threshold exceedance rates as represented by the percentile indices. As can be seen in Figure 3, threshold exceedance rates for TN-based percentile indices in ERA40 are generally underestimated over land areas, particularly for cold nights (TN10p), indicating a more narrow TN distribution in ERA40 compared to the observations. Threshold exceedance rates of TN-based percentile indices in NCEP1 are generally overestimated, indicating a wider TN distribution than observed, except over Central America and parts of Southeast Asia. In the CMIP5 ensemble, threshold exceedances are underestimated in some areas, such as Central and Southern America as well as central and northern parts of Asia. In other areas, such as Australia, Southeast Asia, Central and Northern Europe as well as northern parts of North America and Greenland, the threshold exceedance rates are overestimated. This pattern is most pronounced in NH winter months December to February (DJF, Figure S3). Particularly over Greenland, the models as well as both reanalyses overestimate the threshold exceedance rate for TN-based percentile indices, i.e., simulate too many cold (TN10p) and warm nights (TN90p) as compared to the observations. Note, however, that observational data coverage in these regions is very limited and biased towards lower elevations and the coastline (Donat et al., 2013).
For TX-based percentile indices, the agreement between the data sets is more consistent as was also the case for the annual cycles. Compared to GHCNDEX, threshold exceedance rates in the models and reanalyses are generally overestimated in regions such as Alaska, Greenland, northeastern parts of Europe, Russia, South-eastern Asia and South Africa. Regions where threshold exceedance rates are generally underestimated are North and Central America, southern South America and Southern Asia. This pattern is again most pronounced in DJF (Figure S3). NH summer months June to August (JJA), however, reveal different spatial patterns for the TN- and TX-based indices (Figure S4). CMIP5 models as well as NCEP1, when bias-corrected with GHCN-Daily, generally overestimate the exceedance rates of the 90th percentile thresholds of TN and TX (i.e., warm nights and days) over NH land areas, but tend to underestimate these in the Southern Hemisphere (SH), except in South Africa. The opposite is the case for the exceedance rates of the 10th percentile thresholds of TN and TX (i.e., cold nights and days). The ERA40 reanalysis also exhibit this pattern when bias-corrected with GHCN-Daily, although less strongly.
Unfortunately, there are large areas in South America, Africa and Asia (e.g., India and Saudi Arabia) where the GHCNDEX data set is incomplete as indicated by the white land areas in Figure 3, which prevents an evaluation of the percentile indices based on observations in these regions. Reanalyses are often used to gain information from regions lacking observational data. However, as seen particularly for the TN-based percentile indices, the discrepancies between different reanalyses can be substantial due to differences in the representation of the annual cycle (cf. Figure 2). Thus, the bias-correction methodology based on reanalyses data can lead to misleading conclusions depending on which reanalysis is used. In particular, cold and warm nights (TN10p and TN90p, respectively) are simulated more frequently in the CMIP5 ensemble mean than observed when biases were adjusted to the annual cycle of ERA40 (Figure 4). In contrast, when biases are adjusted to NCEP1, the TN-based indices are closer to the indices derived from GHCN-Daily bias-corrected temperatures and are thus closer to the globally averaged observed variability over land. Particularly in regions where observations are sparse or missing (e.g., Greenland, Northern Africa and northern parts of South America), the discrepancies between different reanalyses can be substantial (Figure S5). For TX-based indices (TX10p and TX90p), the reanalyses are generally more consistent amongst each other and with the observations, particularly for warm days.
Last but not least, it is important to note that while there is considerable uncertainty in the representation of model- or reanalysis-simulated lower and upper percentile thresholds, the overall globally averaged trends in the exceedance rates below and above these thresholds are not affected by the bias-correction methodology (Figure 4).
5. Discussion and future research needs
We have introduced a simple method of bias-correcting model temperature time series to calculate percentile indices that allows a more complete evaluation of these indices. Although bias-correction would also be potentially useful for fixed threshold-based ETCCDI indices (e.g., growing season length and frost days), it is necessary for the percentile indices if they are to be used to assess aspects of model threshold crossing behaviour beyond its changes over time. The usage of the ETCCDI indices is presently the only way to provide a reasonable global comparison between models and observations, since in many parts of the world, indices from observations remain the only source of publicly available information about extreme temperature variability. Due to insufficient spatial coverage of observations, reanalyses are often used to evaluate models. But as we show in our analysis, this can lead to very different results depending on the reanalysis used (see also Sillmann et al., 2013).
Our analysis of percentile indices calculated from bias-corrected temperature distributions indicates that particularly large uncertainties exist for the lower tail of the temperature distribution (i.e., TN-based indices). The variability of daily minimum temperature is underestimated in the CMIP5 ensemble mean with respect to observations (GHCN-Daily) and NCEP1, but overestimate it with respect to ERA40 on the global average over land. Also the spatial pattern represented in the CMIP5 ensemble mean when bias-corrected with GHCN-Daily is not well captured when models are bias-corrected with the reanalyses, which could partly be due to sparse data coverage in the areas with the largest discrepancies (e.g., Greenland). Note, however, that near-surface temperature fields in reanalyses are not well constrained by observations (e.g., Kalnay et al., 1996). For the upper tail of the temperature distribution (i.e., TX-based indices), the data sets analyzed in this study show more consistent results (i.e., the discrepancy between data sets is smaller).
Furthermore, we found distinct seasonal patterns in the model-simulated variability of TN and TX compared to the observations and reanalyses. As pointed out, for instance by Yiou et al. (2009) or Parey et al. (2010), the intra-seasonal or day-to-day variability in models could be improved if models were able to simulate a realistic seasonal cycle. Seasonal differences in model performance can also help to identify deficiencies in the representation of physical processes in the models, such as cloud and land-surface processes and sea-ice or snow interactions with the atmosphere that need to be investigated in further studies.
In conclusion, our study shows that more effort is required to fill the gaps in observation-based data sets to improve the evaluation of simulated temperature variability by the current model generation, particularly with respect to the extremes. Although the bias-correction methodology can account for biases in the mean state, the shape of the distribution remains unchanged, which would imply that models underestimating temperature variability would show a larger trend in exceedance rates, and models overestimating temperature variability would show a smaller trend. This can have implication for the interpretation of future projections of simulated temperature extremes and requires further investigation.
We thank two anonymous reviewers for their thoughtful comments that contributed to this manuscript. We acknowledge the World Climate Research Programme's Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modelling groups for producing and making their model output available. For CMIP, the U.S. Department of Energy's Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals. Furthermore, we acknowledge the efforts of the ECWMF and NCEP for providing the reanalysis data sets. J. Sillmann is funded by the German Research Foundation (grant Si 1659/1-1) and the AERO-CLO-WV project (184714/S30) funded by the Norwegian Research Council.