Journal of Geophysical Research: Atmospheres
  • Open Access

Climate extremes indices in the CMIP5 multimodel ensemble: Part 1. Model evaluation in the present climate

Authors


Abstract

[1] This paper provides a first overview of the performance of state-of-the-art global climate models participating in the Coupled Model Intercomparison Project Phase 5 (CMIP5) in simulating climate extremes indices defined by the Expert Team on Climate Change Detection and Indices (ETCCDI), and compares it to that in the previous model generation (CMIP3). For the first time, the indices based on daily temperature and precipitation are calculated with a consistent methodology across multimodel simulations and four reanalysis data sets (ERA40, ERA-Interim, NCEP/NCAR, and NCEP-DOE) and are made available at the ETCCDI indices archive website. Our analyses show that the CMIP5 models are generally able to simulate climate extremes and their trend patterns as represented by the indices in comparison to a gridded observational indices data set (HadEX2). The spread amongst CMIP5 models for several temperature indices is reduced compared to CMIP3 models, despite the larger number of models participating in CMIP5. Some improvements in the CMIP5 ensemble relative to CMIP3 are also found in the representation of the magnitude of precipitation indices. We find substantial discrepancies between the reanalyses, indicating considerable uncertainties regarding their simulation of extremes. The overall performance of individual models is summarized by a “portrait” diagram based on root-mean-square errors of model climatologies for each index and model relative to four reanalyses. This metric analysis shows that the median model climatology outperforms individual models for all indices, but the uncertainties related to the underlying reference data sets are reflected in the individual model performance metrics.

1 Introduction

[2] In the time line of the 5th assessment report (AR5) of the Intergovernmental Panel on Climate Change (IPCC), the simulations from a new generation of state-of-the-art global climate models (GCM) are becoming available for analysis within the Coupled Model Intercomparison Project Phase 5 (CMIP5) [Taylor et al., 2012]. In comparison with the previous model generation (CMIP3) [Meehl et al., 2007], CMIP5 includes more comprehensive global climate models (i.e., Earth system models) with generally higher spatial resolution enabling the research community to address a wider variety of scientific questions. One important task is the evaluation of CMIP5 model performance in representing not only aspects of the mean climate, but also extreme climate and weather events. Over the last two decades, the analysis of such extreme events has become increasingly important due to the recognition of their significant impacts on society and natural systems [Intergovernmental Panel on Climate Change (IPCC), 2012].

[3] While extreme climate and weather events are generally multifaceted phenomena, the present study in particular discusses climate extremes based on daily temperature and precipitation, such as the hottest or coldest day of the year, heavy precipitation events, and dry spells [e.g., Seneviratne et al., 2012; Zwiers et al., 2013]. The Expert Team on Climate Change Detection and Indices (ETCCDI) has attempted to facilitate the analysis of such extremes over the last decade by defining a set of climate indices that provide a comprehensive overview of temperature and precipitation statistics focusing particularly on extreme aspects [Karl and Easterling, 1999; Klein Tank et al., 2009]. This effort has also supported the development of a gridded data set of indices (HadEX [Alexander et al., 2006] and the updated version HadEX2 [Donat et al., 2013]) based on observations with a reasonably dense global coverage. Most of the indices defined by the ETCCDI (hereafter simply referred to as indices) describe moderate climate extremes with reoccurrence times of a year or shorter, as compared to more extreme climate statistics such as 20 year return values of annual temperature and precipitation extremes as considered, for instance, in Kharin et al. [2007].

[4] The indices find multiple applications in climate research and related fields due to their robustness and fairly straightforward calculation and interpretation. For instance, the indices are used in various detection and attribution studies [e.g., Min et al., 2010; Morak et al., 2011], regional studies [e.g., Alexander and Arblaster, 2009], and also for testing statistical downscaling methods as in Bürger et al. [2012]. The substantial use of the indices in the literature, as reviewed in Zhang et al. [2011], motivates the need for a comprehensive evaluation of how well they are simulated by the newly available CMIP5 models; and a comparison between indices calculated from reanalyses or observations and multimodel simulations has not been done so far.

[5] On a global scale, however, the analysis of extreme climate events involves several difficulties compared to analysis of time-mean quantities. Among others, they include (a) the formulation of globally valid definitions of extremes; (b) the identification of data sets with sufficient spatial and temporal coverage that are appropriate for comparison with models; and (c) performing analyses across the available model and observational data sets in a consistent manner. The choice of observational, or observationally constrained data sets that are adequate for model evaluation, in particular, represents a compromise. In this paper, we will use both the gridded HadEX2 observational data set and extremes indices derived from reanalyses. Both have temporal coverage that is adequate, but as we will see, the spatial coverage of the observational gridded product is far less than ideal. Also, the gridded product has necessarily been constructed from smoothed point (station) values and thus is not necessarily representative of the climate model grid box scale. It may also be subject to inhomogeneities caused by, for instance, station instrumentation, location and exposure, although Donat et al. [2013] did make every effort to use homogenized data sources. In contrast, reanalyses have full spatial coverage and simulate extremes on spatial scales comparable to those of climate models, but the near surface temperature and precipitation extremes that we wish to assess here are calculated from variables that are relatively poorly constrained by observations in the reanalyses. Reanalyses may also be affected by, for instance, inaccurate representation of surface and boundary layer processes, convection, and its spin up. The challenges that are posed by these observational limitations will be evident in the evaluations that we describe below in the main part of this paper.

[6] This paper will provide a first look at the performance of the CMIP5 multimodel ensemble in simulating climate extreme aspects represented by the indices as compared to CMIP3, reanalyses, and HadEX2. The indices are calculated with a consistent methodology across the various data sets discussed in this paper and the indices from models and reanalyses are available at the ETCCDI indices archive (EIA) website hosted by the Canadian Centre for Climate Modelling and Analysis (http://www.cccma.ec.gc.ca/data/climdex/climdex.shtml). This paper serves as documentation of the EIA, which constitutes an essential part of this study, and provides the foundation for a companion paper that examines future changes in the indices as projected in CMIP3 and CMIP5 simulations [Sillmann et al., 2013].

[7] We organize the paper as follows. The analyzed data sets are described in section 2. The methodology for the computation and analysis of the indices is given in section 3. The results are analyzed in section 4, and the study is concluded with a summary in section 5.

2 Data Sets

2.1 Global Climate Models

[8] Daily minimum and maximum near surface temperatures (TN and TX, respectively) and daily precipitation accumulation (PR) were retrieved from the Earth System Grid (ESG) data portal for 31 CMIP5 (cf. Table S1 in the auxiliary material) and 18 CMIP3 models (cf. auxiliary material Table S2).1 The main improvements in CMIP5 include (a) the addition of interactive ocean and land carbon cycles of varying degrees of complexity, (b) more comprehensive modeling of the indirect effect of aerosols, and (c) the use of time-evolving volcanic and solar forcing in most models [e.g., Taylor et al., 2012]. The CMIP5 models generally have higher horizontal and vertical resolution (median resolution ∼ 180 ×96L39, cf. Table S1) compared to the CMIP3 (median resolution ∼ 128 ×64L24, cf. Table S2).

[9] Indices are being calculated from all available ensemble members with high-frequency temperature and precipitation output for CMIP3 and CMIP5 models and are available at the EIA. For the purpose of model evaluation in the present study, we consider only one (typically the first) ensemble member of each model. The analysis is based on the 20th century simulations of the CMIP3 models from 1961 to 2000 and on the historical simulations of the CMIP5 models for years from about 1850 to 2005. These simulations employ historical changes in the atmospheric composition reflecting both anthropogenic and natural sources. CMIP5 simulations also include time-evolving land cover information [Taylor et al., 2012], including land use change in some instances [e.g., Avila et al., 2012].

2.2 Observations

[10] The gridded HadEX2 data set of observation-based indices [Donat et al., 2013] allows comparison between model-simulated and observed indices. In many regions around the globe, they are the only source for publicly available information about temperature and precipitation extremes. HadEX2 indices are calculated directly from station-based observations and then interpolated to a global grid, which results in a spatial-scale mismatch with indices calculated from model output because the latter represents area (grid box) averages rather than point values. This might be less of an issue for interpreting results for temperature-based indices because temperature extremes typically occur on a larger spatial scale. However, the effect of station- versus grid-box estimates on temperature extremes has not yet been thoroughly examined and station exposure [e.g., Fall et al., 2011] as well as land use and land cover characteristics [e.g., Pielke et al., 2011] can have an effect on their representations in the different data sets in some locations. For precipitation-based indices in particular one should expect extremes derived from local station-based observations to be more intense than those derived from gridded model output as discussed in Chen and Knutson [2008].

[11] The HadEX2 data set is provided only over land on a 2.5°×3.75° grid from the University of New South Wales Climate Change Research Centre website (http://www.climdex.org). The spatial coverage of HadEX2 depends on the data availability and the search radius used in the interpolation procedure, which is determined by spatial correlation (see Alexander et al. [2006] and Donat et al. [2013] for details). Because station-data availability and search radius differ from one index to another, different indices have different temporal and spatial coverage.

2.3 Reanalyses

[12] Reanalyses data sets are also often used for model evaluation. Reanalyses are more readily comparable with model simulations due to their gridded output, complete global spatial coverage and similarity of scales represented. Although reanalyses are essentially observationally constrained model output, variables that are directly assimilated in the reanalysis forecast model are typically closer to observations. Variables such as precipitation that are primarily determined by the forecast model are poorly constrained by the assimilated observations and therefore classified as “type C” variables [Kalnay et al., 1996]. Near-surface temperature fields such as used for the indices calculation are classified as “type B” variables [Kalnay et al., 1996], because the forecast model has substantial influence on the reanalyzed values.

[13] In this study, we computed indices for four widely used reanalyses: ERA40 [Uppala et al., 2005], ERA-Interim [Dee et al., 2011], National Centers for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) Reanalysis 1 (NCEP1) [Kistler et al., 2001; Kalnay et al., 1996], and NCEP-DOE Reanalysis 2 (NCEP2) [Kanamitsu et al., 2002]. For ERA40 and ERA-Interim, daily temperature and precipitation data were downloaded from the European Centre for Medium-Range Weather Forecasts archive (http://www.ecmwf.int/products/data/archive/). ERA40 does not provide daily temperature extremes (TN and TX), and thus they were approximated by the daily minimum and maximum of 6-hourly near-surface temperature values for the period from 1958 to 2001. The ERA40 is available on a 2.5°×2.5° regular latitude/longitude 144×73 grid. The ERA-Interim is available on a regular 1.5°×1.5° longitude/latitude 240×121 grid for the period from 1979 to 2010. The daily data of the NCEP reanalysis were downloaded from the Physical Science Division website of the NOAA Earth System Research Laboratory (ftp://ftp.cdc.noaa.gov/Datasets/). Both NCEP1 and NCEP2 reanalysis data sets are available on a 192×94 Gaussian grid. The NCEP1 and NCEP2 indices were computed for the years 1948–2010 and 1979–2010, respectively. The indices for the four reanalyses are also available for downloading from the EIA.

3 Methodology

3.1 Climate Extremes Indices

3.1.1 Definition

[14] The indices are defined and described in detail in Klein Tank et al. [2009] and Zhang et al. [2011]. The indices fall roughly into four categories: (1) absolute indices, which describe, for instance, the hottest or coldest day of a year, or the annual maximum 1 day or 5 day precipitation rates; (2) threshold indices, which count the number of days when a fixed temperature or precipitation threshold is exceeded, for instance, frost days or tropical nights; (3) duration indices, which describe the length of wet and dry spells, or warm and cold spells; and (4) percentile-based threshold indices, which describe the exceedance rates above or below a threshold which is defined as the 10th or 90th percentile derived from the 1961–1990 base period. The latter are referred to as percentile indices in the following. The complete set of 27 indices is summarized in Table 1.

Table 1. Core Set of 27 Extreme Indices Recommended by the ETCCDI. The Index R1mm Marked With * is Defined by ETCCDI for a User Specified Threshold Which is Set to 1 mm for This Study
LabelIndex NameIndex DefinitionUnits
TN10pCold nightsLet TNij be the daily minimum temperature on day i in period j and let TNin10 be the calendar day 10th percentile centered on a 5 day window. The percentage of days in a year is determined where TNij < TNin10%
TX10pCold daysLet TXij be the daily maximum temperature on day i in period j and let TXin10 be the calendar day 10th percentile centered on a 5 day window. The percentage of days is determined where TXij < TXin10%
TN90pWarm nightsLet TNij be the daily minimum temperature on day i in period j and let TNin90 be the calendar day 90th percentile centered on a 5 day window. The percentage of days is determined where TNij > TNin90%
TX90pWarm daysLet TXij be the daily maximum temperature on day i in period j and let TXin90 be the calendar day 90th percentile centered on a 5 day window. The percentage of days is determined where TXij > TXin90%
WSDIWarm spell durationLet TXij be the daily maximum temperature on day i in period j and let TXin90 be the calendar day 90th percentile centered on a 5 day window for the base period 1961–1990. Then the number of days per period is summed where, in intervals of at least 6 consecutive days: TXij > TXin90days
CSDICold spell durationLet TNij be the daily minimum temperature on day i in period j and let TNin10 be the calendar day 10th percentile centered on a 5 day window for the base period 1961–1990. Then the number of days per period is summed where, in intervals of at least 6 consecutive days: TNij < TNin10days
TXxMax TXLet TXx be the daily maximum temperatures in month k, period j. The maximum daily maximum temperature each month is then: TXxkj = max(TXxkj)°C
TXnMin TXLet TXn be the daily maximum temperature in month k, period j. The minimum daily maximum temperature each month is then: TXnkj = min(TXnkj)°C
TNxMax TNLet TNx be the daily minimum temperatures in month k, period j. The maximum daily minimum temperature each month is then: TNxkj = max(TNxkj)°C
TNnMin TNLet TNn be the daily minimum temperature in month k, period j. The minimum daily minimum temperature each month is then: TNnkj = min(TNnkj)°C
FDFrost daysLet TN be the daily minimum temperature on day i in period j. Count the number of days where TNij < 0°Cdays
IDIce daysLet TX be the daily maximum temperature on day i in period j. Count the number of days where TXij < 0°Cdays
SUSummer daysLet TX be the daily maximum temperature on day i in period j. Count the number of days where TXij > 25°Cdays
TRTropical nightsLet TN be the daily minimum temperature on day i in period j. Count the number of days where TNij > 20°Cdays
GSLGrowing season lengthLet T be the mean temperature ((TN + TX)/2) on day i in period j. Count the number of days between the first occurrence of at least 6 consecutive days with T > 5°C and the first occurrence after 1st July (NH) or 1st January (SH) of at least 6 consecutive days with Tij < 5°Cdays
DTRDiurnal temperature rangeLet TN and TX be the daily minimum and maximum temperature respectively on day I in period j. If I represents the number of days in j, then: DTRj = inline image (TXij – TNij)/ I°C
RX1dayMax 1 day precipitationLet PRij be the daily precipitation amount on day i in period j. The maximum 1 day value for period j are: RX1dayj = max (PRij)mm
RX5dayMax 5 day precipitationLet PRkj be the precipitation amount for the 5 day interval ending k, period j. Then maximum 5 day values for period j are: RX5dayj = max (PRkj)mm
SDIISimple daily intensityLet PRwj be the daily precipitation amount on wet days, PR > = 1 mm in period j. If W represents number of wet days in j, then: SDIIj = (inline image PRwj) / Wmm
R1mm*Number of wet daysLet PRij be the daily precipitation amount on day i in period j. Count the number of days where PRij > 1 mmdays
R10mmHeavy precipitation daysLet PRij be the daily precipitation amount on day i in period j. Count the number of days where PRij > 10 mmdays
R20mmVery heavy precipitation daysLet PRij be the daily precipitation amount on day i in period j. Count the number of days where PRij > 20 mmdays
CDDConsecutive dry daysLet PRij be the daily precipitation amount on day i in period j. Count the largest number of consecutive days where PRij < 1 mmdays
CWDConsecutive wet daysLet PRij be the daily precipitation amount on day i in period j. Count the largest number of consecutive days where PRij > 1 mmdays
R95pVery wet daysLet PRwj be the daily precipitation amount on a wet day w (PR > = 1 mm) in period i and let PRwn95 be the 95th percentile of precipitation on wet days in the 1961–1990 period. If W represents the number of wet days in the period, then: R95pj = inline image PRwj, where PRwj > PRwn95mm
R99pExtremely wet daysLet PRwj be the daily precipitation amount on a wet day w (PR > = 1 mm) in period i and let PRwn99 be the 95th percentile of precipitation on wet days in the 1961–1990 period. If W represents the number of wet days in the period, then: R99pj = inline image PRwj, where PRwj > PRwn99mm
PRCPTOTTotal wet-day precipitationLet PRij be the daily precipitation amount on day i in period j. If I represents the number of days in j, then: PRCPTOTj = inline image PRijmm

3.1.2 Calculation

[15] All indices are calculated for the CMIP5 and CMIP3 simulations listed in Tables S1 and S2 and for four reanalyses (cf. section 2.3). The calculations are performed with the R package climdex.pcic as documented at The Comprehensive R Archive Network website (http://cran.r-project.org/web/packages/climdex.pcic/index.html). This R package was thoroughly cross-validated with the Fortran code (FClimdex) provided on the ETCCDI website for station-based indices calculation (http://cccma.seos.uvic.ca/ETCCDI/software.shtml) and the code used to produce HadEX2 indices. This step is important to ensure the consistent calculation of indices between models, reanalyses, and observations (i.e., HadEX2) and led to improvement of some of the original FClimdex routines. Note however that HadEX2 also includes indices that are calculated by individual countries and subsequently provided as input for the gridding procedure [cf. Alexander et al., 2006]. Thus, some inconsistencies due to differences in the implemented codes may exist. The percentile indices calculation is implemented according to Zhang et al. [2005] to avoid inhomogeneity in these indices at the boundaries between the base and out-of-base periods. All indices are computed and made available at the EIA on the native model grids for all available ensemble members for the historical time period 1850–2005 of the CMIP5 models and for 1961–2000 for the CMIP3 models, as well as for the available time periods in the reanalyses on their respective grids (cf. section 2.3).

3.1.3 Processing

[16] For our analyses, we regridded all indices to a common 144×73 grid (2.5°×2.5°) using a first-order conservative remapping procedure [Jones, 1999] implemented in the Climate Data Operators (https://code.zmaw.de/projects/cdo) to produce multimodel summary statistics. The HadEX2 indices are kept in their original resolution.

[17] Regional analysis is performed for 21 subregions as defined in Giorgi and Francisco [2000] (cf. their Table 2) and illustrated in Figure 1. For each index, the respective space- and time-varying HadEX2 mask (cf. section 2.2) is regridded to the common grid and applied to the indices to intercompare the model-simulated and reanalysis indices with HadEX2 based only on grid locations where observations are available. Temporal and spatial averages of such masked indices are summarized by box-and-whisker plots that display the median of a multimodel ensemble (black/grey solid mark within the box), the interquartile model range, that is, the range between the 25th and 75th percentiles of the model ensemble (box), and the total intermodel range (whiskers). Regions where the HadEX2 data coverage was less than 10 grid points or 25% of the total land points in the particular region are excluded from the box plot summaries.

Figure 1.

Subregions over land adapted from Giorgi and Francisco [2000] (cf. their Table 2) and color-coded according to the continents, Australia (blue), South America (green), North America (purple), Africa (red), Europe (yellow), and Asia (cyan).

[18] For the time series and model metric performance analysis we consider all grid points over land. We particularly concentrate on land areas in this study, because (1) the impacts of temperature and precipitation extremes are of particular concern over populated land areas and (2) the observation-based HadEX2 indices are not available over ocean.

3.2 Model Performance Metrics

[19] Given the large number of indices and models analyzed in this study, we use a metric approach to assess model performance similar to that advocated by Gleckler et al. [2008] (hereafter referred to as G08), which is based on the root-mean-square errors (RMSEs) of model climatologies. This provides a compact graphical overview of model performance relative to each other for various climate parameters.

[20] The model climate is assessed with respect to the four reanalyses (cf. section 2.3). In this analysis RMSEs are computed for the annual climatologies of indices rather than for their annual cycle (as in G08) because most indices are defined on an annual basis. The RMSEs are calculated for land only in four domains: global, Northern Hemisphere extratropics (NH, 20°N–90°N), Southern Hemisphere extratropics (SH, 20°S–90°S), and Tropics (Tr, 20°N–20°S) as in G08. We first calculate the 20 year 1981–2000 annual climatology for each index. The RMSE is then calculated as

display math(1)

where X represents the model climatology of an index, Y is the corresponding climatology in the reanalysis, and the angular brackets denote spatial averaging over a particular domain (global, NH, SH, or Tr). The collection of RMSEs for all models is then used to derive the relative model error for each model, RMSE'xy, defined as

display math(2)

where RMSEmedian is the median of RMSE for all models (as in G08).

[21] RMSE'xy provides an indication of a model's performance relative to the other models, with respect to a particular reanalysis. The median error RMSEmedian represents typical model performance in the multimodel ensemble. This is not to be confused with the performance of the median (or mean) model. The latter is obtained by first computing the multimodel median (or mean) of an index, and deriving RMSE′ statistics for this multimodel estimate. Negative values of RMSE′ indicate that the corresponding model performs better than the majority (50%) of models. RMSE′ values for all models and all indices obtained for four reanalyses are summarized in a so-called “portrait” diagram (G08). We complement it with an indicator of overall model performance that is obtained by averaging RMSEmedian values across all indices and is referred to as RSMEall.

[22] The original “portrait” diagram in G08 displays only relative model performance to each other, but it does not inform about the absolute magnitude of errors in the multimodel ensemble with respect to the reanalyses. We therefore display the multimodel median RMSE for each index standardized by the spatial standard deviation of the index climatology in the reanalyses

display math(3)

where Y is the index climatology in the reanalysis, and the expression in the denominator is its spatial standard deviation over global land. Values close to zero indicate that model errors are small as compared to typical spatial variations of the index on a global scale.

4 Results

4.1 Temperature Indices

4.1.1 Absolute and Threshold Indices

4.1.1.1 Spatial Representation

[23] We begin our analysis with a comparison of the spatial structure of the 1981–2000 climatologies of TNn and TXx, which describe the coldest and hottest day of a year, respectively (Figure 2). The HadEX2 spatial coverage is better for TNn than for TXx, which reflects differences in their decorrelation length scales (Alexander et al. [2006]; see also section 2.2). Overall, the CMIP5 models compare well with HadEX2 (Figure 2) and the four reanalyses (see auxiliary material, Figures S1 and S2). There are some differences, particularly in high terrain regions, such as the Tibetan Plateau or the Andes. Particularly in these regions, the NCEP1 and NCEP2 reanalyses show lower TNn values than represented in CMIP5 and ERA40 and ERAint. The diminished details in mountainous areas in HadEX2 reflect the relatively coarse spatial resolution of this data set, but may also be related to a likely bias toward lower elevations because most stations in mountainous areas are located in valley bottoms, which results in the loss of some orographic detail in high-elevation areas [e.g., Stahl et al., 2006].

Figure 2.

The 1981–2000 time mean of the annual minimum of TN (TNn, top panel) and maximum of TX (TXx, bottom panel) for HadEX2 and the CMIP5 multimodel ensemble median.

[24] The north-south temperature gradient from warmer tropical regions to colder polar regions is reasonably captured by reanalyses and CMIP5 models. However, these data sets are cold biased relative to HadEX2, particularly over Greenland. These spatial pattern features are similar for other absolute and threshold indices based on daily minimum or maximum temperature, such as frost days (FD) or summer days (SU) (see auxiliary material, Figures S3 and S4, respectively). Underestimation of mean surface temperatures over high northern latitudes, particularly in Greenland and Siberia, and for the Tibetan Plateau, was also reported for CMIP3 models in Randall et al. [2007]. Furthermore, NCEP1 shows a warm bias in TXx (see auxiliary material, Figure S2) particularly in tropical Africa, Asia, and also in Australia compared to HadEX2 and the other reanalyses.

[25] A more detailed look at regional differences between HadEX2, reanalyses and CMIP5 and CMIP3 models is provided by the box-and-whisker plots for the 21 subregions (cf. Figure 1). For TXx and TNn, the multimodel medians of the CMIP3 and CMIP5 are very similar for most regions (Figures 3a and 3b). The interquartile model range for the temperature indices in Figure 3, which is spanned by the 25th and 75th quantiles of the multimodel ensemble and indicated by the boxes, is also similar and often smaller in CMIP5 compared to CMIP3 despite the larger number of CMIP5 models. Exceptions occur in northern regions, such as Greenland (GRL), Western and Central North America (WNA and CNA, respectively), and Northern Europe (NEU) particularly for indices based on TX (e.g., TXx and summer days (SU)).

Figure 3.

Box-and-whisker plots for temperature indices calculated from 18 CMIP3 (grey) and 31 CMIP5 (black) models. The boxes indicate the interquartile model spread (range between the 25th and 75th quantiles), the black/grey solid marks within the boxes show the multimodel median and the whiskers indicate the total intermodel range. For the regional averages, a spatio-temporal variable land mask according to the availability of the HadEX2 indices was applied. Displayed are only regions where at least 10 grid points and 25% of grid boxes within a region over the time period 1981–2000 were covered by the respective HadEX2 index, indicated as a green cross. The reanalyses are indicated in different shapes for ERA40 (blue), ERA-Interim (cyan), NCEP1 (red), and NCEP2 (orange). The 21 subregions are color-coded according to Figure 1.

[26] Except for diurnal temperature range (DTR), the spread between the four reanalyses is similar or even larger than the interquartile model spread, with HadEX2 falling between the reanalyses and model estimates in most regions. In Tibet (TIB) and high northern latitude regions (GRL, Alaska (ALA), North Asia (NAS)), HadEX2 shows warmer temperature extremes than the models and reanalyses, perhaps due to poor observational network coverage (cf. Figure 1a in Donat et al. [2013]), which also is likely biased toward lower elevations.

[27] Models and reanalyses particularly overestimate the number of ice days (ID, see auxiliary material, Figure S5), and thus simulate more cold extremes in comparison to HadEX2, whereas CMIP5 simulates fewer ID than CMIP3 in most regions. On the other hand, in tropical regions, such as the Amazon Basin (AMZ), Central America (CAM), and South Africa (SAF), the models and reanalyses generally simulate warmer temperature extremes (e.g., TXx, SU, and tropical nights (TR)) as compared to HadEX2 with no distinct differences between CMIP3 and CMIP5.

[28] The models and reanalyses also disagree with HadEX2 for the DTR (Figure 3c) with HadEX2 showing much larger values than the median of CMIP3 and CMIP5. Particularly, at high northern latitudes (e.g., ALA and GRL), CMIP5 models estimate warmer TXx and colder TNn than CMIP3 models, and also exhibit greater DTR. HadEX2 values may be biased by poor observational network coverage in these regions (e.g., ALA and GRL) with stations located predominately in coastal areas [see Alexander et al., 2006; Donat et al., 2013]. In most regions, NCEP1 simulates DTR values closer to HadEX2 than the median CMIP3 and CMIP5 model or the other reanalyses, but exceeds HadEX2 values in some regions, such as the Mediterranean (MED), Sahara (SAH), South and Central Asia (SAS, CAS), and TIB.

4.1.1.2 Global Temporal Evolution

[29] The temporal evolution of the globally averaged indices over land in the models and reanalyses is shown in Figure 4 for the time period 1948 to 2010. HadEX2 indices are not shown to avoid effects related to changes in the spatial coverage of the HadEX2 data set over time.

Figure 4.

Global spatial means of temperature indices over land from 1948 to 2005 of the ensemble mean (solid) and median (dashed) of 31 CMIP5 (black) and 18 CMIP3 (green) models. Shading indicates the interquartile model spread (range between the 25th and 75th quantiles). Also shown are the reanalysis ERA40 (blue) from 1958 to 2001, ERA-Interim (cyan) and NCEP2 (orange) from 1979 to 2010, and NCEP1 (red) from 1948 to 2005. For the respective index, the left column displays absolute values and right column shows anomalies with respect to the reference period 1981 to 2000. Grey shading along the horizontal x-axis indicates the evolution of globally averaged volcanic forcing according to Sato et al. [1993].

[30] We can see that the interquartile model spread, indicated by the shading, is larger for CMIP3 than for CMIP5 particularly for indices based on TN, such as TNn, FD, and TR (Figures 4a, 4e, and 4g, respectively), again despite the larger number of CMIP5 models. The multimodel median and mean are comparable in the two CMIP ensembles, particularly for TXx and TR (Figures 4c and 4g, respectively).

[31] The differences between the reanalyses can be as large as or larger than the interquartile model spread. Differences between reanalyses produced by the same center can be smaller than differences between reanalyses from different centers (that is, ERA40 and ERA-Interim versus NCEP1 and NCEP2), for instance for TNn, but can also be substantial, as for FD and TR. For the latter indices, the models compare better with the more recent reanalyses, ERA-Interim and NCEP2, than with their predecessors. NCEP1 in particular reveals some inhomogeneities in the temporal evolution of TXx with an abrupt increase of about 2°C in 1982 and a sharp decrease in the early 1990s (Figures 4c and 4d). This jump is a regional feature confined to some parts of the Tropics (20°S to 20°N) and may be related to problems in the formulation of boundary layer processes as mentioned in NCEP-NCAR [2011] and also discussed in Kharin et al. [2005].

[32] Results from individual CMIP5 models (not shown), do not reveal a clear relationship between the model's spatial resolution and its representation of temperature indices. For TXx and TNn, higher resolution models tend to center around the ensemble median, while lower resolution models spread out toward both ends of the ensemble distribution. A relatively large DTR (≈11–13°C, globally averaged) is simulated in CNRM-CM5, INMCM4, and CanESM2, which cover a wide range of spatial resolutions, and is comparable to NCEP1 (≈11°C) and HadEX2 (cf. Figure 3c).

[33] Although there are substantial differences between the models and reanalyses when absolute values of the indices are considered (left column of Figure 4), the spread is reduced significantly for anomalies relative to the reference period of 1981–2000 (right column of Figure 4). The 1981–2000 reference period is used here because it is available for all data sets considered.

[34] We can see that long-term trends in the historical temporal evolution of the indices are more clearly visible for the anomalies. For TXx and TNn, for instance, there are similar warming trends in the models and reanalyses starting in the 1960s. NCEP2 shows a greater increase for TNn in the last decade compared to the other reanalyses and CMIP models. The temporal inhomogeneities in the TXx time series in NCEP1 still dominate their temporal evolution.

[35] The threshold indices (Figures 4f and 4h) reveal corresponding trends, with decreasing numbers of frost days and increasing numbers of tropical nights. Alexander et al. [2006] also found a decrease in FD in the previous version of HadEX2 (i.e., HadEX) that is, however, much more pronounced.

[36] The grey shading along the horizontal x-axis in Figure 4 indicates the evolution of globally averaged volcanic radiative forcing derived from the stratospheric aerosol optical depth data set described in Sato et al. [1993]. Volcanic forcing is included in most of the CMIP5 models considered in this analysis (see Table S1) and in approximately one-third of the CMIP3 models (see Table S2). The effect of the volcanic forcing is visible in some of the indices. For example, distinct decreases in TXx and particularly in TR are seen in years following 1963 (Mt. Agung eruption), 1982 (El Chichón), and 1991 (Mt. Pinatubo) in the CMIP5 models. The volcano signal is less prominent in TNn and FD for both models and reanalyses.

4.1.2 Percentile Indices

[37] Compared to the absolute indices TNn and TXx, which are based on annual extremes, the percentile indices are related to less extreme aspects of climate variability and represent exceedance rates (in %) above or below the 10th and 90th percentiles of the temperature distribution with respect to a standard base period (usually 1961–1990). By construction, these indices average to approximately 10% over the base period. Because ERA-Interim and NCEP2 start in 1979, we used 1979–2008 as base period for computing the percentile indices from these two reanalyses. As a result, time series of ERA-Interim and NCEP2 indices deviate further away from the model-simulated indices than the ERA40 and NCEP1 indices. Note that this will also affect the warm and cold spell duration indices (WSDI and CSDI, respectively) that are also based on the percentile thresholds derived from the base period.

[38] The construction of the percentile indices results in fairly good agreement in the temporal evolution of CMIP models and reanalyses (Figure 5). In order to perform a more meaningful evaluation of the percentile indices beyond the qualitative trend assessment discussed in this paper, a bias-correction methodology including the removal of the mean model bias and the use of percentile thresholds from a reference data set for the calculation of the percentile indices is suggested (Sillmann et al., in preparation).

Figure 5.

Time series of percentile indices from 1948 to 2005 of the ensemble mean (solid) and median (dashed) of 31 CMIP5 models (black) and 18 CMIP3 models (green). The shading indicates the interquartile ensemble spread (range between the 25th and 75th quantiles). Note that the percentile indices from the reanalysis ERA40 (blue) from 1958 to 2001 and NCEP1 (red) from 1948 to 2005 are calculated with a different base period (1961 to 1990) than those from ERA-Interim (cyan) and NCEP2 (orange) with a base period from 1979 to 2008. Displayed are global averages over all land grid boxes. Grey shading along the horizontal x-axis indicates the evolution of globally averaged volcanic forcing according to Sato et al. [1993].

[39] Consistent with the observed changes documented in Alexander et al. [2006] and Donat et al. [2013], we can see decreasing trends in cold days and nights and increasing trends in warm days and nights for the CMIP models and reanalyses (Figures 5a–5d). The frequency of cold nights decreases to 6% from the nominal level of about 10% by year 2005 for CMIP5 models with CMIP3 models following the trend closely up to year 2000. The frequency of cold days decreases by a slightly smaller amount to 7%. The frequency of warm nights increases to 19% from the nominal level of about 10% by year 2005 in the CMIP5 models, whereas there is a smaller increase in warm days to approximately 17%. The trends in the reanalyses are comparable to that in the models for cold and warm days (both based on TX) but is smaller for cold and warm nights (based on TN) in the last two decades. This asymmetric warming of TN and TX has also been discussed in previous studies [e.g., Karl et al., 1993; Brown et al., 2008; Trenberth et al., 2007] and may be related to greater skewness in TN distribution as proposed by Simolo et al. [2011] for Europe.

[40] The CMIP models display a decreasing trend in cold spell duration (Figure 5e) over the period 1948 to 2005, which is similar in ERA40 and ERA-Interim. NCEP1 and NCEP2 do not show this decrease and also have generally shorter cold spells in comparison to the other data sets. There is no trend in the observed cold spell duration in HadEX, but a decreasing trend in HadEX2 [Donat et al., 2013]. The warm spell duration (Figure 5f) increases from about 6 days, on average, in the early 1960s to about 20 days at the beginning of the 21st century in the CMIP models as well as in NCEP1 and ERA40. The increase in warm spell duration is smaller in HadEX (about 15 days) and HadEX2 (<10 days) as documented in Donat et al. [2013]. The changes are less pronounced in ERA-Interim and NCEP2, although a sharp increase in the warm spell duration (up to 17 days) is evident after the end of the base period for these reanalyses in the year 2008.

[41] There are visible changes in the percentile indices in the CMIP models and reanalyses in the years following major volcanic eruptions with prominent peaks in the number of cold days and nights as well in the cold spell duration. The response is not as pronounced in the percentile indices based on TX, but a decrease in warm days and nights as well as in the warm spell duration is still noticeable.

[42] We noticed a problem in the warm and cold spell duration indices in the previous version of HadEX2 (HadEX) [Alexander et al., 2006]. Specifically, the HadEX values for CSDI and WSDI in North America, particularly in the U.S., were an order of magnitude greater than in other regions (see auxiliary material, Figures S6 and S7). These very high values were likely caused by insufficient data precision in the affected region as discussed in Zhang et al. [2009], which leads to a bias in the temperature percentile exceedance rates. As a consequence of the findings of Zhang et al. [2009], the RClimdex/FClimdex code was revised and consequently HadEX2 seems not affected by this problem (see auxiliary material, Figures S6 and S7).

4.2 Precipitation Indices

4.2.1 Spatial Representation

[43] HadEX2 has generally poorer spatial coverage for precipitation-based indices compared to temperature-based indices due to the smaller decorrelation length scale for precipitation [Alexander et al., 2006]. The main features of the climatological pattern of extreme precipitation events such as represented by the annual maximum 5 day precipitation (RX5day) and very wet days (R95p) is reasonably well captured by the CMIP5 models as compared to HadEX2 (Figure 6). In particular, the east-west contrast in North America with higher RX5day and R95p in southeast North America and lower values in central and northwest interior North America, and high RX5day and R95p values in the tropics are reasonably well represented by the CMIP5 models. However, the actual magnitude of precipitation events is underestimated by the models as well as in the reanalyses (particularly NCEP1) (see auxiliary material, Figures S8 and S9) compared to HadEX2. The underestimation of the magnitude of precipitation extremes is also reflected in the ratio between very wet days over the total wet-day precipitation (R95pTOT, auxiliary material, Figure S10).

Figure 6.

Same as Figure 2, but for the precipitation indices very wet days (R95p, top), maximum 5 day precipitation amount (RX5day, middle) and maximum number of consecutive dry days (CDD, bottom).

[44] This underestimation is most pronounced in the daily precipitation intensity (SDII, see auxiliary material, Figure S11). The very high SDII values (>11 mm/day) in HadEX2 in eastern North America and in the tropics are not well captured by the CMIP5 models or reanalyses, except NCEP2. Because HadEX2 represents smoothed point estimates while models and reanalyses represent area estimates, we however expect to see larger values from HadEX2 than from models [e.g., Chen and Knutson, 2008]; thus, models or reanalyses that simulate larger precipitation extremes than HadEX2 should be considered with caution. However, we have no objective baseline to evaluate models that simulate lower values than HadEX2.

[45] Better agreement is found for the consecutive dry days index (CDD) shown in Figure 6 because dry conditions are usually of larger spatial scale than extreme precipitation events. In comparison to HadEX2, the spatial patterns and magnitudes of CDD are fairly well represented in the CMIP5 models and the reanalyses, except for some noisy features over Eurasia and Antarctica in NCEP1 (see auxiliary material, Figure S12).

[46] A regional analysis of several indices is presented in Figure 7. We can see that generally the CMIP5 median of the total annual precipitation (PRCPTOT) and simple daily intensity SDII (Figures 7a and 7b) are larger than in CMIP3 across all regions. This is also the case for annual precipitation extremes, such as very wet days (R95p), heavy precipitation days (R10mm), and RX5day (Figures 7c, 7d, and 7e) in most regions, with strong outliers in SEA in the CMIP5 ensemble for R95p (766 mm in BCC-CSM1-1) and for RX5day (301 mm in IPSL-CM5B-LR). Models generally underestimate SDII and RX5day in comparison with HadEX2. The discrepancy is most prominent for SDII (Figures 7b), for which HadEX2 values lie far above the CMIP model range. As mentioned before, this is likely to be related to the spatial-scale mismatch in indices between HadEX2 and models or reanalyses.

Figure 7.

Same as Figure 3, but for precipitation indices.

[47] The reanalyses are close to each other or within the interquartile model spread for PRCPTOT, for which models and reanalysis compare well with HadEX2, except in regions with very high annual precipitation amounts, such as AMZ and SEA. In AMZ, for instance, PRCPTOT is overestimated by the models and reanalyses compared to HadEX2. The opposite is the case for SEA. This is also the case for indices that are related to more extreme aspects of precipitation variability such as RX5day, R95p, and R10mm.

[48] CMIP5 models generally simulate fewer CDD compared to CMIP3 in relatively dry regions such as southern South America (SSA), MED, SAF, SAS, and CAS (Figure 7f). In several regions, such as AMZ, SSA, CAM, CNA, ALA, EAF, SAF, EAS, TIB, and NAS, the CMIP5 models underestimate CDD compared to HadEX2. Differences between the reanalyses can be as large as the intermodel spread, but CDD in the reanalyses generally center around the HadEX2 values, except in AMZ, MED, and SEA.

[49] CMIP models and reanalyses particularly disagree with HadEX2 in the representation of consecutive wet days (CWD) for which models and the reanalyses, ERA-Interim and NCEP1, simulate much longer wet spells than HadEX2 (see auxiliary material, Figure S5, and Figures S13e and S13f). However in most regions, CMIP5 models simulated shorter wet spells than CMIP3 models and NCEP2 simulates CWD closest to HadEX2.

4.2.2 Temporal Evolution of Indices Averaged Over Land

[50] The temporal evolution of the absolute values of precipitation indices averaged over global land is displayed in the left column of Figure 8. Generally, the interquartile model spreads in CMIP3 and CMIP5 are comparable; however, the CMIP5 median tends to be higher. The higher CMIP5 spatial resolution is probably part of the explanation. We looked at individual model time series (see auxiliary material, Figure S14) and found that high-resolution models such as MIROC4h, CCSM4, and CNRM-CM5 generally simulate greater values of PRCPTOT than lower resolution models, with MIROC5 simulating the overall highest PRCPTOT. SDII is greatest in the high resolution models MIROC4h and CMCC-CM and the medium resolution MPI-ESM-LR model, while lower resolution models tend to simulate lower SDII values that are comparable to the CMIP3 median. The high resolution MIROC4h and MIROC5 models simulate the highest values of R95p and RX5day. However, the relatively low resolution model BCC-CSM1-1, which is close to the CMIP5 median for PRCPTOT and SDII, also simulates very high values for R95p and RX5day.

Figure 8.

Same as Figure 4, but for precipitation indices.

[51] ERA40 and ERA-Interim agree with the CMIP5 models starting from the early 1970s onwards. Before that time, ERA40 shows much lower PRCPTOT in comparison to the other reanalyses and models, with a steep increase that ends in 1972, which also marks the start of satellite data assimilation into the ERA40 reanalysis. NCEP2 is in line with NCEP1 for PRCPTOT, but has very high values for SDII and RX5d in comparison with the other reanalyses. The NCEP1 behavior relative to other reanalyses and model results depends on the index under consideration. For example, PRCPTOT in NCEP1 is greater than in the models, NCEP1 SDII agrees with CMIP5, but NCEP1 RX5day is even lower than in CMIP3. Discrepancies in the temporal evolution between the models and reanalyses are diminished by considering indices anomalies relative to the reference period 1981–2000 instead of their absolute values (Figure 8, right column). The trends in precipitation-based indices in ERA40 and NCEP1 behave in opposite directions up to early 1970s, and subsequently evolve along with the models, with small increasing trends at the end of the 20th century.

[52] The number of CDD is difficult to evaluate because this index shows very large interannual variability in the global land mean (Figures 9a and 9b), because by definition the length of multiyear spells is assigned to the year when the spell ends. Very large spell durations that may occur in very dry regions significantly affect the spatial averages over land in some years resulting in large interannual variability. A more robust global estimate of CDD is given by the spatial median instead of the spatial mean (Figures 9c and 9d), which is less influenced by very long dry spell outliers. The CMIP3 and CMIP5 models agree reasonably well with NCEP2 in simulating the spatial median of CDD, whereas NCEP1 simulates generally more CDD, and ERA40 and ERA-Interim simulate less CDD than the CMIP models. ERA40 shows a very steep decrease in CDD from 1958 to the 1970s.

Figure 9.

Time series of the (a, b) global spatial mean and (c, d) spatial median over all land grid points of consecutive dry days (CDD) from 1948 to 2005 of the ensemble mean (solid) and median (dashed) of 31 CMIP5 (black) and 18 CMIP3 (green) models as well as the reanalysis ERA40 (blue) from 1958 to 2001, ERA interim (cyan) and NCEP2 (orange) from 1979 to 2010, and NCEP1 (red) from 1948 to 2005. The shading indicates the interquartile ensemble spread (range between the 25th and 75th quantiles). The left column shows absolute values (note different scale between Figures 9a and 9c) and right column shows anomalies with respect to the reference period 1981 to 2000. Grey shading along the horizontal x-axis indicates the evolution of globally averaged volcanic forcing according to Sato et al. [1993].

[53] The sensitivity of the spatial mean to CDD outliers is very evident for the IPSL-CM5A-LR and IPSL-CM5B-MR models (see auxiliary material, Figure S14e), which have different resolutions, but simulate similar, very high, CDD (≈120 days). The shortest spatial mean CDD (≈30 days) are simulated by INMCM4, which also simulates very low values for SDII, R95p, and RX5day (Figures S14b–S14d).

[54] The consecutive wet spell durations index CWD (see auxiliary material, Figures S13e and S13f) is not affected by the outlier problems present in CDD so that the interannual variability in the spatially mean values is comparable to that in other precipitation indices. The CMIP5 models tend to simulate shorter wet spells (≈20 days) than CMIP3 models and ERA-Interim (≈25 days) as well as NCEP1 (≈30 days). Wet spells are much shorter in ERA40 and NCEP2 (about 15 and 10 days, respectively). Results for individual models (auxiliary material, Figure S14f) do not indicate any systematic relationship between the model spatial resolution and the magnitude of CWD. INMCM4 simulates the longest wet spells (≈40 days) while CMCC-CM simulates very short wet spells (≈8 days).

4.3 Metric Analysis of Model Performance

[55] The overall performance of individual models in simulating the 1981–2000 climatology of the indices is summarized in a “portrait” diagram (Figure 10) as introduced in G08 and discussed in section 3.2. The portrait diagram displays the relative magnitude of spatially averaged RMSE for each index (rows) and for each model (columns).

Figure 10.

The “portrait” diagram of relative spatially averaged RMSEs in the 1981–2000 climatologies of temperature and precipitation indices simulated by the CMIP5 models with respect to the four reanalyses, ERA40 (left triangle), ERA-Interim (upper triangle), NCEP1 (right triangle), and NCEP2 (lower triangle). The RMSEs are spatially averaged over global land grid points. The top row indicates the mean relative RMSE across all indices for a particular model and the grey-shaded columns on the right side indicates the standardized median RMSEmedian,std for CMIP3 and CMIP5 (see text for details).

[56] The colors characterize the magnitudes of the RMSEs, with warmer colors indicating models that perform worse than others, on average, and colder colors indicating models that perform better than others, on average. The model performance is assessed with respect to the four reanalyses, ERA40 (left triangle), ERA-Interim (upper triangle), NCEP1 (right triangle), and NCEP2 (lower triangle) of each cell of the diagram. HadEX2 is not used as a reference data set due to its limited spatial coverage for some indices.

[57] In addition to individual models, the performance of the so-called mean and median models is also displayed in the first two columns. The mean and median model is obtained by calculating the multimodel mean and median of an index first, and then deriving its relative RMSE. Consistent with the findings in G08 and other multimodel studies, the mean and median models generally outperform individual models because some of the systematic errors in individual models are canceled out in the multimodel mean or median. In contrast to G08, where similar model performance is reported for two reference data sets (ERA40 and NCEP1), there are larger discrepancies between the reanalyses when evaluating extremes indices, indicating considerable uncertainty in the reanalyses with respect to extremes.

[58] For the temperature-based percentile indices (i.e., TX90p, TX10p, TN90p, and TN10p), the models generally perform well (except FGOALS-s2 and IPSL-CM5A-LR) over global land (Figure 10a) as a consequence of their construction. Deviations from the nominal level of 10% outside the base period are mainly due to differences in the estimated trends in TN and TX of the individual models as compared to the respective reanalysis data set (see also section 4.1.2). An evaluation of these indices in terms of the model's ability to simulate observed temperature variability can be achieved when considering alternative approaches to the calculation of the percentile indices (Sillmann et al., in preparation).

[59] The other temperature indices are also represented reasonably well by most models, particularly ACCESS1-0, CCSM4, CESM1-BGC, and MRI-CGCM3. A distinct pattern of performance metrics that is robust to the choice of reference data set cannot be seen for models that perform worse than the median CMIP5 model.

[60] For the precipitation indices, the patterns of performance metrics are more consistent across reanalyses, where the performance of several models (e.g., BCC-CSM1-1, IPSL-CM5B-LR, MIROC4h, MIROC5, and MRI-CGCM3) deviates substantially from that of the median model with respect to ERA-Interim and NCEP1. Most models have problems in simulating the spell duration indices (CDD and CWD). Models that perform relatively well for most precipitation indices include ACCESS1-0, CCSM4, CESM1-BGC, HadGEM2-ES(-CC), and MPI-ESM-MR.

[61] We complement the original G08 “portrait” diagram with two additional summary statistics in Figure 10. The top row indicates the overall performance of each model averaged across all indices (RSMEall). The grey shaded column at the right-hand side indicates the median RMSE standardized by the spatial standard deviation of the index climatology in the reanalyses, RMSEmedian,std. Based on the overall RMS statistic for all indices, RSMEall, the ACCESS1-0 and MPI-ESM-P models appear to perform better than the other models with respect to all four reanalyses, followed by CCSM4, CESM1-BGC, CMCC-CM, MIROC5, MPI-ESM-LR(-MR), and MRI-CGCM3, which perform slightly worse with respect to NCEP1.

[62] The magnitude of the multimodel median error on a global scale, as measured by RMSEmedian,std is generally larger for precipitation indices than for the absolute and threshold indices based on temperature. The exceptions are WSDI, CSDI, and DTR, which are poorly represented by the models on a global scale (Figure 10) as well as on a hemispheric scale as shown in the auxiliary material for the northern and southern hemispheres as well as the tropics (see auxiliary material, Figure S15). The comparison between extratropical and tropical land regions based on RMSEmedian,std (Figure S15) further indicates that temperature-based absolute and threshold indices are generally better represented in the tropics, except for TXx, SU, and TR. On the contrary, the percentile indices are better represented in the extratropics. The precipitation indices are not as well represented by the models in the tropics as in extratropical regions with the exception of CDD.

5 Summary and Concluding Remarks

[63] This paper provides a first overview of the performance of the CMIP5 multimodel ensemble in simulating climate extremes indices defined by the ETCCDI, as compared to reanalyses and the CMIP3 ensemble as well as to an observation-based set of indices (HadEX2). The set of 27 indices (Table 1) is calculated with a consistent methodology for 18 CMIP3 and 31 CMIP5 models as well as reanalyses, and are available for download for further analysis and applications from the ETCCDI indices archive (EIA, http://www.cccma.ec.gc.ca/data/climdex/climdex.shtml).

[64] The archive will also include indices for multiple ensemble members, if available, for each model, although only the first ensemble member of each model is considered in this study. Our preliminary analyses using multiple ensemble members (not shown) confirm findings of Kharin et al. [2007] that the spread between the ensemble members for an individual model is generally small compared to the multimodel spread for the considered indices.

[65] Our results indicate that for the temperature indices, the performance of the CMIP3 and CMIP5 multimodel ensembles is similar in regard to their ensemble mean and median, but that the spread amongst CMIP3 models tends to be larger than amongst CMIP5 models despite the larger number of models in the CMIP5 ensemble. At high northern latitudes, CMIP5 models simulate warmer TXx and colder TNn than CMIP3 models, and also show a greater diurnal temperature range.

[66] For the precipitation indices, the intermodel uncertainty in the CMIP3 and CMIP5 ensembles is comparable, but the CMIP5 models tend to simulate more intense precipitation and fewer consecutive wet days than the CMIP3 models, and thus are closer to the observations as represented by the HadEX2 indices. This indicates an improvement in the CMIP5 model generation with respect to model deficits in reproducing observed precipitation characteristics as, for instance, discussed in Stephens et al. [2010] who find that rainfall is simulated too often and too lightly in models. This improvement could in part be due to generally higher spatial resolution of CMIP5 models compared to CMIP3. The effect of increasing resolution on precipitation extremes has been discussed, for instance, in Wehner et al. [2010]. Model improvements in the parametrization of unresolved physical processes, most notably of convective precipitation, may also play a role and needs to be further investigated. Large differences still occur in the SDII, which is smaller in all CMIP5 models in comparison to HadEX2. However, we expect that models should tend to under-simulate the magnitude of precipitation extremes as represented by the HadEX2 indices, because of the spatial-scale mismatch between the area estimates of models (and reanalyses) and the smoothed point estimates represented by HadEX2.

[67] We showed that there is large discrepancy between the four reanalyses considered in this paper. The differences between the reanalyses can be as large as the typical intermodel spread of the CMIP ensembles for some indices (e.g., TNn, FD, RX5day) and in some regions. The agreement between the reanalyses and HadEX2 improves considerably for precipitation indices in the early 1970s coinciding with the beginning of satellite data assimilation. Therefore, a careful selection of reference data sets is necessary when evaluating climate model simulations in terms of extremes indices. Note that near-surface temperature fields, such as TN and TX, and precipitation, which are used as input data for the indices calculation, are not well constrained in the reanalyses [Kalnay et al., 1996]. The model evaluation should therefore include data sets based on gridded observations, such as the specialized gridded product introduced in Haylock et al. [2008], which is comparable with the area average values of models. Such an effort is however only possible in regions, such as Europe, that have a high density of observational stations. Thus, HadEX2 is currently the only source of information on temperature and precipitation extremes for a global assessment of model performance as presented in this study.

[68] We also showed that discrepancies between the models and reanalyses are reduced by considering anomalies relative to a reference period (i.e., 1981–2000). Increasing trends over the period 1948 to 2005 are simulated by the models for some temperature indices such as the hottest day of a year (TXx), the coldest night of the year (TNn), the number of SU and TR. Decreasing trends are simulated for the number of FD and ID, particularly in the last two decades, which agrees with trends in the observation-based HadEX and HadEX2 data sets (see Alexander et al. [2006] and Donat et al. [2013], respectively). These trends are also reflected in the percentile indices, which are more robust, by definition, than the other categories of indices. We see increases in warm nights and days and decreases in cold nights and days in the models and reanalyses over the past few decades, which is in close agreement with other studies [e.g., Alexander et al., 2006; Morak et al., 2011]. We only assessed trend patterns in the indices qualitatively, and a comprehensive quantitative trend analysis including model simulations, reanalysis, and observations is subject to a subsequent study.

[69] Our analysis further indicates the presence of the volcanic forcing signal in some temperature indices simulated by the CMIP5 models as reflected, for instance, by a decrease in the number of warm days and nights and an increase in the number of cold days and nights in years following major volcano eruptions.

[70] Consistent with previous multimodel ensemble studies, the multimodel ensemble mean and in particular the ensemble median, which is robust to possible outliers, generally outperform any individual model across all indices considered in this study. Although the relative performance of an individual model may depend on the choice of the reference data set, the mean and median models tend to outperform individual models with respect to the four reanalyses used for verification. On a global and hemispheric scale, the models are closer to the reanalyses in terms of standardized RMSEs for the temperature indices but less so for the precipitation indices.

[71] Without any claim for completeness, this paper can only provide a first-order assessment of the CMIP5 model performance in terms of the ETCCDI indices. Other variables such as surface moist enthalpy and water vapor content [e.g., Peterson et al., 2011] that can effect trends in temperature and precipitation extremes should also be taken into account and evaluated in models as it is important to combine the information available from the various ETCCDI indices with that from other variables such as humidity when looking at future trends in extremes [e.g., Fischer and Knutti, 2013]. Some temperature extremes indices can also be systematically affected by land use and land cover changes as pointed out by Avila et al. [2012]. Our model evaluation should further be complemented by more detailed studies, for instance, regarding seasonal aspects of model performance and specific regions that are particularly sensitive to the occurrence of and changes in extreme climate events. The ETCCDI indices archive, which is made available to the climate community as an essential part of this study, contributes to such efforts.

Acknowledgments

[72] We acknowledge the World Climate Research Programme's Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modeling groups for producing and making available their model output. For CMIP the U.S. Department of Energy's Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals. Furthermore, we acknowledge the efforts of the ECWMF and NCEP for providing the reanalysis data sets. We also acknowledge Lisa Alexander and Markus Donat for their helpful discussions regarding the HadEX and HadEX2 data sets. Many thanks to Uwe Schulzweida for his continuous support with the Climate Data Operators (CDO's). The authors further thank Nathan Gillett and two anonymous reviewers for their helpful comments on the manuscript. F. Zwiers, L. Alexander and M. Donat are supported by the Australian Research Council (grant LP100200690). J. Sillmann is funded by the German Research Foundation (grant Si 1659/1-1).

Ancillary