A score-based method for assessing the performance of GCMs: A case study of southeastern Australia


  • Guobin Fu,

    Corresponding author
    1. CSIRO Land and Water, Wembley, Western Australia, Australia
    • Corresponding author: G. Fu, CSIRO Land and Water, Private Bag 5, Wembley, WA 6913, Australia. (Guobin.Fu@csiro.au); Z. Liu, Institute of Geographical Sciences and Natural Resource Research, Chinese Academy of Sciences, Beijing 100101, China. (zfliu@igsnrr.ac.cn)

    Search for more papers by this author
  • Zhaofei Liu,

    1. Institute of Geographical Sciences and Natural Resource Research, Chinese Academy of Sciences, Beijing, China
    2. Key Laboratory of Water and Sediment Sciences, Ministry of Education, College of Water Sciences, Beijing Normal University, Beijing, China
    Search for more papers by this author
  • Stephen P Charles,

    1. CSIRO Land and Water, Wembley, Western Australia, Australia
    Search for more papers by this author
  • Zongxue Xu,

    1. Key Laboratory of Water and Sediment Sciences, Ministry of Education, College of Water Sciences, Beijing Normal University, Beijing, China
    Search for more papers by this author
  • Zhijun Yao

    1. Institute of Geographical Sciences and Natural Resource Research, Chinese Academy of Sciences, Beijing, China
    Search for more papers by this author


[1] A multi-criteria score-based method is developed to assess General Circulation Model (GCM) performance at the regional scale. Application of the method assessing 25 GCM simulations of monthly mean sea level pressure (MSLP) and air temperature, and monthly and annual rainfall over the southeastern Australia region for 1960/1961–1999/2000 indicate that GCMs usually simulate monthly temperature better than monthly rainfall and mean sea level pressure. For example, the mean observed annual temperature for the study region is 16.7°C while the median and mean values of 25 GCMs are 16.8 and 16.9°C, respectively, and 24 GCMs (except BCC:CM1) can reproduce the annual cycle of temperature accurately, with a minimum correlation coefficient of 0.99. In contrast, the mean observed annual rainfall for the study region is 502 mm, whereas the GCM values vary from 195 to 807 mm, and 12 out of 25 GCMs produce a negative correlation coefficient of monthly rainfall annual cycle. However, GCMs overestimate trend magnitude for temperature, but underestimate for rainfall. The observed annual temperature trend is +0.007°C/yr, while both the median and mean GCM values are +0.013°C/yr, which is almost double the observed magnitude. The observed annual rainfall trend is +0.62mm/yr, while the median and mean values of 25 GCMs are 0.21and 0.36mm/yr, respectively. This demonstrates the advantages of using multi-criteria to assess GCMs performance. The method developed in this study can easily be extended to different study regions and results can be used for better informed regional climate change impact analysis.

1 Introduction

[2] General Circulation Model (GCM) climate projections have large uncertainties, especially at regional scales given their coarse horizontal resolution [Trenberth, 1997; Giorgi and Francisco, 2000; Murphy et al., 2004, Dessai et al., 2005, Fu and Charles, 2007], a major shortfall for assessing regional impacts of change [McCarthy et al., 2001]. Uncertainties stem from a hierarchy of sources [Giorgi and Francisco, 2000]: (1) uncertainty related to the forcing scenarios, (2) uncertainty between different GCMs, (3) uncertainty from different realizations for a given scenario and GCM, and (4) uncertainty related to sub-grid scale forcing and processes.

[3] Inter-model variability as a source of uncertainty has been assessed by comparing different GCMs at regional scales [Kittel et al., 1998; Hulme and Brown, 1998; Giorgi and Francisco, 2000; Giorgi and Mearns, 2002; Stainforth et al., 2007; Smith and Chandler, 2010]. Several global model intercomparison projects have been carried out to gain insight into model behavior by comparing model results among themselves and with observations [Lambert and Boer, 2001]. These projects include the Atmospheric Model Intercomparison Project (AMIP), the Paleoclimatic Model Intercomparison Project (PMIP), the Seasonal Forecast Model Intercomparison Project (SMIP), and the Coupled Model Intercomparison Project (CMIP). Harvey and Wigley [2003] have summarized the results of these global assessment studies, concluding that the better a model simulates the complex spatial patterns and seasonal and diurnal cycles of the present climate, the more confidence there is that the major important processes have been adequately represented. Consequently, for models to predict future climate conditions reliably, not only must they accurately simulate the current climate system, but they should also skilfully simulate changes in the climate system [Spelman and Manabe, 1984; Collier et al., 2006; Delworth et al., 2006; Randall et al., 2007]. This can involve running models in a hindcast mode to ascertain if they can accurately predict observed regional changes over multi-decadal time scales. In addition to global-scale comparisons, numerous regional-scale analyses have also been undertaken [Cubasch et al., 1995, 1996; Kittel et al., 1998; Hulme and Brown, 1998; Giorgi and Francisco, 2000; Giorgi and Mearns, 2002; Stainforth et al., 2007; Smith and Chandler, 2010]. For example, Kittel et al. [1998] compared the biases of nine GCMs outputs against observations at seven sub-continental areas ranging from low- to high-latitude domains. Results showed that the CSIRO Mk2 and HadCM2 GCMs performed comparatively well for both temperature and rainfall, while for the other GCMs assessed, low biases in one variable did not necessarily mean low biases in others. Nieto and Puebla [2006] compared observed and modeled rainfall over the Iberian Peninsula by using empirical orthogonal function (EOF) and spectral analyses. Their results indicated that the Geophysical Fluid Dynamics Laboratory (GFDL) model results best correlated with the observed annual rainfall cycle and that the GFDL and Meteorological Research Institute (MRI) models better reproduced observed winter rainfall. Perkins et al. [2007] evaluated the Intergovernmental Panel on Climate Change Fourth Assessment Report (IPCC AR4) climate models in term of simulating daily maximum temperature, minimum temperature, and rainfall probability density functions over Australia. Three models, MIROC-M, CSIRO, and ECHO-G, which were more skilful over Australia than other models, were recommended for use in climate change impact assessments. Maxino et al. [2008] quantified the skill of the IPCC AR4 model's probability density functions for the maximum and minimum temperatures and rainfall over the Murray-Darling Basin in Australia. The CSIRO, IPSL, and MIROC-M models captured the observed PDFs of these variables relatively well.

[4] However, the assessed performance of models depends heavily on the type and number of assessment criteria used [Schaller et al., 2011]. In this study, a score-based method has been developed to provide a multi-criteria assessment of model performance accounting for long-term mean and standard deviation, seasonal variation (annual cycle), temporal and spatial distributions, and probability density function (PDF). The motivation of using multi-criteria for a wide range of statistics is due to a GCM that performs well for a specific statistic does not necessarily also perform well for a different statistic. The benefit of the proposed method is that it provides a more comprehensive assessment—compared to individual assessment criterion that produce a biased assessment as the results of this study show. The sensitivity of each of the individual criteria used was investigated by comparing the result of the overall assessment with those including/excluding each individual criterion. The southeastern Australian region was used for the case study, but the method can be employed in other regions to assess GCM performance on regional scales, and the results provide a reference for climate change impact studies when selecting GCMs.

2 Materials and Method

2.1 Data Set

2.1.1 Rainfall and Temperature Data

[5] The comprehensive archive of Australian gridded climate data, the SILO Data Drill, was used in this study to assess GCM performance across southeastern Australia [Jeffrey et al., 2001]. There are more than 6000 meteorological stations in the study region (Figure 1), which was used to interpolate surfaces on a regular 0.05° grid with a thin plate smoothing spline and ordinary kriging [Jeffrey et al., 2001]. The daily records have been constructed using spatial interpolation algorithms to infill missing data. Variables include daily rainfall, maximum and minimum temperatures, evaporation, solar radiation, and vapor pressure. The daily mean temperature was estimated as the average of daily maximum and minimum temperatures [Fu et al., 2007]. The data in the Data Drill are all synthetic; there are no original meteorological station data left in the calculated grid fields. However, the Data Drill does have the advantage of providing complete spatial coverage of Australia [Jeffrey et al., 2001]. The daily data were accumulated to monthly data and to the grid centers of 2.5° × 2.5° cells.

Figure 1.

Location of the study region, meteorological stations used as input to Data Drill and the 21 selected GCM/NCEP grids.

2.1.2 Mean Sea Level Pressure Data

[6] National Centers for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) reanalysis mean sea level pressure (MSLP) data were used as “observed” because of a lack of observed data in the Data Drill data set. The original 2.5° by 2.5° latitude-longitude grid was linearly interpolated to a new 2.5° by 2.5° study grid as shown in Figure 1. The NCEP/NCAR Reanalysis project uses a state-of-the-art analysis/forecast system to perform data assimilation for data from 1948 to the present [Kalnay et al., 1996]. This involves the recovery of land surface, ship, rawinsonde, pibal, aircraft, satellite, and other data, as well as data quality controlling and assimilating. Therefore, it is widely used by the international science community.

2.1.3 GCM Data

[7] Twenty-five CMIP3 GCMs were used in this study. The model name, institute, and data periods are listed in Table 1. Detailed information for these models can be found at the IPCC Data Distribution Centre website (http://ipcc-ddc.cru.uea.ac.uk).

Table 1. CMIP3 Climate Models Data Description
GCMsIDOriginating Group(s)CountryResolutionSelected Period
BCC:CM11Beijing Climatic CenterChina1.9° × 1.9°1961–2000
BCCR:BCM202Bjerknes Centre for Climatic ResearchNorway1.9° × 1.9°1961–2000
CGCM3.1_T473Canadian Centre for Climatic Modelling and AnalysisCanada2.8° × 2.8°1961–2000
CGCM3.1_T6341.9° × 1.9°
CNRM:CM35Météo-France/Centre National de Recherches MétéorologiquesFrance1.9° × 1.9°1960–1999
CSIRO:MK306Commonwealth Scientific and Industrial Research Organization (CSIRO) Atmospheric ResearchAustralia1.9° × 1.9°1961–2000
CSIRO:MK3571.9° × 1.9°
GFDL:CM208U.S. Department of Commerce/National Oceanic and Atmospheric Administration (NOAA)/Geophysical Fluid Dynamics Laboratory (GFDL)USA2.0° × 2.5°1961–2000
GISS:AOM10National Aeronautics and Space Administration (NASA)/Goddard Institute for Space Studies (GISS)USA3° × 4°1961–2000
GISS:EH114° × 5°
GISS:ER124° × 5°
IAP:FGOALS-g1.013National Key Laboratory of Numerical Modelling for Atmospheric Sciences and Geophysical Fluid Dynamics (LASG)/Institute of Atmospheric PhysicsChina2.8° × 2.8°1960–1999
INGV:ECHAM414Climatic Research CentreGermany2.8° × 2.8°1961–2000
INM:CM3015Institute for Numerical MathematicsRussia4° × 5°1961–2000
IPSL:CM416Institut Pierre Simon LaplaceFrance2.5° × 3.75°1961–2000
MIROC3.2_hires17Center for Climatic System Research (University of Tokyo), National Institute for Environmental Studies, and Frontier Research Center for Global Change (JAMSTEC)Japan1.1° × 1.1°1961–2000
MIROC3.2_medres182.8° × 2.8°
MIUB:ECHO_G19Meteorological Institute of the University of Bonn, Meteorological Research Institute of KMA, and Model and Data GroupGermany/Korea3.9° × 3.9°1961–2000
MPI:ECHAM520Max Planck Institute for MeteorologyGermany1.9° × 1.9°1961–2000
MRI:CGCM2.3.221Meteorological Research InstituteJapan2.8° × 2.8°1961–2000
NCAR:CCSM322National Center for Atmospheric ResearchUSA1.4° × 1.4°1961–2000
NCAR:PCM23  2.8° × 2.8°1960–1999
UKMO:HadCM324Hadley Centre for Climatic Prediction and Research/Met OfficeUK2.5° × 3.75°1960–1999
UKMO:HadGEM1251.3° × 1.9°

[8] Since GCM horizontal resolution varies, the GCM outputs were interpolated to a uniform resolution of 2.5° × 2.5°. This leads to 21 grid cells over the study region (Figure 1). The study period is 1961–2000 (or 1960 to 1999 for GCMs that do not have data for 2000; Table 1). The 25 GCM runs used here are forced by 20th century emissions scenarios, i.e., IPCC AR4 20th century experiment scenario 20C3M.

2.2 Method

[9] The criteria by which the GCMs are assessed are listed in Table 2. A rank score (RS) value of 0–9 was computed for each individual assessment criterion as

display math(1)

where xi is the relative error or relationship statistic between the GCM output and observed for the ith GCM. Figure 2 shows an example of how equation (1) is used to rank the 25 GCMs based on a criterion (BS score, as described later). The total RS for an individual GCM (for a specific climate variable) was obtained by summing the RS for each criterion used. All assessment criteria have a 1.0 weight in this summation except trend analysis, trend magnitude, EOF1 and EOF2, and two PDF criteria (BS and Sscore), which are weighted 0.5 each (Table 2). This total RS was then used to rank the GCMs for the three climate variables used: monthly mean temperature, rainfall, and MSLP.

Table 2. Statistics of Climate Variables and Their Criteria and Weights
Statistics of Climate VariablesMethodsWeights
MeanRelative error (%)1.0
Standard deviationRelative error (%)1.0
Temporal variationNRMSE1.0
Monthly distribution

(Annual cycle)

Correlation coefficient1.0
Spatial distributionCorrelation coefficient1.0
Trend and its magnitudeMann-Kendall test Z0.5
Trend magnitude β0.5
Space-time variabilityEOF 10.5
EOF 20.5
Probability density functions (PDFs)BS0.5
Figure 2.

Example of ranking 25 GCMs based on BS statistics (x axis labels are GCM IDs as found in Table 1).

[10] The 11 assessment criteria (Table 2) used characterize climate variable long-term mean and standard deviation, seasonal variation (annual cycle), temporal and spatial distributions, and probability density functions (PDFs). For long-term monthly mean and standard deviation, relative error (RE) was used to quantify the similarity between modeled and observed mean values:

display math(2)

where Xmi and Xoi are the modeled and observed ith values of time series and n is the sample length (480 months in this study).

[11] The normalized root mean square error (NRMSE), defined as root mean square error divided by the corresponding standard deviation of the observed field [Randall et al., 2007], was used to compare the similarity of two time series by considering both mean value and standard deviation:

display math(3)

[12] The correlation coefficient was used to evaluate both the annual cycle and the spatial distribution of monthly climate variables. For the annual cycle, the correlation coefficient was calculated between observed and modeled long-term monthly mean values (i.e., a sample size of 12). For the spatial distribution, the correlation coefficient was calculated between observed and modeled long-term means for each individual grid cell. The sample size is thus 21, i.e., the number of grid cells over the study region.The rank-based nonparametric Mann-Kendall test [Hirsch et al., 1982] and trend magnitude method were applied to detect long-term monotonic trends and quantify their magnitudes. The annual time series was used in this case. The rank-based nonparametric Mann-Kendall test statistic Z, for a particular climate variable and for observed or a GCM, was estimated by

display math(4a)


display math(4b)
display math(4c)
display math(4d)

where x is the climate variable annual time series, a sample of n data points (40), t is the extent of any given tie (length of consecutive equal values), and inline image denotes the summation over all ties.

[13] The trend magnitude β, a metric developed by Hirsch et al. [1982] as proposed by Sen [1968], is defined as

display math(5)

where 1 < i < j < n. The slope estimator β is the median over all possible combinations of pairs for the whole data set [Fu et al., 2004]. X is the time series of the variable, i.e., annual rainfall, temperature, or MSLP. The relative error (equation (2)) was used to assess how close the Z statistic and β magnitude of each GCM are to the observed values.

[14] Empirical orthogonal function (EOF) analysis was used to compare the space-time variability of observed and GCM monthly variables [Harvey and Wigley, 2003]. One advantage of using EOFs is the ability to identify and quantify the spatial structures of correlated variability [Mu et al., 2004]. The first two leading modes of each EOF, accounting for the majority of the total variance, between observation and GCMs, were compared.

[15] Two skill scores, the Brier score (BS) and Significance score (Sscore) were used to evaluate the GCM probability density functions (PDFs) of monthly climate variables:

display math(6)
display math(7)

where Pmi and Poi are the modeled and observed ith probability values of each bin and n is the number of bins. All data sets were binned based on their data ranges, with bin sizes varying but the number of bins fixed at 100. The BS is a mean squared error measure for probability forecasts [Brier, 1950], and the Sscore calculates the cumulative minimum value of observed and modeled distributions for each bin, quantifying the overlap between two PDFs [Perkins et al., 2007; Watterson, 2008]. Therefore, a smaller value of BS and a larger value of Sscore indicate relatively better performance of a GCM.

2.3 Southeastern Australia

[16] Southeastern Australia (Figure 1) has been selected in this study as it has recently experienced a decadal drought with unprecedented decline in streamflow. The 1997–2009 dry spell is the driest 13 year period in the last 110 years of reliable climate records [CSIRO, 2010]. Moreover, there is evidence that this drought is partly attributable to climate change [CSIRO, 2010]. Thus, the impacts of global climate change on water resources, agriculture, and natural ecosystems are becoming apparent in this region; hence, usage and selection of GCMs are important for further climate change impact study. For example, the South Eastern Australian Climate Initiative (SEACI) commenced in 2006 to investigate the causes, impacts, and prediction of climate variability and change in southeastern Australia [CSIRO, 2010].


3.1 Monthly and Annual Mean Temperature

[17] Table 3 presents the mean, standard deviation, NRMSE, annual cycle (monthly distribution), and spatial distribution correlation coefficients, trend Z values, and magnitudes, EOF-explained variance, and BS and S scores, for the 25 GCMs monthly temperature. The last column is the total ranking score. The results indicate that the MIROC3.2_hires model (GCM 17) is the best relative performer, and BCC:CM1 (GCM 1) performs the worst.

Table 3. Model Performance for Monthly Mean Temperature
GCMsMean (°C)St DevNRMSECorr (Mon Dis)Corr (Spa Dis)Mann-KendalEOFPDFTotal RS
ZSlope (°C/yr)EOF1EOF2BSSscore
  1. “0” in the column of GCM means SILO data; EOF is the percentage of explained variance; PDF is the probability density function.

016.75.1   1.220.00798.70.7

[18] The long-term mean (1961–2000 or 1960–1999) observed annual temperature for southeastern Australia was 16.7°C whilst the GCMs vary from 13.9 to 19.7°C (Table 3) with median and mean values of 16.8 and 16.9°C, respectively. The standard deviations of the GCM simulations vary from 2.5 to 6.9°C compared to the observed value of 5.1°C. The NRMSE values range from 0.3 to 1.1, where a smaller value indicates a better fit. NRMSE provides additional information because it measures the differences of every pair of observed and GCM values. For example, two data sets with the same mean and standard deviation could have different NRMSE values if their temporal patterns are diverse [Stainforth et al., 2007; Smith and Chandler, 2010].

[19] All GCMs, except BCC:CM1, reproduce the annual cycle (monthly distribution) of temperature, with a minimum correlation coefficient of 0.99 (Figure 3). BCC:CM1 significantly overestimates monthly temperature for May–September but underestimates for January and February (Figure 3). The spatial distribution shows both GCMs and observations have a clear northwest-southeast gradient (Figure 4), although correlation is slightly less than that of the annual cycle, ranging from 0.84 to 0.98 (Table 3).

Figure 3.

The SILO and GCM monthly temperature distribution.

Figure 4.

Spatial distributions of annual mean temperature from (a) SILO data and (b) 25 GCM simulations.

[20] All GCMs, except MPI:ECHAM5, simulate an increasing annual temperature trend for the period 1961–2000 (or 1960–1999), as do observations with a Kendall Z statistic of 1.22 (Table 3). The 24 GCMs Z-statistics range from 0.69 to 4.72, i.e., GCMs underestimate (19 GCMs) and overestimate (6 GCMs) the temperature trend for the last 40 years. GCM median and mean Z statistic values, 2.0 and 2.1, respectively, overestimate the observed temperature trend. The magnitude of the observed annual temperature trend is +0.007°C/yr, whilst GCMs range between −0.024 and +0.035°C/yr. GCM median and mean trend magnitudes are both +0.013°C/yr, i.e., almost double the observed magnitude. The observed annual temperature trend is not statistically significant at the level of α = 0.05, whereas 14 out of 25 GCMs show a statistically significant trend at α = 0.05 level. MPI:ECHAM5 shows a statistically significant decreasing trend, casting doubt on its future temperature projections.

[21] The spatial and temporal variability of monthly temperature was characterized by EOF analysis. The first two EOFs for monthly mean temperature account for 98.7% and 0.7% of the total variance, respectively. In general, the GCMs simulate this variability well: the mean and median values for first two EOFs being 97.8% and 1.5%, and 98.3% and 1.0%, respectively (Table 3). This suggests that the physical processes influencing temperature variability are captured by the GCMs. This is consistent with the spatial correlation coefficients (Figure 4 and Table 3), showing that GCMs simulate the northwest-southeast temperature gradient (Figure 3).

[22] Figure 5 shows empirical cumulative probability distributions for observed and GCM monthly mean temperature for the study region. Overall, the GCMs simulate probability distributions close to observed except for the INM:CM30 and BCC:CM1 models. The results of the Sscore and the BS from each GCM across the 21 grids are presented in Figure 6. The results are consistent with the empirical cumulative probability plots: the relatively poorer performance of the BCC:CM1 model is confirmed by these two indices, i.e., a larger BS score and a smaller Sscore. The variation of scores implies spatial differences, i.e., monthly mean temperature probability distribution closer to the observed in some grid cells than others.

Figure 5.

Empirical cumulative probabilities of observed (thick line) and GCM (thin lines) monthly mean temperature.

Figure 6.

Box plots of PDF-based skill scores of monthly mean temperature over all grids of study region.

3.2 Monthly and Annual Mean Sea Level Pressure (MSLP)

[23] Table 4 presents the GCM annual and monthly MSLP performance. Overall INM:CM30 (GCM 15) is ranked as giving the best reproduction of annual and monthly MSLP, and BCC:CM1 (GCM 1) is ranked the worst.

Table 4. Model Performance for Monthly MSLP
GCMsMean Value (hPa)St DevNRMSECorr (Mon Dis)Corr (Spa Dis)Mann-KendalEOFPDFTotal RS
ZSlope (hPa/yr)EOF1EOF2BSSscore
  1. “0” in the column of GCM means NCEP/NCAR data; EOF is the percentage of explained variance; PDF is the probability density function.

01016.53.9   1.480.02394.64.0   

[24] The mean observed MSLP for the study region was 1016.5 hPa whilst the GCM values vary from 1009.7 to 1019.9 hPa (Table 4), with median and mean values of 1016.7 and 1016.0 hPa, respectively. GCM monthly MSLP standard deviations vary from 1.2 to 5.3 hPa, with both median and mean values of 4.2 hPa, compared to the observed mean of 3.9 hPa.

[25] Observed 1961–2000 annual MSLP has an increasing, although non-significant, trend with a Kendall Z statistic value of 1.48 (p-value = 0.0694). GCMs simulate both increasing and decreasing trends, with Z statistics ranging from −1.74 to +2.60. The Z statistic median (0.38) and mean (0.27) for the 25 GCMs underestimate the observed MSLP trend, with 10 simulating decreasing trends (Table 4). The magnitude of the observed trend is 0.023 hPa/yr, whilst the GCMs range between −0.022 and 0.028 hPa/yr. The ensemble mean would thus produce a biased trend magnitude given 23 of the 25 GCMs underestimate the trend magnitude.

[26] The first two monthly MSLP EOFs explain 94.6% and 4.0% of the total variance (Table 4). In general, the GCMs reproduce this variability: the mean and median values of the first two EOFs being 92.4% and 6.6%, and 92.5% and 6.2%, respectively (Table 4). Whilst this suggests that the physical processes producing MSLP variability are generally captured by the GCMs, the spatial distribution is not as well simulated as that of temperature. The spatial correlation coefficients between NCEP/NCAR and the GCMs vary from −0.05 to 0.95 (compared to 0.84–0.98 for temperature), with a median and mean value of 0.45 and 0.41 (0.95 and 0.94 for temperature), respectively. A negative correlation coefficient implies that a GCM simulates a spatial distribution opposite to that of NCEP/NCAR. For example, NCEP/NCAR MSLP has a high pressure center in the eastern study region, whilst IPSL:CM4 simulates a smooth north-south gradient, with the gradient much larger than that of NCEP/NCAR, as shown in Figure 7.

Figure 7.

Spatial distributions of MSLP: (a) NCEP and (b) simulated by IPSL:CM4.

[27] Figure 8 shows the empirical cumulative probability distributions for monthly MSLP. Overall the GCMs reproduce the observed probability distribution, except for a few models such as BCC:CM1, GISS:EH, and CGCM3.1_T63. The results of the Sscore and the BS for each grid cell are presented in Figure 9. The variation of scores implies that GCM monthly MSLP probability distributions are close to observed in some grid cells, but not in others.

Figure 8.

Empirical cumulative probabilities of NCEP/NCAR (thick line) and GCM (thin lines) monthly MSLP.

Figure 9.

Box plots of PDF-based skill scores of monthly MSLP over all grids of study region.

3.3 Monthly and Annual Rainfall

[28] Table 5 presents the GCM assessment for annual and monthly rainfall. Overall, the rainfall is poorly simulated compared to temperature and MSLP. The mean observed annual rainfall for the study region is 502 mm, whereas GCMs vary from 195 to 807 mm (Table 5), biases of more than 60%. The 25 GCM median and mean values are 561 and 517 mm, that is, 11.7% and 2.9% larger than the observed annual rainfall, respectively. The NRMSE ranges from 1.23 to 2.08, larger than the corresponding ranges for temperature (0.25–1.07) and MSLP (0.72–1.99).

Table 5. Model Performance for Rainfall
GCMsMean Value (mm/yr)St DevNRMSECorr (Mon Dis)Corr (Spa Dis)Mann-KendalEOFPDFTotal RS
ZSlope (mm/yr)EOF1EOF2BSSscore
  1. “0” in the column of GCM means SILO data; EOF is the percentage of explained variance; PDF is the probability density function.

0502.321.4   0.710.6248.023.7   

[29] The annual cycle of rainfall is not simulated as accurately as that of temperature, and MSLP: 12 GCMs have negative correlations, the two worse performing having correlation coefficients of −0.37 and −0.34 (Figure 10). Southeastern Australia receives slightly more rainfall in the winter half-year (May–October) than summer (November–April), whilst both BCC:CM1 and CSIRO:MK35 simulate the opposite pattern of summer season rainfall greater than winter (Figure 10). The spatial distribution of rainfall does not produce negative correlations (correlation coefficients 0.43–0.95). This indicates most GCMs simulate the spatial gradient decrease from the southeastern coast to northwest inland.

Figure 10.

Monthly rainfall from SILO and two GCMs.

[30] Observed annual rainfall has a statistically insignificant increasing trend for 1961–2000 with a Kendall Z statistic of 0.71, corresponding to a p-value of 0.761. GCMs simulate both increasing and decreasing trends with Z statistics ranging from −1.48 to 2.06. The observed annual rainfall trend magnitude is +0.62 mm/yr, whilst the GCMs range from −2.33 to +2.85 mm/yr. Only a few GCMs simulate a trend magnitude similar to the observed. The 25 GCM median and mean values are 0.21 and 0.36 mm/yr, respectively, i.e., −66.5% and −42.4% less than observed.

[31] The first two observed rainfall EOFs explain 48.0% and 23.7% of the total variance. GCMs simulate a more uniform spatial distribution, e.g., GCM EOF1 explains 49.1–69.1% of the total variance (Table 5). The mean and median values for EOF 1 and 2 are of 59.0% and 22.0%, and 59.3% and 21.7%, respectively. The difference between observed and GCM is larger than those for temperature and MSLP, consistent with results of other statistics in that rainfall is relatively poorly simulated compared to temperature and MSLP.

[32] Figure 11 compares the empirical cumulative probability distributions of observed and GCM monthly rainfall. The GCMs do not simulate the probability distributions as well as for monthly temperature and MSLP. This is further confirmed from the box plots of Sscore and BS (Figure 12). The BS score for monthly rainfall is much larger than those of monthly temperature and MSLP (Figures 6 and 9) with the many outliers indicating inconsistency between GCM and observed grid point rainfall. Although the Sscore median values are almost of the same magnitude as those of monthly temperature and mean sea level pressure, its lower values are much smaller (Figure 12), again emphasizing that rainfall is not as accurately simulated as temperature and MSLP.

Figure 11.

Empirical cumulative probabilities of observed (thick line) and GCM (thin lines) monthly rainfall.

Figure 12.

Box plots of PDF-based skill scores of monthly mean rainfall across all grids of study region (circles represent statistical “outliers”).

3.4 Ranking Score Testing for Overall Performance

[33] The resulting ranking scores for temperature, MSLP, and rainfall (last columns in Tables 3-5, respectively) are assessed for the purpose of discriminating poorly performing GCMs. The ranking scores for each variable are sorted into ascending order, and the moving range, the difference between two successive data point values, is used to detect the presence of any change points (Figure 13). The two-sample t-test is used to test whether the two sample means are statistically significantly different and the Grubbs test is used to test whether there is an outlier in a univariate data set (assumed to come from a normally distributed population). If the tests indicate statistical difference, then we have evidence to reject the GCMs within the larger ranking score group. If statistically significant difference is not indicated, then we cannot reject those GCMs in the larger ranking score group. Results of this assessment are summarized in Table 6, leading to several conclusions:

  1. [34] For monthly temperature, one change point is detected (Figure 13a). Since there is only one GCM in the larger ranking score group, Grubbs test is used to test whether it is an outlier. Results indicate that it is indeed an outlier at the α = 0.05 significance level. Thus, this GCM can be rejected, as its ranking score is statistically significantly different from those of the other GCMs. Based on these tests, we cannot reject any of the other GCMs, even though some GCMs simulated monthly temperature better than others in terms of our ranking scores (Table 3 and Figures 13a and 13d).

  2. [35] For monthly MSLP, two change points are detected (Figure 13b). The two-sample t test is then used to test whether the groups either side of these change points have the same mean value. The results indicate that they are statistically significantly different with a p-value of 0.00042. This implies that we would be justified in rejecting the second and third groups of GCMs because they simulate monthly mean sea level pressure poorer than, and they are statistically significantly different from, the GCMs in the first group (Figure 13e).

  3. [36] Rainfall-sorted ranking score change points are more complicated, with GCMs 3 (CGCM3.1_T47) and 15 (INM:CM30) are the two best in simulating monthly rainfall, whilst GCMs 1 (BCC:CM1), 2 (BCCR:BCM20) and 7 (CSIRO:MK35) are the worst (Figures 13c and13f). These two groups are both statistically significantly different from the remaining 20 GCMs in terms of the two-sample t test and also as indicated in the moving range chart (Figures 13c and 13f). The remaining 20 GCMs are further separated into two groups by the two-sample t test with a p-value of 1.209 × 10−05. This implies that the rainfall from the GCMs in the first of these groups may be used for climate change impact studies with caution, whilst it would be preferable not to use the rainfall from the second group.

  4. [37] Of the three variables assessed, GCMs generally simulate monthly temperature the best and rainfall the worst. The poor performance of GCM rainfall is the rationale behind statistical downscaling. The relative performance in simulating the three variables assessed here is consistent with the results of a global-scale study of AR4 GCMs, which found that most were able to capture the probability distribution of monthly temperature and MSLP but not for rainfall [Randall et al., 2007].

  5. [38] The results depend on the climate variables assessed. For example, INM:CM30 (GCM 15) ranks as the best model for both monthly MSLP and monthly rainfall, but only ranks 23 out of the 25 GCMs for monthly temperature.

  6. [39] The results are region specific. For example, BCC:CM1 (GCM1) consistently ranks as the worst model for the three climate variables we investigated. Therefore, it cannot be used with any confidence for climate change studies for southeastern Australia. However, this GCM could potentially perform well for a different study region.

Figure 13.

(a–c) Moving range testing and (d–f) break points of ranking scores for temperature, MSLP, and rainfall.

Table 6. Comparison of Scores for Monthly MSLP, Temperature, and Rainfall
Climate ModelIDTemperatureMSLPRainfallTotal
  1. Values that are crossed-out indicate that the GCM was statistically rejected, referring to section 3.4 and Figure 13.


3.5 Sensitivity Analysis

[40] The sensitivity of the ranking results to each individual assessment criterion (statistic) was investigated in two ways. First, the overall result was compared to that obtained by removing each statistic individually. This indicated that adding or removing a specific statistic did not change the overall ranking (Figure 14), which implies the ranking scores method provides a robust assessment of GCM performance. This is one advantage of using multi-criteria, rather than individual assessment criterion methods, because a GCM performing well for a specific statistic will not necessarily also perform well for other statistics.

Figure 14.

Comparison between overall ranking score and that after removal of one statistic.

[41] Second, we used each of the statistics individually to produce single-criterion ranking scores to compare with the overall results. This indicated that no single-criterion produced exactly the same result as the overall ranking (Figure 15). This also confirms that a multi-criteria assessment is more robust than a single-criterion assessment. Some single-criterion produces rankings close to the overall ranking, such as mean, standard deviation, NRMSE, or PDF (Figure 15), whilst others produce totally different rankings. For example, EOF or trend analysis is totally different, with Spearman's rank correlation coefficients between their rankings and the multi-criteria rankings being 0.2 or less (Figure 15). This implies that EOF or trend analysis is not a robust indicator of GCM performance, i.e., a GCM which can simulate EOF variance, trend significance, and trend magnitude accurately does not reproduce other statistics such as long-term mean and stand deviation, annual cycle, or spatial distribution.

Figure 15.

Correlation between overall and single-criterion ranking scores.

3.6 Comparison of CMIP3 and CMIP5 GCMs

[42] This method was initially designed and applied when only IPCC AR4/CMIP3 GCMs were available. However, it can be reapplied for assessing CMIP5 GCMs when available. The CMIP5 state-of-the-art multimodel data set is designed to advance our knowledge of climate variability and climate change [Taylor et al. 2012]. Therefore, a comparison between CMIP3 and CMIP5 GCMs results would not only test the robustness of our method, but also provide useful information on consistency or changes between CMIP3 and CMIP5 GCMs.

[43] For a preliminary assessment, for our study region, the 1961–2000 monthly rainfalls from 40 CMIP5 GCMs (Table 7) were extracted from the CMIP5 archive (http://cmip-pcmdi.llnl.gov/cmip5). The monthly rainfalls from the combined set of 65 GCMs (25 from CMIP3 and 40 from CMIP5) were then assessed. The results indicate that the CMIP3 and CMIP5 GCM's monthly and annual rainfall for southeastern Australia are not statistically significantly different (Figure 16). The “best” ranked GCM is a CMIP3 GCM and the four “worst” ranked GCMs are CMIP5 models. In addition, the two “worst” CMIP5 GCMs have been identified as statistical “outliers” (Figure 16b).

Table 7. CMIP5 GCMs
GCMsOriginating Group(s)Country
ACCESS1.0Commonwealth Scientific and Industrial Research Organization (CSIRO) and Bureau of Meteorology (BOM)Australia
BCC-CSM1.1Beijing Climatic CenterChina
BNU-ESMBeijing Normal UniversityChina
CanCm4Canadian Centre for Climate Modelling and AnalysisCanada
CCSM4National Center for Atmospheric Research(NCAR)USA
CESM1(BGC)Community Earth System Model Contributors NSF-DOE-NCARUSA
CMCC-CMCentro Euro-Mediterraneo per I Cambiamenti ClimaticiItaly
CNRM-CM5Centre National de Recherches Meteorologiques/Centre Europeen de Recherche et Formation Avancees en Calcul ScientifiqueFrance
CSIRO-Mk3.6.0Commonwealth Scientific and Industrial Research Organization in collaboration with Queensland Climate Change Centre of ExcellenceAustralia
EC-EARTHEC-EARTH consortiumEurope
FGOALS-g2LASG, Institute of Atmospheric Physics, Chinese Academy of Sciences and CESS, Tsinghua UniversityChina
FGOALS-s2LASG, Institute of Atmospheric Physics, Chinese Academy of SciencesChina
FIO-ESMThe First Institute of Oceanography, SOAChina
GFDL-CM3NOAA Geophysical Fluid Dynamics LaboratoryUSA
GISS-E2-HNASA Goddard Institute for Space StudiesUSA
HadGEM2-AONational Institute of Meteorological Research/Korea Meteorological AdministrationKorea
HadCM3Met Office Hadley CentreUK
INM-CM4Institute for Numerical MathematicsRussia
IPSL-CM5A-LRInstitut Pierre-Simon LaplaceFrance
MIROC4hAtmosphere and Ocean Research Institute (The University of Tokyo), National Institute for Environmental Studies, and Japan Agency for Marine-Earth Science and TechnologyJapan
MIROC-ESMJapan Agency for Marine-Earth Science and Technology, Atmosphere and Ocean Research Institute (The University of Tokyo), and National Institute for Environmental StudiesJapan
MPI-ESM-LRMax Planck Institute for MeteorologyGermany
MRI-CGCM3Meteorological Research InstituteJapan
NorESM1-MNorwegian Climate CentreNorway
Figure 16.

Ranking score of GCM rainfall from CMIP3 (red) and CMIP5 (blue) models and corresponding box plots.

[44] One motivation of CMIP5 is to improve a number of limitations in CMIP3 GCM simulations. For example, the CMIP5 models enhance the representation of processes including the feedbacks of changes the carbon cycle, clouds, and effects of aerosols [Taylor et al., 2012]. However, they seem not to improve regional rainfall simulations due the complexity of physical processes for modeling rainfall. Future work will assess their temperature and MSLP performance.

4 Discussions and Conclusion

[45] A score-based multi-criteria method was developed to assess and rank the regional performance of 25 CMIP3 GCMs. Assessing three variables for the southeastern Australia region, GCM performance in simulating monthly temperature was better than for MSLP, which was better than for rainfall.

[46] A sensitivity analysis supported the robustness of the methodology, showing that adding or removing a specific statistic did not change the overall ranking, and that no single-criterion result produced exactly the same result as the multi-criteria assessment. This is one advantage of using multi-criteria to assess GCMs, rather than individual assessment criterion methods existing in the literature.

[47] The method is easily applied to different study regions, with assessment results suitable to guide the selection of GCMs for use in regional climate change impact studies. For example, future temperature projections could be extracted directly from better performing GCMs. However, GCMs currently do not provide reliable rainfall information on regional scales as required by many climate change impacts studies. Downscaling techniques have been developed to resolve the scale discrepancy between climate change scenarios and the resolution required for impact assessment. This is based on the assumption that large-scale circulation patterns have a strong influence on local-scale weather [Maraun et al., 2010]. Two approaches to downscaling are commonly used. Dynamical downscaling nests a regional climate model (RCM) into a GCM to represent finer resolution atmospheric physics within a limited area of interest or a stretched grid GCM of finer resolution over the area of interest. Statistical downscaling involves the modeling of relationships between local spatial-scale climate variables and large spatial-scale atmospheric processes. The GCM assessment results are useful in selecting suitable GCMs to downscale from. For example, many statistical downscaling models use MSLP as a predictor and empirical daily scaling methods uses GCM rainfall directly. In fact, the main motivation of this study was to select suitable GCMs for statistically downscale daily rainfall for southeastern Australia [Fu et al., 2013].

[48] An initial comparison of CMIP5 and CMIP3 GCMs in term of simulating monthly and annual rainfall indicates that this method can be easily used to assess CMIP5 GCMs. There are not statistically significant differences between CMIP3 and CMIP5 monthly and annual rainfall for southeastern Australia (Figure 16). It is interesting to note that the “best” GCM is a CMIP3 GCM and four “worst” GCMs are CMIP5 models.

[49] This study is based on an assumption that GCM agreement with observations is often used to support confidence in projections [Knutti, 2008; Masson and Knutti, 2011], i.e., it is assumed that a model that better reproduces observations will give a more realistic future response. This is not assessed in this study as all the results and conclusions herein were based on GCM 20c3m runs, acknowledging uncertainty due to model limitations in sub-grid scale forcing and processes. However, this assumption appears justified in light of the literature, with Smith and Chandler [2010] finding that better performing GCMs tend to agree on the sign of future rainfall changes. They found the differences between the best five and the remaining 17 GCMs are most evident in winter and spring for southeastern Australia, resulting in annual changes of −13% compared to −3% for all models.

[50] Much climate change impact research requires daily climate data, whereas this study concentrated on monthly data because of data availability and processing time. The assumption is that if a GCM can simulate monthly climate accurately, then it has a high probability that that it can simulate daily climate accurately. Conversely, if a GCM cannot simulate monthly climate accurately, then it is unlikely that it can simulate daily climate accurately. The first part of this assumption will be tested in future research.


[51] We acknowledge the World Climate Research Programme's Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modeling groups (listed in Tables 1 and 7) for producing and making available their model output. For CMIP, the U.S. Department of Energy's Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals. Thanks also should be given to the Chinese Scholarship Council-CSIRO Joint Supervision of Chinese PhD student project, which made it possible for the second author to study in Australia for 1 year. The kind assistance from colleagues in CSIRO, Australia, is also greatly appreciated. We wish to thank the Editor (Sara C. Pryor), the Associate Editor, Dr. Ian Smith, Dr. Freddie Mpelosoka, and four anonymous reviewers for their invaluable comments and constructive suggestions used to improve the quality of the manuscript.