While General Circulation Models (GCM) generally converge well at the global level, results for individual regions usually show a wide range of variation. This study assesses the performance of seventeen GCMs regarding their simulation of temperature and precipitation based on hindcasts for the periods of 1961–1990 and 1931–1960. Skill scores are plotted on a 2° × 2° grid to present “zones” of GCM performance. An overlay of these skill score maps with global climate zones, land cover, and elevation maps shows correlations between GCM performance and the distribution of these geographic variables. No GCM is superior in predicting temperature or precipitation for the whole world, although some GCMs score better in particular regions. For researchers working with GCM results and policymakers who need to make decisions based on GCM projections, the skill score maps may provide useful guidance; while for GCM developers, the skill score maps may open areas for further study to improve their models.
 Climate change projections are generated by highly sophisticated GCMs. While these models converge relatively well at the global scale, outcomes for individual regions can vary significantly among the various GCMs [Giorgi and Mearns, 2003; Schmittner et al., 2005; Connolley and Bracegirdle, 2007; Laurent and Cai, 2007; Whetton et al., 2007]. This regional variability is problematic for impact assessments at the regional level and has been recognized as one of the major sources of uncertainty for climate change projections [Giorgi and Francisco; 2000; Murphy et al., 2004]. Differences among model simulations are generally due to different regional responses to global climate change and chaotic behaviors embedded in multi-decadal variability simulations. This study presents global maps (except for Antarctica) of GCM performance for climate change simulation of temperature and precipitation, based on the root mean square error (RMSE) of the model simulation relative to observed temperature and precipitation. The paper further shows the spatial association of the GCM skill scores with some geographic variables such as land cover, earth surface elevation and climate zones. Rather than evaluating the quality of the GCMs, we argue that the GCM model structure, parameterization and model validation practices might be affected by the distribution of these geographic variables.
 The performance of GCMs is assessed according to their “skill scores” [Murphy et al., 2004; Müller et al., 2005; Connolley and Bracegirdle, 2007]. We calculate the skill scores based on the RMSE of the model simulation relative to the observation of temperature and precipitation, respectively, for each of the 17 GCMs listed in Table S1. The GCM simulations of monthly temperature and precipitation included in the “Climate of the 20th century experiment” (20C3M) were downloaded from the database prepared by the World Climate Research Programme's Coupled Model Intercomparison Project (CMIP3) [Program for Climate Model Diagnosis and Intercomparison, 2008]. For each of these models, several realizations for different initial conditions were prepared for the climate experiment (20C3M). Without compromising significance, a single scenario (scenario 1) using the same initial condition for all GCMs was used for this study. The observed temperature and precipitation data were taken from the CRU05 0.5°, 1901–1995 monthly climate time series of the Climatic Research Unit from the University of East Anglia [New et al., 2000]. The comparison between the GCMs' simulation and observation is based on the average variable value in each month over the period from 1961–1990. The period from 1931–1960 is used to verify the results.
 The skill scores are calculated for 2° × 2° grid cells over the global land surface (except for Antarctica). For the given periods (1961–1990 and 1931–1960), the RMSE is calculated for each of the 17 GCMs, and the inverse RMSE is used as the skill score in this study. Furthermore, for the convenience of comparison, the inverse RMSE values are normalized to values between 0 and 1 (dividing the inverse RMSE of one GCM by the sum of the inverse RMSE of the 17 GCMs), which represent relative skill scores rather than absolute ones. The sum of the normalized values equals 1 and the average skill score is 1/17 ≈ 0.06. Thus a GCM with a skill score higher than 0.06 performs above average.
 Maps of skill scores developed with a geographic information system (GIS) are then used to conduct further spatial analyses by overlaying these maps with maps of land cover, digital elevation models (DEM), and a map of climate zones, respectively, to identify the possible association of GCM regional variability with geographic variables. Moreover, the overall performance of each GCM for the whole world is evaluated using frequency analysis.
Figure 1 presents the skill score maps for 17 GCMs regarding temperature simulations based on hindcasts. GCMs 1 and, 16, 14, and 13 have the highest skill scores for high, medium, and low latitude regions, respectively. Some GCMs score high in specific regions, such as GCM 1 for Australia and Europe; GCMs 5 and 9 for Greenland with snow and ice cover; GCM 13 for the western coastal area of North and South America and Southern Africa; GCMs 16 and 17 for the Amazon region and central and eastern Europe; GCM 6 for Northern Africa and the eastern side of the Rocky Mountains in North America; and GCM 14 for the Queen Elizabeth Islands of Canada. Moreover, GCMs 5, 6 and 7 have a similar distribution for North America, the Amazon region, Northern Africa, and the Middle East, probably because these three models are developed by the same institute (Table S1).
 For some regions, such as South Asia, none of the GCMs is strongly preferred according to the skill scoring technique used in this study. One plausible explanation is the complex climate in this region, such as the monsoons, which undergo aperiodic and high amplitude variations on intraseasonal, annual, biennial and interannual timescales. As illustrated by Webster et al. , the simulation of the mean structure of the Asian monsoon has proven to be elusive and the observed ENSO-monsoon relationships are difficult to replicate.
 Starting from the skill score maps (Figure 1), we delineate “zones of best GCMs” for temperature by aggregating the pixels in the neighborhood taking into account the highest skill scores for up to three GCMs, as shown in Figure 2. Such delineation can be accomplished in GIS by overlaying the seventeen skill score maps of Figure 1 and then generating polygons in which one to three GCMs have the highest skill scores. Since the patterns are obvious for most regions from the skill score maps, we identify the zones visually by comparing the skill scores in the various regions. For example, in Northern Africa GCMs 6, 7 and 11 have the best skill scores. This region is thus labeled with the numbers of these GCMs (i.e., 6, 7, 11). Zones where only one or two GCMs have high performance scores are labeled accordingly. Sixty-nine zones are identified for the global land surface except for Antarctica.
Figures 3a, 3b, and 3c display overlays of these GCM zones with climate zones, land cover and elevation, respectively. As shown in Figure 3a, the GCM zones match well with the most recently developed climate zones [Peel et al., 2007]. For example, zones (6, 7, 11) and (1, 3) are located in the climate zone of arid hot deserts; zones (2, 6, 10) and (1, 13, 17) are located in the cold climate zone with very cold winters and without a dry season; zone (2) is located in the temperate climate region with hot summers and without a dry season. Climate zones depend on global terrain morphology and land cover since these variables influence large-scale atmospheric circulations [Peel et al., 2007]. In Figure 3b, most GCM zones closely match land cover around the world, for example, zone (6, 7, 11) matches well with barren land in northern Africa; zone (1, 8,10) relates to crop land; zone (16, 17) to forest land; and zone (7, 4, 13) to shrub land. Land cover depends on climate zones but also provides feedback effects to climate dynamics.
 Furthermore, GCMs also seem to be related to elevation. The relationship is particularly strong in the Pacific region of America and almost for all of Africa and Asia (Figure 3c). For example, zones (15, 17), (5, 9), among others, relate to regions with high elevations and zones (1, 16), (11, 15, 9), (5, 13), among others, relate to regions with low elevations. Thus the GCM zones in terms of temperature likely reflect some geo-biophysical patterns.
 To verify the results, the skill score maps are developed for another testing period, 1931–1960 (see Figure S1). These maps show results similar to those with the primary testing period (1961–1990).
 Compared to temperature, what is the regional variability of GCM simulations regarding precipitation? We calculate the skill scores and conduct similar spatial analyses for precipitation simulation based on the same set of GCMs, as displayed in Figures S2 and S3 for the two testing periods. Comparing the precipitation maps to the temperature maps, the skill scores of the two variables are consistent for most GCMs regionally and many even globally; for example GCM1, GCM 14 and GCM 17 have higher skill scores for both temperature and precipitation; GCM8 has low skill scores for both temperature and precipitation. But several GCMs such as GCM 2, GCM 10, GCM11, and GCM16 have different performance scores for temperature and precipitation simulations in most regions of the world.
Figure 4 shows the highest skill scores for each pixel. For temperature, about 3.8% of the global land surface area has the highest skill score between 0.06 and 0.12 and 27.5% of the area has the highest score greater than 0.16. For precipitation, on the other hand, on 52.3% of the land area the highest skill score is between 0.06 and 0.12, and on only 14.2% of the land, the highest score is above 0.16. This confirms the observation that for precipitation fewer models perform much better than the others, compared to temperature, especially in high latitude regions.
 To assess the overall skill score of different GCMs for the whole world, we examine the exceedance probability of the skill score for each GCM, as shown in Figure 5, with respect to both temperature and precipitation. Exceedance probability is computed using the frequency analysis of the GCM scores over all pixels in the world (the total number of samples is set as the number of pixels, 4461). For example, the exceedance probability for GCM 1 in terms of temperature with a skill score above 0.06 (the average value among the 17 GCMs) is 54%, which means that GCM 1 performs above average for 54% of the global land area (excluding Antarctica). GCM 1, GCM 14, and GCM 17 score best among the 17 GCMs in predicting temperature. At the average skill score (1/17 ≈ 0.06), the exceedance probability ranges from 0.25 (GCM 9) to 0.54 (GCM 1); whereas at a higher skill score threshold—for example 0.2—no GCM has an exceedance probability of more than 0.05. Among the 17 GCMs, GCMs 5, 8, and 9 have the lowest scores for most regions around the world. Comparing the exceedance probability curves of precipitation to those of temperature, we find that GCM 1's performance is superior for both temperature and precipitation simulations based on the exceedance probability.
 To further display the spatial distribution of the skill scores. Figure 4 maps the highest skill score in the global land except for Antarctica. By definition, the highest skill score for any pixel should be greater than the average skill score of 1/17 ≈ 0.06. If the highest skill score in one pixel is only slightly higher than 0.06, then no single GCM has a superior performance in that pixel. In that case, all GCMs should perform similarly and the choice of GCM will not make a significant difference regarding the simulation results. If, on the other hand, the highest score in one pixel is much higher than 0.06 then one or a few GCMs perform much better than others. For example, in some high-latitude regions and northern Africa, a single GCM with a very high skill score—GCM 1—seems to work best to simulate regional temperature and precipitation. Some GCMs perform much better than others for both temperature and precipitation in high- and medium- latitude regions, while all 17 GCMs have similar skill scores, close to the average value of 0.06, in low-latitude regions, except for Western Australia. In general, temperature and precipitation need to be simulated simultaneously in an internally consistent manner, which implies that the same GCMs should be used for temperature and precipitation simulation.
 Finally, it should be noted that we do not intend to justify the quality of any GCMs since we only assessed the performance of the 17 GCMs by limited indices and limited validation periods. The evaluation of GCMs at the regional scale requires more than assigning skill scores based on hindcast analysis [Whetton et al., 2007; Laurent and Cai, 2007]. Nevertheless, the skill score maps presented in this paper provide some guidance in terms of which GCMs perform better in re-generating the past and current climate against observations in particular regions and the relationship between temperature and precipitation, which is expected to be useful information for researchers working with GCM results and policymakers who need to make decisions based on GCM projections.
 A regional mapping of skill scores that assess GCM performance shows that model performance for temperature simulation seems to be related to land cover, terrain morphology, and climate zones. No single GCM scores high over the entire global land surface although some GCMs score high for particular regions. Most GCMs perform similarly for precipitation and temperature, although inconsistencies exist for several GCMs. Is the spatial pattern of GCM skill scores an indication of GCM model structure, parameterization and model validation practices, which might be affected by geographic variables such as land cover, surface elevation and climate zones? It is beyond the scope of this paper to explain the spatial patterns of the GCM performance in depth, and further study shall be left for GCM developers. We hope that the skill score maps provide some hints for GCM developers to examine and further improve their models. Furthermore, our results show which individual GCMs perform well in which places around the world. This may provide useful information for regional climate model (RCM) nesting. Often RCM modelers are faced with the dilemma of having to choose a specific GCM for a certain region.
 The authors are grateful for three anonymous reviewers, whose comments and suggestions helped improve the original version of this paper.